Lecture times: Tue & Thur, 5:20pm – 6:55pm
First class on Tue, Sep 28.
Class Room: SVC 2165.
Section Time: Wednesday: 3-5 pm,
Classroom: SVC 2155
Instructor Information
Dr. Jalal Mahmud
email: jumahmud@ucsc.edu
Office Hours: After the class or by appointment
Recommended Textbooks:
Natural Language Annotation for Machine Learning: A Guide to Corpus-Building for Applications. James Pustejovsky and Amber Stubbs.
Kazil, Jacqueline, and Katharine Jarmul. Data wrangling with Python: tips and tools to make your life easier. O'Reilly Media, Inc., 2016.
Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning, Bengfort, Benjamin
Course Syllabus and Schedule
Week 1 (Lecture 1 and Lecture 2) - Sep 28, Sep 30
- Introduction & Class logistics
- Layers of linguistic description
- What is a corpus? History of different corpora
- Ideal properties of corpora, usage of corpora
- Corpus Annotation and annotation types (Types of data tagging and labelling, types of markup)
- Python and NLTK background
- Introduction with several corpora from NLTK
- Dataset formats
- Introduction with several NLP datasets such as SemEVAL, ISEAR, OntoNotes, CoNLL, LDC corpora
- Introduction to Spacy
Week 2 (Lecture 3 and Lecture 4) - Oct 5, Oct 7
- Additional review of NLTK and Spacy
- Accessing corpora and importing into CSV.
- Parsing CSV using Python
- Accessing corpora and importing into JSON.
- Parsing JSON using Python
- Mapping JSON data to Python object
- XML data and accessing XML data
Week 3 (Lecture 5 and Lecture 6) - Oct 12, Oct 14:
- Parsing XML using Python
- Corpus Analytics
- Dataset distributions and properties, strengths and weakness
- n-gram models
- Collocations and Significant Collocations
Week 4 (Lecture 7 and Lecture 8) - Oct 19, Oct 21
- Regular Expression in Python
- NLP data preparation and data wrangling
- Analyzing and Counting NLP data properties
- Data preparation methods
- Data cleaning - motivations and methods
- Normalizing Text
Week 5 (Lecture 9 and Lecture 10) - Oct 26, Oct 28
- Pipeline of text preparation
- Managing vocabulary
- Text Vectorization Methods - frequency based, one-hot, TF-IDF
- Text Vectorization Methods -, Distributed representation
- Data Wrangling with Pandas
- Filtering approaches
- Dealing with Duplicate records
- Handling Missing values
- Data quality analysis
- Data cleaning practical guidance
Week 6 (Lecture 11 and Lecture 12) - Nov 2, Nov 4
- More on missing values and practical guidance
- Inconsistent data entry and dealing with inconsistencies.
- Fuzzy matching
- Scaling and Normalization
Week 7 (Lecture 13) - Nov 9
- Parsing Dates
- Data encoding issues
- Encoding conversion
Week 8 (Lecture 14 and Lecture 15) - Nov 16, Nov 18
- Data Sampling Techniques
- Data Augmentation methods
- Finding Outliers
- Data Crawling
Week 9 (Lecture 16) - Nov 23
- Data Annotation
- Applying annotation standards
- Prepare data for annotation
- Annotation guidelines
- Creating Gold Standard
Week 10 (Lecture 17 and Lecture 18) - Nov 30, Dec 2
- Annotators
- Evaluating the annotation, reliability
- Crowdsourcing & Mechanical Turk
- Structuring and storing NLP data
- Example data stores: SQL, MySQL, NoSQL
- Object Relational Mappers, AlchemySQL
- Storing data using MongoDB