Lecture times: Tue & Thur, 5:20pm – 6:55pm

First class on Tue, Sep 28.

Class Room:  SVC 2165.

Section Time: Wednesday: 3-5 pm,

Classroom: SVC 2155

 
Instructor Information

Dr. Jalal Mahmud
email: jumahmud@ucsc.edu
Office Hours: After the class or by appointment

 

 

Recommended Textbooks:

Natural Language Annotation for Machine Learning: A Guide to Corpus-Building for Applications. James Pustejovsky and Amber Stubbs.

Kazil, Jacqueline, and Katharine Jarmul. Data wrangling with Python: tips and tools to make your life easier.  O'Reilly Media, Inc., 2016.

Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning, Bengfort, Benjamin


Course Syllabus and Schedule 

Week 1   (Lecture 1 and Lecture 2)   - Sep 28, Sep 30

 

  • Introduction & Class logistics
  • Layers of linguistic description
  • What is a corpus? History of different corpora
  • Ideal properties of corpora, usage of corpora 
  • Corpus Annotation and annotation types (Types of data tagging and labelling, types of markup)
  • Python and NLTK background
  • Introduction with several corpora from NLTK
  • Dataset formats
  • Introduction with several NLP datasets such as SemEVAL, ISEAR, OntoNotes, CoNLL, LDC corpora
  • Introduction to Spacy

 

Week 2   (Lecture 3 and Lecture 4) - Oct 5, Oct 7

  • Additional review of NLTK and Spacy
  • Accessing corpora and importing into CSV.
  • Parsing CSV using Python
  • Accessing corpora and importing into JSON.
  • Parsing JSON using Python
  • Mapping JSON data to Python object
  • XML data and accessing XML data

 

Week 3  (Lecture 5 and Lecture 6) - Oct 12, Oct 14:  

  • Parsing XML using Python
  • Corpus Analytics
  • Dataset distributions and properties, strengths and weakness 
  • n-gram models
  • Collocations and Significant Collocations

Week 4  (Lecture 7 and Lecture 8) - Oct 19, Oct 21

  • Regular Expression in Python
  • NLP data preparation and data wrangling
  • Analyzing and Counting NLP data properties 
  • Data preparation methods 
  • Data cleaning - motivations and methods
  • Normalizing Text

 

Week 5 (Lecture 9 and Lecture 10) - Oct 26, Oct 28

  • Pipeline of text preparation
  • Managing vocabulary
  • Text Vectorization Methods - frequency based, one-hot, TF-IDF
  • Text Vectorization Methods -, Distributed representation
  • Data Wrangling with Pandas
  • Filtering approaches
  • Dealing with Duplicate records
  • Handling Missing values
  • Data quality analysis
  • Data cleaning practical guidance

 

Week 6  (Lecture 11 and Lecture 12) - Nov 2, Nov 4

  • More on missing values and practical guidance
  • Inconsistent data entry and dealing with inconsistencies.
  • Fuzzy matching
  • Scaling and Normalization

 

Week 7  (Lecture 13) - Nov 9

  • Parsing Dates
  • Data encoding issues
  • Encoding conversion

 

Week 8  (Lecture 14 and Lecture 15) - Nov 16, Nov 18

  • Data Sampling Techniques
  • Data Augmentation methods 
  • Finding Outliers
  • Data Crawling

 

 Week 9  (Lecture 16) - Nov 23

  • Data Annotation
  • Applying annotation standards
  • Prepare data for annotation
  • Annotation guidelines 
  • Creating Gold Standard 

Week 10  (Lecture 17 and Lecture 18) - Nov 30, Dec 2

  • Annotators 
  • Evaluating the annotation, reliability  
  • Crowdsourcing & Mechanical Turk
  • Structuring and storing NLP data 
  • Example data stores: SQL, MySQL, NoSQL
  • Object Relational Mappers, AlchemySQL
  • Storing data using MongoDB