## SEA-DAT2 course repository
###General Assembly Data Science course
Location: Seattle, WA
Class times: Classes: 6:30pm - 9:30pm
Instructor: Jim Byers
Note: Prior to the first day of class complete the 10-15 hours of pre-work in order to be properly prepared for class (prework)
Tuesday | Thursday |
---|---|
Research Design and Exploratory data analysis | |
3/15: L01 Introduction to Data Science | 3/17: L02 Research design and Pandas |
3/22: L03 Statistics fundamentals | 3/24: L04 Command Line and Version Control |
3/29: L05 Fetching Data, Project Discussion Deadline | |
Foundations of data modeling | |
3/31: L06 Intro to Regression, Project Question and Dataset Due | |
4/5: L07 Intro to Classification - K nearest neighbor | 4/7: L08 Evaluating Model Fit |
4/12: L09 Classifying with Logistic Regression | 4/14: L10 Advanced model evaluation |
4/19: L11 Standardization and Clustering | 4/21: L12: First Project Presentations + bonus topics |
Data science in the real world | |
4/26: L13 Natural Language Processing | 4/28: L14 Dimensionality reduction, Draft Paper Due |
5/3: L15 Decision Trees | 5/5: L16 Ensembling, Bagging and Random Forests |
5/10: L17 Modeling with Time Series Data I, Peer Review Due | 5/12 L18 Modeling with Time Series Data II |
5/17: L19 Where to go next + bonus topics | 5/19: Final Project Presentations |
| | | Bonus Content | Bonus content: SVC - Support Vector Classifier | Bonus content: Naive Bayes Classifier| Bonus content: Intro to Neural Networks |
[Homework and project submissions form] (https://docs.google.com/forms/d/1vKgdubWdc-AzMTYS6f6uTFwQDop3M9uUNbilcuziTQA/viewform?usp=send_form)
Â
Student Prework Before this lesson you should already be able to:
- Define basic data types used in object-oriented programming
- Recall the Python syntax for lists, dictionaries, and functions
- Create files and navigate directories using the command line interface (for your specific environment)
After this lesson, you will be able to:
- Describe the roles and components of a successful learning environment
- Define data science and the data science workflow
- Apply the data science workflow to meet your classmates
- Setup your development environment and review python basics
Topics/Highlights
- Welcome from General Assembly!
- Course overview (slides)
- What is data science (slides)
- Data Science Quiz
- Data Science workflow (slides)
- Hands-on with the Data Science Dev Environment (Anaconda, Spyder IDE, iPython notebooks)
- Discuss the course project: requirements and example projects
- Types of data (slides) and public data sources
- GA's student gallery of projects
- Our very own Kevin McAlear's Hater News DAT project on the GA gallery
Homework:
- Due Mar 17
- Read through the information about the course project information to familiarize yourself with the requirements and example projects. Start thinking about what question you would like to answer in your project.
- Types of data (slides) and public data sources
- Read through the information about the course project information to familiarize yourself with the requirements and example projects. Start thinking about what question you would like to answer in your project.
- Due Tuesday March 22
- Review each concept and each line of code in these files of python code: 00_python_beginner_workshop.py and 00_python_intermediate_workshop.py. Complete the coding exercises in the files: If you don't feel comfortable with any of the content (excluding the "requests" and "APIs" sections), you should spend some time before Mar 22nd practicing Python. Use your resources such as documentation, searches, the class Slack to get help if you get stuck. Here are some additional resources:
- Introduction to Python does a great job explaining Python essentials and includes tons of example code.
- If you like learning from a book, Python for Informatics has useful chapters on strings, lists, and dictionaries.
- If you prefer interactive exercises, try these lessons from Codecademy: "Python Lists and Dictionaries" and "A Day at the Supermarket".
- If you have more time, try missions 2 and 3 from DataQuest's Learning Python course.
- If you've already mastered these topics and want more of a challenge, try solving Python Challenge number 1 (decoding a message)
- Review each concept and each line of code in these files of python code: 00_python_beginner_workshop.py and 00_python_intermediate_workshop.py. Complete the coding exercises in the files: If you don't feel comfortable with any of the content (excluding the "requests" and "APIs" sections), you should spend some time before Mar 22nd practicing Python. Use your resources such as documentation, searches, the class Slack to get help if you get stuck. Here are some additional resources:
Resources:
- For a useful look at the different types of data scientists, read Analyzing the Analyzers (32 pages).
- For some thoughts on what it's like to be a data scientist, read these short posts from Win-Vector and Datascope Analytics.
- Quora has a data science topic FAQ with lots of interesting Q&A.
Student pre-work Before this lesson, you should already be able to:
- Have completed the python pre-work in the class pre-work described here
After this lesson, you will be able to:
- Define a problem and types of data
- Identify data set types
- Define the data science workflow
- Apply the data science workflow in the pandas context
- Write an IPython Notebook to import, format and clean data using the Pandas Library
Topics/Highlights
- Discuss the course project: requirements and example projects
- The why's and how's of a good question (slides)
- Types of datasets (slides)
- Write a research question with raw data (exercise)
- Data science workflow steps 2. Acquire and 3. Understand the data
- Acquire and Understand data with Pandas
Homework:
- Due Tuesday March 22
- To turn in homework, attach files to a personal message in Slack to Jim Byers and Kevin Mcalear
- Review each concept and each line of code in these files of python code: 00_python_beginner_workshop.py and 00_python_intermediate_workshop.py. Complete the coding exercises in the files: If you don't feel comfortable with any of the content (excluding the "requests" and "APIs" sections), you should spend some time before Mar 22nd practicing Python. Use your resources such as documentation, searches, the class Slack to get help if you get stuck. Here are some additional resources:
- Introduction to Python does a great job explaining Python essentials and includes tons of example code.
- If you like learning from a book, Python for Informatics has useful chapters on strings, lists, and dictionaries.
- If you prefer interactive exercises, try these lessons from Codecademy: "Python Lists and Dictionaries" and "A Day at the Supermarket".
- If you have more time, try missions 2 and 3 from DataQuest's Learning Python course.
- If you've already mastered these topics and want more of a challenge, try solving Python Challenge number 1 (decoding a message)
Resources:
Python resources
- Want to understand Python's comprehensions? Think in Excel or SQL may be helpful if you are still confused by list comprehensions.
- My code isn't working is a great flowchart explaining how to debug Python errors.
- PEP 8 is Python's "classic" style guide, and is worth a read if you want to write readable code that is consistent with the rest of the Python community.
- If you want to understand Python at a deeper level, Ned Batchelder's Loop Like A Native and Python Names and Values are excellent presentations.
Pandas resources
Name | Description |
---|---|
Official Pandas Tutorials | Wes & Company's selection of tutorials and lectures |
Julia Evans Pandas Cookbook | Great resource with eamples from weather, bikes and 311 calls |
Learn Pandas Tutorials | A great series of Pandas tutorials from Dave Rojas |
Research Computing Python Data PYNBs | A super awesome set of python notebooks from a meetup-based course exclusively devoted to pandas |
By the end of this lesson you will be able to:
- Use NumPy and Pandas libraries to analyze datasets using basic summary statistics: mean, median, mode, max, min, quartile, inter-quartile range, variance, standard deviation, and correlation
- Create data visualizations - including: line graphs, box plots, and histograms- to discern characteristics and trends in a dataset
- Identify a normal distribution within a dataset using summary statistics and visualization
Topics/Highlights
- Review Homework
- 00_python_beginner_workshop.py
- 00_python_intermediate_workshop.py
- Independent Practice (02_starter_code.ipynb)
- Statistics refresher
- Basic Statistics with Pandas
- Statistics Fundamentals (slides)
- Code-along (notebook)
- Stats demo (notebook)
- Correlation
- What is correlation? (slides)
- Correlation is not causation (fun with a commom misconception!)
- Visualization with Pandas (notebook)
- What is correlation? (slides)
- Basic Statistics with Pandas
Homework:
- Due Tuesday March 24
- Windows users, install Git Bash prior to starting the command line pre-class exercise*** as you will need the "bash" type command window on your Windows laptop in order to do the exercise and later to use git
- We recommend Git Bash instead of Git Shell (which uses Powershell).
- For Mac users, you will probably be using Terminal, or another command line application of your choice. It already is a bash type command line interpreter. No need to load anything. Git is part of the MAC OS so is already installed and ready to use.
- Complete GA's friendly command line tutorial using Terminal (Linux/Mac) or Git Bash (Windows)
- Complete the command line pre-class exercise (code). You do not need to turn in this homework
- Find one link to a resource about statistics that you find especially useful and send it in a slack message to Jim and Kevin. Note this will not be graded against the homework evaluation criteria. Jim will share these links back out on our repo so all can benefit.
- Windows users, install Git Bash prior to starting the command line pre-class exercise*** as you will need the "bash" type command window on your Windows laptop in order to do the exercise and later to use git
Statistics Resources:
- Descriptions of Statistics terms in a straight forward way including density plot
Pandas Resources:
- To learn more Pandas, read this three-part tutorial, or review these two excellent (but extremely long) notebooks on Pandas: introduction and data wrangling.
- If you want to go really deep into Pandas (and NumPy), read the book Python for Data Analysis, written by the creator of Pandas.
- This is a nice, short tutorial on pivot tables in Pandas.
- For working with geospatial data in Python, GeoPandas looks promising. This tutorial uses GeoPandas (and scikit-learn) to build a "linguistic street map" of Singapore.
Visualization Resources:
- Watch Look at Your Data (18 minutes) for an excellent example of why visualization is useful for understanding your data.
- For more on Pandas plotting, read this notebook or the visualization page from the official Pandas documentation.
- To learn how to customize your plots further, browse through this notebook on matplotlib or this similar notebook.
- Read Overview of Python Visualization Tools for a useful comparison of Matplotlib, Pandas, Seaborn, ggplot, Bokeh, Pygal, and Plotly.
- To explore different types of visualizations and when to use them, Choosing a Good Chart and The Graphic Continuum are nice one-page references, and the interactive R Graph Catalog has handy filtering capabilities.
- This PowerPoint presentation from Columbia's Data Mining class contains lots of good advice for properly using different types of visualizations.
- Harvard's Data Science course includes an excellent lecture on Visualization Goals, Data Types, and Statistical Graphs (83 minutes), for which the slides are also available.
By the end of this lesson you will be able to:
- clone a Githib repository to your laptop
- synch your local files with your GitHub repository using git add, commit, push and pull
- use more advanced command line commands such as Grep and |
Topics/Highlights
- Review the command line pre-class exercise (code)
- Git and GitHub (slides)
- Intermediate command line (commands)
Homework:
- Complete the command line homework assignment with the Chipotle data.
- Optional: Browse through some more example student projects, which may help to inspire your own project!
Git and Markdown Resources:
- Pro Git is an excellent book for learning Git. Read the first two chapters to gain a deeper understanding of version control and basic commands.
- If you want to practice a lot of Git (and learn many more commands), Git Immersion looks promising.
- If you want to understand how to contribute on GitHub, you first have to understand forks and pull requests.
- GitRef is my favorite reference guide for Git commands, and Git quick reference for beginners is a shorter guide with commands grouped by workflow.
- Cracking the Code to GitHub's Growth explains why GitHub is so popular among developers.
- Markdown Cheatsheet provides a thorough set of Markdown examples with concise explanations. GitHub's Mastering Markdown is a simpler and more attractive guide, but is less comprehensive.
Command Line Resources:
- If you want to go much deeper into the command line, Data Science at the Command Line is a great book. The companion website provides installation instructions for a "data science toolbox" (a virtual machine with many more command line tools), as well as a long reference guide to popular command line tools.
- If you want to do more at the command line with CSV files, try out csvkit, which can be installed via
pip
.
After this lesson you will be able to:
- Articulate what JSON, APIs and Web scraping are and how they help us fetch data
- Retrieve data from a website using the site’s APIs
- Scrape a web page to extract data
Topics/Highlights:
- Chipotle command line homework due (code)
- Fetching data through APIs
- APIs - key concepts (slides)
- Example of API documentation: The OMDb API - omdbapi.com
- Code along - Access APIs on omdbapi.com (code)
- Exercise - Retrieve US Census language stats though APIs (code)
- Census.gov language statistics page with API description
- Grabbing data using Web scraping (code)
Homework:
- If you're using Anaconda, install Seaborn by running
conda install seaborn
at the command line. (Note that some students in past courses have had problems with Anaconda after installing Seaborn.) If you're not using Anaconda, install Seaborn usingpip
. - Optional: Complete the homework exercise listed in the web scraping code. It will take the place of any one homework you miss, past or future! This is due on Tuesday (April 5th).
API Resources:
- This Python script to query the U.S. Census API was created by a former DAT student. It's a bit more complicated than the example we used in class, it's very well commented, and it may provide a useful framework for writing your own code to query APIs.
- Mashape and Apigee allow you to explore tons of different APIs. Alternatively, a Python API wrapper is available for many popular APIs.
- The Data Science Toolkit is a collection of location-based and text-related APIs.
- API Integration in Python provides a very readable introduction to REST APIs.
- Microsoft's Face Detection API, which powers How-Old.net, is a great example of how a machine learning API can be leveraged to produce a compelling web application.
Web Scraping Resources:
- The Beautiful Soup documentation is incredibly thorough, but is hard to use as a reference guide. However, the section on specifying a parser may be helpful if Beautiful Soup appears to be parsing a page incorrectly.
- For more Beautiful Soup examples and tutorials, see Web Scraping 101 with Python, a former DAT student's well-commented notebook on scraping Craigslist, this notebook from Stanford's Text As Data course, and this notebook and associated video from Harvard's Data Science course.
- For a much longer web scraping tutorial covering Beautiful Soup, lxml, XPath, and Selenium, watch Web Scraping with Python (3 hours 23 minutes) from PyCon 2014. The slides and code are also available.
- For more complex web scraping projects, Scrapy is a popular application framework that works with Python. It has excellent documentation, and here's a tutorial with detailed slides and code.
- robotstxt.org has a concise explanation of how to write (and read) the
robots.txt
file. - import.io and Kimono claim to allow you to scrape websites without writing any code.
- How a Math Genius Hacked OkCupid to Find True Love and How Netflix Reverse Engineered Hollywood are two fun examples of how web scraping has been used to build interesting datasets.
After this lesson you will be able to:
- Indentify the kinds of problems that Linear Regression can solve
- Create a linear regression predictive model
- Evaluate the error of the model's fit to the training data
Topics/Highlights:
- Linear regression (notebook)
- Capital Bikeshare dataset used in a Kaggle competition
- Data dictionary
- Why we should examine data well before building a model: Anscombes_Quartet (notebook)
Homework:
- Complete the homework assignment with the Yelp data. This is due on Thursday (4/7).
Linear Regression Resources:
- To go much more in-depth on linear regression, read Chapter 3 of An Introduction to Statistical Learning. Alternatively, watch the related videos or read Kevin Markhams quick reference guide to the key points in that chapter.
- This introduction to linear regression is more detailed and mathematically thorough, and includes lots of good advice.
- This is a relatively quick post on the assumptions of linear regression.
- Setosa has an interactive visualization of linear regression.
After this lesson you will be able to:
- Indentify the steps to build a predictive model in scikit-learn
- Create a k nearest neighbors (knn) predictive model
- Describe the difference between a supervised and unsupervised model
Topics/Highlights:
- K-nearest neighbors (KNN) and scikit-learn (notebook)
- Exercise with NBA player data (notebook, data, data dictionary)
- Machine learning types and terms (slides)
Homework:
- The homework assignment with the Yelp data is due on Thursday (4/7).
- Reading assignment on the bias-variance tradeoff
- Read Kevin Markhams's introduction to reproducibility, read Jeff Leek's guide to creating a reproducible analysis, and watch this related Colbert Report video (8 minutes).
- Optional: Quick Pandas exercise (notebook). Complete this exercise to sharpen your understanding of dataframes.
- Work on your project... your first project presentation is in less than three weeks!
KNN Resources:
- (notebook) An example of the steps one would go through using "human learning" to come up with a rule to classify new iris observations based on the Iris data set. Contains a refresher on many Pandas techniques such as groupby and visulaization.
- For a recap of the key points about KNN and scikit-learn, watch Getting started in scikit-learn with the famous iris dataset (15 minutes) and Training a machine learning model with scikit-learn (20 minutes).
- KNN supports distance metrics other than Euclidean distance, such as Mahalanobis distance, which takes the scale of the data into account.
- A Detailed Introduction to KNN is a bit dense, but provides a more thorough introduction to KNN and its applications.
- This lecture on Image Classification shows how KNN could be used for detecting similar images, and also touches on topics we will cover in future classes (hyperparameter tuning and cross-validation).
- Some applications for which KNN is well-suited are object recognition, satellite image enhancement, document categorization, and gene expression analysis.
Seaborn Resources:
- The official Seaborn website has a series of detailed tutorials and an example gallery.
- Data visualization with Seaborn is a quick tour of some of the popular types of Seaborn plots.
- Visualizing Google Forms Data with Seaborn and How to Create NBA Shot Charts in Python are both good examples of Seaborn usage on real-world data.
- Discuss the reading assignment on the bias-variance tradeoff
- Exploring the bias-variance tradeoff (notebook)
- Model evaluation using train/test split (notebook)
- Exploring the scikit-learn documentation: module reference, user guide, class and function documentation
- Reproducibility
- Discuss assigned readings: introduction, Colbert Report video, cabs article, Tweet, creating a reproducible analysis
- Examples: Classic rock, student project 1, student project 2
Model Evaluation Resources:
- For a recap of some of the key points from today's lesson, watch Comparing machine learning models in scikit-learn (27 minutes).
- For another explanation of training error versus testing error, the bias-variance tradeoff, and train/test split (also known as the "validation set approach"), watch Hastie and Tibshirani's video on estimating prediction error (12 minutes, starting at 2:34).
- Caltech's Learning From Data course includes a fantastic video on visualizing bias and variance (15 minutes).
- Random Test/Train Split is Not Always Enough explains why random train/test split may not be a suitable model evaluation procedure if your data has a significant time element.
Reproducibility Resources:
- What We've Learned About Sharing Our Data Analysis includes tips from BuzzFeed News about how to publish a reproducible analysis.
- Software development skills for data scientists discusses the importance of writing functions and proper code comments (among other skills), which are highly useful for creating a reproducible analysis.
- Data science done well looks easy - and that is a big problem for data scientists explains how a reproducible analysis demonstrates all of the work that goes into proper data science.
After this lesson you will be able to:
- Describe the kind of problem Logistic regression can solve
- Create a logistic regression model
- Describe the elements of a Confusion Matrix
Topics/Highlights:
- Logistic regression (notebook)
- Exercise with Titanic data (notebook, data, data dictionary)
- Confusion matrix (slides, notebook)
Homework:
- Work through the code samples in the "Confusion matrix of Titanic predictions" section in the 09_confusion_matrix.ipynb notebook to see an enample of turning a multi value feature into a
- If you aren't yet comfortable with all of the confusion matrix terminology, watch Rahul Patwari's videos on Intuitive sensitivity and specificity (9 minutes) and The tradeoff between sensitivity and specificity (13 minutes).
- Video/reading assignment on ROC curves and AUC
- Video/reading assignment on cross-validation
Logistic Regression Resources:
- To go deeper into logistic regression, read the first three sections of Chapter 4 of An Introduction to Statistical Learning, or watch the first three videos (30 minutes) from that chapter.
- For a math-ier explanation of logistic regression, watch the first seven videos (71 minutes) from week 3 of Andrew Ng's machine learning course, or read the related lecture notes compiled by a student.
- For more on interpreting logistic regression coefficients, read this excellent guide by UCLA's IDRE and these lecture notes from the University of New Mexico.
- The scikit-learn documentation has a nice explanation of what it means for a predicted probability to be calibrated.
- Supervised learning superstitions cheat sheet is a very nice comparison of four classifiers we cover in the course (logistic regression, decision trees, KNN, Naive Bayes) and one classifier we do not cover (Support Vector Machines).
Confusion Matrix Resources:
- Kevin Markham's simple guide to confusion matrix terminology may be useful to you as a reference.
- This blog post about Amazon Machine Learning contains a neat graphic showing how classification threshold affects different evaluation metrics.
- This notebook (from another DAT course) explains how to calculate "expected value" from a confusion matrix by treating it as a cost-benefit matrix.
After this lesson you will be able to:
- Prepare your data by overcoming issues such as null values
- Be able to measure accuracy of Logistic Regression with ROC curves and AUC
- Be able to use cross validation to measure model accuracy more effectively than with test/train split
Topics/Highlights:
- Data preparation (notebook)
- Handling missing values
- Handling categorical features (review)
- Advanced Model evaluation
- ROC curves and AUC
- Discuss the video/reading assignment
- Exercise: drawing an ROC curve (slides)
- Return to the main notebook
- Cross-validation
- Discuss the video/reading assignment and related notebook
- Return to the main notebook
- ROC curves and AUC
- Exercise with bank marketing data (notebook, data, data dictionary)
Homework:
- Finalize your First Project Presentations! Your first project presentation is next Thursday April 21st.
ROC Resources:
- Rahul Patwari has a great video on ROC Curves (12 minutes).
- An introduction to ROC analysis is a very readable paper on the topic.
- ROC curves can be used across a wide variety of applications, such as comparing different feature sets for detecting fraudulent Skype users, and comparing different classifiers on a number of popular datasets.
Cross-Validation Resources:
- For more on cross-validation, read section 5.1 of An Introduction to Statistical Learning (11 pages) or watch the related videos: K-fold and leave-one-out cross-validation (14 minutes), cross-validation the right and wrong ways (10 minutes).
- If you want to understand the different variations of cross-validation, this paper examines and compares them in detail.
- To learn how to use GridSearchCV and RandomizedSearchCV for parameter tuning, watch How to find the best model parameters in scikit-learn (28 minutes) or read the associated notebook.
Other Resources:
- scikit-learn has extensive documentation on model evaluation.
- Counterfactual evaluation of machine learning models (45 minutes) is an excellent talk about the sophisticated way in which Stripe evaluates its fraud detection model. (These are the associated slides.)
- Visualizing Machine Learning Thresholds to Make Better Business Decisions demonstrates how visualizing precision, recall, and "queue rate" at different thresholds can help you to maximize the business value of your classifier.
By the end of this lesson you will be able to:
- Standardize feature values
- Cluster using K-means
- Compare "how good" the clustering models are
Topics/Highlights
- Review solutions to exercise with bank marketing data (notebook, data, data dictionary)
- Advanced scikit-learn (notebook,dataset description)
- StandardScaler: standardizing features
- Pipeline: chaining steps
- Clustering (slides, notebook,data)
- K-means: documentation, visualization 1, visualization 2
- My clustering of colors in an image to posterize. Used a loop to generate clusters of 1 to 256 clusters. Made an animated gif out of them. Fun! (repository,gif)
- DBSCAN: documentation, visualization
- K-means: documentation, visualization 1, visualization 2
Homework:
- Prepare your initial project presentation for Thursday!!
- By Tuesday April 26, run the "DCSCAN Clustering" part of the 11_clustering.ipynb notebook to understand how to use the DBSCAN estimator to build a clustering model.
scikit-learn Resources:
- This is a longer example of feature scaling in scikit-learn, with additional discussion of the types of scaling you can use.
- Practical Data Science in Python is a long and well-written notebook that uses a few advanced scikit-learn features: pipelining, plotting a learning curve, and pickling a model.
- To learn how to use GridSearchCV and RandomizedSearchCV for parameter tuning, watch How to find the best model parameters in scikit-learn (28 minutes) or read the associated notebook.
- Sebastian Raschka has a number of excellent resources for scikit-learn users, including a repository of tutorials and examples, a library of machine learning tools and extensions, a new book, and a semi-active blog.
- scikit-learn has an incredibly active mailing list that is often much more useful than Stack Overflow for researching functions and asking questions.
- If you forget how to use a particular scikit-learn function that we have used in class, don't forget that this repository is fully searchable!
Clustering Resources:
- For a very thorough introduction to clustering, read chapter 8 (69 pages) of Introduction to Data Mining (available as a free download), or browse through the chapter 8 slides.
- scikit-learn's user guide compares many different types of clustering.
- This PowerPoint presentation from Columbia's Data Mining class provides a good introduction to clustering, including hierarchical clustering and alternative distance metrics.
- An Introduction to Statistical Learning has useful videos on K-means clustering (17 minutes) and hierarchical clustering (15 minutes).
- This is an excellent interactive visualization of hierarchical clustering.
- This is a nice animated explanation of mean shift clustering.
- The K-modes algorithm can be used for clustering datasets of categorical features without converting them to numerical values. Here is a Python implementation.
- Here are some fun examples of clustering: A Statistical Analysis of the Work of Bob Ross (with data and Python code), How a Math Genius Hacked OkCupid to Find True Love, and characteristics of your zip code.
By the end of this lesson you will be able to:
- Apply the NLP techniques of Vectorization and Tokenization to text to create features
- Use stop word removal and other techniques to increase the accuracy of your models using these features
- Create features using Stemming and Lemmatization
Topics/Highlights
- Natural language processing (notebook)
- Vectorization/Tokenization
- Stopword Removal
- Other CountVectorizer options
- Intro to TextBlob
- Stemming and Lemmatization
- NLP Exercise (notebook)
Homework:
- Your draft paper is due on Thursday (12/22)! Please submit a link to your project repository (with paper, code, data, and visualizations) before class
NLP Resources:
- If you want to learn a lot more NLP, check out the excellent video lectures and slides from this Coursera course (which is no longer being offered).
- This slide deck defines many of the key NLP terms.
- Natural Language Processing with Python is the most popular book for going in-depth with the Natural Language Toolkit (NLTK).
- A Smattering of NLP in Python provides a nice overview of NLTK
- spaCy is a newer Python library for text processing that is focused on performance (unlike NLTK).
- If you want to get serious about NLP, Stanford CoreNLP is a suite of tools (written in Java) that is highly regarded.
- When working with a large text corpus in scikit-learn, HashingVectorizer is a useful alternative to CountVectorizer.
- Automatically Categorizing Yelp Businesses discusses how Yelp uses NLP and scikit-learn to solve the problem of uncategorized businesses.
- Modern Methods for Sentiment Analysis shows how "word vectors" can be used for more accurate sentiment analysis.
- Identifying Humorous Cartoon Captions is a readable paper about identifying funny captions submitted to the New Yorker Caption Contest.
By the end of this lesson you will be able to:
- Apply TF-IDF to text (Natural language Processing)
- Reduce dimensions using Principle Compoment analysis (Dimensionality reduction)
Topics/Highlights
- Natural Language Processing continued (notebook)
- Term Frequency-Inverse Document Frequency (TF-IDF)
- Using TF-IDF to Summarize a Yelp Review
- Sentiment Analysis
- NLP Exercise continued with TF-IDF (notebook)
- Dimensionality reduction
NLP Resources:
- If you want to learn a lot more NLP, check out the excellent video lectures and slides from this Coursera course (which is no longer being offered).
- This slide deck defines many of the key NLP terms.
- Natural Language Processing with Python is the most popular book for going in-depth with the Natural Language Toolkit (NLTK).
- A Smattering of NLP in Python provides a nice overview of NLTK
- spaCy is a newer Python library for text processing that is focused on performance (unlike NLTK).
- If you want to get serious about NLP, Stanford CoreNLP is a suite of tools (written in Java) that is highly regarded.
- When working with a large text corpus in scikit-learn, HashingVectorizer is a useful alternative to CountVectorizer.
- Automatically Categorizing Yelp Businesses discusses how Yelp uses NLP and scikit-learn to solve the problem of uncategorized businesses.
- Modern Methods for Sentiment Analysis shows how "word vectors" can be used for more accurate sentiment analysis.
- Identifying Humorous Cartoon Captions is a readable paper about identifying funny captions submitted to the New Yorker Caption Contest.
By the end of this lesson you will be able to:
- Create a Regression tree
- Graph and interpret the decision tree
Topics/Highlights
- Decision trees (notebook)
- Part 1: Regression trees
- Exercise with Capital Bikeshare data (notebook, data, data dictionary)
Homework:
- Read the "Wisdom of the crowds" section from MLWave's post on Human Ensemble Learning.
- Optional: Read the abstract from Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?, as well as Kaggle CTO Ben Hamner's comment about the paper, paying attention to the mentions of "Random Forests".
Resources:
- scikit-learn's documentation on decision trees includes a nice overview of trees as well as tips for proper usage.
- For a more thorough introduction to decision trees, read section 4.3 (23 pages) of Introduction to Data Mining. (Chapter 4 is available as a free download.)
- If you want to go deep into the different decision tree algorithms, this slide deck contains A Brief History of Classification and Regression Trees.
- The Science of Singing Along contains a neat regression tree (page 136) for predicting the percentage of an audience at a music venue that will sing along to a pop song.
- Decision trees are common in the medical field for differential diagnosis, such as this classification tree for identifying psychosis.
- Finish decision trees lesson (notebook)
- Ensembling, Bagging and Random Forests (notebook)
- Major League Baseball player data from 1986-87
- Data dictionary (see page 7)
Resources:
- scikit-learn's documentation on ensemble methods covers both "averaging methods" (such as bagging and Random Forests) as well as "boosting methods" (such as AdaBoost and Gradient Tree Boosting).
- MLWave's Kaggle Ensembling Guide is very thorough and shows the many different ways that ensembling can take place.
- Browse the excellent solution paper from the winner of Kaggle's CrowdFlower competition for an example of the work and insight required to win a Kaggle competition.
- Interpretable vs Powerful Predictive Models: Why We Need Them Both is a short post on how the tactics useful in a Kaggle competition are not always useful in the real world.
- Not Even the People Who Write Algorithms Really Know How They Work argues that the decreased interpretability of state-of-the-art machine learning models has a negative impact on society.
- For an intuitive explanation of Random Forests, read Edwin Chen's answer to How do random forests work in layman's terms?
- Large Scale Decision Forests: Lessons Learned is an excellent post from Sift Science about their custom implementation of Random Forests.
- Unboxing the Random Forest Classifier describes a way to interpret the inner workings of Random Forests beyond just feature importances.
- Understanding Random Forests: From Theory to Practice is an in-depth academic analysis of Random Forests, including details of its implementation in scikit-learn.
- Time series intro (slides)
- Exercises (notebook)
- Time series intro (slides)
- Exercises (notebook)
SVC resources
- For a more in-depth inderstanding of Support Vector Machines and SVC, read Chapter 9 of Hastie and Tibshirani's excellent book, An Introduction to Statistical Learning. (It's a free PDF download!)
- SVC videos by the authors of An Introduction to Statistical Learning can be found here.
- Conditional probability and Bayes' theorem
- Slides (adapted from Visualizing Bayes' theorem)
- Applying Bayes' theorem to iris classification (notebook)
- Naive Bayes classification
- Applying Naive Bayes to text data in scikit-learn (notebook)
- CountVectorizer documentation
- SMS messages: data, data dictionary
Resources:
- Sebastian Raschka's article on Naive Bayes and Text Classification covers the conceptual material from today's class in much more detail.
- For more on conditional probability, read these slides, or read section 2.2 of the OpenIntro Statistics textbook (15 pages).
- For an intuitive explanation of Naive Bayes classification, read this post on airport security.
- For more details on Naive Bayes classification, Wikipedia has two excellent articles (Naive Bayes classifier and Naive Bayes spam filtering), and Cross Validated has a good Q&A.
- When applying Naive Bayes classification to a dataset with continuous features, it is better to use GaussianNB rather than MultinomialNB. This notebook compares their performances on such a dataset. Wikipedia has a short description of Gaussian Naive Bayes, as well as an excellent example of its usage.
- These slides from the University of Maryland provide more mathematical details on both logistic regression and Naive Bayes, and also explain how Naive Bayes is actually a "special case" of logistic regression.
- Andrew Ng has a paper comparing the performance of logistic regression and Naive Bayes across a variety of datasets.
- If you enjoyed Paul Graham's article, you can read his follow-up article on how he improved his spam filter and this related paper about state-of-the-art spam filtering in 2004.
- Yelp has found that Naive Bayes is more effective than Mechanical Turks at categorizing businesses.
Neural Network resources
- For a more background of on Artificial Neural Networks view this series. It describes the logic behind ANN from a low level logic step by step point of view:
- Part 1 https://www.youtube.com/watch?v=bxe2T-V8XRs
- Part 2 https://www.youtube.com/watch?v=UJwK6jAStmg
- Part 3 https://www.youtube.com/watch?v=5u0jaA3qAGk
- Part 4 https://www.youtube.com/watch?v=GlcnxUlrtek
- Part 5 https://www.youtube.com/watch?v=pHMzNW8Agq4
- Part 6 https://www.youtube.com/watch?v=9KM9Td6RVgQ
- Part 7 https://www.youtube.com/watch?v=S4ZUwgesjS8