Learning Data Science

December 5, 2017

Background

BS Accounting
Accountant
MAcc
Marketing Analyst
Product Entry Lead
Financial Analyst
Data Scientist

JHU Data Science Specialization

https://www.coursera.org/specializations/jhu-data-science

The Data Scientist Toolbox
R Programming
Getting and Cleaning Data
Exploratory Data Analysis
Reproducible Research
Statistical Inference
Regression Models
Practical Machine Learning
Developing Data Products
Capstone

Capstone Project

Objective: Create word prediction app similar to Swiftkey on mobile phones

Data Sources: Twitter, news stories, blogs

Data Processing: Tokenize, remove stopwords, punctuation, numbers, symbols, and stem words. Separate into ngrams (1, 2, 3, 4) and sort by most frequent. Unigrams: take top 5k. Bigrams, trigrams, and quadrigrams; take top 5 million.

Data Modelling: Using input (word or phrase), process in same way as dataset. Take last 3 words of phrase, and find most frequent quadrigram that starts with those 3 words. Use that to predict next word. If no matches, try to match trigram, then bigram. If still no matches, take word from unigram.

https://debmartin06.shinyapps.io/Capstone_Word_Predictor/