The following coursebook is produced by the team at Algoritma for its Data Science Academy workshops. The coursebook is intended for a restricted audience only, i.e. the individuals and organizations having received this coursebook directly from the training organization. It may not be reproduced, distributed, translated or adapted in any form outside these individuals and organizations without permission.
Algoritma is a data science education center based in Jakarta. We organize workshops and training programs to help working professionals and students gain mastery in various data science sub-fields: data visualization, machine learning, data modeling, statistical inference, etc.
We’ll set-up caching for this notebook given how computationally expensive some of the code we will write can get:
# caching up the notebook
knitr::opts_chunk$set(cache = TRUE)
# clear up the environment
rm(list = ls())We will need some packages from tidyverse and tidyquant packages, like dplyr and lubridates, to explore Scotty’s data. Let’s use pacman package to load–and automatically install, if you don’t already have–the packages:
# load (or automatically install first) pacman package
if (!require("pacman")) install.packages("pacman")
# load (or automatically install first) the packages
p_load(tidyverse, tidyquant, tm, data.table)In this workshop, we will try some hands-on on machine learning related projects. You will need to demonstrate your knowledge in machine learning to solve a real-life prediction problems.
In capstone project, you are expected to make a prediction model and make a clear report on how you made the model. You can pick one from four problems (explain in detail in section 2.1, 2.2, 3.1, 4.1) we provided. You will need to recall some of your knowledge in:
You can download data here.
Scotty is a ride-sharing business that operating in several big cities in Turkey. The company provide motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” to their order buttons. In this project, we are going to help them in solving some forecasting and classification problem.
Scotty donated a real-time data of order transaction, which is available in data/Scotty.csv:
# a little glimpse of Scotty's data
read.csv("data/Scotty.csv") %>% glimpse()## Observations: 366,784
## Variables: 10
## $ timeStamp <fct> 2017-11-03 19:02:31, 2017-10-01 17:45:56, 201...
## $ driverID <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ riderID <fct> 5941083dc01c9f3eeeaac726, 59d0d4363d32b861760...
## $ orderStatus <fct> nodrivers, nodrivers, nodrivers, nodrivers, n...
## $ confirmedTimeSec <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ srcGeohash <fct> 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ srcLong <dbl> -60.87779, 0.00000, 0.00000, 0.00000, 0.00000...
## $ srcLat <dbl> -23.04455, 0.00000, 0.00000, 0.00000, 0.00000...
## $ destLong <dbl> 40.98718, 40.99750, 40.99750, 41.03570, 41.04...
## $ destLat <dbl> 29.02819, 28.85056, 28.85056, 29.06256, 29.00...
There are two option for the project:
It’s almost 2018 and we need to prepare a forecast model to helps us ready for the end year’s demands. Unfortunately, Scotty is not old enough to have last year data for December, so we can not look back at past experience to prepare for December’s demands.
Fortunately, you already know that time series analysis is more than enough to help us to forecast! Based on data/Scotty.csv data (up to Friday, December 1st 2017), make a time series analysis and forecast report that would be evaluated on the next 5 business days. The report must suits a proper hourly-basis analysis and forecast for Scotty’s demands.
The report should have a clear explanation of:
The data in data/Scotty.csv is containing real-time raw transaction informations. Moreover, it also includes some "cancelled" and duplicated "nodrivers" transactions, which is not representing a real demand, and should not be included in the time series analysis and forecast model. Based on this condition, make a report on how you preprocess the data into a proper time series data that represent the true Scotty’s demands.
Rubric:
As you recall in Time Series and Forecasting course, a time series data have trend and seasonality component. The data in data/Scotty.csv is not sampled at a proper frequency, but if we aggregate the data to hourly data point, the demands for Scotty’s service, theoritically, should have the trend and seasonality components.
Rubric:
After you confident with your time series analysis, build a forecast model based on your intuition. Note that the model should have a proper explanation on why it suits our case. The forecasted value should be stored inside data/submissionForecast.csv templates:
# some samples of the submission template
read.csv("data/submissionForecast.csv") %>% head()## timeStamp nOrder
## 1 2017-12-04 00:00:00 NA
## 2 2017-12-04 01:00:00 NA
## 3 2017-12-04 02:00:00 NA
## 4 2017-12-04 03:00:00 NA
## 5 2017-12-04 04:00:00 NA
## 6 2017-12-04 05:00:00 NA
and name your submission file with format YOURNAME_SCOTTY_FORECAST.csv. This submission would be checked on the actual data to calculate the Mean Absoulte Error (MAE).
Rubric:
Since the data would be aggregated to hourly data, we would be likely to face complex seasonality problem. For example, half-hourly data would have daily and weekly seasonalities, or even including monthly and yearly seasonalities if the data is long enough. Unfortunately, classical decomposition can not solve the problem of multiple seasonality.
To solve multiple seasonality problem, you might want to consider msts() function from package forecast to store the data as multiple frequency time series data. Here is some quick example on elecdemand–we only use “Demand” column as the example–half-hourly data from fpp2 package:
# load forecast and fpp2 package
p_load(forecast, fpp2)## Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5:
## cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5/PACKAGES'
## package 'fpp2' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Algoritma\AppData\Local\Temp\RtmpYj3YOv\downloaded_packages
##
## fpp2 installed
# convert preloaded elecdemand data as msts object
mstsExample <- msts(elecdemand[, "Demand"], seasonal.periods = c(48, 48 * 7))
# inspect the class
class(mstsExample)## [1] "msts" "ts"
In this example, \(48\) is representing the frequency within a natural period of half-hourly data, a day. If the data is long enough, like our example, the frequency might be enough to intersect with other natural period, a week, which is represented in our example by \(48 \cdot 7 = 336\).
Using msts() object, you can use mstl() function from forecast package to decompose multiple seasonality:
# multiple seasonality decomposition
mstsExample %>%
head(48 * 7 * 5) %>% # first 5 week example
mstl() %>%
autoplot()As you can see from the plot, the elecdemand showing clear seasonality patterns for each natural periods; daily and weekly seasonalities.
Consequently, you might want to consider some multiple seasonality forecast models, such as tbats(), or auto.arima() with the fourier arguments, which are also supported in forecast package.
If you interested in the multiple seasonality problems, check out more deep analysis from one of the chapters in “Forecasting: Principle and Practice”, which is written by the same author of forecast package.
Scotty turns out being a very popular service in Turkey! This is a good thing, but also brings a new problem: “There is no driver”. The demands for Scotty began to overload, at some region and some times, and there was not enough driver at those times and places.
Fortunately, we are know that we can use classification model to predict which region and times are risky enough to have this “nodrivers” problem. Based on data/Scotty.csv data (up to Friday, December 1st 2017), make a prediction model report that would be evaluated on the next 5 business days. The report must suits a proper analysis and prediction by region and hour for Scotty’s drivers coverage status (Sufficient or Insufficient).
The report should have a clear explanation of:
The data in data/Scotty.csv is containing real-time raw transaction informations.
Moreover, it also includes some "cancelled" and duplicated "nodrivers" transactions, which is not representing a real demand, hence, the coverage status, and should not be included in the prediction model. For example, if the status from the transactions of one riderID within an hour are nodrivers, nodrivers, completed that would indicate one demand. Based on this condition, make a report on how you preprocess the data into a proper data that represent the true Scotty’s coverage status.
Rubric:
Tips: Checkout data/submissionClassification.csv to see the category of coverage status.
As you recall in Classification 1 & 2 courses, some exploratory data analysis, like analyzing the distribution of our target variable, would be very helpful in building the following prediction model.
Rubric:
Based on prior exploratory data analysis, build a strong prediction model that would help us classify coverage status in the future. Note that the model should have a proper explanation on why it suits our case. The forecasted value should be stored inside data/submissionClassification.csv templates:
# some samples of the submission template
# the values in coverage column is just an example
read.csv("data/submissionClassification.csv") %>% head()## srcGeohash timeStamp coverage
## 1 7 2017-12-04 00:00:00 Sufficient
## 2 7 2017-12-04 01:00:00 Insufficient
## 3 7 2017-12-04 02:00:00 Sufficient
## 4 7 2017-12-04 03:00:00 Insufficient
## 5 7 2017-12-04 04:00:00 Sufficient
## 6 7 2017-12-04 05:00:00 Insufficient
and name your submission file with format YOURNAME_SCOTTY_CLASSIFICATION.csv. This submission would be checked on the actual data to calculate the accuracy, recall, and precision.
Rubric:
This SMS dataset is collected by team Algoritma for educational purposes. It is a real SMS dataset with a spam/ham label for each messages. In this project, we are going to build a classification model to create a spam classifier.
The dataset is available to load in data/sms.csv:
read.csv("data/sms.csv") %>% glimpse()## Observations: 1,805
## Variables: 3
## $ STATUS <fct> ham, ham, spam, ham, spam, spam, spam, spam, spam, spa...
## $ CONTAIN <fct> Sy wa ga sampe2 soalnya, Mbak tiara sy da di cyber yaa...
## $ DATE <fct> 2018-02-28 11:43:00, 2018-02-28 11:43:00, 2018-02-28 0...
Use any approach you find fitting for the problem and create model that can correctly classify SMS into spam/ham.
The drastic decrease in use of SMS has put this once-very-popular feature of a cellphone has turned one’s inbox into a junks inbox. But who knows? Someone might contact you through old-school way of SMS and you might even skipped it because the amount of the spams in your inbox is just way too much. Imagine we are building a spam classifier for SMS-based messages. The SMS classified as spam is collected through user’s report for unwanted SMS.
Make a report of the approach you use in building the model classifier. The report should have a clear explanation of:
The data in data/SMS.csv contains 3 attributes: DATE, CONTAIN, and STATUS. The label of spam (unwanted SMS) and ham is stored under STATUS variable. DATE shows the timestamp of the SMS received and CONTAIN describes the content of the message. Show how you feature engineered the attributes you are going to use in your model.
Rubric:
In building the model, we should consider multiple parameters in coming up with the best model. Do a benchmarking for multiple models / approach you came up with and show the train thought process in choosing the best model.
Rubric:
Based on the best model picked in previous step, use the model to predict SMS messages stored in data/submissionSMS.csv. The prediction should contain 2 classes: spam or ham. The forecasted prediction should be filled in under STATUS column in the same file:
read.csv('data/submissionSMS.csv') %>% head()## DATE
## 1 2018-04-25 14:20:00
## 2 2018-04-25 14:13:00
## 3 2018-04-25 12:26:00
## 4 2018-04-25 12:08:00
## 5 2018-04-24 12:41:00
## 6 2018-04-24 11:42:00
## CONTAIN
## 1 ELITE RELOAD PULSA:Kami ingin menawarkan anda menjadi agen pulsa all operator. harga: V5:4750,V10:8750,V20:18750.Minat invite BBM: D995290A & WA: 085656345887.
## 2 iRing keren cuman buat km, Feni Rose-TERCYDUK,Rp.0,1/3hr prpnjngan Rp.3190 dengan hnya bls YA lho!
## 3 Sahabat, registrasi kartu SIM prabayar Anda SEKARANG. Ketik ULANG#NIK#No.KK# SMS ke 4444 atau klik http://im3.do/registrasi s.d. 30 April 2018. Terima kasih.
## 4 sians text ULANG#NIK#No.KK# to 4444.Foreigners bring Passport
## 5 24/04/2018 12:41 Pada No. Rek 378088918 ada dana keluar sebesar Rp 100.000,00. Saldo akhir: Rp 216.309,00.Berita: BILL PAYMENT GOPAY CUST NO 39152454823092
## 6 Sisa kuota GAK AKAN HANGUS & dpt disimpan hg 100GB SETAHUN dgn DATAROLLOVER dr IM3Ooredoo!Gratis dgn pkt Internet bulanan IM3 Ooredoo http://im3.do/dr0424 BMI1
## STATUS
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
and name your submission file with format YOURNAME_SPAM_CLASSIFICATION.csv. This submission would be checked on the actual data to calculate the accuracy, recall, and precision.
Rubric:
To help you on creating the model, recall the text mining material from classification 2. If you’re planning to do naive bayes modelling (of course, not limited to this approach!) like the one presented in classification 2 class, use tm library in doing text preprocessing. Here are the steps you might want to pay attention to in doing word tokenization:
library(tm)
#Create corpus object
corpus <- VCorpus(VectorSource(data))
#Inspecting corpus
corpus[[1]]$content## [1] "Dorothy lived in the midst of the great Kansas prairies, with Uncle\nHenry, who was a farmer, and Aunt Em, who was the farmer's wife. Their\nhouse was small, for the lumber to build it had to be carried by wagon\nmany miles. There were four walls, a floor and a roof, which made one\nroom; and this room contained a rusty looking cookstove, a cupboard for\nthe dishes, a table, three or four chairs, and the beds. Uncle Henry\nand Aunt Em had a big bed in one corner, and Dorothy a little bed in\nanother corner."
#Creating custom function
transformer <- content_transformer(function(x, pattern) {
gsub(pattern, " ", x)
})
#Cleaning Examples
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, transformer, "\\n")
corpus <- tm_map(corpus, removePunctuation)
#Seperate words by one whitespace
corpus <- tm_map(corpus, stripWhitespace)
corpus[[1]]$content## [1] "dorothy lived in the midst of the great kansas prairies with uncle henry who was a farmer and aunt em who was the farmers wife their house was small for the lumber to build it had to be carried by wagon many miles there were four walls a floor and a roof which made one room and this room contained a rusty looking cookstove a cupboard for the dishes a table three or four chairs and the beds uncle henry and aunt em had a big bed in one corner and dorothy a little bed in another corner"
data/stopwords-id.txt. You can always adjust the uses of the each stopword accordingly.stopwords <- readLines("data/stopwords-id.txt")## Warning in readLines("data/stopwords-id.txt"): incomplete final line found
## on 'data/stopwords-id.txt'
corpus <- tm_map(corpus, removeWords, stopwords)dtm <- DocumentTermMatrix(corpus)
freqTerms <- findFreqTerms(dtm, 3)
reduced_dtm <- DocumentTermMatrix(corpus, list(dictionary=freqTerms))
reduced_dtm$dimnames$Terms## [1] "and" "dorothy" "for" "gray" "great" "had" "one"
## [8] "small" "the" "was"
There are a lot of text mining techniques you can pick up online, try to always implement each technique by fully understanding the process. A lot of text mining is associated with Natural Language Processing concepts that are language-specific so make sure to implement it correctly. You can also try to use stemming technique that might help you improve your model perfomance. Check out this github about Indonesian stemming library in R.
The dataset contains a receipt collected from a restaurant that served foods and beverages. This restaurant has multiple outlet across the countries. In this project, we are trying to forecast the restaurant visitor for the next 7 days.
The dataset is available to load in data/fnb_train.csv:
fread("data/fnb_train.csv") %>% glimpse()## Observations: 7,063,969
## Variables: 11
## $ datetime <chr> "2017-12-01 00:47:29", "2017-12-01 00:47:29",...
## $ outlet <chr> "E_46", "E_46", "E_46", "E_46", "E_46", "E_46...
## $ receipt <chr> "A0017765", "A0017765", "A0017765", "A0017765...
## $ item <int> 10100101, 10200029, 10400016, 10500028, 10500...
## $ item_group <chr> "NOODLE_DISH", "RICE_DISH", "SIDE_DISH", "DRI...
## $ item_major_group <chr> "FOOD", "FOOD", "FOOD", "BEVERAGES", "BEVERAG...
## $ qty <int> 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, ...
## $ price <dbl> 3.50, 6.96, 9.92, 4.74, 3.01, 5.48, 3.01, 4.7...
## $ total <dbl> 3.50, 6.96, 9.92, 4.74, 6.01, 5.48, 3.01, 4.7...
## $ payment <chr> "CASH", "CASH", "CASH", "CASH", "CASH", "CASH...
## $ sales_type <chr> "DINE_IN", "DINE_IN", "DINE_IN", "DINE_IN", "...
Use any model you find best suited for this time series problem. The variable we are trying to forecast is the total visitor from 22st of February to 28th of February.
These restaurant chains has operated for quite some time, now. The owner wants to analyze his visitor so he can attract more investors in 2018. The dataset provided is a receipt data from 1st of December 2017 to 21st of February 2018. Can you build the best forecast for this restaurant owner?
Make a report of the approach you use in building the forecasting model. The report should have a clear explanation of:
The data in data/fnb_train.csv consists of several receipt information. The variable we are trying to predict is the total visitor per outlet for each hour. You can use any variable you find helpful to help you in the modelling process. Show prepare your data in the report.
Rubric:
As you recall in Classification 1 & 2 courses, some exploratory data analysis, like analyzing the distribution of our target variable, would be very helpful in building the following prediction model.
Rubric:
Based on the best model picked in previous step, use the model to predict total visitor per outlet stored in data/fnb_submission.csv. The forecasted prediction should be filled in under visitor column from the same file:
# submission example
read.csv('data/fnb_submission.csv') %>% head()## datetime outlet visitor
## 1 2018-02-22 00:00:00 A NA
## 2 2018-02-22 00:00:00 B NA
## 3 2018-02-22 00:00:00 C NA
## 4 2018-02-22 00:00:00 D NA
## 5 2018-02-22 00:00:00 E NA
## 6 2018-02-22 01:00:00 A NA
and name your submission file with format YOURNAME_F&B_FORECAST.csv. This submission would be checked on the actual data to calculate the RMSE.
Rubric:
After you finished with the report, save your Rmd file with format YOURNAME_F&B_FORECASTING.Rmd, and compile your report and prediction submission into a .zip file.
This tweets dataset is collected by Team Algoritma for educational purposes. In this project, we are going to analyze and build a classification model to predict sentiment for each tweets from “#YoutubeRewind2018”
The data is download by using twitter’s API. The data is available to load:
read.csv("data/train.csv") %>% head()## text
## 1 They better have T-posing in #YoutubeRewind2018
## 2 Hey folks Please subscribe my channel to watch amazing videos <U+0001F64F><U+0001F64F> @TSeries @pewdiepie #tseriesvspewdiepie #YouTube #youtuber #youtuberewind2018 https://t.co/n0EkVltUcN @Canon @NikonIndia
## 3 I need pewdiepie<U+0001F624><U+0001F624>\n#youtuberewind2018
## 4 If @MrBeastYT is not in #YouTubeRewind2018, we are going to have problems.
## 5 @pewdiepie Waw lets go... #youtuberewind2018
## 6 If they don't even think to invite @pewdiepie to @YouTube #youtuberewind2018 then that will be the biggest sham. This #pewdiepievstseries war is the biggest meme this year on YouTube.
## sentiment_type
## 1 Neutral
## 2 Neutral
## 3 Positive
## 4 Negative
## 5 Neutral
## 6 Neutral
Twitter is an online social network with over 330 million active monthly users as of February 2018. Users on twitter create short messages called tweets to be shared with other twitter users who interact by retweeting and responding. Analyzing each post and understanding the sentiment associated with that post helps us find out which are the key topics or themes which resonate well with the audience.
Make a report of the approach you use in building the model classifier. The report should have a clear explanation of:
The data in train.csv contains 3 attributes: TEXT and SENTIMENT_TYPE. The label of sentiment is Positive, Negative and Neutral. TEXT describes the content of the tweets. Show how you feature engineered the attributes you are going to use in your model.
Rubric:
As you recall in Classification 1 & 2 courses, some exploratory data analysis, like analyzing the distribution of our target variable, would be very helpful in building the following prediction model.
Rubric:
Based on the best model picked in previous step, build a strong prediction model that would help us classify sentiment classes in the future. Note that the model should have a proper explanation on why it suits our case. The prediction value should be stored inside submissionSA.csv templates:
read.csv("data/submissionSA.csv") %>% head()## text
## 1 I can’t say that this years YT REWIND is a let down. It is everything I expected it to be actually. #fortnite and politics. \n#YouTubeRewind2018
## 2 #YouTubeRewind2018 is one of the most well deserved disliked videos. Hope it hits #1 by the end of the week https://t.co/OSpHwIWBuB
## 3 Interesting perspective by, @PhillyD - here's my view: \n@YouTube \n\nHere's a little statement I had to make regarding the whole, #YouTubeRewind/#YouTubeRewind2018 conversation: https://t.co/htBQvjBoHS
## 4 #YouTubeRewind2018 sucked so much!!! Except for my baby @lizakoshy <U+2764><U+FE0F> https://t.co/SxC7h0W9el
## 5 #YouTubeRewind2018 becomes the second-most disliked video of all time https://t.co/tPvr6t9h5T
## 6 The #YouTubeRewind2018 video is the fastest disliked video in the history of Youtube and it's starting to get close to that all-time number 1 spot. Check out my reaction to it through @pewdiepie to see how bad it really is. https://t.co/8Rlm2ZY3ht
## label
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
and name your submission file with format YOURNAME_CLASSIFICATIONSA.csv. This submission would be checked on the actual data to calculate the accuracy, recall, and precision.
Rubric:
To evaluate multi-class classification problem, we can use yardstick that will help us to calculate multiclass metrics. To install the packages:
install.packages("yardstick")
# Development version:
devtools::install_github("tidymodels/yardstick")For calculate multi-class metrics, we can save our prediction and groud truth to data frame first.
eval_met <- data.frame(estimate = prediction, truth = submission_label)
# we can set our metrics: accuracy, precison, recall
metrics_multi <- metric_set(accuracy, recall, precision)
eval_met %>%
metrics_multi(truth = truth, estimate = estimate)More metrics check here.