1 Background

1.1 Algoritma

The following coursebook is produced by the team at Algoritma for its Data Science Academy workshops. The coursebook is intended for a restricted audience only, i.e. the individuals and organizations having received this coursebook directly from the training organization. It may not be reproduced, distributed, translated or adapted in any form outside these individuals and organizations without permission.

Algoritma is a data science education center based in Jakarta. We organize workshops and training programs to help working professionals and students gain mastery in various data science sub-fields: data visualization, machine learning, data modeling, statistical inference, etc.

1.2 Libraries and Setup

We’ll set-up caching for this notebook given how computationally expensive some of the code we will write can get:

# caching up the notebook
knitr::opts_chunk$set(cache = TRUE)

# clear up the environment
rm(list = ls())

We will need some packages from tidyverse and tidyquant packages, like dplyr and lubridates, to explore Scotty’s data. Let’s use pacman package to load–and automatically install, if you don’t already have–the packages:

# load (or automatically install first) pacman package
if (!require("pacman")) install.packages("pacman")

# load (or automatically install first) the packages
p_load(tidyverse, tidyquant, tm, data.table)

1.3 Training Objectives

In this workshop, we will try some hands-on on machine learning related projects. You will need to demonstrate your knowledge in machine learning to solve a real-life prediction problems.

In capstone project, you are expected to make a prediction model and make a clear report on how you made the model. You can pick one from four problems (explain in detail in section 2.1, 2.2, 3.1, 4.1) we provided. You will need to recall some of your knowledge in:

  • Classification 1
  • Classification 2
  • Time Series and Forecasting

You can download data here.

2 Scotty

Scotty is a ride-sharing business that operating in several big cities in Turkey. The company provide motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” to their order buttons. In this project, we are going to help them in solving some forecasting and classification problem.

Scotty donated a real-time data of order transaction, which is available in data/Scotty.csv:

# a little glimpse of Scotty's data
read.csv("data/Scotty.csv") %>% glimpse()
## Observations: 366,784
## Variables: 10
## $ timeStamp        <fct> 2017-11-03 19:02:31, 2017-10-01 17:45:56, 201...
## $ driverID         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ riderID          <fct> 5941083dc01c9f3eeeaac726, 59d0d4363d32b861760...
## $ orderStatus      <fct> nodrivers, nodrivers, nodrivers, nodrivers, n...
## $ confirmedTimeSec <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ srcGeohash       <fct> 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ srcLong          <dbl> -60.87779, 0.00000, 0.00000, 0.00000, 0.00000...
## $ srcLat           <dbl> -23.04455, 0.00000, 0.00000, 0.00000, 0.00000...
## $ destLong         <dbl> 40.98718, 40.99750, 40.99750, 41.03570, 41.04...
## $ destLat          <dbl> 29.02819, 28.85056, 28.85056, 29.06256, 29.00...

There are two option for the project:

  • Forecast Hourly Demands for Scotty
  • Classify “nodrivers” Condition by Region and Hour

2.1 “It’s December already… Bring me the crystal ball”

It’s almost 2018 and we need to prepare a forecast model to helps us ready for the end year’s demands. Unfortunately, Scotty is not old enough to have last year data for December, so we can not look back at past experience to prepare for December’s demands.

Fortunately, you already know that time series analysis is more than enough to help us to forecast! Based on data/Scotty.csv data (up to Friday, December 1st 2017), make a time series analysis and forecast report that would be evaluated on the next 5 business days. The report must suits a proper hourly-basis analysis and forecast for Scotty’s demands.

The report should have a clear explanation of:

  • Data Preprocess (8 points)
  • Seasonality and Trend Analysis (8 points)
  • Forecast Model for Hourly Demands (8 points)

2.1.1 Data Preprocess (8 points)

The data in data/Scotty.csv is containing real-time raw transaction informations. Moreover, it also includes some "cancelled" and duplicated "nodrivers" transactions, which is not representing a real demand, and should not be included in the time series analysis and forecast model. Based on this condition, make a report on how you preprocess the data into a proper time series data that represent the true Scotty’s demands.

Rubric:

  • Give a proper documentation of how you preprocess the data (4 points)
  • Give a proper narrative on each preprocess steps (4 points)

2.1.2 Seasonality and Trend Analysis (8 points)

As you recall in Time Series and Forecasting course, a time series data have trend and seasonality component. The data in data/Scotty.csv is not sampled at a proper frequency, but if we aggregate the data to hourly data point, the demands for Scotty’s service, theoritically, should have the trend and seasonality components.

Rubric:

  • Give a proper documentation of how you decompose trend and seasonality (4 points)
  • Give an informative analysis and visualization for the trend and seasonality (4 points)

2.1.3 Forecast Model for Hourly Demands (8 points)

After you confident with your time series analysis, build a forecast model based on your intuition. Note that the model should have a proper explanation on why it suits our case. The forecasted value should be stored inside data/submissionForecast.csv templates:

# some samples of the submission template
read.csv("data/submissionForecast.csv") %>% head()
##             timeStamp nOrder
## 1 2017-12-04 00:00:00     NA
## 2 2017-12-04 01:00:00     NA
## 3 2017-12-04 02:00:00     NA
## 4 2017-12-04 03:00:00     NA
## 5 2017-12-04 04:00:00     NA
## 6 2017-12-04 05:00:00     NA

and name your submission file with format YOURNAME_SCOTTY_FORECAST.csv. This submission would be checked on the actual data to calculate the Mean Absoulte Error (MAE).

Rubric:

  • Give a proper documentation of how you build the forecast model (4 point)
  • Reach MAE value below 55 on the actual data (4 point)

2.1.4 Additional Tips

Since the data would be aggregated to hourly data, we would be likely to face complex seasonality problem. For example, half-hourly data would have daily and weekly seasonalities, or even including monthly and yearly seasonalities if the data is long enough. Unfortunately, classical decomposition can not solve the problem of multiple seasonality.

To solve multiple seasonality problem, you might want to consider msts() function from package forecast to store the data as multiple frequency time series data. Here is some quick example on elecdemand–we only use “Demand” column as the example–half-hourly data from fpp2 package:

# load forecast and fpp2 package
p_load(forecast, fpp2)
## Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5:
##   cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5/PACKAGES'
## package 'fpp2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Algoritma\AppData\Local\Temp\RtmpYj3YOv\downloaded_packages
## 
## fpp2 installed
# convert preloaded elecdemand data as msts object
mstsExample <- msts(elecdemand[, "Demand"], seasonal.periods = c(48, 48 * 7))

# inspect the class
class(mstsExample)
## [1] "msts" "ts"

In this example, \(48\) is representing the frequency within a natural period of half-hourly data, a day. If the data is long enough, like our example, the frequency might be enough to intersect with other natural period, a week, which is represented in our example by \(48 \cdot 7 = 336\).

Using msts() object, you can use mstl() function from forecast package to decompose multiple seasonality:

# multiple seasonality decomposition
mstsExample %>%
  head(48 * 7 * 5) %>% # first 5 week example
  mstl() %>%
  autoplot()

As you can see from the plot, the elecdemand showing clear seasonality patterns for each natural periods; daily and weekly seasonalities.

Consequently, you might want to consider some multiple seasonality forecast models, such as tbats(), or auto.arima() with the fourier arguments, which are also supported in forecast package.

If you interested in the multiple seasonality problems, check out more deep analysis from one of the chapters in “Forecasting: Principle and Practice”, which is written by the same author of forecast package.

2.2 “There is no driver!”

Scotty turns out being a very popular service in Turkey! This is a good thing, but also brings a new problem: “There is no driver”. The demands for Scotty began to overload, at some region and some times, and there was not enough driver at those times and places.

Fortunately, we are know that we can use classification model to predict which region and times are risky enough to have this “nodrivers” problem. Based on data/Scotty.csv data (up to Friday, December 1st 2017), make a prediction model report that would be evaluated on the next 5 business days. The report must suits a proper analysis and prediction by region and hour for Scotty’s drivers coverage status (Sufficient or Insufficient).

The report should have a clear explanation of:

  • Data Preprocess (8 points)
  • Exploratory Data Analysis (8 points)
  • Prediction Model for Coverage Status by Region and Hour (8 points)

2.2.1 Data Preprocess (8 points)

The data in data/Scotty.csv is containing real-time raw transaction informations.

Moreover, it also includes some "cancelled" and duplicated "nodrivers" transactions, which is not representing a real demand, hence, the coverage status, and should not be included in the prediction model. For example, if the status from the transactions of one riderID within an hour are nodrivers, nodrivers, completed that would indicate one demand. Based on this condition, make a report on how you preprocess the data into a proper data that represent the true Scotty’s coverage status.

Rubric:

  • Give a proper documentation of how you preprocess the data (4 points)
  • Give a proper narrative on each preprocess steps (4 points)

Tips: Checkout data/submissionClassification.csv to see the category of coverage status.

2.2.2 Exploratory Data Analysis (8 points)

As you recall in Classification 1 & 2 courses, some exploratory data analysis, like analyzing the distribution of our target variable, would be very helpful in building the following prediction model.

Rubric:

  • Give a proper documentation of how you do exploratory data analysis (4 points)
  • Give an informative narration and visualization from your exploratory data analysis (4 points)

2.2.3 Prediction Model for Coverage Status by Region and Hour (8 points)

Based on prior exploratory data analysis, build a strong prediction model that would help us classify coverage status in the future. Note that the model should have a proper explanation on why it suits our case. The forecasted value should be stored inside data/submissionClassification.csv templates:

# some samples of the submission template
# the values in coverage column is just an example
read.csv("data/submissionClassification.csv") %>% head()
##   srcGeohash           timeStamp     coverage
## 1          7 2017-12-04 00:00:00   Sufficient
## 2          7 2017-12-04 01:00:00 Insufficient
## 3          7 2017-12-04 02:00:00   Sufficient
## 4          7 2017-12-04 03:00:00 Insufficient
## 5          7 2017-12-04 04:00:00   Sufficient
## 6          7 2017-12-04 05:00:00 Insufficient

and name your submission file with format YOURNAME_SCOTTY_CLASSIFICATION.csv. This submission would be checked on the actual data to calculate the accuracy, recall, and precision.

Rubric:

  • Give a proper documentation of how you build the prediction model (4 point)
  • Reach accuracy, recall, and precision value at least 90% on the actual data (4 point)

3 SMS Spam

This SMS dataset is collected by team Algoritma for educational purposes. It is a real SMS dataset with a spam/ham label for each messages. In this project, we are going to build a classification model to create a spam classifier.

The dataset is available to load in data/sms.csv:

read.csv("data/sms.csv") %>% glimpse()
## Observations: 1,805
## Variables: 3
## $ STATUS  <fct> ham, ham, spam, ham, spam, spam, spam, spam, spam, spa...
## $ CONTAIN <fct> Sy wa ga sampe2 soalnya, Mbak tiara sy da di cyber yaa...
## $ DATE    <fct> 2018-02-28 11:43:00, 2018-02-28 11:43:00, 2018-02-28 0...

Use any approach you find fitting for the problem and create model that can correctly classify SMS into spam/ham.

3.1 “I didn’t get your message!”

The drastic decrease in use of SMS has put this once-very-popular feature of a cellphone has turned one’s inbox into a junks inbox. But who knows? Someone might contact you through old-school way of SMS and you might even skipped it because the amount of the spams in your inbox is just way too much. Imagine we are building a spam classifier for SMS-based messages. The SMS classified as spam is collected through user’s report for unwanted SMS.

Make a report of the approach you use in building the model classifier. The report should have a clear explanation of:

  • Data Preprocess (8 points)
  • Model Selection (8 points)
  • Predicting Spam SMS (8 points)

3.1.1 Data Preprocess (8 points)

The data in data/SMS.csv contains 3 attributes: DATE, CONTAIN, and STATUS. The label of spam (unwanted SMS) and ham is stored under STATUS variable. DATE shows the timestamp of the SMS received and CONTAIN describes the content of the message. Show how you feature engineered the attributes you are going to use in your model.

Rubric:

  • Give a proper documentation of how you preprocess the data (4 points)
  • Give a proper narrative on each preprocess steps (4 points)

3.1.2 Model Selection (8 points)

In building the model, we should consider multiple parameters in coming up with the best model. Do a benchmarking for multiple models / approach you came up with and show the train thought process in choosing the best model.

Rubric:

  • Do a benchmarking with multiple approaches (4 points)
  • Give a proper explanation on how the best model is picked (4 points)

3.1.3 Predicting Spam SMS (8 points)

Based on the best model picked in previous step, use the model to predict SMS messages stored in data/submissionSMS.csv. The prediction should contain 2 classes: spam or ham. The forecasted prediction should be filled in under STATUS column in the same file:

read.csv('data/submissionSMS.csv') %>% head()
##                  DATE
## 1 2018-04-25 14:20:00
## 2 2018-04-25 14:13:00
## 3 2018-04-25 12:26:00
## 4 2018-04-25 12:08:00
## 5 2018-04-24 12:41:00
## 6 2018-04-24 11:42:00
##                                                                                                                                                            CONTAIN
## 1  ELITE RELOAD PULSA:Kami ingin menawarkan anda menjadi agen pulsa all operator. harga: V5:4750,V10:8750,V20:18750.Minat invite BBM: D995290A & WA: 085656345887.
## 2                                                               iRing keren cuman buat km, Feni Rose-TERCYDUK,Rp.0,1/3hr prpnjngan Rp.3190 dengan hnya bls YA lho!
## 3    Sahabat, registrasi kartu SIM prabayar Anda SEKARANG. Ketik ULANG#NIK#No.KK# SMS ke 4444 atau klik http://im3.do/registrasi s.d. 30 April 2018. Terima kasih.
## 4                                                                                                    sians text ULANG#NIK#No.KK# to 4444.Foreigners bring Passport
## 5 24/04/2018 12:41 Pada No. Rek 378088918 ada dana keluar sebesar Rp 100.000,00. Saldo akhir: Rp 216.309,00.Berita: BILL PAYMENT  GOPAY CUST    NO  39152454823092
## 6  Sisa kuota GAK AKAN HANGUS & dpt disimpan hg 100GB SETAHUN dgn DATAROLLOVER dr IM3Ooredoo!Gratis dgn pkt Internet bulanan IM3 Ooredoo http://im3.do/dr0424 BMI1
##   STATUS
## 1     NA
## 2     NA
## 3     NA
## 4     NA
## 5     NA
## 6     NA

and name your submission file with format YOURNAME_SPAM_CLASSIFICATION.csv. This submission would be checked on the actual data to calculate the accuracy, recall, and precision.

Rubric:

  • Achieve at least 90% of precision on unseen data (2 points)
  • Achieve at least 83% of accuracy on unseen data (2 points)
  • Achieve at least 80% of recall on unseen data (2 points)
  • Achieve at least 85% of specificity on unseen data (2 points)

3.1.4 Text Mining Tips

To help you on creating the model, recall the text mining material from classification 2. If you’re planning to do naive bayes modelling (of course, not limited to this approach!) like the one presented in classification 2 class, use tm library in doing text preprocessing. Here are the steps you might want to pay attention to in doing word tokenization:

  1. Use corpus object to store and preprocess your text data.
library(tm)
#Create corpus object
corpus <- VCorpus(VectorSource(data))
#Inspecting corpus
corpus[[1]]$content
## [1] "Dorothy lived in the midst of the great Kansas prairies, with Uncle\nHenry, who was a farmer, and Aunt Em, who was the farmer's wife.  Their\nhouse was small, for the lumber to build it had to be carried by wagon\nmany miles.  There were four walls, a floor and a roof, which made one\nroom; and this room contained a rusty looking cookstove, a cupboard for\nthe dishes, a table, three or four chairs, and the beds.  Uncle Henry\nand Aunt Em had a big bed in one corner, and Dorothy a little bed in\nanother corner."
#Creating custom function
transformer <- content_transformer(function(x, pattern) {
    gsub(pattern, " ", x)
})

#Cleaning Examples
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, transformer, "\\n")
corpus <- tm_map(corpus, removePunctuation)

#Seperate words by one whitespace
corpus <- tm_map(corpus, stripWhitespace)
corpus[[1]]$content
## [1] "dorothy lived in the midst of the great kansas prairies with uncle henry who was a farmer and aunt em who was the farmers wife their house was small for the lumber to build it had to be carried by wagon many miles there were four walls a floor and a roof which made one room and this room contained a rusty looking cookstove a cupboard for the dishes a table three or four chairs and the beds uncle henry and aunt em had a big bed in one corner and dorothy a little bed in another corner"
  1. Remove stopwords related to the text domain and language We have provided indonesian stopwords common list in data/stopwords-id.txt. You can always adjust the uses of the each stopword accordingly.
stopwords <- readLines("data/stopwords-id.txt")
## Warning in readLines("data/stopwords-id.txt"): incomplete final line found
## on 'data/stopwords-id.txt'
corpus <- tm_map(corpus, removeWords, stopwords)
  1. Use dictionary list extracted from train data in creating document term matrix to make sure token attributes used in both train and test are the same.
dtm <- DocumentTermMatrix(corpus)
freqTerms <- findFreqTerms(dtm, 3)
reduced_dtm <- DocumentTermMatrix(corpus, list(dictionary=freqTerms))

reduced_dtm$dimnames$Terms
##  [1] "and"     "dorothy" "for"     "gray"    "great"   "had"     "one"    
##  [8] "small"   "the"     "was"

There are a lot of text mining techniques you can pick up online, try to always implement each technique by fully understanding the process. A lot of text mining is associated with Natural Language Processing concepts that are language-specific so make sure to implement it correctly. You can also try to use stemming technique that might help you improve your model perfomance. Check out this github about Indonesian stemming library in R.

4 Food & Beverage

The dataset contains a receipt collected from a restaurant that served foods and beverages. This restaurant has multiple outlet across the countries. In this project, we are trying to forecast the restaurant visitor for the next 7 days.

The dataset is available to load in data/fnb_train.csv:

fread("data/fnb_train.csv") %>% glimpse()
## Observations: 7,063,969
## Variables: 11
## $ datetime         <chr> "2017-12-01 00:47:29", "2017-12-01 00:47:29",...
## $ outlet           <chr> "E_46", "E_46", "E_46", "E_46", "E_46", "E_46...
## $ receipt          <chr> "A0017765", "A0017765", "A0017765", "A0017765...
## $ item             <int> 10100101, 10200029, 10400016, 10500028, 10500...
## $ item_group       <chr> "NOODLE_DISH", "RICE_DISH", "SIDE_DISH", "DRI...
## $ item_major_group <chr> "FOOD", "FOOD", "FOOD", "BEVERAGES", "BEVERAG...
## $ qty              <int> 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, ...
## $ price            <dbl> 3.50, 6.96, 9.92, 4.74, 3.01, 5.48, 3.01, 4.7...
## $ total            <dbl> 3.50, 6.96, 9.92, 4.74, 6.01, 5.48, 3.01, 4.7...
## $ payment          <chr> "CASH", "CASH", "CASH", "CASH", "CASH", "CASH...
## $ sales_type       <chr> "DINE_IN", "DINE_IN", "DINE_IN", "DINE_IN", "...

Use any model you find best suited for this time series problem. The variable we are trying to forecast is the total visitor from 22st of February to 28th of February.

4.1 “How many visitor we’ll get next week?”

These restaurant chains has operated for quite some time, now. The owner wants to analyze his visitor so he can attract more investors in 2018. The dataset provided is a receipt data from 1st of December 2017 to 21st of February 2018. Can you build the best forecast for this restaurant owner?

Make a report of the approach you use in building the forecasting model. The report should have a clear explanation of:

  • Data Preprocess (8 points)
  • Model Selection (8 points)
  • Forecasting Our Visitor (8 points)

4.1.1 Data Preprocess (8 points)

The data in data/fnb_train.csv consists of several receipt information. The variable we are trying to predict is the total visitor per outlet for each hour. You can use any variable you find helpful to help you in the modelling process. Show prepare your data in the report.

Rubric:

  • Give a proper documentation of how you preprocess the data (4 points)
  • Give a proper narrative on each preprocess steps (4 points)

4.1.2 Exploratory Data Analysis (8 points)

As you recall in Classification 1 & 2 courses, some exploratory data analysis, like analyzing the distribution of our target variable, would be very helpful in building the following prediction model.

Rubric:

  • Give a proper documentation of how you do exploratory data analysis (4 points)
  • Give an informative narration and visualization from your exploratory data analysis (4 points)

4.1.3 Forecast Future Visitors (8 points)

Based on the best model picked in previous step, use the model to predict total visitor per outlet stored in data/fnb_submission.csv. The forecasted prediction should be filled in under visitor column from the same file:

# submission example
read.csv('data/fnb_submission.csv') %>% head()
##              datetime outlet visitor
## 1 2018-02-22 00:00:00      A      NA
## 2 2018-02-22 00:00:00      B      NA
## 3 2018-02-22 00:00:00      C      NA
## 4 2018-02-22 00:00:00      D      NA
## 5 2018-02-22 00:00:00      E      NA
## 6 2018-02-22 01:00:00      A      NA

and name your submission file with format YOURNAME_F&B_FORECAST.csv. This submission would be checked on the actual data to calculate the RMSE.

Rubric:

  • Give a proper documentation of how you build the forecast model for multiple outlets (4 point)
  • Reach RMSE value below 50 on the actual data (RMSE will calculated accross outlets) (4 point)

4.1.4 Submission

After you finished with the report, save your Rmd file with format YOURNAME_F&B_FORECASTING.Rmd, and compile your report and prediction submission into a .zip file.

5 Sentiment Analysis

This tweets dataset is collected by Team Algoritma for educational purposes. In this project, we are going to analyze and build a classification model to predict sentiment for each tweets from “#YoutubeRewind2018”

The data is download by using twitter’s API. The data is available to load:

read.csv("data/train.csv") %>% head()
##                                                                                                                                                                                                                 text
## 1                                                                                                                                                                    They better have T-posing in #YoutubeRewind2018
## 2 Hey folks Please subscribe my channel to watch amazing videos <U+0001F64F><U+0001F64F>  @TSeries @pewdiepie  #tseriesvspewdiepie #YouTube #youtuber  #youtuberewind2018 https://t.co/n0EkVltUcN @Canon @NikonIndia
## 3                                                                                                                                                       I need pewdiepie<U+0001F624><U+0001F624>\n#youtuberewind2018
## 4                                                                                                                                         If @MrBeastYT is not in #YouTubeRewind2018, we are going to have problems.
## 5                                                                                                                                                                       @pewdiepie Waw lets go... #youtuberewind2018
## 6                            If they don't even think to invite @pewdiepie to @YouTube #youtuberewind2018 then that will be the biggest sham. This #pewdiepievstseries war is the biggest meme this year on YouTube.
##   sentiment_type
## 1        Neutral
## 2        Neutral
## 3       Positive
## 4       Negative
## 5        Neutral
## 6        Neutral

5.1 Predict Sentiment from the tweets

Twitter is an online social network with over 330 million active monthly users as of February 2018. Users on twitter create short messages called tweets to be shared with other twitter users who interact by retweeting and responding. Analyzing each post and understanding the sentiment associated with that post helps us find out which are the key topics or themes which resonate well with the audience.

Make a report of the approach you use in building the model classifier. The report should have a clear explanation of:

  • Data Preprocess (8 points)
  • Model Selection (8 points)
  • Prediction Model for Sentiment Status (8 points)

5.1.1 Data Preprocess

The data in train.csv contains 3 attributes: TEXT and SENTIMENT_TYPE. The label of sentiment is Positive, Negative and Neutral. TEXT describes the content of the tweets. Show how you feature engineered the attributes you are going to use in your model.

Rubric:

  • Give a proper documentation of how you preprocess the data (4 points)
  • Give a proper narrative on each preprocess steps (4 points)

5.1.2 Model Selection

As you recall in Classification 1 & 2 courses, some exploratory data analysis, like analyzing the distribution of our target variable, would be very helpful in building the following prediction model.

Rubric:

  • Do a benchmarking with multiple approaches (4 points)
  • Give a proper explanation on how the best model is picked (4 points)

5.1.3 Prediction Model for Sentiment Classes

Based on the best model picked in previous step, build a strong prediction model that would help us classify sentiment classes in the future. Note that the model should have a proper explanation on why it suits our case. The prediction value should be stored inside submissionSA.csv templates:

read.csv("data/submissionSA.csv") %>% head()
##                                                                                                                                                                                                                                                      text
## 1                                                                                                        I can’t say that this years YT REWIND is a let down. It is everything I expected it to be actually. #fortnite and politics. \n#YouTubeRewind2018
## 2                                                                                                                     #YouTubeRewind2018 is one of the most well deserved disliked videos. Hope it hits #1 by the end of the week https://t.co/OSpHwIWBuB
## 3                                              Interesting perspective by, @PhillyD - here's my view: \n@YouTube \n\nHere's a little statement I had to make regarding the whole, #YouTubeRewind/#YouTubeRewind2018 conversation: https://t.co/htBQvjBoHS
## 4                                                                                                                                             #YouTubeRewind2018 sucked so much!!! Except for my baby @lizakoshy <U+2764><U+FE0F> https://t.co/SxC7h0W9el
## 5                                                                                                                                                           #YouTubeRewind2018 becomes the second-most disliked video of all time https://t.co/tPvr6t9h5T
## 6 The #YouTubeRewind2018 video is the fastest disliked video in the history of Youtube and it's starting to get close to that all-time number 1 spot. Check out my reaction to it through @pewdiepie to see how bad it really is. https://t.co/8Rlm2ZY3ht
##   label
## 1    NA
## 2    NA
## 3    NA
## 4    NA
## 5    NA
## 6    NA

and name your submission file with format YOURNAME_CLASSIFICATIONSA.csv. This submission would be checked on the actual data to calculate the accuracy, recall, and precision.

Rubric:

  • Give a proper documentation of how you build the prediction model (4 point)
  • Reach accuracy 70%, recall 65%, and precision 65% on the actual data (4 point)

5.1.4 Multi-class tips

To evaluate multi-class classification problem, we can use yardstick that will help us to calculate multiclass metrics. To install the packages:

install.packages("yardstick")

# Development version:
devtools::install_github("tidymodels/yardstick")

For calculate multi-class metrics, we can save our prediction and groud truth to data frame first.

eval_met <-  data.frame(estimate = prediction, truth = submission_label)

# we can set our metrics: accuracy, precison, recall
metrics_multi <- metric_set(accuracy, recall, precision)

eval_met %>%
  metrics_multi(truth = truth, estimate = estimate)

More metrics check here.