As a final step in machine learning course, every student should complete 1 case as their machine learning capstone project. In this article, we will explain some cases that you can choose as your machine learning capstone project.
There will be 5 different datasets. From each dataset, you can choose 1 case with a set of Rubrics/Requirments you will need to solve to get a score.
All datasets used in capstone could be found in this link.
In the link provided, there will be 2 datasets: train and test dataset, for each case.
The train dataset will be used to train and evaluate the model, while the test dataset is used for the final evaluation. The final evaluation requires you to submit your prediction of the test dataset to the leaderboard in order to obtain the final model evaluation (more details are provided below). The data scheme is illustrated as follows:
This problem was originally proposed by Prof. I-Cheng Yeh, Department of Information Management Chung-Hua University, Hsin Chu, Taiwan in 2007. It is related to his research in 1998 about how to predict compression strength in a concrete structure.
Concrete is the most important material in civil engineering
as said by Prof. I-Cheng Yeh.
Concrete compression strength is determined not just only by water-cement mixture but also by other ingredients, and how we treat the mixture. Using this dataset, we are going to find “the perfect recipe” to predict the concrete’s compression strength, and how to explain the relationship between the ingredients concentration and the age of testing to the compression strength.
It takes too long when you want to observe concrete structure’s strength especially when the resting time is quite long, let’s say 6 months. Why don’t you just try to predict the compression strength instead of waiting for 6 months?
Your goal is to predict the compression strength based on the mixture properties.
How much the increment/decrement of the structure’s compression strength when you add more water? Can the concrete structure have more compression strength when you left it to rest longer?
Your goal is to build a linear regression model that fulfills all assumptions. Interpret how each ingredient and age of testing affect the concrete compression strength.
The Food and Beverage dataset is provided by Dattabot, which contains detailed transactions of multiple food and beverage outlets. Using this dataset, we are challenged to do some forecasting and time series analysis to help the outlet’s owner making a better business decision.
Customer behaviour, especially in the food and beverage industry is highly related to seasonality patterns. The owner wants to analyze the number of visitors so he could make better judgment in 2018. Fortunately, you already know that time series analysis is enough to provide a good forecast and seasonality explanation.
Please make a report of your forecasting result and seasonality explanation for hourly number of visitors, that would be evaluated on the next 7 days (Monday, December 19th 2017 to Sunday, December 25th 2017)!
The SMS dataset is collected by team Algoritma for educational purposes. It is a real SMS dataset with a spam/ham label for each message.
Someone might contact you through old-school way of SMS and you might even skip it because the amount of the spams in your inbox is just way too much. The SMS is classified as spam is collected through user’s report for unwanted SMS. Can we build a spam classifier?
The problem above urge you to classify whether a text message would be a SPAM or HAM based on the content.
Scotty is a ride-sharing business operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” in their order buttons.
Scotty provided us with a real-time transaction dataset. With this dataset, we are going to help them in solving some forecasting and classification problems in order to improve their business processes.
It’s almost the end of 2017 and we need to prepare a forecast model to helps Scotty ready for the end year’s demands. Unfortunately, Scotty is not old enough to have last year’s data for December, so we can not look back at past demands to prepare forecast for December’s demands. Fortunately, you already know that time series analysis is more than enough to help us to forecast! But, as an investment for the business’ future, we need to develop an automated forecasting framework so we don’t have to meddle with forecast model selection anymore in the future!
Build an automated forecasting model for hourly demands that would be evaluated on the next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017)!
Scotty turns out to be a very popular service in Turkey! The demands for Scotty began to overload, in some region and sometimes, and there was not enough driver at those times and places. Fortunately, we know that we can use a classification model to predict which region and times are risky enough to have this “no drivers” problem.
Create a classification model report that would be evaluated in the next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017). Make prediction that should cover the predicted coverage status for each hour and each area: "sufficient" or "insufficient".
Airline13 dataset provides you with airline on-time data for all flights departed from Newark Liberty International Airport to Charlotte Douglas International Airport in 2013. The dataset includes records of flight departures and weather condition recorded per hour.
Through this dataset, data scientists were challenged to solve one of the most frequent problems in airline industry– arrival delay status of a flight. Let’s put our data scientist knowledge into action and solve this problem using classification algorithms.
Flight delays can put a real strain on travelers, which often have to make overnight accommodations until the next batch of flights heads out in the morning. Major airline companies have begun to take account of such problems and developing a way to give early notification about flight arrival delay status for their costumer and to make sure that their customer gets the best service and compensation available.
Using “Airline13” dataset, make a prediction model to classify the arrival delay status of a flight (using data_train set: flights and weather records up to November 2013), that would be evaluated on December 2013. Make prediction that should cover the predicted arrival delay status for each flight: "Delay" or "Not Delay".
We provide the train dataset as follows:
The observation data consists of the following variables:
id: Id of each cement mixture,cement: The amount of cement (Kg) in a \(m^3\) mixture,slag: The amount of blast furnace slag (Kg) in a \(m^3\) mixture,flyash: The amount of fly ash (Kg) in a \(m^3\) mixture,water: The amount of water (Kg) in a m3 mixture,super_plast: The amount of Superplasticizer (Kg) in a \(m^3\) mixture,coarse_agg: The amount of Coarse Aggreagate (Kg) in a \(m^3\) mixture,fine_agg: The amount of Fine Aggreagate (Kg) in a \(m^3\) mixture,age: the number of resting days before the compressive strength measurement,strength: Concrete compressive strength measurement in \(MPa\) unit.And we provide the test dataset as follows:
Please follow submission-example below as a template for submission. The values on the target variable do not represent the actual answer. Your process to achieve the target variable may differ, but the final submission data should have the same columns and observations.
The template contains:
id : Id of each cement mixturestrength: Concrete compressive strength measurement in \(MPa\) unit.# predict target using your model
pred_test <- predict(model, ...)
# Create submission data
submission <- data.frame(id = data_test$id,
strength = pred_test
)
# save data
write.csv(submission, "submission-david.csv", row.names = F)
# check first 3 data
head(submission, 3)Prepare a report for this case explaining every part listed in “Rubrics” section. Export it as an .html file with format "yourname_rm-concrete-predict.html".
Data Preprocess and Exploratory Data Analysis
strength positively correlated with age?strength and cement has strong correlation?super_plast has a linear correlation with the strength?Model Fitting and Evaluation
Prediction Performance
Conclusion
We provide the train dataset as follows:
The observation data consists of the following variables:
id: Id of each cement mixture,cement: The amount of cement (Kg) in a \(m^3\) mixture,slag: The amount of blast furnace slag (Kg) in a \(m^3\) mixture,flyash: The amount of fly ash (Kg) in a \(m^3\) mixture,water: The amount of water (Kg) in a m3 mixture,super_plast: The amount of Superplasticizer (Kg) in a \(m^3\) mixture,coarse_agg: The amount of Coarse Aggreagate (Kg) in a \(m^3\) mixture,fine_agg: The amount of Fine Aggreagate (Kg) in a \(m^3\) mixture,age: the number of resting days before the compressive strength measurement,strength: Concrete compressive strength measurement in \(MPa\) unit.And we provide the test dataset as follows:
Please follow submission-example below as a template for submission. The values on the target variable do not represent the actual answer. Your process to achieve the target variable may differ, but the final submission data should have the same columns and observations.
# predict target using your model
pred_test <- predict(model, ...)
# Create submission data
submission <- data.frame(id = data_test$id,
strength = pred_test
)
# save data
write.csv(submission, "submission-david.csv", row.names = F)
# check first 3 data
head(submission, 3)The template contains:
id : Id of each cement mixturestrength: Concrete compressive strength measurement in \(MPa\) unit.Prepare a report for this case explaining every part listed in “Rubrics” section. Export it as an .html file with format "yourname_concrete-rm-analysis.html".
Data Preprocess
Exploratory Data Analysis
strength positively correlated with age?strength and cement has strong correlation?super_plast has a linear correlation with the strength?Model Fitting and Evaluation
Prediction Performance
Model Interpretation and Improvement Idea(s)
(2 Point) Reported the interpretation of each predictors and explain how much their effect to concrete compression strength.
(3 Point) Reported all of the assumption checking using the proper testing method and/or using any visualization. If there is any violation, explain why it happens (e.g. outliers existance, non-linear relationship, etc.) or if there is none, propose the method to improve the model performance (and why it works).
(3 Points) Improve the model to fulfill the assumptions
Finding the Right Material Composition
Conclusion
The train dataset contains detailed transaction details from October 1st 2017 to December 2nd 2017:
The dataset includes information about:
transaction_date: The timestamp of a transactionreceipt_number: The ID of a transactionitem_id: The ID of an item in a transactionitem_group: The group ID of an item in a transactionitem_major_group: The major-group ID of an item in a transactionquantity: The quantity of purchased itemprice_usd: The price of purchased itemtotal_usd: The total price of purchased itempayment_type: The payment methodsales_type: The sales methodThe test dataset should serve as a template for submission:
Please follow content of the submission-example below as a template for submission. The values on the target variable do not represent the actual answer. Your process to achieve the target variable may differ, but the final submission data should have the same columns and observations.
The template contains:
datetime: Timestamp (equivalent to transaction_date)visitor: Estimated number of visitor(s)# Forecast the target using your model
forecast_mod <- forecast(model, ...)
# Create submission data
submission <- data_test %>%
mutate(visitor = forecast_mod)
# save data
write.csv(submission, "submission-david.csv", row.names = F)
# check first 3 data
head(submission, 3)Prepare a report for this case explaining every part listed in “Rubrics” section. Export it as an .html file with format "yourname_fnb-ts-single.html".
Data Preprocess
Seasonality Analysis
Model Fitting and Evaluation
Prediction Performance
Conclusion
We provide the train dataset as follows:
The observation data consists of the following variables:
datetime: Timestamp,text: The contain of messages,status: The label of spam/ham for each messages.We provide the test dataset as follows:
Please follow content of the submission-example below as a template for submission. The values on the target variable do not represent the actual answer. Your process to achieve the target variable may differ, but the final submission data should have the same columns and observations.
The template contains:
datetime : Timestampstatus: The label of spam/ham for each messages.# Predict on data test
pred_test <- predict(model, ...)
# Create submission data
submission <- data_test %>%
mutate(status = pred_test) %>%
select(-text)
# save data
write.csv(submission, "submission-david.csv", row.names = F)
# check first 3 data
head(submission, 3)Prepare a report for this case explaining every part listed in “Rubrics” section. Export it as an .html file with format "yourname_sms-cl-spam.html".
Data Preprocess and Exploratory Data Analysis
Model Selection and Evaluation
Prediction Performance
(1 Points) Accuracy in (your own) validation dataset reach > 80%.
(1 Points) Sensitivity in (your own) validation dataset reach > 80%.
(1 Points) Specificity in (your own) validation dataset reach > 85%.
(1 Points) Precision in (your own) validation dataset reach > 90%.
(2 Points) Accuracy in test dataset reach > 80%.
(2 Points) Sensitivity in test dataset reach > 80%.
(2 Points) Specificity in test dataset reach > 85%.
(2 Points) Precision in test dataset reach > 90%.
Conclusion
The train dataset contains detailed transaction details from October 1st 2017 to December 2nd 2017:
The dataset includes information about:
id: Transaction idtrip_id: Trip iddriver_id: Driver idrider_id: Rider idstart_time: Rider idsrc_lat: Request source latitudesrc_lon: Request source longitudesrc_area: Request source areasrc_sub_area: Request source sub-areadest_lat: Requested destination latitudedest_lon: Requested destination longitudedest_area: Requested destination areadest_sub_area: Requested destination sub-areadistance: Trip distance (in KM)status: Trip status (all status considered as a demand)confirmed_time_sec: Time different from request to confirmed (in seconds)The test dataset should serve as a template for submission:
The template contains:
src_sub_area: Request source sub-areadatetime: Timestamp (equivalent to start_time)demand: Estimated number of demand(s)Please follow content of the submission-example below as a template for submission. The values on the target variable do not represent the actual answer. Your process to achieve the target variable may differ, but the final submission data should have the same columns and observations.
# Forecast the target using your model
forecast_mod <- forecast(model, ...)
# Create submission data
submission <- data_test %>%
mutate(demand = forecast_mod)
# save data
write.csv(submission, "submission-david.csv", row.names = F)
# check first 3 data
head(submission, 3)Prepare a report for this case explaining every part listed in “Rubrics” section. Export it as an .html file with format "yourname_scotty-ts.html".
Data Preprocess
Cross-Validation Scheme
Model Selection
Prediction Performance
(1 Points) Reached MAE < 12 for sub-area sxk97 in (your own) evaluation dataset.
(1 Points) Reached MAE < 11 for sub-area sxk9e in (your own) evaluation dataset.
(1 Points) Reached MAE < 10 for sub-area sxk9s in (your own) evaluation dataset.
(1 Points) Reached MAE < 11 for all sub-area in (your own) evaluation dataset.
(2 Points) Reached MAE < 12 for sub-area sxk97 in test dataset.
(2 Points) Reached MAE < 11 for sub-area sxk9e in test dataset.
(2 Points) Reached MAE < 10 for sub-area sxk9s in test dataset.
(2 Points) Reached MAE < 11 for all sub-area in test dataset.
Conclusion
The train dataset contains detailed transaction details from October 1st 2017 to December 2nd 2017:
The dataset includes information about:
id: Transaction idtrip_id: Trip iddriver_id: Driver idrider_id: Rider idstart_time: Rider idsrc_lat: Request source latitudesrc_lon: Request source longitudesrc_area: Request source areasrc_sub_area: Request source sub-areadest_lat: Requested destination latitudedest_lon: Requested destination longitudedest_area: Requested destination areadest_sub_area: Requested destination sub-areadistance: Trip distance (in KM)status: Trip status (all status considered as a demand)confirmed_time_sec: Time different from request to confirmed (in seconds)The test dataset should serve as a template for submission:
The template contains:
src_area: Request source areadatetime: Timestamp (equivalent to start_time)coverage: Estimated coverage status; sufficient or insufficientPlease follow content of the submission-example below as a template for submission. The values on the target variable do not represent the actual answer. Your process to achieve the target variable may differ, but the final submission data should have the same columns and observations.
# Predict the target using your model
pred_test <- predict(model, ...)
# Create submission data
submission <- data_test %>%
mutate(coverage = pred_test)
# save data
write.csv(submission, "submission-david.csv", row.names = F)
# check first 3 data
head(submission, 3)Prepare a report for this case explaining every part listed in “Rubrics” section. Export it as an .html file with format "yourname_scotty-cl-cov.html".
Data Preprocess
Exploratory Data Analysis
Model Fitting and Evaluation
Prediction Performance
(1 Point) Reached Accuracy > 75% in (your own) validation dataset.
(1 Point) Reached Sensitivity > 85% in (your own) validation dataset.
(1 Point) Reached Specificity > 70% in (your own) validation dataset.
(1 Point) Reached Precision > 75% in (your own) validation dataset.
(2 Point) Reached Accuracy > 75% in test dataset.
(2 Point) Reached Sensitivity > 85% in test dataset.
(2 Point) Reached Specificity > 70% in test dataset.
(2 Point) Reached Precision > 75% in test dataset.
Conclusion
The train dataset contains detailed flight records and weather condition recorded per hour from January 1st 2013 to November 31th 2013. The data set can be joined by their recorded time.
The flight data consists of the following variables:
year,month,day: Date of departure.dep_time: Actual departure times (format HHMM or HMM), local tz.sched_dep_time,sched_arr_time: Scheduled departure and arrival times (format HHMM or HMM), local tz.dep_delay: Departure delays, in minutes. Negative times represent early departures.arr_delay: Flight arrival status; Delay or Not Delay.carrier: Two letter carrier (airlines) abbreviation.flight: Flight number.tailnum: Plane tail number.origin,dest: Airports of origin and destination.distance: Distance between airports, in mileshour,minute: Time of scheduled departure broken into hour and minutes.time_hour: Scheduled date and hour of the flight (YYYYMMDD HHMMSS).The weather data consists of the following variables:
year,month,day,hour: Time of recording.temp: Temperature in Fahrenheit.dewp: Dewpoint in Fahrenheit.humid: Relative humidity.wind_dir: Wind direction (in degrees).wind_speed: Wind speed (in mph).wind_gust: Wind gust speed (in mph).precip: Precipitation, in inches.pressure: Sea level pressure in millibars.visib: Visibility in miles.time_hour: Date and hour of the recording (YYYYMMDD HHMMSS).We provide the test dataset as follows:
Please follow content of the submission-example below as a template for submission. The values on the target variable do not represent the actual answer. Your process to achieve the target variable may differ, but the final submission data should have the same columns and observations.
The template contains:
Id : Flight Idarr_status: Flight arrival status; Delay or Not Delay.# Predict on the data test
pred_test <- predict(model, ...)
# Create submission data
submission <- data_test %>%
mutate(arr_status = pred_test) %>%
select(id, arr_status)
# save data
write.csv(submission, "submission-david.csv", row.names = F)
# check first 3 data
head(submission, 3)Prepare a report for this case explaining every part listed in “Rubrics” section. Export it as an .html file with format “yourname_airline13-cl-del.html”.
Data Wrangling
Explanatory Data Analysis
prediction model the most.
Model Fiting and Evaluation
Prediction Performance
(1 Point) Reached Accuracy > 75% in (your own) validation dataset.
(1 Point) Reached Sensitivity > 73% in (your own) validation dataset.
(1 Point) Reached Specificity > 75% in (your own) validation dataset.
(1 Point) Reached Precision > % 70 in (your own) validation dataset.
(2 Point) Reached Accuracy > 75% in in test dataset.
(2 Point) Reached Sensitivity > 73% in test dataset.
(2 Point) Reached Specificity > 75% in test dataset.
(2 Point) Reached Precision > 70% in test dataset.
Conclusion
After finishing your work of data preprocessing, modeling, and model evaluation, the next step will be;
data_test.csv that comes within the case.submission-example that each case provided to store your prediction. You should save your prediction as a .csv file.