As a final step in machine learning course, every student should complete 1 case as their machine learning capstone project. In this article, we will explain some cases that you can choose as your machine learning capstone project.
There will be 5 different datasets. From each dataset, you can choose 1 case with a set of Rubrics/Requirments you will need to solve to get a score.
All datasets used in capstone could be found in this link.
In the link provided, there will be 2 datasets: train and test dataset, for each case.
The train dataset will be used to train and evaluate the model, while the test dataset is used for the final evaluation. The final evaluation requires you to submit your prediction of the test dataset to the leaderboard in order to obtain the final model evaluation (more details are provided below). The data scheme is illustrated as follows:
This problem was originally proposed by Prof. I-Cheng Yeh, Department of Information Management Chung-Hua University, Hsin Chu, Taiwan in 2007. It is related to his research in 1998 about how to predict compression strength in a concrete structure.
Concrete is the most important material in civil engineering
as said by Prof. I-Cheng Yeh.
Concrete compression strength is determined not just only by water-cement mixture but also by other ingredients, and how we treat the mixture. Using this dataset, we are going to find “the perfect recipe” to predict the concrete’s compression strength, and how to explain the relationship between the ingredients concentration and the age of testing to the compression strength.
It takes too long when you want to observe concrete structure’s strength especially when the resting time is quite long, let’s say 6 months. Why don’t you just try to predict the compression strength instead of waiting for 6 months?
Your goal is to predict the compression strength based on the mixture properties.
How much the increment/decrement of the structure’s compression strength when you add more water? Can the concrete structure have more compression strength when you left it to rest longer?
Your goal is to build a linear regression model that fulfills all assumptions. Interpret how each ingredient and age of testing affect the concrete compression strength.
The Food and Beverage dataset is provided by Dattabot, which contains detailed transactions of multiple food and beverage outlets. Using this dataset, we are challenged to do some forecasting and time series analysis to help the outlet’s owner making a better business decision.
Customer behaviour, especially in the food and beverage industry is highly related to seasonality patterns. The owner wants to analyze the number of visitors (includes dine in, delivery, and takeaway transactions) so he could make better judgment in 2018. Fortunately, you already know that time series analysis is enough to provide a good forecast and seasonality explanation.
Please make a report of your forecasting result and seasonality explanation for hourly number of visitors, that would be evaluated on the next 7 days (Monday, February 19th 2018 to Sunday, February 25th 2018)!
Scotty is a ride-sharing business operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” in their order buttons.
Scotty provided us with a real-time transaction dataset. With this dataset, we are going to help them in solving their problems in order to improve their business processes.
It’s almost the end of 2017 and we need to prepare a forecast model to helps Scotty ready for the end year’s demands. Unfortunately, Scotty is not old enough to have last year’s data for December, so we can not look back at past demands to prepare forecast for December’s demands. Fortunately, you already know that time series analysis is more than enough to help us to forecast! But, as an investment for the business’ future, we need to develop an automated forecasting framework so we don’t have to meddle with forecast model selection anymore in the future!
Build an automated forecasting model for hourly demands that would be evaluated on the next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017)!
Scotty turns out to be a very popular service in Turkey! The demands for Scotty began to overload, in some region and sometimes, and there was not enough driver at those times and places. Fortunately, we know that we can use a classification model to predict which region and times are risky enough to have this “no drivers” problem.
Create a classification model report that would be evaluated in the
next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017).
Make prediction that should cover the predicted coverage status
for each hour and each area: "sufficient"
or
"insufficient"
.
The SMS dataset is collected by team Algoritma for educational purposes. It is a real SMS dataset with a spam/ham label for each message.
Someone might contact you through old-school way of SMS and you might even skip it because the amount of the spams in your inbox is just way too much. The SMS is classified as spam is collected through user’s report for unwanted SMS. Can we build a spam classifier?
The problem above urge you to classify whether a text message
would be a SPAM
or HAM
based on the
content.
“Where Were You” is a challenge for you who wish to learn more about
solving problems with unstructured data from a collection of images. The
data consists of images with 3 different labels: "Beach"
,
"Forest"
, or "Mountain"
. Data were collected
by scraping images directly from Google image search.
Through this dataset, you are expected to solve an image classification problem by building a model that can extract information from images and give the correct label. If you are familiar with deep learning, this is your chance to learn and implement deep learning model that is very good at dealing with unstructured data such as texts and images.
Image classification is pretty beneficial in many fields. In social media, a face recognition system will automatically detect your face and tag your friend if they are present in your posts. In wildlife conservation, image classification will help researcher to label image based on the animal presence in the camera-trap image. In this case, you will build image classification for helping a stock photo website categorizing their image database based on the thematic location. Why is this an important task? You can check how the unsplash, a photo stock website that use deep learning to organize and create tag for each image in their collection.
Using “Where Were You” dataset, make a prediction model to classify
the place captured from an image using collection of images inside the
train
folder. Submit your prediction for images located in
the test
folder. Make prediction to classify
whether the image is about a "Forest"
, a
"Mountain"
, or a "Beach"
.
You can find the detailed explanation about each cases through video and the description below.
We provide the train dataset as follows:
The observation data consists of the following variables:
id
: Id of each cement mixture,cement
: The amount of cement (Kg) in a \(m^3\) mixture,slag
: The amount of blast furnace slag (Kg) in a \(m^3\) mixture,flyash
: The amount of fly ash (Kg) in a \(m^3\) mixture,water
: The amount of water (Kg) in a m3 mixture,super_plast
: The amount of Superplasticizer (Kg) in a
\(m^3\) mixture,coarse_agg
: The amount of Coarse Aggreagate (Kg) in a
\(m^3\) mixture,fine_agg
: The amount of Fine Aggreagate (Kg) in a \(m^3\) mixture,age
: the number of resting days before the compressive
strength measurement,strength
: Concrete compressive strength measurement in
\(MPa\) unit.And we provide the test dataset as follows:
Please follow submission-example below as a template for submission. The values on the target variable do not represent the actual answer. Your process to achieve the target variable may differ, but the final submission data should have the same columns and observations.
The template contains:
id
: Id of each cement mixturestrength
: Concrete compressive strength measurement in
\(MPa\) unit.# predict target using your model
<- predict(model, ...)
pred_test
# Create submission data
<- data.frame(id = data_test$id,
submission strength = pred_test
)
# save data
write.csv(submission, "submission-david.csv", row.names = F)
# check first 3 data
head(submission, 3)
Prepare a report for this case explaining every part listed in
“Rubrics” section. Export it as an .html file with format
"yourname_rm-concrete-predict.html"
.
Data Preprocess and Exploratory Data Analysis
strength
positively correlated with
age
?strength
and cement
has strong
correlation?super_plast
has a linear correlation with the
strength
?Model Fitting and Evaluation
Prediction Performance
Interpretation
Conclusion
We provide the train dataset as follows:
The observation data consists of the following variables:
id
: Id of each cement mixture,cement
: The amount of cement (Kg) in a \(m^3\) mixture,slag
: The amount of blast furnace slag (Kg) in a \(m^3\) mixture,flyash
: The amount of fly ash (Kg) in a \(m^3\) mixture,water
: The amount of water (Kg) in a m3 mixture,super_plast
: The amount of Superplasticizer (Kg) in a
\(m^3\) mixture,coarse_agg
: The amount of Coarse Aggreagate (Kg) in a
\(m^3\) mixture,fine_agg
: The amount of Fine Aggreagate (Kg) in a \(m^3\) mixture,age
: the number of resting days before the compressive
strength measurement,strength
: Concrete compressive strength measurement in
\(MPa\) unit.And we provide the test dataset as follows:
Please follow submission-example below as a template for submission. The values on the target variable do not represent the actual answer. Your process to achieve the target variable may differ, but the final submission data should have the same columns and observations.
# predict target using your model
<- predict(model, ...)
pred_test
# Create submission data
<- data.frame(id = data_test$id,
submission strength = pred_test
)
# save data
write.csv(submission, "submission-david.csv", row.names = F)
# check first 3 data
head(submission, 3)
The template contains:
id
: Id of each cement mixturestrength
: Concrete compressive strength measurement in
\(MPa\) unit.Prepare a report for this case explaining every part listed in
“Rubrics” section. Export it as an .html file with format
"yourname_concrete-rm-analysis.html"
.
Data Preprocess
Exploratory Data Analysis
strength
positively correlated with
age
?strength
and cement
has strong
correlation?super_plast
has a linear correlation with the
strength
?Model Fitting and Evaluation
Prediction Performance
Model Interpretation and Improvement Idea(s)
(4 Point) Reported the interpretation of each predictors and explain how much their effect to concrete compression strength.
(4 Point) Reported all of the assumption checking using the proper testing method and/or using any visualization. If there is any violation, explain why it happens (e.g. outliers existance, non-linear relationship, etc.) or if there is none, propose the method to improve the model performance (and why it works).
(3 Points) Try improving the model to fulfill the assumptions
Finding the Right Material Composition
Conclusion
The train dataset contains detailed transaction details from December 1st 2017 to December 18th of February 2018. The shop opens from 10 am to 10 pm every day:
The dataset includes information about:
transaction_date
: The timestamp of a transactionreceipt_number
: The ID of a transactionitem_id
: The ID of an item in a transactionitem_group
: The group ID of an item in a
transactionitem_major_group
: The major-group ID of an item in a
transactionquantity
: The quantity of purchased itemprice_usd
: The price of purchased itemtotal_usd
: The total price of purchased itempayment_type
: The payment methodsales_type
: The sales methodThe test dataset should serve as a template for submission:
Please follow content of the submission-example below as a template for submission. The values on the target variable do not represent the actual answer. Your process to achieve the target variable may differ, but the final submission data should have the same columns and observations.
The template contains:
datetime
: Timestamp (equivalent to
transaction_date
)visitor
: Estimated number of visitor(s)# Forecast the target using your model
<- forecast(model, ...)
forecast_mod
# Create submission data
<- data_test %>%
submission mutate(visitor = forecast_mod)
# save data
write.csv(submission, "submission-david.csv", row.names = F)
# check first 3 data
head(submission, 3)
Prepare a report for this case explaining every part listed in
“Rubrics” section. Export it as an .html file with format
"yourname_fnb-ts-single.html"
.
Data Preprocess
Seasonality Analysis
Model Fitting and Evaluation
Prediction Performance
Conclusion
The train dataset contains detailed transaction details from October 1st 2017 to December 2nd 2017:
The dataset includes information about:
id
: Transaction idtrip_id
: Trip iddriver_id
: Driver idrider_id
: Rider idstart_time
: Start time of requestsrc_lat
: Request source latitudesrc_lon
: Request source longitudesrc_area
: Request source areasrc_sub_area
: Request source sub-areadest_lat
: Requested destination latitudedest_lon
: Requested destination longitudedest_area
: Requested destination areadest_sub_area
: Requested destination sub-areadistance
: Trip distance (in KM)status
: Trip status (all status considered as a
demand)confirmed_time_sec
: Time different from request to
confirmed (in seconds)The test dataset should serve as a template for submission:
The template contains:
src_sub_area
: Request source sub-areadatetime
: Timestamp (equivalent to
start_time
)demand
: Estimated number of demand(s)Please follow content of the submission-example below as a template for submission. The values on the target variable do not represent the actual answer. Your process to achieve the target variable may differ, but the final submission data should have the same columns and observations.
# Forecast the target using your model
<- forecast(model, ...)
forecast_mod
# Create submission data
<- data_test %>%
submission mutate(demand = forecast_mod)
# save data
write.csv(submission, "submission-david.csv", row.names = F)
# check first 3 data
head(submission, 3)
Prepare a report for this case explaining every part listed in
“Rubrics” section. Export it as an .html file with format
"yourname_scotty-ts.html"
.
Data Preprocess
Cross-Validation Scheme
Automated Model Selection
Prediction Performance
(1 Points) Reached MAE < 12 for sub-area sxk97 in (your own) evaluation dataset.
(1 Points) Reached MAE < 11 for sub-area sxk9e in (your own) evaluation dataset.
(1 Points) Reached MAE < 10 for sub-area sxk9s in (your own) evaluation dataset.
(1 Points) Reached MAE < 11 for all sub-area in (your own) evaluation dataset.
(2 Points) Reached MAE < 12 for sub-area sxk97 in test dataset.
(2 Points) Reached MAE < 11 for sub-area sxk9e in test dataset.
(2 Points) Reached MAE < 10 for sub-area sxk9s in test dataset.
(2 Points) Reached MAE < 11 for all sub-area in test dataset.
Conclusion
The train dataset contains detailed transaction details from October 1st 2017 to December 2nd 2017:
The dataset includes information about:
id
: Transaction idtrip_id
: Trip iddriver_id
: Driver idrider_id
: Rider idstart_time
: Start time of requestsrc_lat
: Request source latitudesrc_lon
: Request source longitudesrc_area
: Request source areasrc_sub_area
: Request source sub-areadest_lat
: Requested destination latitudedest_lon
: Requested destination longitudedest_area
: Requested destination areadest_sub_area
: Requested destination sub-areadistance
: Trip distance (in KM)status
: Trip status (all status considered as a
demand)confirmed_time_sec
: Time different from request to
confirmed (in seconds)The test dataset should serve as a template for submission:
The template contains:
src_area
: Request source areadatetime
: Timestamp (equivalent to
start_time
)coverage
: Estimated coverage status;
sufficient or insufficientPlease follow content of the submission-example below as a template for submission. The values on the target variable do not represent the actual answer. Your process to achieve the target variable may differ, but the final submission data should have the same columns and observations.
# Predict the target using your model
<- predict(model, ...)
pred_test
# Create submission data
<- data_test %>%
submission mutate(coverage = pred_test)
# save data
write.csv(submission, "submission-david.csv", row.names = F)
# check first 3 data
head(submission, 3)
Prepare a report for this case explaining every part listed in
“Rubrics” section. Export it as an .html file with format
"yourname_scotty-cl-cov.html"
.
Data Preprocess
Exploratory Data Analysis
Model Fitting and Evaluation
Prediction Performance
(1 Point) Reached Accuracy > 75% in (your own) validation dataset.
(1 Point) Reached Sensitivity > 85% in (your own) validation dataset.
(1 Point) Reached Specificity > 70% in (your own) validation dataset.
(1 Point) Reached Precision > 75% in (your own) validation dataset.
(2 Point) Reached Accuracy > 75% in test dataset.
(2 Point) Reached Sensitivity > 85% in test dataset.
(2 Point) Reached Specificity > 70% in test dataset.
(2 Point) Reached Precision > 75% in test dataset.
Interpretation
Conclusion
We provide the train dataset as follows:
The observation data consists of the following variables:
datetime
: Timestamp,text
: The contain of messages,status
: The label of spam/ham for each messages.We provide the test dataset as follows:
Please follow content of the submission-example below as a template for submission. The values on the target variable do not represent the actual answer. Your process to achieve the target variable may differ, but the final submission data should have the same columns and observations.
The template contains:
datetime
: Timestampstatus
: The label of spam/ham for each messages.# Predict on data test
<- predict(model, ...)
pred_test
# Create submission data
<- data_test %>%
submission mutate(status = pred_test) %>%
select(-text)
# save data
write.csv(submission, "submission-david.csv", row.names = F)
# check first 3 data
head(submission, 3)
Prepare a report for this case explaining every part listed in
“Rubrics” section. Export it as an .html file with format
"yourname_sms-cl-spam.html"
.
Data Preprocess and Exploratory Data Analysis
Model Selection and Evaluation
Prediction Performance
(1 Points) Accuracy in (your own) validation dataset reach > 80%.
(1 Points) Sensitivity in (your own) validation dataset reach > 80%.
(1 Points) Specificity in (your own) validation dataset reach > 85%.
(1 Points) Precision in (your own) validation dataset reach > 90%.
(2 Points) Accuracy in test dataset reach > 80%.
(2 Points) Sensitivity in test dataset reach > 80%.
(2 Points) Specificity in test dataset reach > 85%.
(2 Points) Precision in test dataset reach > 90%.
Interpretation
Conclusion
All image data for the data train is located inside the
data/train
folder.
We provide the test dataset as follows:
All image data for the data test is located inside the
data/test
folder.
Please follow content of the submission-example below as a template for submission. The values on the target variable do not represent the actual answer. Your process to achieve the target variable may differ, but the final submission data should have the same columns and observations.
The template contains:
id
: Image ID from test
folderlabel
: Image label; Beach, Forest or
Mountain.# Predict label on array
<- predict_classes(model, test_x)
pred_test
# Convert encoding to label
<- function(x){
decode case_when(x == 0 ~ "beach",
== 1 ~ "forest",
x == 2 ~ "mountain"
x
)
}
# Create data submission
<- data.frame(id = test_file_name,
submission label = sapply(pred_test, decode)
%>%
) mutate(id = str_remove(id, "data/test/")) # remove file path and only keep the file name
# Write submission
write.csv(submission, "submission-david.csv")
# check first 3 data
head(submission, 3)
Prepare a report for this case explaining every part listed in “Rubrics” section. Export it as an .html file with format “yourname_wherewereyou-cl.html”.
Data Preprocess and Explanatory Data Analysis
(2 Points) Explore the distribution of the image dimensions (height and width).
(2 Points) Demonstrate and explain how to do image augmentation dataset with image generator.
(2 Points) Explore the label/class distribution of the target variable.
Model Fitting and Evaluation (12 points)
(2 Points) Demonstrate and explain how to prepare cross-validation data for this case.
(4 Points) Demonstrate and explain how to build deep learning architecture.
(4 Points) Demonstrate how to properly do model fitting and evaluation.
(2 Points) Demonstrate and explain how to properly do model selection by comparing models or making adjustment to single model.
Prediction Performance
Conclusion
(2 Points) Write the conclusion of your capstone project.
After finishing your work of data preprocessing, modeling, and model evaluation, the next step will be;
data_test.csv
that comes within
the case.submission-example
that each case
provided to store your prediction. You should save your prediction as a
.csv
file.If you’re about to tackle this machine learning capstone project and looking for clear guidance to steer you in the right direction, we strongly encourage you to explore the user-friendly and expertly curated resource called “Capstone Machine Learning Guidance” by Algoritma. This valuable compendium provides accessible information and insights that will help you navigate the complexities of your project with confidence and ease.