Introduction

As a final step in machine learning course, every student should complete 1 case as their machine learning capstone project. In this article, we will explain some cases that you can choose as your machine learning capstone project.

There will be 5 different datasets. From each dataset, you can choose 1 case with a set of Rubrics/Requirments you will need to solve to get a score.

Datasets

All datasets used in capstone could be found in this link.

In the link provided, there will be 2 datasets: train and test dataset, for each case.

The train dataset will be used to train and evaluate the model, while the test dataset is used for the final evaluation. The final evaluation requires you to submit your prediction of the test dataset to the leaderboard in order to obtain the final model evaluation (more details are provided below). The data scheme is illustrated as follows:

1. Concrete Strength

This problem was originally proposed by Prof. I-Cheng Yeh, Department of Information Management Chung-Hua University, Hsin Chu, Taiwan in 2007. It is related to his research in 1998 about how to predict compression strength in a concrete structure.

Concrete is the most important material in civil engineering

as said by Prof. I-Cheng Yeh.

Concrete compression strength is determined not just only by water-cement mixture but also by other ingredients, and how we treat the mixture. Using this dataset, we are going to find “the perfect recipe” to predict the concrete’s compression strength, and how to explain the relationship between the ingredients concentration and the age of testing to the compression strength.

Concrete: “Will it last forever?”

It takes too long when you want to observe concrete structure’s strength especially when the resting time is quite long, let’s say 6 months. Why don’t you just try to predict the compression strength instead of waiting for 6 months?

Your goal is to predict the compression strength based on the mixture properties.

Concrete: “Can you show me your recipe?”

How much the increment/decrement of the structure’s compression strength when you add more water? Can the concrete structure have more compression strength when you left it to rest longer?

Your goal is to build a linear regression model that fulfills all assumptions. Interpret how each ingredient and age of testing affect the concrete compression strength.

2. Food and Beverage

The Food and Beverage dataset is provided by Dattabot, which contains detailed transactions of multiple food and beverage outlets. Using this dataset, we are challenged to do some forecasting and time series analysis to help the outlet’s owner making a better business decision.

Food & Beverage: “It’s friday night!”

Customer behaviour, especially in the food and beverage industry is highly related to seasonality patterns. The owner wants to analyze the number of visitors (includes dine in, delivery, and takeaway transactions) so he could make better judgment in 2018. Fortunately, you already know that time series analysis is enough to provide a good forecast and seasonality explanation.

Please make a report of your forecasting result and seasonality explanation for hourly number of visitors, that would be evaluated on the next 7 days (Monday, February 19th 2018 to Sunday, February 25th 2018)!

3. Scotty

Scotty is a ride-sharing business operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” in their order buttons.

Scotty provided us with a real-time transaction dataset. With this dataset, we are going to help them in solving their problems in order to improve their business processes.

Scotty: “Bring me the crystal ball!”

It’s almost the end of 2017 and we need to prepare a forecast model to helps Scotty ready for the end year’s demands. Unfortunately, Scotty is not old enough to have last year’s data for December, so we can not look back at past demands to prepare forecast for December’s demands. Fortunately, you already know that time series analysis is more than enough to help us to forecast! But, as an investment for the business’ future, we need to develop an automated forecasting framework so we don’t have to meddle with forecast model selection anymore in the future!

Build an automated forecasting model for hourly demands that would be evaluated on the next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017)!

Scotty: “There is no drivers!”

Scotty turns out to be a very popular service in Turkey! The demands for Scotty began to overload, in some region and sometimes, and there was not enough driver at those times and places. Fortunately, we know that we can use a classification model to predict which region and times are risky enough to have this “no drivers” problem.

Create a classification model report that would be evaluated in the next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017). Make prediction that should cover the predicted coverage status for each hour and each area: "sufficient" or "insufficient".

4. SMS

The SMS dataset is collected by team Algoritma for educational purposes. It is a real SMS dataset with a spam/ham label for each message.

SMS: “I didn’t get your message!”

Someone might contact you through old-school way of SMS and you might even skip it because the amount of the spams in your inbox is just way too much. The SMS is classified as spam is collected through user’s report for unwanted SMS. Can we build a spam classifier?

The problem above urge you to classify whether a text message would be a SPAM or HAM based on the content.

5. Where Were You?

“Where Were You” is a challenge for you who wish to learn more about solving problems with unstructured data from a collection of images. The data consists of images with 3 different labels: "Beach", "Forest", or "Mountain". Data were collected by scraping images directly from Google image search.

Through this dataset, you are expected to solve an image classification problem by building a model that can extract information from images and give the correct label. If you are familiar with deep learning, this is your chance to learn and implement deep learning model that is very good at dealing with unstructured data such as texts and images.

Where Were You: “Image Classification”

Image classification is pretty beneficial in many fields. In social media, a face recognition system will automatically detect your face and tag your friend if they are present in your posts. In wildlife conservation, image classification will help researcher to label image based on the animal presence in the camera-trap image. In this case, you will build image classification for helping a stock photo website categorizing their image database based on the thematic location. Why is this an important task? You can check how the unsplash, a photo stock website that use deep learning to organize and create tag for each image in their collection.

Using “Where Were You” dataset, make a prediction model to classify the place captured from an image using collection of images inside the train folder. Submit your prediction for images located in the test folder. Make prediction to classify whether the image is about a "Forest", a "Mountain", or a "Beach".

Cases

You can find the detailed explanation about each cases through video and the description below.

1. Concrete-Prediction

Data-Concrete: “Will it last forever?”

We provide the train dataset as follows:

The observation data consists of the following variables:

id: Id of each cement mixture,
cement: The amount of cement (Kg) in a \(m^3\) mixture,
slag: The amount of blast furnace slag (Kg) in a \(m^3\) mixture,
flyash: The amount of fly ash (Kg) in a \(m^3\) mixture,
water: The amount of water (Kg) in a m3 mixture,
super_plast: The amount of Superplasticizer (Kg) in a \(m^3\) mixture,
coarse_agg: The amount of Coarse Aggreagate (Kg) in a \(m^3\) mixture,
fine_agg: The amount of Fine Aggreagate (Kg) in a \(m^3\) mixture,
age: the number of resting days before the compressive strength measurement,
strength: Concrete compressive strength measurement in \(MPa\) unit.

And we provide the test dataset as follows:

Please follow submission-example below as a template for submission. The values on the target variable do not represent the actual answer. Your process to achieve the target variable may differ, but the final submission data should have the same columns and observations.

The template contains:

id : Id of each cement mixture
strength: Concrete compressive strength measurement in \(MPa\) unit.

# predict target using your model
pred_test <- predict(model, ...)

# Create submission data
submission <- data.frame(id = data_test$id,
                         strength = pred_test
                         )

# save data
write.csv(submission, "submission-david.csv", row.names = F)

# check first 3 data
head(submission, 3)

Prepare a report for this case explaining every part listed in “Rubrics” section. Export it as an .html file with format "yourname_rm-concrete-predict.html".

Rubrics-Concrete: “Will it last forever?”

Data Preprocess and Exploratory Data Analysis

(2 Points) Demonstrate and explain how to apply some data preprocessing to make sure that your data is “ready”, such as handling outlier.
- What data preprocessing that you do?
- Is there any outlier?
- Do you need to scale the features or the target?
(2 Points) Explore the relation between the target and the features.
- Is strength positively correlated with age?
- Is strength and cement has strong correlation?
- Is super_plast has a linear correlation with the strength?

Model Fitting and Evaluation

(2 Points) Demonstrate how to prepare cross-validation data for this case.
- What is the proportion of the training vs testing dataset?
(2 Points) Demonstrate how to properly do model fitting and evaluation.
- What model do you use?
- How do you evaluate the model?
- Is your model overfit?
(4 Points) Compare multiple data preprocess approach.
- Do you need to normalize the data?
- Do you need to log-transform or scale the variables with square root?
(4 Points) Compare multiple model.
- Build at least 2 models or build a model then tune the parameter later.
- If the model is not satisfactory, what will you do to tune the model?
- Is the tuned model perform better?

Prediction Performance

(3 Points) MAE in (your own) validation dataset reach < 4.
(3 Points) R-squared in (your own) validation dataset reach > 90%.
(4 Points) MAE in test dataset reach < 4.
(4 Points) R-squared in test dataset reach > 90%.

Interpretation

(2 Points) Use LIME method to interpret the model that you have used.
- Do you need to scale back the data into original value in order to be more interpretable?
- How many features do you use to explain the model?
- What is the difference between using LIME compared to interpretable machine learning models such as Decision Tree or metrics such as Variable Importance in Random Forest?
(2 Points) Interpret the first 4 observations of the plot.
- What is the difference between interpreting black box model with LIME and using an interpretable machine learning model?
- How good is the explanation fit? What does it signify?
- What are the most and the least important factors for each observation?

Conclusion

(2 Points) Write the conclusion of your capstone project.
- Is your goal achieved?
- Is the problem can be solved by machine learning?
- What model did you use and how is the performance?
- What is the potential business implementation of your capstone project?

2. Concrete-Analysis

Data-Concrete: “Can you show me your recipe?”

We provide the train dataset as follows:

The observation data consists of the following variables:

id: Id of each cement mixture,
cement: The amount of cement (Kg) in a \(m^3\) mixture,
slag: The amount of blast furnace slag (Kg) in a \(m^3\) mixture,
flyash: The amount of fly ash (Kg) in a \(m^3\) mixture,
water: The amount of water (Kg) in a m3 mixture,
super_plast: The amount of Superplasticizer (Kg) in a \(m^3\) mixture,
coarse_agg: The amount of Coarse Aggreagate (Kg) in a \(m^3\) mixture,
fine_agg: The amount of Fine Aggreagate (Kg) in a \(m^3\) mixture,
age: the number of resting days before the compressive strength measurement,
strength: Concrete compressive strength measurement in \(MPa\) unit.

And we provide the test dataset as follows:

# predict target using your model
pred_test <- predict(model, ...)

# Create submission data
submission <- data.frame(id = data_test$id,
                         strength = pred_test
                         )

# save data
write.csv(submission, "submission-david.csv", row.names = F)

# check first 3 data
head(submission, 3)

The template contains:

id : Id of each cement mixture
strength: Concrete compressive strength measurement in \(MPa\) unit.

Prepare a report for this case explaining every part listed in “Rubrics” section. Export it as an .html file with format "yourname_concrete-rm-analysis.html".

Rubrics-Concrete: “Can you show me your recipe?”

Data Preprocess

(2 Points) Demonstrate and explain how to apply some data transformations, scalling, handling outliers or any statistical approach here to make sure that your data is “ready”.
- What do you use for data transformation? Log? Log10? Square root?
- Why you choose those transformation method?
- What variables that need to be transformed or scaled?
- Is there any outlier in the target variable? Why should we care about outlier?
(2 Points) Demonstrate and explain how to properly do feature engineering/ variabel selection.
- Do you remove some variables? Why?
- What method do you use to remove the variables?

Exploratory Data Analysis

(2 Points) Explore the relation between the target and the features.
- Is strength positively correlated with age?
- Is strength and cement has strong correlation?
- Is super_plast has a linear correlation with the strength?
- Other exploratory activities
(2 Points) Give informative insight from the visualization and/or any kind of your exploratory result.
- How is the data distribution of each variables?
- How is the correlation between features?
- Other insight you’ve found

Model Fitting and Evaluation

(2 Points) Demonstrate how to prepare cross-validation data for this case.
- What is the proportion of the training vs testing dataset?
- How and why do you do a cross-validation scheme?
(3 Points) Demonstrate how to properly do model fitting and evaluation.
- What function do you use to build the model?
- How do you evaluate the model performance?

Prediction Performance

(1 Point) MAE in (your own) validation dataset reach < 7.5.
(1 Point) R-squared in (your own) validation dataset reach > 65%.
(2 Point) MAE in test dataset reach < 7.5.
(2 Point) R-squared in test dataset reach > 65%.

Model Interpretation and Improvement Idea(s)

(4 Point) Reported the interpretation of each predictors and explain how much their effect to concrete compression strength.
- How do you measure the effect of each predictors?
- How do you interpret the standard error of each variables?
- Is the predictor has significant effect on the concrete compression strength?
(4 Point) Reported all of the assumption checking using the proper testing method and/or using any visualization. If there is any violation, explain why it happens (e.g. outliers existance, non-linear relationship, etc.) or if there is none, propose the method to improve the model performance (and why it works).
(3 Points) Try improving the model to fulfill the assumptions
- Do you need to transform the target variable?
- Do you need to transform the features?
- Should you transform the data using log, square root, Box-Cox or any other method?
- Explain your effort in improving the model and describe your result (achieved/not achieved).

Finding the Right Material Composition

(2 Point) Choose one from each ingredients or age and do a test to find out the difference for each composition.
- Which predictors did you choose? Why?
- How many class do you create for each ingredients or age?
(2 Point) Do a test to find the right composition to get the maximum concrete compression strength.
- What statistical test you use to find the difference of mean of concrete compression strength?
- What is the optimal composition of ingredients or age to get maximum or higher concrete compression strength?

Conclusion

(2 Points)* Write the conclusion of your capstone project.
- Is your goal achieved?
- Is the problem can be solved by machine learning?
- What model did you use and how is the performance?
- What is the potential business implementation of your capstone project?

3. F&B-TimeSeries

Data F&B: “It’s friday night!”

The train dataset contains detailed transaction details from December 1st 2017 to December 18th of February 2018. The shop opens from 10 am to 10 pm every day:

The dataset includes information about:

transaction_date: The timestamp of a transaction
receipt_number: The ID of a transaction
item_id: The ID of an item in a transaction
item_group: The group ID of an item in a transaction
item_major_group: The major-group ID of an item in a transaction
quantity: The quantity of purchased item
price_usd: The price of purchased item
total_usd: The total price of purchased item
payment_type: The payment method
sales_type: The sales method

The test dataset should serve as a template for submission:

Please follow content of the submission-example below as a template for submission. The values on the target variable do not represent the actual answer. Your process to achieve the target variable may differ, but the final submission data should have the same columns and observations.

The template contains:

datetime: Timestamp (equivalent to transaction_date)
visitor: Estimated number of visitor(s)

# Forecast the target using your model
forecast_mod <- forecast(model, ...)

# Create submission data
submission <- data_test %>% 
  mutate(visitor = forecast_mod)

# save data
write.csv(submission, "submission-david.csv", row.names = F)

# check first 3 data
head(submission, 3)

Prepare a report for this case explaining every part listed in “Rubrics” section. Export it as an .html file with format "yourname_fnb-ts-single.html".

Rubrics-F&B: “It’s friday night!”

Data Preprocess

(2 Points) Demonstrated and explain how to properly do data aggregation.
- Do you need to aggregate/summarise the number of visitors before doing time series padding?
- Do you need to filter the time to certain hours after doing time series padding?
- Do you need to replace NA value?
(2 Points) Demonstrate how to properly do time series padding.
- Should you do time series padding?
- Do you need to round the datetime into hour or minutes?
- When is the start and the end of the time interval for time series padding?

Seasonality Analysis

(2 Points) Compare multiple time series decomposition approach.
- Can you decompose the time series into the observed data, trend, hourly seasonality, weekly seasonality, and the residuals?
(2 Points) Reported interpretable hourly and weekly seasonality.
- Can you create a better visualization of hourly and weekly seasonality?
- How do you interpret the seasonality? Describe the interpretation.

Model Fitting and Evaluation

(4 Points) Demonstrate and explain how to prepare cross-validation data for this case.
- Do you need to do cross validation before doing time series analysis?
- How do you split the data into training and testing dataset?
(4 Points) Demonstrate and explain how to properly do model fitting and evaluation.
- What data preprocessing you used before fitting the model?
- What time series model did you use?
- Can you visualize the actual vs estimated number of visitors?
- how to evaluate the model performance?
(4 Points) Compare multiple model specifications.
- How many forecasting model will you use?
- Will you use exponential smoothing? Will you use ARIMA?
- How to evaluate the model performance?
- Can you visualize the actual vs estimated number of visitors?

Prediction Performance

(6 Points) Reached MAE < 6 in (your own) validation dataset.
(6 Points) Reached MAE < 6 in test dataset.

Conclusion

(4 Point) Assumption Checking
- Does your model require assumption checking? If so, what are the results?
- Based on seasonality when the highest visitors?

4. Scotty-TimeSeries

Data-Scotty: “Bring me the crystal ball!”

The train dataset contains detailed transaction details from October 1st 2017 to December 2nd 2017:

The dataset includes information about:

id: Transaction id
trip_id: Trip id
driver_id: Driver id
rider_id: Rider id
start_time: Start time of request
src_lat: Request source latitude
src_lon: Request source longitude
src_area: Request source area
src_sub_area: Request source sub-area
dest_lat: Requested destination latitude
dest_lon: Requested destination longitude
dest_area: Requested destination area
dest_sub_area: Requested destination sub-area
distance: Trip distance (in KM)
status: Trip status (all status considered as a demand)
confirmed_time_sec: Time different from request to confirmed (in seconds)

The test dataset should serve as a template for submission:

The template contains:

src_sub_area: Request source sub-area
datetime: Timestamp (equivalent to start_time)
demand: Estimated number of demand(s)

# Forecast the target using your model
forecast_mod <- forecast(model, ...)

# Create submission data
submission <- data_test %>% 
  mutate(demand = forecast_mod)

# save data
write.csv(submission, "submission-david.csv", row.names = F)

# check first 3 data
head(submission, 3)

Prepare a report for this case explaining every part listed in “Rubrics” section. Export it as an .html file with format "yourname_scotty-ts.html".

Rubrics-Scotty: “Bring me the crystal ball!”

Data Preprocess

(2 Points) Demonstrate and explain how to properly do data aggregation.
- Should you floor the date to specific time level (minutes or hours or days)?
- How do we group the data for aggregation/summarise?
(2 Points) Demonstrate how to properly do time series padding.
- Should you do time series padding?
- Do you need to round the datetime into hour or minutes?
- When is the start and the end of the time interval for time series padding?

Cross-Validation Scheme

(2 Points) Demonstrate and explain how to prepare cross-validation data for automated model selection.
- How to cross-validate data for time series?
- Do you need to group the data by the source area?
- Do you need to make nested dataframe?
- How many observations you will use as the testing dataset?
(2 Points) Demonstrate and explain how to prepare cross-validation data for “best” model evaluation.
- Do you need to further split the data train into training set and validation set?
- How do you split them? Should you use rolling origin method?
- How much of the data will be used as the validation set?
- Do we forecast with windowed data or expanding data?

Automated Model Selection

(3 Points) Compare multiple preprocess specifications.
- Is different preprocess will have diffrerent results?
- How many kind of preprocess spesification you will prepare?
- Will you choose 2 different speficiation: log transformation and square root transformation specification? Will you create another preprocess approach?
(3 Points) Compare multiple seasonality specifications.
- How many seasonality specification you will create?
- Will you create model with daily sesasonality only?
- Will you create multiple seasonality (daily and weekly)?
(3 Points) Compare multiple model specifications.
- How many forecasting model will you use?
- Will you use exponential smoothing? Will you use ARIMA?
(3 Points) Automate best specifications selection.
- Since we use multiple preprocess, seasonality, and models, can you make an automated script to summarise the result?
- How do you measure the model performance?
- Which model and specifications has the best performance?

Prediction Performance

(1 Points) Reached MAE < 12 for sub-area sxk97 in (your own) evaluation dataset.
(1 Points) Reached MAE < 11 for sub-area sxk9e in (your own) evaluation dataset.
(1 Points) Reached MAE < 10 for sub-area sxk9s in (your own) evaluation dataset.
(1 Points) Reached MAE < 11 for all sub-area in (your own) evaluation dataset.
(2 Points) Reached MAE < 12 for sub-area sxk97 in test dataset.
(2 Points) Reached MAE < 11 for sub-area sxk9e in test dataset.
(2 Points) Reached MAE < 10 for sub-area sxk9s in test dataset.
(2 Points) Reached MAE < 11 for all sub-area in test dataset.

Conclusion

(4 Point) Assumption Checking
- Does your model require assumption checking? If so, what are the results?
- Based on seasonality when the highest visitors?

5. Scotty-Classification

Data-Scotty: “There is no drivers!”

The train dataset contains detailed transaction details from October 1st 2017 to December 2nd 2017:

The dataset includes information about:

id: Transaction id
trip_id: Trip id
driver_id: Driver id
rider_id: Rider id
start_time: Start time of request
src_lat: Request source latitude
src_lon: Request source longitude
src_area: Request source area
src_sub_area: Request source sub-area
dest_lat: Requested destination latitude
dest_lon: Requested destination longitude
dest_area: Requested destination area
dest_sub_area: Requested destination sub-area
distance: Trip distance (in KM)
status: Trip status (all status considered as a demand)
confirmed_time_sec: Time different from request to confirmed (in seconds)

The test dataset should serve as a template for submission:

The template contains:

src_area: Request source area
datetime: Timestamp (equivalent to start_time)
coverage: Estimated coverage status; sufficient or insufficient

# Predict the target using your model
pred_test <- predict(model, ...)

# Create submission data
submission <- data_test %>% 
  mutate(coverage = pred_test)

# save data
write.csv(submission, "submission-david.csv", row.names = F)

# check first 3 data
head(submission, 3)

Prepare a report for this case explaining every part listed in “Rubrics” section. Export it as an .html file with format "yourname_scotty-cl-cov.html".

Rubrics-Scotty: “There is no drivers!”

Data Preprocess

(2 Points) Demonstrated how to properly do data aggregation.
- Should you floor the date to specific time level (minutes or hours or days)?
- How do we group the data for aggregation/summarise?
(2 Points) Demonstrated how to properly do time series padding.
- Determine the start and end of padding interval
- Padding the time data in specific time interval (minutes or hours or days) before doing any EDA or further preprocessing to get the same interval time.
- Fill the NA count on the new time interval with 0 or any other imputation method

Exploratory Data Analysis

(2 Points) Explored the state in the target distribution.
- See the proportion of class of target variable overall
- See the proportion of class of target variable in each area (3 areas)
(2 Points) Explored the relation between the target and the features.
- Find pattern or correlation between target and features
- use heatmap of time (hour) and weekdays, grouped by area and find the pattern

Model Fitting and Evaluation

(2 Points) Demonstrated how to prepare cross-validation data for this case.
- What is the proportion of the training vs testing dataset?
(2 Points) Demonstrated how to properly do data preprocess and feature engineering.
- explain the details of data preprocessing
- explain feature engineering/variable selection, including removing unused variable
- do upsample or downsample (based on the class proportion)
(2 Points) Demonstrated how to properly do model fitting and evaluation.
- What model to be used?
- How to set the model’s parameter?
(2 Points) Demonstrated how to properly do model selection by comparing models or making adjustment to single model.
- is the model overfit?
- Did you use confusion matrix?
- Did you use accuracy, precision, sensitivity, and specificity? Which metric is considered the most important on this case?
- How is the sensitivity-specificity trade-off?
- How is the precision-recall trade-off? What is the optimal treshold to get better trade-off for sensitivity and precision?

Prediction Performance

(1 Point) Reached Accuracy > 75% in (your own) validation dataset.
(1 Point) Reached Sensitivity > 85% in (your own) validation dataset.
(1 Point) Reached Specificity > 70% in (your own) validation dataset.
(1 Point) Reached Precision > 75% in (your own) validation dataset.
(2 Point) Reached Accuracy > 75% in test dataset.
(2 Point) Reached Sensitivity > 85% in test dataset.
(2 Point) Reached Specificity > 70% in test dataset.
(2 Point) Reached Precision > 75% in test dataset.

Interpretation

(3 Points) Use LIME method to interpret the model that you have used
- There any pre-processing that you need in order to be more interpretable?
- How many features do you use to explain the model?
- What is the difference between using LIME compared to interpretable machine learning models such as Decision Tree or metrics such as Variable Importance in Random Forest?
(3 Points) Interpret the first 4 observation of the plot
- What is the difference between interpreting black box model with LIME and using an interpretable machine learning model?
- How good is the explanation fit? What does it signify?
- What are the most and the least important factors for each observation?

Conclusion

(2 Points) Write the conclusion of your capstone project
- Is your goal achieved?
- Is the problem can be solved by machine learning?
- What model did you use and how is the performance?
- What is the potential business implementation of your capstone project?

6. SMS

Data-SMS: “I didn’t get your message!”

We provide the train dataset as follows:

The observation data consists of the following variables:

datetime: Timestamp,
text: The contain of messages,
status: The label of spam/ham for each messages.

We provide the test dataset as follows:

The template contains:

datetime : Timestamp
status: The label of spam/ham for each messages.

# Predict on data test
pred_test <- predict(model, ...)

# Create submission data
submission <- data_test %>% 
  mutate(status = pred_test) %>% 
  select(-text)

# save data
write.csv(submission, "submission-david.csv", row.names = F)

# check first 3 data
head(submission, 3)

Prepare a report for this case explaining every part listed in “Rubrics” section. Export it as an .html file with format "yourname_sms-cl-spam.html".

Rubrics-SMS : “I didn’t get your message!”

Data Preprocess and Exploratory Data Analysis

(2 Points) Demonstrated how to properly do data preprocess for text data
- What package you will use for text mining?
- Should you remove punctuation or emoticon?
- Will you create a document-term matrix?
(2 Points) Reported a distribution plot of total hourly frequency for each status.
- How do you prepare the data for visualization?
- Will you use histogram? Heatmap? Boxplot?
(2 Points) Reported some text characteristics related to spam and ham
- What text or token that can represent if a text is spam or ham?
- Is it based on the term frequency of each word or token? Or is it based on the Term Frequency (TF) - Inverse Document Frequency (IDF)?
- Will you use visualization to explain the characteristics of spam or ham?

Model Selection and Evaluation

(2 Points) Compare multiple method approaches for text classification task (e.g. Naive Bayes, Random Forest, Deep Learning)
- What model will you use to classify the text?
- How many token or word you will use for training the model?
(2 Points) Reported model selection and cross-validation results.
- How much percent (%) of the data used for training the model?
- How do you choose which one is the better model? Is it based on the accuracy?
- Which model is the best?
(2 Points) Reported which words are important for prediction problem.
- How do you decide which words are important?
(2 Points) Reported which sms were incorrectly predicted in your own test dataset.
- Which sms were incorrectly predicted on the test dataset?
(2 Points) Based on sms that misclassified, give an analysis of why this might happen.
- Is there any common pattern among the misclassified texts?
- Is there any particular words that present in most of the misclassified texts?

Prediction Performance

(1 Points) Accuracy in (your own) validation dataset reach > 80%.
(1 Points) Sensitivity in (your own) validation dataset reach > 80%.
(1 Points) Specificity in (your own) validation dataset reach > 85%.
(1 Points) Precision in (your own) validation dataset reach > 90%.
(2 Points) Accuracy in test dataset reach > 80%.
(2 Points) Sensitivity in test dataset reach > 80%.
(2 Points) Specificity in test dataset reach > 85%.
(2 Points) Precision in test dataset reach > 90%.

Interpretation

(3 Points) Use LIME method to interpret the model that you have used
- Is there any pre-processing that you need in order to be more interpretable?
- How many features do you use to explain the model?
(3 Points) Interpret the first 4 observation of the plot
- What is the difference between interpreting black box model with LIME and using an interpretable machine learning model?
- How good is the explanation fit? What does it signify?
- What are the factors that support and weaken the possibility of an SMS classified as SPAM?

Conclusion

(2 Points) Write the conclusion of your capstone project
- Is your goal achieved?
- Is the problem can be solved by machine learning?
- What model did you use and how is the performance?
- What is the potential business implementation of your capstone project?

7. Where Were You-Prediction

Data-Where Were You: “Image Classification”

All image data for the data train is located inside the data/train folder.

We provide the test dataset as follows:

All image data for the data test is located inside the data/test folder.

The template contains:

id : Image ID from test folder
label: Image label; Beach, Forest or Mountain.

# Predict label on array
pred_test <- predict_classes(model, test_x) 

# Convert encoding to label
decode <- function(x){
  case_when(x == 0 ~ "beach",
            x == 1 ~ "forest",
            x == 2 ~ "mountain"
            )
}

# Create data submission
submission <- data.frame(id = test_file_name,
                         label = sapply(pred_test, decode)
                         ) %>% 
  mutate(id = str_remove(id, "data/test/")) # remove file path and only keep the file name

# Write submission
write.csv(submission, "submission-david.csv")

# check first 3 data
head(submission, 3)

Prepare a report for this case explaining every part listed in “Rubrics” section. Export it as an .html file with format “yourname_wherewereyou-cl.html”.

Rubrics-Where Were You: “Image Classification”

Data Preprocess and Explanatory Data Analysis

(2 Points) Explore the distribution of the image dimensions (height and width).
- Does the dataset have a varying dimensions of the image data?
- What is the maximum and minimum dimensions of the image data?
- Why should we concerned about the dimension of the image data?
(2 Points) Demonstrate and explain how to do image augmentation dataset with image generator.
- Do you resize the image? What is your input image dimensions?
- Do you scale/normalize the image?
- Do you rotate or flip the image?
- Do you use grayscale or RGB color mode?
(2 Points) Explore the label/class distribution of the target variable.
- See the proportion of class of target variable in each label (beach, forest, mountain)
- Is there any class imbalance among labels? What is the effect of class imbalance?

Model Fitting and Evaluation (12 points)

(2 Points) Demonstrate and explain how to prepare cross-validation data for this case.
- What is the proportion of the training vs validation dataset?
- Why do we need to divide the data?
(4 Points) Demonstrate and explain how to build deep learning architecture.
- Do you use convolutional layer (CNN)?
- How many CNN layer you will build?
- Do you use flatten layer? What is the function of flatten layer?
- What is the activation for the output layer?
(4 Points) Demonstrate how to properly do model fitting and evaluation.
- What loss function do you use to fit the model?
- How do you adjust the optimzer to fit the model?
- How many epochs do you use to train the model?
(2 Points) Demonstrate and explain how to properly do model selection by comparing models or making adjustment to single model.
- Is the model overfit?
- Do you use confusion matrix?
- Do you use accuracy, precision, sensitivity, or specificity? Which metric is considered the most important on this case?
- What do you do to improve the model performance?

Prediction Performance

(2 Points) Accuracy in (your own) validation dataset reach > 75%.
(2 Points) Sensitivity of all classes in (your own) validation dataset reach > 75%.
(2 Points) Specificity of all classes in (your own) validation dataset reach > 75%.
(2 Points) Precision of all classes in (your own) validation dataset reach > 75%.
(2 Points) Accuracy in test dataset reach > 75%.
(2 Points) Sensitivity of all classes in test dataset reach > 75%.
(2 Points) Specificity of all classes in test dataset reach > 75%.
(2 Points) Precision of all classes in test dataset reach > 75%.

Conclusion

(2 Points) Write the conclusion of your capstone project.

Is your goal achieved?
Is the problem can be solved by machine learning?
What model did you use and how is the performance?
What is the potential business implementation of your capstone project?

Submission

After finishing your work of data preprocessing, modeling, and model evaluation, the next step will be;

Apply your model to the data_test.csv that comes within the case.
Follow the template submission-example that each case provided to store your prediction. You should save your prediction as a .csv file.
Login to Algoritma Capstone Leaderboard to see your metrics achievement.
Please screenshot your metrics achievement and attach it to your report or Google Classroom submission.
Submit your html report to ML-Capstone Classwork!
Please do not publish your work on Rpubs or other online website since the data are confidential.

Guidance

If you’re about to tackle this machine learning capstone project and looking for clear guidance to steer you in the right direction, we strongly encourage you to explore the user-friendly and expertly curated resource called “Capstone Machine Learning Guidance” by Algoritma. This valuable compendium provides accessible information and insights that will help you navigate the complexities of your project with confidence and ease.

Machine Learning Capstone Project

Team Algoritma

June 20, 2024

Introduction

Datasets

1. Concrete Strength

2. Food and Beverage

3. Scotty

4. SMS

5. Where Were You?

Cases

1. Concrete-Prediction

Data-Concrete: “Will it last forever?”

Rubrics-Concrete: “Will it last forever?”

2. Concrete-Analysis

Data-Concrete: “Can you show me your recipe?”

Rubrics-Concrete: “Can you show me your recipe?”

3. F&B-TimeSeries

Data F&B: “It’s friday night!”

Rubrics-F&B: “It’s friday night!”

4. Scotty-TimeSeries

Data-Scotty: “Bring me the crystal ball!”

Rubrics-Scotty: “Bring me the crystal ball!”

5. Scotty-Classification

Data-Scotty: “There is no drivers!”

Rubrics-Scotty: “There is no drivers!”

6. SMS

Data-SMS: “I didn’t get your message!”

Rubrics-SMS : “I didn’t get your message!”

7. Where Were You-Prediction

Data-Where Were You: “Image Classification”

Rubrics-Where Were You: “Image Classification”

Submission

Guidance

Reference

General

Time Series

Text Mining

Image Classification