Introduction

As a final step in machine learning course, every student should complete 1 case as their machine learning capstone project. In this article, we will explain some cases that you can choose as your machine learning capstone project.

There will be 5 different datasets. From each dataset, you can choose 1 case with a set of Rubrics/Requirments you will need to solve to get a score.

Datasets

All datasets used in capstone could be found in this link.

In the link provided, there will be 2 datasets: train and test dataset, for each case.

The train dataset will be used to train and evaluate the model, while the test dataset is used for the final evaluation. The final evaluation requires you to submit your prediction of the test dataset to the leaderboard in order to obtain the final model evaluation (more details are provided below). The data scheme is illustrated as follows:

1. Concrete Strength

This problem was originally proposed by Prof. I-Cheng Yeh, Department of Information Management Chung-Hua University, Hsin Chu, Taiwan in 2007. It is related to his research in 1998 about how to predict compression strength in a concrete structure.

Concrete is the most important material in civil engineering

as said by Prof. I-Cheng Yeh.

Concrete compression strength is determined not just only by water-cement mixture but also by other ingredients, and how we treat the mixture. Using this dataset, we are going to find “the perfect recipe” to predict the concrete’s compression strength, and how to explain the relationship between the ingredients concentration and the age of testing to the compression strength.

It takes too long when you want to observe concrete structure’s strength especially when the resting time is quite long, let’s say 6 months. Why don’t you just try to predict the compression strength instead of waiting for 6 months?

Your goal is to predict the compression strength based on the mixture properties.

How much the increment/decrement of the structure’s compression strength when you add more water? Can the concrete structure have more compression strength when you left it to rest longer?

Your goal is to build a linear regression model that fulfills all assumptions. Interpret how each ingredient and age of testing affect the concrete compression strength.



2. Food and Beverage

The Food and Beverage dataset is provided by Dattabot, which contains detailed transactions of multiple food and beverage outlets. Using this dataset, we are challenged to do some forecasting and time series analysis to help the outlet’s owner making a better business decision.

Customer behaviour, especially in the food and beverage industry is highly related to seasonality patterns. The owner wants to analyze the number of visitors so he could make better judgment in 2018. Fortunately, you already know that time series analysis is enough to provide a good forecast and seasonality explanation.

Please make a report of your forecasting result and seasonality explanation for hourly number of visitors, that would be evaluated on the next 7 days (Monday, December 19th 2017 to Sunday, December 25th 2017)!



3. SMS

The SMS dataset is collected by team Algoritma for educational purposes. It is a real SMS dataset with a spam/ham label for each message.

Someone might contact you through old-school way of SMS and you might even skip it because the amount of the spams in your inbox is just way too much. The SMS is classified as spam is collected through user’s report for unwanted SMS. Can we build a spam classifier?

The problem above urge you to classify whether a text message would be a SPAM or HAM based on the content.



4. Scotty

Scotty is a ride-sharing business operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” in their order buttons.

Scotty provided us with a real-time transaction dataset. With this dataset, we are going to help them in solving some forecasting and classification problems in order to improve their business processes.

It’s almost the end of 2017 and we need to prepare a forecast model to helps Scotty ready for the end year’s demands. Unfortunately, Scotty is not old enough to have last year’s data for December, so we can not look back at past demands to prepare forecast for December’s demands. Fortunately, you already know that time series analysis is more than enough to help us to forecast! But, as an investment for the business’ future, we need to develop an automated forecasting framework so we don’t have to meddle with forecast model selection anymore in the future!

Build an automated forecasting model for hourly demands that would be evaluated on the next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017)!

Scotty turns out to be a very popular service in Turkey! The demands for Scotty began to overload, in some region and sometimes, and there was not enough driver at those times and places. Fortunately, we know that we can use a classification model to predict which region and times are risky enough to have this “no drivers” problem.

Create a classification model report that would be evaluated in the next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017). Make prediction that should cover the predicted coverage status for each hour and each area: "sufficient" or "insufficient".



5. Airline13

Airline13 dataset provides you with airline on-time data for all flights departed from Newark Liberty International Airport to Charlotte Douglas International Airport in 2013. The dataset includes records of flight departures and weather condition recorded per hour.

Through this dataset, data scientists were challenged to solve one of the most frequent problems in airline industry– arrival delay status of a flight. Let’s put our data scientist knowledge into action and solve this problem using classification algorithms.

Flight delays can put a real strain on travelers, which often have to make overnight accommodations until the next batch of flights heads out in the morning. Major airline companies have begun to take account of such problems and developing a way to give early notification about flight arrival delay status for their costumer and to make sure that their customer gets the best service and compensation available.

Using “Airline13” dataset, make a prediction model to classify the arrival delay status of a flight (using data_train set: flights and weather records up to November 2013), that would be evaluated on December 2013. Make prediction that should cover the predicted arrival delay status for each flight: "Delay" or "Not Delay".


Cases

1. Concrete-Prediction

Data-Concrete: “Will it last forever?”

We provide the train dataset as follows:

The observation data consists of the following variables:

  • id: Id of each cement mixture,
  • cement: The amount of cement (Kg) in a \(m^3\) mixture,
  • slag: The amount of blast furnace slag (Kg) in a \(m^3\) mixture,
  • flyash: The amount of fly ash (Kg) in a \(m^3\) mixture,
  • water: The amount of water (Kg) in a m3 mixture,
  • super_plast: The amount of Superplasticizer (Kg) in a \(m^3\) mixture,
  • coarse_agg: The amount of Coarse Aggreagate (Kg) in a \(m^3\) mixture,
  • fine_agg: The amount of Fine Aggreagate (Kg) in a \(m^3\) mixture,
  • age: the number of resting days before the compressive strength measurement,
  • strength: Concrete compressive strength measurement in \(MPa\) unit.


And we provide the test dataset as follows:

Please follow submission-example below as a template for submission. The values on the target variable do not represent the actual answer. Your process to achieve the target variable may differ, but the final submission data should have the same columns and observations.

The template contains:

  • id : Id of each cement mixture
  • strength: Concrete compressive strength measurement in \(MPa\) unit.
# predict target using your model
pred_test <- predict(model, ...)

# Create submission data
submission <- data.frame(id = data_test$id,
                         strength = pred_test
                         )

# save data
write.csv(submission, "submission-david.csv", row.names = F)

# check first 3 data
head(submission, 3)

Prepare a report for this case explaining every part listed in “Rubrics” section. Export it as an .html file with format "yourname_rm-concrete-predict.html".


Rubrics-Concrete: “Will it last forever?”

Data Preprocess and Exploratory Data Analysis

  • (2 Points) Demonstrated how to apply some data preprocessing to make sure that your data is “ready”, such as handling outlier.
    • What data preprocessing that you do?
    • Is there any outlier?
    • Do you need to scale the features or the target?
  • (2 Points) Explored the relation between the target and the features.
    • Is strength positively correlated with age?
    • Is strength and cement has strong correlation?
    • Is super_plast has a linear correlation with the strength?

Model Fitting and Evaluation

  • (2 Points) Demonstrated how to prepare cross-validation data for this case.
    • What is the proportion of the training vs testing dataset?
  • (2 Points) Demonstrated how to properly do model fitting and evaluation.
    • What model do you use?
    • How do you evaluate the model?
    • Is your model overfit?
  • (4 Points) Compared multiple data preprocess approach.
    • Do you need to normalize the data?
    • Do you need to log-transform or scale the variables with square root?
  • (4 Points) Compared multiple model.
    • Build at least 2 models or build a model then tune the parameter later.
    • If the model is not satisfactory, what will you do to tune the model?
    • Is the tuned model perform better?

Prediction Performance

  • (2 Points) MAE in (your own) validation dataset reach < 4.
  • (2 Points) R-squared in (your own) validation dataset reach > 90%.
  • (4 Points) MAE in test dataset reach < 4.
  • (4 Points) R-squared in test dataset reach > 90%.

Conclusion

  • (2 Points) Write the conclusion of your capstone project
    • Is your goal achieved?
    • Is the problem can be solved by machine learning?
    • What model did you use and how is the performance?
    • What is the potential business implementation of your capstone project?


2. Concrete-Analysis

Data-Concrete: “Can you show me your recipe?”

We provide the train dataset as follows:

The observation data consists of the following variables:

  • id: Id of each cement mixture,
  • cement: The amount of cement (Kg) in a \(m^3\) mixture,
  • slag: The amount of blast furnace slag (Kg) in a \(m^3\) mixture,
  • flyash: The amount of fly ash (Kg) in a \(m^3\) mixture,
  • water: The amount of water (Kg) in a m3 mixture,
  • super_plast: The amount of Superplasticizer (Kg) in a \(m^3\) mixture,
  • coarse_agg: The amount of Coarse Aggreagate (Kg) in a \(m^3\) mixture,
  • fine_agg: The amount of Fine Aggreagate (Kg) in a \(m^3\) mixture,
  • age: the number of resting days before the compressive strength measurement,
  • strength: Concrete compressive strength measurement in \(MPa\) unit.


And we provide the test dataset as follows:

Please follow submission-example below as a template for submission. The values on the target variable do not represent the actual answer. Your process to achieve the target variable may differ, but the final submission data should have the same columns and observations.

# predict target using your model
pred_test <- predict(model, ...)

# Create submission data
submission <- data.frame(id = data_test$id,
                         strength = pred_test
                         )

# save data
write.csv(submission, "submission-david.csv", row.names = F)

# check first 3 data
head(submission, 3)

The template contains:

  • id : Id of each cement mixture
  • strength: Concrete compressive strength measurement in \(MPa\) unit.

Prepare a report for this case explaining every part listed in “Rubrics” section. Export it as an .html file with format "yourname_concrete-rm-analysis.html".


Rubrics-Concrete: “Can you show me your recipe?”

Data Preprocess

  • (2 Points) Demonstrated how to apply some data transformations, scalling, handling outliers or any statistical approach here to make sure that your data is “ready”.
    • What do you use for data transformation? Log? Log10? Square root?
    • Why you choose those transformation method?
    • What variables that need to be transformed or scaled?
    • Is there any outlier in the target variable? Why should we care about outlier?
  • (2 Points) Demonstrated how to properly do feature engineering/ variabel selection.
    • Do you remove some variables? Why?
    • What method do you use to remove the variables?

Exploratory Data Analysis

  • (2 Points) Explored the relation between the target and the features.
    • Is strength positively correlated with age?
    • Is strength and cement has strong correlation?
    • Is super_plast has a linear correlation with the strength?
    • Other exploratory activities
    • How is the data distribution of each variables?
    • How is the correlation between features?
    • Other insight you’ve found

Model Fitting and Evaluation

  • (2 Points) Demonstrated how to prepare cross-validation data for this case.
    • What is the proportion of the training vs testing dataset?
    • How and why do you do a cross-validation scheme?
  • (2 Points) Demonstrated how to properly do model fitting and evaluation.
    • What function do you use to build the model?
    • How do you evaluate the model performance?

Prediction Performance

  • (1 Point) MAE in (your own) validation dataset reach < 7.5.
  • (1 Point) R-squared in (your own) validation dataset reach > 65%.
  • (2 Point) MAE in test dataset reach < 7.5.
  • (2 Point) R-squared in test dataset reach > 65%.

Model Interpretation and Improvement Idea(s)

  • (2 Point) Reported the interpretation of each predictors and explain how much their effect to concrete compression strength.

    • How do you measure the effect of each predictors?
    • How do you interpret the standard error of each variables?
    • Is the predictor has significant effect on the concrete compression strength?
  • (3 Point) Reported all of the assumption checking using the proper testing method and/or using any visualization. If there is any violation, explain why it happens (e.g. outliers existance, non-linear relationship, etc.) or if there is none, propose the method to improve the model performance (and why it works).

  • (3 Points) Improve the model to fulfill the assumptions

    • How do you improve the model in order to fulfill the assumption?
    • Do you need to transform the target variable?
    • Do you need to transform the features?
    • Should you transform the data using log, square root, Box-Cox or any other method?

Finding the Right Material Composition

  • (2 Point) Choose one from each ingredients or age and do a test to find out the difference for each composition.
    • Which predictors did you choose? Why?
    • How many class do you create for each ingredients or age?
  • (2 Point) Do a test to find the right composition to get the maximum concrete compression strength.
    • What statistical test you use to find the difference of mean of concrete compression strength?
    • What is the optimal composition of ingredients or age to get maximum or higher concrete compression strength?

Conclusion

  • (2 Points) Write the conclusion of your capstone project
    • Is your goal achieved?
    • Is the problem can be solved by machine learning?
    • What model did you use and how is the performance?
    • What is the potential business implementation of your capstone project?


3. F&B-TimeSeries

Data F&B: “It’s friday night!”

The train dataset contains detailed transaction details from October 1st 2017 to December 2nd 2017:

The dataset includes information about:

  • transaction_date: The timestamp of a transaction
  • receipt_number: The ID of a transaction
  • item_id: The ID of an item in a transaction
  • item_group: The group ID of an item in a transaction
  • item_major_group: The major-group ID of an item in a transaction
  • quantity: The quantity of purchased item
  • price_usd: The price of purchased item
  • total_usd: The total price of purchased item
  • payment_type: The payment method
  • sales_type: The sales method

The test dataset should serve as a template for submission:

Please follow content of the submission-example below as a template for submission. The values on the target variable do not represent the actual answer. Your process to achieve the target variable may differ, but the final submission data should have the same columns and observations.

The template contains:

  • datetime: Timestamp (equivalent to transaction_date)
  • visitor: Estimated number of visitor(s)
# Forecast the target using your model
forecast_mod <- forecast(model, ...)

# Create submission data
submission <- data_test %>% 
  mutate(visitor = forecast_mod)

# save data
write.csv(submission, "submission-david.csv", row.names = F)

# check first 3 data
head(submission, 3)

Prepare a report for this case explaining every part listed in “Rubrics” section. Export it as an .html file with format "yourname_fnb-ts-single.html".

Rubrics-F&B: “It’s friday night!”

Data Preprocess

  • (2 Points) Demonstrated how to properly do data aggregation.
    • Do you need to aggregate/summarise the number of visitors before doing time series padding?
    • Do you need to filter the time to certain hours after doing time series padding?
    • Do you need to replace NA value?
  • (2 Points) Demonstrated how to properly do time series padding.
    • Should you do time series padding?
    • Do you need to round the datetime into hour or minutes?
    • When is the start and the end of the time interval for time series padding?

Seasonality Analysis

  • (2 Points) Compared multiple time series decomposition approach.
    • Can you decompose the time series into the observed data, trend, hourly seasonality, weekly seasonality, and the residuals?
  • (2 Points) Reported interpretable hourly and weekly seasonality.
    • Can you create a better visualization of hourly and weekly seasonality?
    • How do you interpret the seasonality? Describe the interpretation.

Model Fitting and Evaluation

  • (4 Points) Demonstrated how to prepare cross-validation data for this case.
    • Do you need to do cross validation before doing time series analysis?
    • How do you split the data into training and testing dataset?
  • (4 Points) Demonstrated how to properly do model fitting and evaluation.
    • What data preprocessing you used before fitting the model?
    • What time series model did you use?
    • Can you visualize the actual vs estimated number of visitors?
    • how to evaluate the model performance?
  • (4 Points) Compared multiple model specifications.
    • How many forecasting model will you use?
    • Will you use exponential smoothing? Will you use ARIMA?
    • How to evaluate the model performance?
    • Can you visualize the actual vs estimated number of visitors?

Prediction Performance

  • (4 Points) Reached MAE < 6 in (your own) validation dataset.
  • (4 Points) Reached MAE < 6 in test dataset.

Conclusion

  • (2 Point) Assumption Checking
    • Does the model meet the autocorrelation assumption?
    • What about the normality of residuals?
    • If the assumptions are not met, what is the cause? how to handle that?
    • Based on seasonality when the highest visitors ?



4. SMS-Prediction

Data SMS : “I didn’t get your message!”

We provide the train dataset as follows:

The observation data consists of the following variables:

  • datetime: Timestamp,
  • text: The contain of messages,
  • status: The label of spam/ham for each messages.


We provide the test dataset as follows:

Please follow content of the submission-example below as a template for submission. The values on the target variable do not represent the actual answer. Your process to achieve the target variable may differ, but the final submission data should have the same columns and observations.

The template contains:

  • datetime : Timestamp
  • status: The label of spam/ham for each messages.
# Predict on data test
pred_test <- predict(model, ...)

# Create submission data
submission <- data_test %>% 
  mutate(status = pred_test) %>% 
  select(-text)

# save data
write.csv(submission, "submission-david.csv", row.names = F)

# check first 3 data
head(submission, 3)

Prepare a report for this case explaining every part listed in “Rubrics” section. Export it as an .html file with format "yourname_sms-cl-spam.html".

Rubrics-SMS : “I didn’t get your message!”

Data Preprocess and Exploratory Data Analysis

  • (2 Points) Demonstrated how to properly do data preprocess for text data
    • What package you will use for text mining?
    • Should you remove punctuation or emoticon?
    • Will you create a document-term matrix?
  • (2 Points) Reported a distribution plot of total hourly frequency for each status.
    • How do you prepare the data for visualization?
    • Will you use histogram? Heatmap? Boxplot?
  • (2 Points) Reported some text characteristics related to spam and ham
    • What text or token that can represent if a text is spam or ham?
    • Is it based on the term frequency of each word or token? Or is it based on the Term Frequency (TF) - Inverse Document Frequency (IDF)?
    • Will you use visualization to explain the characteristics of spam or ham?

Model Selection and Evaluation

  • (2 Points) Compare multiple method approaches for text classification task (e.g. Naive Bayes, Random Forest, Deep Learning)
    • What model will you use to classify the text?
    • How many token or word you will use for training the model?
  • (2 Points) Reported model selection and cross-validation results.
    • How much percent (%) of the data used for training the model?
    • How do you choose which one is the better model? Is it based on the accuracy?
    • Which model is the best?
  • (2 Points) Reported which words are important for prediction problem.
    • How do you decide which words are important?
  • (2 Points) Reported which sms were incorrectly predicted in your own test dataset.
    • Which sms were incorrectly predicted on the test dataset?
  • (2 Points) Based on sms that misclassified, give an analysis of why this might happen.
    • Is there any common pattern among the misclassified texts?
    • Is there any particular words that present in most of the misclassified texts?

Prediction Performance

  • (1 Points) Accuracy in (your own) validation dataset reach > 80%.

  • (1 Points) Sensitivity in (your own) validation dataset reach > 80%.

  • (1 Points) Specificity in (your own) validation dataset reach > 85%.

  • (1 Points) Precision in (your own) validation dataset reach > 90%.

  • (2 Points) Accuracy in test dataset reach > 80%.

  • (2 Points) Sensitivity in test dataset reach > 80%.

  • (2 Points) Specificity in test dataset reach > 85%.

  • (2 Points) Precision in test dataset reach > 90%.

Conclusion

  • (2 Points) Write the conclusion of your capstone project
    • Is your goal achieved?
    • Is the problem can be solved by machine learning?
    • What model did you use and how is the performance?
    • What is the potential business implementation of your capstone project?

5. Scotty-TimeSeries

Data-Scotty: “Bring me the crystal ball!”

The train dataset contains detailed transaction details from October 1st 2017 to December 2nd 2017:

The dataset includes information about:

  • id: Transaction id
  • trip_id: Trip id
  • driver_id: Driver id
  • rider_id: Rider id
  • start_time: Rider id
  • src_lat: Request source latitude
  • src_lon: Request source longitude
  • src_area: Request source area
  • src_sub_area: Request source sub-area
  • dest_lat: Requested destination latitude
  • dest_lon: Requested destination longitude
  • dest_area: Requested destination area
  • dest_sub_area: Requested destination sub-area
  • distance: Trip distance (in KM)
  • status: Trip status (all status considered as a demand)
  • confirmed_time_sec: Time different from request to confirmed (in seconds)

The test dataset should serve as a template for submission:

The template contains:

  • src_sub_area: Request source sub-area
  • datetime: Timestamp (equivalent to start_time)
  • demand: Estimated number of demand(s)

Please follow content of the submission-example below as a template for submission. The values on the target variable do not represent the actual answer. Your process to achieve the target variable may differ, but the final submission data should have the same columns and observations.

# Forecast the target using your model
forecast_mod <- forecast(model, ...)

# Create submission data
submission <- data_test %>% 
  mutate(demand = forecast_mod)

# save data
write.csv(submission, "submission-david.csv", row.names = F)

# check first 3 data
head(submission, 3)

Prepare a report for this case explaining every part listed in “Rubrics” section. Export it as an .html file with format "yourname_scotty-ts.html".


Rubrics-Scotty: “Bring me the crystal ball!”

Data Preprocess

  • (2 Points) Demonstrated how to properly do data aggregation.
    • Should you floor the date to specific time level (minutes or hours or days)?
    • How do we group the data for aggregation/summarise?
  • (2 Points) Demonstrated how to properly do time series padding.
    • Should you do time series padding?
    • Do you need to round the datetime into hour or minutes?
    • When is the start and the end of the time interval for time series padding?

Cross-Validation Scheme

  • (2 Points) Demonstrated how to prepare cross-validation data for model selection.
    • How to cross-validate data for time series?
    • Do you need to group the data by the source area?
    • Do you need to make nested dataframe?
    • How many observations you will use as the testing dataset?
  • (2 Points) Demonstrated how to prepare cross-validation data for “best” model evaluation.
    • Do you need to further split the data train into training set and validation set?
    • How much of the data will be used as the validation set?

Model Selection

  • (2 Points) Compared multiple preprocess specifications.
    • Is different preprocess will have diffrerent results?
    • How many kind of preprocess spesification you will prepare?
    • Will you choose 2 different speficiation: log transformation and square root transformation specification? Will you create another preprocess approach?
  • (2 Points) Compared multiple seasonality specifications.
    • How many seasonality specification you will create?
    • Will you create model with daily sesasonality only?
    • Will you create multiple seasonality (daily and weekly)?
  • (2 Points) Compared multiple model specifications.
    • How many forecasting model will you use?
    • Will you use exponential smoothing? Will you use ARIMA?
  • (2 Points) Best specifications selection.
    • Since we use multiple preprocess, seasonality, and models, can you make an automated script to summarise the result?
    • How do you measure the model performance?
    • Which model and specifications has the best performance?

Prediction Performance

  • (1 Points) Reached MAE < 12 for sub-area sxk97 in (your own) evaluation dataset.

  • (1 Points) Reached MAE < 11 for sub-area sxk9e in (your own) evaluation dataset.

  • (1 Points) Reached MAE < 10 for sub-area sxk9s in (your own) evaluation dataset.

  • (1 Points) Reached MAE < 11 for all sub-area in (your own) evaluation dataset.

  • (2 Points) Reached MAE < 12 for sub-area sxk97 in test dataset.

  • (2 Points) Reached MAE < 11 for sub-area sxk9e in test dataset.

  • (2 Points) Reached MAE < 10 for sub-area sxk9s in test dataset.

  • (2 Points) Reached MAE < 11 for all sub-area in test dataset.

Conclusion

  • (2 Point) Assumption Checking
    • Does the model meet the autocorrelation assumption? What about the normality of residuals?
    • If the assumptions are not met, what is the cause? how to handle that? Based on seasonality when the highest demand ?


6. Scotty-Prediction

Data-Scotty: “There is no drivers!”

The train dataset contains detailed transaction details from October 1st 2017 to December 2nd 2017:

The dataset includes information about:

  • id: Transaction id
  • trip_id: Trip id
  • driver_id: Driver id
  • rider_id: Rider id
  • start_time: Rider id
  • src_lat: Request source latitude
  • src_lon: Request source longitude
  • src_area: Request source area
  • src_sub_area: Request source sub-area
  • dest_lat: Requested destination latitude
  • dest_lon: Requested destination longitude
  • dest_area: Requested destination area
  • dest_sub_area: Requested destination sub-area
  • distance: Trip distance (in KM)
  • status: Trip status (all status considered as a demand)
  • confirmed_time_sec: Time different from request to confirmed (in seconds)


The test dataset should serve as a template for submission:

The template contains:

  • src_area: Request source area
  • datetime: Timestamp (equivalent to start_time)
  • coverage: Estimated coverage status; sufficient or insufficient

Please follow content of the submission-example below as a template for submission. The values on the target variable do not represent the actual answer. Your process to achieve the target variable may differ, but the final submission data should have the same columns and observations.

# Predict the target using your model
pred_test <- predict(model, ...)

# Create submission data
submission <- data_test %>% 
  mutate(coverage = pred_test)

# save data
write.csv(submission, "submission-david.csv", row.names = F)

# check first 3 data
head(submission, 3)

Prepare a report for this case explaining every part listed in “Rubrics” section. Export it as an .html file with format "yourname_scotty-cl-cov.html".



Rubrics-Scotty: “There is no drivers!”

Data Preprocess

  • (2 Points) Demonstrated how to properly do data aggregation.
    • Should you floor the date to specific time level (minutes or hours or days)?
    • How do we group the data for aggregation/summarise?
  • (2 Points) Demonstrated how to properly do time series padding.
    • Determine the start and end of padding interval
    • Padding the time data in specific time interval (minutes or hours or days) before doing any EDA or further preprocessing to get the same interval time.
    • Fill the NA count on the new time interval with 0 or any other imputation method

Exploratory Data Analysis

  • (2 Points) Explored the state in the target distribution.
    • See the proportion of class of target variable overall
    • See the proportion of class of target variable in each area (3 areas)
  • (2 Points) Explored the relation between the target and the features.
    • Find pattern or correlation between target and features
    • use heatmap of time (hour) and weekdays, grouped by area and find the pattern

Model Fitting and Evaluation

  • (2 Points) Demonstrated how to prepare cross-validation data for this case.
    • What is the proportion of the training vs testing dataset?
  • (2 Points) Demonstrated how to properly do data preprocess and feature engineering.
    • explain the details of data preprocessing
    • explain feature engineering/variable selection, including removing unused variable
    • do upsample or downsample (based on the class proportion)
  • (2 Points) Demonstrated how to properly do model fitting and evaluation.
    • What model to be used?
    • How to set the model’s parameter?
  • (2 Points) Demonstrated how to properly do model selection by comparing models or making adjustment to single model.
    • is the model overfit?
    • Did you use confusion matrix?
    • Did you use accuracy, precision, sensitivity, and specificity? Which metric is considered the most important on this case?
    • How is the sensitivity-specificity trade-off?
    • How is the precision-recall trade-off? What is the optimal treshold to get better trade-off for sensitivity and precision?

Prediction Performance

  • (1 Point) Reached Accuracy > 75% in (your own) validation dataset.

  • (1 Point) Reached Sensitivity > 85% in (your own) validation dataset.

  • (1 Point) Reached Specificity > 70% in (your own) validation dataset.

  • (1 Point) Reached Precision > 75% in (your own) validation dataset.

  • (2 Point) Reached Accuracy > 75% in test dataset.

  • (2 Point) Reached Sensitivity > 85% in test dataset.

  • (2 Point) Reached Specificity > 70% in test dataset.

  • (2 Point) Reached Precision > 75% in test dataset.

Conclusion

  • (2 Points) Write the conclusion of your capstone project
    • Is your goal achieved?
    • Is the problem can be solved by machine learning?
    • What model did you use and how is the performance?
    • What is the potential business implementation of your capstone project?

7. Airline13-Prediction

Data-Airline13: “The Late List”

The train dataset contains detailed flight records and weather condition recorded per hour from January 1st 2013 to November 31th 2013. The data set can be joined by their recorded time.

The flight data consists of the following variables:

  • year,month,day: Date of departure.
  • dep_time: Actual departure times (format HHMM or HMM), local tz.
  • sched_dep_time,sched_arr_time: Scheduled departure and arrival times (format HHMM or HMM), local tz.
  • dep_delay: Departure delays, in minutes. Negative times represent early departures.
  • arr_delay: Flight arrival status; Delay or Not Delay.
  • carrier: Two letter carrier (airlines) abbreviation.
  • flight: Flight number.
  • tailnum: Plane tail number.
  • origin,dest: Airports of origin and destination.
  • distance: Distance between airports, in miles
  • hour,minute: Time of scheduled departure broken into hour and minutes.
  • time_hour: Scheduled date and hour of the flight (YYYYMMDD HHMMSS).

The weather data consists of the following variables:

  • year,month,day,hour: Time of recording.
  • temp: Temperature in Fahrenheit.
  • dewp: Dewpoint in Fahrenheit.
  • humid: Relative humidity.
  • wind_dir: Wind direction (in degrees).
  • wind_speed: Wind speed (in mph).
  • wind_gust: Wind gust speed (in mph).
  • precip: Precipitation, in inches.
  • pressure: Sea level pressure in millibars.
  • visib: Visibility in miles.
  • time_hour: Date and hour of the recording (YYYYMMDD HHMMSS).


We provide the test dataset as follows:

Please follow content of the submission-example below as a template for submission. The values on the target variable do not represent the actual answer. Your process to achieve the target variable may differ, but the final submission data should have the same columns and observations.

The template contains:

  • Id : Flight Id
  • arr_status: Flight arrival status; Delay or Not Delay.
# Predict on the data test
pred_test <- predict(model, ...)

# Create submission data
submission <- data_test %>% 
  mutate(arr_status = pred_test) %>% 
  select(id, arr_status)

# save data
write.csv(submission, "submission-david.csv", row.names = F)

# check first 3 data
head(submission, 3)

Prepare a report for this case explaining every part listed in “Rubrics” section. Export it as an .html file with format “yourname_airline13-cl-del.html”.

Rubrics-Airline13: “The Late List”

Data Wrangling

  • (2 Points) Demonstrated how to properly do data tidying by joining two data frames.
    • Is there any data preprocessing required before joining two data frames?
    • Do you need to remove certain columns after joining the data frames?

Explanatory Data Analysis

  • (2 Points) Explored which airline company might needed this prediction model the most.
    • Do you think airline that has the highest number of delayed flight needed the model the most?
    • Is there any airline that has more delayed flight than non-delayed flight?
  • (2 Points) Explored the proportion of the target variable.
    • What is the target variable?
    • Is there any class imbalance between the target value?
    • What should you do if there is a class imbalance?

Model Fiting and Evaluation

  • (2 Points) Demonstrated how to properly do data preprocessing and feature engineering.
    • Do you need to transform the data type of some variables?
    • Do you need to separate time into hour and minute?
    • Do you need to remove some variables? How?
  • (2 Points) Demonstrated how to properly handle missing values (includes reasoning for the method applied).
    • Should you check any missing values?
    • Do you need to impute missing values?
    • Should you use median or mean imputation? Why?
  • (2 Points) Demonstrated how to prepare cross-validation data for this case.
    • What is your proportion of training-testing dataset?
    • Do you need to use stratified random sampling during the cross-validation?
  • (2 Points) Demonstrated how to properly do model fitting and evaluation.
    • What model do you use?
    • How do you set the model parameter?
    • Do you concerned more with precision than accuracy for this case? Why?
  • (2 Points) Demonstrated how to properly do model selection by comparing models or making adjustment to single model.
    • Which model is better?
    • What kind of adjustment you need to do in order to improve the performance of your chosen model?
    • Can you adjust the classification threshold to get better model performance?

Prediction Performance

  • (1 Point) Reached Accuracy > 75% in (your own) validation dataset.

  • (1 Point) Reached Sensitivity > 73% in (your own) validation dataset.

  • (1 Point) Reached Specificity > 75% in (your own) validation dataset.

  • (1 Point) Reached Precision > % 70 in (your own) validation dataset.

  • (2 Point) Reached Accuracy > 75% in in test dataset.

  • (2 Point) Reached Sensitivity > 73% in test dataset.

  • (2 Point) Reached Specificity > 75% in test dataset.

  • (2 Point) Reached Precision > 70% in test dataset.

Conclusion

  • (2 Points) Write the conclusion of your capstone project
    • Is your goal achieved?
    • Is the problem can be solved by machine learning?
    • What model did you use and how is the performance?
    • What is the potential business implementation of your capstone project?

Submission

After finishing your work of data preprocessing, modeling, and model evaluation, the next step will be;

  1. Apply your model to the data_test.csv that comes within the case.
  2. Follow the template submission-example that each case provided to store your prediction. You should save your prediction as a .csv file.
  3. Login to scoring dashboard to see your metrics achievement.
  4. Submit your html report to ML-Capstone Classwork!
  5. Please do not publish your work on Rpubs or other online website since the data are confidential.