class: clear, center, middle <br><br><br><br><br> .font200.grey[Walmart Recuriting-Store Sales Forecasting] <br><br><br><br> # .font200[Machine Learning with
<i class="fab fa-r-project faa-pulse animated faa-slow " style=" color:steelblue;"></i>
] ### Bolun Zhang, Nan Li, Rahat Saiful and Zhaohu(Jonathan) Fan ### April 11, 2019 --- # Introduction This module introduces concepts that are useful for any type of machine learning model: - modeling process versus a model - data splitting - nuances of the R modeling ecosystem - resampling - bias-variance trade-off - model evaluation <br> .center.bold[Many of these topics will be put into action in later sections.] --- # Overview .pull-left[ * the machine learning process is very iterative and heurstic-based * common for many ML approaches to be applied, evaluated, and modified before a final, optimal model can be determined * A proper process needs to be implemented to have confidence in our results <br><br> .center.bold.blue[_Not a short sprint!_] ] .pull-right[ <br><br> <img src="images/modeling_process.png" style="display: block; margin: auto;" /> ] --- # Prerequisites .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 1] .pull-left[ .center.bold.font110[Packages] ```r library(rsample) library(caret) library(tidyverse) library(ggplot2) library(visdat) library(naniar) ``` ] .pull-right[ .center.bold.font110[Data] ```r # walmart data setwd("C:/Users/Zhaohu/Box/walmart-recruiting-store-sales-forecasting") #ames <- AmesHousing::make_ames() features=read.csv("C:/Users/Zhaohu/Box/walmart-recruiting-store-sales-forecasting/features.csv",header=TRUE, sep=",") stores=read.csv("C:/Users/Zhaohu/Box/walmart-recruiting-store-sales-forecasting/stores.csv",header=TRUE, sep=",") data=read.csv("C:/Users/Zhaohu/Box/walmart-recruiting-store-sales-forecasting/train.csv",header=TRUE, sep=",") data_merge=merge(features,stores,by=c("Store"), all=TRUE) #data_merge1=merge(features,stores,data,by=c("Store"),all = TRUE) ``` ] --- # Prerequisites .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 2] .pull-left[ .center.bold.font110[Missing Data] ```r vis_dat(data_merge) ``` <img src="Draft_DataMining_Project2_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> ] .pull-right[ .center.bold.font110[Missing Data] ```r vis_miss(data_merge) ``` <img src="Draft_DataMining_Project2_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> ] --- class: center, middle, inverse .font300.white[Data Splitting] --- # Generalizability __Generalizability__: we want an algorithm that not only fits well to our past data, but more importantly, one that .blue[predicts a future outcome accurately]. -- - .bold[Training Set]: these data are used to develop feature sets, train our algorithms, tune hyper-parameters, compare across models, and all of the other activities required to reach a final model decision. - .bold[Test Set]: having chosen a final model, these data are used to estimate an unbiased assessment of the model as performance (generalization error). -- .pull-left[ <br><br> .center.bold.red[DO NOT TOUCH THE TEST SET UNTIL THE VERY END!!!] ] .pull-right[ <img src="images/nope.png" width="30%" height="30%" style="display: block; margin: auto;" /> ] --- --- # Questions? <img src="https://media.makeameme.org/created/i-love-questions.jpg" width="50%" height="50%" style="display: block; margin: auto;" />