Walmart Recuriting-Store Sales Forecasting

class: clear, center, middle

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; .font200.grey[Walmart Recuriting-Store Sales Forecasting]

# .font200[Machine Learning with ]

###  Bolun Zhang, Nan Li, Rahat Saiful and Zhaohu(Jonathan) Fan
### April 11, 2019

---
# Introduction

This module introduces concepts that are useful for any type of machine learning model:

- modeling process versus a model

- data splitting

- nuances of the R modeling ecosystem

- resampling

- bias-variance trade-off

- model evaluation

.center.bold[Many of these topics will be put into action in later sections.]

---
# Overview

.pull-left[

* the machine learning process is very iterative and heurstic-based

* common for many ML approaches to be applied, evaluated, and modified before a final, optimal model can be determined

* A proper process needs to be implemented to have confidence in our results

.center.bold.blue[_Not a short sprint!_]

]

.pull-right[

]

---
# Prerequisites .red[ code chunk 1]

.pull-left[

.center.bold.font110[Packages]

```r
library(rsample)
library(caret)
library(tidyverse)
library(ggplot2)
library(visdat)
library(naniar)
```

]

.pull-right[

.center.bold.font110[Data]

```r
# walmart data
setwd("C:/Users/Zhaohu/Box/walmart-recruiting-store-sales-forecasting")
#ames <- AmesHousing::make_ames()
features=read.csv("C:/Users/Zhaohu/Box/walmart-recruiting-store-sales-forecasting/features.csv",header=TRUE, sep=",")
stores=read.csv("C:/Users/Zhaohu/Box/walmart-recruiting-store-sales-forecasting/stores.csv",header=TRUE, sep=",")
data=read.csv("C:/Users/Zhaohu/Box/walmart-recruiting-store-sales-forecasting/train.csv",header=TRUE, sep=",")
data_merge=merge(features,stores,by=c("Store"), all=TRUE)
#data_merge1=merge(features,stores,data,by=c("Store"),all = TRUE)
```

]

---
# Prerequisites .red[ code chunk 2]

.pull-left[

.center.bold.font110[Missing Data]

```r
vis_dat(data_merge)
```

]

.pull-right[

.center.bold.font110[Missing Data]

```r
vis_miss(data_merge)
```

<img src="Draft_DataMining_Project2_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" />
]

---

class: center, middle, inverse

.font300.white[Data Splitting]

---
# Generalizability

__Generalizability__: we want an algorithm that not only fits well to our past data, but more importantly, one that .blue[predicts a future outcome accurately].

- .bold[Training Set]: these data are used to develop feature sets, train our algorithms, tune hyper-parameters, compare across models, and all of the other activities required to reach a final model decision.

- .bold[Test Set]: having chosen a final model, these data are used to estimate an unbiased assessment of the model as performance (generalization error).

.pull-left[

.center.bold.red[DO NOT TOUCH THE TEST SET UNTIL THE VERY END!!!]

]

.pull-right[

]

---

---
# Questions?