libraries <- c("ggplot2", "dplyr", "caTools", "party",
               "randomForest", "nnet", "e1071")

lapply(libraries, library, character.only = TRUE)

Cross validation techniques

Training and testing data - Linear regression

## [1] 1.974877

Classification

##         car_name test.mpg test.am        X0           X1
## 1      Mazda RX4     21.0       1 0.5000000 0.5000000000
## 2     Datsun 710     22.8       1 0.8807971 0.1192029220
## 3 Hornet 4 Drive     21.4       0 0.9996646 0.0003353501
## 4    Merc 450SLC     15.2       0 0.9996646 0.0003353501
## 5       Fiat 128     32.4       1 0.1192029 0.8807970780
## 6  Maserati Bora     15.0       1 0.9975274 0.0024726232
## 7     Volvo 142E     21.4       1 0.5000000 0.5000000000

Supervised clustering methods

##             
##               1  2  3
##   setosa     50  0  0
##   versicolor  0  2 48
##   virginica   0 44  6

Tree-based models

### Train - test prediction

##   row.names.test. test.mpg      mpg test.class
## 1       Mazda RX4     21.0 25.88889          2
## 2      Datsun 710     22.8 25.88889          2
## 3  Hornet 4 Drive     21.4 16.29375          3
## 4     Merc 450SLC     15.2 16.29375          3
## 5        Fiat 128     32.4 25.88889          2
## 6   Maserati Bora     15.0 16.29375          3
## 7      Volvo 142E     21.4 25.88889          2

Neural networks

A neural network, as its name implies, takes its computational form from the way neurons in a biological system work. In essence, for a given list of inputs, a neural nwtwork performs a number of processing steps before running an output. The complexity in neural networks comes in how many of th processing steps are, and how complex each particular step might be.

We have a number of aspects in a neural network to be cognizant of:

  • The input layer: This is a layer that takes in a number of features, including a bias node, which id often just an offset parameter

  • The hidden layer, or compute" layer: This is the layer that computes some function of each feature. The number of nodes in this hidden layer depends on the computation. Sometimes, it might be as simple as one node in this layer. Other times, the picture might be more complex with multiple hidden layers.

## # weights:  27
## initial  value 188.800225 
## iter  10 value 69.665454
## iter  20 value 69.326775
## iter  30 value 69.314729
## final  value 69.314718 
## converged
##             
##              setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         28        22
##   virginica       0         21        29

Support vector machines

the idea is that we are taking data and trying to find a plane or a line that can separate the data into different classes.

Suppose that we have n features in your data and m observations, or rows . If n is much greater than m (e.g, n = 1000, m = 10), we would want to use a logistic regression. If we have the opossite, we might want to use an SVM instead.

##             
##              setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         48         2
##   virginica       0          2        48

Sampling Statistics and mode training in R

  • Population: is the entire collection (or universe) of things under consideration
  • Sample: is a portion of the population that we select for analysis

When we talk about values related to the terms mean, variance and standard deviation in relation to the total population, these are called parameters. When we talk about those same values, but specific toa certain subset of the data, we call them statistics

Bias and variance

  • low bias and low variance: best-case scenario. Samples are pretty well representative of the population.

  • High bias, low variance: The samples are all pretty consistent, but not really reflective of the population.

  • Low bias, high variance: The samples vary widly in their consistency, but some might be representative of the population

  • Low bias, high variance: the samples are a little more consistent, but not likely to be representative of the population.

  • Sampling methods.
    • Random sampling
    • Stratified random sampling ( ensures representations in each strata, it can be more accurate than a simple random sample if there is more variation in one strata than anothers)
    • systematic sample (you randomly pick a number, n, and the pick every nth data point in the dataset)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species cluster
## 83           5.8         2.7          3.9         1.2 versicolor       3
## 113          6.8         3.0          5.5         2.1  virginica       2
## 75           6.4         2.9          4.3         1.3 versicolor       3
## 80           5.7         2.6          3.5         1.0 versicolor       3
## 139          6.0         3.0          4.8         1.8  virginica       3
## 20           5.1         3.8          1.5         0.3     setosa       1
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species   cluster
##  setosa    :50   1:50   
##  versicolor:50   2:46   
##  virginica :50   3:54   
##                         
##                         
## 
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.200   Min.   :1.100   Min.   :0.100  
##  1st Qu.:5.175   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.375  
##  Median :5.800   Median :3.000   Median :4.400   Median :1.300  
##  Mean   :5.896   Mean   :3.059   Mean   :3.835   Mean   :1.232  
##  3rd Qu.:6.400   3rd Qu.:3.400   3rd Qu.:5.100   3rd Qu.:1.900  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species   cluster
##  setosa    :36   1:36   
##  versicolor:36   2:36   
##  virginica :40   3:40   
##                         
##                         
## 

Stratified sampling

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.400   Min.   :2.200   Min.   :1.200   Min.   :0.100  
##  1st Qu.:5.200   1st Qu.:2.800   1st Qu.:1.500   1st Qu.:0.350  
##  Median :5.700   Median :3.000   Median :4.500   Median :1.400  
##  Mean   :5.844   Mean   :3.125   Mean   :3.765   Mean   :1.215  
##  3rd Qu.:6.350   3rd Qu.:3.400   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.700   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species   cluster
##  setosa    :25   1:25   
##  versicolor:25   2:22   
##  virginica :25   3:28   
##                         
##                         
## 

Systematic sampling

##   Sepal.Length    Sepal.Width     Petal.Length   Petal.Width  
##  Min.   :4.300   Min.   :2.200   Min.   :1.10   Min.   :0.10  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.55   1st Qu.:0.35  
##  Median :5.700   Median :3.000   Median :4.20   Median :1.30  
##  Mean   :5.847   Mean   :3.051   Mean   :3.74   Mean   :1.18  
##  3rd Qu.:6.400   3rd Qu.:3.250   3rd Qu.:5.10   3rd Qu.:1.80  
##  Max.   :7.900   Max.   :4.400   Max.   :6.70   Max.   :2.50  
##  NA's   :39      NA's   :39      NA's   :39     NA's   :39    
##        Species   cluster  
##  setosa    :25   1   :25  
##  versicolor:25   2   :24  
##  virginica :25   3   :26  
##  NA's      :39   NA's:39  
##                           
##                           
## 

There are two major assuptions that we work with doing these training/test splits:

  • The data is a fair representation of the actual processes that you want to model (i.e., the subset accurately reflects the population)
  • The processes that you want to model are relatively stable over time and that a model built with last month’s data should accurately reflect next month’s data.

Training and test in regression modeling

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6481 -3.7122 -1.9390  0.9698 29.8283 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -13.6323     1.6335  -8.345 4.63e-13 ***
## x            11.9801     0.7167  16.715  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.51 on 98 degrees of freedom
## Multiple R-squared:  0.7403, Adjusted R-squared:  0.7377 
## F-statistic: 279.4 on 1 and 98 DF,  p-value: < 2.2e-16

RSME - Root mean square error

\[ RMSE = \sqrt{ \frac{1}{n} \sum{(y_{predicted} - y_{actual})} ^ 2} \]

##    predicted    actual         SE
## 2   7.874300  6.383579 0.07407499
## 3  28.504227 34.624423 1.24855995
## 4  11.341893  7.233768 0.56255641
## 5  12.019753  6.505638 1.01351529
## 12 14.678243 11.102747 0.42613909
## 15  4.118657  2.335049 0.10604193
## [1] 6.946493

Error measures

Mean absolute error

\[ MAE = \frac{1}{n}\sum|y_{predicted} - y_{actual}| \] #### RRSE Root relative squared error

\[ RRSE = \sqrt{\frac{\sum(y_{predicted}-y_{actual})^2}{\sum(\bar{y}_{predicted}-y_{actual})^2}} \] #### RAE Relative Absolute error

\[ RRSE = \frac{\sum|y_{predicted}-y_{actual}|}{\sum|\bar{y}_{predicted}-y_{actual}|} \]

Where \(\bar{y}_{predicted}\) is the average value of our model output and is just a scalar number

Training and test sets: Classification modeling

##                
## iris_prediction other setosa
##          other     34      0
##          setosa     0     11
  • True positives: The model predicted “setosa” classes and got them right
  • True negatives: The model predicted “other” classes and got them right
  • False positives The model predicted “setosa” classes, but the correct answer wat other
  • False negatives the model “other” classes, but the correct answer was setosa
##                
## iris_prediction other setosa
##          other     28      3
##          setosa     2     12

TP = 15 TN = 26 FP = 2 FN = 3

Sensitivity (equivalent to hit rate, or recall: often called recall, this is if you have a lower threshold set for your classification model. You would set a lower bar if you didn’t want to miss out on any plants that could possibly be of a “setosa” type

\[ Sensitivity = \frac{TP}{TP + FN} \]

Specificity: Logically the same thing as precision, but for the opposite case when you’re predicting whether a plant isn’t a “setosa” variant

\[ Specificity = \frac{TN}{TN + FP} \]

Precision (Or positive predictive value): The number of positive cases you’ve predicted divided divided by the total predicted positive. If you had a model that had a very high sensitivity, that would be akin to setting a threshold in your model to say, “Only classify a plant as”setosa" if we are absolutely sure about it"

\[ Precision = \frac{TP}{TP + FP} \]

Accuracy: Number of true cases divided by the total true and false cases \[ Accuracy = \frac{TP + TN}{ TP + TN + TP + FN} \]

F1 Score weighted average of precision and recall scores

\[ F1 = \frac{2TP}{2TP + FP + FN} \]

Cross validation

Is a statistical technique by which you take your entire dataset, split it into a number of small train/test chunks, evaluate the error for each chunk, and then average those final errors. This approach winds up being a more accurate way of assessing whether your modeling approach has any issues that could be hidden in various combinations of the combinations of the training and test parts o the dataset

k-fold cross-validation

This involves taking your dataset and splitting it into k chunks. For each of these chunks, you then split the data into a smaller train/test set and then evaluate that individual chunk’s error. After you have all the errors for all the chunks, you simply take the average. the advantage to this method is that you can then see the error in all aspects of your data instead of just testing no one specific subset of it.

##  [1]  7.091023  8.244583 11.774470  5.571905  9.145773  8.569813  9.325125
##  [8]  6.289492  7.452926  7.116796
## [1] 7.325628