libraries <- c("ggplot2", "dplyr", "caTools", "party",
               "randomForest", "nnet", "e1071")

lapply(libraries, library, character.only = TRUE)

Cross validation techniques

Bootstrap cross-validation
Bootstrap 632 cross validation
k-fold cross valadation
repeated cross-validation
leave-one-out cross validation
leave-group-out cross-validation
Out-of-bag cross-validation
Adaptative cross-validation
Adaptative bootstrap cross-validation
Adaptative leave-group-out cross-validation

Training and testing data - Linear regression

# Sample size

split_size <- 0.8
sample_size <- floor(split_size * nrow(mtcars))

set.seed(10)


train_indices <- sample(seq_len(nrow(mtcars)), size = sample_size)


# Splitting dataframe
train <- mtcars[train_indices, ]
test <- mtcars[-train_indices, ]


# Model training

model <- lm(mpg ~ disp, data = train)


# Model prediction
new_data <- dplyr::tibble(disp = test$disp)
test$output <- predict(model, new_data)


# RMSE

sqrt(sum(test$mpg - test$output) ** 2 / nrow(test))

## [1] 1.974877

Classification

mtcars %>% 
  ggplot(aes(x = mpg, y = am)) +
  geom_point() + theme_bw() +
  xlab("Fuel Efficiency (Miles per Gallon)") +
  ylab("Vehicle Transmission Type (0 = Automatic, 1 = Manual)")

# Splitting Y and X
label_train <- train[, 9]
data_train <- train[, -9]

# Logistic regression
l_model <- LogitBoost(data_train, label_train)

# Prediction
Lab <- predict(l_model, test, type = "raw")
l_prediction <-  data.frame(car_name = row.names(test), test$mpg, test$am, Lab)
l_prediction

##         car_name test.mpg test.am        X0           X1
## 1      Mazda RX4     21.0       1 0.5000000 0.5000000000
## 2     Datsun 710     22.8       1 0.8807971 0.1192029220
## 3 Hornet 4 Drive     21.4       0 0.9996646 0.0003353501
## 4    Merc 450SLC     15.2       0 0.9996646 0.0003353501
## 5       Fiat 128     32.4       1 0.1192029 0.8807970780
## 6  Maserati Bora     15.0       1 0.9975274 0.0024726232
## 7     Volvo 142E     21.4       1 0.5000000 0.5000000000

Supervised clustering methods

iris %>% 
  ggplot(aes(x = Petal.Length, y = Petal.Width)) +
  geom_point() +
  xlab("Petal  Lenght") +
  ylab("Petal Width") +
  theme_minimal()

data_cluster <- tibble(iris$Petal.Length, iris$Petal.Width)

iris_kmeans <- kmeans(data_cluster, 3)
iris$cluster <- as.factor(iris_kmeans$cluster)

iris %>% 
  ggplot(aes(x = Petal.Length, y = Petal.Width, color = cluster)) +
  geom_point() +
  xlab("Petal  Lenght") +
  ylab("Petal Width") +
  theme_minimal() +
  geom_point(data = as.data.frame(iris_kmeans$centers),
             aes(x = `iris$Petal.Length`, y = `iris$Petal.Width`, col = "Centroid"))

table(iris$Species, iris_kmeans$cluster)

##             
##               1  2  3
##   setosa     50  0  0
##   versicolor  0  2 48
##   virginica   0 44  6

Tree-based models

tree <- ctree(mpg ~ ., data = mtcars)
plot(tree)

### Train - test prediction

tree_train <- ctree(mpg ~., train)
plot(tree_train)

test$mpg_tree <- predict(tree_train, test)
test$class <- predict(tree_train, test, type = "node")

t_model <- data.frame(row.names(test), test$mpg, test$mpg_tree, test$class)
t_model

##   row.names.test. test.mpg      mpg test.class
## 1       Mazda RX4     21.0 25.88889          2
## 2      Datsun 710     22.8 25.88889          2
## 3  Hornet 4 Drive     21.4 16.29375          3
## 4     Merc 450SLC     15.2 16.29375          3
## 5        Fiat 128     32.4 25.88889          2
## 6   Maserati Bora     15.0 16.29375          3
## 7      Volvo 142E     21.4 25.88889          2

Random Forests

mtcars_rf <- randomForest(mpg ~ ., data = mtcars, ntree = 100,
                          keep.forest = FALSE, importance = TRUE)

plot(mtcars_rf, log = "y", title = "")

Neural networks

A neural network, as its name implies, takes its computational form from the way neurons in a biological system work. In essence, for a given list of inputs, a neural nwtwork performs a number of processing steps before running an output. The complexity in neural networks comes in how many of th processing steps are, and how complex each particular step might be.

We have a number of aspects in a neural network to be cognizant of:

The input layer: This is a layer that takes in a number of features, including a bias node, which id often just an offset parameter
The hidden layer, or compute" layer: This is the layer that computes some function of each feature. The number of nodes in this hidden layer depends on the computation. Sometimes, it might be as simple as one node in this layer. Other times, the picture might be more complex with multiple hidden layers.

iris_nn <- nnet(Species ~ ., data = iris[, -6], size = 3)

## # weights:  27
## initial  value 188.800225 
## iter  10 value 69.665454
## iter  20 value 69.326775
## iter  30 value 69.314729
## final  value 69.314718 
## converged

table(iris$Species, predict(iris_nn, iris, type = "class"))

##             
##              setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         28        22
##   virginica       0         21        29

Support vector machines

the idea is that we are taking data and trying to find a plane or a line that can separate the data into different classes.

Suppose that we have n features in your data and m observations, or rows . If n is much greater than m (e.g, n = 1000, m = 10), we would want to use a logistic regression. If we have the opossite, we might want to use an SVM instead.

iris_svm <- svm(Species ~ ., data = iris[, -6])
table(iris$Species, predict(iris_svm, iris, type = "class"))

##             
##              setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         48         2
##   virginica       0          2        48

Sampling Statistics and mode training in R

Population: is the entire collection (or universe) of things under consideration
Sample: is a portion of the population that we select for analysis

When we talk about values related to the terms mean, variance and standard deviation in relation to the total population, these are called parameters. When we talk about those same values, but specific toa certain subset of the data, we call them statistics

Bias and variance

low bias and low variance: best-case scenario. Samples are pretty well representative of the population.
High bias, low variance: The samples are all pretty consistent, but not really reflective of the population.
Low bias, high variance: The samples vary widly in their consistency, but some might be representative of the population
Low bias, high variance: the samples are a little more consistent, but not likely to be representative of the population.
Sampling methods.
- Random sampling
- Stratified random sampling ( ensures representations in each strata, it can be more accurate than a simple random sample if there is more variation in one strata than anothers)
- systematic sample (you randomly pick a number, n, and the pick every nth data point in the dataset)

iris_df <- data.frame(iris)

sample_index <- sample(1:nrow(iris_df), nrow(iris) * 0.75, replace = FALSE)
head(iris[sample_index, ])

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species cluster
## 83           5.8         2.7          3.9         1.2 versicolor       3
## 113          6.8         3.0          5.5         2.1  virginica       2
## 75           6.4         2.9          4.3         1.3 versicolor       3
## 80           5.7         2.6          3.5         1.0 versicolor       3
## 139          6.0         3.0          4.8         1.8  virginica       3
## 20           5.1         3.8          1.5         0.3     setosa       1

summary(iris_df)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species   cluster
##  setosa    :50   1:50   
##  versicolor:50   2:46   
##  virginica :50   3:54   
##                         
##                         
##

summary(iris[sample_index, ])

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.200   Min.   :1.100   Min.   :0.100  
##  1st Qu.:5.175   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.375  
##  Median :5.800   Median :3.000   Median :4.400   Median :1.300  
##  Mean   :5.896   Mean   :3.059   Mean   :3.835   Mean   :1.232  
##  3rd Qu.:6.400   3rd Qu.:3.400   3rd Qu.:5.100   3rd Qu.:1.900  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species   cluster
##  setosa    :36   1:36   
##  versicolor:36   2:36   
##  virginica :40   3:40   
##                         
##                         
##

Stratified sampling

str_sample <- iris %>% group_by(Species) %>% sample_frac(size = 0.5)
summary(str_sample)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.400   Min.   :2.200   Min.   :1.200   Min.   :0.100  
##  1st Qu.:5.200   1st Qu.:2.800   1st Qu.:1.500   1st Qu.:0.350  
##  Median :5.700   Median :3.000   Median :4.500   Median :1.400  
##  Mean   :5.844   Mean   :3.125   Mean   :3.765   Mean   :1.215  
##  3rd Qu.:6.350   3rd Qu.:3.400   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.700   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species   cluster
##  setosa    :25   1:25   
##  versicolor:25   2:22   
##  virginica :25   3:28   
##                         
##                         
##

Systematic sampling

sys_sample <- function(N, n){
  k = ceiling(N/n)
  r = sample(1:k, 1)
  sys_samp = seq(r, r + k * (n + 1), k)
}


sys_index <- sys_sample(nrow(iris), nrow(iris) * 0.75)
summary(iris[sys_index, ])

##   Sepal.Length    Sepal.Width     Petal.Length   Petal.Width  
##  Min.   :4.300   Min.   :2.200   Min.   :1.10   Min.   :0.10  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.55   1st Qu.:0.35  
##  Median :5.700   Median :3.000   Median :4.20   Median :1.30  
##  Mean   :5.847   Mean   :3.051   Mean   :3.74   Mean   :1.18  
##  3rd Qu.:6.400   3rd Qu.:3.250   3rd Qu.:5.10   3rd Qu.:1.80  
##  Max.   :7.900   Max.   :4.400   Max.   :6.70   Max.   :2.50  
##  NA's   :39      NA's   :39      NA's   :39     NA's   :39    
##        Species   cluster  
##  setosa    :25   1   :25  
##  versicolor:25   2   :24  
##  virginica :25   3   :26  
##  NA's      :39   NA's:39  
##                           
##                           
##

There are two major assuptions that we work with doing these training/test splits:

The data is a fair representation of the actual processes that you want to model (i.e., the subset accurately reflects the population)
The processes that you want to model are relatively stable over time and that a model built with last month’s data should accurately reflect next month’s data.

Training and test in regression modeling

set.seed(123)

x <- rnorm(100, 2, 1)
y <- exp(x) + rnorm(5, 0, 2)

qplot(x, y) + theme_minimal() + geom_smooth(method = "lm", se = FALSE)

linear_model <- lm(y ~ x)
summary(linear_model)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6481 -3.7122 -1.9390  0.9698 29.8283 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -13.6323     1.6335  -8.345 4.63e-13 ***
## x            11.9801     0.7167  16.715  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.51 on 98 degrees of freedom
## Multiple R-squared:  0.7403, Adjusted R-squared:  0.7377 
## F-statistic: 279.4 on 1 and 98 DF,  p-value: < 2.2e-16

data_linear <- data.frame(x, y)
data_samples <- sample(1:nrow(data_linear),
                       nrow(data_linear) * 0.7, replace = FALSE)

training_data <- data_linear[data_samples, ]
test_data <- data_linear[-data_samples, ]


train_linear <- lm(y ~ x, training_data)
train_output <- predict(train_linear, test_data)

RSME - Root mean square error

\[ RMSE = \sqrt{ \frac{1}{n} \sum{(y_{predicted} - y_{actual})} ^ 2} \]

RMSE_df <- data.frame( predicted = train_output, actual = test_data$y,
                       SE = ((train_output - test_data$y)^2) / length(train_output))

head(RMSE_df)

##    predicted    actual         SE
## 2   7.874300  6.383579 0.07407499
## 3  28.504227 34.624423 1.24855995
## 4  11.341893  7.233768 0.56255641
## 5  12.019753  6.505638 1.01351529
## 12 14.678243 11.102747 0.42613909
## 15  4.118657  2.335049 0.10604193

sqrt(sum(RMSE_df$SE))

## [1] 6.946493

Cuadratic model

train_quadratic <- lm(y ~ x^2 + x, data = training_data)
quadratic_output <- predict(train_quadratic, test_data)

RMSE_quad_df <- data.frame( predicted = quadratic_output, actual = test_data$y,
                       SE = ((quadratic_output - test_data$y)^2) / length(train_output))

sqrt(sum(RMSE_quad_df$SE))

## [1] 6.946493

Polynomial model

train_polyn <- lm(y ~ poly(x, 4), data = training_data)
polyn_output <- predict(train_polyn, test_data)

RMSE_poly_df <- data.frame( predicted = polyn_output, actual = test_data$y,
                       SE = ((polyn_output - test_data$y)^2) / length(train_output))

sqrt(sum(RMSE_poly_df$SE))

## [1] 0.8836878

Error measures

Mean absolute error

\[ MAE = \frac{1}{n}\sum|y_{predicted} - y_{actual}| \] #### RRSE Root relative squared error

\[ RRSE = \sqrt{\frac{\sum(y_{predicted}-y_{actual})^2}{\sum(\bar{y}_{predicted}-y_{actual})^2}} \] #### RAE Relative Absolute error

\[ RRSE = \frac{\sum|y_{predicted}-y_{actual}|}{\sum|\bar{y}_{predicted}-y_{actual}|} \]

Where \(\bar{y}_{predicted}\) is the average value of our model output and is just a scalar number

Training and test sets: Classification modeling

iris_df <- iris[, -6] %>%
  mutate(Species = as.character(Species),
         Species = if_else(Species != "setosa", "other", Species),
         Species = as.factor(Species)) 

iris_sample <- sample(1:nrow(iris_df), nrow(iris_df) * 0.7,
                      replace = FALSE)

train_iris <- iris_df[iris_sample, ]
test_iris <- iris_df[-iris_sample, ]

iris_rf <- randomForest(Species ~ ., data = train_iris)
iris_prediction <- predict(iris_rf, test_iris)

table(iris_prediction, test_iris$Species)

##                
## iris_prediction other setosa
##          other     34      0
##          setosa     0     11

True positives: The model predicted “setosa” classes and got them right
True negatives: The model predicted “other” classes and got them right
False positives The model predicted “setosa” classes, but the correct answer wat other
False negatives the model “other” classes, but the correct answer was setosa

example <- table(iris_prediction, test_iris$Species)
example[1,1] <- 28
example[1,2] <- 3
example[2,1] <- 2
example[2,2] <- 12

example

##                
## iris_prediction other setosa
##          other     28      3
##          setosa     2     12

TP = 15 TN = 26 FP = 2 FN = 3

Sensitivity (equivalent to hit rate, or recall: often called recall, this is if you have a lower threshold set for your classification model. You would set a lower bar if you didn’t want to miss out on any plants that could possibly be of a “setosa” type

\[ Sensitivity = \frac{TP}{TP + FN} \]

Specificity: Logically the same thing as precision, but for the opposite case when you’re predicting whether a plant isn’t a “setosa” variant

\[ Specificity = \frac{TN}{TN + FP} \]

Precision (Or positive predictive value): The number of positive cases you’ve predicted divided divided by the total predicted positive. If you had a model that had a very high sensitivity, that would be akin to setting a threshold in your model to say, “Only classify a plant as”setosa" if we are absolutely sure about it"

\[ Precision = \frac{TP}{TP + FP} \]

Accuracy: Number of true cases divided by the total true and false cases \[ Accuracy = \frac{TP + TN}{ TP + TN + TP + FN} \]

F1 Score weighted average of precision and recall scores

\[ F1 = \frac{2TP}{2TP + FP + FN} \]

Cross validation

Is a statistical technique by which you take your entire dataset, split it into a number of small train/test chunks, evaluate the error for each chunk, and then average those final errors. This approach winds up being a more accurate way of assessing whether your modeling approach has any issues that could be hidden in various combinations of the combinations of the training and test parts o the dataset

k-fold cross-validation

This involves taking your dataset and splitting it into k chunks. For each of these chunks, you then split the data into a smaller train/test set and then evaluate that individual chunk’s error. After you have all the errors for all the chunks, you simply take the average. the advantage to this method is that you can then see the error in all aspects of your data instead of just testing no one specific subset of it.

set.seed(2)

x <- rnorm(100, 2, 1)
y <- exp(x) + rnorm(5, 0, 2)
data <- data.frame(x, y)

data_shuffled <- data[sample(nrow(data)), ]
folds <- cut(seq(1, nrow(data)), breaks = 10, labels = FALSE)

errors <- c(0)

for (i in 1:10) {
  
  fold_indexes <- which(folds == i, arr.ind = TRUE)
  
  
  test_data <- data[fold_indexes, ]
  training_data <- data[-fold_indexes, ]
  
  train_linear <- lm(y ~ x, training_data)
  train_output <- predict(train_linear, test_data)
  errors <- c(errors,
              sqrt(sum((train_output - test_data$y)^2/length(train_output))))
  
  
}

errors[2:11]

##  [1]  7.091023  8.244583 11.774470  5.571905  9.145773  8.569813  9.325125
##  [8]  6.289492  7.452926  7.116796

mean(errors[1:11])

## [1] 7.325628

Machine learning introduction

Miguel Arquez Abdala

12/1/2020