Algorithm and Decision Making: Models

Trends in superbowl commercials

For this exercise we will be using the following dataset, which contains information about the superbowl commercials

superbowl.tbl <- read_csv("Data/superbowl.csv")
superbowl.tbl <- superbowl.tbl %>%
  select(year, animals, celebrity, danger, funny, patriotic, show_product_quickly)

The variable description of the dataset can be found at:

https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-03-02/readme.md#data-dictionary

In particular we are interested in the characteristics of the commercial which are encoded by the following columns:

animals
celebrity
danger
funny
patriotic
show_product_quickly

Use a linear model to establish how much each of these characteristics are changing through the years. What are your input and response variables and their types? According to your analysis, are commercials becoming less funny lately? Are they increasingly using more celebrities? Make sure to interpret the coefficients associated with each input variable of your linear model.

superbowl.lm <- lm(year ~ ., data = superbowl.tbl)

animal (coef: 0.0243, p-val: 0.974): Year is expected to be 0.243 (year) higher for commercials that feature animals (animal = TRUE) compared to the ones that don’t. It does not mean that the commercials featuring animals directly increases year by 0.243 year. In addition, since P-value 0.975 > 0.05, it’s not a statistically significant predictor. If we wanted to improve our model we might consider removing animal.

celebrity (coef: 2.13, p-val: 0.00680): Year is expected to be 2.13 (year) higher for commercials that feature celebrities (celebrity = TRUE) compared to the ones that don’t. Since P-value 0.0068 > 0.05, It is a statistically significant predictor. If we wanted to improve our model we should keep this variable.

danger (coef: 0.565, p-val: 0.479): Year is expected to be 0.565 (year) higher for commercials that involve danger (danger = TRUE) compared to the ones that don’t. In addition, since P-value 0.975 > 0.05, it’s not a statistically significant predictor. If we wanted to improve our model we might consider removing danger.

funny (coef: -3.54, p-val: 0.0000395): Year is expected to be -3.54 (year) lower for commercials that are considered funny (funny = TRUE) compared to the ones that don’t. In addition, since 0.0000395 < 0.05, it is a statistically significant predictor. If we wanted to improve our model we should keep this variable.

patriotic (coef: 2.19, p-val: 0.0342) Year is expected to be 2.19 (year) higher for commercials that have patriotic elements (patriotic = TRUE) compared to the ones that don’t. In addition, since ** 0.0342 < 0.05,** it is a statistically significant predictor. If we wanted to improve our model we should keep this variable.

show_product_quickly (coef: 0.796, p-val: 0.294): Year is expected to be 0.796 (year) higher for commercials that show product quickly on screen (show_product_quickly = TRUE) compared to the ones that don’t. In addition, since P-value 0.294 > 0.05, it’s not a statistically significant predictor. If we wanted to improve our model we might consider removing show_product_quickly.

Overall, the super bowl commercials are including less funny ones and more featuring celebrities and patriotic elements.

Cats or dogs?

Consider the following dataset:

image.tbl <- read.csv("Data/cats_dogs.csv") %>% mutate(type = factor(type))

which contains the images of 99 cats and 99 dogs represented as 64 by 64 images. The first 4096 (\(64 \times 64\)) columns correspond to the pixels of the image, and the column type contains information of whether the image corresponds to a cat or a dog.

The function plotImage allows us to show an image represented in a matrix. Notice that we can use it to plot the images in our dataset by selecting the appropriate row and columns and converting into a matrix (see the examples below)

plotImage <- function(dat,size=64){
  imag <- matrix(dat,  nrow = size,byrow = T) 
  image(imag,col = grey.colors(256)) 
}
#Plot the first image-dog
plotImage(as.matrix(image.tbl[1,1:4096]))

#Plot the 150th image-cat
plotImage(as.matrix(image.tbl[150,1:4096]))

Finally let’s create training/testing datasets as well as 10-fold cross-validation dataset. Make sure to use these datasets in the following subsections.

set.seed(654321)
image.split <- initial_split(image.tbl, prop = 0.6)
image.train.tbl <- training(image.split)
image.test.tbl <- testing(image.split)
images.folds <- vfold_cv(image.train.tbl, v = 10)

LASSO model

Describe what a LASSO model is and what the penalty term (\(\lambda\)) is. Create a LASSO model that distinguishes between cats and dogs by selecting an optimal penalty by using 10 fold cross-validation. Establish the accuracy and confusion matrix of the model using your testing dataset. Show two incorrectly classified images and their probabilities of being a dog or a cat assigned by the model.

LASSO models are used for regularization. Instead of minimizing the residual sums of squares only, we minimize the following function:

The penalty term (\(\lambda\)) for a LASSO model is this function: \[RSS(\beta_0,\dots,\beta_p) + \lambda \sum\limits_{i=1}^p \beta_j^2\]

image.model <-
  logistic_reg(mixture = 1, penalty = tune()) %>%
  set_mode("classification") %>%
  set_engine("glmnet")

image.recipe <-
  recipe(formula = type ~ ., data = image.train.tbl) %>%
  step_normalize(all_predictors()) 

image.wf <- workflow() %>%
  add_recipe(image.recipe) %>%
  add_model(image.model)

penalty.grid <- grid_regular(penalty(range = c(-2, 0)), levels = 20)
tune.res <- tune_grid(image.wf, resamples = images.folds, grid = penalty.grid)

show_best(tune.res, metric = "accuracy")

## # A tibble: 5 × 7
##   penalty .metric  .estimator  mean     n std_err .config              
##     <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
## 1  0.0207 accuracy binary     0.848    10  0.0166 Preprocessor1_Model04
## 2  0.0428 accuracy binary     0.840    10  0.0225 Preprocessor1_Model07
## 3  0.0162 accuracy binary     0.831    10  0.0246 Preprocessor1_Model03
## 4  0.0264 accuracy binary     0.831    10  0.0172 Preprocessor1_Model05
## 5  0.0336 accuracy binary     0.831    10  0.0172 Preprocessor1_Model06

best.penalty <- select_by_one_std_err(tune.res, metric = "accuracy", desc(penalty))

image.final.wf <- finalize_workflow(image.wf, best.penalty)
image.final.fit <- fit(image.final.wf, data = image.train.tbl)

augment(image.final.fit, new_data = image.test.tbl) %>%
  conf_mat(truth = type, estimate = .pred_class)

##           Truth
## Prediction cat dog
##        cat  32  15
##        dog   5  28

augment(image.final.fit, new_data = image.test.tbl) %>%
  accuracy(truth = type, estimate = .pred_class )

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary          0.75

test.pred.tbl <- augment(image.final.fit, new_data = image.test.tbl) 

missed_class <- test.pred.tbl %>%
  filter(type == "dog" & .pred_class == "cat" |type == "cat" & .pred_class == "dog") 

plotImage(as.matrix(missed_class[1,1:4096]))

plotImage(as.matrix(missed_class[2,1:4096]))

missed_class %>% select(.pred_dog, .pred_cat) %>% head(2)

## # A tibble: 2 × 2
##   .pred_dog .pred_cat
##       <dbl>     <dbl>
## 1     0.411     0.589
## 2     0.415     0.585

Both miss-classified images are dogs miss-classified as cats. The probability of both images being a dog are around 0.41 and prob(cat) = 0.59). I think this makes sense as the probability of both is so close to 1/2

Random forest

Describe what a random forest model is and what the parameter mtry refers to in the ranger implementation of random forests. Create a random forest model that distinguishes between cats and dogs by selecting an optimal value of mtry using 10 fold cross-validation. Establish the accuracy and confusion matrix of the model using your testing data set. Show two incorrectly classified images and their probabilities of being a dog or a cat assigned by the model.

Random forests are basically ensembles of decision trees. In random forests we want to de-correlate the trees by selecting a random fraction of the variables at each level, so that the different trees won’t be related to each other.

Ranger is for the implementation of bagging and ‘mtry’ refers to number of predictors used in each trees.

forest.model <-
  rand_forest(mtry = tune()) %>%
  set_mode("classification") %>%
  set_engine("ranger")

forest.recipe <- recipe(formula = type ~ ., data = image.train.tbl) 

forest.wf <- workflow() %>%
  add_recipe(forest.recipe) %>%
  add_model(forest.model %>% set_args(mtry = tune()))

penalty.grid <- grid_regular(mtry(range = c(1, 40)), levels = 10)
tune.res <- 
  tune_grid(forest.wf, resamples = images.folds, grid = penalty.grid, metrics = metric_set(accuracy))

show_best(tune.res)

## # A tibble: 5 × 7
##    mtry .metric  .estimator  mean     n std_err .config              
##   <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
## 1    18 accuracy binary     0.839    10  0.0233 Preprocessor1_Model05
## 2    27 accuracy binary     0.839    10  0.0292 Preprocessor1_Model07
## 3    31 accuracy binary     0.839    10  0.0292 Preprocessor1_Model08
## 4    35 accuracy binary     0.839    10  0.0233 Preprocessor1_Model09
## 5    40 accuracy binary     0.831    10  0.0246 Preprocessor1_Model10

best.penalty <- select_best(tune.res, metric = "accuracy")

forest.final.wf <- finalize_workflow(forest.wf, best.penalty)
forest.final.fit <- fit(forest.final.wf, data = image.train.tbl)

augment(forest.final.fit, image.test.tbl) %>%
  accuracy(truth = type, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.825

augment(forest.final.fit, new_data = image.test.tbl) %>%
  conf_mat(truth = type, estimate = .pred_class)

##           Truth
## Prediction cat dog
##        cat  34  11
##        dog   3  32

forest.pred.tbl <- augment(forest.final.fit, new_data = image.test.tbl) 

forest_missed <- forest.pred.tbl %>%
  filter(type == "dog" & .pred_class == "cat" |type == "cat" & .pred_class == "dog") 

plotImage(as.matrix(forest_missed[1,1:4096]))

plotImage(as.matrix(forest_missed[2,1:4096]))

forest_missed %>% select(.pred_dog, .pred_cat) %>% head(2)

## # A tibble: 2 × 2
##   .pred_dog .pred_cat
##       <dbl>     <dbl>
## 1     0.481     0.519
## 2     0.456     0.544

Again, these are both dogs misclassified as cats.

In the first image: prob(dog) = 0.488, prob(cat) = 0.512

In the second image: prob(dog) = 0.467, prob(cat) = 0.533

Boosting

Describe in your own words what a boosting model is and what the learning rate is in a boosting model. Create a boosting model that distinguishes between cats and dogs by selecting an optimal learning rate using 10 fold cross-validation. Establish the accuracy and confusion matrix of the model. Show two incorrectly classified images and their probabilities of being a dog or a cat assigned by the model.

In boosting we use “weak learning” repeatedly to get a suitable model. With boosting, we build a large number of under-fitted models. Then we build a final model by adding the under-fitted models.

Suppose we have data \(X\), a response variable \(y\) and a learning rate \(\lambda >0\) . We will build a sequence of weak learners (small trees), \(\hat f_b\), \(b=1...B\) as follows.

(boost.grid <- grid_regular(learn_rate(range = c(-2,0)), levels = 10))

## # A tibble: 10 × 1
##    learn_rate
##         <dbl>
##  1     0.01  
##  2     0.0167
##  3     0.0278
##  4     0.0464
##  5     0.0774
##  6     0.129 
##  7     0.215 
##  8     0.359 
##  9     0.599 
## 10     1

boost_recipe <- 
  recipe(formula = type ~ ., data = image.train.tbl) %>% 
  step_zv(all_predictors()) 

boost_spec <- 
  boost_tree(learn_rate = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("xgboost") 

boost_workflow <- 
  workflow() %>% 
  add_recipe(boost_recipe) %>% 
  add_model(boost_spec)

boost_tune <- tune_grid(boost_workflow, resamples = images.folds, grid = boost.grid)

best.param <- select_best(boost_tune, "accuracy")
boost.final.wf <- finalize_workflow(boost_workflow, best.param)
boost.final.fit <- fit(boost.final.wf, image.train.tbl)

## [03:46:32] WARNING: amalgamation/../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.

augment(boost.final.fit, image.test.tbl) %>%
  conf_mat(truth = type, estimate = .pred_class)

##           Truth
## Prediction cat dog
##        cat  29  15
##        dog   8  28

augment(boost.final.fit, image.test.tbl) %>%
  accuracy(truth = type, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.712

boost.pred.tbl <- augment(boost.final.fit, new_data = image.test.tbl) 

boost_missed <- boost.pred.tbl %>%
  filter(type == "dog" & .pred_class == "cat" | type == "cat" & .pred_class == "dog") 

plotImage(as.matrix(boost_missed[1,1:4096]))

plotImage(as.matrix(boost_missed[2,1:4096]))

boost_missed %>% select(.pred_dog, .pred_cat) %>% head(2)

## # A tibble: 2 × 2
##   .pred_dog .pred_cat
##       <dbl>     <dbl>
## 1    0.216      0.784
## 2    0.0714     0.929

These are both dogs misclassified as cats.

In the first image: prob(dog) = 0.216, prob(cat) = 0.784