Homework 4

set.seed(29923)

library(tidyverse)
library(data.table)
library(caret)
library(leaps)
library(glmnet)

About the Data

For this assignment, we will be analyzing data from users of Google Reviews. The file ratings.csv contains (lightly edited) information on the average ratings of thousands of users across a wide variety of categories. All of the user’s ratings were on a scale from 0 to 5, and these values were averaged by category. Each user’s averages for the categories appear in one row of the file. For more details, see http://archive.ics.uci.edu/ml/datasets/Tarvel+Review+Ratings#.

The data includes a variable called user that provides a unique identifier. The set variable divided the data into training and testing sets. Otherwise, all of the variables are categories of ratings.

Using these data, answer the following questions.

Question 1: Preparation and Summarization

1a: Creating an Outcome

For this study, we will be focused on the question of predicting the ratings of accommodations for travelers in terms of all of the other experiences available. Because travelers can either stay in resorts or in hotels_lodging, we will create an overall measure of satisfaction. Add a column to your data set named accommodations. This will be defined as the user’s average of their scores on resorts and hotels_lodging. Show the code for how you constructed the accommodations variable.

df <- read.csv('ratings.csv')
str(df)

'data.frame':   5456 obs. of  26 variables:
 $ user          : Factor w/ 5456 levels "User 1","User 10",..: 1 1112 2223 3334 4445 5013 5124 5235 5346 2 ...
 $ churches      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ resorts       : num  0 0 0 0.5 0 0 5 5 5 5 ...
 $ beaches       : num  3.63 3.63 3.63 3.63 3.63 3.63 3.63 3.63 3.64 3.64 ...
 $ parks         : num  3.65 3.65 3.63 3.63 3.63 3.63 3.63 3.63 3.64 3.64 ...
 $ theaters      : num  5 5 5 5 5 5 5 5 5 5 ...
 $ museums       : num  2.92 2.92 2.92 2.92 2.92 2.92 2.92 2.92 2.92 2.92 ...
 $ malls         : num  5 5 5 5 5 5 3.03 5 3.03 5 ...
 $ zoo           : num  2.35 2.64 2.64 2.35 2.64 2.63 2.35 2.63 2.62 2.35 ...
 $ restaurants   : num  2.33 2.33 2.33 2.33 2.33 2.33 2.33 2.33 2.32 2.32 ...
 $ bars_pubs     : num  2.64 2.65 2.64 2.64 2.64 2.65 2.64 2.64 2.63 2.63 ...
 $ local_services: num  1.7 1.7 1.7 1.73 1.7 1.71 1.73 1.7 1.71 1.69 ...
 $ burger_pizza  : num  1.69 1.69 1.69 1.69 1.69 1.69 1.68 1.68 1.67 1.67 ...
 $ hotels_lodging: num  1.7 1.7 1.7 1.7 1.7 1.69 1.69 1.69 1.68 1.67 ...
 $ juice_bars    : num  1.72 1.72 1.72 1.72 1.72 1.72 1.71 1.71 1.7 1.7 ...
 $ art_galleries : num  1.74 1.74 1.74 1.74 1.74 1.74 1.75 1.74 0.75 0.74 ...
 $ dance_clubs   : num  0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.6 0.6 0.59 ...
 $ swimming_pools: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 0 ...
 $ gyms          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ bakeries      : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 ...
 $ beauty_spas   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ cafes         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ view_points   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ monuments     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ gardens       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ set           : Factor w/ 2 levels "test","train": 2 2 2 2 2 2 2 2 1 2 ...

df['accommodations'] <- (df['resorts']+df['hotels_lodging'])/2
head(df)

    user churches resorts beaches parks theaters museums malls  zoo
1 User 1        0     0.0    3.63  3.65        5    2.92     5 2.35
2 User 2        0     0.0    3.63  3.65        5    2.92     5 2.64
3 User 3        0     0.0    3.63  3.63        5    2.92     5 2.64
4 User 4        0     0.5    3.63  3.63        5    2.92     5 2.35
5 User 5        0     0.0    3.63  3.63        5    2.92     5 2.64
6 User 6        0     0.0    3.63  3.63        5    2.92     5 2.63
  restaurants bars_pubs local_services burger_pizza hotels_lodging
1        2.33      2.64           1.70         1.69           1.70
2        2.33      2.65           1.70         1.69           1.70
3        2.33      2.64           1.70         1.69           1.70
4        2.33      2.64           1.73         1.69           1.70
5        2.33      2.64           1.70         1.69           1.70
6        2.33      2.65           1.71         1.69           1.69
  juice_bars art_galleries dance_clubs swimming_pools gyms bakeries
1       1.72          1.74        0.59            0.5    0      0.5
2       1.72          1.74        0.59            0.5    0      0.5
3       1.72          1.74        0.59            0.5    0      0.5
4       1.72          1.74        0.59            0.5    0      0.5
5       1.72          1.74        0.59            0.5    0      0.5
6       1.72          1.74        0.59            0.5    0      0.5
  beauty_spas cafes view_points monuments gardens   set accommodations
1           0     0           0         0       0 train          0.850
2           0     0           0         0       0 train          0.850
3           0     0           0         0       0 train          0.850
4           0     0           0         0       0 train          1.100
5           0     0           0         0       0 train          0.850
6           0     0           0         0       0 train          0.845

1b: Summarization

For each category of rating, including the newly created accommodations variable, show the average and the standard deviation of the recorded values on the training set. Show the results in a table. Round your answers to a reasonable number of decimal places.

category <- c('hotels_lodging','resorts','accommodations','churches','beaches','parks','theaters','museums','malls', 'zoo','restaurants','bars_pubs','local_services', 'burger_pizza' , 'juice_bars' ,'art_galleries','dance_clubs','swimming_pools', 'gyms' , 'bakeries' ,'beauty_spas' ,'cafes',  'view_points',  'monuments','gardens')

train <- df[df['set'] == 'train', category]
test <- df[df['set'] == 'test', category]

avg_sd <- function(x) {
  c(mean = round(mean(x, na.rm = TRUE),2), 
    sd = round(sd(x,na.rm = TRUE),2)
  )
}

sapply(train,FUN = avg_sd)

     hotels_lodging resorts accommodations churches beaches parks theaters
mean           2.12    2.31           2.21     1.46    2.49  2.80     2.96
sd             1.41    1.41           0.89     0.82    1.24  1.31     1.34
     museums malls  zoo restaurants bars_pubs local_services burger_pizza
mean    2.90  3.34 2.54        3.13      2.84           2.55         2.07
sd      1.28  1.41 1.12        1.36      1.31           1.39         1.25
     juice_bars art_galleries dance_clubs swimming_pools gyms bakeries
mean       2.18          2.21        1.19           0.95 0.82     0.98
sd         1.57          1.72        1.09           0.97 0.94     1.22
     beauty_spas cafes view_points monuments gardens
mean        0.99  0.96        1.76      1.54    1.57
sd          1.18  0.93        1.60      1.32    1.17

as.data.table(sapply(train,FUN = avg_sd), keep.rownames = "measure")

   measure hotels_lodging resorts accommodations churches beaches parks
1:    mean           2.12    2.31           2.21     1.46    2.49  2.80
2:      sd           1.41    1.41           0.89     0.82    1.24  1.31
   theaters museums malls  zoo restaurants bars_pubs local_services
1:     2.96    2.90  3.34 2.54        3.13      2.84           2.55
2:     1.34    1.28  1.41 1.12        1.36      1.31           1.39
   burger_pizza juice_bars art_galleries dance_clubs swimming_pools gyms
1:         2.07       2.18          2.21        1.19           0.95 0.82
2:         1.25       1.57          1.72        1.09           0.97 0.94
   bakeries beauty_spas cafes view_points monuments gardens
1:     0.98        0.99  0.96        1.76      1.54    1.57
2:     1.22        1.18  0.93        1.60      1.32    1.17

Question 2: Linear Regression

2a

Use the training data to create a linear regression model for the accommodations outcome. The predictor variables should include every rating variable except for resorts and hotels_lodging. No other predictors should be used. Build the model and display a summary of the coefficients. Show a summary of the resulting model’s coefficients, rounded to a reasonable number of digits.

#exclude 'resorts', 'hotels_lodging' and 'set' from train and test sets.

train <- train[-c(1,2)]
test <- test[-c(1,2)]

fit.ols <-lm (accommodations ~.,data = train)

coef.ols <- round(fit.ols$coefficients,2)
coef.ols

   (Intercept)       churches        beaches          parks       theaters 
          0.44           0.19           0.12           0.04           0.11 
       museums          malls            zoo    restaurants      bars_pubs 
         -0.05          -0.04           0.04           0.06           0.00 
local_services   burger_pizza     juice_bars  art_galleries    dance_clubs 
          0.01           0.14           0.18           0.03          -0.01 
swimming_pools           gyms       bakeries    beauty_spas          cafes 
         -0.07           0.04           0.05           0.04          -0.02 
   view_points      monuments        gardens 
         -0.06           0.00           0.03

2b

Based on the linear model’s results, which categories are associated with an increase in the average ratings for accommodations in a statistically significant way? Display the summary of the linear model’s coefficients for this set of variables. This table should be sorted in order of the effect size (the estimated coefficient) to show the strongest effects first.

#extract p.values of the lm model
p.values.ols <- summary(fit.ols)$coefficients[, 4]

#subset variables with 0.05 signigicance level
coef.select <- coef.ols[p.values.ols < 0.05]
coef.select

   (Intercept)       churches        beaches          parks       theaters 
          0.44           0.19           0.12           0.04           0.11 
       museums          malls            zoo    restaurants   burger_pizza 
         -0.05          -0.04           0.04           0.06           0.14 
    juice_bars  art_galleries swimming_pools           gyms       bakeries 
          0.18           0.03          -0.07           0.04           0.05 
   beauty_spas    view_points        gardens 
          0.04          -0.06           0.03

#select those whose coefficients are positive, and sort in descending 
sort(coef.select[coef.select>0], decreasing = TRUE)

  (Intercept)      churches    juice_bars  burger_pizza       beaches 
         0.44          0.19          0.18          0.14          0.12 
     theaters   restaurants      bakeries         parks           zoo 
         0.11          0.06          0.05          0.04          0.04 
         gyms   beauty_spas art_galleries       gardens 
         0.04          0.04          0.03          0.03

2c

Which categories are associated with an decrease in the average ratings for accommodations in a statistically significant way? Display the summary of the linear model’s coefficients for this set of variables. This table should be sorted in order of the effect size (the estimated coefficient) to show the strongest effects first.

sort(coef.select[coef.select< 0], decreasing = FALSE)

swimming_pools    view_points        museums          malls 
         -0.07          -0.06          -0.05          -0.04

2d

Based on the linear model’s results, which categories did not show statistically significant relationships with the accommodations?

coef.unselect <- coef.ols[p.values.ols >= 0.05]
coef.unselect

     bars_pubs local_services    dance_clubs          cafes      monuments 
          0.00           0.01          -0.01          -0.02           0.00

2e

Using the root mean squared error (RMSE) as a metric, how accurate is the linear model in terms of predicting the ratings for accommodations on the testing set?

pred.ols <- predict(fit.ols, newdata = test)
rmse.ols <- sqrt(mean(pred.ols- test$accommodations)^2)
rmse.ols

[1] 0.03165383

Question 3: Selection Procedures

3a

Use forward stepwise regression to create a separate linear regression model of accommodations on the training set. The procedure should start with a model that only includes an intercept, and allowing the model to grow as large as including all of the predictors used in Question 2. Show a summary of the resulting model’s coefficients, rounded to a reasonable number of digits.

Note: The capture.output function can be used to prevent R from printing out all of the intermediate calculations performed in stepwise regression. You are not required to use this method, but it will help you to create reports that maintain good readability while using methods like this.

empty.mod = lm(accommodations~1,data=train)
full.mod = lm(accommodations~.,data=train)

output.forward <- capture.output(forwardStepwise <- step(empty.mod,scope=list(upper=full.mod,lower=empty.mod),direction='forward'))

round(forwardStepwise$coefficients,2)

   (Intercept)     juice_bars        beaches   burger_pizza       churches 
          0.43           0.18           0.12           0.14           0.19 
      theaters    view_points    beauty_spas    restaurants        museums 
          0.11          -0.06           0.04           0.06          -0.05 
      bakeries swimming_pools          parks          malls            zoo 
          0.05          -0.08           0.04          -0.04           0.04 
 art_galleries           gyms        gardens          cafes 
          0.03           0.04           0.03          -0.03

3b

Use backward stepwise regression to create a separate linear regression model of accommodations on the training set. The procedure should start with the full model you built in Question 2 while allowing the model to become as small as one that only includes an intercept. Show a summary of the resulting model’s coefficients, rounded to a reasonable number of digits.

output.backward <- capture.output(backwardStepwise <- step(full.mod, scope = list(upper = full.mod, lower = empty.mod), direction='backward'))

round(backwardStepwise$coefficients,2)

   (Intercept)       churches        beaches          parks       theaters 
          0.43           0.19           0.12           0.04           0.11 
       museums          malls            zoo    restaurants   burger_pizza 
         -0.05          -0.04           0.04           0.06           0.14 
    juice_bars  art_galleries swimming_pools           gyms       bakeries 
          0.18           0.03          -0.08           0.04           0.05 
   beauty_spas          cafes    view_points        gardens 
          0.04          -0.03          -0.06           0.03

3c

Describe the similarities and differences in the results obtained by forward and backward stepwise selection.

Forward and backward stepwise selections are all greedy feature selection algorithms that only look at a step ahead to find the best subset. Both approaches search through only 1+p(p+1)/2 models, thus don’t guarantee to yield the best subset among all possible subsets.They both use methods such as AIC, BIC or adjusted R^2 to determine the single best model.

Forward stepwise selection starts from an empty model to the full model, with each step adding a predictor to the model, and backward stepwise selection does the opposite - from full model to empty model. To fit the backward stepwise algorithm, sample size n should be larger than the number of predictors p. In contrast, forward stepwise can be used even when n<p, and so is the only viable subset method when p is very large.

3d

Use the results from the forward selection and backward selection models to make predictions on the testing set. Calculate the RMSE of each set of predictions. Show the RMSE of linear regression, forward selection, and backward selection in a table. Round the results to a reasonable number of digits.

pred.forward <- predict(forwardStepwise, newdata=test)
pred.backward <- predict(backwardStepwise, newdata=test)

rmse.forward <- sqrt(mean(pred.forward - test$accommodations)^2)
rmse.backward <- sqrt(mean(pred.backward - test$accommodations)^2)

round(data.table(rmse.ols, rmse.forward, rmse.backward),4)

   rmse.ols rmse.forward rmse.backward
1:   0.0317       0.0313        0.0313

Question 4: Regularized Regression

4a

Use ridge regression to create a model of accommodations on the training set. The model should include the same predictors used to build the linear regression above. Display the model’s coefficients, rounded to a reasonable number of digits.

This can be implemented using the glmnet function in the glmnet package. Note that ridge regression is specified when alpha = 0.

x<- data.matrix(train[, 2:23])
y <- train$accommodations
fit.ridge <- cv.glmnet(x, y, alpha = 0 )
round(coef(fit.ridge),2)

23 x 1 sparse Matrix of class "dgCMatrix"
                   1
(Intercept)     1.02
churches        0.13
beaches         0.09
parks           0.03
theaters        0.07
museums        -0.03
malls          -0.02
zoo             0.02
restaurants     0.02
bars_pubs       0.00
local_services  0.01
burger_pizza    0.11
juice_bars      0.13
art_galleries   0.03
dance_clubs    -0.02
swimming_pools -0.05
gyms            0.02
bakeries        0.04
beauty_spas     0.03
cafes          -0.03
view_points    -0.04
monuments       0.00
gardens         0.02

4b

Use lasso regression to create a model of accommodations on the training set. The model should include the same predictors used to build the linear regression above. Display the model’s coefficients, rounded to a reasonable number of digits.

This can be implemented using the glmnet function in the glmnet package. Note that lasso regression is specified when alpha = 1.

fit.lasso <- cv.glmnet(x, y, alpha = 1 )
round(coef(fit.lasso),2)

23 x 1 sparse Matrix of class "dgCMatrix"
                   1
(Intercept)     0.92
churches        0.16
beaches         0.10
parks           0.02
theaters        0.07
museums        -0.02
malls           0.00
zoo             0.01
restaurants     0.02
bars_pubs       .   
local_services  .   
burger_pizza    0.13
juice_bars      0.16
art_galleries   0.01
dance_clubs     .   
swimming_pools -0.04
gyms            .   
bakeries        0.03
beauty_spas     0.02
cafes           .   
view_points    -0.03
monuments       .   
gardens         .

4c

Use the ridge and lasso regression models to generate predictions on the testing set. Compute the RMSE for each set of predictions. Add these values to the table of RMSE values that includes those for linear regression and the stepwise procedures. Round the table to a reasonable number of digits.

pred.lasso <- predict(fit.lasso, data.matrix(test[,2:23]))
pred.ridge <- predict(fit.ridge, data.matrix(test[,2:23]))
rmse.lasso <- sqrt(mean(pred.lasso - test$accommodations)^2)
rmse.ridge <- sqrt(mean(pred.ridge - test$accommodations)^2)

round(data.table(rmse.ols, rmse.forward, rmse.backward, rmse.ridge, rmse.lasso),4)

   rmse.ols rmse.forward rmse.backward rmse.ridge rmse.lasso
1:   0.0317       0.0313        0.0313      0.033     0.0279

4d

Comment on the results. Were the results of the models reasonably similar or quite different? What is the reason for this?

The results of the models reasonably similar to each other mainly because the number of observations is much much larger than the number of features in the dataset. Least squares regression technique is intended for low-dimensional setting like this. The resulting linear model can perform reasonabaly well on the test set.

Question 5

How would the results for the regularized methods (ridge and lasso) have changed if we had utilized less data in the training set? We will explore this question in the following parts.

5a

Create a reduced training set that only contains the first 250 rows of the training data. Then fit a ridge regression model on this reduced training set with a similar specification to the earlier model. Display the coefficients of the model, rounded to a reasonable number of decimal places.

dim(train)

[1] 4367   23

train.reduced <- train[1:250,]

fit.ridge1 <- cv.glmnet(data.matrix(train.reduced[, 2:23]), train.reduced$accommodations, alpha = 0 )
round(coef(fit.ridge1),2)

23 x 1 sparse Matrix of class "dgCMatrix"
                   1
(Intercept)     1.20
churches        0.11
beaches         0.01
parks          -0.03
theaters       -0.03
museums        -0.10
malls           0.03
zoo             0.03
restaurants     0.00
bars_pubs       0.02
local_services  0.04
burger_pizza    0.01
juice_bars      0.11
art_galleries   0.01
dance_clubs     0.04
swimming_pools  0.03
gyms            0.12
bakeries        0.07
beauty_spas     0.06
cafes           0.02
view_points     0.02
monuments       0.02
gardens         0.10

5b

Now fit a lasso regression model on this reduced training set with a similar specification to the earlier model. Display the coefficients of the model, rounded to a reasonable number of decimal places.

fit.lasso1 <- cv.glmnet(data.matrix(train.reduced[, 2:23]), train.reduced$accommodations, alpha = 1 )
round(coef(fit.lasso1),2)

23 x 1 sparse Matrix of class "dgCMatrix"
                   1
(Intercept)     1.59
churches        0.25
beaches         .   
parks           .   
theaters        .   
museums        -0.14
malls           .   
zoo             .   
restaurants     .   
bars_pubs       .   
local_services  .   
burger_pizza    .   
juice_bars      0.15
art_galleries   .   
dance_clubs     .   
swimming_pools  .   
gyms            .   
bakeries        .   
beauty_spas     .   
cafes           .   
view_points     .   
monuments       .   
gardens         0.14

5c

How different are the coefficients for the full and reduced ridge regression models?

cbind(round(coef(fit.ridge),2),  round(coef(fit.ridge1),2))

23 x 2 sparse Matrix of class "dgCMatrix"
                   1     1
(Intercept)     1.02  1.20
churches        0.13  0.11
beaches         0.09  0.01
parks           0.03 -0.03
theaters        0.07 -0.03
museums        -0.03 -0.10
malls          -0.02  0.03
zoo             0.02  0.03
restaurants     0.02  0.00
bars_pubs       0.00  0.02
local_services  0.01  0.04
burger_pizza    0.11  0.01
juice_bars      0.13  0.11
art_galleries   0.03  0.01
dance_clubs    -0.02  0.04
swimming_pools -0.05  0.03
gyms            0.02  0.12
bakeries        0.04  0.07
beauty_spas     0.03  0.06
cafes          -0.03  0.02
view_points    -0.04  0.02
monuments       0.00  0.02
gardens         0.02  0.10

5d

How different are the coefficients for the full and reduced lasso regression models?

cbind(round(coef(fit.lasso),2), round(coef(fit.lasso1),2))

23 x 2 sparse Matrix of class "dgCMatrix"
                   1     1
(Intercept)     0.92  1.59
churches        0.16  0.25
beaches         0.10  .   
parks           0.02  .   
theaters        0.07  .   
museums        -0.02 -0.14
malls           0.00  .   
zoo             0.01  .   
restaurants     0.02  .   
bars_pubs       .     .   
local_services  .     .   
burger_pizza    0.13  .   
juice_bars      0.16  0.15
art_galleries   0.01  .   
dance_clubs     .     .   
swimming_pools -0.04  .   
gyms            .     .   
bakeries        0.03  .   
beauty_spas     0.02  .   
cafes           .     .   
view_points    -0.03  .   
monuments       .     .   
gardens         .     0.14

5e

Use the ridge and lasso regression models that were fit on the reduced training set to generate predictions on the full testing set. Compute the RMSE for each set of predictions. Add these values to the table of RMSE values that include all of the earlier RMSE results. Round the table to a reasonable number of digits.

pred.ridge1 <- predict(fit.ridge1, data.matrix(test[,2:23]))
pred.lasso1 <- predict(fit.lasso1, data.matrix(test[,2:23]))

rmse.ridge1 <- sqrt(mean(pred.ridge1 - test$accommodations)^2)
rmse.lasso1 <- sqrt(mean(pred.lasso1 - test$accommodations)^2)

round(data.table(rmse.ols, rmse.forward, rmse.backward, rmse.ridge, rmse.lasso, rmse.ridge1, rmse.lasso1),4)

   rmse.ols rmse.forward rmse.backward rmse.ridge rmse.lasso rmse.ridge1
1:   0.0317       0.0313        0.0313      0.033     0.0279      0.1872
   rmse.lasso1
1:      0.1417

5f

What conclusions can you draw about the usage of selection procedures and regularization methods based upon this work?

Subset selection procedures increase model interpretability by automatically performing feature selection in a multiple regression model. To fit the backward stepwise method, number of observations n should be larger than the number of predictors p. In contrast, forward stepwise can be used even when n < p. Regulization methods can yeild better prediction accuracy compared to least squares in certain settings.If number of observations is much larger than p, the number of variables, then the least squares estimate tend to also have low variance, and hence will perform well on test observations. However, if n is not much larger than p, as we saw in the reduced train set, then there can be a lot of variability in the least squares fit, resulting in overfitting and consequently poor predictions on test set. If this case the least squares method will not be a good choice since the variance is infinite, while the regularization methods, by reducing the variance, perform well on unseen test set.