Problem Set #1

summary(daylm)

## 
## Call:
## glm(formula = total ~ yr * mnth, data = daytots)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6392.2   -391.7    122.3    629.0   3043.5  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1231.90     182.04   6.767 2.76e-11 ***
## yr           1888.87     257.45   7.337 6.01e-13 ***
## mnth2         490.06     264.25   1.855 0.064081 .  
## mnth3         834.06     257.45   3.240 0.001252 ** 
## mnth4        1930.43     259.58   7.437 3.00e-13 ***
## mnth5        3149.42     257.45  12.233  < 2e-16 ***
## mnth6        3551.83     259.58  13.683  < 2e-16 ***
## mnth7        3327.48     257.45  12.925  < 2e-16 ***
## mnth8        3177.48     257.45  12.342  < 2e-16 ***
## mnth9        3015.36     259.58  11.616  < 2e-16 ***
## mnth10       2752.32     257.45  10.691  < 2e-16 ***
## mnth11       2173.66     259.58   8.374 2.99e-16 ***
## mnth12       1584.97     257.45   6.156 1.25e-09 ***
## yr:mnth2      -54.39     372.01  -0.146 0.883808    
## yr:mnth3     1363.71     364.08   3.746 0.000195 ***
## yr:mnth4      756.26     367.11   2.060 0.039758 *  
## yr:mnth5       48.03     364.08   0.132 0.895080    
## yr:mnth6       88.40     367.11   0.241 0.809787    
## yr:mnth7      119.71     364.08   0.329 0.742407    
## yr:mnth8      621.19     364.08   1.706 0.088414 .  
## yr:mnth9     1149.63     367.11   3.132 0.001810 ** 
## yr:mnth10     541.13     364.08   1.486 0.137653    
## yr:mnth11    -205.64     367.11  -0.560 0.575549    
## yr:mnth12    -715.00     364.08  -1.964 0.049941 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1027322)
## 
##     Null deviance: 2739535392  on 730  degrees of freedom
## Residual deviance:  726316624  on 707  degrees of freedom
## AIC: 12219
## 
## Number of Fisher Scoring iterations: 2

1.1 What are the in-sample SSE and R2 for this regression?

\(R^2\) in terms of deviance is \[ R^2 = \frac{(Deviance_0 - Deviance_R)}{Deviance_0} = 1 - \frac{Deviance_r}{Deviance_0}\] where the subscribt 0 refers to the null deviance and the subscript refers to the residual deviance. Alternative formula for \(R^2\) is

for the regression the null deviance is 2739535392 and the residual deviance is 726316624.

Sum of Squared Errors is the sum of each obseration from the mean, squared. \[\Sigma_{i=1}^n(x_i-\bar x)\]

library(rsq)

rsq <- rsq(daylm)
print(paste0("R-squared: ", rsq))

## [1] "R-squared: 0.734875984363624"

###R-squared in terms of deviance --> R2 = (Null_deviance - Residual_Deviance)/ Null Deviance

dev_r2 = 1-  (daylm$deviance/daylm$null.deviance)
print(paste0("R-squared from deviance: ", dev_r2))

## [1] "R-squared from deviance: 0.734875984363624"

sse <- rsq.sse(daylm, adj = TRUE)
print(paste0("SSE: ", sse))

## [1] "SSE: 0.726251016386769"

1.2
Write out the mathematical formula for daylm and describe it in words. Make sure to describe the probability model that is implied by the deviance we’ve minimized. Do you have any criticisms of this model?

\[\mathbb{E}[Total\ |\ Year,\ Month] = \beta_0 + \beta_1Year_j + \beta_2Month_i + \beta_{3ij}(Year)(Month)\]

the probability model implied by the minimized deviance is:

\[ y \approx \mathcal{N}(x^\prime\beta, \sigma^2 ) \] \[ \mathcal{N}(x^\prime\beta, \sigma^2 ) = \frac {exp[\frac{-(y-\mu)^2}{2\sigma^2}]}{\sqrt{2\pi\sigma^2}}\] Minimzing the deviance is the same as least squares. Minimizing the deviance is the same as least squares model.

What this regression is tell us is that the expected value of the total variable (i.e. the total number of bikes rented) is a function of the year and month of the year. The model does add an interaction term where to capture the effect of the month of each year. For example, the interaction effect captures the effect of April and Year 1 and April of Year 2 separately, instead of just the effect of it being april and just the effect of it being year 1 (or year 2). Mathematically, the effect of the interaction term is such that the effect on \(\mathbb{E[y]}\) of a unit increase is \(x_j\) is \(\beta_i + \beta_{jk}x_k\) and not just \(\beta_ix_j\).

This model does lack in its ability to reflect reality. First, the model only captures the month and the year but not the day. Second, the model does not account for things like weather, whcih is included in the data set. Other important data points that would help the model describe reality would be things like proximity to public transit or income levels or house values around the bike stations.

1.3
A standardized residual for response \(y\) and prediction \(\hat y\) is \(r_i = \frac{(y_i - \hat y_i)}{\sigma^2}\), where \(\sigma^2\) is the estimated standard deviation of residuals \(y - \hat y\) . Calculate the standardized residuals for daylm.

standard_residuals = (daylm$residuals) / sd(daylm$residuals)

Now, we’ll call the outlier p-value \(2 \ \times p(Z < -|r_i|)\) where \(Z \approx {N(0, 1)}\). In R, this is 2*pnorm(-abs(std_resids)).

Calculate these p-values.

outlier_pvals = 2*(pnorm(-1 * abs(standard_residuals)))

Describe what null hypothesis distribution they correspond to and why small values indicate a possible outlier day.

the pvalues calculated refer to a normally distributed null p-values.The null p-values are all normally distributed between 0 and 1. Under the null hypotheses, it hasa 5% chance of being less than .05 and a 10% chance of being less than .1, which describes a uniform distribution.

Small p-values indicate a possible outlier day beause if the p-value is small the probability of it ocurring naturally is small and is more likely to just be an outlier.

df = data.frame(std_resid = standard_residuals, 
                pvals = outlier_pvals)

df$padjust = p.adjust(df$pvals, method = "BH")


new_df = cbind(daytots, df)

#order df on original pvals
new_df = arrange(new_df, pvals)
#rank p-values
new_df$rank = 1:length(df$pvals)

head(new_df)

##       dteday yr mnth total std_resid        pvals      padjust rank
## 1 2012-10-29  1   10    22 -6.408414 1.470414e-10 1.074873e-07    1
## 2 2012-10-30  1   10  1096 -5.331694 9.730083e-08 3.556345e-05    2
## 3 2012-04-22  1    4  1027 -4.792573 1.646559e-06 4.012115e-04    3
## 4 2012-12-26  1   12   441 -3.558731 3.726503e-04 6.810184e-02    4
## 5 2012-05-14  1    5  2843 -3.484027 4.939303e-04 7.221261e-02    5
## 6 2011-10-29  0   10   627 -3.365728 7.634200e-04 9.301000e-02    6

filter(new_df, padjust < .05)

##       dteday yr mnth total std_resid        pvals      padjust rank
## 1 2012-10-29  1   10    22 -6.408414 1.470414e-10 1.074873e-07    1
## 2 2012-10-30  1   10  1096 -5.331694 9.730083e-08 3.556345e-05    2
## 3 2012-04-22  1    4  1027 -4.792573 1.646559e-06 4.012115e-04    3

1.4 What is the p-value rejection region associated with a 5% False Discovery Rate here?

Which observations (days) are in this rejection region? Do you have any explanation for them?

alpha = .05
m = length(new_df$pvals)

ggplot(new_df) + 
  geom_point(aes(y = pvals, x = rank)) + 
  geom_smooth(data = new_df,
              mapping =  aes(x = rank, y =pvals), 
              method = "lm", color = "Blue")+
  xlab("Index (i)") +
  ylab("P-Value")+
  geom_vline(xintercept = 38, linetype="solid", 
                color = "black", size=.5) +
  geom_abline(intercept = 0, slope = alpha/m, linetype = "dashed", color = 'red') +
  annotate("text", x = 225, y =.9, label = "Cut off for Outlier P-Values =38") +
  geom_vline(xintercept = 3, linetype="dashed", color = "blue", size=.5)

For the normalized p-values (i.e. outsider pvalues) there are 38 days in the rejection region. With the adjusted p-values using the Benjamini Hochberg method, we can see that there are 3 days in the rejection region. Thus the 5% False Discovery Rate is a smaller rejection region for the outlier p values. The graph shows the cut off for the adjusted pvalues in blue and the solid black vertical line represents the cut off for the outlier pvals. The red line represents the adjusted slop for Benjamimi Hochberg process, and thus rejecting any points to the left of the line.

According to the adjusted P-Values and FDR process, two of the three days were in october and then one in April. These days could have been due to extreme and unusal weather or some other event that would otherwide deter bike rentals.

1.5 Plot the p-value distribution. What does it tell you about the assumptions of the probability model we used for our regression?

ggplot(new_df, aes(x = pvals)) + 
  geom_histogram() + 
  xlab("P Value") +
  ylab("Number of P Values") + 
  ggtitle("Distribution of P Value")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We originally assumed that the distribution of the null-values was uniform. However, as we can see from the histogram of p-values this is not true. The non-normally distributed p-values could heavily influence our model and incorrectly preduct the cnt variable.

2) Lasso Linear Regression and Model Selection

For this question, consider the cv.gamlr object I’ve fit as fitlin.

#### Q2: lasso regression
library(gamlr)

## Warning: package 'gamlr' was built under R version 3.4.3

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following object is masked from 'package:tidyr':
## 
##     expand

source("naref.R")
mmbike <- sparse.model.matrix(
    cnt ~ . + yr*mnth + hr*notbizday, 
    data=naref(biketab))[,-1]
y <- log(biketab$cnt)
## note, I need lambda.min.ratio=1e-4 because otherwise we don't get a path
## out to complex enough models (i.e. cv err is still decreasing at termination)
fitlin <- cv.gamlr( mmbike, y, lmr=1e-4, verb=TRUE )

## fold 1,2,3,4,5,done.

2.1
What is our response variable? Describe the columns of our model matrix. How has this model addressed the ‘outlier detection’ of question 1?

The response variable in the model is the log of cnt , which is the log of count of total rental bikes.

The sparse model matrix is a matrix that only contains non-zero values and ignore zeros which saves memory and expensive computation. In this case, with the function naref takes care of any missing or NA values. In this case the columns of the sparse matrix are the same as the data frame with two noteable exceptions. First, the sparse model matrix takes categorical values and creates binary value variables for each category value. Second, it replaces all 0 with . to preserve memory. Evidence of the the change in columns can be seen in the number of dimensions of the sparse model matrix. Where previously the data frame biketab had 13 columns, mmbike now has 824.

This model has addressed the ‘outlier detection’ by

print(head(mmbike))

## 6 x 824 sparse Matrix of class "dgCMatrix"

##    [[ suppressing 824 column names 'season1', 'season2', 'season3' ... ]]

##                                                                          
## 1 1 . . . . 1 . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . .
## 2 1 . . . . 1 . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . .
## 3 1 . . . . 1 . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . .
## 4 1 . . . . 1 . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . .
## 5 1 . . . . 1 . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . .
## 6 1 . . . . 1 . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . .
##                                                                          
## 1 . . . . . . . . . . . . 1 1 1 . . . -1.334609 0.9473452 -1.553844 1 . .
## 2 . . . . . . . . . . . . 1 1 1 . . . -1.438475 0.8955129 -1.553844 1 . .
## 3 . . . . . . . . . . . . 1 1 1 . . . -1.438475 0.8955129 -1.553844 1 . .
## 4 . . . . . . . . . . . . 1 1 1 . . . -1.334609 0.6363514 -1.553844 1 . .
## 5 . . . . . . . . . . . . 1 1 1 . . . -1.334609 0.6363514 -1.553844 1 . .
## 6 . . . . . . . . . . . . 1 1 . 1 . . -1.334609 0.6363514 -0.821460 1 . .
##                                                                          
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                          
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                          
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                          
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                          
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                          
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                          
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                          
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                          
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                          
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                          
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                          
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                          
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                          
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                          
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                          
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                          
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                          
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                          
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                          
## 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##                                                                          
## 1 . . . . . . . . 0 . . . . . . . . . . . 1 . . . . . . . . . . . . . . .
## 2 . . . . . . . . 0 . . . . . . . . . . . . 1 . . . . . . . . . . . . . .
## 3 . . . . . . . . 0 . . . . . . . . . . . . . 1 . . . . . . . . . . . . .
## 4 . . . . . . . . 0 . . . . . . . . . . . . . . 1 . . . . . . . . . . . .
## 5 . . . . . . . . 0 . . . . . . . . . . . . . . . 1 . . . . . . . . . . .
## 6 . . . . . . . . 0 . . . . . . . . . . . . . . . . 1 . . . . . . . . . .
##                  
## 1 . . . . . . . .
## 2 . . . . . . . .
## 3 . . . . . . . .
## 4 . . . . . . . .
## 5 . . . . . . . .
## 6 . . . . . . . .

dim(biketab)

## [1] 17379    13

2.2 Describe the criteria used to choose models under select=“1se” and select=“min” rules. What are estimated out-of-sample \(R^2\) for models fit using these \(\lambda\)?

The select = "1se" defines the biggest \(\lambda_t\) with the average out of sample deviance no more than 1 standard deviation away from the minimum. the select = "min" criteria is the \(\lambda_t\) with the minimumum out of sample deviance. These criteria will give different regression coefficients for the variables in the model. These lambda can shrink the coefficients to zero and thus give the essential variables to keep in a regression.

library(glmnet)

## Warning: package 'glmnet' was built under R version 3.4.4

## Loading required package: foreach

## Warning: package 'foreach' was built under R version 3.4.3

## 
## Attaching package: 'foreach'

## The following objects are masked from 'package:purrr':
## 
##     accumulate, when

## Loaded glmnet 2.0-16

set.seed(1)
train = sample(1:nrow(mmbike), nrow(mmbike)/2)
test = (-train)
y.test = y[test]

lasso.mod <- glmnet(mmbike[train,],
                    y[train], alpha=1)
plot(lasso.mod, xvar="lambda")

min_lam = fitlin$lambda.min

oneSE_lam = fitlin$lambda.1se

minLambda_prediction =  predict(lasso.mod, s=min_lam, newx =mmbike[test,])

oneSE_prediction = predict(lasso.mod, s=oneSE_lam, newx =mmbike[test,])

df2 = data.frame(reality = y.test,
                 min_lambda_pred = minLambda_prediction,
                 oneSE_pred = oneSE_prediction)

df2$minlam_error = df2$reality - df2$X1

df2$onese_error = df2$reality - df2$X1.1

colnames(df2)[2] = "minLambda_prediction"

colnames(df2)[3] = "oneSE_prediction"

df2$minlam_meansub = df2$reality - mean(df2$reality)

df2$onse_meansub = df2$reality - mean(df2$reality)

head(df2)

##     reality minLambda_prediction oneSE_prediction minlam_error onese_error
## 1  2.772589            2.9796454        3.0169801   -0.2070567  -0.2443914
## 4  2.564949            1.7582572        1.7978481    0.8066921   0.7671013
## 5  0.000000            0.6246626        0.6637173   -0.6246626  -0.6637173
## 6  0.000000            0.5497715        0.6202774   -0.5497715  -0.6202774
## 8  1.098612            2.2009414        2.2780117   -1.1023291  -1.1793994
## 12 4.025352            4.3053929        4.3358541   -0.2800412  -0.3105024
##    minlam_meansub onse_meansub
## 1      -1.7643668   -1.7643668
## 4      -1.9720062   -1.9720062
## 5      -4.5369555   -4.5369555
## 6      -4.5369555   -4.5369555
## 8      -3.4383432   -3.4383432
## 12     -0.5116038   -0.5116038

\[ r^2 = \frac{SSR}{SSTO} = 1 - \frac{SSE}{SSTO}\]

#for min lambda
min_rss = (sum((df2$minlam_error)^2))
min_rss

## [1] 1141.405

min_ssto = sum((df2$minlam_meansub)^2)
min_ssto

## [1] 18916.08

r_squared_min = 1 -(min_rss / min_ssto)
print(paste0("R-Squared for Min Lambda: ", r_squared_min))

## [1] "R-Squared for Min Lambda: 0.939659528585033"

#for one se lambda

rss_onese = sum((df2$onese_error)^2) %>% print()

## [1] 1150.303

ssto = sum((df2$onse_meansub)^2) %>% print()

## [1] 18916.08

r_squared_onse = 1 - (sum((df2$onese_error)^2) / sum((df2$onse_meansub)^2))

print(paste0("R-Squared for One SE Lambda: ", r_squared_onse))

## [1] "R-Squared for One SE Lambda: 0.939189169500555"

2.3
Compare AICc, AIC, and BIC selection to each other and to the CV rules.

In general the training mean square error is smaller than the test mean square error because we are minimizing the training RSS but not the testing RSS. The training RSS will decreasee as more variables are inluded in the model but not necessarily in the test data. Therefore, \(R^2\) for the training set cannot be used to select models with different variables.

\(AIC_c\):

\[AIC_c = AIC + \frac{2k^2 + 2k}{n-k-1}\] \(AICc\) is an extenion of \(AIC\) wherein \(AIC_c\) corrects for small sample sizes. When sample size is small, there is a higher probability that \(AIC\) will select models that have too many paramters, meaning the \(AIC\) will overfit. Here, \(n\) denotes the sample size and \(k\) denotes the number of parameters. Essentially, \(AIC_c\) is \(AIC\) with an additional penatly for more parameters. As \(n\) gets very large, (i.e. approaches infinity), the coefficeints essenitally turn to 0.

\(C_p\):

\[C_p = \frac{1}{n} * (RSS + 2 d \hat \sigma^2 )\]

\(\hat \sigma^2\) is an estimate of the variance of error in \(\epsilon\) with ech response measurement. \(C_p\) is measured with the full model containing all predictior variables. the \(C_p\) method adds a penalty to the training RSS to adjust for training error trends that underestimate the test error. Similar to other approaches, the \(C_p\) approach penalties increase as the number of variables increases in order to adjust for the decrease in training RSS. If \(\hat \sigma^2\) is unbiased estimate of \(\sigma^2\) than \(C_p\) is an unbiased estimate of \(MSE\). We choose the model with the lowest \(C_p\) .

\(AIC\):

\[AIC = \frac{1}{n \hat \sigma^2} * (RSS + 2 d \hat \sigma^2 )\] AIC is defined for models fit by maximum likelihood. \(AIC\) is proportional to \(C_p\)

\(BIC\)

\[BIC = \frac{1}{n \hat \sigma^2} * (RSS + log(n)d \hat \sigma^2 )\] \(BIC\) will take on a small value for a model with low test error and so we select the model with the lowest \(BIC\) value. Unlike, \(C_p\), \(BIC\) replaces the penalty with \(log(n) d \hat \sigma^2\) where \(n\) is the number of observations.BIC generally places heavier penalty on models with many variables.

As you can see from these graphs, \(C_p\) and \(BIC\) have different penalties as the number of variables (i.e. predictors) increases.

Cross Validation

Cross validation directly tests error rates. Cross Validation computes the validation set error or cross validation error for each model under consideration and then selects the model for which the resulting estimated test error is the smallest. It is better than \(AIC\), \(C_p\), \(BIC\) and \(AIC_C\) in that in can provide direct estimates of the test error and makes fewer assumptions about the true model.

The graphs above compare BIC to Validation Error and Cross Validation Error as a function of \(d\) (the number of predictors). Validation Errors are calculated by selection 75% of records in the training set and set aside 25% as the testing set. Both Validation and Cross Validation show 10 predictors, but it is essentially the same as chose 6 predictors because their test errors are minimally different.

2.4

Print the top three dteday effects by absolute value under your preferred selection rule, and describe the implied effect on cnt. Can you explain any of these?

The following variables are chosen using the cross validation method in cv.gamlr.

fitlin_coef = data.frame(as.matrix(coef(fitlin, s = "min")),
                            Variable.Name = row.names(coef(fitlin)))

colnames(fitlin_coef)[1] = "Coef.Val" 

fitlin_coef$Coef.Val = abs(fitlin_coef$Coef.Val)

fitlin_coef_max = arrange(fitlin_coef, desc(Coef.Val))

head(fitlin_coef_max, 4)

##   Coef.Val Variable.Name
## 1 4.841307     intercept
## 2 3.590335           hr3
## 3 3.475308           hr4
## 4 3.059764           hr2

The top effects using the cross validation model, assuming a minimum lambda, are the intercept, hour 3, hour 4 and hour 2. The intercept being the average of \(y\). Hour 1 has an effect of 3.6 on \(log(cnt)\), hour 4 has a effect of 3.5 on \(log(cnt)\) and hour 2 has an effect of 3.06 on \(log(cnt)\). This could mean that bikes are most rented in the afternoon before 5 oclock than any other time of the day.

Problem Set #1

Tom Curran

April 14, 2018

2) Lasso Linear Regression and Model Selection