Assignment-2 Statistical Modelling

Question 1.

Show that the ridge regression estimate is the mean (and mode) of the posterior distribution, under a Gaussian prior ?? ??? N(0, ?? 2I), and Gaussian sampling model y ??? N(X??, ??2I). Find the relationship between the regularization parameter ?? in the ridge formula, and the variances ??2 and ??2.

Question 2.

This problem involves the Boston data set, which we saw in the lab for this chapter. We will now try to predict per capita crime rate using the other variables in this data set. In other words, per capita crime rate is the response, and the other variables are the predictors.

Question 2.a.

For each predictor, fit a simple linear regression model to predict the response. Describe your results. In which of the models is there a statistically significant association between the predictor and the response? Create some plots to back up your assertions.

rm(list=ls())
setwd("/Users/Mughundhan/UIC/UIC Academics/FALL 2017/BIZ ANALYTICS STATS/Assignment 2")
library(MASS)
data(Boston)
colnames(Boston)

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

# summary(lm(crim ~ zn, data = Boston))
# summary(lm(crim ~ indus, data = Boston))
summary(lm(crim ~ chas, data = Boston))

## 
## Call:
## lm(formula = crim ~ chas, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.738 -3.661 -3.435  0.018 85.232 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.7444     0.3961   9.453   <2e-16 ***
## chas         -1.8928     1.5061  -1.257    0.209    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.597 on 504 degrees of freedom
## Multiple R-squared:  0.003124,   Adjusted R-squared:  0.001146 
## F-statistic: 1.579 on 1 and 504 DF,  p-value: 0.2094

# summary(lm(crim ~ nox, data = Boston))
# summary(lm(crim ~ rm, data = Boston))
# summary(lm(crim ~ age, data = Boston))
# summary(lm(crim ~ dis, data = Boston))
# summary(lm(crim ~ rad, data = Boston))
# summary(lm(crim ~ tax, data = Boston))
# summary(lm(crim ~ ptratio, data = Boston))
# summary(lm(crim ~ black, data = Boston))
# summary(lm(crim ~ lstat, data = Boston))
# summary(lm(crim ~ medv, data = Boston))

Observation:

Predictors that have p-value < 0.05 are Statistically Significant: zn, indus, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv
Predictors that have p-value > 0.05: chas
The plots plotted below also portrays the same.

 par(mfrow=c(2,2)) #Fill by rows: Row, Cols
# par(mar = rep(2, 4)) #Setting Margins
plot(crim ~ . - crim, data = Boston)

Based on the plots we shall conclude that, there is a statistically significant association between each predictor [zn, indus, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv] and the response [crim] variable, except for the chas predictor.

Question 2.b.

Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predictors can we reject the null hypothesis H0 : ??j = 0?

summary(lm(crim ~ . - crim, data = Boston))

## 
## Call:
## lm(formula = crim ~ . - crim, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.924 -2.120 -0.353  1.019 75.051 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.033228   7.234903   2.354 0.018949 *  
## zn            0.044855   0.018734   2.394 0.017025 *  
## indus        -0.063855   0.083407  -0.766 0.444294    
## chas         -0.749134   1.180147  -0.635 0.525867    
## nox         -10.313535   5.275536  -1.955 0.051152 .  
## rm            0.430131   0.612830   0.702 0.483089    
## age           0.001452   0.017925   0.081 0.935488    
## dis          -0.987176   0.281817  -3.503 0.000502 ***
## rad           0.588209   0.088049   6.680 6.46e-11 ***
## tax          -0.003780   0.005156  -0.733 0.463793    
## ptratio      -0.271081   0.186450  -1.454 0.146611    
## black        -0.007538   0.003673  -2.052 0.040702 *  
## lstat         0.126211   0.075725   1.667 0.096208 .  
## medv         -0.198887   0.060516  -3.287 0.001087 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared:  0.454,  Adjusted R-squared:  0.4396 
## F-statistic: 31.47 on 13 and 492 DF,  p-value: < 2.2e-16

Observation:

Reject Null Hypotheses for p-value < 0.05 [95% Confidence]

Predictors that have p-value < 0.05: zn, dis, rad, black, medv
Predictors that have p-value > 0.05: indus, chas, nox, rm, age, tax, ptratio, lstat

Thus, we shall conclude that we can Reject Null Hypotheses for the predictors zn, dis, rad, black, medv.

Question 2.c.

How do your results from (a) compare to your results from (b)? Create a plot displaying the univariate regression coefficients from (a) on the x-axis, and the multiple regression coefficients from (b) on the y-axis. That is, each predictor is displayed as a single point in the plot. Its coefficient in a simple linear regression model is shown on the x-axis, and its coefficient estimate in the multiple linear regression model is shown on the y-axis.

univcof <- lm(crim ~ zn, data = Boston)$coefficients[2]
univcof <- append(univcof, lm(crim ~ indus, data = Boston)$coefficients[2])
univcof <- append(univcof, lm(crim ~ chas, data = Boston)$coefficients[2])
univcof <- append(univcof, lm(crim ~ nox, data = Boston)$coefficients[2])
univcof <- append(univcof, lm(crim ~ rm, data = Boston)$coefficients[2])
univcof <- append(univcof, lm(crim ~ age, data = Boston)$coefficients[2])
univcof <- append(univcof, lm(crim ~ dis, data = Boston)$coefficients[2])
univcof <- append(univcof, lm(crim ~ rad, data = Boston)$coefficients[2])
univcof <- append(univcof, lm(crim ~ tax, data = Boston)$coefficients[2])
univcof <- append(univcof, lm(crim ~ ptratio, data = Boston)$coefficients[2])
univcof <- append(univcof, lm(crim ~ black, data = Boston)$coefficients[2])
univcof <- append(univcof, lm(crim ~ lstat, data = Boston)$coefficients[2])
univcof <- append(univcof, lm(crim ~ medv, data = Boston)$coefficients[2])
fooBoston <- (lm(crim ~ . - crim, data = Boston))
fooBoston$coefficients[2:14]

##            zn         indus          chas           nox            rm 
##   0.044855215  -0.063854824  -0.749133611 -10.313534912   0.430130506 
##           age           dis           rad           tax       ptratio 
##   0.001451643  -0.987175726   0.588208591  -0.003780016  -0.271080558 
##         black         lstat          medv 
##  -0.007537505   0.126211376  -0.198886821

plot(univcof, fooBoston$coefficients[2:14], main = "Simple Linear Regression vs. Multiple Linear Regression", 
    xlab = "Simple Linear Regression", ylab = "Multiple Linear Regression", col="blue")

Observation:

We can see that there is a substantial difference between the coefficients of Simple Linear Regression and Multiple Linear Regression. This difference is due to the following:

In case of the Simple Linear Regression, the slope term represents the average effect of an increase in the predictor, ignoring other predictors.
In case of the Multiple Linear Regression, the slope term represents the average effect of an increase in the predictor, while holding other predictors fixed.

It does make sense for Multiple Linear Regression to suggest no relationship between the response and some of the predictors.
In contrast, Simple Linear Regression implies the exact opposite because the correlation between the predictors show some strong relationships between some of the predictors.

Question 2.d.

Is there evidence of non-linear association between any of the predictors and the response? To answer this question, for each predictor X, fit a model of the form Y = ??0 + ??1X + ??2X2 + ??3X3 + .

summary(lm(crim ~ poly(zn, 3), data = Boston))

## 
## Call:
## lm(formula = crim ~ poly(zn, 3), data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.821 -4.614 -1.294  0.473 84.130 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.6135     0.3722   9.709  < 2e-16 ***
## poly(zn, 3)1 -38.7498     8.3722  -4.628  4.7e-06 ***
## poly(zn, 3)2  23.9398     8.3722   2.859  0.00442 ** 
## poly(zn, 3)3 -10.0719     8.3722  -1.203  0.22954    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.372 on 502 degrees of freedom
## Multiple R-squared:  0.05824,    Adjusted R-squared:  0.05261 
## F-statistic: 10.35 on 3 and 502 DF,  p-value: 1.281e-06

summary(lm(crim ~ poly(indus, 3), data = Boston))

## 
## Call:
## lm(formula = crim ~ poly(indus, 3), data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.278 -2.514  0.054  0.764 79.713 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.614      0.330  10.950  < 2e-16 ***
## poly(indus, 3)1   78.591      7.423  10.587  < 2e-16 ***
## poly(indus, 3)2  -24.395      7.423  -3.286  0.00109 ** 
## poly(indus, 3)3  -54.130      7.423  -7.292  1.2e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.423 on 502 degrees of freedom
## Multiple R-squared:  0.2597, Adjusted R-squared:  0.2552 
## F-statistic: 58.69 on 3 and 502 DF,  p-value: < 2.2e-16

summary(lm(crim ~ poly(nox, 3), data = Boston))

## 
## Call:
## lm(formula = crim ~ poly(nox, 3), data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.110 -2.068 -0.255  0.739 78.302 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.6135     0.3216  11.237  < 2e-16 ***
## poly(nox, 3)1  81.3720     7.2336  11.249  < 2e-16 ***
## poly(nox, 3)2 -28.8286     7.2336  -3.985 7.74e-05 ***
## poly(nox, 3)3 -60.3619     7.2336  -8.345 6.96e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.234 on 502 degrees of freedom
## Multiple R-squared:  0.297,  Adjusted R-squared:  0.2928 
## F-statistic: 70.69 on 3 and 502 DF,  p-value: < 2.2e-16

summary(lm(crim ~ poly(rm, 3), data = Boston))

## 
## Call:
## lm(formula = crim ~ poly(rm, 3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.485  -3.468  -2.221  -0.015  87.219 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.6135     0.3703   9.758  < 2e-16 ***
## poly(rm, 3)1 -42.3794     8.3297  -5.088 5.13e-07 ***
## poly(rm, 3)2  26.5768     8.3297   3.191  0.00151 ** 
## poly(rm, 3)3  -5.5103     8.3297  -0.662  0.50858    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.33 on 502 degrees of freedom
## Multiple R-squared:  0.06779,    Adjusted R-squared:  0.06222 
## F-statistic: 12.17 on 3 and 502 DF,  p-value: 1.067e-07

summary(lm(crim ~ poly(age, 3), data = Boston))

## 
## Call:
## lm(formula = crim ~ poly(age, 3), data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.762 -2.673 -0.516  0.019 82.842 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.6135     0.3485  10.368  < 2e-16 ***
## poly(age, 3)1  68.1820     7.8397   8.697  < 2e-16 ***
## poly(age, 3)2  37.4845     7.8397   4.781 2.29e-06 ***
## poly(age, 3)3  21.3532     7.8397   2.724  0.00668 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.84 on 502 degrees of freedom
## Multiple R-squared:  0.1742, Adjusted R-squared:  0.1693 
## F-statistic: 35.31 on 3 and 502 DF,  p-value: < 2.2e-16

summary(lm(crim ~ poly(dis, 3), data = Boston))

## 
## Call:
## lm(formula = crim ~ poly(dis, 3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.757  -2.588   0.031   1.267  76.378 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.6135     0.3259  11.087  < 2e-16 ***
## poly(dis, 3)1 -73.3886     7.3315 -10.010  < 2e-16 ***
## poly(dis, 3)2  56.3730     7.3315   7.689 7.87e-14 ***
## poly(dis, 3)3 -42.6219     7.3315  -5.814 1.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.331 on 502 degrees of freedom
## Multiple R-squared:  0.2778, Adjusted R-squared:  0.2735 
## F-statistic: 64.37 on 3 and 502 DF,  p-value: < 2.2e-16

summary(lm(crim ~ poly(rad, 3), data = Boston))

## 
## Call:
## lm(formula = crim ~ poly(rad, 3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.381  -0.412  -0.269   0.179  76.217 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.6135     0.2971  12.164  < 2e-16 ***
## poly(rad, 3)1 120.9074     6.6824  18.093  < 2e-16 ***
## poly(rad, 3)2  17.4923     6.6824   2.618  0.00912 ** 
## poly(rad, 3)3   4.6985     6.6824   0.703  0.48231    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.682 on 502 degrees of freedom
## Multiple R-squared:    0.4,  Adjusted R-squared:  0.3965 
## F-statistic: 111.6 on 3 and 502 DF,  p-value: < 2.2e-16

summary(lm(crim ~ poly(tax, 3), data = Boston))

## 
## Call:
## lm(formula = crim ~ poly(tax, 3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.273  -1.389   0.046   0.536  76.950 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.6135     0.3047  11.860  < 2e-16 ***
## poly(tax, 3)1 112.6458     6.8537  16.436  < 2e-16 ***
## poly(tax, 3)2  32.0873     6.8537   4.682 3.67e-06 ***
## poly(tax, 3)3  -7.9968     6.8537  -1.167    0.244    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.854 on 502 degrees of freedom
## Multiple R-squared:  0.3689, Adjusted R-squared:  0.3651 
## F-statistic:  97.8 on 3 and 502 DF,  p-value: < 2.2e-16

summary(lm(crim ~ poly(ptratio, 3), data = Boston))

## 
## Call:
## lm(formula = crim ~ poly(ptratio, 3), data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.833 -4.146 -1.655  1.408 82.697 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.614      0.361  10.008  < 2e-16 ***
## poly(ptratio, 3)1   56.045      8.122   6.901 1.57e-11 ***
## poly(ptratio, 3)2   24.775      8.122   3.050  0.00241 ** 
## poly(ptratio, 3)3  -22.280      8.122  -2.743  0.00630 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.122 on 502 degrees of freedom
## Multiple R-squared:  0.1138, Adjusted R-squared:  0.1085 
## F-statistic: 21.48 on 3 and 502 DF,  p-value: 4.171e-13

summary(lm(crim ~ poly(black, 3), data = Boston))

## 
## Call:
## lm(formula = crim ~ poly(black, 3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.096  -2.343  -2.128  -1.439  86.790 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.6135     0.3536  10.218   <2e-16 ***
## poly(black, 3)1 -74.4312     7.9546  -9.357   <2e-16 ***
## poly(black, 3)2   5.9264     7.9546   0.745    0.457    
## poly(black, 3)3  -4.8346     7.9546  -0.608    0.544    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.955 on 502 degrees of freedom
## Multiple R-squared:  0.1498, Adjusted R-squared:  0.1448 
## F-statistic: 29.49 on 3 and 502 DF,  p-value: < 2.2e-16

summary(lm(crim ~ poly(lstat, 3), data = Boston))

## 
## Call:
## lm(formula = crim ~ poly(lstat, 3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.234  -2.151  -0.486   0.066  83.353 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.6135     0.3392  10.654   <2e-16 ***
## poly(lstat, 3)1  88.0697     7.6294  11.543   <2e-16 ***
## poly(lstat, 3)2  15.8882     7.6294   2.082   0.0378 *  
## poly(lstat, 3)3 -11.5740     7.6294  -1.517   0.1299    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.629 on 502 degrees of freedom
## Multiple R-squared:  0.2179, Adjusted R-squared:  0.2133 
## F-statistic: 46.63 on 3 and 502 DF,  p-value: < 2.2e-16

summary(lm(crim ~ poly(medv, 3), data = Boston))

## 
## Call:
## lm(formula = crim ~ poly(medv, 3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.427  -1.976  -0.437   0.439  73.655 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.614      0.292  12.374  < 2e-16 ***
## poly(medv, 3)1  -75.058      6.569 -11.426  < 2e-16 ***
## poly(medv, 3)2   88.086      6.569  13.409  < 2e-16 ***
## poly(medv, 3)3  -48.033      6.569  -7.312 1.05e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.569 on 502 degrees of freedom
## Multiple R-squared:  0.4202, Adjusted R-squared:  0.4167 
## F-statistic: 121.3 on 3 and 502 DF,  p-value: < 2.2e-16

Observation:

Evidence of Non-Linear Relationship: Each of these variables [indus, nox, dis, ptratio, medv] squared and cubed terms is found to be statistically signficant (we reject the null hypothesis that the coeffecients on these exponentiated variables are zero). Age also appears to have a non-linear relationship, and once squared-age and cubed-age are brought into the model, linear age becomes statistically insignficant.
For every other variable, we do not find evidence of a non-linear relationship between the predictor and outcome variables.

Question 3.

Suppose we collect data for a group of students in a statistics class with variables X1 = hours studied, X2 = undergrad GPA, and Y = receive an A. We fit a logistic regression and produce estimated coefficient, ????0 = ???6, ????1 = 0.05, ????2 = 1.

Question 3.a.

Estimate the probability that a student who studies for 40 h and has an undergrad GPA of 3.5 gets an A in the class.

hours <- 40
UG_GPA <- 3.5
prob_A <- function(x1,x2){ z <- exp(-6 + 0.05*x1 + 1*x2); return( round(z/(1+z),4))}
prob_A(hours,UG_GPA)

## [1] 0.3775

Solution: The probability for scoring an A grade is 0.3775

Question 3.b.

How many hours would the student in part (a) need to study to have a 50 % chance of getting an A in the class?

hours <- seq(45,55,1)
probs <- mapply(hours, 3.5, FUN=prob_A)
probs_tab <- cbind(probs, hours)
probs_tab

##        probs hours
##  [1,] 0.4378    45
##  [2,] 0.4502    46
##  [3,] 0.4626    47
##  [4,] 0.4750    48
##  [5,] 0.4875    49
##  [6,] 0.5000    50
##  [7,] 0.5125    51
##  [8,] 0.5250    52
##  [9,] 0.5374    53
## [10,] 0.5498    54
## [11,] 0.5622    55

probs_tab[probs==0.50, ]

## probs hours 
##   0.5  50.0

Solution: The student in part (a) need to study for 50 hours in-order to have 50% chance of getting an A grade in class.

Question 4.

This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter???s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

Question 4.a.

Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

library(ISLR)
summary(Weekly)

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume       
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202  
##  Median :  0.2380   Median :  0.2340   Median :1.00268  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821  
##      Today          Direction 
##  Min.   :-18.1950   Down:484  
##  1st Qu.: -1.1540   Up  :605  
##  Median :  0.2410             
##  Mean   :  0.1499             
##  3rd Qu.:  1.4050             
##  Max.   : 12.0260

pairs(Weekly)

cor(Weekly[,-ncol(Weekly)])

##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

Solution:

There is high correlation between Volume and Year. This appears to be an exponential relationship where volume increases exponentially as a function of year.
Certain years seem to have more or less variation than other years.
Notice the violin shape of the various Lag features and the Year. There does seem to be some autocorrelation in the variability of the Lag and the year. Perhaps some years people are more skittish than other years, and this takes a while to wear off.
There appears to be very little if any correlation between Lag and other lags.
Direction appears slightly skewed by a few of the lags, perhaps Lag5, and Lag1.

Question 4.b.

Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

log_fit = glm(Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5+Volume, family=binomial, data=Weekly)
summary(log_fit)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Solution: The predictor variable Lag2 shall be considered to be Statistically Significant [because only Lag2 has p-value less than 0.05]

Question 4.c.

Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

glm_probs=predict(log_fit,Weekly,type="response")
glm_pred=rep("Down",nrow(Weekly))
glm_pred[glm_probs > 0.50]="Up"
table(glm_pred,Weekly$Direction)

##         
## glm_pred Down  Up
##     Down   54  48
##     Up    430 557

#mean(glm_pred==Weekly$Direction)
correct_pred <- round((54+557)/(54+557+48+430), 4)
Right_Up <- round(557/(48+557), 4)
Right_Down <- round(54/(54+430), 4)
error_rate <- 1-correct_pred
cbind(correct_pred, Right_Up, Right_Down, error_rate)

##      correct_pred Right_Up Right_Down error_rate
## [1,]       0.5611   0.9207     0.1116     0.4389

Solution:

The confusion matrix is telling us that the model does not fit the data particularly well.
The “Up” direction is guessed correctly most of the time: 92.07% of the times market really goes up when the market is actually predicted to go up.
The “Down” direction is NOT guessed correctly most of the time: 11.12% of the times market really goes down when the market is actually predicted to go down.

Question 4.d.

Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

train=Weekly$Year <= 2008
Weekly_test=Weekly[!train,]
log_fit = glm(Direction ~ Lag2, family=binomial, data=Weekly, subset=train)
contrasts(Weekly$Direction)

##      Up
## Down  0
## Up    1

summary(log_fit)

## 
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly, 
##     subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.536  -1.264   1.021   1.091   1.368  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4

glm_probs=predict(log_fit, Weekly_test, type="response")
glm_pred=rep("Down",nrow(Weekly_test))
glm_pred[glm_probs > 0.50]="Up"
table(glm_pred, Weekly_test$Direction)

##         
## glm_pred Down Up
##     Down    9  5
##     Up     34 56

correct_pred <- round((9+56)/104, 4)
Right_Up <- round(56/(56+5), 4)
Right_Down <- round(9/(9+34), 4)
error_rate <- 1-correct_pred
log_details <- cbind(correct_pred, Right_Up, Right_Down, error_rate)
log_details

##      correct_pred Right_Up Right_Down error_rate
## [1,]        0.625    0.918     0.2093      0.375

Solution:

The Percentage of correct predictions is 62.5%. Thus, Error Rate = 37.5%.
The “Up” direction is guessed correctly most of the time: 91.8% of the times market really goes up when the market is actually predicted to go up.
The “Down” direction is NOT guessed correctly most of the time: 20.93% of the times market really goes down when the market is actually predicted to go down.

Question 4.e.

Repeat (d) using LDA.

lda_fit = lda(Direction ~ Lag2, data=Weekly, subset=train)
lda_fit

## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162

lda_class = predict(lda_fit, Weekly_test)$class
table(lda_class, Weekly_test$Direction)

##          
## lda_class Down Up
##      Down    9  5
##      Up     34 56

correct_pred <- round((9+56)/104, 4)
Right_Up <- round(56/(56+5), 4)
Right_Down <- round(9/(9+34), 4)
error_rate <- 1-correct_pred
lda_details <- cbind(correct_pred, Right_Up, Right_Down, error_rate)
lda_details

##      correct_pred Right_Up Right_Down error_rate
## [1,]        0.625    0.918     0.2093      0.375

Solution: Results of LDA is identical to the results of Logistic

The Percentage of correct predictions is 62.5%. Thus, Error Rate = 37.5%.
The “Up” direction is guessed correctly most of the time: 91.8% of the times market really goes up when the market is actually predicted to go up.
The “Down” direction is NOT guessed correctly most of the time: 20.93% of the times market really goes down when the market is actually predicted to go down.

Question 4.f.

Repeat (d) using QDA.

qda_fit = qda(Direction ~ Lag2, data=Weekly, subset=train)
qda_fit

## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581

qda_class = predict(qda_fit, Weekly_test)$class
table(qda_class, Weekly_test$Direction)

##          
## qda_class Down Up
##      Down    0  0
##      Up     43 61

correct_pred <- round((0+61)/104, 4)
Right_Up <- round(61/61, 4)
Right_Down <- round(0/43, 4)
error_rate <- 1-correct_pred
qda_details <- cbind(correct_pred, Right_Up, Right_Down, error_rate)
qda_details

##      correct_pred Right_Up Right_Down error_rate
## [1,]       0.5865        1          0     0.4135

Solution: Looks like QDA overfits “up” variable. LDA/logistic regression performs better on the test data.

The Percentage of correct predictions is 58.65%. Thus, Error Rate = 41.35%.
The “Up” direction is guessed correctly most of the time: 100% of the times market really goes up when the market is actually predicted to go up.
The “Down” direction is NOT guessed correctly most of the time: 0% of the times market really goes down when the market is actually predicted to go down.

Question 4.g.

Repeat (d) using KNN with K = 1.

library(class)
train_X=Weekly[train,"Lag2",drop=F]
test_X=Weekly[!train,"Lag2",drop=F]
train_Direction=Weekly[train,"Direction",drop=T]
test_Direction=Weekly[!train,"Direction",drop=T]
set.seed(1)
knn_pred=knn(train_X,test_X,train_Direction,k=1)
table(knn_pred,test_Direction)

##         test_Direction
## knn_pred Down Up
##     Down   21 30
##     Up     22 31

correct_pred <- round((21+31)/104, 4)
Right_Up <- round(31/61, 4)
Right_Down <- round(21/43, 4)
error_rate <- 1-correct_pred
knn_details <- cbind(correct_pred, Right_Up, Right_Down, error_rate)
knn_details

##      correct_pred Right_Up Right_Down error_rate
## [1,]          0.5   0.5082     0.4884        0.5

mean(knn_pred==test_Direction)

## [1] 0.5

Solution: Looks like QDA overfits “up” variable. LDA/logistic regression performs better on the test data.

The Percentage of correct predictions is 50%. Thus, Error Rate = 50%.
The “Up” direction is guessed correctly most of the time: 50.82% of the times market really goes up when the market is actually predicted to go up.
The “Down” direction is NOT guessed correctly most of the time: 48.84% of the times market really goes down when the market is actually predicted to go down.

Question 4.h.

Which of these methods appears to provide the best results on this data?

log_details * 100

##      correct_pred Right_Up Right_Down error_rate
## [1,]         62.5     91.8      20.93       37.5

lda_details * 100

##      correct_pred Right_Up Right_Down error_rate
## [1,]         62.5     91.8      20.93       37.5

qda_details * 100

##      correct_pred Right_Up Right_Down error_rate
## [1,]        58.65      100          0      41.35

knn_details * 100

##      correct_pred Right_Up Right_Down error_rate
## [1,]           50    50.82      48.84         50

Solution: Both LDA and Logistic have minimal error-rate and perform equally well on the test data.

On looking at the results of each model, we shall conclude the following:

KNN: Looks very random
QDA: Seems to be overfitting values
LDA and Logistic: Both are identical and prove to perform equally well on the test data with minimal error-rate when compared with other models.

Question 4.i.

Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

set.seed(1)
knn_pred=knn(train_X, test_X,train_Direction,k=4)
table(knn_pred,test_Direction)

##         test_Direction
## knn_pred Down Up
##     Down   20 17
##     Up     23 44

mean(knn_pred==test_Direction)

## [1] 0.6153846

knn_pred=knn(train_X, test_X,train_Direction,k=10)
table(knn_pred,test_Direction)

##         test_Direction
## knn_pred Down Up
##     Down   20 21
##     Up     23 40

mean(knn_pred==test_Direction)

## [1] 0.5769231

knn_pred=knn(train_X, test_X,train_Direction,k=100)
table(knn_pred,test_Direction)

##         test_Direction
## knn_pred Down Up
##     Down   10 12
##     Up     33 49

mean(knn_pred==test_Direction)

## [1] 0.5673077

Observation: KNN performs well at 0.57

qda_fit = qda(Direction ~ Lag1, data=Weekly, subset=train)
qda_fit

## Call:
## qda(Direction ~ Lag1, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##              Lag1
## Down  0.289444444
## Up   -0.009213235

qda_class=predict(qda_fit,Weekly_test)$class
table(qda_class,Weekly_test$Direction)

##          
## qda_class Down Up
##      Down    0  0
##      Up     43 61

mean(qda_class==Weekly_test$Direction)

## [1] 0.5865385

qda_fit = qda(Direction ~ Lag1 + Lag2, data=Weekly, subset=train)
qda_fit

## Call:
## qda(Direction ~ Lag1 + Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##              Lag1        Lag2
## Down  0.289444444 -0.03568254
## Up   -0.009213235  0.26036581

qda_class=predict(qda_fit,Weekly_test)$class
table(qda_class,Weekly_test$Direction)

##          
## qda_class Down Up
##      Down    7 10
##      Up     36 51

mean(qda_class==Weekly_test$Direction)

## [1] 0.5576923

qda_fit = qda(Direction ~ Lag1 + Lag2 + Lag3, data=Weekly, subset=train)
qda_fit

## Call:
## qda(Direction ~ Lag1 + Lag2 + Lag3, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##              Lag1        Lag2       Lag3
## Down  0.289444444 -0.03568254 0.17080045
## Up   -0.009213235  0.26036581 0.08404044

qda_class=predict(qda_fit,Weekly_test)$class
table(qda_class,Weekly_test$Direction)

##          
## qda_class Down Up
##      Down    6 10
##      Up     37 51

mean(qda_class==Weekly_test$Direction)

## [1] 0.5480769

qda_fit = qda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4, data=Weekly, subset=train)
qda_fit

## Call:
## qda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##              Lag1        Lag2       Lag3       Lag4
## Down  0.289444444 -0.03568254 0.17080045 0.15925624
## Up   -0.009213235  0.26036581 0.08404044 0.09220956

qda_class=predict(qda_fit,Weekly_test)$class
table(qda_class,Weekly_test$Direction)

##          
## qda_class Down Up
##      Down    9 16
##      Up     34 45

mean(qda_class==Weekly_test$Direction)

## [1] 0.5192308

Observation: QDA performs even more worse on adding more predictors

lda_fit = lda(Direction ~ Lag1 + Lag2, data=Weekly, subset=train)
lda_fit

## Call:
## lda(Direction ~ Lag1 + Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##              Lag1        Lag2
## Down  0.289444444 -0.03568254
## Up   -0.009213235  0.26036581
## 
## Coefficients of linear discriminants:
##             LD1
## Lag1 -0.3013148
## Lag2  0.2982579

lda_class = predict(lda_fit, Weekly_test)$class
table(lda_class, Weekly_test$Direction)

##          
## lda_class Down Up
##      Down    7  8
##      Up     36 53

mean(lda_class==Weekly_test$Direction)

## [1] 0.5769231

lda_fit = lda(Direction ~ Lag1 + Lag2 + Volume, data=Weekly, subset=train)
lda_fit

## Call:
## lda(Direction ~ Lag1 + Lag2 + Volume, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##              Lag1        Lag2   Volume
## Down  0.289444444 -0.03568254 1.266966
## Up   -0.009213235  0.26036581 1.156529
## 
## Coefficients of linear discriminants:
##               LD1
## Lag1   -0.2979204
## Lag2    0.2366224
## Volume -0.3545069

lda_class = predict(lda_fit, Weekly_test)$class
table(lda_class, Weekly_test$Direction)

##          
## lda_class Down Up
##      Down   27 33
##      Up     16 28

mean(lda_class==Weekly_test$Direction)

## [1] 0.5288462

lda_fit = lda(Direction ~ Lag1 + Lag2 + Volume + Lag3, data=Weekly, subset=train)
lda_fit

## Call:
## lda(Direction ~ Lag1 + Lag2 + Volume + Lag3, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##              Lag1        Lag2   Volume       Lag3
## Down  0.289444444 -0.03568254 1.266966 0.17080045
## Up   -0.009213235  0.26036581 1.156529 0.08404044
## 
## Coefficients of linear discriminants:
##                LD1
## Lag1   -0.29112134
## Lag2    0.22615149
## Volume -0.36829715
## Lag3   -0.07080671

lda_class = predict(lda_fit, Weekly_test)$class
table(lda_class, Weekly_test$Direction)

##          
## lda_class Down Up
##      Down   30 37
##      Up     13 24

mean(lda_class==Weekly_test$Direction)

## [1] 0.5192308

Observation: Like QDA, LDA performs even more worse on adding more predictors

train=Weekly$Year <= 2008
Weekly_test=Weekly[!train,]

logit_fit = glm(Direction ~ Lag1+Lag2, family=binomial, data=Weekly, subset=train)
glm_probs = predict(logit_fit, Weekly_test, type="response")
glm_pred = rep("Down", nrow(Weekly_test))
glm_pred[glm_probs > 0.50]="Up"
table(glm_pred, Weekly_test$Direction)

##         
## glm_pred Down Up
##     Down    7  8
##     Up     36 53

mean(glm_pred==Weekly_test$Direction)

## [1] 0.5769231

logit_fit = glm(Direction ~ Lag1+Lag2+Volume, family=binomial, data=Weekly, subset=train)
glm_probs = predict(logit_fit, Weekly_test, type="response")
glm_pred = rep("Down", nrow(Weekly_test))
glm_pred[glm_probs > 0.50]="Up"
table(glm_pred, Weekly_test$Direction)

##         
## glm_pred Down Up
##     Down   27 33
##     Up     16 28

mean(glm_pred==Weekly_test$Direction)

## [1] 0.5288462

logit_fit = glm(Direction ~ Lag1+Lag2+Volume+Lag3, family=binomial, data=Weekly, subset=train)
glm_probs = predict(logit_fit, Weekly_test, type="response")
glm_pred = rep("Down", nrow(Weekly_test))
glm_pred[glm_probs > 0.50]="Up"
table(glm_pred, Weekly_test$Direction)

##         
## glm_pred Down Up
##     Down   30 37
##     Up     13 24

mean(glm_pred==Weekly_test$Direction)

## [1] 0.5192308

Observation: Like QDA and LDA, Logistic Regression also performs even more worse on adding more predictors. Logistic Regression performs well with Lag2.

Assignment-2 Statistical Modelling

Mughundhan Chandrasekar

10/4/2017

Question 1.

Question 2.

Question 2.a.

Observation:

Based on the plots we shall conclude that, there is a statistically significant association between each predictor [zn, indus, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv] and the response [crim] variable, except for the chas predictor.

Question 2.b.

Observation:

Question 2.c.

Observation:

Question 2.d.

Observation:

Question 3.

Question 3.a.

Solution: The probability for scoring an A grade is 0.3775

Question 3.b.

Solution: The student in part (a) need to study for 50 hours in-order to have 50% chance of getting an A grade in class.

Question 4.

Question 4.a.

Solution:

Question 4.b.

Solution: The predictor variable Lag2 shall be considered to be Statistically Significant [because only Lag2 has p-value less than 0.05]

Question 4.c.

Solution:

Question 4.d.

Solution:

Question 4.e.

Solution: Results of LDA is identical to the results of Logistic

Question 4.f.

Solution: Looks like QDA overfits “up” variable. LDA/logistic regression performs better on the test data.

Question 4.g.

Solution: Looks like QDA overfits “up” variable. LDA/logistic regression performs better on the test data.

Question 4.h.

Solution: Both LDA and Logistic have minimal error-rate and perform equally well on the test data.

Question 4.i.

Observation: KNN performs well at 0.57

Observation: QDA performs even more worse on adding more predictors

Observation: Like QDA, LDA performs even more worse on adding more predictors

Observation: Like QDA and LDA, Logistic Regression also performs even more worse on adding more predictors. Logistic Regression performs well with Lag2.