Homework 3

##1. Question 4.7.1 pg 168

Given: \(P(x) = \frac{e^{\beta_0 + \beta_1*X}}{1+e^{\beta_0 + \beta_1*X}}\)

Show: \(\frac{P(X)}{1-PX()} = e^{\beta_0 + \beta_1*X}\)

\(P(X)(1+e^{\beta_0 + \beta_1*X}) = e^{\beta_0 + \beta_1*X}\)

\(P(X) + P(X)*e^{\beta_0 + \beta_1*X} = e^{\beta_0 + \beta_1*X}\)

\(P(X)=e^{\beta_0 + \beta_1*X}-P(X)e^{\beta_0 + \beta_1*X}\)

\(P(X) = e^{\beta_0 + \beta_1*X}{[-P(X) + 1]}\)

\(\frac{P(X)}{(-P(X)+1)}=e^{\beta_0 + \beta_1*X}\)

Hence:

\(\frac{P(X)}{1-PX()} = e^{\beta_0 + \beta_1*X}\)

##2. Question 4.7.10(a-d) pg 171

#Problem 10 a)

First I will load the weekly data set found in the ISLR library. Then I will produce some numerical and graphical summaries of the data.

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume       
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202  
##  Median :  0.2380   Median :  0.2340   Median :1.00268  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821  
##      Today          Direction 
##  Min.   :-18.1950   Down:484  
##  1st Qu.: -1.1540   Up  :605  
##  Median :  0.2410             
##  Mean   :  0.1499             
##  3rd Qu.:  1.4050             
##  Max.   : 12.0260

When looking at the data it appears that the ony thing that has a correlation is the volumn and year. IT appears that as time goes on volume is increaseing. But it appears to be a exponential function that starts off slow and then really takes off. In addition, it has a postive skew in the histogram for volume.

#Problem 10 b)

Now I will fit a glm model using Direction is explained by all the lag variables and volume. I will be using the binomial family. Then I will output the summary. I will also assign a binary response variable for direction. If Direction is down then 0 otherwise 1

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

When looking at the model it appears only Lag2 is significant.

##Problem 10 c)

Now I will put together a confusion matrix based on the model that was just put together

##    glm_mod10_1_preds
##       0   1
##   0 430  54
##   1 557  48

What the confusion matrix reveals about that is that the accuracy of the model is roughly 56% meaning the error rate is roughly 44%. The recall of the model is rather low, roughly 11%, which indicates the percentage of the results which are relevant. The recall is roughly 53% which means that about half of the results were correclty classified. In short, the model does to do well predicting whether the market has a positive or negative return.

#Problem 10 d)

Now I will subset the data to only contain the years 1990 through 2008. After that I will fit a logisitic regression model with Lag2 as the only predictor. Then I will output the confusion matrix of the model and the overall fraction of correct predictions for the data that was not used for the years 2009 and 2010.

##    glm_10d_preds
##     Down Up
##   0    9 34
##   1    5 56

It appears this model is better at predicting the direction of the market will have postive or negative return. The recall, precision, and accuracy are all higher than the previous model. Since the accuracy is 62.5% the error rate for the model is 37.5%

##3. Question 4.7.11(a,b,c,f) pg 172

#Probelm 11 a)

I will load the data set found is the ISLR library called Auto. Using this data set I will predict whether a given car gets high or low gas mileage based on various variables within the data set. First, I calculate the median value the mpg and then create a binary variable within the data set which will be named mpg01. If mpg is higher than the median value then 1 will be assigned and if mpg is lower than the median than 0 will be assigned.

#Problem 11 b)

Now I will explore the data set to see which variables would be useful in predicting mpg01. I will use the pairs plot and ggpairs plot to see scatter plots and boxplots. In this data set name is not going to be useful for exploring or coming up with a model. There are 301 and different names and the entire data set consists of 392 rows of data.

I believe the boxplots reveal the most about mpg01. The greater the discrepancy between the mpg01 factor and the covariates such as cylinders, weight, displacement, horsepower seem to be good values showing association for mpg01 based on the discrepancy in boxplots. year may also be a decent value that somewhat has an association. However, when fitting a model it is important that covariates do not also share a strong association between themseleves.

Since there are some other categorical variables like cylinders which represents the number of cylinders between 4 and 8. Orgin which represents 1. America, 2. European, 3. Japanese. I want to look at some bar plots of these with the categorical variable outlined.

From the bar graph counts it appears cylinders and mpg01 have a good association with each other. For example, 4 cylinders seems to have better MPG rating than low MPG rating. 8 cylinders seems to have mostly a lower rating for MPG.

When looking at origin bar graphs it appears America has lowest MPG ratings. Whereas European and Japanese seem have many more vehicles with higher MPG ratings.

#Problem 11 c)

Now I will split the data for testing and training. Training data will contain 70% of data set population and the testing will contain the other 30%. I will set a seed using 123 for reproducibility. Before splitting, I also remove any variables that contain a correlation coefficient of .80 or higher amongst themselves excluding mpg01

## [1] "weight"       "acceleration" "year"         "origin"      
## [5] "mpg01"

#Problem 11 f)

After the removing strongly correlated covariates it appears weight, acceleration, year, and origin remain. As previously mentioned, all the variables seem to possess a good relationship for using a predictive. However, I did not previously mention acceleration. Nevertheless, I will use all the remaining covariates in the logistic regression model. I will then produce a summary of the model.

## 
## Call:
## glm(formula = mpg01 ~ weight + acceleration + year + origin, 
##     family = binomial, data = training)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.17703  -0.07045   0.01414   0.19423   2.25828  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.759e+01  6.701e+00  -4.117 3.84e-05 ***
## weight       -5.859e-03  9.351e-04  -6.266 3.71e-10 ***
## acceleration  2.511e-01  1.204e-01   2.085    0.037 *  
## year          5.180e-01  9.834e-02   5.268 1.38e-07 ***
## origin        3.002e-01  3.547e-01   0.846    0.397    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 379.830  on 273  degrees of freedom
## Residual deviance:  98.407  on 269  degrees of freedom
## AIC: 108.41
## 
## Number of Fisher Scoring iterations: 8

It appears I was wrong with using origin in the model. The p-value reveals that there is not statistically significant relationship in the model. Therefore, I will remove it from the model.

## 
## Call:
## glm(formula = mpg01 ~ weight + acceleration + year, family = binomial, 
##     data = training)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.25409  -0.06870   0.01543   0.19712   2.22266  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.656e+01  6.524e+00  -4.071 4.69e-05 ***
## weight       -6.023e-03  9.114e-04  -6.608 3.89e-11 ***
## acceleration  2.416e-01  1.185e-01   2.039   0.0414 *  
## year          5.188e-01  9.833e-02   5.276 1.32e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 379.830  on 273  degrees of freedom
## Residual deviance:  99.137  on 270  degrees of freedom
## AIC: 107.14
## 
## Number of Fisher Scoring iterations: 8

Now I will compile a confusion matrix comparing the training data model to the testing model.

##    glm_11_preds
##      0  1
##   0 55  5
##   1  8 50

##    accuracy error_rate
## 1 0.8898305  0.1101695

It appears the model’s error rate is roughly 11.02%.

##Problem 4

Write a function in RMD that calculates the misclassification rate, sensitivity, and specificity. The inputs for this function are a cutoff point, predicted probabilities, and original binary response. Test your function using the model from 4.7.10 b. (Post any questions you might have regarding this on the discussion board, this needs to be an actual function, using the function() command, not just a chunk of code). This will be something you will want to use throughout the semester, since we will be calculating these a lot!

## [1] 0.4389348 0.4356636 0.4705882

It appears the function is correct when testing against the model in 10 b. The confusion matrix for 10 b is:

##    glm_mod10_1_preds
##       0   1
##   0 430  54
##   1 557  48

So the classification rate is (430 + 48)/sum(confusion_matrix) = 0.4389348 The Sensitivity is 430/(430+557) = 0.4356636 Sepcificity is 48 / (48+54) = 0.4705882

Therefore the values check out.

Homework 3

Modern Applied Statistics II