Suppose we collect data for a group of students in a statistics class with variables $X_1$ =hours studied, $X_2$ =undergrad GPA, and Y= receive an A. We fit a logistic regression and produce estimated coefficient, ^$\beta_0$ = -6, ^$\beta_1$ = 0.05, ^$\beta_2$ = 1.

(a).Estimate the probability that a student who studies for 40 hours and has an undergrad GPA of 3.5 gets an A in the class.

Answer: We may predict probability using equation, $^p(X)=e^(\beta_0 +\beta_1X_1 +\beta_2X_2)/(1+e^(\beta_0+\beta_1X_1 +\beta_2X_2))$……..(A)

We have ^$\beta_0$= -6 ^$\beta_1$= 0.05 ^$\beta_2$= 1 $X_1$= 40 $X_2$= 3.5 Now ,Substituting the corresponding values, $^p(X)=e^(-6 +0.05*40+ 1*3.5)/(1+e^(-6+0.05*40 +1*3.5))$ = 0.3775

Probability that a student who studies for 40 hours and has an undergrad GPA of 3.5 gets an A in the class is 0.3775.

(b).How many hours would the student in part (a) need to study to have a 50% chance of getting an A in the class ?

We have the equation as, $0.5=e^(-6 +0.05*X_1 +1*3.5)/(1+e^(-6+0.05*X_1 +1*3.5))$ or $0.5*e^(-6 +0.05*X_1 +1*3.5)=0.5$ or $e^(-6 +0.05*X_1 +1*3.5)=1$

By taking log on both side, $(0.05*X_1 -2.5)= 0$ $Log(1)=0$ $X_1 = 2.5/0.05$ = 50

Therefore, the student in part (a) Should need to study 50 hours to have a 50% chance of getting an A in the class.

Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures.First we use logistic regression and get an error rate of 20% on the training data and 30% on the test data. Next we use 1-nearest neighbors (i.e. K = 1) and get an average error rate (averaged over both test and training data sets) of 18%. Based on these results, which method should we prefer to use for classification of new observations? Why?

First we use logistic regression and get an error rate of 20% on the training data and 30% on the test data, So an average error rate(averaged over both testing and training data sets) is equal to $ (20+30)/2=25$ and test data error rate is 30%.

In the case of KNN with K=1, we have a training error rate of 0% because in this case, we have P(Y=j|X=xi)=I(yi=j) which is equal to 1 if yi=j and 0 if not. We do not make any error on the training data within this setting, that explains the 0% training error rate. However, we have an average error rate of 18% which indicates a test error rate of 18*2= 36% for KNN which is greater than the test error rate for logistic regression of 30%.

So, In this case,It is better to select logistic regression because of its lower test error rate.

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

Answer 11:

library(ISLR)

## Warning: package 'ISLR' was built under R version 3.3.2

data("Auto")
mpg01 <- rep(0, length(Auto$mpg))
mpg01[Auto$mpg > median(Auto$mpg)] <- 1
Auto <- data.frame(Auto, mpg01)
summary(Auto)

##       mpg          cylinders      displacement     horsepower   
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0  
##                                                                 
##      weight      acceleration        year           origin     
##  Min.   :1613   Min.   : 8.00   Min.   :70.00   Min.   :1.000  
##  1st Qu.:2225   1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000  
##  Median :2804   Median :15.50   Median :76.00   Median :1.000  
##  Mean   :2978   Mean   :15.54   Mean   :75.98   Mean   :1.577  
##  3rd Qu.:3615   3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000  
##                                                                
##                  name         mpg01    
##  amc matador       :  5   Min.   :0.0  
##  ford pinto        :  5   1st Qu.:0.0  
##  toyota corolla    :  5   Median :0.5  
##  amc gremlin       :  4   Mean   :0.5  
##  amc hornet        :  4   3rd Qu.:1.0  
##  chevrolet chevette:  4   Max.   :1.0  
##  (Other)           :365

(b).Explore the data graphically in order to investigate the association between “mpg01” and the other features. Which of the other features seem most likely to be useful in predictiong “mpg01” ? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings. ### Correlation

cor(Auto[, -9])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
## mpg01         0.8369392 -0.7591939   -0.7534766 -0.6670526 -0.7577566
##              acceleration       year     origin      mpg01
## mpg             0.4233285  0.5805410  0.5652088  0.8369392
## cylinders      -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement   -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower     -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight         -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration    1.0000000  0.2903161  0.2127458  0.3468215
## year            0.2903161  1.0000000  0.1815277  0.4299042
## origin          0.2127458  0.1815277  1.0000000  0.5136984
## mpg01           0.3468215  0.4299042  0.5136984  1.0000000

library(corrplot)

## Warning: package 'corrplot' was built under R version 3.3.2

corrplot::corrplot.mixed(cor(Auto[, -9]), upper="circle")

Scatterplot matrix

pairs(Auto[, -9])

Boxplots

par(mfrow=c(2,3))
boxplot(cylinders ~ mpg01, data = Auto, main = "Cylinders vs mpg01")
boxplot(displacement ~ mpg01, data = Auto, main = "Displacement vs mpg01")
boxplot(horsepower ~ mpg01, data = Auto, main = "Horsepower vs mpg01")
boxplot(weight ~ mpg01, data = Auto, main = "Weight vs mpg01")
boxplot(acceleration ~ mpg01, data = Auto, main = "Acceleration vs mpg01")
boxplot(year ~ mpg01, data = Auto, main = "Year vs mpg01")

The graph indicates that there exists some association between “mpg01” and “cylinders”, “weight”, “displacement” and “horsepower”.

(c).Split the data into a training set and a test set.

set.seed(123)
train <- sample(1:dim(Auto)[1], dim(Auto)[1]*.7, rep=FALSE)
test <- -train
training_data<- Auto[train, ]
testing_data= Auto[test, ]
mpg01.test <- mpg01[test]

(d).Perform LDA on the training data in order to predict “mpg01” using the variables that seemed most associated with “mpg01” in (b). What is the test error of the model obtained ?

library(MASS)

## Warning: package 'MASS' was built under R version 3.3.2

lda_model <- lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = training_data)
lda_model

## Call:
## lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = training_data)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4817518 0.5182482 
## 
## Group means:
##   cylinders   weight displacement horsepower
## 0  6.795455 3659.008     274.3333  130.88636
## 1  4.218310 2343.599     117.0528   79.71127
## 
## Coefficients of linear discriminants:
##                        LD1
## cylinders    -0.4483062756
## weight       -0.0012201696
## displacement  0.0001078142
## horsepower    0.0046253024

lda_pred = predict(lda_model, testing_data)
names(lda_pred)

## [1] "class"     "posterior" "x"

## compute the confusion matrix
pred.lda <- predict(lda_model, testing_data)
table(pred.lda$class, mpg01.test)

##    mpg01.test
##      0  1
##   0 53  3
##   1 11 51

mean(pred.lda$class != mpg01.test)

## [1] 0.1186441

We conclude that we have a test error rate of 11.86%

Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

qda_model = qda(mpg01 ~ cylinders + horsepower + weight + acceleration, data=training_data)
qda_model

## Call:
## qda(mpg01 ~ cylinders + horsepower + weight + acceleration, data = training_data)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4817518 0.5182482 
## 
## Group means:
##   cylinders horsepower   weight acceleration
## 0  6.795455  130.88636 3659.008     14.65303
## 1  4.218310   79.71127 2343.599     16.46620

# compute the confusion matrix
qda.class=predict(qda_model, testing_data)$class
table(qda.class, testing_data$mpg01)

##          
## qda.class  0  1
##         0 56  3
##         1  8 51

mean(qda.class != testing_data$mpg01)

## [1] 0.09322034

We may conclude that we have a test error rate of 0.093%.

(f).Perform logistic regression on the training data in order to predict “mpg01” using the variables that seemed most associated with “mpg01” in (b). What is the test error of the model obtained ?

glm_model <- glm(mpg01 ~ cylinders + weight + displacement + horsepower, data = training_data, family = binomial)
summary(glm_model)

## 
## Call:
## glm(formula = mpg01 ~ cylinders + weight + displacement + horsepower, 
##     family = binomial, data = training_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4870  -0.1763   0.1231   0.3633   3.2024  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  11.6661375  1.9885822   5.867 4.45e-09 ***
## cylinders     0.0365940  0.3817545   0.096  0.92363    
## weight       -0.0023706  0.0008319  -2.850  0.00438 ** 
## displacement -0.0106969  0.0096956  -1.103  0.26991    
## horsepower   -0.0327332  0.0154688  -2.116  0.03434 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 379.48  on 273  degrees of freedom
## Residual deviance: 147.50  on 269  degrees of freedom
## AIC: 157.5
## 
## Number of Fisher Scoring iterations: 7

probs <- predict(glm_model, testing_data, type = "response")
pred.glm <- rep(0, length(probs))
pred.glm[probs > 0.5] <- 1
table(pred.glm, mpg01.test)

##         mpg01.test
## pred.glm  0  1
##        0 54  3
##        1 10 51

mean(pred.glm != mpg01.test)

## [1] 0.1101695

Test error rate in logistic regression is 11.01695%.

Perform KNN on the training data, with several values of K, in order to predict “mpg01” using the variables that seemed most associated with “mpg01” in (b). What test errors do you obtain ? Which value of K seems to perform the best on this data set ?

str(Auto)

## 'data.frame':    392 obs. of  10 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  $ mpg01       : num  0 0 0 0 0 0 0 0 0 0 ...

data = scale(Auto[,-c(9,10)])
set.seed(1234)
train <- sample(1:dim(Auto)[1], 392*.7, rep=FALSE)
#train <- sample(1:dim(Auto)[1], dim(Auto)[1]*.7, rep=FALSE)
test <- -train
training_data = data[train,c("cylinders","horsepower","weight","acceleration")]
testing_data = data[test, c("cylinders", "horsepower","weight","acceleration")]
## KNN take the training response variable seperately
train.mpg01 = Auto$mpg01[train]

## we also need the have the testing_y seperately for assesing the model later on
test.mpg01= Auto$mpg01[test]

library(class)
set.seed(1234)
knn_pred_y = knn(training_data, testing_data, train.mpg01, k = 1)
table(knn_pred_y, test.mpg01)

##           test.mpg01
## knn_pred_y  0  1
##          0 50  4
##          1  6 58

mean(knn_pred_y != test.mpg01)

## [1] 0.08474576

Test error rate is 8.47% for K=1

Using a for loop to find the optimum K value

knn_pred_y = NULL
error_rate = NULL
for(i in 1:dim(testing_data)[1]){
set.seed(1234)
knn_pred_y = knn(training_data,testing_data,train.mpg01,k=i)
error_rate[i] = mean(test.mpg01 != knn_pred_y)
}

### find the minimum error rate
min_error_rate = min(error_rate)
print(min_error_rate)

## [1] 0.06779661

### get the index of that error rate, which is the k
K = which(error_rate == min_error_rate)
print(K)

## [1] 4

When we train a KNN model with k=4, then we get the lowest misclassification error rate of 6.77%.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.2

qplot(1:dim(testing_data)[1], error_rate, xlab = "K",
ylab = "Error Rate",
geom=c("point", "line"))

Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.

library(MASS)
data("Boston")
crim01 <- rep(0, length(Boston$crim))
crim01[Boston$crim > median(Boston$crim)] <- 1
Boston <- data.frame(Boston, crim01)
summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv           crim01   
##  Min.   : 1.73   Min.   : 5.00   Min.   :0.0  
##  1st Qu.: 6.95   1st Qu.:17.02   1st Qu.:0.0  
##  Median :11.36   Median :21.20   Median :0.5  
##  Mean   :12.65   Mean   :22.53   Mean   :0.5  
##  3rd Qu.:16.95   3rd Qu.:25.00   3rd Qu.:1.0  
##  Max.   :37.97   Max.   :50.00   Max.   :1.0

set.seed(1234)
train <- sample(1:dim(Boston)[1], dim(Boston)[1]*.7, rep=FALSE)
test <- -train
Boston.train <- Boston[train, ]
Boston.test <- Boston[test, ]
crim01.test <- crim01[test]

logistic regression

fit.glm1 <- glm(crim01 ~ . - crim01 - crim, data = Boston, family = binomial)
fit.glm1

## 
## Call:  glm(formula = crim01 ~ . - crim01 - crim, family = binomial, 
##     data = Boston)
## 
## Coefficients:
## (Intercept)           zn        indus         chas          nox  
##  -34.103704    -0.079918    -0.059389     0.785327    48.523782  
##          rm          age          dis          rad          tax  
##   -0.425596     0.022172     0.691400     0.656465    -0.006412  
##     ptratio        black        lstat         medv  
##    0.368716    -0.013524     0.043862     0.167130  
## 
## Degrees of Freedom: 505 Total (i.e. Null);  492 Residual
## Null Deviance:       701.5 
## Residual Deviance: 211.9     AIC: 239.9

library(corrplot)
corrplot::corrplot.mixed(cor(Boston[, -1]), upper="circle")

fit.glm <- glm(crim01 ~ nox + indus + age + rad, data = Boston, family = binomial)

probs <- predict(fit.glm, Boston.test, type = "response")
pred.glm <- rep(0, length(probs))
pred.glm[probs > 0.5] <- 1
table(pred.glm, crim01.test)

##         crim01.test
## pred.glm  0  1
##        0 65 15
##        1  4 68

mean(pred.glm != crim01.test)

## [1] 0.125

For the logistic regression, we have a test error rate of 12.5%.

LDA model

fit.lda <- lda(crim01 ~ nox + indus + age + rad , data = Boston)

pred.lda <- predict(fit.lda, Boston.test)
table(pred.lda$class, crim01.test)

##    crim01.test
##      0  1
##   0 66 20
##   1  3 63

mean(pred.lda$class != crim01.test)

## [1] 0.1513158

For the LDA regression model, we have a test error rate of 15.1%.

knn classification

data = scale(Boston[,-c(1,15)])
set.seed(1234)
train <- sample(1:dim(Boston)[1], dim(Boston)[1]*.7, rep=FALSE)
test <- -train
training_data = data[train, c("nox" , "indus" , "age" , "rad")]
testing_data = data[test, c("nox" , "indus" , "age" , "rad")]
## KNN take the training response variable seperately
train.crime01 = Boston$crim01[train]
## we also need the have the testing_y seperately for assesing the model later on
test.crime01= Boston$crim01[test]

library(class)
set.seed(1234)
knn_pred_y = knn(training_data, testing_data, train.crime01, k = 1)
table(knn_pred_y, test.crime01)

##           test.crime01
## knn_pred_y  0  1
##          0 62  7
##          1  7 76

mean(knn_pred_y != test.crime01)

## [1] 0.09210526

For this KNN (k=1), we have a test error rate of 9.21%

knn_pred_y = NULL
error_rate = NULL
for(i in 1:dim(testing_data)[1]){
set.seed(1234)
knn_pred_y = knn(training_data,testing_data,train.crime01,k=i)
error_rate[i] = mean(test.crime01 != knn_pred_y)
}

### find the minimum error rate
min_error_rate = min(error_rate)
print(min_error_rate)

## [1] 0.06578947

### get the index of that error rate, which is the k
K = which(error_rate == min_error_rate)
print(K)

## [1] 3

Minimum error rate is 6.57% at K=3

library(ggplot2)
qplot(1:dim(testing_data)[1], error_rate, xlab = "K",
ylab = "Error Rate",
geom=c("point", "line"))

ISLR Chapter 4

Amit

February 07, 2017