STATS216 Homework 1

Shijia Bian, Jan 29,2014

1.For each of parts (a) through (d), indicate whether you would expect the performance of a ﬂexible statistical learning method to perform better or worse than an inﬂexible method. Give reasons in each case.

(a) The number of observations n is extremely large, and the number of predictors p is small.

Comparably, more flexible model is preferred. Because the plots span around in the low dimensional space, we have enough data to train the model, thus we can take the flexible model. However, if we can have more information on the underlying distribution of the model, such as linear or non-linear, we can have more confidence to choose between the inflexible and flexible models.

(b) The number of predictors p is extremely large, and the number of observations n is small.

The inflexible model is preferred. We do not have enough observation in the high dimensional space to train the model, thus it is more essential for us to know the trend of the observations, thus inflexible model, such as linear model, can satisfy this need and effectively avoid overfitting the data.

\( (c) \) The variance of the error terms, i.e. σ²=Var(ϵ), is extremely high.

The inflexible model is preferred. Because the flexible might fit everything including the errors, unfortunately, this data set has large error. Thus the inflexible model is a good option for getting the knowledge about the trend and the pattern of the data.

(d) The relationship between the predictors and response is highly non-linear, and σ² is small.

The flexible model is preferred. The prior information has already said that this model is highly non-linear. The linear model must not provide an accurate estimation for the data set. Moreover, the variance of the error term is small; we do not need worry if the flexible model can also fit into the errors. But this flexible model also needs to be controlled for overfitting.

(e) The relationship between the predictors and response is highly non-linear, and σ² is large.

The inflexible model is preferred. Although the underlying distribution is highly non-linear, the large error might also be mistakenly included into the flexible model. Inflexible model can fit better here because it can give us an clear and accurate sense of the overall knowledge of the data set.

2.Explain whether each scenario below is a regression, classification or unsupervised learning problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

This is a supervised regression learning problem. We are interested in the inference. n=500, p=3.

(b) Our website has collected the ratings of 1000 different restaurants by 10, 000 customers. Each customer has rated about 100 restaurants, and we would like to recommend restaurants to customers who have not yet been there.

This is a supervised classification learning problem. We are interested in the prediction. We assume that each customer has rated exactly 100 restaurants, the average number of rating each restaurant receives are 1,000. Thus the dimension is n=1,000 and p = 1 for each restaurant. With the 1,000 ratings each restaurant receives, we can classify if the restaurant should be recommended or not by choosing a proper threshold.

\( (c) \) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

This is a supervised classification learning problem. We are interested in the prediction. p=13, n=20

(d) We are interesting in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market.

This is a regression problem, and we are interested in the prediction. p=3, n=52.

3.In this next question we consider some real-life applications of statistical learning:

(a) Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

i. Assume that disease A has four types, and there are a set of symptoms, say 10 symptoms, might contribute to the four types of the disease. We are interested in predicting which type of the disease a patient has according to the certain symptoms this patient has. We have 10,000 past patients who have disease A and the information on their 10 symptoms. By doing classification regression on the 10,000 samples, we can predict the type of disease A the patient has. The response is the four types of the disease. The predictors are the set of the 10 symptoms. We are interested in prediction here.

ii. The home mortgage can turn into two conditions: default and non-default. The condition of the home mortgage is related to four variables: income, education, banking balance and previous credits. We are interested in if a potential buyer will default or not in the future. We have 1,000 samples who are our past clients, by analysing the 1,000 sampleswe can do the classification regression analysis to predict if the buyer will default or not. The response here is default/non-default. The predictors are this byuer's income, education, banking balance and previous credits. We are interested in prediction here.

iii. Before shipping out, the chipset company need exam if a chipset has problem. There are two conditions: has problem and does not have problem. This is related to a set of variables of this single chipset. We have the exam information on 10,000 past chipsets and their information of the variables. We can perform classification regression analysis to predict if this single chipset has problem or not. The response is that the chipset has problem or not. The predictors are the set of related variables. We are interested in prediction here.

(b) Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

i. The next day temperature is related to a set of current daily weather factors, say 5 factors. In order to predict the next-day temperature, we need build a regression model to regress the next day temperature on those current day’s variables. We have the data from the past 10 years. This means that we have 3652 sets of observations: n=3652, p=5. The response is the next day temperature, and the predictors are current day’s related weather variables. The goal of this application is prediction.

ii. The height of the children is related to four factors,: mother’s height, father’s height, daily diet and daily exercise. We want to build up a regression model to regress the height of the children on these factors, so we can examine how strong the association between the children height and the fout factors. We have the data collected from 1000 children and their parents. Thus, n=1000, p=4. The response is the children height. The predictors are the four mentioned factors. The goal of this application is inference.

iii. The graduate GPA is related to a set of factors: undergraduate GPA, GRE score, the hours willing to spend studying in college and the attendance rate in undergraduate school. In order to predict the GPA of a potential candidate for the graduate school, we need build a regression model to regress the graduate GPA on the four factors. We have collected the related data from 1000 graduate students in this graduate school. The response is the graduate GPA the student obtains. The predictors are the four factors. n=1000, p=4. The goal of this application is prediction.

i. The social class is divided into low-income, middle class and high-income class. We want to cluster 1,000 people into these 3 classes. The only information we have from each of the 100 people are their monthly income and education.

ii. According to the shopping behavior and income of the customers, the market wants to cluster the customers into different segments in order to target the right customer group.

iii. The Internet database clusters the Internet users into different social communities by examining the users’ Internet searching history.

4.This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US.

Before reading the data into R, it can be viewed in Excel or a text editor.

(a) Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.

college <- read.csv("/Users/shijiabian/Documents/College.csv", header = TRUE, 
    sep = ",")

(b) Look at the data using the fix() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:

rownames(college) = college[, 1]
fix(college)

You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try

college = college[, -1]
fix(college)

Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.

\( (c) \)

i. Use the summary() function to produce a numerical summary of the variables in the data set.

summary(college)

##  Private        Apps           Accept          Enroll       Top10perc   
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.0  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.0  
##            Median : 1558   Median : 1110   Median : 434   Median :23.0  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.6  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.0  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.0  
##    Top25perc      F.Undergrad     P.Undergrad       Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836   Max.   :21700  
##    Room.Board       Books         Personal         PhD       
##  Min.   :1780   Min.   :  96   Min.   : 250   Min.   :  8.0  
##  1st Qu.:3597   1st Qu.: 470   1st Qu.: 850   1st Qu.: 62.0  
##  Median :4200   Median : 500   Median :1200   Median : 75.0  
##  Mean   :4358   Mean   : 549   Mean   :1341   Mean   : 72.7  
##  3rd Qu.:5050   3rd Qu.: 600   3rd Qu.:1700   3rd Qu.: 85.0  
##  Max.   :8124   Max.   :2340   Max.   :6800   Max.   :103.0  
##     Terminal       S.F.Ratio     perc.alumni       Expend     
##  Min.   : 24.0   Min.   : 2.5   Min.   : 0.0   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.5   1st Qu.:13.0   1st Qu.: 6751  
##  Median : 82.0   Median :13.6   Median :21.0   Median : 8377  
##  Mean   : 79.7   Mean   :14.1   Mean   :22.7   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.5   3rd Qu.:31.0   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.8   Max.   :64.0   Max.   :56233  
##    Grad.Rate    
##  Min.   : 10.0  
##  1st Qu.: 53.0  
##  Median : 65.0  
##  Mean   : 65.5  
##  3rd Qu.: 78.0  
##  Max.   :118.0

ii. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first 10 columns of a matrix A using A[,1:10].

pairs(college[, 1:10])

plot of chunk unnamed-chunk-5

iii. Use the plot() function to produce side-by-side boxplots of Outstate versus Private.

boxplot(college$Outstate ~ college$Private, col = c("blue", "green"), main = "Outstate versus Private", 
    xlab = "Private", ylab = "Outstate")

plot of chunk unnamed-chunk-6

iv. Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.

Elite = rep("No", nrow(college))
Elite[college$Top10perc > 50] = "Yes"
Elite = as.factor(Elite)
college = data.frame(college, Elite)
fix(college)

Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.

There are 78 elite university and 699 non-elite university.

summary(college$Elite)

##  No Yes 
## 699  78

boxplot(college$Outstate ~ college$Elite, col = c("blue", "green"), main = "Outstate versus Elite", 
    xlab = "Elite", ylab = "Outstate")

plot of chunk unnamed-chunk-9

v. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow=c(2,2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.

The plot includes the histogram for Accept, Enroll and Top10perc. The color blue represents 5 bins. The color green represent 9 bins.

par(mfcol = c(2, 3))
# Apps with 5 bins
hist(college$Accept, breaks = 6, freq = TRUE, col = "blue", main = "Histogram", 
    xlab = "Accept", ylab = "Value")
hist(college$Accept, breaks = 10, freq = TRUE, col = "green", main = "Histogram", 
    xlab = "Accept", ylab = "Value")
hist(college$Enroll, breaks = 6, freq = TRUE, col = "blue", main = "Histogram", 
    xlab = "Enroll", ylab = "Value")
hist(college$Enroll, breaks = 10, freq = TRUE, col = "green", main = "Histogram", 
    xlab = "Enroll", ylab = "Value")
hist(college$Top10perc, breaks = 6, freq = TRUE, col = "blue", main = "Histogram", 
    xlab = "Top10perc", ylab = "Value")
hist(college$Top10perc, breaks = 10, freq = TRUE, col = "green", main = "Histogram", 
    xlab = "Top10perc", ylab = "Value")

plot of chunk unnamed-chunk-10

In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set of approximately equal size.

The training set has 389 samples and the test set has 388 samples.

library(ISLR)
set.seed(13435)
train <- sample(777, 389)
train

##   [1] 234 449 501 283 543 339 252  13 337 708 635 369 371 294 205  39 723
##  [18] 408 648 197 435 384 728 256 520 748 395 705 222 201 549 133   6 189
##  [35] 398 365 507 319 622 439 521  25 463 357 122 242 127  54 280 220 636
##  [52] 564 140 444 117  45 453 258  98 239 112  42 522  64 495 250 652 674
##  [69] 488 372 659 204 301 771 490 773 733 586 246 164 377 594 396 174 231
##  [86]  87 305 706 487 494  22 380 178 472 148 169 138 370 108 762 415 691
## [103] 221 505 525 128 227 461 673 207  34 428 747 251   8 599 509 406 598
## [120] 410 551 595 757 470 208 191 467 393 725 361  18 343 665 623 775 362
## [137] 209 424  29  69 581 523 263 675 755  51 532 639 479 538 268 625 658
## [154] 206 405  47 695 617 620 694 486 188 331 456 276 310 644 304 344 223
## [171] 226  36 468 130 631 137 769 143   9 669 560  67 160 438 368 132 297
## [188] 565 163 524 173 394  48 754 386 249 571 345  57 713 508 306 271 264
## [205]  17 419  89 611 632 296 325 473 259 101 290 504 535 247  75 716 645
## [222] 257 216 416 376 458 619 340 317 455 528 554 483 327  74 709 654 135
## [239] 392 228 761  71 281  20 477 526 664 350 284 724 185 766 265  59 261
## [256]  52 182  68 735 142 701 518 666 736 626 577 556 153 237 704  27 110
## [273] 333 621 177 722 500 425 759 196 230  16 184 548  10 443 506 603 573
## [290] 155 224 726 531 740 537 176 503 480  65 124 316 383 422 111 742 389
## [307] 100 676 597 329  24 572 321  21  53 576 379 318 616 745 634 421 457
## [324] 749 373 437 703 336 442 253 285 600 154 299   1  49 167  15 462 719
## [341] 346 433 198 641 170 233  73 758 566 245 624   3 409 382 404 238 190
## [358] 266  94 649 193 485 657 347   4 269  61 777  83 464  32 313  85 278
## [375] 562 375 332 618 605 552 693 553 540 550 559 670 499 568 195

college <- read.csv("/Users/shijiabian/Documents/College.csv", header = TRUE, 
    sep = ",")
rownames(college) = college[, 1]
college = college[, -1]
dim(college)

## [1] 777  18

college <- college[, -c(3, 4)]
attach(college)
dim(college)

## [1] 777  16

trainCollege = college[train, ]
testCollege = college[-train, ]
attach(trainCollege)

## The following objects are masked from college:
## 
##     Apps, Books, Expend, F.Undergrad, Grad.Rate, Outstate,
##     P.Undergrad, perc.alumni, Personal, PhD, Private, Room.Board,
##     S.F.Ratio, Terminal, Top10perc, Top25perc

attach(testCollege)

## The following objects are masked from trainCollege:
## 
##     Apps, Books, Expend, F.Undergrad, Grad.Rate, Outstate,
##     P.Undergrad, perc.alumni, Personal, PhD, Private, Room.Board,
##     S.F.Ratio, Terminal, Top10perc, Top25perc
## The following objects are masked from college:
## 
##     Apps, Books, Expend, F.Undergrad, Grad.Rate, Outstate,
##     P.Undergrad, perc.alumni, Personal, PhD, Private, Room.Board,
##     S.F.Ratio, Terminal, Top10perc, Top25perc

(b) Fit a linear model using least squares on the training set, and report the training and test error obtained. Do not include the Elite predictor, or the Accept or Enrol predictors in the regression.

The mean squared error of the training data set is 2096944.The mean squared error of the test data is 5255540. (to get the correct mse, please restart the R studio and only run the code for 5(a) and 5(b). Otherwise, the result is not correct due to the system interruption.)

lm.fit <- lm(Apps ~ ., data = trainCollege)
# train MSE
mseTrain <- mean((trainCollege$Apps - predict(lm.fit, trainCollege))^2)
# test MSE
mseTest <- mean((testCollege$Apps - predict(lm.fit, testCollege))^2)

\( (c) \) Comment on the results obtained. How accurately can we predict the number of college applications received? What are the most important predictors?

The least square model cannot provide a very good prediction as we regress the number of applications on all the other variables. The mean squared error is extremely large, say a little more than 2 millions. When we use this least square model to fit the test set, the MSE is also extremely larger, say a little more than 5 millions. Thus, this is not a very good model. 

According to the information given by the `summary()`, the variables `Private`, `F.Undergrad`,`Room.Board` have p-value less than 0.001. Moreover, `F.Undergrad` is the most important predictor as its p-value is less than 2e-16. `Top10perc` is also significant as p-value is 0.008. `(intercept)` and `Outstate` are also significant as their p-value are less than 0.05. But we do not care much about `(intercept)`.

summary(lm.fit)

## 
## Call:
## lm(formula = Apps ~ ., data = trainCollege)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4665   -724    -48    489   7095 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.57e+03   7.69e+02   -3.34  0.00091 ***
## PrivateYes  -8.84e+02   2.89e+02   -3.06  0.00237 ** 
## Top10perc    3.17e+01   1.12e+01    2.82  0.00499 ** 
## Top25perc   -1.17e+01   9.22e+00   -1.27  0.20574    
## F.Undergrad  5.95e-01   2.58e-02   23.03  < 2e-16 ***
## P.Undergrad -2.48e-02   5.86e-02   -0.42  0.67275    
## Outstate     5.42e-02   3.75e-02    1.45  0.14914    
## Room.Board   4.26e-01   9.56e-02    4.45  1.1e-05 ***
## Books       -6.34e-02   4.34e-01   -0.15  0.88400    
## Personal    -1.49e-01   1.27e-01   -1.18  0.24014    
## PhD          7.43e+00   9.49e+00    0.78  0.43394    
## Terminal    -1.12e+01   9.91e+00   -1.13  0.25852    
## S.F.Ratio    1.82e+01   2.50e+01    0.73  0.46703    
## perc.alumni -2.45e+01   8.14e+00   -3.01  0.00277 ** 
## Expend       8.06e-02   2.40e-02    3.36  0.00086 ***
## Grad.Rate    2.31e+01   5.92e+00    3.90  0.00011 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1480 on 373 degrees of freedom
## Multiple R-squared:  0.823,  Adjusted R-squared:  0.815 
## F-statistic:  115 on 15 and 373 DF,  p-value: <2e-16

6.Using the same setup as in the previous question, form a new outcome variable Y which equals one if the number of applications is greater than or equal to the overall median and zero otherwise. Fit a logistic regression model to Y and report the training and test misclassification rates, and the most important predictors. As above, do not include the Elite predictor, or the Accept or Enrol predictors in the regression. Compare the results of this analysis to that of the linear regression approach in the previous question.

The function `contrast()` shows that R has created a dummy variable with a 1 for `zero`. For the training set, the misclassification rate is $latex \frac{14+15}{389}=0.07455013$. For the testing set, the misclassification rate is $latex \frac{11+26}{388}=0.09536082$. Although the misclassification rate for the testing set is some higher than the training set, we can see that the accuracy rate for both of them are pretty high. This is a better setting than the previous question.This is better thatn the linear regression approach.

Here, the most important predictors are `F.Undergrad` and `(Intercept)`.But we do not care much about the `(Intercept)`.

set.seed(12363235)
# install.packages('stats') sample the training set first
train <- sample(777, 389)
train

##   [1] 615  86 591 758 128 646 426 358 116 647 493  72 750  28 282  25 185
##  [18] 346 302 539 313  30 139 482 415  83 614 135 729  34 118 262 744 182
##  [35] 682 150 690 298 638 146 714 461 559 438 251 473  92 325  87 551 630
##  [52] 305 324 718 239 151 299 409 607 740 330 348 737 280 129 509 284 694
##  [69]   9 707 575 701 384 334 326 121 596 199 616 417 697  81 212 526 765
##  [86] 502 688  96 106 480 586 470 712 498  12 166 443  42 652 250 241 593
## [103] 161 456  37 386 468 181 440 692 171  54 735 510 567 528  70 479 233
## [120]  67 297 144 643 452 332 300 276 767 410  89 114 268 382 100 354  60
## [137]  10  11 254 208  73 505  53 418 172 626 734 704 390 495   2 308 776
## [154]  27 469  94 589 472   6 353 474 530 273 543 252 213 546 477 140 565
## [171] 115  16 419 602 743 466 111 196 668  43 387 709 573 383 211 587 403
## [188] 631 487 739 768 501 155  26 751 537 519  82 237  33 320 434 259 191
## [205]  22 201 376 636 671 220 515 731 404 331 703 547 637  68 362 293 691
## [222]  51 514 229 613 603 689 433 307 656 566 649  98 342 420 629 363 580
## [239] 407 206 571 742 352 650 521 296 448 677 711 414 157 588 716  78 103
## [256] 680  24 564 119 338 594 570 713 356 549  18 507 664 655 402 635 327
## [273] 361 113 777 459 560  29 162 455 535 371 717 683 406 625 175 274 632
## [290] 464 368 578  59 203 489 427 708 496 598 416 722 266 618 741 601 599
## [307] 511 715 279 525 221 281 209 318 538 733 223 441 193  31 413 460 585
## [324] 314 265 670 544 143  47 195 494 761 553 148 752 101 679 475 524 366
## [341] 339 595 513  57 772 552  88 244 303 555 388 654 125 436  32  23 306
## [358] 397 699 264 666 491 458 312 645 542 336 536 202 529 422 429 263 432
## [375] 608 659 774 492 190 160 532  38  77 173 133 311 188 393 243

# prepare the data frame
college <- read.csv("/Users/shijiabian/Documents/College.csv", header = TRUE, 
    sep = ",")
rownames(college) = college[, 1]
college = college[, -1]
dim(college)

## [1] 777  18

# delete the predictors we are not using
college <- college[, -c(3, 4)]
medApp <- median(college$Apps)
newY = rep("one", 777)
newY[college$Apps < medApp] = "zero"
newY = factor(newY)
college <- data.frame(college, newY)
college = college[, -2]
attach(college)

## The following object is masked _by_ .GlobalEnv:
## 
##     newY
## The following objects are masked from testCollege:
## 
##     Books, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad,
##     perc.alumni, Personal, PhD, Private, Room.Board, S.F.Ratio,
##     Terminal, Top10perc, Top25perc
## The following objects are masked from trainCollege:
## 
##     Books, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad,
##     perc.alumni, Personal, PhD, Private, Room.Board, S.F.Ratio,
##     Terminal, Top10perc, Top25perc
## The following objects are masked from college (position 5):
## 
##     Books, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad,
##     perc.alumni, Personal, PhD, Private, Room.Board, S.F.Ratio,
##     Terminal, Top10perc, Top25perc

trainCollege = college[train, ]
testCollege = college[-train, ]
attach(trainCollege)

## The following object is masked _by_ .GlobalEnv:
## 
##     newY
## The following objects are masked from college (position 3):
## 
##     Books, Expend, F.Undergrad, Grad.Rate, newY, Outstate,
##     P.Undergrad, perc.alumni, Personal, PhD, Private, Room.Board,
##     S.F.Ratio, Terminal, Top10perc, Top25perc
## The following objects are masked from testCollege:
## 
##     Books, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad,
##     perc.alumni, Personal, PhD, Private, Room.Board, S.F.Ratio,
##     Terminal, Top10perc, Top25perc
## The following objects are masked from trainCollege (position 5):
## 
##     Books, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad,
##     perc.alumni, Personal, PhD, Private, Room.Board, S.F.Ratio,
##     Terminal, Top10perc, Top25perc
## The following objects are masked from college (position 6):
## 
##     Books, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad,
##     perc.alumni, Personal, PhD, Private, Room.Board, S.F.Ratio,
##     Terminal, Top10perc, Top25perc

attach(testCollege)

## The following object is masked _by_ .GlobalEnv:
## 
##     newY
## The following objects are masked from trainCollege (position 3):
## 
##     Books, Expend, F.Undergrad, Grad.Rate, newY, Outstate,
##     P.Undergrad, perc.alumni, Personal, PhD, Private, Room.Board,
##     S.F.Ratio, Terminal, Top10perc, Top25perc
## The following objects are masked from college (position 4):
## 
##     Books, Expend, F.Undergrad, Grad.Rate, newY, Outstate,
##     P.Undergrad, perc.alumni, Personal, PhD, Private, Room.Board,
##     S.F.Ratio, Terminal, Top10perc, Top25perc
## The following objects are masked from testCollege (position 5):
## 
##     Books, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad,
##     perc.alumni, Personal, PhD, Private, Room.Board, S.F.Ratio,
##     Terminal, Top10perc, Top25perc
## The following objects are masked from trainCollege (position 6):
## 
##     Books, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad,
##     perc.alumni, Personal, PhD, Private, Room.Board, S.F.Ratio,
##     Terminal, Top10perc, Top25perc
## The following objects are masked from college (position 7):
## 
##     Books, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad,
##     perc.alumni, Personal, PhD, Private, Room.Board, S.F.Ratio,
##     Terminal, Top10perc, Top25perc


glm.fit <- glm(formula = newY ~ ., family = binomial, data = trainCollege)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# train MSE
glm.probs.train = predict(glm.fit, type = "response")
glm.pred = rep("one", 389)
glm.pred[glm.probs.train > 0.5] = "zero"
contrasts(trainCollege$newY)

##      zero
## one     0
## zero    1

# training set
table(glm.pred, trainCollege$newY)

##         
## glm.pred one zero
##     one  174   14
##     zero  15  186

(table(glm.pred, trainCollege$newY)[1, 2] + table(glm.pred, trainCollege$newY)[2, 
    1])/389

## [1] 0.07455


# test set
glm.probs.test = predict(glm.fit, newdata = testCollege, type = "response")
glm.pred.test = rep("one", 388)
glm.pred.test[glm.probs.test > 0.5] = "zero"
contrasts(testCollege$newY)

##      zero
## one     0
## zero    1

# training set
table(glm.pred.test, testCollege$newY)

##              
## glm.pred.test one zero
##          one  174   11
##          zero  26  177

(table(glm.pred.test, testCollege$newY)[1, 2] + table(glm.pred.test, testCollege$newY)[2, 
    1])/388

## [1] 0.09536