1.For each of parts (a) through (d), indicate whether you would expect the performance of a flexible statistical learning method to perform better or worse than an inflexible method. Give reasons in each case.
(a) The number of observations n is extremely large, and the number of predictors p is small.
Comparably, more flexible model is preferred. Because the plots span around in the low dimensional space, we have enough data to train the model, thus we can take the flexible model. However, if we can have more information on the underlying distribution of the model, such as linear or non-linear, we can have more confidence to choose between the inflexible and flexible models.
(b) The number of predictors p is extremely large, and the number of observations n is small.
The inflexible model is preferred. We do not have enough observation in the high dimensional space to train the model, thus it is more essential for us to know the trend of the observations, thus inflexible model, such as linear model, can satisfy this need and effectively avoid overfitting the data.
\( (c) \) The variance of the error terms, i.e. σ2=Var(ϵ), is extremely high.
The inflexible model is preferred. Because the flexible might fit everything including the errors, unfortunately, this data set has large error. Thus the inflexible model is a good option for getting the knowledge about the trend and the pattern of the data.
(d) The relationship between the predictors and response is highly non-linear, and σ2 is small.
The flexible model is preferred. The prior information has already said that this model is highly non-linear. The linear model must not provide an accurate estimation for the data set. Moreover, the variance of the error term is small; we do not need worry if the flexible model can also fit into the errors. But this flexible model also needs to be controlled for overfitting.
(e) The relationship between the predictors and response is highly non-linear, and σ2 is large.
The inflexible model is preferred. Although the underlying distribution is highly non-linear, the large error might also be mistakenly included into the flexible model. Inflexible model can fit better here because it can give us an clear and accurate sense of the overall knowledge of the data set.
2.Explain whether each scenario below is a regression, classification or unsupervised learning problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.
(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.
This is a supervised regression learning problem. We are interested in the inference. n=500, p=3.
(b) Our website has collected the ratings of 1000 different restaurants by 10, 000 customers. Each customer has rated about 100 restaurants, and we would like to recommend restaurants to customers who have not yet been there.
This is a supervised classification learning problem. We are interested in the prediction. We assume that each customer has rated exactly 100 restaurants, the average number of rating each restaurant receives are 1,000. Thus the dimension is n=1,000 and p = 1 for each restaurant. With the 1,000 ratings each restaurant receives, we can classify if the restaurant should be recommended or not by choosing a proper threshold.
\( (c) \) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.
This is a supervised classification learning problem. We are interested in the prediction. p=13, n=20
(d) We are interesting in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market.
This is a regression problem, and we are interested in the prediction. p=3, n=52.
3.In this next question we consider some real-life applications of statistical learning:
(a) Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.
i. Assume that disease A has four types, and there are a set of symptoms, say 10 symptoms, might contribute to the four types of the disease. We are interested in predicting which type of the disease a patient has according to the certain symptoms this patient has. We have 10,000 past patients who have disease A and the information on their 10 symptoms. By doing classification regression on the 10,000 samples, we can predict the type of disease A the patient has. The response is the four types of the disease. The predictors are the set of the 10 symptoms. We are interested in prediction here.
ii. The home mortgage can turn into two conditions: default and non-default. The condition of the home mortgage is related to four variables: income, education, banking balance and previous credits. We are interested in if a potential buyer will default or not in the future. We have 1,000 samples who are our past clients, by analysing the 1,000 sampleswe can do the classification regression analysis to predict if the buyer will default or not. The response here is default/non-default. The predictors are this byuer's income, education, banking balance and previous credits. We are interested in prediction here.
iii. Before shipping out, the chipset company need exam if a chipset has problem. There are two conditions: has problem and does not have problem. This is related to a set of variables of this single chipset. We have the exam information on 10,000 past chipsets and their information of the variables. We can perform classification regression analysis to predict if this single chipset has problem or not. The response is that the chipset has problem or not. The predictors are the set of related variables. We are interested in prediction here.
(b) Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.
i. The next day temperature is related to a set of current daily weather factors, say 5 factors. In order to predict the next-day temperature, we need build a regression model to regress the next day temperature on those current day’s variables. We have the data from the past 10 years. This means that we have 3652 sets of observations: n=3652, p=5. The response is the next day temperature, and the predictors are current day’s related weather variables. The goal of this application is prediction.
ii. The height of the children is related to four factors,: mother’s height, father’s height, daily diet and daily exercise. We want to build up a regression model to regress the height of the children on these factors, so we can examine how strong the association between the children height and the fout factors. We have the data collected from 1000 children and their parents. Thus, n=1000, p=4. The response is the children height. The predictors are the four mentioned factors. The goal of this application is inference.
iii. The graduate GPA is related to a set of factors: undergraduate GPA, GRE score, the hours willing to spend studying in college and the attendance rate in undergraduate school. In order to predict the GPA of a potential candidate for the graduate school, we need build a regression model to regress the graduate GPA on the four factors. We have collected the related data from 1000 graduate students in this graduate school. The response is the graduate GPA the student obtains. The predictors are the four factors. n=1000, p=4. The goal of this application is prediction.
© Describe three real-life applications in which unsupervised learning might be useful.
i. The social class is divided into low-income, middle class and high-income class. We want to cluster 1,000 people into these 3 classes. The only information we have from each of the 100 people are their monthly income and education.
ii. According to the shopping behavior and income of the customers, the market wants to cluster the customers into different segments in order to target the right customer group.
iii. The Internet database clusters the Internet users into different social communities by examining the users’ Internet searching history.
4.This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US.
Before reading the data into R, it can be viewed in Excel or a text editor.
(a) Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.
college <- read.csv("/Users/shijiabian/Documents/College.csv", header = TRUE,
sep = ",")
(b) Look at the data using the fix() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:
rownames(college) = college[, 1]
fix(college)
You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try
college = college[, -1]
fix(college)
Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.
\( (c) \)
i. Use the summary() function to produce a numerical summary of the variables in the data set.
summary(college)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.0
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.0
## Median : 1558 Median : 1110 Median : 434 Median :23.0
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.6
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.0
## Max. :48094 Max. :26330 Max. :6392 Max. :96.0
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96 Min. : 250 Min. : 8.0
## 1st Qu.:3597 1st Qu.: 470 1st Qu.: 850 1st Qu.: 62.0
## Median :4200 Median : 500 Median :1200 Median : 75.0
## Mean :4358 Mean : 549 Mean :1341 Mean : 72.7
## 3rd Qu.:5050 3rd Qu.: 600 3rd Qu.:1700 3rd Qu.: 85.0
## Max. :8124 Max. :2340 Max. :6800 Max. :103.0
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.5 Min. : 0.0 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.5 1st Qu.:13.0 1st Qu.: 6751
## Median : 82.0 Median :13.6 Median :21.0 Median : 8377
## Mean : 79.7 Mean :14.1 Mean :22.7 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.5 3rd Qu.:31.0 3rd Qu.:10830
## Max. :100.0 Max. :39.8 Max. :64.0 Max. :56233
## Grad.Rate
## Min. : 10.0
## 1st Qu.: 53.0
## Median : 65.0
## Mean : 65.5
## 3rd Qu.: 78.0
## Max. :118.0
ii. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first 10 columns of a matrix A using A[,1:10].
pairs(college[, 1:10])
iii. Use the plot() function to produce side-by-side boxplots of Outstate versus Private.
boxplot(college$Outstate ~ college$Private, col = c("blue", "green"), main = "Outstate versus Private",
xlab = "Private", ylab = "Outstate")
iv. Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.
Elite = rep("No", nrow(college))
Elite[college$Top10perc > 50] = "Yes"
Elite = as.factor(Elite)
college = data.frame(college, Elite)
fix(college)
Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.
There are 78 elite university and 699 non-elite university.
summary(college$Elite)
## No Yes
## 699 78
boxplot(college$Outstate ~ college$Elite, col = c("blue", "green"), main = "Outstate versus Elite",
xlab = "Elite", ylab = "Outstate")
v. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow=c(2,2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.
The plot includes the histogram for Accept, Enroll and Top10perc. The color blue represents 5 bins. The color green represent 9 bins.
par(mfcol = c(2, 3))
# Apps with 5 bins
hist(college$Accept, breaks = 6, freq = TRUE, col = "blue", main = "Histogram",
xlab = "Accept", ylab = "Value")
hist(college$Accept, breaks = 10, freq = TRUE, col = "green", main = "Histogram",
xlab = "Accept", ylab = "Value")
hist(college$Enroll, breaks = 6, freq = TRUE, col = "blue", main = "Histogram",
xlab = "Enroll", ylab = "Value")
hist(college$Enroll, breaks = 10, freq = TRUE, col = "green", main = "Histogram",
xlab = "Enroll", ylab = "Value")
hist(college$Top10perc, breaks = 6, freq = TRUE, col = "blue", main = "Histogram",
xlab = "Top10perc", ylab = "Value")
hist(college$Top10perc, breaks = 10, freq = TRUE, col = "green", main = "Histogram",
xlab = "Top10perc", ylab = "Value")
College data set.(a) Split the data set into a training set and a test set of approximately equal size.
The training set has 389 samples and the test set has 388 samples.
library(ISLR)
set.seed(13435)
train <- sample(777, 389)
train
## [1] 234 449 501 283 543 339 252 13 337 708 635 369 371 294 205 39 723
## [18] 408 648 197 435 384 728 256 520 748 395 705 222 201 549 133 6 189
## [35] 398 365 507 319 622 439 521 25 463 357 122 242 127 54 280 220 636
## [52] 564 140 444 117 45 453 258 98 239 112 42 522 64 495 250 652 674
## [69] 488 372 659 204 301 771 490 773 733 586 246 164 377 594 396 174 231
## [86] 87 305 706 487 494 22 380 178 472 148 169 138 370 108 762 415 691
## [103] 221 505 525 128 227 461 673 207 34 428 747 251 8 599 509 406 598
## [120] 410 551 595 757 470 208 191 467 393 725 361 18 343 665 623 775 362
## [137] 209 424 29 69 581 523 263 675 755 51 532 639 479 538 268 625 658
## [154] 206 405 47 695 617 620 694 486 188 331 456 276 310 644 304 344 223
## [171] 226 36 468 130 631 137 769 143 9 669 560 67 160 438 368 132 297
## [188] 565 163 524 173 394 48 754 386 249 571 345 57 713 508 306 271 264
## [205] 17 419 89 611 632 296 325 473 259 101 290 504 535 247 75 716 645
## [222] 257 216 416 376 458 619 340 317 455 528 554 483 327 74 709 654 135
## [239] 392 228 761 71 281 20 477 526 664 350 284 724 185 766 265 59 261
## [256] 52 182 68 735 142 701 518 666 736 626 577 556 153 237 704 27 110
## [273] 333 621 177 722 500 425 759 196 230 16 184 548 10 443 506 603 573
## [290] 155 224 726 531 740 537 176 503 480 65 124 316 383 422 111 742 389
## [307] 100 676 597 329 24 572 321 21 53 576 379 318 616 745 634 421 457
## [324] 749 373 437 703 336 442 253 285 600 154 299 1 49 167 15 462 719
## [341] 346 433 198 641 170 233 73 758 566 245 624 3 409 382 404 238 190
## [358] 266 94 649 193 485 657 347 4 269 61 777 83 464 32 313 85 278
## [375] 562 375 332 618 605 552 693 553 540 550 559 670 499 568 195
college <- read.csv("/Users/shijiabian/Documents/College.csv", header = TRUE,
sep = ",")
rownames(college) = college[, 1]
college = college[, -1]
dim(college)
## [1] 777 18
college <- college[, -c(3, 4)]
attach(college)
dim(college)
## [1] 777 16
trainCollege = college[train, ]
testCollege = college[-train, ]
attach(trainCollege)
## The following objects are masked from college:
##
## Apps, Books, Expend, F.Undergrad, Grad.Rate, Outstate,
## P.Undergrad, perc.alumni, Personal, PhD, Private, Room.Board,
## S.F.Ratio, Terminal, Top10perc, Top25perc
attach(testCollege)
## The following objects are masked from trainCollege:
##
## Apps, Books, Expend, F.Undergrad, Grad.Rate, Outstate,
## P.Undergrad, perc.alumni, Personal, PhD, Private, Room.Board,
## S.F.Ratio, Terminal, Top10perc, Top25perc
## The following objects are masked from college:
##
## Apps, Books, Expend, F.Undergrad, Grad.Rate, Outstate,
## P.Undergrad, perc.alumni, Personal, PhD, Private, Room.Board,
## S.F.Ratio, Terminal, Top10perc, Top25perc
(b) Fit a linear model using least squares on the training set, and report the training and test error obtained. Do not include the Elite predictor, or the Accept or Enrol predictors in the regression.
The mean squared error of the training data set is 2096944.The mean squared error of the test data is 5255540. (to get the correct mse, please restart the R studio and only run the code for 5(a) and 5(b). Otherwise, the result is not correct due to the system interruption.)
lm.fit <- lm(Apps ~ ., data = trainCollege)
# train MSE
mseTrain <- mean((trainCollege$Apps - predict(lm.fit, trainCollege))^2)
# test MSE
mseTest <- mean((testCollege$Apps - predict(lm.fit, testCollege))^2)
\( (c) \) Comment on the results obtained. How accurately can we predict the number of college applications received? What are the most important predictors?
The least square model cannot provide a very good prediction as we regress the number of applications on all the other variables. The mean squared error is extremely large, say a little more than 2 millions. When we use this least square model to fit the test set, the MSE is also extremely larger, say a little more than 5 millions. Thus, this is not a very good model.
According to the information given by the `summary()`, the variables `Private`, `F.Undergrad`,`Room.Board` have p-value less than 0.001. Moreover, `F.Undergrad` is the most important predictor as its p-value is less than 2e-16. `Top10perc` is also significant as p-value is 0.008. `(intercept)` and `Outstate` are also significant as their p-value are less than 0.05. But we do not care much about `(intercept)`.
summary(lm.fit)
##
## Call:
## lm(formula = Apps ~ ., data = trainCollege)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4665 -724 -48 489 7095
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.57e+03 7.69e+02 -3.34 0.00091 ***
## PrivateYes -8.84e+02 2.89e+02 -3.06 0.00237 **
## Top10perc 3.17e+01 1.12e+01 2.82 0.00499 **
## Top25perc -1.17e+01 9.22e+00 -1.27 0.20574
## F.Undergrad 5.95e-01 2.58e-02 23.03 < 2e-16 ***
## P.Undergrad -2.48e-02 5.86e-02 -0.42 0.67275
## Outstate 5.42e-02 3.75e-02 1.45 0.14914
## Room.Board 4.26e-01 9.56e-02 4.45 1.1e-05 ***
## Books -6.34e-02 4.34e-01 -0.15 0.88400
## Personal -1.49e-01 1.27e-01 -1.18 0.24014
## PhD 7.43e+00 9.49e+00 0.78 0.43394
## Terminal -1.12e+01 9.91e+00 -1.13 0.25852
## S.F.Ratio 1.82e+01 2.50e+01 0.73 0.46703
## perc.alumni -2.45e+01 8.14e+00 -3.01 0.00277 **
## Expend 8.06e-02 2.40e-02 3.36 0.00086 ***
## Grad.Rate 2.31e+01 5.92e+00 3.90 0.00011 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1480 on 373 degrees of freedom
## Multiple R-squared: 0.823, Adjusted R-squared: 0.815
## F-statistic: 115 on 15 and 373 DF, p-value: <2e-16
6.Using the same setup as in the previous question, form a new outcome variable Y which equals one if the number of applications is greater than or equal to the overall median and zero otherwise. Fit a logistic regression model to Y and report the training and test misclassification rates, and the most important predictors. As above, do not include the Elite predictor, or the Accept or Enrol predictors in the regression. Compare the results of this analysis to that of the linear regression approach in the previous question.
The function `contrast()` shows that R has created a dummy variable with a 1 for `zero`. For the training set, the misclassification rate is $latex \frac{14+15}{389}=0.07455013$. For the testing set, the misclassification rate is $latex \frac{11+26}{388}=0.09536082$. Although the misclassification rate for the testing set is some higher than the training set, we can see that the accuracy rate for both of them are pretty high. This is a better setting than the previous question.This is better thatn the linear regression approach.
Here, the most important predictors are `F.Undergrad` and `(Intercept)`.But we do not care much about the `(Intercept)`.
set.seed(12363235)
# install.packages('stats') sample the training set first
train <- sample(777, 389)
train
## [1] 615 86 591 758 128 646 426 358 116 647 493 72 750 28 282 25 185
## [18] 346 302 539 313 30 139 482 415 83 614 135 729 34 118 262 744 182
## [35] 682 150 690 298 638 146 714 461 559 438 251 473 92 325 87 551 630
## [52] 305 324 718 239 151 299 409 607 740 330 348 737 280 129 509 284 694
## [69] 9 707 575 701 384 334 326 121 596 199 616 417 697 81 212 526 765
## [86] 502 688 96 106 480 586 470 712 498 12 166 443 42 652 250 241 593
## [103] 161 456 37 386 468 181 440 692 171 54 735 510 567 528 70 479 233
## [120] 67 297 144 643 452 332 300 276 767 410 89 114 268 382 100 354 60
## [137] 10 11 254 208 73 505 53 418 172 626 734 704 390 495 2 308 776
## [154] 27 469 94 589 472 6 353 474 530 273 543 252 213 546 477 140 565
## [171] 115 16 419 602 743 466 111 196 668 43 387 709 573 383 211 587 403
## [188] 631 487 739 768 501 155 26 751 537 519 82 237 33 320 434 259 191
## [205] 22 201 376 636 671 220 515 731 404 331 703 547 637 68 362 293 691
## [222] 51 514 229 613 603 689 433 307 656 566 649 98 342 420 629 363 580
## [239] 407 206 571 742 352 650 521 296 448 677 711 414 157 588 716 78 103
## [256] 680 24 564 119 338 594 570 713 356 549 18 507 664 655 402 635 327
## [273] 361 113 777 459 560 29 162 455 535 371 717 683 406 625 175 274 632
## [290] 464 368 578 59 203 489 427 708 496 598 416 722 266 618 741 601 599
## [307] 511 715 279 525 221 281 209 318 538 733 223 441 193 31 413 460 585
## [324] 314 265 670 544 143 47 195 494 761 553 148 752 101 679 475 524 366
## [341] 339 595 513 57 772 552 88 244 303 555 388 654 125 436 32 23 306
## [358] 397 699 264 666 491 458 312 645 542 336 536 202 529 422 429 263 432
## [375] 608 659 774 492 190 160 532 38 77 173 133 311 188 393 243
# prepare the data frame
college <- read.csv("/Users/shijiabian/Documents/College.csv", header = TRUE,
sep = ",")
rownames(college) = college[, 1]
college = college[, -1]
dim(college)
## [1] 777 18
# delete the predictors we are not using
college <- college[, -c(3, 4)]
medApp <- median(college$Apps)
newY = rep("one", 777)
newY[college$Apps < medApp] = "zero"
newY = factor(newY)
college <- data.frame(college, newY)
college = college[, -2]
attach(college)
## The following object is masked _by_ .GlobalEnv:
##
## newY
## The following objects are masked from testCollege:
##
## Books, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad,
## perc.alumni, Personal, PhD, Private, Room.Board, S.F.Ratio,
## Terminal, Top10perc, Top25perc
## The following objects are masked from trainCollege:
##
## Books, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad,
## perc.alumni, Personal, PhD, Private, Room.Board, S.F.Ratio,
## Terminal, Top10perc, Top25perc
## The following objects are masked from college (position 5):
##
## Books, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad,
## perc.alumni, Personal, PhD, Private, Room.Board, S.F.Ratio,
## Terminal, Top10perc, Top25perc
trainCollege = college[train, ]
testCollege = college[-train, ]
attach(trainCollege)
## The following object is masked _by_ .GlobalEnv:
##
## newY
## The following objects are masked from college (position 3):
##
## Books, Expend, F.Undergrad, Grad.Rate, newY, Outstate,
## P.Undergrad, perc.alumni, Personal, PhD, Private, Room.Board,
## S.F.Ratio, Terminal, Top10perc, Top25perc
## The following objects are masked from testCollege:
##
## Books, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad,
## perc.alumni, Personal, PhD, Private, Room.Board, S.F.Ratio,
## Terminal, Top10perc, Top25perc
## The following objects are masked from trainCollege (position 5):
##
## Books, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad,
## perc.alumni, Personal, PhD, Private, Room.Board, S.F.Ratio,
## Terminal, Top10perc, Top25perc
## The following objects are masked from college (position 6):
##
## Books, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad,
## perc.alumni, Personal, PhD, Private, Room.Board, S.F.Ratio,
## Terminal, Top10perc, Top25perc
attach(testCollege)
## The following object is masked _by_ .GlobalEnv:
##
## newY
## The following objects are masked from trainCollege (position 3):
##
## Books, Expend, F.Undergrad, Grad.Rate, newY, Outstate,
## P.Undergrad, perc.alumni, Personal, PhD, Private, Room.Board,
## S.F.Ratio, Terminal, Top10perc, Top25perc
## The following objects are masked from college (position 4):
##
## Books, Expend, F.Undergrad, Grad.Rate, newY, Outstate,
## P.Undergrad, perc.alumni, Personal, PhD, Private, Room.Board,
## S.F.Ratio, Terminal, Top10perc, Top25perc
## The following objects are masked from testCollege (position 5):
##
## Books, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad,
## perc.alumni, Personal, PhD, Private, Room.Board, S.F.Ratio,
## Terminal, Top10perc, Top25perc
## The following objects are masked from trainCollege (position 6):
##
## Books, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad,
## perc.alumni, Personal, PhD, Private, Room.Board, S.F.Ratio,
## Terminal, Top10perc, Top25perc
## The following objects are masked from college (position 7):
##
## Books, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad,
## perc.alumni, Personal, PhD, Private, Room.Board, S.F.Ratio,
## Terminal, Top10perc, Top25perc
glm.fit <- glm(formula = newY ~ ., family = binomial, data = trainCollege)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# train MSE
glm.probs.train = predict(glm.fit, type = "response")
glm.pred = rep("one", 389)
glm.pred[glm.probs.train > 0.5] = "zero"
contrasts(trainCollege$newY)
## zero
## one 0
## zero 1
# training set
table(glm.pred, trainCollege$newY)
##
## glm.pred one zero
## one 174 14
## zero 15 186
(table(glm.pred, trainCollege$newY)[1, 2] + table(glm.pred, trainCollege$newY)[2,
1])/389
## [1] 0.07455
# test set
glm.probs.test = predict(glm.fit, newdata = testCollege, type = "response")
glm.pred.test = rep("one", 388)
glm.pred.test[glm.probs.test > 0.5] = "zero"
contrasts(testCollege$newY)
## zero
## one 0
## zero 1
# training set
table(glm.pred.test, testCollege$newY)
##
## glm.pred.test one zero
## one 174 11
## zero 26 177
(table(glm.pred.test, testCollege$newY)[1, 2] + table(glm.pred.test, testCollege$newY)[2,
1])/388
## [1] 0.09536