n
is extremely large, and the number of predictiors p
is smallFlexible method is better, because given a large set of sample size we can make use of it to train the model
p
is extremely large, and the number of observations n
is smallInflexible method is better, because flexible method may overfit the sample data.
Inflexible method is better, because flexible method will overfit a lot error instead of real values.
Flexible method is better, since the relationship is non-linear, introducing a inflexible method will cause a higher bias.
It depends on how relatively non-linear and how relatively large $\sigma^2$ is. Flexible method will work better in non-linear relationship but a high $\sigma^2$ will introduce too much noise.
n
and p
.Regression, inference, n = 500, p = 3
Classification, prediction, n = 10,000 * 100 = 1,000,000, p = 1
Classification, prediction, n = 20, p = 13
Regression, prediction, n = 52, p = 3
1). A shopping mall wants to predict whether male or female is going to spend more money during shopping. They record last 5 years sales, shopping frequency, time. All those data are per gender.
Response: male or female
Predictors: sales, shopping frequency, time
Goal: Prediction
2). A rating agency will rate a stock between AAA to DDD. In order to do that they record the company sales, number of employees, previous ratings in 5 years.
Response: ratings
Predictors: company sales, number of employees, previous ratings
Goal: prediction
3). Whether my application to Stanford University will be approved or rejected.
Response: Approve or reject
Predictors: GPA, working experience, research experience
Goal: prediction
1). A fast-food restaurant wants to predict how much revenue they can make in the next year. They collect last year weekly records. For each week it has advertising cost, personnel cost, material cost and revenue.
Response: next year revenue
Predictors: advertising cost, personnel cost, material cost
Goal: prediction
2). Youtube wants to know which factors impact on the time people spending on a video. They have 10000 video sample. For each video they collect the category of that video, length of video, whether they have inserted ads in between, number of subscriber of that youbuter.
Response: time spent on a video
Predictors: the category of that video, length of video, whether they have inserted ads, number of subscriber of a youbuter
Goal: Inference
3). Birth rate in U.S
Response: Birth Rate
Predictors: number of hospitals, number of people who are married, house income
Goal: prediction
1). Banks want to find divide their credit card holders into different groups based on their spending behaviors such as monthly balance, FICO scores, income.
2). A restaurant wants to divide their customer into different groups based on their food preference, time spent during restaurant, gender.
3). A univertisy wants to cluster their students into different group based on their GPA, major, research experience."
setwd("D:/One Drive/OneDrive/Document/Study/Stanford/Introduction to Statistical Learning/homework/hw1")
college = read.csv("college.csv")
rownames(college)=college[,1]
college=college[,-1]
summary(college)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
pairs(college[, 1:10])
plot(college$Private, college$Outstate, xlab = "Private", ylab = "Outstate")
Elite = rep("No",nrow(college))
Elite[college$Top10perc >50] = "Yes"
Elite = as.factor(Elite)
college = data.frame(college ,Elite)
summary(college$Elite)
## No Yes
## 699 78
plot(college$Elite, college$Outstate, xlab = "Elite", ylab = "Outstate")
par(mfcol = c(2, 2))
hist(college$Grad.Rate, xlab = "Grad Rate", ylab = "Frequency")
hist(college$Expend, xlab = "Expend", ylab = "Frequency")
hist(college$PhD, xlab = "PhD", ylab = "Frequency")
hist(college$Personal, xlab = "Personal", ylab = "Frequency")
indices <- split(sample(nrow(college), nrow(college), replace=FALSE), as.factor(1:2))
trainingSet = college[indices[[1]], ]
testSet = college[-indices[[1]], ]
fit <- lm(Apps ~ . - Accept - Enroll - Elite, data = trainingSet)
summary(fit)
##
## Call:
## lm(formula = Apps ~ . - Accept - Enroll - Elite, data = trainingSet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6198 -721 -73 487 31918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.266e+03 1.183e+03 -2.761 0.006051 **
## PrivateYes 7.097e+01 4.185e+02 0.170 0.865445
## Top10perc 1.847e+01 1.685e+01 1.096 0.273737
## Top25perc -2.763e-01 1.323e+01 -0.021 0.983343
## F.Undergrad 7.243e-01 3.457e-02 20.952 < 2e-16 ***
## P.Undergrad -7.227e-02 8.415e-02 -0.859 0.390986
## Outstate -1.252e-02 5.534e-02 -0.226 0.821191
## Room.Board 5.022e-01 1.482e-01 3.388 0.000779 ***
## Books 3.166e-01 7.244e-01 0.437 0.662310
## Personal -3.203e-01 1.894e-01 -1.691 0.091684 .
## PhD -1.164e+00 1.423e+01 -0.082 0.934832
## Terminal -1.735e+01 1.549e+01 -1.120 0.263350
## S.F.Ratio 4.934e+01 3.786e+01 1.303 0.193239
## perc.alumni -1.168e+01 1.313e+01 -0.890 0.374172
## Expend 1.201e-01 4.057e-02 2.959 0.003282 **
## Grad.Rate 1.890e+01 8.763e+00 2.156 0.031697 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2183 on 373 degrees of freedom
## Multiple R-squared: 0.7424, Adjusted R-squared: 0.7321
## F-statistic: 71.68 on 15 and 373 DF, p-value: < 2.2e-16
We are using training MSE
and test MSE
to measure the quality of fit.
trainingMSE = mean(fit$residuals^2)
testMSE = mean((testSet$Apps - predict.lm(fit, testSet)) ^ 2)
summary(trainingMSE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4569400 4569400 4569400 4569400 4569400 4569400
summary(testMSE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3010673 3010673 3010673 3010673 3010673 3010673
As we can see above `testMSE` and `trainingMSE` have large numbers. Also $R^2$ is 0.8105 using this linear model. We can conclude that linear model does not fit this data very well. However F-statistic is 106.4 which is far more than 1, it suggests at least one of the factors must be related to `Apps`.
`F.Undergrad`, `Room.Board`, `Grad.Rate` and `Private - Yes` have the smallest p-values and are the most important factors. `Top10prec`, `perc.alumni` and `Expend` are the tier 2 important factors.
med = median(college$Apps)
Y = rep(0, nrow(college))
Y[college$Apps >= med] = 1
Y = as.factor(Y)
college = data.frame(college, Y)
## exlcude unwanted factors
college = subset(college, select = -c(Accept, Enroll, Elite, Apps))
indices <- split(sample(nrow(college), nrow(college), replace=FALSE), as.factor(1:2))
trainingSet = college[indices[[1]], ]
testSet = college[-indices[[1]], ]
fit <- glm(formula = Y ~ ., family = binomial, data = trainingSet)
summary(fit)
##
## Call:
## glm(formula = Y ~ ., family = binomial, data = trainingSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9708 -0.2077 0.0000 0.0292 3.1984
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.319e+01 2.963e+00 -4.452 8.52e-06 ***
## PrivateYes 2.362e-01 1.059e+00 0.223 0.8236
## Top10perc 8.371e-03 4.114e-02 0.203 0.8388
## Top25perc -3.807e-03 3.351e-02 -0.114 0.9095
## F.Undergrad 3.618e-03 5.038e-04 7.183 6.84e-13 ***
## P.Undergrad 3.603e-04 5.524e-04 0.652 0.5142
## Outstate 3.008e-04 1.182e-04 2.545 0.0109 *
## Room.Board 3.278e-04 2.892e-04 1.133 0.2570
## Books 4.030e-04 1.762e-03 0.229 0.8191
## Personal -5.774e-04 5.285e-04 -1.093 0.2746
## PhD 3.754e-04 2.459e-02 0.015 0.9878
## Terminal 1.647e-02 2.674e-02 0.616 0.5380
## S.F.Ratio -5.711e-02 7.555e-02 -0.756 0.4498
## perc.alumni -1.898e-02 2.549e-02 -0.744 0.4566
## Expend -1.226e-05 1.132e-04 -0.108 0.9137
## Grad.Rate 2.482e-02 2.048e-02 1.212 0.2254
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 539.14 on 388 degrees of freedom
## Residual deviance: 134.23 on 373 degrees of freedom
## AIC: 166.23
##
## Number of Fisher Scoring iterations: 9
## calculate training misclassification rate
trainingProbs = predict(fit, type = "response")
trainingPred = rep(0, nrow(trainingSet))
trainingPred[trainingProbs > 0.5] = 1
table(trainingPred, trainingSet$Y)
##
## trainingPred 0 1
## 0 182 15
## 1 9 183
## training misclassification rate
1 - mean(trainingPred == trainingSet$Y)
## [1] 0.06169666
## calculate test misclassification rate
testProbs = predict(fit, newdata = testSet, type = "response")
testPred = rep(0, nrow(testSet))
testPred[testProbs > 0.5] = 1
table(testPred, testSet$Y)
##
## testPred 0 1
## 0 179 18
## 1 18 173
## test misclassification rate
1 - mean(testPred == testSet$Y)
## [1] 0.09278351
The error rates for training set and test set are both at range 6% - 9% which fits better than the linear model. The most important factors are `F.Undergrad`, `Outstate` and `Grad.Rate`. Compared to linear model they both have `F.undergrad` and `Grad.Rate` so we can conclude these two factors are most important to the `Apps`