Statistics 216 Homework 1

1. Which statistical learning method is performing better? flexible method or inflexible method.

(a) The number of observations `n` is extremely large, and the number of predictiors `p` is small

Flexible method is better, because given a large set of sample size we can make use of it to train the model

(b) The number of predictors `p` is extremely large, and the number of observations `n` is small

Inflexible method is better, because flexible method may overfit the sample data.

(c) The variance of the error terms, i.e. \(\sigma^2 = Var(\epsilon)\), is extremely high

Inflexible method is better, because flexible method will overfit a lot error instead of real values.

(d) The relationship between the predictors and response is highly non-linear, and \(\sigma^2\) is small

Flexible method is better, since the relationship is non-linear, introducing a inflexible method will cause a higher bias.

(e) The relationship between the predictors and response is highly non-linear, and \(\sigma^2\) is large

It depends on how relatively non-linear and how relatively large $\sigma^2$ is. Flexible method will work better in non-linear relationship but a high $\sigma^2$ will introduce too much noise.

2. Explain whether each scenario below is a regression, classification or unsupervised learning problem, and indicate whether we are most interested in inference or prediction. Finally, provide `n` and `p`.

(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary

Regression, inference, n = 500, p = 3

(b) Our website has collected the ratings of 1000 different restaurants by 10,000 customers. Each customer has rated about 100 restaurants, and we would like to recommend restaurants to customers who have not yet been there.

Classification, prediction, n = 10,000 * 100 = 1,000,000, p = 1

(c) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

Classification, prediction, n = 20, p = 13

(d) We are interesting in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market.

Regression, prediction, n = 52, p = 3

3. In this next question we consider some real-life applications of statistical learning

(a)

1). A shopping mall wants to predict whether male or female is going to spend more money during shopping. They record last 5 years sales, shopping frequency, time. All those data are per gender.
    Response: male or female  
    Predictors: sales, shopping frequency, time  
    Goal: Prediction  
2). A rating agency will rate a stock between AAA to DDD. In order to do that they record the company sales, number of employees, previous ratings in 5 years. 
    Response: ratings  
    Predictors: company sales, number of employees, previous ratings  
    Goal: prediction  
3). Whether my application to Stanford University will be approved or rejected. 
    Response: Approve or reject  
    Predictors: GPA, working experience, research experience  
    Goal: prediction

(b)

1). A fast-food restaurant wants to predict how much revenue they can make in the next year. They collect last year weekly records. For each week it has advertising cost, personnel cost, material cost and revenue.  
    Response: next year revenue  
    Predictors: advertising cost, personnel cost, material cost  
    Goal: prediction  
2). Youtube wants to know which factors impact on the time people spending on a video. They have 10000 video sample. For each video they collect the category of that video, length of video, whether they have inserted ads in between, number of subscriber of that youbuter.   
    Response: time spent on a video  
    Predictors: the category of that video, length of video, whether they have inserted ads, number of subscriber of a youbuter  
    Goal: Inference  
3). Birth rate in U.S  
    Response: Birth Rate  
    Predictors: number of hospitals, number of people who are married, house income  
    Goal: prediction

(c)

1). Banks want to find divide their credit card holders into different groups based on their spending behaviors such as monthly balance, FICO scores, income.  
2). A restaurant wants to divide their customer into different groups based on their food preference, time spent during restaurant, gender.  
3). A univertisy wants to cluster their students into different group based on their GPA, major, research experience."

4. This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US.

setwd("D:/One Drive/OneDrive/Document/Study/Stanford/Introduction to Statistical Learning/homework/hw1")
college = read.csv("college.csv")
rownames(college)=college[,1]
college=college[,-1]
summary(college)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

pairs(college[, 1:10])

plot(college$Private, college$Outstate, xlab = "Private", ylab = "Outstate")

Elite = rep("No",nrow(college))
Elite[college$Top10perc >50] = "Yes"
Elite = as.factor(Elite)
college = data.frame(college ,Elite)
summary(college$Elite)

##  No Yes 
## 699  78

plot(college$Elite, college$Outstate, xlab = "Elite", ylab = "Outstate")

par(mfcol = c(2, 2))
hist(college$Grad.Rate, xlab = "Grad Rate", ylab = "Frequency")
hist(college$Expend, xlab = "Expend", ylab = "Frequency")
hist(college$PhD, xlab = "PhD", ylab = "Frequency")
hist(college$Personal, xlab = "Personal", ylab = "Frequency")

5. In this exercise, we will predict the number of applications received using the other variables in the College data set

(a)

indices <- split(sample(nrow(college), nrow(college), replace=FALSE), as.factor(1:2))
trainingSet = college[indices[[1]], ]
testSet = college[-indices[[1]], ]

(b)

fit <- lm(Apps ~ . - Accept - Enroll - Elite, data = trainingSet)
summary(fit)

## 
## Call:
## lm(formula = Apps ~ . - Accept - Enroll - Elite, data = trainingSet)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6198   -721    -73    487  31918 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.266e+03  1.183e+03  -2.761 0.006051 ** 
## PrivateYes   7.097e+01  4.185e+02   0.170 0.865445    
## Top10perc    1.847e+01  1.685e+01   1.096 0.273737    
## Top25perc   -2.763e-01  1.323e+01  -0.021 0.983343    
## F.Undergrad  7.243e-01  3.457e-02  20.952  < 2e-16 ***
## P.Undergrad -7.227e-02  8.415e-02  -0.859 0.390986    
## Outstate    -1.252e-02  5.534e-02  -0.226 0.821191    
## Room.Board   5.022e-01  1.482e-01   3.388 0.000779 ***
## Books        3.166e-01  7.244e-01   0.437 0.662310    
## Personal    -3.203e-01  1.894e-01  -1.691 0.091684 .  
## PhD         -1.164e+00  1.423e+01  -0.082 0.934832    
## Terminal    -1.735e+01  1.549e+01  -1.120 0.263350    
## S.F.Ratio    4.934e+01  3.786e+01   1.303 0.193239    
## perc.alumni -1.168e+01  1.313e+01  -0.890 0.374172    
## Expend       1.201e-01  4.057e-02   2.959 0.003282 ** 
## Grad.Rate    1.890e+01  8.763e+00   2.156 0.031697 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2183 on 373 degrees of freedom
## Multiple R-squared:  0.7424, Adjusted R-squared:  0.7321 
## F-statistic: 71.68 on 15 and 373 DF,  p-value: < 2.2e-16

We are using training MSE and test MSE to measure the quality of fit.

trainingMSE = mean(fit$residuals^2)
testMSE = mean((testSet$Apps - predict.lm(fit, testSet)) ^ 2)
summary(trainingMSE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 4569400 4569400 4569400 4569400 4569400 4569400

summary(testMSE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 3010673 3010673 3010673 3010673 3010673 3010673

(c)

As we can see above `testMSE` and `trainingMSE` have large numbers. Also $R^2$ is 0.8105 using this linear model. We can conclude that linear model does not fit this data very well. However F-statistic is 106.4 which is far more than 1, it suggests at least one of the factors must be related to `Apps`.  
`F.Undergrad`, `Room.Board`, `Grad.Rate` and `Private - Yes` have the smallest p-values and are the most important factors. `Top10prec`, `perc.alumni` and `Expend` are the tier 2 important factors.

6. Using the same setup as in the previous question, form a new outcome variable Y which equals one if the number of applications is greater than or equal to the overall median and zero otherwise. Fit a logistic regression model to Y and report the training and test misclassification rates, and the most important predictors. As above, do not include the Elite predictor, or the Accept or Enrol predictors in the regression. Compare the results of this analysis to that of the linear regression approach in the previous question.

med = median(college$Apps)
Y = rep(0, nrow(college))
Y[college$Apps >= med] = 1
Y = as.factor(Y)
college = data.frame(college, Y)

## exlcude unwanted factors
college = subset(college, select = -c(Accept, Enroll, Elite, Apps))

indices <- split(sample(nrow(college), nrow(college), replace=FALSE), as.factor(1:2))
trainingSet = college[indices[[1]], ]
testSet = college[-indices[[1]], ]
fit <- glm(formula = Y ~ ., family = binomial, data = trainingSet)
summary(fit)

## 
## Call:
## glm(formula = Y ~ ., family = binomial, data = trainingSet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9708  -0.2077   0.0000   0.0292   3.1984  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.319e+01  2.963e+00  -4.452 8.52e-06 ***
## PrivateYes   2.362e-01  1.059e+00   0.223   0.8236    
## Top10perc    8.371e-03  4.114e-02   0.203   0.8388    
## Top25perc   -3.807e-03  3.351e-02  -0.114   0.9095    
## F.Undergrad  3.618e-03  5.038e-04   7.183 6.84e-13 ***
## P.Undergrad  3.603e-04  5.524e-04   0.652   0.5142    
## Outstate     3.008e-04  1.182e-04   2.545   0.0109 *  
## Room.Board   3.278e-04  2.892e-04   1.133   0.2570    
## Books        4.030e-04  1.762e-03   0.229   0.8191    
## Personal    -5.774e-04  5.285e-04  -1.093   0.2746    
## PhD          3.754e-04  2.459e-02   0.015   0.9878    
## Terminal     1.647e-02  2.674e-02   0.616   0.5380    
## S.F.Ratio   -5.711e-02  7.555e-02  -0.756   0.4498    
## perc.alumni -1.898e-02  2.549e-02  -0.744   0.4566    
## Expend      -1.226e-05  1.132e-04  -0.108   0.9137    
## Grad.Rate    2.482e-02  2.048e-02   1.212   0.2254    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 539.14  on 388  degrees of freedom
## Residual deviance: 134.23  on 373  degrees of freedom
## AIC: 166.23
## 
## Number of Fisher Scoring iterations: 9

## calculate training misclassification rate
trainingProbs = predict(fit, type = "response")
trainingPred = rep(0, nrow(trainingSet))
trainingPred[trainingProbs > 0.5] = 1
table(trainingPred, trainingSet$Y)

##             
## trainingPred   0   1
##            0 182  15
##            1   9 183

## training misclassification rate
1 - mean(trainingPred == trainingSet$Y)

## [1] 0.06169666

## calculate test misclassification rate
testProbs = predict(fit, newdata = testSet, type = "response")
testPred = rep(0, nrow(testSet))
testPred[testProbs > 0.5] = 1
table(testPred, testSet$Y)

##         
## testPred   0   1
##        0 179  18
##        1  18 173

## test misclassification rate
1 - mean(testPred == testSet$Y)

## [1] 0.09278351

The error rates for training set and test set are both at range 6% - 9% which fits better than the linear model. The most important factors are `F.Undergrad`, `Outstate` and `Grad.Rate`. Compared to linear model they both have `F.undergrad` and `Grad.Rate` so we can conclude these two factors are most important to the `Apps`

Statistics 216 Homework 1

Xiangpeng Li

1. Which statistical learning method is performing better? flexible method or inflexible method.

(a) The number of observations n is extremely large, and the number of predictiors p is small

(b) The number of predictors p is extremely large, and the number of observations n is small

(c) The variance of the error terms, i.e. \(\sigma^2 = Var(\epsilon)\), is extremely high

(d) The relationship between the predictors and response is highly non-linear, and \(\sigma^2\) is small

(e) The relationship between the predictors and response is highly non-linear, and \(\sigma^2\) is large

2. Explain whether each scenario below is a regression, classification or unsupervised learning problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary

(b) Our website has collected the ratings of 1000 different restaurants by 10,000 customers. Each customer has rated about 100 restaurants, and we would like to recommend restaurants to customers who have not yet been there.

3. In this next question we consider some real-life applications of statistical learning

(a)

(b)

(c)

4. This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US.

5. In this exercise, we will predict the number of applications received using the other variables in the College data set

(a)

(b)

(c)

(a) The number of observations `n` is extremely large, and the number of predictiors `p` is small

(b) The number of predictors `p` is extremely large, and the number of observations `n` is small

2. Explain whether each scenario below is a regression, classification or unsupervised learning problem, and indicate whether we are most interested in inference or prediction. Finally, provide `n` and `p`.