library(ISLR)
library(MASS)
A flexible method would work better in this case since the sample size is so large. Additionally, since the number of predictors p is small, the model might still be interpretable.
For the opposite case, an inflexible method might be better. In order to run a flexible method, we need a large sample size or else it will have a very large variance. Additionally, we might need an inflexible model in order to understand the effects of the large number of predictors.
We might want a more flexible if the relationship between the predictors and response is non-linear. An inflexible model in this case might be highly biased.
Flexible models are prone to a higher variance because they fit the training set so closely, so in this case, an inflexible model might work better.
Regression - response is quantitative
Inference - they set out to find out which factors affect CEO salary
n = 500
p = 3
Classification - response variable is binary
Prediction - they want to know whether a product will be successful or a failure
n = 20
p = 13
Regression - response is quantitative
Prediction - they want to predict percent change in the US dollar value
n = 52
p = 3
Some of the advantages of a more flexible approach are the following: decreased bias, (usually) more accurate predictions, and the ability to incorporate complex mathematics in a model. Some of the disadvantages to a more flexible model are the following: increased variance, might take long to load if computer has low RAM, and decreased ability to interpret the relationship between predictors and response.
A more flexible approach might be preferred when the sample size is large; this way, the possibility for overfitting is lessened. A flexible approach might also be preferred if you have a lot of computing power, if the relationship between the predictors and the response is non-linear, and if you just want a prediction without much regard for the actual relationship between each predictor and the response.
A less flexible approach might be preferred in cases when the goal is to gain information about the relationship between the response and the predictors. It might also be preferred in cases with a small sample size, or when there is not much computing power.
To transform this from a regression to classification problem, we could instead classify CEO salary as a list of ranges. For example: [<100k, 100k-200k,200.01k-500k, 500k-1 mil, > 1 mil] could be the various groups for CEO salary, and each response would be classified as one of the five groups.
In order to make this a regression problem, we could instead predict probability of success using logistic regression.
Instead of the percent change in the US dollar value, they could measure if the stock rises or falls. In this case, we have two groups to classify the response variable (rising or falling).
# a)
# I did not use read.csv() since the data frame
# was included in the ISLR package
college <- College
# b)
# the college dataset in the ISLR package
# was already cleaned according to part b).
# I'll write the code anyway and comment it out
# to show I know how to do it
# fix(college)
# rownames(college) = college[,1]
# fix(college)
# college = college[,-1]
fix(college)
# c)
summary(college)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
pairs(college[,1:10])
plot(college$Private, college$Outstate, xlab = "Private", ylab = "Out-of-state Tuition")
Elite = rep("No", nrow(college))
Elite[college$Top10perc > 50] = "Yes"
Elite = as.factor(Elite)
college = data.frame(college, Elite)
summary(college$Elite)
## No Yes
## 699 78
plot(college$Elite, college$Outstate, xlab = ">%50 of students come from the top 10% of their HS", ylab = "Out-of-state tuition")
par(mfrow= c(2,2))
hist(college$F.Undergrad, main = "Number of full-time undergraduates")
hist(college$Room.Board, main = "Room and board costs")
hist(college$Expend, main = "Instructional Expenditure per Student")
hist(college$Grad.Rate, main = "Graduation Rate")
par(mfrow = c(1,1))
plot(college$Outstate, college$Expend, xlab = "Out-of-state Tuition", ylab = "Instructional Expenditure per Student")
cor(college$Outstate, college$Expend)
## [1] 0.6727786
college["University of Pittsburgh",]
## Private Apps Accept Enroll Top10perc
## University of Pittsburgh-Main Campus No 8586 6383 2503 25
## Top25perc F.Undergrad P.Undergrad Outstate
## University of Pittsburgh-Main Campus 59 13138 4289 10786
## Room.Board Books Personal PhD Terminal
## University of Pittsburgh-Main Campus 4560 400 900 93 93
## S.F.Ratio perc.alumni Expend Grad.Rate
## University of Pittsburgh-Main Campus 7.8 10 13789 66
## Elite
## University of Pittsburgh-Main Campus No
The vast majority of colleges have less than 5000 students, which makes sense logically, but still surprised me. The average graduation rate among colleges is 65.46%, which to me was lower than expected. I was also interested in plotting instructional expenditure per student vs. out-of-state tuition. The variables had a moderate correlation of about .672. Lastly, I couldn’t help but look up how Pitt fared in this dataset. 59% of new students were in the top quarter of their high school class, and 93% of faculty have PhDs. Also, instructional expenditure per student is $13,789 which is fairly average among colleges. A stat the institution might not really want us to know: only 10% of alumni end up donating, which is on the very low end (first quartile). That being said, it turns out this dataset was from 1995, and things may have changed significantly in the past 26 years.
auto <- Auto
apply(auto, 2, anyNA)
## mpg cylinders displacement horsepower weight acceleration
## FALSE FALSE FALSE FALSE FALSE FALSE
## year origin name
## FALSE FALSE FALSE
str(auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
Quantitative predictors: mpg, cylinders, displacement, horsepower, weight, acceleration, year
Qualitative predictors: origin, name
autoquant <- auto[,1:(dim(auto)[2]-2)]
apply(autoquant, 2, range)
## mpg cylinders displacement horsepower weight acceleration year
## [1,] 9.0 3 68 46 1613 8.0 70
## [2,] 46.6 8 455 230 5140 24.8 82
# c)
apply(autoquant, 2, mean)
## mpg cylinders displacement horsepower weight acceleration
## 23.445918 5.471939 194.411990 104.469388 2977.584184 15.541327
## year
## 75.979592
apply(autoquant, 2, sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.805007 1.705783 104.644004 38.491160 849.402560 2.758864
## year
## 3.683737
# d)
autoquant <- autoquant[-c(10:85),]
mult <- function(x) {
c(range = range(x), mean = mean(x), sd = sd(x))
}
apply(autoquant, 2, mult)
## mpg cylinders displacement horsepower weight acceleration
## range1 11.000000 3.000000 68.00000 46.00000 1649.0000 8.500000
## range2 46.600000 8.000000 455.00000 230.00000 4997.0000 24.800000
## mean 24.404430 5.373418 187.24051 100.72152 2935.9715 15.726899
## sd 7.867283 1.654179 99.67837 35.70885 811.3002 2.693721
## year
## range1 70.000000
## range2 82.000000
## mean 77.145570
## sd 3.106217
# e)
pairs(auto[,])
cor(auto[,1:7])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## acceleration year
## mpg 0.4233285 0.5805410
## cylinders -0.5046834 -0.3456474
## displacement -0.5438005 -0.3698552
## horsepower -0.6891955 -0.4163615
## weight -0.4168392 -0.3091199
## acceleration 1.0000000 0.2903161
## year 0.2903161 1.0000000
After viewing the scatterplots between all predictors in the dataset, there seemed to be many pairs of predictors that are highly correlated. Some of the strongest positively correlated variables are the following: cylinders and displacement, cylinders and weight, displacement and horsepower, displacement and weight, and horsepower and weight. Some of the strongest negatively correlated variables are the following: mpg and cylinders, mpg and displacement, mpg and horsepower, mpg and weight, and horsepower and acceleration.
All the other predictor variables except acceleration, year, and name might be useful in predicting mpg since they are all at least moderately correlated with mpg. However, some of these variables are correlated with one another, so we may not need all the variables mentioned to obtain an adequate prediction for mpg.
The data frame has 506 rows, with each row representing a town in Boston. It has 14 columns which each represent demographic, real estate, and environmental data about each town.
par(mar=c(1,1,1,1))
pairs(Boston)
Nitrogen oxide concentration is negatively correlated with the weighted mean of distances to the five employment centers. Crime is also negatively correlated with this distance variable. This would seem to back up the traditional view of the “suburbs”: the further you are from the center of a city, the cleaner the air and lower the crime (based on only these three variables).
cor(Boston$crim, Boston)
## crim zn indus chas nox rm age
## [1,] 1 -0.2004692 0.4065834 -0.05589158 0.4209717 -0.2192467 0.3527343
## dis rad tax ptratio black lstat medv
## [1,] -0.3796701 0.6255051 0.5827643 0.2899456 -0.3850639 0.4556215 -0.3883046
The two predictors most highly correlated with higher per capita crime rate are: property tax rate and accessibility to radial highways.
sapply(Boston, range)
## crim zn indus chas nox rm age dis rad tax ptratio black
## [1,] 0.00632 0 0.46 0 0.385 3.561 2.9 1.1296 1 187 12.6 0.32
## [2,] 88.97620 100 27.74 1 0.871 8.780 100.0 12.1265 24 711 22.0 396.90
## lstat medv
## [1,] 1.73 5
## [2,] 37.97 50
The highest per-capita crime rate is about 89, which is very high. The highest tax-rate is $711 per $10,000, which is relatively high compared to the minimum of $187. Lastly, pupil-teacher ratio has a somewhat large range, but the max of 22 students per teacher does not seem extraordinarily high considering my elementary school and middle school had 25-30 students per teacher.
sum(Boston$chas)
## [1] 35
35 towns bound the Charles river.
median(Boston$ptratio)
## [1] 19.05
The median pupil-teacher ratio is 19.5 students per teacher.
Boston[Boston$medv == min(Boston$medv),]
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 396.90 30.59
## 406 67.9208 0 18.1 0 0.693 5.683 100 1.4254 24 666 20.2 384.97 22.98
## medv
## 399 5
## 406 5
Suburbs 399 and 406 have the lowest median values of owner-occupied homes. Both have relatively high crime rates, high NOx pollution, high proportion of buildings built before 1940, low distance to the city center, high property taxes, and high proportion of black residents.
seven <- Boston[Boston$rm > 7,]
eight <- Boston[Boston$rm > 8,]
nrow(seven)
## [1] 64
nrow(eight)
## [1] 13
There are 64 suburbs with more than 7 rooms per dwelling, and 13 with more than 8 rooms per dwelling.
sapply(eight, range)
## crim zn indus chas nox rm age dis rad tax ptratio black
## [1,] 0.02009 0 2.68 0 0.4161 8.034 8.4 1.8010 2 224 13.0 354.55
## [2,] 3.47428 95 19.58 1 0.7180 8.780 93.9 8.9067 24 666 20.2 396.90
## lstat medv
## [1,] 2.47 21.9
## [2,] 7.44 50.0
sapply(Boston, range)
## crim zn indus chas nox rm age dis rad tax ptratio black
## [1,] 0.00632 0 0.46 0 0.385 3.561 2.9 1.1296 1 187 12.6 0.32
## [2,] 88.97620 100 27.74 1 0.871 8.780 100.0 12.1265 24 711 22.0 396.90
## lstat medv
## [1,] 1.73 5
## [2,] 37.97 50
In the towns with more than eight rooms per dwelling, there is uniformly low crime rate relative to the other suburbs, and a high proportion of black residents. There is also a lower percentage of “lower-status” members of the population, and high median values of owner-occupied homes.