HW 1

library(ISLR)
library(MASS)

Conceptual Exercises

1) Which method would be better: flexible or inflexible?

a)

A flexible method would work better in this case since the sample size is so large. Additionally, since the number of predictors p is small, the model might still be interpretable.

b)

For the opposite case, an inflexible method might be better. In order to run a flexible method, we need a large sample size or else it will have a very large variance. Additionally, we might need an inflexible model in order to understand the effects of the large number of predictors.

c)

We might want a more flexible if the relationship between the predictors and response is non-linear. An inflexible model in this case might be highly biased.

d)

Flexible models are prone to a higher variance because they fit the training set so closely, so in this case, an inflexible model might work better.

2) Classification or regression and inference or prediction

a)

Regression - response is quantitative
Inference - they set out to find out which factors affect CEO salary
n = 500
p = 3

b)

Classification - response variable is binary
Prediction - they want to know whether a product will be successful or a failure
n = 20
p = 13

c)

Regression - response is quantitative
Prediction - they want to predict percent change in the US dollar value
n = 52
p = 3

5)

Some of the advantages of a more flexible approach are the following: decreased bias, (usually) more accurate predictions, and the ability to incorporate complex mathematics in a model. Some of the disadvantages to a more flexible model are the following: increased variance, might take long to load if computer has low RAM, and decreased ability to interpret the relationship between predictors and response.

A more flexible approach might be preferred when the sample size is large; this way, the possibility for overfitting is lessened. A flexible approach might also be preferred if you have a lot of computing power, if the relationship between the predictors and the response is non-linear, and if you just want a prediction without much regard for the actual relationship between each predictor and the response.

A less flexible approach might be preferred in cases when the goal is to gain information about the relationship between the response and the predictors. It might also be preferred in cases with a small sample size, or when there is not much computing power.

3)

a)

To transform this from a regression to classification problem, we could instead classify CEO salary as a list of ranges. For example: [<100k, 100k-200k,200.01k-500k, 500k-1 mil, > 1 mil] could be the various groups for CEO salary, and each response would be classified as one of the five groups.

b)

In order to make this a regression problem, we could instead predict probability of success using logistic regression.

c)

Instead of the percent change in the US dollar value, they could measure if the stock rises or falls. In this case, we have two groups to classify the response variable (rising or falling).

Applied Exercises

8)

# a)

# I did not use read.csv() since the data frame
# was included in the ISLR package 

college <- College

# b)

# the college dataset in the ISLR package
# was already cleaned according to part b). 
# I'll write the code anyway and comment it out
# to show I know how to do it

# fix(college)
# rownames(college) = college[,1]
# fix(college)

# college = college[,-1]

fix(college)

# c) 

summary(college)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

pairs(college[,1:10])

plot(college$Private, college$Outstate, xlab = "Private", ylab = "Out-of-state Tuition")

Elite = rep("No", nrow(college))
Elite[college$Top10perc > 50] = "Yes"
Elite = as.factor(Elite)
college = data.frame(college, Elite)

summary(college$Elite)

##  No Yes 
## 699  78

plot(college$Elite, college$Outstate, xlab = ">%50 of students come from the top 10% of their HS", ylab = "Out-of-state tuition")

par(mfrow= c(2,2))
hist(college$F.Undergrad, main = "Number of full-time undergraduates")
hist(college$Room.Board, main = "Room and board costs")
hist(college$Expend, main = "Instructional Expenditure per Student")
hist(college$Grad.Rate, main = "Graduation Rate")

par(mfrow = c(1,1))
plot(college$Outstate, college$Expend, xlab = "Out-of-state Tuition", ylab = "Instructional Expenditure per Student")

cor(college$Outstate, college$Expend)

## [1] 0.6727786

college["University of Pittsburgh",]

##                                      Private Apps Accept Enroll Top10perc
## University of Pittsburgh-Main Campus      No 8586   6383   2503        25
##                                      Top25perc F.Undergrad P.Undergrad Outstate
## University of Pittsburgh-Main Campus        59       13138        4289    10786
##                                      Room.Board Books Personal PhD Terminal
## University of Pittsburgh-Main Campus       4560   400      900  93       93
##                                      S.F.Ratio perc.alumni Expend Grad.Rate
## University of Pittsburgh-Main Campus       7.8          10  13789        66
##                                      Elite
## University of Pittsburgh-Main Campus    No

Quick Summary of the Data

The vast majority of colleges have less than 5000 students, which makes sense logically, but still surprised me. The average graduation rate among colleges is 65.46%, which to me was lower than expected. I was also interested in plotting instructional expenditure per student vs. out-of-state tuition. The variables had a moderate correlation of about .672. Lastly, I couldn’t help but look up how Pitt fared in this dataset. 59% of new students were in the top quarter of their high school class, and 93% of faculty have PhDs. Also, instructional expenditure per student is $13,789 which is fairly average among colleges. A stat the institution might not really want us to know: only 10% of alumni end up donating, which is on the very low end (first quartile). That being said, it turns out this dataset was from 1995, and things may have changed significantly in the past 26 years.

9)

a)

auto <- Auto
apply(auto, 2, anyNA)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##        FALSE        FALSE        FALSE        FALSE        FALSE        FALSE 
##         year       origin         name 
##        FALSE        FALSE        FALSE

str(auto)

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

Quantitative predictors: mpg, cylinders, displacement, horsepower, weight, acceleration, year
Qualitative predictors: origin, name

b)

autoquant <- auto[,1:(dim(auto)[2]-2)]
apply(autoquant, 2, range)

##       mpg cylinders displacement horsepower weight acceleration year
## [1,]  9.0         3           68         46   1613          8.0   70
## [2,] 46.6         8          455        230   5140         24.8   82

# c)

apply(autoquant, 2, mean)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##    23.445918     5.471939   194.411990   104.469388  2977.584184    15.541327 
##         year 
##    75.979592

apply(autoquant, 2, sd)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.805007     1.705783   104.644004    38.491160   849.402560     2.758864 
##         year 
##     3.683737

# d)

autoquant <- autoquant[-c(10:85),]

mult <- function(x) {
        c(range = range(x), mean = mean(x), sd = sd(x))
}

apply(autoquant, 2, mult)

##              mpg cylinders displacement horsepower    weight acceleration
## range1 11.000000  3.000000     68.00000   46.00000 1649.0000     8.500000
## range2 46.600000  8.000000    455.00000  230.00000 4997.0000    24.800000
## mean   24.404430  5.373418    187.24051  100.72152 2935.9715    15.726899
## sd      7.867283  1.654179     99.67837   35.70885  811.3002     2.693721
##             year
## range1 70.000000
## range2 82.000000
## mean   77.145570
## sd      3.106217

# e)

pairs(auto[,])

cor(auto[,1:7])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
##              acceleration       year
## mpg             0.4233285  0.5805410
## cylinders      -0.5046834 -0.3456474
## displacement   -0.5438005 -0.3698552
## horsepower     -0.6891955 -0.4163615
## weight         -0.4168392 -0.3091199
## acceleration    1.0000000  0.2903161
## year            0.2903161  1.0000000

After viewing the scatterplots between all predictors in the dataset, there seemed to be many pairs of predictors that are highly correlated. Some of the strongest positively correlated variables are the following: cylinders and displacement, cylinders and weight, displacement and horsepower, displacement and weight, and horsepower and weight. Some of the strongest negatively correlated variables are the following: mpg and cylinders, mpg and displacement, mpg and horsepower, mpg and weight, and horsepower and acceleration.

f)

All the other predictor variables except acceleration, year, and name might be useful in predicting mpg since they are all at least moderately correlated with mpg. However, some of these variables are correlated with one another, so we may not need all the variables mentioned to obtain an adequate prediction for mpg.

10)

a)

The data frame has 506 rows, with each row representing a town in Boston. It has 14 columns which each represent demographic, real estate, and environmental data about each town.

b)

par(mar=c(1,1,1,1))
pairs(Boston)

Nitrogen oxide concentration is negatively correlated with the weighted mean of distances to the five employment centers. Crime is also negatively correlated with this distance variable. This would seem to back up the traditional view of the “suburbs”: the further you are from the center of a city, the cleaner the air and lower the crime (based on only these three variables).

c)

cor(Boston$crim, Boston)

##      crim         zn     indus        chas       nox         rm       age
## [1,]    1 -0.2004692 0.4065834 -0.05589158 0.4209717 -0.2192467 0.3527343
##             dis       rad       tax   ptratio      black     lstat       medv
## [1,] -0.3796701 0.6255051 0.5827643 0.2899456 -0.3850639 0.4556215 -0.3883046

The two predictors most highly correlated with higher per capita crime rate are: property tax rate and accessibility to radial highways.

d)

sapply(Boston, range)

##          crim  zn indus chas   nox    rm   age     dis rad tax ptratio  black
## [1,]  0.00632   0  0.46    0 0.385 3.561   2.9  1.1296   1 187    12.6   0.32
## [2,] 88.97620 100 27.74    1 0.871 8.780 100.0 12.1265  24 711    22.0 396.90
##      lstat medv
## [1,]  1.73    5
## [2,] 37.97   50

The highest per-capita crime rate is about 89, which is very high. The highest tax-rate is $711 per $10,000, which is relatively high compared to the minimum of $187. Lastly, pupil-teacher ratio has a somewhat large range, but the max of 22 students per teacher does not seem extraordinarily high considering my elementary school and middle school had 25-30 students per teacher.

e)

sum(Boston$chas)

## [1] 35

35 towns bound the Charles river.

f)

median(Boston$ptratio)

## [1] 19.05

The median pupil-teacher ratio is 19.5 students per teacher.

g)

Boston[Boston$medv == min(Boston$medv),]

##        crim zn indus chas   nox    rm age    dis rad tax ptratio  black lstat
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 396.90 30.59
## 406 67.9208  0  18.1    0 0.693 5.683 100 1.4254  24 666    20.2 384.97 22.98
##     medv
## 399    5
## 406    5

Suburbs 399 and 406 have the lowest median values of owner-occupied homes. Both have relatively high crime rates, high NOx pollution, high proportion of buildings built before 1940, low distance to the city center, high property taxes, and high proportion of black residents.

h)

seven <- Boston[Boston$rm > 7,]
eight <- Boston[Boston$rm > 8,]
nrow(seven)

## [1] 64

nrow(eight)

## [1] 13

There are 64 suburbs with more than 7 rooms per dwelling, and 13 with more than 8 rooms per dwelling.

sapply(eight, range)

##         crim zn indus chas    nox    rm  age    dis rad tax ptratio  black
## [1,] 0.02009  0  2.68    0 0.4161 8.034  8.4 1.8010   2 224    13.0 354.55
## [2,] 3.47428 95 19.58    1 0.7180 8.780 93.9 8.9067  24 666    20.2 396.90
##      lstat medv
## [1,]  2.47 21.9
## [2,]  7.44 50.0

sapply(Boston, range)

##          crim  zn indus chas   nox    rm   age     dis rad tax ptratio  black
## [1,]  0.00632   0  0.46    0 0.385 3.561   2.9  1.1296   1 187    12.6   0.32
## [2,] 88.97620 100 27.74    1 0.871 8.780 100.0 12.1265  24 711    22.0 396.90
##      lstat medv
## [1,]  1.73    5
## [2,] 37.97   50

In the towns with more than eight rooms per dwelling, there is uniformly low crime rate relative to the other suburbs, and a high proportion of black residents. There is also a lower percentage of “lower-status” members of the population, and high median values of owner-occupied homes.