Number 2

  1. Regression. Inference. n=500. p=3. b. Classification. Prediction. n=20. p=13 c. Regression. Prediction. n=52. p=3

Number 5

An advantage of a flexible model is that is that it can be applied to many different estimates of f. A disadvantage of a very flexible model, versus a non-flexible model, is it can lead to overfitting. A more flexible model may be ideal for predictive modeling. On the other hand, if we are interested in inference, a more restrictive model would be ideal (such as a linear model, which is not flexible).

Number 6

A parametric statistical learning approach breaks the problem of estimating f down to one of estimating a set of parameters. A non-parametric method avoids the assumption of a particular functional form for f. An advantage of a parametric method, as opposed to a non-parametric approach, is that it is generally easier to estimate a set of parameters in a linear model. A disadvantage of a parametric approach is that if you are too far from the true f, the model is not good.

Number 8

Part A

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ISLR)
data(College)
write.csv(College, "/Users/admin/Desktop/college.csv")
setwd("/Users/admin/Desktop")
college <- read.csv("college.csv")

Part B

college <- college[, -1]

Part C

summary(college)
##    Private               Apps           Accept          Enroll    
##  Length:777         Min.   :   81   Min.   :   72   Min.   :  35  
##  Class :character   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242  
##  Mode  :character   Median : 1558   Median : 1110   Median : 434  
##                     Mean   : 3002   Mean   : 2019   Mean   : 780  
##                     3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902  
##                     Max.   :48094   Max.   :26330   Max.   :6392  
##    Top10perc       Top25perc      F.Undergrad     P.Undergrad     
##  Min.   : 1.00   Min.   :  9.0   Min.   :  139   Min.   :    1.0  
##  1st Qu.:15.00   1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0  
##  Median :23.00   Median : 54.0   Median : 1707   Median :  353.0  
##  Mean   :27.56   Mean   : 55.8   Mean   : 3700   Mean   :  855.3  
##  3rd Qu.:35.00   3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0  
##  Max.   :96.00   Max.   :100.0   Max.   :31643   Max.   :21836.0  
##     Outstate       Room.Board       Books           Personal   
##  Min.   : 2340   Min.   :1780   Min.   :  96.0   Min.   : 250  
##  1st Qu.: 7320   1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850  
##  Median : 9990   Median :4200   Median : 500.0   Median :1200  
##  Mean   :10441   Mean   :4358   Mean   : 549.4   Mean   :1341  
##  3rd Qu.:12925   3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700  
##  Max.   :21700   Max.   :8124   Max.   :2340.0   Max.   :6800  
##       PhD            Terminal       S.F.Ratio      perc.alumni   
##  Min.   :  8.00   Min.   : 24.0   Min.   : 2.50   Min.   : 0.00  
##  1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00  
##  Median : 75.00   Median : 82.0   Median :13.60   Median :21.00  
##  Mean   : 72.66   Mean   : 79.7   Mean   :14.09   Mean   :22.74  
##  3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00  
##  Max.   :103.00   Max.   :100.0   Max.   :39.80   Max.   :64.00  
##      Expend        Grad.Rate     
##  Min.   : 3186   Min.   : 10.00  
##  1st Qu.: 6751   1st Qu.: 53.00  
##  Median : 8377   Median : 65.00  
##  Mean   : 9660   Mean   : 65.46  
##  3rd Qu.:10830   3rd Qu.: 78.00  
##  Max.   :56233   Max.   :118.00
college[,1] = as.numeric(factor(college[,1]))
pairs(college[,1:10])

boxplot(college$Private, college$Outstate)

Elite <- rep ("No", nrow (college))
Elite[college$Top10perc > 50] <- " Yes "
Elite <- as.factor (Elite)
college <- data.frame (college , Elite)
summary(college)
##     Private           Apps           Accept          Enroll       Top10perc    
##  Min.   :1.000   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  1st Qu.:1.000   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##  Median :2.000   Median : 1558   Median : 1110   Median : 434   Median :23.00  
##  Mean   :1.727   Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##  3rd Qu.:2.000   3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##  Max.   :2.000   Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate        Elite    
##  Min.   : 10.00    Yes : 78  
##  1st Qu.: 53.00   No   :699  
##  Median : 65.00              
##  Mean   : 65.46              
##  3rd Qu.: 78.00              
##  Max.   :118.00
boxplot(college$Outstate, college$Elite)

par(mfrow = c(2, 2))
hist(college$Top10perc)

hist(college$Room.Board)

hist(college$Enroll)

hist(college$Books)

From the data above, we can compare multiple variables through histograms and boxplots. For instance, we can see that most room and board costs are in the 3,000-5,000 range. We can also see that there are usually around 10-30 students who were in the top ten percent of their highschool class. The estimated book cost is normally around $500 for most students.

From the scatterplot matrix, we can see that there is a strong positive relationship between some variables such as New students Enrolled and students in the Top 10% of their highschool class.

Number 9

Part A

data("Auto")
write.csv(Auto, "/Users/admin/Desktop/auto.csv")
Auto <- read.csv("auto.csv")
Auto <- na.omit(Auto)
head(Auto)
as.factor(Auto$origin)
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 3 2 2 2 2 2 1 1 1 1 1 3 1 3 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 2 1 3 1 2 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1
##  [75] 1 2 2 2 2 1 3 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 3 1 3 3
## [112] 1 1 2 1 1 2 2 2 2 1 2 3 1 1 1 1 3 1 3 1 1 1 1 1 1 1 1 1 2 2 2 3 3 1 2 2 3
## [149] 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 3 2 3 1 2 1 2 2 2 2 3 2 2 1 1 2
## [186] 1 1 1 1 1 1 1 1 1 1 2 3 1 1 1 1 2 3 3 1 2 1 2 3 2 1 1 1 1 3 1 2 1 3 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 2 1 3 1 1 1 3 2 3 2 3 2 1 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 3 3 1 3 1 1 3 2 2 2 2 2 3 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 1 1 2
## [297] 1 2 1 1 1 3 2 1 1 1 1 2 3 1 3 1 1 1 1 2 3 3 3 3 3 1 3 2 2 2 2 3 3 2 3 3 2
## [334] 3 1 1 1 1 1 3 1 3 3 3 3 3 1 1 1 2 3 3 3 3 2 2 3 3 1 1 1 1 1 1 1 1 1 1 1 2
## [371] 3 3 1 1 3 3 3 3 3 3 1 1 1 1 3 1 1 1 2 1 1 1
## Levels: 1 2 3
Auto$origin <- as.factor(Auto$origin)

Name and origin are the only qualitative variables

Part B

sapply(Auto[c(2:8)], function(Auto) max(Auto, na.rm=TRUE) - min(Auto, na.rm=TRUE))
##          mpg    cylinders displacement   horsepower       weight acceleration 
##         37.6          5.0        387.0        184.0       3527.0         16.8 
##         year 
##         12.0

Part C

sapply(Auto[c(2:8)], mean)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    23.445918     5.471939   194.411990   104.469388  2977.584184    15.541327 
##         year 
##    75.979592
sapply(Auto[c(2:8)], sd)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.805007     1.705783   104.644004    38.491160   849.402560     2.758864 
##         year 
##     3.683737

Part D

newauto <- Auto[-c(10:85), ]
sapply(newauto[c(2:8)], function(newauto) max(newauto, na.rm=TRUE) - min(newauto, na.rm=TRUE))
##          mpg    cylinders displacement   horsepower       weight acceleration 
##         35.6          5.0        387.0        184.0       3348.0         16.3 
##         year 
##         12.0
sapply(newauto[c(2:8)], mean)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    24.404430     5.373418   187.240506   100.721519  2935.971519    15.726899 
##         year 
##    77.145570
sapply(newauto[c(2:8)], sd)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.867283     1.654179    99.678367    35.708853   811.300208     2.693721 
##         year 
##     3.106217

Part E

Auto$cylinders <- as.factor(Auto$cylinders)
pairs(Auto[, c(2:8)])

par(mfrow = c(1, 2))
hist(Auto$acceleration)
hist(Auto$mpg)

We can see that year and mpg have a positive correlation, but year and cylinders do not have a linear relationship or correlation. We can also see that the majority of cars have an mpg of about 20-30.

From the plots above, we can see that year and mpg have a relationship and may be useful in predicting mpg.

Number 10

Part A

library(ISLR2)
## 
## Attaching package: 'ISLR2'
## The following object is masked _by_ '.GlobalEnv':
## 
##     Auto
## The following objects are masked from 'package:ISLR':
## 
##     Auto, Credit
??boston

There are 506 rows and 13 variables in the data set. The colomns/variables include info about houses in 506 suburbs in boston.

Part B

data("Boston")
write.csv(Boston, "/Users/admin/Desktop/boston.csv")
str(Boston)
## 'data.frame':    506 obs. of  13 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Boston$chas <- as.numeric(Boston$chas)
Boston$rad <- as.numeric(Boston$rad)
pairs(Boston)

Some variables are correlated with one another both negatively and positively

Part C

Predictors associated with percapita crime rate include the lstat variable. The lower status variable has a strong positive correlation with crime rate. Lstat seams to have the strongest assosiation with crime rate, although other variables such as Dis (distance from Boston), medv (median value), and rm (rooms) have a weak association with crime.

Part D

cor(Boston)
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax    ptratio
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867
##              lstat       medv
## crim     0.4556215 -0.3883046
## zn      -0.4129946  0.3604453
## indus    0.6037997 -0.4837252
## chas    -0.0539293  0.1752602
## nox      0.5908789 -0.4273208
## rm      -0.6138083  0.6953599
## age      0.6023385 -0.3769546
## dis     -0.4969958  0.2499287
## rad      0.4886763 -0.3816262
## tax      0.5439934 -0.4685359
## ptratio  0.3740443 -0.5077867
## lstat    1.0000000 -0.7376627
## medv    -0.7376627  1.0000000
par(mfrow = c(1, 3))
hist(Boston$crim)
hist(Boston$tax)
hist(Boston$ptratio)

There are a lot of Boston suburbs that have a 0 per capita crime rate. There is also a high frequency of tax with a value of about 700 (in 10,000’s). In the ptratio histogram, there is a spike in frequency in the 20 Pupil-teacher Ratio.

Part E

ChasPos <- subset(Boston, chas == 1)

There are 35 observations that bound the Charles river.

Part F

summary(Boston$ptratio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00

The median pupil-teacher ratio is 19.05.

Part G

Boston <- data.frame("CTract" = c(1:length(Boston$crim)), Boston)
filter(Boston, medv == min(medv))
summary(Boston$crim)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08204  0.25651  3.61352  3.67708 88.97620

Census Tracts 399 and 406 have the lowest median value, 5. For these two census tracts, most of the values are similar with the exception of crime, which is much lower per-capita rate for tract 399. Both of these, however, have a high per capita crime rate variable than the rest of the data set, as shown below.

Part H

MoreThan7Rm <- subset(Boston, rm > 7)
MoreThan8Rm <- subset(Boston, rm > 8)

There are 64 census tracts that overage more than 7 rooms per dwelling and 13 census tracts that average more than 8 rooms per dwelling. There are significantly less census tract that average more than 8 rooms per dwelling, a difference of 51.