Question 2

Part A

This is a Regression problem because we are predicting a numerical/quantitative value. We are more interested in Inference for this problem because we want to know more about the inputs and how they affect the output. N = 500 (the number of observations) and P = 4 (the number of inputs/predictors).

Part B

This is a Classification problem because we are predicting a text/qualitative value. We are more interested in Prediction for this problem because we want to know the output. N = 20 and P = 14.

Part C

This is a Regression problem because we are predicting a numerical/quantitative value. We are more interested in Prediction because we want to know the exact percentage change (output). N = 52 and P = 4.

Question 5

Advantage: A very flexible approach could give a better fit for non-linear models and decrease the bias. Disadvantage: A very flexible approach would require estimating a larger number of parameters, it overfits the model, and increases variance. A more flexible approach would be beneficial when we are interested in Prediction rather than the interpretability of the results. A less flexible approach would be beneficial when we are interested in Inference and the interpretability of the results.

Question 6

Parametric: assumes a form for f and therefore reduces the problem of estimating f down to estimating a set of parameters. Non-Parametric: does not assume a form for f and therefore requires a very large sample. Advantages: A Parametric approach simplifies the modeling of f to just a few parameters therefore not as many observations are necessary as opposed to Non-Parametric. Disadvantages: if the form of f is assumed wrong, there could be an incorrect estimate of f. 

Question 8

Part A

library(ISLR)
## Warning: package 'ISLR' was built under R version 4.1.2
data(College)
college <- read.csv("College.csv")

Part B

head(college[, 1:5])
##                              X Private Apps Accept Enroll
## 1 Abilene Christian University     Yes 1660   1232    721
## 2           Adelphi University     Yes 2186   1924    512
## 3               Adrian College     Yes 1428   1097    336
## 4          Agnes Scott College     Yes  417    349    137
## 5    Alaska Pacific University     Yes  193    146     55
## 6            Albertson College     Yes  587    479    158
rownames <- college[,1]
college <- college[,-1]
head(college[, 1:5])
##   Private Apps Accept Enroll Top10perc
## 1     Yes 1660   1232    721        23
## 2     Yes 2186   1924    512        16
## 3     Yes 1428   1097    336        22
## 4     Yes  417    349    137        60
## 5     Yes  193    146     55        16
## 6     Yes  587    479    158        38

Part C

summary(college)
##    Private               Apps           Accept          Enroll    
##  Length:777         Min.   :   81   Min.   :   72   Min.   :  35  
##  Class :character   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242  
##  Mode  :character   Median : 1558   Median : 1110   Median : 434  
##                     Mean   : 3002   Mean   : 2019   Mean   : 780  
##                     3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902  
##                     Max.   :48094   Max.   :26330   Max.   :6392  
##    Top10perc       Top25perc      F.Undergrad     P.Undergrad     
##  Min.   : 1.00   Min.   :  9.0   Min.   :  139   Min.   :    1.0  
##  1st Qu.:15.00   1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0  
##  Median :23.00   Median : 54.0   Median : 1707   Median :  353.0  
##  Mean   :27.56   Mean   : 55.8   Mean   : 3700   Mean   :  855.3  
##  3rd Qu.:35.00   3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0  
##  Max.   :96.00   Max.   :100.0   Max.   :31643   Max.   :21836.0  
##     Outstate       Room.Board       Books           Personal   
##  Min.   : 2340   Min.   :1780   Min.   :  96.0   Min.   : 250  
##  1st Qu.: 7320   1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850  
##  Median : 9990   Median :4200   Median : 500.0   Median :1200  
##  Mean   :10441   Mean   :4358   Mean   : 549.4   Mean   :1341  
##  3rd Qu.:12925   3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700  
##  Max.   :21700   Max.   :8124   Max.   :2340.0   Max.   :6800  
##       PhD            Terminal       S.F.Ratio      perc.alumni   
##  Min.   :  8.00   Min.   : 24.0   Min.   : 2.50   Min.   : 0.00  
##  1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00  
##  Median : 75.00   Median : 82.0   Median :13.60   Median :21.00  
##  Mean   : 72.66   Mean   : 79.7   Mean   :14.09   Mean   :22.74  
##  3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00  
##  Max.   :103.00   Max.   :100.0   Max.   :39.80   Max.   :64.00  
##      Expend        Grad.Rate     
##  Min.   : 3186   Min.   : 10.00  
##  1st Qu.: 6751   1st Qu.: 53.00  
##  Median : 8377   Median : 65.00  
##  Mean   : 9660   Mean   : 65.46  
##  3rd Qu.:10830   3rd Qu.: 78.00  
##  Max.   :56233   Max.   :118.00

pairs(college[, 1:10])

plot(college\(Private, college\)Outstate, xlab = “Private University”, ylab =“Out of State tuition”, main = “Outstate Tuition”)

Elite = rep("No", nrow(college))
Elite[college$Top10perc>50]="Yes"
Elite=as.factor(Elite)
college=data.frame(college, Elite)
summary(college$Elite)
##  No Yes 
## 699  78
plot(college$Elite, college$Outstate,xlab = "Elite University", ylab ="Out of State tuition", main = "Outstate Tuition")

par(mfrow = c(2,2))
hist(college$Books, col = 1, xlab = "Books", ylab = "Count")
hist(college$PhD, col = 2, xlab = "PhD", ylab = "Count")
hist(college$Grad.Rate, col = 3, xlab = "Grad Rate", ylab = "Count")
hist(college$perc.alumni, col = 4, xlab = "% alumni", ylab = "Count")

summary(college$PhD)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00   62.00   75.00   72.66   85.00  103.00
high.phd = college[college$PhD == 103, ]
nrow(high.phd)
## [1] 1
rownames[as.numeric(rownames(high.phd))]
## [1] "Texas A&M University at Galveston"

Texas A&M at Galveston somehow has 103% of their faculty possessing a PhD.

Question 9

Part A

auto = read.csv("Auto.csv", na.strings = "?")
auto = na.omit(auto)
str(auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

The ‘name’ variable is the only one that is qualitative. The rest are quantitative.

Part B

sapply(auto[, -c(9)], range)
##       mpg cylinders displacement horsepower weight acceleration year origin
## [1,]  9.0         3           68         46   1613          8.0   70      1
## [2,] 46.6         8          455        230   5140         24.8   82      3

Part C

sapply(auto[, -c(9)], mean)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    23.445918     5.471939   194.411990   104.469388  2977.584184    15.541327 
##         year       origin 
##    75.979592     1.576531
sapply(auto[, -c(9)], sd)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    7.8050075    1.7057832  104.6440039   38.4911599  849.4025600    2.7588641 
##         year       origin 
##    3.6837365    0.8055182

Part D

new_auto = auto[-c(10:85), -c(9)]
sapply(new_auto, range)
##       mpg cylinders displacement horsepower weight acceleration year origin
## [1,] 11.0         3           68         46   1649          8.5   70      1
## [2,] 46.6         8          455        230   4997         24.8   82      3
sapply(new_auto, mean)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    24.404430     5.373418   187.240506   100.721519  2935.971519    15.726899 
##         year       origin 
##    77.145570     1.601266
sapply(new_auto, sd)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.867283     1.654179    99.678367    35.708853   811.300208     2.693721 
##         year       origin 
##     3.106217     0.819910

Part E

auto$cylinders <- as.factor(auto$cylinders)
auto$year <- as.factor(auto$year)
auto$origin <- as.factor(auto$origin)
par(mfrow = c(3,2))
hist(auto$mpg, col = 1, xlab = "mpg", ylab = "Count")
hist(auto$displacement, col = 2, xlab = "displacement", ylab = "Count")
hist(auto$horsepower, col = 3, xlab = "horsepower", ylab = "Count")
hist(auto$weight, col = 4, xlab = "weight", ylab = "Count")
hist(auto$acceleration, col = 5, xlab = "acceleration", ylab = "Count")

Question 10

Part A

library(MASS)
## Warning: package 'MASS' was built under R version 4.1.2
Boston$chas <- as.factor(Boston$chas)
nrow(Boston)
## [1] 506
ncol(Boston)
## [1] 14

Part B

par(mfrow = c(2, 2))
plot(Boston$nox, Boston$crim)
plot(Boston$rm, Boston$crim)
plot(Boston$age, Boston$crim)
plot(Boston$dis, Boston$crim)

Part C

hist(Boston$crim, breaks = 50)

80% of the data has crim < 20

pairs(Boston[Boston$crim < 20, ])

There is a possible relationship between crim and nox, rm, age, dis, lstat, and medv

Part D

hist(Boston$crim, breaks = 50)

nrow(Boston[Boston$crim > 20, ])
## [1] 18
hist(Boston$tax, breaks = 50)

nrow(Boston[Boston$tax == 666, ])
## [1] 132
hist(Boston$ptratio, breaks = 50)

nrow(Boston[Boston$ptratio > 20, ])
## [1] 201

Part E

nrow(Boston[Boston$chas == 1, ])
## [1] 35

Part F

median(Boston$ptratio)
## [1] 19.05

Part G

row.names(Boston[min(Boston$medv), ])
## [1] "5"
range(Boston$tax)
## [1] 187 711
Boston[min(Boston$medv), ]$tax
## [1] 222

Part H

nrow(Boston[Boston$rm > 7, ])
## [1] 64
nrow(Boston[Boston$rm > 8, ])
## [1] 13