This is a Regression problem because we are predicting a numerical/quantitative value. We are more interested in Inference for this problem because we want to know more about the inputs and how they affect the output. N = 500 (the number of observations) and P = 4 (the number of inputs/predictors).
This is a Classification problem because we are predicting a text/qualitative value. We are more interested in Prediction for this problem because we want to know the output. N = 20 and P = 14.
This is a Regression problem because we are predicting a numerical/quantitative value. We are more interested in Prediction because we want to know the exact percentage change (output). N = 52 and P = 4.
Advantage: A very flexible approach could give a better fit for non-linear models and decrease the bias. Disadvantage: A very flexible approach would require estimating a larger number of parameters, it overfits the model, and increases variance. A more flexible approach would be beneficial when we are interested in Prediction rather than the interpretability of the results. A less flexible approach would be beneficial when we are interested in Inference and the interpretability of the results.
Parametric: assumes a form for f and therefore reduces the problem of estimating f down to estimating a set of parameters. Non-Parametric: does not assume a form for f and therefore requires a very large sample. Advantages: A Parametric approach simplifies the modeling of f to just a few parameters therefore not as many observations are necessary as opposed to Non-Parametric. Disadvantages: if the form of f is assumed wrong, there could be an incorrect estimate of f.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.1.2
data(College)
college <- read.csv("College.csv")
head(college[, 1:5])
## X Private Apps Accept Enroll
## 1 Abilene Christian University Yes 1660 1232 721
## 2 Adelphi University Yes 2186 1924 512
## 3 Adrian College Yes 1428 1097 336
## 4 Agnes Scott College Yes 417 349 137
## 5 Alaska Pacific University Yes 193 146 55
## 6 Albertson College Yes 587 479 158
rownames <- college[,1]
college <- college[,-1]
head(college[, 1:5])
## Private Apps Accept Enroll Top10perc
## 1 Yes 1660 1232 721 23
## 2 Yes 2186 1924 512 16
## 3 Yes 1428 1097 336 22
## 4 Yes 417 349 137 60
## 5 Yes 193 146 55 16
## 6 Yes 587 479 158 38
summary(college)
## Private Apps Accept Enroll
## Length:777 Min. : 81 Min. : 72 Min. : 35
## Class :character 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242
## Mode :character Median : 1558 Median : 1110 Median : 434
## Mean : 3002 Mean : 2019 Mean : 780
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902
## Max. :48094 Max. :26330 Max. :6392
## Top10perc Top25perc F.Undergrad P.Undergrad
## Min. : 1.00 Min. : 9.0 Min. : 139 Min. : 1.0
## 1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0
## Median :23.00 Median : 54.0 Median : 1707 Median : 353.0
## Mean :27.56 Mean : 55.8 Mean : 3700 Mean : 855.3
## 3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0
## Max. :96.00 Max. :100.0 Max. :31643 Max. :21836.0
## Outstate Room.Board Books Personal
## Min. : 2340 Min. :1780 Min. : 96.0 Min. : 250
## 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850
## Median : 9990 Median :4200 Median : 500.0 Median :1200
## Mean :10441 Mean :4358 Mean : 549.4 Mean :1341
## 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700
## Max. :21700 Max. :8124 Max. :2340.0 Max. :6800
## PhD Terminal S.F.Ratio perc.alumni
## Min. : 8.00 Min. : 24.0 Min. : 2.50 Min. : 0.00
## 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00
## Median : 75.00 Median : 82.0 Median :13.60 Median :21.00
## Mean : 72.66 Mean : 79.7 Mean :14.09 Mean :22.74
## 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00
## Max. :103.00 Max. :100.0 Max. :39.80 Max. :64.00
## Expend Grad.Rate
## Min. : 3186 Min. : 10.00
## 1st Qu.: 6751 1st Qu.: 53.00
## Median : 8377 Median : 65.00
## Mean : 9660 Mean : 65.46
## 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :56233 Max. :118.00
pairs(college[, 1:10])
plot(college\(Private, college\)Outstate, xlab = “Private University”, ylab =“Out of State tuition”, main = “Outstate Tuition”)
Elite = rep("No", nrow(college))
Elite[college$Top10perc>50]="Yes"
Elite=as.factor(Elite)
college=data.frame(college, Elite)
summary(college$Elite)
## No Yes
## 699 78
plot(college$Elite, college$Outstate,xlab = "Elite University", ylab ="Out of State tuition", main = "Outstate Tuition")
par(mfrow = c(2,2))
hist(college$Books, col = 1, xlab = "Books", ylab = "Count")
hist(college$PhD, col = 2, xlab = "PhD", ylab = "Count")
hist(college$Grad.Rate, col = 3, xlab = "Grad Rate", ylab = "Count")
hist(college$perc.alumni, col = 4, xlab = "% alumni", ylab = "Count")
summary(college$PhD)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 62.00 75.00 72.66 85.00 103.00
high.phd = college[college$PhD == 103, ]
nrow(high.phd)
## [1] 1
rownames[as.numeric(rownames(high.phd))]
## [1] "Texas A&M University at Galveston"
Texas A&M at Galveston somehow has 103% of their faculty possessing a PhD.
auto = read.csv("Auto.csv", na.strings = "?")
auto = na.omit(auto)
str(auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
The ‘name’ variable is the only one that is qualitative. The rest are quantitative.
sapply(auto[, -c(9)], range)
## mpg cylinders displacement horsepower weight acceleration year origin
## [1,] 9.0 3 68 46 1613 8.0 70 1
## [2,] 46.6 8 455 230 5140 24.8 82 3
sapply(auto[, -c(9)], mean)
## mpg cylinders displacement horsepower weight acceleration
## 23.445918 5.471939 194.411990 104.469388 2977.584184 15.541327
## year origin
## 75.979592 1.576531
sapply(auto[, -c(9)], sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.8050075 1.7057832 104.6440039 38.4911599 849.4025600 2.7588641
## year origin
## 3.6837365 0.8055182
new_auto = auto[-c(10:85), -c(9)]
sapply(new_auto, range)
## mpg cylinders displacement horsepower weight acceleration year origin
## [1,] 11.0 3 68 46 1649 8.5 70 1
## [2,] 46.6 8 455 230 4997 24.8 82 3
sapply(new_auto, mean)
## mpg cylinders displacement horsepower weight acceleration
## 24.404430 5.373418 187.240506 100.721519 2935.971519 15.726899
## year origin
## 77.145570 1.601266
sapply(new_auto, sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.867283 1.654179 99.678367 35.708853 811.300208 2.693721
## year origin
## 3.106217 0.819910
auto$cylinders <- as.factor(auto$cylinders)
auto$year <- as.factor(auto$year)
auto$origin <- as.factor(auto$origin)
par(mfrow = c(3,2))
hist(auto$mpg, col = 1, xlab = "mpg", ylab = "Count")
hist(auto$displacement, col = 2, xlab = "displacement", ylab = "Count")
hist(auto$horsepower, col = 3, xlab = "horsepower", ylab = "Count")
hist(auto$weight, col = 4, xlab = "weight", ylab = "Count")
hist(auto$acceleration, col = 5, xlab = "acceleration", ylab = "Count")
library(MASS)
## Warning: package 'MASS' was built under R version 4.1.2
Boston$chas <- as.factor(Boston$chas)
nrow(Boston)
## [1] 506
ncol(Boston)
## [1] 14
par(mfrow = c(2, 2))
plot(Boston$nox, Boston$crim)
plot(Boston$rm, Boston$crim)
plot(Boston$age, Boston$crim)
plot(Boston$dis, Boston$crim)
hist(Boston$crim, breaks = 50)
80% of the data has crim < 20
pairs(Boston[Boston$crim < 20, ])
There is a possible relationship between crim and nox, rm, age, dis, lstat, and medv
hist(Boston$crim, breaks = 50)
nrow(Boston[Boston$crim > 20, ])
## [1] 18
hist(Boston$tax, breaks = 50)
nrow(Boston[Boston$tax == 666, ])
## [1] 132
hist(Boston$ptratio, breaks = 50)
nrow(Boston[Boston$ptratio > 20, ])
## [1] 201
nrow(Boston[Boston$chas == 1, ])
## [1] 35
median(Boston$ptratio)
## [1] 19.05
row.names(Boston[min(Boston$medv), ])
## [1] "5"
range(Boston$tax)
## [1] 187 711
Boston[min(Boston$medv), ]$tax
## [1] 222
nrow(Boston[Boston$rm > 7, ])
## [1] 64
nrow(Boston[Boston$rm > 8, ])
## [1] 13