An advantage of a flexible model is that is that it can be applied to many different estimates of f. A disadvantage of a very flexible model, versus a non-flexible model, is it can lead to overfitting. A more flexible model may be ideal for predictive modeling. On the other hand, if we are interested in inference, a more restrictive model would be ideal (such as a linear model, which is not flexible).
A parametric statistical learning approach breaks the problem of estimating f down to one of estimating a set of parameters. A non-parametric method avoids the assumption of a particular functional form for f. An advantage of a parametric method, as opposed to a non-parametric approach, is that it is generally easier to estimate a set of parameters in a linear model. A disadvantage of a parametric approach is that if you are too far from the true f, the model is not good.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ISLR)
data(College)
write.csv(College, "/Users/admin/Desktop/college.csv")
setwd("/Users/admin/Desktop")
college <- read.csv("college.csv")
college <- college[, -1]
summary(college)
## Private Apps Accept Enroll
## Length:777 Min. : 81 Min. : 72 Min. : 35
## Class :character 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242
## Mode :character Median : 1558 Median : 1110 Median : 434
## Mean : 3002 Mean : 2019 Mean : 780
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902
## Max. :48094 Max. :26330 Max. :6392
## Top10perc Top25perc F.Undergrad P.Undergrad
## Min. : 1.00 Min. : 9.0 Min. : 139 Min. : 1.0
## 1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0
## Median :23.00 Median : 54.0 Median : 1707 Median : 353.0
## Mean :27.56 Mean : 55.8 Mean : 3700 Mean : 855.3
## 3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0
## Max. :96.00 Max. :100.0 Max. :31643 Max. :21836.0
## Outstate Room.Board Books Personal
## Min. : 2340 Min. :1780 Min. : 96.0 Min. : 250
## 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850
## Median : 9990 Median :4200 Median : 500.0 Median :1200
## Mean :10441 Mean :4358 Mean : 549.4 Mean :1341
## 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700
## Max. :21700 Max. :8124 Max. :2340.0 Max. :6800
## PhD Terminal S.F.Ratio perc.alumni
## Min. : 8.00 Min. : 24.0 Min. : 2.50 Min. : 0.00
## 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00
## Median : 75.00 Median : 82.0 Median :13.60 Median :21.00
## Mean : 72.66 Mean : 79.7 Mean :14.09 Mean :22.74
## 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00
## Max. :103.00 Max. :100.0 Max. :39.80 Max. :64.00
## Expend Grad.Rate
## Min. : 3186 Min. : 10.00
## 1st Qu.: 6751 1st Qu.: 53.00
## Median : 8377 Median : 65.00
## Mean : 9660 Mean : 65.46
## 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :56233 Max. :118.00
college[,1] = as.numeric(factor(college[,1]))
pairs(college[,1:10])
boxplot(college$Private, college$Outstate)
Elite <- rep ("No", nrow (college))
Elite[college$Top10perc > 50] <- " Yes "
Elite <- as.factor (Elite)
college <- data.frame (college , Elite)
summary(college)
## Private Apps Accept Enroll Top10perc
## Min. :1.000 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## 1st Qu.:1.000 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median :2.000 Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean :1.727 Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.:2.000 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :2.000 Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate Elite
## Min. : 10.00 Yes : 78
## 1st Qu.: 53.00 No :699
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
boxplot(college$Outstate, college$Elite)
par(mfrow = c(2, 2))
hist(college$Top10perc)
hist(college$Room.Board)
hist(college$Enroll)
hist(college$Books)
From the data above, we can compare multiple variables through histograms and boxplots. For instance, we can see that most room and board costs are in the 3,000-5,000 range. We can also see that there are usually around 10-30 students who were in the top ten percent of their highschool class. The estimated book cost is normally around $500 for most students.
From the scatterplot matrix, we can see that there is a strong positive relationship between some variables such as New students Enrolled and students in the Top 10% of their highschool class.
data("Auto")
write.csv(Auto, "/Users/admin/Desktop/auto.csv")
Auto <- read.csv("auto.csv")
Auto <- na.omit(Auto)
head(Auto)
as.factor(Auto$origin)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 3 2 2 2 2 2 1 1 1 1 1 3 1 3 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 2 1 3 1 2 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1
## [75] 1 2 2 2 2 1 3 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 3 1 3 3
## [112] 1 1 2 1 1 2 2 2 2 1 2 3 1 1 1 1 3 1 3 1 1 1 1 1 1 1 1 1 2 2 2 3 3 1 2 2 3
## [149] 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 3 2 3 1 2 1 2 2 2 2 3 2 2 1 1 2
## [186] 1 1 1 1 1 1 1 1 1 1 2 3 1 1 1 1 2 3 3 1 2 1 2 3 2 1 1 1 1 3 1 2 1 3 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 2 1 3 1 1 1 3 2 3 2 3 2 1 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 3 3 1 3 1 1 3 2 2 2 2 2 3 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 1 1 2
## [297] 1 2 1 1 1 3 2 1 1 1 1 2 3 1 3 1 1 1 1 2 3 3 3 3 3 1 3 2 2 2 2 3 3 2 3 3 2
## [334] 3 1 1 1 1 1 3 1 3 3 3 3 3 1 1 1 2 3 3 3 3 2 2 3 3 1 1 1 1 1 1 1 1 1 1 1 2
## [371] 3 3 1 1 3 3 3 3 3 3 1 1 1 1 3 1 1 1 2 1 1 1
## Levels: 1 2 3
Auto$origin <- as.factor(Auto$origin)
Name and origin are the only qualitative variables
sapply(Auto[c(2:8)], function(Auto) max(Auto, na.rm=TRUE) - min(Auto, na.rm=TRUE))
## mpg cylinders displacement horsepower weight acceleration
## 37.6 5.0 387.0 184.0 3527.0 16.8
## year
## 12.0
sapply(Auto[c(2:8)], mean)
## mpg cylinders displacement horsepower weight acceleration
## 23.445918 5.471939 194.411990 104.469388 2977.584184 15.541327
## year
## 75.979592
sapply(Auto[c(2:8)], sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.805007 1.705783 104.644004 38.491160 849.402560 2.758864
## year
## 3.683737
newauto <- Auto[-c(10:85), ]
sapply(newauto[c(2:8)], function(newauto) max(newauto, na.rm=TRUE) - min(newauto, na.rm=TRUE))
## mpg cylinders displacement horsepower weight acceleration
## 35.6 5.0 387.0 184.0 3348.0 16.3
## year
## 12.0
sapply(newauto[c(2:8)], mean)
## mpg cylinders displacement horsepower weight acceleration
## 24.404430 5.373418 187.240506 100.721519 2935.971519 15.726899
## year
## 77.145570
sapply(newauto[c(2:8)], sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.867283 1.654179 99.678367 35.708853 811.300208 2.693721
## year
## 3.106217
Auto$cylinders <- as.factor(Auto$cylinders)
pairs(Auto[, c(2:8)])
par(mfrow = c(1, 2))
hist(Auto$acceleration)
hist(Auto$mpg)
We can see that year and mpg have a positive correlation, but year and cylinders do not have a linear relationship or correlation. We can also see that the majority of cars have an mpg of about 20-30.
From the plots above, we can see that year and mpg have a relationship and may be useful in predicting mpg.
library(ISLR2)
##
## Attaching package: 'ISLR2'
## The following object is masked _by_ '.GlobalEnv':
##
## Auto
## The following objects are masked from 'package:ISLR':
##
## Auto, Credit
??boston
There are 506 rows and 13 variables in the data set. The colomns/variables include info about houses in 506 suburbs in boston.
data("Boston")
write.csv(Boston, "/Users/admin/Desktop/boston.csv")
str(Boston)
## 'data.frame': 506 obs. of 13 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Boston$chas <- as.numeric(Boston$chas)
Boston$rad <- as.numeric(Boston$rad)
pairs(Boston)
Some variables are correlated with one another both negatively and positively
Predictors associated with percapita crime rate include the lstat variable. The lower status variable has a strong positive correlation with crime rate. Lstat seams to have the strongest assosiation with crime rate, although other variables such as Dis (distance from Boston), medv (median value), and rm (rooms) have a weak association with crime.
cor(Boston)
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax ptratio
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
## lstat medv
## crim 0.4556215 -0.3883046
## zn -0.4129946 0.3604453
## indus 0.6037997 -0.4837252
## chas -0.0539293 0.1752602
## nox 0.5908789 -0.4273208
## rm -0.6138083 0.6953599
## age 0.6023385 -0.3769546
## dis -0.4969958 0.2499287
## rad 0.4886763 -0.3816262
## tax 0.5439934 -0.4685359
## ptratio 0.3740443 -0.5077867
## lstat 1.0000000 -0.7376627
## medv -0.7376627 1.0000000
par(mfrow = c(1, 3))
hist(Boston$crim)
hist(Boston$tax)
hist(Boston$ptratio)
There are a lot of Boston suburbs that have a 0 per capita crime rate. There is also a high frequency of tax with a value of about 700 (in 10,000’s). In the ptratio histogram, there is a spike in frequency in the 20 Pupil-teacher Ratio.
ChasPos <- subset(Boston, chas == 1)
There are 35 observations that bound the Charles river.
summary(Boston$ptratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.60 17.40 19.05 18.46 20.20 22.00
The median pupil-teacher ratio is 19.05.
Boston <- data.frame("CTract" = c(1:length(Boston$crim)), Boston)
filter(Boston, medv == min(medv))
summary(Boston$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00632 0.08204 0.25651 3.61352 3.67708 88.97620
Census Tracts 399 and 406 have the lowest median value, 5. For these two census tracts, most of the values are similar with the exception of crime, which is much lower per-capita rate for tract 399. Both of these, however, have a high per capita crime rate variable than the rest of the data set, as shown below.
MoreThan7Rm <- subset(Boston, rm > 7)
MoreThan8Rm <- subset(Boston, rm > 8)
There are 64 census tracts that overage more than 7 rooms per dwelling and 13 census tracts that average more than 8 rooms per dwelling. There are significantly less census tract that average more than 8 rooms per dwelling, a difference of 51.