Regression, Inference, n = 500 , p = 3 Classification, Prediction, n =20, p = 13 Regression, Prediction, n = 52, p = 3
A very flexible approach can be disadvantageous when the model overfits to the training data, then leads to a reduced Test MSE. It can be advantageous to not have a strict shape over the data so we can reduce the bias in our model, but our variance will increase. A less flexible approach may be preferred if we want to make inferences on our data and understand the relationship between X and Y. Less flexible approaches are also less complex and easier to describe compared to very flexible models.
In terms of statistics, parametric models usually follow the assumptions of the model while non-parametric models violate some assumptions of its parametric counterpart in order to get as close to f without being too wiggly. Parametric: Parametric models are much more structured and follow defined shapes that aren’t very flexible. Do not require as many observations to create a parametric model. Parametric models also have reduced computation time. Non-Parametric: Non-parametric models are very flexible and have a better potential to fit the shape of f due to having a wider range of shapes. A large disadvantage of non-parametric measures is that they require much more observations in order to make an accurate estimate of f.
View(college)
summary(college)
## ...1 Private Apps Accept
## Length:777 Length:777 Min. : 81 Min. : 72
## Class :character Class :character 1st Qu.: 776 1st Qu.: 604
## Mode :character Mode :character Median : 1558 Median : 1110
## Mean : 3002 Mean : 2019
## 3rd Qu.: 3624 3rd Qu.: 2424
## Max. :48094 Max. :26330
## Enroll Top10perc Top25perc F.Undergrad
## Min. : 35 Min. : 1.00 Min. : 9.0 Min. : 139
## 1st Qu.: 242 1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 992
## Median : 434 Median :23.00 Median : 54.0 Median : 1707
## Mean : 780 Mean :27.56 Mean : 55.8 Mean : 3700
## 3rd Qu.: 902 3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005
## Max. :6392 Max. :96.00 Max. :100.0 Max. :31643
## P.Undergrad Outstate Room.Board Books
## Min. : 1.0 Min. : 2340 Min. :1780 Min. : 96.0
## 1st Qu.: 95.0 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0
## Median : 353.0 Median : 9990 Median :4200 Median : 500.0
## Mean : 855.3 Mean :10441 Mean :4358 Mean : 549.4
## 3rd Qu.: 967.0 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0
## Max. :21836.0 Max. :21700 Max. :8124 Max. :2340.0
## Personal PhD Terminal S.F.Ratio
## Min. : 250 Min. : 8.00 Min. : 24.0 Min. : 2.50
## 1st Qu.: 850 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.:11.50
## Median :1200 Median : 75.00 Median : 82.0 Median :13.60
## Mean :1341 Mean : 72.66 Mean : 79.7 Mean :14.09
## 3rd Qu.:1700 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50
## Max. :6800 Max. :103.00 Max. :100.0 Max. :39.80
## perc.alumni Expend Grad.Rate
## Min. : 0.00 Min. : 3186 Min. : 10.00
## 1st Qu.:13.00 1st Qu.: 6751 1st Qu.: 53.00
## Median :21.00 Median : 8377 Median : 65.00
## Mean :22.74 Mean : 9660 Mean : 65.46
## 3rd Qu.:31.00 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :64.00 Max. :56233 Max. :118.00
pairs(college[ ,3:13])
college$Private = as.factor(college$Private)
plot(college$Private, college$Outstate)
Elite=rep("No",nrow(college))
Elite[college$Top10perc >50]="Yes"
Elite=as.factor(Elite)
college=data.frame(college, Elite)
summary(college$Elite)
## No Yes
## 699 78
plot(college$Elite, college$Outstate)
par(mfrow = c(2, 2))
hist(college$Apps)
hist(college$Grad.Rate)
hist(college$S.F.Ratio)
hist(college$Expend)
college_glm = glm(Private ~.-...1, data = college, family = binomial)
summary(college_glm)
##
## Call:
## glm(formula = Private ~ . - ...1, family = binomial, data = college)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.832e-02 1.904e+00 -0.010 0.99233
## Apps -4.002e-04 2.382e-04 -1.680 0.09296 .
## Accept -9.533e-05 4.634e-04 -0.206 0.83702
## Enroll 1.470e-03 8.710e-04 1.688 0.09139 .
## Top10perc 5.025e-02 3.371e-02 1.491 0.13609
## Top25perc -6.060e-03 2.005e-02 -0.302 0.76245
## F.Undergrad -4.257e-04 1.472e-04 -2.892 0.00383 **
## P.Undergrad 2.079e-06 1.367e-04 0.015 0.98787
## Outstate 7.250e-04 1.172e-04 6.185 6.22e-10 ***
## Room.Board 1.211e-04 2.690e-04 0.450 0.65252
## Books 1.932e-03 1.354e-03 1.427 0.15369
## Personal -3.746e-04 2.706e-04 -1.384 0.16625
## PhD -6.917e-02 2.693e-02 -2.568 0.01022 *
## Terminal -2.618e-02 2.567e-02 -1.020 0.30786
## S.F.Ratio -8.065e-02 6.301e-02 -1.280 0.20053
## perc.alumni 4.686e-02 2.108e-02 2.223 0.02620 *
## Expend 1.799e-04 1.198e-04 1.501 0.13324
## Grad.Rate 1.636e-02 1.184e-02 1.382 0.16696
## EliteYes -3.191e+00 1.219e+00 -2.617 0.00887 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 910.75 on 776 degrees of freedom
## Residual deviance: 232.98 on 758 degrees of freedom
## AIC: 270.98
##
## Number of Fisher Scoring iterations: 8
I ran a logistic regression on all the variables being a function of Private and I found that F. Undergrad, Outstate, PhD,perc.alumni, and EliteYes are all significant predictors, at an alpha level of 0.05, for a university being Private.
college_glm = glm(Private ~ `F.Undergrad` + Outstate + PhD
+ `perc.alumni` + Elite, data = college, family = binomial)
summary(college_glm)
##
## Call:
## glm(formula = Private ~ F.Undergrad + Outstate + PhD + perc.alumni +
## Elite, family = binomial, data = college)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.349e-01 8.938e-01 -0.375 0.707911
## F.Undergrad -4.795e-04 6.326e-05 -7.579 3.47e-14 ***
## Outstate 8.228e-04 8.545e-05 9.629 < 2e-16 ***
## PhD -6.991e-02 1.456e-02 -4.802 1.57e-06 ***
## perc.alumni 6.608e-02 1.967e-02 3.359 0.000781 ***
## EliteYes -1.426e+00 8.892e-01 -1.604 0.108819
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 910.75 on 776 degrees of freedom
## Residual deviance: 260.86 on 771 degrees of freedom
## AIC: 272.86
##
## Number of Fisher Scoring iterations: 7
When running only those significant predictors, Elite no longer becomes significant in predicting Private.
auto = read_csv("~/R-Studio/Predictive Modeling/ALL CSV FILES - 2nd Edition/Auto.csv")
## Rows: 397 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): horsepower, name
## dbl (7): mpg, cylinders, displacement, weight, acceleration, year, origin
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
auto = na.omit(auto)
summary(auto)
## mpg cylinders displacement horsepower
## Min. : 9.00 Min. :3.000 Min. : 68.0 Length:397
## 1st Qu.:17.50 1st Qu.:4.000 1st Qu.:104.0 Class :character
## Median :23.00 Median :4.000 Median :146.0 Mode :character
## Mean :23.52 Mean :5.458 Mean :193.5
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:262.0
## Max. :46.60 Max. :8.000 Max. :455.0
## weight acceleration year origin
## Min. :1613 Min. : 8.00 Min. :70.00 Min. :1.000
## 1st Qu.:2223 1st Qu.:13.80 1st Qu.:73.00 1st Qu.:1.000
## Median :2800 Median :15.50 Median :76.00 Median :1.000
## Mean :2970 Mean :15.56 Mean :75.99 Mean :1.574
## 3rd Qu.:3609 3rd Qu.:17.10 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :5140 Max. :24.80 Max. :82.00 Max. :3.000
## name
## Length:397
## Class :character
## Mode :character
##
##
##
Everything is numeric EXCEPT horsepower and name according to the summary. I will change horsepower into numeric because that does not seem correct.
auto$horsepower = as.numeric(auto$horsepower)
## Warning: NAs introduced by coercion
auto = na.omit(auto)
summary(auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 Length:392
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 Class :character
## Median :15.50 Median :76.00 Median :1.000 Mode :character
## Mean :15.54 Mean :75.98 Mean :1.577
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :24.80 Max. :82.00 Max. :3.000
NA’s were introduced when switching horsepower to numeric but have been removed. Now only ‘name’ should be the character and everything else numeric.
Range of auto
sapply(auto[, 1:7], range)
## mpg cylinders displacement horsepower weight acceleration year
## [1,] 9.0 3 68 46 1613 8.0 70
## [2,] 46.6 8 455 230 5140 24.8 82
#mpg cylinders displacement horsepower weight acceleration year
#[1,] 9.0 3 68 46 1613 8.0 70
#[2,] 46.6 8 455 230 5140 24.8 82
Mean of auto
sapply(auto[, 1:7], mean)
## mpg cylinders displacement horsepower weight acceleration
## 23.445918 5.471939 194.411990 104.469388 2977.584184 15.541327
## year
## 75.979592
#mpg cylinders displacement horsepower weight acceleration year
#23.445918 5.471939 194.411990 104.469388 2977.584184 15.541327 75.979592
Standard Deviation of auto
sapply(auto[, 1:7], sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.805007 1.705783 104.644004 38.491160 849.402560 2.758864
## year
## 3.683737
#mpg cylinders displacement horsepower weight acceleration year
#7.805007 1.705783 104.644004 38.491160 849.402560 2.758864 3.683737
Subsample Range
auto2 <- auto[-c(10:85), ]
sapply(auto2[, 1:7], range)
## mpg cylinders displacement horsepower weight acceleration year
## [1,] 11.0 3 68 46 1649 8.5 70
## [2,] 46.6 8 455 230 4997 24.8 82
#mpg cylinders displacement horsepower weight acceleration year
#[1,] 11.0 3 68 46 1649 8.5 70
#[2,] 46.6 8 455 230 4997 24.8 82
Subsample Mean
sapply(auto2[, 1:7], mean)
## mpg cylinders displacement horsepower weight acceleration
## 24.404430 5.373418 187.240506 100.721519 2935.971519 15.726899
## year
## 77.145570
#mpg cylinders displacement horsepower weight acceleration year
#24.404430 5.373418 187.240506 100.721519 2935.971519 15.726899 77.145570
Subsample Standard Deviation
sapply(auto2[, 1:7], sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.867283 1.654179 99.678367 35.708853 811.300208 2.693721
## year
## 3.106217
#mpg cylinders displacement horsepower weight acceleration year
#7.867283 1.654179 99.678367 35.708853 811.300208 2.693721 3.106217
pairs(auto[ ,1:8])
par(mfrow = c(1, 1))
plot(auto$weight, auto$acceleration)
Slight correlation that as weight increases, acceleration decreases
plot(auto$weight, auto$mpg)
Collelation that as weight increases, miles per gallon decreases
Yes, weight seems to be a good predictor on mpg because there is a downward trend in mpg especially from 3500+ in weight, the mpg goes from ~20 to ~10. We can also make a model to see if this is true
auto_lm = lm(mpg~weight, data = auto)
summary(auto_lm)
##
## Call:
## lm(formula = mpg ~ weight, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.9736 -2.7556 -0.3358 2.1379 16.5194
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.216524 0.798673 57.87 <2e-16 ***
## weight -0.007647 0.000258 -29.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.333 on 390 degrees of freedom
## Multiple R-squared: 0.6926, Adjusted R-squared: 0.6918
## F-statistic: 878.8 on 1 and 390 DF, p-value: < 2.2e-16
Weight is a significant predictor based on this linear model.
#install.packages("ISLR2")
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.3.2
boston = ISLR2::Boston
dim(boston)
## [1] 506 13
?ISLR2::Boston
## starting httpd help server ... done
506 rows, 13 columns which means 506 suburbs of Boston listed with 13 variables to help predict housing value in the suburbs. The explaination of each column can be found in the help function that was called, while each row is a suburb of Boston.
pairs(boston)
There is a lot going on in the pairwise scatterplots but some plots that look to be correlated at first glance: zn & crim, indus & nox, lstat & medv, rad & tax and possibily more just hard to see with mark I eyeballs
par(mfrow = c(2, 2))
plot(boston$tax, boston$crim)
plot(boston$ptratio, boston$crim)
plot(boston$medv, boston$crim)
plot(boston$rm, boston$crim)
It seems that more crime occurs in the higher tax range (~650 to be
specific) More crime seems to occur when the pupil-teacher ratio is ~20.
Most crime occurs in the lower median value range of around ~10000 More
crime around 4-6 room houses.
boston_lm = lm(crim ~., data = boston)
summary(boston_lm)
##
## Call:
## lm(formula = crim ~ ., data = boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.534 -2.248 -0.348 1.087 73.923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.7783938 7.0818258 1.946 0.052271 .
## zn 0.0457100 0.0187903 2.433 0.015344 *
## indus -0.0583501 0.0836351 -0.698 0.485709
## chas -0.8253776 1.1833963 -0.697 0.485841
## nox -9.9575865 5.2898242 -1.882 0.060370 .
## rm 0.6289107 0.6070924 1.036 0.300738
## age -0.0008483 0.0179482 -0.047 0.962323
## dis -1.0122467 0.2824676 -3.584 0.000373 ***
## rad 0.6124653 0.0875358 6.997 8.59e-12 ***
## tax -0.0037756 0.0051723 -0.730 0.465757
## ptratio -0.3040728 0.1863598 -1.632 0.103393
## lstat 0.1388006 0.0757213 1.833 0.067398 .
## medv -0.2200564 0.0598240 -3.678 0.000261 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared: 0.4493, Adjusted R-squared: 0.4359
## F-statistic: 33.52 on 12 and 493 DF, p-value: < 2.2e-16
Using a linear model to find significant predictors of crim (alpha = 0.05), significant predictors were zn, dis, rad, and medv. Positive predictors were zn and rad while dis and medv were negative predictors of crime per capita. This tells us that as zn and rad increase then crime is likely to increase as well. If dis and medv were to increase then crime is likely to decrease. Based on these four variables alone, a suburb that would experience a lot of crime would have higher zn and rad while having lower dis and medv.
par(mfrow=c(1,1))
hist(boston$crim,breaks=50)
range(boston$crim)
## [1] 0.00632 88.97620
Most places have low crime, but the value trails all the way up to ~88 per capita crime rate by town
hist(boston$tax)
range(boston$tax)
## [1] 187 711
Most places have a tax range between ~200-400, there there is a cap from 400-600, then a big jump in frequency occurs at the 700 tax range
hist(boston$ptratio)
range(boston$ptratio)
## [1] 12.6 22.0
There is a very high frequency on the ~20 pupil-teacher ratio which could be influencing our prior look at crime vs ptratio in b.
dim(subset(Boston, chas == 1))
## [1] 35 13
# 35 bound the Charles River
35 bound the Charles River
median(Boston$ptratio)
## [1] 19.05
#19.05
19.05 median pupil-teacher ratio
boston[boston$medv == min(boston$medv), ]
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 30.59 5
## 406 67.9208 0 18.1 0 0.693 5.683 100 1.4254 24 666 20.2 22.98 5
summary(boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio lstat
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 1.73
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.: 6.95
## Median : 5.000 Median :330.0 Median :19.05 Median :11.36
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :12.65
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:16.95
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :37.97
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
# Compared to our ranges, crime is in the 3rd quartile, indus is in 3rd quartile,
# nox is in the 3rd quartile, rm is below the first quartile, age is max,
# dis is below 1st quartile, rad is max, tax is in the 3rd quartile, ptratio is in the 3rd quartile
# and lstat is in the 3rd quartile.
Compared to our ranges, crime is in the 3rd quartile, indus is in 3rd quartile, nox is in the 3rd quartile, rm is below the first quartile, age is max, dis is below 1st quartile, rad is max, tax is in the 3rd quartile, ptratio is in the 3rd quartile and lstat is in the 3rd quartile.
dim(subset(boston, rm > 7))
## [1] 64 13
dim(subset(boston, rm > 8))
## [1] 13 13
# 64 houses with 7+ rooms, and 13 with 8+ rooms.
64 houses with 7+ rooms, and 13 with 8+ rooms.
summary((subset(boston, rm > 8)))
## crim zn indus chas
## Min. :0.02009 Min. : 0.00 Min. : 2.680 Min. :0.0000
## 1st Qu.:0.33147 1st Qu.: 0.00 1st Qu.: 3.970 1st Qu.:0.0000
## Median :0.52014 Median : 0.00 Median : 6.200 Median :0.0000
## Mean :0.71879 Mean :13.62 Mean : 7.078 Mean :0.1538
## 3rd Qu.:0.57834 3rd Qu.:20.00 3rd Qu.: 6.200 3rd Qu.:0.0000
## Max. :3.47428 Max. :95.00 Max. :19.580 Max. :1.0000
## nox rm age dis
## Min. :0.4161 Min. :8.034 Min. : 8.40 Min. :1.801
## 1st Qu.:0.5040 1st Qu.:8.247 1st Qu.:70.40 1st Qu.:2.288
## Median :0.5070 Median :8.297 Median :78.30 Median :2.894
## Mean :0.5392 Mean :8.349 Mean :71.54 Mean :3.430
## 3rd Qu.:0.6050 3rd Qu.:8.398 3rd Qu.:86.50 3rd Qu.:3.652
## Max. :0.7180 Max. :8.780 Max. :93.90 Max. :8.907
## rad tax ptratio lstat medv
## Min. : 2.000 Min. :224.0 Min. :13.00 Min. :2.47 Min. :21.9
## 1st Qu.: 5.000 1st Qu.:264.0 1st Qu.:14.70 1st Qu.:3.32 1st Qu.:41.7
## Median : 7.000 Median :307.0 Median :17.40 Median :4.14 Median :48.3
## Mean : 7.462 Mean :325.1 Mean :16.36 Mean :4.31 Mean :44.2
## 3rd Qu.: 8.000 3rd Qu.:307.0 3rd Qu.:17.40 3rd Qu.:5.12 3rd Qu.:50.0
## Max. :24.000 Max. :666.0 Max. :20.20 Max. :7.44 Max. :50.0
# Crime is very low with 75% of 8 room homes being less than 1 per capita crime rate
# and the median home value is between 21-50, with a mean of 44!
Crime is very low with 75% of 8 room homes being less than 1 per capita crime rate and the median home value is between 21-50, with a mean of 44!