Regression problem , inference. N= 500, p =3.
Classification, prediction. n= 20, p = 13.
Regression problem, prediction, n= 52, p= 3.
Flexible approach is prefered when prediction is the primary concern and not the interpretability where are as a less flexible approach is preferred when interpretability is the of the primary concern.
advantage of flexible approach :used for non linear models and less bias
Disadvantage of flexible approach: overfit and high variance.
A parametric approach estimates set of parameters based on a fixed model of f.
A non-parametric approach requires a very large sample to accurately estimate f as it does not assume a specific model of f.Â
The advantages of a parametric approach to regression or classification are by simplifying the parameters to model the function therefore requiring less observation and computing power when compared to non-parametric approach
The disadvantages of a parametric approach to regression or classification are inaccurate estimates of f. this could be caused by selecting the wrong model of f.Â
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.3
data(College)
head(College[,1:5])
## Private Apps Accept Enroll Top10perc
## Abilene Christian University Yes 1660 1232 721 23
## Adelphi University Yes 2186 1924 512 16
## Adrian College Yes 1428 1097 336 22
## Agnes Scott College Yes 417 349 137 60
## Alaska Pacific University Yes 193 146 55 16
## Albertson College Yes 587 479 158 38
fix(College)
summary(College)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
pairs(College[,1:10])
plot(College$Private, College$Outstate, xlab = "Private University", ylab ="Out of State tuition in USD", main = "Outstate Tuition Plot")
Elite <- rep("No", nrow(College))
Elite[College$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
College$Elite <- Elite
summary(College$Elite)
## No Yes
## 699 78
plot(College$Elite, College$Outstate, xlab = "Elite University", ylab ="Out of State tuition in USD", main = "Outstate Tuition Plot")
par(mfrow = c(2,2))
hist(College$Apps, col = 2, xlab = "Applications", ylab = "Count")
hist(College$PhD, col = 3, xlab = "PhD", ylab = "Count")
hist(College$Grad.Rate, col = 4, xlab = "Grad Rate", ylab = "Count")
hist(College$perc.alumni, col = 6, xlab = "% alumni", ylab = "Count")
plot(perc.alumni ~Private, data = College)
if a university is private the alumni is more likely to donate.
reg <- lm(Grad.Rate ~ PhD,data= College)
plot(Grad.Rate ~PhD, data = College)
abline(reg, col="blue")
The more faculty has PhD higher the graduation rate of the school.
data("Auto")
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
sapply(Auto[,0:7], range)
## mpg cylinders displacement horsepower weight acceleration year
## [1,] 9.0 3 68 46 1613 8.0 70
## [2,] 46.6 8 455 230 5140 24.8 82
sapply(Auto[,0:7],mean)
## mpg cylinders displacement horsepower weight acceleration
## 23.445918 5.471939 194.411990 104.469388 2977.584184 15.541327
## year
## 75.979592
sapply(Auto[,0:7],sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.805007 1.705783 104.644004 38.491160 849.402560 2.758864
## year
## 3.683737
subset <- Auto[-c(10:85),0:7]
sapply(subset, range)
## mpg cylinders displacement horsepower weight acceleration year
## [1,] 11.0 3 68 46 1649 8.5 70
## [2,] 46.6 8 455 230 4997 24.8 82
sapply(subset, mean)
## mpg cylinders displacement horsepower weight acceleration
## 24.404430 5.373418 187.240506 100.721519 2935.971519 15.726899
## year
## 77.145570
sapply(subset, sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.867283 1.654179 99.678367 35.708853 811.300208 2.693721
## year
## 3.106217
pairs(Auto)
plot(horsepower ~ weight, Auto)
As weight increases the horsepower of the vehicle increases
plot(mpg~ year, Auto)
Vehicles have better mpg as years goes up.
library(MASS)
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.0.3
## corrplot 0.84 loaded
data("Boston")
dim(Boston)
## [1] 506 14
columns are variable and rows are observation.
pairs(Boston)
p1 <- ggplot(Boston, aes(x= medv,y =crim))+
geom_jitter()
p1
I compared medv to crim and see there is a correlation between crime rates and median home value. the relationship is as the median home value increases crime rate reduces
p2 <- ggplot(Boston, aes(x=lstat, y= crim))+
geom_jitter()
p2
another comparison i did was between lstat and crim and here we see a as proportion lstat increases so does the crime in the neighborhood.
p3 <- ggplot(Boston, aes(x=zn, y= crim))+geom_jitter()
p4 <- ggplot(Boston, aes(x=indus, y= crim))+geom_jitter()
p5 <- ggplot(Boston, aes(x=chas, y= crim))+geom_jitter()
p6 <- ggplot(Boston, aes(x=nox, y= crim))+geom_jitter()
p7 <- ggplot(Boston, aes(x=rm, y= crim))+geom_jitter()
p8 <- ggplot(Boston, aes(x=age, y= crim))+geom_jitter()
p9 <- ggplot(Boston, aes(x=dis, y= crim))+geom_jitter()
p10 <- ggplot(Boston, aes(x=rad, y= crim))+geom_jitter()
p11 <- ggplot(Boston, aes(x=tax, y= crim))+geom_jitter()
p12 <- ggplot(Boston, aes(x=ptratio, y= crim))+geom_jitter()
p13 <- ggplot(Boston, aes(x=black, y= crim))+geom_jitter()
multiplot(p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13, p1, p2, cols=4)
corrplot(cor(Boston), type= "upper")
based on the graphs. These predictors might provide insight about the crime.
age, medv, lstat, dis
relationship between medv, lstat and crim was explored in 10 b.
age vs crim
as the proportion of owner units build before 1940s increase so does the crim in the area
dis vs crime
as the town is closer to employment centers more crime the town has
based on the correlation, rad, tax and nox could also be considered predictors of crim
dim(subset(Boston, crim > quantile(Boston$crim, 0.75)))
## [1] 127 14
dim(subset(Boston, ptratio > quantile(Boston$ptratio, 0.75)))
## [1] 56 14
dim(subset(Boston, tax > quantile(Boston$tax, 0.75)))
## [1] 5 14
127 towns can be considered to high crime areas
5 towns can be considered high tax areas
56 towns can be considered to have high pupil-teacher ratio
sum(Boston$chas ==1 )
## [1] 35
median(Boston$ptratio)
## [1] 19.05
subset(Boston, medv == min(medv))
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 396.90 30.59
## 406 67.9208 0 18.1 0 0.693 5.683 100 1.4254 24 666 20.2 384.97 22.98
## medv
## 399 5
## 406 5
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Observation #399 and #406 has the lowest median value of owner-occupied homes.
age is at max, medv is at min, zn is at min, rad at max, chas not bounded by river.
tax is at 3rd quartile, lstat at 3rd quartile.
###(h) In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.
sum(Boston$rm >7)
## [1] 64
sum(Boston$rm >8)
## [1] 13
summary(subset(Boston,rm >8))
## crim zn indus chas
## Min. :0.02009 Min. : 0.00 Min. : 2.680 Min. :0.0000
## 1st Qu.:0.33147 1st Qu.: 0.00 1st Qu.: 3.970 1st Qu.:0.0000
## Median :0.52014 Median : 0.00 Median : 6.200 Median :0.0000
## Mean :0.71879 Mean :13.62 Mean : 7.078 Mean :0.1538
## 3rd Qu.:0.57834 3rd Qu.:20.00 3rd Qu.: 6.200 3rd Qu.:0.0000
## Max. :3.47428 Max. :95.00 Max. :19.580 Max. :1.0000
## nox rm age dis
## Min. :0.4161 Min. :8.034 Min. : 8.40 Min. :1.801
## 1st Qu.:0.5040 1st Qu.:8.247 1st Qu.:70.40 1st Qu.:2.288
## Median :0.5070 Median :8.297 Median :78.30 Median :2.894
## Mean :0.5392 Mean :8.349 Mean :71.54 Mean :3.430
## 3rd Qu.:0.6050 3rd Qu.:8.398 3rd Qu.:86.50 3rd Qu.:3.652
## Max. :0.7180 Max. :8.780 Max. :93.90 Max. :8.907
## rad tax ptratio black
## Min. : 2.000 Min. :224.0 Min. :13.00 Min. :354.6
## 1st Qu.: 5.000 1st Qu.:264.0 1st Qu.:14.70 1st Qu.:384.5
## Median : 7.000 Median :307.0 Median :17.40 Median :386.9
## Mean : 7.462 Mean :325.1 Mean :16.36 Mean :385.2
## 3rd Qu.: 8.000 3rd Qu.:307.0 3rd Qu.:17.40 3rd Qu.:389.7
## Max. :24.000 Max. :666.0 Max. :20.20 Max. :396.9
## lstat medv
## Min. :2.47 Min. :21.9
## 1st Qu.:3.32 1st Qu.:41.7
## Median :4.14 Median :48.3
## Mean :4.31 Mean :44.2
## 3rd Qu.:5.12 3rd Qu.:50.0
## Max. :7.44 Max. :50.0
relatively lower crime comapared to the summary of boston.