Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.
What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?
Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?
college = read.csv("College.csv")
rownames(college) = college[,1]
fix(college)
college =college [,-1]
fix (college )
summary(college)
## Private Apps Accept Enroll
## Length:777 Min. : 81 Min. : 72 Min. : 35
## Class :character 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242
## Mode :character Median : 1558 Median : 1110 Median : 434
## Mean : 3002 Mean : 2019 Mean : 780
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902
## Max. :48094 Max. :26330 Max. :6392
## Top10perc Top25perc F.Undergrad P.Undergrad
## Min. : 1.00 Min. : 9.0 Min. : 139 Min. : 1.0
## 1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0
## Median :23.00 Median : 54.0 Median : 1707 Median : 353.0
## Mean :27.56 Mean : 55.8 Mean : 3700 Mean : 855.3
## 3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0
## Max. :96.00 Max. :100.0 Max. :31643 Max. :21836.0
## Outstate Room.Board Books Personal
## Min. : 2340 Min. :1780 Min. : 96.0 Min. : 250
## 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850
## Median : 9990 Median :4200 Median : 500.0 Median :1200
## Mean :10441 Mean :4358 Mean : 549.4 Mean :1341
## 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700
## Max. :21700 Max. :8124 Max. :2340.0 Max. :6800
## PhD Terminal S.F.Ratio perc.alumni
## Min. : 8.00 Min. : 24.0 Min. : 2.50 Min. : 0.00
## 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00
## Median : 75.00 Median : 82.0 Median :13.60 Median :21.00
## Mean : 72.66 Mean : 79.7 Mean :14.09 Mean :22.74
## 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00
## Max. :103.00 Max. :100.0 Max. :39.80 Max. :64.00
## Expend Grad.Rate
## Min. : 3186 Min. : 10.00
## 1st Qu.: 6751 1st Qu.: 53.00
## Median : 8377 Median : 65.00
## Mean : 9660 Mean : 65.46
## 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :56233 Max. :118.00
pairs(college[,2:11])
plot(as.factor(college$Private), college$Outstate, xlab = "Private", ylab= "Out of State Tuition ")
Elite =rep ("No", nrow(college))
Elite [college$Top10perc >50]=" Yes"
Elite =as.factor (Elite)
college =data.frame(college ,Elite)
summary(college$Elite)
## Yes No
## 78 699
plot(college$Elite, college$Outstate,
xlab = "Elite School", ylab= "Out of State Tuition ")
par(mfrow=c(2,2))
hist(college$Enroll)
hist(college$PhD)
hist(college$F.Undergrad)
hist(college$Apps)
num_cols <- unlist(lapply(college, is.numeric))
x1 = college[,num_cols] #Taking all the numeric columns
library(tidyverse)
## -- Attaching packages --------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.3
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
x1 %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_density()
cormat <- round(cor(x1),2)
melted_cormat <- melt(cormat)
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation")+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Exploring the data further leads to showing a lot of the variables are unimodel with many variables being skewed to the right. When examining pearson’s correlation we can see some relationships make sense. For example negative relationship with Student Faculty ratio and expenditure per student. Has the sF ratio increases the expenditure per student goes down because there are more students to faculty.
Auto = read.csv("Auto.csv")
str(Auto)
## 'data.frame': 397 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : chr "130" "165" "150" "150" ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
| Variable | Quantitative vs Qualitative |
|---|---|
| mpg | Quantitative |
| cylinders | Quantitative |
| displacement | Quantitative |
| horsepower | Quantitative |
| weight | Quantitative |
| acceleration | Quantitative |
| year | Qualitative |
| origin | Qualitative |
| name | Qualitative |
Auto$horsepower <- as.numeric(Auto$horsepower) #Making it numeric
## Warning: NAs introduced by coercion
Auto$origin <- as.factor(Auto$origin)
Auto$year <- as.factor(Auto$year)
num_cols <- unlist(lapply(Auto, is.numeric))
x2 = na.omit(Auto[,num_cols])
lapply(x2,range)
## $mpg
## [1] 9.0 46.6
##
## $cylinders
## [1] 3 8
##
## $displacement
## [1] 68 455
##
## $horsepower
## [1] 46 230
##
## $weight
## [1] 1613 5140
##
## $acceleration
## [1] 8.0 24.8
lapply(x2,mean)
## $mpg
## [1] 23.44592
##
## $cylinders
## [1] 5.471939
##
## $displacement
## [1] 194.412
##
## $horsepower
## [1] 104.4694
##
## $weight
## [1] 2977.584
##
## $acceleration
## [1] 15.54133
lapply(x2,sd)
## $mpg
## [1] 7.805007
##
## $cylinders
## [1] 1.705783
##
## $displacement
## [1] 104.644
##
## $horsepower
## [1] 38.49116
##
## $weight
## [1] 849.4026
##
## $acceleration
## [1] 2.758864
subsetx2 = x2[-10:-85,]
lapply(subsetx2,range)
## $mpg
## [1] 11.0 46.6
##
## $cylinders
## [1] 3 8
##
## $displacement
## [1] 68 455
##
## $horsepower
## [1] 46 230
##
## $weight
## [1] 1649 4997
##
## $acceleration
## [1] 8.5 24.8
lapply(subsetx2,mean)
## $mpg
## [1] 24.40443
##
## $cylinders
## [1] 5.373418
##
## $displacement
## [1] 187.2405
##
## $horsepower
## [1] 100.7215
##
## $weight
## [1] 2935.972
##
## $acceleration
## [1] 15.7269
lapply(subsetx2,sd)
## $mpg
## [1] 7.867283
##
## $cylinders
## [1] 1.654179
##
## $displacement
## [1] 99.67837
##
## $horsepower
## [1] 35.70885
##
## $weight
## [1] 811.3002
##
## $acceleration
## [1] 2.693721
x2 %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_density()
cormat <- round(cor(x2),2)
melted_cormat <- melt(cormat)
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation")+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
pairs(x2)
Exploring auto data shows that there is a lot of linearity between variables. Acceleration variable being the one that standout as not having a strong linear relationship with the other variables.
On the topic of predicting miles per gallon our plots suggest that many variables have a linear relationship with MPG. Displacement, horsepower, and weight all have a negative relationship with MPG. Modeling MPG would be fairly easy. One possible issue is multicollinearity where displacement, horsepower, and weight all have positive linear relationship with each other.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
dim(Boston)
## [1] 506 14
The Boston dataset contains 506 rows and 14 columns. Each row represents an observation of a neighborhood in Boston. Each column is a variable that holds information about the observation like crime rate, average number of rooms per house etc.
pairs(Boston)
Based of the pairwise scatterplots some things of is the relationship between nox and dis. The closer a suburb is to employment centers more pollution makes sense. One thing of note where I think there is spatial autocorrelation is when comparing nox and age. Age has a positive correlation with nox because I suspect that older homes will be closer to employment centers. The actual age of a home I don’t think would influence the nitrogen in the atmosphere unless they have a special kind of chimney.
Regarding the per capita crime rate by town the only predictors that have a small positive correlation are age and dis. The distance away from urban centers will have less crime. As mentioned above the old homes are closer to urban centers.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
When looking at the ranges of the variables we can see that some suburbs have a very high crime rate, age, residential land, nitrogen oxides, lower status of the population, and very rich homes.
sum(Boston$chas)
## [1] 35
There are 35 suburbs bound to the river.
median(Boston$ptratio)
## [1] 19.05
The median of Median pupil-teacher ratios is 19.05
Boston[Boston$medv == min(Boston$medv),]
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 396.90 30.59
## 406 67.9208 0 18.1 0 0.693 5.683 100 1.4254 24 666 20.2 384.97 22.98
## medv
## 399 5
## 406 5
There are two suburbs that have the lowest median value of owner-occupied homes. Both suburbs have high crime rate. Slightly higher levels in the nitrogen oxides. Age and distance are the highest and lowest values respectfully. They have very high pupil and teacher ratio. The black value is also high. Lower status of the population is also high.
sum(Boston$rm > 7)
## [1] 64
sum(Boston$rm > 8)
## [1] 13
Boston[Boston$rm > 8,]
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 98 0.12083 0 2.89 0 0.4450 8.069 76.0 3.4952 2 276 18.0 396.90 4.21
## 164 1.51902 0 19.58 1 0.6050 8.375 93.9 2.1620 5 403 14.7 388.45 3.32
## 205 0.02009 95 2.68 0 0.4161 8.034 31.9 5.1180 4 224 14.7 390.55 2.88
## 225 0.31533 0 6.20 0 0.5040 8.266 78.3 2.8944 8 307 17.4 385.05 4.14
## 226 0.52693 0 6.20 0 0.5040 8.725 83.0 2.8944 8 307 17.4 382.00 4.63
## 227 0.38214 0 6.20 0 0.5040 8.040 86.5 3.2157 8 307 17.4 387.38 3.13
## 233 0.57529 0 6.20 0 0.5070 8.337 73.3 3.8384 8 307 17.4 385.91 2.47
## 234 0.33147 0 6.20 0 0.5070 8.247 70.4 3.6519 8 307 17.4 378.95 3.95
## 254 0.36894 22 5.86 0 0.4310 8.259 8.4 8.9067 7 330 19.1 396.90 3.54
## 258 0.61154 20 3.97 0 0.6470 8.704 86.9 1.8010 5 264 13.0 389.70 5.12
## 263 0.52014 20 3.97 0 0.6470 8.398 91.5 2.2885 5 264 13.0 386.86 5.91
## 268 0.57834 20 3.97 0 0.5750 8.297 67.0 2.4216 5 264 13.0 384.54 7.44
## 365 3.47428 0 18.10 1 0.7180 8.780 82.9 1.9047 24 666 20.2 354.55 5.29
## medv
## 98 38.7
## 164 50.0
## 205 50.0
## 225 44.8
## 226 50.0
## 227 37.6
## 233 41.7
## 234 48.3
## 254 42.8
## 258 50.0
## 263 48.8
## 268 50.0
## 365 21.9
Suburbs that have more then 8 average number of rooms per dwelling have low crime rates, and low lower status of the population. Median value of owner-occupied homes is also high. Strangely despite the high home values the property tax is below average.