Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.
*** A : Regression and inference n=500 and p=3 ***
*** A: Classification and prediction n=20 and p=13 ***
*** A: Regression and prediction n=52 and p=3 ***
What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?
*** Flexible Approach Advantages: decrease in bias and given a better fir for non-linear models Flexible Approach Disadvantages: a greater number of parameters needs to be estimated. It also follows the noise too closely (overfit) and it increases the variance.***
*** A more flexible approach would be preferred to a less flexible approach when we are interested in prediction and not the interpretability of the results ***
*** A less flexible approach would be preferred to a more flexible approach when we are interested in inference and the interpretability of the results ***
Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?
A parametric approach reduces the problem of estimating f down to one of estimating a set of parameters because it assumes a form for f.
A non-parametric approach does not assume a particular form of f and so requires a very large sample to accurately estimate f.
Advantages: - Simplifying of modeling f to a few parameters and not as many observations
Disadvantages: - Inaccurately estimating f if the form of f assumed is wrong - Overfitting the observations
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.5
setwd("C:/Users/brend/OneDrive/Desktop/UTSA/Summer 2021/Algo 2/Assignment 1")
college <- read.csv("College.csv")
head(college[, 1:5])
## X Private Apps Accept Enroll
## 1 Abilene Christian University Yes 1660 1232 721
## 2 Adelphi University Yes 2186 1924 512
## 3 Adrian College Yes 1428 1097 336
## 4 Agnes Scott College Yes 417 349 137
## 5 Alaska Pacific University Yes 193 146 55
## 6 Albertson College Yes 587 479 158
fix(college)
You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored.
rownames = college[, 1]
fix(college)
college = college[, -1]
head(college[, 1:5])
## Private Apps Accept Enroll Top10perc
## 1 Yes 1660 1232 721 23
## 2 Yes 2186 1924 512 16
## 3 Yes 1428 1097 336 22
## 4 Yes 417 349 137 60
## 5 Yes 193 146 55 16
## 6 Yes 587 479 158 38
summary(college)
## Private Apps Accept Enroll
## Length:777 Min. : 81 Min. : 72 Min. : 35
## Class :character 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242
## Mode :character Median : 1558 Median : 1110 Median : 434
## Mean : 3002 Mean : 2019 Mean : 780
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902
## Max. :48094 Max. :26330 Max. :6392
## Top10perc Top25perc F.Undergrad P.Undergrad
## Min. : 1.00 Min. : 9.0 Min. : 139 Min. : 1.0
## 1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0
## Median :23.00 Median : 54.0 Median : 1707 Median : 353.0
## Mean :27.56 Mean : 55.8 Mean : 3700 Mean : 855.3
## 3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0
## Max. :96.00 Max. :100.0 Max. :31643 Max. :21836.0
## Outstate Room.Board Books Personal
## Min. : 2340 Min. :1780 Min. : 96.0 Min. : 250
## 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850
## Median : 9990 Median :4200 Median : 500.0 Median :1200
## Mean :10441 Mean :4358 Mean : 549.4 Mean :1341
## 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700
## Max. :21700 Max. :8124 Max. :2340.0 Max. :6800
## PhD Terminal S.F.Ratio perc.alumni
## Min. : 8.00 Min. : 24.0 Min. : 2.50 Min. : 0.00
## 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00
## Median : 75.00 Median : 82.0 Median :13.60 Median :21.00
## Mean : 72.66 Mean : 79.7 Mean :14.09 Mean :22.74
## 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00
## Max. :103.00 Max. :100.0 Max. :39.80 Max. :64.00
## Expend Grad.Rate
## Min. : 3186 Min. : 10.00
## 1st Qu.: 6751 1st Qu.: 53.00
## Median : 8377 Median : 65.00
## Mean : 9660 Mean : 65.46
## 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :56233 Max. :118.00
#college[,1:10]
pairs(College[,1:10])
plot(College$Private, College$Outstate, xlab = "Private",xlim = c(0,2.5), ylab ="OutState", main = "Outstate vs Private")
Elite=rep("No",nrow(college))
Elite[college$Top10perc >50]="Yes"
Elite=as.factor(Elite)
college=data.frame(college ,Elite)
summary(college$Elite)
## No Yes
## 699 78
plot(college$Elite, college$Outstate, xlab = "Elite",,xlim = c(0,2.5), ylab ="OutState", main = "Outstate vs Elite")
par(mfrow = c(2,2))
hist(college$Enroll, col=10, xlab = "Accept", ylab = "Count")
hist(college$Top10perc, col = 10, xlab = "Enroll", ylab = "Count")
hist(college$Personal, col = 5, xlab = "Top10perc", ylab = "Count")
hist(college$Grad.Rate, col = 5, xlab = "Top25perc", ylab = "Count")
# summary(college$Room.Board)
summary(college$Grad.Rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.00 53.00 65.00 65.46 78.00 118.00
gradrate <- college[college$Grad.Rate > 99,]
nrow(gradrate)
## [1] 11
rownames[as.numeric(rownames(gradrate))]
## [1] "Amherst College" "Cazenovia College"
## [3] "College of Mount St. Joseph" "Grove City College"
## [5] "Harvard University" "Harvey Mudd College"
## [7] "Lindenwood College" "Missouri Southern State College"
## [9] "Santa Clara University" "Siena College"
## [11] "University of Richmond"
I though it was interesting to find that these 11 schools have graduation rates of over 99%. I am curious to see how their curriculum is structured.
This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.
data("Auto")
summary(complete.cases(Auto))
## Mode TRUE
## logical 392
sapply(Auto, class)
## mpg cylinders displacement horsepower weight acceleration
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## year origin name
## "numeric" "numeric" "factor"
quant <- sapply(Auto, is.numeric)
quant
## mpg cylinders displacement horsepower weight acceleration
## TRUE TRUE TRUE TRUE TRUE TRUE
## year origin name
## TRUE TRUE FALSE
sapply(Auto[, quant], range)
## mpg cylinders displacement horsepower weight acceleration year origin
## [1,] 9.0 3 68 46 1613 8.0 70 1
## [2,] 46.6 8 455 230 5140 24.8 82 3
Range: MPG: 9 - 24 Cylinders:3 - 8 Displacement: 68 - 455 Horsepower: 46 - 230 WEight: 1613 - 5140 Acceleration: 8.0 - 24.8 Year: 70 - 82 Origin: 1-3
sapply(Auto[, quant], function(x) signif(c(mean(x), sd(x)), 2))
## mpg cylinders displacement horsepower weight acceleration year origin
## [1,] 23.0 5.5 190 100 3000 16.0 76.0 1.60
## [2,] 7.8 1.7 100 38 850 2.8 3.7 0.81
1 is mean 2 is standard deviation
delete_obs <- sapply(Auto[-10:-85, quant], function(x) round(c(range(x), mean(x), sd(x)), 2))
rownames(delete_obs) <- c("min", "max", "mean", "sd")
delete_obs
## mpg cylinders displacement horsepower weight acceleration year origin
## min 11.00 3.00 68.00 46.00 1649.00 8.50 70.00 1.00
## max 46.60 8.00 455.00 230.00 4997.00 24.80 82.00 3.00
## mean 24.40 5.37 187.24 100.72 2935.97 15.73 77.15 1.60
## sd 7.87 1.65 99.68 35.71 811.30 2.69 3.11 0.82
pairs(Auto)
In the pairs plots if you specifically look at the relationship between horsepower and weight they are very linear. Also horsepower and displacement has a linear relationship as well. The relationship between displacement and weight also looks to be linear with a little bit of inconsistency.
MPG suggest that displacement, horsepower and weight show the best linear relationship and would have the highest impact for predicting mpg.
library(MASS)
#Boston
#?Boston
How many rows are in this data set? How many columns? What do the rows and columns represent?
The Boston data frame has 506 rows and 14 columns. The rows and columns represent: Housing Values in Suburbs of Boston
pairs(Boston)
THere looks like there could be some sort of relationship between crim and rm and age.
summary(Boston$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00632 0.08204 0.25651 3.61352 3.67708 88.97620
cor(Boston)
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax ptratio
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
## black lstat medv
## crim -0.38506394 0.4556215 -0.3883046
## zn 0.17552032 -0.4129946 0.3604453
## indus -0.35697654 0.6037997 -0.4837252
## chas 0.04878848 -0.0539293 0.1752602
## nox -0.38005064 0.5908789 -0.4273208
## rm 0.12806864 -0.6138083 0.6953599
## age -0.27353398 0.6023385 -0.3769546
## dis 0.29151167 -0.4969958 0.2499287
## rad -0.44441282 0.4886763 -0.3816262
## tax -0.44180801 0.5439934 -0.4685359
## ptratio -0.17738330 0.3740443 -0.5077867
## black 1.00000000 -0.3660869 0.3334608
## lstat -0.36608690 1.0000000 -0.7376627
## medv 0.33346082 -0.7376627 1.0000000
Looking at the coorelations bewtween Crim and the other variables: zn, chas, rm, dis, black, and medv all have a relationship to crim since they are all below the signifigance level of .05.
summary(Boston$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00632 0.08204 0.25651 3.61352 3.67708 88.97620
#?Boston
library(ggplot2)
qplot(Boston$crim, binwidth=5 , xlab = "Crime rate", ylab="Number of Suburbs" )
The suburbs don’t seem to have particularly high crime rate.
summary(Boston$tax)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 187.0 279.0 330.0 408.2 666.0 711.0
qplot(Boston$tax, binwidth=50 , xlab = "Full-value property-tax rate per $10,000", ylab="Number of Suburbs")
A little over 125 suburbs have high taxes over 600.
summary(Boston$ptratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.60 17.40 19.05 18.46 20.20 22.00
qplot(Boston$ptratio, binwidth=5, xlab ="Pupil-teacher ratio by town", ylab="Number of Suburbs")
Ideally suburbs over 150 have large puptil to teach ratio. with a range of 17.5 - 22.5
nrow(subset(Boston, chas ==1))
## [1] 35
35 suburbs set bound the Charles river
summary(Boston$ptratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.60 17.40 19.05 18.46 20.20 22.00
The median pupil-teacher ratio is 19
least_occupied <- Boston[order(Boston$medv),]
least_occupied[1,]
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 396.9 30.59
## medv
## 399 5
Suburb 399 has the lowest median value of owneroccupied homes.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Looking at the Summary of Boston as a whole compared to suburb 399: Age is at the max with 100.It has the minimum medv compared to Boston. 399 has a very high tax value.The pupil per teacher ratio is very high with 20 when the max average is 22. It also has very high level of nox compared to other Boston suberbs. This list shows the suburb 399 is is not a suburb people want to live in.
rm_7 <- subset(Boston, rm>7)
nrow(rm_7)
## [1] 64
4 suburbs have more than 7 rooms per dwelling.
rm_8 <- subset(Boston, rm>8)
nrow(rm_8)
## [1] 13
13 suburbs have more than 7 rooms per dwelling