Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.
(a)We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary. The scenario is a regression problem, because we are predicting which factors affect CEO salary. The factors we are given are quantitative such as the sample size, form of profit, and the number of employees. n=500 and p=3. (b)We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables. The scenario is a classification problem, because it is focusing on two options: whether it is a success or a fail. Rather than being a quantitative response it is a qualitative one. n=20 and p=13. (c)We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market. The scenario is a regression problem, because we are looking at the percent change in the exchange rate, which is a quantitative observation. n=52 and p=3.
What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?
Advantages to a very flexible approach for regression or classification over a less flexible approach are that it is more useful in non-linear models because it reduces the error of the model predictions as well as the bias. Disadvantages are that it could possibly lead to over estimating the model by increasing the error. A less flexible approach would be preferred when the results need to be more accurately interpreted, because the slightest increase of error could throw off the results of the study.
Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?
A parametric statistical learning approach assumes the distribution of random variables, has more statistical power, and has a smaller p value. Meanwhile a non-parametric approach can be applied more broadly as it is not fixed and can be adjusted with the sample size.
College <- read.csv(file="C:/Users/amand/Downloads/College.csv")
head(College[ , 1:4])
## X Private Apps Accept
## 1 Abilene Christian University Yes 1660 1232
## 2 Adelphi University Yes 2186 1924
## 3 Adrian College Yes 1428 1097
## 4 Agnes Scott College Yes 417 349
## 5 Alaska Pacific University Yes 193 146
## 6 Albertson College Yes 587 479
rownames=College[,1]
fix(College)
College=College[,-1]
fix(College)
summary(College)
## Private Apps Accept Enroll
## Length:777 Min. : 81 Min. : 72 Min. : 35
## Class :character 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242
## Mode :character Median : 1558 Median : 1110 Median : 434
## Mean : 3002 Mean : 2019 Mean : 780
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902
## Max. :48094 Max. :26330 Max. :6392
## Top10perc Top25perc F.Undergrad P.Undergrad
## Min. : 1.00 Min. : 9.0 Min. : 139 Min. : 1.0
## 1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0
## Median :23.00 Median : 54.0 Median : 1707 Median : 353.0
## Mean :27.56 Mean : 55.8 Mean : 3700 Mean : 855.3
## 3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0
## Max. :96.00 Max. :100.0 Max. :31643 Max. :21836.0
## Outstate Room.Board Books Personal
## Min. : 2340 Min. :1780 Min. : 96.0 Min. : 250
## 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850
## Median : 9990 Median :4200 Median : 500.0 Median :1200
## Mean :10441 Mean :4358 Mean : 549.4 Mean :1341
## 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700
## Max. :21700 Max. :8124 Max. :2340.0 Max. :6800
## PhD Terminal S.F.Ratio perc.alumni
## Min. : 8.00 Min. : 24.0 Min. : 2.50 Min. : 0.00
## 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00
## Median : 75.00 Median : 82.0 Median :13.60 Median :21.00
## Mean : 72.66 Mean : 79.7 Mean :14.09 Mean :22.74
## 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00
## Max. :103.00 Max. :100.0 Max. :39.80 Max. :64.00
## Expend Grad.Rate
## Min. : 3186 Min. : 10.00
## 1st Qu.: 6751 1st Qu.: 53.00
## Median : 8377 Median : 65.00
## Mean : 9660 Mean : 65.46
## 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :56233 Max. :118.00
A=read.csv(file="C:/Users/amand/Downloads/College.csv",header=T)
College[,1]=as.numeric(factor(College[,1]))
pairs(College[, 1:10])
College=read.csv(file="C:/Users/amand/Downloads/College.csv",header=T)
Elite=rep("No",nrow(College))
Elite[College$Top10perc>50]="Yes"
Elite=as.factor(Elite)
College=data.frame(College,Elite)
summary(College$Elite)
## No Yes
## 699 78
par(mfrow = c(2,2))
hist(College$Top25perc, col= 2,xlab = "Top25perc",ylab = "Count")
hist(College$PhD, col = 3,xlab = "PhD",ylab = "Count")
hist(College$Grad.Rate, col = 4,xlab = "Grad Rate",ylab = "Count")
hist(College$Expend,col = 5,xlab = "%Expend",ylab = "Count")
summary(College$Top25perc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 41.0 54.0 55.8 69.0 100.0
summary(College$PhD)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 62.00 75.00 72.66 85.00 103.00
summary(College$Grad.Rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.00 53.00 65.00 65.46 78.00 118.00
summary(College$Expend)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3186 6751 8377 9660 10830 56233
weird.phd=College[College$PhD==103, ]
nrow(weird.phd)
## [1] 1
rownames[as.numeric(rownames(weird.phd))]
## [1] "Texas A&M University at Galveston"
weird.grad.rate=College[College$Grad.Rate==103, ]
nrow(weird.phd)
## [1] 1
rownames[as.numeric(rownames(weird.grad.rate))]
## character(0)
Auto=read.csv(file="C:/Users/amand/Downloads/Auto.csv")
str(Auto)
## 'data.frame': 397 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : chr "130" "165" "150" "150" ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
summary(Auto[,-c(4,9)])
## mpg cylinders displacement weight acceleration
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. :1613 Min. : 8.00
## 1st Qu.:17.50 1st Qu.:4.000 1st Qu.:104.0 1st Qu.:2223 1st Qu.:13.80
## Median :23.00 Median :4.000 Median :146.0 Median :2800 Median :15.50
## Mean :23.52 Mean :5.458 Mean :193.5 Mean :2970 Mean :15.56
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:262.0 3rd Qu.:3609 3rd Qu.:17.10
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :5140 Max. :24.80
## year origin
## Min. :70.00 Min. :1.000
## 1st Qu.:73.00 1st Qu.:1.000
## Median :76.00 Median :1.000
## Mean :75.99 Mean :1.574
## 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :82.00 Max. :3.000
sapply(Auto[,-c(4,9)],mean)
## mpg cylinders displacement weight acceleration year
## 23.515869 5.458438 193.532746 2970.261965 15.555668 75.994962
## origin
## 1.574307
sapply(Auto[,-c(4,9)],sd)
## mpg cylinders displacement weight acceleration year
## 7.8258039 1.7015770 104.3795833 847.9041195 2.7499953 3.6900049
## origin
## 0.8025495
newauto=Auto[-c(10:85),-c(4,9)]
sapply(newauto,mean)
## mpg cylinders displacement weight acceleration year
## 24.438629 5.370717 187.049844 2933.962617 15.723053 77.152648
## origin
## 1.598131
sapply(newauto,sd)
## mpg cylinders displacement weight acceleration year
## 7.9081842 1.6534857 99.6353853 810.6429384 2.6805138 3.1112298
## origin
## 0.8161627
library(base)
library(ISLR2)
##
## Attaching package: 'ISLR2'
## The following objects are masked _by_ '.GlobalEnv':
##
## Auto, College
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stats)
library(graphics)
nrow(Boston)
## [1] 506
ncol(Boston)
## [1] 13
pairs(Boston)
par(mfrow=c(1,3))
boxplot(Boston$crim, xlab = "per capita crime rate by town", main = "Box-plot of crime rate")
boxplot(Boston$tax, xlab = "full-value property-tax rate per $10,000", main = "Box-plot of tax-rate")
boxplot(Boston$ptratio, xlab = "pupil-teacher ratio by town", main = "Box-plot of Pupil-teacher ratio")
table(Boston$chas)
##
## 0 1
## 471 35
median(Boston$ptratio)
## [1] 19.05
Boston%>%mutate(Census_tracts=c(1:length(Boston$crim)))%>%filter(medv == min(medv))
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 30.59 5
## 2 67.9208 0 18.1 0 0.693 5.683 100 1.4254 24 666 20.2 22.98 5
## Census_tracts
## 1 399
## 2 406
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio lstat
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 1.73
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.: 6.95
## Median : 5.000 Median :330.0 Median :19.05 Median :11.36
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :12.65
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:16.95
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :37.97
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
more7 = Boston %>% filter(rm > 7)
more8 = Boston %>% filter(rm > 8)
data.frame("more_than_7_rooms" = c(length(more7$rm)),
"more_than_8_rooms" = c(length(more8$rm)))
## more_than_7_rooms more_than_8_rooms
## 1 64 13
boxplot(more8, main="Boxplot of data which hae more than 8 rooms per dwelling")