Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.
We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.
We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.
We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.
What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?
Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?
This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US.
Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data
data(College)
college <- read.csv("College.csv")
Look at the data using the fix() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later.
head(college[, 1:5])
rownames <- college[, 1]
college <- college[, -1]
head(college[, 1:5])
summary(college)
## Private Apps Accept Enroll
## Length:777 Min. : 81 Min. : 72 Min. : 35
## Class :character 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242
## Mode :character Median : 1558 Median : 1110 Median : 434
## Mean : 3002 Mean : 2019 Mean : 780
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902
## Max. :48094 Max. :26330 Max. :6392
## Top10perc Top25perc F.Undergrad P.Undergrad
## Min. : 1.00 Min. : 9.0 Min. : 139 Min. : 1.0
## 1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0
## Median :23.00 Median : 54.0 Median : 1707 Median : 353.0
## Mean :27.56 Mean : 55.8 Mean : 3700 Mean : 855.3
## 3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0
## Max. :96.00 Max. :100.0 Max. :31643 Max. :21836.0
## Outstate Room.Board Books Personal
## Min. : 2340 Min. :1780 Min. : 96.0 Min. : 250
## 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850
## Median : 9990 Median :4200 Median : 500.0 Median :1200
## Mean :10441 Mean :4358 Mean : 549.4 Mean :1341
## 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700
## Max. :21700 Max. :8124 Max. :2340.0 Max. :6800
## PhD Terminal S.F.Ratio perc.alumni
## Min. : 8.00 Min. : 24.0 Min. : 2.50 Min. : 0.00
## 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00
## Median : 75.00 Median : 82.0 Median :13.60 Median :21.00
## Mean : 72.66 Mean : 79.7 Mean :14.09 Mean :22.74
## 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00
## Max. :103.00 Max. :100.0 Max. :39.80 Max. :64.00
## Expend Grad.Rate
## Min. : 3186 Min. : 10.00
## 1st Qu.: 6751 1st Qu.: 53.00
## Median : 8377 Median : 65.00
## Mean : 9660 Mean : 65.46
## 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :56233 Max. :118.00
college$Private <- as.factor(college$Private)
pairs(college[, 1:10])
plot(college$Private, college$Outstate,
xlab = "Private University",
ylab ="Out of State tuition in USD",
main = "Outstate Tuition Plot")
Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college$Elite <- Elite
summary(college$Elite)
## No Yes
## 699 78
Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.
plot(college$Elite, college$Outstate,
xlab = "Elite University",
ylab ="Out of State tuition in USD",
main = "Outstate Tuition Plot")
par(mfrow = c(2,2))
hist(college$Books, xlab = "Books", ylab = "Count")
hist(college$PhD, xlab = "PhD", ylab = "Count")
hist(college$Grad.Rate, xlab = "Grad Rate", ylab = "Count")
hist(college$perc.alumni, xlab = "% alumni", ylab = "Count")
summary(college$PhD)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 62.00 75.00 72.66 85.00 103.00
Some universities have 103% of faculty with Phd degree, let us see how many universities have this percentage and their names.
faculty.phd <- college[college$PhD == 103, ]
nrow(faculty.phd)
## [1] 1
rownames[as.numeric(rownames(faculty.phd))]
## [1] "Texas A&M University at Galveston"
This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.
Which of the predictors are quantitative, and which are qualitative?
auto <- read.csv("Auto.csv", na.strings = "?")
auto <- na.omit(auto)
str(auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
Quantitative: Displacement, Horsepower, Weight, acceleration and Year
Qualitative: Cylinders, Origin and Name (Cylinders and Origin are numbers but they should be transformed into Factors, since they are categorical values and not continuous)
What is the range of each quantitative predictor? You can answer this using the range() function.
range(Auto$mpg)
## [1] 9.0 46.6
range(Auto$cylinders)
## [1] 3 8
range(Auto$displacement)
## [1] 68 455
range(Auto$weight)
## [1] 1613 5140
range(Auto$acceleration)
## [1] 8.0 24.8
range(Auto$year)
## [1] 70 82
range(Auto$origin)
## [1] 1 3
What is the mean and standard deviation of each quantitative predictor?
sapply(auto[, -c(4, 9)], mean)
## mpg cylinders displacement weight acceleration year
## 23.445918 5.471939 194.411990 2977.584184 15.541327 75.979592
## origin
## 1.576531
sapply(auto[, -c(4, 9)], sd)
## mpg cylinders displacement weight acceleration year
## 7.8050075 1.7057832 104.6440039 849.4025600 2.7588641 3.6837365
## origin
## 0.8055182
Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
subset <- auto[-c(10:85), -c(4,9)]
sapply(subset, range)
## mpg cylinders displacement weight acceleration year origin
## [1,] 11.0 3 68 1649 8.5 70 1
## [2,] 46.6 8 455 4997 24.8 82 3
sapply(subset, mean)
## mpg cylinders displacement weight acceleration year
## 24.404430 5.373418 187.240506 2935.971519 15.726899 77.145570
## origin
## 1.601266
sapply(subset, sd)
## mpg cylinders displacement weight acceleration year
## 7.867283 1.654179 99.678367 811.300208 2.693721 3.106217
## origin
## 0.819910
Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
auto$cylinders <- as.factor(auto$cylinders)
auto$year <- as.factor(auto$year)
auto$origin <- as.factor(auto$origin)
auto$horsepower <- as.factor(auto$horsepower)
auto$name <- as.factor(auto$name)
pairs(auto)
There seems more mileage per gallon on a 4 cyl vehicle than other vehicles. Weight, displacement and horsepower seem to have an inverse effect with mpg. We see an overall increase in mpg over the years. Almost doubled in one decade.
Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.
From the plots above,mpg has negative correlation with weight, displacement and horsepower.
cor(auto$mpg, auto$weight)
## [1] -0.8322442
cor(auto$mpg, auto$displacement)
## [1] -0.8051269
auto$horsepower <- as.numeric(auto$horsepower)
cor(auto$mpg, auto$horsepower)
## [1] -0.8291518
This exercise involves the Boston housing data set
To begin, load in the Boston data set. The Boston data set is part of the MASS library in R.
library(MASS)
Boston$chas <- as.factor(Boston$chas)
nrow(Boston)
## [1] 506
ncol(Boston)
## [1] 14
Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.
par(mfrow = c(2, 2))
plot(Boston$nox, Boston$crim)
plot(Boston$rm, Boston$crim)
plot(Boston$age, Boston$crim)
plot(Boston$dis, Boston$crim)
The crime and tax rate have an inverse relationship as in less crime in high tax rate areas.
Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
hist(Boston$crim, breaks = 50)
Most suburbs do not have any crime (80% of data falls in crim < 20).
pairs(Boston[Boston$crim < 20, ])
There may be a relationship between crim and nox, rm, age, dis, lstat and medv.
Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.
hist(Boston$crim, breaks = 50)
nrow(Boston[Boston$crim > 20, ])
## [1] 18
hist(Boston$tax, breaks = 50)
nrow(Boston[Boston$tax == 666, ])
## [1] 132
hist(Boston$ptratio, breaks = 50)
nrow(Boston[Boston$ptratio > 20, ])
## [1] 201
How many of the suburbs in this data set bound the Charles river?
nrow(Boston[Boston$chas == 1, ])
## [1] 35
What is the median pupil-teacher ratio among the towns in this data set?
median(Boston$ptratio)
## [1] 19.05
Which suburb of Boston has lowest median value of owner occupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.
row.names(Boston[min(Boston$medv), ])
## [1] "5"
low.med = Boston[order(Boston$medv),] #order in the ascending order
low.med[1,]
399 has the lowest median value(5) of owner occupied homes when compared to other suburbs of Boston.
range(Boston$tax)
## [1] 187 711
Boston[min(Boston$medv), ]$tax
## [1] 222
In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.
nrow(Boston[Boston$rm > 7, ])
## [1] 64
64 of the suburbs average more than seven rooms per dwelling.
nrow(Boston[Boston$rm > 8, ])
## [1] 13
13 of the suburbs average more than seven rooms per dwelling.