Conceptual Questions
This scenario is a regression problem because the desirable outcome is quantitive and we are most interested in inference because we want to know which factor is affecting the salary. N = 500 and p = 3.
This scenario is a classification problem because the desirable outcome is qualitative and we are most interested in prediction because we want to know whether it would be a successful or failure launch. N = 20 and p = 14.
(c)We are interest in predicting the ed % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.
This scenario is a regression problem because the desirable outcome is quantitative and we are most interested in prediction because we want to know what the changes in the % could be. N = 52 and p = 4.
The advantages of a very flexible model is potential reducing the bias in the model, meaning our prediction could be more accurate than a less flexible model. However, a disadvantage is the potential of high variance, meaning the model may not be consistent when new samples are introduced.
A circumstance that a very flexible model could be preferred is that if we care more about accurately predicting a case rather than inference. For example, if we want to predict fraud then we might care less about inferring but rather correctly identifying the cases. A less flexible approach could be preferred when we want to infer on a case and is interested on which factor is affecting the desirable outcome rather than prediction.
The differences between a parametric and a non-parametric approaches is that parametric approach assumes the functional form that is used to estimate f while non-parametric does not. Instead of an assumption of the estimate, non-parametric attempts to get a closer estimate of the data points leading to potentially achieving a ranges of shapes. The advantages of parametric approach to a regression or classification is simplifying the model, we may be able to interpret the model and over-fitting may be avoided. The disadvantages is that there is a possibility of not fitting the actual shape well, resulting inaccurate predictions.
Applied Questions
Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.
Look at the data using the fix() function. You should notice that the first column is just the name of each university.We don’t really want R to treat this as data. However, it may be handy to have these names for later.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.3
fix(College)
summary(College)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
pairs(College[,1:10])
plot(College$Private, College$Outstate)
iv. Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.
Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.
Elite =rep("No", nrow(College))
Elite[College$Top10perc>50]="Yes"
Elite=as.factor(Elite)
College=data.frame(College, Elite)
summary(Elite)
## No Yes
## 699 78
plot(College$Elite, College$Outstate)
par(mfrow=c(2,2))
hist(College$Apps)
hist(College$Accept)
hist(College$Top10perc)
hist(College$Top25perc)
According to the box plot below of number of applicants accepted, there are less student’s accepted in private college.
plot(College$Private, College$Accept, ylab="Number of Applications Accepted", xlab="Private College = Yes")
According to the box plots below of top 10 and 25 percentage, it seems that private college accepts more of the students that falls below these categories.
par(mfrow=c(1,2))
plot(College$Private, College$Top10perc, ylab="Number of Top 10 percentage", xlab="Private College = Yes")
plot(College$Private, College$Top25perc, ylab="Number of Top 25 percentage", xlab="Private College = Yes")
According to the structure of the data, the quantitative predictors are mpg,displacement, horsepower, weight, acceleration and year. The qualitative predictors are cylinders, origin, and name.
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
Quants <- c(1,3,4,5,6,7)
QuantsAuto <- Auto[,Quants]
QualsAuto <- Auto[,-Quants]
sapply(QuantsAuto, range)
## mpg displacement horsepower weight acceleration year
## [1,] 9.0 68 46 1613 8.0 70
## [2,] 46.6 455 230 5140 24.8 82
sapply(QuantsAuto, mean)
## mpg displacement horsepower weight acceleration year
## 23.44592 194.41199 104.46939 2977.58418 15.54133 75.97959
sapply(QuantsAuto, sd)
## mpg displacement horsepower weight acceleration year
## 7.805007 104.644004 38.491160 849.402560 2.758864 3.683737
Subset9b <- QuantsAuto[-c(10:85),]
sapply(Subset9b, range)
## mpg displacement horsepower weight acceleration year
## [1,] 11.0 68 46 1649 8.5 70
## [2,] 46.6 455 230 4997 24.8 82
sapply(Subset9b, mean)
## mpg displacement horsepower weight acceleration year
## 24.40443 187.24051 100.72152 2935.97152 15.72690 77.14557
sapply(Subset9b, sd)
## mpg displacement horsepower weight acceleration year
## 7.867283 99.678367 35.708853 811.300208 2.693721 3.106217
Auto$cylinders <- as.factor(Auto$cylinders)
Auto$origin <- as.factor(Auto$origin)
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : Factor w/ 5 levels "3","4","5","6",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
Accordint to the plots below, as horsepower increases the mpg decreases and vehicles with 8 cylinders has less mpg in comparison to the other cylinders.
par(mfrow=c(1,2))
plot(Auto$horsepower, Auto$mpg, xlab="Horsepower", ylab="mpg")
plot(Auto$cylinders, Auto$mpg, xlab="Cylinders", ylab="mpg")
Yes because the plot’s above indicates that horsepower and cylinders affects the mpg and considering these variables in predicting mpg could be viable.
How many rows are in this data set? How many columns? What do the rows and columns represent?
There are 506 rows and 14 columns. The rows and columns represents several characteristics of homes such as per capital crime rate by town, average number of rooms per dwelling in homes, and property tax.
library(MASS)
Boston
?Boston
## starting httpd help server ... done
It seems that some predictors some of the predictors has some correlation, but most do not. The plot on the top left indicates that some of median value of owner-occupied homes increased as the number of rooms increased. The plot on the bottom left indicates that some of the lower status of the population tends to live in a home with more rooms. The plot on the top right indicates that the median value and age is scattered, along with tax and age of homes.
par(mfrow = c(2, 2))
plot(Boston$rm, Boston$medv)
plot(Boston$age, Boston$medv)
plot(Boston$lstat, Boston$rm)
plot(Boston$tax, Boston$age)
(c) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
Yes, it seems that as age of homes increases then the crime rate increases. The weighted mean of distances to five Boston employment centers also seems to affect the crime rate, the crime rate is higher in the lower level of the distance.
par(mfrow = c(3, 3))
plot(Boston$zn, Boston$crim)
plot(Boston$indus, Boston$crim)
plot(Boston$chas, Boston$crim)
plot(Boston$nox, Boston$crim)
plot(Boston$rm, Boston$crim)
plot(Boston$age, Boston$crim)
plot(Boston$dis, Boston$crim)
plot(Boston$rad, Boston$crim)
plot(Boston$tax, Boston$crim)
plot(Boston$ptratio, Boston$crim)
plot(Boston$black, Boston$crim)
plot(Boston$lstat, Boston$crim)
plot(Boston$medv, Boston$crim)
According to the histograms, most of the suburbs have low crime rate with some om the left tail that have high crime rate. There are crime rates as low as 0. 00632 and as high as 88.98.
hist(Boston$crim, breaks=50)
range(Boston$crim)
## [1] 0.00632 88.97620
The histogram of the tax rates is quite uneven. Most are on the far left tail and some one the far left tail. There are tax rates as low as 187 and as high as 711.
hist(Boston$tax, breaks=50)
range(Boston$tax)
## [1] 187 711
The histogram of the pupil-teacher ratio is also quite uneven. There are ratios as low as 12.6 and as high as 22.0.
hist(Boston$ptratio, breaks=50)
range(Boston$ptratio)
## [1] 12.6 22.0
There are 35 suburbs that set bound in the Charles river.
chascat <- as.factor(Boston$chas)
summary(chascat)
## 0 1
## 471 35
The median pupil-teacher ratio is 19.05
median(Boston$ptratio)
## [1] 19.05
The suburd with the lowest median value is in region 399. Some of the variables are in the lower and higher end. Some are more correlated on the median value, while others are not.
which.min(Boston$medv)
## [1] 399
print(Boston[399,])
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 396.9 30.59
## medv
## 399 5
sapply(Boston, range)
## crim zn indus chas nox rm age dis rad tax ptratio black
## [1,] 0.00632 0 0.46 0 0.385 3.561 2.9 1.1296 1 187 12.6 0.32
## [2,] 88.97620 100 27.74 1 0.871 8.780 100.0 12.1265 24 711 22.0 396.90
## lstat medv
## [1,] 1.73 5
## [2,] 37.97 50
There are 64 suburbs that have more than 7 rooms.
dim(Boston[Boston$rm>7,])
## [1] 64 14
There are 13 suburbs that have more than 7 rooms.
dim(Boston[Boston$rm>8,])
## [1] 13 14
The crime rates on these suburbs seems to be in the lower crime rates zone.
Room8 <- Boston[Boston$rm>8,]
sapply(Room8, range)
## crim zn indus chas nox rm age dis rad tax ptratio black
## [1,] 0.02009 0 2.68 0 0.4161 8.034 8.4 1.8010 2 224 13.0 354.55
## [2,] 3.47428 95 19.58 1 0.7180 8.780 93.9 8.9067 24 666 20.2 396.90
## lstat medv
## [1,] 2.47 21.9
## [2,] 7.44 50.0