Conceptual Questions

  1. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.
  1. We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

This scenario is a regression problem because the desirable outcome is quantitive and we are most interested in inference because we want to know which factor is affecting the salary. N = 500 and p = 3.

  1. We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

This scenario is a classification problem because the desirable outcome is qualitative and we are most interested in prediction because we want to know whether it would be a successful or failure launch. N = 20 and p = 14.

(c)We are interest in predicting the ed % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

This scenario is a regression problem because the desirable outcome is quantitative and we are most interested in prediction because we want to know what the changes in the % could be. N = 52 and p = 4.

  1. What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

The advantages of a very flexible model is potential reducing the bias in the model, meaning our prediction could be more accurate than a less flexible model. However, a disadvantage is the potential of high variance, meaning the model may not be consistent when new samples are introduced.

A circumstance that a very flexible model could be preferred is that if we care more about accurately predicting a case rather than inference. For example, if we want to predict fraud then we might care less about inferring but rather correctly identifying the cases. A less flexible approach could be preferred when we want to infer on a case and is interested on which factor is affecting the desirable outcome rather than prediction.

  1. Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?

The differences between a parametric and a non-parametric approaches is that parametric approach assumes the functional form that is used to estimate f while non-parametric does not. Instead of an assumption of the estimate, non-parametric attempts to get a closer estimate of the data points leading to potentially achieving a ranges of shapes. The advantages of parametric approach to a regression or classification is simplifying the model, we may be able to interpret the model and over-fitting may be avoided. The disadvantages is that there is a possibility of not fitting the actual shape well, resulting inaccurate predictions.

Applied Questions

  1. This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US.
  1. Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.

  2. Look at the data using the fix() function. You should notice that the first column is just the name of each university.We don’t really want R to treat this as data. However, it may be handy to have these names for later.

library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.3
fix(College)
    1. Use the summary() function to produce a numerical summary of the variables in the data set.
summary(College)
##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00
  1. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].
pairs(College[,1:10])

  1. Use the plot() function to produce side-by-side boxplots of Outstate versus Private.
plot(College$Private, College$Outstate)

iv. Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.

Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.

Elite =rep("No", nrow(College))
Elite[College$Top10perc>50]="Yes"
Elite=as.factor(Elite)
College=data.frame(College, Elite)

summary(Elite)
##  No Yes 
## 699  78
plot(College$Elite, College$Outstate)

  1. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow=c(2,2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.
par(mfrow=c(2,2))
hist(College$Apps)
hist(College$Accept)
hist(College$Top10perc)
hist(College$Top25perc)

  1. Continue exploring the data, and provide a brief summary of what you discover.

According to the box plot below of number of applicants accepted, there are less student’s accepted in private college.

plot(College$Private, College$Accept, ylab="Number of Applications Accepted", xlab="Private College = Yes")

According to the box plots below of top 10 and 25 percentage, it seems that private college accepts more of the students that falls below these categories.

par(mfrow=c(1,2))
plot(College$Private, College$Top10perc, ylab="Number of Top 10 percentage", xlab="Private College = Yes")
plot(College$Private, College$Top25perc, ylab="Number of Top 25 percentage", xlab="Private College = Yes")

  1. This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

According to the structure of the data, the quantitative predictors are mpg,displacement, horsepower, weight, acceleration and year. The qualitative predictors are cylinders, origin, and name.

str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
  1. What is the range of each quantitative predictor? You can answer this using the range() function.
Quants <- c(1,3,4,5,6,7)
QuantsAuto <- Auto[,Quants]
QualsAuto <- Auto[,-Quants]

sapply(QuantsAuto, range)
##       mpg displacement horsepower weight acceleration year
## [1,]  9.0           68         46   1613          8.0   70
## [2,] 46.6          455        230   5140         24.8   82
  1. What is the mean and standard deviation of each quantitative predictor?
sapply(QuantsAuto, mean)
##          mpg displacement   horsepower       weight acceleration         year 
##     23.44592    194.41199    104.46939   2977.58418     15.54133     75.97959
sapply(QuantsAuto, sd)
##          mpg displacement   horsepower       weight acceleration         year 
##     7.805007   104.644004    38.491160   849.402560     2.758864     3.683737
  1. Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
Subset9b <- QuantsAuto[-c(10:85),] 
sapply(Subset9b, range)
##       mpg displacement horsepower weight acceleration year
## [1,] 11.0           68         46   1649          8.5   70
## [2,] 46.6          455        230   4997         24.8   82
sapply(Subset9b, mean)
##          mpg displacement   horsepower       weight acceleration         year 
##     24.40443    187.24051    100.72152   2935.97152     15.72690     77.14557
sapply(Subset9b, sd)
##          mpg displacement   horsepower       weight acceleration         year 
##     7.867283    99.678367    35.708853   811.300208     2.693721     3.106217
  1. Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
Auto$cylinders <- as.factor(Auto$cylinders)
Auto$origin <- as.factor(Auto$origin)
str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : Factor w/ 5 levels "3","4","5","6",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

Accordint to the plots below, as horsepower increases the mpg decreases and vehicles with 8 cylinders has less mpg in comparison to the other cylinders.

par(mfrow=c(1,2))
plot(Auto$horsepower, Auto$mpg, xlab="Horsepower", ylab="mpg")
plot(Auto$cylinders, Auto$mpg, xlab="Cylinders", ylab="mpg")

  1. Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

Yes because the plot’s above indicates that horsepower and cylinders affects the mpg and considering these variables in predicting mpg could be viable.

  1. This exercise involves the Boston housing data set.
  1. To begin, load in the Boston data set. The Boston data set is part of the MASS library in R.

How many rows are in this data set? How many columns? What do the rows and columns represent?

There are 506 rows and 14 columns. The rows and columns represents several characteristics of homes such as per capital crime rate by town, average number of rooms per dwelling in homes, and property tax.

library(MASS)
Boston
?Boston
## starting httpd help server ... done
  1. Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

It seems that some predictors some of the predictors has some correlation, but most do not. The plot on the top left indicates that some of median value of owner-occupied homes increased as the number of rooms increased. The plot on the bottom left indicates that some of the lower status of the population tends to live in a home with more rooms. The plot on the top right indicates that the median value and age is scattered, along with tax and age of homes.

par(mfrow = c(2, 2))
plot(Boston$rm, Boston$medv)
plot(Boston$age, Boston$medv)
plot(Boston$lstat, Boston$rm)
plot(Boston$tax, Boston$age)

(c) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

Yes, it seems that as age of homes increases then the crime rate increases. The weighted mean of distances to five Boston employment centers also seems to affect the crime rate, the crime rate is higher in the lower level of the distance.

par(mfrow = c(3, 3))
plot(Boston$zn, Boston$crim)
plot(Boston$indus, Boston$crim)
plot(Boston$chas, Boston$crim)
plot(Boston$nox, Boston$crim)
plot(Boston$rm, Boston$crim)
plot(Boston$age, Boston$crim)
plot(Boston$dis, Boston$crim)
plot(Boston$rad, Boston$crim)
plot(Boston$tax, Boston$crim)

plot(Boston$ptratio, Boston$crim)
plot(Boston$black, Boston$crim)
plot(Boston$lstat, Boston$crim)
plot(Boston$medv, Boston$crim)

  1. Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

According to the histograms, most of the suburbs have low crime rate with some om the left tail that have high crime rate. There are crime rates as low as 0. 00632 and as high as 88.98.

hist(Boston$crim, breaks=50)

range(Boston$crim)
## [1]  0.00632 88.97620

The histogram of the tax rates is quite uneven. Most are on the far left tail and some one the far left tail. There are tax rates as low as 187 and as high as 711.

hist(Boston$tax, breaks=50)

range(Boston$tax)
## [1] 187 711

The histogram of the pupil-teacher ratio is also quite uneven. There are ratios as low as 12.6 and as high as 22.0.

hist(Boston$ptratio, breaks=50)

range(Boston$ptratio)
## [1] 12.6 22.0
  1. How many of the suburbs in this data set bound the Charles river?

There are 35 suburbs that set bound in the Charles river.

chascat <- as.factor(Boston$chas)
summary(chascat)
##   0   1 
## 471  35
  1. What is the median pupil-teacher ratio among the towns in this data set?

The median pupil-teacher ratio is 19.05

median(Boston$ptratio)
## [1] 19.05
  1. Which suburb of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

The suburd with the lowest median value is in region 399. Some of the variables are in the lower and higher end. Some are more correlated on the median value, while others are not.

which.min(Boston$medv)
## [1] 399
print(Boston[399,])
##        crim zn indus chas   nox    rm age    dis rad tax ptratio black lstat
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 396.9 30.59
##     medv
## 399    5
sapply(Boston, range)
##          crim  zn indus chas   nox    rm   age     dis rad tax ptratio  black
## [1,]  0.00632   0  0.46    0 0.385 3.561   2.9  1.1296   1 187    12.6   0.32
## [2,] 88.97620 100 27.74    1 0.871 8.780 100.0 12.1265  24 711    22.0 396.90
##      lstat medv
## [1,]  1.73    5
## [2,] 37.97   50
  1. In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.

There are 64 suburbs that have more than 7 rooms.

dim(Boston[Boston$rm>7,])
## [1] 64 14

There are 13 suburbs that have more than 7 rooms.

dim(Boston[Boston$rm>8,])
## [1] 13 14

The crime rates on these suburbs seems to be in the lower crime rates zone.

Room8 <- Boston[Boston$rm>8,]
sapply(Room8, range)
##         crim zn indus chas    nox    rm  age    dis rad tax ptratio  black
## [1,] 0.02009  0  2.68    0 0.4161 8.034  8.4 1.8010   2 224    13.0 354.55
## [2,] 3.47428 95 19.58    1 0.7180 8.780 93.9 8.9067  24 666    20.2 396.90
##      lstat medv
## [1,]  2.47 21.9
## [2,]  7.44 50.0