Homework 1

Question 2

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide $n$ and $p$.

We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

This is a regression problem because our response variable, salary, is continuous. We are most interested in making an inference about the relationships between the predictors and a CEO’s salary. N is the top 500 firms in the US and there are three predictors, profit, number of employees, and industry.
We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

This is a classification problem because we have a categorical response variable. We are interested in making a prediction about whether a product is successful or not. N is the 20 similar products that were previously launched and we have 13 predictors, the price charged for the product, marketing budget, competition price, and ten other variables.
We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence, we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

This is a regression problem because we are interested in predicting a continuous outcome, the percent change in the USD/Euro exchange rate. N is 52, each week of 2012, and there are three predictors, the % change in the US market, the % change in the British market, and the % change in the German market.

Question 5

What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

Flexible models are able to more closely fit the data and can make better predictions than less flexible methods. However, this usually requires estimating a greater number of parameters, making the model more complex and less interpretable, and can lead to overfitting the data. Due to their close fit, more flexible methods tend to have lower bias, the amount our average estimate differs from the true value, and higher variance, the amount our estimate of $f$ would change given new data. When we are most interested in making predictions or when the data is highly nonlinear, more flexible models are preferred.

Alternatively, when inference is the goal, less flexible models are preferred. More restrictive approaches, like linear regression, make it easier to understand the relationship between the response variable and the predictors. Given their relative inflexibility, a small change in the data will not cause a significant change in our estimate of $f$, resulting in a smaller variance. However, these models do not follow the exact shape of the data as closely as more flexible methods, leading to a higher bias.

Question 6

Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?

The key difference between the approaches is that parametric methods make an assumption about the distributional form, or the shape, of $f$ while non-parametric methods do not. Since parametric methods assume a form of the data, they reduce the problem of estimating $f$ down to estimating a small number of parameters. There is the potential disadvantage that the model chosen will not closely enough match the true unknown form of $f$, in which case the model will not fit the data well. Non-parametric methods avoid this problem since essentially no assumptions about the distributional form of the data are made. Thus, they are able to fit a wider range of possible shapes than parametric methods. However, these approaches require a very large number of observations to obtain an accurate estimate of $f$.

Question 8

This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US.

Use the read.csv() function to read the data into R. Call the loaded data college.

college <- read.csv("College.csv", stringsAsFactors=TRUE)

Look at the data using the fix() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:

rownames(college)=college[,1]
fix(college)
college=college[,-1]
fix(college)

Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.

head(college[,c(1:5)])

##                              Private Apps Accept Enroll Top10perc
## Abilene Christian University     Yes 1660   1232    721        23
## Adelphi University               Yes 2186   1924    512        16
## Adrian College                   Yes 1428   1097    336        22
## Agnes Scott College              Yes  417    349    137        60
## Alaska Pacific University        Yes  193    146     55        16
## Albertson College                Yes  587    479    158        38

1. Use the summary() function to produce a numerical summary of the variables in the data set.

summary(college)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data.

pairs(college[,1:10])

Use the plot() function to produce side-by-side boxplots of Outstate versus Private.

plot(college$Private, college$Outstate, xlab="", 
     ylab="Out-of-State Tuition, $", 
     main="Out-of-State Tuition for Students Attending Private Universities")

Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%. Use the summary() function to see how many elite universities there are. Next, use the plot() function to produce side-by-side boxplots of Outstate versus Elite.

Elite=rep("No",nrow(college))
Elite[college$Top10perc>50]="Yes"
Elite=as.factor(Elite)
college=data.frame(college, Elite)
summary(college$Elite)

##  No Yes 
## 699  78

plot(college$Elite, college$Outstate, xlab="", 
     ylab="Out-of-State Tuition, $", 
     main="Out-of-State Tuition for Universities Classified as Elite")

Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow=c(2,2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.

attach(college)
par(mfrow=c(2,2))
hist(Top10perc, breaks = 40)
hist(Room.Board, breaks = 30)
hist(PhD, breaks = 20)
hist(Grad.Rate, breaks = 10)

Continue exploring the data, and provide a brief summary of what you discover.

Below are additional scatterplots exploring some of the relationships between the variables. We can see there is a positive correlation between students in the top 25% of their class and graduation rate. There is a strong positive linear relationship in the number of full-time students who enroll but not as much in part-time students. There is also a more positive trend in graduation rates of full-time students than part-time.

pairs(~Grad.Rate + Top25perc + P.Undergrad + F.Undergrad + Accept + Enroll, college)

detach(college)

Question 9

This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

Which of the predictors are quantitative and which are qualitative?

The quantitative predictors are mpg, displacement, horsepower, weight, and acceleration. Name is a qualitative variable. Cylinders, origin, and year take on a small set of discrete values so I would treat them as qualitative or categorical.

auto <- read.csv("Auto.csv", na.strings="?")
auto <- na.omit(auto)
summary(auto)

##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##   acceleration        year           origin          name          
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   Length:392        
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   Class :character  
##  Median :15.50   Median :76.00   Median :1.000   Mode  :character  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577                     
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000                     
##  Max.   :24.80   Max.   :82.00   Max.   :3.000

What is the range of each quantitative predictor?

# Range
auto %>% 
  summarise(across(
    .cols = c(1,3:6), 
    .fns = list(Min = min, Max = max))) %>% 
  pivot_longer(cols = everything(),
               names_sep = "_",
               names_to  = c("variable", ".value"))

## # A tibble: 5 x 3
##   variable       Min    Max
##   <chr>        <dbl>  <dbl>
## 1 mpg              9   46.6
## 2 displacement    68  455  
## 3 horsepower      46  230  
## 4 weight        1613 5140  
## 5 acceleration     8   24.8

What is the mean and standard deviation of each quantitative predictor?

# Mean and standard deviation
auto %>% 
  summarise(across(c(1,3:6), 
    list(Mean = mean, SD = sd))) %>% 
  pivot_longer(everything(),
               names_sep = "_",
               names_to  = c("variable", ".value"))

## # A tibble: 5 x 3
##   variable       Mean     SD
##   <chr>         <dbl>  <dbl>
## 1 mpg            23.4   7.81
## 2 displacement  194.  105.  
## 3 horsepower    104.   38.5 
## 4 weight       2978.  849.  
## 5 acceleration   15.5   2.76

Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

auto %>% 
  slice(-(10:85),) %>% 
  summarise(across(c(1,3:6), 
    list(Min=min, Max=max, Mean = mean, SD = sd))) %>% 
  group_by() %>% 
  pivot_longer(everything(),
               names_sep = "_",
               names_to  = c("variable", ".value"))

## # A tibble: 5 x 5
##   variable        Min    Max   Mean     SD
##   <chr>         <dbl>  <dbl>  <dbl>  <dbl>
## 1 mpg            11     46.6   24.4   7.87
## 2 displacement   68    455    187.   99.7 
## 3 horsepower     46    230    101.   35.7 
## 4 weight       1649   4997   2936.  811.  
## 5 acceleration    8.5   24.8   15.7   2.69

Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

The values in origin represent three different regions, America, Europe, and Japan. Based on the boxplots of mpg, cars from Japan tend to get better gas mileage, followed by European cars.

auto$origin <- factor(auto$origin, 
                      labels = c("American", "European", "Japanese"))

ggplot(auto, aes(mpg, origin, color=origin)) +
  geom_boxplot() +
  labs(title="Fuel Efficiency by Vehicle Origin", x="Miles per Gallon", y = "") +
  coord_flip() +
  theme_classic() +
  theme(legend.position="none")

auto %>% 
  filter(cylinders %in% c(4,6,8)) %>% 
  ggplot(aes(weight, horsepower)) +
  geom_point() +
  geom_smooth(method = "lm") +
  xlab("Weight") +
  ylab("Horsepower") +
  facet_wrap(~cylinders, scales = "free")

These graphs show the relationship between the weight of a vehicle and its horsepower for each cylinder. In cars with four or eight cylinders, as the weight increases, so does the horsepower. However, this relationship isn’t as significant in six-cylinder vehicles.

ggplot(auto, aes(x=year+1900, y=mpg)) + 
  geom_jitter() + 
  labs(x = "", 
       y = "Miles per Gallon", 
       title = "Gas Mileage Over Time")

This scatterplot shows that gas mileage has improved over time.

ggplot(auto, aes(weight, mpg, col=factor(cylinders))) +
  geom_point() +
  labs(x = "Weight", 
       y = "Miles per Gallon",
       col = "Cylinders",
       title = "Fuel Efficiency Improves in Lighter Vehicles")

This graph shows that a negative relationship exists between gas mileage and a vehicle’s weight. Four-cylinder vehicles tend to be lighter and get better gas mileage while eight-cylinder vehicles tend to be heavier and have lower gas mileage.

ggplot(auto, aes(acceleration, mpg, col=factor(cylinders))) + 
  geom_point() + 
  labs(x = "Acceleration", 
       y = "Miles per Gallon", 
       col = "Cylinders",
       title = "Gas Mileage Increases with Acceleration")

This scatterplot shows that as acceleration increases, gas mileage also tends to increase, especially in four-cylinder vehicles but not as much in eight-cylinder vehicles.

Suppose that we wish to predict gas mileage on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

Yes, other than the name of the vehicle, the rest of the variables do appear to be correlated to mpg, some positively and some negatively, and would be useful in predicting gas mileage. Weight, displacement, horsepower, and cylinders have a negative relationship with mpg while acceleration, year, and origin have a positive relationship.

Question 10

This exercise involves the Boston housing data set. The Boston data set is part of the MASS library in R.

How many rows are in this data set? How many columns? What do the rows and columns represent?

There are 506 rows and 14 columns. Each row represents a neighborhood in Boston and each column represents a feature or attribute of the data.

library(MASS)
dim(Boston)

## [1] 506  14

Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

Based on the pairwise scatterplots below, median house prices tend to increase as the number of rooms increases and housing prices decrease in suburbs with a higher percentage of lower status of the population. There is a negative linear relationship between age and distance, older homes tend to be located closer to Boston work centers. Crime rates also tend to be higher in these areas.

pairs(~crim + rm + age + dis + lstat + medv, Boston)

Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

Let’s look at a correlation matrix of the predictors.

# Correlation matrix
corr <- cor(Boston)
ggcorrplot(corr, hc.order = TRUE, lab = TRUE)

Based on the correlation matrix and scatterplots above, there are several variables associated with per capita crime rate. The variables with the most significant positive relationship to crime rate are rad and tax, which means that crime rates tend to be greater where property taxes are higher and where there is better accessibility to highways. Additionally, houses with lower median values, areas that are close to Boston employment centers, and suburbs with a higher percentage of lower status of the population tend to have higher crime rates.

Now let’s explore these relationships using a linear regression model to predict Boston crime rates.

#Full regression model
fit<-lm(crim~., data=Boston)

#Stepwise selection
fit.step<-ols_step_both_p(fit, pent = 0.05, prem = 0.05, details = FALSE)
fit.step

## 
##                               Stepwise Selection Summary                               
## --------------------------------------------------------------------------------------
##                      Added/                   Adj.                                        
## Step    Variable    Removed     R-Square    R-Square     C(p)         AIC        RMSE     
## --------------------------------------------------------------------------------------
##    1      rad       addition       0.391       0.390    46.5480    3367.5725    6.7178    
##    2     lstat      addition       0.421       0.418    21.9300    3344.4026    6.5592    
##    3     black      addition       0.429       0.425    16.8870    3339.5281    6.5213    
## --------------------------------------------------------------------------------------

#Final model
fit.final <- lm(crim~rad+lstat+black, data=Boston)
summary(fit.final)

## 
## Call:
## lm(formula = crim ~ rad + lstat + black, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.023  -1.713  -0.281   0.873  77.716 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.372585   1.641557  -0.227  0.82054    
## rad          0.488172   0.040422  12.077  < 2e-16 ***
## lstat        0.213596   0.047447   4.502 8.39e-06 ***
## black       -0.009472   0.003615  -2.620  0.00905 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.521 on 502 degrees of freedom
## Multiple R-squared:  0.4286, Adjusted R-squared:  0.4252 
## F-statistic: 125.5 on 3 and 502 DF,  p-value: < 2.2e-16

Stepwise selection helps us find the best subset of predictors. Using a p-value cutoff of 0.05, the three most significant predictors of crime rate are rad, lstat, and black. This model shows similar associations to the predictors that we saw in the correlation matrix, and allows us to quantify these relationships. On average, crime rates are expected to increase by 0.49 when the accessibility to highways increases by one unit. For every 1% increase in the lower status of the population, crime rate per capita increases by 0.21. When the proportion of black population increases by one unit, crime rates decrease by 0.01.

Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

There is a large range in crime rates per capita, with the majority of crime rates between the range of 0.08 and 3.68. However, there are several that fall outside of this range as seen in the histogram below, with some reaching a crime rate as high as 89 per capita. There are also several suburbs with significantly higher property tax rates at $711 per $10,000. The range of pupil-teacher ratio is smaller with a greater number of suburbs approaching the maximum ratio of 22 students per teacher.

summary(Boston[c(1,10,11)])

##       crim               tax           ptratio     
##  Min.   : 0.00632   Min.   :187.0   Min.   :12.60  
##  1st Qu.: 0.08205   1st Qu.:279.0   1st Qu.:17.40  
##  Median : 0.25651   Median :330.0   Median :19.05  
##  Mean   : 3.61352   Mean   :408.2   Mean   :18.46  
##  3rd Qu.: 3.67708   3rd Qu.:666.0   3rd Qu.:20.20  
##  Max.   :88.97620   Max.   :711.0   Max.   :22.00

par(mfrow=c(1,3))
# Crime rates
Boston %>% 
  filter(crim>median(crim)) %>% 
  with(hist(crim, breaks=25))

# Tax rates
Boston %>% 
  with(hist(tax, breaks=25))

# Pupil-teacher ratio
Boston %>% 
  with(hist(ptratio, breaks=25))

How many of the suburbs in this data set are bound the Charles river?

There are 35 suburbs bound to the Charles River.

Boston %>% 
  count(chas)

##   chas   n
## 1    0 471
## 2    1  35

What is the median pupil-teacher ratio among the towns in this data set?

The median ratio is 19 pupils per teacher.

median(Boston$ptratio)

## [1] 19.05

Which suburb of Boston has lowest median value of owner-occupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

There are two suburbs with the lowest median home values. When compared to the overall ranges of the other predictors, they tend to fall into the higher end of the range, particularly in crime rates, age, property taxes, and percentage of lower status of the population.

# Suburbs with lowest median value
kable(Boston %>% 
  filter(medv==min(medv)), format = "html", digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                position = "left")

crim	zn	indus	chas	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv
38.35	0	18.1	0	0.69	5.45	100	1.49	24	666	20.2	396.90	30.59	5
67.92	0	18.1	0	0.69	5.68	100	1.43	24	666	20.2	384.97	22.98	5

# Ranges of each variable among all suburbs
kable(Boston %>% 
  summarise(across(
    .cols = where(is.numeric), 
    list(Range=range))), format = "html", digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = F) %>% 
  scroll_box(width = "800px")

crim_Range	zn_Range	indus_Range	chas_Range	nox_Range	rm_Range	age_Range	dis_Range	rad_Range	tax_Range	ptratio_Range	black_Range	lstat_Range	medv_Range
0.01	0	0.46	0	0.38	3.56	2.9	1.13	1	187	12.6	0.32	1.73	5
88.98	100	27.74	1	0.87	8.78	100.0	12.13	24	711	22.0	396.90	37.97	50

In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.

There are 64 suburbs with more than seven rooms per dwelling and 13 with more than eight rooms. On average, neighborhoods with more than 8 rooms have a lower crime rate, lower property taxes, and higher median home values than suburbs with less than 8 rooms per home.

# Suburbs with more than 7 rooms
Boston %>% 
  filter(rm>7) %>% 
  summarise(n=n())

##    n
## 1 64

# Suburbs with more than 8 rooms
Boston %>% 
  filter(rm>8) %>% 
  summarise(n=n())

##    n
## 1 13

# Comparing suburbs with more than 8 rooms
sapply(Boston[Boston$rm>8,], mean)

##        crim          zn       indus        chas         nox          rm 
##   0.7187954  13.6153846   7.0784615   0.1538462   0.5392385   8.3485385 
##         age         dis         rad         tax     ptratio       black 
##  71.5384615   3.4301923   7.4615385 325.0769231  16.3615385 385.2107692 
##       lstat        medv 
##   4.3100000  44.2000000

sapply(Boston[Boston$rm<=8,], mean)

##         crim           zn        indus         chas          nox           rm 
##   3.68985513  11.30425963  11.24379310   0.06693712   0.55510264   6.23021095 
##          age          dis          rad          tax      ptratio        black 
##  68.49675456   3.80466349   9.60446247 410.43002028  18.51075051 355.92154158 
##        lstat         medv 
##  12.87306288  21.96146045