Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide \(n\) and \(p\).
We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.
This is a regression problem because our response variable, salary, is continuous. We are most interested in making an inference about the relationships between the predictors and a CEO’s salary. N is the top 500 firms in the US and there are three predictors, profit, number of employees, and industry.
We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.
This is a classification problem because we have a categorical response variable. We are interested in making a prediction about whether a product is successful or not. N is the 20 similar products that were previously launched and we have 13 predictors, the price charged for the product, marketing budget, competition price, and ten other variables.
We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence, we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.
This is a regression problem because we are interested in predicting a continuous outcome, the percent change in the USD/Euro exchange rate. N is 52, each week of 2012, and there are three predictors, the % change in the US market, the % change in the British market, and the % change in the German market.
What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?
Flexible models are able to more closely fit the data and can make better predictions than less flexible methods. However, this usually requires estimating a greater number of parameters, making the model more complex and less interpretable, and can lead to overfitting the data. Due to their close fit, more flexible methods tend to have lower bias, the amount our average estimate differs from the true value, and higher variance, the amount our estimate of \(f\) would change given new data. When we are most interested in making predictions or when the data is highly nonlinear, more flexible models are preferred.
Alternatively, when inference is the goal, less flexible models are preferred. More restrictive approaches, like linear regression, make it easier to understand the relationship between the response variable and the predictors. Given their relative inflexibility, a small change in the data will not cause a significant change in our estimate of \(f\), resulting in a smaller variance. However, these models do not follow the exact shape of the data as closely as more flexible methods, leading to a higher bias.
Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?
The key difference between the approaches is that parametric methods make an assumption about the distributional form, or the shape, of \(f\) while non-parametric methods do not. Since parametric methods assume a form of the data, they reduce the problem of estimating \(f\) down to estimating a small number of parameters. There is the potential disadvantage that the model chosen will not closely enough match the true unknown form of \(f\), in which case the model will not fit the data well. Non-parametric methods avoid this problem since essentially no assumptions about the distributional form of the data are made. Thus, they are able to fit a wider range of possible shapes than parametric methods. However, these approaches require a very large number of observations to obtain an accurate estimate of \(f\).
This exercise relates to the College data set, which can be found in the file College.csv
. It contains a number of variables for 777 different universities and colleges in the US.
read.csv()
function to read the data into R
. Call the loaded data college
.fix()
function. You should notice that the first column is just the name of each university. We don’t really want R
to treat this as data. However, it may be handy to have these names for later. Try the following commands:Now you should see that the first data column is Private
. Note that another column labeled row.names
now appears before the Private
column. However, this is not a data column but rather the name that R
is giving to each row.
## Private Apps Accept Enroll Top10perc
## Abilene Christian University Yes 1660 1232 721 23
## Adelphi University Yes 2186 1924 512 16
## Adrian College Yes 1428 1097 336 22
## Agnes Scott College Yes 417 349 137 60
## Alaska Pacific University Yes 193 146 55 16
## Albertson College Yes 587 479 158 38
summary()
function to produce a numerical summary of the variables in the data set.## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
pairs()
function to produce a scatterplot matrix of the first ten columns or variables of the data.plot()
function to produce side-by-side boxplots of Outstate
versus Private
.plot(college$Private, college$Outstate, xlab="",
ylab="Out-of-State Tuition, $",
main="Out-of-State Tuition for Students Attending Private Universities")
Elite
, by binning the Top10perc
variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%. Use the summary()
function to see how many elite universities there are. Next, use the plot()
function to produce side-by-side boxplots of Outstate
versus Elite
.Elite=rep("No",nrow(college))
Elite[college$Top10perc>50]="Yes"
Elite=as.factor(Elite)
college=data.frame(college, Elite)
summary(college$Elite)
## No Yes
## 699 78
plot(college$Elite, college$Outstate, xlab="",
ylab="Out-of-State Tuition, $",
main="Out-of-State Tuition for Universities Classified as Elite")
hist()
function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow=c(2,2))
useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.attach(college)
par(mfrow=c(2,2))
hist(Top10perc, breaks = 40)
hist(Room.Board, breaks = 30)
hist(PhD, breaks = 20)
hist(Grad.Rate, breaks = 10)
Continue exploring the data, and provide a brief summary of what you discover.
Below are additional scatterplots exploring some of the relationships between the variables. We can see there is a positive correlation between students in the top 25% of their class and graduation rate. There is a strong positive linear relationship in the number of full-time students who enroll but not as much in part-time students. There is also a more positive trend in graduation rates of full-time students than part-time.
This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.
Which of the predictors are quantitative and which are qualitative?
The quantitative predictors are mpg
, displacement
, horsepower
, weight
, and acceleration
. Name
is a qualitative variable. Cylinders
, origin
, and year
take on a small set of discrete values so I would treat them as qualitative or categorical.
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 Length:392
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 Class :character
## Median :15.50 Median :76.00 Median :1.000 Mode :character
## Mean :15.54 Mean :75.98 Mean :1.577
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :24.80 Max. :82.00 Max. :3.000
# Range
auto %>%
summarise(across(
.cols = c(1,3:6),
.fns = list(Min = min, Max = max))) %>%
pivot_longer(cols = everything(),
names_sep = "_",
names_to = c("variable", ".value"))
## # A tibble: 5 x 3
## variable Min Max
## <chr> <dbl> <dbl>
## 1 mpg 9 46.6
## 2 displacement 68 455
## 3 horsepower 46 230
## 4 weight 1613 5140
## 5 acceleration 8 24.8
# Mean and standard deviation
auto %>%
summarise(across(c(1,3:6),
list(Mean = mean, SD = sd))) %>%
pivot_longer(everything(),
names_sep = "_",
names_to = c("variable", ".value"))
## # A tibble: 5 x 3
## variable Mean SD
## <chr> <dbl> <dbl>
## 1 mpg 23.4 7.81
## 2 displacement 194. 105.
## 3 horsepower 104. 38.5
## 4 weight 2978. 849.
## 5 acceleration 15.5 2.76
auto %>%
slice(-(10:85),) %>%
summarise(across(c(1,3:6),
list(Min=min, Max=max, Mean = mean, SD = sd))) %>%
group_by() %>%
pivot_longer(everything(),
names_sep = "_",
names_to = c("variable", ".value"))
## # A tibble: 5 x 5
## variable Min Max Mean SD
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 mpg 11 46.6 24.4 7.87
## 2 displacement 68 455 187. 99.7
## 3 horsepower 46 230 101. 35.7
## 4 weight 1649 4997 2936. 811.
## 5 acceleration 8.5 24.8 15.7 2.69
Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
The values in origin
represent three different regions, America, Europe, and Japan. Based on the boxplots of mpg
, cars from Japan tend to get better gas mileage, followed by European cars.
auto$origin <- factor(auto$origin,
labels = c("American", "European", "Japanese"))
ggplot(auto, aes(mpg, origin, color=origin)) +
geom_boxplot() +
labs(title="Fuel Efficiency by Vehicle Origin", x="Miles per Gallon", y = "") +
coord_flip() +
theme_classic() +
theme(legend.position="none")
auto %>%
filter(cylinders %in% c(4,6,8)) %>%
ggplot(aes(weight, horsepower)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("Weight") +
ylab("Horsepower") +
facet_wrap(~cylinders, scales = "free")
These graphs show the relationship between the weight of a vehicle and its horsepower for each cylinder. In cars with four or eight cylinders, as the weight increases, so does the horsepower. However, this relationship isn’t as significant in six-cylinder vehicles.
ggplot(auto, aes(x=year+1900, y=mpg)) +
geom_jitter() +
labs(x = "",
y = "Miles per Gallon",
title = "Gas Mileage Over Time")
This scatterplot shows that gas mileage has improved over time.
ggplot(auto, aes(weight, mpg, col=factor(cylinders))) +
geom_point() +
labs(x = "Weight",
y = "Miles per Gallon",
col = "Cylinders",
title = "Fuel Efficiency Improves in Lighter Vehicles")
This graph shows that a negative relationship exists between gas mileage and a vehicle’s weight. Four-cylinder vehicles tend to be lighter and get better gas mileage while eight-cylinder vehicles tend to be heavier and have lower gas mileage.
ggplot(auto, aes(acceleration, mpg, col=factor(cylinders))) +
geom_point() +
labs(x = "Acceleration",
y = "Miles per Gallon",
col = "Cylinders",
title = "Gas Mileage Increases with Acceleration")
This scatterplot shows that as acceleration increases, gas mileage also tends to increase, especially in four-cylinder vehicles but not as much in eight-cylinder vehicles.
Suppose that we wish to predict gas mileage on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg
? Justify your answer.
Yes, other than the name of the vehicle, the rest of the variables do appear to be correlated to mpg
, some positively and some negatively, and would be useful in predicting gas mileage. Weight
, displacement
, horsepower
, and cylinders
have a negative relationship with mpg
while acceleration
, year
, and origin
have a positive relationship.
This exercise involves the Boston
housing data set. The Boston
data set is part of the MASS
library in R
.
How many rows are in this data set? How many columns? What do the rows and columns represent?
There are 506 rows and 14 columns. Each row represents a neighborhood in Boston and each column represents a feature or attribute of the data.
## [1] 506 14
Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.
Based on the pairwise scatterplots below, median house prices tend to increase as the number of rooms increases and housing prices decrease in suburbs with a higher percentage of lower status of the population. There is a negative linear relationship between age and distance, older homes tend to be located closer to Boston work centers. Crime rates also tend to be higher in these areas.
Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
Let’s look at a correlation matrix of the predictors.
Based on the correlation matrix and scatterplots above, there are several variables associated with per capita crime rate. The variables with the most significant positive relationship to crime rate are rad
and tax
, which means that crime rates tend to be greater where property taxes are higher and where there is better accessibility to highways. Additionally, houses with lower median values, areas that are close to Boston employment centers, and suburbs with a higher percentage of lower status of the population tend to have higher crime rates.
Now let’s explore these relationships using a linear regression model to predict Boston crime rates.
#Full regression model
fit<-lm(crim~., data=Boston)
#Stepwise selection
fit.step<-ols_step_both_p(fit, pent = 0.05, prem = 0.05, details = FALSE)
fit.step
##
## Stepwise Selection Summary
## --------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## --------------------------------------------------------------------------------------
## 1 rad addition 0.391 0.390 46.5480 3367.5725 6.7178
## 2 lstat addition 0.421 0.418 21.9300 3344.4026 6.5592
## 3 black addition 0.429 0.425 16.8870 3339.5281 6.5213
## --------------------------------------------------------------------------------------
##
## Call:
## lm(formula = crim ~ rad + lstat + black, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.023 -1.713 -0.281 0.873 77.716
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.372585 1.641557 -0.227 0.82054
## rad 0.488172 0.040422 12.077 < 2e-16 ***
## lstat 0.213596 0.047447 4.502 8.39e-06 ***
## black -0.009472 0.003615 -2.620 0.00905 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.521 on 502 degrees of freedom
## Multiple R-squared: 0.4286, Adjusted R-squared: 0.4252
## F-statistic: 125.5 on 3 and 502 DF, p-value: < 2.2e-16
Stepwise selection helps us find the best subset of predictors. Using a p-value cutoff of 0.05, the three most significant predictors of crime rate are rad
, lstat
, and black
. This model shows similar associations to the predictors that we saw in the correlation matrix, and allows us to quantify these relationships. On average, crime rates are expected to increase by 0.49 when the accessibility to highways increases by one unit. For every 1% increase in the lower status of the population, crime rate per capita increases by 0.21. When the proportion of black population increases by one unit, crime rates decrease by 0.01.
Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.
There is a large range in crime rates per capita, with the majority of crime rates between the range of 0.08 and 3.68. However, there are several that fall outside of this range as seen in the histogram below, with some reaching a crime rate as high as 89 per capita. There are also several suburbs with significantly higher property tax rates at $711 per $10,000. The range of pupil-teacher ratio is smaller with a greater number of suburbs approaching the maximum ratio of 22 students per teacher.
## crim tax ptratio
## Min. : 0.00632 Min. :187.0 Min. :12.60
## 1st Qu.: 0.08205 1st Qu.:279.0 1st Qu.:17.40
## Median : 0.25651 Median :330.0 Median :19.05
## Mean : 3.61352 Mean :408.2 Mean :18.46
## 3rd Qu.: 3.67708 3rd Qu.:666.0 3rd Qu.:20.20
## Max. :88.97620 Max. :711.0 Max. :22.00
par(mfrow=c(1,3))
# Crime rates
Boston %>%
filter(crim>median(crim)) %>%
with(hist(crim, breaks=25))
# Tax rates
Boston %>%
with(hist(tax, breaks=25))
# Pupil-teacher ratio
Boston %>%
with(hist(ptratio, breaks=25))
How many of the suburbs in this data set are bound the Charles river?
There are 35 suburbs bound to the Charles River.
## chas n
## 1 0 471
## 2 1 35
What is the median pupil-teacher ratio among the towns in this data set?
The median ratio is 19 pupils per teacher.
## [1] 19.05
Which suburb of Boston has lowest median value of owner-occupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.
There are two suburbs with the lowest median home values. When compared to the overall ranges of the other predictors, they tend to fall into the higher end of the range, particularly in crime rates, age, property taxes, and percentage of lower status of the population.
# Suburbs with lowest median value
kable(Boston %>%
filter(medv==min(medv)), format = "html", digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
position = "left")
crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
38.35 | 0 | 18.1 | 0 | 0.69 | 5.45 | 100 | 1.49 | 24 | 666 | 20.2 | 396.90 | 30.59 | 5 |
67.92 | 0 | 18.1 | 0 | 0.69 | 5.68 | 100 | 1.43 | 24 | 666 | 20.2 | 384.97 | 22.98 | 5 |
# Ranges of each variable among all suburbs
kable(Boston %>%
summarise(across(
.cols = where(is.numeric),
list(Range=range))), format = "html", digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = F) %>%
scroll_box(width = "800px")
crim_Range | zn_Range | indus_Range | chas_Range | nox_Range | rm_Range | age_Range | dis_Range | rad_Range | tax_Range | ptratio_Range | black_Range | lstat_Range | medv_Range |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.01 | 0 | 0.46 | 0 | 0.38 | 3.56 | 2.9 | 1.13 | 1 | 187 | 12.6 | 0.32 | 1.73 | 5 |
88.98 | 100 | 27.74 | 1 | 0.87 | 8.78 | 100.0 | 12.13 | 24 | 711 | 22.0 | 396.90 | 37.97 | 50 |
In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.
There are 64 suburbs with more than seven rooms per dwelling and 13 with more than eight rooms. On average, neighborhoods with more than 8 rooms have a lower crime rate, lower property taxes, and higher median home values than suburbs with less than 8 rooms per home.
## n
## 1 64
## n
## 1 13
## crim zn indus chas nox rm
## 0.7187954 13.6153846 7.0784615 0.1538462 0.5392385 8.3485385
## age dis rad tax ptratio black
## 71.5384615 3.4301923 7.4615385 325.0769231 16.3615385 385.2107692
## lstat medv
## 4.3100000 44.2000000
## crim zn indus chas nox rm
## 3.68985513 11.30425963 11.24379310 0.06693712 0.55510264 6.23021095
## age dis rad tax ptratio black
## 68.49675456 3.80466349 9.60446247 410.43002028 18.51075051 355.92154158
## lstat medv
## 12.87306288 21.96146045