Load some necessary packages:
library(ggplot2)
library(dplyr)
Question 1
What is the best description of interaction?
B
A. Allowing an explanatory variable to affect a
response (i.e., including an explanatory variable in a model).
B. Allowing the effect of an explanatory variable
on the response variable to depend on another explanatory variable.
C. Allowing the effect of an explanatory variable
on another explanatory variable to depend on the response variable.
D. Forcing the effect of an explanatory variable on
a response to be 0.
E. Forcing the effect of an explanatory variable on
a response to be independent of other explanatory variables.
Question 2
College.csv contains the following data on 195
colleges:
GradRate: Proportion of students earning a degree in 4
years (ranges from 0 to 1)
AdmisRate: Admission rate; that is, proportion of
applicants admitted to the college (ranges from 0 to 1)
Type: School type (Private or Public)
Read in the data:
College = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/College.csv')
Note: Two private schools have a listed graduation rate of 0%. We
can discuss potential reasons for why this is, and what consequences
this may have on our results.
Use this dataset to answer the following questions.
(a) Make the scatterplot of GradRate by AdmisRate, distinguishing
the data points according to whether they are a public/private college
(e.g., using colors or shapes), and add the parallel lines model to the
scatterplot.
lm(formula=GradRate~AdmisRate+Type,data=College)
##
## Call:
## lm(formula = GradRate ~ AdmisRate + Type, data = College)
##
## Coefficients:
## (Intercept) AdmisRate TypePublic
## 0.8714 -0.3504 -0.2820
ggplot(data=College,aes(x=AdmisRate,y=GradRate))+geom_point(aes(color=Type))+geom_abline(intercept=0.8714,slope=-0.3504,color="red")+geom_abline(intercept=0.5894,slope=-0.3504,color="blue")

(b) Fit a model for GradRate that uses AdmisRate and Type as
explanatory variables, but allows the effect of these explanatory
variables to depend on one another (i.e., fit an interaction model).
Interpret each of the coefficients returned by R.
lm(formula=GradRate~AdmisRate+Type+AdmisRate:Type,data=College)
##
## Call:
## lm(formula = GradRate ~ AdmisRate + Type + AdmisRate:Type, data = College)
##
## Coefficients:
## (Intercept) AdmisRate TypePublic
## 0.8522 -0.3053 -0.2167
## AdmisRate:TypePublic
## -0.1156
(c) Re-make the scatterplot from (a) and add the lines corresponding
to the interaction model to this scatterplot.
ggplot(data=College,aes(x=AdmisRate,y=GradRate))+geom_point(aes(color=Type))+geom_abline(intercept=0.8522,slope=-0.3053,color="red")+geom_abline(intercept=0.6355,slope=-0.4209,color="blue")

(d) Do you think the interaction model is worthwhile? Briefly
explain your reasoning.
yes, because it relates the trend of public and private together
Question 3
Think about the relationship between the following variables:
speed of a bicyclist
steepness of the road: a quantitative variable measured by
the grade (rise over run) where 0 means flat, positive means uphill, and
negative means downhill
fitness of the rider: a categorical variable with three
levels: unfit, average, athletic
On paper, sketch out a graph of speed versus steepness for
reasonable models of each of the following forms:
(a) speed ~ steepness
(b) speed ~ fitness
(c) speed ~ steepness + fitness
(d) speed ~ steepness + fitness + steepness:fitness
Question 4
Read in the data:
WeightLifting = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/WeightLifting.csv')
Use this dataset to answer the following questions.
(a) Make a scatterplot of TotalKg versus BodyweightKg,
distinguishing the data points by Equipment (using, say, color or
shape).
ggplot(data=WeightLifting,aes(x=BodyweightKg,y=TotalKg))+geom_point(aes(color=Equipment))

Optional note: In geom_point(), we can control the transparency of
the data points using the “alpha” parameter, making it easier to graph
large amounts of data.
(b) Fit a model for TotalKg that uses BodyweightKg and Equipment
(but no interaction). Add the model to the scatterplot, and use the
resulting graph to comment on the size/magnitude of the coefficients
labeled: BodyweightKg, EquipmentSingle-ply, and
EquipmentWraps.
lm(formula=TotalKg~BodyweightKg+Equipment,data=WeightLifting)
##
## Call:
## lm(formula = TotalKg ~ BodyweightKg + Equipment, data = WeightLifting)
##
## Coefficients:
## (Intercept) BodyweightKg EquipmentSingle-ply
## 73.940 3.278 -17.831
## EquipmentWraps
## 146.174
ggplot(data=WeightLifting,aes(x=BodyweightKg,y=TotalKg))+geom_point(aes(color=Equipment))+geom_abline(intercept=73.940,slope=3.278,color="red")+geom_abline(intercept=56.109,slope=3.278,color="green")+geom_abline(intercept=220.114,slope=3.278,color="blue")

(c) Now fit a model for TotalKg that uses BodyweightKg and
Equipment, including interaction. Re-create the scatterplot and add this
model to it.
lm(formula=TotalKg~BodyweightKg+Equipment+BodyweightKg:Equipment,data=WeightLifting)
##
## Call:
## lm(formula = TotalKg ~ BodyweightKg + Equipment + BodyweightKg:Equipment,
## data = WeightLifting)
##
## Coefficients:
## (Intercept) BodyweightKg
## 123.1997 2.6991
## EquipmentSingle-ply EquipmentWraps
## 0.8450 -66.1298
## BodyweightKg:EquipmentSingle-ply BodyweightKg:EquipmentWraps
## -0.1609 2.4332
ggplot(data=WeightLifting,aes(x=BodyweightKg,y=TotalKg))+geom_point(aes(color=Equipment))+
geom_abline(intercept=123.1997,slope=2.6991,color="red")+
geom_abline(intercept=124.0447,slope=2.5382,color="green")+
geom_abline(intercept=57.0699,slope=5.1323,color="blue")

(d) Use the resulting graph from (c) to comment on the magnitude of
the coefficients labeled: BodyweightKg:EquipmentSingle-ply and
BodyweightKg:EquipmentWraps.
the magnitude of Bodyweight:EquipmentSingle-Ply is much smaller than
the magnitude of Bodyweight:EquipmentWraps
(e) The coefficient labels EquipmentSingle-ply and
EquipmentWraps have different meanings in the models from (b)
and (c). In which model are their interpretations contextually
meaningful, and in which are they not? Explain why this is the
case.
both serve the same role in their respective model, but in C, they
represent the change in intercept from the reference when the two
variables affect each other, and in B, they represent the change in
intercept from the reference when the two variables don’t affect each
other.
Question 5
The drawings below show some data involving three variables:
D: a quantitative variable
A: a quantitative variable
G: a categorical variable with two levels: S and
K
On paper, sketch a graph of the fitted model function for each of
the following:
(a) D ~ 1
(b) D ~ G
(c) D ~ A
Question 6
This question refers to data (SCI.csv) on the survival
times of 2498 individuals after sustaining a spinal cord injury. The
variables in SCI.csv are:
Time: Survival time after spinal cord injury (in
months)
Age: Age of the individual at the time of the injury (in
years)
Sex: The patient’s biological sex recorded in binary
fashion, Male or Female
Cause: Cause of the spinal cord injury, categorized as:
Fall, MotorVehicle, or Other
ISS: Injury severity score (on a scale from 1-75, with
larger values indicating greater severity)
ISSCat: Injury severity categorized as Mild (ISS <= 8),
Moderate (9 <= ISS <= 15), or Severe (ISS >= 16)
Urban: Did the injury occur in an urban location (Yes or
No)
YOI: Year of the injury (2004-2013)
Read in the data:
SCI = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/SCI.csv')
(a) Fit a model for survival time using Cause and Age, and which
includes interaction.
lm(formula=Time~Age+Cause+Age:Cause,data=SCI)
##
## Call:
## lm(formula = Time ~ Age + Cause + Age:Cause, data = SCI)
##
## Coefficients:
## (Intercept) Age CauseMotorVehicle
## 79.20468 -0.56294 0.78136
## CauseOther Age:CauseMotorVehicle Age:CauseOther
## -8.36409 0.05206 0.27734
ggplot(data=SCI,aes(x=Age,y=Time))+geom_point(aes(color=Cause))+
geom_abline(intercept=79.20468,slope=-0.56294,color="red")+
geom_abline(intercept=79.20468+0.78136,slope=-0.56294+0.05206,color="green")+
geom_abline(intercept=79.20468-8.36409,slope=-0.56294+0.27734,color="blue")
## Warning: Removed 2 rows containing missing values (`geom_point()`).

(b) Provide careful interpretations of the coefficients labeled
Age and CauseOther:Age.
Age is the reference slope for the model, CauseOther:Age is the
change in slope from the reference slope to the slope of CauseOther
related to age
(c) According to the model from (a), what is the predicted survival
time for a 22-year old individual who sustained a spinal cord injury in
a motor vehicle accident?
68.74668 months or 5 years, 8 months, 22 days, 19 hours, and 12
minutes
(d) Suppose two individuals (A and B) both sustained a spinal cord
injury from a motor vehicle accident, but that A was 10 years younger
than B when their injury occurred. Using the model from (a), answer the
following:
Can
we calculate A’s predicted survival time? If ‘yes’, calculate it, and if
‘no’, explain why not.
no, because we are only given relative age, not an absolute age, we
cannot say for certain whether A is 20 and B is 30 or A is 65 and B is
75, both are plausible given the provided information.
Can
we calculate the difference between A’s and B’s predicted survival time?
If ‘yes’, calculate it, and if ‘no’, explain why not.
yes, it is -5.1088
Question 7
Bike sharing is becoming a more popular means of transportation in
many cities. BikeShare.csv contains information from Capital
Bikeshare, a bike-sharing service in the Washington, DC area.
Specifically, every row in the file corresponds to information from a
day. In total, there are 731 data points, for each day over a 2-year
stretch (2011-2012). Our primary research goal is to understand what
factors are related to total number of riders on a given day so that you
can help Capital Bikeshare plan its services. The variables and their
meanings are listed below:
date: Date in format YYYY-MM-DD
season: Season (winter, spring, summer, or fall)
year: 2011 or 2012
month: 3-letter month abbreviation
day_of_week: 3-letter abbreviation for day of week
weekend: TRUE if the case is a weekend
holiday: Is the day a holiday? (yes or no)
temp_actual: Actual temperature in degrees Fahrenheit
temp_feel: What the temperature “feels like” in degrees
Fahrenheit
humidity: Fraction from 0 to 1 giving the humidity
level
windspeed: Wind speed in miles per hour
weather_cat: Three possible weather categories:
categ1 = Clear, Few
clouds, Partly cloudy
categ2 =
Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
categ3 =
Light Snow, Light Rain/Thunderstorm + Scattered clouds
riders_casual: Count of daily rides by casual users
(non-registered users)
riders_registered: Count of daily rides by registered
users.
riders_total: Count of total daily rides
(riders_casual + riders_registered)
Read in the data:
BikeShare = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/BikeShare.csv')
(a) Make a model for the daily total number of riders that uses
season as the only explanatory variable. Report, and
interpret, the coefficients from this model.
lm(formula=riders_total ~ season,data=BikeShare)
##
## Call:
## lm(formula = riders_total ~ season, data = BikeShare)
##
## Coefficients:
## (Intercept) seasonspring seasonsummer seasonwinter
## 4728.2 264.2 916.1 -2124.0
the intercept coefficient refers to the total number of riders in the
fall, the seasonspring coefficient refers to the change in the total
number of riders from the fall, and its the same for the seasonsummer
and seasonwinter coefficients.
(b) Now make a model for riders_total that uses
both season and temp_actual as
explanatory variables. Briefly interpret each of the coefficients from
this model.
lm(formula=riders_total~temp_actual+season,data=BikeShare)
##
## Call:
## lm(formula = riders_total ~ temp_actual + season, data = BikeShare)
##
## Coefficients:
## (Intercept) temp_actual seasonspring seasonsummer seasonwinter
## -617.61 84.57 -494.15 -852.68 -1342.87
the intercept coefficient refers to where the model hits the y axis
for fall, the temp actual coefficient refers to the slope of the lines,
and the coefficients for each of the seasons refers to the change
between the intercepts of those lines and the intercept for the fall
(c) Make a scatterplot of riders_total by
temp_actual which distinguishes the data points
according to season.
ggplot(data=BikeShare,aes(x=temp_actual,y=riders_total))+geom_point(aes(color=season))

(d) Add the model from (b), for all seasons, to the scatterplot from
(c).
ggplot(data=BikeShare,aes(x=temp_actual,y=riders_total))+geom_point(aes(color=season))+
geom_abline(intercept=-617.61,slope=84.57,color="red")+
geom_abline(intercept=-617.61-494.15,slope=84.57,color="green")+
geom_abline(intercept=-617.61-852.68,slope=84.57,color="aquamarine")+
geom_abline(intercept=-617.61-1342.87,slope=84.57,color="purple")

(e) Use the model coefficients from (b) to find the prediction (or
fitted value) for the total number of riders on an 80-degree summer
day.
5295.31
(f) The model in (b) assumes that the effect of
temp_actual is identical regardless of
season. Explain in your own words why this might not be
a good assumption to make.
because more people are likely to be doing stuff outdoors on a summer
day than a winter day even if the temperature is the same, this is in
part because winter tends to be colder regardless with a lower windchill
than you would get in the summer.
(g) Fit a model for riders_total that allows the
effect of temp_actual to depend on
season. Interpret the coefficients from this
model.
lm(formula=riders_total~temp_actual+season+temp_actual:season,data=BikeShare)
##
## Call:
## lm(formula = riders_total ~ temp_actual + season + temp_actual:season,
## data = BikeShare)
##
## Coefficients:
## (Intercept) temp_actual seasonspring
## -640.263 84.929 -816.646
## seasonsummer seasonwinter temp_actual:seasonspring
## 7055.404 -3424.828 4.424
## temp_actual:seasonsummer temp_actual:seasonwinter
## -94.092 38.635
(h) Remake the scatterplot from (c), and add to it the lines
corresponding to the model from (g).
ggplot(data=BikeShare,aes(x=temp_actual,y=riders_total))+geom_point(aes(color=season))+
geom_abline(intercept=-640.263,slope=84.929,color="red")+
geom_abline(intercept=-640.263-816.646,slope=84.929+4.424,color="green")+
geom_abline(intercept=-640.263+7055.404,slope=84.929-94.092,color="aquamarine")+
geom_abline(intercept=-640.263-3423.828,slope=84.929+38.635,color="purple")

(i) Does it seem like an interaction model is worthwhile in this
context? Briefly explain.
yes, because it shows that temperature does affect total ridership
for 3 seasons, but also that as you get
(j) Recalculate the prediction for the total number of riders on an
80-degree summer day, based on the model in (g).
5682.364
Question 8
The graph below shows (schematically) a possible relationship
between used car price, mileage, and the car model year:
Consider the model price ~ mileage*year, and treat year as a
categorical variable with 2005 as the reference group.
What will be the sign of the coefficient on mileage?
A. Negative
B. Zero
C. Positive
What will be the sign of the coefficient on year2007?
A. Negative
B. Zero
C. Positive
What will be the sign of the interaction coefficient?
A. Negative
B. Zero
C. Positive
D. There is no interaction coefficient.
Question 9
Consider this made-up model of a response variable (y) as a function
of two quantitative explanatory variables (x1q and x2q), and a
categorical explanatory variable (xcat) with levels A or B:
We obtain the following coefficients:
(a) Suppose there are 2 data points (#1 and #2) with identical
values of x2q and both in category B, but where #1 has a value of x1q
which is 1 larger than that of #2. How much larger will the predicted
value for #1 be in comparison to #2?
0.025
(b) Suppose there are 2 data points (#1 and #2) with identical
values of x1q and both in category B, but where #2 has a value of x2q
which is 1 larger than that of #1. How much larger will the predicted
value for #2 be in comparison to #1?
0.06667
(c) Put (a) and (b) together: #1 and #2 are both in category B, but
#1 has a value of x1q which is 1 larger than that of #2, AND, #2 has a
value of x2q which is 1 larger than that of #1. Which data point will
have a larger predicted value, and by how much?
b, by 0.04
Question 10
Suppose you found a summary table presenting mean prices of pine
trees. The trees are either the Red Pine or the White Pine, and they are
classified as Short or Tall:
| Short |
$ 12 |
$ 18 |
$ 15 |
| Tall |
$ 20 |
$ 34 |
$ 27 |
| Both Heights |
$ 16 |
$ 26 |
$ 21 |
For example, the table tells us the average price of Tall and Red
pines was $20, and overall, the White Pines averaged $26. The average
price of all the trees in the data was $21. Based on this summary table,
answer the following questions:
(a) For the model price ~ color, state how the
coefficients are labeled and their values. You may assume that Red is
the reference category for the color variable.
Intercept White 16, 10,
(b) For the model price ~ height, state how the
coefficients are labeled and their values. You may assume that Short is
the reference category for the height variable.
Intercept Tall
15, 12,
(c) The model price ~ height * color, has
4 coefficients. State how the coefficients are labeled and their
values.
Intercept White Tall:Red Tall:White 12, 6, 8, 22
(d) The model price ~ height + color, has
3 coefficients. It would be difficult to figure out the coefficients
from this model, but suppose you are told their values:
Intercept = 10
heightTall = 12
colorWhite = 10
According to this model, what are the fitted values for the
following types of trees: Short-Red, Short-White,
Tall-Red, and Tall-White?
Short-Red=2 short-white=8 tall-red=10 tall-white=24