STAT 155 Topic 4 In-class Activity

Load some necessary packages:

library(ggplot2)

library(dplyr)

Question 1

What is the best description of interaction?

A. Allowing an explanatory variable to affect a response (i.e., including an explanatory variable in a model).

B. Allowing the effect of an explanatory variable on the response variable to depend on another explanatory variable.

C. Allowing the effect of an explanatory variable on another explanatory variable to depend on the response variable.

D. Forcing the effect of an explanatory variable on a response to be 0.

E. Forcing the effect of an explanatory variable on a response to be independent of other explanatory variables.

Question 2

College.csv contains the following data on 195 colleges:

GradRate: Proportion of students earning a degree in 4 years (ranges from 0 to 1)

AdmisRate: Admission rate; that is, proportion of applicants admitted to the college (ranges from 0 to 1)

Type: School type (Private or Public)

Read in the data:

College = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/College.csv')

Note: Two private schools have a listed graduation rate of 0%. We can discuss potential reasons for why this is, and what consequences this may have on our results.

Use this dataset to answer the following questions.

(a) Make the scatterplot of GradRate by AdmisRate, distinguishing the data points according to whether they are a public/private college (e.g., using colors or shapes), and add the parallel lines model to the scatterplot.

lm(formula=GradRate~AdmisRate+Type,data=College)

## 
## Call:
## lm(formula = GradRate ~ AdmisRate + Type, data = College)
## 
## Coefficients:
## (Intercept)    AdmisRate   TypePublic  
##      0.8714      -0.3504      -0.2820

ggplot(data=College,aes(x=AdmisRate,y=GradRate))+geom_point(aes(color=Type))+geom_abline(intercept=0.8714,slope=-0.3504,color="red")+geom_abline(intercept=0.5894,slope=-0.3504,color="blue")

(b) Fit a model for GradRate that uses AdmisRate and Type as explanatory variables, but allows the effect of these explanatory variables to depend on one another (i.e., fit an interaction model). Interpret each of the coefficients returned by R.

lm(formula=GradRate~AdmisRate+Type+AdmisRate:Type,data=College)

## 
## Call:
## lm(formula = GradRate ~ AdmisRate + Type + AdmisRate:Type, data = College)
## 
## Coefficients:
##          (Intercept)             AdmisRate            TypePublic  
##               0.8522               -0.3053               -0.2167  
## AdmisRate:TypePublic  
##              -0.1156

(c) Re-make the scatterplot from (a) and add the lines corresponding to the interaction model to this scatterplot.

ggplot(data=College,aes(x=AdmisRate,y=GradRate))+geom_point(aes(color=Type))+geom_abline(intercept=0.8522,slope=-0.3053,color="red")+geom_abline(intercept=0.6355,slope=-0.4209,color="blue")

(d) Do you think the interaction model is worthwhile? Briefly explain your reasoning.

yes, because it relates the trend of public and private together

(e) Your answer to (d) is a way of making an informal inference (or conclusion) about one of the coefficients from the interaction model. Which coefficient does it tell us something about, and what do you think it tells us?

it tells us about the type coefficient and it tells us how type affects the output GradRate based on a certain AdmisRate

Question 3

Think about the relationship between the following variables:

speed of a bicyclist

steepness of the road: a quantitative variable measured by the grade (rise over run) where 0 means flat, positive means uphill, and negative means downhill

fitness of the rider: a categorical variable with three levels: unfit, average, athletic

On paper, sketch out a graph of speed versus steepness for reasonable models of each of the following forms:

(a) speed ~ steepness

(b) speed ~ fitness

(c) speed ~ steepness + fitness

(d) speed ~ steepness + fitness + steepness:fitness

Question 4

WeightLifting.csv contains data from recent weightlifting competitions in the United States. Each case is an athlete, and we have information on the following variables for 2178 athletes:

TotalKg: total amount the athlete lifted in the competition in kilograms (kg)

BodyweightKg: athlete’s body weight in kilograms (kg)

SWR: strength-to-weight ratio (TotalKg/BodyweightKg)

Equipment: categorical with 3 levels:

raw (athlete did not use any helpful equipment)
single-ply (equipment that helps the athlete lift more weight)
wraps (equipment that help the athlete lift more weight)

Age: age in years

Sex: biological sex, recorded as a binary; M (male) and F (female)

Read in the data:

WeightLifting = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/WeightLifting.csv')

Use this dataset to answer the following questions.

(a) Make a scatterplot of TotalKg versus BodyweightKg, distinguishing the data points by Equipment (using, say, color or shape).

ggplot(data=WeightLifting,aes(x=BodyweightKg,y=TotalKg))+geom_point(aes(color=Equipment))

Optional note: In geom_point(), we can control the transparency of the data points using the “alpha” parameter, making it easier to graph large amounts of data.

(b) Fit a model for TotalKg that uses BodyweightKg and Equipment (but no interaction). Add the model to the scatterplot, and use the resulting graph to comment on the size/magnitude of the coefficients labeled: BodyweightKg, EquipmentSingle-ply, and EquipmentWraps.

lm(formula=TotalKg~BodyweightKg+Equipment,data=WeightLifting)

## 
## Call:
## lm(formula = TotalKg ~ BodyweightKg + Equipment, data = WeightLifting)
## 
## Coefficients:
##         (Intercept)         BodyweightKg  EquipmentSingle-ply  
##              73.940                3.278              -17.831  
##      EquipmentWraps  
##             146.174

ggplot(data=WeightLifting,aes(x=BodyweightKg,y=TotalKg))+geom_point(aes(color=Equipment))+geom_abline(intercept=73.940,slope=3.278,color="red")+geom_abline(intercept=56.109,slope=3.278,color="green")+geom_abline(intercept=220.114,slope=3.278,color="blue")

(c) Now fit a model for TotalKg that uses BodyweightKg and Equipment, including interaction. Re-create the scatterplot and add this model to it.

lm(formula=TotalKg~BodyweightKg+Equipment+BodyweightKg:Equipment,data=WeightLifting)

## 
## Call:
## lm(formula = TotalKg ~ BodyweightKg + Equipment + BodyweightKg:Equipment, 
##     data = WeightLifting)
## 
## Coefficients:
##                      (Intercept)                      BodyweightKg  
##                         123.1997                            2.6991  
##              EquipmentSingle-ply                    EquipmentWraps  
##                           0.8450                          -66.1298  
## BodyweightKg:EquipmentSingle-ply       BodyweightKg:EquipmentWraps  
##                          -0.1609                            2.4332

ggplot(data=WeightLifting,aes(x=BodyweightKg,y=TotalKg))+geom_point(aes(color=Equipment))+
  geom_abline(intercept=123.1997,slope=2.6991,color="red")+
  geom_abline(intercept=124.0447,slope=2.5382,color="green")+
  geom_abline(intercept=57.0699,slope=5.1323,color="blue")

(d) Use the resulting graph from (c) to comment on the magnitude of the coefficients labeled: BodyweightKg:EquipmentSingle-ply and BodyweightKg:EquipmentWraps.

the magnitude of Bodyweight:EquipmentSingle-Ply is much smaller than the magnitude of Bodyweight:EquipmentWraps

(e) The coefficient labels EquipmentSingle-ply and EquipmentWraps have different meanings in the models from (b) and (c). In which model are their interpretations contextually meaningful, and in which are they not? Explain why this is the case.

both serve the same role in their respective model, but in C, they represent the change in intercept from the reference when the two variables affect each other, and in B, they represent the change in intercept from the reference when the two variables don’t affect each other.

Question 5

The drawings below show some data involving three variables:

D: a quantitative variable

A: a quantitative variable

G: a categorical variable with two levels: S and K

On paper, sketch a graph of the fitted model function for each of the following:

(a) D ~ 1

(b) D ~ G

(c) D ~ A

(d) D ~ A*G

Question 6

This question refers to data (SCI.csv) on the survival times of 2498 individuals after sustaining a spinal cord injury. The variables in SCI.csv are:

Time: Survival time after spinal cord injury (in months)

Age: Age of the individual at the time of the injury (in years)

Sex: The patient’s biological sex recorded in binary fashion, Male or Female

Cause: Cause of the spinal cord injury, categorized as: Fall, MotorVehicle, or Other

ISS: Injury severity score (on a scale from 1-75, with larger values indicating greater severity)

ISSCat: Injury severity categorized as Mild (ISS <= 8), Moderate (9 <= ISS <= 15), or Severe (ISS >= 16)

Urban: Did the injury occur in an urban location (Yes or No)

YOI: Year of the injury (2004-2013)

Read in the data:

SCI = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/SCI.csv')

(a) Fit a model for survival time using Cause and Age, and which includes interaction.

lm(formula=Time~Age+Cause+Age:Cause,data=SCI)

## 
## Call:
## lm(formula = Time ~ Age + Cause + Age:Cause, data = SCI)
## 
## Coefficients:
##           (Intercept)                    Age      CauseMotorVehicle  
##              79.20468               -0.56294                0.78136  
##            CauseOther  Age:CauseMotorVehicle         Age:CauseOther  
##              -8.36409                0.05206                0.27734

ggplot(data=SCI,aes(x=Age,y=Time))+geom_point(aes(color=Cause))+
  geom_abline(intercept=79.20468,slope=-0.56294,color="red")+
  geom_abline(intercept=79.20468+0.78136,slope=-0.56294+0.05206,color="green")+
  geom_abline(intercept=79.20468-8.36409,slope=-0.56294+0.27734,color="blue")

## Warning: Removed 2 rows containing missing values (`geom_point()`).

(b) Provide careful interpretations of the coefficients labeled Age and CauseOther:Age.

Age is the reference slope for the model, CauseOther:Age is the change in slope from the reference slope to the slope of CauseOther related to age

(c) According to the model from (a), what is the predicted survival time for a 22-year old individual who sustained a spinal cord injury in a motor vehicle accident?

68.74668 months or 5 years, 8 months, 22 days, 19 hours, and 12 minutes

(d) Suppose two individuals (A and B) both sustained a spinal cord injury from a motor vehicle accident, but that A was 10 years younger than B when their injury occurred. Using the model from (a), answer the following:

Can we calculate A’s predicted survival time? If ‘yes’, calculate it, and if ‘no’, explain why not.

no, because we are only given relative age, not an absolute age, we cannot say for certain whether A is 20 and B is 30 or A is 65 and B is 75, both are plausible given the provided information.
Can we calculate the difference between A’s and B’s predicted survival time? If ‘yes’, calculate it, and if ‘no’, explain why not.

yes, it is -5.1088

Question 7

date: Date in format YYYY-MM-DD

season: Season (winter, spring, summer, or fall)

year: 2011 or 2012

month: 3-letter month abbreviation

day_of_week: 3-letter abbreviation for day of week

weekend: TRUE if the case is a weekend

holiday: Is the day a holiday? (yes or no)

temp_actual: Actual temperature in degrees Fahrenheit

temp_feel: What the temperature “feels like” in degrees Fahrenheit

humidity: Fraction from 0 to 1 giving the humidity level

windspeed: Wind speed in miles per hour

weather_cat: Three possible weather categories:

categ1 = Clear, Few clouds, Partly cloudy
categ2 = Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
categ3 = Light Snow, Light Rain/Thunderstorm + Scattered clouds

riders_casual: Count of daily rides by casual users (non-registered users)

riders_registered: Count of daily rides by registered users.

riders_total: Count of total daily rides (riders_casual + riders_registered)

Read in the data:

BikeShare = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/BikeShare.csv')

(a) Make a model for the daily total number of riders that uses season as the only explanatory variable. Report, and interpret, the coefficients from this model.

lm(formula=riders_total ~ season,data=BikeShare)

## 
## Call:
## lm(formula = riders_total ~ season, data = BikeShare)
## 
## Coefficients:
##  (Intercept)  seasonspring  seasonsummer  seasonwinter  
##       4728.2         264.2         916.1       -2124.0

the intercept coefficient refers to the total number of riders in the fall, the seasonspring coefficient refers to the change in the total number of riders from the fall, and its the same for the seasonsummer and seasonwinter coefficients.

(b) Now make a model for riders_total that uses both season and temp_actual as explanatory variables. Briefly interpret each of the coefficients from this model.

lm(formula=riders_total~temp_actual+season,data=BikeShare)

## 
## Call:
## lm(formula = riders_total ~ temp_actual + season, data = BikeShare)
## 
## Coefficients:
##  (Intercept)   temp_actual  seasonspring  seasonsummer  seasonwinter  
##      -617.61         84.57       -494.15       -852.68      -1342.87

the intercept coefficient refers to where the model hits the y axis for fall, the temp actual coefficient refers to the slope of the lines, and the coefficients for each of the seasons refers to the change between the intercepts of those lines and the intercept for the fall

(c) Make a scatterplot of riders_total by temp_actual which distinguishes the data points according to season.

ggplot(data=BikeShare,aes(x=temp_actual,y=riders_total))+geom_point(aes(color=season))

(d) Add the model from (b), for all seasons, to the scatterplot from (c).

ggplot(data=BikeShare,aes(x=temp_actual,y=riders_total))+geom_point(aes(color=season))+
  geom_abline(intercept=-617.61,slope=84.57,color="red")+
  geom_abline(intercept=-617.61-494.15,slope=84.57,color="green")+
  geom_abline(intercept=-617.61-852.68,slope=84.57,color="aquamarine")+
  geom_abline(intercept=-617.61-1342.87,slope=84.57,color="purple")

(e) Use the model coefficients from (b) to find the prediction (or fitted value) for the total number of riders on an 80-degree summer day.

5295.31

(f) The model in (b) assumes that the effect of temp_actual is identical regardless of season. Explain in your own words why this might not be a good assumption to make.

because more people are likely to be doing stuff outdoors on a summer day than a winter day even if the temperature is the same, this is in part because winter tends to be colder regardless with a lower windchill than you would get in the summer.

(g) Fit a model for riders_total that allows the effect of temp_actual to depend on season. Interpret the coefficients from this model.

lm(formula=riders_total~temp_actual+season+temp_actual:season,data=BikeShare)

## 
## Call:
## lm(formula = riders_total ~ temp_actual + season + temp_actual:season, 
##     data = BikeShare)
## 
## Coefficients:
##              (Intercept)               temp_actual              seasonspring  
##                 -640.263                    84.929                  -816.646  
##             seasonsummer              seasonwinter  temp_actual:seasonspring  
##                 7055.404                 -3424.828                     4.424  
## temp_actual:seasonsummer  temp_actual:seasonwinter  
##                  -94.092                    38.635

(h) Remake the scatterplot from (c), and add to it the lines corresponding to the model from (g).

ggplot(data=BikeShare,aes(x=temp_actual,y=riders_total))+geom_point(aes(color=season))+
  geom_abline(intercept=-640.263,slope=84.929,color="red")+
  geom_abline(intercept=-640.263-816.646,slope=84.929+4.424,color="green")+
  geom_abline(intercept=-640.263+7055.404,slope=84.929-94.092,color="aquamarine")+
  geom_abline(intercept=-640.263-3423.828,slope=84.929+38.635,color="purple")

(i) Does it seem like an interaction model is worthwhile in this context? Briefly explain.

yes, because it shows that temperature does affect total ridership for 3 seasons, but also that as you get

(j) Recalculate the prediction for the total number of riders on an 80-degree summer day, based on the model in (g).

5682.364

Question 8

The graph below shows (schematically) a possible relationship between used car price, mileage, and the car model year:

Consider the model price ~ mileage*year, and treat year as a categorical variable with 2005 as the reference group.

What will be the sign of the coefficient on mileage?

A. Negative

B. Zero

C. Positive

D. We don’t have enough information to answer this question.

What will be the sign of the coefficient on year2007?

A. Negative

B. Zero

C. Positive

D. We don’t have enough information to answer this question.

What will be the sign of the interaction coefficient?

A. Negative

B. Zero

C. Positive

D. There is no interaction coefficient.

E. We don’t have enough information to answer this question.

Question 9

Consider this made-up model of a response variable (y) as a function of two quantitative explanatory variables (x1q and x2q), and a categorical explanatory variable (xcat) with levels A or B:

y ~ x1qxcat + x2qxcat

We obtain the following coefficients:

(Intercept)	x1q	x2q	xcatB	x1q:xcatB	x2q:xcatB
16	0.4	0.3	3	0.01	0.02

(a) Suppose there are 2 data points (#1 and #2) with identical values of x2q and both in category B, but where #1 has a value of x1q which is 1 larger than that of #2. How much larger will the predicted value for #1 be in comparison to #2?

0.025

(b) Suppose there are 2 data points (#1 and #2) with identical values of x1q and both in category B, but where #2 has a value of x2q which is 1 larger than that of #1. How much larger will the predicted value for #2 be in comparison to #1?

0.06667

(c) Put (a) and (b) together: #1 and #2 are both in category B, but #1 has a value of x1q which is 1 larger than that of #2, AND, #2 has a value of x2q which is 1 larger than that of #1. Which data point will have a larger predicted value, and by how much?

b, by 0.04

Question 10

Suppose you found a summary table presenting mean prices of pine trees. The trees are either the Red Pine or the White Pine, and they are classified as Short or Tall:

	Red	White	Both Colors
Short	$ 12	$ 18	$ 15
Tall	$ 20	$ 34	$ 27
Both Heights	$ 16	$ 26	$ 21

For example, the table tells us the average price of Tall and Red pines was $20, and overall, the White Pines averaged $26. The average price of all the trees in the data was $21. Based on this summary table, answer the following questions:

(a) For the model price ~ color, state how the coefficients are labeled and their values. You may assume that Red is the reference category for the color variable.

Intercept White 16, 10,

(b) For the model price ~ height, state how the coefficients are labeled and their values. You may assume that Short is the reference category for the height variable.

Intercept Tall
15, 12,

(c) The model price ~ height * color, has 4 coefficients. State how the coefficients are labeled and their values.

Intercept White Tall:Red Tall:White 12, 6, 8, 22

(d) The model price ~ height + color, has 3 coefficients. It would be difficult to figure out the coefficients from this model, but suppose you are told their values:

Intercept = 10

heightTall = 12

colorWhite = 10

According to this model, what are the fitted values for the following types of trees: Short-Red, Short-White, Tall-Red, and Tall-White?

Short-Red=2 short-white=8 tall-red=10 tall-white=24

STAT 155 Topic 4 In-class Activity

Load some necessary packages:

Question 1

What is the best description of interaction?

A. Allowing an explanatory variable to affect a response (i.e., including an explanatory variable in a model).

B. Allowing the effect of an explanatory variable on the response variable to depend on another explanatory variable.

C. Allowing the effect of an explanatory variable on another explanatory variable to depend on the response variable.

D. Forcing the effect of an explanatory variable on a response to be 0.

E. Forcing the effect of an explanatory variable on a response to be independent of other explanatory variables.

Question 2

College.csv contains the following data on 195 colleges:

GradRate: Proportion of students earning a degree in 4 years (ranges from 0 to 1)

AdmisRate: Admission rate; that is, proportion of applicants admitted to the college (ranges from 0 to 1)

Type: School type (Private or Public)

Read in the data:

Note: Two private schools have a listed graduation rate of 0%. We can discuss potential reasons for why this is, and what consequences this may have on our results.

Use this dataset to answer the following questions.

(a) Make the scatterplot of GradRate by AdmisRate, distinguishing the data points according to whether they are a public/private college (e.g., using colors or shapes), and add the parallel lines model to the scatterplot.

(b) Fit a model for GradRate that uses AdmisRate and Type as explanatory variables, but allows the effect of these explanatory variables to depend on one another (i.e., fit an interaction model). Interpret each of the coefficients returned by R.

(c) Re-make the scatterplot from (a) and add the lines corresponding to the interaction model to this scatterplot.

(d) Do you think the interaction model is worthwhile? Briefly explain your reasoning.

(e) Your answer to (d) is a way of making an informal inference (or conclusion) about one of the coefficients from the interaction model. Which coefficient does it tell us something about, and what do you think it tells us?

Question 3

Think about the relationship between the following variables:

speed of a bicyclist

steepness of the road: a quantitative variable measured by the grade (rise over run) where 0 means flat, positive means uphill, and negative means downhill

fitness of the rider: a categorical variable with three levels: unfit, average, athletic

On paper, sketch out a graph of speed versus steepness for reasonable models of each of the following forms:

(a) speed ~ steepness

(b) speed ~ fitness

(c) speed ~ steepness + fitness

(d) speed ~ steepness + fitness + steepness:fitness

Question 4

WeightLifting.csv contains data from recent weightlifting competitions in the United States. Each case is an athlete, and we have information on the following variables for 2178 athletes:

TotalKg: total amount the athlete lifted in the competition in kilograms (kg)

BodyweightKg: athlete’s body weight in kilograms (kg)

SWR: strength-to-weight ratio (TotalKg/BodyweightKg)

Equipment: categorical with 3 levels:

raw (athlete did not use any helpful equipment)

single-ply (equipment that helps the athlete lift more weight)

wraps (equipment that help the athlete lift more weight)

Age: age in years

Sex: biological sex, recorded as a binary; M (male) and F (female)

Read in the data:

Use this dataset to answer the following questions.

(a) Make a scatterplot of TotalKg versus BodyweightKg, distinguishing the data points by Equipment (using, say, color or shape).

Optional note: In geom_point(), we can control the transparency of the data points using the “alpha” parameter, making it easier to graph large amounts of data.

(b) Fit a model for TotalKg that uses BodyweightKg and Equipment (but no interaction). Add the model to the scatterplot, and use the resulting graph to comment on the size/magnitude of the coefficients labeled: BodyweightKg, EquipmentSingle-ply, and EquipmentWraps.

(c) Now fit a model for TotalKg that uses BodyweightKg and Equipment, including interaction. Re-create the scatterplot and add this model to it.

(d) Use the resulting graph from (c) to comment on the magnitude of the coefficients labeled: BodyweightKg:EquipmentSingle-ply and BodyweightKg:EquipmentWraps.

(e) The coefficient labels EquipmentSingle-ply and EquipmentWraps have different meanings in the models from (b) and (c). In which model are their interpretations contextually meaningful, and in which are they not? Explain why this is the case.

Question 5

The drawings below show some data involving three variables:

D: a quantitative variable

A: a quantitative variable

G: a categorical variable with two levels: S and K

On paper, sketch a graph of the fitted model function for each of the following:

(a) D ~ 1

(b) D ~ G

(c) D ~ A

(d) D ~ A*G

Question 6

This question refers to data (SCI.csv) on the survival times of 2498 individuals after sustaining a spinal cord injury. The variables in SCI.csv are:

Time: Survival time after spinal cord injury (in months)

Age: Age of the individual at the time of the injury (in years)

Sex: The patient’s biological sex recorded in binary fashion, Male or Female

Cause: Cause of the spinal cord injury, categorized as: Fall, MotorVehicle, or Other

ISS: Injury severity score (on a scale from 1-75, with larger values indicating greater severity)

ISSCat: Injury severity categorized as Mild (ISS <= 8), Moderate (9 <= ISS <= 15), or Severe (ISS >= 16)

Urban: Did the injury occur in an urban location (Yes or No)

YOI: Year of the injury (2004-2013)

Read in the data:

(a) Fit a model for survival time using Cause and Age, and which includes interaction.

(b) Provide careful interpretations of the coefficients labeled Age and CauseOther:Age.

(c) According to the model from (a), what is the predicted survival time for a 22-year old individual who sustained a spinal cord injury in a motor vehicle accident?

(d) Suppose two individuals (A and B) both sustained a spinal cord injury from a motor vehicle accident, but that A was 10 years younger than B when their injury occurred. Using the model from (a), answer the following:

Can we calculate A’s predicted survival time? If ‘yes’, calculate it, and if ‘no’, explain why not.

Can we calculate the difference between A’s and B’s predicted survival time? If ‘yes’, calculate it, and if ‘no’, explain why not.

Question 7

date: Date in format YYYY-MM-DD

y ~ x1qxcat + x2qxcat