Load some necessary packages:
library(ggplot2)
library(dplyr)
Question 1
CSHA.csv has data on 509 patients from the Canadian Study
of Health and Ageing, a study which examined survival times from onset
of Alzheimer’s disease. For each patient, the following variables were
recorded:
Education: Number of years of education
AAO: Age at onset of Alzheimer’s (in years)
Sex: A patient’s biological sex recorded in binary fashion:
male (M) or female (F)
Survival: Time from onset of Alzheimer’s until death (in
days)
Using this data, we will be exploring a few models of
Survival using Sex, Education, and
AAO.
Read in the data:
CSHA = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/CSHA.csv')
(a) In this context, what is the response variable and what are the
explanatory variables?
the response variable is survival and the rest are explanatory or
independent variables
(c) State the “optimal” value of \(b\) returned by R, and explain what this
number represents in the context of the question.
2299, this is the average number of days survived by an
individual
(d) Explain in your own words what we mean by “optimal” in part
(c).
the optimal refers to the fact that its the mean of this variable
(f) Make a scatterplot using Survival and
Education, and add the best fitting line to the scatterplot of
the data.
# Insert your R code below:
ggplot(data=CSHA,aes(x=Education,y=Survival))+geom_point()

(g) Do you think Education is a good explanatory variable
for modeling Survival? Explain your reasoning; what are you
looking for in order to answer this question?
No, because education does not show a clear trend and the points are
grouped vertically along certain values of education
(h) Interpret the slope of the model from (e) in the context of the
problem.
the slope of the model shows the slope of the mean survival time for
each value of education
(i) Repeat parts (e), (f), (g), and (h), except using AAO
instead of Education.
lm(formula=Survival~AAO,data=CSHA)
##
## Call:
## lm(formula = Survival ~ AAO, data = CSHA)
##
## Coefficients:
## (Intercept) AAO
## 10259.1 -97.6
Slope:-97.6 Intercept:10259.1
ggplot(data=CSHA,aes(x=AAO,y=Survival))+geom_point()+geom_abline(intercept=10259.1,slope=-97.6)

AAO appears to be a much better predictor of survival time than
education the slope of the line of best fit shows that there is a clear
negative trend in this graph
(j) Do you think AAO is a better or worse predictor of
Survival than Education? Briefly explain your
reasoning.
yes I do, because AAO shows a clear negative correlation with
survival, the slope of its best fit line is also steeper than that of
Education.
(k) Suppose someone had onset of Alzheimer’s at age 70 and survived
for 2500 days. Calculate the fitted value (i.e. the model’s prediction)
and the residual for this individual from the Survival ~
AAO model.
fitted=-97.6*70+10259.1
residual=2500-3427.1
fitted value=3427.1 residual=-927.1
(l) What would happen to the slope in the Survival ~
AAO model if we had measured AAO in days instead of in
years? Answer this question conceptually, and if you’d like to verify
using R, there is some code below to help you create a new
AAOdays variable in the CSHA dataframe.
the slope would become much shallower and the numerical value would
become much smaller because you would be increasing the value of the x,
but the message and conclusion reached by the original slope would not
change
# Create a new AAOdays in CSHA by un-commenting and completing the line below:
CSHA = CSHA %>% mutate( AAOdays = AAO*365 )
lm(formula=Survival~AAOdays,data=CSHA)
##
## Call:
## lm(formula = Survival ~ AAOdays, data = CSHA)
##
## Coefficients:
## (Intercept) AAOdays
## 10259.1039 -0.2674
# Then, re-fit the model using AAOdays instead of AAO:
(m) Would measuring AAO in days change anything about the
quality of the AAO variable for modeling Survival? Why
is this an important realization when looking at model coefficients in
general?
Probably, a graph that uses variables of the same unit produces more
reliable and often distinct trends in the data being represented.
Question 2
In studies of employment discrimination, several attributes of
employees are often considered, for example, age, biological sex, race,
years of experience, salary, whether promoted, whether laid off,
etc.
For each of the following questions, state the response variable and
the explanatory variable.
1. Are males paid more than females?
Explanatory variable: Sex Response variable: Salary
4. Are older employees more likely to be laid off than younger
employees?
Explanatory variable: Age Response variable: Whether Laid-Off
Question 3
The data set trees.csv provides measurements of the girth,
height, and volume of timber in 31 felled black cherry trees. Note that
girth is the diameter of the tree (in inches) measured at 4 ft 6 inches
above the ground, height is measured in feet, and volume is measured in
cubic feet.
Read in the data:
trees = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/trees.csv')
You can click on the name of the dataframe from the Environment tab
to see the data and the variable names. Use this data to answer the
following questions:
(a) In a model with girth and volume, which do you think would be
the response variable, and which would be the explanatory variable?
Briefly explain your choices.
Girth would be my explanatory variable and volume would be my
response variable, though you could really do it either way because you
are comparing two different independent variables, meaning that it does
not matter which is which
(b) Based on your answer to (a), make a scatterplot of the response
(y-axis) versus the explanatory (x-axis). Would you say that your
explanatory is a good predictor for your response? Briefly explain.
ggplot(data=trees,aes(x=Girth,y=Volume))+geom_point()

ggplot(data=trees,aes(x=Volume,y=Girth))+geom_point()

I would say that Girth is a good predictor of Volume because the
graph shows a clear correlation or trend between the two, though I would
say that if we reversed the axes, we would see the same type of trend
(c) Make a simple linear regression model for your choice of
response using your choice of explanatory. Report the coefficients of
this model, and add a depiction of the model to your scatterplot from
(b).
lm(formula=Volume~Girth,data=trees)
##
## Call:
## lm(formula = Volume ~ Girth, data = trees)
##
## Coefficients:
## (Intercept) Girth
## -36.943 5.066
ggplot(data=trees,aes(x=Girth,y=Volume))+geom_point()+geom_abline(intercept=-36.943,slope=5.066)
Intercept:-36.943 Slope:5.066
(d) Interpret the two model coefficients obtained in (c). Explain
why one of these coefficients is contextually meaningless.
the intercept is meaningless because no data on this graph is
negative, and the graph doesn’t show negative y values
(e) Using the model from (c), find the fitted value and the residual
for a hypothetical tree that was observed to have a volume of 33.8 cubic
feet and a girth of 12.9 inches.
fitted4=5.066*12.9-36.943
residual4=33.8-28.4084
fitted4=28.4084 residual4=5.3916
(f) dplyr is a very useful package which makes it easy to
add variables to a data frame, use only certain variables/observations,
or manipulate data in countless other valuable ways. For example, to
create a new variable that categorizes trees as either short, medium, or
tall, you can type:
trees = trees %>% mutate( HeightCategorical = cut( Height ,
breaks=c(60,70,80,90) ,
labels = c("Short","Medium","Tall") )
)
For each of the following parts, explain in words what the commands
are doing:
Part 1
trees = trees %>% mutate( GirthCategorical = cut( Girth,
breaks = c(8,12,22),
labels = c("Small","Big")
)
)
this command groups trees based on whether they have Small Girth or
Large Girth
Part 2
trees %>% select( HeightCategorical , GirthCategorical ) %>% table( )
compares categories of girth and height
Part 3
trees %>% group_by(HeightCategorical) %>% summarize( Median = median(Volume),
Mean = mean(Volume),
SD = sd(Volume)
)
finds the mean, median, and standard deviation of the height
categories
Part 4
shorts = trees %>% filter( HeightCategorical == "Short" )
lm( Volume ~ Girth , data=shorts )
creates a model for girth in the shorts
Question 4
Here is a paragraph from a New York Times article:
It has long been said that regular physical activity and better
sleep go hand in hand. But only recently have scientists sought to find
out precisely to what extent. One study published looked for answers by
having healthy children wear actigraphs (devices that measure movement)
and then seeing whether more movement and activity during the day meant
improved sleep at night. The study found that sleep onset latency (the
time it takes to fall asleep once in bed) ranged from as little as
roughly 10 minutes for some children to more than 40 minutes for others.
But physical activity during the day and sleep onset at night were
closely linked: every hour of sedentary activity during the day resulted
in an additional three minutes in the time it took to fall asleep at
night. And the children who fell asleep faster ultimately slept longer,
getting an extra hour of sleep for every 10-minute reduction in the time
it took them to drift off.
(a) There are two models described in this passage with two
different response variables. What are the two response variables?
minutes it took to fall asleep and length of time asleep in
minutes
(b) For each of the two response variables that you stated in (a),
what is the explanatory variable being used to model it?
hours of physical activity
(c) Suppose that you are comparing two groups of children. Group A
has 3 hour of sedentary activity each day, Group B has 8 hours of
sedentary activity. Which of these statements is best supported by the
article? (Select only 1)
C #### A. The children in Group A will take, on average, 3 minutes
less time to fall asleep.
B. The children in Group B will have, on average, 10 minutes less
sleep each night.
C. The children in Group A will take, on average, 15 minutes less
time to fall asleep.
D. The children in Group B will have, on average, 45 minutes less
sleep each night.
(d) Again comparing the two groups of children from (c), which of
these statements is supported by the article? (Select only 1)
A #### A. The children in Group A will get, on average, about an hour
and a half hours of extra sleep compared to the Group B children.
B. The children in Group A will get, on average, about 15 minutes
more sleep than the Group B children.
C. The two groups will get about the same amount of sleep.
Question 5
This question refers to data (SCI.csv) on the survival
times of 2498 individuals after sustaining a spinal cord injury,
however, you will NOT need to run R commands to arrive at your
responses. The variables in SCI.csv are:
Time: Survival time after spinal cord injury (in
months)
Age: Age of the individual at the time of the injury (in
years)
Sex: The patient’s biological sex recorded in binary
fashion, Male or Female
Cause: Cause of the spinal cord injury, categorized as:
Fall, MotorVehicle, or Other
ISS: Injury severity score (on a scale from 1-75, with
larger values indicating greater severity)
ISSCat: Injury severity categorized as Mild (ISS <= 8),
Moderate (9 <= ISS <= 15), or Severe (ISS >= 16)
Urban: Did the injury occur in an urban location (Yes or
No)
YOI: Year of the injury (2004-2013)
Below is one R command and the output it produced:
lm( Time ~ ISS , data=SCI )
##
## Call:
## lm(formula = Time ~ ISS, data = SCI)
##
## Coefficients:
## (Intercept) ISS
## 61.6222 -0.2846
(a) What are the units associated with the (Intercept) coefficient,
61.6222?
months
(b) What are the units associated with the ISS coefficient,
-0.2846?
its a scale and there aren’t really units
(c) Suppose we multiplied all of the ISS scores by 10, and re-fit
the (Time ~ ISS) model, then the value of the (Intercept) coefficient
would:
D #### A. Become 10 times larger.
B. Become 10 times smaller.
C. Remain unchanged.
D. Increase, additively, by 10 (i.e., become 71.6222)
E. Decrease by 10 (i.e., become 51.6222)
(d) Suppose we multiplied all of the ISS scores by 10, and re-fit
the (Time ~ ISS) model, then the value of the ISS coefficient
would:
C #### A. Become 10 times larger.
B. Become 10 times smaller.
C. Remain unchanged.
D. Increase, additively, by 10 (i.e., become 9.7154)
E. Decrease by 10 (i.e., become −10.2846)
(e) Suppose we measured survival time in years instead of months,
but left the ISS scores as in the original data. If we re-fit the (Time
~ ISS) model, then the value of the (Intercept) coefficient would:
B #### A. Become 12 times larger.
B. Become 12 times smaller.
C. Remain unchanged.
D. Increase, additively, by 12 (i.e., become 73.6222)
E. Decrease by 12 (i.e., become 49.6222)
(f) Suppose we measured survival time in years instead of months but
left the ISS scores as in the original data. If we re-fit the (Time ~
ISS) model, then the value of the ISS coefficient would:
C #### A. Become 12 times larger.
B. Become 12 times smaller.
C. Remain unchanged.
D. Increase, additively, by 12 (i.e., become 11.7154)
E. Decrease by 12 (i.e., become −12.2846)
Question 6
Answer the following TRUE or FALSE questions. If you answer FALSE,
re-write the statement so that it is TRUE.
(a) TRUE or FALSE: A residual is defined to be the difference
between the observed value of a response variable and the observed value
of an explanatory variable in the model.
FALSE
(b) TRUE or FALSE: The Principle of Least Squares states that
coefficients in a model should be chosen such that the average of the
residuals is as small as possible.
FALSE
Question 7
Suppose an insurance company wants to relate the amount of fire
damage in major residential fires to the distance between the burning
house and the nearest fire station. The study is conducted in a large
suburb of a major city. A sample of 15 recent fires in this suburb is
selected. The amount of damage (in thousands of dollars) and the
distance between the fire and the nearest fire station (in miles) are
recorded for each fire, with the data provided in
FIREDAM.csv.
Read in the data:
FIREDAM = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/FIREDAM.csv')
(a) What is the appropriate response variable?
Damage
(b) What is the appropriate explanatory variable?
Distance
(c) Make a scatterplot with the variables from (a) and (b) on the
appropriate axes.
ggplot(data=FIREDAM,aes(x=distance,y=damage))+geom_point()

(e) Explain what we mean by “best” in (d).
the mean
(f) Add the best fitting line to a scatterplot of the data.
ggplot(data=FIREDAM,aes(x=distance,y=damage))+geom_point()+geom_abline(slope=4.919,intercept=10.278)

(g) Interpret the slope in the context of the problem.
the slope represents the change in damaage as distance increases
(h) According to the model in (d), what is the fitted value for the
amount of damage when the fire is 3 miles from the closest fire
station.
Fitted=4.919*FIREDAM$distance+10.278
Fitted value=25.035
(i) For the model in (d), write some short R code to verify that the
mean of the residuals is 0, and find the sum of squared residuals.
Residual5=FIREDAM$damage-Fitted
mean(Residual5)
## [1] 0.001013333
Question 8
Consider this graph which appeared in a letter on gun violence in
the United States, published in the January 2017 issue of the Journal of
the American Medical Association:

(a) What message are the authors of the letter trying to convey with
this graph?
how much funding the US reserves for certain threats to the
population and how much of an impact those threats have
(b) In what ways does this relate to material from our course?
I guess it relates to the fox news data that we looked at from the
last ICA a bit, but it is also tangentially related to the CSHA data in
another dimension