Load some necessary packages:

library(ggplot2)

library(dplyr)


Question 1

CSHA.csv has data on 509 patients from the Canadian Study of Health and Ageing, a study which examined survival times from onset of Alzheimer’s disease. For each patient, the following variables were recorded:

Education: Number of years of education

AAO: Age at onset of Alzheimer’s (in years)

Sex: A patient’s biological sex recorded in binary fashion: male (M) or female (F)

Survival: Time from onset of Alzheimer’s until death (in days)


Using this data, we will be exploring a few models of Survival using Sex, Education, and AAO.


Read in the data:

CSHA = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/CSHA.csv')


(a) In this context, what is the response variable and what are the explanatory variables?

the response variable is survival and the rest are explanatory or independent variables


(b) Use R to create a model of the form \(y = b\), where \(y\) represents your response variable, and \(b\) represents some constant.

# Insert your R code below:
lm(formula=Survival~1,data=CSHA)
## 
## Call:
## lm(formula = Survival ~ 1, data = CSHA)
## 
## Coefficients:
## (Intercept)  
##        2299


(c) State the “optimal” value of \(b\) returned by R, and explain what this number represents in the context of the question.

2299, this is the average number of days survived by an individual


(d) Explain in your own words what we mean by “optimal” in part (c).

the optimal refers to the fact that its the mean of this variable

(e) Use R to build a model of the form \(y = mx + b\) with \(y\) = Survival and \(x\) = Education. State the slope and intercept of the best fitting line.

# Insert your R code below:
lm(formula=Survival~Education,data=CSHA)
## 
## Call:
## lm(formula = Survival ~ Education, data = CSHA)
## 
## Coefficients:
## (Intercept)    Education  
##      2012.7         33.1

Intercept=2012.7 Slope=33.1


(f) Make a scatterplot using Survival and Education, and add the best fitting line to the scatterplot of the data.

# Insert your R code below:
ggplot(data=CSHA,aes(x=Education,y=Survival))+geom_point()


(g) Do you think Education is a good explanatory variable for modeling Survival? Explain your reasoning; what are you looking for in order to answer this question?

No, because education does not show a clear trend and the points are grouped vertically along certain values of education


(h) Interpret the slope of the model from (e) in the context of the problem.

the slope of the model shows the slope of the mean survival time for each value of education


(i) Repeat parts (e), (f), (g), and (h), except using AAO instead of Education.

lm(formula=Survival~AAO,data=CSHA)
## 
## Call:
## lm(formula = Survival ~ AAO, data = CSHA)
## 
## Coefficients:
## (Intercept)          AAO  
##     10259.1        -97.6

Slope:-97.6 Intercept:10259.1

ggplot(data=CSHA,aes(x=AAO,y=Survival))+geom_point()+geom_abline(intercept=10259.1,slope=-97.6)

AAO appears to be a much better predictor of survival time than education the slope of the line of best fit shows that there is a clear negative trend in this graph


(j) Do you think AAO is a better or worse predictor of Survival than Education? Briefly explain your reasoning.

yes I do, because AAO shows a clear negative correlation with survival, the slope of its best fit line is also steeper than that of Education.


(k) Suppose someone had onset of Alzheimer’s at age 70 and survived for 2500 days. Calculate the fitted value (i.e. the model’s prediction) and the residual for this individual from the Survival ~ AAO model.

fitted=-97.6*70+10259.1
residual=2500-3427.1

fitted value=3427.1 residual=-927.1

(l) What would happen to the slope in the Survival ~ AAO model if we had measured AAO in days instead of in years? Answer this question conceptually, and if you’d like to verify using R, there is some code below to help you create a new AAOdays variable in the CSHA dataframe.

the slope would become much shallower and the numerical value would become much smaller because you would be increasing the value of the x, but the message and conclusion reached by the original slope would not change

# Create a new AAOdays in CSHA by un-commenting and completing the line below:
CSHA = CSHA %>% mutate( AAOdays = AAO*365 )
lm(formula=Survival~AAOdays,data=CSHA)
## 
## Call:
## lm(formula = Survival ~ AAOdays, data = CSHA)
## 
## Coefficients:
## (Intercept)      AAOdays  
##  10259.1039      -0.2674
# Then, re-fit the model using AAOdays instead of AAO:


(m) Would measuring AAO in days change anything about the quality of the AAO variable for modeling Survival? Why is this an important realization when looking at model coefficients in general?

Probably, a graph that uses variables of the same unit produces more reliable and often distinct trends in the data being represented.


Question 2

In studies of employment discrimination, several attributes of employees are often considered, for example, age, biological sex, race, years of experience, salary, whether promoted, whether laid off, etc.

For each of the following questions, state the response variable and the explanatory variable.


1. Are males paid more than females?

Explanatory variable: Sex Response variable: Salary

2. On average, how much extra salary is a year of experience worth?

Explanatory variable: Years of experience Response variable: Salary


3. Are white employees more likely than black employees to be promoted?

Explanatory variable: Race Response variable: Whether Promoted


4. Are older employees more likely to be laid off than younger employees?

Explanatory variable: Age Response variable: Whether Laid-Off




Question 3

The data set trees.csv provides measurements of the girth, height, and volume of timber in 31 felled black cherry trees. Note that girth is the diameter of the tree (in inches) measured at 4 ft 6 inches above the ground, height is measured in feet, and volume is measured in cubic feet.

Read in the data:

trees = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/trees.csv')


You can click on the name of the dataframe from the Environment tab to see the data and the variable names. Use this data to answer the following questions:


(a) In a model with girth and volume, which do you think would be the response variable, and which would be the explanatory variable? Briefly explain your choices.

Girth would be my explanatory variable and volume would be my response variable, though you could really do it either way because you are comparing two different independent variables, meaning that it does not matter which is which

(b) Based on your answer to (a), make a scatterplot of the response (y-axis) versus the explanatory (x-axis). Would you say that your explanatory is a good predictor for your response? Briefly explain.

ggplot(data=trees,aes(x=Girth,y=Volume))+geom_point()

ggplot(data=trees,aes(x=Volume,y=Girth))+geom_point()

I would say that Girth is a good predictor of Volume because the graph shows a clear correlation or trend between the two, though I would say that if we reversed the axes, we would see the same type of trend

(c) Make a simple linear regression model for your choice of response using your choice of explanatory. Report the coefficients of this model, and add a depiction of the model to your scatterplot from (b).

lm(formula=Volume~Girth,data=trees)
## 
## Call:
## lm(formula = Volume ~ Girth, data = trees)
## 
## Coefficients:
## (Intercept)        Girth  
##     -36.943        5.066
ggplot(data=trees,aes(x=Girth,y=Volume))+geom_point()+geom_abline(intercept=-36.943,slope=5.066)

Intercept:-36.943 Slope:5.066

(d) Interpret the two model coefficients obtained in (c). Explain why one of these coefficients is contextually meaningless.

the intercept is meaningless because no data on this graph is negative, and the graph doesn’t show negative y values

(e) Using the model from (c), find the fitted value and the residual for a hypothetical tree that was observed to have a volume of 33.8 cubic feet and a girth of 12.9 inches.

fitted4=5.066*12.9-36.943
residual4=33.8-28.4084

fitted4=28.4084 residual4=5.3916


(f) dplyr is a very useful package which makes it easy to add variables to a data frame, use only certain variables/observations, or manipulate data in countless other valuable ways. For example, to create a new variable that categorizes trees as either short, medium, or tall, you can type:

trees = trees %>% mutate( HeightCategorical = cut( Height ,
                                                   breaks=c(60,70,80,90) ,
                                                   labels = c("Short","Medium","Tall") )
                          )


For each of the following parts, explain in words what the commands are doing:


Part 1

trees = trees %>% mutate( GirthCategorical = cut( Girth,
                                                  breaks = c(8,12,22),
                                                  labels = c("Small","Big")
                                                  )
                          )

this command groups trees based on whether they have Small Girth or Large Girth


Part 2

trees %>% select( HeightCategorical , GirthCategorical ) %>% table( )

compares categories of girth and height


Part 3

trees %>% group_by(HeightCategorical) %>% summarize( Median = median(Volume), 
                                                     Mean = mean(Volume),
                                                     SD = sd(Volume)
                                                     )

finds the mean, median, and standard deviation of the height categories


Part 4

shorts = trees %>% filter( HeightCategorical == "Short" )

lm( Volume ~ Girth , data=shorts )

creates a model for girth in the shorts


Question 4

Here is a paragraph from a New York Times article:

It has long been said that regular physical activity and better sleep go hand in hand. But only recently have scientists sought to find out precisely to what extent. One study published looked for answers by having healthy children wear actigraphs (devices that measure movement) and then seeing whether more movement and activity during the day meant improved sleep at night. The study found that sleep onset latency (the time it takes to fall asleep once in bed) ranged from as little as roughly 10 minutes for some children to more than 40 minutes for others. But physical activity during the day and sleep onset at night were closely linked: every hour of sedentary activity during the day resulted in an additional three minutes in the time it took to fall asleep at night. And the children who fell asleep faster ultimately slept longer, getting an extra hour of sleep for every 10-minute reduction in the time it took them to drift off.

(a) There are two models described in this passage with two different response variables. What are the two response variables?

minutes it took to fall asleep and length of time asleep in minutes


(b) For each of the two response variables that you stated in (a), what is the explanatory variable being used to model it?

hours of physical activity


(c) Suppose that you are comparing two groups of children. Group A has 3 hour of sedentary activity each day, Group B has 8 hours of sedentary activity. Which of these statements is best supported by the article? (Select only 1)

C #### A. The children in Group A will take, on average, 3 minutes less time to fall asleep.

B. The children in Group B will have, on average, 10 minutes less sleep each night.

C. The children in Group A will take, on average, 15 minutes less time to fall asleep.

D. The children in Group B will have, on average, 45 minutes less sleep each night.


(d) Again comparing the two groups of children from (c), which of these statements is supported by the article? (Select only 1)

A #### A. The children in Group A will get, on average, about an hour and a half hours of extra sleep compared to the Group B children.

B. The children in Group A will get, on average, about 15 minutes more sleep than the Group B children.

C. The two groups will get about the same amount of sleep.




Question 5

This question refers to data (SCI.csv) on the survival times of 2498 individuals after sustaining a spinal cord injury, however, you will NOT need to run R commands to arrive at your responses. The variables in SCI.csv are:

Time: Survival time after spinal cord injury (in months)

Age: Age of the individual at the time of the injury (in years)

Sex: The patient’s biological sex recorded in binary fashion, Male or Female

Cause: Cause of the spinal cord injury, categorized as: Fall, MotorVehicle, or Other

ISS: Injury severity score (on a scale from 1-75, with larger values indicating greater severity)

ISSCat: Injury severity categorized as Mild (ISS <= 8), Moderate (9 <= ISS <= 15), or Severe (ISS >= 16)

Urban: Did the injury occur in an urban location (Yes or No)

YOI: Year of the injury (2004-2013)


Below is one R command and the output it produced:

lm( Time ~ ISS , data=SCI )
## 
## Call:
## lm(formula = Time ~ ISS, data = SCI)
## 
## Coefficients:
## (Intercept)          ISS  
##     61.6222      -0.2846


(a) What are the units associated with the (Intercept) coefficient, 61.6222?

months

(b) What are the units associated with the ISS coefficient, -0.2846?

its a scale and there aren’t really units

(c) Suppose we multiplied all of the ISS scores by 10, and re-fit the (Time ~ ISS) model, then the value of the (Intercept) coefficient would:

D #### A. Become 10 times larger.

B. Become 10 times smaller.

C. Remain unchanged.

D. Increase, additively, by 10 (i.e., become 71.6222)

E. Decrease by 10 (i.e., become 51.6222)


(d) Suppose we multiplied all of the ISS scores by 10, and re-fit the (Time ~ ISS) model, then the value of the ISS coefficient would:

C #### A. Become 10 times larger.

B. Become 10 times smaller.

C. Remain unchanged.

D. Increase, additively, by 10 (i.e., become 9.7154)

E. Decrease by 10 (i.e., become −10.2846)


(e) Suppose we measured survival time in years instead of months, but left the ISS scores as in the original data. If we re-fit the (Time ~ ISS) model, then the value of the (Intercept) coefficient would:

B #### A. Become 12 times larger.

B. Become 12 times smaller.

C. Remain unchanged.

D. Increase, additively, by 12 (i.e., become 73.6222)

E. Decrease by 12 (i.e., become 49.6222)


(f) Suppose we measured survival time in years instead of months but left the ISS scores as in the original data. If we re-fit the (Time ~ ISS) model, then the value of the ISS coefficient would:

C #### A. Become 12 times larger.

B. Become 12 times smaller.

C. Remain unchanged.

D. Increase, additively, by 12 (i.e., become 11.7154)

E. Decrease by 12 (i.e., become −12.2846)




Question 6

Answer the following TRUE or FALSE questions. If you answer FALSE, re-write the statement so that it is TRUE.


(a) TRUE or FALSE: A residual is defined to be the difference between the observed value of a response variable and the observed value of an explanatory variable in the model.

FALSE


(b) TRUE or FALSE: The Principle of Least Squares states that coefficients in a model should be chosen such that the average of the residuals is as small as possible.

FALSE




Question 7

Suppose an insurance company wants to relate the amount of fire damage in major residential fires to the distance between the burning house and the nearest fire station. The study is conducted in a large suburb of a major city. A sample of 15 recent fires in this suburb is selected. The amount of damage (in thousands of dollars) and the distance between the fire and the nearest fire station (in miles) are recorded for each fire, with the data provided in FIREDAM.csv.


Read in the data:

FIREDAM = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/FIREDAM.csv')


(a) What is the appropriate response variable?

Damage

(b) What is the appropriate explanatory variable?

Distance


(c) Make a scatterplot with the variables from (a) and (b) on the appropriate axes.

ggplot(data=FIREDAM,aes(x=distance,y=damage))+geom_point()


(d) Consider the model form \(y = mx + b\), where \(y\) is the response variable and \(x\) is the explanatory variable. Use R to find the “best” (or “optimal”) coefficients for this model.

lm(formula=damage~distance,data=FIREDAM)
## 
## Call:
## lm(formula = damage ~ distance, data = FIREDAM)
## 
## Coefficients:
## (Intercept)     distance  
##      10.278        4.919


(e) Explain what we mean by “best” in (d).

the mean


(f) Add the best fitting line to a scatterplot of the data.

ggplot(data=FIREDAM,aes(x=distance,y=damage))+geom_point()+geom_abline(slope=4.919,intercept=10.278)


(g) Interpret the slope in the context of the problem.

the slope represents the change in damaage as distance increases


(h) According to the model in (d), what is the fitted value for the amount of damage when the fire is 3 miles from the closest fire station.

Fitted=4.919*FIREDAM$distance+10.278

Fitted value=25.035

(i) For the model in (d), write some short R code to verify that the mean of the residuals is 0, and find the sum of squared residuals.

Residual5=FIREDAM$damage-Fitted
mean(Residual5)
## [1] 0.001013333




Question 8

Consider this graph which appeared in a letter on gun violence in the United States, published in the January 2017 issue of the Journal of the American Medical Association:



(a) What message are the authors of the letter trying to convey with this graph?

how much funding the US reserves for certain threats to the population and how much of an impact those threats have


(b) In what ways does this relate to material from our course?

I guess it relates to the fox news data that we looked at from the last ICA a bit, but it is also tangentially related to the CSHA data in another dimension