Assignment 2 - Simple Linear Regression

Created by John Palowitch, UNC Chapel Hill output: html_document

Simple Linear Regression

LSD DATA SET

In 1968, a study was performed to attempt to assess the effect of human consumption of LSD on math test scores. Seven subjects were given a dose of LSD. Then, the concentration of LSD in their body tissue was measured, and they were subsequently scored on a math test. In the following questions, we refer to the outcome of this study as the “LSD” data set.

NOTE: If you have trouble getting and reading lsd.txt, the file only has 8 lines. You may be able to improvise.

In this question you will perform a basic simple linear regression analysis with the LSD data. Load the lsd data set into R with the given code line. Print the data set.

lsd <- read.table("lsd.txt", header = TRUE)
lsd

##   Tissue_conc Score
## 1        1.17 78.93
## 2        2.97 58.20
## 3        3.26 67.47
## 4        4.69 37.47
## 5        5.83 45.65
## 6        6.00 32.92
## 7        6.41 29.97

Consider the simple linear regression model \[ \text{Score} = \beta_0 + \beta_1*(\text{Tissue_conc}) + \epsilon \] where $\epsilon\sim{N}(0, \sigma^2)$. Write code to calculate the quantities below for the regression of Score on Tissue_conc, without using R’s lm function (i.e., using only the formulas given in class). Print these quantities, in order. Note that you may know some of these quantities by another name.
1. $r$ and $R^2$
2. $\hat \beta_1$
3. $\hat \beta_0$
4. $\hat \sigma^2$
```
n <- nrow(lsd)
r <- cor(lsd$Score, lsd$Tissue_conc)
R2 <- r^2
beta1_hat <- r * sd(lsd$Score) / sd(lsd$Tissue_conc)
beta0_hat <- mean(lsd$Score) - beta1_hat * mean(lsd$Tissue_conc)
sig2_hat <- sum((lsd$Score - beta0_hat - beta1_hat * lsd$Tissue_conc)^2) / (n - 2)

c(r, R2)
```
```
## [1] -0.9369285  0.8778350
```
```
beta1_hat
```
```
## [1] -9.009466
```
```
beta0_hat
```
```
## [1] 89.12387
```
```
sqrt(sig2_hat)
```
```
## [1] 7.125747
```

Compute and store an lm (linear model) object associated with the regression. Print the summary() output of the linear model. Ensure the quantities from part (a) match what your summary output shows.

lsd.lm <- lm(Score ~ Tissue_conc, data = lsd)
summary(lsd.lm)

## 
## Call:
## lm(formula = Score ~ Tissue_conc, data = lsd)
## 
## Residuals:
##       1       2       3       4       5       6       7 
##  0.3472 -4.1658  7.7170 -9.3995  9.0513 -2.1471 -1.4032 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   89.124      7.048  12.646 5.49e-05 ***
## Tissue_conc   -9.009      1.503  -5.994  0.00185 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.126 on 5 degrees of freedom
## Multiple R-squared:  0.8778, Adjusted R-squared:  0.8534 
## F-statistic: 35.93 on 1 and 5 DF,  p-value: 0.001854

Use the output of lm to find each of the following. You can write your answers on a sheet of paper to be handed in. Be sure to note that they are answers to “1c. Part i”“, etc.
1. The $t$-statistic associated with the estimate of $\beta_1$
2. The $p$-value for the hypotheses $H_0:\beta_1 = 0$ vs. $H_1:\beta_1 \ne 0$
3. The 95% confidence interval for $\beta_1$ (This will require calculation.)
4. $SE_{\hat \beta_1}$
5. Interpret the estimates “Intercept” and “Tissue_conc” in terms of the scenario.
6. According to this regression, what proportion of the variation in test scores can be explained by the model?
7. Would you conclude that “Tissue_conc” can be used to explain differences in test scores? Explain.

Produce a scatterplot of your data, and use R to draw the regression line on the plot.
```
plot(lsd$Tissue_conc, lsd$Score,
     xlab = "Tissue Concentration",
     ylab = "Test Score", main = "LSD Experiment w/Regression Line")
abline(lsd.lm)
```
1. Compute and print the predicted score of a student with LSD tissue concentration of 2. Interpret this number. If we administer the test to a student with this level of LSD tissue concentration, are we guaranteed to observe the number you computed? Why or why not? (Answer on your paper sheet as 1.e)
```
print(beta0_hat + 2 * beta1_hat)
```
```
## [1] 71.10494
```
1. Compute and print the predicted score of a student with no LSD tissue concentration. Interpret this number on paper.
```
print(beta0_hat)
```
```
## [1] 89.12387
```
1. Compute and print the predicted score of a student with LSD tissue concentration of 10. Interpret this number. Does anything seem fishy? Why do you suppose this is? (paper)
```
print(beta0_hat + beta1_hat * 10)
```
```
## [1] -0.9707904
```
1. Write a paragraph on paper giving your conclusions about the effect of LSD on math test performance. You may use both the lm output and your plot. In particular, interpret the meaning of the $p$-value and confidence interval you found in part c.
2. Plot the residuals versus x and make a QQ plot of the residuals.
```
#Your Code here
```
1. Comment on the two diagnostic plots (paper). Do you have any concerns about our linear model?

PGA (Professional Golf Association) Data Set

The PGA (Professional Golf Association) data set contains the average driving distance and accuracies for pro golfers in 2008. Use the code line below to load the data set into R. Consider the regression of Accuracy (response) on Distance (predictor). Accuracy is the fairway accuracy over the players’ holes that year. Distance is their average driving distance that year. Do the following:

Perform and print the summary output of the regression of Accuracy on Distance. Then, on paper, interpret the estimates “(Intercept)” and “distance” in terms of the scenario. Then find each of the following and put their values on paper.

pga <- read.table("pga.txt", header = TRUE, sep = "\t")
pga.slr <- lm(Accuracy ~ Distance, data = pga)
summary(pga.slr)

## 
## Call:
## lm(formula = Accuracy ~ Distance, data = pga)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.0314  -3.1952   0.3083   3.4181  15.5215 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 102.80154    3.31858   30.98   <2e-16 ***
## Distance     -0.13937    0.01227  -11.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.118 on 352 degrees of freedom
## Multiple R-squared:  0.2682, Adjusted R-squared:  0.2661 
## F-statistic:   129 on 1 and 352 DF,  p-value: < 2.2e-16

The $p$-value for the hypotheses $H_0:\beta_1 = 0$ vs. $H_1:\beta_1 \ne 0$
The 95% confidence interval for $\beta_1$ (This will require calculation.)
$SE_{\hat \beta_1}$
1. Use R to make a scatterplot of the data, and draw the line of best fit (remember to label axes and title your plot).
```
plot(pga$Distance, pga$Accuracy,
     xlab = "Distance", ylab = "Accuracy",
     main = "2008 PGA statistics")
abline(pga.slr)
```
1. R has built-in functions to make plots associated with post-regression model diagnostics. Use the code below to create these plots. We are interested in only the first two diagnostic plots offered automatically by R, so I have added the argument which = 1:2
```
plot(pga.slr, which=1:2)    
```
1. On paper, write 2-3 sentences in a new comment box commenting on and interpreting the plots.

You will notice that in the PGA data set, there is an extra variable for gender. Looking at the plots you produced in the previous problem, can you tell if this variable is important? Make at least one plot below that could help you guess whether or not gender should play a role in your regression, or in any analysis of the data. On paper, write a few sentences explaining your plot.

plot(pga$Distance, pga$Accuracy,
     xlab = "Distance", ylab = "Accuracy",
     main = "2008 PGA statistics",
     col = pga$Gender + 1)
legend("topright", legend = c("Female", "Male"), col = c("red", "green"), pch = 1)

In a code block below, split up the data set into two separate data sets; one for female, one for male. (As in the Marathon data set, Gender 1 represents Females. I believe this is the convention using alphabetical order for the words ‘female’ and ‘male’.) Then:

Perform a regression analysis for each, print the summary outputs, and make only the residual vs. fitted value and Q-Q diagnostic plots for each regression (by setting the which parameter of plot to 1:2). Do the residuals appear more normally distributed for men or for women? Answer on paper and explain how you arrived at your answer.

# Regressions for split data
pga.slr.M <- lm(Accuracy ~ Distance, data = pga, subset = Gender == 2)
pga.slr.F <- lm(Accuracy ~ Distance, data = pga, subset = Gender == 1)
summary(pga.slr.M)

## 
## Call:
## lm(formula = Accuracy ~ Distance, data = pga, subset = Gender == 
##     2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1837  -2.7969   0.2126   2.8715  11.0186 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 174.92546   10.44771   16.74   <2e-16 ***
## Distance     -0.38789    0.03631  -10.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.349 on 195 degrees of freedom
## Multiple R-squared:  0.3692, Adjusted R-squared:  0.3659 
## F-statistic: 114.1 on 1 and 195 DF,  p-value: < 2.2e-16

summary(pga.slr.F)

## 
## Call:
## lm(formula = Accuracy ~ Distance, data = pga, subset = Gender == 
##     1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.6777  -2.6583   0.9829   3.6346  10.2339 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 130.89331   10.92765  11.978  < 2e-16 ***
## Distance     -0.25649    0.04424  -5.797 3.66e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.246 on 155 degrees of freedom
## Multiple R-squared:  0.1782, Adjusted R-squared:  0.1729 
## F-statistic: 33.61 on 1 and 155 DF,  p-value: 3.662e-08

# Plots for split data
plot(pga.slr.M, which = 1:2)

plot(pga.slr.F, which = 1:2)

Re-create the scatter plot of the data set (with the original overall regression line), but now plot the two gender-specific regression lines along with the main one. Choose different colors for the three lines and include a legend.

# Scatterplot showing all lines
plot(pga$Distance, pga$Accuracy,
     xlab = "Distance", ylab = "Accuracy",
     main = "2008 PGA statistics")
abline(pga.slr)
abline(pga.slr.M, col = "green")
abline(pga.slr.F, col = "red")
legend("topright", legend = c("Female", "Male"), col = c("red", "green"), lty = 1)

On paper, write a few sentences stating whether splitting up the data appears to improve the analysis of this data set. You may use either (or both) quantitative support from the lm summary output or qualitative support from your plots.
Extra Credit: Let $\beta_{M}$ and $\beta_{F}$ be the true regression coefficients for the two regressions. On a separate sheet of paper, write down the $t$-statistic associated with the hypothesis test $H_0:\beta_{M} = \beta_{F}$ vs. $H_0:\beta_{M}\ne\beta_{F}$. In a code block below, compute and print the $p$-value for this hypothesis test. On paper, write a few sentences explaining and interpreting this result in non-technical terms. \[ t = \frac{\hat\beta_F - \hat\beta_M}{\sqrt{SE_{\hat\beta_F}^2 + SE_{\hat\beta_M}^2}},\;\;\;df = \min\{n_F, n_M\} \]

# Computation of p-value
Beta.F <- summary(pga.slr.F)$coefficient[2 , 1]
Beta.M <- summary(pga.slr.M)$coefficient[2 , 1]
SE.F <- summary(pga.slr.F)$coefficient[2 , 2]
SE.M <- summary(pga.slr.M)$coefficient[2 , 2]
t <- (Beta.F - Beta.M) / sqrt(SE.F^2 + SE.M^2)
DF <- min(table(pga$Gender))
2 * pt(abs(t), df = DF, lower.tail = FALSE)

## [1] 0.02301723