Created by John Palowitch, UNC Chapel Hill output: html_document
LSD DATA SET
In 1968, a study was performed to attempt to assess the effect of human consumption of LSD on math test scores. Seven subjects were given a dose of LSD. Then, the concentration of LSD in their body tissue was measured, and they were subsequently scored on a math test. In the following questions, we refer to the outcome of this study as the “LSD” data set.
NOTE: If you have trouble getting and reading lsd.txt, the file only has 8 lines. You may be able to improvise.
In this question you will perform a basic simple linear regression analysis with the LSD data. Load the lsd data set into R with the given code line. Print the data set.
lsd <- read.table("lsd.txt", header = TRUE)
lsd
## Tissue_conc Score
## 1 1.17 78.93
## 2 2.97 58.20
## 3 3.26 67.47
## 4 4.69 37.47
## 5 5.83 45.65
## 6 6.00 32.92
## 7 6.41 29.97Score on Tissue_conc, without using R’s lm function (i.e., using only the formulas given in class). Print these quantities, in order. Note that you may know some of these quantities by another name. n <- nrow(lsd)
r <- cor(lsd$Score, lsd$Tissue_conc)
R2 <- r^2
beta1_hat <- r * sd(lsd$Score) / sd(lsd$Tissue_conc)
beta0_hat <- mean(lsd$Score) - beta1_hat * mean(lsd$Tissue_conc)
sig2_hat <- sum((lsd$Score - beta0_hat - beta1_hat * lsd$Tissue_conc)^2) / (n - 2)
c(r, R2)
## [1] -0.9369285 0.8778350
beta1_hat
## [1] -9.009466
beta0_hat
## [1] 89.12387
sqrt(sig2_hat)
## [1] 7.125747Compute and store an lm (linear model) object associated with the regression. Print the summary() output of the linear model. Ensure the quantities from part (a) match what your summary output shows.
lsd.lm <- lm(Score ~ Tissue_conc, data = lsd)
summary(lsd.lm)
##
## Call:
## lm(formula = Score ~ Tissue_conc, data = lsd)
##
## Residuals:
## 1 2 3 4 5 6 7
## 0.3472 -4.1658 7.7170 -9.3995 9.0513 -2.1471 -1.4032
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 89.124 7.048 12.646 5.49e-05 ***
## Tissue_conc -9.009 1.503 -5.994 0.00185 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.126 on 5 degrees of freedom
## Multiple R-squared: 0.8778, Adjusted R-squared: 0.8534
## F-statistic: 35.93 on 1 and 5 DF, p-value: 0.001854lm to find each of the following. You can write your answers on a sheet of paper to be handed in. Be sure to note that they are answers to “1c. Part i”“, etc.
Produce a scatterplot of your data, and use R to draw the regression line on the plot.
plot(lsd$Tissue_conc, lsd$Score,
xlab = "Tissue Concentration",
ylab = "Test Score", main = "LSD Experiment w/Regression Line")
abline(lsd.lm)
print(beta0_hat + 2 * beta1_hat)
## [1] 71.10494
print(beta0_hat)
## [1] 89.12387
print(beta0_hat + beta1_hat * 10)
## [1] -0.9707904
lm output and your plot. In particular, interpret the meaning of the \(p\)-value and confidence interval you found in part c.#Your Code here
Accuracy (response) on Distance (predictor). Accuracy is the fairway accuracy over the players’ holes that year. Distance is their average driving distance that year. Do the following:pga <- read.table("pga.txt", header = TRUE, sep = "\t")
pga.slr <- lm(Accuracy ~ Distance, data = pga)
summary(pga.slr)
##
## Call:
## lm(formula = Accuracy ~ Distance, data = pga)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.0314 -3.1952 0.3083 3.4181 15.5215
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 102.80154 3.31858 30.98 <2e-16 ***
## Distance -0.13937 0.01227 -11.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.118 on 352 degrees of freedom
## Multiple R-squared: 0.2682, Adjusted R-squared: 0.2661
## F-statistic: 129 on 1 and 352 DF, p-value: < 2.2e-16\(SE_{\hat \beta_1}\)
plot(pga$Distance, pga$Accuracy,
xlab = "Distance", ylab = "Accuracy",
main = "2008 PGA statistics")
abline(pga.slr)
which = 1:2plot(pga.slr, which=1:2)
plot(pga$Distance, pga$Accuracy,
xlab = "Distance", ylab = "Accuracy",
main = "2008 PGA statistics",
col = pga$Gender + 1)
legend("topright", legend = c("Female", "Male"), col = c("red", "green"), pch = 1)
which parameter of plot to 1:2). Do the residuals appear more normally distributed for men or for women? Answer on paper and explain how you arrived at your answer.# Regressions for split data
pga.slr.M <- lm(Accuracy ~ Distance, data = pga, subset = Gender == 2)
pga.slr.F <- lm(Accuracy ~ Distance, data = pga, subset = Gender == 1)
summary(pga.slr.M)
##
## Call:
## lm(formula = Accuracy ~ Distance, data = pga, subset = Gender ==
## 2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1837 -2.7969 0.2126 2.8715 11.0186
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 174.92546 10.44771 16.74 <2e-16 ***
## Distance -0.38789 0.03631 -10.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.349 on 195 degrees of freedom
## Multiple R-squared: 0.3692, Adjusted R-squared: 0.3659
## F-statistic: 114.1 on 1 and 195 DF, p-value: < 2.2e-16
summary(pga.slr.F)
##
## Call:
## lm(formula = Accuracy ~ Distance, data = pga, subset = Gender ==
## 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.6777 -2.6583 0.9829 3.6346 10.2339
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 130.89331 10.92765 11.978 < 2e-16 ***
## Distance -0.25649 0.04424 -5.797 3.66e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.246 on 155 degrees of freedom
## Multiple R-squared: 0.1782, Adjusted R-squared: 0.1729
## F-statistic: 33.61 on 1 and 155 DF, p-value: 3.662e-08
# Plots for split data
plot(pga.slr.M, which = 1:2)
plot(pga.slr.F, which = 1:2)
# Scatterplot showing all lines
plot(pga$Distance, pga$Accuracy,
xlab = "Distance", ylab = "Accuracy",
main = "2008 PGA statistics")
abline(pga.slr)
abline(pga.slr.M, col = "green")
abline(pga.slr.F, col = "red")
legend("topright", legend = c("Female", "Male"), col = c("red", "green"), lty = 1)
lm summary output or qualitative support from your plots. # Computation of p-value
Beta.F <- summary(pga.slr.F)$coefficient[2 , 1]
Beta.M <- summary(pga.slr.M)$coefficient[2 , 1]
SE.F <- summary(pga.slr.F)$coefficient[2 , 2]
SE.M <- summary(pga.slr.M)$coefficient[2 , 2]
t <- (Beta.F - Beta.M) / sqrt(SE.F^2 + SE.M^2)
DF <- min(table(pga$Gender))
2 * pt(abs(t), df = DF, lower.tail = FALSE)
## [1] 0.02301723