Homework #5

Problem 2

Use the purity-hydrocarbon data from Homework #2 and R to answer the following questions. The data are hosted on my github and the link is provided on the WISE page.

The purity of oxygen produced by a fractionation process is thought to be related to the percentage of hydrocarbons in the main condenser of the processing unity. Data for twenty samples are contained in the dataset. The dataset contains two columns: Purity (%) and Hydrocarbon (%).

# Load Data ( make sure URL is on one line)

oxygen<-read.csv("https://raw.githubusercontent.com/kitadasmalley/sp21_MATH376
LMT/main/data/oxygenPurity.csv", header = TRUE)

purity<-oxygen$purity
hydro<-oxygen$hydro

First, fit a simple linear regression model to the data (this has already been done on previous homeworks). Then perform the following tasks:

a) Test the Hypothesis H_0: B_1 = 0.

Show all work from scratch and confirm with the output table.

State the hypothesis, derive the test statistic, state the degrees of freedom, find the p-value, and state the conclusion.

The test statistic is 3.386119.

mod<-lm(purity~hydro, data = oxygen)

n<-dim(oxygen)[1]

beta_1<-mod$coefficients[2]


ss_res <- sum(mod$residuals^2)

ms_res<- ss_res/(n-2)

se_b1<-sqrt(ms_res/sum((hydro-mean(hydro))^2))


t_stat <- beta_1/se_b1

t_stat

##    hydro 
## 3.386119

The p-value for a two sided T test is 0.003291122

# obtain the p value for a two sided test.
pt(abs(t_stat), df = n-2, lower.tail = FALSE)*2

##       hydro 
## 0.003291122

There are 18 degrees of freedom.

df <- n-2
df

## [1] 18

Confirm with the output table

summary(mod)

## 
## Call:
## lm(formula = purity ~ hydro, data = oxygen)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6724 -3.2113 -0.0626  2.5783  7.3037 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   77.863      4.199  18.544 3.54e-13 ***
## hydro         11.801      3.485   3.386  0.00329 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.597 on 18 degrees of freedom
## Multiple R-squared:  0.3891, Adjusted R-squared:  0.3552 
## F-statistic: 11.47 on 1 and 18 DF,  p-value: 0.003291

Hypothesis: The null hypothesis is that H0 = 0, the alternative hypothesis is that HA != 0.However, we reject the null hypothesis with a p-value of 0.003291122 at the 0.05 significance level. There is highly suggestive evidence that percentage of hydrocarbons affects the purity of oxygen.

Problem 3

Use the NFL data and R to answer the following questions. The data are hosted on my github and the link is provided on the WISE page.

This dataset is for the performance of the 26 National Football League teams in 1976. It is suspected that the number of yards gained rushing by opponents (x8) has an effect on the number of games won by a team (y).

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

# Load Data ( make sure URL is on one line)
nfl<-read.csv("https://raw.githubusercontent.com/kitadasmalley/sp21_MATH376L
MT/main/data/nlf1976.csv", header = TRUE)

a) Fit a simple linear regression model relating games won “y” to yards gained rushing by opponents “x8”.

nfl_mod<-lm(y~x8, data = nfl)
summary(nfl_mod)

## 
## Call:
## lm(formula = y ~ x8, data = nfl)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.804 -1.591 -0.647  2.032  4.580 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 21.788251   2.696233   8.081 1.46e-08 ***
## x8          -0.007025   0.001260  -5.577 7.38e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.393 on 26 degrees of freedom
## Multiple R-squared:  0.5447, Adjusted R-squared:  0.5272 
## F-statistic:  31.1 on 1 and 26 DF,  p-value: 7.381e-06

b) Find a 95% confidence interval on the slope.

State the point estimate, degrees of freedom, critical value, standard error, and margin of error. State your findings in the context of the problem.

n<-dim(nfl)[1]
beta_1<-nfl_mod$coefficients[2]
nfl_mod$coefficients

## (Intercept)          x8 
##  21.7882509  -0.0070251

ss_res <- sum(nfl_mod$residuals^2)
ms_res<- ss_res/(n-2)
se_b1 <- sqrt(ms_res/sum((nfl$x8-mean(nfl$x8))^2)) # standard error
se_b1

## [1] 0.00125965

# Confidence Interval for Slope

crt_value<-qt(.975, df=n-2)
crt_value

## [1] 2.055529

# critical value
conf<-beta_1+c(-1,1)*crt_value*se_b1 # confidence interval
conf

## [1] -0.009614347 -0.004435854

beta_1 # point estimate

##         x8 
## -0.0070251

crt_value*se_b1 # margin of error

## [1] 0.002589247

(n-2) # degrees of freedom

## [1] 26

confint(nfl_mod)

##                    2.5 %       97.5 %
## (Intercept) 16.246064040 27.330437725
## x8          -0.009614347 -0.004435854

We have 95% confidence that, with 26 degrees of freedom, that the estimated gained yardage lies between -0.009614347 and -0.004435854. The point estimate is 21.788251, the critical value is 2.055529, the standard error is 0.00125965 (0.001260 in the summary), and the margin of error is 0.002589247.

c) Suppose we would like to use this model to predict the number of games a team will win if we can limit the opponents’ yards rushing to 1800 yards.

- Find a point estimate of the number of games won when x8 = 1800

beta_0<-nfl_mod$coefficients[1]
X8<-1800
beta_0+(beta_1*X8)

## (Intercept) 
##     9.14307

You’d win an estimated 9 games when rushing is limited to 1800 yards.

- Find a 90% prediction interval on the number of games won.

– State the fitted value, degrees of freedom, critical value, standard error for the estimated value, and the margin of error. State your findings in context of the problem.

n<-dim(nfl)[1]


ft_value<-beta_0+(beta_1*X8) # fitted value
ft_value

## (Intercept) 
##     9.14307

crt_value<-qt(.95, df = n-2) # critical value
crt_value

## [1] 1.705618

x_bar<-mean(nfl$x8)

std_err<-sqrt(ms_res*(1+(1/n)+((X8-x_bar)^2/sum((nfl$x8-x_bar)^2)))) # standard error
std_err

## [1] 2.466366

marg<-crt_value*std_err # margin of error
marg

## [1] 4.206679

conf<-ft_value+(c(-1,1)*marg)
conf

## [1]  4.936392 13.349749

If 1800 yards were the limit on yards rushed, with 26 degrees of freedom, we have 90% confidence that the number of games won would be between 4.936392 and 13.349749 (since games don’t work like that, we’ll call it 4-13 games). The margin of error is 4.206679, the critical value is 1.705618, and the stand error for the estimated value is 2.466366. Fitted value is 9.14307.

- Find a 90% confidence interval for the mean number of games won.

– State the fitted value, degrees of freedom, critical value, standard error for the estimated value, and margin of error, state your findings in the context of the problem.

n<-dim(nfl)[1]


ft_value<-beta_0+(beta_1*X8) # fitted value
ft_value

## (Intercept) 
##     9.14307

crt_value<-qt(.95, df = n-2) # critical value
crt_value

## [1] 1.705618

x_bar<-mean(nfl$x8)

std_err<-sqrt(ms_res*((1/n)+((X8-x_bar)^2/sum((nfl$x8-x_bar)^2)))) # standard error
std_err

## [1] 0.597594

marg<-crt_value*std_err # margin of error
marg

## [1] 1.019267

conf<-ft_value+(c(-1,1)*marg)
conf

## [1]  8.123803 10.162337

If 1800 yards were the limit on yards rushed, with 26 degrees of freedom, with a mean 90% confidence that the number of games won would be between 8.123803 and 10.162337 (8-10 games). The margin of error is 1.019267, the critical value is 1.705618, and the standard error for the estimated value is 0.597594 Fitted value is 9.14307.

– Compare the last two parts. Explain the difference.

Since there is more noise around one variable, there is a larger margin of error, thus a larger confidence interval. The mean confidence interval has less noise, which makes for a smaller confidence interval.

d) Make a plot that shows the confidence and prediction bands for this model.

confBand<-predict(nfl_mod, interval="confidence")

predBand<-predict(nfl_mod, interval="predict")

## Warning in predict.lm(nfl_mod, interval = "predict"): predictions on current data refer to _future_ responses

colnames(predBand)<-c("fit2", "lwr2", "upr2")


newDF<-cbind(nfl, confBand, predBand)

ggplot(newDF, aes(x=nfl$x8, y=nfl$y))+
  geom_point()+
  geom_abline(slope=nfl_mod$coefficients[2], intercept=nfl_mod$coefficients[1],
              color="blue", lty=2, lwd=1)+
  geom_line(aes(y=lwr), color="green", lty=2, lwd=1)+
  geom_line(aes(y=upr), color="green", lty=2, lwd=1)+
  geom_line(aes(y=lwr2), color="red", lty=2, lwd=1)+
  geom_line(aes(y=upr2), color="red", lty=2, lwd=1)+
  theme_bw()