126 HW3

#plot with fitted regression line
library(alr4)

## Loading required package: car

## Loading required package: carData

## Loading required package: effects

## lattice theme set by effectsTheme()
## See ?effectsTheme for details.

attach(ftcollinstemp)
plot(winter~fall, xlab= 'Fall Temp', ylab= 'Winter Temp')
temp.lm = lm(winter~fall)
abline(temp.lm, col='blue')

#testing null hypothesis
summary(temp.lm)

## 
## Call:
## lm(formula = winter ~ fall)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8186 -1.7837 -0.0873  2.1300  7.5896 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  13.7843     7.5549   1.825   0.0708 .
## fall          0.3132     0.1528   2.049   0.0428 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.179 on 109 degrees of freedom
## Multiple R-squared:  0.0371, Adjusted R-squared:  0.02826 
## F-statistic:   4.2 on 1 and 109 DF,  p-value: 0.04284

Since the p-value 0.0428 is greater than α = 0.01, we fail to reject the null hypothesis that β^1=0, thus there is not significant evidence to conclude that there is a linear association between average winter temperatures and average fall temperatures.

#percentage of variability
summary(temp.lm)

## 
## Call:
## lm(formula = winter ~ fall)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8186 -1.7837 -0.0873  2.1300  7.5896 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  13.7843     7.5549   1.825   0.0708 .
## fall          0.3132     0.1528   2.049   0.0428 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.179 on 109 degrees of freedom
## Multiple R-squared:  0.0371, Adjusted R-squared:  0.02826 
## F-statistic:   4.2 on 1 and 109 DF,  p-value: 0.04284

This value is the coefficient of determination, which is R^2 = 0.0372 from the above summary. 3.7% of the variability in average winter temperatures is accounted for by a linear relationship with average fall temperatures.

detach(ftcollinstemp)

#ANOVA
library(faraway)

## 
## Attaching package: 'faraway'

## The following objects are masked from 'package:alr4':
## 
##     cathedral, pipeline, twins

## The following objects are masked from 'package:car':
## 
##     logit, vif

attach(prostate) 
prostate.lm <-lm(lpsa~lcavol)
prostate.aov <- anova(prostate.lm)
prostate.aov

## Analysis of Variance Table
## 
## Response: lpsa
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## lcavol     1 69.003  69.003  111.27 < 2.2e-16 ***
## Residuals 95 58.915   0.620                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#hypothesis test
summary(prostate.lm, alpha=0.01)

## 
## Call:
## lm(formula = lpsa ~ lcavol)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.67625 -0.41648  0.09859  0.50709  1.89673 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.50730    0.12194   12.36   <2e-16 ***
## lcavol       0.71932    0.06819   10.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7875 on 95 degrees of freedom
## Multiple R-squared:  0.5394, Adjusted R-squared:  0.5346 
## F-statistic: 111.3 on 1 and 95 DF,  p-value: < 2.2e-16

Because the p-value is very small and less than α = 0.01, we reject the null hypothesis that the slope, β^1 is 0 conclude that there is a linear association between lpsa and lcavol.

#coefficient of determination
anova(prostate.lm)

## Analysis of Variance Table
## 
## Response: lpsa
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## lcavol     1 69.003  69.003  111.27 < 2.2e-16 ***
## Residuals 95 58.915   0.620                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#R2=SSModel/SSTotal
R2= prostate.aov[[2]][[1]]/(prostate.aov[[2]][[1]]+prostate.aov[[2]][[2]])
R2

## [1] 0.5394319

The coefficient of determination, R^2, is 0.5394. Hence, 54% of the variability in lpsa is explained by the model, a linear relationship with lcavol.

##In the ANOVA table from part (a) or using part (b), which quantity represents the variability in lpsa which is left unexplained by the regression?
prostate.aov <- anova(prostate.lm)
prostate.aov

## Analysis of Variance Table
## 
## Response: lpsa
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## lcavol     1 69.003  69.003  111.27 < 2.2e-16 ***
## Residuals 95 58.915   0.620                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The quantity that represents the variability in lcavol which is left unexplained by regression is the sum of squared residuals, which is 58.915 in this case.

detach(prostate)

##Use the lm function in R to fit the regression of the response on the predictor. Draw a scatterplot of the data and add your fitted regression line.
shortleaf <- read.table("shortleaf.txt", header=TRUE)
fit <- lm(shortleaf$Vol~shortleaf$Diam)
shortleaf.lm <- lm(Vol~Diam, data=shortleaf)
shortleaf.lm

## 
## Call:
## lm(formula = Vol ~ Diam, data = shortleaf)
## 
## Coefficients:
## (Intercept)         Diam  
##     -41.568        6.837

plot(data=shortleaf, Vol~Diam, xlab= 'Diameter of Shortleaf Pines', ylab= 'Volume of Shortleaf Pines')
abline(shortleaf.lm, col="blue")

#residuals vs fits
path <- file.path("~","Desktop","CLASSES","PSTAT126","shortleaf.txt")
dat = read.table(path, header = T)
head(dat)

##   Diam Vol
## 1  4.4 2.0
## 2  4.6 2.2
## 3  5.0 3.0
## 4  5.1 4.3
## 5  5.1 3.0
## 6  5.2 2.9

names(dat)

## [1] "Diam" "Vol"

#x: the predictor
#y: the response
x = dat$Diam
y = dat$Vol

fit = lm(y ~ x)
yhat = fitted(fit)
e = y - yhat

plot(x, y, xlab = 'Diameter', ylab = 'Volume', main = 'Tree Volume vs Diameter', ylim = c(min(yhat), max(y)), cex.lab = 1.3, cex.main = 1.5)
lines(x, fitted(fit), col = 2)

plot(yhat, e, xlab = 'Fitted Values', ylab = 'Residuals', main = 'Residuals vs Fits')
abline(h = 0, lty = 2)

#QQ Plot
qqnorm(e)
qqline(e, col="blue")

Not linear, not normal(outlier), unequal variances

#transformation of x and y
lnx = log(x)
lny = log(y)
fit = lm(lny ~ lnx)
yhat = fitted(fit)
e = lny - yhat
summary(lm(lny ~ lnx))

## 
## Call:
## lm(formula = lny ~ lnx)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3323 -0.1131  0.0267  0.1177  0.4280 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.8718     0.1216  -23.63   <2e-16 ***
## lnx           2.5644     0.0512   50.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1703 on 68 degrees of freedom
## Multiple R-squared:  0.9736, Adjusted R-squared:  0.9732 
## F-statistic:  2509 on 1 and 68 DF,  p-value: < 2.2e-16

anova(lm(lny ~ lnx))

## Analysis of Variance Table
## 
## Response: lny
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## lnx        1 72.734  72.734    2509 < 2.2e-16 ***
## Residuals 68  1.971   0.029                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(lnx, lny, xlab = 'ln(Diameter)', ylab = 'ln(Volume)', main = 'ln(Volume) vs ln(Diameter)', cex.lab = 1.3, cex.main = 1.5)
lines(lnx, fitted(fit), col = 2)

plot(yhat, e, xlab = 'Fitted Values', ylab = 'Residuals', main = 'Residuals vs Fits (Reponse is ln(Volume) and Predictor is ln(Diameter))')
abline(h = 0, lty = 2)

Because the pattern is not linear, not normal(outliers), and variances are unequal, we must transform both x and y with log.

#nature of association
plot(lnx, lny, xlab = 'ln(Diameter)', ylab = 'ln(Volume)', main = 'ln(Volume) vs ln(Diameter)', cex.lab = 1.3, cex.main = 1.5)
lines(lnx, fitted(fit), col = 2)

The natural logarithm of tree volume is positively linearly related to the natural logarithm of tree diameter. As the natural log of tree diameters increases, the average natural logarithm of the tree volume also increases.

##Is there an association between diameter and volume of shortleaf pines?
summary(lm(lny ~ lnx))

## 
## Call:
## lm(formula = lny ~ lnx)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3323 -0.1131  0.0267  0.1177  0.4280 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.8718     0.1216  -23.63   <2e-16 ***
## lnx           2.5644     0.0512   50.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1703 on 68 degrees of freedom
## Multiple R-squared:  0.9736, Adjusted R-squared:  0.9732 
## F-statistic:  2509 on 1 and 68 DF,  p-value: < 2.2e-16

Since the p-value is very small, we therefore reject the null hypothesis that β^1 is 0 and conclude that there is a linear association between the natural log of diameter and natural log of volume of shortleaf pines.

#average volume of all shortleaf pine 20" in diameter
fit$call

## lm(formula = lny ~ lnx)

predict(lm(lny ~ lnx), data.frame(lnx = log(20)), 
        interval = 'prediction', level = .95)

##        fit      lwr      upr
## 1 4.810514 4.461859 5.159169

We predict the volume of all shortleaf pine trees that are 20 inches in median diameter to be exp(4.81)= 122.7316 cubic feet and we are 95% confident that the median volume of all shortleaf pine trees will be exp(4.46)=86.48751 and exp(5.16)=174.1645 cubic feet.

#expected change in volume for a five-fold increase in diameter?
confint(lm(lny ~ lnx), level=0.95)[2,]

##    2.5 %   97.5 % 
## 2.462255 2.666577

For a five-fold increase in diameter the estimated median volume changes by a factor of 5^(2.67)= 73.49367. We are 95% confident that the median volume of all shortleaf pine trees will be 5^(2.46)=52.41628 and 5^(2.67)=73.49367 for each five fold increase in diameter.

126 HW3

Becky Su

2/1/2019