126_HW6

library(alr4)

## Loading required package: car

## Loading required package: carData

## Loading required package: effects

## lattice theme set by effectsTheme()
## See ?effectsTheme for details.

attach(UN11)
pairs(~fertility+log(ppgdp)+pctUrban, data=UN11)

There seems to be a correlation between log(ppgdp) and pctUrban, given the linear trends seen in the plots. However, there is little to no correlation between the variables fertility and log(ppgdp) and fertility and pctUrban; there is no consistent linear trend.

m1 <- lm(fertility~log(ppgdp), UN11)
m2 <- lm(fertility~pctUrban, UN11)
summary(m1)

## 
## Call:
## lm(formula = fertility ~ log(ppgdp), data = UN11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.16313 -0.64507 -0.06586  0.62479  3.00517 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.00967    0.36529   21.93   <2e-16 ***
## log(ppgdp)  -0.62009    0.04245  -14.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9305 on 197 degrees of freedom
## Multiple R-squared:   0.52,  Adjusted R-squared:  0.5175 
## F-statistic: 213.4 on 1 and 197 DF,  p-value: < 2.2e-16

summary(m2)

## 
## Call:
## lm(formula = fertility ~ pctUrban, data = UN11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4932 -0.7795 -0.1475  0.6517  2.9029 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.559823   0.213681  21.339   <2e-16 ***
## pctUrban    -0.031045   0.003421  -9.076   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.128 on 197 degrees of freedom
## Multiple R-squared:  0.2948, Adjusted R-squared:  0.2913 
## F-statistic: 82.37 on 1 and 197 DF,  p-value: < 2.2e-16

The p-value for fertility~ log(ppgdp) and fertility ~pctUrban are both less than 2e-16 which are very small, so we reject the null hypothesis and conclude that both the slope coefficients are significantly different from 0.

par(mfrow=c(1,2))
m3 <- lm(pctUrban~log(ppgdp), UN11)
m4 <- lm(log(ppgdp)~pctUrban, UN11)
plot(residuals(m1)~residuals(m3), xlab="Residuals from pctUrban~log(ppgdp)", ylab="Residualsfrom fertility~log(ppgdp)", main="Added variable plot for pctUrban after log(ppgdp)")
m5<- lm(residuals(m1)~residuals(m3), UN11)
abline(m5)
plot(residuals(m2)~residuals(m4), xlab="Residuals from log(ppgdp)~pctUrban)", ylab="Residualsfrom fertility~pctUrban", main="Added variable plot for log(ppgdp) after pctUrban")
m6<- lm(residuals(m2)~residuals(m4), UN11)
abline(m6)

The slope is very small for for added variable plot for pctUrban after log(ppgdp), so pctUrban is not very useful after adjusting for log(ppgdp). However, the slope seems very great for added variable plot for log(ppgdp) after pctUrban. It has negative influence to fertility.

library(faraway)

## 
## Attaching package: 'faraway'

## The following objects are masked from 'package:alr4':
## 
##     cathedral, pipeline, twins

## The following objects are masked from 'package:car':
## 
##     logit, vif

attach(teengamb)

model = lm(gamble ~ . , data = teengamb)
newdata = data.frame(sex=1,status=mean(teengamb$status),income=mean(teengamb$income),verbal=mean(teengamb$verbal))
predict(model, newdata, interval="predict")

##        fit       lwr      upr
## 1 6.124186 -41.19262 53.44099

predict(model, newdata, interval="confidence")

##        fit       lwr     upr
## 1 6.124186 -5.795024 18.0434

The predicted expenditure is 6.124186. The 95% confidence interval is (-5.795024,18.0434). We are 95% confident that the true mean expenditure that women would gamble is between -5.795024 and 18.0434.

newdata1 = data.frame(sex=1,status=max(teengamb$status),income=max(teengamb$income),verbal=max(teengamb$verbal))

predict(model, newdata1, interval="confidence")

##        fit      lwr      upr
## 1 49.18961 12.87959 85.49963

The 95% confidence interval for true mean expenditure for women with maximal values of status, income, and verbal is between 12.87959 and 85.49963. This confidence interval is wider than that of averages because it takes the maximal values.

model1 = lm(sqrt(gamble)~.,data = teengamb)
predict(model1, newdata, interval="prediction")

##        fit       lwr      upr
## 1 2.005021 -2.340873 6.350915

The predicted expenditure is 2.005021. We are 95% confident that the expenditure gambled for an individual woman is between -2.340873 and 6.350915.

library(alr4)
attach(water)

pairs(~OPBPC + OPRC + OPSLAKE, data=water)

There seems to be a correlation between OPBPC and OPSLAKE, less of a correlation between OPBPC and OPRC, and OPRC and OPSLAKE, given the linear trend in the plots.

#avplot
full.lm <- lm(BSAAM~OPBPC + OPRC + OPSLAKE, data=water)
avPlots(full.lm, id=FALSE)

OPBPC contains no additional useful information in predicting y. However, both OPRC and OPSLAKE, do contain useful information in predicting y, because the lines in both plots have positive slopes.

summary(full.lm)

## 
## Call:
## lm(formula = BSAAM ~ OPBPC + OPRC + OPSLAKE, data = water)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15964.1  -6491.8   -404.4   4741.9  19921.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 22991.85    3545.32   6.485  1.1e-07 ***
## OPBPC          40.61     502.40   0.081  0.93599    
## OPRC         1867.46     647.04   2.886  0.00633 ** 
## OPSLAKE      2353.96     771.71   3.050  0.00410 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8304 on 39 degrees of freedom
## Multiple R-squared:  0.9017, Adjusted R-squared:  0.8941 
## F-statistic: 119.2 on 3 and 39 DF,  p-value: < 2.2e-16

The column t value shows you the t-test associated with testing the significance of the parameter listed in the first column. Pr(>|t|) gives you the p-value for that t-test.

red.lm <- lm(BSAAM~1)
anova(red.lm, full.lm)

## Analysis of Variance Table
## 
## Model 1: BSAAM ~ 1
## Model 2: BSAAM ~ OPBPC + OPRC + OPSLAKE
##   Res.Df        RSS Df  Sum of Sq     F    Pr(>F)    
## 1     42 2.7351e+10                                  
## 2     39 2.6895e+09  3 2.4662e+10 119.2 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: β1 = β2 = β3 = 0 H1: β1 = β2 = β3 ≠ 0 Since the p-value is less than 0.05, we reject H0 : β1 = β2 = β3= 0 and conclude that at least one slope parameter is not 0 or at least one predictor is useful.

3e No because it doesn’t test the overall effects of all parameters, only one.

redBS.lm <- lm(BSAAM ~ OPBPC + OPSLAKE)
anova(redBS.lm, full.lm)

## Analysis of Variance Table
## 
## Model 1: BSAAM ~ OPBPC + OPSLAKE
## Model 2: BSAAM ~ OPBPC + OPRC + OPSLAKE
##   Res.Df        RSS Df Sum of Sq      F   Pr(>F)   
## 1     40 3263954031                                
## 2     39 2689509185  1 574444846 8.3299 0.006326 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0 : β1 = β3 = 0 H1: β1 = β3 ≠ 0 Since p-value is less that 0.001, we reject H0 : β1 = β3 = 0 and conclude that at least once of the predictors has a significant linear relationship with BSAAM given the other predictor in the model.

library(alr4)
attach(UBSprices)
prices.lm <- lm(bigmac2009 ~ bread2009 + rice2009)
summary(prices.lm)

## 
## Call:
## lm(formula = bigmac2009 ~ bread2009 + rice2009)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.050 -10.102   0.939   6.448  88.150 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -5.4384     5.4883  -0.991  0.32641    
## bread2009     1.2340     0.1822   6.772 1.25e-08 ***
## rice2009      0.5544     0.1958   2.831  0.00662 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.53 on 51 degrees of freedom
## Multiple R-squared:  0.6081, Adjusted R-squared:  0.5928 
## F-statistic: 39.57 on 2 and 51 DF,  p-value: 4.218e-11

#linearity
prices.lm <- lm(bigmac2009 ~ bread2009 + rice2009)
pr.resid <- prices.lm$residuals
pr.fitted <- prices.lm$fitted.values
plot(pr.resid~pr.fitted, xlab="Fitted Values", ylab="Residuals", main="Residuals vs Fits Plot")

plot(prices.lm, which=1)

To check for linearity, you want to use a residuals vs fits plot. In this case, the line at 0 has a distinct shape that is not horizontal, hence this model does not meet the assumption of linearity.

#equal variances
prices.lm <- lm(bigmac2009 ~ bread2009 + rice2009)
pr.resid <- prices.lm$residuals
pr.fitted <- prices.lm$fitted.values
plot(pr.resid~pr.fitted, xlab="Fitted Values", ylab="Residuals", main="Residuals vs Fits Plot")
abline(h=0, lty=2)

To check equal/constance variance, we also use residuals vs fitted plot. The distribution of the residuals is well concentrated around 0 for small fitted values, but they get more and more spread out as the fitted values increase, indicating an increased variance. This model does not produce constant variances.

#normality
yhat <- fitted(prices.lm)
e <- bigmac2009~yhat
plot(prices.lm, which=2)

To check for normality, we use the normal QQ plot. The residuals deviate from the diagonal line in both the upper and lower tail. The tails are observed to have larger values than the diagonal line. Thus, we cannot assume normality.

shapiro.test(prices.lm$residual)

## 
##  Shapiro-Wilk normality test
## 
## data:  prices.lm$residual
## W = 0.85297, p-value = 9.652e-06

Since the p-value 9.652e-06 is very small and less than α = 0.05, we reject the null hypothesis that the variable is normally distributed. This finding supports my conclusion in part c, that the residuals of this linear fit is not normally distributed.

detach(UBSprices)

126_HW6

Becky Su

2/24/2019