STOR 455 Final Exam B R Notebook

library(readr)

Turtles455001 = read_csv("http://mclean.web.unc.edu/files/2020/05/Turtles455001.csv")
Cheating455001=read_csv("http://mclean.web.unc.edu/files/2020/05/Cheating455001.csv")

##1

mod1 = lm(Annuli ~ StraightlineCL, data = Turtles455001)
plot(Annuli ~ StraightlineCL, data = Turtles455001)
abline(mod1)

plot(mod1)

hist(mod1$residuals)

The above data shows the residuals versus fitted plot, the normal q-q plot, and a histogram of the residuals. The residual plot shows that the data does not fit the conditions for a simple linear model. The data is not shapeless because there appears to be a fan shape, there are a few obvious outliers, and the data is clustered together rather than symmetrically distributed. The model does not satisfy the second condition of zero mean. Using a normal q-q plot, we can see that there is not too much variability expected because the line fits pretty well - the variance for Y is the same at each X (homoscedastcity). There are overt outliers at the rightmost end of the plot and the leftmost tail deviates a little bit from the line. The data is not completely normally distributed and/or there may be relationships among the errors. Using a histogram of residuals, we can see that the residuals are skewed to the right, which I assume are due to outliers - the distribution of the errors are not completely centered at zero. This plot does not satisfy the fifth condition of normality because the values do not follow a normal distribution.

##2

mod2 = lm(Annuli ~ StraightlineCL + Mass + I(StraightlineCL^2) + I(Mass^2) + I(StraightlineCL*Mass), data = Turtles455001)
plot(Annuli ~ StraightlineCL + Mass, main = "Second Order Model", data = Turtles455001)

anova(mod2)

H₀: p-value = 0 ;

H_&alpha: p-value != 0

The test shows an F-value of 156.0400 (variation among the variables is larger than the variation within the variables) and a p-value of approximately 0, so we can reject the null hypothesis and conclude there exists a relationship between variables Mass, StraightlineCL, and Annuli

##3

mod3 = lm(sqrt(Annuli) ~ StraightlineCL, data = Turtles455001)
summary(mod3)

## 
## Call:
## lm(formula = sqrt(Annuli) ~ StraightlineCL, data = Turtles455001)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.60719 -0.38651 -0.03166  0.36173  1.92971 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.58653    0.23429   2.503   0.0129 *  
## StraightlineCL  0.03016    0.00202  14.934   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5871 on 278 degrees of freedom
## Multiple R-squared:  0.4451, Adjusted R-squared:  0.4431 
## F-statistic:   223 on 1 and 278 DF,  p-value: < 2.2e-16

plot(Annuli ~ StraightlineCL, data = Turtles455001)
curve(0.34402+0.03538*x+0.0009*x^2, add=TRUE)

I tried out several different transformations (including (response) and sqrt(predictor), sqrt(response) and sqrt(predictor), sqrt(response) and predictor^3) and decided on sqrt(response) and predictor because it showed the greatest improvement out of the others I tried. I created the equation for the curve by obtaining the values from the summary statistics and setting sqrt(response)=0.58653+0.03016(predictor).

plot(mod3)

hist(mod3$residuals)

The above data shows the residuals versus fitted plot, the normal q-q plot, and a histogram of the residuals for the transformed model. The residual plot shows an improvement from the one in question 1. There is still a fan shape and the data is still clustered together, but there is more of an even distribution among the data points. The model does not satisfy the second condition of zero mean. Using a normal q-q plot, we can see that there is not too much variability expected because the line fits pretty well - the variance for Y is the same at each X (homoscedastcity). Compared to the q-q plot in question 1, the rightmost tail and the leftmost tail fit better on the line. There are still a few outliers that disrupt the linearity a bit, but overall, the plot falls on the line. Lastly, according to the histogram, the histogram appears to be centered at 0 and follow a normal distribution. In question 1, the histogram was skewed to the right whereas the above histogram is more normally distributed. This does satisfy the fifth condition of normality because the values do follow a normal distribution.

##4

Full=lm(Annuli~., data=Turtles455001)
MSE=(summary(Full)$sigma)^2
none=lm(Annuli~1,data=Turtles455001)
step(none,scope=list(upper=Full),scale=MSE)

## Start:  AIC=236.39
## Annuli ~ 1
## 
##                       Df Sum of Sq    RSS      Cp
## + StraightlineCL       1   3146.72 5961.2  60.671
## + MaxCW                1   3126.58 5981.4  61.808
## + ShellHeightatHinge   1   3094.36 6013.6  63.628
## + Mass                 1   3000.91 6107.0  68.906
## + PL_HingetoPosterior  1   2886.99 6220.9  75.339
## + LifeStage            2   2907.62 6200.3  76.174
## + PL_AnteriortoHinge   1   2720.25 6387.7  84.756
## + Sex                  2    977.97 8130.0 185.155
## + Recapture            1    715.32 8392.6 197.988
## + Temp                 1    363.79 8744.2 217.842
## <none>                             9107.9 236.387
## + CaptureMethod        1      2.32 9105.6 238.257
## + Habitat              8    197.50 8910.4 241.233
## 
## Step:  AIC=60.67
## Annuli ~ StraightlineCL
## 
##                       Df Sum of Sq    RSS      Cp
## + LifeStage            2    513.78 5447.4  35.654
## + Temp                 1    411.08 5550.1  39.454
## + ShellHeightatHinge   1    159.01 5802.2  53.690
## + MaxCW                1    145.39 5815.8  54.459
## + Mass                 1    141.47 5819.8  54.681
## + Recapture            1     94.31 5866.9  57.344
## + Sex                  2    105.46 5855.8  58.715
## <none>                             5961.2  60.671
## + PL_HingetoPosterior  1     22.59 5938.6  61.395
## + PL_AnteriortoHinge   1      4.64 5956.6  62.409
## + CaptureMethod        1      0.06 5961.2  62.667
## + Habitat              8    126.93 5834.3  69.502
## - StraightlineCL       1   3146.72 9107.9 236.387
## 
## Step:  AIC=35.65
## Annuli ~ StraightlineCL + LifeStage
## 
##                       Df Sum of Sq    RSS     Cp
## + Temp                 1    352.20 5095.2 17.762
## + Mass                 1    103.39 5344.1 31.815
## + ShellHeightatHinge   1     96.35 5351.1 32.212
## + Recapture            1     86.17 5361.3 32.787
## + MaxCW                1     71.59 5375.8 33.610
## <none>                             5447.4 35.654
## + Sex                  2     62.05 5385.4 36.149
## + PL_HingetoPosterior  1      8.95 5438.5 37.149
## + PL_AnteriortoHinge   1      1.82 5445.6 37.551
## + CaptureMethod        1      0.04 5447.4 37.652
## + Habitat              8    107.37 5340.1 45.590
## - LifeStage            2    513.78 5961.2 60.671
## - StraightlineCL       1    752.88 6200.3 76.174
## 
## Step:  AIC=17.76
## Annuli ~ StraightlineCL + LifeStage + Temp
## 
##                       Df Sum of Sq    RSS     Cp
## + ShellHeightatHinge   1    126.97 4968.3 12.591
## + Mass                 1     96.80 4998.4 14.296
## + Recapture            1     90.71 5004.5 14.639
## + MaxCW                1     60.58 5034.7 16.341
## <none>                             5095.2 17.762
## + CaptureMethod        1      9.93 5085.3 19.202
## + PL_HingetoPosterior  1      8.53 5086.7 19.281
## + PL_AnteriortoHinge   1      4.91 5090.3 19.485
## + Sex                  2     29.89 5065.3 20.075
## + Habitat              8    115.05 4980.2 27.265
## - Temp                 1    352.20 5447.4 35.654
## - LifeStage            2    454.91 5550.1 39.454
## - StraightlineCL       1    814.08 5909.3 61.739
## 
## Step:  AIC=12.59
## Annuli ~ StraightlineCL + LifeStage + Temp + ShellHeightatHinge
## 
##                       Df Sum of Sq    RSS     Cp
## + Recapture            1     62.12 4906.1 11.083
## + Sex                  2     74.88 4893.4 12.363
## <none>                             4968.3 12.591
## + Mass                 1     31.29 4937.0 12.825
## + MaxCW                1     24.28 4944.0 13.220
## + CaptureMethod        1     13.14 4955.1 13.849
## + PL_HingetoPosterior  1      4.00 4964.3 14.366
## + PL_AnteriortoHinge   1      0.82 4967.4 14.545
## - StraightlineCL       1     80.75 5049.0 15.152
## - ShellHeightatHinge   1    126.97 5095.2 17.762
## + Habitat              8    158.53 4809.7 19.638
## - LifeStage            2    388.14 5356.4 30.512
## - Temp                 1    382.83 5351.1 32.212
## 
## Step:  AIC=11.08
## Annuli ~ StraightlineCL + LifeStage + Temp + ShellHeightatHinge + 
##     Recapture
## 
##                       Df Sum of Sq    RSS     Cp
## + Sex                  2     87.24 4818.9 10.156
## <none>                             4906.1 11.083
## + Mass                 1     21.70 4884.4 11.858
## + CaptureMethod        1     18.55 4887.6 12.036
## + MaxCW                1     17.21 4888.9 12.111
## - Recapture            1     62.12 4968.3 12.591
## + PL_HingetoPosterior  1      2.49 4903.7 12.943
## + PL_AnteriortoHinge   1      0.65 4905.5 13.046
## - StraightlineCL       1     77.02 4983.2 13.433
## - ShellHeightatHinge   1     98.38 5004.5 14.639
## + Habitat              8    197.51 4708.6 15.929
## - LifeStage            2    387.06 5293.2 28.943
## - Temp                 1    383.18 5289.3 30.724
## 
## Step:  AIC=10.16
## Annuli ~ StraightlineCL + LifeStage + Temp + ShellHeightatHinge + 
##     Recapture + Sex
## 
##                       Df Sum of Sq    RSS      Cp
## - StraightlineCL       1     13.68 4832.6  8.9288
## + Mass                 1     45.98 4772.9  9.5594
## <none>                             4818.9 10.1560
## + MaxCW                1     24.28 4794.6 10.7850
## + CaptureMethod        1     19.18 4799.7 11.0730
## - Sex                  2     87.24 4906.1 11.0833
## + PL_HingetoPosterior  1      2.08 4816.8 12.0386
## + PL_AnteriortoHinge   1      0.38 4818.5 12.1343
## - Recapture            1     74.49 4893.4 12.3627
## + Habitat              8    179.05 4639.8 16.0437
## - ShellHeightatHinge   1    144.23 4963.1 16.3019
## - LifeStage            2    367.04 5185.9 26.8852
## - Temp                 1    353.44 5172.3 28.1173
## 
## Step:  AIC=8.93
## Annuli ~ LifeStage + Temp + ShellHeightatHinge + Recapture + 
##     Sex
## 
##                       Df Sum of Sq    RSS      Cp
## + Mass                 1     59.65 4772.9  7.5599
## + MaxCW                1     37.62 4795.0  8.8043
## <none>                             4832.6  8.9288
## + CaptureMethod        1     19.36 4813.2  9.8357
## + StraightlineCL       1     13.68 4818.9 10.1560
## + PL_AnteriortoHinge   1      2.51 4830.1 10.7868
## + PL_HingetoPosterior  1      0.79 4831.8 10.8842
## - Recapture            1     78.28 4910.9 11.3499
## - Sex                  2    150.58 4983.2 13.4332
## + Habitat              8    182.17 4650.4 14.6404
## - Temp                 1    356.14 5188.7 27.0423
## - LifeStage            2    405.34 5237.9 27.8211
## - ShellHeightatHinge   1    647.76 5480.3 43.5124
## 
## Step:  AIC=7.56
## Annuli ~ LifeStage + Temp + ShellHeightatHinge + Recapture + 
##     Sex + Mass
## 
##                       Df Sum of Sq    RSS      Cp
## <none>                             4772.9  7.5599
## + CaptureMethod        1     16.23 4756.7  8.6430
## + MaxCW                1     14.35 4758.6  8.7492
## - Mass                 1     59.65 4832.6  8.9288
## - Recapture            1     61.53 4834.5  9.0347
## + PL_HingetoPosterior  1      3.45 4769.5  9.3648
## + PL_AnteriortoHinge   1      0.20 4772.7  9.5487
## + StraightlineCL       1      0.01 4772.9  9.5594
## - Sex                  2    143.80 4916.7 11.6811
## - ShellHeightatHinge   1    113.67 4886.6 11.9797
## + Habitat              8    168.94 4604.0 14.0190
## - LifeStage            2    361.29 5134.2 23.9643
## - Temp                 1    335.81 5108.7 24.5256

## 
## Call:
## lm(formula = Annuli ~ LifeStage + Temp + ShellHeightatHinge + 
##     Recapture + Sex + Mass, data = Turtles455001)
## 
## Coefficients:
##        (Intercept)  LifeStageHatchling   LifeStageJuvenile                Temp  
##          -6.931805            0.446168           -5.340853            0.142505  
## ShellHeightatHinge           Recapture             SexMale          SexUnknown  
##           0.164687            1.033107            1.099360           -1.141631  
##               Mass  
##           0.008538

I used the forward selection method to choose the best model for predicting Annuli. I chose the model with the smallest Mallow’s Cp. The model includes predictor variables LifeStage, Temp, ShellHeightatHinge, Recapture, Sex, and Mass and has a Cp value of 7.5599, which is good because it is small and indicates the model to be the most effective. I created the model in the chunk below.

mod4=lm(Annuli~Recapture+Temp+LifeStage+Sex+Mass+ShellHeightatHinge, data=Turtles455001)

##5

mod5 = aov(Annuli~factor(LifeStage), data = Turtles455001)
summary(mod5)

##                    Df Sum Sq Mean Sq F value Pr(>F)    
## factor(LifeStage)   2   2908  1453.8   64.95 <2e-16 ***
## Residuals         277   6200    22.4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

library(car)

## Loading required package: carData

leveneTest(Turtles455001$Annuli, Turtles455001$LifeStage)

## Warning in leveneTest.default(Turtles455001$Annuli, Turtles455001$LifeStage):
## Turtles455001$LifeStage coerced to factor.

H₀: variance₁ = variance₂ = variance₃ = … ;

H_&alpha: some variance_i != variance_j

##6

mod5.1 = aov(sqrt(Annuli)~factor(LifeStage), data = Turtles455001)
summary(mod5.1)

##                    Df Sum Sq Mean Sq F value Pr(>F)    
## factor(LifeStage)   2  71.85   35.93    98.7 <2e-16 ***
## Residuals         277 100.82    0.36                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Like with my previous transformation, I did sqrt of the response variable. The one-way ANOVA model with the transformed Annuli as the response and LifeStage as the predictor is still a significant model because the p-value is approximately 0.

##7

library(MASS)
mod6 = glm(Recapture~CaptureMethod+Temp+factor(LifeStage) +factor(Sex)+Annuli+Mass+StraightlineCL+MaxCW+PL_AnteriortoHinge+ PL_HingetoPosterior +ShellHeightatHinge, family="binomial", data=Turtles455001)
mod7 = stepAIC(mod6, trace=0)
summary(mod7)

## 
## Call:
## glm(formula = Recapture ~ Annuli + MaxCW + ShellHeightatHinge, 
##     family = "binomial", data = Turtles455001)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9997  -1.0267  -0.3959   1.1298   2.0606  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -9.15627    1.73990  -5.263 1.42e-07 ***
## Annuli              0.05592    0.03024   1.849   0.0644 .  
## MaxCW               0.03882    0.02220   1.749   0.0804 .  
## ShellHeightatHinge  0.07178    0.03547   2.024   0.0430 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 375.21  on 279  degrees of freedom
## Residual deviance: 324.79  on 276  degrees of freedom
## AIC: 332.79
## 
## Number of Fisher Scoring iterations: 5

I used the stepAIC to choose the best model for predicting Recapture. The model created includes predictor variables Annuli, MaxCW, and ShellHeightatHinge. The p-value is approximately 0, which is good because it indicates the model is effective. I created the model in the chunk below.

mod8 = glm(Recapture~Annuli+MaxCW+ShellHeightatHinge, family = binomial, data =Turtles455001)

##8

mod8 = glm(Recapture~Annuli+MaxCW+ShellHeightatHinge, family = binomial, data =Turtles455001)
summary(mod8)

## 
## Call:
## glm(formula = Recapture ~ Annuli + MaxCW + ShellHeightatHinge, 
##     family = binomial, data = Turtles455001)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9997  -1.0267  -0.3959   1.1298   2.0606  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -9.15627    1.73990  -5.263 1.42e-07 ***
## Annuli              0.05592    0.03024   1.849   0.0644 .  
## MaxCW               0.03882    0.02220   1.749   0.0804 .  
## ShellHeightatHinge  0.07178    0.03547   2.024   0.0430 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 375.21  on 279  degrees of freedom
## Residual deviance: 324.79  on 276  degrees of freedom
## AIC: 332.79
## 
## Number of Fisher Scoring iterations: 5

mod9=glm(Recapture~ShellHeightatHinge, family = binomial, data=Turtles455001)
summary(mod9)

## 
## Call:
## glm(formula = Recapture ~ ShellHeightatHinge, family = binomial, 
##     data = Turtles455001)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1374  -0.9825  -0.5198   1.1365   1.9021  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -7.64823    1.41251  -5.415 6.14e-08 ***
## ShellHeightatHinge  0.12805    0.02453   5.219 1.80e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 375.21  on 279  degrees of freedom
## Residual deviance: 333.04  on 278  degrees of freedom
## AIC: 337.04
## 
## Number of Fisher Scoring iterations: 5

anova(mod8, mod9, test="Chisq")

H₀: beta_i = 0;

H_&alpha: beta_i != 0

mod8 is the model created in question 7, with its respective G-statistic being G1, and mod9 is the reduced version of mod8 with the predictor variable as the quantitative variable with the lowest p-value (ShellHeightatHinge - 0.125). I know the question says that the original model should appear more significant than the model with one predictor, but my results showed that the model with one predictor was more significant than the original model. The p-value of the reduced model is lower than the p-value of the original model and the residual deviance of the new model is larger than the residual deviance of the former model. Since the second model demonstrates a significant p-value, I reject the null hypothesis and conclude that the model with just ShellHeightatHinge is more significant, or effective, than the original model.

##9

exp(confint(mod9))

## Waiting for profiling to be done...

##                           2.5 %      97.5 %
## (Intercept)        2.444119e-05 0.006259808
## ShellHeightatHinge 1.086714e+00 1.196612087

I am 95% confident that the odds ratio falls within 2.444e-05 and 0.0062 for the intercept and within 1.087 and 1.197 for the ShellHeightatHinge. The values for ShellHeightatHinge are larger than 1, which means for every unit change in the predictor, the odds of a turtle being recaptured increases.

##10

turtleCaptured = data.frame(CaptureMethod=2, Temp=84.6, RaceM=7, LifeStage="Adult", Sex="Male", Annuli=12, Mass=350.0, StraightlineCL=123.00, MaxCW=96.00, PL_AnteriortoHinge=46.00, PL_HingetoPosterior=80.00, ShellHeightatHinge=52.00, Habitat="field/forest edge (within 6m of boundry)")

predict(mod8, turtleCaptured, type="response")

##         1 
## 0.2639663

Using the model created in number 7, the odds of a turtle like the first one in the dataset being recaptured is approximately 26%.

##11

modC=aov(GPA~FRAUD+COPYEXAM+FRAUD*COPYEXAM,data=Cheating455001)
summary(modC)

##                 Df Sum Sq Mean Sq F value Pr(>F)  
## FRAUD            1   0.06  0.0574   0.267 0.6055  
## COPYEXAM         1   0.91  0.9128   4.253 0.0402 *
## FRAUD:COPYEXAM   1   0.02  0.0151   0.070 0.7914  
## Residuals      260  55.80  0.2146                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##12

plot(modC)

hist(modC$residuals)

The above data shows the residuals versus fitted plot, the normal q-q plot, and a histogram of the residuals. According to the residuals versus fitted plot - The residuals for each question demonstrate variance in the data points. Each question shows a similar amount of variance. None of the models overlap and each model’s variability seems pretty scattered. According to the normal q-q plot - the plot appears to be sort of linear. Both the right and the left end show datapoints that disrupt the linearity, sp overall, the plot does not on the line. According to the histogram - the plot appears to be normal, with the values centering at 0. The plot skews, just a little bit, to the left, which I assume is due to outliers.

##13

anova(modC)

The null hypothesis is that there is no main effect of FRAUD on GPA, no main effect of COPYEXAM on GPA, and there is no significant interaction effect of FRAID and COPYEXAM on GPA. The alternative hypothesis is that there is a main effect of FRAUD on GPA, a main effect of COPYEXAM on GPA, and there is a significant interaction effect of FRAID and COPYEXAM on GPA. All of the p-values are insignificant which indicates we cannot reject the null hypothesis. The only significant p-value is COPYEXAM at the 0.05 alpha level which indicates that there may be a main effect of COPYEXAM on GPA. A Tukey HSD method was not used because the p-values were not significant enough, so we cannot conclude a main effect of FRAUD on GPA, a main effect of COPYEXAM on GPA, and there is a significant interaction effect of FRAID and COPYEXAM on GPA.

##14

modC1=lm(GPA~COPYEXAM+FRAUD, data=Cheating455001)
df1 = data.frame(COPYEXAM = 1, FRAUD = 0)
predict(modC1, df1, type="response")

##       1 
## 2.99629

df2 = data.frame(COPYEXAM = 0, FRAUD = 1)
predict(modC1, df2, type="response")

##        1 
## 3.131629

df3 = data.frame(COPYEXAM = 1, FRAUD = 1)
predict(modC1, df3, type="response")

##        1 
## 2.982121

Using the new model, the odds of a model’s prediction for GPA being impacted by a “yes” response to the COPYEXAM question is 2.99. The odds of a model’s prediction for GPA being impacted by a “yes” response to the FRAUD question is 3.13. Lastly, the odds of a model’s prediction for GPA being impacted by a “yes” response to both the COPYEXAM and FRAUD questions is 2.98. Conclusively, there is most of an impact when the students answer “yes” to both questions, second highest impact when students answer “yes” to FRAUD, and the least impact when students answer “yes” to COPYEXAM. The impact is a higher GPA calculated.

STOR 455 Final Exam B R Notebook

H0: p-value = 0 ;

H&alpha: p-value != 0

H0: variance1 = variance2 = variance3 = … ;

H&alpha: some variancei != variancej