Due: Saturday, Sep 30, at 11:59 pm

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'dplyr' was built under R version 4.3.2
## Warning: package 'stringr' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(fBasics)
## Warning: package 'fBasics' was built under R version 4.3.2
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:fBasics':
## 
##     densityPlot
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(DescTools)
## Warning: package 'DescTools' was built under R version 4.3.2
## 
## Attaching package: 'DescTools'
## 
## The following object is masked from 'package:car':
## 
##     Recode
library(dplyr)

Data: heartbpchol.csv for Exercise 1, bupa.csv for Exercise 2, psych.csv for Exercise 3, cars_new.csv for Exercise 4.

Use the significance level of .05.

.bordered{
  border-style: solid;
  border-color: teal;
  padding: 5px;
  background-color: #DCDCDC;
}

Exercise 1

Analysis of Variance

The heartbpchol.csv data set contains continuous cholesterol (Cholesterol) and blood pressure status (BP_Status) (category: High/ Normal/ Optimal) for alive patients.

For the heartbpchol data set, consider a one-way ANOVA model to identify differences between group cholesterol means. The normality assumption is reasonable, so you can proceed without testing normality.

(A) Perform a one-way ANOVA for Cholesterol with BP_Status as the categorical predictor. Comment on statistical significance of BP_Status, the amount of variation described by the model, and whether or not the equal variance assumption can be trusted.

  • The analysis of variance table has a p-value of 0.00137 for BP_Status, so BP_Status is statistically significant. The r-squared value of 0.0242, however, is very small. Only 2.42% of the variation in cholesterol can be described by the blood pressure levels. The Levene’s test for homogeneity of variance gives p-value of 0.8332 and it is highly insignificant. Thus, an equal variance assumption is fine here.

  • (Additional note about small R^2) R-square will tend to go smaller for when a categorical predictor has been based on cutoffs for a continuous variable. If there is a linear relationship between the two continuous variables, we would expect the response just to the left of the cutoff to be pretty close to the expected response just to the right of the cutoff, but the observations would be in different groups. The reduced granularity in the predictors from binning tends to flatten out the predictions and reduce explanation of variation.


BP=read_csv("heartbpchol.csv")
## Rows: 541 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): BP_Status
## dbl (1): Cholesterol
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(BP)
## spc_tbl_ [541 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Cholesterol: num [1:541] 221 188 292 319 205 247 202 150 228 280 ...
##  $ BP_Status  : chr [1:541] "Optimal" "High" "High" "Normal" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Cholesterol = col_double(),
##   ..   BP_Status = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
BP$Cholesterol= as.numeric(BP$Cholesterol)
str(BP)
## spc_tbl_ [541 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Cholesterol: num [1:541] 221 188 292 319 205 247 202 150 228 280 ...
##  $ BP_Status  : chr [1:541] "Optimal" "High" "High" "Normal" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Cholesterol = col_double(),
##   ..   BP_Status = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
table(BP$BP_Status)
## 
##    High  Normal Optimal 
##     229     245      67
aov.bp <- aov(Cholesterol ~ BP_Status, BP)
summary(aov.bp)
##              Df  Sum Sq Mean Sq F value  Pr(>F)   
## BP_Status     2   25211   12605   6.671 0.00137 **
## Residuals   538 1016631    1890                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm.res = lm(Cholesterol~BP_Status,BP)
anova(lm.res)
## Analysis of Variance Table
## 
## Response: Cholesterol
##            Df  Sum Sq Mean Sq F value   Pr(>F)   
## BP_Status   2   25211 12605.4  6.6708 0.001375 **
## Residuals 538 1016631  1889.6                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.res)$r.squared
## [1] 0.02419833
leveneTest(aov.bp)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   2  0.1825 0.8332
##       538
par(mfrow=c(2,2))
plot(aov.bp)

(B) Comment on any significantly different cholesterol means as determined by the post-hoc test comparing all pairwise differences. Specifically explain what that tells us about differences in cholesterol levels across blood pressure status groups, like which group has the highest or lowest mean values of Cholesterol.

  • We perform all pairwise comparisons via Tukey’s multi-comparison test. Expected cholesterol levels are significantly different for high and normal blood pressures and for high and optimal blood pressures. We expect individuals with high blood pressures to have cholesterol levels 11.54 higher than those with normal blood pressures on average and 18.65 higher than those with optimal blood pressure on average. The difference between normal and optimal blood pressure groups is not significant. Simply speaking, mean of High > mean of (Normal, Optimal).


# performing the post-hoc test, using the Tukey Test.


TukeyHSD(aov.bp)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Cholesterol ~ BP_Status, data = BP)
## 
## $BP_Status
##                      diff       lwr       upr     p adj
## Normal-High    -11.543481 -20.93394 -2.153023 0.0111929
## Optimal-High   -18.646679 -32.83690 -4.456460 0.0059898
## Optimal-Normal  -7.103198 -21.18815  6.981749 0.4624869

Exercise 2: Analysis of Variance

For this problem use the bupa.csv data set. Check UCI Machine Learning Repository for more information (http://archive.ics.uci.edu/ml/datasets/Liver+Disorders). The mean corpuscular volume and alkaline phosphatase are blood tests thought to be sensitive to liver disorder related to excessive alcohol consumption. We assume that normality and independence assumptions are valid.

bupa=read_csv("bupa.csv")
## Rows: 345 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): mcv, alkphos, drinkgroup
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  • Variable Name = Description
  • MCV = mean corpuscular volume
  • alkphos = alkaline phosphate

  • Drink group categorization of the half-pint equivalents of alcoholic beverages drunk per day:

  • group 1: less than 1 drink.
  • group 2: at least 1 but fewer than 3 drinks.
  • group 3: at least 3 but fewer than 6 drinks.
  • group 4: at 6 but fewer than 9 drinks.
  • group 5: 9 or more drinks.
  • (A) Perform a one-way ANOVA for mcv as a function of drink group. Comment on significance of the drink group, the amount of variation described by the model, and whether or not the equal variance assumption can be trusted.

    • The p-value from F-test for the effect of drink group is less than significance level 0.05. Thus, we can conclude that drink group has a significant effect on MCV. The R-square is around 0.1077 which indicates that around 10.77% of variation of MCV can be described by the model. Result of Levene’s test for testing equal-variance assumption is presented below. The null hypothesis is Homogeneity of MCV variance. The corresponding p-value is greater than 0.05. Thus, we can conclude that, MCV has equal variance.


    bupa$mcv <- as.numeric(bupa$mcv)
    bupa$drinkgroup <- as.factor(bupa$drinkgroup) 
    aov.bupa <- aov(mcv~drinkgroup,bupa)
    summary(aov.bupa)
    ##              Df Sum Sq Mean Sq F value   Pr(>F)    
    ## drinkgroup    4    733  183.29   10.26 7.43e-08 ***
    ## Residuals   340   6073   17.86                     
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    lm.bupa<-lm(mcv~drinkgroup, bupa)
    anova(lm.bupa)
    ## Analysis of Variance Table
    ## 
    ## Response: mcv
    ##             Df Sum Sq Mean Sq F value    Pr(>F)    
    ## drinkgroup   4  733.2 183.294  10.262 7.429e-08 ***
    ## Residuals  340 6073.1  17.862                      
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    summary(lm.bupa)$r.squared
    ## [1] 0.1077214
    leveneTest(aov.bupa)
    ## Levene's Test for Homogeneity of Variance (center = median)
    ##        Df F value Pr(>F)
    ## group   4  0.3053 0.8744
    ##       340

    (B) Perform a one-way ANOVA for alkphos as a function of drinkgroup. Comment on statistical significance of the drinkgroup, the amount of variation described by the model, and whether or not the equal variance assumption can be trusted.

    • The p-value from F-test for the effect of drinkgroup is less than significance level 0.05. Thus, we can conclude that, drinkgroup has a significant effect on alkphos. The R-square is around 0.0427 which indicates that around 4.27% of variation of alkphos can be described by the model. Result of Levene’s test for testing equal-variance assumption is presented below. The null hypothesis is Homogeneity of alkphos variance. The corresponding p-value is greater than significance level 0.05. Thus, we can conclude that, alkphos has equal variance.


    bupa$alkphos <- as.numeric(bupa$alkphos)
    bupa$drinkgroup <- as.factor(bupa$drinkgroup) 
    aov.alkphos <- aov(alkphos~drinkgroup,bupa)
    summary(aov.alkphos)
    ##              Df Sum Sq Mean Sq F value  Pr(>F)   
    ## drinkgroup    4   4946  1236.4   3.792 0.00495 **
    ## Residuals   340 110858   326.1                   
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    lm.alkphos<-lm(alkphos~drinkgroup, bupa)
    anova(lm.alkphos)
    ## Analysis of Variance Table
    ## 
    ## Response: alkphos
    ##             Df Sum Sq Mean Sq F value  Pr(>F)   
    ## drinkgroup   4   4946 1236.41  3.7921 0.00495 **
    ## Residuals  340 110858  326.05                   
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    summary(lm.alkphos)$r.squared
    ## [1] 0.04270721
    LeveneTest(aov.alkphos)
    ## Levene's Test for Homogeneity of Variance (center = median)
    ##        Df F value Pr(>F)
    ## group   4  0.8089 0.5201
    ##       340

    (C) Perform post-hoc tests for models in a) and b). Comment on any similarities or differences you observe from their results.

    • For the anova model with mcv and drinkgroup, Group5 has significantly greater mcv than group1. And group4 has significantly greater mcv than group1, group2, and group 3. From multi-comparison result for the anova model with alkphos and drinkgroup, group5 has significantly greater alkphos than group1, group2, group3, and group 4. Multi-comparison results indicate that more alcoholic beverages drunk per day associates with higher mean corpuscular volume and higher alkphos. Especially we see higher risk from group4 and 5 like around 4 or more 9 drinks per day.


    Tukey Post-hoc for MCV

    TukeyHSD(aov.bupa)
    ##   Tukey multiple comparisons of means
    ##     95% family-wise confidence level
    ## 
    ## Fit: aov(formula = mcv ~ drinkgroup, data = bupa)
    ## 
    ## $drinkgroup
    ##             diff          lwr      upr     p adj
    ## 2-1  1.241452991 -0.690316778 3.173223 0.3973587
    ## 3-1  0.938131313 -0.697363908 2.573627 0.5157202
    ## 4-1  3.744610282  1.968846495 5.520374 0.0000002
    ## 5-1  3.746031746  0.999127120 6.492936 0.0020039
    ## 3-2 -0.303321678 -2.330666517 1.724023 0.9940309
    ## 4-2  2.503157290  0.361050967 4.645264 0.0127884
    ## 5-2  2.504578755 -0.492214115 5.501372 0.1499436
    ## 4-3  2.806478969  0.927189295 4.685769 0.0005031
    ## 5-3  2.807900433 -0.007037875 5.622839 0.0509321
    ## 5-4  0.001421464 -2.897262735 2.900106 1.0000000

    Tukey Post-hoc for Alkphos

    TukeyHSD(aov.alkphos)
    ##   Tukey multiple comparisons of means
    ##     95% family-wise confidence level
    ## 
    ## Fit: aov(formula = alkphos ~ drinkgroup, data = bupa)
    ## 
    ## $drinkgroup
    ##          diff         lwr       upr     p adj
    ## 2-1 -2.645299 -10.8987258  5.608127 0.9045115
    ## 3-1 -4.056138 -11.0437411  2.931464 0.5035761
    ## 4-1 -1.148743  -8.7356393  6.438152 0.9937509
    ## 5-1 12.572650   0.8365845 24.308715 0.0288892
    ## 3-2 -1.410839 -10.0726073  7.250929 0.9917361
    ## 4-2  1.496556  -7.6555274 10.648639 0.9916117
    ## 5-2 15.217949   2.4142438 28.021654 0.0107154
    ## 4-3  2.907395  -5.1218122 10.936602 0.8583931
    ## 5-3 16.628788   4.6020510 28.655525 0.0016498
    ## 5-4 13.721393   1.3368544 26.105932 0.0214375

    Exercise 3:

    The psychology department at a hypothetical university has been accused of underpaying female faculty members. The data represent salary (in thousands of dollars) for all 22 professors in the department. This problem is from Maxwell and Delaney (2004).

    This includes the interaction term: an interaction term is a multiplication term, not an addition.

    (A) Fit a two-way ANOVA model including sex (F, M) and rank (Assistant, Associate) the interactionterm. What do the Type 1 and Type 3 sums of squares tell us about significance of effects? Is the interaction between sex and rank significant? Also comment on the variation explained by the model.

    • Type I test tells us that sex and rank main effects are both significant since the p-values are all less than 0.05. But Type III test shows significant rank effect but insignificant gender effect based on significance level 0.05. The interaction term between rank and sex is not significant with a p-value of 0.7951 from both Type I and Type III tests. The value of R-square is 0.6648, meaning that around 66.48% of the variation in salary (the response variable) is explained by the model (two-way ANOVA with interaction effect).


    #reviewing for balanced
    
    pd<-read.csv("psych.csv")

    Analysis of Variance Table - Type 1 Test.

    #Type 1 test
    
    #1
    aov.pd <- aov(salary ~ sex * rank, pd)
    summary(aov.pd)
    ##             Df Sum Sq Mean Sq F value   Pr(>F)    
    ## sex          1 155.15  155.15  17.007 0.000637 ***
    ## rank         1 169.82  169.82  18.616 0.000417 ***
    ## sex:rank     1   0.63    0.63   0.069 0.795101    
    ## Residuals   18 164.21    9.12                     
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    Analysis of Variance Table - Type 3 Test.

    # Type 3 test
    
    Anova(aov.pd, type=3)
    ## Anova Table (Type III tests)
    ## 
    ## Response: salary
    ##             Sum Sq Df  F value  Pr(>F)    
    ## (Intercept) 8140.2  1 892.2994 < 2e-16 ***
    ## sex           28.0  1   3.0711 0.09671 .  
    ## rank          70.4  1   7.7189 0.01240 *  
    ## sex:rank       0.6  1   0.0695 0.79510    
    ## Residuals    164.2 18                     
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    R-Squared

    lm.res<-lm(salary~sex*rank,pd)
    summary(lm.res)$r.squared
    ## [1] 0.6647566

    Note: This does not include the interaction term - it means that you are fitting a model where the effects of the predictor variables are considered independently, without accounting for any interaction between them.

    (B) Refit the model without the interaction term. Comment on the significance of effects and variation explained. Report and interpret the Type 1 and Type 3 tests of the main effects. Are the main effects of rank and sex significant?

    • Type I and Type III tests tell us that the main effects are both significant since the p-values are all less than 0.05. The value of R-square is 0.6635, meaning that around 66.35% of the variation of salary is explained by the model.


    #1
    aov.pd2 <- aov(salary ~ sex + rank, pd)
    summary(aov.pd2)
    ##             Df Sum Sq Mean Sq F value   Pr(>F)    
    ## sex          1  155.2  155.15   17.88 0.000454 ***
    ## rank         1  169.8  169.82   19.57 0.000291 ***
    ## Residuals   19  164.8    8.68                     
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    # Type 3 test
    
    Anova(aov.pd2, type=3)
    ## Anova Table (Type III tests)
    ## 
    ## Response: salary
    ##              Sum Sq Df   F value    Pr(>F)    
    ## (Intercept) 10227.6  1 1178.8469 < 2.2e-16 ***
    ## sex            72.8  1    8.3862 0.0092618 ** 
    ## rank          169.8  1   19.5743 0.0002912 ***
    ## Residuals     164.8 19                        
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    #2
    aov.pd3 <- aov(salary ~rank + sex, pd)
    summary(aov.pd3)
    ##             Df Sum Sq Mean Sq F value   Pr(>F)    
    ## rank         1 252.22  252.22  29.071 3.34e-05 ***
    ## sex          1  72.76   72.76   8.386  0.00926 ** 
    ## Residuals   19 164.84    8.68                     
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    Anova(aov.pd3, type=3)
    ## Anova Table (Type III tests)
    ## 
    ## Response: salary
    ##              Sum Sq Df   F value    Pr(>F)    
    ## (Intercept) 10227.6  1 1178.8469 < 2.2e-16 ***
    ## rank          169.8  1   19.5743 0.0002912 ***
    ## sex            72.8  1    8.3862 0.0092618 ** 
    ## Residuals     164.8 19                        
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    (C) Obtain model diagnostics to validate your Normality assumptions.

    • Normality assumption seems valid as we observe almost straight line on Normal QQ plot.


    lm.res2<-lm(salary~sex+rank,pd)
    summary(lm.res2)$r.squared
    ## [1] 0.6634627
    par(mfrow=c(2,2))
    plot(aov.pd)

    (D) Choose a final model based on your results from parts (a) and (b). Comment on any significant group differences through the post-hoc test. State the differences in salary across different main effect groups and interaction (if included) between them.

    • Based on the results of (3 a) and (3 b), the final model is main effect model, the model in (3 b) – with two main effects of rank and sex. The average salary difference between females and males is 5.333333. The average salary difference between assistant professors and associate professors is 5.377778. Therefore, we claim that the salary of female is significantly lower than the salary of male and the salary of assistant professor is significantly lower than salary of associate professor.


    Exercise 4:

    Use the cars_new.csv. See HW1 for detailed information of variables.

    (A) Start with a three-way main effects ANOVA and choose the best main effects ANOVA model for mpg_highway as a function of cylinders, origin, and type for the cars in this set. Comment on which terms should be kept in a model for mpg_highway and why based on Type 3 SS. For the model with just predictors you decide to keep, comment on the significant effects in the model and comment on how much variation in highway fuel efficiency the model describes.

    • To find the best main effects ANOVA model for mpg_highway, we first fit 3-way ANOVA model with cylinders, origin, and type and results are as follows. TheType3 analysis shows that Origin is not significant variable in ANOVA model. We remove Origin and re-fit the 2-way main effects ANOVA model with Cylinders and Type. Now, as we can see from the output below, Cylinders and Type are significant with small p-values in Type3 analysis. In other words, we should keep two main effects in our final model.
    • This 2-way main effects ANOVA model describes 45.72% of the variation in highway fuel efficiency.


    cars_new<- read.csv("cars_new.csv")
    cars_new$type <- as.factor(cars_new$type)
    cars_new$origin <- as.factor(cars_new$origin)
    cars_new$cylinders <- as.factor(cars_new$cylinders)
    cars_new$mpg_highway <- as.numeric(cars_new$mpg_highway)
    str(cars_new)
    ## 'data.frame':    180 obs. of  4 variables:
    ##  $ type       : Factor w/ 2 levels "Sedan","Sports": 1 1 1 1 1 2 1 1 1 1 ...
    ##  $ origin     : Factor w/ 2 levels "Asia","USA": 1 1 1 1 1 1 2 2 2 2 ...
    ##  $ cylinders  : Factor w/ 2 levels "4","6": 1 1 2 2 2 2 2 2 2 2 ...
    ##  $ mpg_highway: num  31 29 28 24 24 24 30 29 30 28 ...
    aov.cars <- aov(mpg_highway ~ cylinders + origin + type, cars_new)
    Anova(aov.cars, type=3)
    ## Anova Table (Type III tests)
    ## 
    ## Response: mpg_highway
    ##             Sum Sq  Df   F value  Pr(>F)    
    ## (Intercept)  69548   1 6501.6715 < 2e-16 ***
    ## cylinders     1453   1  135.8499 < 2e-16 ***
    ## origin           1   1    0.0786 0.77948    
    ## type           108   1   10.1018 0.00175 ** 
    ## Residuals     1883 176                      
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    lm.cars<-lm(mpg_highway~cylinders+type,cars_new)
    summary(lm.cars)$r.squared
    ## [1] 0.4572163

    (B) Starting with main effects chosen in part (a), find your best ANOVA model by adding in any additional interaction terms that will significantly improve the model. For your final model, comment on the significant effects and variation explained by the model.

    • The Cylinders*Type interaction is also significant when added to the model, so the final model has Cylinders, Type, and their interaction. This 2-way ANOVA model with interaction describes 48.14% of the variation in fuel efficiency.


    aov.car1 <- aov(mpg_highway ~ cylinders*type, cars_new)
    Anova(aov.car1, type=3)
    ## Anova Table (Type III tests)
    ## 
    ## Response: mpg_highway
    ##                Sum Sq  Df  F value    Pr(>F)    
    ## (Intercept)     85471   1 8358.838 < 2.2e-16 ***
    ## cylinders        1558   1  152.397 < 2.2e-16 ***
    ## type              198   1   19.392 1.844e-05 ***
    ## cylinders:type     84   1    8.201  0.004696 ** 
    ## Residuals        1800 176                       
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    lm.car1<-lm(mpg_highway~cylinders*type,cars_new)
    summary(lm.car1)$r.squared
    ## [1] 0.4813821
    plot(aov.car1, 2)

    (C) Comment on any significant group differences through the post-hoc test. What does this tell us about fuel efficiency differences across cylinders, origin, or type groups? See Hint in Exercise 3.

    • The following differences of least squares means for main effects tell us that 4 cylinder cars are significantly more efficient than 6 cylinder cars, with 4 cylinders expected to get 5.74 mpg more than 6 cylinders. Sedans are significantly more efficient and expected to get 2.678 mpg more than sports cars with p-value less than 0.05.

    • For the interaction term, only the vehicle, sedan with 4 cylinders, have significant differences with other type and 4cylinder combinations. It is significantly more fuel efficient than the other type and cylinder combinations.


    TukeyHSD(aov.car1)
    ##   Tukey multiple comparisons of means
    ##     95% family-wise confidence level
    ## 
    ## Fit: aov(formula = mpg_highway ~ cylinders * type, data = cars_new)
    ## 
    ## $cylinders
    ##          diff       lwr       upr p adj
    ## 6-4 -5.722662 -6.664343 -4.780981     0
    ## 
    ## $type
    ##                   diff       lwr       upr     p adj
    ## Sports-Sedan -2.817931 -4.470787 -1.165075 0.0009407
    ## 
    ## $`cylinders:type`
    ##                         diff       lwr       upr     p adj
    ## 6:Sedan-4:Sedan   -6.1723315 -7.469178 -4.875485 0.0000000
    ## 4:Sports-4:Sedan  -5.2275641 -8.306639 -2.148489 0.0001079
    ## 6:Sports-4:Sedan  -6.6025641 -9.681639 -3.523489 0.0000006
    ## 4:Sports-6:Sedan   0.9447674 -2.120956  4.010491 0.8546517
    ## 6:Sports-6:Sedan  -0.4302326 -3.495956  2.635491 0.9834567
    ## 6:Sports-4:Sports -1.3750000 -5.521993  2.771993 0.8253946