library(tidyverse)
library(Stat2Data)
library(skimr)
library(agricolae)

5.28 Discrimination: exploratory

The city of New Haven, Conneticut, administered exams (both written and oral) in November and December of 2003 to firefighters hoping to qualify for promotion to either Lieutenant or Captain in the city fire department. A final score consisting of a 60% weight for the written exam anda 40% weight for the oral exam was computed for each person who took the exam. Those people receiving a total score of at least 70% were deemed to be eligible for promotion. In a situation where \(t\) openings were available, the people with the top \(t+2\) scores would be considered for those openings. A concern was raised, however, that the exams were discriminatory with respect to race anda lawsuit was filed. The data are given in the data file Ricci. For each person who took the exams, there are measurements on their race (black, white, or Hispanic), which position they were trying for (Lieutenant, Captain), scores on the oral and written exams, and the combined score. The concern over the exams administered by the city was that they were discriminatory based on race. Here we concentrate on the overall, combined score on the two tests for these people seeking promotion and we analyze the average score for the three different races.

data("Ricci")
  1. Use a graphical approach to answer the question of whether the average combined score is different for the three races. What do the graphs suggest about any further analysis that could be done? Explain.
data("Ricci")
Ricci %>%
  group_by(Race) %>%
  skim(Combine)
## Skim summary statistics
##  n obs: 118 
##  n variables: 5 
##  group variables: Race 
## 
## ── Variable type:numeric ──────────────────────────────────────────────
##  Race variable missing complete  n  mean   sd    p0   p25   p50   p75
##     B  Combine       0       27 27 63.74 8.74 45.93 57.66 61.07 72.03
##     H  Combine       0       23 23 65.34 7.14 54.13 60.08 65    69.95
##     W  Combine       0       68 68 72.68 8.83 56.32 68.02 71.64 78.45
##   p100     hist
##  76.6  ▁▂▅▇▁▃▅▆
##  79.68 ▅▅▇▅▅▅▂▃
##  92.81 ▅▂▇▇▆▅▃▂
ggplot(Ricci) + geom_boxplot(aes(x=Race, y=Combine))

  1. Check the conditions necessary for conducting an ANOVA to determine if the combined score is significantly different for at least one race.
a1 = aov(Combine ~ Race, data=Ricci)
anova(a1)
## Analysis of Variance Table
## 
## Response: Combine
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Race        2 1971.7  985.83  13.595 5.014e-06 ***
## Residuals 115 8339.1   72.51                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(a1, which=2)

plot(a1, which=1)

5.34 Aphid honeydew

Aphids (a type of small insect) produce a form of liquid waste, called honeydew, when they eat plant sap. An experiment was conducted to see whether the amounts of honeydew produced by aphids differ for different combinations of type of aphid and type of host plant. The following ANOVA table was produced with the data from this experiment.

table <- matrix(c("m-1","SST","4.9807", "MST/MSE", "0.000","46","39.87","0.8667", " ", " ", "51", "64.77", " ", " ", " "),ncol=5,byrow=TRUE)
colnames(table) <- c("Df", " Sum Sq", "Mean Sq", "F value", "Pr(>F)")
rownames(table) <- c("aphid race-host plant combina","Error","Total")
table <- as.table(table)
table
##                               Df   Sum Sq Mean Sq F value Pr(>F)
## aphid race-host plant combina m-1 SST     4.9807  MST/MSE 0.000 
## Error                         46  39.87   0.8667                
## Total                         51  64.77
  1. Fill in the three missing values in this ANOVA table. Also show how you calculate them.
table <- matrix(c("5","24.9","4.9807", "5.747", "0.000","46","39.87","0.8667", " ", " ", "51", "64.77", " ", " ", " "),ncol=5,byrow=TRUE)
colnames(table) <- c("Df", " Sum Sq", "Mean Sq", "F value", "Pr(>F)")
rownames(table) <- c("aphid race-host plant combina","Error","Total")
table <- as.table(table)
table
##                               Df  Sum Sq Mean Sq F value Pr(>F)
## aphid race-host plant combina 5  24.9    4.9807  5.747   0.000 
## Error                         46 39.87   0.8667                
## Total                         51 64.77
  1. How many different aphid/plant combinations were considered in this analysis? Explain how you know.
  1. Summarize the conclusion from this ANOVA (in context).

5.38 Meniscus: stiffness

An experiment was conducted to compare three different methods of repairing a meniscus (cartilage in the knee). Eighteen lightly embalmed cadaveric specimens were used., with each being randomly assigned to one of the three treatments: vertical suture, meniscus arrow, FasT-Fix. Each knee was evaluated on three different response variables: load at failture, stiffness, and displacement. The data are located in the file Meniscus. For this exercise we will concentrate on the stiffness response variable (variable name stiffness).

  1. Give the hypotheses that would be tested in an ANOVA procedure for this dataset.

\[ H_0: \mu_1=\mu_2=\mu_3\\ H_A: at\ least\ one\ \mu_i\ not\ the\ same\ \]

  1. Show that the conditions for ANOVA are met for these data.
data(Meniscus)
a1 <- aov(Stiffness~factor(Method), data=Meniscus)
plot(a1, which=2)

plot(a1, which=1)

ggplot(Meniscus) + geom_boxplot(aes(x=factor(Method), y=Stiffness))

  1. Conduct an ANOVA. Report the ANOVA table and interpret the results. Do the data provide strong evidence that the mean value of stiffness differs based on the type of meniscus repair? Explain.
anova(a1)
## Analysis of Variance Table
## 
## Response: Stiffness
##                Df Sum Sq Mean Sq F value  Pr(>F)  
## factor(Method)  2 10.570   5.285  4.9811 0.02193 *
## Residuals      15 15.915   1.061                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.46 Meniscus stiffness: Fisher’s LSD

In Exercise 5.38 we discovered that there was a significant difference between the treatments with respect to the response variable stiffness. For this variable, larger values are better (less stiffness to the specimen). The researchers were comparing a potential new treatment (FasT-Fix) to two commonly used treatments (vertical suture and meniscus arrod). Use Fisher’s LSD to determine which differences exist between the treatments and discuss the ramifications of your conclusions for doctors.

a1 = aov(Stiffness~factor(Method), data=Meniscus)
anova(a1)
## Analysis of Variance Table
## 
## Response: Stiffness
##                Df Sum Sq Mean Sq F value  Pr(>F)  
## factor(Method)  2 10.570   5.285  4.9811 0.02193 *
## Residuals      15 15.915   1.061                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
print(LSD.test(a1,"factor(Method)"))
## $statistics
##   MSerror Df     Mean       CV t.value     LSD
##     1.061 15 7.183333 14.33942 2.13145 1.26757
## 
## $parameters
##         test p.ajusted         name.t ntr alpha
##   Fisher-LSD      none factor(Method)   3  0.05
## 
## $means
##   Stiffness       std r      LCL      UCL Min Max   Q25  Q50   Q75
## 1      7.75 0.9710819 6 6.853692 8.646308 6.3 8.7 7.225 7.80 8.600
## 2      6.10 1.3266499 6 5.203692 6.996308 4.7 8.4 5.200 5.95 6.475
## 3      7.70 0.6928203 6 6.803692 8.596308 6.4 8.3 7.625 7.85 8.150
## 
## $comparison
## NULL
## 
## $groups
##   Stiffness groups
## 1      7.75      a
## 3      7.70      a
## 2      6.10      b
## 
## attr(,"class")
## [1] "group"

5.52 Words with Friends

Revisit the dataset WordsWithFriends that was analyzed in Section 5.8. In that analysis in the Case Study, we questioned whether there number of blank tiles that a player receives was related to the final score. In that analysis, our conclusion was that there is a noticeable difference between the final scores of the games, depending on how many blank tiles the player receives. In this exercise we ask the same question, but with respect to the winning margin rather than the final score.

  1. Show that the conditions for ANOVA are met for these data.
data("WordsWithFriends")
a1 <- aov(WinMargin~factor(BlanksNumber), data=WordsWithFriends)
plot(a1, which=2)

plot(a1, which=1)

ggplot(WordsWithFriends) + geom_boxplot(aes(x=factor(BlanksNumber), y=WinMargin))

  1. Conduct an ANOVA. Report the ANOVA table and interpret the results.
anova(a1)
## Analysis of Variance Table
## 
## Response: WinMargin
##                       Df Sum Sq Mean Sq F value   Pr(>F)   
## factor(BlanksNumber)   2   9514  4757.2  6.9884 0.001028 **
## Residuals            441 300202   680.7                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.58 Salary

A researcher wanted to know if the mean salaries of men and women are different. She chose a stratified random sample of 280 people form the 2000 U.S. Census consisting of men and women from New York State, Oregon, Arizona, and Iowa. The researcher, not understanding much about statistics, had Minitab compute an ANOVA table for her. It is shown below:

table <- matrix(c("1","8190848743","8190848743", "12.45", "0.000","278","1.82913E+11","657958980", " ", " ", "279", "1.91103E+11", " ", " ", " "),ncol=5,byrow=TRUE)
colnames(table) <- c("Df", " Sum Sq", "Mean Sq", "F value", "Pr(>F)")
rownames(table) <- c("sex","Error","Total")
table <- as.table(table)
add <- matrix(c("S=25651","R-sq=4.29","R-sq(adj)=3.94"),ncol=3,byrow=TRUE)
table
##       Df   Sum Sq     Mean Sq    F value Pr(>F)
## sex   1   8190848743  8190848743 12.45   0.000 
## Error 278 1.82913E+11 657958980                
## Total 279 1.91103E+11
add
##      [,1]      [,2]        [,3]            
## [1,] "S=25651" "R-sq=4.29" "R-sq(adj)=3.94"
  1. Is a person’s sex significant in predicting their salary? Explain your conclusions.
  1. What value of \(R^2\) value does the ANOVA model have? Is this good? Explain.
  1. The researcher did not look at residual plots. They are shown in Figure 5.32. What conclusions do you reach about the ANOVA after examining these plots? Explain.