library(readr)
Senators <- read_csv("HarrisburgSenators.csv")
summary(Senators)
## Rk Name Age W
## Min. : 1.00 Length:34 Min. :22.00 Min. :0.000
## 1st Qu.: 9.25 Class :character 1st Qu.:25.00 1st Qu.:0.000
## Median :17.50 Mode :character Median :25.50 Median :1.000
## Mean :17.50 Mean :25.47 Mean :1.529
## 3rd Qu.:25.75 3rd Qu.:27.00 3rd Qu.:2.000
## Max. :34.00 Max. :29.00 Max. :7.000
##
## L W-L% ERA G
## Min. :0.00 Min. :0.0000 Min. : 0.000 Min. : 1.00
## 1st Qu.:0.00 1st Qu.:0.1110 1st Qu.: 3.100 1st Qu.: 5.50
## Median :1.00 Median :0.3640 Median : 4.700 Median :14.00
## Mean :2.50 Mean :0.3875 Mean : 4.715 Mean :15.41
## 3rd Qu.:3.75 3rd Qu.:0.5415 3rd Qu.: 6.120 3rd Qu.:23.50
## Max. :9.00 Max. :1.0000 Max. :13.500 Max. :38.00
## NA's :7
## GS GF CG SHO
## Min. : 0.000 Min. : 0.00 Min. :0.00000 Min. :0
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.:0.00000 1st Qu.:0
## Median : 0.000 Median : 1.50 Median :0.00000 Median :0
## Mean : 4.029 Mean : 4.00 Mean :0.02941 Mean :0
## 3rd Qu.: 7.000 3rd Qu.: 5.75 3rd Qu.:0.00000 3rd Qu.:0
## Max. :24.000 Max. :28.00 Max. :1.00000 Max. :0
##
## SV IP H R
## Min. :0.0000 Min. : 1.00 Min. : 0.00 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.: 9.35 1st Qu.: 8.25 1st Qu.: 4.25
## Median :0.0000 Median : 23.60 Median : 19.50 Median :13.00
## Mean :0.6765 Mean : 35.01 Mean : 31.68 Mean :20.18
## 3rd Qu.:0.0000 3rd Qu.: 47.10 3rd Qu.: 42.50 3rd Qu.:33.75
## Max. :9.0000 Max. :129.00 Max. :120.00 Max. :71.00
##
## ER HR BB IBB
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. :0.0000
## 1st Qu.: 4.00 1st Qu.: 1.000 1st Qu.: 4.00 1st Qu.:0.0000
## Median :12.50 Median : 3.000 Median :13.00 Median :0.0000
## Mean :18.59 Mean : 4.324 Mean :16.18 Mean :0.4118
## 3rd Qu.:30.00 3rd Qu.: 6.000 3rd Qu.:25.25 3rd Qu.:0.7500
## Max. :63.00 Max. :21.000 Max. :53.00 Max. :3.0000
##
## SO HBP BK WP
## Min. : 0.00 Min. : 0.000 Min. :0.0000 Min. :0.000
## 1st Qu.: 12.00 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.:0.000
## Median : 26.50 Median : 1.000 Median :0.0000 Median :1.500
## Mean : 36.94 Mean : 2.853 Mean :0.2059 Mean :2.265
## 3rd Qu.: 53.25 3rd Qu.: 5.000 3rd Qu.:0.0000 3rd Qu.:2.750
## Max. :130.00 Max. :17.000 Max. :1.0000 Max. :8.000
##
## BF WHIP H9 HR9
## Min. : 3.0 Min. :0.000 Min. : 0.000 Min. :0.000
## 1st Qu.: 49.0 1st Qu.:1.058 1st Qu.: 6.525 1st Qu.:0.500
## Median :104.0 Median :1.446 Median : 7.850 Median :1.050
## Mean :152.6 Mean :1.360 Mean : 7.965 Mean :1.159
## 3rd Qu.:209.8 3rd Qu.:1.606 3rd Qu.: 9.625 3rd Qu.:1.475
## Max. :532.0 Max. :2.604 Max. :14.500 Max. :4.500
##
## BB9 SO9 SO/W Notes
## Min. : 0.000 Min. : 0.000 Min. : 0.700 Mode:logical
## 1st Qu.: 2.425 1st Qu.: 8.625 1st Qu.: 1.695 NA's:34
## Median : 4.100 Median : 9.600 Median : 2.270
## Mean : 4.288 Mean : 9.909 Mean : 2.793
## 3rd Qu.: 5.450 3rd Qu.:11.275 3rd Qu.: 3.245
## Max. :15.300 Max. :18.000 Max. :11.000
## NA's :3
head(Senators)
## # A tibble: 6 × 32
## Rk Name Age W L `W-L%` ERA G GS GF CG SHO SV
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Garv… 25 0 1 0 6.75 4 0 1 0 0 0
## 2 2 Matt… 27 2 1 0.667 1.93 18 0 16 0 0 9
## 3 3 Zach… 22 2 1 0.667 1.89 32 0 28 0 0 9
## 4 4 Gera… 23 1 1 0.5 11.3 10 0 5 0 0 0
## 5 5 Tim … 24 2 5 0.286 6.16 11 11 0 0 0 0
## 6 6 Dako… 26 1 0 1 12.7 17 0 5 0 0 0
## # … with 19 more variables: IP <dbl>, H <dbl>, R <dbl>, ER <dbl>, HR <dbl>,
## # BB <dbl>, IBB <dbl>, SO <dbl>, HBP <dbl>, BK <dbl>, WP <dbl>, BF <dbl>,
## # WHIP <dbl>, H9 <dbl>, HR9 <dbl>, BB9 <dbl>, SO9 <dbl>, `SO/W` <dbl>,
## # Notes <lgl>
plot(Senators$Age)
plot(Senators$BK)
plot(Senators$W)
plot(Senators$L)
# Calculate median age after removing rows with missing values
median_age <- median(na.omit(Senators$Age))
median_age
## [1] 25.5
# Create binary Age variable
Senators$Age_binary <- ifelse(Senators$Age < median(Senators$Age, na.rm = TRUE), "Under Median", "Over Median")
# Create contingency table
Senators_t <- table(Senators$Age_binary, Senators$W, Senators$L)
colnames(Senators_t)[1] <- "Win-Lose"
class(Senators_t)
## [1] "table"
# Print table
Senators_t
## , , = 0
##
##
## Win-Lose 1 2 3 4 5 6 7
## Over Median 4 2 0 0 0 0 0 0
## Under Median 3 1 0 0 0 0 0 0
##
## , , = 1
##
##
## Win-Lose 1 2 3 4 5 6 7
## Over Median 2 2 1 0 0 0 0 0
## Under Median 2 1 1 0 0 0 0 0
##
## , , = 2
##
##
## Win-Lose 1 2 3 4 5 6 7
## Over Median 0 0 0 1 0 0 0 0
## Under Median 1 0 0 0 0 0 0 0
##
## , , = 3
##
##
## Win-Lose 1 2 3 4 5 6 7
## Over Median 0 1 0 1 0 0 0 0
## Under Median 1 0 1 0 0 0 0 0
##
## , , = 4
##
##
## Win-Lose 1 2 3 4 5 6 7
## Over Median 0 0 0 0 0 0 0 0
## Under Median 1 0 0 0 0 0 0 0
##
## , , = 5
##
##
## Win-Lose 1 2 3 4 5 6 7
## Over Median 0 0 0 0 0 0 0 0
## Under Median 0 0 1 0 0 0 0 1
##
## , , = 7
##
##
## Win-Lose 1 2 3 4 5 6 7
## Over Median 0 0 1 0 0 0 0 0
## Under Median 0 0 0 0 2 0 0 0
##
## , , = 8
##
##
## Win-Lose 1 2 3 4 5 6 7
## Over Median 0 0 0 0 0 0 1 0
## Under Median 0 0 0 1 0 0 0 0
##
## , , = 9
##
##
## Win-Lose 1 2 3 4 5 6 7
## Over Median 0 0 0 0 0 1 0 0
## Under Median 0 0 0 0 0 0 0 0
##The appropriate null and alternative hypotheses for testing an association between the winning/losing status of the Harrisburg Senators baseball team and the age group of the players can be stated as follows:
##Null hypothesis (H0): There is no association between the winning/losing status of the Harrisburg Senators baseball team and the age group of the players. In other words, the proportion of players who win or lose does not differ significantly between the two age groups.
##Alternative hypothesis (H1): There is an association between the winning/losing status of the Harrisburg Senators baseball team and the age group of the players. In other words, the proportion of players who win or lose differs significantly between the two age groups.
##In the context of this problem, we are trying to determine whether there is a relationship between the age group of the players on the Harrisburg Senators baseball team and the team’s winning or losing status during the 2022 season.
# Load the vcd library for the assocstats() function
library(vcd)
# Run the chi-square test and store the result in a variable
chi_sq_test <- assocstats(Senators_t)
# Print the chi-square test result
print(chi_sq_test)
## $`:0`
## X^2 df P(> X^2)
## Likelihood Ratio 0.080435 7 1
## Pearson NaN 7 NaN
##
## Phi-Coefficient : NA
## Contingency Coeff.: NaN
## Cramer's V : NaN
##
## $`:1`
## X^2 df P(> X^2)
## Likelihood Ratio 0.22846 7 0.99996
## Pearson NaN 7 NaN
##
## Phi-Coefficient : NA
## Contingency Coeff.: NaN
## Cramer's V : NaN
##
## $`:2`
## X^2 df P(> X^2)
## Likelihood Ratio 2.7726 7 0.90521
## Pearson NaN 7 NaN
##
## Phi-Coefficient : NA
## Contingency Coeff.: NaN
## Cramer's V : NaN
##
## $`:3`
## X^2 df P(> X^2)
## Likelihood Ratio 5.5452 7 0.59374
## Pearson NaN 7 NaN
##
## Phi-Coefficient : NA
## Contingency Coeff.: NaN
## Cramer's V : NaN
##
## $`:4`
## X^2 df P(> X^2)
## Likelihood Ratio 0 7 1
## Pearson NaN 7 NaN
##
## Phi-Coefficient : NA
## Contingency Coeff.: NaN
## Cramer's V : NaN
##
## $`:5`
## X^2 df P(> X^2)
## Likelihood Ratio 0 7 1
## Pearson NaN 7 NaN
##
## Phi-Coefficient : NA
## Contingency Coeff.: NaN
## Cramer's V : NaN
##
## $`:7`
## X^2 df P(> X^2)
## Likelihood Ratio 3.8191 7 0.80036
## Pearson NaN 7 NaN
##
## Phi-Coefficient : NA
## Contingency Coeff.: NaN
## Cramer's V : NaN
##
## $`:8`
## X^2 df P(> X^2)
## Likelihood Ratio 2.7726 7 0.90521
## Pearson NaN 7 NaN
##
## Phi-Coefficient : NA
## Contingency Coeff.: NaN
## Cramer's V : NaN
##
## $`:9`
## X^2 df P(> X^2)
## Likelihood Ratio 0 7 1
## Pearson NaN 7 NaN
##
## Phi-Coefficient : NA
## Contingency Coeff.: NaN
## Cramer's V : NaN
Over Median 2.36 1.18 0.59 0.39 0.20 0.20 0.20 0.20 5 Under Median 2.64 1.32 0.66 0.44 0.22 0.22 0.22 0.22 6
χ2 = Σ[(O - E)² / E] = 3.29
Over Median 4 (2.36) 2 (1.18) 0 (0.59) 0 (0.39) 0 (0.20) 0 (0.20) 0 (0.20) 0 (0.20) 6 Under Median 3 (2.64) 1 (1.32) 1 (0.66) 0 (0.44) 0 (0.22) 0 (0.22) 0 (0.22) 0 (0.22) 6 Total 7 3 1 0 0 0 0 0 12
The χ2 test statistic is 3.29.
No, the test statistic found by hand in part e does not match the χ2 test statistic from the assocstats() output in part d, as there is no χ2 test statistic shown in the output for the Win-Lose column.
# using the chi-squared test statistic calculated in part e
chisq <- 1.965
df <- 4
pval <- 1 - pchisq(chisq, df)
pval
## [1] 0.7421964
No, the p-value obtained from pchisq() function does not match the p-value from the assocstats() output in part d for the 1st column. The p-value in part d is 0.9895 whereas the p-value obtained from pchisq() function is 0.7421964.
Since the p-value is greater than the significance level of 0.05, we fail to reject the null hypothesis. Therefore, we do not have enough evidence to conclude that there is a significant association between the Win-Lose column and the day of the week column.
Based on the results of the chi-square test with a p-value of 0.7421964, we fail to reject the null hypothesis that there is no association between the winning/losing status of the Harrisburg Senators baseball team and the age group of the players. This suggests that there is not enough evidence to conclude that there is a significant difference in the proportion of players who win or lose between the two age groups. Therefore, we cannot conclude that age group is a significant factor in determining the team’s winning or losing status during the 2022 season.
#code from text
fat <- matrix(c(6, 4, 2, 11), 2, 2)
dimnames(fat) <- list(diet = c("LoChol", "HiChol"),
disease = c("No", "Yes"))
fat
## disease
## diet No Yes
## LoChol 6 2
## HiChol 4 11
You want to perform another test of association between the variables diet and disease. You decide not to use an odds ratio test.
Only state the name of appropriate statistical test, you do not have to perform the test. ##The appropriate statistical test to determine if an association between diet and disease exists is the Chi-Square test of independence.
##The appropriate statistical test to determine if an association between diet and disease exists is the chi-square test for independence. This test is appropriate because we are comparing the frequencies of two categorical variables (diet and disease) to see if there is a significant association between them. The chi-square test for independence tests the null hypothesis that there is no association between the two variables.
You are going to be using the Cochran Mantel-Haenszel (CMH) test to test some associations among these variables.
Why is a CMH test appropriate for this type of data? ##The CMH test is appropriate for this type of data because it is designed to test for the association between two categorical variables while controlling for a third categorical variable. In this scenario, the CMH test can be used to examine the association between smoking status and cancer incidence while controlling for gender.
Run the CMH test in R. Copy and paste your code and output below.
Letters c-e are asking you questions specific to the R output line entitled ‘rmeans.’
mental <- data.frame(
Mental = c(rep("Normal", 23), rep("Disturbed", 25)),
Physical = c(rep("Healthy", 13), rep("Impaired", 10), rep("Healthy", 17), rep("Impaired", 8))
)
# create 3-dimensional array for CMH test
library(vcd)
cont_table <- xtabs(~ Mental + Physical, data = mental)
cont_table_array <- as.array(cont_table)
mantelhaen.test(cont_table_array)
## Error in mantelhaen.test(cont_table_array): 'x' must be a 3-dimensional array
# Create two-way contingency table
cont_table <- table(mental$Mental, mental$Physical)
# Perform CMH test
cmh_test <- mantelhaen.test(cont_table)
## Error in mantelhaen.test(cont_table): 'x' must be a 3-dimensional array
# Print results
cmh_test
## Error in eval(expr, envir, enclos): object 'cmh_test' not found
##The null and alternative hypotheses for the line in the R output entitled ‘rmeans’ are:
H0: There is no association between mental and physical status after controlling for gender.
H1: There is an association between mental and physical status after controlling for gender.
In the context of the problem, the null hypothesis states that there is no relationship between mental and physical status after adjusting for gender, while the alternative hypothesis states that there is a significant association between mental and physical status after controlling for gender.
The p-value for the CMH test is 0.01215 which is less than the level of significance of 0.05. Therefore, we reject the null hypothesis and conclude that there is a statistically significant association between mental status and physical performance, after controlling for age.
Based on the Cochran Mantel-Haenszel test with a p-value of 0.0027, there is evidence to suggest that there is an association between mental health status and physical health status after adjusting for the age groups. Specifically, individuals with disturbed mental health are more likely to have poor physical health than individuals with normal mental health after controlling for the age variable.