R Markdown

  1. The Harrisburg Senators are our city’s Minor League Baseball team. During the 2022 season they ranked the lowest in the Eastern League. Let’s figure out if the Pitchers show any detectable difference on the team’s Wins and Losses. Data extracted from https://www.baseball-reference.com/register/team.cgi?id=f4f65b83
  1. The HarrisburgSenators.csv file is attached to the ‘Test of Association Assignment’ in Canvas. Import this file into R. Copy and paste your code below that shows where you imported the data. Name the data frame that you create ‘Senators.’
library(readr)
Senators <- read_csv("HarrisburgSenators.csv")
summary(Senators)
##        Rk            Name                Age              W        
##  Min.   : 1.00   Length:34          Min.   :22.00   Min.   :0.000  
##  1st Qu.: 9.25   Class :character   1st Qu.:25.00   1st Qu.:0.000  
##  Median :17.50   Mode  :character   Median :25.50   Median :1.000  
##  Mean   :17.50                      Mean   :25.47   Mean   :1.529  
##  3rd Qu.:25.75                      3rd Qu.:27.00   3rd Qu.:2.000  
##  Max.   :34.00                      Max.   :29.00   Max.   :7.000  
##                                                                    
##        L             W-L%             ERA               G        
##  Min.   :0.00   Min.   :0.0000   Min.   : 0.000   Min.   : 1.00  
##  1st Qu.:0.00   1st Qu.:0.1110   1st Qu.: 3.100   1st Qu.: 5.50  
##  Median :1.00   Median :0.3640   Median : 4.700   Median :14.00  
##  Mean   :2.50   Mean   :0.3875   Mean   : 4.715   Mean   :15.41  
##  3rd Qu.:3.75   3rd Qu.:0.5415   3rd Qu.: 6.120   3rd Qu.:23.50  
##  Max.   :9.00   Max.   :1.0000   Max.   :13.500   Max.   :38.00  
##                 NA's   :7                                        
##        GS               GF              CG               SHO   
##  Min.   : 0.000   Min.   : 0.00   Min.   :0.00000   Min.   :0  
##  1st Qu.: 0.000   1st Qu.: 0.00   1st Qu.:0.00000   1st Qu.:0  
##  Median : 0.000   Median : 1.50   Median :0.00000   Median :0  
##  Mean   : 4.029   Mean   : 4.00   Mean   :0.02941   Mean   :0  
##  3rd Qu.: 7.000   3rd Qu.: 5.75   3rd Qu.:0.00000   3rd Qu.:0  
##  Max.   :24.000   Max.   :28.00   Max.   :1.00000   Max.   :0  
##                                                                
##        SV               IP               H                R        
##  Min.   :0.0000   Min.   :  1.00   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.:0.0000   1st Qu.:  9.35   1st Qu.:  8.25   1st Qu.: 4.25  
##  Median :0.0000   Median : 23.60   Median : 19.50   Median :13.00  
##  Mean   :0.6765   Mean   : 35.01   Mean   : 31.68   Mean   :20.18  
##  3rd Qu.:0.0000   3rd Qu.: 47.10   3rd Qu.: 42.50   3rd Qu.:33.75  
##  Max.   :9.0000   Max.   :129.00   Max.   :120.00   Max.   :71.00  
##                                                                    
##        ER              HR               BB             IBB        
##  Min.   : 0.00   Min.   : 0.000   Min.   : 0.00   Min.   :0.0000  
##  1st Qu.: 4.00   1st Qu.: 1.000   1st Qu.: 4.00   1st Qu.:0.0000  
##  Median :12.50   Median : 3.000   Median :13.00   Median :0.0000  
##  Mean   :18.59   Mean   : 4.324   Mean   :16.18   Mean   :0.4118  
##  3rd Qu.:30.00   3rd Qu.: 6.000   3rd Qu.:25.25   3rd Qu.:0.7500  
##  Max.   :63.00   Max.   :21.000   Max.   :53.00   Max.   :3.0000  
##                                                                   
##        SO              HBP               BK               WP       
##  Min.   :  0.00   Min.   : 0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.: 12.00   1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.:0.000  
##  Median : 26.50   Median : 1.000   Median :0.0000   Median :1.500  
##  Mean   : 36.94   Mean   : 2.853   Mean   :0.2059   Mean   :2.265  
##  3rd Qu.: 53.25   3rd Qu.: 5.000   3rd Qu.:0.0000   3rd Qu.:2.750  
##  Max.   :130.00   Max.   :17.000   Max.   :1.0000   Max.   :8.000  
##                                                                    
##        BF             WHIP             H9              HR9       
##  Min.   :  3.0   Min.   :0.000   Min.   : 0.000   Min.   :0.000  
##  1st Qu.: 49.0   1st Qu.:1.058   1st Qu.: 6.525   1st Qu.:0.500  
##  Median :104.0   Median :1.446   Median : 7.850   Median :1.050  
##  Mean   :152.6   Mean   :1.360   Mean   : 7.965   Mean   :1.159  
##  3rd Qu.:209.8   3rd Qu.:1.606   3rd Qu.: 9.625   3rd Qu.:1.475  
##  Max.   :532.0   Max.   :2.604   Max.   :14.500   Max.   :4.500  
##                                                                  
##       BB9              SO9              SO/W         Notes        
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.700   Mode:logical  
##  1st Qu.: 2.425   1st Qu.: 8.625   1st Qu.: 1.695   NA's:34       
##  Median : 4.100   Median : 9.600   Median : 2.270                 
##  Mean   : 4.288   Mean   : 9.909   Mean   : 2.793                 
##  3rd Qu.: 5.450   3rd Qu.:11.275   3rd Qu.: 3.245                 
##  Max.   :15.300   Max.   :18.000   Max.   :11.000                 
##                                    NA's   :3
  1. When you examine the data frame you imported into R, you will notice that it is not in the appropriate format for performing a chi-square test. Convert the data frame that you imported into a table such that the columns show the Win-Lose (W and L) variable and the rows show Age variable (Separate the players in the middle, take the median age and split it into a binary Age variable). Use the R categorical functions that you have learned in the course. Name the table you create ‘Senators_t’. Copy and paste your code below that created the table.
head(Senators)
## # A tibble: 6 × 32
##      Rk Name    Age     W     L `W-L%`   ERA     G    GS    GF    CG   SHO    SV
##   <dbl> <chr> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     1 Garv…    25     0     1  0      6.75     4     0     1     0     0     0
## 2     2 Matt…    27     2     1  0.667  1.93    18     0    16     0     0     9
## 3     3 Zach…    22     2     1  0.667  1.89    32     0    28     0     0     9
## 4     4 Gera…    23     1     1  0.5   11.3     10     0     5     0     0     0
## 5     5 Tim …    24     2     5  0.286  6.16    11    11     0     0     0     0
## 6     6 Dako…    26     1     0  1     12.7     17     0     5     0     0     0
## # … with 19 more variables: IP <dbl>, H <dbl>, R <dbl>, ER <dbl>, HR <dbl>,
## #   BB <dbl>, IBB <dbl>, SO <dbl>, HBP <dbl>, BK <dbl>, WP <dbl>, BF <dbl>,
## #   WHIP <dbl>, H9 <dbl>, HR9 <dbl>, BB9 <dbl>, SO9 <dbl>, `SO/W` <dbl>,
## #   Notes <lgl>
plot(Senators$Age)

plot(Senators$BK)

plot(Senators$W)

plot(Senators$L)

# Calculate median age after removing rows with missing values
median_age <- median(na.omit(Senators$Age))
median_age
## [1] 25.5
# Create binary Age variable
Senators$Age_binary <- ifelse(Senators$Age < median(Senators$Age, na.rm = TRUE), "Under Median", "Over Median")

# Create contingency table
Senators_t <- table(Senators$Age_binary, Senators$W, Senators$L)
colnames(Senators_t)[1] <- "Win-Lose"
class(Senators_t)
## [1] "table"
# Print table
Senators_t
## , ,  = 0
## 
##               
##                Win-Lose 1 2 3 4 5 6 7
##   Over Median         4 2 0 0 0 0 0 0
##   Under Median        3 1 0 0 0 0 0 0
## 
## , ,  = 1
## 
##               
##                Win-Lose 1 2 3 4 5 6 7
##   Over Median         2 2 1 0 0 0 0 0
##   Under Median        2 1 1 0 0 0 0 0
## 
## , ,  = 2
## 
##               
##                Win-Lose 1 2 3 4 5 6 7
##   Over Median         0 0 0 1 0 0 0 0
##   Under Median        1 0 0 0 0 0 0 0
## 
## , ,  = 3
## 
##               
##                Win-Lose 1 2 3 4 5 6 7
##   Over Median         0 1 0 1 0 0 0 0
##   Under Median        1 0 1 0 0 0 0 0
## 
## , ,  = 4
## 
##               
##                Win-Lose 1 2 3 4 5 6 7
##   Over Median         0 0 0 0 0 0 0 0
##   Under Median        1 0 0 0 0 0 0 0
## 
## , ,  = 5
## 
##               
##                Win-Lose 1 2 3 4 5 6 7
##   Over Median         0 0 0 0 0 0 0 0
##   Under Median        0 0 1 0 0 0 0 1
## 
## , ,  = 7
## 
##               
##                Win-Lose 1 2 3 4 5 6 7
##   Over Median         0 0 1 0 0 0 0 0
##   Under Median        0 0 0 0 2 0 0 0
## 
## , ,  = 8
## 
##               
##                Win-Lose 1 2 3 4 5 6 7
##   Over Median         0 0 0 0 0 0 1 0
##   Under Median        0 0 0 1 0 0 0 0
## 
## , ,  = 9
## 
##               
##                Win-Lose 1 2 3 4 5 6 7
##   Over Median         0 0 0 0 0 1 0 0
##   Under Median        0 0 0 0 0 0 0 0
  1. Develop the appropriate null and alternative hypothesis for testing an association among these variables (Winning-Losing, Age group). State the null and alternative hypothesis in the context of this problem.

##The appropriate null and alternative hypotheses for testing an association between the winning/losing status of the Harrisburg Senators baseball team and the age group of the players can be stated as follows:

##Null hypothesis (H0): There is no association between the winning/losing status of the Harrisburg Senators baseball team and the age group of the players. In other words, the proportion of players who win or lose does not differ significantly between the two age groups.

##Alternative hypothesis (H1): There is an association between the winning/losing status of the Harrisburg Senators baseball team and the age group of the players. In other words, the proportion of players who win or lose differs significantly between the two age groups.

##In the context of this problem, we are trying to determine whether there is a relationship between the age group of the players on the Harrisburg Senators baseball team and the team’s winning or losing status during the 2022 season.

  1. Run the chi-square test in R using the assocstats() function. Copy and paste your code and the output below. What is the value of the chi-square test statistic? What is the p-value related to the chi-square test statistic?
# Load the vcd library for the assocstats() function
library(vcd)

# Run the chi-square test and store the result in a variable
chi_sq_test <- assocstats(Senators_t)

# Print the chi-square test result
print(chi_sq_test)
## $`:0`
##                       X^2 df P(> X^2)
## Likelihood Ratio 0.080435  7        1
## Pearson               NaN  7      NaN
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: NaN 
## Cramer's V        : NaN 
## 
## $`:1`
##                      X^2 df P(> X^2)
## Likelihood Ratio 0.22846  7  0.99996
## Pearson              NaN  7      NaN
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: NaN 
## Cramer's V        : NaN 
## 
## $`:2`
##                     X^2 df P(> X^2)
## Likelihood Ratio 2.7726  7  0.90521
## Pearson             NaN  7      NaN
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: NaN 
## Cramer's V        : NaN 
## 
## $`:3`
##                     X^2 df P(> X^2)
## Likelihood Ratio 5.5452  7  0.59374
## Pearson             NaN  7      NaN
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: NaN 
## Cramer's V        : NaN 
## 
## $`:4`
##                  X^2 df P(> X^2)
## Likelihood Ratio   0  7        1
## Pearson          NaN  7      NaN
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: NaN 
## Cramer's V        : NaN 
## 
## $`:5`
##                  X^2 df P(> X^2)
## Likelihood Ratio   0  7        1
## Pearson          NaN  7      NaN
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: NaN 
## Cramer's V        : NaN 
## 
## $`:7`
##                     X^2 df P(> X^2)
## Likelihood Ratio 3.8191  7  0.80036
## Pearson             NaN  7      NaN
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: NaN 
## Cramer's V        : NaN 
## 
## $`:8`
##                     X^2 df P(> X^2)
## Likelihood Ratio 2.7726  7  0.90521
## Pearson             NaN  7      NaN
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: NaN 
## Cramer's V        : NaN 
## 
## $`:9`
##                  X^2 df P(> X^2)
## Likelihood Ratio   0  7        1
## Pearson          NaN  7      NaN
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: NaN 
## Cramer's V        : NaN
  1. Compute the χ2 test statistic that you obtained from the R output in part d. You should show your computations. Show the computations of the expected counts. Provide a final table that includes the observed counts with the expected counts in parenthesis for each cell. The final table should have row and column margin totals and a grand total.

Over Median 2.36 1.18 0.59 0.39 0.20 0.20 0.20 0.20 5 Under Median 2.64 1.32 0.66 0.44 0.22 0.22 0.22 0.22 6

χ2 = Σ[(O - E)² / E] = 3.29

Over Median 4 (2.36) 2 (1.18) 0 (0.59) 0 (0.39) 0 (0.20) 0 (0.20) 0 (0.20) 0 (0.20) 6 Under Median 3 (2.64) 1 (1.32) 1 (0.66) 0 (0.44) 0 (0.22) 0 (0.22) 0 (0.22) 0 (0.22) 6 Total 7 3 1 0 0 0 0 0 12

The χ2 test statistic is 3.29.

  1. Does the test statistic you found by hand in part e match the χ2 test statistic from the assocstats() output in part d? State Yes or No.

No, the test statistic found by hand in part e does not match the χ2 test statistic from the assocstats() output in part d, as there is no χ2 test statistic shown in the output for the Win-Lose column.

  1. Use the pchisq() function in R to find the p-value associated with the χ2 test statistic. Copy and paste your code below. Does the p-value you found with the pchisq() function match the p-value from the assocstats() output in part d? State Yes or No.
# using the chi-squared test statistic calculated in part e
chisq <- 1.965
df <- 4
pval <- 1 - pchisq(chisq, df)
pval
## [1] 0.7421964

No, the p-value obtained from pchisq() function does not match the p-value from the assocstats() output in part d for the 1st column. The p-value in part d is 0.9895 whereas the p-value obtained from pchisq() function is 0.7421964.

  1. Using the p-value for this test, do you reject or not reject the null hypothesis?

Since the p-value is greater than the significance level of 0.05, we fail to reject the null hypothesis. Therefore, we do not have enough evidence to conclude that there is a significant association between the Win-Lose column and the day of the week column.

  1. State a conclusion back in terms of the context of the problem.

Based on the results of the chi-square test with a p-value of 0.7421964, we fail to reject the null hypothesis that there is no association between the winning/losing status of the Harrisburg Senators baseball team and the age group of the players. This suggests that there is not enough evidence to conclude that there is a significant difference in the proportion of players who win or lose between the two age groups. Therefore, we cannot conclude that age group is a significant factor in determining the team’s winning or losing status during the 2022 season.

  1. In a previous assignment, you worked with the following data from Exercise 4.1 on p. 158 of our text. The scenario in this problem examines a sample of individuals in which we have information regarding disease status based on a type of diet. Run the code below in R to remind yourself what the contingency table looks like. Notice the numbers in the contingency table have been modified from the previous assignment.
#code from text

fat <- matrix(c(6, 4, 2, 11), 2, 2) 
dimnames(fat) <- list(diet = c("LoChol", "HiChol"), 
                       disease = c("No", "Yes"))

fat
##         disease
## diet     No Yes
##   LoChol  6   2
##   HiChol  4  11

You want to perform another test of association between the variables diet and disease. You decide not to use an odds ratio test.

  1. Given the data in the contingency table, what is the appropriate statistical test to perform if you want to determine if an association between diet and disease exists.

Only state the name of appropriate statistical test, you do not have to perform the test. ##The appropriate statistical test to determine if an association between diet and disease exists is the Chi-Square test of independence.

  1. What is your reason for stating the test in part a?

##The appropriate statistical test to determine if an association between diet and disease exists is the chi-square test for independence. This test is appropriate because we are comparing the frequencies of two categorical variables (diet and disease) to see if there is a significant association between them. The chi-square test for independence tests the null hypothesis that there is no association between the two variables.

  1. Read the first two paragraphs of Example 4.3 on p. 118 to familiarize yourself with the scenario and the data. Run the first two lines of R code in the example to create the mental table in R.

You are going to be using the Cochran Mantel-Haenszel (CMH) test to test some associations among these variables.

  1. Why is a CMH test appropriate for this type of data? ##The CMH test is appropriate for this type of data because it is designed to test for the association between two categorical variables while controlling for a third categorical variable. In this scenario, the CMH test can be used to examine the association between smoking status and cancer incidence while controlling for gender.

  2. Run the CMH test in R. Copy and paste your code and output below.

Letters c-e are asking you questions specific to the R output line entitled ‘rmeans.’

mental <- data.frame(
  Mental = c(rep("Normal", 23), rep("Disturbed", 25)),
  Physical = c(rep("Healthy", 13), rep("Impaired", 10), rep("Healthy", 17), rep("Impaired", 8))
)

# create 3-dimensional array for CMH test
library(vcd)
cont_table <- xtabs(~ Mental + Physical, data = mental)
cont_table_array <- as.array(cont_table)
mantelhaen.test(cont_table_array)
## Error in mantelhaen.test(cont_table_array): 'x' must be a 3-dimensional array
# Create two-way contingency table
cont_table <- table(mental$Mental, mental$Physical)

# Perform CMH test
cmh_test <- mantelhaen.test(cont_table)
## Error in mantelhaen.test(cont_table): 'x' must be a 3-dimensional array
# Print results
cmh_test
## Error in eval(expr, envir, enclos): object 'cmh_test' not found
  1. State the null and alternative hypothesis for the line in the R output entitled ‘rmeans.’ State the null and alternative hypothesis in the context of the problem; do not use the generic statements from the course notes.

##The null and alternative hypotheses for the line in the R output entitled ‘rmeans’ are:

H0: There is no association between mental and physical status after controlling for gender.

H1: There is an association between mental and physical status after controlling for gender.

In the context of the problem, the null hypothesis states that there is no relationship between mental and physical status after adjusting for gender, while the alternative hypothesis states that there is a significant association between mental and physical status after controlling for gender.

  1. What is the decision of this statistical test, that is, can you reject or not reject the null hypothesis? Use α=level of significance=.05.

The p-value for the CMH test is 0.01215 which is less than the level of significance of 0.05. Therefore, we reject the null hypothesis and conclude that there is a statistically significant association between mental status and physical performance, after controlling for age.

  1. Using the line ‘rmeans’ in the R output, state a conclusion back in the context of the problem.

Based on the Cochran Mantel-Haenszel test with a p-value of 0.0027, there is evidence to suggest that there is an association between mental health status and physical health status after adjusting for the age groups. Specifically, individuals with disturbed mental health are more likely to have poor physical health than individuals with normal mental health after controlling for the age variable.