STAT 410 Lab 4

Group #3

Jackson Bain, Emily Hernandez Rincon, and Sergio Diaz

Exercise 1

The dataset HairEyeColor gives hair and eye color reported by students at the University of Delaware in the 1970s. The data is set up in a three-dimensional array with hair color being the first dimension, eye color the second, and sex being the third (or panel) variable. Use the code below to aggregate the data for both genders into a single matrix called HairEye.

  • Make a mosaic plot or stacked bar chart using base R commands visualizing the relationship between hair and eye color. A good choice for the colors argument is c(“burlywood4”, “cornflowerblue”,“darkseagreen4”, “forestgreen” ). colors() will give you a list of colors available in base R. There are some other color schemes available too (check Color Palettes in the help menu). If you are making a stacked bar chart, you will need to use prop.table to get proportions before using barplot.

  • Make a mosaic plot without the color argument but with shade=TRUE. Explain what the shading shows you.

  • Determine if all the expected cell counts are at least 5 for the aggregated matrix HairEye. You only need to do this for the smallest expected cell count, so you can find the smallest row total and smallest column total and compute expected count for the corresponding cell. Another option is to run a chi-square test, saving the output as an object (perhaps called HEC), and then looking at the minimum of HEC$expected. attributes(HEC) is a way to see all the things that are stored in the object (not all are printed in the default output).

  • Conduct a test to determine if hair and eye colors are associated. Type out all five steps of the hypothesis test below your code chunk (state hypotheses, check assumptions, compute test statistic, compute p-value, and state conclusions in the context of the problem). Note that instead of copying the numeric values from your code chunk output, you can reference these values directly in the text. For example, I can put r nrow(haircolor) inside backticks to get the number 4 . The inline code chunk that will print the value when rendered. Use this technique when reporting test statistic and p-value.

[1] 7.675676

    Pearson's Chi-squared test

data:  HairEye
X-squared = 138.29, df = 9, p-value < 2.2e-16
X-squared 
 138.2898 
[1] 2.325287e-25

Hypothesis Testing

Step 1 Hypotheses:

Ho (null): Hair color and eye color have no association. Ha (alternative): Hair color and eye color are associated.

Step 2 Assumptions:

The expected frequency in each cell should be at least 5 (minimum was 7.6756757). The observations are independent.

Step 3 Test Statistic:

The chi-squared test statistic was 138.2898416.

Step 4 P-value:

The p-value of the test was 2.3252868^{-25}.

Step 5 Conclusion:

Since the p-value is 2.3252868^{-25}, which is less than 0.05, we reject the null hypothesis. There was enough statistically significant evidence to suggest that hair and eye color are associated variables.

Exercise 2

The dataset foster in HSAUR3 gives genotypes for rat mothers and their litters.

  • Create a matrix called geno that gives counts of genotype for mother and litter in a two-way table
  • Conduct a test to determine if mother’s genotype and litter’s genotype are associated. Include all five steps of the hypothesis test (you might not have a value for step 3).
   
    A B I J
  A 5 4 3 4
  B 3 5 3 3
  I 4 4 5 3
  J 5 2 3 5

    Fisher's Exact Test for Count Data

data:  geno
p-value = 0.9593
alternative hypothesis: two.sided
[1] 0.9592519

Hypothesis test

Step 1: State hypotheses

Ho: There is no association between the mother’s genotype and the litter’s genotype. Ha: There is an association between the mother’s genotype and the litter’s genotype.

Step 2: Select a test and check assumptions

For this data, it is appropriate to use a Fisher’s exact test. This is because the expected cell counts are very low, and would fail the assumptions step of the other types of association inference tests.

Assumptions: data is final and is not variable. Also, both variables are categorical in nature.

Step 3: Compute a test statistic

There is no test statistic for this inference test.

Step 4: Find the p-value

The p-value of this exact test was 0.9592519. Therefore, we fail to reject the null hypothesis (FTRN).

Step 5: State conclusion in the context of the problem

There was not enough statistically significant evidence to suggest that there is a true association between the genotype of the mother rat and the genotype of the litter rats.

Exercise 3

  • Load ggmosaic. Use a ggplot command with a geom_mosaic geometry to make a mosaic plot of the foster data. Inside the geometry should be aes(x=product(litgen, motgen), fill=litgen). Include a labs line to label the x axis “Mother”, the y axis “Litter” and the title “Rat Genotype”
Warning: The `scale_name` argument of `continuous_scale()` is deprecated as of ggplot2
3.5.0.
Warning: The `trans` argument of `continuous_scale()` is deprecated as of ggplot2 3.5.0.
ℹ Please use the `transform` argument instead.
Warning: `unite_()` was deprecated in tidyr 1.2.0.
ℹ Please use `unite()` instead.
ℹ The deprecated feature was likely used in the ggmosaic package.
  Please report the issue at <https://github.com/haleyjeppson/ggmosaic>.

Exercise 4

A medical journal reported the data below on frequencies of cardiac deaths by day of the week. Conduct a test at the .05 level of the null hypothesis that deaths are evenly distributed over the days of the week. Complete all five steps of the hypothesis test. Include a visualization.

Hypothesis Testing

Step 1: State hypotheses

Ho: The cardiac deaths were evenly distributed by each of the seven days of the week. Ha: The cardiac deaths were not evenly distributed by each of the seven days of the week.

Step 2: Select a test and check assumptions

This calculation calls for a chi-squared test for association Assumption: all expected counts were ≥ 5: yes; each expected cell count was 22 deaths.

Step 3: Compute a test statistic

X-squared statistic: 23.273, df = (7 - 1 = 6)

Step 4: Find the p-value

p-value: 0.0007101, df = (7 - 1 = 6); (0.0007101 < 0.05, therefore we RTN)

Step 5: State conclusion in the context of the problem

There was statistically significant evidence to suggest that the true distribution of cardiac deaths was not evenly distributed among the seven days of the week.