PSCI 9590A - Introduction to Quantitative Methods

Evelyne Brie

Fall 2022

Categorical Data

In this laboratory, we will: (1) create contingency tables, (2) calculate a chi-square and a Cramer’s V and (3) perform an ANOVA test and a Tukey’s test.

Relevant functions: table(), prop.table(), chisq.test(), aov(), TukeyHSD().

1. Importing Data

We start by importing a dataset encompassing data from a (fake) survey experiment testing whether exposure to positive or negative vignettes about the state of the Canadian economy impacts satisfaction with the federal government and a feeling thermometer for current the prime minister of the country.

Type of vignette (IV) \(\rightarrow\) Political attitudes (DV)

In order to test this relationship, we randomly assign 1,500 participants in three experimental treatment conditions: “positive” for the group exposed to the positive vignette, “negative” for the group exposed to the negative vignette, and “control” for the group exposed to no vignette at all.

Condition n
Positive vignette 500
Negative vignette 500
Control group 500

Because the treatment condition is randomly attributed, we assume that the difference we will observe across the scores of the three groups is only explained by the experimental treatment. That’s a property of random assignment: if your sample size is large enough, the disparities on observed and unobserved covariates should be close to zero across all groups. Because of this property, if we observe a statistically significant difference in attitudes for each of the treatment conditions, we can make a causal inference about the effect of these vignettes.

# Set work directory
setwd(dirname(rstudioapi::getSourceEditorContext()$path))

# Loading data
expData <- read.csv("expData.csv")

# Priting out the column names
colnames(expData)
## [1] "X"            "id"           "condition"    "satisfaction" "feelTherm"

Here is a description of the different variables within this dataset:

Variable name Description Type of Variable
X row name from .csv dataset numerical (discrete)
id unique identification number numerical (discrete)
condition experimental treatment condition (“positive vignette”, “negative vignette”, “control”) categorical (nominal)
satisfaction level of satisfaction with current government at the federal level (“dissatisfied”, “neutral”, “satisfied”) categorical (ordinal)
feelTherm feeling thermometer for current PM of Canada {0,100} numerical (continuous)
# Priting out the first 10 observations
head(expData,10)
##     X id condition satisfaction feelTherm
## 1   1  1  Negative Dissatisfied  37.62840
## 2   2  2  Negative      Neutral  29.77305
## 3   3  3  Negative Dissatisfied  47.29787
## 4   4  4  Positive    Satisfied  57.72648
## 5   5  5  Negative Dissatisfied  33.80386
## 6   6  6  Positive    Satisfied  58.74346
## 7   7  7   Control Dissatisfied  56.68478
## 8   8  8   Control    Satisfied  41.79225
## 9   9  9   Control    Satisfied  63.78172
## 10 10 10   Control    Satisfied  49.48885

2. Contingency Tables

The first thing we want to do is to display the distribution of our data in a table. Here, we’ll display the relationship between treatment condition and satisfaction with the current government, which are two categorical variables.

Note: you can also create such tables for numerical variables. As a matter of fact, we have done this before in the lab material for Week 4 (Estimation & Inference).

2.1 Raw Data

We use the table() function to display the distribution of the raw data. As a convention, we usually display the IV as the rows and the DV as the columns.

# Creating a table 
table(expData$condition,expData$satisfaction)
##           
##            Dissatisfied Neutral Satisfied
##   Control           132     198       170
##   Negative          262     145        93
##   Positive          111     205       184

2.2 Percentaging

We can also display the percentages of observations in each cell, in each row (percentaging across) or in each column (percentaging down) using the prop.table() function .

# Creating a table with overall percentages
prop.table(table(expData$condition,expData$satisfaction))
##           
##            Dissatisfied    Neutral  Satisfied
##   Control    0.08800000 0.13200000 0.11333333
##   Negative   0.17466667 0.09666667 0.06200000
##   Positive   0.07400000 0.13666667 0.12266667
# Creating a table with percentages by rows
prop.table(table(expData$condition,expData$satisfaction),1)
##           
##            Dissatisfied Neutral Satisfied
##   Control         0.264   0.396     0.340
##   Negative        0.524   0.290     0.186
##   Positive        0.222   0.410     0.368
# Creating a table with percentages by columns
prop.table(table(expData$condition,expData$satisfaction),2)
##           
##            Dissatisfied   Neutral Satisfied
##   Control     0.2613861 0.3613139 0.3803132
##   Negative    0.5188119 0.2645985 0.2080537
##   Positive    0.2198020 0.3740876 0.4116331

Just by looking at these tables does it seem like there is a relationship between the treatment condition and satisfaction with the federal government?

3. Chi-Square

If you remember our lecture from Monday, we calculated the chi-square using the following formula:

\[ \chi^2 = \sum \frac{(f_o - f_e)^2}{f_e} \]

where

\[ f_o = \text{observed data in each cell} \] \[ f_e = \frac{\sum \text{column} * \sum \text{row}}{n} \]

Calculating the chi-square using the chisq.test() function is quite practical, since you don’t have to do these steps by hand. Moreover, the R output automatically gives you the statistical significance of your chi-square, which means that you don’t have to look it up in a table.

# Performing the chi-square test
chisq.test(expData$condition, expData$satisfaction)
## 
##  Pearson's Chi-squared test
## 
## data:  expData$condition and expData$satisfaction
## X-squared = 123.5, df = 4, p-value < 2.2e-16

It’s as simple as that! We get as an output: the chi-square (123.5), the degrees of freedom (4) and the p-value (< 2.2e-16 or 0.00000000000000022).

You can also calculate Cramer’s V (which evaluates the strength of the relationship between the two variables) using the cramersv() function from the confintr package.

# Loading the confintr package
library(confintr)

# Calculating Cramer's V
cramersv(table(expData$condition,expData$satisfaction))
## [1] 0.2028973
# Note that the order of the variables within the table()
# function does not impact the results

What do you conclude from looking at this test? What is it impossible to conclude from looking at this test?

4. ANOVA

When working with a categorical explanatory variable (IV) and a continuous dependent variable (DV), you can use an ANOVA (analysis of variance) test. In this case, this test allows us to measure the strength of the relationship between the treatment condition variable (categorical) and the feeling thermometer variable (numerical).

We use the aov() function to peform this test.

# Running an ANOVA test
model <- aov(feelTherm ~ condition, expData) 

# Viewing our results using summary()
summary(model)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## condition      2 104744   52372   495.5 <2e-16 ***
## Residuals   1497 158211     106                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looks like the relationship between the treatment condition and the feeling thermometer towards the PM is statistically significant. But wait: what is the direction of the relationship for each treatment condition?

We cannot know this by looking simply at the ANOVA. Why? Because the ANOVA just tells you if the three groups (or however many treatment groups you have) are significantly different regarding your DV. It might be that the positive treatment had a statistically significant effect on the feeling thermometer, but not the negative one. It might also be that they both had an effect. We just don’t know at this stage.

This is why we use the Tukey’s Honest Significant Difference Test. This allows us to see which pairs differ significantly from each other with regard to the DV.

# Checking whether the difference is sign. across all pairs of IVs
TukeyHSD(model)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = feelTherm ~ condition, data = expData)
## 
## $condition
##                         diff        lwr        upr   p adj
## Negative-Control  -15.967048 -17.492409 -14.441686 0.0e+00
## Positive-Control    3.108001   1.582639   4.633362 5.7e-06
## Positive-Negative  19.075048  17.549687  20.600410 0.0e+00

Here is how to interpret these results:

  • the negative group has feeling thermometers that are lower than the control group by -15.967048 on average, and this difference is statistically significant (p \(<\) 0.001).

  • the positive group has feeling thermometers that are higher than the control group by 3.108001 on average, and this difference is also statistically significant (p = 0.0000057).

  • the positive group has feeling thermometers that are higher than the negative group by 19.075048 on average, and this difference is also statistically significant (p \(<\) 0.001).

Voilà! You are now able to analyse relationships using categorical variables in R.

Exercises

Anti-COVID Treatment \(\rightarrow\) Well-Being

Imagine an experiment in which 600 COVID-19 patients admitted at London’s University Hospital are randomly given either a new anti-COVID medication (“treatment A”), a cup of hot lemon tea with ginger and rhum (my grandmother’s recommendation) (“treatment B”) or given solely a placebo pill (“control”).

  • Load the medicalData.csv file on OWL.

  • Using an ANOVA, test whether the treatment condition had a statistically significant effect on the number of days spent in the hospital (i.e. the “days” variable), and if so, which pairs of treatment conditions differ significantly from each other.

# Running an ANOVA test
# model <- aov(...) 

# Viewing our results using summary()
# summary(model)

# Checking whether the difference is sign. across all pairs of IVs
#TukeyHSD(model)
# Solution

# Loading data
medicalData <- read.csv("medicalData.csv")

# Running ANOVA model
model <- aov(days ~ condition, medicalData) 
summary(model)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## condition     2   41.4  20.691   10.51 3.26e-05 ***
## Residuals   597 1175.1   1.968                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Running Tukey's test
TukeyHSD(model)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = days ~ condition, data = medicalData)
## 
## $condition
##                               diff        lwr         upr     p adj
## Treatment A-Control     -0.2924783 -0.6221151  0.03715854 0.0938557
## Treatment B-Control     -0.6424305 -0.9720673 -0.31279366 0.0000169
## Treatment B-Treatment A -0.3499522 -0.6795890 -0.02031537 0.0343887
  • One doctor claims that many patients who sought treatment at the hospital now complain of insomnia. We don’t know yet if this is purely anecdotal, or whether it reveals something more worrisome about one of the treatments that was administered. Test whether the treatment conditions had an impact on insomnia symptoms (i.e. the “insomnia” variable) using a chi-square test.
# Performing the chi-square test
# chisq.test(...)
# Solution

# Loading data
medicalData <- read.csv("medicalData.csv")

# Chi-square test
chisq.test(medicalData$condition, medicalData$insomnia)
## 
##  Pearson's Chi-squared test
## 
## data:  medicalData$condition and medicalData$insomnia
## X-squared = 5.5607, df = 2, p-value = 0.06202
# Extra : Cramer's V

# Loading the confintr package
library(confintr)

# Calculating Cramer's V
cramersv(table(medicalData$condition, medicalData$insomnia))
## [1] 0.09626975