Now that we can acquire and process data, let’s learn how to create reports!
This module covers: - Conducting basic statistical tests on your data
including t-tests, chi-square tests, and Pearson correlations - Creating
publication-quality figures using the ggplot2
library -
Using templates to create reports in your organization’s style
A t-test helps us determine if there are significant differences between groups or paired measurements. Here are three common ways to run t-tests in R:
# Perform paired t-test (same students, different subjects)
t.test(synthetic_data$math_score, synthetic_data$english_score, paired = TRUE)
Paired t-test
data: synthetic_data$math_score and synthetic_data$english_score
t = -0.32909, df = 49, p-value = 0.7435
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-6.395881 4.595881
sample estimates:
mean difference
-0.9
# Same test, but using the data argument for readability:
t.test(math_score, english_score, data = synthetic_data, paired = TRUE)
Paired t-test
data: math_score and english_score
t = -0.32909, df = 49, p-value = 0.7435
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-6.395881 4.595881
sample estimates:
mean difference
-0.9
# Independent samples t-test (comparing two different groups):
t.test(math_score ~ secondary, data = synthetic_data)
Welch Two Sample t-test
data: math_score by secondary
t = -1.2987, df = 33.591, p-value = 0.2029
alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
95 percent confidence interval:
-10.592478 2.335125
sample estimates:
mean in group FALSE mean in group TRUE
71.81250 75.94118
Correlation measures how strongly two variables are related to each other. Here we’ll create some example data and test the correlation:
# Create two related variables
x <- rnorm(50, mean = 70, sd = 10) # Test 1 scores
y <- x + rnorm(50, mean = 5, sd = 5) # Test 2 scores (related to Test 1)
# Calculate correlation and test statistical significance
cor.test(x, y)
Pearson's product-moment correlation
data: x and y
t = 13.51, df = 48, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8127800 0.9362718
sample estimates:
cor
0.8898191
Chi-square tests help us understand if there’s a relationship between categorical variables:
# Create gender and program variables for our synthetic data
synthetic_data$gender <- sample(c("Male", "Female"), nrow(synthetic_data), replace = TRUE)
synthetic_data$program <- sample(c("Academic", "Applied"), nrow(synthetic_data), replace = TRUE)
# Create a contingency table (counts of each combination
program_gender <- table(synthetic_data$program, synthetic_data$gender)
# Perform chi-square test of independence
chisq.test(program_gender)
Pearson's Chi-squared test with Yates' continuity correction
data: program_gender
X-squared = 0, df = 1, p-value = 1
Female Male
Academic 15 13
Applied 12 10
One of R’s greatest strengths is its ability to create professional,
publication-quality figures. We’ll use the ggplot2
package
(part of tidyverse) along with a custom theme package that ensures our
plots match our organization’s visual identity.
First, we need to install and load a custom package that provides our organization’s visual themes. This only needs to be done once on your computer:
# Install the package manager if you haven't already installed it
# install.packages("devtools")
# Install the TCDSB theme package
devtools::install_github("grousell/tcdsb")
Now, let’s load our required packages:
Let’s start with a simple bar plot using our synthetic data. In ggplot2, we build plots layer by layer, like stacking blocks:
# First, boost academic scores to make the synthetic data more interesting
synthetic_data <- synthetic_data |>
mutate(math_score = case_when(
program == "Academic" ~ math_score + 15,
TRUE ~ math_score
))
# Create a basic plot showing average math scores by program
synthetic_data |>
group_by(program) |> # Group data by program type
summarise(avg_score = mean(math_score)) |> # Calculate mean for each group
ggplot(aes(x = program, y = avg_score)) + # Set up the basic plot structure
geom_col() + # Add bars to the plot
labs(title = "Average Math Scores by Program", # Add labels
x = "Program",
y = "Average Score")
Let’s break down what each part does:
ggplot(aes(x = program, y = avg_score))
- Sets up the
plot structure, specifying which variables go on which axesgeom_col()
- Adds vertical bars (columns) to our
plotlabs()
- Adds title and axis labels to our plotThe TCDSB package provides colors and fonts that match our organization’s style guide:
Now we can create the same plot with our organization’s theme:
synthetic_data |>
group_by(program) |>
summarise(avg_score = mean(math_score)) |>
ggplot(aes(x = program, y = avg_score)) +
geom_col() +
labs(title = "Average Math Scores by Program",
x = "Program",
y = "Average Score") +
tcdsb::tcdsb_ggplot_theme() # Add TCDSB theme
We can also use our organization’s official colors:
synthetic_data |>
group_by(program) |>
summarise(avg_score = mean(math_score)) |>
ggplot(aes(x = program, y = avg_score)) +
geom_col(fill = tcdsb_board_color) + # Use TCDSB colors
labs(title = "Average Math Scores by Program",
x = "Program",
y = "Average Score") +
tcdsb::tcdsb_ggplot_theme()
Let’s create a scatter plot to show the relationship between math and English scores. This is useful for visualizing correlations:
synthetic_data |>
ggplot(aes(x = math_score, y = english_score)) +
geom_point() + # Add points
geom_smooth(method = "lm", se = FALSE) + # Add trend line
labs(title = "Math vs English Scores",
x = "Math Score",
y = "English Score") +
tcdsb::tcdsb_ggplot_theme()
And a box plot to show score distributions across programs:
synthetic_data |>
ggplot(aes(x = program, y = math_score)) +
geom_boxplot(fill = tcdsb_board_color) +
labs(title = "Math Score Distribution by Program",
x = "Program",
y = "Math Score") +
tcdsb::tcdsb_ggplot_theme()
These plots can be saved to files using the ggsave()
function. The width and height are in inches: