In this tutorial, we will explore an education dataset to understand the relationships between various factors, such as gender, race/ethnicity, parental level of education, and student test scores. By performing descriptive statistics and creating visualizations, we aim to uncover patterns and insights that can help us better understand how these factors influence academic performance. This analysis is particularly valuable for educators, policymakers, and researchers who are interested in improving educational outcomes and addressing disparities.
We will use R and R Markdown to conduct our analysis. R is a powerful statistical programming language that allows us to perform complex data manipulation and visualization. R Markdown, on the other hand, is a dynamic document generation tool that integrates R code with narrative text, making it easy to produce reproducible reports. By the end of this tutorial, you will have a better understanding of how to analyze data using R and how to create comprehensive, reproducible reports using R Markdown.
For this tutorial, I am using data provided by Kaggle. Here is what the website provides as info:
Problem Statement: This [data] understands how the student’s performance (test scores) is affected by other variables such as Gender, Ethnicity, Parental level of education, Lunch and Test preparation course.
Content: This data set consists of the marks secured by the students in various subjects.
Inspiration: To understand the influence of the parent’s background, test preparation etc on students’ performance
First, we need to load the necessary libraries. For this tutorial, we
will use ggplot2
for data visualization.
# Loading necessary libraries
library(tidyverse) # For creating plots and data manipulation
library(reshape2) # For reshaping data
library(knitr) # For creating tables in R Markdown
Now we need to upload the data that we are using.
# Load the Dataset
ed_data <- read.csv("study_performance.csv")
### Summary by Gender
kable(head(ed_data), caption = "First Few Rows of Data")
gender | race_ethnicity | parental_level_of_education | lunch | test_preparation_course | math_score | reading_score | writing_score |
---|---|---|---|---|---|---|---|
female | group B | bachelor’s degree | standard | none | 72 | 72 | 74 |
female | group C | some college | standard | completed | 69 | 90 | 88 |
female | group B | master’s degree | standard | none | 90 | 95 | 93 |
male | group A | associate’s degree | free/reduced | none | 47 | 57 | 44 |
male | group C | some college | standard | none | 76 | 78 | 75 |
female | group B | associate’s degree | standard | none | 71 | 83 | 78 |
Let’s first see a table that summarizes the data we have.
## Summary by Gender
summary_gender <- ed_data %>%
group_by(gender) %>%
summarize(
mean_math = mean(math_score, na.rm = TRUE),
sem_math = sd(math_score, na.rm = TRUE) / sqrt(n()),
mean_reading = mean(reading_score, na.rm = TRUE),
sem_reading = sd(reading_score, na.rm = TRUE) / sqrt(n()),
mean_writing = mean(writing_score, na.rm = TRUE),
sem_writing = sd(writing_score, na.rm = TRUE) / sqrt(n())
)
## Summary by Race/Ethnicity
summary_race <- ed_data %>%
group_by(race_ethnicity) %>%
summarize(
mean_math = mean(math_score, na.rm = TRUE),
sem_math = sd(math_score, na.rm = TRUE) / sqrt(n()),
mean_reading = mean(reading_score, na.rm = TRUE),
sem_reading = sd(reading_score, na.rm = TRUE) / sqrt(n()),
mean_writing = mean(writing_score, na.rm = TRUE),
sem_writing = sd(writing_score, na.rm = TRUE) / sqrt(n())
)
## Summary by Parental Level of Education
summary_parent_edu <- ed_data %>%
group_by(parental_level_of_education) %>%
summarize(
mean_math = mean(math_score, na.rm = TRUE),
sem_math = sd(math_score, na.rm = TRUE) / sqrt(n()),
mean_reading = mean(reading_score, na.rm = TRUE),
sem_reading = sd(reading_score, na.rm = TRUE) / sqrt(n()),
mean_writing = mean(writing_score, na.rm = TRUE),
sem_writing = sd(writing_score, na.rm = TRUE) / sqrt(n())
)
## Summary by Lunch
summary_lunch <- ed_data %>%
group_by(lunch) %>%
summarize(
mean_math = mean(math_score, na.rm = TRUE),
sem_math = sd(math_score, na.rm = TRUE) / sqrt(n()),
mean_reading = mean(reading_score, na.rm = TRUE),
sem_reading = sd(reading_score, na.rm = TRUE) / sqrt(n()),
mean_writing = mean(writing_score, na.rm = TRUE),
sem_writing = sd(writing_score, na.rm = TRUE) / sqrt(n())
)
## Summary by Test Preparation Course
summary_test_prep <- ed_data %>%
group_by(test_preparation_course) %>%
summarize(
mean_math = mean(math_score, na.rm = TRUE),
sem_math = sd(math_score, na.rm = TRUE) / sqrt(n()),
mean_reading = mean(reading_score, na.rm = TRUE),
sem_reading = sd(reading_score, na.rm = TRUE) / sqrt(n()),
mean_writing = mean(writing_score, na.rm = TRUE),
sem_writing = sd(writing_score, na.rm = TRUE) / sqrt(n())
)
# Display Tables
### Summary by Gender
kable(summary_gender, caption = "Mean Test Scores by Gender")
gender | mean_math | sem_math | mean_reading | sem_reading | mean_writing | sem_writing |
---|---|---|---|---|---|---|
female | 63.63320 | 0.6806554 | 72.60811 | 0.6317438 | 72.46718 | 0.6522449 |
male | 68.72822 | 0.6539105 | 65.47303 | 0.6345776 | 63.31120 | 0.6428674 |
### Summary by Race/Ethnicity
kable(summary_race, caption = "Mean Test Scores by Race/Ethnicity")
race_ethnicity | mean_math | sem_math | mean_reading | sem_reading | mean_writing | sem_writing |
---|---|---|---|---|---|---|
group A | 61.62921 | 1.5394358 | 64.67416 | 1.6476354 | 62.67416 | 1.6396342 |
group B | 63.45263 | 1.1221805 | 67.35263 | 1.1010915 | 65.60000 | 1.1335692 |
group C | 64.46395 | 0.8315896 | 69.10345 | 0.7836834 | 67.82759 | 0.8389081 |
group D | 67.36260 | 0.8506755 | 70.03053 | 0.8584549 | 70.14504 | 0.8876399 |
group E | 73.82143 | 1.3128845 | 73.02857 | 1.2570845 | 71.40714 | 1.2773582 |
### Summary by Parental Level of Education
kable(summary_parent_edu, caption = "Mean Test Scores by Parental Level of Education")
parental_level_of_education | mean_math | sem_math | mean_reading | sem_reading | mean_writing | sem_writing |
---|---|---|---|---|---|---|
associate’s degree | 67.88288 | 1.0142573 | 70.92793 | 0.9308229 | 69.89640 | 0.9604996 |
bachelor’s degree | 69.38983 | 1.3756873 | 73.00000 | 1.3150639 | 73.38136 | 1.3558464 |
high school | 62.13776 | 1.0385465 | 64.70408 | 1.0094379 | 62.44898 | 1.0061362 |
master’s degree | 69.74576 | 1.9728717 | 75.37288 | 1.7933735 | 75.67797 | 1.7875864 |
some college | 67.12832 | 0.9520797 | 69.46018 | 0.9350610 | 68.84071 | 0.9986054 |
some high school | 63.49721 | 1.1905138 | 66.93855 | 1.1569768 | 64.88827 | 1.1761786 |
### Summary by Lunch
kable(summary_lunch, caption = "Mean Test Scores by Lunch")
lunch | mean_math | sem_math | mean_reading | sem_reading | mean_writing | sem_writing |
---|---|---|---|---|---|---|
free/reduced | 58.92113 | 0.8046069 | 64.65352 | 0.7905625 | 63.02254 | 0.8191423 |
standard | 70.03411 | 0.5376061 | 71.65426 | 0.5445794 | 70.82326 | 0.5646167 |
### Summary by Test Preparation Course
kable(summary_test_prep, caption = "Mean Test Scores by Test Preparation Course")
test_preparation_course | mean_math | sem_math | mean_reading | sem_reading | mean_writing | sem_writing |
---|---|---|---|---|---|---|
completed | 69.69553 | 0.7634261 | 73.89385 | 0.720811 | 74.41899 | 0.7069084 |
none | 64.07788 | 0.5995952 | 66.53427 | 0.570844 | 64.50467 | 0.5919894 |
We will create scatter plots with linear regression lines to visualize the correlations between different test scores and parental level of education. Each plot includes a regression line with confidence intervals to show the trend and variability.
First, let’s review some bar graphs to understand the average test scores by different groupings. We will create bar graphs for the mean test scores in math, reading, and writing, grouped by gender, race/ethnicity, parental level of education, lunch status, and test preparation course. Each bar graph will include error bars to represent the standard error of the mean (SEM), which shows the variability of the data.
This bar graph displays the average math scores for male and female students. The error bars represent the standard error of the mean (SEM), indicating the variability of the scores within each gender group. By comparing the heights of the bars, we can see if there are any noticeable differences in math performance between male and female students.
### Math Scores by Gender
ggplot(summary_gender, aes(x=gender, y=mean_math, fill=gender)) +
geom_bar(stat="identity", position="dodge") +
geom_errorbar(aes(ymin=mean_math - sem_math, ymax=mean_math + sem_math), width=0.2, position=position_dodge(0.9)) +
labs(title="Math Scores by Gender", x="Gender", y="Mean Math Score") +
theme_minimal()
This bar graph shows the average reading scores for male and female students. Similar to the previous graph, the error bars represent the SEM. This visualization helps us understand if there are any significant differences in reading performance between genders.
### Reading Scores by Gender
ggplot(summary_gender, aes(x=gender, y=mean_reading, fill=gender)) +
geom_bar(stat="identity", position="dodge") +
geom_errorbar(aes(ymin=mean_reading - sem_reading, ymax=mean_reading + sem_reading), width=0.2, position=position_dodge(0.9)) +
labs(title="Reading Scores by Gender", x="Gender", y="Mean Reading Score") +
theme_minimal()
### Writing Scores by Gender
ggplot(summary_gender, aes(x=gender, y=mean_writing, fill=gender)) +
geom_bar(stat="identity", position="dodge") +
geom_errorbar(aes(ymin=mean_writing - sem_writing, ymax=mean_writing + sem_writing), width=0.2, position=position_dodge(0.9)) +
labs(title="Writing Scores by Gender", x="Gender", y="Mean Writing Score") +
theme_minimal()
These bar graphs illustrates the average scores for students who took a test preparation course versus those who did not. The error bars represent the SEM. This visualization helps us determine the impact of test preparation courses on students’ academic performance.
### Math Scores by Test Preparation Course
ggplot(summary_test_prep, aes(x=test_preparation_course, y=mean_math, fill=test_preparation_course)) +
geom_bar(stat="identity", position="dodge") +
geom_errorbar(aes(ymin=mean_math - sem_math, ymax=mean_math + sem_math), width=0.2, position=position_dodge(0.9)) +
labs(title="Math Scores by Test Preparation Course", x="Test Preparation Course", y="Mean Math Score") +
theme_minimal()
### Reading Scores by Test Preparation Course
ggplot(summary_test_prep, aes(x=test_preparation_course, y=mean_reading, fill=test_preparation_course)) +
geom_bar(stat="identity", position="dodge") +
geom_errorbar(aes(ymin=mean_reading - sem_reading, ymax=mean_reading + sem_reading), width=0.2, position=position_dodge(0.9)) +
labs(title="Reading Scores by Test Preparation Course", x="Test Preparation Course", y="Mean Reading Score") +
theme_minimal()
### Writing Scores by Test Preparation Course
ggplot(summary_test_prep, aes(x=test_preparation_course, y=mean_writing, fill=test_preparation_course)) +
geom_bar(stat="identity", position="dodge") +
geom_errorbar(aes(ymin=mean_writing - sem_writing, ymax=mean_writing + sem_writing), width=0.2, position=position_dodge(0.9)) +
labs(title="Writing Scores by Test Preparation Course", x="Test Preparation Course", y="Mean Writing Score") +
theme_minimal()
This scatter plot visualizes the correlation between math scores and reading scores. Each point represents an individual student’s scores. The linear regression line with a confidence interval helps us see the overall trend and strength of the relationship between these two test scores.
# Order parental_level_of_education for better visualization
ed_data$parental_level_of_education <- factor(ed_data$parental_level_of_education,
levels = c("some high school", "high school", "some college",
"associate's degree", "bachelor's degree", "master's degree"))
# Scatter plot for Math Scores
ggplot(ed_data, aes(x = parental_level_of_education, y = math_score)) +
geom_point(position = position_jitter(width = 0.1), alpha = 0.5) +
geom_smooth(method = "lm", aes(group = 1), se = TRUE) +
labs(title = "Correlation between Parental Level of Education and Math Scores",
x = "Parental Level of Education",
y = "Math Score") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## `geom_smooth()` using formula = 'y ~ x'
# Scatter plot for Reading Scores
ggplot(ed_data, aes(x = parental_level_of_education, y = reading_score)) +
geom_point(position = position_jitter(width = 0.1), alpha = 0.5) +
geom_smooth(method = "lm", aes(group = 1), se = TRUE) +
labs(title = "Correlation between Parental Level of Education and Reading Scores",
x = "Parental Level of Education",
y = "Reading Score") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## `geom_smooth()` using formula = 'y ~ x'
# Scatter plot for Writing Scores
ggplot(ed_data, aes(x = parental_level_of_education, y = writing_score)) +
geom_point(position = position_jitter(width = 0.1), alpha = 0.5) +
geom_smooth(method = "lm", aes(group = 1), se = TRUE) +
labs(title = "Correlation between Parental Level of Education and Writing Scores",
x = "Parental Level of Education",
y = "Writing Score") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## `geom_smooth()` using formula = 'y ~ x'
In this tutorial, we explored the relationships between various factors and student test scores using an education dataset. We performed the following analyses:
Through these analyses, we gained insights into how various factors might influence student performance. For instance, we observed differences in test scores based on gender, race/ethnicity, and participation in test preparation courses.
R Markdown is a powerful tool for creating dynamic, reproducible documents that integrate code, output, and narrative text. Some of the key benefits include:
By using R Markdown, you can create comprehensive, professional-quality reports that facilitate better understanding and communication of your data analysis. This makes it an invaluable tool for students, researchers, and professionals alike.