1 Introduction

In this tutorial, we will explore an education dataset to understand the relationships between various factors, such as gender, race/ethnicity, parental level of education, and student test scores. By performing descriptive statistics and creating visualizations, we aim to uncover patterns and insights that can help us better understand how these factors influence academic performance. This analysis is particularly valuable for educators, policymakers, and researchers who are interested in improving educational outcomes and addressing disparities.

We will use R and R Markdown to conduct our analysis. R is a powerful statistical programming language that allows us to perform complex data manipulation and visualization. R Markdown, on the other hand, is a dynamic document generation tool that integrates R code with narrative text, making it easy to produce reproducible reports. By the end of this tutorial, you will have a better understanding of how to analyze data using R and how to create comprehensive, reproducible reports using R Markdown.

1.1 About the Dataset

For this tutorial, I am using data provided by Kaggle. Here is what the website provides as info:

Problem Statement: This [data] understands how the student’s performance (test scores) is affected by other variables such as Gender, Ethnicity, Parental level of education, Lunch and Test preparation course.

Content: This data set consists of the marks secured by the students in various subjects.

gender : sex of students -> (Male/female)
race/ethnicity : ethnicity of students -> (Group A, B,C, D,E)
parental level of education: parents’ final education ->(bachelor’s degree,some college,master’s degree,associate’s degree,- high school)
lunch: having lunch before test (standard or free/reduced)
test preparation course: complete or not complete before test
math score
reading score *writing score

Inspiration: To understand the influence of the parent’s background, test preparation etc on students’ performance

2 Set Up

2.1 Load Libraries

First, we need to load the necessary libraries. For this tutorial, we will use ggplot2 for data visualization.

# Loading necessary libraries
library(tidyverse)    # For creating plots and data manipulation
library(reshape2)   # For reshaping data
library(knitr)      # For creating tables in R Markdown

2.2 Upload the Data

Now we need to upload the data that we are using.

# Load the Dataset
ed_data <- read.csv("study_performance.csv")

### Summary by Gender
kable(head(ed_data), caption = "First Few Rows of Data")

First Few Rows of Data
gender	race_ethnicity	parental_level_of_education	lunch	test_preparation_course	math_score	reading_score	writing_score
female	group B	bachelor’s degree	standard	none	72	72	74
female	group C	some college	standard	completed	69	90	88
female	group B	master’s degree	standard	none	90	95	93
male	group A	associate’s degree	free/reduced	none	47	57	44
male	group C	some college	standard	none	76	78	75
female	group B	associate’s degree	standard	none	71	83	78

3 Exploring the Data

3.1 Descriptive Results

Let’s first see a table that summarizes the data we have.

3.1.1 Create the Summary Dataframes

## Summary by Gender
summary_gender <- ed_data %>%
  group_by(gender) %>%
  summarize(
    mean_math = mean(math_score, na.rm = TRUE),
    sem_math = sd(math_score, na.rm = TRUE) / sqrt(n()),
    mean_reading = mean(reading_score, na.rm = TRUE),
    sem_reading = sd(reading_score, na.rm = TRUE) / sqrt(n()),
    mean_writing = mean(writing_score, na.rm = TRUE),
    sem_writing = sd(writing_score, na.rm = TRUE) / sqrt(n())
  )

## Summary by Race/Ethnicity
summary_race <- ed_data %>%
  group_by(race_ethnicity) %>%
  summarize(
    mean_math = mean(math_score, na.rm = TRUE),
    sem_math = sd(math_score, na.rm = TRUE) / sqrt(n()),
    mean_reading = mean(reading_score, na.rm = TRUE),
    sem_reading = sd(reading_score, na.rm = TRUE) / sqrt(n()),
    mean_writing = mean(writing_score, na.rm = TRUE),
    sem_writing = sd(writing_score, na.rm = TRUE) / sqrt(n())
  )

## Summary by Parental Level of Education
summary_parent_edu <- ed_data %>%
  group_by(parental_level_of_education) %>%
  summarize(
    mean_math = mean(math_score, na.rm = TRUE),
    sem_math = sd(math_score, na.rm = TRUE) / sqrt(n()),
    mean_reading = mean(reading_score, na.rm = TRUE),
    sem_reading = sd(reading_score, na.rm = TRUE) / sqrt(n()),
    mean_writing = mean(writing_score, na.rm = TRUE),
    sem_writing = sd(writing_score, na.rm = TRUE) / sqrt(n())
  )

## Summary by Lunch
summary_lunch <- ed_data %>%
  group_by(lunch) %>%
  summarize(
    mean_math = mean(math_score, na.rm = TRUE),
    sem_math = sd(math_score, na.rm = TRUE) / sqrt(n()),
    mean_reading = mean(reading_score, na.rm = TRUE),
    sem_reading = sd(reading_score, na.rm = TRUE) / sqrt(n()),
    mean_writing = mean(writing_score, na.rm = TRUE),
    sem_writing = sd(writing_score, na.rm = TRUE) / sqrt(n())
  )

## Summary by Test Preparation Course
summary_test_prep <- ed_data %>%
  group_by(test_preparation_course) %>%
  summarize(
    mean_math = mean(math_score, na.rm = TRUE),
    sem_math = sd(math_score, na.rm = TRUE) / sqrt(n()),
    mean_reading = mean(reading_score, na.rm = TRUE),
    sem_reading = sd(reading_score, na.rm = TRUE) / sqrt(n()),
    mean_writing = mean(writing_score, na.rm = TRUE),
    sem_writing = sd(writing_score, na.rm = TRUE) / sqrt(n())
  )

3.1.2 Display the Tables

# Display Tables

### Summary by Gender
kable(summary_gender, caption = "Mean Test Scores by Gender")

Mean Test Scores by Gender
gender	mean_math	sem_math	mean_reading	sem_reading	mean_writing	sem_writing
female	63.63320	0.6806554	72.60811	0.6317438	72.46718	0.6522449
male	68.72822	0.6539105	65.47303	0.6345776	63.31120	0.6428674

### Summary by Race/Ethnicity
kable(summary_race, caption = "Mean Test Scores by Race/Ethnicity")

Mean Test Scores by Race/Ethnicity
race_ethnicity	mean_math	sem_math	mean_reading	sem_reading	mean_writing	sem_writing
group A	61.62921	1.5394358	64.67416	1.6476354	62.67416	1.6396342
group B	63.45263	1.1221805	67.35263	1.1010915	65.60000	1.1335692
group C	64.46395	0.8315896	69.10345	0.7836834	67.82759	0.8389081
group D	67.36260	0.8506755	70.03053	0.8584549	70.14504	0.8876399
group E	73.82143	1.3128845	73.02857	1.2570845	71.40714	1.2773582

### Summary by Parental Level of Education
kable(summary_parent_edu, caption = "Mean Test Scores by Parental Level of Education")

Mean Test Scores by Parental Level of Education
parental_level_of_education	mean_math	sem_math	mean_reading	sem_reading	mean_writing	sem_writing
associate’s degree	67.88288	1.0142573	70.92793	0.9308229	69.89640	0.9604996
bachelor’s degree	69.38983	1.3756873	73.00000	1.3150639	73.38136	1.3558464
high school	62.13776	1.0385465	64.70408	1.0094379	62.44898	1.0061362
master’s degree	69.74576	1.9728717	75.37288	1.7933735	75.67797	1.7875864
some college	67.12832	0.9520797	69.46018	0.9350610	68.84071	0.9986054
some high school	63.49721	1.1905138	66.93855	1.1569768	64.88827	1.1761786

### Summary by Lunch
kable(summary_lunch, caption = "Mean Test Scores by Lunch")

Mean Test Scores by Lunch
lunch	mean_math	sem_math	mean_reading	sem_reading	mean_writing	sem_writing
free/reduced	58.92113	0.8046069	64.65352	0.7905625	63.02254	0.8191423
standard	70.03411	0.5376061	71.65426	0.5445794	70.82326	0.5646167

### Summary by Test Preparation Course
kable(summary_test_prep, caption = "Mean Test Scores by Test Preparation Course")

Mean Test Scores by Test Preparation Course
test_preparation_course	mean_math	sem_math	mean_reading	sem_reading	mean_writing	sem_writing
completed	69.69553	0.7634261	73.89385	0.720811	74.41899	0.7069084
none	64.07788	0.5995952	66.53427	0.570844	64.50467	0.5919894

4 Data Visualizaiton

We will create scatter plots with linear regression lines to visualize the correlations between different test scores and parental level of education. Each plot includes a regression line with confidence intervals to show the trend and variability.

4.1 Bar Graphs

First, let’s review some bar graphs to understand the average test scores by different groupings. We will create bar graphs for the mean test scores in math, reading, and writing, grouped by gender, race/ethnicity, parental level of education, lunch status, and test preparation course. Each bar graph will include error bars to represent the standard error of the mean (SEM), which shows the variability of the data.

4.1.1 Math Scores by Gender

This bar graph displays the average math scores for male and female students. The error bars represent the standard error of the mean (SEM), indicating the variability of the scores within each gender group. By comparing the heights of the bars, we can see if there are any noticeable differences in math performance between male and female students.

### Math Scores by Gender
ggplot(summary_gender, aes(x=gender, y=mean_math, fill=gender)) +
  geom_bar(stat="identity", position="dodge") +
  geom_errorbar(aes(ymin=mean_math - sem_math, ymax=mean_math + sem_math), width=0.2, position=position_dodge(0.9)) +
  labs(title="Math Scores by Gender", x="Gender", y="Mean Math Score") +
  theme_minimal()

4.1.2 Reading Scores by Gender

This bar graph shows the average reading scores for male and female students. Similar to the previous graph, the error bars represent the SEM. This visualization helps us understand if there are any significant differences in reading performance between genders.

### Reading Scores by Gender
ggplot(summary_gender, aes(x=gender, y=mean_reading, fill=gender)) +
  geom_bar(stat="identity", position="dodge") +
  geom_errorbar(aes(ymin=mean_reading - sem_reading, ymax=mean_reading + sem_reading), width=0.2, position=position_dodge(0.9)) +
  labs(title="Reading Scores by Gender", x="Gender", y="Mean Reading Score") +
  theme_minimal()

### Writing Scores by Gender
ggplot(summary_gender, aes(x=gender, y=mean_writing, fill=gender)) +
  geom_bar(stat="identity", position="dodge") +
  geom_errorbar(aes(ymin=mean_writing - sem_writing, ymax=mean_writing + sem_writing), width=0.2, position=position_dodge(0.9)) +
  labs(title="Writing Scores by Gender", x="Gender", y="Mean Writing Score") +
  theme_minimal()

4.1.3 Math Scores by Test Preparation Course

These bar graphs illustrates the average scores for students who took a test preparation course versus those who did not. The error bars represent the SEM. This visualization helps us determine the impact of test preparation courses on students’ academic performance.

### Math Scores by Test Preparation Course
ggplot(summary_test_prep, aes(x=test_preparation_course, y=mean_math, fill=test_preparation_course)) +
  geom_bar(stat="identity", position="dodge") +
  geom_errorbar(aes(ymin=mean_math - sem_math, ymax=mean_math + sem_math), width=0.2, position=position_dodge(0.9)) +
  labs(title="Math Scores by Test Preparation Course", x="Test Preparation Course", y="Mean Math Score") +
  theme_minimal()

### Reading Scores by Test Preparation Course
ggplot(summary_test_prep, aes(x=test_preparation_course, y=mean_reading, fill=test_preparation_course)) +
  geom_bar(stat="identity", position="dodge") +
  geom_errorbar(aes(ymin=mean_reading - sem_reading, ymax=mean_reading + sem_reading), width=0.2, position=position_dodge(0.9)) +
  labs(title="Reading Scores by Test Preparation Course", x="Test Preparation Course", y="Mean Reading Score") +
  theme_minimal()

### Writing Scores by Test Preparation Course
ggplot(summary_test_prep, aes(x=test_preparation_course, y=mean_writing, fill=test_preparation_course)) +
  geom_bar(stat="identity", position="dodge") +
  geom_errorbar(aes(ymin=mean_writing - sem_writing, ymax=mean_writing + sem_writing), width=0.2, position=position_dodge(0.9)) +
  labs(title="Writing Scores by Test Preparation Course", x="Test Preparation Course", y="Mean Writing Score") +
  theme_minimal()

4.2 ScatterPlots

This scatter plot visualizes the correlation between math scores and reading scores. Each point represents an individual student’s scores. The linear regression line with a confidence interval helps us see the overall trend and strength of the relationship between these two test scores.

# Order parental_level_of_education for better visualization
ed_data$parental_level_of_education <- factor(ed_data$parental_level_of_education,
                                              levels = c("some high school", "high school", "some college",
                                                         "associate's degree", "bachelor's degree", "master's degree"))

# Scatter plot for Math Scores
ggplot(ed_data, aes(x = parental_level_of_education, y = math_score)) +
  geom_point(position = position_jitter(width = 0.1), alpha = 0.5) +
  geom_smooth(method = "lm", aes(group = 1), se = TRUE) +
  labs(title = "Correlation between Parental Level of Education and Math Scores",
       x = "Parental Level of Education",
       y = "Math Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## `geom_smooth()` using formula = 'y ~ x'

# Scatter plot for Reading Scores
ggplot(ed_data, aes(x = parental_level_of_education, y = reading_score)) +
  geom_point(position = position_jitter(width = 0.1), alpha = 0.5) +
  geom_smooth(method = "lm", aes(group = 1), se = TRUE) +
  labs(title = "Correlation between Parental Level of Education and Reading Scores",
       x = "Parental Level of Education",
       y = "Reading Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## `geom_smooth()` using formula = 'y ~ x'

# Scatter plot for Writing Scores
ggplot(ed_data, aes(x = parental_level_of_education, y = writing_score)) +
  geom_point(position = position_jitter(width = 0.1), alpha = 0.5) +
  geom_smooth(method = "lm", aes(group = 1), se = TRUE) +
  labs(title = "Correlation between Parental Level of Education and Writing Scores",
       x = "Parental Level of Education",
       y = "Writing Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## `geom_smooth()` using formula = 'y ~ x'

5 Conclusion

In this tutorial, we explored the relationships between various factors and student test scores using an education dataset. We performed the following analyses:

Descriptive Statistics: We calculated the mean and standard error of the mean (SEM) for math, reading, and writing scores, grouped by gender, race/ethnicity, parental level of education, lunch status, and test preparation course.
Bar Graphs: We visualized the average test scores with error bars to understand the differences between groups.
Scatter Plots: We examined the correlations between different test scores and parental level of education using scatter plots with linear regression lines and confidence intervals.

Through these analyses, we gained insights into how various factors might influence student performance. For instance, we observed differences in test scores based on gender, race/ethnicity, and participation in test preparation courses.

5.1 Benefits of R Markdown

R Markdown is a powerful tool for creating dynamic, reproducible documents that integrate code, output, and narrative text. Some of the key benefits include:

Reproducibility: R Markdown allows you to embed R code directly within your document. This ensures that your analysis is fully reproducible, as the code and output are always in sync.
Versatility: You can generate reports in various formats, including HTML, PDF, and Word, making it easy to share your findings with others.
Clarity: By combining code, output, and explanations in a single document, R Markdown helps you clearly communicate your analysis and results.
Efficiency: Automating your analysis and report generation saves time and reduces the risk of errors that can occur when manually copying and pasting results.

By using R Markdown, you can create comprehensive, professional-quality reports that facilitate better understanding and communication of your data analysis. This makes it an invaluable tool for students, researchers, and professionals alike.

R Markdown Tutorial

FLi Sci Scholars

2024-05-19