1 Introduction

In this tutorial, we will explore an education dataset to understand the relationships between various factors, such as gender, race/ethnicity, parental level of education, and student test scores. By performing descriptive statistics and creating visualizations, we aim to uncover patterns and insights that can help us better understand how these factors influence academic performance. This analysis is particularly valuable for educators, policymakers, and researchers who are interested in improving educational outcomes and addressing disparities.

We will use R and R Markdown to conduct our analysis. R is a powerful statistical programming language that allows us to perform complex data manipulation and visualization. R Markdown, on the other hand, is a dynamic document generation tool that integrates R code with narrative text, making it easy to produce reproducible reports. By the end of this tutorial, you will have a better understanding of how to analyze data using R and how to create comprehensive, reproducible reports using R Markdown.

1.1 About the Dataset

For this tutorial, I am using data provided by Kaggle. Here is what the website provides as info:

Problem Statement: This [data] understands how the student’s performance (test scores) is affected by other variables such as Gender, Ethnicity, Parental level of education, Lunch and Test preparation course.

Content: This data set consists of the marks secured by the students in various subjects.

  • gender : sex of students -> (Male/female)
  • race/ethnicity : ethnicity of students -> (Group A, B,C, D,E)
  • parental level of education: parents’ final education ->(bachelor’s degree,some college,master’s degree,associate’s degree,- high school)
  • lunch: having lunch before test (standard or free/reduced)
  • test preparation course: complete or not complete before test
  • math score
  • reading score *writing score

Inspiration: To understand the influence of the parent’s background, test preparation etc on students’ performance

2 Set Up

2.1 Load Libraries

First, we need to load the necessary libraries. For this tutorial, we will use ggplot2 for data visualization.

# Loading necessary libraries
library(tidyverse)    # For creating plots and data manipulation
library(reshape2)   # For reshaping data
library(knitr)      # For creating tables in R Markdown

2.2 Upload the Data

Now we need to upload the data that we are using.

# Load the Dataset
ed_data <- read.csv("study_performance.csv")

### Summary by Gender
kable(head(ed_data), caption = "First Few Rows of Data")
First Few Rows of Data
gender race_ethnicity parental_level_of_education lunch test_preparation_course math_score reading_score writing_score
female group B bachelor’s degree standard none 72 72 74
female group C some college standard completed 69 90 88
female group B master’s degree standard none 90 95 93
male group A associate’s degree free/reduced none 47 57 44
male group C some college standard none 76 78 75
female group B associate’s degree standard none 71 83 78

3 Exploring the Data

3.1 Descriptive Results

Let’s first see a table that summarizes the data we have.

3.1.1 Create the Summary Dataframes

## Summary by Gender
summary_gender <- ed_data %>%
  group_by(gender) %>%
  summarize(
    mean_math = mean(math_score, na.rm = TRUE),
    sem_math = sd(math_score, na.rm = TRUE) / sqrt(n()),
    mean_reading = mean(reading_score, na.rm = TRUE),
    sem_reading = sd(reading_score, na.rm = TRUE) / sqrt(n()),
    mean_writing = mean(writing_score, na.rm = TRUE),
    sem_writing = sd(writing_score, na.rm = TRUE) / sqrt(n())
  )

## Summary by Race/Ethnicity
summary_race <- ed_data %>%
  group_by(race_ethnicity) %>%
  summarize(
    mean_math = mean(math_score, na.rm = TRUE),
    sem_math = sd(math_score, na.rm = TRUE) / sqrt(n()),
    mean_reading = mean(reading_score, na.rm = TRUE),
    sem_reading = sd(reading_score, na.rm = TRUE) / sqrt(n()),
    mean_writing = mean(writing_score, na.rm = TRUE),
    sem_writing = sd(writing_score, na.rm = TRUE) / sqrt(n())
  )

## Summary by Parental Level of Education
summary_parent_edu <- ed_data %>%
  group_by(parental_level_of_education) %>%
  summarize(
    mean_math = mean(math_score, na.rm = TRUE),
    sem_math = sd(math_score, na.rm = TRUE) / sqrt(n()),
    mean_reading = mean(reading_score, na.rm = TRUE),
    sem_reading = sd(reading_score, na.rm = TRUE) / sqrt(n()),
    mean_writing = mean(writing_score, na.rm = TRUE),
    sem_writing = sd(writing_score, na.rm = TRUE) / sqrt(n())
  )

## Summary by Lunch
summary_lunch <- ed_data %>%
  group_by(lunch) %>%
  summarize(
    mean_math = mean(math_score, na.rm = TRUE),
    sem_math = sd(math_score, na.rm = TRUE) / sqrt(n()),
    mean_reading = mean(reading_score, na.rm = TRUE),
    sem_reading = sd(reading_score, na.rm = TRUE) / sqrt(n()),
    mean_writing = mean(writing_score, na.rm = TRUE),
    sem_writing = sd(writing_score, na.rm = TRUE) / sqrt(n())
  )

## Summary by Test Preparation Course
summary_test_prep <- ed_data %>%
  group_by(test_preparation_course) %>%
  summarize(
    mean_math = mean(math_score, na.rm = TRUE),
    sem_math = sd(math_score, na.rm = TRUE) / sqrt(n()),
    mean_reading = mean(reading_score, na.rm = TRUE),
    sem_reading = sd(reading_score, na.rm = TRUE) / sqrt(n()),
    mean_writing = mean(writing_score, na.rm = TRUE),
    sem_writing = sd(writing_score, na.rm = TRUE) / sqrt(n())
  )

3.1.2 Display the Tables

# Display Tables

### Summary by Gender
kable(summary_gender, caption = "Mean Test Scores by Gender")
Mean Test Scores by Gender
gender mean_math sem_math mean_reading sem_reading mean_writing sem_writing
female 63.63320 0.6806554 72.60811 0.6317438 72.46718 0.6522449
male 68.72822 0.6539105 65.47303 0.6345776 63.31120 0.6428674
### Summary by Race/Ethnicity
kable(summary_race, caption = "Mean Test Scores by Race/Ethnicity")
Mean Test Scores by Race/Ethnicity
race_ethnicity mean_math sem_math mean_reading sem_reading mean_writing sem_writing
group A 61.62921 1.5394358 64.67416 1.6476354 62.67416 1.6396342
group B 63.45263 1.1221805 67.35263 1.1010915 65.60000 1.1335692
group C 64.46395 0.8315896 69.10345 0.7836834 67.82759 0.8389081
group D 67.36260 0.8506755 70.03053 0.8584549 70.14504 0.8876399
group E 73.82143 1.3128845 73.02857 1.2570845 71.40714 1.2773582
### Summary by Parental Level of Education
kable(summary_parent_edu, caption = "Mean Test Scores by Parental Level of Education")
Mean Test Scores by Parental Level of Education
parental_level_of_education mean_math sem_math mean_reading sem_reading mean_writing sem_writing
associate’s degree 67.88288 1.0142573 70.92793 0.9308229 69.89640 0.9604996
bachelor’s degree 69.38983 1.3756873 73.00000 1.3150639 73.38136 1.3558464
high school 62.13776 1.0385465 64.70408 1.0094379 62.44898 1.0061362
master’s degree 69.74576 1.9728717 75.37288 1.7933735 75.67797 1.7875864
some college 67.12832 0.9520797 69.46018 0.9350610 68.84071 0.9986054
some high school 63.49721 1.1905138 66.93855 1.1569768 64.88827 1.1761786
### Summary by Lunch
kable(summary_lunch, caption = "Mean Test Scores by Lunch")
Mean Test Scores by Lunch
lunch mean_math sem_math mean_reading sem_reading mean_writing sem_writing
free/reduced 58.92113 0.8046069 64.65352 0.7905625 63.02254 0.8191423
standard 70.03411 0.5376061 71.65426 0.5445794 70.82326 0.5646167
### Summary by Test Preparation Course
kable(summary_test_prep, caption = "Mean Test Scores by Test Preparation Course")
Mean Test Scores by Test Preparation Course
test_preparation_course mean_math sem_math mean_reading sem_reading mean_writing sem_writing
completed 69.69553 0.7634261 73.89385 0.720811 74.41899 0.7069084
none 64.07788 0.5995952 66.53427 0.570844 64.50467 0.5919894

4 Data Visualizaiton

We will create scatter plots with linear regression lines to visualize the correlations between different test scores and parental level of education. Each plot includes a regression line with confidence intervals to show the trend and variability.

4.1 Bar Graphs

First, let’s review some bar graphs to understand the average test scores by different groupings. We will create bar graphs for the mean test scores in math, reading, and writing, grouped by gender, race/ethnicity, parental level of education, lunch status, and test preparation course. Each bar graph will include error bars to represent the standard error of the mean (SEM), which shows the variability of the data.

4.1.1 Math Scores by Gender

This bar graph displays the average math scores for male and female students. The error bars represent the standard error of the mean (SEM), indicating the variability of the scores within each gender group. By comparing the heights of the bars, we can see if there are any noticeable differences in math performance between male and female students.

### Math Scores by Gender
ggplot(summary_gender, aes(x=gender, y=mean_math, fill=gender)) +
  geom_bar(stat="identity", position="dodge") +
  geom_errorbar(aes(ymin=mean_math - sem_math, ymax=mean_math + sem_math), width=0.2, position=position_dodge(0.9)) +
  labs(title="Math Scores by Gender", x="Gender", y="Mean Math Score") +
  theme_minimal()

4.1.2 Reading Scores by Gender

This bar graph shows the average reading scores for male and female students. Similar to the previous graph, the error bars represent the SEM. This visualization helps us understand if there are any significant differences in reading performance between genders.

### Reading Scores by Gender
ggplot(summary_gender, aes(x=gender, y=mean_reading, fill=gender)) +
  geom_bar(stat="identity", position="dodge") +
  geom_errorbar(aes(ymin=mean_reading - sem_reading, ymax=mean_reading + sem_reading), width=0.2, position=position_dodge(0.9)) +
  labs(title="Reading Scores by Gender", x="Gender", y="Mean Reading Score") +
  theme_minimal()

### Writing Scores by Gender
ggplot(summary_gender, aes(x=gender, y=mean_writing, fill=gender)) +
  geom_bar(stat="identity", position="dodge") +
  geom_errorbar(aes(ymin=mean_writing - sem_writing, ymax=mean_writing + sem_writing), width=0.2, position=position_dodge(0.9)) +
  labs(title="Writing Scores by Gender", x="Gender", y="Mean Writing Score") +
  theme_minimal()

4.1.3 Math Scores by Test Preparation Course

These bar graphs illustrates the average scores for students who took a test preparation course versus those who did not. The error bars represent the SEM. This visualization helps us determine the impact of test preparation courses on students’ academic performance.

### Math Scores by Test Preparation Course
ggplot(summary_test_prep, aes(x=test_preparation_course, y=mean_math, fill=test_preparation_course)) +
  geom_bar(stat="identity", position="dodge") +
  geom_errorbar(aes(ymin=mean_math - sem_math, ymax=mean_math + sem_math), width=0.2, position=position_dodge(0.9)) +
  labs(title="Math Scores by Test Preparation Course", x="Test Preparation Course", y="Mean Math Score") +
  theme_minimal()

### Reading Scores by Test Preparation Course
ggplot(summary_test_prep, aes(x=test_preparation_course, y=mean_reading, fill=test_preparation_course)) +
  geom_bar(stat="identity", position="dodge") +
  geom_errorbar(aes(ymin=mean_reading - sem_reading, ymax=mean_reading + sem_reading), width=0.2, position=position_dodge(0.9)) +
  labs(title="Reading Scores by Test Preparation Course", x="Test Preparation Course", y="Mean Reading Score") +
  theme_minimal()

### Writing Scores by Test Preparation Course
ggplot(summary_test_prep, aes(x=test_preparation_course, y=mean_writing, fill=test_preparation_course)) +
  geom_bar(stat="identity", position="dodge") +
  geom_errorbar(aes(ymin=mean_writing - sem_writing, ymax=mean_writing + sem_writing), width=0.2, position=position_dodge(0.9)) +
  labs(title="Writing Scores by Test Preparation Course", x="Test Preparation Course", y="Mean Writing Score") +
  theme_minimal()

4.2 ScatterPlots

This scatter plot visualizes the correlation between math scores and reading scores. Each point represents an individual student’s scores. The linear regression line with a confidence interval helps us see the overall trend and strength of the relationship between these two test scores.

# Order parental_level_of_education for better visualization
ed_data$parental_level_of_education <- factor(ed_data$parental_level_of_education,
                                              levels = c("some high school", "high school", "some college",
                                                         "associate's degree", "bachelor's degree", "master's degree"))

# Scatter plot for Math Scores
ggplot(ed_data, aes(x = parental_level_of_education, y = math_score)) +
  geom_point(position = position_jitter(width = 0.1), alpha = 0.5) +
  geom_smooth(method = "lm", aes(group = 1), se = TRUE) +
  labs(title = "Correlation between Parental Level of Education and Math Scores",
       x = "Parental Level of Education",
       y = "Math Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## `geom_smooth()` using formula = 'y ~ x'

# Scatter plot for Reading Scores
ggplot(ed_data, aes(x = parental_level_of_education, y = reading_score)) +
  geom_point(position = position_jitter(width = 0.1), alpha = 0.5) +
  geom_smooth(method = "lm", aes(group = 1), se = TRUE) +
  labs(title = "Correlation between Parental Level of Education and Reading Scores",
       x = "Parental Level of Education",
       y = "Reading Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## `geom_smooth()` using formula = 'y ~ x'

# Scatter plot for Writing Scores
ggplot(ed_data, aes(x = parental_level_of_education, y = writing_score)) +
  geom_point(position = position_jitter(width = 0.1), alpha = 0.5) +
  geom_smooth(method = "lm", aes(group = 1), se = TRUE) +
  labs(title = "Correlation between Parental Level of Education and Writing Scores",
       x = "Parental Level of Education",
       y = "Writing Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## `geom_smooth()` using formula = 'y ~ x'

5 Conclusion

In this tutorial, we explored the relationships between various factors and student test scores using an education dataset. We performed the following analyses:

  1. Descriptive Statistics: We calculated the mean and standard error of the mean (SEM) for math, reading, and writing scores, grouped by gender, race/ethnicity, parental level of education, lunch status, and test preparation course.
  2. Bar Graphs: We visualized the average test scores with error bars to understand the differences between groups.
  3. Scatter Plots: We examined the correlations between different test scores and parental level of education using scatter plots with linear regression lines and confidence intervals.

Through these analyses, we gained insights into how various factors might influence student performance. For instance, we observed differences in test scores based on gender, race/ethnicity, and participation in test preparation courses.

5.1 Benefits of R Markdown

R Markdown is a powerful tool for creating dynamic, reproducible documents that integrate code, output, and narrative text. Some of the key benefits include:

  • Reproducibility: R Markdown allows you to embed R code directly within your document. This ensures that your analysis is fully reproducible, as the code and output are always in sync.
  • Versatility: You can generate reports in various formats, including HTML, PDF, and Word, making it easy to share your findings with others.
  • Clarity: By combining code, output, and explanations in a single document, R Markdown helps you clearly communicate your analysis and results.
  • Efficiency: Automating your analysis and report generation saves time and reduces the risk of errors that can occur when manually copying and pasting results.

By using R Markdown, you can create comprehensive, professional-quality reports that facilitate better understanding and communication of your data analysis. This makes it an invaluable tool for students, researchers, and professionals alike.