In this project, I look at the Students Performance in Exams dataset which I found on Kaggle (https://www.kaggle.com/spscientist/students-performance-in-exams). It was mostly clean to begin with and contains 8 different variables, 5 of which are String columns and 3 are Integers. It includes variables such as, gender, race/ethnicity, parental level of education, lunch, and test preparation course, among other. I will start with some exploratory data analysis and try to uncover any interesting patterns in the dataset.

Loading Necessary Libraries

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.3
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Setting Current Working Directory

setwd("~/Desktop/DATA110")
data <- read_csv("StudentsPerformance.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   gender = col_character(),
##   `race/ethnicity` = col_character(),
##   `parental level of education` = col_character(),
##   lunch = col_character(),
##   `test preparation course` = col_character(),
##   `math score` = col_double(),
##   `reading score` = col_double(),
##   `writing score` = col_double()
## )

Cleaning the Dataset

Before I begin my analysis, I will rename some of the columns to get rid of any spaces and make it more convenient to work with them.

names(data) <- str_replace_all(names(data), c(" " = "_", "/" = "_"))
data
## # A tibble: 1,000 x 8
##    gender race_ethnicity parental_level_… lunch test_preparatio… math_score
##    <chr>  <chr>          <chr>            <chr> <chr>                 <dbl>
##  1 female group B        bachelor's degr… stan… none                     72
##  2 female group C        some college     stan… completed                69
##  3 female group B        master's degree  stan… none                     90
##  4 male   group A        associate's deg… free… none                     47
##  5 male   group C        some college     stan… none                     76
##  6 female group B        associate's deg… stan… none                     71
##  7 female group B        some college     stan… completed                88
##  8 male   group B        some college     free… none                     40
##  9 male   group D        high school      free… completed                64
## 10 female group B        high school      free… none                     38
## # … with 990 more rows, and 2 more variables: reading_score <dbl>,
## #   writing_score <dbl>

As part of cleaning the data, I am creating a new column called average which records average student exam scores. After creating a new column, I am converting the chr columns to factors to be used in my analysis later.

data <- data %>%
  mutate(test_preparation_course = ifelse(test = data$test_preparation_course == "completed", yes = "Completed", no = "Not Completed")) %>%
  mutate(average = (math_score + reading_score + writing_score)/3)

data$parental_level_of_education<-factor(data$parental_level_of_education, levels=c("some high school", "some college","high school", "associate's degree", "bachelor's degree", "master's degree"))
data$lunch<-factor(data$lunch, levels=c("free/reduced", "standard"))
data$lunch<-factor(data$lunch, levels=c("free/reduced", "standard"))

Exploratory Data Analysis

For my initial analysis, I am creating a bar graph which shows the distribution of students by parental education and type of lunch.

p1<- data %>%
  ggplot() +
  geom_bar(mapping = aes(x = parental_level_of_education, fill = lunch), position = "dodge") +
  labs(title = "Distribution of Students by Parental Education and Type of Lunch", 
       subtitle = "Grouped by Successful Completion of Test Prep Course", 
       caption = "Source: www.kaggle.com") +
  ylab("Count") +
  facet_wrap(test_preparation_course ~ .) +
  theme(axis.title.x=element_blank(), 
        axis.text.x = element_text(angle = 80, vjust = .55, hjust = 0.5)) +
  scale_fill_manual(name = "Lunch", 
                    labels= c("Free/Reduced", "Standard"), 
                    values = c("free/reduced" = "#f07167", "standard" = "#0081a7"))
p1

The second plot I am creating is a histogram where average student scores is my explanatory variable. I found some interesting patterns in this visualization which I discuss in my essay below.

p2<- data %>%
  ggplot() +
  geom_histogram(mapping = aes(x = average, fill = lunch), position = "dodge", alpha = 0.8, bins = 30) +
  labs(title = "Distribution of Students by Average Score and Type of Lunch",
       subtitle = "Grouped by Successful Completion of Test Prep Course", 
       caption = "Source: www.kaggle.com") +
  facet_wrap(test_preparation_course ~ .) +
  xlab("Average Score (from 0 to 100)") +
  ylab("Frequency") +
  theme(axis.text.x = element_text(angle = 80, vjust = .55, hjust = 0.5)) +
  theme_gray() +
  scale_fill_manual(name = "Lunch", 
                    labels= c("Free/Reduced", "Standard"), 
                    values = c("#335c67" , "#f6bd60"))
p2

Linear Regression

For my third visualization, I am creating a boxplot that shows the frequency of average exam scores grouped by parental education. I later use these variables to fit a linear model and perform statistical analysis to test the accuracy of that model where average test scores is my dependent variable and parental education is the independent variable.

p3 <- data %>%
  filter(parental_level_of_education %in% c("high school", "associate's degree", "bachelor's degree", "master's degree")) %>%
  ggplot() + 
  geom_boxplot(mapping = aes(x = parental_level_of_education, y = average, 
                             fill = parental_level_of_education), color = "#2e294e", alpha = 0.8) +
  labs(title = "Frequency of Student Exam Scores by Parental Education", 
       caption = "Source: www.kaggle.com", fill = "Parental Education") +
  ylab("Frequency of Average scores") +
  theme_linedraw() +
  theme(axis.title.y=element_blank()) +
  theme_gray() +
  theme(axis.title.y=element_blank(), 
        strip.background = element_blank(),
        strip.text.x = element_blank(),
        strip.text.y = element_blank(), 
        legend.position = "top") +
  coord_flip() +
  scale_fill_manual(name = "Parental Education", 
                    labels = c("High School", "Associate's", "Bachelor's", "Master's"), 
                    values=c("#14213d", "#0096c7", "#fca311", "#e5e5e5"))

ggplotly(p3)

According to the output, even though the median is quite close to 0 and the values for 1Q and 3Q are approximately the same distance away from 0, there is a relatively large gap between min and max residuals in the summary which could indicate a lack of accuracy of this model. It could also indicate that the residuals are not distributed evenly around the mean. In addition to the intercept, slope and the p-value, we also need to look at the adjusted r-squared. It states that 0.47 or about 5% of the variation in the observations may be explained by the model. This means that 95% of the variation in the data is likely not explained by this model.

fit1 <- lm(average ~ parental_level_of_education, data = data)
summary(fit1)
## 
## Call:
## lm(formula = average ~ parental_level_of_education, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.108  -9.259   0.874   9.895  33.892 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                     65.108      1.041  62.571
## parental_level_of_educationsome college          3.368      1.393   2.418
## parental_level_of_educationhigh school          -2.011      1.439  -1.397
## parental_level_of_educationassociate's degree    4.461      1.398   3.190
## parental_level_of_educationbachelor's degree     6.816      1.651   4.129
## parental_level_of_educationmaster's degree       8.491      2.090   4.063
##                                               Pr(>|t|)    
## (Intercept)                                    < 2e-16 ***
## parental_level_of_educationsome college        0.01578 *  
## parental_level_of_educationhigh school         0.16265    
## parental_level_of_educationassociate's degree  0.00147 ** 
## parental_level_of_educationbachelor's degree  3.95e-05 ***
## parental_level_of_educationmaster's degree    5.23e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.92 on 994 degrees of freedom
## Multiple R-squared:  0.05131,    Adjusted R-squared:  0.04654 
## F-statistic: 10.75 on 5 and 994 DF,  p-value: 4.381e-10

In order to improve on my previous model, I am creating another plot but this time changing my independent variable to student math scores while keeping my dependent variable the same.

p4 <- data %>%
  mutate(average = (math_score + reading_score + writing_score)/3) %>%
   mutate(Range = ifelse(average < 20, "0-19", 
                        ifelse(average < 40, "20-39",
                               ifelse(average < 60, "40-59",
                                      ifelse(average < 80, "60-79", "80-100"))))) %>%
  ggplot() + 
  geom_jitter(mapping = aes(x = math_score, y = average, color = Range), alpha = 0.8) +
  geom_smooth(method = lm, col = "#ffbe0b", mapping = aes(x = average, y = math_score)) +
  labs(caption = "Source: www.kaggle.com") +
  xlab("Math score (from 0 - 100)") +
  ylab("Average (reading, writing & math scores)") +
  annotate("text", x = 80, y = 25, label = "Average Exam Score of Students \n by Math Score") +
  theme_gray()+
  theme(legend.position = "bottom") +
   theme(strip.background = element_blank(),
        strip.text.x = element_blank(),
        strip.text.y = element_blank(),
        legend.position = "top") +
  scale_color_manual(values=c("#c5d86d", "#1b998b", "#f46036", "#d7263d", "#2e294e" )) 

ggplotly(p4)
## `geom_smooth()` using formula 'y ~ x'

According to the output this time, we observe a large improvement over the previous model. The correlation coefficient in this case, which we couldn’t compute before as the explanatory variable was not numerical, is 0.92. This is a value between -1 and 1, inclusive. According to the class notes, “the correlation coefficient tells how strong or weak the correlation is. Values closer to +/- 1 are strong correlation (the sign is determined by the linear slope), values close to +/- 0.5 are weak correlation, and values close to zero have no correlation.” So, we observe a significantly strong correlation in this case.

The model has the equation: average score = 0.8639(math_score) + 10.68

The slope may be interpreted as the following: For each unit increase in math score, there is a predicted increase of 0.8639 in the average student score.

The next part of the summary we observe are the residuals which shows that min and max are roughly the same distance away from 0, as are 1Q and 3Q and at the same time the median is quite close to 0 as well. This is already a good indication that this model is an improvement over the previous one. The p-value on the right of math_score has 3 asterisks which suggests it is a meaningful variable to explain the linear increase in the average_score. lastly, we look at the adjusted r-squared value which is 0.84. In other words, 84% of the variation in the observations may be explained by the model. In the end, we observe that overall this model is a big improvement over the previous one.

cor(data$math_score, data$average)
## [1] 0.9187458
fit2 <- lm(average ~ math_score, data = data)
summary(fit2)
## 
## Call:
## lm(formula = average ~ math_score, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.402  -4.107   0.081   4.188  14.626 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.67868    0.79685   13.40   <2e-16 ***
## math_score   0.86387    0.01175   73.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.632 on 998 degrees of freedom
## Multiple R-squared:  0.8441, Adjusted R-squared:  0.8439 
## F-statistic:  5403 on 1 and 998 DF,  p-value: < 2.2e-16

Summary and Analysis

The topic of my data is Student Performance in Exams. It records exam scores earned by high school students in the United States. There are 5 string and 3 integer columns in the dataset which are as follows: race/ethnicity, parental level of education, lunch, test preparation course, math score, reading score, and writing score. Out of these, the only three numeric columns are math score, reading score and writing score. The data came from Kaggle and was mostly clean to begin with. I chose this dataset because I was interested in exploring any relationships that may exist between student performance in math, reading and writing and some of the other variables that exist in the data such as students’ parental level of education and the impact of whether or not the students completed any test prep courses.

An article published by Brookings.edu, reported that despite the education reform that began in the 1980s when many states raised standards and increased graduation requirements, academic performance has largely remained unchanged over the past 25 years and the performance gaps between racial groups remain large. According to the article, “the current K-12 reform movement began in 1983 when a federal panel, the National Commission on Excellence in Education, issued its report A Nation at Risk” (Ravitch). It also states that something that is generally agreed is that public schools currently demand too little of their students. Through the variables provided in this data I wished to study what might be some of the variables that may be hindering academic progress and success and look for any interesting patterns that I could research.

An interesting pattern I observed was during exploratory data analysis. My first plot that I created was a bar graph of distribution of students by parental education and classification of their lunch. I further divided this plot into two subplots based on the test preparation column. This column represents whether or not a student successfully completed the test prep course. As I didn’t find anything particularly striking, I created a similar plot but this type replacing my explanatory variable from parental education level to average student exam score while keeping my response variable the same.

This time an interesting pattern emerged. I noticed that among students who completed the test prep course, there were students from both classifications of the lunch variable who scored near 100. On the other hand, among students who did not complete a test prep course there were no students in the Free/Reduced lunch classification who scored near 100. Similarly, in the first group (students who completed the prep course), none of the students scored below a 25. On the other hand, out of the students who did not complete a prep course, the only students who scored below a 25 were in the free/reduced lunch classification. This is something that I found interesting and wished to explore further.

I also wish I could have shown a plot in highcharter. The visualizations done in highcharter I have seen so far look so interesting and I am hoping to incorporate something similar into my final project.

Reference: Ravitch, Diane. “Student Performance Today.” Brookings, Brookings, 28 July 2016, www.brookings.edu/research/student-performance-today/.