In this project, I look at the Students Performance in Exams dataset which I found on Kaggle (https://www.kaggle.com/spscientist/students-performance-in-exams). It was mostly clean to begin with and contains 8 different variables, 5 of which are String columns and 3 are Integers. It includes variables such as, gender, race/ethnicity, parental level of education, lunch, and test preparation course, among other. I will start with some exploratory data analysis and try to uncover any interesting patterns in the dataset.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.3
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
setwd("~/Desktop/DATA110")
data <- read_csv("StudentsPerformance.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## gender = col_character(),
## `race/ethnicity` = col_character(),
## `parental level of education` = col_character(),
## lunch = col_character(),
## `test preparation course` = col_character(),
## `math score` = col_double(),
## `reading score` = col_double(),
## `writing score` = col_double()
## )
Before I begin my analysis, I will rename some of the columns to get rid of any spaces and make it more convenient to work with them.
names(data) <- str_replace_all(names(data), c(" " = "_", "/" = "_"))
data
## # A tibble: 1,000 x 8
## gender race_ethnicity parental_level_… lunch test_preparatio… math_score
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 female group B bachelor's degr… stan… none 72
## 2 female group C some college stan… completed 69
## 3 female group B master's degree stan… none 90
## 4 male group A associate's deg… free… none 47
## 5 male group C some college stan… none 76
## 6 female group B associate's deg… stan… none 71
## 7 female group B some college stan… completed 88
## 8 male group B some college free… none 40
## 9 male group D high school free… completed 64
## 10 female group B high school free… none 38
## # … with 990 more rows, and 2 more variables: reading_score <dbl>,
## # writing_score <dbl>
As part of cleaning the data, I am creating a new column called average which records average student exam scores. After creating a new column, I am converting the chr columns to factors to be used in my analysis later.
data <- data %>%
mutate(test_preparation_course = ifelse(test = data$test_preparation_course == "completed", yes = "Completed", no = "Not Completed")) %>%
mutate(average = (math_score + reading_score + writing_score)/3)
data$parental_level_of_education<-factor(data$parental_level_of_education, levels=c("some high school", "some college","high school", "associate's degree", "bachelor's degree", "master's degree"))
data$lunch<-factor(data$lunch, levels=c("free/reduced", "standard"))
data$lunch<-factor(data$lunch, levels=c("free/reduced", "standard"))
For my initial analysis, I am creating a bar graph which shows the distribution of students by parental education and type of lunch.
p1<- data %>%
ggplot() +
geom_bar(mapping = aes(x = parental_level_of_education, fill = lunch), position = "dodge") +
labs(title = "Distribution of Students by Parental Education and Type of Lunch",
subtitle = "Grouped by Successful Completion of Test Prep Course",
caption = "Source: www.kaggle.com") +
ylab("Count") +
facet_wrap(test_preparation_course ~ .) +
theme(axis.title.x=element_blank(),
axis.text.x = element_text(angle = 80, vjust = .55, hjust = 0.5)) +
scale_fill_manual(name = "Lunch",
labels= c("Free/Reduced", "Standard"),
values = c("free/reduced" = "#f07167", "standard" = "#0081a7"))
p1
The second plot I am creating is a histogram where average student scores is my explanatory variable. I found some interesting patterns in this visualization which I discuss in my essay below.
p2<- data %>%
ggplot() +
geom_histogram(mapping = aes(x = average, fill = lunch), position = "dodge", alpha = 0.8, bins = 30) +
labs(title = "Distribution of Students by Average Score and Type of Lunch",
subtitle = "Grouped by Successful Completion of Test Prep Course",
caption = "Source: www.kaggle.com") +
facet_wrap(test_preparation_course ~ .) +
xlab("Average Score (from 0 to 100)") +
ylab("Frequency") +
theme(axis.text.x = element_text(angle = 80, vjust = .55, hjust = 0.5)) +
theme_gray() +
scale_fill_manual(name = "Lunch",
labels= c("Free/Reduced", "Standard"),
values = c("#335c67" , "#f6bd60"))
p2
For my third visualization, I am creating a boxplot that shows the frequency of average exam scores grouped by parental education. I later use these variables to fit a linear model and perform statistical analysis to test the accuracy of that model where average test scores is my dependent variable and parental education is the independent variable.
p3 <- data %>%
filter(parental_level_of_education %in% c("high school", "associate's degree", "bachelor's degree", "master's degree")) %>%
ggplot() +
geom_boxplot(mapping = aes(x = parental_level_of_education, y = average,
fill = parental_level_of_education), color = "#2e294e", alpha = 0.8) +
labs(title = "Frequency of Student Exam Scores by Parental Education",
caption = "Source: www.kaggle.com", fill = "Parental Education") +
ylab("Frequency of Average scores") +
theme_linedraw() +
theme(axis.title.y=element_blank()) +
theme_gray() +
theme(axis.title.y=element_blank(),
strip.background = element_blank(),
strip.text.x = element_blank(),
strip.text.y = element_blank(),
legend.position = "top") +
coord_flip() +
scale_fill_manual(name = "Parental Education",
labels = c("High School", "Associate's", "Bachelor's", "Master's"),
values=c("#14213d", "#0096c7", "#fca311", "#e5e5e5"))
ggplotly(p3)
According to the output, even though the median is quite close to 0 and the values for 1Q and 3Q are approximately the same distance away from 0, there is a relatively large gap between min and max residuals in the summary which could indicate a lack of accuracy of this model. It could also indicate that the residuals are not distributed evenly around the mean. In addition to the intercept, slope and the p-value, we also need to look at the adjusted r-squared. It states that 0.47 or about 5% of the variation in the observations may be explained by the model. This means that 95% of the variation in the data is likely not explained by this model.
fit1 <- lm(average ~ parental_level_of_education, data = data)
summary(fit1)
##
## Call:
## lm(formula = average ~ parental_level_of_education, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.108 -9.259 0.874 9.895 33.892
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 65.108 1.041 62.571
## parental_level_of_educationsome college 3.368 1.393 2.418
## parental_level_of_educationhigh school -2.011 1.439 -1.397
## parental_level_of_educationassociate's degree 4.461 1.398 3.190
## parental_level_of_educationbachelor's degree 6.816 1.651 4.129
## parental_level_of_educationmaster's degree 8.491 2.090 4.063
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## parental_level_of_educationsome college 0.01578 *
## parental_level_of_educationhigh school 0.16265
## parental_level_of_educationassociate's degree 0.00147 **
## parental_level_of_educationbachelor's degree 3.95e-05 ***
## parental_level_of_educationmaster's degree 5.23e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.92 on 994 degrees of freedom
## Multiple R-squared: 0.05131, Adjusted R-squared: 0.04654
## F-statistic: 10.75 on 5 and 994 DF, p-value: 4.381e-10
In order to improve on my previous model, I am creating another plot but this time changing my independent variable to student math scores while keeping my dependent variable the same.
p4 <- data %>%
mutate(average = (math_score + reading_score + writing_score)/3) %>%
mutate(Range = ifelse(average < 20, "0-19",
ifelse(average < 40, "20-39",
ifelse(average < 60, "40-59",
ifelse(average < 80, "60-79", "80-100"))))) %>%
ggplot() +
geom_jitter(mapping = aes(x = math_score, y = average, color = Range), alpha = 0.8) +
geom_smooth(method = lm, col = "#ffbe0b", mapping = aes(x = average, y = math_score)) +
labs(caption = "Source: www.kaggle.com") +
xlab("Math score (from 0 - 100)") +
ylab("Average (reading, writing & math scores)") +
annotate("text", x = 80, y = 25, label = "Average Exam Score of Students \n by Math Score") +
theme_gray()+
theme(legend.position = "bottom") +
theme(strip.background = element_blank(),
strip.text.x = element_blank(),
strip.text.y = element_blank(),
legend.position = "top") +
scale_color_manual(values=c("#c5d86d", "#1b998b", "#f46036", "#d7263d", "#2e294e" ))
ggplotly(p4)
## `geom_smooth()` using formula 'y ~ x'
According to the output this time, we observe a large improvement over the previous model. The correlation coefficient in this case, which we couldn’t compute before as the explanatory variable was not numerical, is 0.92. This is a value between -1 and 1, inclusive. According to the class notes, “the correlation coefficient tells how strong or weak the correlation is. Values closer to +/- 1 are strong correlation (the sign is determined by the linear slope), values close to +/- 0.5 are weak correlation, and values close to zero have no correlation.” So, we observe a significantly strong correlation in this case.
The model has the equation: average score = 0.8639(math_score) + 10.68
The slope may be interpreted as the following: For each unit increase in math score, there is a predicted increase of 0.8639 in the average student score.
The next part of the summary we observe are the residuals which shows that min and max are roughly the same distance away from 0, as are 1Q and 3Q and at the same time the median is quite close to 0 as well. This is already a good indication that this model is an improvement over the previous one. The p-value on the right of math_score has 3 asterisks which suggests it is a meaningful variable to explain the linear increase in the average_score. lastly, we look at the adjusted r-squared value which is 0.84. In other words, 84% of the variation in the observations may be explained by the model. In the end, we observe that overall this model is a big improvement over the previous one.
cor(data$math_score, data$average)
## [1] 0.9187458
fit2 <- lm(average ~ math_score, data = data)
summary(fit2)
##
## Call:
## lm(formula = average ~ math_score, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.402 -4.107 0.081 4.188 14.626
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.67868 0.79685 13.40 <2e-16 ***
## math_score 0.86387 0.01175 73.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.632 on 998 degrees of freedom
## Multiple R-squared: 0.8441, Adjusted R-squared: 0.8439
## F-statistic: 5403 on 1 and 998 DF, p-value: < 2.2e-16
The topic of my data is Student Performance in Exams. It records exam scores earned by high school students in the United States. There are 5 string and 3 integer columns in the dataset which are as follows: race/ethnicity, parental level of education, lunch, test preparation course, math score, reading score, and writing score. Out of these, the only three numeric columns are math score, reading score and writing score. The data came from Kaggle and was mostly clean to begin with. I chose this dataset because I was interested in exploring any relationships that may exist between student performance in math, reading and writing and some of the other variables that exist in the data such as students’ parental level of education and the impact of whether or not the students completed any test prep courses.
An article published by Brookings.edu, reported that despite the education reform that began in the 1980s when many states raised standards and increased graduation requirements, academic performance has largely remained unchanged over the past 25 years and the performance gaps between racial groups remain large. According to the article, “the current K-12 reform movement began in 1983 when a federal panel, the National Commission on Excellence in Education, issued its report A Nation at Risk” (Ravitch). It also states that something that is generally agreed is that public schools currently demand too little of their students. Through the variables provided in this data I wished to study what might be some of the variables that may be hindering academic progress and success and look for any interesting patterns that I could research.
An interesting pattern I observed was during exploratory data analysis. My first plot that I created was a bar graph of distribution of students by parental education and classification of their lunch. I further divided this plot into two subplots based on the test preparation column. This column represents whether or not a student successfully completed the test prep course. As I didn’t find anything particularly striking, I created a similar plot but this type replacing my explanatory variable from parental education level to average student exam score while keeping my response variable the same.
This time an interesting pattern emerged. I noticed that among students who completed the test prep course, there were students from both classifications of the lunch variable who scored near 100. On the other hand, among students who did not complete a test prep course there were no students in the Free/Reduced lunch classification who scored near 100. Similarly, in the first group (students who completed the prep course), none of the students scored below a 25. On the other hand, out of the students who did not complete a prep course, the only students who scored below a 25 were in the free/reduced lunch classification. This is something that I found interesting and wished to explore further.
I also wish I could have shown a plot in highcharter. The visualizations done in highcharter I have seen so far look so interesting and I am hoping to incorporate something similar into my final project.
Reference: Ravitch, Diane. “Student Performance Today.” Brookings, Brookings, 28 July 2016, www.brookings.edu/research/student-performance-today/.