An Explanatory Data Analysis of Students’ Performance

Disclaimer: This report was done in fulfilment of an introductory to data mining and visualisation class assignment and some of the analysis may have errors. Findings should not be used for generalisation.

Introduction

This project’s aim is to analyze and discuss Students’ Performance using various visualisation techniques in R. The dataset used in this project was taken from: https://www.kaggle.com/spscientist/students-performance-in-exams

Libraries

library(tidyverse)
library(ggplot2)
library(gridExtra)
library(dplyr)
library(stringr)
Data Set
data<-read.csv("/Users/Stephie/Desktop/StudentsPerformance.csv")
head(data) # first six rwos of the data set
##   gender race.ethnicity parental.level.of.education        lunch
## 1 female        group B           bachelor's degree     standard
## 2 female        group C                some college     standard
## 3 female        group B             master's degree     standard
## 4   male        group A          associate's degree free/reduced
## 5   male        group C                some college     standard
## 6 female        group B          associate's degree     standard
##   test.preparation.course math.score reading.score writing.score
## 1                    none         72            72            74
## 2               completed         69            90            88
## 3                    none         90            95            93
## 4                    none         47            57            44
## 5                    none         76            78            75
## 6                    none         71            83            78

Questions

  1. Most students are apart of which ethnic group?
  2. What is the distribution of students’ overall performance in Math, Reading and Writing?
  3. Which gender showed greater participation in the test preparation course?
  4. Are students scores dependent on their gender?
  5. Does parental level of education influence students score?
  6. Do children that complete the test preparation course attain higher scores?
  7. Is there a correlation between students’ reading and writing scores?
  8. Are students’ math and writing scores correlated?
  9. Is students’ completion of the test preparation course associated with their parents level of education?
  10. What is the frequency of free and standard lunches among students of all ethnicities?

Dimensions of the dataset

dimension<-dim(data)
cat("The dimension of the dataset is:", dimension)
## The dimension of the dataset is: 1000 8

Summary of the dataset

summary(data)
##     gender          race.ethnicity     parental.level.of.education
##  Length:1000        Length:1000        Length:1000                
##  Class :character   Class :character   Class :character           
##  Mode  :character   Mode  :character   Mode  :character           
##                                                                   
##                                                                   
##                                                                   
##     lunch           test.preparation.course   math.score     reading.score   
##  Length:1000        Length:1000             Min.   :  0.00   Min.   : 17.00  
##  Class :character   Class :character        1st Qu.: 57.00   1st Qu.: 59.00  
##  Mode  :character   Mode  :character        Median : 66.00   Median : 70.00  
##                                             Mean   : 66.09   Mean   : 69.17  
##                                             3rd Qu.: 77.00   3rd Qu.: 79.00  
##                                             Max.   :100.00   Max.   :100.00  
##  writing.score   
##  Min.   : 10.00  
##  1st Qu.: 57.75  
##  Median : 69.00  
##  Mean   : 68.05  
##  3rd Qu.: 79.00  
##  Max.   :100.00

Findings in the dataset: There are three numeric variables (math score, reading score and writing score) which show a common maximum value of 100. Students’ scores show great variation across the three subjects but reading score has the narrowest range. The five non-numeric variables are gender, race/ethnicity, parental level of education, lunch and test preparation course.

Question 1: Most students are apart of which ethnic group?

ggplot(data, aes(x = race.ethnicity, fill = gender)) +
  geom_bar(position = "stack") + ggtitle("Distribution of Students") +
  ylab("Students") + xlab("Groups")

The stacked bar plot shows that most students belong to group C, followed by groups D, B and E; group A has the least students.

Question 2: What is the distribution of students’ overall performance in Math, Reading and Writing?

data <- data%>% mutate(subjects.scores = math.score + reading.score + writing.score)

ggplot(data, aes(x = subjects.scores)) + geom_histogram(bins = 40, aes(y = ..density..), color = '#2980B9', fill = '#00AFBB') + geom_density(alpha = 0.3)+
ggtitle("Distribution of Scores")+
  ylab("Frequency") + xlab("Overall Performance")

It can be seen from the histogram that the distribution of students’ scores is negatively(left) skewed, wherein the modal score is more than the mean score. Students tend to attain marks in the greater half of the range, as opposed to lower marks.

Question 3: Which gender showed greater participation in the test preparation course?

participation<-table(data$test.preparation.course,data$gender)
participation
##            
##             female male
##   completed    184  174
##   none         334  308

The table depicts that males participated more in the test preparation course than females. Even though the number of females who completed the test preparation course is greater than the number of males who completed the course , it is seen that a higher proportion of males completed in comparison with those who did not.

Question 4: Are students scores dependent on their gender?

p <- ggplot(data, aes(x=gender, y= subjects.scores)) + 
  geom_boxplot(color = '#000000',  fill ='#00AFBB') + ggtitle("Distribution of Scores")+
  ylab("Scores") + xlab("Gender") + coord_flip()
p

The box plot shows that females have higher overall scores than males. This is indicated by the modal score and lowest score for the females being higher than that of the males. It is seen also that males have few outliers as compared to females.

Question 5: Does parental level of education influence students score?

p <- ggplot(data, aes(x=parental.level.of.education, y= subjects.scores)) + 
  geom_boxplot(color = '#000000',  fill ='#00AFBB') + ggtitle("Distribution of Scores")+
  ylab("Scores") + xlab("Parental Level of Education")
p

From the box plot illustration, it is observed that children of parents with a master’s degree scored the highest whereas students whose parents had a high school level education scored the lowest. It can be deducted that parental level of education influences students’ scores since students whose parents have high level of education tend to achieve better grades.

Question 6: Do children that complete the test preparation course attain higher scores?

ggplot(data, aes(x =subjects.scores, fill = test.preparation.course)) +
  geom_density(alpha = .3) +
ggtitle("Test Preparation Course Density Plot")+
  xlab("Students") + ylab("Scores")

The density plot displays a significant difference in the scores of students unexposed to the test preparation course versus students who completed the course. The trend is such that the scores of students having completed the course are generally higher than those who did not complete the course. The modal score was greater for prepared students, with the most frequent scores occurring close to the modal score.

Question 7: Is there a correlation between students’ reading and writing scores?

ggplot(data, aes(x = reading.score, y = writing.score)) + 
  geom_point(color = '#00AFBB') + 
  ggtitle("Writing Score vs Reading Score")+
  xlab("Reading Score") + ylab("Writing Score")

The scatterplot shows clearly that there is a strong positive linear relationship between students’ reading and writing scores. Writing scores tend to increase as reading scores increase.

Question 8: Are students’ math and writing scores correlated?

ggplot(data, aes(x = math.score, y = writing.score)) + 
  geom_point(color = '#00AFBB') + 
  ggtitle("Math Score vs Writing Score")+
  xlab("Math Score") + ylab("Writing Score")

The scatterplot shows that there is a positive linear relationship between students’ writing and math scores. Writing scores tend to increase as math scores increase.

Question 9: Is students’ completion of the test preparation course associated with their parents level of education?

ggplot(data, aes(x = parental.level.of.education, fill = test.preparation.course)) + 
  geom_bar(position = "dodge") +
  theme(axis.text.x = element_text(angle = 90)) + 
  ylab("Frequency") + xlab("Parental Level of Eduacation")+ coord_flip()

The bar plot demonstrates that students whose parents attained an associate’s degree showed the highest frequency of completion of the test participation course, while students whose parents earned a high school education only had the lowest rate of completion in the test preparation course relative to the number of students who showed non-completion for the same category. Hence, it can be deduced that the children of parents with a lower level of education are inclined not to complete the test preparation course.

Question 10: What is the frequency of free and standard lunches among students of all ethnicities?

ggplot(data, aes(x = race.ethnicity, fill = lunch)) + 
  geom_bar(position = "dodge") +
  theme(axis.text.x = element_text(angle = 90))+
        facet_wrap(~lunch)+ 
  ylab("Frequency") + xlab("Groups")

The comparative bar plot can be interpreted to deduce that standard lunches are clearly more frequent across all the racial groups when compared to free/reduced lunches.

Simple Linear Modeling

Spliting Data into Train and Test

data$gender <- as.factor(data$gender)
data$race.ethnicity <- as.factor(data$race.ethnicity)
data$test.preparation.course <- as.factor(data$test.preparation.course)
data$lunch <- as.factor(data$lunch)
data$parental.level.of.education <- as.factor(data$parental.level.of.education)

set.seed(130) 
sampleSize <- floor(.75*nrow(data))
trainIndexes <- sample(seq_len(nrow(data)), sampleSize, replace = TRUE) 
train <- data[trainIndexes, ]
test <- data[-trainIndexes, ]

Model 1

Predictor Value: Reading Score

The first model uses the entire dataset. Reading Score is the predictor or independent value and writing is the dependent value.This model can help to determine the relationship between reading and writing scores.

#Simple linear model
mod1<-lm(writing.score~reading.score, data = data)
mod1
## 
## Call:
## lm(formula = writing.score ~ reading.score, data = data)
## 
## Coefficients:
##   (Intercept)  reading.score  
##       -0.6676         0.9935
summary(mod1)
## 
## Call:
## lm(formula = writing.score ~ reading.score, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9573  -2.9573   0.0363   3.1026  15.0557 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.667554   0.693792  -0.962    0.336    
## reading.score  0.993531   0.009814 101.233   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.529 on 998 degrees of freedom
## Multiple R-squared:  0.9113, Adjusted R-squared:  0.9112 
## F-statistic: 1.025e+04 on 1 and 998 DF,  p-value: < 2.2e-16

Model 1 used the train set and the predictor value was reading score while writing score was the dependent value. The p-value for this model is less than 0.05 which suggest that there is a statistically significant relationship between the variables.

Model 2

Predictor Value: Reading Score

The second model uses the training dataset. Reading Score is the predictor or independent value and writing is the dependent value. This model can help to determine the relationship between reading and writing scores.

mod2<-lm(writing.score~reading.score, data = train)
mod2
## 
## Call:
## lm(formula = writing.score ~ reading.score, data = train)
## 
## Coefficients:
##   (Intercept)  reading.score  
##       -0.6068         0.9940
summary(mod2)
## 
## Call:
## lm(formula = writing.score ~ reading.score, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0679  -2.9956  -0.0348   3.0089  14.9682 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.60680    0.78475  -0.773     0.44    
## reading.score  0.99398    0.01105  89.912   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.401 on 748 degrees of freedom
## Multiple R-squared:  0.9153, Adjusted R-squared:  0.9152 
## F-statistic:  8084 on 1 and 748 DF,  p-value: < 2.2e-16

Model 2 used the test set and the predictor value was reading score while writing score was the dependent value. The p-value for this model is less than 0.05 which suggest that there is a statistically significant relationship between the variables.

Model 3

Preditor Value: Reading Score

The third model uses the training dataset. Reading Score is the predictor or independent value and math is the dependent value. This model can help to determine the relationship between reading and math scores.

mod3<-lm(math.score~reading.score, data = train)
mod3
## 
## Call:
## lm(formula = math.score ~ reading.score, data = train)
## 
## Coefficients:
##   (Intercept)  reading.score  
##        5.9058         0.8698
summary(mod3)
## 
## Call:
## lm(formula = math.score ~ reading.score, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.1774  -6.2467   0.0066   6.1245  24.5981 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.90582    1.52372   3.876 0.000116 ***
## reading.score  0.86981    0.02147  40.522  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.546 on 748 degrees of freedom
## Multiple R-squared:  0.687,  Adjusted R-squared:  0.6866 
## F-statistic:  1642 on 1 and 748 DF,  p-value: < 2.2e-16

Model 3 used the test set and the predictor value was reading score while math score was the dependent value. The p-value for this model is less than 0.05 which suggest that there is a statistically significant relationship between the variables.

Model Selection using AIC & BIC

The AIC AND BIC are measures to find the goodness of fit of an estimated statistical model and can also be used for model selection. Model with the lowest AIC and BIC score is preferred.

Model 2:

AIC(mod2)
## [1] 4355.303
BIC(mod2)
## [1] 4369.163

Model 3:

AIC(mod3)
## [1] 5350.627
BIC(mod3)
## [1] 5364.487

After comparing model 2 and 3 using AIC and BIC, it was observed that model 2 had the lowest score.

pred_writingscore<-predict(mod2,test)

Model Two Prediction

##    actuals predicteds
## 2       88   88.85107
## 6       78   81.89323
## 7       92   93.82095
## 8       39   42.13418
## 9       67   63.00768
## 12      43   51.07997

In this model, the difference between actual value in the data and the predicted values is very small.

correlation_accuracy <- cor(actuals_prediction) 
correlation_accuracy
##             actuals predicteds
## actuals    1.000000   0.952187
## predicteds 0.952187   1.000000

In conclusion, model two can be used to make very close predictions since the correlation coefficient is very close to 1.