This project’s aim is to analyze and discuss Students’ Performance using various visualisation techniques in R. The dataset used in this project was taken from: https://www.kaggle.com/spscientist/students-performance-in-exams
library(tidyverse)
library(ggplot2)
library(gridExtra)
library(dplyr)
library(stringr)
data<-read.csv("/Users/Stephie/Desktop/StudentsPerformance.csv")
head(data) # first six rwos of the data set
## gender race.ethnicity parental.level.of.education lunch
## 1 female group B bachelor's degree standard
## 2 female group C some college standard
## 3 female group B master's degree standard
## 4 male group A associate's degree free/reduced
## 5 male group C some college standard
## 6 female group B associate's degree standard
## test.preparation.course math.score reading.score writing.score
## 1 none 72 72 74
## 2 completed 69 90 88
## 3 none 90 95 93
## 4 none 47 57 44
## 5 none 76 78 75
## 6 none 71 83 78
dimension<-dim(data)
cat("The dimension of the dataset is:", dimension)
## The dimension of the dataset is: 1000 8
summary(data)
## gender race.ethnicity parental.level.of.education
## Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## lunch test.preparation.course math.score reading.score
## Length:1000 Length:1000 Min. : 0.00 Min. : 17.00
## Class :character Class :character 1st Qu.: 57.00 1st Qu.: 59.00
## Mode :character Mode :character Median : 66.00 Median : 70.00
## Mean : 66.09 Mean : 69.17
## 3rd Qu.: 77.00 3rd Qu.: 79.00
## Max. :100.00 Max. :100.00
## writing.score
## Min. : 10.00
## 1st Qu.: 57.75
## Median : 69.00
## Mean : 68.05
## 3rd Qu.: 79.00
## Max. :100.00
Findings in the dataset: There are three numeric variables (math score, reading score and writing score) which show a common maximum value of 100. Students’ scores show great variation across the three subjects but reading score has the narrowest range. The five non-numeric variables are gender, race/ethnicity, parental level of education, lunch and test preparation course.
ggplot(data, aes(x = race.ethnicity, fill = gender)) +
geom_bar(position = "stack") + ggtitle("Distribution of Students") +
ylab("Students") + xlab("Groups")
The stacked bar plot shows that most students belong to group C, followed by groups D, B and E; group A has the least students.
data <- data%>% mutate(subjects.scores = math.score + reading.score + writing.score)
ggplot(data, aes(x = subjects.scores)) + geom_histogram(bins = 40, aes(y = ..density..), color = '#2980B9', fill = '#00AFBB') + geom_density(alpha = 0.3)+
ggtitle("Distribution of Scores")+
ylab("Frequency") + xlab("Overall Performance")
It can be seen from the histogram that the distribution of students’ scores is negatively(left) skewed, wherein the modal score is more than the mean score. Students tend to attain marks in the greater half of the range, as opposed to lower marks.
participation<-table(data$test.preparation.course,data$gender)
participation
##
## female male
## completed 184 174
## none 334 308
The table depicts that males participated more in the test preparation course than females. Even though the number of females who completed the test preparation course is greater than the number of males who completed the course , it is seen that a higher proportion of males completed in comparison with those who did not.
p <- ggplot(data, aes(x=gender, y= subjects.scores)) +
geom_boxplot(color = '#000000', fill ='#00AFBB') + ggtitle("Distribution of Scores")+
ylab("Scores") + xlab("Gender") + coord_flip()
p
The box plot shows that females have higher overall scores than males. This is indicated by the modal score and lowest score for the females being higher than that of the males. It is seen also that males have few outliers as compared to females.
p <- ggplot(data, aes(x=parental.level.of.education, y= subjects.scores)) +
geom_boxplot(color = '#000000', fill ='#00AFBB') + ggtitle("Distribution of Scores")+
ylab("Scores") + xlab("Parental Level of Education")
p
From the box plot illustration, it is observed that children of parents with a master’s degree scored the highest whereas students whose parents had a high school level education scored the lowest. It can be deducted that parental level of education influences students’ scores since students whose parents have high level of education tend to achieve better grades.
ggplot(data, aes(x =subjects.scores, fill = test.preparation.course)) +
geom_density(alpha = .3) +
ggtitle("Test Preparation Course Density Plot")+
xlab("Students") + ylab("Scores")
The density plot displays a significant difference in the scores of students unexposed to the test preparation course versus students who completed the course. The trend is such that the scores of students having completed the course are generally higher than those who did not complete the course. The modal score was greater for prepared students, with the most frequent scores occurring close to the modal score.
ggplot(data, aes(x = reading.score, y = writing.score)) +
geom_point(color = '#00AFBB') +
ggtitle("Writing Score vs Reading Score")+
xlab("Reading Score") + ylab("Writing Score")
The scatterplot shows clearly that there is a strong positive linear relationship between students’ reading and writing scores. Writing scores tend to increase as reading scores increase.
ggplot(data, aes(x = parental.level.of.education, fill = test.preparation.course)) +
geom_bar(position = "dodge") +
theme(axis.text.x = element_text(angle = 90)) +
ylab("Frequency") + xlab("Parental Level of Eduacation")+ coord_flip()
The bar plot demonstrates that students whose parents attained an associate’s degree showed the highest frequency of completion of the test participation course, while students whose parents earned a high school education only had the lowest rate of completion in the test preparation course relative to the number of students who showed non-completion for the same category. Hence, it can be deduced that the children of parents with a lower level of education are inclined not to complete the test preparation course.
ggplot(data, aes(x = race.ethnicity, fill = lunch)) +
geom_bar(position = "dodge") +
theme(axis.text.x = element_text(angle = 90))+
facet_wrap(~lunch)+
ylab("Frequency") + xlab("Groups")
The comparative bar plot can be interpreted to deduce that standard lunches are clearly more frequent across all the racial groups when compared to free/reduced lunches.
data$gender <- as.factor(data$gender)
data$race.ethnicity <- as.factor(data$race.ethnicity)
data$test.preparation.course <- as.factor(data$test.preparation.course)
data$lunch <- as.factor(data$lunch)
data$parental.level.of.education <- as.factor(data$parental.level.of.education)
set.seed(130)
sampleSize <- floor(.75*nrow(data))
trainIndexes <- sample(seq_len(nrow(data)), sampleSize, replace = TRUE)
train <- data[trainIndexes, ]
test <- data[-trainIndexes, ]
The first model uses the entire dataset. Reading Score is the predictor or independent value and writing is the dependent value.This model can help to determine the relationship between reading and writing scores.
#Simple linear model
mod1<-lm(writing.score~reading.score, data = data)
mod1
##
## Call:
## lm(formula = writing.score ~ reading.score, data = data)
##
## Coefficients:
## (Intercept) reading.score
## -0.6676 0.9935
summary(mod1)
##
## Call:
## lm(formula = writing.score ~ reading.score, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9573 -2.9573 0.0363 3.1026 15.0557
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.667554 0.693792 -0.962 0.336
## reading.score 0.993531 0.009814 101.233 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.529 on 998 degrees of freedom
## Multiple R-squared: 0.9113, Adjusted R-squared: 0.9112
## F-statistic: 1.025e+04 on 1 and 998 DF, p-value: < 2.2e-16
Model 1 used the train set and the predictor value was reading score while writing score was the dependent value. The p-value for this model is less than 0.05 which suggest that there is a statistically significant relationship between the variables.
The second model uses the training dataset. Reading Score is the predictor or independent value and writing is the dependent value. This model can help to determine the relationship between reading and writing scores.
mod2<-lm(writing.score~reading.score, data = train)
mod2
##
## Call:
## lm(formula = writing.score ~ reading.score, data = train)
##
## Coefficients:
## (Intercept) reading.score
## -0.6068 0.9940
summary(mod2)
##
## Call:
## lm(formula = writing.score ~ reading.score, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.0679 -2.9956 -0.0348 3.0089 14.9682
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.60680 0.78475 -0.773 0.44
## reading.score 0.99398 0.01105 89.912 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.401 on 748 degrees of freedom
## Multiple R-squared: 0.9153, Adjusted R-squared: 0.9152
## F-statistic: 8084 on 1 and 748 DF, p-value: < 2.2e-16
Model 2 used the test set and the predictor value was reading score while writing score was the dependent value. The p-value for this model is less than 0.05 which suggest that there is a statistically significant relationship between the variables.
The third model uses the training dataset. Reading Score is the predictor or independent value and math is the dependent value. This model can help to determine the relationship between reading and math scores.
mod3<-lm(math.score~reading.score, data = train)
mod3
##
## Call:
## lm(formula = math.score ~ reading.score, data = train)
##
## Coefficients:
## (Intercept) reading.score
## 5.9058 0.8698
summary(mod3)
##
## Call:
## lm(formula = math.score ~ reading.score, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.1774 -6.2467 0.0066 6.1245 24.5981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.90582 1.52372 3.876 0.000116 ***
## reading.score 0.86981 0.02147 40.522 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.546 on 748 degrees of freedom
## Multiple R-squared: 0.687, Adjusted R-squared: 0.6866
## F-statistic: 1642 on 1 and 748 DF, p-value: < 2.2e-16
Model 3 used the test set and the predictor value was reading score while math score was the dependent value. The p-value for this model is less than 0.05 which suggest that there is a statistically significant relationship between the variables.
The AIC AND BIC are measures to find the goodness of fit of an estimated statistical model and can also be used for model selection. Model with the lowest AIC and BIC score is preferred.
Model 2:
AIC(mod2)
## [1] 4355.303
BIC(mod2)
## [1] 4369.163
Model 3:
AIC(mod3)
## [1] 5350.627
BIC(mod3)
## [1] 5364.487
pred_writingscore<-predict(mod2,test)
## actuals predicteds
## 2 88 88.85107
## 6 78 81.89323
## 7 92 93.82095
## 8 39 42.13418
## 9 67 63.00768
## 12 43 51.07997
In this model, the difference between actual value in the data and the predicted values is very small.
correlation_accuracy <- cor(actuals_prediction)
correlation_accuracy
## actuals predicteds
## actuals 1.000000 0.952187
## predicteds 0.952187 1.000000
In conclusion, model two can be used to make very close predictions since the correlation coefficient is very close to 1.