Assignment 1

Students Performance in Exams

Introduction

A data set of “Students Performance in Exams” is retrieved from Kaggle. It consists of the marks obtained by students in two different subjects, i.e., Mathematics and Language. Since the language of the Language subject is not specified in the data set description, we will just denote it as “Language subject” in this assignment. Besides, the evaluation for the Language subject is divided into two parts: reading and writing. Therefore, the scores for Mathematics, Reading and Writing are recorded in this data set. A total of 1000 observations are found in the data set with the 8 attributes stated as below.

Gender
Race/Ethnicity
Parental Level of Education
Lunch
Test Preparation Course
Math Score
Reading Score
Writing Score

Packages Info

There are a few packages that need to be installed in advance for this assignment.

dplyr is a grammar of data manipulation whereby it provides a set of easy-to-use verbs to solve the problems. The commands of mutate(), group_by() and summarise() will be used in this assignment. “Mutate()” is the command used to add new variables into the existing data set, while “group_by” and “summarise()” are used to find the summary of certain variables (Mathematics and Language scores) grouped by the categorical variables (Gender, Race/Ethnicity, Parental Level of Education and Test Preparation Course).
ggplot2 is used to create graphical visualization of the data. For instance, the command of “geom_bar” is used to create bar chart in this assignment.
sqldf is a package used to pass SQL statements into R for extracting data, updating data, replacing data, creating data, etc. In this assignment, a new data frame is created using this command.
corrplot is a graphical depiction of correlation matrix and confidence interval. The correlation between the numerical variables (Mathematics and Language scores) are calculated by using this package.

# Load these packages before we start to do data preparation and data analysis.
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(sqldf)

## Warning: package 'sqldf' was built under R version 4.0.5

## Loading required package: gsubfn

## Warning: package 'gsubfn' was built under R version 4.0.5

## Loading required package: proto

## Warning: package 'proto' was built under R version 4.0.5

## Loading required package: RSQLite

## Warning: package 'RSQLite' was built under R version 4.0.5

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.0.5

## corrplot 0.84 loaded

Data Preparation

The data set will be imported into R Studio with the name of Students_Performance.

Students_Performance <- read.csv("StudentsPerformance.csv")

The first six observations of the data set is displayed as below.

head(Students_Performance)

##   gender race.ethnicity parental.level.of.education        lunch
## 1 female        group B           bachelor's degree     standard
## 2 female        group C                some college     standard
## 3 female        group B             master's degree     standard
## 4   male        group A          associate's degree free/reduced
## 5   male        group C                some college     standard
## 6 female        group B          associate's degree     standard
##   test.preparation.course math.score reading.score writing.score
## 1                    none         72            72            74
## 2               completed         69            90            88
## 3                    none         90            95            93
## 4                    none         47            57            44
## 5                    none         76            78            75
## 6                    none         71            83            78

The last six observations of the data set is displayed as below.

tail(Students_Performance)

##      gender race.ethnicity parental.level.of.education        lunch
## 995    male        group A                 high school     standard
## 996  female        group E             master's degree     standard
## 997    male        group C                 high school free/reduced
## 998  female        group C                 high school free/reduced
## 999  female        group D                some college     standard
## 1000 female        group D                some college free/reduced
##      test.preparation.course math.score reading.score writing.score
## 995                     none         63            63            62
## 996                completed         88            99            95
## 997                     none         62            55            55
## 998                completed         59            71            65
## 999                completed         68            78            77
## 1000                    none         77            86            86

The structure of the data set is shown as below.

str(Students_Performance)

## 'data.frame':    1000 obs. of  8 variables:
##  $ gender                     : chr  "female" "female" "female" "male" ...
##  $ race.ethnicity             : chr  "group B" "group C" "group B" "group A" ...
##  $ parental.level.of.education: chr  "bachelor's degree" "some college" "master's degree" "associate's degree" ...
##  $ lunch                      : chr  "standard" "standard" "standard" "free/reduced" ...
##  $ test.preparation.course    : chr  "none" "completed" "none" "none" ...
##  $ math.score                 : int  72 69 90 47 76 71 88 40 64 38 ...
##  $ reading.score              : int  72 90 95 57 78 83 95 43 64 60 ...
##  $ writing.score              : int  74 88 93 44 75 78 92 39 67 50 ...

As we can see, the data types for gender, race/ethnicity, parental level of education, lunch and test preparation course are recorded as characters, thus we have to convert them into factor.

Students_Performance$gender <- as.factor(Students_Performance$gender)
Students_Performance$race.ethnicity <- as.factor(Students_Performance$race.ethnicity)
Students_Performance$parental.level.of.education <- as.factor(Students_Performance$parental.level.of.education)
Students_Performance$lunch <- as.factor(Students_Performance$lunch)
Students_Performance$test.preparation.course <- as.factor(Students_Performance$test.preparation.course)
str(Students_Performance)

## 'data.frame':    1000 obs. of  8 variables:
##  $ gender                     : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
##  $ race.ethnicity             : Factor w/ 5 levels "group A","group B",..: 2 3 2 1 3 2 2 2 4 2 ...
##  $ parental.level.of.education: Factor w/ 6 levels "associate's degree",..: 2 5 4 1 5 1 5 5 3 3 ...
##  $ lunch                      : Factor w/ 2 levels "free/reduced",..: 2 2 2 1 2 2 2 1 1 1 ...
##  $ test.preparation.course    : Factor w/ 2 levels "completed","none": 2 1 2 2 2 2 1 2 1 2 ...
##  $ math.score                 : int  72 69 90 47 76 71 88 40 64 38 ...
##  $ reading.score              : int  72 90 95 57 78 83 95 43 64 60 ...
##  $ writing.score              : int  74 88 93 44 75 78 92 39 67 50 ...

The attribute of lunch is meaningless, hence it can be removed from the data set.

Students_Performance <- subset(Students_Performance, select=-c(lunch))
str(Students_Performance)

## 'data.frame':    1000 obs. of  7 variables:
##  $ gender                     : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
##  $ race.ethnicity             : Factor w/ 5 levels "group A","group B",..: 2 3 2 1 3 2 2 2 4 2 ...
##  $ parental.level.of.education: Factor w/ 6 levels "associate's degree",..: 2 5 4 1 5 1 5 5 3 3 ...
##  $ test.preparation.course    : Factor w/ 2 levels "completed","none": 2 1 2 2 2 2 1 2 1 2 ...
##  $ math.score                 : int  72 69 90 47 76 71 88 40 64 38 ...
##  $ reading.score              : int  72 90 95 57 78 83 95 43 64 60 ...
##  $ writing.score              : int  74 88 93 44 75 78 92 39 67 50 ...

Besides, we can add one more column to calculate the total scores for the Language subject by taking the mean of the summation of reading and writing scores.

Students_Performance <- mutate(Students_Performance, language.score = (reading.score + writing.score)/2)
head(Students_Performance)

##   gender race.ethnicity parental.level.of.education test.preparation.course
## 1 female        group B           bachelor's degree                    none
## 2 female        group C                some college               completed
## 3 female        group B             master's degree                    none
## 4   male        group A          associate's degree                    none
## 5   male        group C                some college                    none
## 6 female        group B          associate's degree                    none
##   math.score reading.score writing.score language.score
## 1         72            72            74           73.0
## 2         69            90            88           89.0
## 3         90            95            93           94.0
## 4         47            57            44           50.5
## 5         76            78            75           76.5
## 6         71            83            78           80.5

Lastly, this data set is now ready to proceed with the analysis.

Data Analysis

Let’s look at the summary of this data set.

summary(Students_Performance)

##     gender    race.ethnicity     parental.level.of.education
##  female:518   group A: 89    associate's degree:222         
##  male  :482   group B:190    bachelor's degree :118         
##               group C:319    high school       :196         
##               group D:262    master's degree   : 59         
##               group E:140    some college      :226         
##                              some high school  :179         
##  test.preparation.course   math.score     reading.score    writing.score   
##  completed:358           Min.   :  0.00   Min.   : 17.00   Min.   : 10.00  
##  none     :642           1st Qu.: 57.00   1st Qu.: 59.00   1st Qu.: 57.75  
##                          Median : 66.00   Median : 70.00   Median : 69.00  
##                          Mean   : 66.09   Mean   : 69.17   Mean   : 68.05  
##                          3rd Qu.: 77.00   3rd Qu.: 79.00   3rd Qu.: 79.00  
##                          Max.   :100.00   Max.   :100.00   Max.   :100.00  
##  language.score  
##  Min.   : 13.50  
##  1st Qu.: 58.50  
##  Median : 69.50  
##  Mean   : 68.61  
##  3rd Qu.: 79.00  
##  Max.   :100.00

Based on the summary shown above, we can conclude the following:

The total observations are comprised of 518 females and 482 males.
Speaking of the race/ethnicity, most of the people come from group C and the least amount of people come from group A.
Besides, the proportions of parents with an associate’s degree are almost similar to the proportion of parents who did not complete their college studies (denoted by “some college”), while only 59 parents have earned a master’s degree.
358 students have completed the test preparation course, while 642 students did not complete it.
The mean scores for the subjects of Mathematics and Language are 66.09 and 68.61 respectively.

Next, the analysis will be conducted based on different categorical groups. Besides, the correlation between numerical variables, i.e., Mathematics and Language scores will be calculated as well.

Gender

First of all, let’s look at the data analysis of the subject scores factored by Gender.

Students_Performance %>%
    group_by(gender) %>%
    summarise(LanguageScore = mean(language.score)) %>%
    ggplot(mapping = aes(x=gender, y=LanguageScore, fill=gender)) + 
    geom_bar(stat="identity") + 
    labs(x="Gender", y="Language Scores", title="Language Scores by Gender") + 
    labs(fill="Gender") + 
    theme_bw()

The graph above shows that females outperformed males in the Language subject.

Students_Performance %>%
    group_by(gender) %>%
    summarise(math_Mean = mean(math.score)) %>%
    ggplot(mapping = aes(x=gender, y=math_Mean, fill=gender)) + 
    geom_bar(stat="identity") + 
    labs(x="Gender", y="Mathematics Scores", title="Mathematics Scores by Gender") + 
    labs(fill="Gender") + 
    theme_bw()

However, males seemed to perform better in Mathematics as compared to females.

Therefore, we can conclude that females have achieved higher scores in Language, while males excel at Mathematics.

Race/Ethnicity

Next, let’s look at the data analysis of the subject scores factored by Race/Ethnicity.

Students_Performance %>%
    group_by(race.ethnicity) %>%
    summarise(LanguageScore = mean(language.score)) %>%
    ggplot(mapping = aes(x=race.ethnicity, y=LanguageScore, fill=race.ethnicity)) + 
    geom_bar(stat="identity") + 
    labs(x="Race/Ethnicity", y="Language Scores", title="Language Scores by Race/Etnicity") + 
    labs(fill="Race/Etnicity") + 
    theme_bw()

Although there is not much difference in the language scores between groups, group E appeared to perform slightly better in overall.

Students_Performance %>%
    group_by(race.ethnicity) %>%
    summarise(mathMean = mean(math.score)) %>%
    ggplot(mapping = aes(x=race.ethnicity, y=mathMean, fill=race.ethnicity)) + 
    geom_bar(stat="identity") + 
    labs(x="Race/Ethnicity", y="Mathematics Scores", title="Mathematics Scores by Race/Etnicity") + 
    labs(fill="Race/Etnicity") + 
    theme_bw()

Surprisingly, group E is not only proved to perform well in Language, but in Mathematics too.

To conclude, group E has achieved an outstanding results in both Mathematics and Language, while group A performed poorly in both subjects.

Parental Level of Education

Moreover, let’s examine the effects of Parental Level of Education on scores.

Students_Performance %>%
    group_by(parental.level.of.education) %>%
    summarise(LanguageScore = mean(language.score)) %>%
    ggplot(mapping = aes(x=parental.level.of.education, y=LanguageScore, fill=parental.level.of.education)) + 
    geom_bar(stat="identity") + 
    labs(x="Parental Level of Education", y="Language Scores", title="Language Scores by Parent's Education Background") + 
    labs(fill="Parental Level of Education") + 
    theme_bw() + 
    theme(axis.text.x=element_text(angle=65, vjust=0.6))

Students whose parents earned a Master’s degree scored higher in Language, followed by those whose parents have a Bachelor’s degree and an Associate’s degree.

Students_Performance %>%
    group_by(parental.level.of.education) %>%
    summarise(mathMean = mean(math.score)) %>%
    ggplot(mapping = aes(x=parental.level.of.education, y=mathMean, fill=parental.level.of.education)) + 
    geom_bar(stat="identity") + 
    labs(x="Parental Level of Education", y="Mathematics Scores", title="Mathematics Scores by Parent's Education Background") + 
    labs(fill="Parental Level of Education") + 
    theme_bw() + 
    theme(axis.text.x=element_text(angle=65, vjust=0.6))

The overall ranking of the results are exactly similar to the results of Language scores. However, in overall, the mean scores for Mathematics are slightly higher than the mean scores for Language across all groups.

To summarize, students whose parents have a Master’s degree appeared to score better in both Language and Mathematics, whereas students whose parents have just graduated from high school performed the worst in both subjects.

Test Preparation Course

Lastly, let’s look at the scores distribution for those who have completed and have not completed the Test Preparation Course.

Students_Performance %>%
    group_by(test.preparation.course) %>%
    summarise(LanguageScore = mean(language.score)) %>%
    ggplot(mapping = aes(x=test.preparation.course, y=LanguageScore, fill=test.preparation.course)) + 
    geom_bar(stat="identity") + 
    labs(x="Test Preparation Course", y="Language Scores", title="Language Scores by Test Preparation Comparison") + 
    labs(fill="Test Preparation Course") + 
    theme_bw()

Apparently, students who have completed the test preparation course show higher marks in Language than those who did not complete it.

Students_Performance %>%
    group_by(test.preparation.course) %>%
    summarise(mathMean = mean(math.score)) %>%
    ggplot(mapping = aes(x=test.preparation.course, y=mathMean, fill=test.preparation.course)) + 
    geom_bar(stat="identity") + 
    labs(x="Test Preparation Course", y="Mathematics Scores", title="Mathematics Scores by Test Preparation Comparison") + 
    labs(fill="Test Preparation Course") + 
    theme_bw()

The same goes to Mathematics.

In general, students who have completed the Test Preparation Course appeared to perform well in both subjects than those who did not.

Correlation between Scores

Lastly, let’s determine the correlation between the scores of Mathematics and Language.

# Create a new data frame that only consists of Mathematics and Language scores.
Students_Performance_Scores <- sqldf('SELECT "math.score", "language.score" FROM Students_Performance') 
corrplot(cor(Students_Performance_Scores), method="square", type="upper", addCoef.col = "white", mar=c(1,1,1,1))

The correlation coefficient of 0.82 indicates that Mathematics and Language scores are highly positive correlated, which means for those who score well in Mathematics tend to perform well in Language as well, or vice versa.