A data set of “Students Performance in Exams” is retrieved from Kaggle. It consists of the marks obtained by students in two different subjects, i.e., Mathematics and Language. Since the language of the Language subject is not specified in the data set description, we will just denote it as “Language subject” in this assignment. Besides, the evaluation for the Language subject is divided into two parts: reading and writing. Therefore, the scores for Mathematics, Reading and Writing are recorded in this data set. A total of 1000 observations are found in the data set with the 8 attributes stated as below.
There are a few packages that need to be installed in advance for this assignment.
dplyr is a grammar of data manipulation whereby it provides a set of easy-to-use verbs to solve the problems. The commands of mutate(), group_by() and summarise() will be used in this assignment. “Mutate()” is the command used to add new variables into the existing data set, while “group_by” and “summarise()” are used to find the summary of certain variables (Mathematics and Language scores) grouped by the categorical variables (Gender, Race/Ethnicity, Parental Level of Education and Test Preparation Course).
ggplot2 is used to create graphical visualization of the data. For instance, the command of “geom_bar” is used to create bar chart in this assignment.
sqldf is a package used to pass SQL statements into R for extracting data, updating data, replacing data, creating data, etc. In this assignment, a new data frame is created using this command.
corrplot is a graphical depiction of correlation matrix and confidence interval. The correlation between the numerical variables (Mathematics and Language scores) are calculated by using this package.
# Load these packages before we start to do data preparation and data analysis.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(sqldf)
## Warning: package 'sqldf' was built under R version 4.0.5
## Loading required package: gsubfn
## Warning: package 'gsubfn' was built under R version 4.0.5
## Loading required package: proto
## Warning: package 'proto' was built under R version 4.0.5
## Loading required package: RSQLite
## Warning: package 'RSQLite' was built under R version 4.0.5
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.0.5
## corrplot 0.84 loaded
The data set will be imported into R Studio with the name of Students_Performance.
Students_Performance <- read.csv("StudentsPerformance.csv")
The first six observations of the data set is displayed as below.
head(Students_Performance)
## gender race.ethnicity parental.level.of.education lunch
## 1 female group B bachelor's degree standard
## 2 female group C some college standard
## 3 female group B master's degree standard
## 4 male group A associate's degree free/reduced
## 5 male group C some college standard
## 6 female group B associate's degree standard
## test.preparation.course math.score reading.score writing.score
## 1 none 72 72 74
## 2 completed 69 90 88
## 3 none 90 95 93
## 4 none 47 57 44
## 5 none 76 78 75
## 6 none 71 83 78
The last six observations of the data set is displayed as below.
tail(Students_Performance)
## gender race.ethnicity parental.level.of.education lunch
## 995 male group A high school standard
## 996 female group E master's degree standard
## 997 male group C high school free/reduced
## 998 female group C high school free/reduced
## 999 female group D some college standard
## 1000 female group D some college free/reduced
## test.preparation.course math.score reading.score writing.score
## 995 none 63 63 62
## 996 completed 88 99 95
## 997 none 62 55 55
## 998 completed 59 71 65
## 999 completed 68 78 77
## 1000 none 77 86 86
The structure of the data set is shown as below.
str(Students_Performance)
## 'data.frame': 1000 obs. of 8 variables:
## $ gender : chr "female" "female" "female" "male" ...
## $ race.ethnicity : chr "group B" "group C" "group B" "group A" ...
## $ parental.level.of.education: chr "bachelor's degree" "some college" "master's degree" "associate's degree" ...
## $ lunch : chr "standard" "standard" "standard" "free/reduced" ...
## $ test.preparation.course : chr "none" "completed" "none" "none" ...
## $ math.score : int 72 69 90 47 76 71 88 40 64 38 ...
## $ reading.score : int 72 90 95 57 78 83 95 43 64 60 ...
## $ writing.score : int 74 88 93 44 75 78 92 39 67 50 ...
As we can see, the data types for gender, race/ethnicity, parental level of education, lunch and test preparation course are recorded as characters, thus we have to convert them into factor.
Students_Performance$gender <- as.factor(Students_Performance$gender)
Students_Performance$race.ethnicity <- as.factor(Students_Performance$race.ethnicity)
Students_Performance$parental.level.of.education <- as.factor(Students_Performance$parental.level.of.education)
Students_Performance$lunch <- as.factor(Students_Performance$lunch)
Students_Performance$test.preparation.course <- as.factor(Students_Performance$test.preparation.course)
str(Students_Performance)
## 'data.frame': 1000 obs. of 8 variables:
## $ gender : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
## $ race.ethnicity : Factor w/ 5 levels "group A","group B",..: 2 3 2 1 3 2 2 2 4 2 ...
## $ parental.level.of.education: Factor w/ 6 levels "associate's degree",..: 2 5 4 1 5 1 5 5 3 3 ...
## $ lunch : Factor w/ 2 levels "free/reduced",..: 2 2 2 1 2 2 2 1 1 1 ...
## $ test.preparation.course : Factor w/ 2 levels "completed","none": 2 1 2 2 2 2 1 2 1 2 ...
## $ math.score : int 72 69 90 47 76 71 88 40 64 38 ...
## $ reading.score : int 72 90 95 57 78 83 95 43 64 60 ...
## $ writing.score : int 74 88 93 44 75 78 92 39 67 50 ...
The attribute of lunch is meaningless, hence it can be removed from the data set.
Students_Performance <- subset(Students_Performance, select=-c(lunch))
str(Students_Performance)
## 'data.frame': 1000 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
## $ race.ethnicity : Factor w/ 5 levels "group A","group B",..: 2 3 2 1 3 2 2 2 4 2 ...
## $ parental.level.of.education: Factor w/ 6 levels "associate's degree",..: 2 5 4 1 5 1 5 5 3 3 ...
## $ test.preparation.course : Factor w/ 2 levels "completed","none": 2 1 2 2 2 2 1 2 1 2 ...
## $ math.score : int 72 69 90 47 76 71 88 40 64 38 ...
## $ reading.score : int 72 90 95 57 78 83 95 43 64 60 ...
## $ writing.score : int 74 88 93 44 75 78 92 39 67 50 ...
Besides, we can add one more column to calculate the total scores for the Language subject by taking the mean of the summation of reading and writing scores.
Students_Performance <- mutate(Students_Performance, language.score = (reading.score + writing.score)/2)
head(Students_Performance)
## gender race.ethnicity parental.level.of.education test.preparation.course
## 1 female group B bachelor's degree none
## 2 female group C some college completed
## 3 female group B master's degree none
## 4 male group A associate's degree none
## 5 male group C some college none
## 6 female group B associate's degree none
## math.score reading.score writing.score language.score
## 1 72 72 74 73.0
## 2 69 90 88 89.0
## 3 90 95 93 94.0
## 4 47 57 44 50.5
## 5 76 78 75 76.5
## 6 71 83 78 80.5
Lastly, this data set is now ready to proceed with the analysis.
Let’s look at the summary of this data set.
summary(Students_Performance)
## gender race.ethnicity parental.level.of.education
## female:518 group A: 89 associate's degree:222
## male :482 group B:190 bachelor's degree :118
## group C:319 high school :196
## group D:262 master's degree : 59
## group E:140 some college :226
## some high school :179
## test.preparation.course math.score reading.score writing.score
## completed:358 Min. : 0.00 Min. : 17.00 Min. : 10.00
## none :642 1st Qu.: 57.00 1st Qu.: 59.00 1st Qu.: 57.75
## Median : 66.00 Median : 70.00 Median : 69.00
## Mean : 66.09 Mean : 69.17 Mean : 68.05
## 3rd Qu.: 77.00 3rd Qu.: 79.00 3rd Qu.: 79.00
## Max. :100.00 Max. :100.00 Max. :100.00
## language.score
## Min. : 13.50
## 1st Qu.: 58.50
## Median : 69.50
## Mean : 68.61
## 3rd Qu.: 79.00
## Max. :100.00
Based on the summary shown above, we can conclude the following:
Next, the analysis will be conducted based on different categorical groups. Besides, the correlation between numerical variables, i.e., Mathematics and Language scores will be calculated as well.
First of all, let’s look at the data analysis of the subject scores factored by Gender.
Students_Performance %>%
group_by(gender) %>%
summarise(LanguageScore = mean(language.score)) %>%
ggplot(mapping = aes(x=gender, y=LanguageScore, fill=gender)) +
geom_bar(stat="identity") +
labs(x="Gender", y="Language Scores", title="Language Scores by Gender") +
labs(fill="Gender") +
theme_bw()
The graph above shows that females outperformed males in the Language subject.
Students_Performance %>%
group_by(gender) %>%
summarise(math_Mean = mean(math.score)) %>%
ggplot(mapping = aes(x=gender, y=math_Mean, fill=gender)) +
geom_bar(stat="identity") +
labs(x="Gender", y="Mathematics Scores", title="Mathematics Scores by Gender") +
labs(fill="Gender") +
theme_bw()
However, males seemed to perform better in Mathematics as compared to females.
Therefore, we can conclude that females have achieved higher scores in Language, while males excel at Mathematics.
Next, let’s look at the data analysis of the subject scores factored by Race/Ethnicity.
Students_Performance %>%
group_by(race.ethnicity) %>%
summarise(LanguageScore = mean(language.score)) %>%
ggplot(mapping = aes(x=race.ethnicity, y=LanguageScore, fill=race.ethnicity)) +
geom_bar(stat="identity") +
labs(x="Race/Ethnicity", y="Language Scores", title="Language Scores by Race/Etnicity") +
labs(fill="Race/Etnicity") +
theme_bw()
Although there is not much difference in the language scores between groups, group E appeared to perform slightly better in overall.
Students_Performance %>%
group_by(race.ethnicity) %>%
summarise(mathMean = mean(math.score)) %>%
ggplot(mapping = aes(x=race.ethnicity, y=mathMean, fill=race.ethnicity)) +
geom_bar(stat="identity") +
labs(x="Race/Ethnicity", y="Mathematics Scores", title="Mathematics Scores by Race/Etnicity") +
labs(fill="Race/Etnicity") +
theme_bw()
Surprisingly, group E is not only proved to perform well in Language, but in Mathematics too.
To conclude, group E has achieved an outstanding results in both Mathematics and Language, while group A performed poorly in both subjects.
Moreover, let’s examine the effects of Parental Level of Education on scores.
Students_Performance %>%
group_by(parental.level.of.education) %>%
summarise(LanguageScore = mean(language.score)) %>%
ggplot(mapping = aes(x=parental.level.of.education, y=LanguageScore, fill=parental.level.of.education)) +
geom_bar(stat="identity") +
labs(x="Parental Level of Education", y="Language Scores", title="Language Scores by Parent's Education Background") +
labs(fill="Parental Level of Education") +
theme_bw() +
theme(axis.text.x=element_text(angle=65, vjust=0.6))
Students whose parents earned a Master’s degree scored higher in Language, followed by those whose parents have a Bachelor’s degree and an Associate’s degree.
Students_Performance %>%
group_by(parental.level.of.education) %>%
summarise(mathMean = mean(math.score)) %>%
ggplot(mapping = aes(x=parental.level.of.education, y=mathMean, fill=parental.level.of.education)) +
geom_bar(stat="identity") +
labs(x="Parental Level of Education", y="Mathematics Scores", title="Mathematics Scores by Parent's Education Background") +
labs(fill="Parental Level of Education") +
theme_bw() +
theme(axis.text.x=element_text(angle=65, vjust=0.6))
The overall ranking of the results are exactly similar to the results of Language scores. However, in overall, the mean scores for Mathematics are slightly higher than the mean scores for Language across all groups.
To summarize, students whose parents have a Master’s degree appeared to score better in both Language and Mathematics, whereas students whose parents have just graduated from high school performed the worst in both subjects.
Lastly, let’s look at the scores distribution for those who have completed and have not completed the Test Preparation Course.
Students_Performance %>%
group_by(test.preparation.course) %>%
summarise(LanguageScore = mean(language.score)) %>%
ggplot(mapping = aes(x=test.preparation.course, y=LanguageScore, fill=test.preparation.course)) +
geom_bar(stat="identity") +
labs(x="Test Preparation Course", y="Language Scores", title="Language Scores by Test Preparation Comparison") +
labs(fill="Test Preparation Course") +
theme_bw()
Apparently, students who have completed the test preparation course show higher marks in Language than those who did not complete it.
Students_Performance %>%
group_by(test.preparation.course) %>%
summarise(mathMean = mean(math.score)) %>%
ggplot(mapping = aes(x=test.preparation.course, y=mathMean, fill=test.preparation.course)) +
geom_bar(stat="identity") +
labs(x="Test Preparation Course", y="Mathematics Scores", title="Mathematics Scores by Test Preparation Comparison") +
labs(fill="Test Preparation Course") +
theme_bw()
The same goes to Mathematics.
In general, students who have completed the Test Preparation Course appeared to perform well in both subjects than those who did not.
Lastly, let’s determine the correlation between the scores of Mathematics and Language.
# Create a new data frame that only consists of Mathematics and Language scores.
Students_Performance_Scores <- sqldf('SELECT "math.score", "language.score" FROM Students_Performance')
corrplot(cor(Students_Performance_Scores), method="square", type="upper", addCoef.col = "white", mar=c(1,1,1,1))
The correlation coefficient of 0.82 indicates that Mathematics and Language scores are highly positive correlated, which means for those who score well in Mathematics tend to perform well in Language as well, or vice versa.