Author: "Zeenat Wilson"
Date: "5/25/2020"
Disclaimer:
This project is in fulfillment of the 2019/2020 requirement for the 4th year Database Systems Design and Information Management Systems II (CSE4202) course offered by the University of Guyana, South America.The result therein should not be used as reference at this point in time; as the outputs may be subjected to errors and may not reflect the true outputs from following stricts guidelines and assumptions needed to obtain those.
This work be will continuous updated over a period of time to reflect such results
,The following analysis is based on Exploratory Data Analysis (EDA) and ultimately Predictive Model Building in R. It anaylzes students performance in Reading and Math via the provided dataset and provides the answers to ten(10) pertinent questions throughout the process.
Dataset: Based on marks secured by high school students located in the United States. Source: https://www.kaggle.com/spscientist/students-performance-in-exams
Load required libraries:
This is the first step in R.
library(Amelia)
library(corrplot)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(kableExtra)
library(knitr)
library(tidyverse)
Load the dataset:
The data required for the analysis must be imported into R.
Original_Data<-read.csv("C:\\Users\\Zeenet\\Desktop\\Marks_Lab\\StudentsPerformance.csv")
str(Original_Data) # Structural make up of the dataset
## 'data.frame': 1000 obs. of 8 variables:
## $ gender : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
## $ race.ethnicity : Factor w/ 5 levels "group A","group B",..: 2 3 2 1 3 2 2 2 4 2 ...
## $ parental.level.of.education: Factor w/ 6 levels "associate's degree",..: 2 5 4 1 5 1 5 5 3 3 ...
## $ lunch : Factor w/ 2 levels "free/reduced",..: 2 2 2 1 2 2 2 1 1 1 ...
## $ test.preparation.course : Factor w/ 2 levels "completed","none": 2 1 2 2 2 2 1 2 1 2 ...
## $ math.score : int 72 69 90 47 76 71 88 40 64 38 ...
## $ reading.score : int 72 90 95 57 78 83 95 43 64 60 ...
## $ writing.score : int 74 88 93 44 75 78 92 39 67 50 ...
This is a breakdown of the dataset. Its dimension can be seen. It has 8 variables and 1000 observations. Three numerical and five categorical variables are in the set.
Data_Head <- head(Original_Data) # First six rows of data
Data_Head %>%
knitr::kable(caption = "First six(6) rows of the data")%>% kableExtra::kable_styling(bootstrap_options = "striped")%>%
row_spec(0, background ="#f0efef")
| gender | race.ethnicity | parental.level.of.education | lunch | test.preparation.course | math.score | reading.score | writing.score |
|---|---|---|---|---|---|---|---|
| female | group B | bachelor’s degree | standard | none | 72 | 72 | 74 |
| female | group C | some college | standard | completed | 69 | 90 | 88 |
| female | group B | master’s degree | standard | none | 90 | 95 | 93 |
| male | group A | associate’s degree | free/reduced | none | 47 | 57 | 44 |
| male | group C | some college | standard | none | 76 | 78 | 75 |
| female | group B | associate’s degree | standard | none | 71 | 83 | 78 |
Some of the data present in the set.
For the purpose of this project, only the variables defined or zoomed into here will be use to conduct the proceeding analysis.
Adjusted_Data <- Original_Data[,-c(2,4,8)] # Drops the specifies columns from the dataset
names(Adjusted_Data) # Names of the working variables
## [1] "gender" "parental.level.of.education"
## [3] "test.preparation.course" "math.score"
## [5] "reading.score"
These are the selected variables. race.ethnicity and lunch (two categorical variables) have been removed.
This step is necessary to ensure that there are no missing observations present for the defined variables.
missmap(Adjusted_Data, main="Missing Map") # Highlights missing observations
There are no missing values present.
This step is not mandatory but can be useful for modifying the names of the variables into more comfortable ones to work with.
names(Adjusted_Data)[1] <- "Gender"
names(Adjusted_Data)[2] <- "ParentsEducationLevel"
names(Adjusted_Data)[3] <- "TestPrepCourse"
names(Adjusted_Data)[4] <- "MathScore"
names(Adjusted_Data)[5] <- "ReadingScore"
names(Adjusted_Data)
## [1] "Gender" "ParentsEducationLevel" "TestPrepCourse"
## [4] "MathScore" "ReadingScore"
Here, the variables names are updated
What are the characteristics of each feature present in the dataset?
The answer to this question will provide insights as to the statistical nature of the working data through breaking down the attributes of each variable.
But first, a visual inspection of the variables to see what information about their shape and distribution can be gathered.
# Barchart plot of the categorical variables
b1 <- ggplot(Adjusted_Data, aes(x=Gender)) + geom_bar(fill="darkgreen", alpha=0.6) + ggtitle("Frequency of Male and Female") + theme_bw() + theme(plot.title = element_text(size = rel(0.9)))
b2 <- ggplot(Adjusted_Data, aes(x=ParentsEducationLevel)) + geom_bar(fill="black", alpha=0.6) + ggtitle("Frequency of The Parents Educational Levels") + theme_bw() + theme(plot.title = element_text(size = rel(0.9)))
b3 <- ggplot(Adjusted_Data, aes(x=TestPrepCourse)) + geom_bar(fill="deeppink3", alpha=0.7) + ggtitle("Frequency of Completing a Test Preparation Course or Not") + theme_bw() + theme(plot.title = element_text(size = rel(0.9)))
grid.arrange(b1, b3, b2) # Displays a grid of pictures of the saved objects
Observations from the diagram above show that there are more females than males marginally present in the dataset.
Secondly, the students who would have completed a test prep course before the exam is just over half the set of those who have not taken it.
And lastly, there is quite a bit of variability running across the 6 levels of the parents’ education. The least popular educational level attained among the parents is a master degree. Moving up the list is the bachelors degree which almost double that amount. The associate degree and some college level almost tied as the most popular educational level among the parents, closely followed the by the high school and some high school levels. Altogther, most parents would have attained an associate degree or less.
The following boxplots displays the distribution of the the Math and Reading scores. They deliver a visual representation of their five summary statics and also highlight possible cases of outliers they may have.
The Math Case:
# Boxplot
boxplot(Adjusted_Data$MathScore, main="Box Plot of the Math Scores",
xlab="Math",
ylab="Scores",
col="pink")
For the Math scores above, the range of values is from 0 - 100. Despite closely following a relatively normal distribution, the scores are skewed slightly to the left as depicted by the longer bottom whisker and consists of the lower scores. This left skewness indicates that greater variability resides at the lower end of the scores. The presence of outliers is also noticeable as they are values that exist to varying degrees beyond the lower whisker in the plot.
A minimum score of zero is the only one existing at that value and could suggest that the student did not take the exam or probably was severely lacking in the knowledge or skills required for this particular exam.
The first quarter which represents twenty five percent (25%) of the scores is just around 55 and tells us that over seventy five (75%) of the students passed the Math exam. In addition, twenty five percent (25%) would have also scored As and the median score which is the center of the distribution and lies just above the 60 mark, reflects the overall all high pass rate.
The Reading case:
# Boxplot
boxplot(Adjusted_Data$ReadingScore, main="Box Plot of the Reading Scores",
xlab="Reading",
ylab="Scores",
col="grey")
At first sight, the distribution of the Reading scores bares similarity to the previous Math scores. However, they are some notable differences.
Although they are slightly skewed to the left as in the first case, the minimum score is closer to the 20 mark and does not extend all the way down to zero. This highlights that every student would have wrote the exam and would have at least obtained one mark.
The presence of outliers are fewer than in the Math case. The lower twenty five percent (25%) of the scores while closer to 60 than in Math, are also below the 60 mark which indicates that most of the students would have also passed their Reading exam. However, the median value is further away from the 60 mark which represents a better overall performance for the Reading exam.
The upper twenty five (25%) scored all A’s with seventy five percent (75%) of the students scoring less.Next a quantification of the variable measurement to confirm what was observed visually about them.
SumStat <- summary(Adjusted_Data) # Statistical summary
# Outputs a formatted table of the values
SumStat %>%
knitr::kable(caption = "Summary statistic of the dataset")%>%
kableExtra::kable_styling(bootstrap_options = "striped")%>%
row_spec(0, background ="#f0efef")
| Gender | ParentsEducationLevel | TestPrepCourse | MathScore | ReadingScore | |
|---|---|---|---|---|---|
| female:518 | associate’s degree:222 | completed:358 | Min. : 0.00 | Min. : 17.00 | |
| male :482 | bachelor’s degree :118 | none :642 | 1st Qu.: 57.00 | 1st Qu.: 59.00 | |
| NA | high school :196 | NA | Median : 66.00 | Median : 70.00 | |
| NA | master’s degree : 59 | NA | Mean : 66.09 | Mean : 69.17 | |
| NA | some college :226 | NA | 3rd Qu.: 77.00 | 3rd Qu.: 79.00 | |
| NA | some high school :179 | NA | Max. :100.00 | Max. :100.00 |
The values recorded here confirmed the initial findings found from the visual inspection of the variables plotted in the graphs above.
There are more female students accounted for than males, less students completed a test preparatory course than those who did not and the least populous education level among the parents is a masters degree level closely followed by the bachelors degree.
The Math and Reading measured values also serve as confirmation for their visual analysis findings. Reading received a better average mark than Math, with the latter a mean of 66.09 while the former a mark of 69.17.
How are the two individual scores distributed across the male and female students?
A density plot will be use to explore this distribution.
The Math case:
# Density plot for the Math scores
ggplot(Adjusted_Data, aes(x=MathScore, fill= Gender)) + geom_density(alpha=0.4) + theme_classic() + ggtitle("Male and Female Distribution Across the Math Scores") + theme(plot.title = element_text(size = rel(1.1), hjust=0.5)) + labs(y="Density")
According to the diagram, more males capped the highest grades in the vicinity of the top twenty five percent (25%) and very few shared the lower grades in the bottom twenty five percent (25%) with the females.The females basically dominated the lower end of the Math scores ranging from zero to about the 70 mark score. This information shows clearly that the male outperformed the females in the Math exam.
The Reading case:
# Density plot for the Reading scores
ggplot(Adjusted_Data, aes(x=ReadingScore, fill = Gender)) + geom_density(alpha=0.4) + theme_classic() + ggtitle("Male and Female Distribution Across the Reading Scores") + theme(plot.title = element_text(size = rel(1.1), hjust=0.5)) + labs(y="Density")
In this setting the case is reversed. The lower grades are concentrated among the males while the females dominated the higher ones. Here however, we have a narrower bracket between the two groups competing for the lower twenty five percent (25%) of the scores. We also have more females scoring around seventy five (75) and over than for the males in the similar Math case.
In what way does the educational level of their parents, influence a student completing a test preparatory course?
A two-way contingency table will be used to derived any influence if any, the former variable have on the latter.
# Outputs a contingency table
compare <- table(Adjusted_Data$ParentsEducationLevel, Adjusted_Data$TestPrepCourse)
compare %>%
knitr::kable(caption = "Two Way Comparison of The Parents Educational Level and Taking a Test Preparatory Course")%>%
kableExtra::kable_styling(bootstrap_options = "striped")%>%
row_spec(0, background ="#f0efef")
| completed | none | |
|---|---|---|
| associate’s degree | 82 | 140 |
| bachelor’s degree | 46 | 72 |
| high school | 56 | 140 |
| master’s degree | 20 | 39 |
| some college | 77 | 149 |
| some high school | 77 | 102 |
Strikingly, the parents with the highest educational levels, namely a master and bachelors degree does not equate to more students completing a test prep course. Instead it resulted in the students being least likely to complete the course. Perhaps the students felt they were knowledgeable enough and did not need to take such a course before writing the exams.
On the flip side, the lower levels of education starting with the associate degree, has the most occurrences of students who completed the test. This pattern follows a downward trend.The result is very clear that students who came from parental background without a first degree and above, are more inclined to take some form of prep course before taking the two exams.
Is completing or not completing a test preparatory course significantly influence by a student’s gender?
We will enlist the the help of a dodged-bar gragh plot, followed by a confirmation chi square test to answer this question.
A chi square test is a test for significance between two categorical variables. The null hypothesis states there is no dependency between the two variables.
# Dodge bargrap plot
ggplot(Adjusted_Data, aes(x=TestPrepCourse, color=Gender, fill=Gender)) + geom_bar(position="dodge") + theme_bw() + ggtitle("Comparison of Male to Female Amount Across Test Prep Course Categories") + theme(plot.title = element_text(size = rel(1.1), hjust=0.5)) + coord_flip()
For each category of the test prep course, the female to male ratio seems only marginally different. To explore this difference we will conduct a chi square test to find out if it is significant and if indeed gender is not playing a role in whether a student complete a test prep course or not.
# Contingency table of values
ChiTest <- table(Adjusted_Data$TestPrepCourse, Adjusted_Data$Gender)
Test <- chisq.test(ChiTest) # Performs chi square test
Test
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: ChiTest
## X-squared = 0.015529, df = 1, p-value = 0.9008
Because the p-value is more that 0.05%, we cannot reject the null hypothesis that the two variables are independent of each other.
Hence there is no significant association between completing a test prep course or not and being a male or female. That is, being a male or female has no bearing on whether a student complete a test prep course or not.They are two isolated events. This proved what was suspected from the previous plot.What role does the parental level of education for a student plays, in their over all performance for each exam and how is this affected by Gender?
The use of a histogram to show the distribution of each scores across gender then faceted by the parents educational level will help to answer this question.
The Math case:
# Faceted histogram
ggplot(Adjusted_Data, aes(x = MathScore, color=Gender)) +
geom_histogram() + facet_wrap(~ParentsEducationLevel) + theme(panel.grid.major = element_blank()) + ggtitle("Gender Distributed across the Math scores Segmented by Parents Education Level") + labs(y="Count") + theme(plot.title = element_text(size = rel(1.1), hjust=0.5)) + labs(y="Density")
Overall we have few but better scores from students whose parents obtained a masters degree when compared to the others. This category received the the lowest minimum grade. There is a notable trend between the students performances and the level of education their parents would have obtained. The higher the education the better the students would have scored.
When the performance is filter down by gender we see that while most females are approximately attaining most of the grades at all levels, they are also soley responsible for the lowest scores. This bracket vary accross all the the levels but the spread is worse at the high school and some high school levels.
We can therefore conclude, that while stemming from a higher educational background increases your chances of performing better in a Math exam, this is undermined by being a female to some extend as it simultaneous increases the odds of you securing one of the lowest scores in Math.
The Reading Case:
# Faceted histogram
ggplot(Adjusted_Data, aes(x = ReadingScore, color=Gender)) + geom_histogram() + facet_wrap(~ParentsEducationLevel) + theme(panel.grid.major = element_blank()) + ggtitle("Gender Distributed across the Reading scores Segmented by Parents Education Level") + labs(y="Count") + theme(plot.title = element_text(size = rel(1.1), hjust=0.5)) + labs(y="Density")
Quite interestingly; the higher the educational background the students came from, their chances of passing Reading increases. Conversely, the fail counts increases as the educational level decreases. The spread of scores also increases the lower this level is.
At the higher levels, most males are responsible for the lower scores while there is a fluctuation between the two groups for the lower grades at a lower educational background. Lastly, the females dominated the performances capping most of the highest scores across all levels. Thus, a student performance on a Reading exam increases, the higher their education background, but shares an inverse relationship with being a male.The use of a scatter plot is ideal to capture the relationship that may exist between two numerical variables.
# Scatter plot to show the relationship
ggplot(Adjusted_Data, aes(x=MathScore, y=ReadingScore)) + geom_point(color="brown", alpha=0.9) + theme_classic() + ggtitle("A plot of Math Verses Reading Scores") + theme(plot.title = element_text(size = rel(1.3), hjust=0.5)) + geom_smooth()
The output above shows a linear relationship occuring between the two set of scores. This relationship is also positive, that is, as the Math scores increases, the Reading scores also increasing in the same direction and vice versa. The rate of the increase can only be deduce through further analysis and modeling.
The line through the data shows how the data moves as the scores increases. It hightlights a wider variance for the lower scores in the data .
How strong is the relationship between the two performances?
The following plot utilizes the Pearson’s correlation measurement to display the strength of the corellation between the two performances.
# Extract the two columns from the data
Relation <- select (Adjusted_Data, ReadingScore, MathScore)
R <- cor(Relation) # Calculates correlation
corrplot(R, method="number") # Graphical display of correlation
As seen from the above diagram, there is a strong relationship that exist among the two performances. They are collinear with each other in a strong way.
Note: Since there is a collinear relationship between the performances it is possible to use one of the scores to predict the other.
What is the fail and pass rate proportions of the exams relative to each other?
A simple proportion table will help to answer this question
# Creating new column
Adjusted_DataPF <- Adjusted_Data %>%
mutate(MathPassFail = ifelse(MathScore >= 50, "PassMath", "FailMath")) %>%
mutate(ReadingPassFail = ifelse(ReadingScore >= 50, "PassReading", "FailReading"))
c <- table(as.factor(Adjusted_DataPF$MathPassFail), as.factor(Adjusted_DataPF$ReadingPassFail))
PassFaillPercent <- prop.table(c)
PassFaillPercent %>%
knitr::kable(caption = "Proportion of The Exams Fail and Pass Rate Relative To Each Other")%>%
kableExtra::kable_styling(bootstrap_options = "striped", "condensed")%>%
row_spec(0, background ="#f0efef")
| FailReading | PassReading | |
|---|---|---|
| FailMath | 0.059 | 0.076 |
| PassMath | 0.031 | 0.834 |
The table depicts a high pass rate between the two subjects. A whooping 83.4% who passed Math, also passed their Reading exams; explaining why the distribution increased in density substantially after the 50 mark, as seen on the previous scatter plot between the two scores.
The lowest percent of 3.1% went to the group of students who pass Math but fail Reading. In between there are more students who failed Math and passed Reading possessing a rate of 7.6% than those who failed both Math and Reading with only 5.9%. These low rates explained why the data was sparsely distributed at the lower scores.A faceted dodged-bar gragh will help us to drill through the data from the right angle to uncover how these variablesinteract to affect the final outcome of the students’ Math scores.
# Faceted bar graph
Data <- Adjusted_Data %>%
mutate(MathScore_PassFail = cut(MathScore, breaks= c(0, 49, 50, 100)))
ggplot(Data, aes(x = MathScore_PassFail, color=TestPrepCourse)) + geom_bar(position='dodge') + facet_wrap(~Gender) + ggtitle("A plot of Math Verses Reading Scores") + theme(plot.title = element_text(size = rel(1.2), hjust=0.5)) + labs(y="Count") #+ scale_fill_manual(values = c("blue", "green"))
This plot reveals an interesting development. Our focus is on bins one and two. Examining the scores from this perspective shows that they are some amount of irregularities present in the Math scores. Although earlier analysis would have revealed the range of the data to be between 0-100, there is obviously a category unaccounted for. Nevertheless, students who did not take a test preparatory course fall in this group and all are females.
The data is also revealing to us that completing a test preparatory course increases a student chances of passing a Math exam and this is unaffected by the gender of the student. However, for the females this seem to have slightly more marginal effect than the males as more females who did not completed one failed when compared to the smaller set of males who did not complete a test prep course and fail.
What is the best fitted model to predict the performance of a student in a Math exam?
The variable of interest to be predicted is the MathScore. This is also known as the dependent or response variable.
This is usually the first stage before creating your model. It is an essential stage if you are doing out of sample prediction
The dataset was divided into 80-20% train to test set ratio via a random sampling without replacement process.
View of the training data set
# Split procedure
set.seed(2)
sampleSize <- floor(.80*nrow(Adjusted_Data))
split <- sample(seq_len(nrow(Adjusted_Data)), sampleSize, replace = FALSE)
Adjusted_Train <- Adjusted_Data[split, ]
Adjusted_Test <- Adjusted_Data[-split, ]
head(Adjusted_Train) %>%
knitr::kable(caption = "Dataset used for training the model- first six rows")%>%
kableExtra::kable_styling(bootstrap_options = "striped", "hover")%>%
row_spec(0, background ="#f0efef")
| Gender | ParentsEducationLevel | TestPrepCourse | MathScore | ReadingScore | |
|---|---|---|---|---|---|
| 853 | female | some college | none | 87 | 85 |
| 975 | female | some college | none | 54 | 63 |
| 710 | female | associate’s degree | completed | 42 | 61 |
| 774 | female | bachelor’s degree | none | 62 | 78 |
| 416 | male | high school | none | 84 | 73 |
| 392 | female | some college | none | 77 | 68 |
Data that will be used for training the model.
A view of the test dataset
head(Adjusted_Test) %>%
knitr::kable(caption = "Dataset used for training the model- first six rows ")%>%
kableExtra::kable_styling(bootstrap_options = "striped", "condensed")%>%
row_spec(0, background ="#f0efef")
| Gender | ParentsEducationLevel | TestPrepCourse | MathScore | ReadingScore | |
|---|---|---|---|---|---|
| 6 | female | associate’s degree | none | 71 | 83 |
| 7 | female | some college | completed | 88 | 95 |
| 15 | female | master’s degree | none | 50 | 53 |
| 18 | female | some high school | none | 18 | 32 |
| 19 | male | master’s degree | completed | 46 | 42 |
| 24 | female | some high school | none | 69 | 73 |
Data that will be used for testing the prediction accuracy of the model.
This model uses the Reading Score variable as the predictor variable. This variable was chosen as the independent variable because it was proven to have a strong linear relation with the Math response variable.
# Constructing model1 using the Reading score as the predictor variable
SingleModel <- lm(MathScore~ReadingScore, data=Adjusted_Train)
summary(SingleModel)
##
## Call:
## lm(formula = MathScore ~ ReadingScore, data = Adjusted_Train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.5245 -6.3847 0.1679 6.4824 24.4755
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.24264 1.48635 4.873 1.33e-06 ***
## ReadingScore 0.85318 0.02107 40.501 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.752 on 798 degrees of freedom
## Multiple R-squared: 0.6727, Adjusted R-squared: 0.6723
## F-statistic: 1640 on 1 and 798 DF, p-value: < 2.2e-16
From the results above, the Reading score variable is significantly contributing to this model since its p value is less than 0.05%. Approximated 67.27% of the variation in the Math scores could be explained by the Reading scores from our model. And, according to the slope value (ReadingScore Estimated value), for every increase in a Reading score, the Math score increases by 0.85318 of that value. Over all the model is significant as the F statistics shows its p-value to be less than 0.5%. This prove that the slope of the predictor variable is not equal to zero.
The fitted regression equation for the MathScore is 7.24264 + 0.85318ReadingScore
ggplot(Adjusted_Data, aes(x=MathScore, y=ReadingScore)) + geom_point(color="brown", alpha=0.9) + theme_classic() + ggtitle("A plot of The Fitted Regression Line") + theme(plot.title = element_text(size = rel(1.3), hjust=0.5)) + geom_abline(intercept = 7.24264, slope = 0.85318, color="blue", size=1.1, alpha=0.5)
A graph depicting the regression line from the SingleModel cutting through the data. Notice the bulk of the data resides above the line proving it is not explaining a reasonable amount of the variations in the Math scores
Test preparatory course was included alongside the Reading score in the construction of this model because previous analysis highlighted that taking a Prep course influence the outcome of the students’s performances to some degree
# Constructing model2 using the Reading score and Test Prep Course as the predictor variables
DualModel <- lm(MathScore~ReadingScore + TestPrepCourse, data=Adjusted_Train)
summary(DualModel)
##
## Call:
## lm(formula = MathScore ~ ReadingScore + TestPrepCourse, data = Adjusted_Train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.5039 -6.3833 0.0499 6.6196 24.1822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.32747 1.69632 3.730 0.000205 ***
## ReadingScore 0.85949 0.02181 39.415 < 2e-16 ***
## TestPrepCoursenone 0.74721 0.66779 1.119 0.263507
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.75 on 797 degrees of freedom
## Multiple R-squared: 0.6732, Adjusted R-squared: 0.6724
## F-statistic: 821 on 2 and 797 DF, p-value: < 2.2e-16
This model utilized two predictors. One numerical and one categorical. Here we see from the stars that only the Reading Score variable is significantly contributing to this model. The TestPrepCourse category that did not complete the course is not adding any significant contribution to this model since its p value is greater than 0.05%. However, the model over all is significant as shown by the very small F statistic p value which happens to be the same as the value in the previous model. Apart from that, despite there is a minute difference in the R squared values between this model and the last, this model does not seems to perform any better than using the Readingscore as the sole predictor for a Mathscore.
The fitted regression equation is 6.32747 + 0.85949ReadingScore + 0.74721TestPrepCourse
For those who did not completed the test prep course, 7.07468 + 0.85949*ReadingScore is the fitted regression line.
Note: To calculate the regression equation for those who completed the test prep course, the TestPrepCourse value will be 0 since that is the reference variable (line will be the same as the first regression line) and 1 in the case for those who did not completed
The respective equations will allow you to gain insight into the expected Math score a student is expected to obtain given that particular condition.
# Plot the resulting graph
ggplot(Adjusted_Data, aes(x=MathScore, y=ReadingScore)) + geom_point(color="brown", alpha=0.9) + theme_classic() + ggtitle("A plot of The Fitted Regression Lines") + theme(plot.title = element_text(size = rel(1.3), hjust=0.5)) + geom_abline(intercept = 6.32747, slope = 0.85949, color="Deeppink", size=1.3) + geom_abline(intercept = 7.07468, slope = 0.85949, color="turquoise", size=1.2) + theme(legend.position="right") + geom_text(aes(x = 100, y = 80, label = " ", color ="ReadingSore")) + theme(legend.position="right") + geom_text(aes(x = 100, y = 80, label = " ", color ="TestPrepCourse")) + labs(colour = "Fitted Lines")
As confirmed by the model summary, the addition of TestPrepCourse to the ReadingScore did not improve this model performance in any notable way. Notice the two regression lines, although parallel, are closely touching each other.
A combination of the three variables were use to construct this model to find out if it would yeild better performance from a model when compared to the previous models with the other predictors. Additionally, the two categorical variables were proven to be independent of each other thereby not violating the multicollinear assumption of multiple regression between them. They can independently contribute to this model without risk of the model being over fitted.
# Constructing model3 using the Reading score, Test Prep Course and Gender as the predictor variables
TriModel <- lm(MathScore~ReadingScore + TestPrepCourse + Gender, data=Adjusted_Train)
summary(TriModel)
##
## Call:
## lm(formula = MathScore ~ ReadingScore + TestPrepCourse + Gender,
## data = Adjusted_Train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.2417 -4.0858 -0.0762 4.1558 17.3290
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.7699 1.3881 -5.598 2.99e-08 ***
## ReadingScore 0.9706 0.0169 57.416 < 2e-16 ***
## TestPrepCoursenone 1.6339 0.5008 3.262 0.00115 **
## Gendermale 12.0251 0.4798 25.062 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.546 on 796 degrees of freedom
## Multiple R-squared: 0.8174, Adjusted R-squared: 0.8167
## F-statistic: 1187 on 3 and 796 DF, p-value: < 2.2e-16
This model seemed to have performed better than the previous two. The coefficient values show that all the predictor variables are contributing significantly to this model as all their p-values are less than 0.05%. Even further, in this model a combination of the three variables are explaining a whooping 81.74% of the variability in the Math scores. A sharp increased from the R squared values in the last two models. The overall significance of the model appears to remain the same in value as compared to the other models. The value is too small to notice any change in value if they are any.
Note:The intercept values are rarely important when constructing models, however, an interesting observation is that it is negative for this particular model.
The fitted regression equation for this model is -7.7699 + 0.9706ReadingScore + 1.6339TestPrepCourse + 12.0251Gender.
Note:
To calculate the regression equation for those who completed the test prep course, the TestPrepCourse value will be 0 since that is the reference variable and 1 in the case for those who did not completed.
To calculate the regression equation for the female gender, the value will be 0 since that is the reference variable and 1 in the case of male gender.
The individaul regression equation for the categorical variables will allow you to calculate the expected value for a Math score given the respective condition(s). For example the expect value of the student Math score given the fact they have completed a test prep course etc.
# Plot the resuting graph
ggplot(Adjusted_Data, aes(x=MathScore, y=ReadingScore)) + geom_point(color="brown", alpha=0.9) + theme_classic() + ggtitle("A plot of The Fitted Regression Lines") + theme(plot.title = element_text(size = rel(1.3), hjust=0.5)) + geom_abline(intercept = -7.7699, slope = 0.9706, color="green", size=1.2, alpha=0.6) + geom_abline(intercept = -6.136, slope = 0.9706, color="blue", size=1.2, alpha=0.6) + geom_abline(intercept = 4.2516, slope = 0.9706, color="Deeppink", size=1.2, alpha=0.9) + theme(legend.position="right") + geom_text(aes(x = 100, y = 80, label = " ", color ="ReadingSore")) + theme(legend.position="right") + geom_text(aes(x = 100, y = 80, label = " ", color ="TestPrepCourse")) + theme(legend.position="right") + geom_text(aes(x = 100, y = 80, label = " ", color ="Gender")) + labs(colour = "Fitted Lines")
This visualization clearly shows an improved model from the previous set. We see that the addition of gender to the other two variables from DualModel indeed helped to addressed more of the variation in the MathScores. Consequently, we see that not only is gender playing a big part in this model but it is the male rather than the female group that is explaining this variation. This is also explaining the higher section of the grades better than the other variables in the model. We can see that it supports earlier analysis which revealed that the males students were the better performers in Math.
After creating and evaluating your models it is essential that you compare the models to each other to guide your selection of the best fitted model among the set. There are several ways this can be done, but here, they will be judged based on AIC, BIC, R-Squared values and their Overall p-Values.
# Calculates The AIC and BIC values
A <- round(AIC(SingleModel),4)
B <- round(AIC(DualModel), 4)
C <- round(AIC(TriModel), 4)
D <- round(BIC(SingleModel), 4)
E <- round(BIC(DualModel), 4)
G <- round(BIC(TriModel), 4)
# Procedure collects models values and output to table
Rsquare1 <- round(summary(SingleModel)$r.squared, 4)
Rsquare2 <- round(summary(DualModel)$r.squared, 4)
Rsquare3 <- round(summary(TriModel)$r.squared, 4)
Single <- summary(SingleModel)
Dual <- summary(DualModel)
Tri <- summary(TriModel)
coeff1 <- round(coef(Single), 4)
coeff2 <- round(coef(Dual), 4)
coeff3 <- round(coef(Tri), 4)
table= matrix(NA, nrow = 3, ncol = 7)
colnames(table) = c("Model","Intercept", "Slope", "P-Value", "AIC", "BIC", "R-Square")
table[1,] = c("SingleModel", coeff1[1,1], coeff1[1,2], "2.2e-16", A, D, Rsquare1)
table[2,] = c("DualModel", coeff2[1,1], coeff2[1,2], "2.2e-16", B, E, Rsquare2)
table[3,] = c("TriModel", coeff3[1,1], coeff3[1,2], "2.2e-16", C, G, Rsquare3)
table %>%
knitr::kable(caption = "Model Comparison")%>%
kableExtra::kable_styling(bootstrap_options = "striped")%>%
row_spec(0, background ="#f0efef")
| Model | Intercept | Slope | P-Value | AIC | BIC | R-Square |
|---|---|---|---|---|---|---|
| SingleModel | 7.2426 | 1.4864 | 2.2e-16 | 5745.0833 | 5759.1371 | 0.6727 |
| DualModel | 6.3275 | 1.6963 | 2.2e-16 | 5745.8276 | 5764.566 | 0.6732 |
| TriModel | -7.7699 | 1.3881 | 2.2e-16 | 5282.4586 | 5305.8817 | 0.8174 |
Based on these values the best model for predicting a student Mathscore is the TRIMODEl which consist of a combination of contributions from 3 influential varibles on the model. Despite the p-value is constant for all the models, the derived TriModel received the highest rating in term of its performance from both Akaike’s information criterion (AIC) and Bayesian information criterion (BIC) as well as possessing the highest R squared values.
This makes TriModel the clear winner among the models and provides the answer to the question, what is the best fitted model to predict the performance of a student in a Math exam.
TriModel, a product of Multiple Regression is the best fitted model in this *CONTEXT.
A view of the predict object which contains the true and predicted values
Predicted <- predict(TriModel, Adjusted_Test)
Predicted_Values <- data.frame(cbind(TrueValues=Adjusted_Test$MathScore, PredictedValues=Predicted))
head(Predicted_Values) %>%
knitr::kable(caption = "First six rows of the Predicted object values")%>%
kableExtra::kable_styling(bootstrap_options = "striped")%>%
row_spec(0, background ="#f0efef")
| TrueValues | PredictedValues | |
|---|---|---|
| 6 | 71 | 74.42250 |
| 7 | 88 | 84.43566 |
| 15 | 50 | 45.30494 |
| 18 | 18 | 24.92265 |
| 19 | 46 | 45.01969 |
| 24 | 69 | 64.71664 |
A snapshot showing how closely the the predicted values align with the actual values.
A look at the correlation between the original and predicted values.
Predicted_Accuracy <- cor(Predicted_Values)
Predicted_Accuracy %>%
knitr::kable(caption = "Correlation between the True and Predicted values")%>%
kableExtra::kable_styling(bootstrap_options = "striped")%>%
row_spec(0, background ="#f0efef")
| TrueValues | PredictedValues | |
|---|---|---|
| TrueValues | 1.000000 | 0.896823 |
| PredictedValues | 0.896823 | 1.000000 |
And there we have it, the correlation between the actual values and the predicted values is strong and positive which confirms the high performance of TriModel, the selected best fitted model.