Exploratory Data Analysis And Regression Modeling

Author: "Zeenat Wilson"
Date: "5/25/2020"

Disclaimer:

This project is in fulfillment of the 2019/2020 requirement for the 4th year Database Systems Design
and Information Management Systems II (CSE4202) course offered by the University of Guyana,
South America.The result therein should not be used as reference at this point in time;
as the outputs may be subjected to errors and may not reflect
the true outputs from following stricts guidelines and
assumptions needed to obtain those.

This work be will continuous updated over a period of time to reflect such results

The following analysis is based on Exploratory Data Analysis (EDA) and ultimately Predictive Model Building in R. It anaylzes students performance in Reading and Math via the provided dataset and provides the answers to ten(10) pertinent questions throughout the process.

Dataset: Based on marks secured by high school students located in the United States.
Source: https://www.kaggle.com/spscientist/students-performance-in-exams

Load required libraries:

This is the first step in R.

library(Amelia)
library(corrplot)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(kableExtra)
library(knitr)
library(tidyverse)

Load the dataset:

The data required for the analysis must be imported into R.

Original_Data<-read.csv("C:\\Users\\Zeenet\\Desktop\\Marks_Lab\\StudentsPerformance.csv")

A Preliminary view of the data

str(Original_Data) # Structural make up of the dataset

## 'data.frame':    1000 obs. of  8 variables:
##  $ gender                     : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
##  $ race.ethnicity             : Factor w/ 5 levels "group A","group B",..: 2 3 2 1 3 2 2 2 4 2 ...
##  $ parental.level.of.education: Factor w/ 6 levels "associate's degree",..: 2 5 4 1 5 1 5 5 3 3 ...
##  $ lunch                      : Factor w/ 2 levels "free/reduced",..: 2 2 2 1 2 2 2 1 1 1 ...
##  $ test.preparation.course    : Factor w/ 2 levels "completed","none": 2 1 2 2 2 2 1 2 1 2 ...
##  $ math.score                 : int  72 69 90 47 76 71 88 40 64 38 ...
##  $ reading.score              : int  72 90 95 57 78 83 95 43 64 60 ...
##  $ writing.score              : int  74 88 93 44 75 78 92 39 67 50 ...

This is a breakdown of the dataset. Its dimension can be seen. It has 8 variables and 1000 observations. Three numerical and five categorical variables are in the set.

A partial view of the data

Data_Head <- head(Original_Data) # First six rows of data

Data_Head %>%
 knitr::kable(caption = "First six(6) rows of the data")%>%  kableExtra::kable_styling(bootstrap_options = "striped")%>%
  row_spec(0, background ="#f0efef")

First six(6) rows of the data
gender	race.ethnicity	parental.level.of.education	lunch	test.preparation.course	math.score	reading.score	writing.score
female	group B	bachelor’s degree	standard	none	72	72	74
female	group C	some college	standard	completed	69	90	88
female	group B	master’s degree	standard	none	90	95	93
male	group A	associate’s degree	free/reduced	none	47	57	44
male	group C	some college	standard	none	76	78	75
female	group B	associate’s degree	standard	none	71	83	78

Some of the data present in the set.

Isolating variables of interest

For the purpose of this project, only the variables defined or zoomed into here will be use to conduct the proceeding analysis.

Adjusted_Data <- Original_Data[,-c(2,4,8)] # Drops the specifies columns from the dataset
names(Adjusted_Data) # Names of the working variables

## [1] "gender"                      "parental.level.of.education"
## [3] "test.preparation.course"     "math.score"                 
## [5] "reading.score"

These are the selected variables. race.ethnicity and lunch (two categorical variables) have been removed.

A check for completeness of the data

This step is necessary to ensure that there are no missing observations present for the defined variables.

missmap(Adjusted_Data, main="Missing Map") # Highlights missing observations

There are no missing values present.

Renaming the variables

This step is not mandatory but can be useful for modifying the names of the variables into more comfortable ones to work with.

names(Adjusted_Data)[1] <- "Gender"
names(Adjusted_Data)[2] <- "ParentsEducationLevel"
names(Adjusted_Data)[3] <- "TestPrepCourse"
names(Adjusted_Data)[4] <- "MathScore"
names(Adjusted_Data)[5] <- "ReadingScore"
names(Adjusted_Data)

## [1] "Gender"                "ParentsEducationLevel" "TestPrepCourse"       
## [4] "MathScore"             "ReadingScore"

Here, the variables names are updated

Univariate Analysis

Univariate analysis is useful for dissecting variables individually and will be employed to answer the following question.

Question #1

What are the characteristics of each feature present in the dataset?

The answer to this question will provide insights as to the statistical nature of the working data through breaking down the attributes of each variable.

But first, a visual inspection of the variables to see what information about their shape and distribution can be gathered.

Visual representation of the categorical variables

# Barchart plot of the categorical variables
b1 <- ggplot(Adjusted_Data, aes(x=Gender)) + geom_bar(fill="darkgreen", alpha=0.6)  + ggtitle("Frequency of Male and Female") + theme_bw() + theme(plot.title = element_text(size = rel(0.9))) 

b2 <- ggplot(Adjusted_Data, aes(x=ParentsEducationLevel)) + geom_bar(fill="black", alpha=0.6)  + ggtitle("Frequency of The Parents Educational Levels") + theme_bw() + theme(plot.title = element_text(size = rel(0.9)))

b3 <- ggplot(Adjusted_Data, aes(x=TestPrepCourse)) + geom_bar(fill="deeppink3", alpha=0.7)  + ggtitle("Frequency of Completing a Test Preparation Course or Not") + theme_bw() + theme(plot.title = element_text(size = rel(0.9)))

grid.arrange(b1, b3, b2) # Displays a grid of pictures of the saved objects

Observations from the diagram above show that there are more females than males marginally present in the dataset.

Secondly, the students who would have completed a test prep course before the exam is just over half the set of those who have not taken it.

And lastly, there is quite a bit of variability running across the 6 levels of the parents’ education. The least popular educational level attained among the parents is a master degree. Moving up the list is the bachelors degree which almost double that amount. The associate degree and some college level almost tied as the most popular educational level among the parents, closely followed the by the high school and some high school levels. Altogther, most parents would have attained an associate degree or less.

Visual representation of the numerical variables

The following boxplots displays the distribution of the the Math and Reading scores. They deliver a visual representation of their five summary statics and also highlight possible cases of outliers they may have.

The Math Case:

# Boxplot
boxplot(Adjusted_Data$MathScore, main="Box Plot of the Math Scores",
xlab="Math",
ylab="Scores",
col="pink")

For the Math scores above, the range of values is from 0 - 100. Despite closely following a relatively normal distribution, the scores are skewed slightly to the left as depicted by the longer bottom whisker and consists of the lower scores. This left skewness indicates that greater variability resides at the lower end of the scores. The presence of outliers is also noticeable as they are values that exist to varying degrees beyond the lower whisker in the plot.

A minimum score of zero is the only one existing at that value and could suggest that the student did not take the exam or probably was severely lacking in the knowledge or skills required for this particular exam.

The first quarter which represents twenty five percent (25%) of the scores is just around 55 and tells us that over seventy five (75%) of the students passed the Math exam. In addition, twenty five percent (25%) would have also scored As and the median score which is the center of the distribution and lies just above the 60 mark, reflects the overall all high pass rate.

Visual representation of the numerical variables

The Reading case:

# Boxplot
boxplot(Adjusted_Data$ReadingScore, main="Box Plot of the Reading Scores",
xlab="Reading",
ylab="Scores",
col="grey")

At first sight, the distribution of the Reading scores bares similarity to the previous Math scores. However, they are some notable differences.

Although they are slightly skewed to the left as in the first case, the minimum score is closer to the 20 mark and does not extend all the way down to zero. This highlights that every student would have wrote the exam and would have at least obtained one mark.

The presence of outliers are fewer than in the Math case. The lower twenty five percent (25%) of the scores while closer to 60 than in Math, are also below the 60 mark which indicates that most of the students would have also passed their Reading exam. However, the median value is further away from the 60 mark which represents a better overall performance for the Reading exam.

The upper twenty five (25%) scored all A’s with seventy five percent (75%) of the students scoring less.

Quantified measurement of the variables

Next a quantification of the variable measurement to confirm what was observed visually about them.

SumStat <- summary(Adjusted_Data) # Statistical summary

# Outputs a formatted table of the values
SumStat %>%
  knitr::kable(caption = "Summary statistic of the dataset")%>%
  kableExtra::kable_styling(bootstrap_options = "striped")%>%
  row_spec(0, background ="#f0efef")

Summary statistic of the dataset
Gender	ParentsEducationLevel	TestPrepCourse	MathScore	ReadingScore
female:518	associate’s degree:222	completed:358	Min. : 0.00	Min. : 17.00
male :482	bachelor’s degree :118	none :642	1st Qu.: 57.00	1st Qu.: 59.00
NA	high school :196	NA	Median : 66.00	Median : 70.00
NA	master’s degree : 59	NA	Mean : 66.09	Mean : 69.17
NA	some college :226	NA	3rd Qu.: 77.00	3rd Qu.: 79.00
NA	some high school :179	NA	Max. :100.00	Max. :100.00

The values recorded here confirmed the initial findings found from the visual inspection of the variables plotted in the graphs above.

There are more female students accounted for than males, less students completed a test preparatory course than those who did not and the least populous education level among the parents is a masters degree level closely followed by the bachelors degree.

The Math and Reading measured values also serve as confirmation for their visual analysis findings. Reading received a better average mark than Math, with the latter a mean of 66.09 while the former a mark of 69.17.

Bivariate and Multivariate Analysis

These types of analyses conveys information about how two variables interact with or affect each other. They will be used to answers the following questions.

Question #2

How are the two individual scores distributed across the male and female students?

A density plot will be use to explore this distribution.

The Math case:

# Density plot for the Math scores
ggplot(Adjusted_Data, aes(x=MathScore, fill= Gender)) + geom_density(alpha=0.4) + theme_classic() +  ggtitle("Male and Female Distribution Across the Math Scores") + theme(plot.title = element_text(size = rel(1.1), hjust=0.5)) + labs(y="Density")

According to the diagram, more males capped the highest grades in the vicinity of the top twenty five percent (25%) and very few shared the lower grades in the bottom twenty five percent (25%) with the females.The females basically dominated the lower end of the Math scores ranging from zero to about the 70 mark score. This information shows clearly that the male outperformed the females in the Math exam.

The Reading case:

# Density plot for the Reading scores
ggplot(Adjusted_Data, aes(x=ReadingScore, fill = Gender)) + geom_density(alpha=0.4) + theme_classic() +  ggtitle("Male and Female Distribution Across the Reading Scores") + theme(plot.title = element_text(size = rel(1.1), hjust=0.5)) + labs(y="Density")

In this setting the case is reversed. The lower grades are concentrated among the males while the females dominated the higher ones. Here however, we have a narrower bracket between the two groups competing for the lower twenty five percent (25%) of the scores. We also have more females scoring around seventy five (75) and over than for the males in the similar Math case.

Question #3

In what way does the educational level of their parents, influence a student completing a test preparatory course?

A two-way contingency table will be used to derived any influence if any, the former variable have on the latter.

# Outputs a contingency table 
compare <- table(Adjusted_Data$ParentsEducationLevel, Adjusted_Data$TestPrepCourse)

compare %>%
  knitr::kable(caption = "Two Way Comparison of The Parents Educational Level and Taking a Test Preparatory Course")%>%
  kableExtra::kable_styling(bootstrap_options = "striped")%>%
  row_spec(0, background ="#f0efef")

Two Way Comparison of The Parents Educational Level and Taking a Test Preparatory Course
	completed	none
associate’s degree	82	140
bachelor’s degree	46	72
high school	56	140
master’s degree	20	39
some college	77	149
some high school	77	102

Strikingly, the parents with the highest educational levels, namely a master and bachelors degree does not equate to more students completing a test prep course. Instead it resulted in the students being least likely to complete the course. Perhaps the students felt they were knowledgeable enough and did not need to take such a course before writing the exams.

On the flip side, the lower levels of education starting with the associate degree, has the most occurrences of students who completed the test. This pattern follows a downward trend.

The result is very clear that students who came from parental background without a first degree and above, are more inclined to take some form of prep course before taking the two exams.

Question #4

Is completing or not completing a test preparatory course significantly influence by a student’s gender?

We will enlist the the help of a dodged-bar gragh plot, followed by a confirmation chi square test to answer this question.

A chi square test is a test for significance between two categorical variables. The null hypothesis states there is no dependency between the two variables.

# Dodge bargrap plot
ggplot(Adjusted_Data, aes(x=TestPrepCourse, color=Gender, fill=Gender)) + geom_bar(position="dodge") + theme_bw() +  ggtitle("Comparison of Male to Female Amount Across Test Prep Course Categories") + theme(plot.title = element_text(size = rel(1.1), hjust=0.5)) + coord_flip()

For each category of the test prep course, the female to male ratio seems only marginally different. To explore this difference we will conduct a chi square test to find out if it is significant and if indeed gender is not playing a role in whether a student complete a test prep course or not.

# Contingency table of values
ChiTest <- table(Adjusted_Data$TestPrepCourse, Adjusted_Data$Gender)
Test <- chisq.test(ChiTest) # Performs chi square test
Test

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  ChiTest
## X-squared = 0.015529, df = 1, p-value = 0.9008

Because the p-value is more that 0.05%, we cannot reject the null hypothesis that the two variables are independent of each other.

Hence there is no significant association between completing a test prep course or not and being a male or female. That is, being a male or female has no bearing on whether a student complete a test prep course or not.They are two isolated events. This proved what was suspected from the previous plot.

Question #5

What role does the parental level of education for a student plays, in their over all performance for each exam and how is this affected by Gender?

The use of a histogram to show the distribution of each scores across gender then faceted by the parents educational level will help to answer this question.

The Math case:

# Faceted histogram  
ggplot(Adjusted_Data, aes(x = MathScore, color=Gender)) +
  geom_histogram() + facet_wrap(~ParentsEducationLevel) + theme(panel.grid.major = element_blank()) + ggtitle("Gender Distributed across the Math scores Segmented by Parents Education Level") + labs(y="Count") + theme(plot.title = element_text(size = rel(1.1), hjust=0.5)) + labs(y="Density")

Overall we have few but better scores from students whose parents obtained a masters degree when compared to the others. This category received the the lowest minimum grade. There is a notable trend between the students performances and the level of education their parents would have obtained. The higher the education the better the students would have scored.

When the performance is filter down by gender we see that while most females are approximately attaining most of the grades at all levels, they are also soley responsible for the lowest scores. This bracket vary accross all the the levels but the spread is worse at the high school and some high school levels.

We can therefore conclude, that while stemming from a higher educational background increases your chances of performing better in a Math exam, this is undermined by being a female to some extend as it simultaneous increases the odds of you securing one of the lowest scores in Math.

The Reading Case:

# Faceted histogram
ggplot(Adjusted_Data, aes(x = ReadingScore, color=Gender)) +  geom_histogram() + facet_wrap(~ParentsEducationLevel) + theme(panel.grid.major = element_blank()) + ggtitle("Gender Distributed across the Reading scores Segmented by Parents Education Level") + labs(y="Count") + theme(plot.title = element_text(size = rel(1.1), hjust=0.5)) + labs(y="Density")

Quite interestingly; the higher the educational background the students came from, their chances of passing Reading increases. Conversely, the fail counts increases as the educational level decreases. The spread of scores also increases the lower this level is.

At the higher levels, most males are responsible for the lower scores while there is a fluctuation between the two groups for the lower grades at a lower educational background. Lastly, the females dominated the performances capping most of the highest scores across all levels. Thus, a student performance on a Reading exam increases, the higher their education background, but shares an inverse relationship with being a male.

Question #6

What is the nature of the relationship between a student’s Math and Reading performances?

The use of a scatter plot is ideal to capture the relationship that may exist between two numerical variables.

# Scatter plot to show the relationship
ggplot(Adjusted_Data, aes(x=MathScore, y=ReadingScore)) + geom_point(color="brown", alpha=0.9)  + theme_classic() + ggtitle("A plot of Math Verses Reading Scores") + theme(plot.title = element_text(size = rel(1.3), hjust=0.5)) + geom_smooth()

The output above shows a linear relationship occuring between the two set of scores. This relationship is also positive, that is, as the Math scores increases, the Reading scores also increasing in the same direction and vice versa. The rate of the increase can only be deduce through further analysis and modeling.

The line through the data shows how the data moves as the scores increases. It hightlights a wider variance for the lower scores in the data .

Question #7

How strong is the relationship between the two performances?

The following plot utilizes the Pearson’s correlation measurement to display the strength of the corellation between the two performances.

# Extract the two columns from the data
Relation <- select (Adjusted_Data, ReadingScore, MathScore)
R <- cor(Relation) # Calculates correlation
 corrplot(R, method="number") # Graphical display of correlation

As seen from the above diagram, there is a strong relationship that exist among the two performances. They are collinear with each other in a strong way.

Note: Since there is a collinear relationship between the performances it is possible to use one of the scores to predict the other.

Question #8

What is the fail and pass rate proportions of the exams relative to each other?

A simple proportion table will help to answer this question

# Creating new column
Adjusted_DataPF <- Adjusted_Data %>%
  mutate(MathPassFail = ifelse(MathScore >= 50, "PassMath", "FailMath")) %>%
  mutate(ReadingPassFail = ifelse(ReadingScore >= 50, "PassReading", "FailReading"))
                               

c <- table(as.factor(Adjusted_DataPF$MathPassFail), as.factor(Adjusted_DataPF$ReadingPassFail))
PassFaillPercent <- prop.table(c)

PassFaillPercent %>%
  knitr::kable(caption = "Proportion of The Exams Fail and Pass Rate Relative To Each Other")%>%
  kableExtra::kable_styling(bootstrap_options = "striped", "condensed")%>%
  row_spec(0, background ="#f0efef")

Proportion of The Exams Fail and Pass Rate Relative To Each Other
	FailReading	PassReading
FailMath	0.059	0.076
PassMath	0.031	0.834

The table depicts a high pass rate between the two subjects. A whooping 83.4% who passed Math, also passed their Reading exams; explaining why the distribution increased in density substantially after the 50 mark, as seen on the previous scatter plot between the two scores.

The lowest percent of 3.1% went to the group of students who pass Math but fail Reading. In between there are more students who failed Math and passed Reading possessing a rate of 7.6% than those who failed both Math and Reading with only 5.9%. These low rates explained why the data was sparsely distributed at the lower scores.

Question #9

How does completing a test preparatory course or not and being male or female, impact a student’s pass/fail performance in Math?

A faceted dodged-bar gragh will help us to drill through the data from the right angle to uncover how these variablesinteract to affect the final outcome of the students’ Math scores.

# Faceted bar graph
Data <- Adjusted_Data %>%
  mutate(MathScore_PassFail = cut(MathScore, breaks= c(0, 49, 50, 100)))
ggplot(Data, aes(x = MathScore_PassFail, color=TestPrepCourse)) + geom_bar(position='dodge') + facet_wrap(~Gender) + ggtitle("A plot of Math Verses Reading Scores") + theme(plot.title = element_text(size = rel(1.2), hjust=0.5)) + labs(y="Count") #+ scale_fill_manual(values = c("blue", "green"))

This plot reveals an interesting development. Our focus is on bins one and two. Examining the scores from this perspective shows that they are some amount of irregularities present in the Math scores. Although earlier analysis would have revealed the range of the data to be between 0-100, there is obviously a category unaccounted for. Nevertheless, students who did not take a test preparatory course fall in this group and all are females.

The data is also revealing to us that completing a test preparatory course increases a student chances of passing a Math exam and this is unaffected by the gender of the student. However, for the females this seem to have slightly more marginal effect than the males as more females who did not completed one failed when compared to the smaller set of males who did not complete a test prep course and fail.

Modeling

This final question explores both Linear and Multiply Regression.

Question #10

What is the best fitted model to predict the performance of a student in a Math exam?

The variable of interest to be predicted is the MathScore. This is also known as the dependent or response variable.

Splitting the Dataset into train and test data

This is usually the first stage before creating your model. It is an essential stage if you are doing out of sample prediction

The dataset was divided into 80-20% train to test set ratio via a random sampling without replacement process.

View of the training data set

# Split procedure
set.seed(2)
sampleSize <- floor(.80*nrow(Adjusted_Data))
split <- sample(seq_len(nrow(Adjusted_Data)), sampleSize, replace = FALSE) 
Adjusted_Train <- Adjusted_Data[split, ]
Adjusted_Test <- Adjusted_Data[-split, ]


head(Adjusted_Train) %>%
  knitr::kable(caption = "Dataset used for training the model- first six rows")%>%
  kableExtra::kable_styling(bootstrap_options = "striped", "hover")%>%
  row_spec(0, background ="#f0efef")

Dataset used for training the model- first six rows
	Gender	ParentsEducationLevel	TestPrepCourse	MathScore	ReadingScore
853	female	some college	none	87	85
975	female	some college	none	54	63
710	female	associate’s degree	completed	42	61
774	female	bachelor’s degree	none	62	78
416	male	high school	none	84	73
392	female	some college	none	77	68

Data that will be used for training the model.

A view of the test dataset

head(Adjusted_Test) %>%
  knitr::kable(caption = "Dataset used for training the model- first six rows ")%>%
  kableExtra::kable_styling(bootstrap_options = "striped", "condensed")%>%
  row_spec(0, background ="#f0efef")

Dataset used for training the model- first six rows
	Gender	ParentsEducationLevel	TestPrepCourse	MathScore	ReadingScore
6	female	associate’s degree	none	71	83
7	female	some college	completed	88	95
15	female	master’s degree	none	50	53
18	female	some high school	none	18	32
19	male	master’s degree	completed	46	42
24	female	some high school	none	69	73

Data that will be used for testing the prediction accuracy of the model.

Creation of Model #1

This model uses the Reading Score variable as the predictor variable. This variable was chosen as the independent variable because it was proven to have a strong linear relation with the Math response variable.

# Constructing model1 using the Reading score as the predictor variable
SingleModel <- lm(MathScore~ReadingScore, data=Adjusted_Train)

summary(SingleModel)

## 
## Call:
## lm(formula = MathScore ~ ReadingScore, data = Adjusted_Train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.5245  -6.3847   0.1679   6.4824  24.4755 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.24264    1.48635   4.873 1.33e-06 ***
## ReadingScore  0.85318    0.02107  40.501  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.752 on 798 degrees of freedom
## Multiple R-squared:  0.6727, Adjusted R-squared:  0.6723 
## F-statistic:  1640 on 1 and 798 DF,  p-value: < 2.2e-16

From the results above, the Reading score variable is significantly contributing to this model since its p value is less than 0.05%. Approximated 67.27% of the variation in the Math scores could be explained by the Reading scores from our model. And, according to the slope value (ReadingScore Estimated value), for every increase in a Reading score, the Math score increases by 0.85318 of that value. Over all the model is significant as the F statistics shows its p-value to be less than 0.5%. This prove that the slope of the predictor variable is not equal to zero.

The fitted regression equation for the MathScore is 7.24264 + 0.85318ReadingScore

View of the plotted SingleModel

ggplot(Adjusted_Data, aes(x=MathScore, y=ReadingScore)) + geom_point(color="brown", alpha=0.9)  + theme_classic() + ggtitle("A plot of The  Fitted Regression Line") + theme(plot.title = element_text(size = rel(1.3), hjust=0.5)) + geom_abline(intercept = 7.24264, slope = 0.85318, color="blue", size=1.1, alpha=0.5)

A graph depicting the regression line from the SingleModel cutting through the data. Notice the bulk of the data resides above the line proving it is not explaining a reasonable amount of the variations in the Math scores

Creation of Model #2

Test preparatory course was included alongside the Reading score in the construction of this model because previous analysis highlighted that taking a Prep course influence the outcome of the students’s performances to some degree

# Constructing model2 using the Reading score and Test Prep Course  as the predictor variables
DualModel <- lm(MathScore~ReadingScore + TestPrepCourse, data=Adjusted_Train)
summary(DualModel)

## 
## Call:
## lm(formula = MathScore ~ ReadingScore + TestPrepCourse, data = Adjusted_Train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.5039  -6.3833   0.0499   6.6196  24.1822 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.32747    1.69632   3.730 0.000205 ***
## ReadingScore        0.85949    0.02181  39.415  < 2e-16 ***
## TestPrepCoursenone  0.74721    0.66779   1.119 0.263507    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.75 on 797 degrees of freedom
## Multiple R-squared:  0.6732, Adjusted R-squared:  0.6724 
## F-statistic:   821 on 2 and 797 DF,  p-value: < 2.2e-16

This model utilized two predictors. One numerical and one categorical. Here we see from the stars that only the Reading Score variable is significantly contributing to this model. The TestPrepCourse category that did not complete the course is not adding any significant contribution to this model since its p value is greater than 0.05%. However, the model over all is significant as shown by the very small F statistic p value which happens to be the same as the value in the previous model. Apart from that, despite there is a minute difference in the R squared values between this model and the last, this model does not seems to perform any better than using the Readingscore as the sole predictor for a Mathscore.

The fitted regression equation is 6.32747 + 0.85949ReadingScore + 0.74721TestPrepCourse

For those who did not completed the test prep course, 7.07468 + 0.85949*ReadingScore is the fitted regression line.

Note: To calculate the regression equation for those who completed the test prep course, the TestPrepCourse value will be 0 since that is the reference variable (line will be the same as the first regression line) and 1 in the case for those who did not completed

The respective equations will allow you to gain insight into the expected Math score a student is expected to obtain given that particular condition.

A view of the plotted DualModel

# Plot the resulting graph
ggplot(Adjusted_Data, aes(x=MathScore, y=ReadingScore)) + geom_point(color="brown", alpha=0.9)  + theme_classic() + ggtitle("A plot of The  Fitted Regression Lines") + theme(plot.title = element_text(size = rel(1.3), hjust=0.5)) + geom_abline(intercept = 6.32747, slope = 0.85949, color="Deeppink", size=1.3) + geom_abline(intercept = 7.07468, slope = 0.85949, color="turquoise", size=1.2) + theme(legend.position="right")  + geom_text(aes(x = 100, y = 80, label = " ", color ="ReadingSore")) + theme(legend.position="right")  + geom_text(aes(x = 100, y = 80, label = " ", color ="TestPrepCourse")) + labs(colour = "Fitted Lines")

As confirmed by the model summary, the addition of TestPrepCourse to the ReadingScore did not improve this model performance in any notable way. Notice the two regression lines, although parallel, are closely touching each other.

Creation of the Model #3

A combination of the three variables were use to construct this model to find out if it would yeild better performance from a model when compared to the previous models with the other predictors. Additionally, the two categorical variables were proven to be independent of each other thereby not violating the multicollinear assumption of multiple regression between them. They can independently contribute to this model without risk of the model being over fitted.

# Constructing model3 using the Reading score, Test Prep Course and Gender as the predictor variables
TriModel <- lm(MathScore~ReadingScore + TestPrepCourse + Gender, data=Adjusted_Train)
summary(TriModel)

## 
## Call:
## lm(formula = MathScore ~ ReadingScore + TestPrepCourse + Gender, 
##     data = Adjusted_Train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.2417  -4.0858  -0.0762   4.1558  17.3290 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -7.7699     1.3881  -5.598 2.99e-08 ***
## ReadingScore         0.9706     0.0169  57.416  < 2e-16 ***
## TestPrepCoursenone   1.6339     0.5008   3.262  0.00115 ** 
## Gendermale          12.0251     0.4798  25.062  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.546 on 796 degrees of freedom
## Multiple R-squared:  0.8174, Adjusted R-squared:  0.8167 
## F-statistic:  1187 on 3 and 796 DF,  p-value: < 2.2e-16

This model seemed to have performed better than the previous two. The coefficient values show that all the predictor variables are contributing significantly to this model as all their p-values are less than 0.05%. Even further, in this model a combination of the three variables are explaining a whooping 81.74% of the variability in the Math scores. A sharp increased from the R squared values in the last two models. The overall significance of the model appears to remain the same in value as compared to the other models. The value is too small to notice any change in value if they are any.

Note:The intercept values are rarely important when constructing models, however, an interesting observation is that it is negative for this particular model.

The fitted regression equation for this model is -7.7699 + 0.9706ReadingScore + 1.6339TestPrepCourse + 12.0251Gender.

Note:
To calculate the regression equation for those who completed the test prep course, the TestPrepCourse value will be 0 since that is the reference variable and 1 in the case for those who did not completed.

To calculate the regression equation for the female gender, the value will be 0 since that is the reference variable and 1 in the case of male gender.

The individaul regression equation for the categorical variables will allow you to calculate the expected value for a Math score given the respective condition(s). For example the expect value of the student Math score given the fact they have completed a test prep course etc.

View of the plotted TriModel

# Plot the resuting graph
ggplot(Adjusted_Data, aes(x=MathScore, y=ReadingScore)) + geom_point(color="brown", alpha=0.9)  + theme_classic() + ggtitle("A plot of The  Fitted Regression Lines") + theme(plot.title = element_text(size = rel(1.3), hjust=0.5))  + geom_abline(intercept = -7.7699, slope = 0.9706, color="green", size=1.2, alpha=0.6) + geom_abline(intercept = -6.136, slope = 0.9706, color="blue", size=1.2, alpha=0.6) + geom_abline(intercept = 4.2516, slope = 0.9706, color="Deeppink", size=1.2, alpha=0.9) + theme(legend.position="right")  + geom_text(aes(x = 100, y = 80, label = " ", color ="ReadingSore")) + theme(legend.position="right")  + geom_text(aes(x = 100, y = 80, label = " ", color ="TestPrepCourse"))  + theme(legend.position="right")  + geom_text(aes(x = 100, y = 80, label = " ", color ="Gender")) + labs(colour = "Fitted Lines")

This visualization clearly shows an improved model from the previous set. We see that the addition of gender to the other two variables from DualModel indeed helped to addressed more of the variation in the MathScores. Consequently, we see that not only is gender playing a big part in this model but it is the male rather than the female group that is explaining this variation. This is also explaining the higher section of the grades better than the other variables in the model.
We can see that it supports earlier analysis which revealed that the males students were the better performers in Math.

Model Selection

After creating and evaluating your models it is essential that you compare the models to each other to guide your selection of the best fitted model among the set. There are several ways this can be done, but here, they will be judged based on AIC, BIC, R-Squared values and their Overall p-Values.

# Calculates The AIC and BIC values
A <- round(AIC(SingleModel),4)
B <- round(AIC(DualModel), 4)
C <- round(AIC(TriModel), 4)
D <- round(BIC(SingleModel), 4)
E <- round(BIC(DualModel), 4)
G <- round(BIC(TriModel), 4)

# Procedure collects models values and output to table
Rsquare1 <- round(summary(SingleModel)$r.squared, 4)
Rsquare2 <- round(summary(DualModel)$r.squared, 4)
Rsquare3 <- round(summary(TriModel)$r.squared, 4)

Single <- summary(SingleModel)
Dual <- summary(DualModel)
Tri <- summary(TriModel) 

coeff1 <- round(coef(Single), 4)
coeff2 <- round(coef(Dual), 4)
coeff3 <- round(coef(Tri), 4)


table= matrix(NA, nrow = 3, ncol = 7)
colnames(table) = c("Model","Intercept", "Slope", "P-Value", "AIC", "BIC", "R-Square")

table[1,] = c("SingleModel", coeff1[1,1], coeff1[1,2], "2.2e-16", A, D, Rsquare1)
table[2,] = c("DualModel", coeff2[1,1], coeff2[1,2], "2.2e-16", B, E, Rsquare2)
table[3,] = c("TriModel", coeff3[1,1], coeff3[1,2], "2.2e-16",  C, G, Rsquare3)

table %>%
  knitr::kable(caption = "Model Comparison")%>%
  kableExtra::kable_styling(bootstrap_options = "striped")%>%
  row_spec(0, background ="#f0efef")

Model Comparison
Model	Intercept	Slope	P-Value	AIC	BIC	R-Square
SingleModel	7.2426	1.4864	2.2e-16	5745.0833	5759.1371	0.6727
DualModel	6.3275	1.6963	2.2e-16	5745.8276	5764.566	0.6732
TriModel	-7.7699	1.3881	2.2e-16	5282.4586	5305.8817	0.8174

Based on these values the best model for predicting a student Mathscore is the TRIMODEl which consist of a combination of contributions from 3 influential varibles on the model. Despite the p-value is constant for all the models, the derived TriModel received the highest rating in term of its performance from both Akaike’s information criterion (AIC) and Bayesian information criterion (BIC) as well as possessing the highest R squared values.

This makes TriModel the clear winner among the models and provides the answer to the question, what is the best fitted model to predict the performance of a student in a Math exam.

TriModel, a product of Multiple Regression is the best fitted model in this *CONTEXT.

Prediction

A view of the predict object which contains the true and predicted values

Predicted <- predict(TriModel, Adjusted_Test)
Predicted_Values <- data.frame(cbind(TrueValues=Adjusted_Test$MathScore, PredictedValues=Predicted))  

head(Predicted_Values) %>%
  knitr::kable(caption = "First six rows of the Predicted object values")%>%
  kableExtra::kable_styling(bootstrap_options = "striped")%>%
  row_spec(0, background ="#f0efef")

First six rows of the Predicted object values
	TrueValues	PredictedValues
6	71	74.42250
7	88	84.43566
15	50	45.30494
18	18	24.92265
19	46	45.01969
24	69	64.71664

A snapshot showing how closely the the predicted values align with the actual values.

Correlation Accuracy of the Prediction

A look at the correlation between the original and predicted values.

Predicted_Accuracy <- cor(Predicted_Values)
Predicted_Accuracy %>%
  knitr::kable(caption = "Correlation between the True and Predicted values")%>%
  kableExtra::kable_styling(bootstrap_options = "striped")%>%
  row_spec(0, background ="#f0efef")

Correlation between the True and Predicted values
	TrueValues	PredictedValues
TrueValues	1.000000	0.896823
PredictedValues	0.896823	1.000000

And there we have it, the correlation between the actual values and the predicted values is strong and positive which confirms the high performance of TriModel, the selected best fitted model.