Disclaimer: This report was done in fulfilment of an introduction to a data mining class assignment. Some of the analysis therein may have errors to be updated overtime and as such, findings should not be used for generalisation.

Purpose of Document/ Introduction

The Purpose of this document is to analyse a dataset by the name of Students Performance in exams and provide a best fit model output. The dataset being analysed was taken from https://www.kaggle.com/spscientist/students-performance-in-exams.

Dimension of Dataset

What is the dimension of the dataset?

Findings I

## The dimension of the data set is: 1000 9

Summary of the dataset

What is the general statistical operation of the Student Performance Dataset?

Findings II
Below illustrates a summary of the Students Performance in Exams dataset.We would have seen before that the dimension of the dataset is 1000 rows and 9 columns, however, the given dataset or original dataset was 1000 8. The dataset was appended to add an average cloumn showing the averages of the three scores.

##     gender       race.ethnicity     parental.level.of.education
##  female:518   amerindian: 89    associate's degree:222         
##  male  :482   african   :190    bachelor's degree :118         
##               chinese   :319    high school       :196         
##               indian    :262    master's degree   : 59         
##               portuguese:140    some college      :226         
##                                 some high school  :179         
##           lunch     test.preparation.course   math.score     reading.score   
##  free/reduced:355   completed:358           Min.   :  0.00   Min.   : 17.00  
##  standard    :645   none     :642           1st Qu.: 57.00   1st Qu.: 59.00  
##                                             Median : 66.00   Median : 70.00  
##                                             Mean   : 66.09   Mean   : 69.17  
##                                             3rd Qu.: 77.00   3rd Qu.: 79.00  
##                                             Max.   :100.00   Max.   :100.00  
##  writing.score    average.score   
##  Min.   : 10.00   Min.   :  9.00  
##  1st Qu.: 57.75   1st Qu.: 58.33  
##  Median : 69.00   Median : 68.33  
##  Mean   : 68.05   Mean   : 67.77  
##  3rd Qu.: 79.00   3rd Qu.: 77.67  
##  Max.   :100.00   Max.   :100.00

Test Preparation Completed

What are the total count of males and females that would have completed test preparation?

Findings III
The table below indicates that more females would have prepared for their exams than males. Additionally, it also shows that more females did not prepare for their exams than male drawing. Adding the figures of both males and females who prepared or did not prepare for their exams, it was found that the dataset contains more females than males.

##         
##          completed none
##   female       184  334
##   male         174  308

Best performed exam

What exam is the best performed?

Findings IV
Based upon the histograms below, the best performed exam was reading. Math came after with writing yielding the least performance.

Gender performance in each subject are

Which gender performs better in each subject area?

Findings V
The box plots below indicates that males perform better in math while in reading and writing, females are the better performers.

Performance based upon test preparation

What are the differences in performance of students who would have prepared for a test vs those who didn’t?

Findings VI
What the bar plot illustrates below is that the performance of students who would have prepared for their exams are lower than those who did not prepare, in this scenario- completed(prepared) and none(did not prepare).

Parental level of education and its impact

What is the impact of a parent’s level of education on his/her child’s test preparation?

Findings VII
The idea being pictured in the bar graph below is that the level of parents’ education has no major impact on students preparation for exams. For example, parents with some high school level of education would have produced more students who prepared for their exams.The balance of completed and none fluctuates while some are equal.

Students Performance based upon parental level of education

Is there a relationship between the reading and writing scores?

Findings VIII
The scatter plot below shows that there is a positive linear relationship between the reading and writing scores.

Performance based upon lunch

Are the performances of students who benefit from a standard lunch better?

Findings IX
The findings using the box plots below illustrates that students who benefited from a standard lunch did performed better as opposed to those who had free/reduced lunch.

Ethnicity

Does students of a particular ethnicity perform better?

Findings X
Based upon the bar plot below, it can be identified that the portuguese performs better given that they have the highest average of 68.25. Information such as the amount of a particular race within the dataset can also be gathered, for example, there are over 300 chinese within the dataset.


Modeling

Division of dataset

The dataset was divided into two parts namely train and test. The new dimension of both train and test are shown below:

## [1] 778   9
## [1] 222   9

Linear regression model1

The model below uses the dataset to identify how significant the reading score is to the writing score or how dependent is writing to reading. According to the model, reading is statistically significant for writing, hence, if a student does well in reading, he/she is likely to do well in writing.

## 
## Call:
## lm(formula = writing.score ~ reading.score, data = dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9573  -2.9573   0.0363   3.1026  15.0557 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.667554   0.693792  -0.962    0.336    
## reading.score  0.993531   0.009814 101.233   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.529 on 998 degrees of freedom
## Multiple R-squared:  0.9113, Adjusted R-squared:  0.9112 
## F-statistic: 1.025e+04 on 1 and 998 DF,  p-value: < 2.2e-16

Linear regression model2

In model2, the train dataset is used to justify the significance of the reading score to the writing score and like model1, this model also shows that the reading score is statistically significant to the writing score.

## 
## Call:
## lm(formula = writing.score ~ reading.score, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9112  -2.9056   0.1022   3.1633  15.1155 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.84889    0.79533  -1.067    0.286    
## reading.score  0.99556    0.01124  88.599   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.516 on 776 degrees of freedom
## Multiple R-squared:   0.91,  Adjusted R-squared:  0.9099 
## F-statistic:  7850 on 1 and 776 DF,  p-value: < 2.2e-16

Linear regression model3

In this model, the train data was used to justify the significance of the reading score to the math score, in other words, is math dependent upon reading. Based on the model derived, the reading score is also statistically significant to the math score, hence if a student does well in reading, he/she is likely to do well in math.

## 
## Call:
## lm(formula = math.score ~ reading.score, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.2250  -6.4997  -0.1159   6.3501  24.7750 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.43890    1.54609   4.165 3.47e-05 ***
## reading.score  0.86008    0.02184  39.375  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.778 on 776 degrees of freedom
## Multiple R-squared:  0.6664, Adjusted R-squared:  0.666 
## F-statistic:  1550 on 1 and 776 DF,  p-value: < 2.2e-16

Model Selection

A best fit estimated statistical model was selected based upon the lowest AIC (Akaike’s information criterion) and BIC (Bayesian information criterion) scores. Based on the criterion, model 2 (mod2) was selected.

## [1] 5862.88
## [1] 4557.622
## [1] 5591.932
## [1] 5877.603
## [1] 4571.592
## [1] 5605.902

Prediction

Model 2 (mod2) was selected to make predictions since it was deemed the best fit according to the AIC and BIC scores. According to the first six rows in the actual and predicted table, the difference in scores are of a thin line and can be used for further prediction. Based upon output and performance, this model can be considered accurate and well performed.

##    actuals predicteds
## 5       75   76.80452
## 8       39   41.96004
## 14      70   70.83118
## 17      86   87.75564
## 23      53   52.91117
## 26      72   72.82230
##              actuals predicteds
## actuals    1.0000000  0.9567072
## predicteds 0.9567072  1.0000000

References

https://rmarkdown.rstudio.com/authoring_quick_tour.html

https://rpubs.com/ID_Tech/S1

https://r4ds.had.co.nz/r-markdown.html

https://www.r-graph-gallery.com/index.html