Disclaimer: This report was done in fulfilment of an introduction to a data mining class assignment. Some of the analysis therein may have errors to be updated overtime and as such, findings should not be used for generalisation.
The Purpose of this document is to analyse a dataset by the name of Students Performance in exams and provide a best fit model output. The dataset being analysed was taken from https://www.kaggle.com/spscientist/students-performance-in-exams.
Findings I
## The dimension of the data set is: 1000 9
Findings II
Below illustrates a summary of the Students Performance in Exams dataset.We would have seen before that the dimension of the dataset is 1000 rows and 9 columns, however, the given dataset or original dataset was 1000 8. The dataset was appended to add an average cloumn showing the averages of the three scores.
## gender race.ethnicity parental.level.of.education
## female:518 amerindian: 89 associate's degree:222
## male :482 african :190 bachelor's degree :118
## chinese :319 high school :196
## indian :262 master's degree : 59
## portuguese:140 some college :226
## some high school :179
## lunch test.preparation.course math.score reading.score
## free/reduced:355 completed:358 Min. : 0.00 Min. : 17.00
## standard :645 none :642 1st Qu.: 57.00 1st Qu.: 59.00
## Median : 66.00 Median : 70.00
## Mean : 66.09 Mean : 69.17
## 3rd Qu.: 77.00 3rd Qu.: 79.00
## Max. :100.00 Max. :100.00
## writing.score average.score
## Min. : 10.00 Min. : 9.00
## 1st Qu.: 57.75 1st Qu.: 58.33
## Median : 69.00 Median : 68.33
## Mean : 68.05 Mean : 67.77
## 3rd Qu.: 79.00 3rd Qu.: 77.67
## Max. :100.00 Max. :100.00
Findings III
The table below indicates that more females would have prepared for their exams than males. Additionally, it also shows that more females did not prepare for their exams than male drawing. Adding the figures of both males and females who prepared or did not prepare for their exams, it was found that the dataset contains more females than males.
##
## completed none
## female 184 334
## male 174 308
Findings IV
Based upon the histograms below, the best performed exam was reading. Math came after with writing yielding the least performance.
Findings V
The box plots below indicates that males perform better in math while in reading and writing, females are the better performers.
Findings VI
What the bar plot illustrates below is that the performance of students who would have prepared for their exams are lower than those who did not prepare, in this scenario- completed(prepared) and none(did not prepare).
Findings VII
The idea being pictured in the bar graph below is that the level of parents’ education has no major impact on students preparation for exams. For example, parents with some high school level of education would have produced more students who prepared for their exams.The balance of completed and none fluctuates while some are equal.
Findings VIII
The scatter plot below shows that there is a positive linear relationship between the reading and writing scores.
Findings IX
The findings using the box plots below illustrates that students who benefited from a standard lunch did performed better as opposed to those who had free/reduced lunch.
Findings X
Based upon the bar plot below, it can be identified that the portuguese performs better given that they have the highest average of 68.25. Information such as the amount of a particular race within the dataset can also be gathered, for example, there are over 300 chinese within the dataset.
The dataset was divided into two parts namely train and test. The new dimension of both train and test are shown below:
## [1] 778 9
## [1] 222 9
The model below uses the dataset to identify how significant the reading score is to the writing score or how dependent is writing to reading. According to the model, reading is statistically significant for writing, hence, if a student does well in reading, he/she is likely to do well in writing.
##
## Call:
## lm(formula = writing.score ~ reading.score, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9573 -2.9573 0.0363 3.1026 15.0557
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.667554 0.693792 -0.962 0.336
## reading.score 0.993531 0.009814 101.233 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.529 on 998 degrees of freedom
## Multiple R-squared: 0.9113, Adjusted R-squared: 0.9112
## F-statistic: 1.025e+04 on 1 and 998 DF, p-value: < 2.2e-16
In model2, the train dataset is used to justify the significance of the reading score to the writing score and like model1, this model also shows that the reading score is statistically significant to the writing score.
##
## Call:
## lm(formula = writing.score ~ reading.score, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.9112 -2.9056 0.1022 3.1633 15.1155
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.84889 0.79533 -1.067 0.286
## reading.score 0.99556 0.01124 88.599 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.516 on 776 degrees of freedom
## Multiple R-squared: 0.91, Adjusted R-squared: 0.9099
## F-statistic: 7850 on 1 and 776 DF, p-value: < 2.2e-16
In this model, the train data was used to justify the significance of the reading score to the math score, in other words, is math dependent upon reading. Based on the model derived, the reading score is also statistically significant to the math score, hence if a student does well in reading, he/she is likely to do well in math.
##
## Call:
## lm(formula = math.score ~ reading.score, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.2250 -6.4997 -0.1159 6.3501 24.7750
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.43890 1.54609 4.165 3.47e-05 ***
## reading.score 0.86008 0.02184 39.375 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.778 on 776 degrees of freedom
## Multiple R-squared: 0.6664, Adjusted R-squared: 0.666
## F-statistic: 1550 on 1 and 776 DF, p-value: < 2.2e-16
A best fit estimated statistical model was selected based upon the lowest AIC (Akaike’s information criterion) and BIC (Bayesian information criterion) scores. Based on the criterion, model 2 (mod2) was selected.
## [1] 5862.88
## [1] 4557.622
## [1] 5591.932
## [1] 5877.603
## [1] 4571.592
## [1] 5605.902
Model 2 (mod2) was selected to make predictions since it was deemed the best fit according to the AIC and BIC scores. According to the first six rows in the actual and predicted table, the difference in scores are of a thin line and can be used for further prediction. Based upon output and performance, this model can be considered accurate and well performed.
## actuals predicteds
## 5 75 76.80452
## 8 39 41.96004
## 14 70 70.83118
## 17 86 87.75564
## 23 53 52.91117
## 26 72 72.82230
## actuals predicteds
## actuals 1.0000000 0.9567072
## predicteds 0.9567072 1.0000000