Disclaimer: The following report was done as a fulfillment for an introduction to data mining class assignment. Some of the analysis therein may have errors to be updated over time. The findings therein should not be used for generalisation.
This report is an analysis of the student-mat.csv dataset. The students that are involved in the dataset are doing a Math course.
1. What is the summary of the dataset?
## school sex age address
## Length:395 Length:395 Min. :15.0 Length:395
## Class :character Class :character 1st Qu.:16.0 Class :character
## Mode :character Mode :character Median :17.0 Mode :character
## Mean :16.7
## 3rd Qu.:18.0
## Max. :22.0
## famsize Pstatus Medu Fedu
## Length:395 Length:395 Min. :0.000 Min. :0.000
## Class :character Class :character 1st Qu.:2.000 1st Qu.:2.000
## Mode :character Mode :character Median :3.000 Median :2.000
## Mean :2.749 Mean :2.522
## 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :4.000 Max. :4.000
## Mjob Fjob reason guardian
## Length:395 Length:395 Length:395 Length:395
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## traveltime studytime failures schoolsup
## Min. :1.000 Min. :1.000 Min. :0.0000 Length:395
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 Class :character
## Median :1.000 Median :2.000 Median :0.0000 Mode :character
## Mean :1.448 Mean :2.035 Mean :0.3342
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :4.000 Max. :3.0000
## famsup paid activities nursery
## Length:395 Length:395 Length:395 Length:395
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## higher internet romantic famrel
## Length:395 Length:395 Length:395 Min. :1.000
## Class :character Class :character Class :character 1st Qu.:4.000
## Mode :character Mode :character Mode :character Median :4.000
## Mean :3.944
## 3rd Qu.:5.000
## Max. :5.000
## freetime goout Dalc Walc
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median :3.000 Median :3.000 Median :1.000 Median :2.000
## Mean :3.235 Mean :3.109 Mean :1.481 Mean :2.291
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## health absences G1 G2
## Min. :1.000 Min. : 0.000 Min. : 3.00 Min. : 0.00
## 1st Qu.:3.000 1st Qu.: 0.000 1st Qu.: 8.00 1st Qu.: 9.00
## Median :4.000 Median : 4.000 Median :11.00 Median :11.00
## Mean :3.554 Mean : 5.709 Mean :10.91 Mean :10.71
## 3rd Qu.:5.000 3rd Qu.: 8.000 3rd Qu.:13.00 3rd Qu.:13.00
## Max. :5.000 Max. :75.000 Max. :19.00 Max. :19.00
## G3
## Min. : 0.00
## 1st Qu.: 8.00
## Median :11.00
## Mean :10.42
## 3rd Qu.:14.00
## Max. :20.00
Findings: The dataset contains thirty-three groups of data. The age group consists of secondary school aged children and some who are in the young adult age category.
2. What is the structure of the dataset?
## 'data.frame': 395 obs. of 33 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ paid : chr "no" "no" "yes" "yes" ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
Findings: The dataset contains only categorical and integer variables. There is a total of three hundred and ninety-five observations.
3. How are the students distributed among the schools?
Findings: From the bar chart it can be seen that the dataset consists of students that are from two separate schools. Most of the students are from Gabriel Pereira (GP) with the latter being from Mousinho da Silveira (MS). So, there are two schools under consideration here. They are all involved in the same Math course.
4. How many students travel long distances to get to school?
Findings: From the column chart it can be seen that most of the students spend less time traveling to get to school.
5. At what age is alcohol consumed most among the students?
The mean age of the students who consume alcohol is 16.7.
Findings: From the boxplot it can seen that there is one outlier whose age is twenty-two. If it is included in the age of students who consume alcohol it can skew the data. This data is thus not considered. So, we just say most of the students who consume alcohol are around seventeen years of age.
6. Which gender among the students is the most dominant in the consumption of alcohol?
Findings: From the pie chart it can be seen that the female students are engaging more in the consumption of alcohol than their male counterparts.
7. Which guardian within the guardian group is the most dominant?
Findings: From the column chart it can be seen that most of the guardians of the students are mothers.
8. How much spare time does students have?
Findings: From the density plot it can be seen that most of the students has about three hours of spear time. This is the time when most of them consume alcohol. There is also an even spread of the data. The data is symmetrical at three or a little above that.
9. As you grow older does your study time increase?
Findings: From the scatter plot it can be seen that the student’s study time does not increase as they grow older.
10. Is there any relationship between the parents’ education?
## Medudb
## 0 1 2 3 4
## 3 59 103 99 131
## Fedudb
## 0 1 2 3 4
## 2 82 115 100 96
The above information is interpreted as follows:
Mother’s education
| None | 4th Grade | 5th to 9th Grade | Secondary | Higher |
|---|---|---|---|---|
| 3 | 59 | 103 | 99 | 131 |
Father’s education
| None | 4th Grade | 5th to 9th Grade | Secondary | Higher |
|---|---|---|---|---|
| 2 | 82 | 115 | 100 | 96 |
Findings: From the above tables it can be seen that there is a relationship between the parents’ education. Most of them has some form of education.
Correlation
The correlation between mother’s and father’s education is 0.6234551.
Findings: There is a relationship between father’s and mother’s education. It has a positive correlation.
Building a linear model
##
## Call:
## lm(formula = Fedu ~ Medu, data = db)
##
## Coefficients:
## (Intercept) Medu
## 0.8176 0.6197
Here we are predicting father’s education (0.8176) using mother’s education (0.6197).
##
## Call:
## lm(formula = Fedu ~ Medu, data = db)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2966 -0.4374 -0.0571 0.7034 2.5626
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8176 0.1160 7.049 8.15e-12 ***
## Medu 0.6197 0.0392 15.808 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8519 on 393 degrees of freedom
## Multiple R-squared: 0.3887, Adjusted R-squared: 0.3871
## F-statistic: 249.9 on 1 and 393 DF, p-value: < 2.2e-16
Findings: The p value is less than 0.05. This means that the model is statistically significant. This is important before we can go ahead and use it to predict the dependent variable (mother’s education).
AIC and BIC
Akaike’s information criterion (AIC) = 998.3315503
Bayesian information criterion (BIC) = 1010.2682076
Correlation
The correlation between the first period grade and the second period grade is 0.8521181.
Findings: There is a relationship between the first period grade and the second period grade. It has a positive correlation.
Building a linear model
##
## Call:
## lm(formula = G2 ~ G1, data = db)
##
## Coefficients:
## (Intercept) G1
## 0.1796 0.9657
Here we are predicting the second period grade (0.1796) using the first period grade (0.9657).
##
## Call:
## lm(formula = G2 ~ G1, data = db)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7676 -0.8363 0.1637 1.1637 4.1981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.17957 0.34110 0.526 0.599
## G1 0.96567 0.02992 32.278 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.971 on 393 degrees of freedom
## Multiple R-squared: 0.7261, Adjusted R-squared: 0.7254
## F-statistic: 1042 on 1 and 393 DF, p-value: < 2.2e-16
Findings: The p value is less than 0.05. This means that the model is statistically significant. This is important before we can go ahead and use it to predict the dependent variable (the first period grade).The independent variable here is the second period grade.
AIC and BIC
Akaike’s information criterion (AIC) = 1661.0378071
Bayesian information criterion (BIC) = 1672.9744644
Correlation
The correlation between second period grade and final grade is 0.904868.
Findings: There is a relationship between the first period grade and the second period grade. It has a positive correlation.
Building a linear model
##
## Call:
## lm(formula = G3 ~ G2, data = db)
##
## Coefficients:
## (Intercept) G2
## -1.393 1.102
Here we are predicting the final grade (-1.393) using the second period grade (1.102).
##
## Call:
## lm(formula = G3 ~ G2, data = db)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6284 -0.3326 0.2695 1.0653 3.5759
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.39276 0.29694 -4.69 3.77e-06 ***
## G2 1.10211 0.02615 42.14 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.953 on 393 degrees of freedom
## Multiple R-squared: 0.8188, Adjusted R-squared: 0.8183
## F-statistic: 1776 on 1 and 393 DF, p-value: < 2.2e-16
Findings: The p value is less than 0.05. This means that the model is statistically significant. This is important before we can go ahead and use it to predict the dependent variable (the second period grade). The independent variable here is the final grade.
AIC and BIC
Akaike’s information criterion (AIC) = 1653.6607401
Bayesian information criterion (BIC) = 1665.5973974
The first model was chosen to do a prediction since it has the lowest AIC and BIC. An AIC that is low determines if the model is fit enough to do a prediction.
##
## Call:
## lm(formula = Fedu ~ Medu, data = trainingData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2662 -0.4408 -0.0493 0.7338 1.9507
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8324 0.1266 6.577 2.01e-10 ***
## Medu 0.6085 0.0426 14.283 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8394 on 314 degrees of freedom
## Multiple R-squared: 0.3938, Adjusted R-squared: 0.3919
## F-statistic: 204 on 1 and 314 DF, p-value: < 2.2e-16
Akaike’s information criterion (AIC) = 790.105425.
## actuals predicteds
## 8 4 3.266231
## 9 2 2.657760
## 10 4 2.657760
## 22 4 3.266231
## 24 2 2.049290
## 27 2 2.049290
Min Max accuracy and MAPE
## actuals predicteds
## 8 4 3.266231
## 9 2 2.657760
## 10 4 2.657760
## 22 4 3.266231
## 24 2 2.049290
## 27 2 2.049290
Here we can see that the model did the prediction well since its results (predicteds) are close to the actual data (actuals) from the dataset.