Data mining using EDA in R


By: Samuel Sarius

Date: May 13, 2020

Update number 6: May 31, 2020

Disclaimer: The following report was done as a fulfillment for an introduction to data mining class assignment. Some of the analysis therein may have errors to be updated over time. The findings therein should not be used for generalisation.


This report is an analysis of the student-mat.csv dataset. The students that are involved in the dataset are doing a Math course.

1. What is the summary of the dataset?

##     school              sex                 age         address         
##  Length:395         Length:395         Min.   :15.0   Length:395        
##  Class :character   Class :character   1st Qu.:16.0   Class :character  
##  Mode  :character   Mode  :character   Median :17.0   Mode  :character  
##                                        Mean   :16.7                     
##                                        3rd Qu.:18.0                     
##                                        Max.   :22.0                     
##    famsize            Pstatus               Medu            Fedu      
##  Length:395         Length:395         Min.   :0.000   Min.   :0.000  
##  Class :character   Class :character   1st Qu.:2.000   1st Qu.:2.000  
##  Mode  :character   Mode  :character   Median :3.000   Median :2.000  
##                                        Mean   :2.749   Mean   :2.522  
##                                        3rd Qu.:4.000   3rd Qu.:3.000  
##                                        Max.   :4.000   Max.   :4.000  
##      Mjob               Fjob              reason            guardian        
##  Length:395         Length:395         Length:395         Length:395        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    traveltime      studytime        failures       schoolsup        
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   Length:395        
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000   Class :character  
##  Median :1.000   Median :2.000   Median :0.0000   Mode  :character  
##  Mean   :1.448   Mean   :2.035   Mean   :0.3342                     
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000                     
##  Max.   :4.000   Max.   :4.000   Max.   :3.0000                     
##     famsup              paid            activities          nursery         
##  Length:395         Length:395         Length:395         Length:395        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     higher            internet           romantic             famrel     
##  Length:395         Length:395         Length:395         Min.   :1.000  
##  Class :character   Class :character   Class :character   1st Qu.:4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :4.000  
##                                                           Mean   :3.944  
##                                                           3rd Qu.:5.000  
##                                                           Max.   :5.000  
##     freetime         goout            Dalc            Walc      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :3.000   Median :3.000   Median :1.000   Median :2.000  
##  Mean   :3.235   Mean   :3.109   Mean   :1.481   Mean   :2.291  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      health         absences            G1              G2       
##  Min.   :1.000   Min.   : 0.000   Min.   : 3.00   Min.   : 0.00  
##  1st Qu.:3.000   1st Qu.: 0.000   1st Qu.: 8.00   1st Qu.: 9.00  
##  Median :4.000   Median : 4.000   Median :11.00   Median :11.00  
##  Mean   :3.554   Mean   : 5.709   Mean   :10.91   Mean   :10.71  
##  3rd Qu.:5.000   3rd Qu.: 8.000   3rd Qu.:13.00   3rd Qu.:13.00  
##  Max.   :5.000   Max.   :75.000   Max.   :19.00   Max.   :19.00  
##        G3       
##  Min.   : 0.00  
##  1st Qu.: 8.00  
##  Median :11.00  
##  Mean   :10.42  
##  3rd Qu.:14.00  
##  Max.   :20.00

Findings: The dataset contains thirty-three groups of data. The age group consists of secondary school aged children and some who are in the young adult age category.

2. What is the structure of the dataset?

## 'data.frame':    395 obs. of  33 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ paid      : chr  "no" "no" "yes" "yes" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...

Findings: The dataset contains only categorical and integer variables. There is a total of three hundred and ninety-five observations.

3. How are the students distributed among the schools?

Findings: From the bar chart it can be seen that the dataset consists of students that are from two separate schools. Most of the students are from Gabriel Pereira (GP) with the latter being from Mousinho da Silveira (MS). So, there are two schools under consideration here. They are all involved in the same Math course.

4. How many students travel long distances to get to school?

Findings: From the column chart it can be seen that most of the students spend less time traveling to get to school.

5. At what age is alcohol consumed most among the students?

The mean age of the students who consume alcohol is 16.7.

Findings: From the boxplot it can seen that there is one outlier whose age is twenty-two. If it is included in the age of students who consume alcohol it can skew the data. This data is thus not considered. So, we just say most of the students who consume alcohol are around seventeen years of age.

6. Which gender among the students is the most dominant in the consumption of alcohol?

Findings: From the pie chart it can be seen that the female students are engaging more in the consumption of alcohol than their male counterparts.

7. Which guardian within the guardian group is the most dominant?

Findings: From the column chart it can be seen that most of the guardians of the students are mothers.

8. How much spare time does students have?

Findings: From the density plot it can be seen that most of the students has about three hours of spear time. This is the time when most of them consume alcohol. There is also an even spread of the data. The data is symmetrical at three or a little above that.

9. As you grow older does your study time increase?

Findings: From the scatter plot it can be seen that the student’s study time does not increase as they grow older.

10. Is there any relationship between the parents’ education?

## Medudb
##   0   1   2   3   4 
##   3  59 103  99 131
## Fedudb
##   0   1   2   3   4 
##   2  82 115 100  96

The above information is interpreted as follows:

Mother’s education

None 4th Grade 5th to 9th Grade Secondary Higher
3 59 103 99 131

Father’s education

None 4th Grade 5th to 9th Grade Secondary Higher
2 82 115 100 96

Findings: From the above tables it can be seen that there is a relationship between the parents’ education. Most of them has some form of education.

Model 1

Correlation

The correlation between mother’s and father’s education is 0.6234551.

Findings: There is a relationship between father’s and mother’s education. It has a positive correlation.

Building a linear model

## 
## Call:
## lm(formula = Fedu ~ Medu, data = db)
## 
## Coefficients:
## (Intercept)         Medu  
##      0.8176       0.6197

Here we are predicting father’s education (0.8176) using mother’s education (0.6197).

## 
## Call:
## lm(formula = Fedu ~ Medu, data = db)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2966 -0.4374 -0.0571  0.7034  2.5626 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.8176     0.1160   7.049 8.15e-12 ***
## Medu          0.6197     0.0392  15.808  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8519 on 393 degrees of freedom
## Multiple R-squared:  0.3887, Adjusted R-squared:  0.3871 
## F-statistic: 249.9 on 1 and 393 DF,  p-value: < 2.2e-16

Findings: The p value is less than 0.05. This means that the model is statistically significant. This is important before we can go ahead and use it to predict the dependent variable (mother’s education).

AIC and BIC

Akaike’s information criterion (AIC) = 998.3315503
Bayesian information criterion (BIC) = 1010.2682076

Model 2

Correlation

The correlation between the first period grade and the second period grade is 0.8521181.

Findings: There is a relationship between the first period grade and the second period grade. It has a positive correlation.

Building a linear model

## 
## Call:
## lm(formula = G2 ~ G1, data = db)
## 
## Coefficients:
## (Intercept)           G1  
##      0.1796       0.9657

Here we are predicting the second period grade (0.1796) using the first period grade (0.9657).

## 
## Call:
## lm(formula = G2 ~ G1, data = db)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7676  -0.8363   0.1637   1.1637   4.1981 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.17957    0.34110   0.526    0.599    
## G1           0.96567    0.02992  32.278   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.971 on 393 degrees of freedom
## Multiple R-squared:  0.7261, Adjusted R-squared:  0.7254 
## F-statistic:  1042 on 1 and 393 DF,  p-value: < 2.2e-16

Findings: The p value is less than 0.05. This means that the model is statistically significant. This is important before we can go ahead and use it to predict the dependent variable (the first period grade).The independent variable here is the second period grade.

AIC and BIC

Akaike’s information criterion (AIC) = 1661.0378071
Bayesian information criterion (BIC) = 1672.9744644

Model 3

Correlation

The correlation between second period grade and final grade is 0.904868.

Findings: There is a relationship between the first period grade and the second period grade. It has a positive correlation.

Building a linear model

## 
## Call:
## lm(formula = G3 ~ G2, data = db)
## 
## Coefficients:
## (Intercept)           G2  
##      -1.393        1.102

Here we are predicting the final grade (-1.393) using the second period grade (1.102).

## 
## Call:
## lm(formula = G3 ~ G2, data = db)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6284 -0.3326  0.2695  1.0653  3.5759 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.39276    0.29694   -4.69 3.77e-06 ***
## G2           1.10211    0.02615   42.14  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.953 on 393 degrees of freedom
## Multiple R-squared:  0.8188, Adjusted R-squared:  0.8183 
## F-statistic:  1776 on 1 and 393 DF,  p-value: < 2.2e-16

Findings: The p value is less than 0.05. This means that the model is statistically significant. This is important before we can go ahead and use it to predict the dependent variable (the second period grade). The independent variable here is the final grade.

AIC and BIC

Akaike’s information criterion (AIC) = 1653.6607401
Bayesian information criterion (BIC) = 1665.5973974

Prediction using a linear model

The first model was chosen to do a prediction since it has the lowest AIC and BIC. An AIC that is low determines if the model is fit enough to do a prediction.

## 
## Call:
## lm(formula = Fedu ~ Medu, data = trainingData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2662 -0.4408 -0.0493  0.7338  1.9507 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.8324     0.1266   6.577 2.01e-10 ***
## Medu          0.6085     0.0426  14.283  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8394 on 314 degrees of freedom
## Multiple R-squared:  0.3938, Adjusted R-squared:  0.3919 
## F-statistic:   204 on 1 and 314 DF,  p-value: < 2.2e-16

Akaike’s information criterion (AIC) = 790.105425.

##    actuals predicteds
## 8        4   3.266231
## 9        2   2.657760
## 10       4   2.657760
## 22       4   3.266231
## 24       2   2.049290
## 27       2   2.049290

Min Max accuracy and MAPE

##    actuals predicteds
## 8        4   3.266231
## 9        2   2.657760
## 10       4   2.657760
## 22       4   3.266231
## 24       2   2.049290
## 27       2   2.049290

Here we can see that the model did the prediction well since its results (predicteds) are close to the actual data (actuals) from the dataset.