Data Analysis and Visualization with R

Dataset Description

 student.mat <- read.csv("~/Dokumente/unistuff/R_Course/Final_Paper/student-mat.csv", sep=";")

I got the data set from the UCI machine learning database (http://archive.ics.uci.edu/ml/datasets/Student+Performance). It was donated to this website by Paulo Cortez and contains information about students and their performance in a Math Course on two secondary education schools in Portugal.

Since the advice on the webpage only gave vague responses on the question if there are any missing values (to be exact, the indicating Boolean value was a missing value), I checked for them:

 #At first, I build a small function that checks for missing values within a vector (column) 
 ### TASK 10
na.test <- function (x) {
  output <- any(is.na(x)== TRUE)
return(output)
}

#Secondly, I apply the function to every column.
apply(student.mat, 2, 'na.test')

##     school        sex        age    address    famsize    Pstatus 
##      FALSE      FALSE      FALSE      FALSE      FALSE      FALSE 
##       Medu       Fedu       Mjob       Fjob     reason   guardian 
##      FALSE      FALSE      FALSE      FALSE      FALSE      FALSE 
## traveltime  studytime   failures  schoolsup     famsup       paid 
##      FALSE      FALSE      FALSE      FALSE      FALSE      FALSE 
## activities    nursery     higher   internet   romantic     famrel 
##      FALSE      FALSE      FALSE      FALSE      FALSE      FALSE 
##   freetime      goout       Dalc       Walc     health   absences 
##      FALSE      FALSE      FALSE      FALSE      FALSE      FALSE 
##         G1         G2         G3 
##      FALSE      FALSE      FALSE

The results showed that there are no missing values.

In the data set are 33 attributes and 649 instances. The names of the columns were:

names(student.mat)

##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "failures"  
## [16] "schoolsup"  "famsup"     "paid"       "activities" "nursery"   
## [21] "higher"     "internet"   "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"

Since this is a R-Course, I won’t focus on every attribute. The information on UCI is sufficient, so I just copied them:

Attributes for student-mat.csv (Math course):

1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)

2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)

3 age - student’s age (numeric: from 15 to 22)

4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)

5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)

6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)

7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)

8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)

9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)

12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)

13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)

14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)

16 schoolsup - extra educational support (binary: yes or no)

17 famsup - family educational support (binary: yes or no)

18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)

19 activities - extra-curricular activities (binary: yes or no)

20 nursery - attended nursery school (binary: yes or no)

21 higher - wants to take higher education (binary: yes or no)

22 internet - Internet access at home (binary: yes or no)

23 romantic - with a romantic relationship (binary: yes or no)

24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)

26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)

27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)

28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)

29 health - current health status (numeric: from 1 - very bad to 5 - very good)

30 absences - number of school absences (numeric: from 0 to 93)

Recoding binary values

Since I prefer the binary values 0 and 1 for dichotomy variables, my idea was to decode all columns containing “yes”’s and “no”’s into the respective value.

Many data frames save strings as factors which makes it complicated. This is why I build a function that converts factors into characters within a data frame.

### TASK 10
cha.fac <- function (x) {
  ###TASK 11
  ###TASK 1
   for(j in 1:ncol((x))) {
     #Checks whether the column is a factor.
     if (is.factor(x[,j]) == TRUE) {
       #If the column is a factor, it converts it into a character (into a column that holds characters).
       ###TASK 1
       x[,j] <- as.character(x[,j])
     }
   }
  return(x)
}

Now that I have built this function, I can use it to transform the data frame.

str(student.mat)

## 'data.frame':    395 obs. of  33 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...

student.mat <- cha.fac(student.mat)
str(student.mat)

## 'data.frame':    395 obs. of  33 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ paid      : chr  "no" "no" "yes" "yes" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...

It is easier to work with this data frame. Now I build the essential function.

###TASK 10
recode.bin <- function (x) {
  ###TASK 11
  #I prefer two integers and I can handle them. This is why I decided for them. i is for the rows and j is for the columns.
  for(i in 1:(nrow(x))) {
    for(j in 1:ncol((x))) {
      #Only columns which hold characters should be checked.
     if (is.character(x[,j]) ==TRUE) {
       #If the value at position i,j hold the cha "yes", it is reassigned to 1.
      if (x[i,j] == "yes") {
        ###TASK 1
       x[i,j] <- 1
      }
      #If the value at position i,j hold the cha "no", it is reassigned to 0.
      if (x[i,j] == "no") {
        ###TASK 1
       x[i,j] <- 0
      }
     }
    }
  }
  return(x)
}

No I can decode the data frame.

student.mat <- recode.bin(student.mat)

Questions

Investigation of the performance in G3.
What is the impact of age and the sex on performance?
What is the relationship between failures and performance with reference to the age?
Impact of the parents jobs on their child performance.
Relationship between Goout and Performance.

Analysis

1. Investigation of the performance in G3

The performance in G3 is the most important column in this data frame. It represents the performance in the final grade of the math course. In the following, there are some characteristics offered:

###TASK 2
mean(student.mat$G3)

## [1] 10.41519

sd(student.mat$G3)

## [1] 4.581443

median(student.mat$G3)

## [1] 11

table(student.mat$G3)

## 
##  0  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
## 38  1  7 15  9 32 28 56 47 31 31 27 33 16  6 12  5  1

###TASK 7
hist(student.mat$G3, main = "Histogramm of Performance in G3", xlab = "Performance in G3", col = "darkgrey")
#I used abline for the vertical lines. 
abline(v=mean(student.mat$G3), col = "red")
abline(v=median(student.mat$G3), col = "blue")
#This is just some labeling. 
text(9, 15, "mean", col = "red")
text(9, 9, round(mean(student.mat$G3), 2) , col = "red")
text(13, 15, "median", col = "blue")
text(13, 9, round(median(student.mat$G3), 2) , col = "blue")

2. What is the impact of age and the sex on performance?

In this section I will investigate the relationship between the three attributes “age”, “sex” and “G3”. First of all, I checked whether there is a difference in performance between boys and girls.

###TASK 3
gender.dif <- t.test(student.mat$G3~student.mat$sex)
#I use the apa-function to show the result. 
apa(gender.dif)

## [1] "mean difference = 0.95, t(390.57) = -2.07, p = 0.04 (2-tailed)"

The mean values between the genders is not equal.

Now, I go a step further and take also the age into consideration.

###TASK 
summary(with(student.mat, aov(G3 ~ sex + age + sex*age)))

##              Df Sum Sq Mean Sq F value  Pr(>F)   
## sex           1     89   88.51   4.385 0.03690 * 
## age           1    208  208.24  10.317 0.00143 **
## sex:age       1     81   80.74   4.000 0.04619 * 
## Residuals   391   7892   20.19                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results reveal a connection between sex and G3, age and G3, and sex and G3 and age. This means there is a interaction between age and sex. I want to further investigate this interaction.

#I used the function interaction.plot to create the plotting.
with(student.mat, interaction.plot(sex, age, G3))

As you can see, there are different lines on different levels of performance. This reveals the impact of the main factor age. Additionally, you can see some interactions (crossing or non-parallel lines). Now I plot age on the x-axis.

with(student.mat, interaction.plot(age, sex, G3))

Even so the interaction and the main effect of sex is significant, you cannot see it that good in that plotting. This is why I try to improve the image. First of all, I checked the table of age.

table(student.mat$age)

## 
##  15  16  17  18  19  20  21  22 
##  82 104  98  82  24   3   1   1

As you can see, there are not many persons older than 20. This is why I consider these ones as outliers and eliminate them from the data frame.

student.mat.2 <- subset(student.mat, age < 20)
table(student.mat.2$age)

## 
##  15  16  17  18  19 
##  82 104  98  82  24

Now, I redo the ANOVA.

summary(with(student.mat.2, aov(G3 ~ sex + age + sex*age)))

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## sex           1     94   93.56   4.665 0.031406 *  
## age           1    240  240.21  11.975 0.000599 ***
## sex:age       1     95   95.50   4.761 0.029714 *  
## Residuals   386   7743   20.06                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

While eliminating the outliers, the probability that the treatment means differ became less likely for every factor. Now, I redo the interaction plots, too.

with(student.mat.2, interaction.plot(sex, age, G3))

with(student.mat.2, interaction.plot(age, sex, G3))

Especially the last plot looks better. But as it seems, the interaction is more difficult to understand. I was wondering why the performance of boys gradually (linearly) decreases when boys grow older and why the performance of girls stays more constant with reference to the age. Looking at the mean values this thought is reinforced.

###TASK 9
student.mat.2 %>%
  group_by(age, sex) %>%
  summarise(
    a.mean = mean(G3)
  )

## Source: local data frame [10 x 3]
## Groups: age [?]
## 
##      age   sex    a.mean
##    (int) (chr)     (dbl)
## 1     15     F  9.552632
## 2     15     M 12.727273
## 3     16     F 10.537037
## 4     16     M 11.560000
## 5     17     F 10.482759
## 6     17     M  9.975000
## 7     18     F  9.325581
## 8     18     M  9.794872
## 9     19     F  8.357143
## 10    19     M  8.000000

Consequently, I checked the correlation between age and the performance for two subsets holding boys and girls separately.

###TASK 4
cor.test1 <- with(subset(student.mat.2, sex == "M"),
     cor.test(G3, age))
cor.test1

## 
##  Pearson's product-moment correlation
## 
## data:  G3 and age
## t = -4.13, df = 181, p-value = 5.535e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4206139 -0.1550039
## sample estimates:
##        cor 
## -0.2934623

apa(cor.test1)

## [1] "r = -0.29, t(181) = -4.13, p < 0.01 (2-tailed)"

###TASK 4
cor.test2 <- with(subset(student.mat.2, sex == "F"),
     (cor.test(G3, age)))
cor.test2

## 
##  Pearson's product-moment correlation
## 
## data:  G3 and age
## t = -0.93582, df = 205, p-value = 0.3505
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1998139  0.0717874
## sample estimates:
##         cor 
## -0.06522112

apa(cor.test2)

## [1] "r = -0.07, t(205) = -0.94, p = 0.35 (2-tailed)"

As expected, there is a correlation between age and performance for boys and none for girls. Therefore, I will focus on the boys. I calculate a linear regression analysis between age and performance in order to get further information.

###TASK 5
with(subset(student.mat.2, sex == "M"),
     summary(lm(G3~ age)))

## 
## Call:
## lm(formula = G3 ~ age)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6200  -1.5383   0.4617   2.6252   8.4617 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  28.8464     4.3514   6.629 3.75e-10 ***
## age          -1.0818     0.2619  -4.130 5.53e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.311 on 181 degrees of freedom
## Multiple R-squared:  0.08612,    Adjusted R-squared:  0.08107 
## F-statistic: 17.06 on 1 and 181 DF,  p-value: 5.535e-05

The probability that the group means are equal is, of course, the same as in the correlation analysis. However, with the linear regression we can predict values and show a tendency with a regression line. Last but not least, I show the results in a scatter plot:

# First of all, I create an empty plot.
###TASK 6
plot(1,
     xlim = c(15, 19),
     ylim = c(0, 20),
     type = "n",
     main = "Relationship between age and performance",
     xlab = "Age",
     ylab = "Performance in G3"
     )

#Now, I fill in the points.
with(subset(student.mat.2, sex == "M"), 
     points(age, 
            G3,
            pch = 25, 
            col = alpha("blue", 0.2)
            ))

#Finally, I draw the regression line.
with(subset(student.mat.2, sex == "M"),
     abline(lm(G3 ~ age), col = "blue"))

3. What is the relationship between failures and performance with reference to the age?

While eliminating persons older than 20, I recognized that these persons have bad grades. So at first, I checked the correlation between failures and age.

cor.test3 <- with(student.mat, cor.test(age, failures))
apa(cor.test3)

## [1] "r = 0.24, t(393) = 4.98, p < 0.01 (2-tailed)"

The results reveal a strong connection between failures and age. This maybe explains why there are people of 22 in a school class. Furthermore, I explored the relationship between age, failures and the performance.

with(student.mat, summary(aov(G3 ~ age + failures + age*failures)))

##               Df Sum Sq Mean Sq F value   Pr(>F)    
## age            1    216   215.9  11.976 0.000598 ***
## failures       1    906   906.2  50.261 6.33e-12 ***
## age:failures   1     98    98.4   5.458 0.019986 *  
## Residuals    391   7049    18.0                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

with(student.mat, summary(lm(G3 ~ age + failures + age*failures)))

## 
## Call:
## lm(formula = G3 ~ age + failures + age * failures)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9865  -1.8901   0.0777   3.0135   9.2300 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   19.9687     3.3765   5.914 7.29e-09 ***
## age           -0.5321     0.2034  -2.616  0.00925 ** 
## failures      -8.8031     2.8834  -3.053  0.00242 ** 
## age:failures   0.3912     0.1675   2.336  0.01999 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.246 on 391 degrees of freedom
## Multiple R-squared:  0.1476, Adjusted R-squared:  0.141 
## F-statistic: 22.56 on 3 and 391 DF,  p-value: 1.707e-13

The results show that all factors are significant. The older a person was and the more failures a person experienced, the more will the performance decrease. Finally, I plot the results:

###TASK 6
plot(1,
     xlim = c(15, 22),
     ylim = c(0, 20),
     type = "n",
     main = "Relationship between age and performance",
     xlab = "Age",
     ylab = "Performance in G3"
     )

#People with no failures.
with(subset(student.mat, failures == 0), 
     points(age, 
            G3,
            pch = 21, 
            col = alpha("blue", 0.1),
            bg =alpha("blue", 0.1)
            ))

#People with more than one failure. 
with(subset(student.mat, failures > 0), 
     points(age, 
            G3,
            pch = 21, 
            col = alpha("red", 0.1),
            bg =alpha("red", 0.1)
            ))

with(student.mat, abline(lm(G3 ~ age + failures + age*failures)))

## Warning in abline(lm(G3 ~ age + failures + age * failures)): only using the
## first two of 4 regression coefficients

4. Impact of the parents jobs on their child performance.

In the following, I want to investigate the impact of parents jobs on the performance of their child. This is why I conduct a variance analysis:

summary(with(student.mat, aov(G3 ~ Mjob + Fjob + Mjob*Fjob)))

##              Df Sum Sq Mean Sq F value  Pr(>F)   
## Mjob          4    307   76.66   3.744 0.00532 **
## Fjob          4     67   16.68   0.815 0.51648   
## Mjob:Fjob    15    299   19.95   0.974 0.48215   
## Residuals   371   7597   20.48                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As you can see, the job of the mother is important for their kid’s performance. To show the quantitative impact of the mother’s job, I created a box plot.

#I sort the factors accordingly to their impact which I checked priorly.
fac.mjob <- with(student.mat, factor(Mjob, levels = c("at_home", "other", "teacher", "services", "health")))
###TASK 8
with(student.mat, boxplot(G3 ~ fac.mjob))

The ANOVA above revealed the result that their is a difference in performance means for children when the mother has a different job. To ensure this I conducted a ANOVA again.

summary(with(student.mat, aov(G3 ~ fac.mjob)))

##              Df Sum Sq Mean Sq F value  Pr(>F)   
## fac.mjob      4    307   76.66   3.754 0.00519 **
## Residuals   390   7963   20.42                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since a ANOVA does not tell you which job exactly is significantly different to the others, I test for difference with the help of contrasts.

#I use a ANOVA for the model.
model <- aov(G3 ~ fac.mjob, data = student.mat)

#The performance of the child when the mother is at home is different to all other groups.
contrast <- rbind("(a1)-(a2+a3+a4+a5)* 1/4"=c(1,-1/4,-1/4,-1/4,-1/4))
summary(glht(model,linfct=mcp(fac.mjob=contrast), alternative="less"))

## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: User-defined Contrasts
## 
## 
## Fit: aov(formula = G3 ~ fac.mjob, data = student.mat)
## 
## Linear Hypotheses:
##                              Estimate Std. Error t value  Pr(<t)   
## (a1)-(a2+a3+a4+a5)* 1/4 >= 0  -1.8577     0.6535  -2.843 0.00235 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

#The performance of the child when the mother works in the health sector is different to all other groups.
contrast <- rbind("(a1+a2+a3+a4) * 1/4 - a5"=c(1/4,1/4,1/4,1/4,-1))
summary(glht(model,linfct=mcp(fac.mjob=contrast), alternative="less"))

## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: User-defined Contrasts
## 
## 
## Fit: aov(formula = G3 ~ fac.mjob, data = student.mat)
## 
## Linear Hypotheses:
##                               Estimate Std. Error t value Pr(<t)  
## (a1+a2+a3+a4) * 1/4 - a5 >= 0  -1.8855     0.8159  -2.311 0.0107 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

#The performance of the child when the mother is a teacher is not different to all other groups.
contrast <- rbind("(a1+a2+a4+a5)* 1/4 - a3"=c(1/4,1/4,-1,1/4,1/4))
summary(glht(model,linfct=mcp(fac.mjob=contrast), alternative="less"))

## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: User-defined Contrasts
## 
## 
## Fit: aov(formula = G3 ~ fac.mjob, data = student.mat)
## 
## Linear Hypotheses:
##                              Estimate Std. Error t value Pr(<t)
## (a1+a2+a4+a5)* 1/4 - a3 >= 0  -0.5163     0.6578  -0.785  0.216
## (Adjusted p values reported -- single-step method)

As you can see, the difference is significantly different when the mother works in the health sector or is at home. When the mother works as a teacher, this has no impact on the child’s performance.

5. Relationship between Goout and Performance.

In the following, I want to investigate the impact of going out on the performance of the child. This is why I conduct a variance analysis:

lm1 <- with(student.mat, summary(lm(G3 ~ goout )))
lm1

## 
## Call:
## lm(formula = G3 ~ goout)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.5676  -1.9282   0.4324   3.0718   9.0718 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12.1141     0.6793  17.833  < 2e-16 ***
## goout        -0.5465     0.2057  -2.656  0.00823 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.547 on 393 degrees of freedom
## Multiple R-squared:  0.01763,    Adjusted R-squared:  0.01513 
## F-statistic: 7.054 on 1 and 393 DF,  p-value: 0.008229

Going out is significantly related to the performance in the math course. I want to visualize this with a scatter plot:

plot(1,
     xlim = c(1, 5),
     ylim = c(0, 20),
     type = "n",
     main = "Relationship between goout and performance",
     xlab = "Goout",
     ylab = "Performance in G3"
     )


with(student.mat, 
     points(goout, 
            G3,
            pch = 21, 
            col = alpha("blue", 0.1),
            bg =alpha("blue", 0.1)
            ))

with(student.mat, abline(lm(G3 ~ goout)))

After checking the plot I realized that the mean of the performance is low when the child is rarely going out. This is why I assumed another coherence. At first, I checked for the means:

aggregate(
  formula = G3 ~ goout,
  data= student.mat, 
  FUN = mean)

##   goout        G3
## 1     1  9.869565
## 2     2 11.194175
## 3     3 10.961538
## 4     4  9.651163
## 5     5  9.037736

The means reveal what I assumed. The first mean is lower than the second or third one. Finally, I expected the relationship between performance and going out to be quadratic. I checked this with a regression analysis.

with(student.mat, summary(lm(G3 ~I(goout^2) )))

## 
## Call:
## lm(formula = G3 ~ I(goout^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.3513  -1.9331   0.4051   3.0669   9.0669 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.44584    0.41470  27.601  < 2e-16 ***
## I(goout^2)  -0.09454    0.03176  -2.977  0.00309 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.536 on 393 degrees of freedom
## Multiple R-squared:  0.02205,    Adjusted R-squared:  0.01956 
## F-statistic: 8.861 on 1 and 393 DF,  p-value: 0.003093

The results reveal that the relationship between performance and going out is quadratic and also stronger now. I create a plot to show the new regression line:

plot(1,
     xlim = c(1, 5),
     ylim = c(0, 20),
     type = "n",
     main = "Relationship between goout and performance",
     xlab = "Goout",
     ylab = "Performance in G3"
     )


with(student.mat, 
     points(goout, 
            G3,
            pch = 21, 
            col = alpha("blue", 0.1),
            bg =alpha("blue", 0.1)
            ))

#X-axis vector, needed for the lines function.
xx <- c(1,2,3,4,5)

#Regression analysis
i2 <-with(student.mat, (lm(G3 ~ I(goout^2) )))

#Print the line with the lines function. 
with(student.mat, lines(xx, predict(i2, data.frame(goout = xx))))

Conclusion

The performance of children in a Math Course is dependent on different factors. First of all, I found a strong connection between the mother’s job and the performance. It is best for the child when the mother works in the health sector and worst when the mother is at home. This result is counter-intuitive because one can assume that a mother who is at home has time for their children.

In comparison to that the result that going out is negatively correlated with your performance in a Math Class is totally intuitive. Additionally, the results revealed that older children which failed once or several times have lower performance rates.

While boys show lower performances when they grow older, girls remain relatively constant.

Data Analysis and Visualization with R - Final Paper

Marvin Pafla

März 2016

Dataset Description

Attributes for student-mat.csv (Math course):

Recoding binary values

Questions

Analysis

1. Investigation of the performance in G3

2. What is the impact of age and the sex on performance?

3. What is the relationship between failures and performance with reference to the age?

4. Impact of the parents jobs on their child performance.

5. Relationship between Goout and Performance.

Conclusion

Data Analysis and Visualization with R - Final Paper

Marvin Pafla

März 2016

Dataset Description

Attributes for student-mat.csv (Math course):

These grades are related with the course subject, Math:

Recoding binary values

Questions

Analysis

1. Investigation of the performance in G3

2. What is the impact of age and the sex on performance?

3. What is the relationship between failures and performance with reference to the age?

4. Impact of the parents jobs on their child performance.

5. Relationship between Goout and Performance.

Conclusion