data <- read.table("C:/Julia/SEB LU/IMB/SEMESTER 2/Multivariate Analysis/HOMEWORK/student-mat.csv", header=TRUE, sep=",", dec=".")

head(data)

##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob     reason
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher     course
## 2     GP   F  17       U     GT3       T    1    1  at_home    other     course
## 3     GP   F  15       U     LE3       T    1    1  at_home    other      other
## 4     GP   F  15       U     GT3       T    4    2   health services       home
## 5     GP   F  16       U     GT3       T    3    3    other    other       home
## 6     GP   M  16       U     LE3       T    4    3 services    other reputation
##   guardian traveltime studytime failures schoolsup famsup paid activities
## 1   mother          2         2        0       yes     no   no         no
## 2   father          1         2        0        no    yes   no         no
## 3   mother          1         2        3       yes     no  yes         no
## 4   mother          1         3        0        no    yes  yes        yes
## 5   father          1         2        0        no    yes  yes         no
## 6   mother          1         2        0        no    yes  yes        yes
##   nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1     yes    yes       no       no      4        3     4    1    1      3
## 2      no    yes      yes       no      5        3     3    1    1      3
## 3     yes    yes      yes       no      4        3     2    2    3      3
## 4     yes    yes      yes      yes      3        2     2    1    1      5
## 5     yes    yes       no       no      4        3     2    1    2      5
## 6     yes    yes      yes       no      5        4     2    1    2      5
##   absences G1 G2 G3
## 1        6  5  6  6
## 2        4  5  5  6
## 3       10  7  8 10
## 4        2 15 14 15
## 5        4  6 10 10
## 6       10 15 15 15

The dataset was found on kaggle.com, and investigates the alcohol consumption of students attending Math and Portugese courses.

The unit of observation is a student attending Portugese or Math courses. It incorporates 33 variables in total, with 395 observations.

Data <- data [c(-1, -5, -6, -7, -8, -9, -10, -11, -13, -14, -15, -16, -17, -18, -19, -20, -21, -22, -23, -24, -25, -26, -29, -30)]

head(Data)

##   sex age address guardian Dalc Walc G1 G2 G3
## 1   F  18       U   mother    1    1  5  6  6
## 2   F  17       U   father    1    1  5  5  6
## 3   F  15       U   mother    2    3  7  8 10
## 4   F  15       U   mother    1    1 15 14 15
## 5   F  16       U   father    1    2  6 10 10
## 6   M  16       U   mother    1    2 15 15 15

Description of variables:

sex: student’s sex (binary: ‘F’ - female or ‘M’ - male)
age: student’s age (numeric: from 15 to 22)
address: student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
guardian: student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
Dalc: workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
Walc: weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
G1: first period Math or Portuguese grade (numeric: from 0 to 20)
G2: second period Math or Portuguese grade (numeric: from 0 to 20)
G3: final Math or Portuguese grade (numeric: from 0 to 20, output target)

RQ: Does alcohol consumption of students differ on workdays in comparison to weekends? What are the characteristics of alcohol consumption among students?

Hypothesis Testing Example 1

For this example we will compare Dalc, of the entire sample (urban and rural) to Walc, also of the entire sample.

Data$Difference <- Data$Dalc - Data$Walc

library(ggplot2)
ggplot(Data, aes(x = Difference)) +
  geom_histogram(binwidth = 1, fill = "hotpink2") +
  xlab("Differences")

We perform the Shapiro-Wilk normality test.

H0: variable is normally distributed

H1: variable is not normally distributed

shapiro.test(Data$Difference)

## 
##  Shapiro-Wilk normality test
## 
## data:  Data$Difference
## W = 0.82267, p-value < 2.2e-16

We reject the null hypothesis at p<2.2e-16. We cannot do a paired samples t-test, so we do the non-parametric alternative.

(We can also doublecheck normality this way:)

library(ggpubr)

ggqqplot(Data$Difference)

## Warning: The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

Wilcoxon Signed Rank Test

The Wilcoxon test compares the medians of the differences between the paired observations, rather than the means used in the t-test.

H0: Distribution locations of alcohol consumption are the same

H1: Distribution locations of alcohol consumption are different

wilcox.test(Data$Dalc, Data$Walc,
            paired = TRUE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon signed rank test
## 
## data:  Data$Dalc and Data$Walc
## V = 317.5, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

We reject the null hypothesis and accept H1, at p<2.2e-16. As the p-value suggests that there is a statistically significant difference between the medians of Dalc and Walc. (= The difference between the medians of Dalc and Walc is very likely to be real and not due to random chance.)

library(effectsize)
effectsize(wilcox.test(Data$Dalc, Data$Walc,
                       paired = TRUE,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))

## r (rank biserial) |         95% CI
## ----------------------------------
## -0.97             | [-0.98, -0.96]

interpret_rank_biserial(-0.97)

## [1] "very large"
## (Rules: funder2019)

Conclusions: Using the sample data, we find that the consumption of alcohol during weekdays differs from the alcohol consumption during weekends (p<2.2e-16). Moreover, the rank biserial, in this case is a negative correlation coefficient of -0.97 and indicates a strong negative association between the two variables - the difference is very large.

Hypothesis Testing Example 2

For this example we will compare Walc of students living in an urban environment to those students living in a rural one.

Data$addressR <- factor(Data$address, 
                         levels = c("U", "R"), 
                         labels = c("Urban", "Rural"))

library(psych)

## 
## Attaching package: 'psych'

## The following object is masked from 'package:effectsize':
## 
##     phi

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

describeBy(Data$Walc, g = Data$addressR)

## 
##  Descriptive statistics by group 
## group: Urban
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 307 2.22 1.27      2    2.08 1.48   1   5     4 0.67    -0.75 0.07
## ------------------------------------------------------------ 
## group: Rural
##    vars  n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 88 2.53 1.31      2    2.43 1.48   1   5     4 0.41    -0.97 0.14

library(ggplot2)

ggplot(Data, aes(x = Walc)) +
  geom_histogram(binwidth = 1, fill="lightblue") +
  facet_wrap(~addressR, ncol = 1) + 
  ylab("Frequency")

We must perform the Wilcoxon Rank Sum test:

H0: distribution locations are the same

H1: distribution locations are different

library(rstatix)

## 
## Attaching package: 'rstatix'

## The following objects are masked from 'package:effectsize':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:stats':
## 
##     filter

Data %>%
  group_by(addressR) %>%
  shapiro_test(Walc)

## # A tibble: 2 × 4
##   addressR variable statistic        p
##   <fct>    <chr>        <dbl>    <dbl>
## 1 Urban    Walc         0.834 1.87e-17
## 2 Rural    Walc         0.882 8.46e- 7

H0: distribution is normal

H1: distribution is not normal

=> Reject null hypothesis for Urban based on p-value (alpha = 5%).

=> Reject null hypothesis for Rural based on p-value.

wilcox.test(Data$Walc ~ Data$addressR,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  Data$Walc by Data$addressR
## W = 11614, p-value = 0.03684
## alternative hypothesis: true location shift is not equal to 0

Students from rural env. consume more alcohol.

library(effectsize)

effectsize(wilcox.test(Data$Walc ~ Data$addressR,
                      paired = FALSE,
                      correct = FALSE,
                      exact = FALSE,
                      alternative = "two.sided"))

## r (rank biserial) |         95% CI
## ----------------------------------
## -0.14             | [-0.27,  0.00]

interpret_rank_biserial(-0.14)

## [1] "small"
## (Rules: funder2019)

Conclusions: Based on the sample data, we found that students from urban environments and students from rural environments differ in their alcohol consumption during weekends (p=0.03684) - rural students consume more alcohol during weekends, the difference in distribution is small (r=-0.14).

Hypothesis Testing Example 3

For the final hypothesis testing, we will check if there is a difference between final grades of the students (G3), based on who is their guardian (mother, father, other).

shapiro.test(Data$G3)

## 
##  Shapiro-Wilk normality test
## 
## data:  Data$G3
## W = 0.92873, p-value = 8.836e-13

H0: variable is normally distributed

H1: variable is not normally distributed

We reject the null hypothesis based on p=8.836e-13. We must perform a non-parametric test.

Kruskal-Wallis test

Data1 <- Data [c(-1, -2, -3, -5, -6, -7, -8)]

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Data2 <- Data1 %>% 
  mutate(id = seq_len(nrow(Data1)))

library(dplyr)
Data2 <- Data2 %>% 
  select(id, everything())

head(Data2)

##   id guardian G3 Difference addressR
## 1  1   mother  6          0    Urban
## 2  2   father  6          0    Urban
## 3  3   mother 10         -1    Urban
## 4  4   mother 15          0    Urban
## 5  5   father 10         -1    Urban
## 6  6   mother 15         -1    Urban

kruskal.test(G3 ~ guardian, 
             data = Data2)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  G3 by guardian
## Kruskal-Wallis chi-squared = 3.8131, df = 2, p-value = 0.1486

H0: loc. distrib. of happiness are the same in all three groups

H1: at least 1 loc. distrib. is different

=> We fail to reject the null hypothesis based on p=0.1486 (alpha=0.05)

kruskal_effsize(G3 ~ guardian, 
                data = Data2)

## # A tibble: 1 × 5
##   .y.       n effsize method  magnitude
## * <chr> <int>   <dbl> <chr>   <ord>    
## 1 G3      395 0.00463 eta2[H] small

library(rstatix)

groups_nonpar <- wilcox_test(G3 ~ guardian,
                             paired = FALSE,
                             p.adjust.method = "bonferroni",
                             data = Data2)

groups_nonpar

## # A tibble: 3 × 9
##   .y.   group1 group2    n1    n2 statistic     p p.adj p.adj.signif
## * <chr> <chr>  <chr>  <int> <int>     <dbl> <dbl> <dbl> <chr>       
## 1 G3    father mother    90   273    12679  0.647 1     ns          
## 2 G3    father other     90    32     1778. 0.048 0.144 ns          
## 3 G3    mother other    273    32     5198. 0.078 0.233 ns

pwc <- Data2 %>%
  wilcox_test(G3 ~ guardian, 
              paired = FALSE, 
              p.adjust.method = "bonferroni")
 
Kruskal_results <- kruskal_test(G3 ~ guardian, 
                                data = Data2)

library(rstatix)
pwc <- pwc %>% 
       add_y_position(fun = "median", step.increase = 0.35)

library(ggpubr)
ggboxplot(Data2, x = "guardian", y = "G3", add = "point", ylim=c(0, 20)) +
  stat_pvalue_manual(pwc, hide.ns = FALSE) +
  stat_summary(fun = median, geom = "point", shape = 16, size = 4,
               aes(group = guardian), color = "purple",
               position = position_dodge(width = 0.8)) +
  stat_summary(fun = median, colour = "purple", 
               position = position_dodge(width = 0.8),
               geom = "text", vjust = -0.5, hjust = -1,
               aes(label = round(after_stat(y), digits = 2), group = guardian)) +
  labs(subtitle = get_test_label(Kruskal_results,  detailed = TRUE),
       caption = get_pwc_label(pwc))

Conclusions: We found that at a p-value of 0.15 (which means that there is a 15% chance that the difference in median values between the groups is due to random chance), we fail to reject the null hypothesis. This means that there is not enough evidence to suggest that there is a statistically significant difference in the median values of the the final grade based on grouping by types of guardians (x^2=3.81, p=0.15), moreover, the effect size was small (=0.00463). The post-hoc analysis with wilcox_test was done with a multiple comparison correction (Bonferroni correction), and it indicates that there is no significant difference between the groups as well. In conclusion, based on the analysis, there is no evidence to suggest that there is a statistically significant difference in the median values of the final grades of the students based on who their guardian is.

HW2

Julia Vulić

2023-01-09

Hypothesis Testing Example 1

Wilcoxon Signed Rank Test

Hypothesis Testing Example 2

Hypothesis Testing Example 3

Kruskal-Wallis test