HW1

data <- read.table("C:/Julia/SEB LU/IMB/SEMESTER 2/Multivariate Analysis/HOMEWORK 1/student-mat.csv", header=TRUE, sep=",", dec=".")

head(data)

##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob     reason
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher     course
## 2     GP   F  17       U     GT3       T    1    1  at_home    other     course
## 3     GP   F  15       U     LE3       T    1    1  at_home    other      other
## 4     GP   F  15       U     GT3       T    4    2   health services       home
## 5     GP   F  16       U     GT3       T    3    3    other    other       home
## 6     GP   M  16       U     LE3       T    4    3 services    other reputation
##   guardian traveltime studytime failures schoolsup famsup paid activities
## 1   mother          2         2        0       yes     no   no         no
## 2   father          1         2        0        no    yes   no         no
## 3   mother          1         2        3       yes     no  yes         no
## 4   mother          1         3        0        no    yes  yes        yes
## 5   father          1         2        0        no    yes  yes         no
## 6   mother          1         2        0        no    yes  yes        yes
##   nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1     yes    yes       no       no      4        3     4    1    1      3
## 2      no    yes      yes       no      5        3     3    1    1      3
## 3     yes    yes      yes       no      4        3     2    2    3      3
## 4     yes    yes      yes      yes      3        2     2    1    1      5
## 5     yes    yes       no       no      4        3     2    1    2      5
## 6     yes    yes      yes       no      5        4     2    1    2      5
##   absences G1 G2 G3
## 1        6  5  6  6
## 2        4  5  5  6
## 3       10  7  8 10
## 4        2 15 14 15
## 5        4  6 10 10
## 6       10 15 15 15

The dataset was found on kaggle.com, and investigates the alcohol consumption of students attending Math and Portugese courses.

The unit of observation is a student attending Portugese or Math courses. It incorporates 33 variables in total, with 395 observations. These variables include:

school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
age - student’s age (numeric: from 15 to 22)
address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
failures - number of past class failures (numeric: n if 1<=n<3, else 4)
schoolsup - extra educational support (binary: yes or no)
famsup - family educational support (binary: yes or no)
paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
activities - extra-curricular activities (binary: yes or no)
nursery - attended nursery school (binary: yes or no)
higher - wants to take higher education (binary: yes or no)
internet - Internet access at home (binary: yes or no)
romantic - with a romantic relationship (binary: yes or no)
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
freetime - free time after school (numeric: from 1 - very low to 5 - very high)
goout - going out with friends (numeric: from 1 - very low to 5 - very high)
Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
health - current health status (numeric: from 1 - very bad to 5 - very good)
absences - number of school absences (numeric: from 0 to 93)

Data <- data [c(-1, -5, -6, -7, -8, -9, -10, -11, -12, -13, -14, -15, -16, -17, -18, -19, -20, -21, -22, -23, -24, -25, -26, -30, -31, -32, -33)]

head(Data)

##   sex age address Dalc Walc health
## 1   F  18       U    1    1      3
## 2   F  17       U    1    1      3
## 3   F  15       U    2    3      3
## 4   F  15       U    1    1      5
## 5   F  16       U    1    2      5
## 6   M  16       U    1    2      5

For our analysis we will focus on the following variables: sex, age, address, Dalc, Walc, and health.

RQ: Does alcohol consumption of students differ on workdays in comparison to weekends?

#install.packages("lessR")

sex_var <- factor(c(rep("Female", sum(Data$sex=='F')),
                              rep("Male", sum(Data$sex=='M'))))
sex_table <- table(sex_var)

library(lessR)

## 
## lessR 4.2.5                         feedback: gerbing@pdx.edu 
## --------------------------------------------------------------
## > d <- Read("")   Read text, Excel, SPSS, SAS, or R data file
##   d is default data frame, data= in analysis routines optional
## 
## Learn about reading, writing, and manipulating data, graphics,
## testing means and proportions, regression, factor analysis,
## customization, and descriptive statistics from pivot tables.
##   Enter:  browseVignettes("lessR")
## 
## View changes in this and recent versions of lessR.
##   Enter: news(package="lessR")
## 
## **Newly Revised**: Interactive data analysis.
##   Enter: interact()

PieChart(sex_var, values="%", fill=c("pink", "lightblue"), hole=0)

## >>> Note: sex_var is not in a data frame (table)
## >>> Note: sex_var is not in a data frame (table)

## >>> suggestions
## piechart(sex_var, hole=0)  # traditional pie chart
## piechart(sex_var, values="%")  # display %'s on the chart
## piechart(sex_var)  # bar chart
## plot(sex_var)  # bubble plot
## plot(sex_var, values="count")  # lollipop plot 
## 
## --- sex_var --- 
## 
##                Female   Male     Total 
## Frequencies:      208    187       395 
## Proportions:    0.527  0.473     1.000 
## 
## Chi-squared test of null hypothesis of equal probabilities 
##   Chisq = 1.116, df = 1, p-value = 0.291

From the Pie Chart we can see that there is more females, 53%, than males, 47%, in the observed sample.

#install.packages("rstatix")

library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

get_summary_stats(Data)

## # A tibble: 4 × 13
##   variable     n   min   max median    q1    q3   iqr   mad  mean    sd    se
##   <fct>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age        395    15    22     17    16    18     2  1.48 16.7  1.28  0.064
## 2 Dalc       395     1     5      1     1     2     1  0     1.48 0.891 0.045
## 3 Walc       395     1     5      2     1     3     2  1.48  2.29 1.29  0.065
## 4 health     395     1     5      4     3     5     2  1.48  3.55 1.39  0.07 
## # … with 1 more variable: ci <dbl>

The minimum age of our sample is 15 and the maximum is 22, therefore range is 7. The median age of our sample is 17, meaning that half of the values fall below and half above 22 years of age. The mean age is 17 (rounded from 16.696), which coincidentally happens to be the median as well.

If we look at the alcohol consumption during weekdays (Dalc), we can observe the standard deviation or sd. This measures the dispersion of the sample data in relation to the mean. The standard dev. for Dalc is 0.89, which means that on average the values of our 395 observations differ from the sample mean by 0.89. On the other hand the standard error of alcohol consumption dutring weekdays is 0.045. The standard error measures how different the population mean is likely to be from a sample mean. (It tells us how much the sample mean would vary if we were to repeat the study using new samples from the population).

Next, q1 for Walc (alcohol consumption during weekends) is 1, q2 or the median is 2, and q3 is 3. This means that 25% of the sample fall below the value 1, and the rest above it; 50% of the sample fall below the value 2, and 50% above it; and 75% of the sample falls below 3, and 25% above 3, with regard to alcohol consumption during weekends.

library(ggplot2)
ggplot(Data, aes(x = health)) +
  geom_histogram(binwidth = 1, fill = "hotpink2") +
  xlab("Health")

From the graph we can see how students rated their current health status, the distribution looks skewed. We can also calculate the skewness and kurtosis, to better understand the data:

#install.packages("moments")

library(moments)

## 
## Attaching package: 'moments'

## The following object is masked from 'package:lessR':
## 
##     kurtosis

skewness(Data$health)

## [1] -0.4927233

kurtosis(Data$health)

## [1] 1.983561

Skewness refers to or measures the symmetry of a distribution, in this sample it is -0.49, which indicates a negative skew also known as a left skew. So the sample is not symmetric.

Kurtosis measures if the sample data is heavy-tailed or light-tailed compared to a normal distribution. The kurtosis for our sample is 1.98, which indicates that the data is fat-tailed or leptokurtic.

library(ggplot2)

Urban_Dalc <- ggplot(Data[Data$address=="U",  ], aes(x = Dalc)) +
  theme_linedraw() + 
  geom_histogram(binwidth = 1, fill = "pink") +
  ggtitle("Urban")

Urban_Walc <- ggplot(Data[Data$address=="U",  ], aes(x = Walc)) +
  theme_linedraw() + 
  geom_histogram(binwidth = 1, fill = "purple") +
  ggtitle("Urban")

Rural_Dalc <- ggplot(Data[Data$address=="R",  ], aes(x = Dalc)) +
  theme_linedraw() + 
  geom_histogram(binwidth = 1, fill = "lightblue") +
  ggtitle("Rural")

Rural_Walc <- ggplot(Data[Data$address=="R",  ], aes(x = Walc)) +
  theme_linedraw() + 
  geom_histogram(binwidth = 1, fill = "blue") +
  ggtitle("Rural")

library(ggpubr)
ggarrange(Urban_Dalc, Urban_Walc, Rural_Dalc, Rural_Walc,
          ncol= 2, nrow = 2)

In the graphs above, we can observe the differences in alcohol consumption by weekday vs. weekend, as well as by students home address - urban or rural. However, we are more interested in the differences between alcohol consumption relative to week time, not by address. Hence we will calculate the difference:

Data$Difference <- Data$Dalc - Data$Walc 

library(ggplot2)
ggplot(Data, aes(x = Difference)) +
  geom_histogram(binwidth = 1, fill = "hotpink2") +
  xlab("Differences")

From the graph above we can observe the difference between alcohol consumption on weekdays compared to weekends.

HW1

Julia Vulić

2023-01-09