data <- read.table("C:/Julia/SEB LU/IMB/SEMESTER 2/Multivariate Analysis/HOMEWORK 1/student-mat.csv", header=TRUE, sep=",", dec=".")
head(data)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## 1 GP F 18 U GT3 A 4 4 at_home teacher course
## 2 GP F 17 U GT3 T 1 1 at_home other course
## 3 GP F 15 U LE3 T 1 1 at_home other other
## 4 GP F 15 U GT3 T 4 2 health services home
## 5 GP F 16 U GT3 T 3 3 other other home
## 6 GP M 16 U LE3 T 4 3 services other reputation
## guardian traveltime studytime failures schoolsup famsup paid activities
## 1 mother 2 2 0 yes no no no
## 2 father 1 2 0 no yes no no
## 3 mother 1 2 3 yes no yes no
## 4 mother 1 3 0 no yes yes yes
## 5 father 1 2 0 no yes yes no
## 6 mother 1 2 0 no yes yes yes
## nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1 yes yes no no 4 3 4 1 1 3
## 2 no yes yes no 5 3 3 1 1 3
## 3 yes yes yes no 4 3 2 2 3 3
## 4 yes yes yes yes 3 2 2 1 1 5
## 5 yes yes no no 4 3 2 1 2 5
## 6 yes yes yes no 5 4 2 1 2 5
## absences G1 G2 G3
## 1 6 5 6 6
## 2 4 5 5 6
## 3 10 7 8 10
## 4 2 15 14 15
## 5 4 6 10 10
## 6 10 15 15 15
The dataset was found on kaggle.com, and investigates the alcohol consumption of students attending Math and Portugese courses.
The unit of observation is a student attending Portugese or Math courses. It incorporates 33 variables in total, with 395 observations. These variables include:
school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
age - student’s age (numeric: from 15 to 22)
address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
failures - number of past class failures (numeric: n if 1<=n<3, else 4)
schoolsup - extra educational support (binary: yes or no)
famsup - family educational support (binary: yes or no)
paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
activities - extra-curricular activities (binary: yes or no)
nursery - attended nursery school (binary: yes or no)
higher - wants to take higher education (binary: yes or no)
internet - Internet access at home (binary: yes or no)
romantic - with a romantic relationship (binary: yes or no)
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
freetime - free time after school (numeric: from 1 - very low to 5 - very high)
goout - going out with friends (numeric: from 1 - very low to 5 - very high)
Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
health - current health status (numeric: from 1 - very bad to 5 - very good)
absences - number of school absences (numeric: from 0 to 93)
Data <- data [c(-1, -5, -6, -7, -8, -9, -10, -11, -12, -13, -14, -15, -16, -17, -18, -19, -20, -21, -22, -23, -24, -25, -26, -30, -31, -32, -33)]
head(Data)
## sex age address Dalc Walc health
## 1 F 18 U 1 1 3
## 2 F 17 U 1 1 3
## 3 F 15 U 2 3 3
## 4 F 15 U 1 1 5
## 5 F 16 U 1 2 5
## 6 M 16 U 1 2 5
For our analysis we will focus on the following variables: sex, age, address, Dalc, Walc, and health.
RQ: Does alcohol consumption of students differ on workdays in comparison to weekends?
#install.packages("lessR")
sex_var <- factor(c(rep("Female", sum(Data$sex=='F')),
rep("Male", sum(Data$sex=='M'))))
sex_table <- table(sex_var)
library(lessR)
##
## lessR 4.2.5 feedback: gerbing@pdx.edu
## --------------------------------------------------------------
## > d <- Read("") Read text, Excel, SPSS, SAS, or R data file
## d is default data frame, data= in analysis routines optional
##
## Learn about reading, writing, and manipulating data, graphics,
## testing means and proportions, regression, factor analysis,
## customization, and descriptive statistics from pivot tables.
## Enter: browseVignettes("lessR")
##
## View changes in this and recent versions of lessR.
## Enter: news(package="lessR")
##
## **Newly Revised**: Interactive data analysis.
## Enter: interact()
PieChart(sex_var, values="%", fill=c("pink", "lightblue"), hole=0)
## >>> Note: sex_var is not in a data frame (table)
## >>> Note: sex_var is not in a data frame (table)
## >>> suggestions
## piechart(sex_var, hole=0) # traditional pie chart
## piechart(sex_var, values="%") # display %'s on the chart
## piechart(sex_var) # bar chart
## plot(sex_var) # bubble plot
## plot(sex_var, values="count") # lollipop plot
##
## --- sex_var ---
##
## Female Male Total
## Frequencies: 208 187 395
## Proportions: 0.527 0.473 1.000
##
## Chi-squared test of null hypothesis of equal probabilities
## Chisq = 1.116, df = 1, p-value = 0.291
From the Pie Chart we can see that there is more females, 53%, than males, 47%, in the observed sample.
#install.packages("rstatix")
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
get_summary_stats(Data)
## # A tibble: 4 × 13
## variable n min max median q1 q3 iqr mad mean sd se
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age 395 15 22 17 16 18 2 1.48 16.7 1.28 0.064
## 2 Dalc 395 1 5 1 1 2 1 0 1.48 0.891 0.045
## 3 Walc 395 1 5 2 1 3 2 1.48 2.29 1.29 0.065
## 4 health 395 1 5 4 3 5 2 1.48 3.55 1.39 0.07
## # … with 1 more variable: ci <dbl>
The minimum age of our sample is 15 and the maximum is 22, therefore range is 7. The median age of our sample is 17, meaning that half of the values fall below and half above 22 years of age. The mean age is 17 (rounded from 16.696), which coincidentally happens to be the median as well.
If we look at the alcohol consumption during weekdays (Dalc), we can observe the standard deviation or sd. This measures the dispersion of the sample data in relation to the mean. The standard dev. for Dalc is 0.89, which means that on average the values of our 395 observations differ from the sample mean by 0.89. On the other hand the standard error of alcohol consumption dutring weekdays is 0.045. The standard error measures how different the population mean is likely to be from a sample mean. (It tells us how much the sample mean would vary if we were to repeat the study using new samples from the population).
Next, q1 for Walc (alcohol consumption during weekends) is 1, q2 or the median is 2, and q3 is 3. This means that 25% of the sample fall below the value 1, and the rest above it; 50% of the sample fall below the value 2, and 50% above it; and 75% of the sample falls below 3, and 25% above 3, with regard to alcohol consumption during weekends.
library(ggplot2)
ggplot(Data, aes(x = health)) +
geom_histogram(binwidth = 1, fill = "hotpink2") +
xlab("Health")
From the graph we can see how students rated their current health status, the distribution looks skewed. We can also calculate the skewness and kurtosis, to better understand the data:
#install.packages("moments")
library(moments)
##
## Attaching package: 'moments'
## The following object is masked from 'package:lessR':
##
## kurtosis
skewness(Data$health)
## [1] -0.4927233
kurtosis(Data$health)
## [1] 1.983561
Skewness refers to or measures the symmetry of a distribution, in this sample it is -0.49, which indicates a negative skew also known as a left skew. So the sample is not symmetric.
Kurtosis measures if the sample data is heavy-tailed or light-tailed compared to a normal distribution. The kurtosis for our sample is 1.98, which indicates that the data is fat-tailed or leptokurtic.
library(ggplot2)
Urban_Dalc <- ggplot(Data[Data$address=="U", ], aes(x = Dalc)) +
theme_linedraw() +
geom_histogram(binwidth = 1, fill = "pink") +
ggtitle("Urban")
Urban_Walc <- ggplot(Data[Data$address=="U", ], aes(x = Walc)) +
theme_linedraw() +
geom_histogram(binwidth = 1, fill = "purple") +
ggtitle("Urban")
Rural_Dalc <- ggplot(Data[Data$address=="R", ], aes(x = Dalc)) +
theme_linedraw() +
geom_histogram(binwidth = 1, fill = "lightblue") +
ggtitle("Rural")
Rural_Walc <- ggplot(Data[Data$address=="R", ], aes(x = Walc)) +
theme_linedraw() +
geom_histogram(binwidth = 1, fill = "blue") +
ggtitle("Rural")
library(ggpubr)
ggarrange(Urban_Dalc, Urban_Walc, Rural_Dalc, Rural_Walc,
ncol= 2, nrow = 2)
In the graphs above, we can observe the differences in alcohol consumption by weekday vs. weekend, as well as by students home address - urban or rural. However, we are more interested in the differences between alcohol consumption relative to week time, not by address. Hence we will calculate the difference:
Data$Difference <- Data$Dalc - Data$Walc
library(ggplot2)
ggplot(Data, aes(x = Difference)) +
geom_histogram(binwidth = 1, fill = "hotpink2") +
xlab("Differences")
From the graph above we can observe the difference between alcohol consumption on weekdays compared to weekends.