TEAM: Alex Stephenson, Anna Gorobtsova, Elizaveta Dyachenko, Marina Romanova
COUNTRY: Slovenia
TOPIC: Health and Care
Firstly, we download all the needed libraries and database.
library(rmarkdown)
library(foreign)
library(ggplot2)
library(gapminder)
library(dplyr)
library(psych)
library(corrplot)
library(knitr)
library(data.table)
library(moments)
library(sjPlot)
Slovenia <- read.spss("ESS7SI.sav", use.value.labels = T, to.data.frame = T, na.omit = T)
gndr - Gender
jbexebs - In any job, ever exposed to: breathing in smoke, fumes, powder, dust
H0: There is no significant association between gender and being exposed to breathing smoke in a job.
H1: There is significant association between gender and being exposed to breathing smoke in a job.
chi <- select(Slovenia, c("gndr","jbexebs"))
str(chi)
## 'data.frame': 1224 obs. of 2 variables:
## $ gndr : Factor w/ 2 levels "Male","Female": 1 1 2 1 2 1 1 2 1 2 ...
## $ jbexebs: Factor w/ 2 levels "Not marked","Marked": 1 1 1 1 1 1 2 1 2 2 ...
## - attr(*, "variable.labels")= Named chr "Title of dataset" "ESS round" "Edition" "Production date" ...
## ..- attr(*, "names")= chr "name" "essround" "edition" "proddate" ...
## - attr(*, "codepage")= int 65001
ggplot(chi, aes(x = jbexebs, fill = gndr)) +
geom_bar() +
labs(x = 'Exposure to breathing smoke in a job', y = 'Number of people', title = 'Being exposed to breathing smoke in a job distributed by gender') +
scale_fill_manual(name = "Gender", values = c("skyblue", "salmon"), labels = c("Male", "Female")) +
scale_x_discrete(labels = c("Not exposed", "Exposed"))
From the barplot it can be seen that there are much less people in Slovenia who are exposed to breathing smoke in a job. Nevertheless, there are more men than women among them.
cont_table <- table(chi$gndr, chi$jbexebs, dnn = c("Gender", "Exposed to breathing smoke in a job"))
cont_table
## Exposed to breathing smoke in a job
## Gender Not marked Marked
## Male 393 170
## Female 589 72
sjp.xtab(chi$jbexebs, chi$gndr, geom.colors = c("skyblue", "salmon", "greenyellow"), axis.titles = "Exposure to breathing smoke in a job", legend.title = "Gender", axis.labels = c("Not exposed", "Exposed"))
test <- chisq.test(chi$gndr, chi$jbexebs)
test
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: chi$gndr and chi$jbexebs
## X-squared = 70.206, df = 1, p-value < 2.2e-16
With p-value < 2.2e-16 we decline the H0-hypothesis and can claim that there is a significant association between breathing smoke in a job and gender.
test$stdres
## chi$jbexebs
## chi$gndr Not marked Marked
## Male -8.450889 8.450889
## Female 8.450889 -8.450889
corrplot(test$stdres, is.cor = FALSE)
Positive residuals (blue cells) mean that there is a positive association between the gender and being exposed to breathing smoke in a job.
Negative residuals (red cells) mean that there is a negative association.
assocplot(t(cont_table), main = "Residuals and number of observations")
The cells for females which are exposed to breathing smoke in a job and for males who are not contain fewer observations than it was expected.
Two other cells contain more observations than it was expected.
gndr - Gender
cgtsday - How many cigarettes smoke on typical day
H0: There is no significant difference between genders in amount of cigarettes they smoke every day.
H1: There is significant difference between genders in amount of cigarettes they smoke every day.
t <- select(Slovenia, c("gndr","cgtsday"))
str(t)
## 'data.frame': 1224 obs. of 2 variables:
## $ gndr : Factor w/ 2 levels "Male","Female": 1 1 2 1 2 1 1 2 1 2 ...
## $ cgtsday: Factor w/ 22 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "variable.labels")= Named chr "Title of dataset" "ESS round" "Edition" "Production date" ...
## ..- attr(*, "names")= chr "name" "essround" "edition" "proddate" ...
## - attr(*, "codepage")= int 65001
t$cgtsday <- as.numeric(as.character(t$cgtsday))
table(t$gndr, t$cgtsday)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 17 20 23 25 28 30 40
## Male 5 1 4 4 7 0 6 1 0 21 1 1 0 17 1 1 57 0 4 1 6 1
## Female 5 6 5 11 9 9 1 3 1 45 0 2 1 16 0 0 31 1 0 0 2 0
t$cgtsday[t$cgtsday > 30] <- 30
ggplot(t, aes(y = cgtsday, x = gndr)) +
geom_boxplot(fill = c("skyblue", "salmon")) +
stat_summary(fun.y = mean, geom = "point", shape = 4, size = 4) +
theme_classic() +
labs(title = "Difference in amount of cigerettes smoked a day between males and females", x = "Gender", y = "Cigarettes smoked per day")
From the boxplot it is seen that in Slovenia men smoke more cigarettes per day than women.
describeBy(t, t$gndr)
##
## Descriptive statistics by group
## group: Male
## vars n mean sd median trimmed mad min max range skew
## gndr* 1 563 1.00 0.00 1 1.00 0.00 1 1 0 NaN
## cgtsday 2 139 15.24 7.35 17 15.31 4.45 1 30 29 -0.18
## kurtosis se
## gndr* NaN 0.00
## cgtsday -0.69 0.62
## --------------------------------------------------------
## group: Female
## vars n mean sd median trimmed mad min max range skew
## gndr* 1 661 2.00 0.00 2 2.00 0.00 2 2 0 NaN
## cgtsday 2 148 11.11 6.45 10 10.97 7.41 1 30 29 0.45
## kurtosis se
## gndr* NaN 0.00
## cgtsday -0.53 0.53
For male skew value is 0.01, so as should be for normal distribution. For female it is 0.45, so less normally but still less than 1.0. The negative kurtosis values show that the data has lighter tails than the standard distribution.
ggplot(t, aes(x = cgtsday, fill = gndr), na.rm = TRUE) +
geom_histogram(binwidth = 5, alpha = 0.75) +
geom_density(alpha = 0.5) +
labs(title = "Smoking habits in Slovenia",x = "Cigarettes smoked per day", y = "Density") +
geom_vline(aes(xintercept = mean(t$cgtsday, na.rm = TRUE), color='mean'), show.legend = TRUE, size = 1) +
geom_vline(aes(xintercept = median(t$cgtsday, na.rm = TRUE), color='median'), show.legend = TRUE, size = 1) +
scale_fill_manual(values = c("skyblue", "salmon"), guide = FALSE) +
scale_color_manual(values = c("hotpink4", "orangered")) +
theme(legend.title = element_blank()) +
facet_grid(t$gndr)
Histogram shows that distribution in both groups is close to normal but still is not.
m <- t %>%
filter(gndr == 'Male')
f <- t %>%
filter(gndr == 'Female')
par(mfrow = c(1,2))
qqnorm(m$cgtsday); qqline(m$cgtsday, col= "red", lty = 5, lwd = 2)
qqnorm(f$cgtsday); qqline(f$cgtsday, col = "red", lty = 5, lwd = 2)
Q-Q plots show that the distributions are not really normal.
var(f$cgtsday, na.rm = T)
## [1] 41.65338
var(m$cgtsday, na.rm = T)
## [1] 54.04118
Variances are not equal.
t.test(f$cgtsday, m$cgtsday, paired = F, var.equal = F)
##
## Welch Two Sample t-test
##
## data: f$cgtsday and m$cgtsday
## t = -5.0444, df = 274.88, p-value = 8.276e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.741407 -2.518072
## sample estimates:
## mean of x mean of y
## 11.11486 15.24460
On average men smoke 15 cigarettes a day whereas women smoke only 11. The t-statistic t(274.88) = - 5 (p-value < 0.001), so we reject the H0-hypothesis and claim that there is significant difference between men and women in amount of cigarettes they smoke in a day.
wilcox.test(x = f$cgtsday, y = m$cgtsday, paired = F)
##
## Wilcoxon rank sum test with continuity correction
##
## data: f$cgtsday and m$cgtsday
## W = 6986.5, p-value = 1.592e-06
## alternative hypothesis: true location shift is not equal to 0
The Wilcoxon rank sum W = 6986.5 (p-value < 0.001), which means that different genders smoke different amount of cigarettes in a day and this difference is statistically significant.