TEAM: Alex Stephenson, Anna Gorobtsova, Elizaveta Dyachenko, Marina Romanova
COUNTRY: Slovenia
YEAR: 2014
TOPIC: Health and Care
INDIVIDUAL CONTRIBUTION:
Firstly, we download all the needed libraries and database.
library(rmarkdown)
library(foreign)
library(ggplot2)
library(gapminder)
library(dplyr)
library(psych)
library(corrplot)
library(knitr)
library(data.table)
library(moments)
library(sjPlot)
Slovenia <- read.spss("ESS7SI.sav", use.value.labels = T, to.data.frame = T, na.omit = T)gndr - Gender
jbexebs - In any job, ever exposed to: breathing in smoke, fumes, powder, dust
H0: Gender and being exposed to breathing smoke in a job are independent variables.
H1: Gender and being exposed to breathing smoke in a job are not independent variables.
data1 <- select(Slovenia, c("gndr","jbexebs"))
str(data1)## 'data.frame': 1224 obs. of 2 variables:
## $ gndr : Factor w/ 2 levels "Male","Female": 1 1 2 1 2 1 1 2 1 2 ...
## $ jbexebs: Factor w/ 2 levels "Not marked","Marked": 1 1 1 1 1 1 2 1 2 2 ...
## - attr(*, "variable.labels")= Named chr "Title of dataset" "ESS round" "Edition" "Production date" ...
## ..- attr(*, "names")= chr "name" "essround" "edition" "proddate" ...
## - attr(*, "codepage")= int 65001
sjp.xtab(data1$jbexebs, data1$gndr, bar.pos = "stack", geom.colors = c("skyblue", "salmon", "greenyellow"),
title = "Being exposed to breathing smoke in a job distributed by gender", axis.titles = "Exposure to breathing smoke in a job",
legend.title = "Gender", axis.labels = c("Not exposed", "Exposed"), show.total = FALSE, margin = "row")From the barplot it can be seen that there are much less people in Slovenia who are exposed to breathing smoke in a job. Nevertheless, there are more men than women among them.
cont_table <- table(data1$gndr, data1$jbexebs, dnn = c("Gender", "Exposed to breathing smoke in a job"))
cont_table## Exposed to breathing smoke in a job
## Gender Not marked Marked
## Male 393 170
## Female 589 72
From the contingency table it is seen that we met the assumptions of Chi-squared test:
test <- chisq.test(data1$gndr, data1$jbexebs)
test##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: data1$gndr and data1$jbexebs
## X-squared = 70.206, df = 1, p-value < 2.2e-16
The X-squared(1) = 70.206 and with p-value < 2.2e-16 we can claim that there is a statistically significant association between breathing smoke in a job and gender.
test$stdres## data1$jbexebs
## data1$gndr Not marked Marked
## Male -8.450889 8.450889
## Female 8.450889 -8.450889
corrplot(test$stdres, is.cor = FALSE)Positive residuals (blue cells) show a positive association between the gender and being exposed to breathing smoke in a job, while negative residuals (red cells) show a negative association. It means that males are positively associated with being exposed to breathing smoke in a job and females are positively associated with not being exposed to breathing smoke in a job.
assocplot(t(cont_table), main = "Residuals and number of observations")The cells for females which are not exposed to breathing smoke in a job and for males who are (black color) indicate that the observations are higher than it was expected. Other two cells (red color) show that the observed values are lower than expected ones. It means that there are much more men than women who are exposed to breathing smoke in a job and much more women that men who are not exposed to breathing smoke in a job.
We decline the H0-hypothesis and conclude that gender and being exposed to breathing smoke in a job are dependent variables. Men are more likely to be exposed to breathing smoke in a job, while women are not.
gndr - Gender
cgtsday - How many cigarettes smoke on typical day
H0: The average number of cigarettes smoked per day by females is not different from that smoked by males.
H1: The average number of cigarettes smoked per day by females is different from that smoked by males.
data2 <- select(Slovenia, c("gndr","cgtsday"))
str(data2)## 'data.frame': 1224 obs. of 2 variables:
## $ gndr : Factor w/ 2 levels "Male","Female": 1 1 2 1 2 1 1 2 1 2 ...
## $ cgtsday: Factor w/ 22 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "variable.labels")= Named chr "Title of dataset" "ESS round" "Edition" "Production date" ...
## ..- attr(*, "names")= chr "name" "essround" "edition" "proddate" ...
## - attr(*, "codepage")= int 65001
data2$cgtsday <- as.numeric(as.character(data2$cgtsday))ggplot(data2, aes(y = cgtsday, x = gndr)) +
geom_boxplot(fill = c("skyblue", "salmon")) +
stat_summary(fun.y = mean, geom = "point", shape = 4, size = 4) +
theme_classic() +
labs(title = "Difference in amount of cigerettes smoked a day between males and females", x = "Gender", y = "Cigarettes smoked per day")From the boxplot it is seen that in Slovenia men smoke more cigarettes per day than women.
describeBy(data2, data2$gndr)##
## Descriptive statistics by group
## group: Male
## vars n mean sd median trimmed mad min max range skew
## gndr* 1 563 1.00 0.00 1 1.00 0.00 1 1 0 NaN
## cgtsday 2 139 15.32 7.54 17 15.31 4.45 1 40 39 0.01
## kurtosis se
## gndr* NaN 0.00
## cgtsday -0.19 0.64
## --------------------------------------------------------
## group: Female
## vars n mean sd median trimmed mad min max range skew
## gndr* 1 661 2.00 0.00 2 2.00 0.00 2 2 0 NaN
## cgtsday 2 148 11.11 6.45 10 10.97 7.41 1 30 29 0.45
## kurtosis se
## gndr* NaN 0.00
## cgtsday -0.53 0.53
For male skew value is 0.01, so as should be for normal distribution. For female it is 0.45, so less normally but still less than 1.0. The negative kurtosis values show that the data has lighter tails than the standard distribution.
ggplot(data2, aes(x = cgtsday, fill = gndr), na.rm = TRUE) +
geom_histogram(binwidth = 5, alpha = 0.75) +
geom_density(alpha = 0.5) +
labs(title = "Smoking habits in Slovenia",x = "Cigarettes smoked per day", y = "Density") +
geom_vline(aes(xintercept = mean(data2$cgtsday, na.rm = TRUE), color='mean'), show.legend = TRUE, size = 1) +
geom_vline(aes(xintercept = median(data2$cgtsday, na.rm = TRUE), color='median'), show.legend = TRUE, size = 1) +
scale_fill_manual(values = c("skyblue", "salmon"), guide = FALSE) +
scale_color_manual(values = c("hotpink4", "orangered")) +
theme(legend.title = element_blank()) +
facet_grid(data2$gndr)Histogram shows that distribution in both groups is close to normal but still is not.
m <- data2 %>%
filter(gndr == 'Male')
f <- data2 %>%
filter(gndr == 'Female')
par(mfrow = c(1,2))
qqnorm(m$cgtsday); qqline(m$cgtsday, col= "red", lty = 5, lwd = 2)
qqnorm(f$cgtsday); qqline(f$cgtsday, col = "red", lty = 5, lwd = 2)Q-Q plots show that the distributions are not really normal.
t.test(f$cgtsday, m$cgtsday, paired = F, var.equal = F)##
## Welch Two Sample t-test
##
## data: f$cgtsday and m$cgtsday
## t = -5.0553, df = 272.2, p-value = 7.898e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.837954 -2.565409
## sample estimates:
## mean of x mean of y
## 11.11486 15.31655
On average men smoke 15 cigarettes a day whereas women smoke only 11. The t-statistic t(272.2) = - 5 (p-value < 0.001), so we can claim that there is a statistically significant difference between men and women in amount of cigarettes they smoke in a day.
wilcox.test(x = f$cgtsday, y = m$cgtsday, paired = F)##
## Wilcoxon rank sum test with continuity correction
##
## data: f$cgtsday and m$cgtsday
## W = 6985.5, p-value = 1.581e-06
## alternative hypothesis: true location shift is not equal to 0
The Wilcoxon rank sum W = 6985.5 (p-value < 0.001), which means that different genders smoke different amount of cigarettes in a day and this difference is statistically significant.
We reject the H0-hypothesis and conclude that the average number of cigarettes smoked per day by males is bigger than that smoked by females.