PROJECT 2. Chi-squared and T-test

TEAM: Alex Stephenson, Anna Gorobtsova, Elizaveta Dyachenko, Marina Romanova
COUNTRY: Slovenia
TOPIC: Health and Care

Firstly, we download all the needed libraries and database.

library(rmarkdown)
library(foreign)
library(ggplot2)
library(gapminder)
library(dplyr)
library(psych)
library(corrplot)
library(knitr)
library(data.table)
library(moments)
library(sjPlot)

Slovenia <- read.spss("ESS7SI.sav", use.value.labels = T, to.data.frame = T, na.omit = T)

CHI-SQUARED DISTRIBUTION

Variables:

gndr - Gender
jbexebs - In any job, ever exposed to: breathing in smoke, fumes, powder, dust

Hypothesis:

H0: There is no significant association between gender and being exposed to breathing smoke in a job.
H1: There is significant association between gender and being exposed to breathing smoke in a job.

We select necessary variables, make a separate database, and inspect the data.

chi <- select(Slovenia, c("gndr","jbexebs"))
str(chi)

## 'data.frame':    1224 obs. of  2 variables:
##  $ gndr   : Factor w/ 2 levels "Male","Female": 1 1 2 1 2 1 1 2 1 2 ...
##  $ jbexebs: Factor w/ 2 levels "Not marked","Marked": 1 1 1 1 1 1 2 1 2 2 ...
##  - attr(*, "variable.labels")= Named chr  "Title of dataset" "ESS round" "Edition" "Production date" ...
##   ..- attr(*, "names")= chr  "name" "essround" "edition" "proddate" ...
##  - attr(*, "codepage")= int 65001

We create a stacked barplot.

ggplot(chi, aes(x = jbexebs, fill = gndr)) +
  geom_bar() +
  labs(x = 'Exposure to breathing smoke in a job', y = 'Number of people', title = 'Being exposed to breathing smoke in a job distributed by gender') +
  scale_fill_manual(name = "Gender", values = c("skyblue", "salmon"), labels = c("Male", "Female")) +
  scale_x_discrete(labels = c("Not exposed", "Exposed"))

From the barplot it can be seen that there are much less people in Slovenia who are exposed to breathing smoke in a job. Nevertheless, there are more men than women among them.

We build and plot contingency table.

cont_table <- table(chi$gndr, chi$jbexebs, dnn = c("Gender", "Exposed to breathing smoke in a job"))
cont_table

##         Exposed to breathing smoke in a job
## Gender   Not marked Marked
##   Male          393    170
##   Female        589     72

sjp.xtab(chi$jbexebs, chi$gndr, geom.colors = c("skyblue", "salmon", "greenyellow"), axis.titles = "Exposure to breathing smoke in a job", legend.title = "Gender", axis.labels = c("Not exposed", "Exposed"))

We apply Chi-squared test formula.

test <- chisq.test(chi$gndr, chi$jbexebs)
test

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  chi$gndr and chi$jbexebs
## X-squared = 70.206, df = 1, p-value < 2.2e-16

Conclusion:

With p-value < 2.2e-16 we decline the H0-hypothesis and can claim that there is a significant association between breathing smoke in a job and gender.

We analyze the residuals.

test$stdres

##         chi$jbexebs
## chi$gndr Not marked    Marked
##   Male    -8.450889  8.450889
##   Female   8.450889 -8.450889

corrplot(test$stdres, is.cor = FALSE)

Positive residuals (blue cells) mean that there is a positive association between the gender and being exposed to breathing smoke in a job.

Negative residuals (red cells) mean that there is a negative association.

assocplot(t(cont_table), main = "Residuals and number of observations")

The cells for females which are exposed to breathing smoke in a job and for males who are not contain fewer observations than it was expected.

Two other cells contain more observations than it was expected.

T-TEST

Variables:

gndr - Gender
cgtsday - How many cigarettes smoke on typical day

Hypothesis:

H0: There is no significant difference between genders in amount of cigarettes they smoke every day.
H1: There is significant difference between genders in amount of cigarettes they smoke every day.

We inspect the data, change its type where needed, and remove outliers.

t <- select(Slovenia, c("gndr","cgtsday")) 
str(t)

## 'data.frame':    1224 obs. of  2 variables:
##  $ gndr   : Factor w/ 2 levels "Male","Female": 1 1 2 1 2 1 1 2 1 2 ...
##  $ cgtsday: Factor w/ 22 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, "variable.labels")= Named chr  "Title of dataset" "ESS round" "Edition" "Production date" ...
##   ..- attr(*, "names")= chr  "name" "essround" "edition" "proddate" ...
##  - attr(*, "codepage")= int 65001

t$cgtsday <- as.numeric(as.character(t$cgtsday))
table(t$gndr, t$cgtsday)

##         
##           1  2  3  4  5  6  7  8  9 10 11 12 13 15 16 17 20 23 25 28 30 40
##   Male    5  1  4  4  7  0  6  1  0 21  1  1  0 17  1  1 57  0  4  1  6  1
##   Female  5  6  5 11  9  9  1  3  1 45  0  2  1 16  0  0 31  1  0  0  2  0

t$cgtsday[t$cgtsday > 30] <- 30

We build a boxpolot.

ggplot(t, aes(y = cgtsday, x = gndr)) + 
  geom_boxplot(fill = c("skyblue", "salmon")) +
  stat_summary(fun.y = mean, geom = "point", shape = 4, size = 4) +
  theme_classic() +
  labs(title = "Difference in amount of cigerettes smoked a day between males and females", x = "Gender", y = "Cigarettes smoked per day")

From the boxplot it is seen that in Slovenia men smoke more cigarettes per day than women.

We check the distribution of our data with numbers.

describeBy(t, t$gndr)

## 
##  Descriptive statistics by group 
## group: Male
##         vars   n  mean   sd median trimmed  mad min max range  skew
## gndr*      1 563  1.00 0.00      1    1.00 0.00   1   1     0   NaN
## cgtsday    2 139 15.24 7.35     17   15.31 4.45   1  30    29 -0.18
##         kurtosis   se
## gndr*        NaN 0.00
## cgtsday    -0.69 0.62
## -------------------------------------------------------- 
## group: Female
##         vars   n  mean   sd median trimmed  mad min max range skew
## gndr*      1 661  2.00 0.00      2    2.00 0.00   2   2     0  NaN
## cgtsday    2 148 11.11 6.45     10   10.97 7.41   1  30    29 0.45
##         kurtosis   se
## gndr*        NaN 0.00
## cgtsday    -0.53 0.53

For male skew value is 0.01, so as should be for normal distribution. For female it is 0.45, so less normally but still less than 1.0. The negative kurtosis values show that the data has lighter tails than the standard distribution.

We build a histogram.

ggplot(t, aes(x = cgtsday, fill = gndr), na.rm = TRUE) +
  geom_histogram(binwidth = 5, alpha = 0.75) +
  geom_density(alpha = 0.5) +
  labs(title = "Smoking habits in Slovenia",x = "Cigarettes smoked per day", y = "Density") +
  geom_vline(aes(xintercept = mean(t$cgtsday, na.rm = TRUE), color='mean'), show.legend = TRUE, size = 1) +
  geom_vline(aes(xintercept = median(t$cgtsday, na.rm = TRUE), color='median'), show.legend = TRUE, size = 1) +
  scale_fill_manual(values = c("skyblue", "salmon"), guide = FALSE) +
  scale_color_manual(values = c("hotpink4", "orangered")) +
  theme(legend.title = element_blank()) +
  facet_grid(t$gndr)

Histogram shows that distribution in both groups is close to normal but still is not.

We filter our data by gender and build Q-Q plot.

m <- t %>%
  filter(gndr == 'Male')
f <- t %>%
  filter(gndr == 'Female')
par(mfrow = c(1,2))
qqnorm(m$cgtsday); qqline(m$cgtsday, col= "red", lty = 5, lwd = 2)
qqnorm(f$cgtsday); qqline(f$cgtsday, col = "red", lty = 5, lwd = 2)

Q-Q plots show that the distributions are not really normal.

We check if the variances are equal and apply T-test formula.

var(f$cgtsday, na.rm = T)

## [1] 41.65338

var(m$cgtsday, na.rm = T)

## [1] 54.04118

Variances are not equal.

t.test(f$cgtsday, m$cgtsday, paired = F, var.equal = F)

## 
##  Welch Two Sample t-test
## 
## data:  f$cgtsday and m$cgtsday
## t = -5.0444, df = 274.88, p-value = 8.276e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.741407 -2.518072
## sample estimates:
## mean of x mean of y 
##  11.11486  15.24460

Conclusion:

On average men smoke 15 cigarettes a day whereas women smoke only 11. The t-statistic t(274.88) = - 5 (p-value < 0.001), so we reject the H0-hypothesis and claim that there is significant difference between men and women in amount of cigarettes they smoke in a day.

We apply non-parametric test.

wilcox.test(x = f$cgtsday, y = m$cgtsday, paired = F)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  f$cgtsday and m$cgtsday
## W = 6986.5, p-value = 1.592e-06
## alternative hypothesis: true location shift is not equal to 0

Conclusion:

The Wilcoxon rank sum W = 6986.5 (p-value < 0.001), which means that different genders smoke different amount of cigarettes in a day and this difference is statistically significant.

PROJECT 2. Chi-squared and T-test

The Boring Team

25.02.2019

CHI-SQUARED DISTRIBUTION

Variables:

Hypothesis:

Conclusion:

T-TEST

Variables:

Hypothesis:

Conclusion:

Conclusion: