Chi-squared and T-test

INTRODUCTION

TEAM: Alex Stephenson, Anna Gorobtsova, Elizaveta Dyachenko, Marina Romanova

COUNTRY: Slovenia

YEAR: 2014

TOPIC: Health and Care

INDIVIDUAL CONTRIBUTION:

We chose variables and formulated the hypotheses and conclusions together.
Anna and Elizaveta were mostly doing Chi-squared test, while Alex and Marina were doing T-test. Still, we were dealing together with different problems in different parts of our analysis.
In Chi-squared test, Anna created a barplot, contingency table, and run the test, while Elizaveta added more text and analyzed the residuals.
In T-test, Marina created a boxplot, and checked the normality with Q-Q plot, while Alex created a histogram and run both the T-test and non-parametric test.

Firstly, we download all the needed libraries and database.

library(rmarkdown)
library(foreign)
library(ggplot2)
library(gapminder)
library(dplyr)
library(psych)
library(corrplot)
library(knitr)
library(data.table)
library(moments)
library(sjPlot)

Slovenia <- read.spss("ESS7SI.sav", use.value.labels = T, to.data.frame = T, na.omit = T)

CHI-SQUARED TEST

Variables:

gndr - Gender
jbexebs - In any job, ever exposed to: breathing in smoke, fumes, powder, dust

Hypothesis:

H0: Gender and being exposed to breathing smoke in a job are independent variables.
H1: Gender and being exposed to breathing smoke in a job are not independent variables.

We select necessary variables, make a separate database, and inspect the data.

data1 <- select(Slovenia, c("gndr","jbexebs"))
str(data1)

## 'data.frame':    1224 obs. of  2 variables:
##  $ gndr   : Factor w/ 2 levels "Male","Female": 1 1 2 1 2 1 1 2 1 2 ...
##  $ jbexebs: Factor w/ 2 levels "Not marked","Marked": 1 1 1 1 1 1 2 1 2 2 ...
##  - attr(*, "variable.labels")= Named chr  "Title of dataset" "ESS round" "Edition" "Production date" ...
##   ..- attr(*, "names")= chr  "name" "essround" "edition" "proddate" ...
##  - attr(*, "codepage")= int 65001

We create a stacked barplot.

sjp.xtab(data1$jbexebs, data1$gndr, bar.pos = "stack", geom.colors = c("skyblue", "salmon", "greenyellow"), 
         title = "Being exposed to breathing smoke in a job distributed by gender", axis.titles = "Exposure to breathing smoke in a job", 
         legend.title = "Gender", axis.labels = c("Not exposed", "Exposed"), show.total = FALSE, margin = "row")

From the barplot it can be seen that there are much less people in Slovenia who are exposed to breathing smoke in a job. Nevertheless, there are more men than women among them.

We build contingency table and check the assumptions.

cont_table <- table(data1$gndr, data1$jbexebs, dnn = c("Gender", "Exposed to breathing smoke in a job"))
cont_table

##         Exposed to breathing smoke in a job
## Gender   Not marked Marked
##   Male          393    170
##   Female        589     72

From the contingency table it is seen that we met the assumptions of Chi-squared test:

There are no empty cells.
Less than 20% of observations are 5 or less.

We apply Chi-squared test formula.

test <- chisq.test(data1$gndr, data1$jbexebs)
test

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  data1$gndr and data1$jbexebs
## X-squared = 70.206, df = 1, p-value < 2.2e-16

The X-squared(1) = 70.206 and with p-value < 2.2e-16 we can claim that there is a statistically significant association between breathing smoke in a job and gender.

We analyze the residuals.

test$stdres

##           data1$jbexebs
## data1$gndr Not marked    Marked
##     Male    -8.450889  8.450889
##     Female   8.450889 -8.450889

corrplot(test$stdres, is.cor = FALSE)

Positive residuals (blue cells) show a positive association between the gender and being exposed to breathing smoke in a job, while negative residuals (red cells) show a negative association. It means that males are positively associated with being exposed to breathing smoke in a job and females are positively associated with not being exposed to breathing smoke in a job.

assocplot(t(cont_table), main = "Residuals and number of observations")

The cells for females which are not exposed to breathing smoke in a job and for males who are (black color) indicate that the observations are higher than it was expected. Other two cells (red color) show that the observed values are lower than expected ones. It means that there are much more men than women who are exposed to breathing smoke in a job and much more women that men who are not exposed to breathing smoke in a job.

Conclusion

We decline the H0-hypothesis and conclude that gender and being exposed to breathing smoke in a job are dependent variables. Men are more likely to be exposed to breathing smoke in a job, while women are not.

T-TEST

Variables:

gndr - Gender
cgtsday - How many cigarettes smoke on typical day

Hypothesis:

H0: The average number of cigarettes smoked per day by females is not different from that smoked by males.
H1: The average number of cigarettes smoked per day by females is different from that smoked by males.

We inspect the data, change its type where needed, and remove outliers.

data2 <- select(Slovenia, c("gndr","cgtsday")) 
str(data2)

## 'data.frame':    1224 obs. of  2 variables:
##  $ gndr   : Factor w/ 2 levels "Male","Female": 1 1 2 1 2 1 1 2 1 2 ...
##  $ cgtsday: Factor w/ 22 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, "variable.labels")= Named chr  "Title of dataset" "ESS round" "Edition" "Production date" ...
##   ..- attr(*, "names")= chr  "name" "essround" "edition" "proddate" ...
##  - attr(*, "codepage")= int 65001

data2$cgtsday <- as.numeric(as.character(data2$cgtsday))

We build a boxpolot.

ggplot(data2, aes(y = cgtsday, x = gndr)) + 
  geom_boxplot(fill = c("skyblue", "salmon")) +
  stat_summary(fun.y = mean, geom = "point", shape = 4, size = 4) +
  theme_classic() +
  labs(title = "Difference in amount of cigerettes smoked a day between males and females", x = "Gender", y = "Cigarettes smoked per day")

From the boxplot it is seen that in Slovenia men smoke more cigarettes per day than women.

We check the distribution of our data with numbers.

describeBy(data2, data2$gndr)

## 
##  Descriptive statistics by group 
## group: Male
##         vars   n  mean   sd median trimmed  mad min max range skew
## gndr*      1 563  1.00 0.00      1    1.00 0.00   1   1     0  NaN
## cgtsday    2 139 15.32 7.54     17   15.31 4.45   1  40    39 0.01
##         kurtosis   se
## gndr*        NaN 0.00
## cgtsday    -0.19 0.64
## -------------------------------------------------------- 
## group: Female
##         vars   n  mean   sd median trimmed  mad min max range skew
## gndr*      1 661  2.00 0.00      2    2.00 0.00   2   2     0  NaN
## cgtsday    2 148 11.11 6.45     10   10.97 7.41   1  30    29 0.45
##         kurtosis   se
## gndr*        NaN 0.00
## cgtsday    -0.53 0.53

For male skew value is 0.01, so as should be for normal distribution. For female it is 0.45, so less normally but still less than 1.0. The negative kurtosis values show that the data has lighter tails than the standard distribution.

We build a histogram.

ggplot(data2, aes(x = cgtsday, fill = gndr), na.rm = TRUE) +
  geom_histogram(binwidth = 5, alpha = 0.75) +
  geom_density(alpha = 0.5) +
  labs(title = "Smoking habits in Slovenia",x = "Cigarettes smoked per day", y = "Density") +
  geom_vline(aes(xintercept = mean(data2$cgtsday, na.rm = TRUE), color='mean'), show.legend = TRUE, size = 1) +
  geom_vline(aes(xintercept = median(data2$cgtsday, na.rm = TRUE), color='median'), show.legend = TRUE, size = 1) +
  scale_fill_manual(values = c("skyblue", "salmon"), guide = FALSE) +
  scale_color_manual(values = c("hotpink4", "orangered")) +
  theme(legend.title = element_blank()) +
  facet_grid(data2$gndr)

Histogram shows that distribution in both groups is close to normal but still is not.

We filter our data by gender and build Q-Q plot.

m <- data2 %>%
  filter(gndr == 'Male')
f <- data2 %>%
  filter(gndr == 'Female')
par(mfrow = c(1,2))
qqnorm(m$cgtsday); qqline(m$cgtsday, col= "red", lty = 5, lwd = 2)
qqnorm(f$cgtsday); qqline(f$cgtsday, col = "red", lty = 5, lwd = 2)

Q-Q plots show that the distributions are not really normal.

We apply T-test formula.

t.test(f$cgtsday, m$cgtsday, paired = F, var.equal = F)

## 
##  Welch Two Sample t-test
## 
## data:  f$cgtsday and m$cgtsday
## t = -5.0553, df = 272.2, p-value = 7.898e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.837954 -2.565409
## sample estimates:
## mean of x mean of y 
##  11.11486  15.31655

On average men smoke 15 cigarettes a day whereas women smoke only 11. The t-statistic t(272.2) = - 5 (p-value < 0.001), so we can claim that there is a statistically significant difference between men and women in amount of cigarettes they smoke in a day.

We apply non-parametric test.

wilcox.test(x = f$cgtsday, y = m$cgtsday, paired = F)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  f$cgtsday and m$cgtsday
## W = 6985.5, p-value = 1.581e-06
## alternative hypothesis: true location shift is not equal to 0

The Wilcoxon rank sum W = 6985.5 (p-value < 0.001), which means that different genders smoke different amount of cigarettes in a day and this difference is statistically significant.

Conclusion

We reject the H0-hypothesis and conclude that the average number of cigarettes smoked per day by males is bigger than that smoked by females.

Chi-squared and T-test

The Boring Team

25.02.2019

INTRODUCTION

CHI-SQUARED TEST

Variables:

Hypothesis:

Conclusion

T-TEST

Variables:

Hypothesis:

Conclusion