Chi-squared and t-test

2BK team: Bakhareva, Borisenko, Kireeva, Kuzmicheva

24/02/2019

Indentifying topic and describing individual contribution

Hello. We are 2BK. Our topic is “Politics”. The country we have chosen for studying is Ireland (round 8). Team members are Bakhareva Anastasia, Borisenko Iana, Kireeva Irina, Kuzmicheva Daria. We have focused on the results of the surveys connected both with politics and personal information on Ireland.

As for individual contribution, there it is done as follows:

Anastasia Bakhareva: Constucting stacked barplot â„–2, The conclusion about the groups representativeness, construction of histogram and conclusion about distribution by histogram , creating report

Iana Borisenko: Constructing contigency table, constucting stacked barplot â„–1, assocplot building and conclusion about residuals, overall conclusion, creating report

Irina Kireeva: Manipulations with data set, creating new variable “politics13$lr”, boxplot conduction and conclusion, descriptive statistics for T-test, Wilcoxon test conduction and conclusions

Daria Kuzmicheva: Describing variables with different measurement scales, Chi-test conduction and conclusion, conclusion about distribution by skew и kurtosis, Q-Q plot construction and conclusions

Preparing data for analysis.

First of all, we run all the libraries necessary for the analysis.

library(readr)
library(ggplot2)
library(dplyr)
library(sjlabelled)
library(sjmisc) 
library(sjstats) 
library(ggeffects)  
library(sjPlot)
library(knitr)
library(psych)

Next, we load out dataset. It is a combination of data connected with both politics and personal information.

politics_gender <- read_csv("~/politics_gender.csv")

Here we filter our data in order to delete all the observation useless for the analysis.

politics <- politics_gender %>% 
  select(agea, lrscale, sgnptit, vote)
politics <- na.omit(politics)
politics13 <- politics %>% 
  filter(lrscale != 77) %>% 
  filter(lrscale != 88) %>% 
  filter(lrscale != 99 ) 

politics13 <- politics %>%
  filter(sgnptit != 7) %>%
  filter(sgnptit != 8) %>%
  filter(sgnptit != 9) 

Describing variables

First, we modify one of the variables to make it comfortable for manipulations. Then, we update our dataset.

politics13$lr <- ifelse(politics13$lrscale <= 3, "Left",
                    ifelse(politics13$lrscale >= 7, "Right", "Middle"))
politics13 <- politics13 %>% 
  select(- lrscale)

Now, let`s look at the number of observations and the number of variables.

dim(politics13)
## [1] 2752    4

Then, there is a description of chosen variables presented.

Label <- c("`sgnptit`", "`lr`", "`agea`", "`vote`" ) 
Meaning <- c("Signed petition last 12 months", "Placement on left right scale", "Age", "Voted last national election")
Level_Of_Measurement <- c("Nominal", "Nominal", "Ratio", "Nominal")
Test <- c("Chi-squared test", "Chi-squared test", "T-test for independet variables", "T-test for independet variables")
df <- data.frame(Label, Meaning, Level_Of_Measurement,Test, stringsAsFactors = FALSE)
kable(df)
Label Meaning Level_Of_Measurement Test
sgnptit Signed petition last 12 months Nominal Chi-squared test
lr Placement on left right scale Nominal Chi-squared test
agea Age Ratio T-test for independet variables
vote Voted last national election Nominal T-test for independet variables

Exploratory analysis of categories

Firstly, we select variables necessary for chi-square test. Next, there is a contigency table presented.

politics_chi <- politics13 %>% 
  select(lr, sgnptit)
politics_chi$sgnptit <- factor(politics_chi$sgnptit, labels = c("Yes", "No"), ordered= F,exclude = NA)
ContigencyTable <- table(politics_chi$lr, politics_chi$sgnptit)
kable(ContigencyTable)
Yes No
Left 142 192
Middle 283 1060
Right 164 911

In order to check whether our categories are successful to run chi-square test, we are going to create stacked barplots and analyze them.

Stacked barplot â„–1

ggplot() +
  geom_bar(data = politics_chi, aes(x = lr, fill = sgnptit), position = "fill")+
  coord_flip()+
  xlab("Party affiliation") + 
  ylab("Percentage of people") +
  ggtitle("Participation in signing petitions due to party affiliation")

Stacked barplot â„–2

sjp.xtab(politics_chi$lr, politics_chi$sgnptit, type = "bar", margin ="row",
  bar.pos = "stack", title = "Participation in signing petitions due to party affiliation", title.wtd.suffix = NULL,
  axis.titles = NULL, axis.labels = NULL, legend.title = NULL,
  legend.labels = NULL, weight.by = NULL, rev.order = FALSE,
  show.values = TRUE, show.n = TRUE, show.prc = TRUE, show.total = TRUE,
  show.legend = TRUE, show.summary = TRUE, summary.pos = "r",
  string.total = "Total", wrap.title = 50, wrap.labels = 15,
  wrap.legend.title = 20, wrap.legend.labels = 20, geom.size = 0.7,
  geom.spacing = 0.1, geom.colors = "Paired", dot.size = 3,
  smooth.lines = FALSE, grid.breaks = 0.2, expand.grid = FALSE,
  ylim = NULL, vjust = "bottom", hjust = "left", y.offset = NULL,
  coord.flip = TRUE, prnt.plot = TRUE)

Chi-square test

Accordingly, in order to make sure that this distribution of the petition signatories due to their political preferences is random, we decided to build a Chi- square test. The following hypotheses were approved for this:

So, let`s run chi-square test.

colnames(ContigencyTable) <- c("Petition +", "Petition -")
rownames(ContigencyTable) <- c("L", "R", "M")
chi.test <- chisq.test(ContigencyTable)
chi.test
## 
##  Pearson's Chi-squared test
## 
## data:  ContigencyTable
## X-squared = 112.73, df = 2, p-value < 2.2e-16

Next, we have to look at observed, expected and residuals.

kable(chi.test$observed) #for observed
Petition + Petition -
L 142 192
R 283 1060
M 164 911
kable(chi.test$expected) #for expected
Petition + Petition -
L 71.48474 262.5153
R 287.43714 1055.5629
M 230.07812 844.9219
kable(chi.test$stdres) #for residuals
Petition + Petition -
L 10.0361808 -10.0361808
R -0.4125685 0.4125685
M -6.2946804 6.2946804
assocplot(t(ContigencyTable), main="Residuals and number of observations" )

Exploring data for T-test

Here we start with filtering data to delete values useless for our test.

politics_ttest <- politics13 %>% 
  select(agea, vote) %>% 
  filter(vote != 3) %>% 
  filter(vote != 7) %>% 
  filter(vote != 8)

Next, Let’s compare mean values with the help of boxplot.

politics_ttest$vote <- factor(politics_ttest$vote, labels = c("Yes", "No"), ordered= F,exclude = NA)
ggplot() +
  geom_boxplot(data = politics_ttest, aes(x = vote, y = agea), fill="#A44200", col="#A44200", alpha = 0.5) +
  scale_y_continuous(limits = c(0,100)) +
  xlab("Voted last national election") + 
  ylab("Age") +
  ggtitle("Participation in the election due to age")

Cheking normality of distribution

There is the first way to check normality presented.

describeBy(politics_ttest, politics_ttest$vote)
## 
##  Descriptive statistics by group 
## group: Yes
##       vars    n  mean     sd median trimmed   mad min max range skew
## agea     1 1940 69.88 119.56     56    55.3 19.27  19 999   980 7.49
## vote*    2 1940  1.00   0.00      1     1.0  0.00   1   1     0  NaN
##       kurtosis   se
## agea     55.29 2.71
## vote*      NaN 0.00
## -------------------------------------------------------- 
## group: No
##       vars   n mean     sd median trimmed   mad min max range skew
## agea     1 593   52 104.78     38   39.56 14.83  16 999   983 8.71
## vote*    2 593    2   0.00      2    2.00  0.00   2   2     0  NaN
##       kurtosis  se
## agea     75.76 4.3
## vote*      NaN 0.0

Next, we check normality with the help of histogram.

library(ggplot2)
ggplot(politics_ttest, aes(x = agea, fill = vote)) +
      geom_histogram(aes(y=..density..), position = "identity", alpha = 0.7, binwidth = 3) +
  xlim(c(0,100)) +
  geom_density(col = "yellow", fill = "white", alpha = 0.1) +
  geom_vline(aes(xintercept = mean(politics_ttest$agea), color = 'mean'), linetype="dashed", size=1) +
  geom_vline(aes(xintercept = median(politics_ttest$agea), color = 'median'), linetype="longdash", size=1) +
  scale_color_manual(name = "Measurement", values = c(median = "#cb3f68", mean = "#824acd")) +
  xlab("Age") + 
  ylab("Density") +
  ggtitle("Age distribution of voters and nonvoters")

Finally, we check normality with the help of Q-Q Plot.

#creating subgroups based on voting / non-voting
voteplus <- subset(politics_ttest[politics_ttest$vote == "Yes",]) 
voteminus <- subset(politics_ttest[politics_ttest$vote == "No",])
par(mfrow = c(1,2))
# y is limited from 18 because it is age at which the Irish are allowed to vote
qqnorm(voteplus$agea, ylim = c(18, 100), main = "Normal Q-Q Plot for vote+"); qqline(voteplus$agea,ylim = c(18 ,100), col= 2)
qqnorm(voteminus$agea, ylim = c(18 ,100), main = "Normal Q-Q Plot for vote-"); qqline(voteminus$agea, col= 2, ylim = c(18 ,100))

Conducting T-test

As in the case with political preferences and signed petitions, we would like to be certain, that voting behavior is not related to the respondent’s age, so, we should conduct a T-test. The following hypotheses were approved for this:

Now we are going to run T-test.

t.test(politics_ttest$agea ~ politics_ttest$vote)
## 
##  Welch Two Sample t-test
## 
## data:  politics_ttest$agea by politics_ttest$vote
## t = 3.5136, df = 1103.6, p-value = 0.00046
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.893126 27.857861
## sample estimates:
## mean in group Yes  mean in group No 
##          69.87887          52.00337

Double-checking results with non-parametric test

wilcox.test(agea ~ vote, data = politics_ttest)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  agea by vote
## W = 842130, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Overall conclusion

Thus, by operating on the data and having conducted several statistical tests, we can confidently assert the following: