Indentifying topic and describing individual contribution

Hello. We are 2BK. Our topic is “Politics”. The country we have chosen for studying is Ireland (round 8). Team members are Bakhareva Anastasia, Borisenko Iana, Kireeva Irina, Kuzmicheva Daria. We have focused on the results of the surveys connected both with politics and personal information on Ireland.

During our work, our hypotheses and conclusions will be marked in this way.

As for individual contribution, there it is done as follows:

Anastasia Bakhareva: Constucting stacked barplot №2, The conclusion about the groups representativeness, construction of histogram and conclusion about distribution by histogram , creating report

Iana Borisenko: Constructing contigency table, constucting stacked barplot №1, assocplot building and conclusion about residuals, overall conclusion, creating report

Irina Kireeva: Manipulations with data set, creating new variable “politics13$lr”, boxplot conduction and conclusion, descriptive statistics for T-test, Wilcoxon test conduction and conclusions

Daria Kuzmicheva: Describing variables with different measurement scales, Chi-test conduction and conclusion, conclusion about distribution by skew и kurtosis, Q-Q plot construction and conclusions

Preparing data for analysis.

First of all, we run all the libraries necessary for the analysis.

library(readr)
library(ggplot2)
library(dplyr)
library(sjlabelled)
library(sjmisc) 
library(sjstats) 
library(ggeffects)  
library(sjPlot)
library(knitr)
library(psych)

Next, we load out dataset. It is a combination of data connected with both politics and personal information.

politics_gender <- read_csv("~/politics_gender.csv")

Here we filter our data in order to delete all the observation useless for the analysis.

politics <- politics_gender %>% 
  select(agea, lrscale, sgnptit, vote)
politics <- na.omit(politics)

politics13 <- politics %>% 
  filter(lrscale != 77) %>% 
  filter(lrscale != 88) %>% 
  filter(lrscale != 99 ) 

politics13 <- politics %>%
  filter(sgnptit != 7) %>%
  filter(sgnptit != 8) %>%
  filter(sgnptit != 9)

Describing variables

First, we modify one of the variables to make it comfortable for manipulations. Then, we update our dataset.

politics13$lr <- ifelse(politics13$lrscale <= 3, "Left",
                    ifelse(politics13$lrscale >= 7, "Right", "Middle"))
politics13 <- politics13 %>% 
  select(- lrscale)

Now, let`s look at the number of observations and the number of variables.

dim(politics13)

## [1] 2752    4

Well, that`s all we need for conducting tests: 4 variables and enough number of observations.

Then, there is a description of chosen variables presented.

Label <- c("`sgnptit`", "`lr`", "`agea`", "`vote`" ) 
Meaning <- c("Signed petition last 12 months", "Placement on left right scale", "Age", "Voted last national election")
Level_Of_Measurement <- c("Nominal", "Nominal", "Ratio", "Nominal")
Test <- c("Chi-squared test", "Chi-squared test", "T-test for independet variables", "T-test for independet variables")
df <- data.frame(Label, Meaning, Level_Of_Measurement,Test, stringsAsFactors = FALSE)
kable(df)

Label	Meaning	Level_Of_Measurement	Test
`sgnptit`	Signed petition last 12 months	Nominal	Chi-squared test
`lr`	Placement on left right scale	Nominal	Chi-squared test
`agea`	Age	Ratio	T-test for independet variables
`vote`	Voted last national election	Nominal	T-test for independet variables

Exploratory analysis of categories

Firstly, we select variables necessary for chi-square test. Next, there is a contigency table presented.

politics_chi <- politics13 %>% 
  select(lr, sgnptit)
politics_chi$sgnptit <- factor(politics_chi$sgnptit, labels = c("Yes", "No"), ordered= F,exclude = NA)
ContigencyTable <- table(politics_chi$lr, politics_chi$sgnptit)
kable(ContigencyTable)

	Yes	No
Left	142	192
Middle	283	1060
Right	164	911

In order to check whether our categories are successful to run chi-square test, we are going to create stacked barplots and analyze them.

Stacked barplot №1

ggplot() +
  geom_bar(data = politics_chi, aes(x = lr, fill = sgnptit), position = "fill")+
  coord_flip()+
  xlab("Party affiliation") + 
  ylab("Percentage of people") +
  ggtitle("Participation in signing petitions due to party affiliation")

Stacked barplot №2

sjp.xtab(politics_chi$lr, politics_chi$sgnptit, type = "bar", margin ="row",
  bar.pos = "stack", title = "Participation in signing petitions due to party affiliation", title.wtd.suffix = NULL,
  axis.titles = NULL, axis.labels = NULL, legend.title = NULL,
  legend.labels = NULL, weight.by = NULL, rev.order = FALSE,
  show.values = TRUE, show.n = TRUE, show.prc = TRUE, show.total = TRUE,
  show.legend = TRUE, show.summary = TRUE, summary.pos = "r",
  string.total = "Total", wrap.title = 50, wrap.labels = 15,
  wrap.legend.title = 20, wrap.legend.labels = 20, geom.size = 0.7,
  geom.spacing = 0.1, geom.colors = "Paired", dot.size = 3,
  smooth.lines = FALSE, grid.breaks = 0.2, expand.grid = FALSE,
  ylim = NULL, vjust = "bottom", hjust = "left", y.offset = NULL,
  coord.flip = TRUE, prnt.plot = TRUE)

After building two plots, we were convinced that, one way or another, the adherents of each category of political preferences signed the petitions. However, it can be noted that the tendency to take part in signing petitions is not very common in Ireland; The largest group supporting this trend is the Liberals, with 42.5% of those who signed any petitions. The “middle” and “right”, respectively, have in their ranks 21.1% and 15.3% respectively.
Each observation is independent of all the others (i.e., one observation per subject)and no more than 20% of the expected counts are less than 5. (none of them, actually). Therefore, the data is appropriate to conduct a reliable chi-square test.

Chi-square test

Accordingly, in order to make sure that this distribution of the petition signatories due to their political preferences is random, we decided to build a Chi- square test. The following hypotheses were approved for this:

H0 – there is no relation between the political preferences of respondents and their signing petition or not signing behavior
H1 - there is a relation

So, let`s run chi-square test.

colnames(ContigencyTable) <- c("Petition +", "Petition -")
rownames(ContigencyTable) <- c("L", "R", "M")
chi.test <- chisq.test(ContigencyTable)
chi.test

## 
##  Pearson's Chi-squared test
## 
## data:  ContigencyTable
## X-squared = 112.73, df = 2, p-value < 2.2e-16

After carrying out the Chi-square test, we found out that its p-value is extremely small, which means that we do not have strong enough evidence to assert that there is no relation between these two variables. In this way, our H0 should be rejected and political preferences of respondents and their signing petition or not signing behavior are likely to be related.

Next, we have to look at observed, expected and residuals.

kable(chi.test$observed) #for observed

	Petition +	Petition -
L	142	192
R	283	1060
M	164	911

kable(chi.test$expected) #for expected

	Petition +	Petition -
L	71.48474	262.5153
R	287.43714	1055.5629
M	230.07812	844.9219

kable(chi.test$stdres) #for residuals

	Petition +	Petition -
L	10.0361808	-10.0361808
R	-0.4125685	0.4125685
M	-6.2946804	6.2946804

assocplot(t(ContigencyTable), main="Residuals and number of observations" )

On the plot of residuals, we can see the confirmation of our conclusion on the Chi-test: the difference in the number of petitioners who belong to different political parties is too big to say that the variables are independent of each other. Especially distinguished are the liberals, in whose ranks the number of signatories of the petition for indicator 10 is greater than expected, if these variables were independent; as well as the “right” ones, where the indicator 6 is less than the expected number of people who signed the petitions, if these variables were independent.
Thus, we were convinced that, apparently, since the Chi-square test and the difference in the residuals indicate a lack of evidence in favor of the independence of these data, we can assert that the political preferences of the respondents and their desire to sign or not to sign petitions of any kinds are related.

Exploring data for T-test

Here we start with filtering data to delete values useless for our test.

politics_ttest <- politics13 %>% 
  select(agea, vote) %>% 
  filter(vote != 3) %>% 
  filter(vote != 7) %>% 
  filter(vote != 8)

Next, Let’s compare mean values with the help of boxplot.

politics_ttest$vote <- factor(politics_ttest$vote, labels = c("Yes", "No"), ordered= F,exclude = NA)
ggplot() +
  geom_boxplot(data = politics_ttest, aes(x = vote, y = agea), fill="#A44200", col="#A44200", alpha = 0.5) +
  scale_y_continuous(limits = c(0,100)) +
  xlab("Voted last national election") + 
  ylab("Age") +
  ggtitle("Participation in the election due to age")

The median age of voters is higher than the median age of those who did not vote. The first box plot is taller than the second, so we can say that there is a greater variety of ages in the group of voters than in the group of those who refused to vote. The whiskers are pretty the same on both of the graphs. However, the graph of non-voters shows some outliers.

Cheking normality of distribution

There is the first way to check normality presented.

describeBy(politics_ttest, politics_ttest$vote)

## 
##  Descriptive statistics by group 
## group: Yes
##       vars    n  mean     sd median trimmed   mad min max range skew
## agea     1 1940 69.88 119.56     56    55.3 19.27  19 999   980 7.49
## vote*    2 1940  1.00   0.00      1     1.0  0.00   1   1     0  NaN
##       kurtosis   se
## agea     55.29 2.71
## vote*      NaN 0.00
## -------------------------------------------------------- 
## group: No
##       vars   n mean     sd median trimmed   mad min max range skew
## agea     1 593   52 104.78     38   39.56 14.83  16 999   983 8.71
## vote*    2 593    2   0.00      2    2.00  0.00   2   2     0  NaN
##       kurtosis  se
## agea     75.76 4.3
## vote*      NaN 0.0

Skewness is a measure of the symmetry in a distribution. The normal distribution is symmetrical, so skew should be equal to 0 in normal distribution. In the group of voters skew equals to 7.49, and in the group of non-voters the skew is higher, 8.71. The distribution of age is more symmetrical in the group of voters, but still it is far away from normal. However, both of skews are greater than 1, so both of the groups have a high positive skewness (right).
Kurtosis tells us, whether the distribution is peaked or plain. The kurtosis of the age in voters group equals to 55.29, and in the non-voters group kurtosis equals to 75.76. That means that the distribution of the first group (voters) is less sharp than the distribution of the second group.

Next, we check normality with the help of histogram.

library(ggplot2)
ggplot(politics_ttest, aes(x = agea, fill = vote)) +
      geom_histogram(aes(y=..density..), position = "identity", alpha = 0.7, binwidth = 3) +
  xlim(c(0,100)) +
  geom_density(col = "yellow", fill = "white", alpha = 0.1) +
  geom_vline(aes(xintercept = mean(politics_ttest$agea), color = 'mean'), linetype="dashed", size=1) +
  geom_vline(aes(xintercept = median(politics_ttest$agea), color = 'median'), linetype="longdash", size=1) +
  scale_color_manual(name = "Measurement", values = c(median = "#cb3f68", mean = "#824acd")) +
  xlab("Age") + 
  ylab("Density") +
  ggtitle("Age distribution of voters and nonvoters")

Now we can see the distribution and assume that both of the groups are not very close to normal distribution, but the group of voters is slightly closer to it.

Finally, we check normality with the help of Q-Q Plot.

#creating subgroups based on voting / non-voting
voteplus <- subset(politics_ttest[politics_ttest$vote == "Yes",]) 
voteminus <- subset(politics_ttest[politics_ttest$vote == "No",])
par(mfrow = c(1,2))
# y is limited from 18 because it is age at which the Irish are allowed to vote
qqnorm(voteplus$agea, ylim = c(18, 100), main = "Normal Q-Q Plot for vote+"); qqline(voteplus$agea,ylim = c(18 ,100), col= 2)
qqnorm(voteminus$agea, ylim = c(18 ,100), main = "Normal Q-Q Plot for vote-"); qqline(voteminus$agea, col= 2, ylim = c(18 ,100))

Both Q-Q plots show the distributions that are skewed to the right (some data higher than the line) and sharp peaks (the shape of the line is not the same as the normality line). However, the first plot shows a slightly plainer peak.
Thus, we can conclude that the first group (voters) is more normally distributed than the second group (non-voters). However, during a series of tests on the normality of the sample, it was proved that the resulting sample has an abnormal distribution.

Conducting T-test

As in the case with political preferences and signed petitions, we would like to be certain, that voting behavior is not related to the respondent’s age, so, we should conduct a T-test. The following hypotheses were approved for this:

H0: the mean age of people who voted and did not vote does not differ.
H1: the mean age does differ and, thus, there is a relation between age and voting behavior

Now we are going to run T-test.

t.test(politics_ttest$agea ~ politics_ttest$vote)

## 
##  Welch Two Sample t-test
## 
## data:  politics_ttest$agea by politics_ttest$vote
## t = 3.5136, df = 1103.6, p-value = 0.00046
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.893126 27.857861
## sample estimates:
## mean in group Yes  mean in group No 
##          69.87887          52.00337

Statistical conclusion: at the 5% significance level on the available data the null hypothesis should be rejected in favor of the alternative one (p-value <α).
Substantive conclusion : the average age of people is significantly different among those who voted, and those who refused to vote.

Double-checking results with non-parametric test

H0: the mean age of people who voted and did not vote does not differ.
H1: the mean age does differ and, thus, there is a relation between age and voting behavior

wilcox.test(agea ~ vote, data = politics_ttest)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  agea by vote
## W = 842130, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Statistical conclusion: according to the obtained p-value, which is really, really small, there are no strong enough evidence to assert that H0 is true. Thus, it should be rejected.
Substantive conclusion: The Wilcoxon test also proves that the age of people from the considered groups is significantly different among those who voted, and those who refused to vote.

Chi-squared and t-test