Contributions: Tolstokoraya Darya: Chi-square
Baturina Elina: literature review, ANOVA
Sorokina Darya: ANOVA
Suetina Anna: T-test
We chose Switzerland as a country for our analysis.
Topic: “Digital and social contacts within family and workplace and its relation to subjective well-being and social exclusion”
Research question: How digital and social contacts within family and workplace are related to subjective well-being and social exclusion?
In this project we will explore some sub research questions, connected with our topic.
loading packages
library(dplyr)
library(ggplot2)
library(readr)
library (kableExtra)
library(foreign)
library(rstatix)
library(DescTools)
library(sjstats)
library(psych)
library(sjPlot)
library(corrplot)
library(effsize)
library(coin)
library(RGraphics)
library(rcompanion)
library(car)
Downloading the data from ESS round 10
ESS <- read_csv(file = '/Users/admin/Downloads/ESS10/ESS10.csv')
Selecting the country & needed variables from dataset
ESS10 <- ESS %>%
filter(cntry == "CH") %>%
select(idno, acchome, domicil, gndr, ttminpnt, speakpnt)
ESS10_1 <- ESS %>%
filter(cntry == "CH") %>%
select(idno, acchome, domicil, gndr, ttminpnt, speakpnt)
Describing variables
Label = c("acchome", "domicil", "gndr", "ttminpnt", "speakpnt")
Meaning = c("Ability to acess the Internet from home", "Area of living type", "Gender", "Time to parents in minutes", "Frequancy of speaking to parents")
Level_Of_Measurement <- c("Nominal, binary", "Nominal", "Nominal, binary", "Ratio", "Ordinal")
df <- data.frame(Label, Meaning, Level_Of_Measurement, stringsAsFactors = FALSE)
kable(df) %>%
kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)
| Label | Meaning | Level_Of_Measurement |
|---|---|---|
| acchome | Ability to acess the Internet from home | Nominal, binary |
| domicil | Area of living type | Nominal |
| gndr | Gender | Nominal, binary |
| ttminpnt | Time to parents in minutes | Ratio |
| speakpnt | Frequancy of speaking to parents | Ordinal |
Research question: Is there a relation between the respondent’s description of the type of the area of living and their ability to access the internet from home?
The issue of internet usage is recently highly developed topic in Switzerland. The most full reaserch on which was conducted in context of World Internet Project. Autors collected the statistics about the usage of Internet in the country according to many parameters (e.g. age, purposes, fears, types of usage, etc.) The research showed that the Switzerland is one of the leading countries according to Internet usage in the Worls. Around 92% of population use Internet. Based on this fact we hypothesized that people all over the country should have equal access to the Internet, there should be no difference in the access to internet in different areas.
Reference: Latzer, Michael and Büchi, Moritz and Festic, Noemi, Internet Use in Switzerland 2011—2019: Trends, Attitudes and Effects. Summary Report from the World Internet Project – Switzerland (2020). Zurich, Switzerland: University of Zurich, 2020
Variables We use 2 categorical variables for this test.
First variable “acchome” – nominal, binary. This variables represent the ability of the respondent to access the internet from home - “Imagine you wanted to access the Internet. At which of these locations would you be able to do it?” (People marked or not marked home as a location to access the internet)
ESS10$acchome <- factor(ESS10$acchome, labels = c("Don't have an access", "Have an access"), ordered= F)
class(ESS10$acchome)
## [1] "factor"
summary(ESS10$acchome)
## Don't have an access Have an access
## 104 1419
Second variable “domicil” - domicile, respondents description - nominal. This variable represents the respondent’s description of type of the area where they live. For the second variable “domicil” first delete observations, which are not needed for the analysis: 7 = “Refusal” & 8 = “Don’t know” & 9 = “No answer”
ESS10$domicil[ESS10$domicil == 7 | ESS10$domicil == 8 | ESS10$domicil == 9] <- NA
ESS10$domicil <- factor(ESS10$domicil, labels = c("A big city", "Suburbs or outskirts of big city", "Town or small town", "Country village", "Farm or home in countryside"), ordered= F)
class(ESS10$domicil)
## [1] "factor"
summary(ESS10$domicil)
## A big city Suburbs or outskirts of big city
## 112 164
## Town or small town Country village
## 386 800
## Farm or home in countryside NA's
## 60 1
Descriptive plot
Here we are able to see descriptive plot which shows the amount of people living in a particular type of area.
ggplot(ESS10)+
geom_bar(aes(x=domicil, fill=acchome), position="stack", na.rm = TRUE)+
scale_x_discrete(na.translate = FALSE)+
ggtitle("The relationship between the description of the area type and ability to access the internet from home")+
xlab("Description of the area of living")+
ylab("Number of respondents")+
labs(caption = "ESS10, Switzerland")+
theme(axis.text.x = element_text(angle=65, vjust = 0.5))
We see, that there is a great majority of people live in a country village. Whereas in the other types of areas there are much less residents. In this case the proportions can be less obvious when we just look at the stacked bar plot, and it will be hard to derive valid conclusions from it. To solve this issue we build plot_xtab to look at the proportions.
library(sjPlot)
plot_xtab (ESS10$domicil, ESS10$acchome, margin = "row", bar.pos = "stack",
show.summary = TRUE)
Interpretation: We see that proportions are approximately equal, as there is not a big difference between proportions, that is why it is hard to understand whether this difference is significant. That is why we need to do chi-squared test in order to discover it.
Cheking assumptions
Assumptions:
Data is independent, the catagories are mutually exclusive
at least 5 observations per cell
table(ESS10$acchome, ESS10$domicil)
##
## A big city Suburbs or outskirts of big city
## Don't have an access 8 10
## Have an access 104 154
##
## Town or small town Country village
## Don't have an access 17 60
## Have an access 369 740
##
## Farm or home in countryside
## Don't have an access 9
## Have an access 51
exp<-chisq.test(ESS10$acchome, ESS10$domicil)
exp$expected
## ESS10$domicil
## ESS10$acchome A big city Suburbs or outskirts of big city
## Don't have an access 7.653088 11.20631
## Have an access 104.346912 152.79369
## ESS10$domicil
## ESS10$acchome Town or small town Country village
## Don't have an access 26.37582 54.66491
## Have an access 359.62418 745.33509
## ESS10$domicil
## ESS10$acchome Farm or home in countryside
## Don't have an access 4.099869
## Have an access 55.900131
The assumption is met.
Chi-square Test
HO: There is no association between the type of the area of living and ability to access the Internet from home
HA: There is association between the type of the area of living and ability to access the Internet from home
chisq.test(ESS10$acchome, ESS10$domicil)
##
## Pearson's Chi-squared test
##
## data: ESS10$acchome and ESS10$domicil
## X-squared = 10.579, df = 4, p-value = 0.03173
Our p-value = 0.03173, meaning we reject the null hypothesis and state that these two categorical variables are not independently distributed, meaning there is an association between the type of the area of living and ability to access the Internet from home. It means people have different abilities to access the Internet from home in different types of areas they live in.
Post-Hoc test
The analysis of the standardized residuals:
res <- chisq.test(ESS10$acchome, ESS10$domicil)
res$stdres
## ESS10$domicil
## ESS10$acchome A big city Suburbs or outskirts of big city
## Don't have an access 0.1349795 -0.3952332
## Have an access -0.1349795 0.3952332
## ESS10$domicil
## ESS10$acchome Town or small town Country village
## Don't have an access -2.1892416 1.0854129
## Have an access 2.1892416 -1.0854129
## ESS10$domicil
## ESS10$acchome Farm or home in countryside
## Don't have an access 2.5581477
## Have an access -2.5581477
Describe residuals: The residuals of 2.5581477 and -2.5581477 that appear for intersection of both “do not have an access” and “have an access” in a “Farm or home in countryside” category indicate substantial deviations between the observed and expected values. There is a positive association between living in farm or home in countryside and not having access to the Internet from home.
In the “Town or small town” category, the indicators are also beyond -2 and 2. So there is a positive association between living in a town or small town and having an access to the internet from home.
Other values are in the range from -2 to 2, meaning this deviation is not different from the expected values.
Visualize residuals:
corrplot(chisq.test(ESS10$acchome, ESS10$domicil)$stdres, is.corr = FALSE, method = "number")
Conclusions:
After conducting chi-squared test we can conclude that there is a relation between between the respondent’s description of the type of the area of living and their ability to access the Internet from home (or the categorical variables “domicil” and “acchome” are not independently distributed). Based on the residuals analysis, we can conclude that the variables that have the most influence on the test results. We see that in our sample there are many more people from town or small city who do not have access to the incinerator at home than we expected. On the other hand, we see that people who live in a farm or home in a countryside and have access to the Internet turned out to be much more than expected.
Thus, it can be concluded that our original hypothesis cannot be confirmed: people from Switzerland, living in different places, have different levels of access to the internet from home.
Research question: Do Swiss people of different gender (female, male) have the different mean time in minutes spent on the getting to parent’s place of living?
Research of Kolk and Martin was aimed at figuring out the geographical distance of children from their parents of different gender in Sweden. Unfortunatly the study do not have the data about children of different age, however it provides the information that mothers in comparison to fathers tend to live closer to their children. We introduced this logic to our data eximation and hypothesized that female children are tend to live closer to parents in Switherland.
Reference: Kolk, Martin (2016). A Life-Course Analysis of Geographical Distance to Siblings, Parents, and Grandparents in Sweden. Population, Space and Place
Data inspection
We are going to do independent samples t-test, where: Categorical variable: gndr - Gender of respondents
ESS10_ttminpnt <- ESS10 %>%
select(gndr, ttminpnt) %>%
filter(ttminpnt != 6666) %>%
filter (ttminpnt != 7777) %>%
filter(ttminpnt != 8888) %>%
filter (ttminpnt != 9999)
ESS10_ttminpnt$gndr <- factor(ESS10_ttminpnt$gndr, labels = c("Male", "Female"), ordered= F)
class (ESS10_ttminpnt$gndr)
## [1] "factor"
summary(ESS10_ttminpnt$gndr)
## Male Female
## 391 397
Description of variables: The “gndr” variable is a categorical and binary, since there are 2 variants (according to descriptive statistic function there are 391 males and 397 females). R identified the class of the variable as “numeric” one, but we converted it into “factor”, which corresponds to categorical type of data.
Continuous variable: ttminpnt - Travel time to parent, in minutes
ESS10_ttminpnt$ttminpnt <- as.numeric(ESS10_ttminpnt$ttminpnt)
class(ESS10_ttminpnt$ttminpnt)
## [1] "numeric"
summary(ESS10_ttminpnt$ttminpnt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 10.0 30.0 188.1 180.0 4320.0
The variable “ttminpnt” is a continuous variable. It was identified as “integer” type by R so we we converted it into “numeric”. According to central tendency measures of “ttminpnt”, we can see that the mean of getting to the parents is 188.1 and median is 30.0, also minimum is 0 and the max is 4320.
Descriptive plot
Boxplot can help to visualize our data:
library (ggplot2)
ggplot(ESS10_ttminpnt)+
geom_boxplot(aes(x=gndr, y=ttminpnt), fill="#FFDDFF", col="#221100",alpha = 0.5)+
ylim(0, 500)+
ggtitle("Minutes spent on getting to the parents by Gender of Respondent")+
xlab("Gender of respondents")+
ylab("Duration of time for getting to the parents in minutes")
Interpretation: based on the plot we see that females (time is approximately 25 minutes) need more time to get to parents then males (time is approximately 20 minutes). It also means that distance between parents and females more than males and their parents.
Summary about data inspection:
there is > 300 observations in both groups
females need more time to get to parents (= on average, they have longer distance between themselves and parents)
Checking assumptions
Сhecking the normality assumption for the t-test
Here we are going to check normality of distribution of our continuous variable (time to get to parents) by Gender.
ggplot(ESS10_ttminpnt, aes(x = ttminpnt, color = gndr, fill = gndr)) +
geom_density(alpha = 0.5) +
labs(title = "Minutes spent on getting to parents by Gender", x = "Duration of time to get to the parents in Minutes", y = "Density") +
theme_classic()
Interpretation: this histogram show that the distributions are skewed to the right (i.e. the right tail is stretched).
#install.packages("psych")
library(psych)
describeBy(ESS10_ttminpnt, group = ESS10_ttminpnt$gndr)
##
## Descriptive statistics by group
## group: Male
## vars n mean sd median trimmed mad min max range skew
## gndr 1 391 1.00 0.00 1 1.00 0.00 1 1 0 NaN
## ttminpnt 2 391 195.61 382.47 30 99.64 37.06 0 3000 3000 3.3
## kurtosis se
## gndr NaN 0.00
## ttminpnt 14.71 19.34
## ------------------------------------------------------------
## group: Female
## vars n mean sd median trimmed mad min max range skew
## gndr 1 397 2.00 0.00 2 2.00 0.00 2 2 0 NaN
## ttminpnt 2 397 180.73 352.11 35 103.69 44.48 1 4320 4319 5.34
## kurtosis se
## gndr NaN 0.00
## ttminpnt 49.2 17.67
Interpretation:
Males: skew (3.3) is not normal (more than 0.5). And kurtosis (14.71) is not normal (more than 1), as the graph above tells us (very sharp top and long tail).
Females: skew (5.34) is not normal (more than 0.5). And kurtosis (49.2) is not normal (more than 1), as the graph above tells us (very sharp top and long tail).
In both groups distribution is skewed and not normal.
Here we are going also to test normality of variables:
qqnorm(ESS10_ttminpnt$ttminpnt)
qqline(ESS10_ttminpnt$ttminpnt)
Interpretation: Q-Q plot do not look normal (heavy right tail and U-shaped line). Also we can see that the points on the plot do not follow a straight line.
Here we also check the normality of our data with a help of test.
shapiro.test(ESS10_ttminpnt$ttminpnt)
##
## Shapiro-Wilk normality test
##
## data: ESS10_ttminpnt$ttminpnt
## W = 0.54001, p-value < 2.2e-16
Interpretation: according to Shapiro test we reject our null hypothesis (p-value < 0,05), so there is not a normal distribution.
Homogeneity of variances assumption
Here is visualization of comparison of the variances in the groups (males and females) with the help of boxplots:
ggplot(ESS10_ttminpnt, aes(x = gndr, y = ttminpnt)) +
ylim(0, 500)+
geom_boxplot() +
stat_summary(fun.y = mean, geom = "point", shape = 4, size = 4) +
theme_classic() +
ggtitle("Minutes spent on getting to parents by Gender of Respondent")
Interpretation: Women have a wider distribution, while men have a smaller one. Women spend more time on average to reach their parents than men (median in females group is slightly more than in males group). The mean among women and men is almost the same. Data distributions (women and men) are skewed because of the mean points are significantly displaced towards the longer tail of the distribution in both groups and do not align well with the medians. Also there are many outliers (points on the plot).
Here we are going to use the test in order to check our visualization results.
H0: Variances are equal.
HA: Variances are not equal.
bartlett.test(ESS10_ttminpnt$ttminpnt ~ ESS10_ttminpnt$gndr)
##
## Bartlett test of homogeneity of variances
##
## data: ESS10_ttminpnt$ttminpnt by ESS10_ttminpnt$gndr
## Bartlett's K-squared = 2.683, df = 1, p-value = 0.1014
Interpretation: according to Bartlett test we are failed to reject our null hypothesis (p-value > 0,05), so variances of groups are equal.
T-Test
The distributions of the continuous variable are not normal but the number of observations in both groups is high enough, so we can try to run t-test (and ignore non-parametric for now).
H0: The mean value of time to get to the parents of males is equal to mean value of of time to get to the parents of females.
HA: The mean value of time to get to the parents of males is not equal to mean value of of time to get to the parents of females.
Note: variances are equal (according our previous results), so Welch’s correction should be applied
t.test(ESS10_ttminpnt$ttminpnt ~ ESS10_ttminpnt$gndr, var.equal = F)
##
## Welch Two Sample t-test
##
## data: ESS10_ttminpnt$ttminpnt by ESS10_ttminpnt$gndr
## t = 0.56798, df = 778.57, p-value = 0.5702
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
## -36.54920 66.31075
## sample estimates:
## mean in group Male mean in group Female
## 195.6113 180.7305
Interpretation: according to Welch Two Sample t-test we are failed to reject our null hypothesis (p-value > 0,05), so there is no statistically significant difference in mean of time to get to the parents between males and females.
Effect size (t-test)
cohen.d(ESS10_ttminpnt$ttminpnt ~ ESS10_ttminpnt$gndr, na.rm = T)
##
## Cohen's d
##
## d estimate: 0.04049353 (negligible)
## 95 percent confidence interval:
## lower upper
## -0.09938187 0.18036893
Interpretation: according to the results the Cohen’s d effect size estimate is 0.04049353. This value indicates a negligible effect size, which means that there is very little difference between the mean values of the two groups being compared (we can also prove our results of t-test in such way).
Non-parametric t-test
Since our data is not normally distributed, t-test is not really reliable in this case. So there is a need to do non-parametric t-test (Wilcox test) for double-checking the results.
H0: The mean of time to get to the parents in minutes of males is equal to mean of time to get to the parents in minutes of females.
HA:The mean of time to get to the parents in minutes of males is not equal to mean of time to get to the parents in minutes of females.
wilcox.test(ESS10_ttminpnt$ttminpnt ~ ESS10_ttminpnt$gndr)
##
## Wilcoxon rank sum test with continuity correction
##
## data: ESS10_ttminpnt$ttminpnt by ESS10_ttminpnt$gndr
## W = 72632, p-value = 0.1183
## alternative hypothesis: true location shift is not equal to 0
Interpretation: according to Wilcox test our p-value is 0.1183, which is greater than 0,05. So, we are failed to reject our null hypothesis, that means there is no significant difference in means of time to get to parents between women and men.
wilcox_effsize(ttminpnt ~ gndr, data = ESS10_ttminpnt, na.rm = T)
## # A tibble: 1 × 7
## .y. group1 group2 effsize n1 n2 magnitude
## * <chr> <chr> <chr> <dbl> <int> <int> <ord>
## 1 ttminpnt Male Female 0.0556 391 397 small
Interpretation: based on our results there is effect size = 0.05564254, that we can interpret as small effect (we can also prove our results of non-parametric test in such way). It means really little difference of means of time to get to parents between males and females.
Conclusions and answer to the RQ: Based on the results after conducting visualizations and test the data is not normally distributed. Also according to the tests we provided, there is no statistically significant difference in the mean time in minutes to get to the parents between females and males. Thus, there is no enough proofs to state that Swiss people of different gender have different mean time in minutes spent on getting to the parents.
Research question: Is there a relation between the the amount of the time people spend to get to their parents and their frequency of live speaking in Swiztherland?
The study conducted by Schwarz, Trommsdorff, Albert and Mayer eximined the relationship quality of parent-child relationships. One of the measures which they used in the analysis was “residential distance”. They found out that residential distance have negative correlation with emotional and instrumental support types, expetially for mother-child relationships. Therefore we hypothesized, that there is a relation between distance between parent and child and frequancy of their communication.
Reference: Beate Schwarz; Gisela Trommsdorff; Isabelle Albert; Boris Mayer (2005). Adult Parent–Child Relationships: Relationship Quality, Support, and Reciprocity. , 54(3), 396–417. doi:10.1111/j.1464-0597.2005.00217.x
Data inspection
ESS10_anova <- ESS %>%
filter(cntry == "CH" & speakpnt <= 7 & ttminpnt != 6666) %>%
select(idno, ttminpnt, speakpnt)
First variable speakpnt – This variable answers the question “How often do you speak with them in person? Please only include occasions where you are physically in the same location.” And indicate the frequancy of speaking to parents in person.
ESS10_anova$speakpnt <- factor(ESS10_anova$speakpnt, labels = c('Several times a day', 'Once a day', 'Several times a week', 'Several times a month',
'Once a month', 'Less often', 'Never' ), ordered = T)
class(ESS10_anova$speakpnt)
## [1] "ordered" "factor"
summary(ESS10_anova$speakpnt)
## Several times a day Once a day Several times a week
## 24 34 177
## Several times a month Once a month Less often
## 234 86 210
## Never
## 36
The second variable ttminpnt was described previously - Travel time to parent, in minutes
ESS10_anova$ttminpnt <- as.numeric(ESS10_anova$ttminpnt)
class(ESS10_anova$ttminpnt)
## [1] "numeric"
summary(ESS10_anova$ttminpnt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 10.0 30.0 335.6 239.0 8888.0
The variable “ttminpnt” is a continuous variable.
Descriptive plot
Now lets make a box plot in order to estimate our data
ggplot(ESS10_anova)+
geom_boxplot(aes(x=speakpnt, y=ttminpnt), fill="#367588", col="#6a5acd", alpha = 0.5)+
scale_x_discrete(na.translate = FALSE)+
ggtitle("Relationship between the frequency of live communication with parents and time of people getting to parents")+
xlab("How often speak")+
ylab("Time to parent")+
theme(axis.text = element_text(size = 7, angle=90))
Interpretation: we see that some groups have visual difference, however some of them not. It is hard to estimate the difference because of the size of the boxes since we have many outliers.
Lets group categories by the approximate frequency
ESS10_anova$speak <- rep(NA, length(ESS10_anova$speakpnt)) #new variable with grouped data from speakpnt
ESS10_anova$speak [ESS10_anova$speakpnt == "Several times a day"|
ESS10_anova$speakpnt == "Once a day"] <- "Daily"
ESS10_anova$speak [ESS10_anova$speakpnt == "Several times a week" ] <- "Weekly"
ESS10_anova$speak [ESS10_anova$speakpnt == "Several times a month"|
ESS10_anova$speakpnt == "Once a month" ] <- "Monthly"
ESS10_anova$speak [ESS10_anova$speakpnt == "Less often"] <- "Less often"
ESS10_anova$speak [ESS10_anova$speakpnt == "Never" ] <- "Never"
ESS10_anova$speak <- as.factor(ESS10_anova$speak)
ESS10_anova$speak <- factor(ESS10_anova$speak, levels = c("Daily", "Weekly", "Monthly", "Less often", "Never"))
table(ESS10_anova$speak)
##
## Daily Weekly Monthly Less often Never
## 58 177 320 210 36
And make a box plot for the new groups of variables
ggplot(ESS10_anova)+
geom_boxplot(aes(x=speak, y=ttminpnt), fill="#367588", col="#6a5acd", alpha = 0.5)+
scale_x_discrete(na.translate = FALSE)+
ggtitle("Relationship between the frequency of live communication with parents and time of people getting to parents")+
xlab("How often speak")+
ylab("Time to parent")+
theme(axis.text = element_text(size = 7, angle=90))
Interpretation: we still se some difference in groups of different frequency of live speaking with parents, but it is hard to estimate significance of this difference only visually
Checking assumptions for ANOVA test Homogentity of variances
H0 variances are equal
H1 variances are not equal
leveneTest(ESS10_anova$ttminpnt ~ ESS10_anova$speak)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 19.414 2.956e-15 ***
## 796
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Variances are not equal as p.value is less then 0,005. Thus, we will use var.equal = F in our ANOVA test later.
Performing F-test
oneway.test(ESS10_anova$ttminpnt ~ ESS10_anova$speak, var.equal = F)
##
## One-way analysis of means (not assuming equal variances)
##
## data: ESS10_anova$ttminpnt and ESS10_anova$speak
## F = 11.486, num df = 4.00, denom df = 178.95, p-value = 2.541e-08
str(oneway.test(ESS10_anova$ttminpnt ~ ESS10_anova$speak, var.equal = F))
## List of 5
## $ statistic: Named num 11.5
## ..- attr(*, "names")= chr "F"
## $ parameter: Named num [1:2] 4 179
## ..- attr(*, "names")= chr [1:2] "num df" "denom df"
## $ p.value : num 2.54e-08
## $ method : chr "One-way analysis of means (not assuming equal variances)"
## $ data.name: chr "ESS10_anova$ttminpnt and ESS10_anova$speak"
## - attr(*, "class")= chr "htest"
Cheching the residuals
one.way.anova <- aov(ESS10_anova$ttminpnt ~ ESS10_anova$speak)
summary(one.way.anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## ESS10_anova$speak 4 118573377 29643344 24.19 <2e-16 ***
## Residuals 796 975437988 1225425
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As the p.valuse if less then 0.05, the difference in the level of time needed to get to parents values across different frequency groups is statistically significant
Now lets check the second assumprion for ANOVA - the normality of residuals
plot(one.way.anova, 2)
We see that the points are not lying along the diagonal line, so our distribution is far from normal
Now lets check the normality of residuals using a test
anova_residuals <- residuals(one.way.anova)
describe(anova_residuals)
## vars n mean sd median trimmed mad min max range skew
## X1 1 801 0 1104.22 -84.38 -117.09 71.57 -1597.25 8750.35 10347.6 6.14
## kurtosis se
## X1 41.1 39.02
The skew and kurtosis are much more than 2, so we again see that the residuals are not normal
shapiro.test(x = anova_residuals)
##
## Shapiro-Wilk normality test
##
## data: anova_residuals
## W = 0.33506, p-value < 2.2e-16
The data definitely is not normal as p value is so low
hist(anova_residuals)
Visually we also see that resiaduals are not normal as it is skewed to the right and have many outliers
As not all the assumprions for ANOVa are not met (namely, our residuals are not distibuted normally), we will use non-parametric ANOVA, which is Kruskal-Wallis test.
kruskal.test(ESS10_anova$ttminpnt ~ ESS10_anova$speakpnt)
##
## Kruskal-Wallis rank sum test
##
## data: ESS10_anova$ttminpnt by ESS10_anova$speakpnt
## Kruskal-Wallis chi-squared = 391.25, df = 6, p-value < 2.2e-16
P.value is less than 0.05 so there is a significant difference between mean ranks of different frequency groups
Post-Hoc for non parametric test
DunnTest(ESS10$ttminpnt ~ ESS10$speakpnt)
##
## Dunn's test of multiple comparisons using rank sums : holm
##
## mean.rank.diff pval
## 2-1 -542.97148 8.4e-14 ***
## 3-1 -717.23890 < 2e-16 ***
## 4-1 -657.65260 < 2e-16 ***
## 5-1 -559.38737 < 2e-16 ***
## 6-1 -361.76776 1.1e-15 ***
## 7-1 -295.86223 0.0027 **
## 66-1 133.62426 0.0076 **
## 77-1 -98.62574 1.0000
## 88-1 502.62426 1.0000
## 3-2 -174.26741 0.2120
## 4-2 -114.68111 1.0000
## 5-2 -16.41588 1.0000
## 6-2 181.20373 0.1575
## 7-2 247.10926 0.1575
## 66-2 676.59574 < 2e-16 ***
## 77-2 444.34574 1.0000
## 88-2 1045.59574 0.2439
## 4-3 59.58630 1.0000
## 5-3 157.85153 0.0893 .
## 6-3 355.47114 3.9e-16 ***
## 7-3 421.37667 5.5e-07 ***
## 66-3 850.86316 < 2e-16 ***
## 77-3 618.61316 0.6543
## 88-3 1219.86316 0.0893 .
## 5-4 98.26523 0.9297
## 6-4 295.88484 1.2e-12 ***
## 7-4 361.79037 2.6e-05 ***
## 66-4 791.27686 < 2e-16 ***
## 77-4 559.02686 0.9297
## 88-4 1160.27686 0.1286
## 6-5 197.61961 0.0058 **
## 7-5 263.52514 0.0342 *
## 66-5 693.01163 < 2e-16 ***
## 77-5 460.76163 1.0000
## 88-5 1062.01163 0.2221
## 7-6 65.90553 1.0000
## 66-6 495.39202 < 2e-16 ***
## 77-6 263.14202 1.0000
## 88-6 864.39202 0.6543
## 66-7 429.48649 4.1e-08 ***
## 77-7 197.23649 1.0000
## 88-7 798.48649 0.9297
## 77-66 -232.25000 1.0000
## 88-66 369.00000 1.0000
## 88-77 601.25000 1.0000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that we have significance difference in the following groups: Once a month-Several times a day, Less often-Several times a day, Never-Several times a day, Several times a month-Once a day, Once a month-Once a day, Less often-Once a day, Never-Once a day, Several times a month-Several times a week, Once a month-Several times a week, Less often-Several times a week, Never-Several times a week, Once a month-Several times a month, Less often-Several times a month, Never-Several times a month, Less often-Once a month, Never-Once a month.
Effect size
epsilonSquared(x = ESS10$ttminpnt, g = ESS10$speakpnt)
## epsilon.squared
## 0.71
We got a result 0,489 which represents large effect, so we have strong statistically significant difference among different groups pof frequency of live speaking to parents.
Conclusions and answer to the RQ: In conclusion, we can see a relation between residental distance and frequency of in person communication in parent-child relations: The larger the distance between parent and child, the less frequently they communicate in person. We see statistical support of out research hypothesis.