Contributions: Tolstokoraya Darya: Chi-square

Baturina Elina: literature review, ANOVA

Sorokina Darya: ANOVA

Suetina Anna: T-test

General Infromation about our research

We chose Switzerland as a country for our analysis.

Topic: “Digital and social contacts within family and workplace and its relation to subjective well-being and social exclusion”

Research question: How digital and social contacts within family and workplace are related to subjective well-being and social exclusion?

In this project we will explore some sub research questions, connected with our topic.

loading packages

library(dplyr)
library(ggplot2)
library(readr)
library (kableExtra)
library(foreign)
library(rstatix)
library(DescTools) 
library(sjstats)
library(psych)
library(sjPlot)
library(corrplot)
library(effsize)
library(coin)
library(RGraphics)
library(rcompanion)
library(car)

Downloading the data from ESS round 10

ESS <- read_csv(file = '/Users/admin/Downloads/ESS10/ESS10.csv')

Selecting the country & needed variables from dataset

ESS10 <- ESS %>% 
  filter(cntry == "CH") %>% 
  select(idno, acchome, domicil, gndr, ttminpnt, speakpnt)

ESS10_1 <- ESS %>% 
  filter(cntry == "CH") %>% 
  select(idno, acchome, domicil, gndr, ttminpnt, speakpnt)

Describing variables

Label = c("acchome", "domicil", "gndr", "ttminpnt", "speakpnt") 
Meaning = c("Ability to acess the Internet from home", "Area of living type", "Gender", "Time to parents in minutes", "Frequancy of speaking to parents")
Level_Of_Measurement <- c("Nominal, binary", "Nominal", "Nominal, binary", "Ratio", "Ordinal")
df <- data.frame(Label, Meaning, Level_Of_Measurement, stringsAsFactors = FALSE)

kable(df) %>% 
  kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)
Label Meaning Level_Of_Measurement
acchome Ability to acess the Internet from home Nominal, binary
domicil Area of living type Nominal
gndr Gender Nominal, binary
ttminpnt Time to parents in minutes Ratio
speakpnt Frequancy of speaking to parents Ordinal

Tests

Chi-square

Research question: Is there a relation between the respondent’s description of the type of the area of living and their ability to access the internet from home?

The issue of internet usage is recently highly developed topic in Switzerland. The most full reaserch on which was conducted in context of World Internet Project. Autors collected the statistics about the usage of Internet in the country according to many parameters (e.g. age, purposes, fears, types of usage, etc.) The research showed that the Switzerland is one of the leading countries according to Internet usage in the Worls. Around 92% of population use Internet. Based on this fact we hypothesized that people all over the country should have equal access to the Internet, there should be no difference in the access to internet in different areas.

Reference: Latzer, Michael and Büchi, Moritz and Festic, Noemi, Internet Use in Switzerland 2011—2019: Trends, Attitudes and Effects. Summary Report from the World Internet Project – Switzerland (2020). Zurich, Switzerland: University of Zurich, 2020

Variables We use 2 categorical variables for this test.

First variable “acchome” – nominal, binary. This variables represent the ability of the respondent to access the internet from home - “Imagine you wanted to access the Internet. At which of these locations would you be able to do it?” (People marked or not marked home as a location to access the internet)

ESS10$acchome <- factor(ESS10$acchome, labels = c("Don't have an access", "Have an access"), ordered= F)
class(ESS10$acchome)
## [1] "factor"
summary(ESS10$acchome)
## Don't have an access       Have an access 
##                  104                 1419

Second variable “domicil” - domicile, respondents description - nominal. This variable represents the respondent’s description of type of the area where they live. For the second variable “domicil” first delete observations, which are not needed for the analysis: 7 = “Refusal” & 8 = “Don’t know” & 9 = “No answer”

ESS10$domicil[ESS10$domicil == 7 | ESS10$domicil == 8 | ESS10$domicil == 9] <- NA
ESS10$domicil <- factor(ESS10$domicil, labels = c("A big city", "Suburbs or outskirts of big city", "Town or small town", "Country village", "Farm or home in countryside"), ordered= F)
class(ESS10$domicil)
## [1] "factor"
summary(ESS10$domicil)
##                       A big city Suburbs or outskirts of big city 
##                              112                              164 
##               Town or small town                  Country village 
##                              386                              800 
##      Farm or home in countryside                             NA's 
##                               60                                1

Descriptive plot

Here we are able to see descriptive plot which shows the amount of people living in a particular type of area.

ggplot(ESS10)+
  geom_bar(aes(x=domicil, fill=acchome), position="stack", na.rm = TRUE)+
  scale_x_discrete(na.translate = FALSE)+
  ggtitle("The relationship between the description of the area type and ability to access the internet from home")+
  xlab("Description of the area of living")+
  ylab("Number of respondents")+
 labs(caption = "ESS10, Switzerland")+
  theme(axis.text.x = element_text(angle=65, vjust = 0.5))

We see, that there is a great majority of people live in a country village. Whereas in the other types of areas there are much less residents. In this case the proportions can be less obvious when we just look at the stacked bar plot, and it will be hard to derive valid conclusions from it. To solve this issue we build plot_xtab to look at the proportions.

library(sjPlot)
plot_xtab (ESS10$domicil, ESS10$acchome, margin = "row", bar.pos = "stack",
         show.summary = TRUE)

Interpretation: We see that proportions are approximately equal, as there is not a big difference between proportions, that is why it is hard to understand whether this difference is significant. That is why we need to do chi-squared test in order to discover it.

Cheking assumptions

Assumptions:

  1. Data is independent, the catagories are mutually exclusive

  2. at least 5 observations per cell

table(ESS10$acchome, ESS10$domicil) 
##                       
##                        A big city Suburbs or outskirts of big city
##   Don't have an access          8                               10
##   Have an access              104                              154
##                       
##                        Town or small town Country village
##   Don't have an access                 17              60
##   Have an access                      369             740
##                       
##                        Farm or home in countryside
##   Don't have an access                           9
##   Have an access                                51
exp<-chisq.test(ESS10$acchome, ESS10$domicil)
exp$expected
##                       ESS10$domicil
## ESS10$acchome          A big city Suburbs or outskirts of big city
##   Don't have an access   7.653088                         11.20631
##   Have an access       104.346912                        152.79369
##                       ESS10$domicil
## ESS10$acchome          Town or small town Country village
##   Don't have an access           26.37582        54.66491
##   Have an access                359.62418       745.33509
##                       ESS10$domicil
## ESS10$acchome          Farm or home in countryside
##   Don't have an access                    4.099869
##   Have an access                         55.900131

The assumption is met.

Chi-square Test

HO: There is no association between the type of the area of living and ability to access the Internet from home

HA: There is association between the type of the area of living and ability to access the Internet from home

chisq.test(ESS10$acchome, ESS10$domicil)
## 
##  Pearson's Chi-squared test
## 
## data:  ESS10$acchome and ESS10$domicil
## X-squared = 10.579, df = 4, p-value = 0.03173

Our p-value = 0.03173, meaning we reject the null hypothesis and state that these two categorical variables are not independently distributed, meaning there is an association between the type of the area of living and ability to access the Internet from home. It means people have different abilities to access the Internet from home in different types of areas they live in.

Post-Hoc test

The analysis of the standardized residuals:

res <- chisq.test(ESS10$acchome, ESS10$domicil)
res$stdres
##                       ESS10$domicil
## ESS10$acchome          A big city Suburbs or outskirts of big city
##   Don't have an access  0.1349795                       -0.3952332
##   Have an access       -0.1349795                        0.3952332
##                       ESS10$domicil
## ESS10$acchome          Town or small town Country village
##   Don't have an access         -2.1892416       1.0854129
##   Have an access                2.1892416      -1.0854129
##                       ESS10$domicil
## ESS10$acchome          Farm or home in countryside
##   Don't have an access                   2.5581477
##   Have an access                        -2.5581477

Describe residuals: The residuals of 2.5581477 and -2.5581477 that appear for intersection of both “do not have an access” and “have an access” in a “Farm or home in countryside” category indicate substantial deviations between the observed and expected values. There is a positive association between living in farm or home in countryside and not having access to the Internet from home.

In the “Town or small town” category, the indicators are also beyond -2 and 2. So there is a positive association between living in a town or small town and having an access to the internet from home.

Other values are in the range from -2 to 2, meaning this deviation is not different from the expected values.

Visualize residuals:

corrplot(chisq.test(ESS10$acchome, ESS10$domicil)$stdres,  is.corr = FALSE, method = "number")

Conclusions:

After conducting chi-squared test we can conclude that there is a relation between between the respondent’s description of the type of the area of living and their ability to access the Internet from home (or the categorical variables “domicil” and “acchome” are not independently distributed). Based on the residuals analysis, we can conclude that the variables that have the most influence on the test results. We see that in our sample there are many more people from town or small city who do not have access to the incinerator at home than we expected. On the other hand, we see that people who live in a farm or home in a countryside and have access to the Internet turned out to be much more than expected.

Thus, it can be concluded that our original hypothesis cannot be confirmed: people from Switzerland, living in different places, have different levels of access to the internet from home.

T-test

Research question: Do Swiss people of different gender (female, male) have the different mean time in minutes spent on the getting to parent’s place of living?

Research of Kolk and Martin was aimed at figuring out the geographical distance of children from their parents of different gender in Sweden. Unfortunatly the study do not have the data about children of different age, however it provides the information that mothers in comparison to fathers tend to live closer to their children. We introduced this logic to our data eximation and hypothesized that female children are tend to live closer to parents in Switherland.

Reference: Kolk, Martin (2016). A Life-Course Analysis of Geographical Distance to Siblings, Parents, and Grandparents in Sweden. Population, Space and Place

Data inspection

We are going to do independent samples t-test, where: Categorical variable: gndr - Gender of respondents

ESS10_ttminpnt <- ESS10 %>%
  select(gndr, ttminpnt) %>% 
  filter(ttminpnt != 6666) %>% 
  filter (ttminpnt != 7777) %>% 
  filter(ttminpnt != 8888) %>% 
  filter (ttminpnt != 9999)

ESS10_ttminpnt$gndr <- factor(ESS10_ttminpnt$gndr, labels = c("Male", "Female"), ordered= F)
class (ESS10_ttminpnt$gndr)
## [1] "factor"
summary(ESS10_ttminpnt$gndr)
##   Male Female 
##    391    397

Description of variables: The “gndr” variable is a categorical and binary, since there are 2 variants (according to descriptive statistic function there are 391 males and 397 females). R identified the class of the variable as “numeric” one, but we converted it into “factor”, which corresponds to categorical type of data.

Continuous variable: ttminpnt - Travel time to parent, in minutes

ESS10_ttminpnt$ttminpnt <- as.numeric(ESS10_ttminpnt$ttminpnt)
class(ESS10_ttminpnt$ttminpnt)
## [1] "numeric"
summary(ESS10_ttminpnt$ttminpnt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    10.0    30.0   188.1   180.0  4320.0

The variable “ttminpnt” is a continuous variable. It was identified as “integer” type by R so we we converted it into “numeric”. According to central tendency measures of “ttminpnt”, we can see that the mean of getting to the parents is 188.1 and median is 30.0, also minimum is 0 and the max is 4320.

Descriptive plot

Boxplot can help to visualize our data:

library (ggplot2)
ggplot(ESS10_ttminpnt)+
  geom_boxplot(aes(x=gndr, y=ttminpnt), fill="#FFDDFF", col="#221100",alpha = 0.5)+
  ylim(0, 500)+
  ggtitle("Minutes spent on getting to the parents by Gender of Respondent")+
  xlab("Gender of respondents")+
  ylab("Duration of time for getting to the parents in minutes")

Interpretation: based on the plot we see that females (time is approximately 25 minutes) need more time to get to parents then males (time is approximately 20 minutes). It also means that distance between parents and females more than males and their parents.

Summary about data inspection:

  • there is > 300 observations in both groups

  • females need more time to get to parents (= on average, they have longer distance between themselves and parents)

Checking assumptions

Сhecking the normality assumption for the t-test

Here we are going to check normality of distribution of our continuous variable (time to get to parents) by Gender.

  1. Histogram
ggplot(ESS10_ttminpnt, aes(x = ttminpnt, color = gndr, fill = gndr)) +
      geom_density(alpha = 0.5) +
      labs(title = "Minutes spent on getting to parents by Gender", x = "Duration of time to get to the parents in Minutes", y = "Density") +
      theme_classic()

Interpretation: this histogram show that the distributions are skewed to the right (i.e. the right tail is stretched).

  1. Skew and kurtosis
#install.packages("psych")
library(psych)
describeBy(ESS10_ttminpnt, group = ESS10_ttminpnt$gndr)
## 
##  Descriptive statistics by group 
## group: Male
##          vars   n   mean     sd median trimmed   mad min  max range skew
## gndr        1 391   1.00   0.00      1    1.00  0.00   1    1     0  NaN
## ttminpnt    2 391 195.61 382.47     30   99.64 37.06   0 3000  3000  3.3
##          kurtosis    se
## gndr          NaN  0.00
## ttminpnt    14.71 19.34
## ------------------------------------------------------------ 
## group: Female
##          vars   n   mean     sd median trimmed   mad min  max range skew
## gndr        1 397   2.00   0.00      2    2.00  0.00   2    2     0  NaN
## ttminpnt    2 397 180.73 352.11     35  103.69 44.48   1 4320  4319 5.34
##          kurtosis    se
## gndr          NaN  0.00
## ttminpnt     49.2 17.67

Interpretation:

Males: skew (3.3) is not normal (more than 0.5). And kurtosis (14.71) is not normal (more than 1), as the graph above tells us (very sharp top and long tail).

Females: skew (5.34) is not normal (more than 0.5). And kurtosis (49.2) is not normal (more than 1), as the graph above tells us (very sharp top and long tail).

In both groups distribution is skewed and not normal.

  1. QQ-plot

Here we are going also to test normality of variables:

qqnorm(ESS10_ttminpnt$ttminpnt)
qqline(ESS10_ttminpnt$ttminpnt)

Interpretation: Q-Q plot do not look normal (heavy right tail and U-shaped line). Also we can see that the points on the plot do not follow a straight line.

  1. Shapiro test

Here we also check the normality of our data with a help of test.

shapiro.test(ESS10_ttminpnt$ttminpnt)
## 
##  Shapiro-Wilk normality test
## 
## data:  ESS10_ttminpnt$ttminpnt
## W = 0.54001, p-value < 2.2e-16

Interpretation: according to Shapiro test we reject our null hypothesis (p-value < 0,05), so there is not a normal distribution.

Homogeneity of variances assumption

Here is visualization of comparison of the variances in the groups (males and females) with the help of boxplots:

ggplot(ESS10_ttminpnt, aes(x = gndr, y = ttminpnt)) + 
    ylim(0, 500)+
  geom_boxplot() +
  stat_summary(fun.y = mean, geom = "point", shape = 4, size = 4) +
  theme_classic() +
  ggtitle("Minutes spent on getting to parents by Gender of Respondent")

Interpretation: Women have a wider distribution, while men have a smaller one. Women spend more time on average to reach their parents than men (median in females group is slightly more than in males group). The mean among women and men is almost the same. Data distributions (women and men) are skewed because of the mean points are significantly displaced towards the longer tail of the distribution in both groups and do not align well with the medians. Also there are many outliers (points on the plot).

Here we are going to use the test in order to check our visualization results.

H0: Variances are equal.

HA: Variances are not equal.

bartlett.test(ESS10_ttminpnt$ttminpnt ~ ESS10_ttminpnt$gndr)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  ESS10_ttminpnt$ttminpnt by ESS10_ttminpnt$gndr
## Bartlett's K-squared = 2.683, df = 1, p-value = 0.1014

Interpretation: according to Bartlett test we are failed to reject our null hypothesis (p-value > 0,05), so variances of groups are equal.

T-Test

The distributions of the continuous variable are not normal but the number of observations in both groups is high enough, so we can try to run t-test (and ignore non-parametric for now).

H0: The mean value of time to get to the parents of males is equal to mean value of of time to get to the parents of females.

HA: The mean value of time to get to the parents of males is not equal to mean value of of time to get to the parents of females.

Note: variances are equal (according our previous results), so Welch’s correction should be applied

t.test(ESS10_ttminpnt$ttminpnt ~ ESS10_ttminpnt$gndr, var.equal = F)
## 
##  Welch Two Sample t-test
## 
## data:  ESS10_ttminpnt$ttminpnt by ESS10_ttminpnt$gndr
## t = 0.56798, df = 778.57, p-value = 0.5702
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
##  -36.54920  66.31075
## sample estimates:
##   mean in group Male mean in group Female 
##             195.6113             180.7305

Interpretation: according to Welch Two Sample t-test we are failed to reject our null hypothesis (p-value > 0,05), so there is no statistically significant difference in mean of time to get to the parents between males and females.

Effect size (t-test)

cohen.d(ESS10_ttminpnt$ttminpnt ~ ESS10_ttminpnt$gndr, na.rm = T)
## 
## Cohen's d
## 
## d estimate: 0.04049353 (negligible)
## 95 percent confidence interval:
##       lower       upper 
## -0.09938187  0.18036893

Interpretation: according to the results the Cohen’s d effect size estimate is 0.04049353. This value indicates a negligible effect size, which means that there is very little difference between the mean values of the two groups being compared (we can also prove our results of t-test in such way).

Non-parametric t-test

Since our data is not normally distributed, t-test is not really reliable in this case. So there is a need to do non-parametric t-test (Wilcox test) for double-checking the results.

H0: The mean of time to get to the parents in minutes of males is equal to mean of time to get to the parents in minutes of females.

HA:The mean of time to get to the parents in minutes of males is not equal to mean of time to get to the parents in minutes of females.

wilcox.test(ESS10_ttminpnt$ttminpnt ~ ESS10_ttminpnt$gndr)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  ESS10_ttminpnt$ttminpnt by ESS10_ttminpnt$gndr
## W = 72632, p-value = 0.1183
## alternative hypothesis: true location shift is not equal to 0

Interpretation: according to Wilcox test our p-value is 0.1183, which is greater than 0,05. So, we are failed to reject our null hypothesis, that means there is no significant difference in means of time to get to parents between women and men.

wilcox_effsize(ttminpnt ~ gndr, data = ESS10_ttminpnt, na.rm = T)
## # A tibble: 1 × 7
##   .y.      group1 group2 effsize    n1    n2 magnitude
## * <chr>    <chr>  <chr>    <dbl> <int> <int> <ord>    
## 1 ttminpnt Male   Female  0.0556   391   397 small

Interpretation: based on our results there is effect size = 0.05564254, that we can interpret as small effect (we can also prove our results of non-parametric test in such way). It means really little difference of means of time to get to parents between males and females.

Conclusions and answer to the RQ: Based on the results after conducting visualizations and test the data is not normally distributed. Also according to the tests we provided, there is no statistically significant difference in the mean time in minutes to get to the parents between females and males. Thus, there is no enough proofs to state that Swiss people of different gender have different mean time in minutes spent on getting to the parents.

ANOVA

Research question: Is there a relation between the the amount of the time people spend to get to their parents and their frequency of live speaking in Swiztherland?

The study conducted by Schwarz, Trommsdorff, Albert and Mayer eximined the relationship quality of parent-child relationships. One of the measures which they used in the analysis was “residential distance”. They found out that residential distance have negative correlation with emotional and instrumental support types, expetially for mother-child relationships. Therefore we hypothesized, that there is a relation between distance between parent and child and frequancy of their communication.

Reference: Beate Schwarz; Gisela Trommsdorff; Isabelle Albert; Boris Mayer (2005). Adult Parent–Child Relationships: Relationship Quality, Support, and Reciprocity. , 54(3), 396–417. doi:10.1111/j.1464-0597.2005.00217.x

Data inspection

ESS10_anova <- ESS %>% 
  filter(cntry == "CH" & speakpnt <= 7 & ttminpnt != 6666) %>% 
  select(idno, ttminpnt, speakpnt)

First variable speakpnt – This variable answers the question “How often do you speak with them in person? Please only include occasions where you are physically in the same location.” And indicate the frequancy of speaking to parents in person.

ESS10_anova$speakpnt <- factor(ESS10_anova$speakpnt, labels = c('Several times a day', 'Once a day', 'Several times a week', 'Several times a month', 
                                                    'Once a month', 'Less often', 'Never' ), ordered = T)
class(ESS10_anova$speakpnt)
## [1] "ordered" "factor"
summary(ESS10_anova$speakpnt)
##   Several times a day            Once a day  Several times a week 
##                    24                    34                   177 
## Several times a month          Once a month            Less often 
##                   234                    86                   210 
##                 Never 
##                    36

The second variable ttminpnt was described previously - Travel time to parent, in minutes

ESS10_anova$ttminpnt <- as.numeric(ESS10_anova$ttminpnt)
class(ESS10_anova$ttminpnt)
## [1] "numeric"
summary(ESS10_anova$ttminpnt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    10.0    30.0   335.6   239.0  8888.0

The variable “ttminpnt” is a continuous variable.

Descriptive plot

Now lets make a box plot in order to estimate our data

ggplot(ESS10_anova)+
  geom_boxplot(aes(x=speakpnt, y=ttminpnt), fill="#367588", col="#6a5acd", alpha = 0.5)+
   scale_x_discrete(na.translate = FALSE)+
  ggtitle("Relationship between the frequency of live communication with parents and time of people getting to parents")+
  xlab("How often speak")+
  ylab("Time to parent")+
  theme(axis.text = element_text(size = 7, angle=90))  

Interpretation: we see that some groups have visual difference, however some of them not. It is hard to estimate the difference because of the size of the boxes since we have many outliers.

Lets group categories by the approximate frequency

ESS10_anova$speak <- rep(NA, length(ESS10_anova$speakpnt)) #new variable with grouped data from speakpnt

ESS10_anova$speak [ESS10_anova$speakpnt == "Several times a day"| 
ESS10_anova$speakpnt == "Once a day"] <- "Daily"

ESS10_anova$speak [ESS10_anova$speakpnt == "Several times a week" ] <- "Weekly"

ESS10_anova$speak [ESS10_anova$speakpnt  == "Several times a month"| 
ESS10_anova$speakpnt  == "Once a month" ] <- "Monthly" 

ESS10_anova$speak [ESS10_anova$speakpnt  == "Less often"] <- "Less often"
ESS10_anova$speak [ESS10_anova$speakpnt  == "Never" ] <- "Never" 

ESS10_anova$speak  <- as.factor(ESS10_anova$speak)

ESS10_anova$speak <- factor(ESS10_anova$speak, levels = c("Daily", "Weekly", "Monthly", "Less often", "Never"))
table(ESS10_anova$speak)
## 
##      Daily     Weekly    Monthly Less often      Never 
##         58        177        320        210         36

And make a box plot for the new groups of variables

ggplot(ESS10_anova)+
  geom_boxplot(aes(x=speak, y=ttminpnt), fill="#367588", col="#6a5acd", alpha = 0.5)+
   scale_x_discrete(na.translate = FALSE)+
  ggtitle("Relationship between the frequency of live communication with parents and time of people getting to parents")+
  xlab("How often speak")+
  ylab("Time to parent")+
  theme(axis.text = element_text(size = 7, angle=90))  

Interpretation: we still se some difference in groups of different frequency of live speaking with parents, but it is hard to estimate significance of this difference only visually

Checking assumptions for ANOVA test Homogentity of variances

H0 variances are equal

H1 variances are not equal

leveneTest(ESS10_anova$ttminpnt ~ ESS10_anova$speak)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value    Pr(>F)    
## group   4  19.414 2.956e-15 ***
##       796                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Variances are not equal as p.value is less then 0,005. Thus, we will use var.equal = F in our ANOVA test later.

Performing F-test

oneway.test(ESS10_anova$ttminpnt ~ ESS10_anova$speak, var.equal = F)
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  ESS10_anova$ttminpnt and ESS10_anova$speak
## F = 11.486, num df = 4.00, denom df = 178.95, p-value = 2.541e-08
str(oneway.test(ESS10_anova$ttminpnt ~ ESS10_anova$speak, var.equal = F))
## List of 5
##  $ statistic: Named num 11.5
##   ..- attr(*, "names")= chr "F"
##  $ parameter: Named num [1:2] 4 179
##   ..- attr(*, "names")= chr [1:2] "num df" "denom df"
##  $ p.value  : num 2.54e-08
##  $ method   : chr "One-way analysis of means (not assuming equal variances)"
##  $ data.name: chr "ESS10_anova$ttminpnt and ESS10_anova$speak"
##  - attr(*, "class")= chr "htest"

Cheching the residuals

one.way.anova <- aov(ESS10_anova$ttminpnt ~ ESS10_anova$speak)
summary(one.way.anova)
##                    Df    Sum Sq  Mean Sq F value Pr(>F)    
## ESS10_anova$speak   4 118573377 29643344   24.19 <2e-16 ***
## Residuals         796 975437988  1225425                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As the p.valuse if less then 0.05, the difference in the level of time needed to get to parents values across different frequency groups is statistically significant

Now lets check the second assumprion for ANOVA - the normality of residuals

plot(one.way.anova, 2)

We see that the points are not lying along the diagonal line, so our distribution is far from normal

Now lets check the normality of residuals using a test

anova_residuals <- residuals(one.way.anova)
describe(anova_residuals)
##    vars   n mean      sd median trimmed   mad      min     max   range skew
## X1    1 801    0 1104.22 -84.38 -117.09 71.57 -1597.25 8750.35 10347.6 6.14
##    kurtosis    se
## X1     41.1 39.02

The skew and kurtosis are much more than 2, so we again see that the residuals are not normal

shapiro.test(x = anova_residuals) 
## 
##  Shapiro-Wilk normality test
## 
## data:  anova_residuals
## W = 0.33506, p-value < 2.2e-16

The data definitely is not normal as p value is so low

hist(anova_residuals)

Visually we also see that resiaduals are not normal as it is skewed to the right and have many outliers

As not all the assumprions for ANOVa are not met (namely, our residuals are not distibuted normally), we will use non-parametric ANOVA, which is Kruskal-Wallis test.

kruskal.test(ESS10_anova$ttminpnt ~ ESS10_anova$speakpnt)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  ESS10_anova$ttminpnt by ESS10_anova$speakpnt
## Kruskal-Wallis chi-squared = 391.25, df = 6, p-value < 2.2e-16

P.value is less than 0.05 so there is a significant difference between mean ranks of different frequency groups

Post-Hoc for non parametric test

DunnTest(ESS10$ttminpnt ~ ESS10$speakpnt)
## 
##  Dunn's test of multiple comparisons using rank sums : holm  
## 
##       mean.rank.diff    pval    
## 2-1       -542.97148 8.4e-14 ***
## 3-1       -717.23890 < 2e-16 ***
## 4-1       -657.65260 < 2e-16 ***
## 5-1       -559.38737 < 2e-16 ***
## 6-1       -361.76776 1.1e-15 ***
## 7-1       -295.86223  0.0027 ** 
## 66-1       133.62426  0.0076 ** 
## 77-1       -98.62574  1.0000    
## 88-1       502.62426  1.0000    
## 3-2       -174.26741  0.2120    
## 4-2       -114.68111  1.0000    
## 5-2        -16.41588  1.0000    
## 6-2        181.20373  0.1575    
## 7-2        247.10926  0.1575    
## 66-2       676.59574 < 2e-16 ***
## 77-2       444.34574  1.0000    
## 88-2      1045.59574  0.2439    
## 4-3         59.58630  1.0000    
## 5-3        157.85153  0.0893 .  
## 6-3        355.47114 3.9e-16 ***
## 7-3        421.37667 5.5e-07 ***
## 66-3       850.86316 < 2e-16 ***
## 77-3       618.61316  0.6543    
## 88-3      1219.86316  0.0893 .  
## 5-4         98.26523  0.9297    
## 6-4        295.88484 1.2e-12 ***
## 7-4        361.79037 2.6e-05 ***
## 66-4       791.27686 < 2e-16 ***
## 77-4       559.02686  0.9297    
## 88-4      1160.27686  0.1286    
## 6-5        197.61961  0.0058 ** 
## 7-5        263.52514  0.0342 *  
## 66-5       693.01163 < 2e-16 ***
## 77-5       460.76163  1.0000    
## 88-5      1062.01163  0.2221    
## 7-6         65.90553  1.0000    
## 66-6       495.39202 < 2e-16 ***
## 77-6       263.14202  1.0000    
## 88-6       864.39202  0.6543    
## 66-7       429.48649 4.1e-08 ***
## 77-7       197.23649  1.0000    
## 88-7       798.48649  0.9297    
## 77-66     -232.25000  1.0000    
## 88-66      369.00000  1.0000    
## 88-77      601.25000  1.0000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that we have significance difference in the following groups: Once a month-Several times a day, Less often-Several times a day, Never-Several times a day, Several times a month-Once a day, Once a month-Once a day, Less often-Once a day, Never-Once a day, Several times a month-Several times a week, Once a month-Several times a week, Less often-Several times a week, Never-Several times a week, Once a month-Several times a month, Less often-Several times a month, Never-Several times a month, Less often-Once a month, Never-Once a month.

Effect size

epsilonSquared(x = ESS10$ttminpnt, g = ESS10$speakpnt)
## epsilon.squared 
##            0.71

We got a result 0,489 which represents large effect, so we have strong statistically significant difference among different groups pof frequency of live speaking to parents.

Conclusions and answer to the RQ: In conclusion, we can see a relation between residental distance and frequency of in person communication in parent-child relations: The larger the distance between parent and child, the less frequently they communicate in person. We see statistical support of out research hypothesis.