Introduction

This is the final project for Data 606, the objective of which is to conduct a reproducible analysis of my own choosing. I chose to use data focused on subjective happiness, as described in more detail below.

Research Question

As someone about to enter the world of parenthood I thought it would be interesting to look at the potential relationship between having children and one’s level of happiness. The specific research question I had planned on asking is: Is there a difference in levels of happiness between non-parents and parents? However, mid-way through my original analysis I realized that there were issues with the data which rendered it less useful so I re-framed the question as Is there a difference in levels of life satisfaction betrween non-parents and parents?

Null hypothesis: There is no difference in average levels of life satisfaction between non-parents and parents
Alternative hypothesis: There is some difference in average levels of life satisfaction between non-parents and parents

My focus was primarily on whether parents (i.e. people with at least one child) are on average more satisfied with their lives than non-parents. I also looked at whether the answer varies depending on the number of children one has and/or the country they live in.

Overview of Data

The data is “Wave 7” from the World Values Survey (WSV), and the details - including a link to where the full data set can be downloaded, is below. However, please note that the date used in this file is an extract I made from that original data set that only includes data for the variables relevant for this project. The original file includes 536 separate variables.

The dataset currently (the website notes that it is still a work in progress) spans 77 countries and includes 69,578 observations covering the years 2017 to 2020.

Citation:

WVS wave 7 (2017-2020): Haerpfer, C., Inglehart, R., Moreno, A., Welzel, C., Kizilova, K., Diez-Medrano J., M. Lagos, P. Norris, E. Ponarin & B. Puranen et al. (eds.). 2020. World Values Survey: Round Seven (2017 to 2020) Country-Pooled Datafile. Madrid, Spain & Vienna, Austria: JD Systems Institute & WVSA Secretariat [Version: http://www.worldvaluessurvey.org/WVSDocumentationWV7.jsp].

Method of Data Collection

From the WVS Website “The main method of data collection in the WVS survey is face-to-face interview at respondent’s home / place of residence. Other interview modes employed in WVS-7 include postal survey, self-administered online survey, and telephone interview (in combination with other surveying techniques).”

Dependent and Independent Variables

The dependent (response) variable is level of life satisfaction, which is technical a categorical (ordinal) variable with values ranging from 1 to 10 but per discussion with the professor will be treated as a quantitative variable for this project.
The main independent (explanatory) variable will be whether the respondent is a parent, which is a binary variable (is parent, or is not parent).
The secondary independent variable will be the country they live in, which is categorical (nominal).

Data Prep and Tidying

The following sections are focused on obtaining and preparing the data for analysis.

Load Required Libraries

library(tidyverse)
library(ggplot2)
library(cowplot)
library(kableExtra)
library(countrycode)

Load the Data and Create a Data Frame

# Load data from Github
file = "https://raw.githubusercontent.com/cwestsmith/cuny-msds/master/datasets/wvsdata_606project.csv"
df <- read.csv(file, header=TRUE)

# Create data frame using with more readable column names and excluding the row number
names(df) <- c("row_num", "country_code", "level_happiness", "life_satisfaction", "num_children")
df <- df %>% select("country_code", "num_children", "life_satisfaction")

Further Tidying of Data

# Remove rows with NA values
df_new <- na.omit(df)

# Add a variable that matches the region to each country name
df_new <- df_new %>% 
  mutate(region = countrycode(sourcevar = df_new$country_code, origin = "iso3c", destination = "region"))

# Check for NA values in the new column, which indicates 0 issues
df_new %>% filter(is.na(region) == TRUE)

## [1] country_code      num_children      life_satisfaction region           
## <0 rows> (or 0-length row.names)

Initial / Exploratory Analysis

# How many levels of life satisfaction are in the dataset?
df_new %>% group_by(life_satisfaction) %>% summarize("num_rows" = n())

## # A tibble: 10 x 2
##    life_satisfaction num_rows
##                <int>    <int>
##  1                 1     2026
##  2                 2     1013
##  3                 3     2289
##  4                 4     3296
##  5                 5     8566
##  6                 6     7807
##  7                 7    11369
##  8                 8    13688
##  9                 9     7118
## 10                10    12634

# Histogram of number of children
hist(df_new$num_children, xlab = "Number of Children", main = "Histogram: Number of Children")

# Mean of life satisfaction level for top 10 countries w/ mean number of children
df_summarized <- df_new %>% group_by(country_code) %>% 
  summarize("avg_life_satisfaction" = mean(life_satisfaction), "avg_num_children" = mean(num_children))
top_n(arrange(df_summarized, desc(avg_life_satisfaction)), 10)

## # A tibble: 10 x 3
##    country_code avg_life_satisfaction avg_num_children
##    <chr>                        <dbl>            <dbl>
##  1 KGZ                           8.37             2.49
##  2 MEX                           8.15             2.47
##  3 TJK                           7.95             2.79
##  4 PAK                           7.67             2.65
##  5 BOL                           7.47             2.26
##  6 PHL                           7.37             2.64
##  7 JOR                           6.87             3.13
##  8 NGA                           5.61             2.30
##  9 ZWE                           4.95             2.61
## 10 IRQ                           4.46             2.44

# Separate data for parents and non-parents
parents_df <- df_new %>% filter(num_children > 0)
nonparents_df <- df_new %>% filter(num_children == 0)

I then created box plots for each of the groups, but given that these are technical ordinal variables with the same range (and therefore the same median), these charts were not very helpful.

# Boxplots of life satisfaction levels - parents vs non-parents
boxplot(parents_df$life_satisfaction, nonparents_df$life_satisfaction, names = c("parents", "non-parents"), ylab = "Life Satisfaction", main = "Life Satisfaction (Parents vs Non-Parents)")

parents_df %>% group_by(life_satisfaction) %>% summarise(num_observations = n())

## # A tibble: 10 x 2
##    life_satisfaction num_observations
##                <int>            <int>
##  1                 1             1496
##  2                 2              723
##  3                 3             1603
##  4                 4             2266
##  5                 5             6124
##  6                 6             5414
##  7                 7             7810
##  8                 8             9979
##  9                 9             5202
## 10                10             9708

nonparents_df %>% group_by(life_satisfaction) %>% summarise(num_observations = n())

## # A tibble: 10 x 2
##    life_satisfaction num_observations
##                <int>            <int>
##  1                 1              530
##  2                 2              290
##  3                 3              686
##  4                 4             1030
##  5                 5             2442
##  6                 6             2393
##  7                 7             3559
##  8                 8             3709
##  9                 9             1916
## 10                10             2926

Summary Statistics and Inference

To determine whether there is a significant difference between the two population means (parents and non-parents), a T-test was used.

As a first step, we need to evaluate the two Central Limit Conditions:

Independence, extended: The WVS website says ‘All countries employed random probability representative samples of the adult population’, so this condition is satisfied.
Normality: The sample size is significantly larger than 30 for each group, as detailed below, so assuming there are no clear outliers this condition is satisfied.

To further check normality histograms and QQ plots were created for each group:

# Histograms to check normality condition for both groups
hist_nonparents <- ggplot(data=nonparents_df, aes(life_satisfaction)) + geom_histogram(binwidth = 1)
hist_parents <- ggplot(data=parents_df, aes(life_satisfaction)) + geom_histogram(binwidth = 1)

# QQ plots to test normality of data for both groups
qq_nonparents <- ggplot(nonparents_df, aes(sample = life_satisfaction)) +
  stat_qq() +
  stat_qq_line()

# QQ plot to test normality of data for parents
qq_parents <- ggplot(parents_df, aes(sample = life_satisfaction)) +
  stat_qq() +
  stat_qq_line()

plot_grid(hist_nonparents,
          hist_parents,
          qq_nonparents,
          qq_parents,
          labels = c('Non-Parents','Parents'),
          label_x = 0.2,
          ncol = 2)

After confirming the conditions for inference have been met, summary statistics and a t-test can be performed. The t-test was done manually and via an r function for illustrative purposes.

# Find the means of each group
nonparents_mean <- mean(nonparents_df$life_satisfaction)
parents_mean <- mean(parents_df$life_satisfaction)
difference_mean <- parents_mean - nonparents_mean

# Find the standard deviation of each group
nonparents_sd <- sd(parents_df$life_satisfaction)
parents_sd <- sd(nonparents_df$life_satisfaction)

# Find 'N' (number of observations) for each group
nonparents_n <- nonparents_df %>% summarize(n()) %>% as.numeric()
parents_n <- parents_df %>% summarize(n()) %>% as.numeric()

# Degrees of freedom, take the smaller of the two observation counts and subtract 1
degfreedom <- if_else(nonparents_n < parents_n, nonparents_n - 1, parents_n - 1)

# Calculate the standard error
standard_error <- sqrt(
  (nonparents_sd ^ 2 / nonparents_n) +
    (parents_sd ^ 2 / parents_n)
)

# Calculate the t-statistic
tstat_manual <- difference_mean / standard_error

# Find the 95% confidence interval for the difference in means
interval_low <- difference_mean - standard_error
interval_high <- difference_mean + standard_error

# The p-value is smaller than 0.05, so we reject the null hypothesis
pvalue_manual <- 2*pt(-abs(tstat_manual), df=degfreedom)
result <- if_else(pvalue_manual > 0.05, "fail to reject", "reject")

cat("Nonparents:", "\n Mean:",round(nonparents_mean,2), "\n Number of Observations (N):", nonparents_n, "\n Standard Deviation:", nonparents_sd, "\n\n")

## Nonparents: 
##  Mean: 6.93 
##  Number of Observations (N): 19481 
##  Standard Deviation: 2.296118

cat("Parents:", "\n Mean:",round(parents_mean,2), "\n Number of Observations (N):", parents_n, "\n Standard Deviation:", parents_sd, "\n\n")

## Parents: 
##  Mean: 7.12 
##  Number of Observations (N): 50325 
##  Standard Deviation: 2.22387

cat("Standard Error:", standard_error, "\nDegrees of Freedom:", degfreedom, "\n\n")

## Standard Error: 0.01920687 
## Degrees of Freedom: 19480

cat("P-value:", pvalue_manual, "\nT-statistic:", tstat_manual, "\n95% Confidence Interval is between", round(interval_low, 3), "and", round(interval_high, 3), "\n\nThe p-value is", round(pvalue_manual,3), "so we", result, "the null hypothesis")

## P-value: 1.014425e-23 
## T-statistic: 10.05336 
## 95% Confidence Interval is between 0.174 and 0.212 
## 
## The p-value is 0 so we reject the null hypothesis

Results of manually performed t-test

cat("P-value:", pvalue_manual, "\nT-statistic:", tstat_manual, "\n95% Confidence Interval is between", round(interval_low, 3), "and", round(interval_high, 3), "\nThe p-value is", round(pvalue_manual,3), "so we", result, "the null hypothesis")

## P-value: 1.014425e-23 
## T-statistic: 10.05336 
## 95% Confidence Interval is between 0.174 and 0.212 
## The p-value is 0 so we reject the null hypothesis

In addition to the manual t-test and confidence interval calculation above, the t.test function can be used to do the same thing. I will perform a test via this method as well to confirm the findings are consistent (hopefully indicating that my manual calculations were correct).

# Alternatively we can just use the built in t.test function
autotest<- t.test(parents_df$life_satisfaction, nonparents_df$life_satisfaction)
autotest

## 
##  Welch Two Sample t-test
## 
## data:  parents_df$life_satisfaction and nonparents_df$life_satisfaction
## t = 10.196, df = 36471, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1559755 0.2302117
## sample estimates:
## mean of x mean of y 
##  7.120099  6.927006

Searching For (Different) Truth

Additional analysis

I performed some more analysis to see if perhaps there was a difference in regions, which there seems to be.

The data seems to indicate that satisfaction is quite a bit better after 13 children.

# Average life satisfaction by # of children and number of rows
avg_satisfaction_parents <-
  parents_df %>% 
  group_by(num_children) %>% 
  summarize("num_rows" = n(), "avg_life_satisfaction" = mean(life_satisfaction))

# Average levels of life satisfaction by region (descending) and number of rows
avg_satisfaction_parents_region <- parents_df %>% 
  group_by(region) %>% 
  summarise(avg_life_satisfaction = mean(life_satisfaction))
  
plot_avg <- ggplot(data=avg_satisfaction_parents, aes(x=num_children, y=avg_life_satisfaction)) +
    geom_bar(stat="identity") +
    xlab("Number of Children") + ylab("Avg. Life Satisfaction")
plot_region <- ggplot(data=avg_satisfaction_parents_region, aes(x=region, y=avg_life_satisfaction)) +
    geom_bar(stat="identity") + coord_flip() +
    xlab("Region") + ylab("Avg. Life Satisfaction")
plot_grid(plot_avg,
          plot_region,
          label_x = 0.2,
          ncol = 2)

With the new insight from above in mind we can perform another t-test focused on comparing the group of parents with 14 or more children to non-parents.

parents_happy <- parents_df %>% filter(num_children >= 14)

avg_satisfaction_happyparents <-
  parents_happy %>% 
  group_by(num_children) %>% 
  summarize("num_rows" = n(), "avg_life_satisfaction" = mean(life_satisfaction))

## `summarise()` ungrouping output (override with `.groups` argument)

# Check normality assumption again, though technically satisfied with 31 observations.  Not very normal, though no outliers.
plothappyhist <- ggplot(data=parents_happy, aes(life_satisfaction)) + 
  geom_histogram(binwidth = 1) + 
  ggtitle("Life Satisfaction: 14 or more children") +
  xlab("Life Satisfaction") + ylab("Number of Observations")

plotchildrencount <- ggplot(data=avg_satisfaction_happyparents, aes(x=num_children, y=num_rows)) +
    geom_bar(stat="identity") +
    ggtitle("# Observations") +
    xlab("Number of Children") + ylab("Number of Observations")

plot_grid(plothappyhist,
          plotchildrencount,
          label_x = 0.2,
          ncol = 2)

Another t-test

Another t-test (this time using the t-test function rather than manually) was performed against the filtered data set. This time the results were different, with a p-value of greater than 0.5 and 0 falling within the confidence interval - thus the null hypothesis cannot be rejected. In other words, there does not seem to be a statistically significant difference between the levels of life satisfaction for parents with 14 or more children and non-parents.

# Run another t-test, this time focused using data for parents with 14 or more children
t.test(parents_happy$life_satisfaction, nonparents_df$life_satisfaction)

## 
##  Welch Two Sample t-test
## 
## data:  parents_happy$life_satisfaction and nonparents_df$life_satisfaction
## t = 1.8148, df = 30.089, p-value = 0.07954
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.09395254  1.59477964
## sample estimates:
## mean of x mean of y 
##  7.677419  6.927006

Smoke and Mirrors

But what if we changed the approach a bit and use a one tailed t-test? Surprise! Based on this test, with a p value of under .05 we can reject say that parents with 14 or more children are on average more satisfied with their lives than non-parents.

# One tailed test, the alternative now being that parents w/ 14 or more children have a greater 
# level of life satisfaction than non-parents
t.test(parents_happy$life_satisfaction, nonparents_df$life_satisfaction, alternative = "greater")

## 
##  Welch Two Sample t-test
## 
## data:  parents_happy$life_satisfaction and nonparents_df$life_satisfaction
## t = 1.8148, df = 30.089, p-value = 0.03977
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.04866957        Inf
## sample estimates:
## mean of x mean of y 
##  7.677419  6.927006

Conclusion

In conclusion, after performing a t-test the null hypothesis can be rejected in favor of the alternative hypothesis, which is that there is a difference in the levels of life satisfaction between non-parents and parents. Encouragingly for parents to be it seems the difference is positive, with a 95% confidence interval of approximately 0.2 of a point on a 1 to 10 scale.

Although the mean level of life satisfaction of parents with 14 or more children is actually higher than the whole group, interestingly the null hypothesis holds true (no statistically significant difference). Although there are 31 observations, I think this may be a flaw in the normality assumption. If the same test is done but with a one tailed, ‘greater than’ test, the p-value decreases to less than 0.5 and thus a different result is reached.

This project has been an interesting experiment with data and has helped to illustrate the pitfalls of trying to use t-tests with ordinal variables as well as why one-tailed tests can sometimes be controversial. When I did not like the results of my experiment I simply needed to change the rules of the game. It would be interesting, and perhaps would have been a better choice all in all, to try ANOVA as well.

Data 606 Final Project

Cameron Smith

11/28/2020