Statistical inference with the GSS data

Setup

The following analysis used R as a programming language and RStudio as its IDE. It is advised to install both software to proceed. The R version used in the latest knit of this file is…

R.version.string

## [1] "R version 4.0.2 (2020-06-22)"

…and the version of RStudio is 1.3.1073 which can be found in Help -> About RStudio.

It is beyond my knowledge if different versions of R causes failures or outputs different results of the following analysis. In any case, this shall be shown for documentation.

The next section is optional but it is recommended for project organization. You are opt to skip this based on your preference.

This creates a new Rproj which sets up a clean working directory and dedicated folder. It can be achieved by performing the following instructions.

In RStudio, click File -> New Project... -> New Directory.
Type any project name that you desire. In my case, the project name is GSS Analysis.

It should result to an empty RStudio project where the working directory is automatically set to the name of the folder associated with the project. It can be checked by getwd() where the path ends in /GSS Analysis.

Another option is to completely clean the current working environment. This results to a completely blank interface.

rm(list=ls())                       # Removes all objects in the Global Environment
if (!is.null(dev.list())) dev.off() # Closes all active plots in the Plots tab
cat("\014")                         # Creates a clean R console

Load packages

These are the only non-system libraries that we’ll be using for this analysis. It is required to install these packages first through install.packages() to proceed through the analysis.

library(ggplot2)
library(dplyr)
library(statsr)

Here are the corresponding versions that were used as of the latest knit for documentation. It is recommended to use the current or the latest version of the following packages through update.packages() or using the “Update” button in the Packages tab. It is beyond my knowledge if such version differences affect the results of the analysis.

packageVersion("ggplot2")

## [1] '3.3.2'

packageVersion("dplyr")

## [1] '1.0.2'

packageVersion("statsr")

## [1] '0.2.0'

Load data

The General Social Survey (GSS) has been monitoring societal change and studying the growing of American society since 1972. GSS has an objective to show, explain, and interpret trends using gathered data from standardized surveys. These indicators include attitudes, behaviors, and attributes of the respondents. (General Social Survey, 2012).

The dataset can be found through this website. However, a specific portion of the dataset that is used for this analysis can only be downloaded through Coursera’s peer-graded assignment. For documentation, the analysis extracts the data using download.file() where it automatically downloads the file to the current working directory. Although it results to the same output as if it is done manually, it is convenient to instantly download the file without leaving RStudio that often, but if you prefer not to do this method (or this particular code fails for some reason), download the file manually as instructed and move it to the local storage then skip the following code below.

The url is saved to an object namely gss_url and it uses download.file() to download the given url. You might have run RStudio as an administrator for the code to work for there might be raised errors related to permission levels (if it doesn’t work, download the file manually and move the file to the designated working directory).

# The link download an .Rdata file that contains the portion of the GSS Dataset
# This is the link that is copied from Coursera's peer-graded assignment.
# The actual link is a long url, varies by user, and it changes periodically.

gss_url <- "https://d3c33hcgiwev3.cloudfront.net/_5db435f06000e694f6050a2d43fc7be3_gss.Rdata?..."

download.file(gss_url, destfile = "gss.Rdata", mode = "wb")
download_date <- paste(format(Sys.time(), "%b %d %Y %X"), "-", Sys.timezone())
print(download_date)

Just to make sure that the file is placed in the current working directory, this short piece of code checks if it’s in the right directory.

list.files()

## [1] "GSS Analysis.Rproj"           "gss.html"                    
## [3] "gss.Rdata"                    "stat_inf_project.html"       
## [5] "stat_inf_project.Rmd"         "stat_inf_project_rubric.html"

There may be differences in results after running the code above but what’s important is that it contains gss.Rdata which means that the data is in the current working directory.

Afterwards, load the the dataset the same way as you do it manually i.e. load(). This load a dataframe named gss.

load("gss.Rdata")

Part 1: Data

The provided data is a portion of the whole cumulative file of the GSS (1972-2012). The data contains 57061 records with 114 indicators e.g. age, sex, social status, etc. The data was partially taken for the purpose of simplification. Furthermore, the portion of the data was taken through random sampling. With that said, the data to be used can infer generality of the entire American citizens’ behavior.

While the data was taken from the GSS file, there is a difference between the actual GSS cumulative file and its portion for Coursera use. Data with complete NA entries have been removed and variable types to each column have been assigned for ease of analysis. In addition, the data provided is assumed to be a sample of the entire population even though it is a sample of a sample.

It is also notable that there are bias and confounding variables that influence the given data. For this reason, any results in this analysis cannot conclude any causality of the variables per se. In relation, The GSS is only an observational study where observed associations in-between variables do not conclude any causal relationship as previously mentioned.

Part 2: Research question

Is there sufficient evidence on the difference between the average ages of: People who think they would stop working if they have enough money to live for the rest of their lives, and people who think otherwise?

This research question aims to explore on the life perspectives of Americans and tests if their ages influence their perspective. Any notable results could lead to further investigation through a separate research.

Application from these results varies in different field of expertise and academics including: Philosophy, Psychology, Data Science, Economics, etc. This analysis would not be discussing on the specific implications to the said fields including their possible areas of research.

Part 3: Exploratory data analysis

Before making a statistical inference, we would first explore the dataset required. The following code below subsets gss data frame where: - There is no NA in columns age and richwork - The subset data is stored in the variable working_data

working_data <- gss %>% filter(!is.na(richwork), !is.na(age)) %>% select(age, richwork)
print(head(working_data))

##   age         richwork
## 1  54     Stop Working
## 2  36 Continue Working
## 3  32 Continue Working
## 4  41     Stop Working
## 5  26 Continue Working
## 6  46 Continue Working

sapply(working_data, FUN = class)

##       age  richwork 
## "integer"  "factor"

Each row corresponds to respondents’ ages and their responses to the question. Overall, there are 21887 respondents that contains data from age and richwork. For more details, visit CODEBOOK.

The column age is an integer column while richwork is a categorical column with 2 levels: Stop Working and Continue Working. These levels would be used as groups for the following analysis.

To show some summary statistics about the data, a short summary shall be applied grouped by their corresponding responses. This can be achieved by the following code below:

summary_stats <- tapply(working_data$age, working_data$richwork, FUN = summary)
cw_mean <- round(as.numeric(summary_stats$`Continue Working`[4]),2)
sw_mean <- round(as.numeric(summary_stats$`Stop Working`[4]),2)
print(summary_stats)

## $`Continue Working`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   29.00   37.00   39.17   48.00   89.00 
## 
## $`Stop Working`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   33.00   43.00   43.09   53.00   85.00

The results were separated by its categorical variable richwork and performs summary statistics from each of their groups. The results are the ages that fit to its corresponding tendency and other statistical calculations.

To interpret these results, the average age of people who would continue working if they have enough money to live is approximately 39.17 years old. On the other hand, the average age of people who would do otherwise is 43.09 years old by estimate.

To illustrate, the following code below shows the sample distribution of ages based on their responses that is attributed by richwork through a comparative histogram. The numbers on the x-axis signifies the ages and the numbers on the y-axis corresponds to the number of people based on the given age. Furthermore, the lines on the two distributions are the means between each categorical variables as discussed in the previous code above.

The inference function only shows the plot as stated in its arguments below, specifically on verbose and show_eda_plot

ggplot(data = working_data, aes(x = age)) +
    geom_histogram(fill = "#8FDEE1", binwidth = diff(range(working_data$age)) / 20) +
    facet_grid(richwork ~ .) +
    xlab("age") +
    ylab("richwork") +
    ggtitle("Sample Distribution")

An observation to be highlighted from this graph is the shift of numbers across age bins. The numbers bounce up and down as ages increases however, the overall distribution follows a normal distribution and it is evidently normal based through illustration. To check, the following code shows the same histogram distribution but with different number of bins.

ggplot(data = working_data, aes(x = age)) +
    geom_histogram(fill = "#8FDEE1", bins = 10) +
    facet_grid(richwork ~ .) +
    xlab("age") +
    ylab("richwork") +
    ggtitle("Sample Distribution with 10 Bins")

Based from this graph, the normality of the age distribution is clearer than the previous graph. The peak frequency of the Continue Working group tends to be in the ages of 35-40 while the Stop Working group is on near 50s. Based from this graph, there is a visual cue on its differences between the average ages of the two groups.

Even though there is visual evidence of change in the average ages, it is only a visual observation of the sample data. This is just an estimate of the population statistic. In the next section, we would infer if such differences exist in the population means can be concluded through the sample data and whether it is statistically significant.

Part 4: Inference

Condition Checking

Before performing inference calculations, conditions must be checked to ensure the reliability of the tests to be performed.

Independence - Every row corresponds to a unique person who answered the survey, therefore it is assumed that the data is independent. In addition, the sample data is a random sample without replacement.
Normality - Since 21887 is less than 10% of the population and the sample is sufficiently large (\(n>30\)), it is assumed that the data is normally distributed. Furthermore, this condition corresponds to the visual observation of the histograms found in Part 3.

With that said, the analysis will proceed to the calculation of Confidence Interval (CI).

Confidence Interval

The following code shows the results of the given data with the following results: point estimates of the two groups, their standard deviations, the number of respondents by group, and the confidence interval. The function inference() from statsr library was used as such.

# Confidence Interval
results_ci <- inference(data = working_data, y = age, x = richwork
                        ,type = "ci", statistic = "mean"
                        ,method = "theoretical"
                        ,conf_level = 0.95)

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_Continue Working = 15341, y_bar_Continue Working = 39.1673, s_Continue Working = 12.8921
## n_Stop Working = 6546, y_bar_Stop Working = 43.0924, s_Stop Working = 13.0654
## 95% CI (Continue Working - Stop Working): (-4.3017 , -3.5485)

The vertical line illustrates the mean of the distribution within the groups. It is noticeable that the mean line drawn on Continue Working category is on younger ages than on the Stop Working category. This observation backs up the calculated means using tapply() as shown earlier where the observed mean on Continue Working is less than on the Stop Working category.

At 95% Confidence Interval, the difference between the ages of the two groups is between -4.3017 and -3.5485. This means that the true difference of the average has a range where 0 is not included. It is highly unlikely that there is no significant difference of the two groups because of it (at 95% CI).

Hypothesis Testing

To further check the results of the CI, hypothesis testing shall be used. For Hypothesis Testing, here are the null and alternative hypothesis for this analysis:

\(H_0\): There is no significant difference between the average ages of: People who think they would stop working if they have enough money to live for the rest of their lives, and people who think otherwise.

\(H_A\): There is a significant difference between the average ages of: People who think they would stop working if they have enough money to live for the rest of their lives, and people who think otherwise.

We would be using theoretical method in order to determine if there is statistical difference between the mentioned variables. The prerequisites for this method (Condition checking, normality checking, etc.) are all passed. Therefore, it is ideal to perform this type of methodology.

Assuming that the null hypothesis is true, we would calculate the probability of getting the observed value of the differences between the two groups. If such probability tends to be more in the extremes, specifically less than 5% (thus \(\alpha=0.05\)) from the two tails (two-sided test), the null hypothesis would be rejected in favor of alternative hypothesis. In short, if p-value is less than 0.05, we would reject the null hypothesis.

Using inference() function from the statsr library, the given test statistic would be computed and show this statistic from the population distribution from the null hypothesis.

Note that the codes uses t-statistic test however, the degrees of freedom (df) used is sufficiently large that it would be similar to z-score test.

# Hypothesis Test
results_ht <- inference(data = working_data, y = age, x = richwork
                        ,type = "ht", statistic = "mean", method = "theoretical"
                        ,sig_level = 0.05, null = 0, alternative = "twosided"
                        ,show_eda_plot = FALSE)

## Response variable: numerical
## Explanatory variable: categorical (2 levels) 
## n_Continue Working = 15341, y_bar_Continue Working = 39.1673, s_Continue Working = 12.8921
## n_Stop Working = 6546, y_bar_Stop Working = 43.0924, s_Stop Working = 13.0654
## H0: mu_Continue Working =  mu_Stop Working
## HA: mu_Continue Working != mu_Stop Working
## t = -20.43, df = 6545
## p_value = < 0.0001

print(results_ht)

## $SE
## [1] 0.1921241
## 
## $df
## [1] 6545
## 
## $t
## [1] -20.42999
## 
## $p_value
## [1] 5.531279e-90

The null distribution of the differences of the two groups can be seen in the graph shown above, the red line corresponds to the computed test statistic which is -20.4299885. This signifies that the probability of getting a difference of the two groups of approximately -4 is significantly 0% (which is the p-value).

To integrate the results of the CI, the interval does not contain 0 which signifies that the null hypothesis would be true. This corresponds to the results of the hypothesis test where the probability of getting the observed data (which is an extreme outcome) is close to 0% given that \(H_0\) is true. Therefore, the two test results correspond to each other.

Conclusion and Recommendation

With the given calculated values, we would reject \(H_0\) in favor of \(H_A\). In conclusion, there is sufficient evidence that there is a significant difference between the average ages of people who think they would stop working when they are rich enough to live, and people who think they would continue working despite being wealthy. The analysis recommends future researchers to explore in-depth on the relationship between these variables.

In relation, the results does not conclude age causes the perspective of working despite being wealthy since this is only an observational study. Several confounding variables are present in the results, thus the existence of bias. It is best to conduct further research on its possible causality within.