Setup

Load packages

library(car)
library(cowplot)
library(dplyr)
library(ggplot2)
library(ggpubr)
library(ggrepel)
library(ggthemes)
library(inference)
library(infer)
library(lattice)
library(latticeExtra)
library(moments)
library(scales)
library(statsr)
library(tidyverse)
load("~/Desktop/R Programming/Statistics_Coursera/Inferential_Stats/gss.Rdata")

Part 1: Data

The General Social Survey (GSS) is a sociological survey designed to collect demographic characteristics, attitudes, behaviours, and attributes of the adult population in the United States. The GSS aims to monitor societal trends and examine the structural change of American society.

The dataset contains 57,061 respondents who are English and Spanish speaking aged 18 years old and older across the United States between 1972 - 2012.

The data should allow us to generalise statistical results to adult population living in the United States as respondents are randomly selected to take part in the survey. From 1972-1974, the survey was conducted using modified probability design. Full-probability sample design was later used to produce a high-quality and representative sample of the adult population in the United States.

Causal inference cannot be drawn from the statistical results as respondents were not randomly assigned to treatment and control groups.


Part 2: Research question

Research question 1

According to Pew Research Center’s survey (1968 - 2015), household income by race shows persistent income inequality, especially between the black and white population. I would like to investigate the following;

Research question 2

The research paper titled “Financial Satisfaction in Old Age: A Satisfaction Paradox or a Result of Accumulated Wealth?” (2008) found that older adults are more financially satisfied than younger ones. I would like to investigate the following;


Exploratory data analysis (RQ - 1)

Research question 1

The household income ranges from 383 USD to 180,386 USD per year. There are 41,824, 6,956, and 2,452 respondents of whites, blacks, and other races respectively.

cleandat <- gss %>%
  filter(!is.na(coninc), !is.na(race)) %>%
  dplyr::select(coninc, race) 

summary(cleandat)
##      coninc          race      
##  Min.   :   383   White:41824  
##  1st Qu.: 18445   Black: 6956  
##  Median : 35602   Other: 2452  
##  Mean   : 44503                
##  3rd Qu.: 59542                
##  Max.   :180386

The summary statistics show that whites earned more than any other races per year, with 47,007 USD at the mean, and 38,414 USD at the median. The black households had the lowest income, with 30,185 USD at the mean, and 21,959 at the median. The income of households headed by other races stood at the average of 42,415 USD and 30,861 USD at the median.

with(cleandat, tapply(cleandat$coninc, cleandat$race, summary))
## $White
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     383   20129   38414   47007   62946  180386 
## 
## $Black
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     383    9953   21959   30185   41523  180386 
## 
## $Other
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     383   15572   30861   42415   56059  180386

Figure 1 and Figure 2 show the mean and median income by race, respectively. It is apparent that the average income of other races was catching up with the white households, accounting for 52,023 USD for whites and 45,399 USD for other races in 2012. At the median, the income level among other races’ households was 34,4470 USD, on par with whites in 2012 .

In contrast, the income of black households has persistently lagged behind all races. The income inequality was significantly pronounced between the White and Black. In 2012, the average income for blacks was 32,605 USD, compared with 52,023 USD among whites. The median income of black households was 21,065 USD, while the white earned 34,470 USD in 2012.

plot_dat <- gss %>%
  filter(!is.na(year), !is.na(coninc), !is.na(race)) %>%
  dplyr::select(year ,coninc, race) %>%
  group_by(year, race) %>%
  summarise(med = median(coninc),
            mean = mean(coninc),
            median = median(coninc)) 

plot1 <- ggplot(plot_dat, aes(x = year, y = mean)) + 
  geom_line(aes(color = race)) +
  theme_wsj() +
  scale_color_wsj(palette = "colors6") +
  theme(legend.title = element_blank())

data_ends_mean <- plot_dat %>%
  filter(year == "2012") 

plot_out1 <- plot1 +
  geom_text_repel(aes(label = round(mean)), data = data_ends_mean)

plot2 <- ggplot(plot_dat, aes(x = year, y = median)) + 
  geom_line(aes(color = race)) +
  theme_wsj() +
  scale_color_wsj(palette = "colors6") +
  theme(legend.title = element_blank())

data_ends_median <- plot_dat %>%
  filter(year == "2012") 

plot_out2 <- plot2 +
  geom_text_repel(aes(label = round(median)), data = data_ends_median)

plot_grid(plot_out1, plot_out2, labels = c("Fig.1 - Mean Income", "Fig.2 - Median Income"))


Inference (RQ - 1)

State the hypotheses

\[ H_0: \mu_1 = \mu_2 = \mu_3 \] \[ H_a: \text{at least the income of one race is different from that of other two races.} \]

Check the conditions

within groups: the survey was randomly sampled so independency can be assumed. The sample size is 57,061 which is less than 10% of the population.

between groups: the data are not paired, and can be assumed independence.

The normality can be tested using a histogram. The original income data in Figure 3 are right-skewed, and need transformation to perform ANOVA. Root transformation (ⁿ√x) is used to normalise the data distribution as shown in Figure 4.

df <- gss %>%
  filter(!is.na(race), !is.na(coninc)) %>%
  dplyr::select(race, coninc) 

df$coninc <- df$coninc^(1/3)

par(mfrow=c(1,2))
untranformed <- hist(gss$coninc, main = "Figure 3 - Untransformed Income", xlab = "") 
tranformed <- hist(df$coninc, main = "Figure 4 - Transformed Income", xlab = "")

Skewness function is used to compute the skewness value. The value corresponds to \(0.0571437\), which means that the data distribution is approximately symmetrical, and thus normality can be assumed.

skewness(df$coninc)
## [1] 0.0571437

Figure 5 shows that the variability is roughly consistent across three races.

boxplot(df$coninc ~ df$race, main = "Figure 5 - Income by Race", xlab = "Race", ylab = "Income")  

Method to be used and why and how

It is important to note that the mean and median are influenced by the sample size. In order to study whether the mean income (quantitative variable) is similar across different races (categorical variable), we need to perform ANOVA. In this context, whether or not the income varies by races will be explored.

In addition, the average income of the White households will be estimated as compared to the Black and other races via a 95% confidence interval.

Perform inference & Interpret results

The Anova p-value is \(2.2e-16\) which is smaller than 0.05. We thus reject the null hypothesis that all means are equal. Therefore, we can conclude that at least the one race is different from the other two races in terms of income level.

res_aov <- aov(coninc ~ race, data = df)

summary(res_aov)
##                Df  Sum Sq Mean Sq F value Pr(>F)    
## race            2  171524   85762   983.6 <2e-16 ***
## Residuals   51229 4466961      87                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In order to see which race(s) is(are) different from the others, we need to compare races 2 by 2 via Tukey’s Test for Post-Hoc Analysis as following;

  1. Black - White
  2. Other - White
  3. Other - Black

The Post-Hoc Analysis shows that all three p-values are smaller than 0.05. We can thus reject the null hypothesis and conclude that all races are significantly different in terms of income. Figure 6 illustrates the statistical results of ANOVA.

TukeyHSD(res_aov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = coninc ~ race, data = df)
## 
## $race
##                  diff       lwr       upr p adj
## Black-White -5.327172 -5.610558 -5.043786     0
## Other-White -1.745903 -2.200641 -1.291165     0
## Other-Black  3.581269  3.067274  4.095264     0
my_comparisons <- list(c("Black", "White"), c("Other", "White"), c("Other", "Black"))

ggboxplot(df, x = "race", y = "coninc",
          color = "race") +
  theme_wsj() +
  scale_color_wsj(palette = "colors6") +
  theme(legend.title = element_blank(),
        plot.title = element_text(hjust = 0.4)) +
  labs(title = "Figure 6 - Income by Race") +
  stat_compare_means(method = "anova", label.y = 50)  

We are 95% confident that the average income of the White is 16,075.82 USD - 17,567.62 USD higher than the Black per year. As compared with other races, the white’s average income per year is 3,042.47 USD - 6,140.17 USD higher.

w_b_income <- gss %>%
  filter(race %in% c("White", "Black"), !is.na(coninc)) %>%
  dplyr::select(race, coninc) 

compare_income <- droplevels(w_b_income)

inference(y = coninc, x = race, data = compare_income, statistic = "mean",
          type = "ci", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_White = 41824, y_bar_White = 47006.7433, s_White = 36405.4758
## n_Black = 6956, y_bar_Black = 30185.0203, s_Black = 28047.6414
## 95% CI (White - Black): (16075.8243 , 17567.6217)

w_o_income <- gss %>%
  filter(race %in% c("White", "Other"), !is.na(coninc)) %>%
  dplyr::select(race, coninc) 

compare_income <- droplevels(w_o_income)

inference(y = coninc, x = race, data = compare_income, statistic = "mean",
          type = "ci", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_White = 41824, y_bar_White = 47006.7433, s_White = 36405.4758
## n_Other = 2452, y_bar_Other = 42415.4274, s_Other = 38105.4353
## 95% CI (White - Other): (3042.4667 , 6140.165)


Exploratory data analysis (RQ - 2)

The selected variables are “Age” and “Financial Satisfaction”. Figure 7 shows a survey in which people aged 18 - 89 were asked about their financial satisfaction, with a total number of 52,287 participants. Respondents were to answer whether they are satisfied, more or less satisfied, or not at all satisfied with their financial situation.

It is evident that the majority of participants are more or less satisfied with their financial situation. Nearly a third of participants (29%) are satisfied with their finances, while the level of dissatisfaction accounts for 26.6%.

df_summary <- gss %>%
  filter(age != "NA", satfin != "NA") %>%
  dplyr::select(age, satfin) 

summary(df_summary)
##       age                   satfin     
##  Min.   :18.00   Satisfied     :15291  
##  1st Qu.:31.00   More Or Less  :23113  
##  Median :43.00   Not At All Sat:13883  
##  Mean   :45.62                         
##  3rd Qu.:59.00                         
##  Max.   :89.00
satfin_pie <- gss %>%
  filter(satfin != "NA") %>%
  dplyr::select(satfin) %>%
  count(satfin) %>%
  mutate(prop = round(n * 100/sum(n), 1),
         lab.ypos = cumsum(prop) - 0.5 * prop)

ggplot(satfin_pie, aes(x = "", y = prop, fill = satfin)) +
  geom_bar(width = 1, stat = "identity") +
  geom_text(aes(y = lab.ypos, label = c("26.6%", "44.2%", "29.3%")), color = "white") +
  ggtitle(label = "Figure 7 - Financial Satisfaction by Age") +
  coord_polar("y", start = 0) +
  theme_void() +
  theme(plot.background = element_rect(fill = "#F6F4E8", color = NA),
        plot.title = element_text(hjust = 0.5, face = "bold"),
        panel.background = element_rect(fill = "#F6F4E8", color = NA),
        legend.title = element_blank(),
        legend.position = "top") +
  scale_fill_manual(values=c("#d8b365", "#bd0026", "#08519c"))

Inference (RQ - 2)

State the hypotheses

\[ H_0: \text{Age and Level of financial Satisfaction are independent} \] \[ H_a: \text{Age and Level of financial Satisfaction are dependent} \] Check the conditions

Independence: The sampled observations are independent as the survey was randomly sampled.Independence can thus be assumed.

Sample Size: Each particular scenario (i.e. cell) has at least 5 expected cases.

Method to be used and why and how

Chi-square Test of Independence is used because I would like to explore the relationship between the two categorical variables. The test will help to compare the observed frequencies to the expected ones. In this context, I would like to investigate whether or not there is a significant relationship between age and level of financial satisfaction.

In addition, a 95% confidence interval will be calculated to find at what age the US adult population is satisfied / dissatisfied with the financial situation.

Perform inference & Interpret results

The p-value output is \(2.2e-16\), which is less than 0.05. We can reject the null hypothesis, and thus conclude that there is a significant relationship between the age and level of financial satisfaction.

tbl <- table(gss$age, gss$satfin) 
chisq.test(tbl)  
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 2110.7, df = 142, p-value < 2.2e-16

We are 95% confident that the average age of US adults who are dissatisfied with their finances is between 41 - 42 years old, while those who are satisfied with their finances are between 49 - 50 years old.

df2 <- gss %>%
  filter(satfin == "Not At All Sat", age != "NA") %>%
  dplyr::select(satfin, age)

inference(y = age, data = df2, statistic = "mean", type = "ci", 
          method = "theoretical", conf_level = 0.95) 
## Single numerical variable
## n = 13883, y-bar = 41.8539, s = 15.5352
## 95% CI: (41.5955 , 42.1124)

df3 <- gss %>%
  filter(satfin == "Satisfied", age != "NA") %>%
  dplyr::select(satfin, age)

inference(y = age, data = df3, statistic = "mean", type = "ci", 
          method = "theoretical", conf_level = 0.95) 
## Single numerical variable
## n = 15291, y-bar = 50.1057, s = 18.5818
## 95% CI: (49.8112 , 50.4003)


Conclusion

Research Question 1

Research Question 2