The General Social Survey (GSS) is a sociological survey designed to collect demographic characteristics, attitudes, behaviours, and attributes of the adult population in the United States. The GSS aims to monitor societal trends and examine the structural change of American society.
The dataset contains 57,061 respondents who are English and Spanish speaking aged 18 years old and older across the United States between 1972 - 2012.
The data should allow us to generalise statistical results to adult population living in the United States as respondents are randomly selected to take part in the survey. From 1972-1974, the survey was conducted using modified probability design. Full-probability sample design was later used to produce a high-quality and representative sample of the adult population in the United States.
Causal inference cannot be drawn from the statistical results as respondents were not randomly assigned to treatment and control groups.
Research question 1
According to Pew Research Center’s survey (1968 - 2015), household income by race shows persistent income inequality, especially between the black and white population. I would like to investigate the following;
Research question 2
The research paper titled “Financial Satisfaction in Old Age: A Satisfaction Paradox or a Result of Accumulated Wealth?” (2008) found that older adults are more financially satisfied than younger ones. I would like to investigate the following;
Research question 1
The household income ranges from 383 USD to 180,386 USD per year. There are 41,824, 6,956, and 2,452 respondents of whites, blacks, and other races respectively.
cleandat <- gss %>%
filter(!is.na(coninc), !is.na(race)) %>%
dplyr::select(coninc, race)
summary(cleandat)## coninc race
## Min. : 383 White:41824
## 1st Qu.: 18445 Black: 6956
## Median : 35602 Other: 2452
## Mean : 44503
## 3rd Qu.: 59542
## Max. :180386
The summary statistics show that whites earned more than any other races per year, with 47,007 USD at the mean, and 38,414 USD at the median. The black households had the lowest income, with 30,185 USD at the mean, and 21,959 at the median. The income of households headed by other races stood at the average of 42,415 USD and 30,861 USD at the median.
## $White
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 383 20129 38414 47007 62946 180386
##
## $Black
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 383 9953 21959 30185 41523 180386
##
## $Other
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 383 15572 30861 42415 56059 180386
Figure 1 and Figure 2 show the mean and median income by race, respectively. It is apparent that the average income of other races was catching up with the white households, accounting for 52,023 USD for whites and 45,399 USD for other races in 2012. At the median, the income level among other races’ households was 34,4470 USD, on par with whites in 2012 .
In contrast, the income of black households has persistently lagged behind all races. The income inequality was significantly pronounced between the White and Black. In 2012, the average income for blacks was 32,605 USD, compared with 52,023 USD among whites. The median income of black households was 21,065 USD, while the white earned 34,470 USD in 2012.
plot_dat <- gss %>%
filter(!is.na(year), !is.na(coninc), !is.na(race)) %>%
dplyr::select(year ,coninc, race) %>%
group_by(year, race) %>%
summarise(med = median(coninc),
mean = mean(coninc),
median = median(coninc))
plot1 <- ggplot(plot_dat, aes(x = year, y = mean)) +
geom_line(aes(color = race)) +
theme_wsj() +
scale_color_wsj(palette = "colors6") +
theme(legend.title = element_blank())
data_ends_mean <- plot_dat %>%
filter(year == "2012")
plot_out1 <- plot1 +
geom_text_repel(aes(label = round(mean)), data = data_ends_mean)
plot2 <- ggplot(plot_dat, aes(x = year, y = median)) +
geom_line(aes(color = race)) +
theme_wsj() +
scale_color_wsj(palette = "colors6") +
theme(legend.title = element_blank())
data_ends_median <- plot_dat %>%
filter(year == "2012")
plot_out2 <- plot2 +
geom_text_repel(aes(label = round(median)), data = data_ends_median)
plot_grid(plot_out1, plot_out2, labels = c("Fig.1 - Mean Income", "Fig.2 - Median Income"))State the hypotheses
\[ H_0: \mu_1 = \mu_2 = \mu_3 \] \[ H_a: \text{at least the income of one race is different from that of other two races.} \]
Check the conditions
Independencewithin groups: the survey was randomly sampled so independency can be assumed. The sample size is 57,061 which is less than 10% of the population.
between groups: the data are not paired, and can be assumed independence.
NormalityThe normality can be tested using a histogram. The original income data in Figure 3 are right-skewed, and need transformation to perform ANOVA. Root transformation (ⁿ√x) is used to normalise the data distribution as shown in Figure 4.
df <- gss %>%
filter(!is.na(race), !is.na(coninc)) %>%
dplyr::select(race, coninc)
df$coninc <- df$coninc^(1/3)
par(mfrow=c(1,2))
untranformed <- hist(gss$coninc, main = "Figure 3 - Untransformed Income", xlab = "")
tranformed <- hist(df$coninc, main = "Figure 4 - Transformed Income", xlab = "")Skewness function is used to compute the skewness value. The value corresponds to \(0.0571437\), which means that the data distribution is approximately symmetrical, and thus normality can be assumed.
## [1] 0.0571437
VarianceFigure 5 shows that the variability is roughly consistent across three races.
Method to be used and why and how
It is important to note that the mean and median are influenced by the sample size. In order to study whether the mean income (quantitative variable) is similar across different races (categorical variable), we need to perform ANOVA. In this context, whether or not the income varies by races will be explored.
In addition, the average income of the White households will be estimated as compared to the Black and other races via a 95% confidence interval.
Perform inference & Interpret results
ANOVAThe Anova p-value is \(2.2e-16\) which is smaller than 0.05. We thus reject the null hypothesis that all means are equal. Therefore, we can conclude that at least the one race is different from the other two races in terms of income level.
## Df Sum Sq Mean Sq F value Pr(>F)
## race 2 171524 85762 983.6 <2e-16 ***
## Residuals 51229 4466961 87
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In order to see which race(s) is(are) different from the others, we need to compare races 2 by 2 via Tukey’s Test for Post-Hoc Analysis as following;
The Post-Hoc Analysis shows that all three p-values are smaller than 0.05. We can thus reject the null hypothesis and conclude that all races are significantly different in terms of income. Figure 6 illustrates the statistical results of ANOVA.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = coninc ~ race, data = df)
##
## $race
## diff lwr upr p adj
## Black-White -5.327172 -5.610558 -5.043786 0
## Other-White -1.745903 -2.200641 -1.291165 0
## Other-Black 3.581269 3.067274 4.095264 0
my_comparisons <- list(c("Black", "White"), c("Other", "White"), c("Other", "Black"))
ggboxplot(df, x = "race", y = "coninc",
color = "race") +
theme_wsj() +
scale_color_wsj(palette = "colors6") +
theme(legend.title = element_blank(),
plot.title = element_text(hjust = 0.4)) +
labs(title = "Figure 6 - Income by Race") +
stat_compare_means(method = "anova", label.y = 50) Confidence IntervalWe are 95% confident that the average income of the White is 16,075.82 USD - 17,567.62 USD higher than the Black per year. As compared with other races, the white’s average income per year is 3,042.47 USD - 6,140.17 USD higher.
w_b_income <- gss %>%
filter(race %in% c("White", "Black"), !is.na(coninc)) %>%
dplyr::select(race, coninc)
compare_income <- droplevels(w_b_income)
inference(y = coninc, x = race, data = compare_income, statistic = "mean",
type = "ci", method = "theoretical")## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_White = 41824, y_bar_White = 47006.7433, s_White = 36405.4758
## n_Black = 6956, y_bar_Black = 30185.0203, s_Black = 28047.6414
## 95% CI (White - Black): (16075.8243 , 17567.6217)
w_o_income <- gss %>%
filter(race %in% c("White", "Other"), !is.na(coninc)) %>%
dplyr::select(race, coninc)
compare_income <- droplevels(w_o_income)
inference(y = coninc, x = race, data = compare_income, statistic = "mean",
type = "ci", method = "theoretical")## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_White = 41824, y_bar_White = 47006.7433, s_White = 36405.4758
## n_Other = 2452, y_bar_Other = 42415.4274, s_Other = 38105.4353
## 95% CI (White - Other): (3042.4667 , 6140.165)
The selected variables are “Age” and “Financial Satisfaction”. Figure 7 shows a survey in which people aged 18 - 89 were asked about their financial satisfaction, with a total number of 52,287 participants. Respondents were to answer whether they are satisfied, more or less satisfied, or not at all satisfied with their financial situation.
It is evident that the majority of participants are more or less satisfied with their financial situation. Nearly a third of participants (29%) are satisfied with their finances, while the level of dissatisfaction accounts for 26.6%.
df_summary <- gss %>%
filter(age != "NA", satfin != "NA") %>%
dplyr::select(age, satfin)
summary(df_summary)## age satfin
## Min. :18.00 Satisfied :15291
## 1st Qu.:31.00 More Or Less :23113
## Median :43.00 Not At All Sat:13883
## Mean :45.62
## 3rd Qu.:59.00
## Max. :89.00
satfin_pie <- gss %>%
filter(satfin != "NA") %>%
dplyr::select(satfin) %>%
count(satfin) %>%
mutate(prop = round(n * 100/sum(n), 1),
lab.ypos = cumsum(prop) - 0.5 * prop)
ggplot(satfin_pie, aes(x = "", y = prop, fill = satfin)) +
geom_bar(width = 1, stat = "identity") +
geom_text(aes(y = lab.ypos, label = c("26.6%", "44.2%", "29.3%")), color = "white") +
ggtitle(label = "Figure 7 - Financial Satisfaction by Age") +
coord_polar("y", start = 0) +
theme_void() +
theme(plot.background = element_rect(fill = "#F6F4E8", color = NA),
plot.title = element_text(hjust = 0.5, face = "bold"),
panel.background = element_rect(fill = "#F6F4E8", color = NA),
legend.title = element_blank(),
legend.position = "top") +
scale_fill_manual(values=c("#d8b365", "#bd0026", "#08519c"))State the hypotheses
\[ H_0: \text{Age and Level of financial Satisfaction are independent} \] \[ H_a: \text{Age and Level of financial Satisfaction are dependent} \] Check the conditions
Independence: The sampled observations are independent as the survey was randomly sampled.Independence can thus be assumed.
Sample Size: Each particular scenario (i.e. cell) has at least 5 expected cases.
Method to be used and why and how
Chi-square Test of Independence is used because I would like to explore the relationship between the two categorical variables. The test will help to compare the observed frequencies to the expected ones. In this context, I would like to investigate whether or not there is a significant relationship between age and level of financial satisfaction.
In addition, a 95% confidence interval will be calculated to find at what age the US adult population is satisfied / dissatisfied with the financial situation.
Perform inference & Interpret results
Chi-square test of independenceThe p-value output is \(2.2e-16\), which is less than 0.05. We can reject the null hypothesis, and thus conclude that there is a significant relationship between the age and level of financial satisfaction.
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 2110.7, df = 142, p-value < 2.2e-16
Confidence IntervalWe are 95% confident that the average age of US adults who are dissatisfied with their finances is between 41 - 42 years old, while those who are satisfied with their finances are between 49 - 50 years old.
df2 <- gss %>%
filter(satfin == "Not At All Sat", age != "NA") %>%
dplyr::select(satfin, age)
inference(y = age, data = df2, statistic = "mean", type = "ci",
method = "theoretical", conf_level = 0.95) ## Single numerical variable
## n = 13883, y-bar = 41.8539, s = 15.5352
## 95% CI: (41.5955 , 42.1124)
df3 <- gss %>%
filter(satfin == "Satisfied", age != "NA") %>%
dplyr::select(satfin, age)
inference(y = age, data = df3, statistic = "mean", type = "ci",
method = "theoretical", conf_level = 0.95) ## Single numerical variable
## n = 15291, y-bar = 50.1057, s = 18.5818
## 95% CI: (49.8112 , 50.4003)
Research Question 1
Research Question 2