Statistical Inference with the GSS dataset

1. Research question

Is there a relationship between social class and the number of children Americans adults have?

A 2015 Brookings Institution paper argued that social mobility is lower in the U.S.A than in many industrialized countries. According to the paper, parents’ income is as strong a determinant of economic and social outcomes as any other factor. This is somewhat surprising considering the potency of the ideal of “The American Dream”: the ideal by which success and prosperity is available to every hardworking American regardless of background.

In Africa, where I come from, I notice that middle and upper class families tend to have relatively few children upon whom they concentrate resources. Perhaps this partly explains their childrens’ apparent superior outcomes.

I am interested in finding out if there is a significant difference in the number of children Americans of different social classes have. If there is such a difference, subsequent research could seek to establish what,if any, impact this has on social success measures such as educational attainment and income levels.

I will attempt to answer this question by comparing the average number of children of respondents within different social classes.

2. Data

Since 1972 the General Social Survey (GSS) has been monitoring social change in the U.S by tracking trends in attitudes and behaviours. The data were collected between 1972 and 2012.

Randomly selected individual adults (> 18 years) from across the United States answered questions that covered a broad range of political, social and cultural issues.

Three methods of data collection were used: computer-assisted personal interview (CAPI), face-to-face interview and telephone interview.

Given the probability sampling methods and the geographic coverage of the survey, conclusions from analyses of the data therein are generalizable to the entire adult population of the U.S.

Such conclusions would only go as far as establishing relationships and identifying patterns and trends in the data. Experiments are generally required to establish causal associations; observational studies such as the GSS are insufficient.

Although it is impossible to completely eliminate survey sampling bias, the GSS invests considerable effort in limiting the impact of common sources of bias in probability samples. Some of the steps taken to limit bias are described in the link above.

2 variables will be analyzed:

class

Description: Subjective social class identification.

Survey Question : If you were asked to use one of four names for your social class, which would you say you belong in: the lower class, the working class, the middle class, or the upper class?

Type: A categorical variable with 4 levels:lower class, working class, middle class, upper class
Transformation: 1 level (“No Class”) was dropped from the original dataset variable.

children

Description`: Number of children.

Survey Question: How many children have you ever had, counting all that were born alive at any time (including any you had from a previous marriage)?

Type: A numerical variable.
Transformation: Renamed from original dataset variable - childs.

3. Setup

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, fig.align = 'center')

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(knitr)
library(gplots)

Load data

load("gss.Rdata")

3. Data Preparation

Details of the data pre-processing steps undertaken in this section are included with the code.

# select required variables
gss1 <- dplyr::select(gss, caseid, year, age, sex, class, childs) 

# rename variable
gss1 <- plyr::rename(gss1, c('childs' = 'children')) 

# remove missing values from analysis variables
gss1 <- filter(gss1, 
               class = (!is.na(class)), 
               children = (!is.na(children))) 

# remove factor level; only 1 observation recorded for the level
gss1 <- filter(gss1, class != "No Class") 

# drop unused factor level
gss1$class <- droplevels(gss1$class)

4. Data Exploration

children

The summary below shows that a quarter of respondents do not have any children.

The maximum number is 8. All respondents who have 8 or more children are grouped into this bucket.

summary(gss1$children)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   2.000   1.949   3.000   8.000

The data is naturally bounded at 0. This partly explains the right skew apparent in the histogram below.

The distribution is centred around 2.

The plot also indicates the presence of outliers.

ggplot(gss1, aes(x =as.numeric(children))) +
  geom_histogram(fill = "steelblue") +
  labs(x = "number of children", y = "respondent count", 
       title = "Frequency Distribution") +
  theme_classic()

The boxplot does a better job of identifying extreme values. Observations beyond the whiskers are outliers. Respondents with more than 7 children may be considered unusual.

boxplot(gss1$children, main = "> 7 Children is Unusual", col = "steelblue")

class

Just over 90% of respondents identify as either working or middle class. Of the remaining 9%, about 6% consider themselves lower class and 3% say they are upper class.

table1 <- table(gss1$class, dnn = "Social Class")

options(digits = 2)
prop.table(table1)

## Social Class
##   Lower Class Working Class  Middle Class   Upper Class 
##         0.059         0.456         0.453         0.032

barplot(table1, col = "steelblue", xlab = "Social Class", main = "Group Counts")

Turning to the distribuion of number of children within the different class groups:

The boxplots for working, middle and upper classes look almost identical. The middle 50% of their distributions range from 0 to just below 3.

The upper limits are also identical. However, this is most likely due to the fact that the data are artificially capped at 8.

The boxplot for children belonging to “lower class” respondents is different from the other 3. The middle 50% ranges between 1 and about 3. The mean is higher and there is more variability in the distribution compared to the others. There are more outliers present.

ggplot(data = gss1, aes(x = class, y = children)) +
  stat_boxplot(geom = "errorbar", width = 0.5) +
  geom_boxplot(fill = "steelblue") +
  labs(x = "", y = "No. of Children", title = "Social Class vs Family Size") +
  theme_classic()

The boxplots seem to suggest weak or no differences between the groups. The next section will rigorously test this hypothesis.

5. Inference

5.1 Hypotheses

\(H_0\): The average number of children born to parents of different social classes is the same
\(H_A\): The average number of children born to parents of different social classes differs between at least one pair of social classes

5.2 Statistical Method

I shall use the analysis of variance (ANOVA) test and the F statistic to compare the mean number of children within the 4 social classes.

Analysis of variance is the generally accepted methodology used to understand group differences when we are considering more than 2 groups.

ANOVA works by partitioning the total variability observed in a response variable into between group variability (attributable to differences between the groups) and within group variability (attributable to differences within the groups).

Our interest lies mainly in the variability in average numbers of children that is attributable to differences between the social classes.

5.3 ANOVA - Conditions

3 key conditions are required to ensure the reliability of ANOVA results.

Independence

Within Groups: Given the GSS survey methodology (random sampling, appropriate sample sizes), we can assume within group independence.
Between Groups: Observations between the groups are independent. Data are not paired, each respondent is present once in only 1 group.

Approximate Normality: The side by side boxplots above show the distributions of data within the groups are unimodal and right skewed.
Equal Variance: Consistent variance across groups is particularly important in cases such as this where sample sizes between groups are unequal. The summary statistics below show just how unbalanced the groups are. However, they also show that variability is reasonably consistent across 3 of the 4 groups; ‘Lower Class’ is more variable.

table1 <-gss1 %>% 
  group_by(class) %>%
  summarise(mean = mean(children), median = median(children), "sd" = sd(children), n = n())

table1

## Source: local data frame [4 x 5]
## 
##           class  mean median    sd     n
##          (fctr) (dbl)  (dbl) (dbl) (int)
## 1   Lower Class   2.4      2   2.1  3136
## 2 Working Class   2.0      2   1.8 24389
## 3  Middle Class   1.9      2   1.7 24230
## 4   Upper Class   2.0      2   1.8  1737

Despite the skew observed in the within group distributions and the possible violation of the equal variance condition, ANOVA is still applicable as these are not so extreme as to invalidate the method’s results.

5.4 ANOVA Test - Implementation

The aov function in base R is a good choice to fit an analysis of variance model in our ‘single classification’ setting.

model1 <- aov(children ~ class, data = gss1)
summary(model1)

##                Df Sum Sq Mean Sq F value Pr(>F)    
## class           3    811   270.4    84.9 <2e-16 ***
## Residuals   53488 170415     3.2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the summary results the F test is significant - p value is close to 0.

The plot below highlights the differences in the group means. The blue bars delimit the 95% confidence intervals for each mean. The confidence intervals of the means of ‘Lower’ and ‘Upper’ classes are considerably wider than those for ‘Working’ and ‘Middle’ classes. This indicates greater uncertainty around those estimates; unsurprising given the much smaller sample sizes of those 2 groups.

plotmeans(children ~ class, data = gss1, xlab = "Social Class", ylab = "Average No. of Children",
          main = "Group Means\n(blue bars mark 95% confidence interval)", ccol = "red")

5.5 Results

The significant ANOVA F test provides evidence that average number of children is not the same across social classes in the U.S.A.

Although this is a very helpful first step, it does not tell us which classes differ from one another.

TukeyHSD() function implements a multiple comparison procedure that solves this problem by testing all the pairwise differences between the means.

The rightmost column lists the p-values that determine the significance of each pairwise test.

The average number of children born to lower class respondents is significantly different from those born to working, middle and upper classes.

The difference between upper class and working class is not significant, neither is it with middle class.

TukeyHSD(model1)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = children ~ class, data = gss1)
## 
## $class
##                               diff    lwr    upr p adj
## Working Class-Lower Class  -0.4439 -0.531 -0.357  0.00
## Middle Class-Lower Class   -0.5361 -0.623 -0.449  0.00
## Upper Class-Lower Class    -0.4375 -0.575 -0.300  0.00
## Middle Class-Working Class -0.0922 -0.134 -0.051  0.00
## Upper Class-Working Class   0.0064 -0.107  0.120  1.00
## Upper Class-Middle Class    0.0986 -0.015  0.212  0.12

6. Conclusion

Combining all the information we have - the ANOVA analysis, the group means plot and the multiple comparison table - we can conclude that lower class Americans do have more children on average than all other social classes. Working classes, averagely, have more children than middle classes.

The emerging pattern is broken by the “upper class” group. They do not have fewer children than the group immediately below them. The relatively small sample size of the “upper class” group is likely having an impact on the precision with wich its mean can be estimated. It may be useful, on a future occasion, to rerun the analysis with more balanced sample sizes.

In addition to presumably having lower incomes, lower class Americans have more children on average and thus family budgets are under greater pressure and fewer resources are available for each child.

Researchers in different fields will no doubt continue studying how this impacts the life outcomes of children born into that social group.