Statistical inference with the GSS data

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(stats)
library(ggridges)

Load data

load("gss.Rdata")

Part 1: Data

The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. GSS has been a source of significant data which has given a clear perspective on what U.S. residents think and feel about such issues as national spending priorities, crime and punishment, intergroup relations, and conficence in institutions.

In short, the GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes.

This extract of the General Social Survey (GSS) Cumulative File 1972-2012, all data and coding come from the original dataset.
57061 observations.
114 variables

Sample collection metodology: Based on GSS platform for survey participants, an adrees is selectec randomly in order to represent a cross-section of the country. The random selection of households from across the United States ensures that the results of the survey are scientifically valid. Then a randomly adult within household is selected in order to complete the interview.

Data collection method implications: Since the sample was obtained from a randomly selected adult in a household through adress, we cannot reneralized the results to the entire U.S. population. The selected population was divided into homogeneous strata and then randomly sample; in other words, only households adults from across the country had an equal chance of being selected for this survey.

Scope of inference: Each subject in the stratum is equally likely to be selected, therefore we are dealing with a large-scale obervational study, the sample is representative of the population from which it comes (households adults). As the groups are not escencially the same (due there is no random assignment), causal conclutions cannot be made.

In short, we have an observational study: not-causal-generalizable.

Part 2: Research question

Research question: First, we may wonder if there is any difference in average family income in constant dollars between different hispanic origins. The origins can be grouped as mexican, central american,south american and caribbean; other origins outside American continent will not take into account for this particular analysis. We will take the following variables from gss dataset:

coninc: Total family income in constant dollars.
hispanic: Hispanic specified.
year: GSS year for this respondent.

Is there any difference in average family income in constant dollars between different hispanic origins?

Part 3: Exploratory data analysis

Before any fancy analysis, we need to create and clean a new data set from the original gss data set. We will use some functions of dplyr package. In gss, there are 28 levels for the variable hispanic, we will simplify this by grouping the responses into 5 groups depending on the divisions of the American continent.

# Select the variables we will work with 
hispanic <- gss %>% select(coninc,hispanic,year)

# As we are only interested in hispanic responses we filter the data
hispanic <- hispanic %>% filter((hispanic != "Not Hispanic" & !is.na(hispanic)))

# Then create the groups
central <- c("salvadorian|panamanian|guatemalan|nicaraguan|central american|costa rican|honduran")
caribbean <- c("puerto rican|dominican|cuban")
south <- c("south american|chilean|peruvian|venezuelan|argentinian")

hispanic$hispanic <- case_when(grepl("mexican",tolower(hispanic$hispanic)) ~ "mexican",
          grepl(central, tolower(hispanic$hispanic)) ~ "central american",
          grepl(caribbean, tolower(hispanic$hispanic)) ~ "caribbean",
          grepl(south, tolower(hispanic$hispanic)) ~ "south american")

# Now remove the NA within all the columns
hispanic <- hispanic[complete.cases(hispanic),]

# Change the columns' names
names(hispanic) <- c("income","origin","year")

# Change years' type from numeric to character
hispanic$year <- as.character(hispanic$year)

Now we will create a data frame called income with the variables year,origin and income. This allows to see if average family income follows a trent for each origin.

income <- hispanic %>% select(year,origin,income) %>% group_by(year,origin) %>% summarise(average=mean(income))

ggplot(data=income, aes(x=year, y=average, group=origin, col=origin)) + geom_line(size=0.8) + 
        labs(title="Average Family Income per Year", y="Income in constant dollars",x="Year")

The following table shows the average family income per year and origin. The incomes were rounded for facilitate readability.

xtabs(round(average,0) ~ origin + year, data=income)

##                   year
## origin              2000  2002  2004  2006  2008  2010  2012
##   caribbean        39474 57076 49828 35026 33485 41417 37369
##   central american 26286 42865 36063 32501 26468 20677 28038
##   mexican          41727 43108 35369 35111 33330 24906 39589
##   south american   45949 75575 32455 56364 56059 65363 56732

Comments:

The average family income for each origin group roughly fluctuates through the years. There does not seem to be a clear trend.
For the year 2012 the highest average family income is for south american, the lowest for central american and both mexican and caribean are quite similar.
This graph, since is built up with aggregated means, tell us nothing about the spread of the incomes. The average is strongly affected for extreme values, therefore we will need other method to visualize the distribution of each origin group for the year 2012.

Now we will see the distribution of family income per Origin years from 2000 to 2012.

# Select the variables we will work with 
income2 <- hispanic %>% select(income,origin)

# Calculate the means in otder to put them in the box plot
means <- data.frame(origin=unique(hispanic$origin),mean=tapply(income2$income,INDEX=income2$origin,FUN=mean))

#Create the box plot
ggplot(data=income2,aes(x=origin, y=income, fill=origin)) + geom_boxplot() + 
        geom_point(data=means, aes(x=origin,y=mean), pch=4, lwd=5, col="red") + 
        labs(title="Family Income per Origin from 2000 to 2012", x="Origin",y="Income")

Comments:

We can see that all categories are right skewed.
Also, it is obvious how the average is strongly affected for extreme values in caribbean and central american categories, the outliers have increased the mean value (the red ‘x’ in the plot).
Even though the average family income in south american is the highest, the category is also the more variable. This can be a critical factor in the result of the statistical analysis.

Part 4: Inference

State hypotheses: We are interested in if there is any difference in average family income in constant dollars between different hispanic origins. Therefore our null hypothesis should be that there is nothing going on, while our alternative hypothesis that indeed the is a difference in at least one.

\(H_0: \mu_{caribbean} = \mu_{central american} = \mu_{mexican} = \mu_{south american}\)

\(H_A: \mu_{caribbean} \ne \mu_{central american} \ne \mu_{mexican} \ne \mu_{south american}\) (At least one of them)

Check conditions:

Independence: While the survey was conducted by random sample, we can assume independence within groups. On the other hand, the data for the groups come from the same survey, though this doesn’t mean the groups aren’t independent.
Normality: As shown below in the normal qq plots, the assumption of normality is not met.
Constant variance: As shown above in the boxplot, the assumption of constant variance is not either met.

#Create the qqnormal plots
par(mfrow=c(1,4))
for( i in 1:4){
        origin <- row.names(means)[i]
        qqnorm(y=income2$income[income2$origin==origin],main=paste("Normal QQ plot - ",origin,sep=""))
        qqline(y=income2$income[income2$origin==origin])
}

State the method: We already saw that we cannot assume normality in the model. In this state we cannot perform any analysis so far, but we can one wat ANOVA with bootstrapping. In bootstrapping, we assume that for each observation in the sample, there may be others like it in the population. So we can think of our bootstrap population as a population where each observation from the sample appears many times. And then we take samples from this population to get an idea of how means from the original population would look like.

Perform inference:

# Sample size for each origin
size <- tapply(income2$income,INDEX = income2$origin,FUN = function(x){length(x)})
size

##        caribbean central american          mexican   south american 
##              326              146             1061               47

First, we need a random sample taken with replacement from the original sample, of the same size as the original sample.

# Create an empty list called bootstrap
bootstrap <- list()

# Fill the list with the bootstrap simulations
for(i in 1:4){
        origin <- row.names(size)[i]
        bootstrap[[i]] <- income2[income2$origin==origin,] %>% 
                rep_sample_n(size=size[i], reps = 100, replace = TRUE) %>%
                group_by(replicate) %>% summarise(income=mean(income)) %>% 
                mutate(origin=origin)
        names(bootstrap)[i] <- paste(origin,"BS",sep="")
}

# Create a data frame with all the boostrap elements
income3 <- rbind(bootstrap$caribbeanBS,
                 bootstrap$`central americanBS`,
                 bootstrap$mexicanBS,
                 bootstrap$`south americanBS`)

# Create a boxplot
ggplot(data=income3,aes(x=origin, y=income, fill=origin)) + geom_boxplot() + 
        labs(title="Family Income per Origin from 2000 to 2012", x="Origin",y="Income")

Comments

After the Bootstrapping mexican,central american and caribbean seem to be normal and have roughly equal variability. Sadly, this is not the case for south american which looks like a smashed potato.
The bad bootstrap result in south american is probably due to its small sample size and high variability.

ggplot(data=income3, aes(x=income,y=origin, fill=origin)) + geom_density_ridges() +
        labs(title="Distributions of Family Income per Origin from 2000 to 2012 \nafter Bootstrapping ",
             x="Income",y="Origin")

State hypotheses: In conclusion, we cannot consider south american since is evidenly no longer representative for its population, but we can perfom ANOVA with remaining origins. Therefore our null hypothesis should be that there is nothing going on, while our alternative hypothesis that indeed the is a difference in at least one. The hypothesis are now the following:

\(H_0: \mu_{caribbean} = \mu_{central american} = \mu_{mexican}\)

\(H_A: \mu_{caribbean} \ne \mu_{central american} \ne \mu_{mexican}\) (At least one of them)

# Remove south american from income data frame
income4 <- income3 %>% filter(origin!="south american")

# Perform ANOVA
income.anova <- aov(income~origin, data=income4)
summary(income.anova)

##              Df    Sum Sq   Mean Sq F value Pr(>F)    
## origin        2 6.661e+09 3.330e+09    1069 <2e-16 ***
## Residuals   297 9.250e+08 3.115e+06                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpret results:

With the understanding of a p-value as the probability that you observe the sample statistics at hand or something more extreme if \(H_0\) is true: a small p-value indicates evidence against the null hypothesis.
The tiny p-value (<2e-16) indicates strong evidence against the null that the mean family income are the same for the different hispanic origins.
ANOVA doest not tell us exactly where the difference lies, merely that there is evicende one exists.
Also, it is important to highlight that we were able to perform an ANOVA, after bootstrapping was made. Otherwise we did not meet the conditions for ANOVA.