The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. GSS has been a source of significant data which has given a clear perspective on what U.S. residents think and feel about such issues as national spending priorities, crime and punishment, intergroup relations, and conficence in institutions.
In short, the GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes.
Sample collection metodology: Based on GSS platform for survey participants, an adrees is selectec randomly in order to represent a cross-section of the country. The random selection of households from across the United States ensures that the results of the survey are scientifically valid. Then a randomly adult within household is selected in order to complete the interview.
Data collection method implications: Since the sample was obtained from a randomly selected adult in a household through adress, we cannot reneralized the results to the entire U.S. population. The selected population was divided into homogeneous strata and then randomly sample; in other words, only households adults from across the country had an equal chance of being selected for this survey.
Scope of inference: Each subject in the stratum is equally likely to be selected, therefore we are dealing with a large-scale obervational study, the sample is representative of the population from which it comes (households adults). As the groups are not escencially the same (due there is no random assignment), causal conclutions cannot be made.
In short, we have an observational study: not-causal-generalizable.
Research question: First, we may wonder if there is any difference in average family income in constant dollars between different hispanic origins. The origins can be grouped as mexican, central american,south american and caribbean; other origins outside American continent will not take into account for this particular analysis. We will take the following variables from gss dataset:
coninc: Total family income in constant dollars.hispanic: Hispanic specified.year: GSS year for this respondent.Is there any difference in average family income in constant dollars between different hispanic origins?
Before any fancy analysis, we need to create and clean a new data set from the original gss data set. We will use some functions of dplyr package. In gss, there are 28 levels for the variable hispanic, we will simplify this by grouping the responses into 5 groups depending on the divisions of the American continent.
# Select the variables we will work with
hispanic <- gss %>% select(coninc,hispanic,year)
# As we are only interested in hispanic responses we filter the data
hispanic <- hispanic %>% filter((hispanic != "Not Hispanic" & !is.na(hispanic)))
# Then create the groups
central <- c("salvadorian|panamanian|guatemalan|nicaraguan|central american|costa rican|honduran")
caribbean <- c("puerto rican|dominican|cuban")
south <- c("south american|chilean|peruvian|venezuelan|argentinian")
hispanic$hispanic <- case_when(grepl("mexican",tolower(hispanic$hispanic)) ~ "mexican",
grepl(central, tolower(hispanic$hispanic)) ~ "central american",
grepl(caribbean, tolower(hispanic$hispanic)) ~ "caribbean",
grepl(south, tolower(hispanic$hispanic)) ~ "south american")
# Now remove the NA within all the columns
hispanic <- hispanic[complete.cases(hispanic),]
# Change the columns' names
names(hispanic) <- c("income","origin","year")
# Change years' type from numeric to character
hispanic$year <- as.character(hispanic$year)Now we will create a data frame called income with the variables year,origin and income. This allows to see if average family income follows a trent for each origin.
income <- hispanic %>% select(year,origin,income) %>% group_by(year,origin) %>% summarise(average=mean(income))
ggplot(data=income, aes(x=year, y=average, group=origin, col=origin)) + geom_line(size=0.8) +
labs(title="Average Family Income per Year", y="Income in constant dollars",x="Year")The following table shows the average family income per year and origin. The incomes were rounded for facilitate readability.
## year
## origin 2000 2002 2004 2006 2008 2010 2012
## caribbean 39474 57076 49828 35026 33485 41417 37369
## central american 26286 42865 36063 32501 26468 20677 28038
## mexican 41727 43108 35369 35111 33330 24906 39589
## south american 45949 75575 32455 56364 56059 65363 56732
Comments:
Now we will see the distribution of family income per Origin years from 2000 to 2012.
# Select the variables we will work with
income2 <- hispanic %>% select(income,origin)
# Calculate the means in otder to put them in the box plot
means <- data.frame(origin=unique(hispanic$origin),mean=tapply(income2$income,INDEX=income2$origin,FUN=mean))
#Create the box plot
ggplot(data=income2,aes(x=origin, y=income, fill=origin)) + geom_boxplot() +
geom_point(data=means, aes(x=origin,y=mean), pch=4, lwd=5, col="red") +
labs(title="Family Income per Origin from 2000 to 2012", x="Origin",y="Income")Comments:
State hypotheses: We are interested in if there is any difference in average family income in constant dollars between different hispanic origins. Therefore our null hypothesis should be that there is nothing going on, while our alternative hypothesis that indeed the is a difference in at least one.
\(H_0: \mu_{caribbean} = \mu_{central american} = \mu_{mexican} = \mu_{south american}\)
\(H_A: \mu_{caribbean} \ne \mu_{central american} \ne \mu_{mexican} \ne \mu_{south american}\) (At least one of them)
Check conditions:
Independence: While the survey was conducted by random sample, we can assume independence within groups. On the other hand, the data for the groups come from the same survey, though this doesn’t mean the groups aren’t independent.
Normality: As shown below in the normal qq plots, the assumption of normality is not met.
Constant variance: As shown above in the boxplot, the assumption of constant variance is not either met.
#Create the qqnormal plots
par(mfrow=c(1,4))
for( i in 1:4){
origin <- row.names(means)[i]
qqnorm(y=income2$income[income2$origin==origin],main=paste("Normal QQ plot - ",origin,sep=""))
qqline(y=income2$income[income2$origin==origin])
}State the method: We already saw that we cannot assume normality in the model. In this state we cannot perform any analysis so far, but we can one wat ANOVA with bootstrapping. In bootstrapping, we assume that for each observation in the sample, there may be others like it in the population. So we can think of our bootstrap population as a population where each observation from the sample appears many times. And then we take samples from this population to get an idea of how means from the original population would look like.
Perform inference:
# Sample size for each origin
size <- tapply(income2$income,INDEX = income2$origin,FUN = function(x){length(x)})
size## caribbean central american mexican south american
## 326 146 1061 47
First, we need a random sample taken with replacement from the original sample, of the same size as the original sample.
# Create an empty list called bootstrap
bootstrap <- list()
# Fill the list with the bootstrap simulations
for(i in 1:4){
origin <- row.names(size)[i]
bootstrap[[i]] <- income2[income2$origin==origin,] %>%
rep_sample_n(size=size[i], reps = 100, replace = TRUE) %>%
group_by(replicate) %>% summarise(income=mean(income)) %>%
mutate(origin=origin)
names(bootstrap)[i] <- paste(origin,"BS",sep="")
}
# Create a data frame with all the boostrap elements
income3 <- rbind(bootstrap$caribbeanBS,
bootstrap$`central americanBS`,
bootstrap$mexicanBS,
bootstrap$`south americanBS`)
# Create a boxplot
ggplot(data=income3,aes(x=origin, y=income, fill=origin)) + geom_boxplot() +
labs(title="Family Income per Origin from 2000 to 2012", x="Origin",y="Income")Comments
mexican,central american and caribbean seem to be normal and have roughly equal variability. Sadly, this is not the case for south american which looks like a smashed potato.south american is probably due to its small sample size and high variability.ggplot(data=income3, aes(x=income,y=origin, fill=origin)) + geom_density_ridges() +
labs(title="Distributions of Family Income per Origin from 2000 to 2012 \nafter Bootstrapping ",
x="Income",y="Origin")State hypotheses: In conclusion, we cannot consider south american since is evidenly no longer representative for its population, but we can perfom ANOVA with remaining origins. Therefore our null hypothesis should be that there is nothing going on, while our alternative hypothesis that indeed the is a difference in at least one. The hypothesis are now the following:
\(H_0: \mu_{caribbean} = \mu_{central american} = \mu_{mexican}\)
\(H_A: \mu_{caribbean} \ne \mu_{central american} \ne \mu_{mexican}\) (At least one of them)
# Remove south american from income data frame
income4 <- income3 %>% filter(origin!="south american")
# Perform ANOVA
income.anova <- aov(income~origin, data=income4)
summary(income.anova)## Df Sum Sq Mean Sq F value Pr(>F)
## origin 2 6.661e+09 3.330e+09 1069 <2e-16 ***
## Residuals 297 9.250e+08 3.115e+06
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpret results: