Setup

Load data


Part 1: Data

The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. GSS has been a source of significant data which has given a clear perspective on what U.S. residents think and feel about such issues as national spending priorities, crime and punishment, intergroup relations, and conficence in institutions.

In short, the GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes.

Sample collection metodology: Based on GSS platform for survey participants, an adrees is selectec randomly in order to represent a cross-section of the country. The random selection of households from across the United States ensures that the results of the survey are scientifically valid. Then a randomly adult within household is selected in order to complete the interview.

Data collection method implications: Since the sample was obtained from a randomly selected adult in a household through adress, we cannot reneralized the results to the entire U.S. population. The selected population was divided into homogeneous strata and then randomly sample; in other words, only households adults from across the country had an equal chance of being selected for this survey.

Scope of inference: Each subject in the stratum is equally likely to be selected, therefore we are dealing with a large-scale obervational study, the sample is representative of the population from which it comes (households adults). As the groups are not escencially the same (due there is no random assignment), causal conclutions cannot be made.

In short, we have an observational study: not-causal-generalizable.


Part 2: Research question

Research question: First, we may wonder if there is any difference in average family income in constant dollars between different hispanic origins. The origins can be grouped as mexican, central american,south american and caribbean; other origins outside American continent will not take into account for this particular analysis. We will take the following variables from gss dataset:

Is there any difference in average family income in constant dollars between different hispanic origins?


Part 3: Exploratory data analysis

Before any fancy analysis, we need to create and clean a new data set from the original gss data set. We will use some functions of dplyr package. In gss, there are 28 levels for the variable hispanic, we will simplify this by grouping the responses into 5 groups depending on the divisions of the American continent.

Now we will create a data frame called income with the variables year,origin and income. This allows to see if average family income follows a trent for each origin.

The following table shows the average family income per year and origin. The incomes were rounded for facilitate readability.

##                   year
## origin              2000  2002  2004  2006  2008  2010  2012
##   caribbean        39474 57076 49828 35026 33485 41417 37369
##   central american 26286 42865 36063 32501 26468 20677 28038
##   mexican          41727 43108 35369 35111 33330 24906 39589
##   south american   45949 75575 32455 56364 56059 65363 56732

Comments:

Now we will see the distribution of family income per Origin years from 2000 to 2012.

Comments:


Part 4: Inference

State hypotheses: We are interested in if there is any difference in average family income in constant dollars between different hispanic origins. Therefore our null hypothesis should be that there is nothing going on, while our alternative hypothesis that indeed the is a difference in at least one.

\(H_0: \mu_{caribbean} = \mu_{central american} = \mu_{mexican} = \mu_{south american}\)

\(H_A: \mu_{caribbean} \ne \mu_{central american} \ne \mu_{mexican} \ne \mu_{south american}\) (At least one of them)

Check conditions:

State the method: We already saw that we cannot assume normality in the model. In this state we cannot perform any analysis so far, but we can one wat ANOVA with bootstrapping. In bootstrapping, we assume that for each observation in the sample, there may be others like it in the population. So we can think of our bootstrap population as a population where each observation from the sample appears many times. And then we take samples from this population to get an idea of how means from the original population would look like.

Perform inference:

##        caribbean central american          mexican   south american 
##              326              146             1061               47

First, we need a random sample taken with replacement from the original sample, of the same size as the original sample.

Comments

State hypotheses: In conclusion, we cannot consider south american since is evidenly no longer representative for its population, but we can perfom ANOVA with remaining origins. Therefore our null hypothesis should be that there is nothing going on, while our alternative hypothesis that indeed the is a difference in at least one. The hypothesis are now the following:

\(H_0: \mu_{caribbean} = \mu_{central american} = \mu_{mexican}\)

\(H_A: \mu_{caribbean} \ne \mu_{central american} \ne \mu_{mexican}\) (At least one of them)

##              Df    Sum Sq   Mean Sq F value Pr(>F)    
## origin        2 6.661e+09 3.330e+09    1069 <2e-16 ***
## Residuals   297 9.250e+08 3.115e+06                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpret results: