Statistical inference with the GSS data

Date: September 10, 2016
Assignment: Data Analysis Project, Inferential Statistics, Coursera

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

# the data file `gss.Rdata` should already be in the working directory
load("gss.Rdata")

Part 1: Data

The data set used in this analysis is the General Social Survey conducted by the National Opinion Research Center. The GSS contains data from 1972 through 2012, with the purpose of “monitoring societal change and studying the growing complexity of the American society.” The aim is to gather data on the attitudes, behaviors, and attributes of contemporary American society, and to compare the United States society with other nations. For more information, see http://www.norc.org/Research/Projects/Pages/general-social-survey.aspx. The exact data set used for this assignment is a subset of the full GSS dataset. All missing values have been coded as ‘NA’.

The GSS Codebook can be found at http://gss.norc.org/get-documentation. Appendix A contains a lengthy discussion of the sampling design, which is not a simple random sampling of the population. Certain Primary Sampling Units were chosen in metropolitan and non-metropolitan areas. Block Groups were then selected within the PSU’s. Interviewers were instructed to canvas chosen blocks within the BG’s seeking respondents to interview until certain quotas were met. While this is not simple random sampling, the study does, however, use weighting techniques as a corrective measure. Furthermore, the GSS Codebook, in Appendix A, states that “the GSS samples closely resemble distributions reported in the Census and other authoritative sources.” Therefore, I will assume sufficiently random sampling so that generalizations can be made to the general population, keeping in mind the concerns noted above.

Since there is no random assignment of sample subjects, the data cannot be used to answer causation questions.

Part 2: Research question

A current issue in the United States is the state of the economy-specifically the unemployment rate. A recent statistic indicated that 1 in 6 males in the United States is unemployed. See: https://twitter.com/NPR/status/773496188209364992/photo/1. The GSS dataset contains a variable called unemp which records a yes or no answer to the following question asked of each respondent: “At any time during the last ten years, have you been unemployed and looking for work for as long as a month?” Since the GSS data covers several decades, it would be of interest to compare the proportions of ‘Yes’ answers in a recent survey year with a prior year to determine if there has been any change.

This report will compare the proportion of respondents who were unemployed and looking for work as long as a month in the decade prior in the most recent survey year of 2012, with the proportion from a decade earlier, survey year 2002.

Part 3: Exploratory data analysis

Before settling on the final research question and survey years to sample, I generally viewed the data with the View(gss) command and reviewed the codebook. I settled on using only the year and unemp variables, but wanted a better understanding of the survey years. The following code will find the unique values of the year variable and plot a histogram of their frequency:

# find unique values of the survey years in the gss$year variable
unique(gss$year)

##  [1] 1972 1973 1974 1975 1976 1977 1978 1980 1982 1983 1984 1985 1986 1987
## [15] 1988 1989 1990 1991 1993 1994 1996 1998 2000 2002 2004 2006 2008 2010
## [29] 2012

# histogram of the frequency that each year appears in gss$year
ggplot(gss, aes(x = gss$year)) + 
    geom_histogram(binwidth = 1, fill = 'steelblue', col = 'black') +
    xlab('Year') +
    ylab('')

I then decided to use the most recent survey year of 2012, and compare with decade earlier, 2002. The following script will reduce the large GSS data file down to only the variables needed:

# reduce GSS dataset to just the two columns needed, 'year' and 'unemp'
dat <- subset(gss, year == 2002 | year == 2012, select = c(year, sex, unemp))
# check dimensioons of new data set and view a summary of the data
dim(dat); summary(dat)

## [1] 4739    3

##       year          sex        unemp     
##  Min.   :2002   Male  :2114   Yes : 757  
##  1st Qu.:2002   Female:2625   No  :1490  
##  Median :2002                 NA's:2492  
##  Mean   :2006                            
##  3rd Qu.:2012                            
##  Max.   :2012

There are quite a few NA values in the unemp variable, so I will remove those so that only the “Yes” and “No” answers remain in the data subset, and subset further to only the male respondents:

# subset to only the 'Male' respondents and remove NA's
dat <- subset(dat, sex == 'Male', select = c(year, unemp))
# remove NA values
dat <- na.omit(dat)
# check dimensions of reduced data set
dim(dat)

## [1] 1001    2

The dataset now contains the year categorical variable with only two values, ‘2002’ and ‘2012’, and unemp, also categorical with only two values, ‘Yes’ or ‘No’.

Now that I have the exact subset of the data needed, I want to see a table of this data to get an idea of the number of ‘Yes’ and ‘No’ responses, as well as the proportions:

# produce a table of dat to see the actual numbers of 'Yes' and 'No' responses
table(dat)

##       unemp
## year   Yes  No
##   2002 123 281
##   2012 222 375

# produce a frequency table of the same data, rounding the proportions to two decimal places
round(prop.table(table(dat),1),2)

##       unemp
## year    Yes   No
##   2002 0.30 0.70
##   2012 0.37 0.63

And, finally a bar chart of the data, showing the numbers of sampled respondents for each year, with the proportions of ‘Yes’ and ‘No’ answers:

# changing the year variable to a factor so that the plot function works properly
dat$year  <- as.factor(dat$year)
# plot a bar chart
ggplot(dat, aes(x=year, fill=unemp)) +
    geom_bar(width = 0.6) +
    theme(legend.title=element_blank()) +
    xlab('') +
    ylab('') +
    ggtitle('U.S. males who reported being \n unemployed at some time in the prior ten years') +
    guides(fill=guide_legend(reverse=T)) +
    stat_count(aes(label=..count.., y=0.5*..count..), geom = 'text')

To conclude this EDA section, the following script will calculate the necessary sample statistics (sample size, number of successes, and sample proportions):

# calculate sample sizes for the 2002 and 2012 survey years
n2002 <- sum(dat$year == 2002)
n2012 <- sum(dat$year == 2012)
# calculate number of sucesses ('Yes" answers) in each sample
n2002_Yes <- sum(dat$year == 2002 & dat$unemp == 'Yes')
n2012_Yes <- sum(dat$year == 2012 & dat$unemp == 'Yes')
# calculate sample proportions for each year
phat2002 <- round((n2002_Yes/n2002), 3)
phat2012 <- round((n2012_Yes/n2012), 3)

From the exploratory data analysis, we can see that the proportion of male respondents who reported unemployment in the ten years prior to the survey rose from \(\hat p_{2002}\) = 0.304, or 30.4% in 2002, to \(\hat p_{2012}\) = 0.372, or 37.2% in 2012. The 2002 sample size was \(n_{2002}\) = 404, and the number of ‘Yes’ answers was 123. The 2012 sample size was \(n_{2012}\) = 597, and the number of ‘Yes’ answers was 222. The next section will compare the two sample proportions to explore whether there is any statistical significance to these measurements of the data.

Part 4: Inference

The final part of this report will examine the two sample proportions to determine what may be inferred about the two populations from 2002 and 2012. Both samples have two categorical variables. The grouping variable is the year, with two values, either “2002” or “2012.” The response variable is the answer to the unemployment question, also with two values, “Yes” or “No.” The “Yes” answer is considered the “success.” To calculate a confidence interval and perform a hypothesis test, the techniques for evaluating categorical variables with two levels will be used.

Confidence Interval

First, a confidence interval will be calculated to estimate the difference in the two populations using the sample population proportions. We have the following information about the two populations samples:

Sample size: \(n_{2002}\) = 404; \(n_{2012}\) = 597
No. of successes (‘Yes’): 123; 222
Sample proportions: \(\hat p_{2002}\) = 0.304; \(\hat p_{2012}\) = 0.372

The question asked is: how do the 2002 and 2012 United States male populations compare in respect to the proportions that were unemployed and looking for work for at least one month in the prior ten years?

The parameter of interest is the difference between proportions of all the U.S. male population in 2012 and all the U.S. male population in 2002 that were unemployed and looking for work for at least one month in the prior ten years.

\[p_{2012} - p_{2002}\]

The point estimate is the difference between the proportions of the sampled 2012 population and the sampled 2002 population that were out of work for at least one month in the prior ten years.

\[\hat p_{2012} - \hat p_{2002}\]

The difference between the proportions will be estimated by calculating a confidence interval, which is the point estimate plus or minus the margin of error:

\[\hat p_{2012} - \hat p_{2002} \pm ME\]

\[\hat p_{2012} - \hat p_{2002} \pm z \times SE_{(\hat p_{2012} - \hat p_{2002})}\]

The sample proportion for 2012 is \(\hat p_{2012}\) = 0.372, and for 2002 is \(\hat p_{2002}\) = 0.304.

# set z statistic at 95% confidence interval
z <- 1.96

To calculate a 95% confidence interval, the z statistic is 1.96.

The standard error for the difference between two proportions is calculated as: \[SE = \sqrt{\frac{\hat p_{2012} (1 - \hat p_{2012})}{n_{2012}} + \frac{\hat p_{2002} (1 - \hat p_{2002})}{n_{2002}}}\]

# calculate standard error
se <- round(sqrt((phat2012 * (1 - phat2012) / n2012) + (phat2002 * (1 - phat2002) / n2002)), 3)

The standard error is \(SE\) = 0.03.

Before continuing, the conditions for inference for comparing two independent proportions will be checked:

First is independence, that is, the observations must be independent within each group. There must be random sampling (discussed in Part 1) and the 10% condition met for both samples. The sample sizes of 597 and 404 are certainly less than ten percent of the U.S. male population (n < 10% of the population, which is over 300 million).

Also, as between groups, the two groups must be independent of each other, that is, non-paired. While it may be theoretically possible that the same person was interviewed in 2002, and then a decade later in 2012, given the size of the U.S. population and the survey sampling techniques, this is very unlikely. Therefore, the independence condition is assumed to be met.

Second is the sample size/skew conditions, that is, the samples should meet the success-failure condition using the observed successes and failures:

\[n_1 \hat p_1 \geq 10; n_1 (1 - \hat p_1) \geq 10\] \[n_2 \hat p_2 \geq 10; n_2 (1 - \hat p_2) \geq 10\]

Using the numbers previously calculated:
597 * 0.372 = 221 successes, which is > 10; and 597 * 0.628 = 376 failures, which is also > 10.
404 * 0.304 = 123 successes, which is > 10; and 404 * 0.696 = 283 failures, which is also > 10.

Now that the conditions are known to be met, the confidence interval is calculated as:

\[(\hat p_{2012} - \hat p_{2002}) \pm z \times SE_(\hat p_{2012} - \hat p_{2002})\]

\[(0.372 - 0.304) \pm 1.96 \times 0.03\]

\[0.068 \pm 0.059\]

# calculate confidence interval upper and lower values
ciupper <- round((phat2012 - phat2002) + (z * se), 2)
cilower <- round((phat2012 - phat2002) - (z * se), 2)

The confidence interval is: (0.01, 0.13)

To double check the manual calculation of confidence interval, the following script will use the same data subset and the inference function from the statsr package:

# use inference function to calculate the confidence intervale
ci <- inference(unemp, year, dat, type = 'ci', statistic = 'proportion', success = 'Yes', method = 'theoretical', order = c('2012', '2002'))

## Response variable: categorical (2 levels, success: Yes)
## Explanatory variable: categorical (2 levels) 
## n_2012 = 597, p_hat_2012 = 0.3719
## n_2002 = 404, p_hat_2002 = 0.3045
## 95% CI (2012 - 2002): (0.0081 , 0.1267)

# pull out the upper and lower CI values from the inference result
cilower1 <- round(ci$CI[1], 2)
ciupper1 <- round(ci$CI[2], 2)

The inference function produces an exploratory data plot of the sample sizes and proportions for survey years 2002 and 2012, which matches the manual calculations.

The 95% confidence interval from the inference function, rounded to two decimals places, also matches the manual calculation: (0.01, 0.13).

Confidence Interval Conclusion: What does mean? We are 95% confident that the proportion of United States males in 2012 that were unemployed and looking for work for at least one month in the prior decade is between 1% and 13% higher than the proportion of United States males in 2002 that were unemployed and looking for work for at least one month in the prior decade.

Hypothesis Test

Next, a hypothesis test will be performed with the same data. The null hypothesis is that there is no difference in the 2002 and 2012 population proportions that were unemployed and looking work for at least one month in the prior ten years. The alternative hypothesis is that the 2012 proportion is greater than the 2002 proportion.

\[H_0: \hat p_{2012} = \hat p_{2002}\] \[H_A: \hat p_{2012} > \hat p_{2002}\]

Before continuing, the conditions for performing a hypothesis test will be checked. To check the conditions for a hypothesis test of two proportions, expected proportions are used. Since that number is not known, a pooled proportion will be calculated and used:

# caculated pooled proporation
pPool <- round((n2012_Yes + n2002_Yes) / (n2012 + n2002), 3)

The pooled proportion is \(\hat p_{Pool} = 0.345\).

As discuss above, because our sample sizes are significantly less than 10% of the male population in the United States, we can assume independence within groups and between groups.

To check the sample size/skew condition, we use the following formula:

\[n_1 p_{Pool} \geq 10; n_1 (1 - p_{Pool}) \geq 10\] \[n_2 p_{Pool} \geq 10; n_2 (1 - p_{Pool}) \geq 10\]

Using the numbers previously calculated:
597 * 0.345 = 206 successes, which is > 10; and 597 * 0.655 = 391 failures, which is also > 10.
404 * 0.345 = 139 successes, which is > 10; and 404 * 0.655 = 265 failures, which is also > 10.

The 10% condition is met so we can assume the sampling distribution of the difference between the 2012 and 2002 proportions is nearly normal. Therefore, all the test conditions are met.

Next, the standard error will be calculated using \(p_{Pool}\):

# calcuate standard error
se_ht <- round(sqrt((pPool * (1 - pPool) / n2012) + (pPool * (1 - pPool) / n2002)), 3)

The standard error is \(SE = 0.031\).

Now we are ready to conduct the hypothesis test, at a 5% significance level, evaluating if 2012 males and 2002 males in the United States are equally likely to answer the survey question ‘Yes’ about whether they were unemployed for at least one month in the prior decade.

The point estimate is:

\[\hat p_{2012} - \hat p_{2002} = 0.372 - 0.304 = 0.068\]

# set the null value to zero as the hypoothesis test assumes that the difference in 2012 and 2002 proportions is zero
nullvalue <- 0

The null value is \(null = 0\).

The z statistic is calculated as the point estimate minus the null value, divided by the standard error:

# calculate the point estimate, which is the difference in observed proportions
pe <- phat2012 - phat2002
# calculate z statistic
z_ht <- round(((pe - nullvalue) / se), 3)

The z statistic is 2.267.

# calculate p value using the z statistic; calculate upper tail value as the hypothesis test was framed as 'greater than'
p_value <- round(pnorm(abs(z_ht), lower.tail = FALSE), 3)

Using the pnorm function, the p-value for this z score is 0.012.

Use the the inference function from the statsr package to check the manual calculation:

# use inference function for the hypothesis test
ht <- inference(unemp, year, dat, type = 'ht', statistic = 'proportion', success = 'Yes', method = 'theoretical', alternative = 'greater', order = c('2012', '2002'), show_eda_plot = FALSE, null = 0)

## Response variable: categorical (2 levels, success: Yes)
## Explanatory variable: categorical (2 levels) 
## n_2012 = 597, p_hat_2012 = 0.3719
## n_2002 = 404, p_hat_2002 = 0.3045
## H0: p_2012 =  p_2002
## HA: p_2012 > p_2002
## z = 2.2015
## p_value = 0.0139

# get p-value from the result of the inference function
ht_p <- round(ht$p, 3)

The graph produced by the inference function shows the null distribution of the hypothesis test with a line at the p-value. The p-value of 0.014 is reasonably close the manual calculation p-value of 0.012.

Hypothesis Test Conclusion: What does this mean? Comparing the calculated p-value of 0.012 with the 5% significance value, we find the p-value is less than the significance value, and conclude there is convincing evidence that the proportion of the 2012 U.S. male population that was unemployed for at least a month in the prior decade is greater than the proportion of the 2002 U.S. male population that was unemployed for at least one month in the prior decade. Because of the low p-value, we reject the null hypothesis that were was no difference in proportions.

As a final note, the confidence interval calculated of (0.01, 0.13) does not include zero, which matches the result of the hypothesis test that the null (difference in the two proportions is zero) should be rejected.