Statistical Inference with the GSS Data

Setup

Load packages

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.3

## Warning: package 'ggplot2' was built under R version 4.2.3

## Warning: package 'tibble' was built under R version 4.2.3

## Warning: package 'tidyr' was built under R version 4.2.3

## Warning: package 'readr' was built under R version 4.2.3

## Warning: package 'purrr' was built under R version 4.2.3

## Warning: package 'dplyr' was built under R version 4.2.3

## Warning: package 'stringr' was built under R version 4.2.3

## Warning: package 'forcats' was built under R version 4.2.3

## Warning: package 'lubridate' was built under R version 4.2.3

library(moments)
library(statsr)

## Warning: package 'statsr' was built under R version 4.2.3

## Warning: package 'BayesFactor' was built under R version 4.2.3

## Warning: package 'coda' was built under R version 4.2.3

## Warning: package 'Matrix' was built under R version 4.2.3

Load data

load("gss.Rdata")

Part 1: Data

There is a potential of bias in this dataset based on the possibility of underrepresented groups even after weighting, the ability of individuals to recall information specific to the interview, and the possibility of individuals altering their responses due to the knowledge of being interviewed. Any bias present increases the possibility of Type I and Type II errors in the inferences, so it is essential to minimize this as much as possible.

Part 2: Research question

In total 4 area will be considered of the following types: one numerical variable and one categorical variable of two levels, one numerical variable and one categorical variable of more than two levels, two categorical variables of two levels each, and two categorical variables of more than two levels each.

Area 1: With the advent of technological advances, a possible area of interest is the effect of age on the likelihood of a respondent to complete an in-person or phone-based interview. Is there a significant difference in the average age of respondents by sex?

Area 2: The data set contains various possible responses for political affiliation. From this data, is there any difference average number of Children among the Republican, Democrat, and Independent political parties?

Area 3: For various reasons, individuals may be unable to work and require government aid. Is the proportion of males who received aid greater than the proportion of females who received aid?

Area 4: In the past of the United States, minority groups have unknowingly been experimented upon, and in the modern time this may be linked to a distrust in medicine. Is there a significant difference in the response of individuals by race in their confidence of those who lead medical institutions?

Part 3: Inference

Area 1

The null hypothesis is there is no difference in the population (here this is the nation) average age of male and female respondents, and the alternative hypothesis is that there is a difference in the population average age of male and female respondents. The GSS survey data contains the response “Male” and “Female” for the sex variable and the age variable as a whole number. Since there is a different number of male and female respondents, this requires a two-sample t-test with two tails (two tails since looking for a difference, which consists of the choices “greater than” or “less than”).

There are several conditions to check to confirm that a two-sample t-test is the appropriate choice: independence of sample data, data obtained from a random sample, normally distributed sample data in both groups, and that the sample variances for each group are equal. Note that there are 202 respondents who did not provide an age (62 male respondents and 140 female respondents). Compared to the size of each group, this is a very small number and won’t significantly influence the results.

From the sampling methods, we know that the sample data was independently and randomly obtained.

str(gss$sex)

##  Factor w/ 2 levels "Male","Female": 2 1 2 2 2 1 1 1 2 2 ...

str(gss$age)

##  int [1:57061] 23 70 48 27 61 26 28 27 21 30 ...

gss %>% count(age)

##    age    n
## 1   18  206
## 2   19  777
## 3   20  818
## 4   21  930
## 5   22  970
## 6   23 1130
## 7   24 1109
## 8   25 1231
## 9   26 1216
## 10  27 1253
## 11  28 1314
## 12  29 1177
## 13  30 1289
## 14  31 1200
## 15  32 1291
## 16  33 1232
## 17  34 1262
## 18  35 1247
## 19  36 1230
## 20  37 1205
## 21  38 1221
## 22  39 1087
## 23  40 1154
## 24  41 1092
## 25  42 1076
## 26  43 1102
## 27  44 1043
## 28  45  994
## 29  46 1002
## 30  47  976
## 31  48  989
## 32  49 1011
## 33  50  918
## 34  51  932
## 35  52  897
## 36  53  874
## 37  54  862
## 38  55  781
## 39  56  860
## 40  57  762
## 41  58  810
## 42  59  764
## 43  60  795
## 44  61  709
## 45  62  741
## 46  63  721
## 47  64  631
## 48  65  696
## 49  66  631
## 50  67  683
## 51  68  634
## 52  69  583
## 53  70  610
## 54  71  536
## 55  72  548
## 56  73  488
## 57  74  515
## 58  75  441
## 59  76  439
## 60  77  409
## 61  78  368
## 62  79  320
## 63  80  284
## 64  81  285
## 65  82  248
## 66  83  220
## 67  84  192
## 68  85  168
## 69  86  151
## 70  87  126
## 71  88   99
## 72  89  294
## 73  NA  202

gss %>% count(sex)

##      sex     n
## 1   Male 25146
## 2 Female 31915

gss %>% filter(is.na(age)) %>% count(sex)

##      sex   n
## 1   Male  62
## 2 Female 140

male_age <- gss$age[gss$sex == "Male" & !is.na(gss$age)]
female_age <- gss$age[gss$sex == "Female" & !is.na(gss$age)]

ggplot(gss) + 
  geom_histogram(aes(x = age)) + 
  facet_wrap(~sex) +
  labs(
    title = "Age of Respondents by Sex",
    x = "Respondent Age",
    y = "Total"
  ) +
  theme_classic()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 202 rows containing non-finite outside the scale range
## (`stat_bin()`).

Due to skewed data for age a two-sample t-test may not be an appropriate choice, so start with checking for normality. male_age has a skew of 0.4469, and female_age has a skew of 0.4339. Since skewness is between 0 and 0.5, this data is approximately symmetric. However, using the logarithmic transformation sqrt(x) for positively and moderately skewed data more closely resembles the normal distribution, so this technique is employed. It is visually confirmed through a histogram of both groups that the data is normally distributed. Due to the sampling methods of the survey, the data is independent.

For the final condition, it is clear that the group variances are unequal. In this case, the Welch approximation to the degrees of freedom is used instead of the t-test pooled estimate. This approximation takes the unequal variance into account, which results in a lower degrees of freedom value yet a clearer test result. This is the default setting for the base-R t-test function. The logic here is that if variances are equal, then their ratio is 1, which is where the numerator value of 1 in the t-test pooled estimate calculation comes from. If the variances are unequal, then their ratio is not 1 (shocker), hence the numerator values are the respective sample variances.

skewness(male_age, na.rm = TRUE)

## [1] 0.4468543

skewness(female_age, na.rm = TRUE)

## [1] 0.4339224

ggplot(gss %>% mutate(age = sqrt(age))) + 
    geom_histogram(aes(x = age)) + 
    facet_wrap(~sex) +
    labs(
        title = "Age of Respondents by Sex",
        x = "Respondent Age",
        y = "Total"
    ) +
    theme_classic()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 202 rows containing non-finite outside the scale range
## (`stat_bin()`).

male_age <- sqrt(male_age)
female_age <- sqrt(female_age)

var.test(male_age, female_age)

## 
##  F test to compare two variances
## 
## data:  male_age and female_age
## F = 0.93135, num df = 25083, denom df = 31774, p-value = 2.772e-09
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.9098136 0.9534278
## sample estimates:
## ratio of variances 
##          0.9313502

Now to consider the theory of the test. We see that in the groups from the sample, on average male respodents are 44.95 years old, and female respondents are 46.29 years old. The transformed male and female average ages are also provided. The t-test is evaluated using the p-value method. This test assumes a 95% confidence level (the percentage of times one would expect to get close to the same results if the test was ran over and over and over again). As a result, there is a significance level of 100% - 95% = 5% shows how strong the evidence to say there is a significant difference in the average age of respondents by sex needs to be. This should be a small percentage, and 5% is the standard. The test calculates a p-value using specific formulas based on the sample sizes, sample means, and sample standard deviations; if the p-value is smaller than the significance level, then the alternative hypothesis is true. Since the test is only seeking a difference and not greater than or less than, the p-value needs to be smaller than 5%/2 = 0.05/2 = 0.025 to reject the null hypothesis. Using the base-R function t.test and specifying an alternative hypothesis as a two-sided result, it is shown that the p-value is much smaller than 0.025. The test results also contain a confidence interval. The interpretation of this confidence level at 95% significance is that 95 out of 100 randomly collected samples of this group would be expected to contain a difference in average age of each group of between -0.1130 and -0.0704.

Based on the test results, the null hypothesis is rejected, showing that there is a significant difference between the average age of respondents based on sex.

gss %>% 
  group_by(sex) %>% 
  summarize(`average age` = mean(age, na.rm = TRUE))

## # A tibble: 2 × 2
##   sex    `average age`
##   <fct>          <dbl>
## 1 Male            45.0
## 2 Female          46.3

mean(male_age)

## [1] 6.58424

mean(female_age)

## [1] 6.675937

t.test(male_age, female_age, alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  male_age and female_age
## t = -8.4458, df = 54637, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.11297786 -0.07041751
## sample estimates:
## mean of x mean of y 
##  6.584240  6.675937

Area 2

Note that the variable partyid contains 9 distinct values, two of which are “Other Party” and NA. Edit partyid to represent Republican, Democrat, and Independent.Although redundant, case_when needs the str_detect option for “Other Party” or it will replace it with a default value of NA, changing the data set. case_when also changes the partyid variable from a factor to a character, so case_when needs to be wrapped in as_factor to undo this. Next, grab a count of partyid to ensure that the appropriate changes were made. It appears that 14,553 respondents identified as Republican, 21,157 respondents identified as Democrat, and 20,163 respondents identified as independent. childs represents the number of Children a respondent has; 181 respondents did not answer this. In our sample groups, Independent respondents on average reported having 1.81 Children, Democrats on average reported having 2.09 Children, and Republicans on average reported having 1.96 Children. From the boxplots below, it appears the spread of data and medians for each group are possibly identical.

gss<- gss %>% 
  mutate(
    partyid = 
      as_factor(case_when(
        str_detect(partyid, "Demo") ~ "Democrat",
        str_detect(partyid, "Repu") ~ "Republican",
        str_detect(partyid, "Ind")  ~ "Independent",
        str_detect(partyid, "Oth")  ~ "Other Party"
        )
      )
    ) 

gss %>% count(partyid)

##       partyid     n
## 1 Independent 20163
## 2    Democrat 21157
## 3  Republican 14553
## 4 Other Party   861
## 5        <NA>   327

gss %>% 
  filter(partyid %in% c("Republican", "Democrat", "Independent")) %>%
  group_by(partyid) %>%
  summarize(`average number of Children` = mean(childs, na.rm = TRUE))

## # A tibble: 3 × 2
##   partyid     `average number of Children`
##   <fct>                              <dbl>
## 1 Independent                         1.81
## 2 Democrat                            2.09
## 3 Republican                          1.96

gss %>% 
  filter(partyid %in% c("Republican", "Democrat", "Independent")) %>%
  ggplot() +
  geom_boxplot(aes(x = partyid, y = childs)) +
  labs(
    title = "Number of Children by Political Affiliation",
    x = "Political Affiliation",
    y = "Number of Children"
  ) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

## Warning: Removed 141 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

To provide more information about the underlying population distribution, an ANOVA (analysis of variance) test needs to be conducted. There are three conditions for the ANOVA test: the sample data is independently and randomly obtained, the sample data is normally distributed, and the level of variance in each group is roughly equal.

From the sampling methods, we know that the sample data was independently and randomly obtained.

The histograms below show that the data is right-skewed. A quick check shows that the skew among groups is close to 1, so the groups are moderately skewed, again necessitating the sqrt(x) logarithmic transformation on childs. Visually the transformed data more closely follows the normal distribution.

gss %>%
  filter(partyid %in% c("Republican", "Democrat", "Independent")) %>%
  ggplot() +
  geom_histogram(aes(x = childs)) +
  facet_wrap(~partyid) +
  labs(
    title = "Reported Number of Children by Political Affiliation",
    x = "Reported Number of Children",
    y = "Total"
  ) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 141 rows containing non-finite outside the scale range
## (`stat_bin()`).

rep_Children <- gss$childs[gss$partyid == "Republican"]
dem_Children <- gss$childs[gss$partyid == "Democrat"]
ind_Children <- gss$childs[gss$partyid == "Independent"]

skewness(rep_Children, na.rm = TRUE)

## [1] 0.9398297

skewness(dem_Children, na.rm = TRUE)

## [1] 1.000218

skewness(ind_Children, na.rm = TRUE)

## [1] 1.075862

gss <- gss %>% mutate(childs = sqrt(childs))

gss %>%
  filter(partyid %in% c("Republican", "Democrat", "Independent")) %>%
  ggplot() +
  geom_histogram(aes(x = childs)) +
  facet_wrap(~partyid) +
  labs(
    title = "Reported Number of Children by Political Affiliation",
    x = "Reported Number of Children",
    y = "Total"
  ) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 141 rows containing non-finite outside the scale range
## (`stat_bin()`).

It is evident that the level of variance in each group is roughly the same. This satisfies the last of the three conditions.

gss %>% 
    filter(partyid %in% c("Republican", "Democrat", "Independent")) %>%
    group_by(partyid) %>%
    summarize(`group variance` = var(childs, na.rm = TRUE))

## # A tibble: 3 × 2
##   partyid     `group variance`
##   <fct>                  <dbl>
## 1 Independent            0.636
## 2 Democrat               0.653
## 3 Republican             0.603

The ANOVA test works by dividing the variance between groups (variation between group mean and overall mean) and variation within groups (variation in group values and their group mean) to create the F-statistic. The null hypothesis is that all group means are equal, and the alternative hypothesis is that at least one group mean differs from the rest. The aov function does this and provides a p-value, working on the assumption of a 95% confidence level. For ease of use of the aov function, create a subset of the data called gss_party containing data for Republicans, Democrats, and Independents. It is evident that the p-value is significant and the null hypothesis is rejected, but this does not give any information about which group means are different.

gss_party <- gss %>% filter(partyid %in% c("Republican", "Democrat", "Independent"))

summary(aov(childs ~ partyid, gss_party))

##                Df Sum Sq Mean Sq F value Pr(>F)    
## partyid         2    138   68.82   108.6 <2e-16 ***
## Residuals   55729  35329    0.63                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 141 observations deleted due to missingness

To determine this, the Tukey HSD (Honestly Significant Differences) test is needed. This test has the same conditions as the ANOVA test. The Tukey HSD test splits the groups into pairs, takes the absolute value of differences between pairs, and divides it by the standard error of the mean (square root of Mean Square Error) determined by the ANOVA test, closely resembling the theory of a t-test. From the Tukey HSD test, it is evident that all three pairings of political affiliations have a p-value smaller than 0.025 (this value is taken from 0.05/2 since the ANOVA test is a two-tailed test).

TukeyHSD(aov(childs ~ partyid, gss_party))

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = childs ~ partyid, data = gss_party)
## 
## $partyid
##                               diff         lwr         upr    p adj
## Democrat-Independent    0.11390851  0.09551857  0.13229846 0.00e+00
## Republican-Independent  0.07777236  0.05744880  0.09809592 0.00e+00
## Republican-Democrat    -0.03613615 -0.05625456 -0.01601774 7.59e-05

Area 3

Out of all respondents, 25,146 identified as male and 31.915 identified as female (taken from the variable sex). The variable govaid contains responses to the question “Did you ever - because of sickness, unemployment, or any other reason - receive anything like welfare, unemployment insurance, or other aid from government agencies?” These responses can be “Yes,” “No,” “DK” (don’t know), or NA (no answer was given). There were 4,325 “Yes” responses, 7,760 “No” responses, and 44,976 NA responses to govaid. Due to the amount of “Yes” and “No” answers, this question can still be explored.

gss %>% count(sex)

##      sex     n
## 1   Male 25146
## 2 Female 31915

gss %>% count(govaid)

##   govaid     n
## 1    Yes  4325
## 2     No  7760
## 3   <NA> 44976

Construct a table of responses with columns representing sex and rows representing the “Yes” and “No” responses to govaid. In total, 12,085 respondents answered govaid, comprised of 5,388 male respondents and 6,697 female respondents. 2.090 male respondents answered govaid “Yes,” and 3,298 male respondents answered govaid “No”; 2,235 female respondents answered govaid “Yes,” and 4,462 female respondents answered govaid “No.”

table(gss$sex, gss$govaid)

##         
##           Yes   No
##   Male   2090 3298
##   Female 2235 4462

To answer this, a two-proportion z-test is needed to test for differences in sample proportions since the recorded answers are based on “Male” and “Female” and “Yes” and “No” (in other words, no numerical data aside from the counts of each response type). Two main conditions need to be met to use this test: the success-failure (sample size * success >= 10 and sample size * failure >= 10) condition for each group needs to be met (this is a requirement for the binomial distribution which the z-test utilizes) and independence.

Independence is establised due to the sampling methods of this survey. Each group in this survey is independent of each other since for the purposes of this survey one cannot identify only as male or female, and govaid has only a “Yes” or “No” option present for those who answered.

The success-failure condition is important because it ensures that the sample is large enough to be approximated by the normal distribution. To verify the success-failure condition, define a success as an individual answering govaid as “Yes.” Using the above table, it is clear that this condition is met:

table(gss$sex, gss$govaid)

##         
##           Yes   No
##   Male   2090 3298
##   Female 2235 4462

(2090/5388) * 5388 >= 10

## [1] TRUE

(3298/5388) * 5388 >= 10

## [1] TRUE

(2235/6697) * 6697 >= 10

## [1] TRUE

(4462/6697) * 6697 >= 10

## [1] TRUE

The null hypothesis is that there is no difference in the sample proportion (the proportion of success) for each group, and the alternative hypothesis is that the proportion for males is greater than the proportion for females. As in the previous areas, a 95% confidence level and a 5% significance level are chosen. The z-test here is one-sided, so the alpha value for significance is 0.05 rather than 0.025. To complete this, the prop.test function is used. This takes the number of successes per group, the sample size of each group, and the type of alternative hypothesis being done (here the alternative hypothesis is “male proportion is greater than female proportion”, so “greater” is chosen). The p-value is much smaller than 0.05, so the null hypothesis is rejected and it is shown that the proportions for males who received federal aid is greater than the proportion of females who received federal aid.

prop.test(x = c(2090, 2235), n = c(5388, 6697), alternative = "greater")

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(2090, 2235) out of c(5388, 6697)
## X-squared = 37.887, df = 1, p-value = 3.747e-10
## alternative hypothesis: greater
## 95 percent confidence interval:
##  0.03954131 1.00000000
## sample estimates:
##    prop 1    prop 2 
## 0.3878990 0.3337315

Area 4

Respondents could answer the race variable as “White,” “Black,” or “Other.” Note that approximately 81% of respondents identified as “White” (46,350 “White”, 7,926 “Black”, and 2,785 “Other), so there may be some inaccuracy due to underrepresentation of minorities in the survey. Bias will not be taken into account here as it is beyond the scope of this project and no causal conclusions are being drawn from these examples. conmedic refers to the recorded answers to the question”I am going to name some institutions in this country. As far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them?” 17,931 answered “A Great Deal,” 17,159 answered “Only Some,” 3,222 answered “Hardly Any,” and 18,749 did not provide a response.

gss %>% count(race)

##    race     n
## 1 White 46350
## 2 Black  7926
## 3 Other  2785

gss %>% count(conmedic)

##       conmedic     n
## 1 A Great Deal 17931
## 2    Only Some 17159
## 3   Hardly Any  3222
## 4         <NA> 18749

Consider a table of the data. It appears that the data is almost equally distributed around “A Great Deal” and “Only Some.” To determine if there is a relationship between race and confidence in medical leadership, a Chi-square independence test is needed. This test is used when the data is categorical aside from counts of answer types. The conditions of this test are independence (the sample is random and each case only contributes to one cell in the table) and that each cell has at least 5 expected cases.

The first condition is fulfilled by the sampling methods and by the values of the contingency table–each respondent could only choose one response, so each case contributes to exactly one cell. Expected value for a Chi-square test is calculated by multiplying the row total by the column total for each cell and then dividing those results by the sample size. In total, 38,312 individuals answered conmedic. Rather than calculate everything by hand, think through this conceptually. For the expected value of each cell to be at least 5, the numerator of each calculation must be larger than x/38,312 = 5, which means that x = 191,560. Comparing the magnitude of each row and column multiplication, the smallest possible magnitude is 1 followed by 13 zeroes (the cell for “Other” and “Hardly Any” has the magnitude of multiplying three values in the hundreds by two values in the hundreds and one in the thousands).

chis_sq_matrix <- as.matrix(table(gss$race, gss$conmedic))
chis_sq_matrix

##        
##         A Great Deal Only Some Hardly Any
##   White        14831     13943       2608
##   Black         2353      2471        479
##   Other          747       745        135

It’s not always possible to verify this conceptually, so make an expected value matrix. Make an empty matrix of the same dimensions and use a for loop to calculate the expected value. Again, 38,312 is the total number of answers represented in the table above.

chis_sq_matrix <-  as.matrix(table(gss$race, gss$conmedic)) 
expected_value <- matrix(data = NA, nrow = 3, ncol = 3)

for(j in 1:3){ 
  for(i in 1:3){ 
    expected_value[i,j] = (sum(chis_sq_matrix[i, ]) * sum(chis_sq_matrix[, j])) / 38312
  } 
} 

expected_value

##            [,1]       [,2]      [,3]
## [1,] 14687.5820 14055.2239 2639.1941
## [2,]  2481.9402  2375.0829  445.9769
## [3,]   761.4778   728.6932  136.8290

Now that the test conditions are established, consider the Chi-square independence test. This test compares the observed and expected frequencies (numbers in cells) to determine if there is a relationship between the variables. The null hypothesis is there is no relationship between race and conmedic; the alternative hypothesis is that there is a relationship between race and conmedic. Again, a 95% confidence level and a 5% significance level is being used.There is a bit to unpack here since the results are not as explanatory as the other test results.

The test produced a value for the Chi-square statistic is calculated by taking the sum of all cells with the following calculation: the difference between the observed and expected values (Chi_sq_matrix - expected_value) squared divided by the expected value. Doing this for the tables above gives us 16.347. The Chi-square test uses a critical value to determine if the null hypothesis should be rejected. For this example, a two-tailed test was done, so there are two critical values (one on each side of the distribution). We use the Chi-square distribution with the significance level and the degrees of freedom (three choices per variable, and you always subtract one; multiply them together to get (3-1) * (3-1) = 4 degrees of freedom). If the Chi-square statistic is less than 0.710723 or greater than 9.487729, then the null hypothesis is rejected. Consider the p-value: if alpha = 0.05/2 = 0.025 (because it’s a two-tailed test), then again it is shown that the null hypothesis is rejected since the p-value is smaller than 0.05. Based on this data, if the survey wasn’t observational, there would be a relationship between race and conmedic.

chisq.test(gss$race, gss$conmedic)

## 
##  Pearson's Chi-squared test
## 
## data:  gss$race and gss$conmedic
## X-squared = 16.347, df = 4, p-value = 0.002587

qchisq(0.05, 4)

## [1] 0.710723

qchisq(0.05, 4, lower.tail = FALSE)

## [1] 9.487729

Reminder: Although the point of statistical inference is to make conclusions about the population based on underlying samples, conclusions about the United States as a whole cannot be made since this survey is observational and not experimental. In this case, the above examples only demonstrate the theory and application of statistical inference.