This paper forms the final assessment for the Duke University Inferential Statistics course.
The objective is to develop a research question using the General Social Survey (GSS) dataset and then conduct exploratory and inferential analysis in response.
This paper asks “Is there a link between a person’s concern regarding spending on crime and their socio-economic status?”
The analysis examines responses to the survey question regarding current spending on halting rising crime and looks at the proportion of each socio-economic group that responded with either too much, too little or about right.
Using the Chi-square test of independence to compare observed and expected proportions, it was determined beyond any reasonable doubt that there is indeed a relationship between the two.
library(ggplot2)
library(dplyr)
library(tidyr)
library(statsr)
library(ggthemes)
library(plotly)
library(kableExtra)
source('http://bit.ly/dasi_inference')# set defaults: cache chunks to speed compiling subsequent edits.
knitr::opts_chunk$set(cache=TRUE, echo = TRUE)The data file is in Rdata (gz compressed) format and is obtained and loaded using the code below.
download.file(
"https://d3c33hcgiwev3.cloudfront.net/_5db435f06000e694f6050a2d43fc7be3_gss.Rdata?Expires=1645056000&Signature=Yxc6MGpqzYdGYkkSlHfGY5tz2VRAWy2eY3IcB-ZvqUWvfiGL3j3Y9wyM-xpSdvACKQiZLHFHDqNtpIZt2TnduwIyKva9bwI1OsuBZAiSiiikJ0mu~LwybmnJs-i8A3AdPACBXMWZuv4440Abg4qPTyti3Lj7sO8~u2xfSoAdyx8_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A",
"./gss.Rdata",
method = "curl"
)
load("./gss.Rdata")
dim(gss)## [1] 57061 114
The data set consists of 57061 observations, each a set of responses from one respondent. Observations have 114 variables - either direct answers to survey questions, or calculated values based on responses.
The full description of each variable is available in the codebook.
Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.
GSS questions cover a diverse range of issues including national spending priorities, marijuana use, crime and punishment, race relations, quality of life, confidence in institutions, and sexual behavior.
The GSS has a “replicating core” that emphasizes collection of data on social trends through exact replication of question wording over time. Core items fall into two major categories— socio-demographic/background measures, and replicated measurements on social and political attitudes and behaviors.
The target population of the GSS is adults (18+) living in households in the United States. The GSS sample is drawn using an area probability design that randomly selects respondents in households across the United States to take part in the survey. Respondents that become part of the GSS sample are from a mix of urban, suburban, and rural geographic areas. Participation in the study is strictly voluntary. However, because only about a few thousand respondents are interviewed in the main study, every respondent selected is very important to the results.
The survey is conducted face-to-face with an in-person interview by NORC at the University of Chicago. The survey was conducted every year from 1972 to 1994 (except in 1979, 1981, and 1992). Since 1994, it has been conducted every other year. The survey takes about 90 minutes to administer.
Excerpted from GSS project description, the GSS Wikipedia Page and The General Social Survey (GSS)The Next Decade and Beyond
This is an observational study due to the nature of data collection. It only establishes associations and trends and cannot be used to infer causality.
While no random assignment was used in the surveys, and the sampling methodology is randomised, caution should be exercised before considering this data to be generalizable to the entire American population due to possible sources of bias (see below).
Two years ago, I moved from a country with a very low crime incidence to a country, and city, with a high and rising crime rate. The highest rate of increase in the level of crime is disproportionally in the poorer suburbs of the city. While you would think that those most affected would be the most concerned, there is a general sense of apathy towards the situation in those neighbourhoods: “It is what it is, the police won’t help us”.
I’m interested to see if the data would suggest that there is a similar attitude in the US.
Is there a link between a person’s concern regarding spending on crime and their socio-economic status?
The question is an important part of understanding the broader picture of not only how crime is perceived across the socio-economic spectrum, but also as an indication of trust levels those groups have with authorities in tackling the crime problem.
There are three competing schools of thought here:
Point 3 then becomes our null hypothesis, while 1 & 2 collectively becomes our alternative hypothesis.
\(H_0: p_{class}\ and\ p_{crime}\ are\
independent\),
\(H_A: p_{class}\ and\ p_{crime}\ are\
dependent\)
where \(p_{class}\) is the proportion in each class and \(p_{crime}\) is the proportion of responses to the crime question.
In other words:
The variables of interest here will be:
| class | SUBJECTIVE CLASS IDENTIFICATION If you were asked to use one of four names for your social class, which would you say you belong in: the lower class, the working class, the middle class, or the upper class? LOWER CLASS, WORKING CLASS, MIDDLE CLASS, UPPER CLASS, NO CLASS |
| natcrime | HALTING RISING CRIME RATE We are faced with many problems in this country, none of which can be solved easily or inexpensively. I’m going to name some of these problems, and for each one I’d like you to tell me whether you think we’re spending too much money on it, too little money, or about the right amount. i.e. Halting the rising crime rate. TOO LITTLE, ABOUT RIGHT, TOO MUCH |
| sei | RESPONDENT SOCIOECONOMIC INDEX Calculated value, numeric (1 - 99.7) See Appendix for explanation of this variable. This variable will be used to cross-check the results of the hypothesis test on the class variable where respondents self-identifying as either UPPER or LOWER CLASS were few. Note that the sei variable is not recorded until 1988, nor for the 2012 survey. |
To prepare the data subset, we eliminate the NA’s and
invalid values in class, sei &
natcrime from the gss data table, select only
those columns of interest and drop any unused factor levels.
#sei: Missing-data codes: -1.0,99.8,99.9
dfSubset <- gss %>%
filter(
!is.na(natcrime),
!is.na(class),
!is.na(sei),
class != 'No Class',
sei >= 0 & sei < 99.8,
) %>%
select(year, class, sei, natcrime) %>%
droplevels()Because of the limit of years that sei was recorded,
this sets the boundaries of the surveys from 1988 to 2010
(inclusive).
Next, we create a factor of 4 levels based on the quarterly quantiles
of the sei variable and label those accordingly:
q_sei <- quantile(dfSubset$sei, c(1:3)/4)
q_sei## 25% 50% 75%
## 32.4 39.0 63.5
dfSubset <- dfSubset %>%
mutate(socioeconomic = factor(
case_when(
sei <= q_sei[1] ~ '1',
sei > q_sei[1] & sei <= q_sei[2] ~ '2',
sei > q_sei[2] & sei <= q_sei[3] ~ '3',
TRUE ~ '4'
),
labels=c('Lower', 'Lower Middle', 'Upper Middle', 'Upper')
)
)
dfSubset %>% sample_n(10)## year class sei natcrime socioeconomic
## 1 1996 Working Class 36.5 Too Little Lower Middle
## 2 2002 Middle Class 37.7 About Right Lower Middle
## 3 1994 Working Class 29.5 Too Little Lower
## 4 2010 Middle Class 39.0 About Right Lower Middle
## 5 1996 Working Class 44.8 Too Little Upper Middle
## 6 1994 Working Class 69.2 Too Little Upper
## 7 1996 Working Class 38.4 About Right Lower Middle
## 8 2000 Lower Class 28.4 Too Little Lower
## 9 2008 Working Class 33.1 About Right Lower Middle
## 10 2000 Working Class 63.5 About Right Upper Middle
Because the surveys cover a long period of time (22 years), it’s a good idea to make sure we’re not incorporating any long term trends over time and skewing the data.
dfclassyearcrime <- dfSubset %>%
group_by(class, year, natcrime) %>%
summarise(n=n()) %>%
mutate(proportion = round(100 * n / sum(n), 1)) %>%
filter(natcrime=='Too Little') %>%
select(-n, -natcrime) %>%
rename(Year = year)
p <- ungroup(dfclassyearcrime) %>%
pivot_wider(names_from = class, values_from = proportion) %>%
plot_ly(x=~Year, y=~`Lower Class`, name='Lower Class', type='scatter', mode='lines') %>%
add_trace(y=~`Working Class`, name='Working Class', mode='lines') %>%
add_trace(y=~`Middle Class`, name='Middle Class', mode='lines') %>%
add_trace(y=~`Upper Class`, name='Upper Class', mode='lines') %>%
config(displayModeBar = F) %>%
layout(
xaxis=list(fixedrange=T),
yaxis=list(title="Proportion of Class (%)")
)
t <- dfclassyearcrime %>%
pivot_wider(names_from = Year, values_from = proportion) %>%
kbl() %>%
kable_styling(bootstrap_options = c("condensed"))Respondents Answering ‘Too Little’ to the National Crime Spending Question 1988-2010
pt| class | 1988 | 1989 | 1990 | 1991 | 1993 | 1994 | 1996 | 1998 | 2000 | 2002 | 2004 | 2006 | 2008 | 2010 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Lower Class | 69.2 | 83.9 | 68.0 | 77.4 | 73.2 | 86.5 | 72.7 | 71.2 | 63.6 | 68.8 | 65.9 | 65.9 | 66.7 | 68.6 |
| Working Class | 68.6 | 75.8 | 74.2 | 67.5 | 72.1 | 79.3 | 74.4 | 68.9 | 66.2 | 60.7 | 63.7 | 66.8 | 66.2 | 62.5 |
| Middle Class | 73.2 | 72.9 | 70.8 | 64.5 | 75.7 | 75.6 | 63.8 | 56.3 | 55.4 | 53.2 | 53.6 | 59.7 | 59.3 | 55.0 |
| Upper Class | 88.2 | 80.0 | 72.2 | 58.3 | 66.7 | 75.0 | 74.4 | 62.7 | 60.0 | 55.1 | 45.0 | 38.1 | 40.6 | 38.7 |
There is indeed a lot of change over the period. As there was a higher frequency of surveys prior to 1994, we’ll exclude those to avoid skewing the subset towards those earlier years. Additionally, since we’re looking at what the latest data looked like rather than trends over time, we will limit the data to the last 10 years of the sub-setted data (2000 - 2010 inclusive).
We can see from the chart above that, over time, there’s indication that there may have been a reverse in the order of most concern amongst those that responded to this survey. This would be an interesting study in itself, but one that is beyond this research question.
dfSubset <- dfSubset %>%
filter(year >= 2000)bp_crime <- ggplot(dfSubset, aes(x=natcrime)) +
geom_bar(fill='orange') +
theme_excel_new() +
theme(plot.margin = margin(0, 0, 0, 0, "cm"))
bp_class <- ggplot(dfSubset, aes(x=class)) +
geom_bar(fill='orange') +
theme_excel_new() +
theme(plot.margin = margin(0, 0, 0, 0, "cm"))
bp_se <- ggplot(dfSubset, aes(x=socioeconomic)) +
geom_bar(fill='orange') +
theme_excel_new() +
theme(plot.margin = margin(0, 0, 0, 0, "cm"))bp_crime
Distribution of Responses to Crime Spending
The distribution is heavily weighted on the ‘Too Little’ response
Care will be needed to ensure there are sufficient responses from each class to make an analysis meaningful.
bp_class
Distribution of Responses to Class
The distribution is heavily weighted on the ‘Working Class’ and ‘Middle Class’ responses
Again, care will be needed to ensure there are sufficient responses from those classes in each crime response category to make an analysis meaningful.
bp_se
Distribution of Socio-Economic band
The distribution is fairly even as expected since this was created from quantile boundaries.
Analysis needed to see if this is the better variable to use for inference models.
allClasses = dfSubset %>%
group_by(natcrime) %>%
summarise(n=n()) %>%
mutate(class='All Classes') %>%
relocate(class)
dfclasscrime <- dfSubset %>%
group_by(class, natcrime) %>%
summarise(n=n()) %>%
rbind(allClasses) %>%
mutate(
proportion = paste0(
as.character(n),
' (',
format(round(100 * n / sum(n), 1), nsmall=1),
'%)'
)
) %>%
select(-n) %>%
pivot_wider(names_from = class, values_from = proportion)
colTotals <- dfSubset %>%
group_by(class) %>%
summarise(n=n()) %>%
pivot_wider(names_from = class, values_from = n) %>%
mutate(natcrime="Total", `All Classes`="") %>%
relocate(natcrime)
dfclasscrime <- dfclasscrime %>%
rbind(colTotals) %>%
rename(Response=natcrime)
t <- dfclasscrime %>%
kbl() %>%
kable_styling(bootstrap_options = c("condensed"))Two-way Table for Respondents Answering the National Crime Spending Question 2000-2010
t| Response | Lower Class | Working Class | Middle Class | Upper Class | All Classes |
|---|---|---|---|---|---|
| Too Little | 289 (66.7%) | 2032 (64.4%) | 1752 (55.9%) | 113 (47.3%) | 4186 (60.1%) |
| About Right | 108 (24.9%) | 932 (29.5%) | 1198 (38.2%) | 107 (44.8%) | 2345 (33.7%) |
| Too Much | 36 ( 8.3%) | 190 ( 6.0%) | 184 ( 5.9%) | 19 ( 7.9%) | 429 ( 6.2%) |
| Total | 433 | 3154 | 3134 | 239 |
dasi_inference
algorithm in the next section.Because we are comparing two categorical variables (each with more than 2 levels) as proportions, we will perform a Hypothesis Test using Chi-square test of independence.
No confidence-interval inference is possible for this type of analysis.
To recap, the hypothesis to be tested is:
\(H_0: p_{class}\ and\ p_{crime}\ are\
independent\),
\(H_A: p_{class}\ and\ p_{crime}\ are\
dependent\)
Independence
Sample Size
The sub-setted data meets the requirements for the Chi-square test of independence.
The chi-square procedure assesses whether the data provide enough evidence that a true relationship between the two variables exists in the population.
The idea behind the test is measuring how far the observed data are from the null hypothesis by comparing the observed counts to the expected counts—the counts that we would expect to see (instead of the observed ones) had the null hypothesis been true. The expected count of each cell is calculated as follows:
\[Expected\ Count = \frac{Column\ Total\ \times\ Row\ Total}{Table\ Total}\]
Degrees of freedom is calculated as:
\[ Degrees\ of\ freedom\ (df) = (Rows-1)\times(Columns-1)=2\times3=6 \]
The measure of the difference between the observed and expected counts is the chi-square test statistic, whose null distribution is called the chi-square distribution. The chi-square test statistic is calculated as follows:
\[ χ2 = \displaystyle\sum_{all\ cells}\frac{(Observed\ Count - Expected\ Count)^2}{Expected\ Count} \]
The \(p\)-value is the probability of observing χ2 at least as large as the one observed, the probability of getting counts like those observed, assuming that the two variables are not related (which is what is claimed by the null hypothesis).
The expected counts, χ2 and \(p\)-value will be calculated from the
sub-setted data using the inference() function from the
dasi_inference package:
x) is the class of the respondent
(class).y) is how the respondent answered
the question regarding crime spending (natcrime).proportion for the estimation type
and run a test of ht type (hypothesis test).method is theoretical as our expected
sample size condition is metalternative is greater since we are
looking for the possibility that χ2 is at least greater than
the one observed.nsim
to 100,000.inference(
y=dfSubset$natcrime,
x=dfSubset$class,
est='proportion',
type='ht',
method='theoretical',
alternative = 'greater',
nsim = 100000
)## Response variable: categorical, Explanatory variable: categorical
## Chi-square test of independence
##
## Summary statistics:
## x
## y Lower Class Working Class Middle Class Upper Class Sum
## Too Little 289 2032 1752 113 4186
## About Right 108 932 1198 107 2345
## Too Much 36 190 184 19 429
## Sum 433 3154 3134 239 6960
## H_0: Response and explanatory variable are independent.
## H_A: Response and explanatory variable are dependent.
## Check conditions: expected counts
## x
## y Lower Class Working Class Middle Class Upper Class
## Too Little 260.42 1896.93 1884.90 143.74
## About Right 145.89 1062.66 1055.92 80.53
## Too Much 26.69 194.41 193.17 14.73
##
## Pearson's Chi-squared test
##
## data: y_table
## X-squared = 87.447, df = 6, p-value < 2.2e-16
This extremely low \(p\)-value of 2.2e-16 implies that there is a negligible probability that the observed results are by chance only and easily meets the requirement to reject the null hypothesis at any confidence level.
Although we can’t produce a confidence interval test for this analysis, we can look at the socio-economic classifications created earlier and see if these produce a similar result.
The independence conditions for this variable still hold since they are from the same data sample and recipients can only belong to one group each.
Sample size is easily met as each group will represent approximately one quarter of the sub-setted recipients.
For this, we just need to swap out class for
socioeconomic as the explanatory variable in the inference
function:
inference(
y=dfSubset$natcrime,
x=dfSubset$socioeconomic,
est='proportion',
type='ht',
method='theoretical',
alternative = 'greater',
nsim = 100000
)## Response variable: categorical, Explanatory variable: categorical
## Chi-square test of independence
##
## Summary statistics:
## x
## y Lower Lower Middle Upper Middle Upper Sum
## Too Little 1118 1048 1059 961 4186
## About Right 496 448 681 720 2345
## Too Much 122 102 109 96 429
## Sum 1736 1598 1849 1777 6960
## H_0: Response and explanatory variable are independent.
## H_A: Response and explanatory variable are dependent.
## Check conditions: expected counts
## x
## y Lower Lower Middle Upper Middle Upper
## Too Little 1044.09 961.10 1112.06 1068.75
## About Right 584.90 538.41 622.97 598.72
## Too Much 107.00 98.50 113.97 109.53
##
## Pearson's Chi-squared test
##
## data: y_table
## X-squared = 89.266, df = 6, p-value < 2.2e-16
This test yields a similarly high χ2 value and resulting low \(p\)-value.
The graph above may also suggest a secondary grouping showing close agreement between those below the average socio-economic index and those above. This is not a conclusion you can draw from this analysis however.
With such a low \(p\)-value, the null hypothesis that there is no relationship between class (or socio-economic status) and concern over spending on crime is easily rejected, the data provides convincing evidence for the alternative hypothesis that there is a relationship.
It should be noted that this test does not explain the nature of the relationship; it simply confirms its existence.
Because we are comparing multi-level categorical variables, only a Chi-square test of independence, and no confidence interval method is possible to back up the findings. It was however possible to produce similar results using both the self-identified class category and the calculated SEI (socio-economic index).
Even though the tests yielded an extremely low \(p\)-value, this is an observational study. As a result, we cannot claim that class causes the level of concern about spending on crime, only that there is a very strong correlation.
Confounders could be that lower-income people may have a higher level of mistrust of authority and are less likely to give out personal information to a survey, or that attitudes to crime and authority are linked to race, and those races are not even represented across the socio-economic groupings.
It should be remembered that the survey question asks the respondent “whether you think we’re spending too much money on \([\)halting rising crime\(]\), too little money, or about the right amount”. This is not the same question as asking the recipient if they are concerned about the crime rate.
It is also easy to speculate that the first school of thought discussed in the question (those who face crime on a regular basis will be most concerned) is supported; however, this was not tested in the inference model, and the survey data does not include information to suggest that people from lower socio-economic groups face crime at a higher rate. This is for a separate study.
Further studies might include:
SEI scores were originally calculated by Otis Dudley Duncan based on NORC’s 1947 North-Hatt prestige study and the 1950 U.S. Census. Duncan regressed prestige scores for 45 occupational titles on education and income to produce weights that would predict prestige. This algorithm was then used to calculate SEI scores for all occupational categories employed in the 1950 Census classification of occupations. Similar procedures have been used to produce SEI scores based on later NORC prestige studies and censuses.
The GSS contains several sets of SEI scores. They all used procedures similar to those employed by Duncan. For cases coded according to the 1970 US Census codes there are SEI scores developed by Lloyd V. Temme (See Appendix G). These exist for respondent (DOTPRES), spouse (SPDOTPRE), and father (PADOTPRE). For cases coded according to the 1980 US Census codes there are SEI scores developed by Nakao and Treas as part of the GSS’s 1989 occupational prestige study (see above). These exist for respondent (SEI), respondent’s first occupation (FIRSTSEI), father (PASEI), mother (MASEI), and spouse (SPSEI).
Excerpt from the GSS Codebook Appendix