This assignment closes the inferential statistics course from Duke University on Coursera. We use data from the General Social Survey and identify a research question illustrating concepts covered during the course.
The proposed research question is of political nature: measuring the confidence level across several topics and under the last 4 presidents. The purpose is to help a presidential candidate to pick a campaign slogan that resonates with the American audience as we interpret it from the GSS data. We will provide convincing evidence that there is a significant decreasing trend in how confident US citizens are across all topics. Additionnally, our EDA will detect which topics are the most impacted (finance, business, government, congress, medicine).
Although the topic is political, I certainly do not make any political statement.
library(statsr); library(dplyr); library(ggplot2); library(Hmisc); library(dplyr); library(reshape2)The data can be found here: “gss.Rdata”
Make sure you save the file into your current working directory.
load("gss.Rdata")
dim(gss)## [1] 57061 114
The dataset comprises 57061 observations (each observation being an answer from a respondent) across 114 variables.
Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.
Variables are either factors or numerical, broken down into 8 top level categories:
- Case identification and year
- Respondent Background
- Personal And family information
- Attitudinal Measures: personal views on national problems such as health, environment, …
- Personal concerns: religion
- Workplace and economic concerns; job security, social class, finance, standard of living
- Controversial social issues: views on abortion, family planning, suicide, violence, …
- Obligations and responsibilities
The version of the GSS dataset we are using for this project is a subset preprocessed for the Coursera course, showing data on American society from 1972 to 2012.
The codebook for this dataset can be found here
Study design
The full Codebook can be found here, with the sampling design explained in great details in Appendix A (starting at page 2867).
There have been many methodology changes over the years but the study is based on random sampling, allowing generalization of inferences to the US population.
However, there was no random assignment, so this remains an observational study, we can only talk about inference and correlation in the insights, not causation.
The methods have changed over decades and although following trends over periods of time can be very interesting (for instance correlating societal issues perception under different presidents or major economic/cultural events), one has to be wary of this bias. Also, not all respondents answered all questions for all years.
1. Research question: I put myself in the skin of a political advisor, advising an imaginay presidential candidate about which type of slogan could be used.
Looking at past slogans, we see that some are about building for the future (“Hope”, “Yes We can”, …), some deal with specific issues (“Country First”, …) or others about restoring a supposedly lost splendor (“Let’s Make America Great Again” from Reagan in 1980).
Let’s imagine that our presidential candidate thinks of reusing 1980 Reagan’s slogan.
The research question is whether the answers from the GSS on confidence in a variety of public topics corroborates the idea that America has lost its greatness, or put another way, did people lose confidence overall over the last 4 presidencies and if so, in which specific area?
Fisrt consideration: WHAT IS GREATNESS ABOUT?
The GSS asks citizens about their confidence in the following topics: Finance, Business, Religion, Education, Government, Labor, Press, Medicine, Television, Justice, Science, Congress, Military.
It’s not all, but it covers quite a lot.
Second consideration: HOW DO YOU MEASURE GREATNESS?
There are countless indicators out there, from GNP, national debt, to any type of economic, financial, … measure known to humankind. Since a political slogan is aimed at voters, we leave it to them to answer this question in an impartial manner, the GSS is the ideal tool for this purpose.
Third consideration: TIME ELEMENT
If this greatness assertion was to be shared by the American public, when is America supposed to have lost its greatness? This hasn’t been clearly defined and we assume here that looking into the last 4 presidencies (2 republican and 2 democrat presidents), spanning from 1989 to 2012, should be relevant enough.
2. Biases:
- not all respondents answered the questions for all years but we consider the answers about confidence to be independent from each other.
- The proposed methodology is to cut the dataset into 4 parts for the last 4 presidencies, but the lastest data is from 2012, just 3 years into the Obama’s presidency that lasted 8 years, so there is an obvious bias here. Data up to 2016 would have been preferable to shed more lights on the question.
3. Methodology:
- we start by an exploratory data analysis, plots and summary statistics around the proportion of people confident in a specific topic/institution, broken down by presidency.
- we use as main confidence metric the proportion of people who answered that they have “a great deal” of confidence over the total answers (“only some confidence” is to be taken more as a neutral opinion, and “hardly any” being negative).
- we calculate the average confidence across all topics, then measure if there has been a decreasing or increasing trend over the last 4 presidencies.
- we perform statistical inference on the outcome to determine how significant the trend is.
- we’ll identify the topics performing best and worst.
4. Important note
I’m definitely not trying to make a political statement or supporting the claim of a candidate, this is merely a reasearch question. The idea is to measure which type of slogan could be more performant, and detect the topics that could bve prioritized in the campaign.
1. Subset a dataset with only the relevant variables
gss_small <- gss[,c("year", "confinan", "conbus", "conclerg", "coneduc", "confed", "conlabor", "conpress", "conmedic", "contv", "conjudge", "consci", "conlegis", "conarmy")]
#remove NAs
gss_small <- na.omit(gss_small)2. Format an empty dataframe to receive positive confidence percentages under each presidency, 4 presidencies x 13 confidence topics:
#create table where we will store all the data, and rename rows and columns
confidenceDF <- data.frame(matrix(NA, nrow = 4, ncol = 13))
rownames(confidenceDF) <- c("Bush_Sr", "Clinton", "Bush_Jr", "Obama")
colnames(confidenceDF) <- c("Finance", "Business", "Religion", "Educ.", "Gov't", "Labor", "Press", "Medicine", "TV", "Justice", "Science", "Congress", "Army")3. Populate the dataframe with actual proportions of positive perception of topics, per presidency:
#BUSH SENIOR
#store in gss_temp the answers during the Obama years (only 2009 to 2012 available)
gss_temp <- gss_small[gss_small$year>=1989 & gss_small$year<1993,]
#set variable "a"" to 1, used for allocating the values to the relevant columns during the loop function
a <- 1
#we loop from column 2 (confinan) to column 14, we leave out column one of the dataframe which is the year
for(i in 2:14){
#we store in tempDF the ith column (respectively finance, business, ...), tempDF dataframe contains then 1 column and 3 rows (A Great Deal, Only Some, Hardly Any).
tempDF <- data.frame(table(gss_temp[,i]))
#We store the value "A great deal"/Sum(all anwers). We assume this is the proportion of positive perception. "Only Some" being neutral, and "Hardly Any" being negative.
confidenceDF[1,a] <- tempDF[1,2]/sum(tempDF[,2])
#Add 1 to the variable a until it reaches column 14, this putting the appropriate value into the relevant column.
a <- a+1
}
#REPEATING PROCESS FOR CLINTON
gss_temp <- gss_small[gss_small$year>=1993 & gss_small$year<2001,]
a <- 1
for(i in 2:14){
tempDF <- data.frame(table(gss_temp[,i]))
confidenceDF[2,a] <- tempDF[1,2]/sum(tempDF[,2])
a <- a+1
}
#REPEATING PROCESS FOR BUSH JUNIOR
gss_temp <- gss_small[gss_small$year>=2001 & gss_small$year<2009,]
a <- 1
for(i in 2:14){
tempDF <- data.frame(table(gss_temp[,i]))
confidenceDF[3,a] <- tempDF[1,2]/sum(tempDF[,2])
a <- a+1
}
#REPEATING PROCESS FOR OBAMA
gss_temp <- gss_small[gss_small$year>=2009,]
a <- 1
for(i in 2:14){
tempDF <- data.frame(table(gss_temp[,i]))
confidenceDF[4,a] <- tempDF[1,2]/sum(tempDF[,2])
a <- a+1
}4. Average confidence across all topics
What matters the most for our research question is the overall average of confidence across all topics.
round(apply(confidenceDF,1,mean),2)## Bush_Sr Clinton Bush_Jr Obama
## 0.26 0.24 0.24 0.23
We notice a slight decreasing trend over time, from about 26% confidence under the presidency of Bush Senior, against 23% under the first Obama mandate.
We wil perform statistical inference in part 4 of this document in order to calculate a 95% confidence interval and define if this result is statistically significant.
For now we continue with the EDA.
5. Display and plot the results by topics
confidenceDF## Finance Business Religion Educ. Gov't Labor
## Bush_Sr 0.1575483 0.2484574 0.2344714 0.2883587 0.2348828 0.1057178
## Clinton 0.2297449 0.2709560 0.2572190 0.2472666 0.1209700 0.1128399
## Bush_Jr 0.2560214 0.1690455 0.2203390 0.2716325 0.1665923 0.1190901
## Obama 0.1062099 0.1528908 0.2068522 0.2646681 0.1571734 0.1143469
## Press Medicine TV Justice Science Congress
## Bush_Sr 0.16042781 0.4672974 0.12957631 0.3800905 0.4310983 0.16824352
## Clinton 0.10120549 0.4334174 0.09545837 0.3211382 0.4269694 0.09055228
## Bush_Jr 0.09455843 0.3782337 0.09277431 0.3264942 0.4165923 0.11351472
## Obama 0.09764454 0.4068522 0.10963597 0.2963597 0.4188437 0.07922912
## Army
## Bush_Sr 0.4236940
## Clinton 0.3807121
## Bush_Jr 0.5107047
## Obama 0.5434690
Plot in ggplot in order to visualize topics with the most and least confidence
#we need to reformat the dataframe in order to plot this in ggplot, I followed the instructions from this site:
#https://mundosubmundo.kaiux.com/2012/02/solving-ggplot2-doesnt-know-how-to-deal-with-data-of-class-matrix/
Presidents <- rownames(confidenceDF)
Conf_Topics <- colnames(confidenceDF)
confidenceDF_Plot <- structure(c(confidenceDF$Finance, confidenceDF$Business, confidenceDF$Religion, confidenceDF$`Educ.`, confidenceDF$`Gov't`, confidenceDF$Labor, confidenceDF$Press, confidenceDF$Medicine, confidenceDF$TV, confidenceDF$Justice, confidenceDF$Science, confidenceDF$Congress, confidenceDF$Army), .Dim=c(4,13), .Dimnames = list(c(Presidents), c(Conf_Topics)))
confidenceDF_Plot <- melt(confidenceDF_Plot)
names(confidenceDF_Plot) <- c("Presidents", "Conf_Topics", "value")
#plot function
ggplot(data=confidenceDF_Plot, aes(x = Conf_Topics, y = value, fill = Presidents)) + geom_bar(stat = "identity", position = "stack") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Confidence Topics") + ylab("")We notice the good performers overall, people seem to have generally more confidence in: the Military, Science, Medicine and Justice. The bad performers are: Television, Labor, Press, Congress and Government.
6. Identifying the confidence topic with the highest variances
#boxplot
boxplot(confidenceDF, las=2)#the boxplot results should be corroborated with a table showing variances
sort(apply(confidenceDF, 2, var), decreasing=TRUE)## Army Finance Business Gov't Congress
## 5.685479e-03 4.661649e-03 3.377628e-03 2.263270e-03 1.565477e-03
## Medicine Justice Press Religion Educ.
## 1.441982e-03 1.242460e-03 9.878487e-04 4.632353e-04 2.895484e-04
## TV Science Labor
## 2.840468e-04 4.636902e-05 3.065310e-05
7. Plotting the evolution of the various topics
ggplot(data=confidenceDF_Plot, aes(x = Presidents, y = value)) +
geom_point() + facet_grid(.~Conf_Topics) + theme(axis.text.x = element_text(angle = 90, hjust = 1))Checking both the variance and the evolution of the confidence per topic over time, we notice that the topics that seem to sharply decreased are:
Finance, Business, Government, Congress, Medicine.
We run the inference test on two proportions: average proportion of strongly confident respondents under the Bush Sr. and Obama presidencies.
We will first calculate the confidence interval, then run the hypothesis test.
1.0 - CONFIDENCE INTERVAL
1.1 - Summarizing the data of interest
Here’s the summary of the data, we have to go dig the information in previously created dataframes:
#Create a summary data frame "SummaryDF" for the summary statistics
summaryDF <- data.frame(matrix(nrow = 2, ncol=4))
#Rename rows and columns
rownames(summaryDF) = c("Lowest prop.", "Highest prop."); colnames(summaryDF) = c("Presidency", "Successes-fav. opinions", "n", "p-hat")
#Put row numbers in tables for both the Obama and Bush Sr. presidencies
#Populate the summary dataframe
#Column 1: Name of the president
summaryDF[1,1] <- "OBAMA"
summaryDF[2,1] <- "BUSH_SR"
#Column 3: n = total respondents to the question
summaryDF[1,3] <- nrow(gss_small[gss_small$year>=2009,])
summaryDF[2,3] <- nrow(gss_small[gss_small$year>=1989 & gss_small$year<1993,])
#Column 4: p-hat, or proportion of respondents showing a great deal of confidence in all topics
summaryDF[1,4] <- round(min(apply(confidenceDF,1,mean)),2)
summaryDF[2,4] <- round(max(apply(confidenceDF,1,mean)),2)
#Column 2: Successes
summaryDF[1,2] <- round(summaryDF[1,4]*summaryDF[1,3])
summaryDF[2,2] <- round(summaryDF[2,4]*summaryDF[2,3])
summaryDF## Presidency Successes-fav. opinions n p-hat
## Lowest prop. OBAMA 537 2335 0.23
## Highest prop. BUSH_SR 632 2431 0.26
1.2 - Defining parameters of interest
- Proportion of all strongly confident American citizens averaged over all surveyed topics under the Bush Sr. presidency MINUS the proportion under the Obama presidency. This is UNKNOW, this is what we are trying to estimate.
- Same difference between the two proportions but based on the samples
1.3 - Calculating point estimate
#Point estimate: PE = p_hat(Bush_Sr) - p_hat(Obama)
PE <- summaryDF[1,4]-summaryDF[2,4]
#Sandard error: SE
SE <- sqrt((summaryDF[1,4] * (1-summaryDF[1,4]) / summaryDF[1,3]) + (summaryDF[2,4] * (1-summaryDF[2,4]) / summaryDF[2,3]))
PE; SE## [1] -0.03
## [1] 0.01244951
1.4 - Checking conditions for comparing two indepenent proportions
- Random sampling without replacement < 10% of the population: VERIFIED
- Independence between groups: this is harder to prove but this was our initial assumption as described in the possible biases: AGREED UPON
- Sample size/skew, success/failure condition:
#n1*p1 >=10
summaryDF[1,3] * summaryDF[1,4] >= 10## [1] TRUE
#n2*p2 >=10
summaryDF[2,3] * summaryDF[2,4] >= 10## [1] TRUE
All conditions are verified, we can proceed with the confidence interval.
1.5 - Calculate confidence interval
We use a standard 95% confidence interval.
CI <- PE + c(-1,1) * qnorm(.975) * SE
CI## [1] -0.054400584 -0.005599416
We are 95% confident that the proportion of american citizens that show strong confidence in all surveyed topics under the Obama presidency is between 0.05% and 5.4% less than under the Bush Sr. presidency.
2.0 HYPOTHESIS TEST
2.1 - Hypothesis test
H_o: null hypothesis is that the proportion of american citizens showing confidence in all topics under the Obama presidency is equal the ones under the Bush Sr. presidency, or p_hat(Obama) - p_hat(Bush_Sr) = 0
H_a: alternative hypothesis, p_hat(Obama) - p_hat(Bush_Sr) != 0
We take a signficance level alpha = 0.05.
2.2 - Pooled proportion
#(number of successes Bush_Sr + number of successes Obama) / (n1+n2)
P_hat_Pooled = (summaryDF[1,2]+summaryDF[2,2])/(summaryDF[1,3]+summaryDF[2,3])
P_hat_Pooled## [1] 0.2452791
2.3 - Conditions with the pooled proportion
All independence conditions have been checked before, now we need to check the success/failure condition with the pooled proportion:
#n1*P_hat_Pooled >=10
summaryDF[1,3] * P_hat_Pooled >= 10## [1] TRUE
#n2*P_hat_Pooled >=10
summaryDF[2,3] * P_hat_Pooled >= 10## [1] TRUE
#n1*(1-P_hat_Pooled) >=10
summaryDF[1,3] * (1-P_hat_Pooled) >= 10## [1] TRUE
#n2*(1-P_hat_Pooled) >=10
summaryDF[2,3] * (1-P_hat_Pooled) >= 10## [1] TRUE
All 4 conditions are checked.
We calculate then the standard error with the pooled proportion:
SE_pooled <- sqrt((P_hat_Pooled * (1-P_hat_Pooled) / summaryDF[1,3]) + (P_hat_Pooled * (1-P_hat_Pooled) / summaryDF[2,3]))
SE_pooled## [1] 0.01246707
2.4 - Distribution
The difference between the two observed proportions follows a normal distribution with mean 0 and standard error equal to SE_pooled: p_hat(Obama) - p_hat(Bush_Sr) ~ N(mean = 0, SE = 0.01247)
2.5 - Calculate Z-SCORE and P-VALUE
Z = ( (p_hat(Obama) - p_hat(Bush_Sr)) - NULL value ) / SE_pooled
#Calculate Z score
Z <- ((summaryDF[1,4] - summaryDF[2,4]) - 0) / SE_pooled
#Calculate P-value
2*pnorm(Z)## [1] 0.01611333
2.6 Conclusion:
The P-value of 0.016 is lower than our signficance level of 0.05.
Therefore we reject the null hypothesis in favor of the alternative hypothesis, we have convincing evidence that there is a difference between how confident people were across all surveyed topics under the Bush Sr. presidency compared to the Obama presidency.
The result is corroborated by the fact that the confidence interval doe not contain our null value, 0.
This decreasing trend in confidence in notably sharp in the following areas: Finance, Business, Government, Congress, Medicine.
The data and the proposed methodology allows us to recommend to the candidate a slogan that can mirror the decrease in the confidence level of Americans, with a focus on the above mentioned topics.