This analysis performs exploratory data anlaysis and statistical inference with a General Social Survey (GSS) dataset prepared for use by Coursera students.
We will first prepare the workspace environment by setting global options.
#Install Knitr pckage if necessary and load Knitr library
list.of.packages <- c("knitr")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages, repos = "http://cran.us.r-project.org")
suppressWarnings ( suppressMessages ( library ( knitr ) ) )
knitr::opts_chunk$set(fig.width=8, fig.height=4, fig.path='figures/DataAnalysisProject_', echo=TRUE, warning=FALSE, message=FALSE)
#Clear variables
rm ( list = ls ( all = TRUE ) )
#Get and set working directory
setwd ( getwd ( ) )Install and load required libraries if neccessary.
#Check installed status of requried packages, and install if necessary
list.of.packages <-
c("dplyr", "ggplot2", "scales", "kableExtra")
new.packages <-
list.of.packages[!(list.of.packages %in% installed.packages()[, "Package"])]
if (length(new.packages))
install.packages(new.packages, repos = "http://cran.us.r-project.org")
suppressWarnings (suppressMessages (library (dplyr)))
suppressWarnings (suppressMessages (library (ggplot2)))
suppressWarnings (suppressMessages (library (scales)))
suppressWarnings (suppressMessages (library (kableExtra)))Load the data set.
load (url ("https://d18ky98rnyall9.cloudfront.net/_5db435f06000e694f6050a2d43fc7be3_gss.Rdata?Expires=1512950400&Signature=e0MG-vaA6qgj2s~0UUc66fzMRzD1pF5VipKXuLmRpDdK63zaosyMDJnY-TX9WhLjBMH1W-WVGk1iDFO~inCMCbzx01u8ws~ze5fNvgE8Swxj-ejzhusOrgtDqYnoLQrpIEmQjUHduYKTCwHanFbdwIQcYiD79f3ktEUVYsIYRSI_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A"))The vast majority of GSS data is obtained in face-to-face interviews. Computer-assisted personal interviewing (CAPI) began in the 2002 GSS. Under some conditions when it has proved difficult to arrange an in-person interview with a sampled respondent, GSS interviews may be conducted by telephone. [@http://gss.norc.org/Pages/Faq.aspx]
The target population of the GSS is adults (18+) living in households in the United States. The GSS sample is drawn using an area probability design that randomly selects respondents in households across the United States to take part in the survey. Respondents that become part of the GSS sample are from a mix of urban, suburban, and rural geographic areas. Participation in the study is strictly voluntary. However, because only about a few thousand respondents are interviewed in the main study, every respondent selected is very important to the results. [@https://en.wikipedia.org/wiki/General_Social_Survey]
The scope of inference for this data is limited to generalizability because it is an observational study, i.e. we cannot determine causality as no random assignment of participants occured.
We will research if there is a statistically significant difference between different poltical view’s family income. This is of interest because political alignment is indicative of personality, where conservatives tend to be more conscientiousness, which should be a predictor of higher earnings potential. [@https://www.economist.com/blogs/democracyinamerica/2012/05/personality-and-polarisation]
For our Exploratory Data Analysis, poltical views (variable “polviews”) will be our explantory variable, and “total family income in constant dollars” (variable “coninc”) will be our response variable.
Let’s get a feel for our explanatory variable, “polviews,” by determining the unique values.
unique (gss$polviews)## [1] <NA> Moderate Slightly Conservative
## [4] Conservative Liberal Extrmly Conservative
## [7] Slightly Liberal Extremely Liberal
## 7 Levels: Extremely Liberal Liberal Slightly Liberal ... Extrmly Conservative
There are 7 unique political views, excluding NA values.
Let’s visualize the distribution of income per poltical view by using box plots. Note we will exclude NAs.
ggplot(data = subset(gss,!is.na(polviews) & !is.na(coninc)), aes(x = polviews, y = coninc)) +
geom_boxplot(fill = "#56B4E9") +
labs(title = "Income by Political Views", x = "Political View", y = "Income") +
#format y-scale
scale_y_continuous(labels = dollar, breaks = seq(0, 200000, by = 25000)) We can see that conservative views appear to have higher median incomes, but more variability. Also, it appears that income decreases as political views become more moderate.
Let’s calculate summary statistics.
#Compute summary stats
GSSSummary <- gss %>%
filter(polviews != "NA") %>%
group_by (polviews) %>%
summarise (
Respondents = n (),
MinIncome = min(coninc, na.rm = TRUE),
MaxIncome = max(coninc, na.rm = TRUE),
AverageIncome = mean(coninc, na.rm = TRUE),
MedianIncome = median(coninc, na.rm = TRUE),
IncomeIQR = IQR(coninc, na.rm = TRUE)
) %>%
arrange (desc(AverageIncome))
#Create summary table
suppressWarnings (suppressMessages (library (kableExtra)))
GSSSummary %>%
kable("html") %>%
kable_styling()| polviews | Respondents | MinIncome | MaxIncome | AverageIncome | MedianIncome | IncomeIQR |
|---|---|---|---|---|---|---|
| Slightly Conservative | 7691 | 402 | 180386 | 50707.66 | 42083.0 | 44900.00 |
| Conservative | 7092 | 383 | 180386 | 49738.12 | 39695.0 | 46377.25 |
| Slightly Liberal | 6181 | 383 | 180386 | 45256.97 | 36482.0 | 40498.00 |
| Liberal | 5582 | 383 | 180386 | 44259.32 | 34470.0 | 41715.00 |
| Extrmly Conservative | 1506 | 402 | 180386 | 42261.62 | 31854.0 | 41223.00 |
| Moderate | 18494 | 383 | 180386 | 42100.79 | 34470.0 | 37866.00 |
| Extremely Liberal | 1330 | 383 | 178712 | 39147.52 | 29065.5 | 39030.25 |
We do see that “Slightly Conservative” has the most variability, and also the highest mean and median income. Also, we see the “moderate” view has the most respondents. Let’s now determine if there is any statistical income difference between the political views.
Let’s also gauge normality with a quantile-quantile plot.
qplot(sample = coninc, data = subset(gss, !is.na(polviews) & !is.na(coninc)), color=polviews)We can see our income distribution is right-skewed for each political view. Since we are comparing more than two groups of a categorical explanatory variable, and we want to mitigate the skewness impact, we will use a Kruskal-Wallis Test.
As mentioned, we will use a Kruskal-Wallis Test for statistical significance. Also, we will calculate pairwise comparisions with Wilcoxon Rank Sum Tests.
The Kruskal Wallis test is the non parametric alternative to the One Way ANOVA. Non parametric means that the test doesn’t assume your data comes from a particular distribution. The H test is used when the assumptions for ANOVA aren’t met (like the assumption of normality). It is sometimes called the one-way ANOVA on ranks, as the ranks of the data values are used in the test rather than the actual data points. The test determines whether the medians of two or more groups are different. Like most statistical tests, you calculate a test statistic and compare it to a distribution cut-off point. The test statistic used in this test is called the H statistic. The hypotheses for the test are:
[@http://www.statisticshowto.com/kruskal-wallis/]
Since this is a non-parametric test and no population parameters are estimated, we won’t construct confidence intervals.
The assumptions for the Kruskal-Wallis Test are:
Based on the GSS methodology, assumptions 1 & 2 are met. Our income dependent variable is on an ordinal scale, so we also meet assumption 3.
First, let’s create a subset of the data to remove NA values.
gsssubset <- subset(gss,!is.na(polviews) & !is.na(coninc))Now, we will perform the Kruskal-Wallis test using a 0.05 significance level (which is the default)
kruskal.test(coninc ~ polviews, data = gsssubset)##
## Kruskal-Wallis rank sum test
##
## data: coninc by polviews
## Kruskal-Wallis chi-squared = 466.68, df = 6, p-value < 2.2e-16
As the p-value is much less than the significance level 0.05, we can conclude that there are significant differences between poltical views when comparing family income.
Next, we will evaulate the paired combinations of political views with Pairwise Wilcoxon Rank Sum Tests, using Bonferroni correction.
pairwise.wilcox.test(gsssubset$coninc, gsssubset$polviews, p.adjust.method = "bonferroni")##
## Pairwise comparisons using Wilcoxon rank sum test
##
## data: gsssubset$coninc and gsssubset$polviews
##
## Extremely Liberal Liberal Slightly Liberal Moderate
## Liberal 1.2e-06 - - -
## Slightly Liberal 1.1e-11 0.08582 - -
## Moderate 1.5e-06 1.00000 6.7e-06 -
## Slightly Conservative < 2e-16 < 2e-16 < 2e-16 < 2e-16
## Conservative < 2e-16 < 2e-16 2.7e-08 < 2e-16
## Extrmly Conservative 0.28242 0.37118 0.00063 0.99836
## Slightly Conservative Conservative
## Liberal - -
## Slightly Liberal - -
## Moderate - -
## Slightly Conservative - -
## Conservative 0.10929 -
## Extrmly Conservative < 2e-16 3.9e-13
##
## P value adjustment method: bonferroni
We can see multiple indications of signifance between groups. What’s compelling is that significant differences don’t exist between “moderate” and “liberal,” or “moderate” and “conservative” but the two “slightly” based views do have signifance when compared to “moderate.”
Furthermore, we do see significant differences between liberal and conservative alignments, but interestingly enough, not between the extreme ends of the spectrum.
Additional research topics could include unpacking the political view further, with additional dimensions such as social and fiscal views within each political view. This would enable multi-variarate analysis to increase understanding of the between-group variance, e.g. an extremely conserative individual may not fiscally conservative, and this could explain why the median income is lower for said group.