Research Proposal

Research Question

Is there a relationship between one’s political stand (party id)and family income in constant dollars?

Data - Citation

General Social Survey Cumulative File, 1972-2012 Coursera Extract. Modified for Data Analysis and Statistical Inference course (Duke University).

R dataset could be downloaded at http://bit.ly/dasi_gss_data.

load(url("http://bit.ly/dasi_gss_data"))

Citation for the original data:

Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. doi:10.3886/ICPSR34802.v1 Persistent URL: http://doi.org/10.3886/ICPSR34802.v1

Data - Collection

The study spans 40 years and nearly every decade the collection process was modified (see http://publicdata.norc.org:41000/gss/documents//BOOK/GSS_Codebook_AppendixA.pdf for details).

The data were collected from United States’ metropolitan and rural areas with household interview. Multiple level of stratification for region, race, age, income and sex was employed to guarantee a random sample. Each year were collected about 1500-2000 cases, with a slight increment in recent years.

Data - Cases(observational/experimental units)

The cases are adult persons resident in United States and interviewed in their household.

Data - Variables

Party ID:

Answer to the question: “Did you ever get a high school diploma or a GED certificate?”.

Type of variable: categorical, ordinal.

summary(gss$partyid)
##    Strong Democrat   Not Str Democrat       Ind,Near Dem 
##               9117              12040               6743 
##        Independent       Ind,Near Rep Not Str Republican 
##               8499               4921               9005 
##  Strong Republican        Other Party               NA's 
##               5548                861                327
str(gss$partyid)
##  Factor w/ 8 levels "Strong Democrat",..: 3 2 4 2 1 3 3 3 1 1 ...

Family Income in Constant Dollars:

Inflation-adjusted family income.

Type of variable: numerical, continuous.

summary(gss$coninc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     383   18440   35600   44500   59540  180400    5829
str(gss$coninc)
##  int [1:57061] 25926 33333 33333 41667 69444 60185 50926 18519 3704 25926 ...

Data - Type of study

The study consists in interviews to a random sample of United States residents about their economic condition, their working status, their health, their beliefs, etc. So the study is observational.

Data - Scope of inference - generalizability

The population of interest is composed by all US residents. The study employed random sampling, so the results could be generalized to the entire the population.

Data - Scope of inference - causality

The study is observational, so we can only establish association but not causal links between the variables of interest.

Exploratory Data Analysis

The dataset, with only the partyid and coninc columns and filtered for NAs values, has 50393 cases.

partyid:

partyid is a categorical variable. We summarize it with table and plot.

table(gss$partyid)
## 
##    Strong Democrat   Not Str Democrat       Ind,Near Dem 
##               9117              12040               6743 
##        Independent       Ind,Near Rep Not Str Republican 
##               8499               4921               9005 
##  Strong Republican        Other Party 
##               5548                861
plot(gss$partyid)

We can see that not strong democrat and not strong republican have the most instances.

Family Income in constant USD:

Family Income in constant USD is numerical continuous variable. We summarize it with summary and histogram.

summary(gss$coninc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     383   18440   35600   44500   59540  180400    5829
hist(gss$coninc)

We can see that the distribution is right skewed.

Is there a relationship between political stand and family income in constant dollars?

To explore the relationship between a categorical and a numerical variable, we use ggplot to explore.

library(ggplot2)
ggplot(gss, aes(x=gss$partyid,y=gss$coninc)) + geom_bar(stat="identity")
## Warning: Removed 5829 rows containing missing values (position_stack).

We don’t see a positive or negative correlation between political stand and family income. However, not strong democrat and republican seem to be associated with higher higher family income.

Data set

head(gss)[,c(27,29)]
##   coninc          partyid
## 1  25926     Ind,Near Dem
## 2  33333 Not Str Democrat
## 3  33333      Independent
## 4  41667 Not Str Democrat
## 5  69444  Strong Democrat
## 6  60185     Ind,Near Dem