General Social Survey Cumulative File, 1972-2012 Coursera Extract. Modified for Data Analysis and Statistical Inference course (Duke University).
R dataset could be downloaded at http://bit.ly/dasi_gss_data.
load(url("http://bit.ly/dasi_gss_data"))
The study spans 40 years and nearly every decade the collection process was modified
The data were collected from United States’ metropolitan and rural areas with household interview. Multiple level of stratification for region, race, age, income and sex was employed to guarantee a random sample. Each year were collected about 1500-2000 cases, with a slight increment in recent years.
The cases are adult persons resident in United States and interviewed in their household.
Type of variable: categorical, ordinal.
Levels: Lt High School, High School, Junior College, Bachelor, Graduate
The study consists in interviews to a random sample of United States residents about their economic condition, their working status, their health, their beliefs, etc. So the study is observational.
The population of interest is composed by all United States residents. The study employed random sampling, so the results could be generalized to the entire the population.
The study is observational, so we can only establish association links and not causal ones between the variables of interest.
The dataset, with only the partyid and coninc columns and filtered for NAs values, has 50393 cases.
partyid:
partyid is a categorical variable. We summarize it with table and plot.
table(gss$partyid)
##
## Strong Democrat Not Str Democrat Ind,Near Dem
## 9117 12040 6743
## Independent Ind,Near Rep Not Str Republican
## 8499 4921 9005
## Strong Republican Other Party
## 5548 861
plot(gss$partyid)
We can see that not strong democrat and not strong republican have the most instances.
Family Income in constant USD:
Family Income in constant USD is numerical continuous variable. We summarize it with summary and histogram.
summary(gss$coninc)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 383 18400 35600 44500 59500 180000 5829
hist(gss$coninc)
head(gss)[,c(27,29)]
## coninc partyid
## 1 25926 Ind,Near Dem
## 2 33333 Not Str Democrat
## 3 33333 Independent
## 4 41667 Not Str Democrat
## 5 69444 Strong Democrat
## 6 60185 Ind,Near Dem