Only children have been the subjects of numerous studies, sometimes stigmatized as spoiled brats and other times as high achiever. In this project, we will analyze the relationship between the number of siblings a person has or had, and her/his level of education as well as her/his family income. The question we will try to asnwer in this analysis is thus the following:
To answer that question, we will use data from the General Social Survey (GSS) Cumulative File 1972-2012, which provides a sample of selected indicators on contemporary American society. Detailed information on this file can be found at https://d396qusza40orc.cloudfront.net/statistics%2Fproject%2Fgss1.html.
From the GSS file, we will use the following variables:
- sibs: respondent’s number of brothers and sisters - numeric variable;
- coninc: respondent’s total family income in constant dollars - numeric variable;
- degree: respondent’s highest degree - ordinal variable.
We first load the required libraries:
setwd("~/Repositories/Coursera_DataAnalysis_Duke/Project/")
library(reshape2)
library(ggplot2)
We then load the data (and cache it) and get some summary statistics:
load(url("http://bit.ly/dasi_gss_data"))
# only keep count of siblings, degree and constant family income
data <- gss[,c("sibs","degree","coninc")]
# get statistics
summary(data)
## sibs degree coninc
## Min. : 0.0 Lt High School:11822 Min. : 383
## 1st Qu.: 2.0 High School :29287 1st Qu.: 18445
## Median : 3.0 Junior College: 3070 Median : 35602
## Mean : 3.9 Bachelor : 8002 Mean : 44503
## 3rd Qu.: 5.0 Graduate : 3870 3rd Qu.: 59542
## Max. :68.0 NA's : 1010 Max. :180386
## NA's :1679 NA's :5829
We see that all three variables contain missing values (NA). In order to prevent this data to create bias in our analysis, we will in those missing values using the following strategy:
# use median number of siblings
sibs.default <- median(data$sibs, na.rm = TRUE)
data[is.na(data$sibs),"sibs"] <- sibs.default
# use "High School"
degree.default <- "High School"
data[is.na(data$degree),"degree"] <- degree.default
# use median income by degree
subdata <- subset(data, !is.na(data$coninc))
aggdata <- aggregate(subdata$coninc, by=list(subdata$degree), FUN=median)
colnames(aggdata) <- c("degree","coninc")
for(d in aggdata$degree)
{
data[is.na(data$coninc) & data$degree == d, "coninc"] <- aggdata[aggdata$degree == d, "coninc"]
}
In this section we will draw a few exploratory plots to get a first impression of each variable and the relationships between variables.
1.1. Number of Siblings
We observe that the distribution for the number of siblings is right-skewed and limited to zero on the left. Both of these observations are expected as one cannot have a negative number of siblings. Similarly, we do expect the count of respondents to decrease as the number of siblings increases:
ggplot(data, aes(x=sibs)) +
geom_histogram(binwidth=1, colour="black", fill="white") +
xlab("siblings") +
ggtitle("Distribution of Number of Siblings")
We see that there is a very small number of respondents who have extreme numbers of siblings (well over 60). These data points represents outliers which are likely to skew the resuts of our analysis. Therefore, we remove these entries from our data and will focus on respondents who have 15 or less siblings:
data <- data[data$sibs <= 15,]
1.2. Education Level
Most respondents hold a high school diloma as their highest degree:
ggplot(data, aes(x=degree)) +
geom_histogram(color="black", fill="white") +
ggtitle("Distribution of Degree")
1.3. Family Income
We observe that the distribution for the family income is right-skewed and limited to zero on the left. Here again, both of these observations are expected as one cannot have a negative income and we expect the count of respondents to decrease as the income increases:
ggplot(data, aes(x=coninc)) +
geom_histogram(binwidth=5000, colour="black", fill="white") +
xlab("income") +
ggtitle("Distribution of Family Income")
2.1. Number of Siblings vs Degree
On the boxplots below, we see that as the level of education increases, the number of siblings decreases. For the two highest degrees (bachelor and graduate), we see that while their median number of siblings is similar, their IQR clearly differs, showing a lower IQR as the degree level is higher:
ggplot(data, aes(x=degree, y=sibs, fill=degree)) +
geom_boxplot(alpha=0.2) +
xlab("Degree") +
ylab("Siblings") +
ggtitle("Number of Siblings vs Degree")
2.2. Number of Siblings vs Family Income
The barplot below shows that the highest family income levels are detained by respondents with 1 to 3 siblings (the maximum income being associated with 2 siblings). Respondents without sibling surprisingly attained lower family income than respondents with 5 siblings:
ggplot(data, aes(x=sibs, y=coninc)) +
geom_bar(binwidth=1, stat="identity", fill="grey") +
xlab("Siblings") +
ylab("Income") +
coord_flip() +
ggtitle("Number of Siblings vs Family Income")
It will be interesting to know the family situation of these respondents. This is outside the scope of this project, but an interesting question to answer would be: do people with less siblings tend to marry less?
2.3. Degree vs Family Income
As we expect, the violin plots below show that the family income levels increase with the education level:
ggplot(data, aes(x=degree, y=coninc, fill=degree)) +
geom_violin(alpha=0.2) +
xlab("Degree") +
ylab("Income") +
ggtitle("Family Income vs Degree")
From the data gathered and the exploratory analysis provided above, there seems to be a negative relationship between number of siblings and level of education achieved (i.e. the less siblings, the higher the degree). This however does not seem to be entirely true for family income, as we have seen that only children do not necessariy have the highest incomes.
In the second part of this project, we will run hypothesis testings to further analyze these first impressions.