In this study we are going to research the relationship between person’s family income and the income of the family he lived in as a teenager at age 16. Discovering the factors that might determine if one is going to be wealthy has always been an interesting topic to research. Some of the usual ingredients for financial success are believed to be passion and hard work, but in this paper we’ll have a look at what kind of impact financial background of a family person grew in has on future income of his own family in his adult years. Instictively, we can presume that coming from a wealthier family can be benifitiary to one’s own financial situation as he grows up to have his own family, but here we’ll try to gain some statistically important insight into this component and see how important it really is. So, the question will try to answer is this:
To make our research more relevant we will concentrate only on the repondents whose age is between 30 and 65, which should represent the years when the person has finished schooling and started working, but is not yet retired. To answer this question we are going to do a research on available financial data of american citizens and use exploratory data analysis and inferential methods to gain some meaningful insight on this topic.
The dataset we are going to use for our research was extracted from General Social Survey (GSS): a sociological survey used to collect data on demographic characteristics and attitudes of residents of the United States. The dataset’s Codebook lists all variables, the values they take, and the survey questions associated with them. There are a total of 114 cases and 57061 variables in this dataset.
It’s important to note that this is a cumulative data file for surveys conducted between 1972 - 2012 and that not all respondents answered all questions in all years.
Some background on the dataset taken from GSS project description:
“Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.”"
The data collection was conducted through computer-assisted personal interview (CAPI), face-to-face interview and telephone interviews. All respondents were US citizens, 18 years old and over, living in households at the time of survey (or non-institutionalised) in the metropolitan and rural areas of the US.
Since the data were collected from the survey, this is an observational study, which prevents us from making any causational conclusions about variables of interest, but still giving us the option to explore relationships within different variables.
Although GSS survey spans over four decades and different survey methods were used during that period, multiple levels of stratification for region, race, age, income and sex were employed to guarantee a random sample. Thus, the dataset is considered to consist of random observations taken from US citizens, enabling us to conduct the research which can be generalized to the entire US population.
We will use two variables from the GSS survey, both containing financial data about the respondent, but at diferent times in his life:
The respondents’ answers fall into one of this 6 categories:
Since the last category “Lived In Institution” does not contain any meaningful data regarding respondent’s family financial background and considering the fact that there were only 9 such cases in the dataset, we removed these cases from the dataset which left us with 39676 observations, which was more than enough for performing our study.
Also, we used the Age of respondent variable to remove the cases where respondents were younger than 25 and older than 60 years, keeping only the observations where respondents are supposedly in the period between college and retirement, when they should be actively working. Age data was removed from the final dataset since it wasn’t needed for our research at later stage.
Figure 1. Distribution of the respondents by age before removing unneeded observations.
After removing the unwanted observations our prepared dataset contains 24985 observations and its summary looks like this:
## Income16 FamilyIncome
## Far Below Average: 2246 Min. : 383
## Below Average : 6440 1st Qu.: 24258
## Average :12178 Median : 42215
## Above Average : 3672 Mean : 50962
## Far Above Average: 449 3rd Qu.: 67035
## Max. :180386
We can observe from this summary that almost half of all observations for Income16 (Family income at age 16) variable falls into “Average” income category and that smallest number of cases (below 2%) belong to the “Far Above Average” category. We’ll make the contigency and frequency table where we can have a better look at the proportions of categories for Income16 variable.
## Far Below Average Below Average Average Above Average Far Above Average
## Count 2246.00 6440.00 12178.00 3672.0 449.0
## Percentage 8.99 25.78 48.74 14.7 1.8
We can notice from the frequency table that 83.5% of respondents were living in their teenager years in below average or average families regarding their income, while remaining 16.5% families had above average income.
Looking at the summary at FamilyIncome variable, which represents the yearly amount of money in US$ that respondent’s family earned at the time when survey was taken, we can notice that distribution of these values is highly right skewed with median of 42215 and maximum value of 180386 US$, which can be easily observed on correspoding variables’ barplot and histogram below:
Figure 2: Distribution plots of the respondents’ family income when they were 16-years old and today.
From these distribution plots we can also observe that both distributions are unimodal, and that FamilyIncome variable has some outliers with the highest income values almost 3 times larger than its 3rd quantile value of 67035 US$.
Since we are mainly interested in how one’s family financial background impacts his future income, we would like to find out what is the relationship between these two variables. Looking at the boxplot in Figure 3. for the categorical Income16 variable versus the continuous FamilyIncome variable we can conclude that a positive relationship exists between these two variables, where respondents coming from families with higher income would actually have higher family income themselves in their adult years. This relationship is very strong and almost linear when comparing the FamilyIncome median values for all the Income16 categories, with only exception being the respondents coming from the “Far Above Average” income families having a little bit lower median value than the one for “Above Average” income families which is interesting and can be seen from the plot below:
Figure 3: Respondents’ family income at age 16 in relation to their current family income.
From the following table we can observe the FamilyIncome mean and median values, together with the standard deviations for each category of Income16 variables and confirm the strong relationship between those two variables. This observation gives us a good reason for next step which is to confirm if these differences between group means are statisticaly significant or is it something that could happen by chance, for example due to the sampling variability.
## Income16 Count Mean(FamilyIncome) Median(FamilyIncome) SD(FamilyIncome)
## 1 Far Below Average 2246 39579.10 30806 34245.98
## 2 Below Average 6440 45893.69 39095 34160.96
## 3 Average 12178 51403.73 43779 35452.52
## 4 Above Average 3672 63689.24 53507 41983.03
## 5 Far Above Average 449 64520.57 50926 49337.72
Table 1: Levels of Income16 variable with corresponding number of observations, mean and median values, and standard deviations.
The goal of this study is to find out if there is any statistically significant difference between the respondents’ current family income and the self-reported income class of their families when they were 16 years old. To compare these means we will use the analysis of variance test (ANOVA) and F statistic.
We will use statistical inference methods to test the null hypothesis \(H_{0}\) which states that Means for the current family income are the same across all categories of family income at age 16 versus the alternative hypothesis \(H_{A}\) which says that At least one pair of means is different from each other.
\(H_{0}: \mu_{FBA} = \mu_{BA} = \mu_{A} = \mu_{AA} = \mu_{FAA}\)
\(H_{A}\) : the average current family income is different for at least one pair of groups
For ANOVA test to produce meaningful results some conditions need to be fullfilled.
Figure 4. Normal probability plot for the FamilyIncome variable grouped by Income16 levels.
We conclude that conditions for ANOVA test are not entirely fullfilled because of the identified deviations in normality within the groups of Income16 variable and larger variance in “Far Above Average” group, which might incur some uncertainty in our research.
ANOVA can tell us if there is something interesting going on in our data, if we find out that at least one pair of means is statisticaly different from each other. It uses uses F test statistic, which represents a standardized ratio of variability within group observations to the variability between different groups. Obtaining a large F statistic represents stronger evidence against the null hypotheses, and to obtain large F statistic variability between sample means needs to be greater than the variability within sample means.
\[F = \frac{(variability \, between \, groups)}{(variability \, within \, groups)}\]
anova(lm(FamilyIncome ~ Income16, data=dat))
## Analysis of Variance Table
##
## Response: FamilyIncome
## Df Sum Sq Mean Sq F value Pr(>F)
## Income16 4 1.1362e+12 2.8404e+11 214.93 < 2.2e-16 ***
## Residuals 24980 3.3013e+13 1.3216e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analyzing the ANOVA results we can see that the obtained F statistic of 214.93 is very large and that p-value is almost zero. In our case that means that the probability of observing the value of 214.93 for F statistic if the null hypothesis was true is really low, close to zero. So, giving that p-value is small enough we reject the null hypothesis and conclude that data provide statistically significant evidence that at least one pair of current family income means in US population varies across corresponding family income groups at age 16.
Now that we rejected the null hypothesis we would like to find out which groups of Family income at age 16 variable have different means. We can do that by conducting the pairwise test between each of the groups which in turn produces \(K = \frac{k*(k-1)}{2} = 10\) tests. To account for a possible inflation of Type I error rate we’ll apply the Bonferroni correction which uses the more stringent significance level needed to reject the null hypothesis \(H_{0}\) by adjusting significance level \(\alpha\) for the number of comparisons performed \(\alpha^* = \frac{\alpha}{K}\).
pairwise.t.test(dat$FamilyIncome, dat$Income16, p.adj="bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: dat$FamilyIncome and dat$Income16
##
## Far Below Average Below Average Average Above Average
## Below Average 1.4e-11 - - -
## Average < 2e-16 < 2e-16 - -
## Above Average < 2e-16 < 2e-16 < 2e-16 -
## Far Above Average < 2e-16 < 2e-16 6.2e-13 1
##
## P value adjustment method: bonferroni
The results from the pairwise test show that for nine groups the p-value is many times lower than our significance level of \(\alpha^*\) = 0.0005 (Bonferroni correction applied) which means that difference in themeans between those groups is statistically significant. For the pair of “Above Average - Far Above Average” we wouldn’t reject the null hypothesis since it’s value is high, so between these two groups we can’t confirm any significant difference in means (the value of 1 is somewhat odd and it might have been caused by the fact that for this group variance condition was not met and that degrees of freedom for this group differ a lot).
This study has shown a strong positive correlation between the family income of the US citizens when they are teenagers at age 16 living in their parents homes and current family income when they are adults of age 30 and over. This establishes the link between growing up in a family of a higher income and having a higher income in your own family later on, but that doesn’t mean we can draw any conclusions from this relationship since the survey was observational and not experimental.
The data used for this research came from the General Social Survey study and spanned from years 1972-2012. The observations were considered to be random so we were able to make generalized conclusions about whole US population. We performed initial exploration of the data by looking at the respondent’s current family income relative to their family’s income level group, which they identified themselves to be in when they were 16 years old.
After visualy confirming the relationship between these two variables we conducted the ANOVA test and pairwise comparisons between each group pairs only to find out that mean current incomes indeed significantly differ for all group pairs except “Above Average - Far Above Average” pair for which we couldn’t find any significant difference in means. Looking at the boxplot in Figure 3. we can see why that might be, since the median for the “Far Above Average” is actually marginally lower than for “Above Average” group (50926 vs. 53507) and the variance is quite larger.
However, some of the conditions for the tests we conducted were not completely met, so we have to be cautious when interpreting found results and not consider our conclusion as final. More advanced analysis using different sophisticated statistical methods could provide more useful results and conducting an experiment would give us the option to explore any possible causality between person’s family financial background and future earnings which might prove to be an interesting research topic.
Official General Social Survey website.
Data Citation: Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11.
Original dataset can be found at: http://doi.org/10.3886/ICPSR34802.v1
Extract from General Social Survey Cumulative dataset, 1972-2012, modified for Data Analysis and Statistical Inference course (Duke University) and used in this study can be downloaded at at http://bit.ly/dasi_gss_data.
The excerpt from the dataset used:
## Income16 FamilyIncome
## 3 Average 33333
## 5 Below Average 69444
## 10 Far Below Average 25926
## 11 Below Average 18519
## 12 Average 18519
## 13 Below Average 18519
## 14 Far Below Average 18519
## 15 Average 25926
## 16 Far Below Average 18519
## 18 Average 25926
## 19 Above Average 60185
## 21 Above Average 50926
## 22 Average 83333
## 26 Average 41667
## 27 Average 41667
## 28 Average 41667
## 37 Below Average 69444
## 38 Average 41667
This RMarkdown document was produced with RStudio v0.0.99.486 on R v3.2.2.