Exploring the relationship between person’s family income level at age 16 and his/her current family income

Introduction

In this study we are going to research the relationship between person’s family income and the income of the family he lived in as a teenager at age 16. Discovering the factors that might determine if one is going to be wealthy has always been an interesting topic to research. Some of the usual ingredients for financial success are believed to be passion and hard work, but in this paper we’ll have a look at what kind of impact financial background of a family person grew in has on future income of his own family in his adult years. Instictively, we can presume that coming from a wealthier family can be benifitiary to one’s own financial situation as he grows up to have his own family, but here we’ll try to gain some statistically important insight into this component and see how important it really is. So, the question will try to answer is this:

What is the relationship between income level of the family one was born into and his own family income in his/her adult years?

To make our research more relevant we will concentrate only on the repondents whose age is between 30 and 65, which should represent the years when the person has finished schooling and started working, but is not yet retired. To answer this question we are going to do a research on available financial data of american citizens and use exploratory data analysis and inferential methods to gain some meaningful insight on this topic.

Data

The dataset we are going to use for our research was extracted from General Social Survey (GSS): a sociological survey used to collect data on demographic characteristics and attitudes of residents of the United States. The dataset’s Codebook lists all variables, the values they take, and the survey questions associated with them. There are a total of 114 cases and 57061 variables in this dataset.

It’s important to note that this is a cumulative data file for surveys conducted between 1972 - 2012 and that not all respondents answered all questions in all years.

Some background on the dataset taken from GSS project description:

“Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.”"

Data collection and survey design:

The data collection was conducted through computer-assisted personal interview (CAPI), face-to-face interview and telephone interviews. All respondents were US citizens, 18 years old and over, living in households at the time of survey (or non-institutionalised) in the metropolitan and rural areas of the US.

Since the data were collected from the survey, this is an observational study, which prevents us from making any causational conclusions about variables of interest, but still giving us the option to explore relationships within different variables.

Although GSS survey spans over four decades and different survey methods were used during that period, multiple levels of stratification for region, race, age, income and sex were employed to guarantee a random sample. Thus, the dataset is considered to consist of random observations taken from US citizens, enabling us to conduct the research which can be generalized to the entire US population.

Variables used in this study:

We will use two variables from the GSS survey, both containing financial data about the respondent, but at diferent times in his life:

Income16 - Respondent’s family income at age 16: categorical ordinal variable. Survey question for this variable was “Thinking about the time when you were 16 years old, compared with American families in general then, would you say your family income was far below average, below average, average, above average, or far above average?”

The respondents’ answers fall into one of this 6 categories:

Far Below Average
Below Average
Average
Above Average
Far Above Average
Lived In Institution

Since the last category “Lived In Institution” does not contain any meaningful data regarding respondent’s family financial background and considering the fact that there were only 9 such cases in the dataset, we removed these cases from the dataset which left us with 39676 observations, which was more than enough for performing our study.

FamilyIncome - Family income in constant dollars: continuous numerical variable representing the respondent’s inflation-adjusted family current income (actually, at the time when the survey was conveyed). We use this inflation-adjusted income to account for the fact that our data spans the period of more than 4 decades. The reported income values are in range of 383 - 180386 US dollars per year.

Also, we used the Age of respondent variable to remove the cases where respondents were younger than 25 and older than 60 years, keeping only the observations where respondents are supposedly in the period between college and retirement, when they should be actively working. Age data was removed from the final dataset since it wasn’t needed for our research at later stage.

Figure 1. Distribution of the respondents by age before removing unneeded observations.

Exploratory data analysis

After removing the unwanted observations our prepared dataset contains 24985 observations and its summary looks like this:

##               Income16      FamilyIncome   
##  Far Below Average: 2246   Min.   :   383  
##  Below Average    : 6440   1st Qu.: 24258  
##  Average          :12178   Median : 42215  
##  Above Average    : 3672   Mean   : 50962  
##  Far Above Average:  449   3rd Qu.: 67035  
##                            Max.   :180386

We can observe from this summary that almost half of all observations for Income16 (Family income at age 16) variable falls into “Average” income category and that smallest number of cases (below 2%) belong to the “Far Above Average” category. We’ll make the contigency and frequency table where we can have a better look at the proportions of categories for Income16 variable.

##            Far Below Average Below Average  Average Above Average Far Above Average
## Count                2246.00       6440.00 12178.00        3672.0             449.0
## Percentage              8.99         25.78    48.74          14.7               1.8

We can notice from the frequency table that 83.5% of respondents were living in their teenager years in below average or average families regarding their income, while remaining 16.5% families had above average income.

Looking at the summary at FamilyIncome variable, which represents the yearly amount of money in US$ that respondent’s family earned at the time when survey was taken, we can notice that distribution of these values is highly right skewed with median of 42215 and maximum value of 180386 US$, which can be easily observed on correspoding variables’ barplot and histogram below:

Figure 2: Distribution plots of the respondents’ family income when they were 16-years old and today.

From these distribution plots we can also observe that both distributions are unimodal, and that FamilyIncome variable has some outliers with the highest income values almost 3 times larger than its 3rd quantile value of 67035 US$.

Since we are mainly interested in how one’s family financial background impacts his future income, we would like to find out what is the relationship between these two variables. Looking at the boxplot in Figure 3. for the categorical Income16 variable versus the continuous FamilyIncome variable we can conclude that a positive relationship exists between these two variables, where respondents coming from families with higher income would actually have higher family income themselves in their adult years. This relationship is very strong and almost linear when comparing the FamilyIncome median values for all the Income16 categories, with only exception being the respondents coming from the “Far Above Average” income families having a little bit lower median value than the one for “Above Average” income families which is interesting and can be seen from the plot below:

Figure 3: Respondents’ family income at age 16 in relation to their current family income.

From the following table we can observe the FamilyIncome mean and median values, together with the standard deviations for each category of Income16 variables and confirm the strong relationship between those two variables. This observation gives us a good reason for next step which is to confirm if these differences between group means are statisticaly significant or is it something that could happen by chance, for example due to the sampling variability.

##            Income16 Count Mean(FamilyIncome) Median(FamilyIncome) SD(FamilyIncome)
## 1 Far Below Average  2246           39579.10                30806         34245.98
## 2     Below Average  6440           45893.69                39095         34160.96
## 3           Average 12178           51403.73                43779         35452.52
## 4     Above Average  3672           63689.24                53507         41983.03
## 5 Far Above Average   449           64520.57                50926         49337.72

Table 1: Levels of Income16 variable with corresponding number of observations, mean and median values, and standard deviations.

Inference

The goal of this study is to find out if there is any statistically significant difference between the respondents’ current family income and the self-reported income class of their families when they were 16 years old. To compare these means we will use the analysis of variance test (ANOVA) and F statistic.

We will use statistical inference methods to test the null hypothesis $H_{0}$ which states that Means for the current family income are the same across all categories of family income at age 16 versus the alternative hypothesis $H_{A}$ which says that At least one pair of means is different from each other.

$H_{0}: \mu_{FBA} = \mu_{BA} = \mu_{A} = \mu_{AA} = \mu_{FAA}$

$H_{A}$ : the average current family income is different for at least one pair of groups

Conditions for Anova:

For ANOVA test to produce meaningful results some conditions need to be fullfilled.

Independence:
- Within groups - from the survey design we know that observations are independent with number of observations in each group less than 10% of total population of US citizens.
- Between groups - Income16 categories are not paired, so they are independent of each other.
Approximate normality: we need to find out if distributions for each category are nearly normal. To check normality we’ll do normal probability plots for each category of Income16 variable. From the normal probability plots we can conclude that obsevations in some groups are diverging from normality in upper quantiles, especially in lower three groups, but their sample sizes are pretty large so it shouldn’t represent a significant problem later on.

Figure 4. Normal probability plot for the FamilyIncome variable grouped by Income16 levels.

Equal variance: we need to confirm that there is equal variability between each Income16 group. From the boxplot in Figure 3. and the standard deviation values for each group in Table 1. we can observe that there’s constant variability in the first four levels of Income16 variable, but we notice that for “Far Above Average” income group the variability is much higher. Since that group also has significantly lower sample size than other groups that might be considered as a problem in later analysis.

We conclude that conditions for ANOVA test are not entirely fullfilled because of the identified deviations in normality within the groups of Income16 variable and larger variance in “Far Above Average” group, which might incur some uncertainty in our research.

Anova test:

ANOVA can tell us if there is something interesting going on in our data, if we find out that at least one pair of means is statisticaly different from each other. It uses uses F test statistic, which represents a standardized ratio of variability within group observations to the variability between different groups. Obtaining a large F statistic represents stronger evidence against the null hypotheses, and to obtain large F statistic variability between sample means needs to be greater than the variability within sample means.

\[F = \frac{(variability \, between \, groups)}{(variability \, within \, groups)}\]

anova(lm(FamilyIncome ~ Income16, data=dat))

## Analysis of Variance Table
## 
## Response: FamilyIncome
##              Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Income16      4 1.1362e+12 2.8404e+11  214.93 < 2.2e-16 ***
## Residuals 24980 3.3013e+13 1.3216e+09                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analyzing the ANOVA results we can see that the obtained F statistic of 214.93 is very large and that p-value is almost zero. In our case that means that the probability of observing the value of 214.93 for F statistic if the null hypothesis was true is really low, close to zero. So, giving that p-value is small enough we reject the null hypothesis and conclude that data provide statistically significant evidence that at least one pair of current family income means in US population varies across corresponding family income groups at age 16.

Now that we rejected the null hypothesis we would like to find out which groups of Family income at age 16 variable have different means. We can do that by conducting the pairwise test between each of the groups which in turn produces $K = \frac{k*(k-1)}{2} = 10$ tests. To account for a possible inflation of Type I error rate we’ll apply the Bonferroni correction which uses the more stringent significance level needed to reject the null hypothesis $H_{0}$ by adjusting significance level $\alpha$ for the number of comparisons performed $\alpha^* = \frac{\alpha}{K}$.

pairwise.t.test(dat$FamilyIncome, dat$Income16, p.adj="bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  dat$FamilyIncome and dat$Income16 
## 
##                   Far Below Average Below Average Average Above Average
## Below Average     1.4e-11           -             -       -            
## Average           < 2e-16           < 2e-16       -       -            
## Above Average     < 2e-16           < 2e-16       < 2e-16 -            
## Far Above Average < 2e-16           < 2e-16       6.2e-13 1            
## 
## P value adjustment method: bonferroni

The results from the pairwise test show that for nine groups the p-value is many times lower than our significance level of $\alpha^*$ = 0.0005 (Bonferroni correction applied) which means that difference in themeans between those groups is statistically significant. For the pair of “Above Average - Far Above Average” we wouldn’t reject the null hypothesis since it’s value is high, so between these two groups we can’t confirm any significant difference in means (the value of 1 is somewhat odd and it might have been caused by the fact that for this group variance condition was not met and that degrees of freedom for this group differ a lot).

Conclusion

This study has shown a strong positive correlation between the family income of the US citizens when they are teenagers at age 16 living in their parents homes and current family income when they are adults of age 30 and over. This establishes the link between growing up in a family of a higher income and having a higher income in your own family later on, but that doesn’t mean we can draw any conclusions from this relationship since the survey was observational and not experimental.

The data used for this research came from the General Social Survey study and spanned from years 1972-2012. The observations were considered to be random so we were able to make generalized conclusions about whole US population. We performed initial exploration of the data by looking at the respondent’s current family income relative to their family’s income level group, which they identified themselves to be in when they were 16 years old.

After visualy confirming the relationship between these two variables we conducted the ANOVA test and pairwise comparisons between each group pairs only to find out that mean current incomes indeed significantly differ for all group pairs except “Above Average - Far Above Average” pair for which we couldn’t find any significant difference in means. Looking at the boxplot in Figure 3. we can see why that might be, since the median for the “Far Above Average” is actually marginally lower than for “Above Average” group (50926 vs. 53507) and the variance is quite larger.

However, some of the conditions for the tests we conducted were not completely met, so we have to be cautious when interpreting found results and not consider our conclusion as final. More advanced analysis using different sophisticated statistical methods could provide more useful results and conducting an experiment would give us the option to explore any possible causality between person’s family financial background and future earnings which might prove to be an interesting research topic.

References

Official General Social Survey website.
Data Citation: Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11.
Original dataset can be found at: http://doi.org/10.3886/ICPSR34802.v1
Extract from General Social Survey Cumulative dataset, 1972-2012, modified for Data Analysis and Statistical Inference course (Duke University) and used in this study can be downloaded at at http://bit.ly/dasi_gss_data.

Appendix

The excerpt from the dataset used:

##             Income16 FamilyIncome
## 3            Average        33333
## 5      Below Average        69444
## 10 Far Below Average        25926
## 11     Below Average        18519
## 12           Average        18519
## 13     Below Average        18519
## 14 Far Below Average        18519
## 15           Average        25926
## 16 Far Below Average        18519
## 18           Average        25926
## 19     Above Average        60185
## 21     Above Average        50926
## 22           Average        83333
## 26           Average        41667
## 27           Average        41667
## 28           Average        41667
## 37     Below Average        69444
## 38           Average        41667

This RMarkdown document was produced with RStudio v0.0.99.486 on R v3.2.2.