Introduction

At a time when income inequality is soaring, the top 1% of US citizens earn more than 20% of all income, and the wealthiest 10% of US households own 76% of US wealth, according to http://money.cnn.com/2016/12/22/news/economy/us-inequality-worse/ and https://www.washingtonpost.com/news/wonk/wp/2015/05/21/the-top-10-of-americans-own-76-of-the-stuff-and-its-dragging-our-economy-down/?utm_term=.cf24bf757522 . Social scientists are frantically searching why some Americans make more money than others. I used a dataset from a 1994 census database to test the effect of age and education levels on whether an individual earns more than $50K in a year.

According to http://www.huffingtonpost.com/steven-strauss/the-connection-between-ed_b_1066401.html , highly educated workers are in high demand, hence they are paid larger amounts, and are often shielded from the brunt of recessions and other economic turmoil. Those with more advanced degrees also tend to have greater job security, according to https://www.bls.gov/emp/ep_chart_001.htm . I wanted to test these assumptions on the dataset, to test the relationship between education and whether an individual in the 1994 census dataset earns more than $50K a year. In addition, according to https://dqydj.com/income-change-career-income-increase-age/ , as age increases, an individual’s income tends to increase as well, as they climb the career ladder. However, there is a point where the income slows its growth, in the late 30s and mid 40s for both men and women. At the average age of 38, savings tend to contribute more to the net worth of an individual than income. The research suggests that as age increases, income does as well, but only up to a point, suggesting that the relationship between income and age is weaker than that between income and education.

I predict that as educational attainment increases, the probability that an individual earns more than $50K increases as well, and that as age increases income does as well, however education is the factor that has the strongest relationship with income. This is because age tends to predict how far an individual is in their career, whereas education tends to predict the career path of the individual itself. Individuals with high level education and little experience may earn more income than those with little education and more experience, according to https://cew.georgetown.edu/wp-content/uploads/2014/11/collegepayoff-complete.pdf .

Methods

The dataset was downloaded from the UCI Machine Learning Repository. (URL: https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data ). The data was collected from the 1994 Census database and was cleaned by Barry Becker.

The 1994 Census Dataset was a collection of 51 state random samples. To clean the data, Barry Becker undertook the following steps: “discretized agrossincome into two ranges with threshold 50,000, converted U.S. to US to avoid periods, converted Unknown to”?“, and ran MLC++ GenCVFiles to generate data,test ,” according to https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names .

I saved the data downloaded and saved the data to a file called adultDataset.csv, and created a data frame called adult that read this csv file. My analysis only concerned the age, education number, and income (whether the individual earned 50K), so I created a separate data frame called adult2 which only had these columns. The education number is an integer that corresponds to a particular level of education. The education number increases with the highest level of education attained.

Since Barry Becker already cleaned the source data to display “?” for an NA value, I searched the adult2 data frame with the Ctrl+F command to find a “?”, and none came up. Hence, there were no incomplete cases in my data.

Graphs: First, I used the boxplot() function to plot a box plot comparing the distribution of age between those who make $ <=50K and >50K. Then, I did the same for the distribution of education level between those who make $ <=50K and >50K. Second, I used a for loop to traverse the adult2 dataset and fill a vector with the proportion of individuals who earn > $ 50K for each educational level. I then used the barplot() function to plot the proportion of individuals who earn > $ 50K. Furthermore, I did the same for the proportion of individuals who earn > $50K for each age.

URL <- "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
download.file(URL, destfile = "adultDataset.csv")
adult <- read.csv("adultDataset.csv", header=FALSE)
adult2 <- adult[,c(1, 5, 15)]
colnames(adult2) <- c("age", "education level", "income")

Results

# Boxplot comparing the distribution of age between those who make $ <=50K and >50K
boxplot(adult2$age ~ adult2$income,
        xlab = "Income ($)",
        ylab = "Age (Years)",
        main = "Distribution of age by earnings ($ >50K or <= 50K)" )

Graph 1: Distribution of age by earnings ($ >50K or <= 50K)

It is evident from the graph that the median age of individuals who earn less than $50K is lower than that of the individuals who earn > 50 K, supporting the notion that generally, as age increases, earnings do as well. This is because as workers age, they generally advance up their careers, through, for example, promotions. This is supported by https://dqydj.com/income-change-career-income-increase-age/ .

Both graphs have outliers whose earnings are extremely large. However, the IQR of the age of the low earning individuals is larger than that of the high earning individuals, illustrating that low earning individuals have a greater spread of age than high earning individuals. On the other hands, there is a clear skew towards larger ages with low earning individuals, while high earning individuals have an almost symmetric distribution of ages. That there is so much overlap between the IQR’s of high and low earning individuals further indicates that age may not be the strongest measure of earnings. This is possibly because earnings tend to plateau at around age 38, where savings play a larger part of an individual’s net worth, according to https://dqydj.com/income-change-career-income-increase-age/.

# Boxplot comparing the distribution of education level between those who make
# $ <=50K and >50K
boxplot(adult2$`education level` ~ adult2$income,
        xlab = "Income ($)",
        ylab = "Education Level",
        main = "Education Level by individual's earnings ($ >50K or <= 50K)" )

Graph 2: Distribution of education level by Individual’s earnings

The median education level of high earning individuals is substantially larger than that of low earning individuals, supporting the notion that as education level increases, income does as well. As education increases, workers become more specialized, and are in higher demand. These high knowledge workers earn more, according to https://cew.georgetown.edu/wp-content/uploads/2014/11/collegepayoff-complete.pdf 

While the boxplot for high earning individuals only has outliers in the lower education level range, the boxplot for low earning individuals has outliers in both directions. The IQR of the low earning individuals is much smaller than that of the high earning individuals, indicating that higher earning individuals have a greater spread of educational level. Both boxplots are relatively somewhat symmetric. Since there is very little overlap between the two IQR’s, education level can effectively be used as a factor to separate high and low earning individuals.

# Barplot comparing the proportion of individuals in each education level who earn more 
# than 50K. Vector is created to store % for each educational level. There is a warning that occurs,
# but it can be ignored
percent50k <- c(1:16)
for(iterator in c(1:16)){
  temp <- adult2[adult2$`education level` == iterator, ]
  temp <- table(temp$income)
  percent50k[iterator] <- 1- prop.table(temp)
}
barplot(percent50k, names = c(1:16),
        xlab = "Education Level",
        ylab = "Proportion of individuals who earn > $ 50K",
        ylim = c(0, 1.0),
        main = "Proportion of indivividuals earning > $50K per education level")

Graph 3: Proportion of individuals earning > 50K per education level

As seen educational levels of individuals increases, the proportion of individuals that are high earning increases as well. This graph illustrates a strong relationship between Education level and high earning status-as educational level increases, the probability that the individual is high earning increases as well. This is because the level of education very accurately predicts an individual's earnings. According to  https://cew.georgetown.edu/wp-content/uploads/2014/11/collegepayoff-complete.pdf , the more specialized and advanced a degree an individual holds, the larger their lifetime earnings. The same source details the median lifetime earnings by highest education:
  1. Less than high School: $ 973,000
  2. High School Diploma: $ 1,304,000
  3. Some College: $ 1,547,000
  4. Associate’s Degree: $ 1,727,000
  5. Bachelor’s Degree: $ 2,268,000
  6. Master’s Degree: $ 2, 671,000
  7. Doctoral Degree: $3, 252,000
  8. Professional Degree: $ 3,648,000 These data explain why as the individual attains more education, the probability of being a “high earning” individual who earns more than $50K increases.
# Barplot comparing the proportion of individuals in each age who earn more 
# than 50K. Vector is created to store % for each age. There is a warning that occurs,
# but it can be ignored
percent50k <- c(1:74)
for(iterator in c(17:90)){
  temp <- adult2[adult2$age == iterator, ]
  temp <- table(temp$income)
  percent50k[iterator - 16] <- 1- prop.table(temp)
}
barplot(percent50k, age = c(17:90),
        xlab = "Age (years) (Minimum is 17 and max is 90)",
        ylab = "Proportion of individuals who earn > $ 50K",
        ylim = c(0, 1.0),
        main = "Proportion of indivividuals earning > $50K per age")

abline(v = 38, col = "red")

Graph 4: Proportion of individuals earning > $50K per age The proportion of high earning individuals increases with age, the age of 38 marked by the red line, and begins to decline. This fits exactly with the model described in https://dqydj.com/income-change-career-income-increase-age/ , where 38 is the average age where incomes start to flatten out. Because income flattens out at 38, that is where most professionals see their peak in income, hence it explains why the peak of the graph is around 38, the age which has the highest probability of individuals being high-earning. Except for the visibly out of place values in the high age range, the graph is unimodal. The shape of the graph illustrates why education level is a better predictor of income than age.

Conclusion

It is evident that there is as relationship between education number and income, and education level is the better factor to use to predict whether an individual earns more than $50K. While conducting the test, I was truly surprised by how accurate the prediction that income flattens out at 38 was, and how it showed up on one of my graphs. Notwithstanding, I expected the education number to be a pretty accurate predictor of an individual’s income status.

For future reports, I would implore recording more attributes of an individual. Not only does the educational level matter, but the individual’s performance in their institute matters as well. If possible, I would create an engineered variable that functions as an individual’s cumulative GPA for all their years studying, so I can not only compare the educational levels of individuals, but their performance as well as a means to estimate income. Their subject of study could also be another attribute, as the pay discrepancy can be quite large between two different majors.

What could be interesting is researching how education affects income inequality and social mobility. While https://www.washingtonpost.com/news/wonk/wp/2015/05/21/the-top-10-of-americans-own-76-of-the-stuff-and-its-dragging-our-economy-down/?utm_term=.cf24bf757522 suggests more people with degrees could help narrow the gap between the extremely rich and the rest of the country, I would love to pursue research to verify these claims.