Introduction:

There are various saying/implication all across the world that if you have high education then you will earn more money. Is this saying true for everyone all over the world? Will it be true for people living in New York, one of the 50 States in U.S. where about 19 million people lives, and the State that holds New York City\(^2\) where over 8 million people leave, and place where very diverse people lives. Therefore for this project I ask the question: Does education have a role in a persons’ earning, in New York? Does it change if you are a female or if you are not a U.S. Citizen?

As a female New Yorker, who was not a U.S. citizen by birth I want to know the answer to those two question. I believe there are many people in New York who wants to know the answer, especially people who are currently in school/college/Universities. Almost everyone thinks about a well-established live with a good earning/income source, and a comfortable living. However to achieve those goals many people have tried to obtain an education. Education is expensive, for most people in the world, so is it worth the money to get a higher educational attainment? Do only education have a role in our earning or does something like our gender or citizenship status have an effect in it? In this project I will study less than 1% of New Yorkers to see if educational attainment have an effect on total personal earnings.

Data:

I obtained the data from: DATA.GOV (“The home of the U.S. Government’s open data”) \(^1\), it can also be obtained from United States Census Bureau \(^5\). I obtained the details about the data and codes for variables “2009 Data Dictionary” and the accuracy of the data from “2009 PUMS Accuracy”\(^6\).

The data was obtained by the American Community Survey (ACS). The data is personal information from people leaving in New York State in 2009. This is an observational study, where New Yorkers were survived (voluntary) about themselves and their life. Here is a description of the data taken from the PDF that comes with the data \(^4\):

“The Public Use Microdata Sample (PUMS) contains a sample of actual responses to the American Community Survey (ACS). The PUMS dataset include variables for nearly every question on the survey… Each record in the file represents a single person… In the person-level file, individuals are organized into households, making possible the study people within the contexts of their families and other household members. The PUMS contain data on approximately one percent of the United States population.”

The original data contain 188,767 cases with 279 variables. Each case in this data set is responses from one person, living in New York in 2009.\(^5\) Since it is too big to work with, I created a subset with all the cases, and only 6 variables. I put the final data set that I will be working with in this project in GitHub. The variables that I choose for this project are: “Citizenship Status”, “Educational Attainment”, “SEX”, and “Total Personal Earnings”. I also included two extra variables “Age” and “Total Personal Income” because I will be discussing them in my conclusion. I put a .zip file with the original data and data dictionary in GitHub, for anyone who wants to explore or study the original data.

variable description type data dictionary
Age age of the respondent numerical, discrete / data as numerical 0 - under 1 year; 1-99 - ages 1 to 99
Citizen_Stat citizenship status of the respondent categorical, variable 1 - born in U.S.; 2 - born in Puerto Rico and surrounding area; 3- born abroad of American parent(s); 4 - U.S. citizen by naturalization; 5 - Not U.S. citizen.
Edu_Attainment educational attainment of the respondent categorical, ordinal / data as numerical N/A - Less than 3 years; 1 - no school; 2-14 - pre-K to 11th grade; 15 - 12th grade no diploma; 16 - regular HS diploma; 17 - GED or equivalent; 18-19 some college; 20 - associate’s degree; 21 - bachelor’s degree; 22-24 beyond a bachelor’s degree.
SEX gender of the respondent categorical, variable / data as numerical 1 - male; 2 - female
Total_Per_Earn total person’s earnings of the respondent numerical, discrete N/A - less than 15 years old; 0 - no earnings; -9999 - loss of $9999 or more; -1 to -9998 - loss $1 to $9998; 1-9999999 - earn of $1 to $9999999 (all whole numbers)
Total_Per_Income total person’s income of the respondent numerical, discrete N/A - less than 15 years old; 0 - no earnings; -9999 - loss of $9999 or more; -1 to -9998 - loss $1 to $9998; 1-9999999 - income of $1 to $9999999 (all whole numbers)

Here is a look at summary of the data distribution:

Table 1: Summary of the data and its distribution
Age Citizen_Stat Edu_Attainment SEX Total_Per_Earn Total_Per_Income
Min. : 0.00 Min. :1.000 Min. : 1.00 Min. :1.000 Min. : -7400 Min. : -13200
1st Qu.:20.00 1st Qu.:1.000 1st Qu.:13.00 1st Qu.:1.000 1st Qu.: 0 1st Qu.: 7000
Median :41.00 Median :1.000 Median :17.00 Median :2.000 Median : 12000 Median : 22400
Mean :40.04 Mean :1.632 Mean :15.88 Mean :1.521 Mean : 31315 Mean : 38837
3rd Qu.:58.00 3rd Qu.:1.000 3rd Qu.:20.00 3rd Qu.:2.000 3rd Qu.: 42100 3rd Qu.: 50000
Max. :94.00 Max. :5.000 Max. :24.00 Max. :2.000 Max. :957000 Max. :1225000
NA NA NA’s :6108 NA NA’s :35877 NA’s :33251

I created a sub set of data, with the four variables that I will be looking at during this project, Total_Per_Earn, Edu_Attainment, SEX, and Citizen_Stat. I only included the cases where Total_Per_Earn\(>=500\). I excluded the negative and less than 500 earnings because I do not know the details of how the respondent and their family members could have a loss in earnings or have little to no earnings. There could be many factors that could have resulted in those earnings (like: unemployment, child support…), therefore I will only look at the cases where there is a positive earnings of \(>=500\). Here is a summary statistic of the data, when I only include positive earnings.

I main focus will be on the two variables Total_Per_Earn and Edu_Attainment. I will be looking at if there is a relationship between “total personal earnings” and “education attainment”. Then I will see if the relationship between “total personal earnings” and “education attainment” changes if an individual is a male or a female living in New York. Lastly I will look to see if there is a relationship between “total personal earnings” and “education attainment” if an individual is a U.S. citizen vs. non-U.S. citizen. Therefore I will be performing three hypothesis tests to the data. The response variable in this project will be “Total Personal Earnings”, which is a numerical variable. The explanatory variables in this project are “Educational Attainment”, “Citizenship Status” and “SEX” the variables are categorical, but also can be numerical.

The reason I wish to look at the four variables mention above is to get an understanding of the relationship between “education attainment” and “total personal earning”. If there is a relationship between the two variables then we can say that there is chance of it being true for all people leaving in New York. Also, I want to see that if a person is a female or a male, with the same “education attainment” will their “total personal earning” change. Also will it change if the person is a U.S. citizen vs non-U.S. Citizen, with the same “education attainment” will their “total personal earning”. By the end of this project we will be able have an understanding about all people leaving in New York, that whether their gender or citizenship status affect the relationship between their “total personal earning” and “education attainment”. There might be other factors that can contribute to any links between these variables (like type of work, hours of work…). However for this project we will exclude those factors and only focus on the four variables.

Exploratory data analysis:

Let’s look at the summary statistics of Citizen_Stat, Edu_Attainment and SEX, where Total_Per_Earn is greater then or equal to 500:

Table 2: Summary statistics for the four variables.
vars n mean sd median trimmed mad min max range skew kurtosis se
Citizen_Stat 1 97349 1.75 1.42 1 1.45 0.00 1 5 4 1.46 0.32 0.00
Edu_Attainment 2 97349 18.60 3.34 19 18.93 2.97 1 24 23 -1.58 4.98 0.01
SEX 3 97349 1.49 0.50 1 1.49 0.00 1 2 1 0.02 -2.00 0.00
Total_Per_Earn 4 97349 49189.42 69351.07 33000 37253.27 31134.60 500 957000 956500 5.35 39.03 222.27

Next let’s study each of the variables using visualization. I created five graphs below, graph 1 and graph 2 are histograms about total personal earning. graph 1 is total personal earning to the nearest dollars, the distribution looks to be extremely skewed to the left and graph 2 is the log of total personal earning which is a little right skewed. Graph 3 and graph 4 shows the distribution of education attainment, first one as numerical and the second one as categorical with 5 levels. We see in Graph3 that most respondents in this survey have as least a high school diploma or above (number 16 or above). Graph 4 shows us that out of the 5 levels, there are many of the respondents have an associate or bachelor degree, and least number of respondents have no diploma. Graph 5 shows the distribution of gender in the dataset. There seems to be about the same number of male respondents as female respondents. Graph 6 shows the distribution of citizenship status, as two levels, U.S. Citizen and not U.S. Citizen. We see that most of the respondents are U.S. citizen and only a few are non-U.S. citizen.

Now that we have looked at the distribution for each of the variables, we will visualize the relationship between the variables. Graph 7 shows the relationship between educational attainment and total personal earning. We see that the best line to describe the relationship between these two variables seems to be exponential. Graph 8 portrays the distribution of log of total personal earnings by two levels: “high school diploma or below” and “associate degree or above”.

Graph 9a and graph 9b portrays the distribution of log of total personal earnings by five levels. Looking at the histogram and the box plots, we see that as the educational attainment goes from “no diploma” to “more than bachelor degree”, the center and the spread of log(Total_Per_Earn) moves from below 10 to above 10. If we look at table 3 below we see that the mean total personal earnings is different for the five categories of education attainment. We see that the mean increases as the education attainment goes from “no diploma” to “more than bachelor degree” as it is shown by the visualization.

Table 3: Summary statistics for the relationship between education attainment by levels and total earning
group1 vars n mean sd median min max se
a. no diploma 1 9362 21280.27 27367.55 15400 500 591000 282.85
b. HS diploma or equivalent 1 23512 30863.62 31400.45 25000 500 957000 204.78
c. some college, no degree 1 19485 33868.34 38915.51 25000 500 691000 278.79
d. associate/bachelor degree 1 29547 57898.03 70645.20 43000 500 957000 410.98
e. higher than bachelor degree 1 15443 96678.85 114998.47 65000 500 957000 925.39

Graph 10 and graph 11 portrays the data distribution of the relationship between educational attainment and log of total personal earning, by gender and citizenship status respectively. These two graphs shows that that the center of the data for the different gender and citizenship status, are different for each of the 5 levels. Graph 12 is a graph that shows the relationship between the four variables: log(Total_Per_Earn), Edu_Attainment, SEX, and Citizen_Stat. We see in this graphs that as the levels of educational attainment goes from “no diploma” to “more than bachelor degree”, the center and spread of log(Total_Per_Earn) goes higher no matter the gender or the citizenship status.

Table 4 shows the relationship between earnings and education attainment, by gender and citizenship status. If we look at the summary statistics for the relationship between the variables below we see that they change as the education attainment goes from “no diploma” to “more than bachelor degree”. For example if we look at the first 5 rows of the table which holds the information about the respondents who were non-U.S. citizen female, we see that the means of the total earning increases from 16 thousand dollars to 64 thousand dollars as there is higher the educational attainment. I think this shows there is a high probability that there is a relationship between educational attainment and total personal earning, even if a person is a female and/or a non-U.S. citizen.

Table 4: Summary statistics for the relationship between education attainment, total earnings, sex, and citizenship status.
group1 group2 group3 vars n mean sd median min max se
a. no diploma non-U.S. citizen female 1 874 16462.73 12911.96 14000 500 114000 436.75
b. HS diploma or equivalent non-U.S. citizen female 1 929 19751.25 16546.79 15800 500 250000 542.88
c. some college, no degree non-U.S. citizen female 1 545 23384.77 40409.17 15000 590 591000 1730.94
d. associate/bachelor degree non-U.S. citizen female 1 880 43455.14 60890.46 30000 500 591000 2052.62
e. higher than bachelor degree non-U.S. citizen female 1 495 65207.64 77236.13 50000 600 591000 3471.51
a. no diploma U.S. citizen female 1 2961 15624.69 19720.74 11000 500 591000 362.41
b. HS diploma or equivalent U.S. citizen female 1 9834 24876.77 24242.66 20200 500 591000 244.46
c. some college, no degree U.S. citizen female 1 9195 26874.19 27357.98 21000 500 591000 285.30
d. associate/bachelor degree U.S. citizen female 1 14651 45391.87 48045.36 37000 500 957000 396.93
e. higher than bachelor degree U.S. citizen female 1 7741 70227.04 70520.44 57000 500 957000 801.52
a. no diploma non-U.S. citizen male 1 1554 23548.29 22917.65 19000 500 366000 581.36
b. HS diploma or equivalent non-U.S. citizen male 1 1354 30705.91 36383.76 25000 600 591000 988.78
c. some college, no degree non-U.S. citizen male 1 633 31008.71 31304.95 24600 500 366000 1244.26
d. associate/bachelor degree non-U.S. citizen male 1 978 60728.99 85770.97 38000 500 591000 2742.65
e. higher than bachelor degree non-U.S. citizen male 1 651 116441.38 143115.49 72000 500 957000 5609.14
a. no diploma U.S. citizen male 1 3973 25667.94 34396.61 18400 500 591000 545.70
b. HS diploma or equivalent U.S. citizen male 1 11395 36955.04 35633.46 30900 500 957000 333.81
c. some college, no degree U.S. citizen male 1 9112 41751.89 46895.41 32500 500 691000 491.27
d. associate/bachelor degree U.S. citizen male 1 13038 72713.87 86633.65 52000 500 957000 758.72
e. higher than bachelor degree U.S. citizen male 1 6556 128325.62 144429.16 82000 500 957000 1783.76

Inference:

In this section we will look at the interaction between the variables Total_Per_Earn, Edu_Attainment, SEX, and Citizen_Stat. We will see if there is a relationship between educational attainment and total personal earning for people leaving in New York State. Then we will see if the gender plays a role in creating a relationship between educational attainment and total personal earning. Lastly, we will look at citizenship statues as factor which might or might not create a link between educational attainment and total personal earning. Statistical inference is the theory, methods, and practice of understanding the quality of parameter estimates or estimating the population mean. Here we will be using various methods to perform statistical inference, like hypothesis test, p-value test, ANOVA test, confidence interval and regression model.

Primarily we will look at Total_Per_Earn and Edu_Attainment by conduct the first hypothesis test, where Edu_Attainment is categorical by two levels. Before we do any analysis let’s check if the conditions necessary for inference are satisfied. I will create a check list to the conditions are meet:

  1. Independence within groups: Meet
    • Random sample/assignment: It is a random sample of voluntary personal who taken the survey either by internet, phone or paper.
    • If sampling without replacement, \(n < 10\%\) of population: Has about $ 1%$ of the United States population.
  2. Independence between groups:Meet
    • Respondents could either have educational attainment as HS diploma or less or Associate degree or higher, but not both
  3. Sample size / skew: Meet
    • \(n\ge{30}\): If we look at table 4 we see that n for both variables and their levels are more \(n\ge{30}\).
    • Population distribution should not be extremely skewed: It is not extremely skewed, as we see above in graph 8.

Since conditions necessary for inference are satisfied, I will perform hypothesis test. During hypothesis test I will start with two hypothesis the null hypothesis (H0), and the alternative hypothesis (HA). I will not reject the null hypothesis, unless the evidence in favor of the alternative hypothesis is very strong, because if I do not do that there is a high probability that I would perform Type 1 Error. Type 1 Error is when someone rejects the null hypothesis when it is true.\(^3\) Here is the first hypothesis test:

\(H_0\): There is no difference between the average total personal earnings respondents, who have an educational attainment of HS diploma or lower and associate degree or higher.

\[\mu_{diff}=0\]

\(H_A\): There is a difference between the average total personal earnings respondents, who have an educational attainment of HS diploma or lower and associate degree or higher.

\[\mu_{diff}\ne0\]

Here is result of hypothesis test and confidence interval:

inference(y=Cat_Ed_2level$Total_Per_Earn, x= Cat_Ed_2level$Edu_Attainment, est = "mean", type = "ht", null = 0, alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_associate degree or above = 46183, mean_associate degree or above = 69931.17, sd_associate degree or above = 89542.75
## n_high school diploma or below = 51166, mean_high school diploma or below = 30467.68, sd_high school diploma or below = 34277.59
## Observed difference between means (associate degree or above-high school diploma or below) = 39463.49
## H0: mu_associate degree or above - mu_high school diploma or below = 0 
## HA: mu_associate degree or above - mu_high school diploma or below != 0 
## Standard error = 443.368 
## Test statistic: Z =  89.008 
## p-value =  0
inference(y=Cat_Ed_2level$Total_Per_Earn, x= Cat_Ed_2level$Edu_Attainment, est = "mean", type = "ci", null = 0, alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_associate degree or above = 46183, mean_associate degree or above = 69931.17, sd_associate degree or above = 89542.75
## n_high school diploma or below = 51166, mean_high school diploma or below = 30467.68, sd_high school diploma or below = 34277.59
## Observed difference between means (associate degree or above-high school diploma or below) = 39463.49
## Standard error = 443.368 
## 95 % Confidence interval = ( 38594.5079 , 40332.4783 )

Looking at the results above we see that on the hypothesis test the \(p-value\) is \(0\) which is less than \(0.05\), meaning that we can reject the null hypothesis. On the confidence interval test we see that the we are 95% confident that that the difference between average total personal earnings of respondents who have a education attainment of associate degree or higher vs respondents who have an education attainment of HS diploma or lower is between \(\$38594.5\) and \(\$40332.5\). A confidence interval is the most likely range of values for the population parameter in my case is people living in New York State.

By looking at graph 2 and graph 7 I noticed that it is better transform the total personal earnings in the format of log\(_{10}\), then keeping them to the nearest dollars, therefore from here on I will log total personal earning. Next I will conduct a hypothesis test using ANOVA, where Edu_Attainment is categorical, by five levels. Analysis of Variance (ANOVA) model, is a model which uses F test statistic, which is the ratio of the between group and within group variability. The ANOVA uses a single hypothesis test to check whether the means across many groups are equal.\(^3\) The conditions necessary for ANOVA test are satisfied as we see below:

  1. Independence within and between groups: Meet
    • Random sample/assignment: It is a random sample of voluntary personal who taken the survey either by internet, phone or paper. Since the sample is random, the respondents’ educational attainment should be independent of other respondents’.
    • If sampling without replacement, \(n < 10\%\) of population: Has about $ 1%$ of the United States population.
  2. Approximately normal: Meet
    • \(n\ge{30}\): If we look at table 4 we see that n for both variables and their levels are more \(n\ge{30}\).
    • Population distribution should not be extremely skewed: It is not extremely skewed, as we see above in graph 9a.
  3. Constant variance: Meet
  • Looking at the graph 9b, we see that the variability of the five levels is roughly constant.

Since conditions necessary for ANOVA are satisfied, I will perform hypothesis test:

\(H_0\): There is no difference between the average total personal earnings respondents, between the 5 educational attainment levels.

\[\mu_{a}=\mu_{b}=\mu_{c}=\mu_{d}=\mu_{e}\]

\(H_A\): At least one of the average total personal earnings respondents, between the 5 educational attainment levels is different.

Df Sum Sq Mean Sq F value Pr(>F)
Edu_Attainment 4 4269.535 1067.3836281 4501.907 0
Residuals 97344 23079.860 0.2370959 NA NA

Looking at the ANOVA test above we see that F-Value is large \(>4000\). Since the F-value is so big the p-value is 0, which means we can reject the null hypothesis. The p-value is “the probability of observing data at least as favorable to the alternative hypothesis as our current data set, if the null hypothesis is true.”\(^3\) This test shows us that at least one of the mean of the total personal earning is different.

Next we will find the linear regression, the best fit line to best represent the relationship between educational attainment (in numerical form from 1 to 25) and log\(_{10}\) of total personal earnings. For the linear regression we use the least squares line to best represent the data, since it is easier to do, it is commonly used and in many cases it shows a residual twice as large as another which is usually more than twice as bad. The strength of the fit of a linear model is most commonly evaluated using \(R^2\), which tells us what percent of variability in the response variable is explained by the model. Lets check the condition for the least squares line:

  1. Linearity: Looking at the residual plot below we see that the points are mostly scattered around \(y=0\), with a pattern. This means that the relationship between ‘Edu_Attainment` and ’Total_Per_Earn’ is not linear.

  2. Nearly normal residuals: Looking at the histogram and the normal probability plot below we see that residuals are nearly normal.

  3. Constant variability: Looking at the plots below and the below, I can say that the variability of points around the least squares line is __ not constant__.

Since the condition test fail for linear regression, the least squares line, I will not analysis the graph or the correlation of the line. This shows that the relationship between educational attainment (in numerical form from 1 to 25) and log\(_{10}\) of total personal earnings is not linear. I also tried original data without any transformation but that does not work either.

cor(Sub_Per_Earn$Edu_Attainment, log10(Sub_Per_Earn$Total_Per_Earn))
## [1] 0.3315994

Next I will perform ANOVA test to see if there is a relationship between education attainment and total personal earnings, if we put gender as a second explanatory variable. Let’s check the conditions necessary for ANOVA test below:

  1. Independence within and between groups: Meet
    • Random sample/assignment: It is a random sample. Since the sample is random, the respondents’ educational attainment and gender should be independent of other respondents’.
    • If sampling without replacement, \(n < 10\%\) of population: Has about $ 1%$ of the United States population.
  2. Approximately normal: Meet
    • \(n\ge{30}\): If we look at table 4 we see that n for both variables and their levels are more \(n\ge{30}\).
    • Population distribution should not be extremely skewed: It is not extremely skewed, as we see on the Q-Q plot below.
  3. Constant variance: Meet
  • Looking at the graph 9, we see that the variability of both genders are roughly constant across five levels.

Since conditions necessary for ANOVA are satisfied, I will perform hypothesis test:

\(H_0\): There is no difference between the average total personal earnings respondents, between the 5 educational attainment levels, by gender.

\(H_A\): At least one of the average total personal earnings respondents is different.

Df Sum Sq Mean Sq F value Pr(>F)
Edu_Attainment 4 4269.5345 1067.3836281 4661.025 0
SEX 1 788.1268 788.1268342 3441.573 0
Residuals 97343 22291.7333 0.2290019 NA NA

Looking at the ANOVA test above we see that F-Value is large \(>4000\) for Edu_Attainment and large \(>3000\) for SEX. Since the F-value is so big the p-value is 0, which means I can reject my null hypothesis.

Next I will perform another ANOVA test to see if there is a relationship between education attainment and total personal earnings, if we put citizenship status as a secondary explanatory variable. Let’s check the conditions necessary for ANOVA test below:

  1. Independence within and between groups: Meet
    • Random sample/assignment: It is a random sample. The respondents educational attainment and citizen status should be independent of other respondents’.
    • If sampling without replacement, \(n < 10\%\) of population: Has about $ 1%$ of the United States population.
  2. Approximately normal: Meet
    • \(n\ge{30}\): If we look at table 4 we see that n for both variables and their levels are more \(n\ge{30}\).
    • Population distribution should not be extremely skewed: It is not extremely skewed, as we see on the Q-Q plot below.
  3. Constant variance: Meet
  • Looking at the graph 10, we see that the variability of the variables are roughly constant.

Since conditions necessary for ANOVA are satisfied, I will perform hypothesis test:

\(H_0\): There is no difference between the average total personal earnings respondents, between the 5 educational attainment levels, by citizenship status.

\(H_A\): At least one of the average total personal earnings respondents is different.

Df Sum Sq Mean Sq F value Pr(>F)
Edu_Attainment 4 4269.53451 1067.3836281 4502.141213 0.0000000
Citizen_Stat 1 1.43595 1.4359504 6.056727 0.0138551
Residuals 97343 23078.42415 0.2370836 NA NA

Looking at the ANOVA test above we see that F-Value is large \(>4000\) for Edu_Attainment and large \(>6\) for Citizen_Stat. Since the F-value is so big for Edu_Attainment the p-value is 0, and for Citizen_Stat the p-value is \(0.014\) which is less \(<0.05\) then I can reject my null hypothesis.

Conclusion:

After completing various statistical analysis on the data, I come to the conclusion that in New York State there is a difference between an individual’s total earnings and the highest degree of education an individual has completed. This project also showed that if the individual is female or male also affects the difference between their earning and educational attainment. Lastly this project showed that citizenship status plays a role in the difference between their earning and educational attainment. However, I would like to look at data for the surveys for other years and states to see if my conclusion holds true all individual, no matter the time period or the state.\(^5\)

During this project I learned a lot about my data. I found out that there were a lot of respondents who have 0 to negative earnings. I learned that negative earnings means loss of money, also that some people earn as small as <10 dollars. I also learned was surprised to know that there were some respondents who had little to no education. I learned there were many respondents who went to high school however did not get their high school degree. I learned that in New York State about equal number both gender seem to have an earning greater than 499 dollars. One of the most surprising thing about the data was that the range for respondents’ age was from \(0\) to \(94\). I really would like to know more about this data and how they were obtained (was it simple random sample, cluster sample or others).

There is room for a lot of future research, with this data. I only analyzes a little bit of it. In the future we can look at if education have a role in income. We can look at is there a relationship between age and earnings/income. Does income and earnings have a liner relationship? There is so much more we can do to study this data set. The possibilities are endless especially with the original 2009 survey data.\(^4\)

References:

  1. 2009 American Community Survey 1-Year PUMS Housing File. (2015, May 20). Retrieved March 1, 2016, from https://catalog.data.gov/dataset/2009-american-community-survey-1-year-pums-housing-file.

  2. 2010 Census Interactive Population Map. (n.d.). Retrieved May 1, 2016, from https://www.census.gov/2010census/popmap/.

  3. Diez, D. M., Barr, C. D., & Cetinkaya-Rundel, M. (2012). OpenIntro statistics. Lexington, KY: CreateSpace. Can be downloaded from https://www.openintro.org/stat/textbook.php?stat_book=os

  4. Hossain, N. (March 2016). GitHub. from https://github.com/nabilahossain/Class-DATA606/tree/master/Project

  5. PUMS Data 2000 - current. (2016, January 15). Retrieved May 21, 2016, from http://www.census.gov/programs-surveys/acs/data/pums.html It is the link: “2009 ACS 1-year PUMS”.

  6. PUMS Technical Documentation. (2015, October 14). Retrieved March 1, 2016, from https://www.census.gov/programs-surveys/acs/technical-documentation/pums/documentation.2009.html The two links/pdf that I used are: “2009 Data Dictionary” and “2009 PUMS Accuracy”

  7. Ross, S. (2015). What is the difference between earnings and income? | Investopedia. Retrieved May 1, 2016, from http://www.investopedia.com/ask/answers/070615/what-difference-between-earnings-and-income.asp