Introduction

The largest city in the United States, New York City stands as the hub of finance, fashion, technology, and culture of today. 8.538 million people live in this place today and it continues to grow as the leading metropolitan city with the median housing value of $690,000.

Now that we have established how populated and demanding the city is, one might easily wonder what it would be like to live in such city. What kind of residents would live in the city, what are their incomes, and how would this condition differ by housing qualities, their age, and even their move in day?

In this analysis, I want to focus on total household income (in $) in the New York City and determine whether there is a relationship between income and other predictors, which would help us profile NYC residents better based on their income. This profile then can be utilized in many markets, as companies are seeking which group of people they should be targetting at in order to increase their sales. Thus, customer profiles are in high demand today.

Exploratory Data Analysis

Data

In this NYC housing data, we examine a random sample of 299 households and 4 variables. To gain insight of household income in NYC, we analyze the relationship between income and three predictor variables: age, maintenance deficiency, and move in year.

Varialbe Description Income total household income (in $) Age respondent’s age(in years) MaintenanceDef number of maintenance dficiencies between 2002 and 2005 NYCMove the year the respondent moved to New York City

Now, let’s examine the first few lines of data

Univariate exploration

As a first step in the analysis we want to explore each variable individually. We use histograms to explore the distributions of continuous variables.

Although graphs are an easy way to visualize distribution of a variable, a numerical summary statistics is alsp helpful in our analysis.

##      Income           Age        MaintenanceDef    NYCMove    
##  Min.   : 1440   Min.   :26.00   Min.   :0.00   Min.   :1942  
##  1st Qu.:21000   1st Qu.:42.00   1st Qu.:1.00   1st Qu.:1973  
##  Median :39000   Median :49.00   Median :2.00   Median :1985  
##  Mean   :42266   Mean   :50.03   Mean   :1.98   Mean   :1983  
##  3rd Qu.:57800   3rd Qu.:58.00   3rd Qu.:2.00   3rd Qu.:1995  
##  Max.   :98000   Max.   :85.00   Max.   :8.00   Max.   :2004

After examining at both the graphs and the summary statistics of our variables, we can state the following observations. The distribution of household income could be considered unimodal, bimodal, or trimodal and the graph is skewed to the right. To distinguish modality, we would need more data. The income range from 1440 to 98000 and the income is centered around 42266. The distribution of age coulb be considered unimodal with normal distribution and symmetry. The median and mean are very similar. The age is centered around 50.03. The distribution of maintenance deficiency is unimodal and the graph is skewed to the right. The mean and median are very similar for this variable as well. The maintenance deficiency is centered around 1.98. The distribution of move in year could be considered unimodal, bimdoal, and trimodal, and we would need more data to distinguish modality. The graph is skewed to the left. The mean and median are also similar, and the year is centered around 1983.

Bivaraite Exploration

After analysis of the distribution of the individual varaibles, we can now examine relationships between variables. We want to focus on how variables interact with household income, and this analysis of relationships can help consumer advocate watchdog groups and markets in understanding customer or consumer profiles, which will help them target their consumers based on consumer profiles.

## [1] 0.03593162

## [1] -0.1681017

## [1] -0.1009987

Through analysis of graphs, we can state that a positive relationship exists between income and age but the association is not very strong. As age increases, income increases on average and the correlation coefficient is 0.04. Maintenance deficiency is not strongly associated with income, as the incrase in the number of maintenance deficiencies does not always correspond to the same change in income. But, we can still state that a negative relation exists between income and maintenance deficiency as as maintenance deficiency increases, income decreases on average. The correlation coefficient for maintenance deficiency is -0.17. Moreover, note that there are gaps between the x-axis because deficiency can be only calculated as integers. The increase in year does not show a sigificant trend in change in income. But we can still state that a weak negative relationship exists between year and income. The correlation coefficient for move in year is -0.10.

Modeling

Now that we examined and visualized the relationship among the variables, we can build a linear regression model to predict household income. First, we can look at the histogram of response variables. The graph of Income by Year and Income by Age are symmetrical, indicating that a transformation might be needed. But for now, we will leave the response variable in the original form and will comment on potential transformation back when looking at model diagnostics.

Our bivariate exploratory data anlaysis indicates that all variables have a relationship with household income regardless of the fact that it is weak. Thus, these variables can be useful in building our model, but before we proceed, we nee to first check for multicollinearity. An indication of a possible multicollinearity can be checked by looking at the correlation coefficients between all quantitative variables.

##                Income   Age MaintenanceDef NYCMove
## Income           1.00  0.04          -0.17   -0.10
## Age              0.04  1.00          -0.25   -0.64
## MaintenanceDef  -0.17 -0.25           1.00    0.46
## NYCMove         -0.10 -0.64           0.46    1.00

The correlation coefficient between NYCMove and MaintenanceDef are relatively high, indicating an existence of multicollinearity between these two variables. Now, we can also look at pairs of explantory variables by checking the pairs plot for another check of multicollinearity.

The pair plot shows that there is a strong linear relationship between Maintenance Deficiency and NYCMove, so we are now more positive that we should be worried about multicollinearity and may not want to include both variables in our model. We can check variation inflation factors(VIF) for these variables in the full model:

##            Age MaintenanceDef        NYCMove 
##       1.687649       1.267728       1.999724

None of the VIF values has a value greater than 2.5, thus we choose not to drop variables in the full model. Now that we checked multicollinearity, we need to look at residual plots for our model with Age, MaintenanceDef, and NYCMove. Then, we performed error assumptions of the model. 1)The first assumption about independece is not met because residuals are not patternlessly spread apart on average. They exist more on the right side of the plot, which does not illustrate patternless. 2)The second assumption about mean being 0 is met because residuals are centered around zero on average. 3)The third assumption about standard deviation is met because residuals do have a constant spread above and below zero. 4) If we look at the QQ plot, we can see that points are close to the line on the qq plot.Thus, we can assume that residuals are normally distributed.

As seen through our EDA, we have shown that a lienar relationship between income and all variables exists. Also, the R^2 value (2.98%) was the highest compared to smaller models. Thus, we can sate that diagnostics assumptions have been met on average, but note that the error assumption on independence is not met. Thus, even though we cannot confidently claim that our model fits well. But for now, we will proceed.

The equation of our model is:

  Income = 237408.41 - 71.98 * Age - 2273.22 * MaintenanceDef - 94.34 * NYCMove
  
## 
## Call:
## lm(formula = Income ~ Age + MaintenanceDef + NYCMove, data = nyc)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37734 -18010  -2878  14971  60171 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    237408.41  278939.01   0.851   0.3954  
## Age               -71.98     144.97  -0.496   0.6199  
## MaintenanceDef  -2273.22     964.72  -2.356   0.0191 *
## NYCMove           -94.34     138.82  -0.680   0.4973  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23960 on 295 degrees of freedom
## Multiple R-squared:  0.02981,    Adjusted R-squared:  0.01995 
## F-statistic: 3.022 on 3 and 295 DF,  p-value: 0.03005

We see negative values associated with all the explanatory variables. But here, note that the value of beta_hat associated with age is different from our EDA result of age, which is positive. Here, since our model assumption is violated, one approach we can take is to try to use a transformation. We created the variable log.Age. First, we can perform univarate EDA with this variable (below). Compared to the origial univariate EDA of age, the graph of log.Age is now more normal with median of 3.892 and mean of 3.882.

Then, we perform bivariate EDA betweeen the transformed response variable and the covariate (below). Compared to the origianl bivariate EDA, the EDA with log.Age has a higher correlation coefficient with income, as its correlation coefficient is now 0.08, which is higher than the original value of 0.04. The closer the correlation coefficient is to 1, the stronger relationship between the x and y. Thus, in this case, there is a stronger relationship between log.Age and income.

## [1] 0.0753082

Then, we produce residual plots and comment on all the 4 assumptions. 1)The first assumption about independece is met because residuals are patternlessly spread apart. 2)The second assumption about mean being 0 is met because residuals are centered around zero on average. 3)The third assumption about standard deviation is met because residuals do have a constant spread above and below zero on average. 4) If we look at the QQ plot, we can see that points are relatively close to the line on the qq plot.Thus, we can assume that residuals are normally distributed.

Our Exploratory Data Analysis with our transformed variable shows that lienar relationship between income and all variables and the transformed variable exists. The R^2 value with the transformed variable is (2.97%) is 0.01% lower than R^2 value (2.98%) of our origianl model, but since with the transformed variable, none of the error assumptions is violated, we can confidently claim that this new model fits our data well.

The equation of our model is:

  Income = 71332.45 + 3085.92 * log(Age) - -2325.47 * MaintenanceDef -18.38 * NYCMove
## 
## Call:
## lm(formula = Income ~ log.Age + MaintenanceDef + NYCMove, data = nyc)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37683 -18766  -3110  15577  60282 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     71332.45  283501.72   0.252   0.8015  
## log.Age          3085.92    7031.97   0.439   0.6611  
## MaintenanceDef  -2325.47     964.45  -2.411   0.0165 *
## NYCMove           -18.38     134.99  -0.136   0.8918  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23960 on 295 degrees of freedom
## Multiple R-squared:  0.02964,    Adjusted R-squared:  0.01977 
## F-statistic: 3.003 on 3 and 295 DF,  p-value: 0.03079

We interpret our coefficients further below:

\(\widehat{\beta_{1}}\) = 3085.92: This means that controlling for the other variables in the model, an increase in a resident’s age by one year, the household income is expected to increase by 3085.92 index units.

\(\widehat{\beta_{2}}\) = -2325.47: This means that controlling for the other variables in the model, an increase in a resident’s maintenance deficiency, the household income is expected to decrease by -2325.47 index units.

\(\widehat{\beta_{3}}\) = -18.38: This means that controlling for the other variables in the model, an increase in a resident’s move in year, the household income is expected to decrease by -18.38 index units.

In conclusion, in this linear regression model there is an acceptable amount of multicollineartiy as all the VIFs were less than 2.5. Moreover, the signs of the coefficients are consistent with our EDA and individual simple regressionsm and this model has a high R^2. Thus, we are confident that age, maintenance deficiency, and move in day are associated with household income.

Prediction

Now that we have analyzed and built a model that satisifies assumptions, we are interested in predicting the income for a household with three maintenance deficiencies and whose respondent’s age is 53 and who moved to NYC in 1987.

##        fit       lwr      upr
## 1 191385.6 -490586.7 873357.9
##        fit       lwr      upr
## 1 191385.6 -488954.5 871725.6
## [1] 54039.8

Prediction Point Estimate: 54039.8

95% Confidence Interval: (-488954.5 , 871725.6 )

95% Prediction Interval: (-490586.7, 873357.9)

Discussion

In this analysis, we learned that household income in NYC is related to residents’ age, maintenance deficiency, and the move in year. However, the error assumption on independence is not met and the sign of the coefficient for age is not consistent with the individual simple regressions and our EDA. Thus, we transform this variable to log(Age) and created a new variable log.Age. With this transformed data, all the error assumptions are not violated and also the signs of coefficients are all consistent with our simple regressions and EDA. Conclusively, all the variables are significant in relation to household income. I was surprised to see that the move in year of NYC has a negative relationship with income, which indicates that the household who moved in to the city more in the past has higher income. This fact highlights the fact that though NYC housing is demanding and costly today, it could have been more costly in the past and that people with high income could be the ones who could afford to live in the city. I would be interested in the occupations of residents rather than just looking at income level, as occupations also provide insights of people’s income level and even geographic preferences of where these groups of people live inside the NYC. Adding more variables to determine more accurate profiles of residents of NYC can be very beneficial to companies and consumer advocate watchdog groups.