Regressions

Part 1 Build a regression model that predicts income using height, weight, and age as the predictors.

First step, is to load in the data and take a look at what we’re dealing with.

#Load the packages
library(rrcov)
library(Hmisc)#makes the histograms
library(corrplot)
library(RColorBrewer)
library(ggplot2)

# load data from file, omit an NAs
hdf <- na.omit(read.csv("homework_3_heights.csv"))

# what are we looking at today?
str(hdf)
## 'data.frame':    6645 obs. of  9 variables:
##  $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ income   : int  19000 35000 105000 40000 75000 102000 0 70000 60000 150000 ...
##  $ height   : int  60 70 65 63 66 68 74 64 69 69 ...
##  $ weight   : int  155 156 195 197 190 200 225 160 162 194 ...
##  $ age      : int  53 51 52 54 49 49 48 54 55 54 ...
##  $ marital  : chr  "married" "married" "married" "married" ...
##  $ sex      : chr  "female" "female" "male" "female" ...
##  $ education: int  13 10 16 14 14 18 16 12 12 13 ...
##  $ afqt     : num  6.84 49.44 99.39 44.02 59.68 ...
##  - attr(*, "na.action")= 'omit' Named int [1:361] 19 20 22 46 52 62 67 80 93 102 ...
##   ..- attr(*, "names")= chr [1:361] "19" "20" "22" "46" ...
summary(hdf)
##        X            income           height          weight     
##  Min.   :   1   Min.   :     0   Min.   :52.00   Min.   : 76.0  
##  1st Qu.:1748   1st Qu.:   400   1st Qu.:64.00   1st Qu.:156.0  
##  Median :3503   Median : 30000   Median :67.00   Median :184.0  
##  Mean   :3499   Mean   : 41819   Mean   :67.12   Mean   :188.3  
##  3rd Qu.:5242   3rd Qu.: 55000   3rd Qu.:70.00   3rd Qu.:212.0  
##  Max.   :7006   Max.   :343830   Max.   :84.00   Max.   :480.0  
##       age          marital              sex              education    
##  Min.   :47.00   Length:6645        Length:6645        Min.   : 1.00  
##  1st Qu.:49.00   Class :character   Class :character   1st Qu.:12.00  
##  Median :51.00   Mode  :character   Mode  :character   Median :12.00  
##  Mean   :51.29                                         Mean   :13.24  
##  3rd Qu.:53.00                                         3rd Qu.:15.00  
##  Max.   :56.00                                         Max.   :20.00  
##       afqt       
##  Min.   :  0.00  
##  1st Qu.: 15.04  
##  Median : 36.77  
##  Mean   : 41.20  
##  3rd Qu.: 65.23  
##  Max.   :100.00

The data contains several variables for over 7000 people. Each person has data on their body size, age, income, education, gender, and afqt (I’m going to assume that this is an affluency quartile).

Next step, build the model.

# create model
income_model <- lm(income ~ height + weight + age, data = hdf)
summary(income_model)
## 
## Call:
## lm(formula = income ~ height + weight + age, data = hdf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -101209  -30938  -10810   14915  324166 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -149746.41   19564.75  -7.654 2.23e-14 ***
## height         3443.90     188.34  18.286  < 2e-16 ***
## weight          -81.96      17.28  -4.743 2.15e-06 ***
## age            -471.19     304.74  -1.546    0.122    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 55410 on 6641 degrees of freedom
## Multiple R-squared:  0.05092,    Adjusted R-squared:  0.05049 
## F-statistic: 118.8 on 3 and 6641 DF,  p-value: < 2.2e-16

What is the R square?

The R-squared is 0.05, which is terrible. This is not a good model for income. Surprisingly, height and weight have very low p-values, though.

If someone is 1 inch taller than average, how much more money does the model estimate that they will make?

First, we need to find the average height, weight, and age, then plug those values into the model to get the predicted average income. Then repeat, except use average height + 1 inch, then subtract to find the difference

# calculate average values for height, weight, and age
ave_height <- mean(hdf$height)
ave_weight <- mean(na.omit(hdf$weight))
ave_age <- mean(hdf$age)

# construct new data frame of average values
ave_hdf <- data.frame(ave_height, ave_weight, ave_age)
names(ave_hdf) <- c("height", "weight", "age")

# plug average df values into the model to get predicted average income
pred_ave_income <- predict(income_model, newdata = ave_hdf)

# make new dataframe, with height = ave + 1 inch, all other values the same (averages)
tall_hdf <- data.frame(ave_height + 1, ave_weight, ave_age)
names(tall_hdf) <- c("height", "weight", "age")

# plug slightly taller df into model
pred_tall_income <- predict(income_model, tall_hdf)

# take difference between predicted values
dif_income <- pred_tall_income - pred_ave_income
dif_income
##        1 
## 3443.895

This model estimates that if someone is 1 inch taller than average, they would make $3443.9 more than a person of average height.


Part 2 Build a regression model that predicts income includes height, weight, age, marital, sex, education and afqt as predictors.

Now we’re going to include information that should have more effect on income (marital, sex, education, and afqt).

# create second model with additional variables
mod2 <- lm(income ~ height + weight + age + marital + sex + education + afqt, data = hdf)
summary(mod2)
## 
## Call:
## lm(formula = income ~ height + weight + age + marital + sex + 
##     education + afqt, data = hdf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -115521  -25139   -5477   14904  326890 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -63638.10   19707.93  -3.229  0.00125 ** 
## height              293.26     227.77   1.288  0.19796    
## weight              -22.62      15.41  -1.468  0.14227    
## age                -401.81     270.53  -1.485  0.13753    
## maritalmarried     8617.82    1535.86   5.611 2.09e-08 ***
## maritalseparated  -2222.34    2951.22  -0.753  0.45146    
## maritalsingle     -5586.83    1990.67  -2.807  0.00502 ** 
## maritalwidowed     5076.53    4207.44   1.207  0.22764    
## sexmale           24815.77    1744.56  14.225  < 2e-16 ***
## education          5944.87     289.14  20.561  < 2e-16 ***
## afqt                389.42      26.52  14.685  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49100 on 6634 degrees of freedom
## Multiple R-squared:  0.2556, Adjusted R-squared:  0.2545 
## F-statistic: 227.8 on 10 and 6634 DF,  p-value: < 2.2e-16

What is the R square?

The R-squared value is a bit better at 0.2556. Significant variables are marital, sex, education, and afqt.

Provide a theory that describes why height and weight are no longer significant after adding the new variables.

Height and weight are no longer significant because income is more dependent on marital status, gender, education, and afqt.


Part 3 From the previous model add the predictions to the dataset and create the following ggplot Predicted values on the X-axis [HINT: use the predict function i.e. heights$preds <- predict(fancyModel, heights)], Income on the Y-axis, color by education, facet_grid with sex and marital [HINT: facet_grid(sex ~ marital)]

# add predicted income to the height dataframe
hdf$pred_income <- predict(mod2, hdf)

# plot 
# used log(incomes) to better see the data, otherwise x-axis ran from -$50,000 to $125,000+
ggplot(hdf, aes(x = log(pred_income), y = log(income), colour = as.factor(education))) + 
  geom_point() +
  facet_grid(sex ~ marital) +
  ggtitle("Predicted Income vs. Measured Income by Gender & Education") + 
  xlab("log(Predicted Income)") + ylab("log(Measured Income)") +
  labs(color = "Education (yrs)") 

I changed the scale for the x- and y-axis to the log() because the range of the income data was so large it was hard to see the labels and thus examine the results. Another slight improvement is to bin the years of education into three categories: “High School”, “College”, and “Post Graduate.” Also, I made the x- and y-axis have the same limit.

# create levels for education: high (0-12), college (13-16), grad (17+)
hdf$ed_level <- cut(hdf$education, breaks = c(0, 13, 17, 20), 
                    labels = c("High School", "College", "Post Graduate"))

# plot 
# used log(incomes) to better see the data, otherwise x-axis ran from -$50,000 to $125,000+
ggplot(hdf, aes(x = log(pred_income), y = log(income), colour = as.factor(ed_level))) + 
  geom_point() +
  facet_grid(sex ~ marital) +
  ggtitle("Predicted Income vs. Measured Income by Gender & Education") + 
  xlab("log(Predicted Income)") + ylab("log(Measured Income)") +
  labs(color = "Education Level") +
  xlim(0, 13) + ylim(0, 13)

It looks like this model is OK at predicting income with the additional variables. Most of the predicted value is about the same as the measured income (10:~10).