The Relationship Between Family Size and number of Bedrooms

In this exercise, I’m going to be testing to see if there is a relationship between family size and the number of bedrooms in a household. I hypothesize that as family size increases, the number of bedrooms in a household should increase in a somewhat linear function: households will need more bedrooms to accommodate more family members.

The outcome variable (number of bedrooms) and the predictor variable (family size) are both interesting to me as I believe analysis of their relationship will provide some insight into the nature of American housing patterns. If this analysis fails to be interesting, I will also examine the relationship between family size, wages and bedroom sizes – under the hypothesis that if family size does not lead to a larger number of bedrooms, higher wages must (since presumably people who make more money can afford bigger houses).

First, we need to load our libraries and data (2015 ACS on IPUMS):

library(lmtest)
## Warning: package 'lmtest' was built under R version 3.4.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.2
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(haven)
library(broom)
library(ggplot2)
ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")

Now, we are going to construct a few variables. First, we will be creating a new wage variable which removes missing values (more on this later). Then, we will be adjusting the “bedrooms” variable to turn missing or not applicable values (“0” in the data set) to “NA”. Finally, those with no bedrooms (“1” in the data set) will be mutated into “0”, while all other bedroom codes will be adjusted by 1 to create the true continuous bedroom variable (ex.: “6” on the bedroom scale is actually 5 bedrooms, so the real number of bedrooms is 6-1 or 5).

Finally, we will filter by head-of-household and remove missing values for both bedrooms and wages.

ipums<-ipums %>% mutate(mywage= ifelse(incwage%in%c(999998,999999), NA, incwage))
ipums<-ipums %>% mutate(new_bedrooms=ifelse(bedrooms==0,NA,
                                            ifelse(bedrooms==1,0,bedrooms-1)))
new_ipums<-ipums %>% filter(relate==1 & !is.na(new_bedrooms) & !is.na(mywage))

Hypothesis 1: There is a relationship between family size and number of bedrooms

Now that the data has been properly filtered, we can construct an OLS linear model to test the hypothesis: number of bedrooms will increase as family size increases.

bedroom_fit<-lm(new_bedrooms~famsize,data=new_ipums)
summary(bedroom_fit)
## 
## Call:
## lm(formula = new_bedrooms ~ famsize, data = new_ipums)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6145 -0.7027  0.0244  0.5701  5.5701 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.156995   0.005794   372.3   <2e-16 ***
## famsize     0.272875   0.002099   130.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.008 on 117298 degrees of freedom
## Multiple R-squared:  0.1259, Adjusted R-squared:  0.1259 
## F-statistic: 1.69e+04 on 1 and 117298 DF,  p-value: < 2.2e-16

The results here were interesting to me, to say the least. It appears that there is a positive relationship between family size and number of bedrooms: for each new family member, the “number of bedrooms” goes up by .27, or, to put it another way, each new family member influences the household head to increase the number of bedrooms by about 1/4 of a room.

Yet, this relationship is very weak. This model only explains about 13% of the variance in number of bedrooms.

A graphical plot appears to bear out the weakness of this relationship:

ggplot(new_ipums, aes(x=famsize,y=new_bedrooms))+
  geom_point()+
  geom_smooth(method="lm",se=FALSE)
## Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.

The plot is, to put it mildly, all over the place. As the numbers have shown us, if there is a relationship between family size and number of bedrooms, it is very weak.

We can now test some of the assumptions of the linear model (such as heteroscedasticity):

plot(bedroom_fit,which=1)

Here we see that the residuals have a strong non-constant variance. We can also run a Breusch-Pagan test to see this as well:

bptest(bedroom_fit)
## 
##  studentized Breusch-Pagan test
## 
## data:  bedroom_fit
## BP = 86.817, df = 1, p-value < 2.2e-16

So, variance in the model is obviously non-constant. This is unfortunate, as it forces us to treat the model with even more caution.

We can also test for normality of residuals:

plot(bedroom_fit,which=2)

And even use a Kolomogorov-Smirnov test to do the same thing, numerically:

ks.test(resid(bedroom_fit),y=pnorm)
## Warning in ks.test(resid(bedroom_fit), y = pnorm): ties should not be
## present for the Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  resid(bedroom_fit)
## D = 0.096523, p-value < 2.2e-16
## alternative hypothesis: two-sided

The bedroom_fit residuals deviate 9.7% from the normal distribution, with a p-value much smaller than 0.001. So, the residuals aren’t exactly normal, but also aren’t terribly deviate from normal either. This means that we need to take these results with a major grain of salt.

Hypothesis 2: There is a relationship between family size, wages and number of bedrooms

Perhaps family size + wages will be a better predictor of “number-of-bedrooms”? After all, poor people can’t afford bigger houses even if they need a bigger house, and presumably rich people have more money to spend on bigger houses. Maybe money AND family size is a better predictor of living arrangements than just family size?

Let’s find out:

# new linear model, with family size + household head wages
bedroom_fit2<-lm(new_bedrooms~famsize+mywage,data=new_ipums)

#see the summary

summary(bedroom_fit2)
## 
## Call:
## lm(formula = new_bedrooms ~ famsize + mywage, data = new_ipums)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3770 -0.6255  0.0902  0.6024  5.5826 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.104e+00  5.855e-03  359.41   <2e-16 ***
## famsize     2.605e-01  2.098e-03  124.20   <2e-16 ***
## mywage      2.384e-06  5.178e-08   46.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9993 on 117297 degrees of freedom
## Multiple R-squared:  0.1414, Adjusted R-squared:  0.1414 
## F-statistic:  9662 on 2 and 117297 DF,  p-value: < 2.2e-16

Here, we see that family size and household-head wages do appear to have a positive effect upon number-of-bedrooms, but the R-squared has hardly increased at all. In other words, while a relationship between these variables exists, the overall effect is still very weak.

Now we can test the assumptions of this model:

ggplot(new_ipums, aes(x=famsize+mywage,y=new_bedrooms))+
  geom_point()+
  geom_smooth(method="lm",se=FALSE)
## Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.

plot(bedroom_fit2,which=1)

plot(bedroom_fit2,which=2)

As we can see, the plots remain largely the same as before: an unclear, weak relationship between the variables, a high-degree of heteroscedasticity, and residuals which deviate somewhat from the norm. The Kolmogorov-Smirnov test bears this out as well:

ks.test(resid(bedroom_fit2),y=pnorm)
## Warning in ks.test(resid(bedroom_fit2), y = pnorm): ties should not be
## present for the Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  resid(bedroom_fit2)
## D = 0.074495, p-value < 2.2e-16
## alternative hypothesis: two-sided

It appears that this model fits a normal distribution a bit better than the previous model, but not by much.

Conclusions

Perhaps the most surprising thing about this analysis is the discovery that while the number of bedrooms in a household is related to family size and the wages of the household head, the relationship between these variables is actually quite weak: at most they only explain about 14% of the variance in number of bedrooms across the United States.

In other words: the number of bedrooms that Americans have in their houses (and thus the size of their houses) remains largely disconnected from the size of the family, or even the wages of the family’s head.

How could this be?

I can imagine a few possibilities: first, it’s possible that the vast majority of dwellings built in the United States have between (say) 2 to 4 rooms. If this is the case, “number of bedrooms” will likely not be associated with any particular variable, as most dwellings are already ‘biased’ towards having a fairly low number of bedrooms.

Secondly, it’s also possible that bedrooms are a relatively ‘inelastic’ product: it would be difficult, for example, to somehow purchase an extra room every time a new child is added to a family, or every time the household head gets a raise. Bedrooms are generally permanent parts of houses, which must generally be purchased as a whole. So, it doesn’t really make sense to say that “every new member of the family influences the household head to increase bedroom size by .27 bedrooms” as I said earlier. Bedrooms simply don’t work that way.

Lesson learned!