DATA 605 - Week 12 Exercise

Linear Regression

Presented below is some data from my current job in real estate development and construction. There are obvious variables that impact the cost of a construction project such as its size in SF. This data below represents construction costs of luxury rental apartments in NYC. Rentals are more or less a standardized product so it also makes sense to look at the number of apartments as a predictor for overall project cost as the cost is driven up by trades particular to individual units. For example, an apartment that is twice as large as another may have more area to cover with flooring but will not have another kitchen (think about the added cost of plumbing fixtures, appliances, cabinetry, etc).

We should keep in mind that a lot of the variability in total project cost could come from factors like unusual foundations and the proportion of commercial and amenity spaces.

Number of Apartments

The data appears linear at first glance as could logically be expected so we fit a linear model to the data and discuss the estimates.

Model

The model statistics below show us that numapt is indeed a strong predictor of project cost and that it is statistically significant as the p-value is well below 0.05. This model explains about 71% of the variability in the data. The estimate for the mean cost per apartment is $428,222 with a standard deviation of about 70,000.

numapt.lm <- lm(totalprojectcost ~ numapt, data=projectdata)
summary(numapt.lm)

## 
## Call:
## lm(formula = totalprojectcost ~ numapt, data = projectdata)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -116367146  -38327022   -8420777   60271879  110999737 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -35365623   42493331  -0.832    0.419    
## numapt         428222      69308   6.179  2.4e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67820000 on 14 degrees of freedom
## Multiple R-squared:  0.7317, Adjusted R-squared:  0.7125 
## F-statistic: 38.17 on 1 and 14 DF,  p-value: 2.4e-05

Residual Analysis - Variance

We conduct residual analysis to verify the assumptions that make our linear model valid. We need the model residuals to have constant variance. The residuals are centered about 0 and there is no discernable pattern so this assumption is valid.

plot(fitted(numapt.lm),resid(numapt.lm))

Residual Analysis - Q-Q Plot

We also need to verify that our residuals are normally distributed and we do so with the help of the Q-Q plot below. The residuals do not vary too much from the line so we conclude that the resisuals could have come from the normal destribution and that the assumption of normality is also valid.

qqnorm(resid(numapt.lm))
qqline(resid(numapt.lm))