Presented below is some data from my current job in real estate development and construction. There are obvious variables that impact the cost of a construction project such as its size in SF. This data below represents construction costs of luxury rental apartments in NYC. Rentals are more or less a standardized product so it also makes sense to look at the number of apartments as a predictor for overall project cost as the cost is driven up by trades particular to individual units. For example, an apartment that is twice as large as another may have more area to cover with flooring but will not have another kitchen (think about the added cost of plumbing fixtures, appliances, cabinetry, etc).
We should keep in mind that a lot of the variability in total project cost could come from factors like unusual foundations and the proportion of commercial and amenity spaces.
The data appears linear at first glance as could logically be expected so we fit a linear model to the data and discuss the estimates.
The model statistics below show us that numapt is indeed a strong predictor of project cost and that it is statistically significant as the p-value is well below 0.05. This model explains about 71% of the variability in the data. The estimate for the mean cost per apartment is $428,222 with a standard deviation of about 70,000.
##
## Call:
## lm(formula = totalprojectcost ~ numapt, data = projectdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -116367146 -38327022 -8420777 60271879 110999737
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -35365623 42493331 -0.832 0.419
## numapt 428222 69308 6.179 2.4e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67820000 on 14 degrees of freedom
## Multiple R-squared: 0.7317, Adjusted R-squared: 0.7125
## F-statistic: 38.17 on 1 and 14 DF, p-value: 2.4e-05
We conduct residual analysis to verify the assumptions that make our linear model valid. We need the model residuals to have constant variance. The residuals are centered about 0 and there is no discernable pattern so this assumption is valid.
We also need to verify that our residuals are normally distributed and we do so with the help of the Q-Q plot below. The residuals do not vary too much from the line so we conclude that the resisuals could have come from the normal destribution and that the assumption of normality is also valid.