One of the most important aspects of the design of a construction project is the budget. When assembling a budget, cost estimates for the major trade are often provided by contractors based on the information available. Estimating these costs involves extracting quantities of materials from plans and determining how many manhours are required to build. This is a detailed exercise and a time consuming process that requires outside information (contractor pricing). For this reason, a linear regression model might be a very useful tool in predicting the budget for a new building using information from previous projects.
Given that a number of trade costs can be broken down to a price per unit per trade, for example $x/square foot of concrete or $y/apartment for electrical wiring, it seems natural to turn to linear regression to predict the cost of a construction project based on these variables. While, the predictors can be detailed and derived for each trade (superstucture, mechanical, plumbing, etc.), it may be time-saving to use more general predictors for the project such as building area in SF, landscape area in SF, facade area in SF number of apartments, number of rooms, average area footprint of a floor etc.
The table below contains some typical project cost information. The variable gea refers to gross enclosed area which is one of the area measures describing the size of a building.
| gea | numapt | avgsfperapt | dolpersf | dolperapt | totalprojectcost | totaltradecost |
|---|---|---|---|---|---|---|
| 636074 | 650 | 979 | 199 | 194787 | 126611699 | 94345699 |
| 477246 | 498 | 958 | 213 | 204396 | 101789064 | 85827762 |
| 703184 | 395 | 1780 | 225 | 400425 | 158167867 | 127464947 |
| 415586 | 394 | 1055 | 317 | 334669 | 131859394 | 106422664 |
| 221313 | 189 | 1171 | 385 | 451390 | 85312660 | 72500532 |
| 258300 | 184 | 1404 | 418 | 587470 | 108094413 | 84055946 |
| 864868 | 835 | 1036 | 341 | 353085 | 294826044 | 241100259 |
| 354622 | 367 | 966 | 266 | 257080 | 94348484 | 74051545 |
| 310053 | 345 | 899 | 272 | 244703 | 84422410 | 66403310 |
| 1146366 | 820 | 1398 | 215 | 300384 | 246314480 | 199096678 |
| 483822 | 584 | 828 | 286 | 237304 | 138585695 | 109187466 |
| 734312 | 714 | 1028 | 347 | 357196 | 255038067 | 202966634 |
| 1195313 | 1028 | 1163 | 403 | 468900 | 482029408 | 389929462 |
| 410239 | 394 | 1041 | 472 | 491253 | 193553729 | 143023902 |
| 855541 | 800 | 1069 | 430 | 459625 | 367700352 | 297597354 |
| 917783 | 798 | 1150 | 455 | 523002 | 417355476 | 339036766 |
| Name | data |
| Number of rows | 16 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| gea | 0 | 1 | 624038.88 | 306933.63 | 221313 | 396334.75 | 559948.0 | 857872.75 | 1195313 | ▇▂▃▃▂ |
| numapt | 0 | 1 | 562.19 | 252.65 | 184 | 387.25 | 541.0 | 798.50 | 1028 | ▅▇▃▇▂ |
| avgsfperapt | 0 | 1 | 1120.31 | 235.43 | 828 | 975.75 | 1048.0 | 1165.00 | 1780 | ▅▇▁▁▁ |
| dolpersf | 0 | 1 | 327.75 | 91.97 | 199 | 255.75 | 329.0 | 406.75 | 472 | ▇▆▆▃▇ |
| dolperapt | 0 | 1 | 366604.31 | 121466.59 | 194787 | 253985.75 | 355140.5 | 461943.75 | 587470 | ▇▃▅▆▃ |
| totalprojectcost | 0 | 1 | 205375577.62 | 126482953.69 | 84422410 | 106518075.75 | 148376781.0 | 264985061.25 | 482029408 | ▇▁▂▁▂ |
| totaltradecost | 0 | 1 | 164563182.88 | 103168415.18 | 66403310 | 85384808.00 | 118326206.5 | 212500040.25 | 389929462 | ▇▁▂▁▂ |
## `geom_smooth()` using formula 'y ~ x'
The full model includes yields a high R-squared of 95% however, none of the predictors are significant. It should also be noted that the predictors dolpersf and dolperapt are actually variables calculated from the known totalprojectcost of previous projects. This is not information that would be available for the prediction, in fact it is what we are trying to predict therefore these variables should be dropped.
lm.full <- lm(totalprojectcost ~ gea + numapt + avgsfperapt + dolpersf + dolperapt, data = data)
summary(lm.full)##
## Call:
## lm(formula = totalprojectcost ~ gea + numapt + avgsfperapt +
## dolpersf + dolperapt, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39066538 -15946458 -149404 22636090 27738253
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.730e+08 2.026e+08 -1.347 0.208
## gea 1.482e+02 1.980e+02 0.748 0.472
## numapt 2.479e+05 2.323e+05 1.067 0.311
## avgsfperapt 3.380e+04 1.767e+05 0.191 0.852
## dolpersf 6.164e+05 5.708e+05 1.080 0.306
## dolperapt 1.797e+01 4.928e+02 0.036 0.972
##
## Residual standard error: 26320000 on 10 degrees of freedom
## Multiple R-squared: 0.9711, Adjusted R-squared: 0.9567
## F-statistic: 67.28 on 5 and 10 DF, p-value: 2.268e-07
##
## Call:
## lm(formula = totalprojectcost ~ gea + numapt + avgsfperapt, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -109146064 -47016722 660641 45273595 106533597
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.708e+08 2.784e+08 -0.973 0.350
## gea -2.190e+02 4.779e+02 -0.458 0.655
## numapt 6.950e+05 5.656e+05 1.229 0.243
## avgsfperapt 1.982e+05 2.298e+05 0.863 0.405
##
## Residual standard error: 67950000 on 12 degrees of freedom
## Multiple R-squared: 0.7691, Adjusted R-squared: 0.7114
## F-statistic: 13.32 on 3 and 12 DF, p-value: 0.0003981
The model above makes use of the variables as they are but we that none of the predictors are significant. We note that the range in gea and totalprojectcost is quite extensive and larger project can be expected to be more expensive as roof and amenity area size grows. For this reason, it may be justified to use a log transformation on these variables.
The summary below reveals an increase in R-squared (or percentage of variance explained) to nearly 77%. However only the intercept is significant. Since we are only dealing with a few predictors, we should proceed by dropping the predictors that we expect to contain the least information. Area (gea) is believed to be more important than numapt and avgsfperapt is simply derived from the existing variables so it is dropped as well.
##
## Call:
## lm(formula = log(totalprojectcost) ~ log(gea) + numapt + avgsfperapt,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40019 -0.11974 -0.04331 0.13593 0.52913
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.4765163 7.8185445 2.747 0.0177 *
## log(gea) -0.3766213 0.6857076 -0.549 0.5929
## numapt 0.0028118 0.0014004 2.008 0.0677 .
## avgsfperapt 0.0008023 0.0005347 1.500 0.1593
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.281 on 12 degrees of freedom
## Multiple R-squared: 0.8135, Adjusted R-squared: 0.7668
## F-statistic: 17.44 on 3 and 12 DF, p-value: 0.0001132
We end up with a statistically significant model with predictor log(gea) with coefficient 0.965. Since both the independent and dependent variables were transformed we can interpret the coefficient as follows: for every 1% increase in gea,totalprojectcost also increases by nearly a percent (0.965%). For every 10% increase in gea, the project costs increases by ((1.10)^0.965-1)*100 = 9.63 %
##
## Call:
## lm(formula = log(totalprojectcost) ~ log(gea), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45538 -0.22170 0.04466 0.26027 0.39226
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.2172 1.9697 3.156 0.007 **
## log(gea) 0.9650 0.1489 6.482 1.44e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3011 on 14 degrees of freedom
## Multiple R-squared: 0.7501, Adjusted R-squared: 0.7322
## F-statistic: 42.02 on 1 and 14 DF, p-value: 1.443e-05
The information that can be derived from this kind of data could be of great value for developers. In this example, we derived some basic insight from a simple linear regression model useful for explanation purposes. Augmenting the data with more predictors would allow buiding more advanced models which could further be used in prediction.