Simple Linear Regression

One of the most important aspects of the design of a construction project is the budget. When assembling a budget, cost estimates for the major trade are often provided by contractors based on the information available. Estimating these costs involves extracting quantities of materials from plans and determining how many manhours are required to build. This is a detailed exercise and a time consuming process that requires outside information (contractor pricing). For this reason, a linear regression model might be a very useful tool in predicting the budget for a new building using information from previous projects.

Given that a number of trade costs can be broken down to a price per unit per trade, for example $x/square foot of concrete or $y/apartment for electrical wiring, it seems natural to turn to linear regression to predict the cost of a construction project based on these variables. While, the predictors can be detailed and derived for each trade (superstucture, mechanical, plumbing, etc.), it may be time-saving to use more general predictors for the project such as building area in SF, landscape area in SF, facade area in SF number of apartments, number of rooms, average area footprint of a floor etc.

The table below contains some typical project cost information. The variable gea refers to gross enclosed area which is one of the area measures describing the size of a building.

Data Exploration

gea numapt avgsfperapt dolpersf dolperapt totalprojectcost totaltradecost
636074 650 979 199 194787 126611699 94345699
477246 498 958 213 204396 101789064 85827762
703184 395 1780 225 400425 158167867 127464947
415586 394 1055 317 334669 131859394 106422664
221313 189 1171 385 451390 85312660 72500532
258300 184 1404 418 587470 108094413 84055946
864868 835 1036 341 353085 294826044 241100259
354622 367 966 266 257080 94348484 74051545
310053 345 899 272 244703 84422410 66403310
1146366 820 1398 215 300384 246314480 199096678
483822 584 828 286 237304 138585695 109187466
734312 714 1028 347 357196 255038067 202966634
1195313 1028 1163 403 468900 482029408 389929462
410239 394 1041 472 491253 193553729 143023902
855541 800 1069 430 459625 367700352 297597354
917783 798 1150 455 523002 417355476 339036766
Data summary
Name data
Number of rows 16
Number of columns 7
_______________________
Column type frequency:
numeric 7
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
gea 0 1 624038.88 306933.63 221313 396334.75 559948.0 857872.75 1195313 ▇▂▃▃▂
numapt 0 1 562.19 252.65 184 387.25 541.0 798.50 1028 ▅▇▃▇▂
avgsfperapt 0 1 1120.31 235.43 828 975.75 1048.0 1165.00 1780 ▅▇▁▁▁
dolpersf 0 1 327.75 91.97 199 255.75 329.0 406.75 472 ▇▆▆▃▇
dolperapt 0 1 366604.31 121466.59 194787 253985.75 355140.5 461943.75 587470 ▇▃▅▆▃
totalprojectcost 0 1 205375577.62 126482953.69 84422410 106518075.75 148376781.0 264985061.25 482029408 ▇▁▂▁▂
totaltradecost 0 1 164563182.88 103168415.18 66403310 85384808.00 118326206.5 212500040.25 389929462 ▇▁▂▁▂

## `geom_smooth()` using formula 'y ~ x'

Modeling

The full model includes yields a high R-squared of 95% however, none of the predictors are significant. It should also be noted that the predictors dolpersf and dolperapt are actually variables calculated from the known totalprojectcost of previous projects. This is not information that would be available for the prediction, in fact it is what we are trying to predict therefore these variables should be dropped.

## 
## Call:
## lm(formula = totalprojectcost ~ gea + numapt + avgsfperapt + 
##     dolpersf + dolperapt, data = data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -39066538 -15946458   -149404  22636090  27738253 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.730e+08  2.026e+08  -1.347    0.208
## gea          1.482e+02  1.980e+02   0.748    0.472
## numapt       2.479e+05  2.323e+05   1.067    0.311
## avgsfperapt  3.380e+04  1.767e+05   0.191    0.852
## dolpersf     6.164e+05  5.708e+05   1.080    0.306
## dolperapt    1.797e+01  4.928e+02   0.036    0.972
## 
## Residual standard error: 26320000 on 10 degrees of freedom
## Multiple R-squared:  0.9711, Adjusted R-squared:  0.9567 
## F-statistic: 67.28 on 5 and 10 DF,  p-value: 2.268e-07
## 
## Call:
## lm(formula = totalprojectcost ~ gea + numapt + avgsfperapt, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -109146064  -47016722     660641   45273595  106533597 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.708e+08  2.784e+08  -0.973    0.350
## gea         -2.190e+02  4.779e+02  -0.458    0.655
## numapt       6.950e+05  5.656e+05   1.229    0.243
## avgsfperapt  1.982e+05  2.298e+05   0.863    0.405
## 
## Residual standard error: 67950000 on 12 degrees of freedom
## Multiple R-squared:  0.7691, Adjusted R-squared:  0.7114 
## F-statistic: 13.32 on 3 and 12 DF,  p-value: 0.0003981

The model above makes use of the variables as they are but we that none of the predictors are significant. We note that the range in gea and totalprojectcost is quite extensive and larger project can be expected to be more expensive as roof and amenity area size grows. For this reason, it may be justified to use a log transformation on these variables.

The summary below reveals an increase in R-squared (or percentage of variance explained) to nearly 77%. However only the intercept is significant. Since we are only dealing with a few predictors, we should proceed by dropping the predictors that we expect to contain the least information. Area (gea) is believed to be more important than numapt and avgsfperapt is simply derived from the existing variables so it is dropped as well.

## 
## Call:
## lm(formula = log(totalprojectcost) ~ log(gea) + numapt + avgsfperapt, 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40019 -0.11974 -0.04331  0.13593  0.52913 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 21.4765163  7.8185445   2.747   0.0177 *
## log(gea)    -0.3766213  0.6857076  -0.549   0.5929  
## numapt       0.0028118  0.0014004   2.008   0.0677 .
## avgsfperapt  0.0008023  0.0005347   1.500   0.1593  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.281 on 12 degrees of freedom
## Multiple R-squared:  0.8135, Adjusted R-squared:  0.7668 
## F-statistic: 17.44 on 3 and 12 DF,  p-value: 0.0001132

We end up with a statistically significant model with predictor log(gea) with coefficient 0.965. Since both the independent and dependent variables were transformed we can interpret the coefficient as follows: for every 1% increase in gea,totalprojectcost also increases by nearly a percent (0.965%). For every 10% increase in gea, the project costs increases by ((1.10)^0.965-1)*100 = 9.63 %

## 
## Call:
## lm(formula = log(totalprojectcost) ~ log(gea), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45538 -0.22170  0.04466  0.26027  0.39226 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.2172     1.9697   3.156    0.007 ** 
## log(gea)      0.9650     0.1489   6.482 1.44e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3011 on 14 degrees of freedom
## Multiple R-squared:  0.7501, Adjusted R-squared:  0.7322 
## F-statistic: 42.02 on 1 and 14 DF,  p-value: 1.443e-05

The information that can be derived from this kind of data could be of great value for developers. In this example, we derived some basic insight from a simple linear regression model useful for explanation purposes. Augmenting the data with more predictors would allow buiding more advanced models which could further be used in prediction.