Simple Linear Regression

One of the most important aspects of the design of a construction project is the budget. When assembling a budget, cost estimates for the major trade are often provided by contractors based on the information available. Estimating these costs involves extracting quantities of materials from plans and determining how many manhours are required to build. This is a detailed exercise and a time consuming process that requires outside information (contractor pricing). For this reason, a linear regression model might be a very useful tool in predicting the budget for a new building using information from previous projects.

Given that a number of trade costs can be broken down to a price per unit per trade, for example $x/square foot of concrete or $y/apartment for electrical wiring, it seems natural to turn to linear regression to predict the cost of a construction project based on these variables. While, the predictors can be detailed and derived for each trade (superstucture, mechanical, plumbing, etc.), it may be time-saving to use more general predictors for the project such as building area in SF, landscape area in SF, facade area in SF number of apartments, number of rooms, average area footprint of a floor etc.

The table below contains some typical project cost information. The variable gea refers to gross enclosed area which is one of the area measures describing the size of a building.

Data Exploration


gea	numapt	avgsfperapt	dolpersf	dolperapt	totalprojectcost	totaltradecost
636074	650	979	199	194787	126611699	94345699
477246	498	958	213	204396	101789064	85827762
703184	395	1780	225	400425	158167867	127464947
415586	394	1055	317	334669	131859394	106422664
221313	189	1171	385	451390	85312660	72500532
258300	184	1404	418	587470	108094413	84055946
864868	835	1036	341	353085	294826044	241100259
354622	367	966	266	257080	94348484	74051545
310053	345	899	272	244703	84422410	66403310
1146366	820	1398	215	300384	246314480	199096678
483822	584	828	286	237304	138585695	109187466
734312	714	1028	347	357196	255038067	202966634
1195313	1028	1163	403	468900	482029408	389929462
410239	394	1041	472	491253	193553729	143023902
855541	800	1069	430	459625	367700352	297597354
917783	798	1150	455	523002	417355476	339036766

Data summary

Name	data
Number of rows	16
Number of columns	7
_______________________
Column type frequency:
numeric	7
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
gea	1	624038.88	306933.63	221313	396334.75	559948.0	857872.75	1195313	▇▂▃▃▂
numapt	1	562.19	252.65	184	387.25	541.0	798.50	1028	▅▇▃▇▂
avgsfperapt	1	1120.31	235.43	828	975.75	1048.0	1165.00	1780	▅▇▁▁▁
dolpersf	1	327.75	91.97	199	255.75	329.0	406.75	472	▇▆▆▃▇
dolperapt	1	366604.31	121466.59	194787	253985.75	355140.5	461943.75	587470	▇▃▅▆▃
totalprojectcost	1	205375577.62	126482953.69	84422410	106518075.75	148376781.0	264985061.25	482029408	▇▁▂▁▂
totaltradecost	1	164563182.88	103168415.18	66403310	85384808.00	118326206.5	212500040.25	389929462	▇▁▂▁▂

## `geom_smooth()` using formula 'y ~ x'

Modeling

The full model includes yields a high R-squared of 95% however, none of the predictors are significant. It should also be noted that the predictors dolpersf and dolperapt are actually variables calculated from the known totalprojectcost of previous projects. This is not information that would be available for the prediction, in fact it is what we are trying to predict therefore these variables should be dropped.

lm.full <- lm(totalprojectcost ~ gea + numapt + avgsfperapt + dolpersf + dolperapt, data = data)
summary(lm.full)

## 
## Call:
## lm(formula = totalprojectcost ~ gea + numapt + avgsfperapt + 
##     dolpersf + dolperapt, data = data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -39066538 -15946458   -149404  22636090  27738253 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.730e+08  2.026e+08  -1.347    0.208
## gea          1.482e+02  1.980e+02   0.748    0.472
## numapt       2.479e+05  2.323e+05   1.067    0.311
## avgsfperapt  3.380e+04  1.767e+05   0.191    0.852
## dolpersf     6.164e+05  5.708e+05   1.080    0.306
## dolperapt    1.797e+01  4.928e+02   0.036    0.972
## 
## Residual standard error: 26320000 on 10 degrees of freedom
## Multiple R-squared:  0.9711, Adjusted R-squared:  0.9567 
## F-statistic: 67.28 on 5 and 10 DF,  p-value: 2.268e-07

lm.relevant <- lm(totalprojectcost ~ gea + numapt + avgsfperapt, data = data)
summary(lm.relevant)

## 
## Call:
## lm(formula = totalprojectcost ~ gea + numapt + avgsfperapt, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -109146064  -47016722     660641   45273595  106533597 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.708e+08  2.784e+08  -0.973    0.350
## gea         -2.190e+02  4.779e+02  -0.458    0.655
## numapt       6.950e+05  5.656e+05   1.229    0.243
## avgsfperapt  1.982e+05  2.298e+05   0.863    0.405
## 
## Residual standard error: 67950000 on 12 degrees of freedom
## Multiple R-squared:  0.7691, Adjusted R-squared:  0.7114 
## F-statistic: 13.32 on 3 and 12 DF,  p-value: 0.0003981

The model above makes use of the variables as they are but we that none of the predictors are significant. We note that the range in gea and totalprojectcost is quite extensive and larger project can be expected to be more expensive as roof and amenity area size grows. For this reason, it may be justified to use a log transformation on these variables.

The summary below reveals an increase in R-squared (or percentage of variance explained) to nearly 77%. However only the intercept is significant. Since we are only dealing with a few predictors, we should proceed by dropping the predictors that we expect to contain the least information. Area (gea) is believed to be more important than numapt and avgsfperapt is simply derived from the existing variables so it is dropped as well.

lm.1a <- lm(log(totalprojectcost) ~ log(gea)+numapt+avgsfperapt, data = data)
summary(lm.1a)

## 
## Call:
## lm(formula = log(totalprojectcost) ~ log(gea) + numapt + avgsfperapt, 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40019 -0.11974 -0.04331  0.13593  0.52913 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 21.4765163  7.8185445   2.747   0.0177 *
## log(gea)    -0.3766213  0.6857076  -0.549   0.5929  
## numapt       0.0028118  0.0014004   2.008   0.0677 .
## avgsfperapt  0.0008023  0.0005347   1.500   0.1593  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.281 on 12 degrees of freedom
## Multiple R-squared:  0.8135, Adjusted R-squared:  0.7668 
## F-statistic: 17.44 on 3 and 12 DF,  p-value: 0.0001132

We end up with a statistically significant model with predictor log(gea) with coefficient 0.965. Since both the independent and dependent variables were transformed we can interpret the coefficient as follows: for every 1% increase in gea,totalprojectcost also increases by nearly a percent (0.965%). For every 10% increase in gea, the project costs increases by ((1.10)^0.965-1)*100 = 9.63 %

lm.1b <- lm(log(totalprojectcost) ~ log(gea), data = data)
summary(lm.1b)

## 
## Call:
## lm(formula = log(totalprojectcost) ~ log(gea), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45538 -0.22170  0.04466  0.26027  0.39226 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.2172     1.9697   3.156    0.007 ** 
## log(gea)      0.9650     0.1489   6.482 1.44e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3011 on 14 degrees of freedom
## Multiple R-squared:  0.7501, Adjusted R-squared:  0.7322 
## F-statistic: 42.02 on 1 and 14 DF,  p-value: 1.443e-05

The information that can be derived from this kind of data could be of great value for developers. In this example, we derived some basic insight from a simple linear regression model useful for explanation purposes. Augmenting the data with more predictors would allow buiding more advanced models which could further be used in prediction.

DATA621 Blog 1

Mael Illien

11/8/2020

Simple Linear Regression

Data Exploration

Modeling