ECON30001/ECOM90001 Tutorial 2 The basic regression model in R

Introduction

This tutorial reviews some basic operations using the econometrics software package R that we will be using in this subject. Specifically, this tutorial reviews:

running an OLS regression in R
plotting actual data values and fitted values
running an OLS regression on a sub-sample of the data in R
some simple data transformations (natural logarithms)
calculation of marginal effects

This tutorial requires one data file:

houseprices_2017.csv

This file can be obtained from the Canvas subject page. In addition the R script file tut2.R provides the program code necessary to complete the tutorial. This R script file uses the following packages which need to be installed prior to running the R script file:

ggplot2 :	for creating graphs and plots in R
stargazer :	for easily generating summary statistics for an R data file
scales :	for displaying thousands with commas in graphs in R
rio :	for easily importing data into R

These can be installed directly in RStudio from the packages tab or by using the command install.packages() and inserting the name of the package in the brackets. Please feel free to play around with the code, particularly the plotting commands for ggplot2() and the table commands for stargazer().

Capstone Project and Final Exam

Please remember that the tutorial material required for the Capstone Project and the Final Exam could differ. For example, you will be required to use R code in your project however R code is not examinable. You will need to interpret the output generated from R code for both the project and the exam.

Question

Download the data file houseprices_2017, from the Canvas page. This file contains data contains on the selling prices of houses in metropolitan Melbourne during the 2016 calendar year. There are several variables of interest:

price =	Selling price, thousands of dollars
distance =	Distance from the C.B.D.,in kilometres
bld_area =	Dwelling size, metres squared
landsize =	Land size, metres squared

large = {\begin{cases} 1 & property on a large lot, land size \geq 650 square metres \\ 0 & property not on a large lot, land size < 650 square metres \end{cases}

(a)

Consider the following econometric model:

$\begin{matrix} (1) & {price}_{i} = β_{0} + β_{1} {bld_area}_{i} + ε_{i} \end{matrix}$

Important

this is the Population Regression Function (PRF) comprising the deterministic part of the population model $Y_{i} = β_{0} + β_{1} X_{i}$ plus the error term, $ε_{i}$

What is the interpretation of the parameter $β_{0}$ ?
What is the interpretation of the parameter $β_{1}$ ?

Solution

The parameter $β_{0}$ represents the mean selling price for a house with a building area of zero. It would represent the value of the land only. However, if the data do not include any properties with zero building area this parameter will not be estimated precisely (out of sample prediction).

The parameter $β_{1}$ represents the marginal effect of an additional square metre of building area on the mean selling price.

(b)

Estimate this model in R and provide a brief description of the point estimates.
Produce a scatter plot of both price and the fitted values against bld_area.
Comment on how well the estimated model fits the data.

Solution

Load the required packages (make sure these are installed first) and import the data

#----------------------------------------
options(scipen=999)               # Do not use scientific notation
library(ggplot2)                  # Flexible graphic facility for R
library(stargazer)                # For producing good quality output from R
library(scales)                   # package to display thousands with commas in graphs
library(rio)
#---------------------------------------
# (1) Import Data File from csv and Save as R Data File
tut2 <- import("houseprices_2017.csv")

then run equation (1)

reg1 <- lm(price ~ bld_area, data=tut2)        # estimate model by OLS, save as reg1
print(summary(reg1), digits=3)                 # print the results on screen


Call:
lm(formula = price ~ bld_area, data = tut2)

Residuals:
   Min     1Q Median     3Q    Max 
 -2603   -403   -131    280   3496 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  685.836     27.727    24.7 <0.0000000000000002 ***
bld_area       2.674      0.152    17.6 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 565 on 2219 degrees of freedom
Multiple R-squared:  0.122, Adjusted R-squared:  0.122 
F-statistic:  308 on 1 and 2219 DF,  p-value: <0.0000000000000002

RSS1 <- deviance(reg1)                            # save the RSS for the model
print(RSS1)                                       # print the RSS on the screen

[1] 709209196

resids1 <- reg1$residuals                         # create a series for the OLS residuals
yhat1 <- reg1$fitted.values                       # create a series for the fitted values

to get a better presentation of the results use stargazer

Important

this is the Sample Regression Function (SRF) - there is no error term but now we obtain the OLS residual which are the difference between the actual $Y_{i}$ and the fitted $Y_{i}$ values e.g. ${\hat{ε}}_{i} = Y_{i} - {\hat{Y}}_{i}$ .

In the SRF, if the model contains an intercept term, then the residuals will always sum to zero. This is an algebraic property of the OLS estimators.

You should note that this is different to the assumption that $E (ε_{i} | X) = 0$ in the PRF.

stargazer(reg1, type="html", dep.var.labels=c("Selling Price ($'000s)"),
          covariate.labels=c("Intercept","Building Area (msq)"),
          digits=4,
          single.row=FALSE,
          intercept.bottom=FALSE,
          out= "tut2_reg1.htm")

**Figure 1: OLS Estimation Results Part (b)**

	Dependent variable:

	Selling Price (’000s)

Intercept	685.8360^***
	(27.7273)

Building Area (msq)	2.6743^***
	(0.1523)


Observations	2,221
R²	0.1220
Adjusted R²	0.1216
Residual Std. Error	565.3385 (df = 2219)
F Statistic	308.4236^*** (df = 1; 2219)

Note:	p<0.1; p<0.05; p<0.01

Interpretation of Results

The estimated coefficient for $β_{0}$ is $685.8360$ which implies that the average value of land alone is $$ 685, 836$ .
The estimated coefficient for $β_{1}$ implies that the average selling price increases by $$ 2, 674.30$ for each additional square metre of building area.
Recall that the dependent variable is measured in thousands of dollars.

Note

note the units used to measure the respective variables when interpreting regression results in your Project, as well as the Final Exam.

Scatter plot and interpretation

Run the following R code chunk to produce the required scatter plot.

ggplot(tut2, aes(x=bld_area)) + 
geom_point(aes(y=price, colour="Actual Data"), size=0.8) +
geom_line(aes(y=yhat1, colour="Fitted Values: Linear Model"), linewidth=0.6) +  
labs(x = "Building Area, square metres", y = "Selling Price, in thousands of dollars") +
  scale_colour_manual("", 
                    breaks = c("Actual Data", "Fitted Values: Linear Model"),
                      values = c("blue", "red")) +
  scale_y_continuous(labels = comma) +
  theme_classic() +
  theme(axis.text=element_text(size=6), axis.title=element_text(size=6),
  legend.text=element_text(size=6))
ggsave("tut2_graph1.pdf", width = 297, height = 210, units = "mm")

The scatter plot presented in Figure 2 shows that the estimated model appears to adequately fit the data for properties with relatively smaller building areas. However, it tends to considerably ‘under-predict’ selling prices for properties with relatively smaller building areas that sell for relatively large prices.
Additionally, it also tends to ‘over-predict’ selling prices for some properties with relatively larger building areas.

Discussion

Ultimately, there are likely several other factors, beyond just building area, that determine selling prices. These factors have been collected in the random error of the econometric model [equation (1)].
Some important variables might be the distance from the C.B.D, the age of the dwelling, the characteristics of the dwelling (such as the number of bedrooms, number of bathrooms etc. , $\dots$ ), quality of local schools, and proximity to local amenities.
Moreover, at least some of these omitted factors are also likely related to building area.
We will be studying omitted variables and how they affect the estimated parameters of econometric models later in this subject.

(c)

Tip

Two changes here:

we are now using a Multiple Regression Model (MRM), and
the functional form has changed

Note, the difference in interpreting the marginal effects .

In this case, compared to the Simple Linear Regression Model (SLRM) in part (a), they are no longer constant.

Consider the following econometric model:

$\begin{matrix} (2) & {price}_{i} = β_{0} + β_{1} {bld_area}_{i} + β_{2} {bld_area}_{i}^{2} + ε_{i} \end{matrix}$

What is the marginal (or partial) effect of an additional square metre of dwelling size (bld_area) on the selling price?
Estimate this equation in R.
What is the estimated marginal effect of an additional square metre of dwelling size for a home with 300 square metres of building area?

Hint: You will need to generate a new variable representing the squared value of the bld_area variable .

Produce a scatter plot of both price and the fitted values against bld_area.

On the same graph, produce a line plot of the fitted values for the linear model (from part b) and the quadratic model (from part c).
Based upon a visual inspection of the fitted values, which model do you think fits the data better?
Why?

Compare the Sum of Squared Residuals (RSS) for the two models.
Which is smaller?

Based upon the value for the RSS, which model fits the data better?

Solution

For the quadratic model

${price}_{i} = β_{0} + β_{1} {bld_area}_{i} + β_{2} {bld_area}_{i}^{2} + ε_{i}$

The marginal effect is given by:

$\frac{\partial E [price]}{\partial bld_area} = β_{1} + β_{2} bld_area$

From the hint, you need to generate a new variable:

# (3) Extended Model with Building Area Squared
#     create bld_area squared variable
tut2$bld_area2 = tut2$bld_area^2

Now run the regression

# Dependent variable: Sale price, in thousands of dollars
# Explanatory variable: Living area, Living area squared
reg2 <- lm(price ~ bld_area + bld_area2, data=tut2)       # estimate model by OLS, save as reg2
print(summary(reg2, digits=3))


Call:
lm(formula = price ~ bld_area + bld_area2, data = tut2)

Residuals:
    Min      1Q  Median      3Q     Max 
-1402.4  -390.1  -106.4   293.7  3398.6 

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept) 287.891616  47.691552   6.037        0.00000000184 ***
bld_area      6.419578   0.398159  16.123 < 0.0000000000000002 ***
bld_area2    -0.006542   0.000645 -10.142 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 552.8 on 2218 degrees of freedom
Multiple R-squared:  0.1609,    Adjusted R-squared:  0.1602 
F-statistic: 212.7 on 2 and 2218 DF,  p-value: < 0.00000000000000022

reg2a <- lm(price ~ bld_area +  I(bld_area^2), data=tut2)

Again use stargazer to present the results

stargazer(reg1, reg2, type="html", dep.var.labels=c("Selling Price (&#36; 000's)"),
          covariate.labels=c("Intercept","Building Area" ,"Building Area (Squared)"),
          digits=4,
          single.row=FALSE,
          intercept.bottom=FALSE,
          out= "tut2_reg2.htm")

**Figure 3: OLS Estimation Results Part (c)**

	Dependent variable:

	Selling Price ($ 000’s)
	(1)	(2)

Intercept	685.8360^***	287.8916^***
	(27.7273)	(47.6916)

Building Area	2.6743^***	6.4196^***
	(0.1523)	(0.3982)

Building Area (Squared)		-0.0065^***
		(0.0006)


Observations	2,221	2,221
R²	0.1220	0.1609
Adjusted R²	0.1216	0.1602
Residual Std. Error	565.3385 (df = 2219)	552.7921 (df = 2218)
F Statistic	308.4236^*** (df = 1; 2219)	212.7259^*** (df = 2; 2218)

Note:	p<0.1; p<0.05; p<0.01

The estimation results in Figure 3 can be used to estimate the marginal effect for a dwelling with 300 square metres of building area as

$b_{1} + 2 b_{2} bld_area = 6.419578 + (2 * - 0.006542 * 300) = $ 2.4945$

In this quadratic model, an additional square metre of dwelling space for a house with 300 square metres of dwelling area is estimated to increase the sales price by $$ 2, 495$ . Compare this to the estimated effect in the linear model in part (b) of $$ 2, 674$ (which restricts the marginal effect to be same regardless of the dwelling area). Note that for properties with 300 square metres of dwelling area, the estimated marginal effect for model 2 is remarkably close to the estimated marginal effect for model 1.

To generate plots of the fitted values versus the actual values for Models 1 and 2 use the following

# fitted values from Model 1
yhat1 <- reg1$fitted.values # create a series for the fitted values 
tut2$yhat1 <- reg1$fitted.values   
# fitted values from Model 2
yhat2 <- reg2$fitted.values   # create a series for the fitted values

ggplot(tut2, aes(x=bld_area)) +
  geom_point(aes(y=price, colour = "Actual Data")) +
  geom_line(aes(y=yhat1, colour="Fitted Values: Linear Model"), size=1) +
  geom_line(aes(y=yhat2, colour="Fitted Values: Quadratic Model"), size=1) +
  labs(x = "Building Area, square metres", y = "Selling Price, in thousands of dollars") +
  scale_colour_manual("",
                      breaks = c("Actual Data", "Fitted Values: Linear Model",
                                 "Fitted Values: Quadratic Model"),
                      values = c("blue", "red", "darkgreen")) +
  scale_y_continuous(labels = comma) +
  theme_classic() +
  theme(axis.text=element_text(size=12), axis.title=element_text(size=12))

Aside:

Is this likely a “causal” effect? Is it likely that for houses with sufficiently large dwelling areas, an additional square metre of dwelling area is estimated to reduce the selling price? In our simple model, it is likely that this is not a `causal’ effect. Why?

Outliers: There are only a few observations for houses with a large building area and relatively low selling prices. It is feasible that these observations are not really representative of the population of houses sold in Melbourne.
Omitted Variables: The econometric model (2) only relates selling prices to the dwelling area.
There are likely omitted variables, that are related to the dwelling area, that also affect the selling prices. Effectively, the estimated negative relationship between dwelling area and price for large dwellings, really reflects the effects of these omitted characteristics. For example, houses with a larger dwelling area will generally be located in different areas to houses with a smaller dwelling area and these location characteristics might be important determinants of prices. For example, houses with a larger dwelling area tend to located further from the C.B.D and it is this characteristic that is associated with lower prices.

We will be exploring these issues throughout the subject.

Figure 4 confirms that the fitted lines for model 1 and model 2 are quite close to each other, at a building area of 300 square metres.

Note that the estimate of $b_{2} = - 0.006542$ . This implies that the estimated relationship between selling prices and building area is an `inverted u-shape’.

For houses with sufficiently large dwelling areas, an additional square metre of dwelling area is estimated to reduce the selling price.

It appears that the quadratic model fits the data slightly better - it is slightly better at capturing the lower selling prices for houses with a larger building area.
However, it still tends to `under-predict’ selling prices for properties with relatively smaller living areas that sell for relatively large prices.

# Obtain Residual Sum of Squares (RSS)
RSS1 <- deviance(reg1)   # save the RSS for  model 1
# Obtain Residual Sum of Squares (RSS)
RSS2 <- deviance(reg2)  # save the RSS for  model 2
# print the RSS on the screen
print(RSS1)
## [1] 709209196
print(RSS2)
## [1] 677774520

The RSS for the linear model (model 1) is $709, 209, 196$ while for the quadratic model (model 2) it is $677, 774, 520$ .
At first glance, the minimised value of the sum of squared residuals appears lower for the quadratic model so it is tempting to conclude that the quadratic model fits the data better.
This is also confirmed by looking at the $R^{2}$ reported in the estimation output.
For the linear model in part b), the $R^{2}$ is 0.1220 while for the quadratic model the $R^{2}$ is 0.1609.
However, since the quadratic model includes an additional explanatory variable (compared) to the linear model, the RSS must necessarily be lower (and the $R^{2}$ higher) for this model.

(d)

Estimate the econometric model 2, restricting the sample to houses that are on large lots.

Now repeat the estimation for houses not on large lots.
Comment on how the estimations differ.

Hint: You will need to restrict the samples using the variable large.

Solution

Run the following R code chunk; note that we have now subsetted the data by including ,data=subset(tut2, large==1)in our regression command.

# Explanatory variable: Building area, Building area squared
reg3 <- lm(price ~ bld_area + bld_area2, data=subset(tut2, large==1))
RSS3 <- deviance(reg3)      # save the RSS for the model
resids3 <- reg3$residuals   # create a series for the OLS residuals
yhat3 <- reg3$fitted.values # create a series for the fitted values

Now, estimate a model for houses not on large lots, include ,data=subset(tut2, large==0) here.

reg4 <- lm(price ~ bld_area + bld_area2, data=subset(tut2, large==0))
RSS4 <- deviance(reg4)      # save the RSS for the model
resids4 <- reg4$residuals   # create a series for the OLS residuals
yhat4 <- reg4$fitted.values # create a series for the fitted values

Use stargazer to report the regression results

# Stargazer package produces `print-ready` output
stargazer(reg3, reg4, type="html", dep.var.labels=c("Selling Price (&#36; 000's)"),
          covariate.labels=c("Intercept","Building Area", "Building Area (Squared)" ),
          column.labels = c("Large Lots", "Small Lots"),
          digits=4,
          single.row=FALSE,
          intercept.bottom=FALSE,
          out= "tut2_reg3.htm")

**Figure 5: OLS Estimation Results by Large/Small Lots**

	Dependent variable:

	Selling Price ($ 000’s)
	Large Lots	Small Lots
	(1)	(2)

Intercept	10.0514	393.9105^***
	(98.6141)	(54.9578)

Building Area	8.3311^***	5.6890^***
	(0.7475)	(0.4832)

Building Area (Squared)	-0.0083^***	-0.0064^***
	(0.0011)	(0.0008)


Observations	708	1,513
R²	0.2265	0.1167
Adjusted R²	0.2243	0.1155
Residual Std. Error	606.3042 (df = 705)	521.6389 (df = 1510)
F Statistic	103.2194^*** (df = 2; 705)	99.7114^*** (df = 2; 1510)

Note:	p<0.1; p<0.05; p<0.01

The estimation results are presented in Figure 5..
The estimated marginal effect of an additional square metre of dwelling area for a house with 300 square metres of dwelling area is $$ 3, 345.37$ for large lots and $$ 1, 854.77$ for smaller lots.

For larger lots:

$b_{1} + 2 b_{2} bld_area = 8.331051 + (2 * 300 * - 0.008309) = 3.345367$

and for smaller lots:

$b_{1} + 2 b_{2} bld_area = 5.689778 + (2 * 300 * - 0.0063903) = 1.854769$

For dwellings with a building area of 300 square metres, the marginal effect of an additional square metre of dwelling area on selling price is greater for properties on larger lots.
This possibly reflects a preference for yard space. For smaller lots, an additional square metre of dwelling size substantially reduces the available yard space.
For larger lots there is not as large a reduction in yard space so buyers are prepared to pay more for the same square metre increase.

Tip

We are back to running two Simple Linear Regression Models (SLRMs) however Model II has a different functional form than Model I because the dependant variable is logged (e,g, we have a log-linear model).

Again, the interpretation of the marginal effects changes when logs are involved (more on this in upcoming tutorials).

Also note log(price) is the log to base e (not base 10) in R

(e)

Consider the following econometric models:

$\begin{matrix} (1) & {price}_{i} = β_{0} + β_{1} {distance}_{i} + ε_{i} \end{matrix}$

and

$\begin{matrix} (2) & {lnprice}_{i} = β_{0} + β_{1} d i s t a n c e_{i} + ε_{i} \end{matrix}$

where lnprice represents the natural log of the variable price.

Estimate Model I in R.

Produce a scatter plot of price against distance and a line plot of the fitted values from Model I.

Now generate a new variable lnprice as the natural log of the variable price.

Estimate Model II in R.

Produce a scatter plot of lnprice against distance and a line plot of the fitted values from Model II.

Compare the scatter plots for each model (Model I and Model II).
Which estimated model do you think fits the data better? Why?

Solution

Generate a variable for the log of selling price

# (9) Create (log) selling price variable
tut2$lnprice = log(tut2$price)    # generate log(price) variable

Now run the two models using the following R code chunk provided

# (7) OLS Regression with Distance variable
# Dependent variable: Sale price, in thousands of dollars
# Explanatory variable: Distance, in kms
reg5 <- lm(price ~ distance, data=tut2)   # Estimate Model I by OLS
RSS5 <- deviance(reg5)                  # save the RSS for the model
print(RSS5)                            # print the RSS on the screen
## [1] 618782879
resids5 <- reg5$residuals   # create a series for the OLS residuals
yhat5 <- reg5$fitted.values   # create a series for the fitted values
#  (10) OLS Regression with lnprice variable
# Dependent variable: (Log) Sale price, in thousands dollars
# Explanatory variable: Distance
reg6 <- lm(lnprice ~ distance, data=tut2)  # Estimate Model II by OLS
RSS6 <- deviance(reg6)              # save the RSS for the model
print(RSS6)                         # Print the RSS to screen
## [1] 346.2655
resids6 <- reg6$residuals   # create a series for the OLS residuals
yhat6 <- reg6$fitted.values # create a series for the fitted values

Then use stargazer to report the results

# Stargazer package produces `print-ready` output
stargazer(reg5, reg6, type="html", 
          dep.var.labels=c("Selling Price (&#36; 000's)", "(Log) Selling Price"),
          covariate.labels=c("Intercept","Distance" ),
          digits=4,
          single.row=FALSE,
          intercept.bottom=FALSE,
          out= "tut2_reg4.htm")

**Figure 6: OLS Estimation Results: Part (e)**

	Dependent variable:

	Selling Price ($ 000’s)	(Log) Selling Price
	(1)	(2)

Intercept	1,636.5080^***	7.3795^***
	(22.6233)	(0.0169)

Distance	-36.3471^***	-0.0337^***
	(1.3961)	(0.0010)


Observations	2,221	2,221
R²	0.2340	0.3189
Adjusted R²	0.2336	0.3186
Residual Std. Error (df = 2219)	528.0688	0.3950
F Statistic (df = 1; 2219)	677.7707^***	1,038.9910^***

Note:	p<0.1; p<0.05; p<0.01

For Model 1, the mean selling price declines by $$ 36, 347.10$ for each additional kilometre from the C.B.D. The estimated coefficient for the intercept implies that the average price of land alone in the C.B.D. (with a zero distance)
$$ 1, 636, 508$ .

For model 2, The mean selling prices declines by 3.37% for each additional kilometre from the C.B.D. The estimated coefficient for the intercept implies that the average price of land alone in the C.B.D. (with a zero distance) is:

$1000 * e x p (b_{0}) = 1000 * e x p (7.3795) = $ 1, 602, 788$

Later in this subject, we will be studying how to interpret estimates in econometric models involving different functional forms, such as natural logarithms).

Now run the following R code chunk to examine the fitted values from both models.

ggplot(tut2, aes(x=distance)) +
  geom_point(aes(y=price, colour="Actual Data")) +
  geom_line(aes(y=yhat5, colour="Fitted Values: Linear Model"), size=1) +
  labs(x = "Distance from CBD, in kms", y = "Selling Price, in thousands of dollars") +
  scale_colour_manual("",
                      breaks = c("Actual Data", "Fitted Values: Linear Model"),
                      values = c("blue", "red")) +
  scale_y_continuous(labels = comma) +
  theme_classic() +
  theme(axis.text=element_text(size=12), axis.title=element_text(size=12))
ggsave("tut2_graph3.pdf")  

ggplot(tut2, aes(x=distance)) +
  geom_point(aes(y=lnprice, colour="Actual Data")) +
  geom_line(aes(y=yhat6, colour="Fitted Values: Log-Linear Model"), size=1) +
  labs(x = "Distance from CBD, in kms", y = "(Log) Selling Price, in thousands  dollars") +
  scale_colour_manual("",
                      breaks = c("Actual Data", "Fitted Values: Log-Linear Model"),
                      values = c("blue", "red")) +
  scale_y_continuous(labels = comma) +
  theme_classic() +
  theme(axis.text=element_text(size=12), axis.title=element_text(size=12))
ggsave("tut2_graph4.pdf")

The fitted values for Model I are presented in Figure 7 and those for Model II in Figure 8.

Comparing the two plots, Model II which uses the natural log of the selling prices appears to fit the data better. The under-prediction of selling prices, relative to the actual data, appears less of an issue in Model II.
As noted in Question 1(e) in Tutorial 1, taking logs reduces the scale in which a variable is measured.

Caution

The dependent variable in Model I is price while in model II it is lnprice.

Since they are not the same e.g. price $\neq$ lnprice , the Total Sum of Squares (TSS) will differ so we can’t use $R^{2}$ , nor ${\bar{R}}^{2}$ to compare the “fit” of the two models here.

Note, since the dependent variable in Model I is different to the dependent variable in Model II, it is not possible to use the $R^{2}$ for these two models to make any judgment about which model is better in terms of goodness of fit.