#----------------------------------------
options(scipen=999) # Do not use scientific notation
library(ggplot2) # Flexible graphic facility for R
library(stargazer) # For producing good quality output from R
library(scales) # package to display thousands with commas in graphs
library(rio)
#---------------------------------------
# (1) Import Data File from csv and Save as R Data File
<- import("houseprices_2017.csv") tut2
Introduction
This tutorial reviews some basic operations using the econometrics software package R that we will be using in this subject. Specifically, this tutorial reviews:
- running an OLS regression in R
- plotting actual data values and fitted values
- running an OLS regression on a sub-sample of the data in R
- some simple data transformations (natural logarithms)
- calculation of marginal effects
This tutorial requires one data file:
- houseprices_2017.csv
This file can be obtained from the Canvas subject page. In addition the R script file tut2.R provides the program code necessary to complete the tutorial. This R script file uses the following packages which need to be installed prior to running the R script file:
ggplot2 : | for creating graphs and plots in R |
stargazer : | for easily generating summary statistics for an R data file |
scales : | for displaying thousands with commas in graphs in R |
rio : | for easily importing data into R |
These can be installed directly in RStudio from the packages tab or by using the command install.packages()
and inserting the name of the package in the brackets. Please feel free to play around with the code, particularly the plotting commands for ggplot2()
and the table commands for stargazer()
.
Please remember that the tutorial material required for the Capstone Project and the Final Exam could differ.
For example, you will be required to use R code in your project however R code is not examinable.
You will need to interpret the output generated from R code for both the project and the exam.
Question
Download the data file houseprices_2017, from the Canvas page.
This file contains data contains on the selling prices of houses in metropolitan Melbourne during the 2016 calendar year. There are several variables of interest:
price = | Selling price, thousands of dollars |
distance = | Distance from the C.B.D.,in kilometres |
bld_area = | Dwelling size, metres squared |
landsize = | Land size, metres squared |
(a)
Consider the following econometric model:
this is the Population Regression Function (PRF) comprising the deterministic part of the population model
What is the interpretation of the parameter
What is the interpretation of the parameter
Solution
The parameter
The parameter
(b)
Estimate this model in R and provide a brief description of the point estimates.
Produce a scatter plot of both price
and the fitted values against bld_area
.
Comment on how well the estimated model fits the data.
Solution
Load the required packages (make sure these are installed first) and import the data
then run equation (1)
<- lm(price ~ bld_area, data=tut2) # estimate model by OLS, save as reg1
reg1 print(summary(reg1), digits=3) # print the results on screen
Call:
lm(formula = price ~ bld_area, data = tut2)
Residuals:
Min 1Q Median 3Q Max
-2603 -403 -131 280 3496
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 685.836 27.727 24.7 <0.0000000000000002 ***
bld_area 2.674 0.152 17.6 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 565 on 2219 degrees of freedom
Multiple R-squared: 0.122, Adjusted R-squared: 0.122
F-statistic: 308 on 1 and 2219 DF, p-value: <0.0000000000000002
<- deviance(reg1) # save the RSS for the model
RSS1 print(RSS1) # print the RSS on the screen
[1] 709209196
<- reg1$residuals # create a series for the OLS residuals
resids1 <- reg1$fitted.values # create a series for the fitted values yhat1
to get a better presentation of the results use stargazer
this is the Sample Regression Function (SRF) - there is no error term but now we obtain the OLS residual which are the difference between the actual
In the SRF, if the model contains an intercept term, then the residuals will always sum to zero. This is an algebraic property of the OLS estimators.
You should note that this is different to the assumption that
stargazer(reg1, type="html", dep.var.labels=c("Selling Price ($'000s)"),
covariate.labels=c("Intercept","Building Area (msq)"),
digits=4,
single.row=FALSE,
intercept.bottom=FALSE,
out= "tut2_reg1.htm")
Dependent variable: | |
Selling Price (’000s) | |
Intercept | 685.8360*** |
(27.7273) | |
Building Area (msq) | 2.6743*** |
(0.1523) | |
Observations | 2,221 |
R2 | 0.1220 |
Adjusted R2 | 0.1216 |
Residual Std. Error | 565.3385 (df = 2219) |
F Statistic | 308.4236*** (df = 1; 2219) |
Note: | p<0.1; p<0.05; p<0.01 |
Interpretation of Results
The estimated coefficient for
The estimated coefficient for
Recall that the dependent variable is measured in thousands of dollars.
note the units used to measure the respective variables when interpreting regression results in your Project, as well as the Final Exam.
Scatter plot and interpretation
Run the following R code chunk to produce the required scatter plot.
ggplot(tut2, aes(x=bld_area)) +
geom_point(aes(y=price, colour="Actual Data"), size=0.8) +
geom_line(aes(y=yhat1, colour="Fitted Values: Linear Model"), linewidth=0.6) +
labs(x = "Building Area, square metres", y = "Selling Price, in thousands of dollars") +
scale_colour_manual("",
breaks = c("Actual Data", "Fitted Values: Linear Model"),
values = c("blue", "red")) +
scale_y_continuous(labels = comma) +
theme_classic() +
theme(axis.text=element_text(size=6), axis.title=element_text(size=6),
legend.text=element_text(size=6))
ggsave("tut2_graph1.pdf", width = 297, height = 210, units = "mm")
The scatter plot presented in Figure 2 shows that the estimated model appears to adequately fit the data for properties with relatively smaller building areas. However, it tends to considerably ‘under-predict’ selling prices for properties with relatively smaller building areas that sell for relatively large prices.
Additionally, it also tends to ‘over-predict’ selling prices for some properties with relatively larger building areas.
Discussion
Ultimately, there are likely several other factors, beyond just building area, that determine selling prices. These factors have been collected in the random error of the econometric model [equation (1)].
Some important variables might be the distance from the C.B.D, the age of the dwelling, the characteristics of the dwelling (such as the number of bedrooms, number of bathrooms etc. ,
Moreover, at least some of these omitted factors are also likely related to building area.
We will be studying omitted variables and how they affect the estimated parameters of econometric models later in this subject.
(c)
Two changes here:
- we are now using a Multiple Regression Model (MRM), and
- the functional form has changed
Note, the difference in interpreting the marginal effects .
In this case, compared to the Simple Linear Regression Model (SLRM) in part (a), they are no longer constant.
Consider the following econometric model:
What is the marginal (or partial) effect of an additional square metre of dwelling size (bld_area
) on the selling price?
Estimate this equation in R.
What is the estimated marginal effect of an additional square metre of dwelling size for a home with 300 square metres of building area?
Hint: You will need to generate a new variable representing the squared value of the bld_area
variable .
Produce a scatter plot of both price
and the fitted values against bld_area
.
On the same graph, produce a line plot of the fitted values for the linear model (from part b) and the quadratic model (from part c).
Based upon a visual inspection of the fitted values, which model do you think fits the data better?
Why?
Compare the Sum of Squared Residuals (RSS) for the two models.
Which is smaller?
Based upon the value for the RSS, which model fits the data better?
Solution
For the quadratic model
The marginal effect is given by:
From the hint, you need to generate a new variable:
# (3) Extended Model with Building Area Squared
# create bld_area squared variable
$bld_area2 = tut2$bld_area^2 tut2
Now run the regression
# Dependent variable: Sale price, in thousands of dollars
# Explanatory variable: Living area, Living area squared
<- lm(price ~ bld_area + bld_area2, data=tut2) # estimate model by OLS, save as reg2
reg2 print(summary(reg2, digits=3))
Call:
lm(formula = price ~ bld_area + bld_area2, data = tut2)
Residuals:
Min 1Q Median 3Q Max
-1402.4 -390.1 -106.4 293.7 3398.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 287.891616 47.691552 6.037 0.00000000184 ***
bld_area 6.419578 0.398159 16.123 < 0.0000000000000002 ***
bld_area2 -0.006542 0.000645 -10.142 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 552.8 on 2218 degrees of freedom
Multiple R-squared: 0.1609, Adjusted R-squared: 0.1602
F-statistic: 212.7 on 2 and 2218 DF, p-value: < 0.00000000000000022
<- lm(price ~ bld_area + I(bld_area^2), data=tut2) reg2a
Again use stargazer to present the results
stargazer(reg1, reg2, type="html", dep.var.labels=c("Selling Price ($ 000's)"),
covariate.labels=c("Intercept","Building Area" ,"Building Area (Squared)"),
digits=4,
single.row=FALSE,
intercept.bottom=FALSE,
out= "tut2_reg2.htm")
Dependent variable: | ||
Selling Price ($ 000’s) | ||
(1) | (2) | |
Intercept | 685.8360*** | 287.8916*** |
(27.7273) | (47.6916) | |
Building Area | 2.6743*** | 6.4196*** |
(0.1523) | (0.3982) | |
Building Area (Squared) | -0.0065*** | |
(0.0006) | ||
Observations | 2,221 | 2,221 |
R2 | 0.1220 | 0.1609 |
Adjusted R2 | 0.1216 | 0.1602 |
Residual Std. Error | 565.3385 (df = 2219) | 552.7921 (df = 2218) |
F Statistic | 308.4236*** (df = 1; 2219) | 212.7259*** (df = 2; 2218) |
Note: | p<0.1; p<0.05; p<0.01 |
The estimation results in Figure 3 can be used to estimate the marginal effect for a dwelling with 300 square metres of building area as
In this quadratic model, an additional square metre of dwelling space for a house with 300 square metres of dwelling area is estimated to increase the sales price by
To generate plots of the fitted values versus the actual values for Models 1 and 2 use the following
# fitted values from Model 1
<- reg1$fitted.values # create a series for the fitted values
yhat1 $yhat1 <- reg1$fitted.values
tut2# fitted values from Model 2
<- reg2$fitted.values # create a series for the fitted values
yhat2
ggplot(tut2, aes(x=bld_area)) +
geom_point(aes(y=price, colour = "Actual Data")) +
geom_line(aes(y=yhat1, colour="Fitted Values: Linear Model"), size=1) +
geom_line(aes(y=yhat2, colour="Fitted Values: Quadratic Model"), size=1) +
labs(x = "Building Area, square metres", y = "Selling Price, in thousands of dollars") +
scale_colour_manual("",
breaks = c("Actual Data", "Fitted Values: Linear Model",
"Fitted Values: Quadratic Model"),
values = c("blue", "red", "darkgreen")) +
scale_y_continuous(labels = comma) +
theme_classic() +
theme(axis.text=element_text(size=12), axis.title=element_text(size=12))
Aside:
Is this likely a “causal” effect? Is it likely that for houses with sufficiently large dwelling areas, an additional square metre of dwelling area is estimated to reduce the selling price? In our simple model, it is likely that this is not a `causal’ effect. Why?
Outliers: There are only a few observations for houses with a large building area and relatively low selling prices. It is feasible that these observations are not really representative of the population of houses sold in Melbourne.
Omitted Variables: The econometric model (2) only relates selling prices to the dwelling area.
There are likely omitted variables, that are related to the dwelling area, that also affect the selling prices. Effectively, the estimated negative relationship between dwelling area and price for large dwellings, really reflects the effects of these omitted characteristics. For example, houses with a larger dwelling area will generally be located in different areas to houses with a smaller dwelling area and these location characteristics might be important determinants of prices. For example, houses with a larger dwelling area tend to located further from the C.B.D and it is this characteristic that is associated with lower prices.
We will be exploring these issues throughout the subject.
Figure 4 confirms that the fitted lines for model 1 and model 2 are quite close to each other, at a building area of 300 square metres.
Note that the estimate of
For houses with sufficiently large dwelling areas, an additional square metre of dwelling area is estimated to reduce the selling price.
It appears that the quadratic model fits the data slightly better - it is slightly better at capturing the lower selling prices for houses with a larger building area.
However, it still tends to `under-predict’ selling prices for properties with relatively smaller living areas that sell for relatively large prices.
# Obtain Residual Sum of Squares (RSS)
<- deviance(reg1) # save the RSS for model 1
RSS1 # Obtain Residual Sum of Squares (RSS)
<- deviance(reg2) # save the RSS for model 2
RSS2 # print the RSS on the screen
print(RSS1)
## [1] 709209196
print(RSS2)
## [1] 677774520
The RSS for the linear model (model 1) is
At first glance, the minimised value of the sum of squared residuals appears lower for the quadratic model so it is tempting to conclude that the quadratic model fits the data better.
This is also confirmed by looking at the
For the linear model in part b), the
However, since the quadratic model includes an additional explanatory variable (compared) to the linear model, the RSS must necessarily be lower (and the
(d)
Estimate the econometric model 2, restricting the sample to houses that are on large lots.
Now repeat the estimation for houses not on large lots.
Comment on how the estimations differ.
Hint: You will need to restrict the samples using the variable large
.
Solution
Run the following R code chunk; note that we have now subsetted the data by including ,data=subset(tut2, large==1)
in our regression command.
# Explanatory variable: Building area, Building area squared
<- lm(price ~ bld_area + bld_area2, data=subset(tut2, large==1))
reg3 <- deviance(reg3) # save the RSS for the model
RSS3 <- reg3$residuals # create a series for the OLS residuals
resids3 <- reg3$fitted.values # create a series for the fitted values yhat3
Now, estimate a model for houses not on large lots, include ,data=subset(tut2, large==0)
here.
<- lm(price ~ bld_area + bld_area2, data=subset(tut2, large==0))
reg4 <- deviance(reg4) # save the RSS for the model
RSS4 <- reg4$residuals # create a series for the OLS residuals
resids4 <- reg4$fitted.values # create a series for the fitted values yhat4
Use stargazer to report the regression results
# Stargazer package produces `print-ready` output
stargazer(reg3, reg4, type="html", dep.var.labels=c("Selling Price ($ 000's)"),
covariate.labels=c("Intercept","Building Area", "Building Area (Squared)" ),
column.labels = c("Large Lots", "Small Lots"),
digits=4,
single.row=FALSE,
intercept.bottom=FALSE,
out= "tut2_reg3.htm")
Dependent variable: | ||
Selling Price ($ 000’s) | ||
Large Lots | Small Lots | |
(1) | (2) | |
Intercept | 10.0514 | 393.9105*** |
(98.6141) | (54.9578) | |
Building Area | 8.3311*** | 5.6890*** |
(0.7475) | (0.4832) | |
Building Area (Squared) | -0.0083*** | -0.0064*** |
(0.0011) | (0.0008) | |
Observations | 708 | 1,513 |
R2 | 0.2265 | 0.1167 |
Adjusted R2 | 0.2243 | 0.1155 |
Residual Std. Error | 606.3042 (df = 705) | 521.6389 (df = 1510) |
F Statistic | 103.2194*** (df = 2; 705) | 99.7114*** (df = 2; 1510) |
Note: | p<0.1; p<0.05; p<0.01 |
The estimation results are presented in Figure 5..
The estimated marginal effect of an additional square metre of dwelling area for a house with 300 square metres of dwelling area is
For larger lots:
and for smaller lots:
For dwellings with a building area of 300 square metres, the marginal effect of an additional square metre of dwelling area on selling price is greater for properties on larger lots.
This possibly reflects a preference for yard space. For smaller lots, an additional square metre of dwelling size substantially reduces the available yard space.
For larger lots there is not as large a reduction in yard space so buyers are prepared to pay more for the same square metre increase.
We are back to running two Simple Linear Regression Models (SLRMs) however Model II has a different functional form than Model I because the dependant variable is logged (e,g, we have a log-linear model).
Again, the interpretation of the marginal effects changes when logs are involved (more on this in upcoming tutorials).
Also note log(price)
is the log to base e (not base 10) in R
(e)
Consider the following econometric models:
and
where lnprice
represents the natural log of the variable price
.
Estimate Model I in R.
Produce a scatter plot of price
against distance
and a line plot of the fitted values from Model I.
Now generate a new variable lnprice
as the natural log of the variable price
.
Estimate Model II in R.
Produce a scatter plot of lnprice
against distance
and a line plot of the fitted values from Model II.
Compare the scatter plots for each model (Model I and Model II).
Which estimated model do you think fits the data better? Why?
Solution
Generate a variable for the log of selling price
# (9) Create (log) selling price variable
$lnprice = log(tut2$price) # generate log(price) variable tut2
Now run the two models using the following R code chunk provided
# (7) OLS Regression with Distance variable
# Dependent variable: Sale price, in thousands of dollars
# Explanatory variable: Distance, in kms
<- lm(price ~ distance, data=tut2) # Estimate Model I by OLS
reg5 <- deviance(reg5) # save the RSS for the model
RSS5 print(RSS5) # print the RSS on the screen
## [1] 618782879
<- reg5$residuals # create a series for the OLS residuals
resids5 <- reg5$fitted.values # create a series for the fitted values
yhat5 # (10) OLS Regression with lnprice variable
# Dependent variable: (Log) Sale price, in thousands dollars
# Explanatory variable: Distance
<- lm(lnprice ~ distance, data=tut2) # Estimate Model II by OLS
reg6 <- deviance(reg6) # save the RSS for the model
RSS6 print(RSS6) # Print the RSS to screen
## [1] 346.2655
<- reg6$residuals # create a series for the OLS residuals
resids6 <- reg6$fitted.values # create a series for the fitted values
yhat6
Then use stargazer to report the results
# Stargazer package produces `print-ready` output
stargazer(reg5, reg6, type="html",
dep.var.labels=c("Selling Price ($ 000's)", "(Log) Selling Price"),
covariate.labels=c("Intercept","Distance" ),
digits=4,
single.row=FALSE,
intercept.bottom=FALSE,
out= "tut2_reg4.htm")
Dependent variable: | ||
Selling Price ($ 000’s) | (Log) Selling Price | |
(1) | (2) | |
Intercept | 1,636.5080*** | 7.3795*** |
(22.6233) | (0.0169) | |
Distance | -36.3471*** | -0.0337*** |
(1.3961) | (0.0010) | |
Observations | 2,221 | 2,221 |
R2 | 0.2340 | 0.3189 |
Adjusted R2 | 0.2336 | 0.3186 |
Residual Std. Error (df = 2219) | 528.0688 | 0.3950 |
F Statistic (df = 1; 2219) | 677.7707*** | 1,038.9910*** |
Note: | p<0.1; p<0.05; p<0.01 |
For Model 1, the mean selling price declines by
For model 2, The mean selling prices declines by 3.37% for each additional kilometre from the C.B.D. The estimated coefficient for the intercept implies that the average price of land alone in the C.B.D. (with a zero distance) is:
Later in this subject, we will be studying how to interpret estimates in econometric models involving different functional forms, such as natural logarithms).
Now run the following R code chunk to examine the fitted values from both models.
ggplot(tut2, aes(x=distance)) +
geom_point(aes(y=price, colour="Actual Data")) +
geom_line(aes(y=yhat5, colour="Fitted Values: Linear Model"), size=1) +
labs(x = "Distance from CBD, in kms", y = "Selling Price, in thousands of dollars") +
scale_colour_manual("",
breaks = c("Actual Data", "Fitted Values: Linear Model"),
values = c("blue", "red")) +
scale_y_continuous(labels = comma) +
theme_classic() +
theme(axis.text=element_text(size=12), axis.title=element_text(size=12))
ggsave("tut2_graph3.pdf")
ggplot(tut2, aes(x=distance)) +
geom_point(aes(y=lnprice, colour="Actual Data")) +
geom_line(aes(y=yhat6, colour="Fitted Values: Log-Linear Model"), size=1) +
labs(x = "Distance from CBD, in kms", y = "(Log) Selling Price, in thousands dollars") +
scale_colour_manual("",
breaks = c("Actual Data", "Fitted Values: Log-Linear Model"),
values = c("blue", "red")) +
scale_y_continuous(labels = comma) +
theme_classic() +
theme(axis.text=element_text(size=12), axis.title=element_text(size=12))
ggsave("tut2_graph4.pdf")
The fitted values for Model I are presented in Figure 7 and those for Model II in Figure 8.
Comparing the two plots, Model II which uses the natural log of the selling prices appears to fit the data better. The under-prediction of selling prices, relative to the actual data, appears less of an issue in Model II.
As noted in Question 1(e) in Tutorial 1, taking logs reduces the scale in which a variable is measured.
The dependent variable in Model I is price
while in model II it is lnprice
.
Since they are not the same e.g. price
lnprice
, the Total Sum of Squares (TSS) will differ so we can’t use
Note, since the dependent variable in Model I is different to the dependent variable in Model II, it is not possible to use the