ECON30001/ECOM90001
Tutorial 2
The basic regression model in R

Introduction

This tutorial reviews some basic operations using the econometrics software package R that we will be using in this subject. Specifically, this tutorial reviews:

  • running an OLS regression in R
  • plotting actual data values and fitted values
  • running an OLS regression on a sub-sample of the data in R
  • some simple data transformations (natural logarithms)
  • calculation of marginal effects

This tutorial requires one data file:

  • houseprices_2017.csv

This file can be obtained from the Canvas subject page. In addition the R script file tut2.R provides the program code necessary to complete the tutorial. This R script file uses the following packages which need to be installed prior to running the R script file:

ggplot2 : for creating graphs and plots in R
modelsummary : for easily generating high quality tables in R
scales : for displaying thousands with commas in graphs in R
rio : for easily importing data into R

These can be installed directly in RStudio from the packages tab or by using the command install.packages() and inserting the name of the package in the brackets. Please feel free to play around with the code, particularly the plotting commands for ggplot2().

Capstone Project and Final Exam

Please remember that the tutorial material required for the Capstone Project and the Final Exam could differ.

For example, you will be required to use R code in your project however R code is not examinable.

You will need to interpret the output generated from R code for both the project and the exam.

Question

Download the data file houseprices_2017, from the Canvas page.

This file contains data contains on the selling prices of houses in metropolitan Melbourne during the 2016 calendar year. There are several variables of interest:

price = Selling price, thousands of dollars
distance = Distance from the C.B.D.,in kilometres
bld_area = Dwelling size, metres squared
landsize = Land size, metres squared

\[\begin{equation*} \texttt{large} = \begin{cases} 1 & \text{property on a large lot, land size $\geq 650$ square metres}\\ 0 & \text{property not on a large lot, land size $< 650$ square metres}\\ \end{cases} \end{equation*}\]

(a)

Consider the following econometric model:

\[\text{price}_i=\beta_0 + \beta_1\, \text{bld{\_}area}_i + \varepsilon_i \tag{1}\]

Important

this is the Population Regression Function (PRF) comprising the deterministic part of the population model \(Y_i = \beta_0 + \beta_1\, X_i\) plus the error term, \(\varepsilon_i\)

What is the interpretation of the parameter \(\beta_{0}\)?
What is the interpretation of the parameter \(\beta_{1}\)?

Solution

The parameter \(\beta_{0}\) represents the mean selling price for a house with a building area of zero. It would represent the value of the land only. However, if the data do not include any properties with zero building area this parameter will not be estimated precisely (out of sample prediction).

The parameter \(\beta_{1}\) represents the marginal effect of an additional square metre of building area on the mean selling price.

(b)

Estimate this model in R and provide a brief description of the point estimates.
Produce a scatter plot of both price and the fitted values against bld_area.
Comment on how well the estimated model fits the data.

Solution

Load the required packages (make sure these are installed first) and import the data

#----------------------------------------
options(scipen=999)               # Do not use scientific notation
library(ggplot2)                  # Flexible graphic facility for R
library(modelsummary)             # For producing good quality output from R 
library(scales)                   # package to display thousands with commas in graphs
library(rio)                      # package for easily importing data into R
library(tinytable)
#---------------------------------------
# (1) Import Data File from csv and Save as R Data File
tut2 <- import("houseprices_2017.csv")
# create custom function for including sample F statistic
# in modelsummary table from the lm function
# only need to do this once per file
glance_custom.lm <- function(x, ...) {
  s <- summary(x)
  f <- s$fstatistic  # value, numdf, dendf
  
  data.frame(
    F_line = sprintf(
      "%.4f (df = %d; %d)",
      f[1], f[2], f[3]
    ),
    p_F = pf(f[1], f[2], f[3], lower.tail = FALSE),
    sigma = s$sigma,
    nobs = nobs(x),
    r.squared = s$r.squared,
    adj.r.squared = s$adj.r.squared
  )
}
# (2)  Estimate the econometric model by OLS
# Dependent variable: Sale price, in thousands of dollars
# Explanatory variable: Building area, square metres
reg1 <- lm(price ~ bld_area, data=tut2)        # estimate model by OLS, save as reg1
print(summary(reg1), digits=3)                 # print the results on screen

Call:
lm(formula = price ~ bld_area, data = tut2)

Residuals:
   Min     1Q Median     3Q    Max 
 -2603   -403   -131    280   3496 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  685.836     27.727    24.7 <0.0000000000000002 ***
bld_area       2.674      0.152    17.6 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 565 on 2219 degrees of freedom
Multiple R-squared:  0.122, Adjusted R-squared:  0.122 
F-statistic:  308 on 1 and 2219 DF,  p-value: <0.0000000000000002
# put the dependent-variable label in the model name (this becomes the column header) 
Note

Output from the lm{} command shown in the Console Window

models1 <- list("Selling Price (&#36; 000's)" = reg1)
# use modelsummary for results:
table1 <- modelsummary(
  models1,
  fmt      = 4,                   # digits=4
  statistic = "({std.error})",    
  coef_map = c(                   # covariate.labels + intercept at top
    `(Intercept)` = "Intercept",
    bld_area = "Building Area (msq)"),
  gof_map = data.frame(
    raw   = c("nobs", "r.squared", "adj.r.squared", "sigma", "F_line"),
    clean = c("Observations", "R-squared", "Adj. R-squared",
              "Residual Std. Error", "F Statistic"),
    fmt   = c(0, 4, 4, 4, 4)
  ),
  output   = "tinytable",
  stars = TRUE,
  notes = "Standard errors shown in parentheses"
)


save_tt(table1, "table1.html", overwrite = TRUE)
RSS1 <- deviance(reg1)                            # save the RSS for the model
print(RSS1)                                       # print the RSS on the screen
[1] 709209196
resids1 <- reg1$residuals                         # create a series for the OLS residuals
yhat1 <- reg1$fitted.values                       # create a series for the fitted values
Selling Price ($ 000's)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Standard errors shown in parentheses
Intercept 685.8360***
(27.7273)
Building Area (msq) 2.6743***
(0.1523)
Observations 2221
R-squared 0.1220
Adj. R-squared 0.1216
Residual Std. Error 565.3385
F Statistic 308.4236 (df = 1; 2219)
Note

Output using modelsummary stored in the named file “table1.html”

Figure 1: OLS Estimation Results: Part (b)


The scatter plot presented in Figure 2 shows that the estimated model appears to adequately fit the data for properties with relatively smaller building areas. However, it tends to considerably ‘under-predict’ selling prices for properties with relatively smaller building areas that sell for relatively large prices.
Additionally, it also tends to ‘over-predict’ selling prices for some properties with relatively larger building areas.

# (2) Plot the actual and fitted values 
ggplot(tut2, aes(x=bld_area)) + 
geom_point(aes(y=price, colour="Actual Data"), size=0.8) +
geom_line(aes(y=yhat1, colour="Fitted Values: Linear Model"), linewidth=0.6) +  
labs(x = "Building Area, square metres", y = "Selling Price, in thousands of dollars") +
  scale_colour_manual("", 
                    breaks = c("Actual Data", "Fitted Values: Linear Model"),
                      values = c("blue", "red")) +
  scale_y_continuous(labels = comma) +
  theme_classic() +
  theme(legend.position="bottom") +
  theme(axis.text=element_text(size=8), axis.title=element_text(size=8),
  legend.text=element_text(size=8))

Figure 2: Actual and Fitted Values for price: Part (b)


Ultimately, there are likely several other factors, beyond just building area, that determine selling prices. These factors have been collected in the random error of the econometric model [equation (1)].
Some important variables might be the distance from the C.B.D, the age of the dwelling, the characteristics of the dwelling (such as the number of bedrooms, number of bathrooms etc. , \(\cdots\)), quality of local schools, and proximity to local amenities.
Moreover, at least some of these omitted factors are also likely related to building area.
We will be studying omitted variables and how they affect the estimated parameters of econometric models later in this subject.

(c)

Tip

Two changes here:

  1. we are now using a Multiple Regression Model (MRM), and
  2. the functional form has changed

Note, the difference in interpreting the marginal effects .

In this case, compared to the Simple Linear Regression Model (SLRM) in part (a), they are no longer constant.

Consider the following econometric model:

\[\text{price}_i=\beta_0 + \beta_1\, \text{bld{\_}area}_i + \beta_2\, \text{bld{\_}area}_i^2+\varepsilon_i \tag{2}\]
What is the marginal (or partial) effect of an additional square metre of dwelling size (bld_area) on the selling price?
Estimate this equation in R.
What is the estimated marginal effect of an additional square metre of dwelling size for a home with 300 square metres of building area?

Hint: You will need to generate a new variable representing the squared value of the bld_area variable .

Produce a scatter plot of both price and the fitted values against bld_area.

On the same graph, produce a line plot of the fitted values for the linear model (from part b) and the quadratic model (from part c).
Based upon a visual inspection of the fitted values, which model do you think fits the data better?
Why?

Compare the Sum of Squared Residuals (RSS) for the two models.
Which is smaller?

Based upon the value for the RSS, which model fits the data better?

Solution
For the quadratic model

\[\text{price}_i=\beta_0 + \beta_1\, \text{bld{\_}area}_i + \beta_2\, \text{bld{\_}area}_i^2+\varepsilon_i \]

The marginal effect is given by:

\[ \frac{\partial E[\text{price}]}{\partial \text{bld{\_}area}} = \beta_1+\beta_2\, \text{bld{\_}area} \]
The estimation results in Figure 3 can be used to estimate the marginal effect for a dwelling with 300 square metres of building area as

\[b_1 + 2b_2 \text{bld{\_}area} = 6.419578+(2 * -0.006542 * 300)= \$2.4945\]

In this quadratic model, an additional square metre of dwelling space for a house with 300 square metres of dwelling area is estimated to increase the sales price by \(\$2,495\). Compare this to the estimated effect in the linear model in part (b) of \(\$2,674\) (which restricts the marginal effect to be same regardless of the dwelling area). Note that for properties with 300 square metres of dwelling area, the estimated marginal effect for model 2 is remarkably close to the estimated marginal effect for model 1.

# (3) Extended Model with Building Area Squared
tut2$bld_area2 <- (tut2$bld_area)^2                           # create bld_area squared variable
# Dependent variable: Sale price, in thousands of dollars
# Explanatory variable: Living area, Living area squared
reg2 <- lm(price ~ bld_area + bld_area2, data=tut2)       # estimate model by OLS, save as reg2
# Alternative specification
reg2a <- lm(price ~ bld_area +  I(bld_area^2), data=tut2)
models2 <- list("(1)" = reg1, "(2)" = reg2)

table2 <- modelsummary(
  models2,
  fmt      = 4,                   # digits=4
  statistic = "({std.error})",    
  coef_map = c(                   # covariate.labels + intercept at top
    `(Intercept)` = "Intercept",
    bld_area = "Building Area (msq)",
    bld_area2 = "Building Area (msq) squared"
),
  gof_map = data.frame(
    raw   = c("nobs", "r.squared", "adj.r.squared", "sigma", "F_line"),
    clean = c("Observations", "R-squared", "Adj. R-squared",
              "Residual Std. Error", "F Statistic"),
    fmt   = c(0, 4, 4, 4, 4)
  ),
  output   = "tinytable",
  stars = TRUE,
  title = "Dependent variable: Selling Price ($ 000's)",
  notes = "Standard errors shown in parentheses"
)

save_tt(table2, "table2.html", overwrite = TRUE)
table2
Dependent variable: Selling Price ($ 000's)
(1) (2)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Standard errors shown in parentheses
Intercept 685.8360*** 287.8916***
(27.7273) (47.6916)
Building Area (msq) 2.6743*** 6.4196***
(0.1523) (0.3982)
Building Area (msq) squared -0.0065***
(0.0006)
Observations 2221 2221
R-squared 0.1220 0.1609
Adj. R-squared 0.1216 0.1602
Residual Std. Error 565.3385 552.7921
F Statistic 308.4236 (df = 1; 2219) 212.7259 (df = 2; 2218)
Figure 3: Estimation Results: Part (c)


RSS2 <- deviance(reg2)                                    # save the RSS for the model
print(RSS2)                                               # print the RSS on the screen
[1] 677774520
resids2 <- reg2$residuals                                 # create a series for the OLS residuals
yhat2 <- reg2$fitted.values                               # create a series for the fitted values
# Calculate the  marginal effect for 300 square metres building area
b1 <- coef(reg2)[["bld_area"]]                             # coefficient on bld_area
b2 <- coef(reg2)[["bld_area2"]]                            # coefficient on bld_area squared
me <- b1 + (2*b2*300)
print(me)   
[1] 2.494518
# Using alternative specification
b1a <- coef(reg2a)[["bld_area"]]                             # coefficient on bld_area
b2a <- coef(reg2a)[["I(bld_area^2)"]]                        # coefficient on bld_area squared
mea <- b1a + (2*b2a*300)
print(mea) 
[1] 2.494518
# print the estimated marginal effect
# (4) Plot the actual and fitted values
ggplot(tut2, aes(x=bld_area)) +
  geom_point(aes(y=price, colour = "Actual Data"), size = 0.8) +
  geom_line(aes(y=yhat1, colour="Fitted Values: Linear Model"), linewidth=0.6) +
  geom_line(aes(y=yhat2, colour="Fitted Values: Quadratic Model"), linewidth=0.6) +
  geom_vline(xintercept = 300, color = "black", linetype = "dotted", linewidth =0.6) +
    labs(x = "Building Area, square metres", y = "Selling Price, in thousands of dollars") +
  scale_colour_manual("", 
                      breaks = c("Actual Data", "Fitted Values: Linear Model", 
                                 "Fitted Values: Quadratic Model"),
                      values = c("blue", "red", "darkgreen")) +
  scale_y_continuous(labels = comma) +
  theme_classic() +
  theme(legend.position="bottom") +
  theme(axis.text=element_text(size=8), axis.title=element_text(size=8),
        legend.text=element_text(size=8))

Figure 4: Actual and Fitted Values : Part (c)


This is also confirmed through an examination of Figure 4 which indicates the the slope of the fitted line for model (1) and model (2) are quite close to each other , at a building area of 300 square metres.

Note that the estimate of \(b_{2} = -0.006542\). This implies that the estimated relationship between selling prices and building area is an `inverted u-shape’.

For houses with sufficiently large dwelling areas, an additional square metre of dwelling area is estimated to reduce the selling price.

We will be looking at issues associated with the appropriate functional form in econometric models, including quadratic functions, in a few weeks.

Aside:

Is this likely a “causal” effect? Is it likely that for houses with sufficiently large dwelling areas, an additional square metre of dwelling area is estimated to reduce the selling price? In our simple model, it is likely that this is not a `causal’ effect. Why?

  • Outliers: There are only a few observations for houses with a large building area and relatively low selling prices. It is feasible that these observations are not really representative of the population of houses sold in Melbourne.

  • Omitted Variables: The econometric model (2) only relates selling prices to the dwelling area.
    There are likely omitted variables, that are related to the dwelling area, that also affect the selling prices. Effectively, the estimated negative relationship between dwelling area and price for large dwellings, really reflects the effects of these omitted characteristics. For example, houses with a larger dwelling area will generally be located in different areas to houses with a smaller dwelling area and these location characteristics might be important determinants of prices. For example, houses with a larger dwelling area tend to located further from the C.B.D and it is this characteristic that is associated with lower prices.

We will be exploring these issues throughout the subject.

The actual and fitted values for the quadratic model and the linear model (part b) are presented in Figure 4.

It appears that the quadratic model fits the data slightly better - it is slightly better at capturing the lower selling prices for houses with a larger building area.
However, it still tends to `under-predict’ selling prices for properties with relatively smaller living areas that sell for relatively large prices.
he RSS for the linear model (model 1) is \(709,209,196\) while for the quadratic model (model 2) it is \(677,774,520\).
At first glance, the minimised value of the sum of squared residuals appears lower for the quadratic model so it is tempting to conclude that the quadratic model fits the data better.
This is also confirmed by looking at the \(R^{2}\) reported in the estimation output.
For the linear model in part b), the \(R^{2}\) is 0.1220 while for the quadratic model the \(R^{2}\) is 0.1609.
However, since the quadratic model includes an additional explanatory variable (compared) to the linear model, the RSS must necessarily be lower (and the \(R^{2}\) higher) for this model.

(d)

Estimate the econometric model 2, restricting the sample to houses that are on large lots.

Now repeat the estimation for houses not on large lots.
Comment on how the estimations differ.

Hint: You will need to restrict the samples using the variable large.

Solution

# (5) OLS Regression for Large Lots only
# Dependent variable: Sale price, in thousands dollars
# Explanatory variable: Building area, Building area squared
reg3 <- lm(price ~ bld_area + bld_area2, data=subset(tut2, large==1))
# print(summary(reg3, digits=3))
RSS3 <- deviance(reg3)                                # save the RSS for the model
print(RSS3)                                           # print the RSS on the screen  
[1] 259161339
resids3 <- reg3$residuals                             # create a series for the OLS residuals
yhat3 <- reg3$fitted.values                           # create a series for the fitted values 
# Calculate the  marginal effect for 300 square metres building area
b1_lge <- coef(reg3)[["bld_area"]]                     # coefficient on bld_area
b2_lge <- coef(reg3)[["bld_area2"]]                    # coefficient on bld_area squared
me_lge <- b1_lge + (2*b2_lge*300)
print(me_lge)                                         # print the estimated marginal effect for large lots
[1] 3.345367
# (6) OLS Regression for (Not) Large Lots
# Dependent variable: Sale price, in dollars
# Explanatory variable: Building area, Building area squared
reg4 <- lm(price ~ bld_area + bld_area2, data=subset(tut2, large==0))
print(summary(reg4, digits=3))

Call:
lm(formula = price ~ bld_area + bld_area2, data = subset(tut2, 
    large == 0))

Residuals:
    Min      1Q  Median      3Q     Max 
-1150.7  -401.1   -91.2   303.8  3437.3 

Coefficients:
               Estimate  Std. Error t value             Pr(>|t|)    
(Intercept) 393.9104909  54.9577593   7.168   0.0000000000011904 ***
bld_area      5.6889778   0.4832344  11.773 < 0.0000000000000002 ***
bld_area2    -0.0063903   0.0008319  -7.681   0.0000000000000281 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 521.6 on 1510 degrees of freedom
Multiple R-squared:  0.1167,    Adjusted R-squared:  0.1155 
F-statistic: 99.71 on 2 and 1510 DF,  p-value: < 0.00000000000000022
RSS4 <- deviance(reg4)                               # save the RSS for the model
print(RSS4)                                          # print the RSS on the scree
[1] 410881760
resids4 <- reg4$residuals                            # create a series for the OLS residuals
yhat4 <- reg4$fitted.values                          # create a series for the fitted values 
# Calculate the  marginal effect for 300 square metres building area
b1_sml <- coef(reg4)[["bld_area"]]                    # coefficient on bld_area
b2_sml <- coef(reg4)[["bld_area2"]]                   # coefficient on bld_area squared
me_sml <- b1_sml + (2*b2_sml*300)
print(me_sml)                                        # print the estimated marginal effect for small lots
[1] 1.854769
models3 <- list("Large Lots" = reg3, "Small Lots" = reg4)

table3 <- modelsummary(
  models3,
  fmt      = 4,                   # digits=4
  statistic = "({std.error})",    
  coef_map = c(                   # covariate.labels + intercept at top
    `(Intercept)` = "Intercept",
    bld_area = "Building Area (msq)",
    bld_area2 = "Building Area (msq) squared"),   
  gof_map = data.frame(
    raw   = c("nobs", "r.squared", "adj.r.squared", "sigma", "F_line"),
    clean = c("Observations", "R-squared", "Adj. R-squared",
              "Residual Std. Error", "F Statistic"),
    fmt   = c(0, 4, 4, 4, 4)
  ),
  output   = "tinytable",
  stars = TRUE,
  title = "Dependent variable: Selling Price ($ 000's)",
  notes = "Standard errors shown in parentheses"
)
table3
Dependent variable: Selling Price ($ 000's)
Large Lots Small Lots
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Standard errors shown in parentheses
Intercept 10.0514 393.9105***
(98.6141) (54.9578)
Building Area (msq) 8.3311*** 5.6890***
(0.7475) (0.4832)
Building Area (msq) squared -0.0083*** -0.0064***
(0.0011) (0.0008)
Observations 708 1513
R-squared 0.2265 0.1167
Adj. R-squared 0.2243 0.1155
Residual Std. Error 606.3042 521.6389
F Statistic 103.2194 (df = 2; 705) 99.7114 (df = 2; 1510)
Figure 5: OLS Estimation Results by Lot Size



The estimation results are presented in Figure 5..
The estimated marginal effect of an additional square metre of dwelling area for a house with 300 square metres of dwelling area is \(\$3,345.37\) for large lots and \(\$1,854.77\) for smaller lots.

For larger lots:

\[ b_1+2b_2\,\text{bld{\_}area}= 8.331051+(2*300*-0.008309)=3.345367 \]

and for smaller lots:

\[b_1+2b_2\,\text{bld{\_}area}= 5.689778+(2*300*-0.0063903)=1.854769\]

For dwellings with a building area of 300 square metres, the marginal effect of an additional square metre of dwelling area on selling price is greater for properties on larger lots.
This possibly reflects a preference for yard space. For smaller lots, an additional square metre of dwelling size substantially reduces the available yard space.
For larger lots there is not as large a reduction in yard space so buyers are prepared to pay more for the same square metre increase.

(e)

Consider the following econometric models:

Tip

We are back to running two Simple Linear Regression Models (SLRMs) however Model II has a different functional form than Model I because the dependant variable is logged (e,g, we have a log-linear model).

Again, the interpretation of the marginal effects changes when logs are involved (more on this in upcoming tutorials).

Also note log(price) is the log to base e (not base 10) in R

\[ \text{price}_i = \beta_0 + \beta_1\, \text{distance}_i + \varepsilon_i \tag{1} \]

and

\[ \text{lnprice}_i = \beta_0 + \beta_1\,distance_i + \varepsilon_i \tag{2} \]

where lnprice represents the natural log of the variable price.

Estimate Model I in R.

Produce a scatter plot of price against distance and a line plot of the fitted values from Model I.

Now generate a new variable lnprice as the natural log of the variable price.

Estimate Model II in R.

Produce a scatter plot of lnprice against distance and a line plot of the fitted values from Model II.

Compare the scatter plots for each model (Model I and Model II).
Which estimated model do you think fits the data better? Why?

Solution

# (7) OLS Regression with Distance variable
# Dependent variable: Sale price, in thousands of dollars
# Explanatory variable: Distance, in kms
reg5 <- lm(price ~ distance, data=tut2)        # Estimate Model I by OLS
# print(summary(reg5, digits=3))                 # Print OLS to screen
RSS5 <- deviance(reg5)                         # save the RSS for the model
print(RSS5)                                    # print the RSS on the screen
[1] 618782879
resids5 <- reg5$residuals                      # create a series for the OLS residuals
yhat5 <- reg5$fitted.values                    # create a series for the fitted values
####################################################################
# (9) Create (log) selling price variable
tut2$lnprice <- log(tut2$price)                # generate log(price) variable
#tut2$lnprice = log(tut2$price)                # generate log(price) variable
#  (10) OLS Regression with lnprice variable
# Dependent variable: (Log) Sale price, in thousands dollars
# Explanatory variable: Distance
reg6 <- lm(lnprice ~ distance, data=tut2)        # Estimate Model II by OLS  
# summary(reg6)                              # Print OLS results to screen
# Alternative method
reg6a <- lm(I(log(price)) ~ distance, data=tut2)  # Estimate Model II by OLS  
# summary(reg6a)                              # Print OLS results to screen

RSS6 <- deviance(reg6)                     # save the RSS for the model
print(RSS6)                                      # Print the RSS to screen
[1] 346.2655
resids6 <- reg6$residuals                        # create a series for the OLS residuals
yhat6 <- reg6$fitted.values           #create a series for the fitted values
models4 <- list("Selling Price ($ 000's)" = reg5, "(Log) Selling Price ($ 000's)" = reg6)

table4 <- modelsummary(
  models4,
  fmt      = 4,                   # digits=4
  statistic = "({std.error})",    
  coef_map = c(                   # covariate.labels + intercept at top
    `(Intercept)` = "Intercept",
    distance = "Distance from CBD (km)"),
  gof_map = data.frame(
    raw   = c("nobs", "r.squared", "adj.r.squared", "sigma", "F_line"),
    clean = c("Observations", "R-squared", "Adj. R-squared",
              "Residual Std. Error", "F Statistic"),
    fmt   = c(0, 4, 4, 4, 4)
  ),
  output   = "tinytable",
  stars = TRUE,
  notes = "Standard errors shown in parentheses"
)
table4
Selling Price ($ 000's) (Log) Selling Price ($ 000's)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Standard errors shown in parentheses
Intercept 1636.5082*** 7.3795***
(22.6233) (0.0169)
Distance from CBD (km) -36.3471*** -0.0337***
(1.3961) (0.0010)
Observations 2221 2221
R-squared 0.2340 0.3189
Adj. R-squared 0.2336 0.3186
Residual Std. Error 528.0688 0.3950
F Statistic 677.7707 (df = 1; 2219) 1038.9908 (df = 1; 2219)
Figure 6: OLS Estimation Results: Part (e)


For model 2, The mean selling prices declines by 3.37% for each additional kilometre from the C.B.D. The estimated coefficient for the intercept implies that the average price of land alone in the C.B.D. (with a zero distance) is:

\[ 1000*{\text{exp}(b_0)} = 1000*{\text{exp}(7.3795)}= \$1,602,788 \]

Later in this subject, we will be studying how to interpret estimates in econometric models involving different functional forms, such as natural logarithms).

Note, since the dependent variable in Model I is different to the dependent variable in Model II, it is not possible to use the \(R^{2}\) for these two models to make any judgment about which model is better in terms of goodness of fit. 

Now run the following R code chunk to examine the fitted values from both models.

ggplot(tut2, aes(x=distance)) +
  geom_point(aes(y=price, colour="Actual Data"), size= 0.8) +
  geom_line(aes(y=yhat5, colour="Fitted Values: Linear Model"), linewidth=0.6) +  
  labs(x = "Distance from CBD, in kms", y = "Selling Price, in thousands of dollars") +
  scale_colour_manual("", 
                      breaks = c("Actual Data", "Fitted Values: Linear Model"),
                      values = c("blue", "red")) +
  scale_y_continuous(labels = comma) +
  theme_classic() +
  theme(legend.position="bottom") +
  theme(axis.text=element_text(size=8), axis.title=element_text(size=8),
        legend.text=element_text(size=8))

Figure 7: Actual and Fitted Values: Part (e)


The fitted values for Model I are presented in Figure 7 and those for Model II in Figure 8.

ggplot(tut2, aes(x=distance)) +
  geom_point(aes(y=lnprice, colour="Actual Data"), size = 0.8) +
  geom_line(aes(y=yhat6, colour="Fitted Values: Log-Linear Model"), linewidth=0.8) +  
  labs(x = "Distance from CBD, in kms", y = "(Log) Selling Price, in thousands  dollars") +
  scale_colour_manual("", 
                      breaks = c("Actual Data", "Fitted Values: Log-Linear Model"),
                      values = c("blue", "red")) +
  scale_y_continuous(labels = comma) +
  theme_classic() +
  theme(legend.position="bottom") +
  theme(axis.text=element_text(size=8), axis.title=element_text(size=8),
        legend.text=element_text(size=8))

Figure 8: Actual and Fitted Values: Log Selling Price - Part (e)


The fitted values for Model I are presented in Figure 7 and those for Model II in Figure 8.

Comparing the two plots, Model II which uses the natural log of the selling price appears to fit the data better. The under-prediction of selling prices , relative to the actual data, appears less of an issue in Model II.
As noted in Question 1 (e) in Tutorial 1, taking logs reduces the scale in which a variable is measured.