Introduction

The purpose of this notebook is to explore linear regression techniques from the Coursera specialization in Machine Learning.

First, we load the required packages.

# Load the required packages
library(tidyverse)
library(here)
library(janitor)

Load data

The data we are using for this exploration is the Perth House Prices data as found on Kaggle.

# Load the data
house_prices <- 
  read_csv(here("data/src/all_perth_310121.csv"))

# View the head
head(house_prices)
## # A tibble: 6 × 19
##   ADDRESS           SUBURB  PRICE BEDROOMS BATHROOMS GARAGE LAND_AREA FLOOR_AREA
##   <chr>             <chr>   <dbl>    <dbl>     <dbl> <chr>      <dbl>      <dbl>
## 1 1 Acorn Place     South… 565000        4         2 2            600        160
## 2 1 Addis Way       Wandi  365000        3         2 2            351        139
## 3 1 Ainsley Court   Camil… 287000        3         1 1            719         86
## 4 1 Albert Street   Belle… 255000        2         1 2            651         59
## 5 1 Aman Place      Lockr… 325000        4         1 2            466        131
## 6 1 Amethyst Cresc… Mount… 409000        4         2 1            759        118
## # ℹ 11 more variables: BUILD_YEAR <chr>, CBD_DIST <dbl>, NEAREST_STN <chr>,
## #   NEAREST_STN_DIST <dbl>, DATE_SOLD <chr>, POSTCODE <dbl>, LATITUDE <dbl>,
## #   LONGITUDE <dbl>, NEAREST_SCH <chr>, NEAREST_SCH_DIST <dbl>,
## #   NEAREST_SCH_RANK <dbl>

Clean data

We will perform the following transformations to produce a clean dataset to work with for this project:

  • Convert column names to lowercase using the janitor::clean_names() function
  • Change GARAGE, BUILD_YEAR from character to numeric
  • Change POSTCODE from numeric to character
  • Split DATE_SOLD into separate columns for year and month
  • Filter records to include only house prices from a single year (to avoid too much “noise” in the pricing data due to inflation)

Clean column names

We convert the column names to lowercase to make them easier to work with in code.

house_prices_cln <- 
  house_prices |> 
  clean_names()

names(house_prices_cln)
##  [1] "address"          "suburb"           "price"            "bedrooms"        
##  [5] "bathrooms"        "garage"           "land_area"        "floor_area"      
##  [9] "build_year"       "cbd_dist"         "nearest_stn"      "nearest_stn_dist"
## [13] "date_sold"        "postcode"         "latitude"         "longitude"       
## [17] "nearest_sch"      "nearest_sch_dist" "nearest_sch_rank"

Change data types

We correct the data types for some of the fields. Before doing this we check the contents of each column we want to convert.

garage

First we check if it is appropriate to convert this column to numeric.

house_prices_cln |> 
  count(garage)
## # A tibble: 26 × 2
##    garage     n
##    <chr>  <int>
##  1 1       5290
##  2 10        26
##  3 11         7
##  4 12        30
##  5 13         8
##  6 14        13
##  7 16         4
##  8 17         1
##  9 18         3
## 10 2      20724
## # ℹ 16 more rows

All values are legitimate numbers or NULL. We can convert this column to numeric. The NULL values will be converted to NA.

house_prices_cln <- 
  house_prices_cln |> 
  mutate(garage = as.numeric(garage))

house_prices_cln |> 
  count(garage)
## # A tibble: 26 × 2
##    garage     n
##     <dbl> <int>
##  1      1  5290
##  2      2 20724
##  3      3  2042
##  4      4  1949
##  5      5   362
##  6      6   466
##  7      7    97
##  8      8   129
##  9      9    17
## 10     10    26
## # ℹ 16 more rows

build_year

First we check if it is appropriate to convert this column to numeric.

house_prices_cln |> 
  count(build_year)
## # A tibble: 125 × 2
##    build_year     n
##    <chr>      <int>
##  1 1868           1
##  2 1870           1
##  3 1880           2
##  4 1886           1
##  5 1889           1
##  6 1890           6
##  7 1892           1
##  8 1894           1
##  9 1895           1
## 10 1897           3
## # ℹ 115 more rows

All values are legitimate numbers or NULL. We can convert this column to numeric. The NULL values will be converted to NA.

house_prices_cln <- 
  house_prices_cln |> 
  mutate(build_year = as.numeric(build_year))

house_prices_cln |> 
  count(build_year)
## # A tibble: 125 × 2
##    build_year     n
##         <dbl> <int>
##  1       1868     1
##  2       1870     1
##  3       1880     2
##  4       1886     1
##  5       1889     1
##  6       1890     6
##  7       1892     1
##  8       1894     1
##  9       1895     1
## 10       1897     3
## # ℹ 115 more rows

postcode

Convert from numeric to character.

house_prices_cln <- 
  house_prices_cln |> 
  mutate(postcode = as.character(postcode))

Split data

date_sold

In this case we need to split the date_sold column into separate columns for month and year. First we check if all values are in the same format (mm_yyyy).

house_prices_cln |> 
  mutate(
    date_sold_correct_format = str_detect(
      date_sold, "\\d{2}-\\d{4}"
    )
  ) |> 
  count(date_sold_correct_format)
## # A tibble: 1 × 2
##   date_sold_correct_format     n
##   <lgl>                    <int>
## 1 TRUE                     33656

All values match the desired format so we can now separate the month and year.

house_prices_cln <- 
  house_prices_cln |> 
  mutate(
    month_sold = as.numeric(str_sub(date_sold, start = 1, end = 2)),
    year_sold = as.numeric(str_sub(date_sold, start = -4, end = -1))
  )

house_prices_cln |> 
  count(year_sold)
## # A tibble: 33 × 2
##    year_sold     n
##        <dbl> <int>
##  1      1988     9
##  2      1989     8
##  3      1990     7
##  4      1991    12
##  5      1992    11
##  6      1993    12
##  7      1994    21
##  8      1995    22
##  9      1996    17
## 10      1997    22
## # ℹ 23 more rows

Filter records

Based on the counts above, we have records for house sales from 1988 to 2020. Let’s filter the data to include only 2020 sales. We can also reduce the number of columns to include only floor_area and price.

house_prices_2020 <- 
  house_prices_cln |> 
  filter(year_sold == 2020) |> 
  select(floor_area, price)

Plot the data

Let’s create a scatter plot showing the relationship between house prices and house size. For house prices we’ll use the price column and for house size we’ll use floor_area. We will also fit a linear regression model to the chart.

# Plot the house prices and house size
scatter_price_size <-
  house_prices_2020 |>
    ggplot(
      aes(x = floor_area, y = price)
    ) +
    geom_point() +
    geom_smooth(method = "lm") # adds a linear regression line

scatter_price_size

Fit a linear regression model

For this exercise we use the simple dataset house_prices_2020 as our training set. We can rename the variables as x_train for the feature variable and y_train for the output variable.

house_prices_2020_lm <- 
  house_prices_2020 |> 
  rename(x_train = floor_area, y_train = price)

head(house_prices_2020_lm)
## # A tibble: 6 × 2
##   x_train y_train
##     <dbl>   <dbl>
## 1     168  565000
## 2     225  570000
## 3     127  315000
## 4     152  482000
## 5     130  360000
## 6     163  420000

We can also store x_train and y_train in separate vectors.

x_train <- house_prices_2020_lm$x_train
y_train <- house_prices_2020_lm$y_train

Number of training examples \(m\)

We use \(m\) to denote the number of training examples.

m <- length(x_train)

paste0("Number of training examples is: ", m)
## [1] "Number of training examples is: 5261"

Model function

The model function for linear regression is represented as \(f(x)=wx+b\)

Different values of \(w\) and \(b\) give you different straight lines on the plot. Let’s try to get a better intuition for this by starting with \(w=100\) and \(b=100\).

w <- 100
b <- 100

paste0("w: ", w)
## [1] "w: 100"
paste0("b: ", b)
## [1] "b: 100"

Now, let’s compute the value of \(f_{w,b}(x^{(i)})\) for your data points. You can explicitly write this out for each data point as -

for \(x^{(1)}\), f_wb = w * x[1] + b

for \(x^{(2)}\), f_wb = w * x[2] + b

and so on.

For a large number of data points, this can get unwieldy and repetitive. So instead, you can calculate the output in a compute_model_output function.

compute_model_output <- function(df, w, b) {
  # Computes the prediction of a linear model
  #  Args:
  #    x (ndarray (m,)): Data, m examples 
  #    w,b (scalar)    : model parameters  
  #  Returns
  #    f_wb (ndarray (m,)): model prediction
  df <- 
    df |> 
    mutate(f_wb = w * x_train + b)
  
  return(df)
}

Now let’s call the compute_model_output function and plot the output.

predictions <- 
  compute_model_output(house_prices_2020_lm, w, b)

predictions |> 
  ggplot() +
  geom_point(
    aes(x = x_train, y = y_train, colour = "Actual values")
  ) +
  geom_smooth(
    aes(x = x_train, y = f_wb, colour = "Our prediction"),
    method = "lm"
  ) +
  xlab("Size (m2)") +
  ylab("Price (AUD)")

As you can see, setting \(w=100\) and \(b=100\) does not result in a line that fits our data. Use the chunk below to try different values for \(b\) and \(w\).

w <- 3500
b <- 0

predictions <- 
  compute_model_output(house_prices_2020_lm, w, b)

predictions |> 
  ggplot() +
  geom_point(
    aes(x = x_train, y = y_train, colour = "Actual values")
  ) +
  geom_smooth(
    aes(x = x_train, y = f_wb, colour = "Our prediction"),
    method = "lm"
  ) +
  xlab("Size (m2)") +
  ylab("Price (AUD)")

Now we can use this function to make a prediction of the price of a house that is 300m2.

x_i <- 300
cost_300sqm <- w * x_i + b

paste0(cost_300sqm, " dollars")
## [1] "1050000 dollars"