The purpose of this notebook is to explore linear regression techniques from the Coursera specialization in Machine Learning.
First, we load the required packages.
# Load the required packages
library(tidyverse)
library(here)
library(janitor)
The data we are using for this exploration is the Perth House Prices data as found on Kaggle.
# Load the data
house_prices <-
read_csv(here("data/src/all_perth_310121.csv"))
# View the head
head(house_prices)
## # A tibble: 6 × 19
## ADDRESS SUBURB PRICE BEDROOMS BATHROOMS GARAGE LAND_AREA FLOOR_AREA
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1 Acorn Place South… 565000 4 2 2 600 160
## 2 1 Addis Way Wandi 365000 3 2 2 351 139
## 3 1 Ainsley Court Camil… 287000 3 1 1 719 86
## 4 1 Albert Street Belle… 255000 2 1 2 651 59
## 5 1 Aman Place Lockr… 325000 4 1 2 466 131
## 6 1 Amethyst Cresc… Mount… 409000 4 2 1 759 118
## # ℹ 11 more variables: BUILD_YEAR <chr>, CBD_DIST <dbl>, NEAREST_STN <chr>,
## # NEAREST_STN_DIST <dbl>, DATE_SOLD <chr>, POSTCODE <dbl>, LATITUDE <dbl>,
## # LONGITUDE <dbl>, NEAREST_SCH <chr>, NEAREST_SCH_DIST <dbl>,
## # NEAREST_SCH_RANK <dbl>
We will perform the following transformations to produce a clean dataset to work with for this project:
janitor::clean_names() functionGARAGE, BUILD_YEAR from character
to numericPOSTCODE from numeric to characterDATE_SOLD into separate columns for year and
monthWe convert the column names to lowercase to make them easier to work with in code.
house_prices_cln <-
house_prices |>
clean_names()
names(house_prices_cln)
## [1] "address" "suburb" "price" "bedrooms"
## [5] "bathrooms" "garage" "land_area" "floor_area"
## [9] "build_year" "cbd_dist" "nearest_stn" "nearest_stn_dist"
## [13] "date_sold" "postcode" "latitude" "longitude"
## [17] "nearest_sch" "nearest_sch_dist" "nearest_sch_rank"
We correct the data types for some of the fields. Before doing this we check the contents of each column we want to convert.
garageFirst we check if it is appropriate to convert this column to numeric.
house_prices_cln |>
count(garage)
## # A tibble: 26 × 2
## garage n
## <chr> <int>
## 1 1 5290
## 2 10 26
## 3 11 7
## 4 12 30
## 5 13 8
## 6 14 13
## 7 16 4
## 8 17 1
## 9 18 3
## 10 2 20724
## # ℹ 16 more rows
All values are legitimate numbers or NULL. We can
convert this column to numeric. The NULL values will be
converted to NA.
house_prices_cln <-
house_prices_cln |>
mutate(garage = as.numeric(garage))
house_prices_cln |>
count(garage)
## # A tibble: 26 × 2
## garage n
## <dbl> <int>
## 1 1 5290
## 2 2 20724
## 3 3 2042
## 4 4 1949
## 5 5 362
## 6 6 466
## 7 7 97
## 8 8 129
## 9 9 17
## 10 10 26
## # ℹ 16 more rows
build_yearFirst we check if it is appropriate to convert this column to numeric.
house_prices_cln |>
count(build_year)
## # A tibble: 125 × 2
## build_year n
## <chr> <int>
## 1 1868 1
## 2 1870 1
## 3 1880 2
## 4 1886 1
## 5 1889 1
## 6 1890 6
## 7 1892 1
## 8 1894 1
## 9 1895 1
## 10 1897 3
## # ℹ 115 more rows
All values are legitimate numbers or NULL. We can
convert this column to numeric. The NULL values will be
converted to NA.
house_prices_cln <-
house_prices_cln |>
mutate(build_year = as.numeric(build_year))
house_prices_cln |>
count(build_year)
## # A tibble: 125 × 2
## build_year n
## <dbl> <int>
## 1 1868 1
## 2 1870 1
## 3 1880 2
## 4 1886 1
## 5 1889 1
## 6 1890 6
## 7 1892 1
## 8 1894 1
## 9 1895 1
## 10 1897 3
## # ℹ 115 more rows
postcodeConvert from numeric to character.
house_prices_cln <-
house_prices_cln |>
mutate(postcode = as.character(postcode))
date_soldIn this case we need to split the date_sold column into
separate columns for month and year. First we check if all values are in
the same format (mm_yyyy).
house_prices_cln |>
mutate(
date_sold_correct_format = str_detect(
date_sold, "\\d{2}-\\d{4}"
)
) |>
count(date_sold_correct_format)
## # A tibble: 1 × 2
## date_sold_correct_format n
## <lgl> <int>
## 1 TRUE 33656
All values match the desired format so we can now separate the month and year.
house_prices_cln <-
house_prices_cln |>
mutate(
month_sold = as.numeric(str_sub(date_sold, start = 1, end = 2)),
year_sold = as.numeric(str_sub(date_sold, start = -4, end = -1))
)
house_prices_cln |>
count(year_sold)
## # A tibble: 33 × 2
## year_sold n
## <dbl> <int>
## 1 1988 9
## 2 1989 8
## 3 1990 7
## 4 1991 12
## 5 1992 11
## 6 1993 12
## 7 1994 21
## 8 1995 22
## 9 1996 17
## 10 1997 22
## # ℹ 23 more rows
Based on the counts above, we have records for house sales from 1988
to 2020. Let’s filter the data to include only 2020 sales. We can also
reduce the number of columns to include only floor_area and
price.
house_prices_2020 <-
house_prices_cln |>
filter(year_sold == 2020) |>
select(floor_area, price)
Let’s create a scatter plot showing the relationship between house
prices and house size. For house prices we’ll use the price
column and for house size we’ll use floor_area. We will
also fit a linear regression model to the chart.
# Plot the house prices and house size
scatter_price_size <-
house_prices_2020 |>
ggplot(
aes(x = floor_area, y = price)
) +
geom_point() +
geom_smooth(method = "lm") # adds a linear regression line
scatter_price_size
For this exercise we use the simple dataset
house_prices_2020 as our training set. We can
rename the variables as x_train for the feature variable
and y_train for the output variable.
house_prices_2020_lm <-
house_prices_2020 |>
rename(x_train = floor_area, y_train = price)
head(house_prices_2020_lm)
## # A tibble: 6 × 2
## x_train y_train
## <dbl> <dbl>
## 1 168 565000
## 2 225 570000
## 3 127 315000
## 4 152 482000
## 5 130 360000
## 6 163 420000
We can also store x_train and y_train in
separate vectors.
x_train <- house_prices_2020_lm$x_train
y_train <- house_prices_2020_lm$y_train
We use \(m\) to denote the number of training examples.
m <- length(x_train)
paste0("Number of training examples is: ", m)
## [1] "Number of training examples is: 5261"
The model function for linear regression is represented as \(f(x)=wx+b\)
Different values of \(w\) and \(b\) give you different straight lines on the plot. Let’s try to get a better intuition for this by starting with \(w=100\) and \(b=100\).
w <- 100
b <- 100
paste0("w: ", w)
## [1] "w: 100"
paste0("b: ", b)
## [1] "b: 100"
Now, let’s compute the value of \(f_{w,b}(x^{(i)})\) for your data points. You can explicitly write this out for each data point as -
for \(x^{(1)}\),
f_wb = w * x[1] + b
for \(x^{(2)}\),
f_wb = w * x[2] + b
and so on.
For a large number of data points, this can get unwieldy and
repetitive. So instead, you can calculate the output in a
compute_model_output function.
compute_model_output <- function(df, w, b) {
# Computes the prediction of a linear model
# Args:
# x (ndarray (m,)): Data, m examples
# w,b (scalar) : model parameters
# Returns
# f_wb (ndarray (m,)): model prediction
df <-
df |>
mutate(f_wb = w * x_train + b)
return(df)
}
Now let’s call the compute_model_output function and
plot the output.
predictions <-
compute_model_output(house_prices_2020_lm, w, b)
predictions |>
ggplot() +
geom_point(
aes(x = x_train, y = y_train, colour = "Actual values")
) +
geom_smooth(
aes(x = x_train, y = f_wb, colour = "Our prediction"),
method = "lm"
) +
xlab("Size (m2)") +
ylab("Price (AUD)")
As you can see, setting \(w=100\) and \(b=100\) does not result in a line that fits our data. Use the chunk below to try different values for \(b\) and \(w\).
w <- 3500
b <- 0
predictions <-
compute_model_output(house_prices_2020_lm, w, b)
predictions |>
ggplot() +
geom_point(
aes(x = x_train, y = y_train, colour = "Actual values")
) +
geom_smooth(
aes(x = x_train, y = f_wb, colour = "Our prediction"),
method = "lm"
) +
xlab("Size (m2)") +
ylab("Price (AUD)")
Now we can use this function to make a prediction of the price of a house that is 300m2.
x_i <- 300
cost_300sqm <- w * x_i + b
paste0(cost_300sqm, " dollars")
## [1] "1050000 dollars"