#|echo: false
library(tidyverse)
library(MASS)
options(rgl.useNULL = TRUE)
library(rgl)datamining
Intro to DS week 7 practical: data mining
In this practical, I’ll cover:
Linear regression prediction models for both single- and multivariable data
Logistical regression for classification
Linear separability of datasets
A single-variable linear model
We’ll be using the ‘hills’ data set in the MASS package with info on hill climb times and horizontal/vertical distance.
Loading in training and test data and EDA
The hills data set has 35 rows. We’ll use 30 for training and 5 for testing, splitting the data accordingly:
hills_train <- hills[1:30,]
hills_test <- hills[31:35,]Let’s do some basic plotting of time vs the other two variables + correlation analysis on the training data:
plot1 <- hills_train |>
ggplot(aes(x = dist, y= time)) +
geom_point()
plot2 <- hills_train |>
ggplot(aes(x = climb, y= time)) +
geom_point()
plot1plot2cor.test(hills_train$dist, hills_train$time)
Pearson's product-moment correlation
data: hills_train$dist and hills_train$time
t = 10.354, df = 28, p-value = 4.438e-11
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7806143 0.9469437
sample estimates:
cor
0.8904587
cor.test(hills_train$climb, hills_train$time)
Pearson's product-moment correlation
data: hills_train$climb and hills_train$time
t = 5.7047, df = 28, p-value = 4.062e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5067420 0.8649637
sample estimates:
cor
0.7331625
From this we can see that distance is better correlated with time than climb is (not surprising).
Fitting the linear model
Let’s try a single variable model to predict time based on climb using the ‘lm’ function:
linear_model <- hills_train |>
lm(formula=time~climb)
summary(linear_model)
Call:
lm(formula = time ~ climb, data = hills_train)
Residuals:
Min 1Q Median 3Q Max
-35.258 -18.227 -5.340 3.839 128.837
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.368522 8.660145 1.659 0.108
climb 0.023553 0.004129 5.705 4.06e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 31.12 on 28 degrees of freedom
Multiple R-squared: 0.5375, Adjusted R-squared: 0.521
F-statistic: 32.54 on 1 and 28 DF, p-value: 4.062e-06
Let’s plot this (you can use either geom_smooth to plot the line, or extract the linear coefficients manually - the geom_smooth line is blue, the manual one is black and they are identical):
coefs_lin <- coef(linear_model)
plotlin <- hills_train |>
ggplot(aes(x = climb, y = time)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
geom_abline(mapping = aes(
slope = coefs_lin['climb'],
intercept = coefs_lin['(Intercept)']
))
plotlinLet’s graph the residuals to examine further where the model is failing by calculating where the model predicts each point in the data should be vs where it actually is - by adding a column each for the predicated values and the residuals to the original training set:
hills_residuals <- hills_train |>
mutate(predicted = predict(linear_model)) |>
mutate(residuals = residuals(linear_model))Now let’s graph them:
hills_residuals |>
ggplot(aes(x=climb, y = time)) +
geom_point(color = 'chartreuse', size = 3) +
geom_point(color = 'grey', size = 2, aes(y = predicted)) +
geom_smooth(method = 'lm', se = FALSE, color = 'black', linewidth = 0.5) +
geom_segment(aes(xend = climb, yend = predicted, color = 'pink'))Or using base R:
plot(linear_model)Predictions using the model
Now we can use the predict function from before, but on the test rather than the training data with the ‘newdata’ argument in predict and then graphing as before:
hills_test_residuals <- hills_test |>
mutate(predicted = predict(linear_model, newdata=hills_test)) |>
mutate(residuals = predicted - time)
hills_test_residuals |>
ggplot(aes(x=climb, y = time)) +
geom_point(color = 'chartreuse', size = 3) +
geom_point(color = 'grey', size = 2, aes(y = predicted)) +
geom_abline(mapping = aes(
slope = coefs_lin['climb'],
intercept = coefs_lin['(Intercept)']
)) +
geom_segment(aes(xend = climb, yend = predicted, color = 'pink'))And then calculating the SSE, i.e. the sum of the residuals squared
sse_climb <- sum(hills_test_residuals$residuals**2)Multi-variable model
What if we wanted to account for both distance AND time? The less colinear they are (i.e. the more scattered) then the better results we’ll get by combining them.
We can build another linear model using lm, but we just specify more variables:
multi_linear_model <- lm(
formula = time ~ climb + dist,
data = hills_train
)Now let’s use THIS model on the test data:
hills_test_multi_residuals <- hills_test |>
mutate(predictions = predict(multi_linear_model, newdata = hills_test)) |>
mutate(residuals = (predictions - time))
layout(matrix(1:2, ncol = 2))
plot(multi_linear_model, which = 1)
plot(linear_model, which = 1)And calculate the SSE again:
SSE_multi <- sum(hills_test_multi_residuals$residuals**2)
SSE_multi[1] 608.8576
This is better!
Logistic regression
EDA
We’re going to classify flowers in the iris dataset, so let’s split it into training and test as before:
iris <- iris[sample(1:nrow(iris)),] #shuffle the data
train_prop = 0.7
iris_train <- iris[1:(train_prop*nrow(iris)),]
iris_test <- iris[(train_prop*nrow(iris)+1):nrow(iris),]Let’s graph the iris data:
iris_train |>
ggplot(aes(x=Sepal.Length,y=Sepal.Width, color = Species)) +
geom_point() +
scale_color_viridis_d()We can see lots of overlap between the points - particularly ‘versicolor’ and others. This means they are unlikely to be linearly separable - in other words, a logistic regression on sepal length/width is not going to be very good at classifying things into ‘versicolor’ and ‘not virginica’. We should try to find combinations of variables that are better at this - perhaps petal length and width?
Binomial logistic regression
Let’s start with a simple binary logistic regression - i.e. virginica or not. Let’s add a column to the training (and test) sets accordingly to indicate this:
iris_train <- iris_train |>
mutate(binary_species = (Species == 'versicolor')*1)
iris_test <- iris_test |>
mutate(binary_species = (Species == 'versicolor')*1)Now we use the glm function to build a logistic model from the training data using all four width and length variables:
iris_binary_model <- glm(
formula = binary_species ~ Petal.Width + Petal.Length + Sepal.Length + Sepal.Width,
family = binomial(link = 'logit'),
data = iris_train
)And use the model to predict species for the test data, and round them to either 0 or 1:
binary_probs <- predict(
iris_binary_model,
newdata = iris_test,
type = 'response'
)
binary_predictions <- ifelse(binary_probs >0.5, 1, 0)And finally calculate how good the model is:
binary_error <- mean(binary_predictions != iris_test$binary_species)
binary_accuracy <- 1-binary_errorSo our accuracy is 0.7777778. That’s pretty good!