datamining

Intro to DS week 7 practical: data mining

In this practical, I’ll cover:

Linear regression prediction models for both single- and multivariable data
Logistical regression for classification
Linear separability of datasets

#|echo: false
library(tidyverse)
library(MASS)
options(rgl.useNULL = TRUE)
library(rgl)

A single-variable linear model

We’ll be using the ‘hills’ data set in the MASS package with info on hill climb times and horizontal/vertical distance.

Loading in training and test data and EDA

The hills data set has 35 rows. We’ll use 30 for training and 5 for testing, splitting the data accordingly:

hills_train <- hills[1:30,]
hills_test <- hills[31:35,]

Let’s do some basic plotting of time vs the other two variables + correlation analysis on the training data:

plot1 <- hills_train |>
  ggplot(aes(x = dist, y= time)) +
  geom_point()
plot2 <- hills_train |>
  ggplot(aes(x = climb, y= time)) +
  geom_point()
plot1

plot2

cor.test(hills_train$dist, hills_train$time)


    Pearson's product-moment correlation

data:  hills_train$dist and hills_train$time
t = 10.354, df = 28, p-value = 4.438e-11
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7806143 0.9469437
sample estimates:
      cor 
0.8904587

cor.test(hills_train$climb, hills_train$time)


    Pearson's product-moment correlation

data:  hills_train$climb and hills_train$time
t = 5.7047, df = 28, p-value = 4.062e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5067420 0.8649637
sample estimates:
      cor 
0.7331625

From this we can see that distance is better correlated with time than climb is (not surprising).

Fitting the linear model

Let’s try a single variable model to predict time based on climb using the ‘lm’ function:

linear_model <- hills_train |>
  lm(formula=time~climb)

summary(linear_model)


Call:
lm(formula = time ~ climb, data = hills_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-35.258 -18.227  -5.340   3.839 128.837 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 14.368522   8.660145   1.659    0.108    
climb        0.023553   0.004129   5.705 4.06e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 31.12 on 28 degrees of freedom
Multiple R-squared:  0.5375,    Adjusted R-squared:  0.521 
F-statistic: 32.54 on 1 and 28 DF,  p-value: 4.062e-06

Let’s plot this (you can use either geom_smooth to plot the line, or extract the linear coefficients manually - the geom_smooth line is blue, the manual one is black and they are identical):

coefs_lin <- coef(linear_model)

plotlin <- hills_train |>
  ggplot(aes(x = climb, y = time)) +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE) +
  geom_abline(mapping = aes(
    slope = coefs_lin['climb'],
    intercept = coefs_lin['(Intercept)']
  ))

plotlin

Let’s graph the residuals to examine further where the model is failing by calculating where the model predicts each point in the data should be vs where it actually is - by adding a column each for the predicated values and the residuals to the original training set:

hills_residuals <- hills_train |>
  mutate(predicted = predict(linear_model)) |>
  mutate(residuals = residuals(linear_model))

Now let’s graph them:

hills_residuals |>
  ggplot(aes(x=climb, y = time)) +
  geom_point(color = 'chartreuse', size = 3) +
  geom_point(color = 'grey', size = 2, aes(y = predicted)) +
  geom_smooth(method = 'lm', se = FALSE, color = 'black', linewidth = 0.5) +
  geom_segment(aes(xend = climb, yend = predicted, color = 'pink'))

Or using base R:

plot(linear_model)

Predictions using the model

Now we can use the predict function from before, but on the test rather than the training data with the ‘newdata’ argument in predict and then graphing as before:

hills_test_residuals <- hills_test |>
  mutate(predicted = predict(linear_model, newdata=hills_test)) |>
  mutate(residuals = predicted - time)

hills_test_residuals |>
  ggplot(aes(x=climb, y = time)) +
  geom_point(color = 'chartreuse', size = 3) +
  geom_point(color = 'grey', size = 2, aes(y = predicted)) +
  geom_abline(mapping = aes(
    slope = coefs_lin['climb'],
    intercept = coefs_lin['(Intercept)']
  )) +
  geom_segment(aes(xend = climb, yend = predicted, color = 'pink'))

And then calculating the SSE, i.e. the sum of the residuals squared

sse_climb <- sum(hills_test_residuals$residuals**2)

Multi-variable model

What if we wanted to account for both distance AND time? The less colinear they are (i.e. the more scattered) then the better results we’ll get by combining them.

We can build another linear model using lm, but we just specify more variables:

multi_linear_model <- lm(
  formula = time ~ climb + dist,
  data = hills_train
)

Now let’s use THIS model on the test data:

hills_test_multi_residuals <- hills_test |>
  mutate(predictions = predict(multi_linear_model, newdata = hills_test)) |>
  mutate(residuals = (predictions - time))

layout(matrix(1:2, ncol = 2))
plot(multi_linear_model, which = 1)
plot(linear_model, which = 1)

And calculate the SSE again:

SSE_multi <- sum(hills_test_multi_residuals$residuals**2)
SSE_multi

[1] 608.8576

This is better!

Logistic regression

EDA

We’re going to classify flowers in the iris dataset, so let’s split it into training and test as before:

iris <- iris[sample(1:nrow(iris)),] #shuffle the data
train_prop = 0.7
iris_train <- iris[1:(train_prop*nrow(iris)),]
iris_test <- iris[(train_prop*nrow(iris)+1):nrow(iris),]

Let’s graph the iris data:

iris_train |>
  ggplot(aes(x=Sepal.Length,y=Sepal.Width, color = Species)) +
  geom_point() +
  scale_color_viridis_d()

We can see lots of overlap between the points - particularly ‘versicolor’ and others. This means they are unlikely to be linearly separable - in other words, a logistic regression on sepal length/width is not going to be very good at classifying things into ‘versicolor’ and ‘not virginica’. We should try to find combinations of variables that are better at this - perhaps petal length and width?

Binomial logistic regression

Let’s start with a simple binary logistic regression - i.e. virginica or not. Let’s add a column to the training (and test) sets accordingly to indicate this:

iris_train <- iris_train |>
  mutate(binary_species = (Species == 'versicolor')*1)

iris_test <- iris_test |>
  mutate(binary_species = (Species == 'versicolor')*1)

Now we use the glm function to build a logistic model from the training data using all four width and length variables:

iris_binary_model <- glm(
  formula = binary_species ~ Petal.Width + Petal.Length + Sepal.Length + Sepal.Width,
  family = binomial(link = 'logit'),
  data = iris_train
)

And use the model to predict species for the test data, and round them to either 0 or 1:

binary_probs <- predict(
  iris_binary_model,
  newdata = iris_test,
  type = 'response'
)

binary_predictions <- ifelse(binary_probs >0.5, 1, 0)

And finally calculate how good the model is:

binary_error <- mean(binary_predictions != iris_test$binary_species)
binary_accuracy <- 1-binary_error

So our accuracy is 0.7777778. That’s pretty good!