Investigating data:

library(e1071)
attach(cars); str(cars)
## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

We have only two variables here - speed and distance. So can start directly working on the relationship between this two variables.

plot(speed, dist)

We can notice a linear trend between two variables: the distance increases with the augmentation of the speed.

Let’s construct a linear model on these variables and check, if it’s accurate enough by R squared value.

lm_cars <- lm(dist ~ speed) 
summary(lm_cars)$r.squared
## [1] 0.6510794

Value of 0.6511 represents the fraction of variance explaned by the model. The closer it is to 1.0, the better our model fits. So covered 65% (Explained Var) of 100% (Total Var) is good, but let’s admit it can be better for this example.

Let’s see it on the plot:

plot(dist ~ speed); abline(lm_cars$coefficients, col = "blue")

Exactly, we see that actual data points do vary about the line and quite a lot. That is why our explaned variance is 0.65 - because of the high residual variance for some datapoints.

To be sure that we got an adequate regression model, let’s run qqnorm on it to see if the theoretical quantiles of normal distribution correspond to residual quantiles.

res <- lm_cars$residuals
qqnorm(res, ylab = "Residual Quantiles"); qqline(res, col = "blue")

Yes, we are good.

End of the 1st part on Machine Learning - Linear regression

Thanks for reading!