Machine learning is a science that enables computer applications to learn without being explicitly programmed. In principle, machine learning develops models or algorithms that can predict an output value with an acceptable error margin, based on a set of known input values. To build these models, advanced statistical analysis techniques are employed. The machine learning process has three phases: The process of fitting a model on a data set that contains information about both independent and dependent variables. This data set is usually called training set and this phase of the process is called training phase. The training set is a part of the total data set. During the training phase, we attempt to detect the relationship between the output (response) variable and the input variables (predictors). This relationship is actually a function which looks like: \[Y=f(x_{1},x_{2},......,x_{p}) + ε \]
After training the model, we will fit it on another data set, called test set or validation set. The test set also contains information about both independent and dependent variables. In the validation phase, several alternative models can be compared, to establish which of them has the smallest prediction error on the test set. Moreover, in this phase we can fine tune (adjust) some model parameters through cross-validation. (Cross Validation is out of scope for this current project.)
Finally, the validated model is applied on new data. The new data would only contain input variables and not the output variable (i.e. dependent variables). In this phase we actually use our model to predict the unknown output values / variables. To assess the quality of our model, we have to wait until the output values become known. As soon as this happens, we can add the new data to our training set and re-train our model. So machine learning is an iterative process.
Machine learning techniques can be divided into two main groups:
The OLS technique falls under the Supervised Learning. Let us delve further into OLS regression to understand how supervised learning works.
The OLS regression studies the relationship between one numeric dependent variable and one or more independent variables.It belongs to the category of supervised prediction techniques with a numeric explained variable.
\[y_{i} = b_{0} + b_{1}x_{1}+ b_{2}x_{2}+...+ b_{k}x_{k}+ε\]
Let us load the carsales.csv ( which can be found in the Github project folder, https://github.com/prateekdaniels/STDS-Vignette ),
cars <- read.csv("carsales.csv")
View(cars)
Here, the dependent variable is the car price (price) and independent variables are the engine size (engine), horse power (horse) and curb weight (weight).
We shall use the lm() function in the stats package and use the summary() function to ascertain the residuals and coefficients.
In the lm function, the dependent variable(price) is specified before the tilda (~) sign, followed by the independent variables.
fit <- lm(price~engine+horse+weight, data = cars)
summary(fit)
##
## Call:
## lm(formula = price ~ engine + horse + weight, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.608 -3.634 -1.153 2.803 34.404
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.893122 3.579521 -5.557 1.21e-07 ***
## engine -5.634731 1.266796 -4.448 1.67e-05 ***
## horse 0.271187 0.019289 14.059 < 2e-16 ***
## weight 0.004195 0.001442 2.910 0.00416 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.395 on 151 degrees of freedom
## Multiple R-squared: 0.7397, Adjusted R-squared: 0.7345
## F-statistic: 143 on 3 and 151 DF, p-value: < 2.2e-16
From the output of the summary() function, the results show that we have an R squared of 73.97% , which says, that 73% of the car price is predicted by the predictors (independent variables) and the F statistic, shows that it is statistically significant, because the p - value of 2.2e-16 is below 5%.
next, we use the predict function, to predict values for the estimated variable of price i.e \(\hat{y}\) and store the results in the pred variable
pred <- predict(fit)
Now, we compute the Mean Squared Error (MSE), which can be calculated manually using the below formula:
mse <- mean((cars$price - pred)^2)
mse
## [1] 53.27727
Thus, MSE test shows we have a value of 53.27727.
Now, we randomly split the cars data frame into a training set and a test set. Each of the sets will contain about 50% of the data. We are interested if the model fitted on the training set performs well on the test set too.
Sine we have 155 observations, we extract 77 observations randomly (77 is approximately the half of 155) using the sample() function. The below n vector will will provide the indices for the training set
n <- sample(155, 77)
n
## [1] 22 67 32 139 52 141 131 147 12 7 38 123 27 135 99 57 111
## [18] 113 84 103 108 58 77 51 73 34 78 61 96 118 68 124 119 106
## [35] 146 35 44 41 29 125 104 98 3 30 85 153 19 11 46 1 65
## [52] 20 62 129 115 71 155 132 75 72 142 36 64 15 37 17 136 122
## [69] 24 66 92 88 63 43 50 83 18
Now, we need to bifurcate the set of 155 observations as mentioned above i.e. training set and test set. Assign the set of 77 observations of randomly selected values to cars_train which would be our training set.
cars_train <- cars[n,]
View(cars_train)
Now, after creating the test set, we have 78 more observations which need to assigned to a test set i.e. cars_test
cars_test <- cars[-n,]
View(cars_test)
Now we fit the data using the training set i.e.cars_train, using the lm() function, predict() function to predict values of the car price for the estimated variables of training set, and also compute the MSE of the training set,
fittrain <- lm(price~engine+horse+weight, data = cars_train)
predtrain <- predict(fittrain)
train_mse <- mean((cars_train$price - predtrain)^2)
train_mse
## [1] 57.53817
Similarly, we compute the values for the test set i.e.cars_test using the lm(), predict() functions to predict the value of the car prices on the estimated variables of the test set, and also compute the MSE of test set,
fittest <- lm(price~engine+horse+weight, data = cars_test)
predtest <- predict(fittest)
test_mse <- mean((cars_test$price - predtest)^2)
test_mse
## [1] 47.62903
We have received an MSE of 52.04 and 51.31 for the train_mse and test_mse respectively. Note, the results of the MSE values for the training (train_mse) and test (test_mse) sets will vary from the values that we have received because, each time the sample function is used to slice the parent set into training and test sets, the set would have different set of variables.
Since we have got a MSE for the train_mse and test_mse which are pretty close to each other, we can conclude that our training model is very good.
Thus, the above example demonstrated how to use OLS technique to through Machine Learning.
Practical procedure for permoring the OLS test is as follows:
\[y_{i}-\hat{y}\] where,
\(y_{i}\) is the actual value of the response variable and
\(\hat{y}\) is the estimated value of the response variable
\[𝑀𝑆𝐸=\frac{ \sum_{i=1}^{N}\left( {y_{i}-\hat{y}} \right)}{N}\] where,
\(y_{i}\) is the actual value of the response variable and
\(\hat{y}\) is the estimated value of the response variable
\(N\) is the sample size
\[R^2 = 1- \frac{ \sum_{i=1}^{N}\left( {y_{i}-\hat{y}} \right)^2}{ \sum_{i=1}^{N}\left( {y_{i}-\bar{y}} \right)^2 }\]
where,
\(y_{i}\) is the actual value of the response variable
\(\hat{y}\) is the estimated value of the response variable
\(\bar{y}\) is the average of output values
\(N\) is the sample size