What Is Machine Learning?


Machine learning is a science that enables computer applications to learn without being explicitly programmed. In principle, machine learning develops models or algorithms that can predict an output value with an acceptable error margin, based on a set of known input values. To build these models, advanced statistical analysis techniques are employed. The machine learning process has three phases: The process of fitting a model on a data set that contains information about both independent and dependent variables. This data set is usually called training set and this phase of the process is called training phase. The training set is a part of the total data set. During the training phase, we attempt to detect the relationship between the output (response) variable and the input variables (predictors). This relationship is actually a function which looks like: \[Y=f(x_{1},x_{2},......,x_{p}) + ε \]

After training the model, we will fit it on another data set, called test set or validation set. The test set also contains information about both independent and dependent variables. In the validation phase, several alternative models can be compared, to establish which of them has the smallest prediction error on the test set. Moreover, in this phase we can fine tune (adjust) some model parameters through cross-validation. (Cross Validation is out of scope for this current project.)

Finally, the validated model is applied on new data. The new data would only contain input variables and not the output variable (i.e. dependent variables). In this phase we actually use our model to predict the unknown output values / variables. To assess the quality of our model, we have to wait until the output values become known. As soon as this happens, we can add the new data to our training set and re-train our model. So machine learning is an iterative process.

Machine learning techniques can be divided into two main groups:

  1. Supervised Learning Techniques
  2. Unsupervised Learning Techniques

The OLS technique falls under the Supervised Learning. Let us delve further into OLS regression to understand how supervised learning works.


Ordinary Least Squares Regression (OLS)


The OLS regression studies the relationship between one numeric dependent variable and one or more independent variables.It belongs to the category of supervised prediction techniques with a numeric explained variable.

\[y_{i} = b_{0} + b_{1}x_{1}+ b_{2}x_{2}+...+ b_{k}x_{k}+ε\]

Let us load the carsales.csv ( which can be found in the Github project folder, https://github.com/prateekdaniels/STDS-Vignette ),

cars <- read.csv("carsales.csv")

View(cars)

Here, the dependent variable is the car price (price) and independent variables are the engine size (engine), horse power (horse) and curb weight (weight).

We shall use the lm() function in the stats package and use the summary() function to ascertain the residuals and coefficients.

In the lm function, the dependent variable(price) is specified before the tilda (~) sign, followed by the independent variables.

fit <- lm(price~engine+horse+weight, data = cars)

summary(fit)
## 
## Call:
## lm(formula = price ~ engine + horse + weight, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.608  -3.634  -1.153   2.803  34.404 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -19.893122   3.579521  -5.557 1.21e-07 ***
## engine       -5.634731   1.266796  -4.448 1.67e-05 ***
## horse         0.271187   0.019289  14.059  < 2e-16 ***
## weight        0.004195   0.001442   2.910  0.00416 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.395 on 151 degrees of freedom
## Multiple R-squared:  0.7397, Adjusted R-squared:  0.7345 
## F-statistic:   143 on 3 and 151 DF,  p-value: < 2.2e-16

From the output of the summary() function, the results show that we have an R squared of 73.97% , which says, that 73% of the car price is predicted by the predictors (independent variables) and the F statistic, shows that it is statistically significant, because the p - value of 2.2e-16 is below 5%.

next, we use the predict function, to predict values for the estimated variable of price i.e \(\hat{y}\) and store the results in the pred variable

pred <- predict(fit)

Now, we compute the Mean Squared Error (MSE), which can be calculated manually using the below formula:

mse <- mean((cars$price - pred)^2)

mse
## [1] 53.27727

Thus, MSE test shows we have a value of 53.27727.

Now, we randomly split the cars data frame into a training set and a test set. Each of the sets will contain about 50% of the data. We are interested if the model fitted on the training set performs well on the test set too.

Sine we have 155 observations, we extract 77 observations randomly (77 is approximately the half of 155) using the sample() function. The below n vector will will provide the indices for the training set

n <- sample(155, 77)
n
##  [1]  22  67  32 139  52 141 131 147  12   7  38 123  27 135  99  57 111
## [18] 113  84 103 108  58  77  51  73  34  78  61  96 118  68 124 119 106
## [35] 146  35  44  41  29 125 104  98   3  30  85 153  19  11  46   1  65
## [52]  20  62 129 115  71 155 132  75  72 142  36  64  15  37  17 136 122
## [69]  24  66  92  88  63  43  50  83  18

Now, we need to bifurcate the set of 155 observations as mentioned above i.e. training set and test set. Assign the set of 77 observations of randomly selected values to cars_train which would be our training set.

cars_train <- cars[n,]

View(cars_train)

Now, after creating the test set, we have 78 more observations which need to assigned to a test set i.e. cars_test

cars_test <- cars[-n,]

View(cars_test)

Now we fit the data using the training set i.e.cars_train, using the lm() function, predict() function to predict values of the car price for the estimated variables of training set, and also compute the MSE of the training set,

fittrain <- lm(price~engine+horse+weight, data = cars_train)

predtrain <- predict(fittrain)

train_mse <- mean((cars_train$price - predtrain)^2)

train_mse
## [1] 57.53817

Similarly, we compute the values for the test set i.e.cars_test using the lm(), predict() functions to predict the value of the car prices on the estimated variables of the test set, and also compute the MSE of test set,

fittest <- lm(price~engine+horse+weight, data = cars_test)

predtest <- predict(fittest)

test_mse <- mean((cars_test$price - predtest)^2)

test_mse
## [1] 47.62903

We have received an MSE of 52.04 and 51.31 for the train_mse and test_mse respectively. Note, the results of the MSE values for the training (train_mse) and test (test_mse) sets will vary from the values that we have received because, each time the sample function is used to slice the parent set into training and test sets, the set would have different set of variables.

Since we have got a MSE for the train_mse and test_mse which are pretty close to each other, we can conclude that our training model is very good.

Thus, the above example demonstrated how to use OLS technique to through Machine Learning.

Practical procedure for permoring the OLS test is as follows:

  1. Develop several alternative models in the training set
  2. Estimate the prediction accuracy of each model in the test set
  3. Choose the model that has the best accuracy in the test set


Glossary


Estimation Error

\[y_{i}-\hat{y}\] where,

\(y_{i}\) is the actual value of the response variable and

\(\hat{y}\) is the estimated value of the response variable


Mean Squared Error (MSE)

\[𝑀𝑆𝐸=\frac{ \sum_{i=1}^{N}\left( {y_{i}-\hat{y}} \right)}{N}\] where,

\(y_{i}\) is the actual value of the response variable and

\(\hat{y}\) is the estimated value of the response variable

\(N\) is the sample size


R Squared

\[R^2 = 1- \frac{ \sum_{i=1}^{N}\left( {y_{i}-\hat{y}} \right)^2}{ \sum_{i=1}^{N}\left( {y_{i}-\bar{y}} \right)^2 }\]

where,

\(y_{i}\) is the actual value of the response variable

\(\hat{y}\) is the estimated value of the response variable

\(\bar{y}\) is the average of output values

\(N\) is the sample size