Machine Learning Videos

Intoduction to R & Analytics

Last article(http://rpubs.com/satyajeeetsingh/Introduction-to-R-and-analytics) we covered tables and graphs, which are very practical and hands on methods of analyzing and understanding a dataset. This also allowed you to get a little more familiar with a tool (R) which we can use for a lot of model building. As datasets get larger and larger, it is less practical to rely on filtering and plotting to really gain insight about what is going on. That being said, tables and graphs are still extremely useful preliminary tools. When preprocessing the data, we need to use the various table functions we had specifically to find inconsistencies and odd/out of place examples. In situations where we want to use existing data to predict future results and when there is a lot of data, we need methods that rely more on mathematics and apply a consistent philosophy to all of the samples. Essentially we need to predict the value “Y” given some input “X” based on a series of (“X”,“Y”) pairs that we know the results of.

For regression we are trying to predict a continuous variable. Applications of this include:

For classification we are trying to predict a discrete and known number of categories. Applications of this include:

In this week we will begin with regression for 3 days, the first two will be for predicting values and then the third day we will motivate classification. Finally, we will look at classification and introduce decision trees. You will get one task both assigned and to be completed this week, as well as one task which is assigned and due halfway through next week.

Simple Linear Regression

Today we cover simple regression. We will be explaining the fundamentals and mathematics behind it, and then run it on a simple example with a dataset built into R. There is less coding on this day, but many potentially new concepts. For this class we will flip the “discussion” and “practical” components.

Practical Component

Today we demonstrate simple regression using the cars dataset (seen earlier). But first we will do some quick filtering just to warm up and understand the data. Please write the results into coding blocks under the questions below. The dataset is here:

data("cars")

Questions:

  1. What’s the mean speed of cars with distance larger than 30?

  2. What’s the mean distance for cars with speed larger than 12?

  3. What’s the first quartile for the speed of cars with speed greater than 12 and distance greater than 50?

  4. For cars with distance greater than 80, is the mean or median of the speed higher?

  5. Plot distance versus speed below and discuss what trends you see. Discuss how would you predict the stopping distance given a speed in general?

Regression Theory

Simple Regression

Regression can be done in n-dimension, but we will start with a two dimension (one input one output) example. When regression uses only one input to predict an output it is called simple regression. Once we have multiple inputs we call it multiple regression. If we are trying to fit a straight line or plane that best represents the relationship between input and output (or X and Y) then it is called linear regression. In 1 dimension this is just a point (for example the mean). In 2 dimensions this is a line, in 3 dimensions it becomes a plane and so on.

Equation

The simple linear regression equation is:

\[ y_i = w_0 + w_1x_i + err_i \]

where \(y_i\) stands for the output for input i, \(w_0\) is the intercept term, and \(w_1\) is the slope of the line. This is known as simple linear regression simple because you are just fitting a line. In calculus you must have seen the equation for a line - and this is kind of the same idea. The slightly complicated part is finding which line fits the set of points the “best”.

Measuring Error

To decide what the best line is, we have to come up with an idea for measuring how good one fit is from another. We can see in many cases which line seems best just by observation, but when two different lines look equally good we need a way to differentiate them. For linear regression the common error measurement is known as the residual sum of squares (RSS).

\[RSS = \frac{1}{n}\sum_{i=1}^{n}(y_i - [w_0 + w_1x_i])^2\]

This is a relative measure of distance from the real values to the fitted line. In the initial equation the noise (\(err_i\)) is the vector that makes up for difference between the actual point and the neighbouring point on the line. We have to square the value because we want a distance and so whether the line is above or below the real point should not affect the distance. In theory there are many options for this error (we can use an absolute value difference too just as an example), but we use this squared error because it has many properties. But how do we find which line has the lowest RSS?

Closed form solution

For linear regression, if we used a squared error there is a closed form solution we can calculate to create the output. Closed form means essentially we can create a set of math operations, that given certain conditions, we will always get our perfect line.

If we “matrixify” or “vectorize” the RSS equation above, we get \(argmin_w||y-Xw||^2\) where \(y_{n,1}\) is a vector representing the outputs, \(X_{n, 2}\) is a matrix with the first column containing 1’s, and second column containing X values, and \(w_{2,1}\) is a weight vector containing \(w_0, w_1\). In other words we are trying to pick the \(w\) vector such that the expression \(||y-Xw||^2\) is minimized. If we just expand the equations:

\[||y-Xw||^2 = (y-Xw)^T(y-Xw) = (y^Ty - 2w^TX^Ty + w^TX^TXw)\]

Since we want to minimize this, we can take the first derivative with respect to w and see when it equals 0, giving: \(0=-2X^Ty+2X^TXw\) which implies \(w=(X^TX)^{-1}(X^Ty)\). Easy way to pick the weight vector, as long as \((X^TX)^{-1}\) is invertible. However, assuming it is not invertible, or inverting it takes extra time, we need another method.

Gradient Descent

You know from calculus finding the minimum of things usually involves some sort of a derivative. If we take every possible line going through the same (similar) intercepts, they will converge at a minimum RSS. (Think about why there can’t be two lines which both have a minimum RSS).

The algorithm used to find the best fitting line (within some error) is known as gradient descent. Gradient descent is built into most regression features and so it is unlikely you will ever have to use it, but over here I will give an overview of how it works. There are many resources online that can explain one of the steps or theory if it is unclear.

The idea is that given a step size (n), a weight vector (w), and a tolerance (err), as long as we have not converged we update the w vector as follows:

\[w^{(t+1)} = w^{(t)} - n\nabla g(w^{(t)})\]

In the case of simple linear regression, we want:

\[min_{w_0,w_1}\sum_{i=1}^{N}(y_i-[w_0+w_1 x_i])^2\]

So we need the gradient of the RSS expression with respect to \(w_0\) and \(w_1\), which (you can check it yourself) evaluates to:

\[ \nabla RSS(w_0, w_1) = \begin{bmatrix} -2\sum_{i=1}^{N}[y_i-(w_0+w_1x_i)] \\ -2\sum_{i=1}^{N}[y_i-(w_0+w_1x_i)]x_i \end{bmatrix} \]

This idea can translate to more than 2 dimensions and in that case rather than finding a local min/max in a curve, it will involve finding the local min/max in a plane or hyperplane. By running this algorithm we will keep adjusting our line until we are not able to minimize the RSS further within some tolerance. This enforces that we will get a best-fitting line. As you can imagine, it also works for curves.

However, does having a low RSS gurantee that the line we predict will be useful?

Regression Package and Implementation

To demonstrate linear regression in R, we will use the built in “cars” package which has speed and distance information from 50 cars.

Plotting

First we should plot variables. Here I will explore a few more specific plots versus the freestyle ones you used in Titanic.

First we will explore scatter plots (which can check for a trend), boxplots (which can identify outliers), and density plots (to see if the response variable does not deviate too much; is close to a normal distribution or bell curve).

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.1
library(cowplot)
## 
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
## 
##     ggsave
# Scatter Plots
pv1 <- ggplot(cars, aes(x=1:nrow(cars), y=sort(speed))) + geom_point()
pv2 <- ggplot(cars, aes(x=1:nrow(cars), y=sort(dist))) + geom_point()
pv1

pv1 + geom_point(x=1:nrow(cars)) + geom_smooth(method=lm)

pv2

pv2 + geom_smooth(method=lm)

# Box Plots
plot_grid(ggplot(cars, aes(y=cars$dist)) + geom_boxplot(), ggplot(cars, aes(y=cars$speed)) + geom_boxplot(), ncol=2)

# Density Plots
plot_grid(ggplot(cars, aes(x=cars$dist)) + geom_density(), ggplot(cars, aes(x=cars$speed)) + geom_density(), ncol=2)

Creating a linear model

In R, you can define and save a linear model (or any model) the same way we can save functions or variables (with the assignment operator, “<-”). Below, we will create a linear model for predicting distance using speed. The key arguments to lm are “formula” and data; formula is formatted as output/Y ~ input(s)/X.

linearModel <- lm(dist ~ speed, data=cars)

print(linearModel)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932
help(lm)
## starting httpd help server ... done
  1. QUESTION: In the model above, what does “Intercept” and “speed” stand for using the regression equation defined at the beginning of this notebook (\(w_0, w_1\))?

Summarize specific features

We can see more information abut the model we built below:

summary(linearModel)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

We will discuss how to interpret this output in the next class.

Make predictions

Here is a method for using the linear model above to predict data for new points (or different speeds). To predict “dist” we need a dataframe that has a column named “speed” (as our formula requires a speed to predict a distance). Also note that the predictions initially are labelled “1,2,3…” and not by the input value, so below I rename them just to provide clarity.

# speeds to predict in vector form
speeds <- c(2,4,6,8,10)

speeds2 <- c(2,3,5,15,26,30)

# Build dataframe
b <- data.frame(speed = speeds2)

# Make predictions
preds <- predict(linearModel, b)
preds
##          1          2          3          4          5          6 
##  -9.714277  -5.781869   2.082949  41.407036  84.663533 100.393168
names(preds) <- as.character(speeds)
preds
##          2          4          6          8         10       <NA> 
##  -9.714277  -5.781869   2.082949  41.407036  84.663533 100.393168

Plot Regression Line

Finally, we will plot our regression line on the points we have from the data using the following code:

plot(cars$speed, cars$dist)
abline(linearModel)

Quick Questions for Refreshing

  1. Looking at the plot of speed vs distance, what do you think the distance a car travelling at speed 30 would travel in the model you just built?

  2. What are the predictions you get for the following set of speeds (2,3,5,15,26,30)? Do these predictions make sense?

  3. You will each be given one component of simple linear regression and tomorrow you will explain your particular component to the rest of the class.

Practice Project

  • Load dataset (preferred) into R
  • Begin exploring different variables in new RmD file (write what each column represents as well as put some summaries of column values)
  • Begin framing problem and exploring possibilities for problem.

Multiple Linear Regression

Multiple Linear Regression

Multiple Linear Regression (MLR) works almost the same mechanically as simple linear regression (SLR), except that it accounts for multiple inputs. We can also think of this as a line in multiple dimensions. Furthermore, in this class we will assign a meaning to having categorical variables. The formula at a high level changes from \(y_i = w_0 + w_1x_1 + err_i\) to \(y_i = w_0 + w_1x_1 + w_2x_2 + ... + err_i\). We will discuss how to interpret intercepts after we discuss the implementation.

Code for multiple linear regression

Code for it is as follows. We use a dataset of Swiss fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland around 1888. There are 47 observations and 6 variables: Fertility, Agriculture, Examination, Education, Catholic, and Infant Mortality. Here we will build a model that predicts Education using all the other variables. The “.” represents “everything not including the output”.

data("swiss")
swissModel <- lm(Education ~ ., data=swiss)

This is the same structure as “lm” with additional arguments added to the “input” part of the formula.

Interpreting Results

We can interpret the results in more detail using the “summary” command.

summary(swissModel)
## 
## Call:
## lm(formula = Education ~ ., data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.3949  -2.3716  -0.2856   2.8108  11.2985 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      32.74414    8.87888   3.688 0.000657 ***
## Fertility        -0.40851    0.08585  -4.758 2.43e-05 ***
## Agriculture      -0.16242    0.04488  -3.619 0.000804 ***
## Examination       0.41980    0.16339   2.569 0.013922 *  
## Catholic          0.10023    0.02150   4.663 3.29e-05 ***
## Infant.Mortality  0.20408    0.28390   0.719 0.476305    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.907 on 41 degrees of freedom
## Multiple R-squared:  0.7678, Adjusted R-squared:  0.7395 
## F-statistic: 27.12 on 5 and 41 DF,  p-value: 5.223e-12

Now is a good time to understand what the summary command for the model says:

  • The first element (“Call”) tells you back the formula that was inputted.
  • The residuals section shows how the residuals are distributed. Residuals are the difference between actual points and model prediction - and if the residuals are distributed symmetrically around 0 that is a good sign. Usually it is a good idea to plot residuals and check the distribution, but that feature may not always be available to you. Generally, if you see an even distribution around 0 and the relationship does not follow a pattern of dispersion it is a good sign.
plot(swissModel$residuals, main="Residuals Plot")
abline(h=0)

The coefficients (next row) have many components.

  • The first one is the estimate of the value of \(w_0\), \(w_1\) etc. in the regression equation.

  • Each estimate represents that with no other factors, education will be 32.744%… and with every unit increase in Fertility, the estimate can increase by ~.40, and so on for the other features.

  • The standard error estimates the average amount that estimates vary from the actual response. This means (intuitevly) that we want the standard error to be very small compared to the estimate. Using that logic it appears that in the above Fertility is the only really good indicator, and Infant.Mortality is certainly a bad one.

  • Coefficient t-value represents how far (how many standard deviations) the estimate is from 0. An answer closer to zero implies that this factor does not impact the output variable much. Therefore, for a useful predictor, we want this value to be as far from zero as possible (such as the intercept, fertility and even agriculture).

  • Pr(>|t|) is the p-value which is used in hypothesis testing. To give a tldr version of hypothesis testing:

You are almost always testing whether or not the null hypothesis is true (meaning that whatever input you are testing has no value on the response) and if the p-value is larger than 0.05 you typically accept that this null hypothesis is true otherwise you can reject it, and rejecting it means the input is important. Typically a value will get rejected for a reason discussed above (i.e too much variation or too much an intercept too close to 0).

  • In the last section, there are many more measures of “goodness of fit” out of which the most notable one is “Adjusted-R Squared”. The Adjusted-R squared is an estimate of how much of the models response variable is explained by an input (in this case 74.25) which is not bad, but this depends on the context too.

  • Some people also look at the F-statistic which is measured based on how much larger it is than the number of variables inputted. Either way, the point is to assess how good the “fit” is.

Discarding Variables

One biproduct of all of these pieces of information is that we can now discard features that are taking up time. That way, when we are predicting we can input less fields and also run the model on less variables (all saving runtime).

There are two easy ways you can choose to discard variables:

    1. Remove features with a high p-value.
    1. Try all subsets of features and choose the combination with the highest R-squared.

If we remove the variables we see in our original model that are bad, we get the following final model:

swissModelFull <- lm(Education ~ Fertility + Agriculture + Examination + Catholic, data=swiss)

We see that we slightly increase the adjusted R-squared by removing a variable! This is because the adjusted R squared is calculated based on how well the input variables explain the actual resulting variable, with respect to how many input variables there are.

Categorical variables

In general, we treat categorical variables the same way we treat numeric variables. However, the regression reads each category as a separate impact. For example, let’s say we run a linear regression on an output variable Y using a matrix X that consists of one numeric variable and 4 categories (code below):

Y <- unlist(lapply(1:100, function(x) -x^3 + 2*x^2 + sin(x) + cos(x)))
Y <- as.vector(scale(Y)) + rnorm(50, 0, 0.5)

X <- data.frame(ind = 1:100, mFn = unlist(lapply(1:100, function(x) -x^3 + 2*x^2 + sin(x) + cos(x))), mC1 = as.factor(sample(1:4, 100, replace=TRUE)), mC2 = as.factor(sample(c("a","b", "c", "d"), 100, replace=TRUE)))
X$mFn <- as.vector(scale(X$mFn))

tmpData <- cbind(Y, X)

myTmpModel <- lm(Y ~ ., data=tmpData)

summary(myTmpModel)
## 
## Call:
## lm(formula = Y ~ ., data = tmpData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.84351 -0.32991 -0.02175  0.27510  1.20962 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.222236   0.234956   0.946    0.347    
## ind         -0.000812   0.003908  -0.208    0.836    
## mFn          0.897455   0.113379   7.916 5.73e-12 ***
## mC12        -0.182761   0.136485  -1.339    0.184    
## mC13        -0.192612   0.139092  -1.385    0.170    
## mC14        -0.170273   0.130115  -1.309    0.194    
## mC2b        -0.131756   0.128094  -1.029    0.306    
## mC2c        -0.059502   0.122562  -0.485    0.628    
## mC2d        -0.168158   0.133305  -1.261    0.210    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.449 on 91 degrees of freedom
## Multiple R-squared:  0.8204, Adjusted R-squared:  0.8046 
## F-statistic: 51.95 on 8 and 91 DF,  p-value: < 2.2e-16

As we see, each individual category has been broken up into 3 separate variables. If we do not include any of these variables that represents the first category. Then each intercept is indicated as "adding category X impacts our estimate by __".

Summary

In summary, all of the above methods are used a lot in traditional experiments, and are a large part of statistical theory. It is usually ignored today, but still good to know about as in the future if you get funny results this can be a good way to verify what happened. Another very important part of these applications is having the ability to discard useless variables. In the real world regression is not used on the front end of most applications in a modern day as we have better methods to define relationships - but it is still an important method to learn and understand for the basics.

Practical

You will get a dataset and be required to create a model which can predict.

The task is to use the form of the notebook given above to create a similar style above, and then at the end make a short report explaining what you have learned from the data using multiple regression. Do your work in a separate RmD notebook!

You must try these datasets: - Use some filters to highlight features of the data before doing regression - Plot the data (use different plots for different relationships) - Choose an appropriate problem to solve - Build a linear or nonlinear model - Do some predictions using your own inputs - Find some conclusion (i.e which of the multiple inputs predicts the output the best, which can we discard, does the model make sense when we extrapolate or interpolate)

Keep in mind that “str” will show all the variables as numeric but some of them are in fact categories. You can define them using the “as.factor()” function.

library("mlbench")
library("AppliedPredictiveModeling")
library("nycflights13")
library("nasaweather")
library("MASS")

Set1

Chemical Manufacturing Process Data Info here: https://rdrr.io/rforge/AppliedPredictiveModeling/man/ChemicalManufacturingProcess.html

data("ChemicalManufacturingProcess")

Set2

Abalone Data Info here: https://archive.ics.uci.edu/ml/datasets/abalone Google “abalone” to see what they are

data("abalone")

Set3

Diamonds Data Info: https://ggplot2.tidyverse.org/reference/diamonds.html

data("diamonds")

Set 5

Boston Housing Data Info here: http://ugrad.stat.ubc.ca/R/library/mlbench/html/BostonHousing.html

data("BostonHousing2")

Set 7

NYC Flights data

data("flights")

Set 8

Atmosphere data from NASA

data("atmos")

Set 9

Boston housing data

data("Boston")

Project

  • Exploring data
  • Framing problems

Non-linear Regression and Measuring Error

library("tidyverse")
## Warning: package 'tidyverse' was built under R version 3.6.1
## -- Attaching packages ----------------------------------- tidyverse 1.2.1 --
## v tibble  2.1.3     v purrr   0.3.2
## v tidyr   0.8.3     v dplyr   0.8.3
## v readr   1.3.1     v stringr 1.4.0
## v tibble  2.1.3     v forcats 0.4.0
## Warning: package 'readr' was built under R version 3.6.1
## Warning: package 'dplyr' was built under R version 3.6.1
## -- Conflicts -------------------------------------- tidyverse_conflicts() --
## x dplyr::filter()   masks stats::filter()
## x cowplot::ggsave() masks ggplot2::ggsave()
## x dplyr::lag()      masks stats::lag()
## x dplyr::select()   masks MASS::select()
library("caret")
## Warning: package 'caret' was built under R version 3.6.1
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library("MASS")
library("splines")
library("mgcv")
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-28. For overview type 'help("mgcv-package")'.
data("Boston")

Types of Non-linear regression

Typically non-linear regression is a more practical application of regression. We use 3 fundamental ideas for non-linear regression:

  • Polynomial Regression
  • Spline Regression
  • Generalized Additive Models

First we will create a data split (to train and test)

trainingSamples <- sample(1:nrow(Boston), round(0.8*nrow(Boston)))
trainData <- Boston[trainingSamples,]
testData <- Boston[-trainingSamples,]

Let’s build a quick linear model too just predicting the medv from the lstat column:

modelLmBoston <- lm(medv ~ lstat, data=trainData)
predictions <- modelLmBoston %>% predict(testData)

RMSE(predictions, testData$medv)
## [1] 5.995698
summary(modelLmBoston)
## 
## Call:
## lm(formula = medv ~ lstat, data = trainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.138  -4.185  -1.340   2.190  24.308 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  34.7964     0.6341   54.87   <2e-16 ***
## lstat        -0.9554     0.0430  -22.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.273 on 403 degrees of freedom
## Multiple R-squared:  0.5506, Adjusted R-squared:  0.5494 
## F-statistic: 493.7 on 1 and 403 DF,  p-value: < 2.2e-16

Polynomial Regression

Add polynomial/quadratic terms to a regression. This is usually done by comparing each individual input to the output in a graph and then choosing a function that you think best generalizes to the input. One example was in a previous day. We will use the Boston Housing dataset and same comparison of variables.

modelSqBoston <- lm(medv ~ poly(lstat, 2, raw=TRUE), data=trainData)
modelPolyBoston <- lm(medv ~ poly(lstat, 6, raw = TRUE), data = trainData)

summary(modelSqBoston)
## 
## Call:
## lm(formula = medv ~ poly(lstat, 2, raw = TRUE), data = trainData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.8127  -3.9684  -0.5957   2.4612  25.2395 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 43.635690   0.978154   44.61   <2e-16 ***
## poly(lstat, 2, raw = TRUE)1 -2.414321   0.138008  -17.49   <2e-16 ***
## poly(lstat, 2, raw = TRUE)2  0.045510   0.004141   10.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.508 on 402 degrees of freedom
## Multiple R-squared:  0.6544, Adjusted R-squared:  0.6527 
## F-statistic: 380.6 on 2 and 402 DF,  p-value: < 2.2e-16
summary(modelPolyBoston)
## 
## Call:
## lm(formula = medv ~ poly(lstat, 6, raw = TRUE), data = trainData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1645  -3.2842  -0.7063   2.1914  26.7562 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  7.314e+01  5.991e+00  12.208  < 2e-16 ***
## poly(lstat, 6, raw = TRUE)1 -1.486e+01  3.171e+00  -4.686 3.83e-06 ***
## poly(lstat, 6, raw = TRUE)2  1.844e+00  6.099e-01   3.024  0.00266 ** 
## poly(lstat, 6, raw = TRUE)3 -1.217e-01  5.544e-02  -2.194  0.02880 *  
## poly(lstat, 6, raw = TRUE)4  4.235e-03  2.560e-03   1.654  0.09893 .  
## poly(lstat, 6, raw = TRUE)5 -7.375e-05  5.787e-05  -1.274  0.20330    
## poly(lstat, 6, raw = TRUE)6  5.088e-07  5.069e-07   1.004  0.31608    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.193 on 398 degrees of freedom
## Multiple R-squared:  0.6957, Adjusted R-squared:  0.6912 
## F-statistic: 151.7 on 6 and 398 DF,  p-value: < 2.2e-16
ggplot(trainData, aes(lstat, medv) ) +
  geom_point() +
  stat_smooth(method = lm, formula = y ~ poly(x, 6, raw = TRUE))

predictions <- modelPolyBoston %>% predict(testData)

RMSE(predictions, testData$medv)
## [1] 5.323196

Spline Regression

Fit a smooth curve with a series of polynomial segments (essentially compiling polynomial segments called knots). We set the knots to be at 1/4, 1/2, 3/4.

knots <- quantile(trainData$lstat, p = c(0.25, 0.5, 0.75))
modelSpline <- lm (medv ~ bs(lstat, knots = knots), data = trainData)
# Make predictions
predictions <- modelSpline %>% predict(testData)
# Model performance
RMSE(predictions, testData$medv)
## [1] 5.32017
ggplot(trainData, aes(lstat, medv) ) +
  geom_point() +
  stat_smooth(method = lm, formula = y ~ splines::bs(x, df = 3))

Generalized additive models

Do spline regression but choose each “knot” in an automated way.

modelGAM <- gam(medv ~ s(lstat), data = trainData)
# Make predictions
predictions <- modelGAM %>% predict(testData)
# Model performance
RMSE(predictions, testData$medv)
## [1] 5.293
ggplot(trainData, aes(lstat, medv) ) +
  geom_point() +
  stat_smooth(method = gam, formula = y ~ s(x))

These are just some methods. To be fully honest I don’t do that much regression on the day-to-day, so it’s a good idea for you to explore these in depth during the scope of this mini-assignment you have.

Measuring Error in Regression

We already know that we can check a model using the “adjusted R-squared” or “deviance explained”. However, there are better ways to test prediction too. These either involve using a subset of the rows in the data to train your model on, and then using the rows you do not include for training to test on, or adding some noise to your training data to make it testing data. We will discuss cross-validation further in classification, but I will give an example on the board in class for regression.

Logistic Regression and Measuring Error

Regression Task Take Questions

Logistic Regression Introduction

Logistic regression is used to fit a regression curve when y is a categorical variable. A categorical variable (as you can tell from the name) is a variable that has a finite number of categories as its output. For example, in the Titanic data “Survived” is categorical as it falls into 1 (survived) or 0 (didn’t survive), and the outcome of a coin flip is categorical as well. We will use the Titanic data to demonstrate this method today. Logistic regression can also be used for multi-category outputs but in reality it’s built for 2-variable outputs.

Conditional Probability

The input for logistic regression is a type of continuous variable and therefore it is important to explain the idea of conditional probability to demonstrate how it works. Conditional probability is the probability of an event occuring given that another event has occured. In math it’s usually represented as Pr(a = x | b) or probability event a is equivalent to x given event b occuring. For independent events such as flips of a coin, Pr(a | b) == Pr(a) or Pr(head | (previous is a tail) ) == Pr(head).

Modelling relationship

Say for some abstract categories we want to model a relationship between p(X)=Pr(Y=1 | X) and X. In linear regression we would define a relationship like p(X) = \(\beta_0\) + \(\beta_1\)X. When all of our outputs are either in category 0 or 1, many of the points on this line will clearly not predict it well. A simple option would be to find a boundary after which all values of X result in 1 and before which all result in 0. However, measuring error in linear regression punishes outliers much more. For instance, if we want a score that is either 1 or 0, and our linear model predicts a score of 100, then the RMSE becomes \(99^2\) even though we know that this point probably belongs to the category 1.

Logistic regression fixes this problem by using the function: \[p(Y=1|X=x) = \frac{e^{\beta_0+\beta_1X}}{1 + e^{\beta_0+\beta_1X}}\]

This will create an S-shaped curve as you can see here:

sigmoid = function(x) {
  1/(1+exp(-x))
}

x <- seq(-5,5,0.01)
y <- seq(-100,100,0.01)
par(mfrow=c(1,2))
plot(x, sigmoid(x), lwd=.1)
plot(y, sigmoid(y), lwd=.1)

You can also rearrange the above to get \(\frac{p(X)}{1-p(X)} = e^{\beta_0+\beta_1X}\). In this fraction, the left side can take values between 0 and infinity, and a low or high value of this implies a low or high probability. We can see that the taking the log of both sides gives \(log(\frac{p(X)}{1-p(X)}) = \beta_0+\beta_1X\), which ties back to linear regression and demonstrates how the weighted coefficients impact the model. This shows how logistic regression penalizes large values to something which is almost constant.

Logistic Regression R Code

Here I will go over some code about logistic regression in R as well as interpreting resultsusing the Titanic Data.

Loading Data

These are usual functions for loading data.

library(titanic)
dfTitanic <- titanic_train

Cleaning Data and Removing Missing Data

Recall that in previous modules we did many bits of initial analysis on data - and while this is a good idea in general to do - it is especially useful when looking for missing values or “NA” values. Here we will look for both missing values and how many unique values there are for different variables to see which variables can be used as categorical, and which cannot really be used.

missingNAVal <- sapply(dfTitanic, function(x) sum(is.na(x) | x == ""))
uniqueVals <- sapply(dfTitanic, function(x) length(unique(x)))
missingNAVal
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2
uniqueVals
## PassengerId    Survived      Pclass        Name         Sex         Age 
##         891           2           3         891           2          89 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           7           7         681         248         148           4

We can see that Cabin has too many missing values, and so does Age. We also see that PassengerId, Name and Ticket have too many possible values. We know from before it is possible to create features out of Name so we will keep that, and we also know from logic that people who are young and healthy have a better chance of surviving, so we will keep those. The next steps are to choose the relevant columns from training and find a way to deal with Age.

For this case, because we have many ages that are available, we can just fill in age using the average of the known ages. This will not actually affect the distribution of the age feature very much. We can also know that an age with decimal points is not real, in case we need to revert these changes. We will also remove the two rows where “Embarked” has a missing value, as it will not significantly impact the total number of rows.

All of these decisions have to be made when there are missing values, but usually can be justified by some form of field knowledge or otherwise should be stated upfront.

trainData <- subset(dfTitanic, select=c(2,3,5,6,7,8,10,12))

trainData$Age[is.na(trainData$Age)] <- mean(trainData$Age, na.rm=TRUE)

trainData <- trainData[trainData$Embarked != "",]
rownames(data) <- NULL

Splitting Data

We will also (for now) just do a basic splitting of the data. You will be required to implement a cross validation properly later.

train <- trainData[1:800,]
test <- trainData[801:889,]

Glm function

To plot a “generalized linear model”, we can use the glm function. Note the “.” means all columns not including the response (which in this case is Survived). In other words, “Survived ~ Age + Sex + Embarked + …” is equivalent to “Surivived ~ .” This will look very similar to linear models in the past, but here there is an additional argument which specifices that we are doing logistic regression.

model <- glm(Survived ~ ., family=binomial(link='logit'), data=train)
summary(model)
## 
## Call:
## glm(formula = Survived ~ ., family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6064  -0.5954  -0.4254   0.6220   2.4165  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  5.137627   0.594998   8.635  < 2e-16 ***
## Pclass      -1.087156   0.151168  -7.192 6.40e-13 ***
## Sexmale     -2.756819   0.212026 -13.002  < 2e-16 ***
## Age         -0.037267   0.008195  -4.547 5.43e-06 ***
## SibSp       -0.292920   0.114642  -2.555   0.0106 *  
## Parch       -0.116576   0.128127  -0.910   0.3629    
## Fare         0.001528   0.002353   0.649   0.5160    
## EmbarkedQ   -0.002656   0.400882  -0.007   0.9947    
## EmbarkedS   -0.318786   0.252960  -1.260   0.2076    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1065.39  on 799  degrees of freedom
## Residual deviance:  709.39  on 791  degrees of freedom
## AIC: 727.39
## 
## Number of Fisher Scoring iterations: 5

Interpreting Outputs

Using similar ideas from above, we can immediately see that Sex, Pclass, Age and SibSp are the most statistically significant in that order while the others are clearly irrelevant.

There are ways to analyze the results using ANOVA tables; once again there are whole fields in statistical theory methods that involve understanding ANOVA, but I will try to give a tldr version here.

The idea is to measure whether our features actually matter. In most cases, this means comparing a model with and without the new feature, and seeing if there is an impact on explaining the variation. In R, there is a built in anova function!

anova(model, test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Survived
## 
## Terms added sequentially (first to last)
## 
## 
##          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                       799    1065.39              
## Pclass    1   83.607       798     981.79 < 2.2e-16 ***
## Sex       1  240.014       797     741.77 < 2.2e-16 ***
## Age       1   17.495       796     724.28 2.881e-05 ***
## SibSp     1   10.842       795     713.43  0.000992 ***
## Parch     1    0.863       794     712.57  0.352873    
## Fare      1    0.994       793     711.58  0.318717    
## Embarked  2    2.187       791     709.39  0.334990    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the NULL deviance and residual deviance with features clearly is different, and we get a much smaller value with features. Also, we can see how little that residual deviance shrinks after the Age variable, and especially after the SibSp.

In this case, a larger Pr(>Chi) value indicates that the feature is not important (a model without it explains close to the same amount as a model with the feature). Unfortunately there is no \(R^2\) value as there is in linear regressions, so this is the next easiest way to assess model correctness.

Predictions

We will now predict using the “predict” function. Remember that we split the data above for this!

results <- predict(model, newdata=subset(test,select=c(2,3,4,5,6,7,8)), type='response')
results <- ifelse(results > .5, 1, 0)

error <- mean(results != test$Survived)
print(paste('Accuracy', 1-error))
## [1] "Accuracy 0.842696629213483"

Note that since logistic regression does return a value, and we actually want it to return a classification, the “ifelse” is important to compare with the actual bucket value. One benefit to return a score between 0 and 1 is that we can change the “.5” boundary in some cases to better perform on a training/testing set.

In this example the accuracy is about 84%. This can be interpreted as good or bad depending on the situation. If you are flipping a coin, getting a model that can correctly predict the result 84% of the time would be outstanding. On the other hand, if you have a testing set where 90% of the outcomes are one way and 10% are the other way, you have a better chance for accuracy from just randomly guessing for the 90% side than from using this model. In our case, we can use our counts and filters to see the split:

table(test$Survived)
## 
##  0  1 
## 56 33

By some guesswork we can assume our model is pretty good, but it is usually a good idea to actually go in and check how easy it is to manually eliminate easy options, then re-assess. Once again, to understand the accuracy of the model it is good to read the research of experts in the field or people who know more about the data.

Measuring Performance for Logistic Regression

One more thing to note is false positives versus false negatives. In statistical theory these can also be called Type I vs Type II errors. False positives occur when something is indicated as true when it is infact false, and false negatives occur when something is indicated as false when it is infact true.

In a medical situation where a disease is not diagnosed (false negative) this is much worse than a false positive; you want them to diagnose if you have a problem. A false positive, on the other hand, can be horrible in the titanic situation (i.e you think someone has survived when in fact they haven’t). So depending on the type of experiment you are doing, it is good to pick a model that minimizes one of the two errors. That is also another reason it is important to understand the type of error.

We can use a simple table function to demonstrate (remember the vertical axis is the first element and horizontal is the second element):

table(test$Survived,results)
##    results
##      0  1
##   0 51  5
##   1  9 24

We see that in this case 5 passengers have been predicted to survive when they in fact did not, and 9 passengers have been predicted to not survive when in fact they did.

Comparing logistic regression to linear regression

Here we will compare the above logistic model to a linear model, and then try and compare the results.

linModel <- lm(Survived ~ ., data=train)
summary(linModel)
## 
## Call:
## lm(formula = Survived ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0598 -0.2033 -0.0905  0.2329  0.9945 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.3333979  0.0791077  16.855  < 2e-16 ***
## Pclass      -0.1705330  0.0215874  -7.900 9.35e-15 ***
## Sexmale     -0.5146673  0.0298415 -17.247  < 2e-16 ***
## Age         -0.0056649  0.0011470  -4.939 9.60e-07 ***
## SibSp       -0.0383192  0.0142871  -2.682  0.00747 ** 
## Parch       -0.0199596  0.0193733  -1.030  0.30320    
## Fare         0.0002316  0.0003379   0.685  0.49335    
## EmbarkedQ    0.0072926  0.0579485   0.126  0.89989    
## EmbarkedS   -0.0485509  0.0367956  -1.319  0.18739    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3803 on 791 degrees of freedom
## Multiple R-squared:  0.3953, Adjusted R-squared:  0.3892 
## F-statistic: 64.64 on 8 and 791 DF,  p-value: < 2.2e-16
results <- predict(linModel, newdata=subset(test,select=c(2,3,4,5,6,7,8)), type='response')
results <- ifelse(results > .5, 1, 0)

error <- mean(results != test$Survived)
print(paste('Accuracy', 1-error))
## [1] "Accuracy 0.786516853932584"

Summary

We use logistic regression in situations where we want a score between 1 and 0, and often to classify points between two categories of data. Linear regression has an infinite possible number of values, and by nature each amount of error is penalized quadratically. Logistic regression uses a loss function and so large errors are penalized by a constant amount. It is easier to interpret the output of linear regression, but also for categories can give much more misleading error measures.

Machine Learning Videos

https://www.youtube.com/watch?v=jwSbzNHGflM https://www.youtube.com/watch?v=Y73iUAh56iI https://www.youtube.com/watch?v=aJq6ygTWdao https://www.youtube.com/watch?v=SfvRhqsmU4o https://www.youtube.com/watch?v=C6nonNRoF7g https://www.youtube.com/watch?v=JJlSgm9OByM&t=1s https://www.youtube.com/watch?v=dd1kN_myNDs https://www.youtube.com/watch?v=UoKXJzTYDpw&t=1s https://www.youtube.com/watch?v=hW1_Sidq3m8 https://www.youtube.com/watch?v=-R9bJGNHltQ https://www.youtube.com/watch?v=u7kQ5lNfUfg https://www.youtube.com/watch?v=8GUYAVXmhsI https://www.youtube.com/watch?v=F84jaIR5Uxc https://www.youtube.com/watch?v=eHipy_j29Xw https://openai.com/blog/better-language-models/ https://openai.com/blog/musenet/