Load Libraries

library(tidyverse)
library(modelr)
library(nycflights13)


0. Introduction


Starting this module, we are going to learn some modeling basics to set foundations for future courses such as data mining or machine learning.


What is a model?

Within the context of data science, a model usually refers to a mathematical model that captures the relationship between variables of given data set.

For example, a simple linear model can be represented by a function

\[ y = ax + b\]

Here \(y\) is called the target variable or the response variable, and \(x\) can be called a feature, predictor, of independent variable etc. In a linear model, \(y\) is usually a numeric variable, and \(x\) can be numeric or categorical.

\(a\) and \(b\) are coefficients of the model. They are usually determined by data using different algorithms and approaches.

This model is called a linear model since no matter what values \(a\) and \(b\) take, \(y\) and \(x\) always follow a linear relationship.

On the other hand, a quadratic model is in the function form of:

\[ y = ax^2 + bx + c\]

It is obvious that the name “quadratic” comes from the fact that when \(a \neq 0\), \(y\) is a quadratic function of \(x\).


Deterministic model vs stochastic model

An important way to classify models is by whether the model formula contains a random error that is not accounted for by the given data. For example, if we have data on \(x\) and \(y\), then the model

\[ \begin{equation}\tag{1} y = \beta_0 + \beta_1 x \end{equation}\]

is a deterministic model because given any \(x\) value we can compute a definitive \(y\) value with given parameters \((a,b,c)\).

On the other hand, a model is called stochastic if there is an error term that can be assumed to follow some probability distribution independent of \(x\) but cannot be predicted with certainty. For example,

\[ \begin{equation}\tag{2} y = \beta_0 + \beta_1 x + \epsilon, \; \epsilon \sim \mathcal{N}(0, \sigma^2) \end{equation}\] Here we assume that even given \(x\) value, there is still some variance in \(y\) that cannot be accounted for, which is of course nearly always true for real data. However, formula (1) and (2) frequently lead to the same model when we use least square method to fit parameters in (1) and maximum-likelihood estimate for parameters in (2). You will learn more details in the course of Linear Regression.


Least Square Solution

First, let’s understand the mathematical principle to build a model. Usually the goal is to find the optimized parameters for a particular model by minimizing the error between model prediction and measured data.

Let’s look at sim1 data set, which is a toy example in modelr package. This data set contains artificial data about the relationship between two continuous random variables:

sim1
## # A tibble: 30 × 2
##        x     y
##    <int> <dbl>
##  1     1  4.20
##  2     1  7.51
##  3     1  2.13
##  4     2  8.99
##  5     2 10.2 
##  6     2 11.3 
##  7     3  7.36
##  8     3 10.5 
##  9     3 10.5 
## 10     4 12.4 
## # ℹ 20 more rows
ggplot(sim1, aes(x,y)) + 
  geom_point()

There is an apparent pattern in the data - a linear relationship. So we are going to use a simple linear model to fit the data:

\[ y = \beta_0 + \beta_1 x\]

However, out of all possible lines (depending on the values of the intercept \(\beta_0\) and the slope of \(x\), \(\beta_1\)), which one is the best? Of course, we hope to minimize the error and we need a metric here. The most commonly used metric in statistical modeling is the mean square error (MSE):

In the figure above (which shifts the \(x\) values of sim1 a little bit to make things more visually clear), the solid line is a linear fit to the data shown as scattered points. The error for each data point is marked as the blue vertical segments. The length of each blue segment represents the error (the distance between model prediction and actual target value).

The mathematical principle here is to minimize the sum of square errors:

\[ \mathrm{find} \; \beta_0, \beta_1 \; \mathrm{to \; minimize \;} \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y_i})^2\]

where \(\hat{y_i}\) is the model prediction for each \(x_i\) value, in a linear model it is

\[ \hat{y_i} = \beta_0 + \beta_1 x_i \] Such a solution is called the least square solution (LSE). The mathematical details regarding how to find the values of \(\beta_0\) and \(\beta_1\) are introduced in courses such as Regression Analysis.


Linear vs non-linear model

Linear models are simple, while non-linear models are more flexible since they have more parameters. An important rule of thumb in data modeling is that more flexible models may not be better than a simple model due to the overfitting problem. Which one to use depends on factors such as data size, data quality, and the actual trend of data. Let’s look at the following example to understand it.

The figure below is from the textbook “Introduction to Statistical Learning”. The left figure shows the data of wage (in $1,000) vs age from an income survey of men from the central Atlantic region of the United States. The right figure shows the data of wage vs year for the same data set.

For wage vs age, we see that on average, it seems that wage first increases with age then decreases at around 60. This does make sense since we know that usually people earn more as their working experience increases. However, after retirement people usually earn less since they don’t work as much as before.

This increasing then decreasing trend can be captured by the quadratic model, but not the linear model. Therefore, it is appropriate to model the relationship between wage and age as as quadratic function.

On the other hand, for wage vs year, on average there is a slow but steady increase between 2003 and 2009. In this case, a linear model is enough to capture the steadily increasing or decreasing trend.

Although linear model is only a special case for the quadratic model (when the quadratic term vanishes), but using the more flexible quadratic model may not necessarily be the better choice because:

  • More flexible (complicated) models are more prone to overfitting, which refers to model too many noises rather than useful information in the data set.
  • More flexible (complicated) models take more computational time. In some situations, this can be critical.
  • More flexible (complicated) models require more data to work well, which is sometimes not available.

You will learn more details in the course studying statistical learning. Here is only a very brief introduction of what modeling is.


1. Simple Linear Regression

Now let’s learn how to do simple modeling with base R package. We will do linear regression which is very simple yet very important in modeling data. There are two basic types of linear regression:

\[ y = \beta_0 + \beta_1 x\] - multiple linear regression: there are more than one independent variables

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n\]

The goal is to find best estimated values for \(\beta_i\) values based on given data set and quantify uncertainties.

For this part, we are going to use the following libraries:

library(MASS)
library(ISLR2)

If you have not installed these packages, please install them. The examples are from section 3.6 of the textbook “Introduction to Statistical Learning, 2nd Edition”.

Now let’s use the Boston data set from the ISLR2 library. The data set records medv (median house value) for 506 census tracts in Boston.

Check the help documentation for the meaning of each variable with ?Boston.

class(Boston)
## [1] "data.frame"
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio lstat medv
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3  4.98 24.0
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8  9.14 21.6
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8  4.03 34.7
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7  2.94 33.4
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7  5.33 36.2
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7  5.21 28.7

Let’s start by using the lm() function to fit a simple linear regression model, with medv as the response and the lstat as the predictor.

lm.fit <- lm(medv ~ lstat, data = Boston)

The input medv ~ lstat is called a formula. The variable on the left of ~ is the response, and the variables on the right are predictors. The meaning of medv ~ lstat is precisely a linear model:

\[ medv = \beta_0 + \beta_1 * lstat\]

The lm() function then performs the actual data fitting to find out the best estimated values of \(\beta_0\) and \(\beta_1\). The data argument tells lm() which data set to use.

The result of this linear model is stored in the object lm.fit, which is a list. If we just print itself, we will have a very brief summary of the model giving the values of estimated coefficients.

lm.fit
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Coefficients:
## (Intercept)        lstat  
##       34.55        -0.95

So \(\beta_0\) (intercept) is estimated to be 34.55, and \(\beta_1\) (coefficient of lstat) is estimated to be -0.95. So our model turns out to be:

\[ \begin{equation} \tag{3} medv = 34.55 - 0.95 * lstat \end{equation}\]

To have a more comprehensive look of model details, we can use the function summary()

summary(lm.fit)
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16


Inferential model vs predictive model

Usually, a model can be used for two main purposes - for statistical inference or predict for unseen data. With the example above, let’s see how the two purposes are different but relate to each other.


Prediction

For predictive purpose, we simply put a new value of lstat into our model as the input variable, and it will give us a predicted output value for \(medv\). The following code predict values of \(medv\) for \(lstat = 5, 10, 15\).

predict(lm.fit, data.frame(lstat = c(5, 10, 15)))
##        1        2        3 
## 29.80359 25.05335 20.30310

But please be noted that this prediction is from model (3), which is a deterministic model to predict the average medv value for given lstat. Usually we also hope to know the uncertainty of this prediction. We may do the following:

predict(lm.fit, data.frame(lstat = 10), interval = "confidence")
##        fit      lwr      upr
## 1 25.05335 24.47413 25.63256
predict(lm.fit, data.frame(lstat = 10), interval = "prediction")
##        fit      lwr      upr
## 1 25.05335 12.82763 37.27907

So given lstat to be 10 (10 percent lower status in the population), the model predicts an average house median value to be 25.053. The 95% CI is [24.47, 25.63] for the average house median.

The 95% CI for any house value is [12.83, 37.28] (as specified by "prediction" for interval argument). This is called prediction interval. It has a much bigger uncertainty due to randomness of value from home to home. In other words, the random error \(\epsilon\) in equation (2) also plays a role here.


Inference

For inferential purpose, usually we need to perform some kind of statistical tests (such as hypothesis testing) to understand and explain relationships between variables for the underlying population. These inferences often concern the mechanisms and forces shaping the data.

For example, if we hope to know whether lstat has a significant impact to medv, we may simply check its p-value in the lm.fit result.

summary(lm.fit)
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

The result shows the p-value for the hypothesis test \(H_0: \beta_1 = 0\) and \(H_a: \beta_1 \neq 0\) is nearly zero. So we should reject \(H_0\) and our data can serve as evidence that lstat does have a non-negligible effect on medv.

we can also check the confidence interval estimate for our parameters:

confint(lm.fit)
##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505

We will have another lecture on different statistical tests that are typically performed to tabular data for different data types and purposes.


lm object is a list

We can check the data structure of lm.fit by

typeof(lm.fit)
## [1] "list"
class(lm.fit)
## [1] "lm"
str(lm.fit)
## List of 12
##  $ coefficients : Named num [1:2] 34.55 -0.95
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "lstat"
##  $ residuals    : Named num [1:506] -5.82 -4.27 3.97 1.64 6.71 ...
##   ..- attr(*, "names")= chr [1:506] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:506] -506.86 -152.46 4.43 2.12 7.13 ...
##   ..- attr(*, "names")= chr [1:506] "(Intercept)" "lstat" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:506] 29.8 25.9 30.7 31.8 29.5 ...
##   ..- attr(*, "names")= chr [1:506] "1" "2" "3" "4" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:506, 1:2] -22.4944 0.0445 0.0445 0.0445 0.0445 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:506] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "lstat"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   ..$ qraux: num [1:2] 1.04 1.02
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 504
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = medv ~ lstat, data = Boston)
##  $ terms        :Classes 'terms', 'formula'  language medv ~ lstat
##   .. ..- attr(*, "variables")= language list(medv, lstat)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "medv" "lstat"
##   .. .. .. ..$ : chr "lstat"
##   .. ..- attr(*, "term.labels")= chr "lstat"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(medv, lstat)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "medv" "lstat"
##  $ model        :'data.frame':   506 obs. of  2 variables:
##   ..$ medv : num [1:506] 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
##   ..$ lstat: num [1:506] 4.98 9.14 4.03 2.94 5.33 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language medv ~ lstat
##   .. .. ..- attr(*, "variables")= language list(medv, lstat)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "medv" "lstat"
##   .. .. .. .. ..$ : chr "lstat"
##   .. .. ..- attr(*, "term.labels")= chr "lstat"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(medv, lstat)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "medv" "lstat"
##  - attr(*, "class")= chr "lm"

We see that there is a lot of information stored in lm.fit! Usually, we don’t directly work on lm objects using $. Rather, we use helpful accessing functions to access the information we need such as:

coef(lm.fit)
## (Intercept)       lstat 
##  34.5538409  -0.9500494
confint(lm.fit)
##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505


Visualize fitting results

In this example, most likely we would like to see how well our model fits the data by plotting the fitting line on top of raw data:

attach(Boston)
plot(lstat, medv)
abline(lm.fit)

Here we use the attach() function to tell R that we will always use my_Boston data set for any relevant functions. The plot() function is the original function in R to create a scatter plot. The abline function accepts coefficients from a lm result and create the corresponding linear fit.

Of course we can do the same by ggplot:

ggplot(Boston) + geom_point(aes(x = lstat, y = medv)) + geom_smooth(aes(x = lstat, y = coef(lm.fit)[1] + lstat * coef(lm.fit)[2]))

An important step in data modeling is to validate the model assumptions. For example, a simple linear regression assumes that the residual error should be normally distributed. We can check this assumption by making the residual plot:

plot(lm.fit, which = 1)

The dashed line shows zero error line, and we expect the error to be normally distributed about zero without dependence on lstat. However, here we clearly see that the error does depend on lstat, which cast doubt on the model validity.

We can also make a normality plot to check the residual distribution:

plot(lm.fit, which = 2)

The error is also not quite normally distributed when lstat has a z-score between 1 and 3. So a liner model may be problematic here. This is expected since raw data show a non-linear trend. So a quadratic model probably would work better here.


Fit a polynomial model to data

To fit a polynomial model (in this case a quadratic model) to data, we may do the following:

attach(Boston)
lm.fit2 <- lm(medv ~ poly(lstat, 2))

This fits data to the model

\[ medv = \beta_0 + \beta_1 * lstat + \beta_2 * lstat^2\]

We can check how well the model fits the data by the code:

plot(lstat, medv)
ix <- sort(lstat, index.return=T)$ix
lines(lstat[ix], predict(lm.fit2)[ix], col = "red", lwd = 2)

More details of polynomial model will be given in future labs.