0. Introduction


Starting this module, we are going to learn some modeling basics to set foundations for future courses such as data mining or machine learning.

The focus of our course is data visualization and exploration. However, doing those usually serves the purpose of produce better models. Therefore, understanding some modeling basics actually helps with better EDA process with a more clear sense of the next-step work.


What is a model?

Within the context of data science, a model usually refers to a mathematical model that captures the relationship between variables of given data set.

For example, a simple linear model can be represented by a function

\[ y = ax + b\]

Here \(y\) is called the target variable or the response variable, and \(x\) can be called a feature, predictor, of independent variable etc. In a linear model, \(y\) is usually a numeric variable, and \(x\) can be numeric or categorical.

\(a\) and \(b\) are coefficients of the model. They are usually determined by data using different algorithms and approaches.

This model is called a linear model since no matter what values \(a\) and \(b\) take, \(y\) and \(x\) always follow a linear relationship.

On the other hand, a quadratic model is in the function form of:

\[ y = ax^2 + bx + c\]

It is obvious that the name “quadratic” comes from the fact that when \(a \neq 0\), \(y\) is a quadratic function of \(x\).


Example

When to use a linear model, and when to use a quadratic model? Let’s look at an example to understand it.

The figure below is from the textbook “Introduction to Statistical Learning”. The left figure shows the data of wage (in $1,000) vs age from an income survey of men from the central Atlantic region of the United States. The right figure shows the data of wage vs year for the same data set.

For wage vs age, we see that on average, it seems that wage first increases with age then decreases at around 60. This does make sense since we know that usually people earn more as their working experience increases. However, after retirement people usually earn less since they don’t work as much as before.

This increasing then decreasing trend can be captured by the quadratic model, but not the linear model. Therefore, it is appropriate to model the relationship between wage and age as as quadratic function.

On the other hand, for wage vs year, on average there is a slow but steady increase between 2003 and 2009. In this case, a linear model is enough to capture the steadily increasing or decreasing trend.

You may wonder, since a linear model is simply a special case of a quadratic model where the quadratic coefficient \(a=0\), why don’t we just always use the more flexible quadratic model? That is because this flexibility does come at some cost:

  • More flexible (complicated) models are more prone to overfitting, which refers to model too many noises rather than useful information in the data set.
  • More flexible (complicated) models take more computational time. In some situations, this can be critical.
  • More flexible (complicated) models require more data to work well, which is sometimes not available.

You will learn more details in the course studying statistical learning. Here is only a very brief introduction of what modeling is.


1. Simple Linear Regression

Now let’s learn how to do simple modeling with base R package. We will do linear regression which is very simple yet very important in modeling data. There are two basic types of linear regression:

\[ y = \beta_0 + \beta_1 x\] - multiple linear regression: there are more than one independent variables

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n\]

The goal is to find best estimated values for \(\beta_i\) values based on given data set and quantify uncertainties.

For this part, we are going to use the following libraries:

library(MASS)
library(ISLR2)

If you have not installed these packages, please install them. The examples are from section 3.6 of the textbook “Introduction to Statistical Learning, 2nd Edition”. The electronic version is free for access online:

https://hastie.su.domains/ISLR2/ISLRv2_website.pdf

Now let’s use the Boston data set from the ISLR2 library. The data set records medv (median house value) for 506 census tracts in Boston.

Check the help documentation for the meaning of each variable with ?Boston.

class(Boston)
## [1] "data.frame"
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio lstat medv
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3  4.98 24.0
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8  9.14 21.6
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8  4.03 34.7
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7  2.94 33.4
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7  5.33 36.2
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7  5.21 28.7

Boston is a data frame. We can make it a tibble.

my_Boston <- as_tibble(Boston)
glimpse(my_Boston)
## Rows: 506
## Columns: 13
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…

Let’s start by using the lm() function to fit a simple linear regression model, with medv as the response and the lstat as the predictor.

lm.fit <- lm(medv ~ lstat, data = my_Boston)

The input medv ~ lstat is called a formula. The variable on the left of ~ is the response, and the variables on the right are predictors. The meaning of medv ~ lstat is precisely a linear model:

\[ medv = \beta_0 + \beta_1 * lstat\]

The lm() function then performs the actual data fitting to find out the best estimated values of \(\beta_0\) and \(\beta_1\). The data argument tells lm() which data set to use.

The result of this linear model is stored in the object lm.fit, which is a list. If we just print itself, we will have a very brief summary of the model giving the values of estimated coefficients.

lm.fit
## 
## Call:
## lm(formula = medv ~ lstat, data = my_Boston)
## 
## Coefficients:
## (Intercept)        lstat  
##       34.55        -0.95

So \(\beta_0\) (intercept) is estimated to be 34.55, and \(\beta_1\) (coefficient of lstat) is estimated to be -0.95.

To have a more comprehensive look of model details, we can use the function summary()

summary(lm.fit)
## 
## Call:
## lm(formula = medv ~ lstat, data = my_Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

We will learn how to read this in a short period of time. But let’s first check the data structure of lm.fit:

typeof(lm.fit)
## [1] "list"
class(lm.fit)
## [1] "lm"
str(lm.fit)
## List of 12
##  $ coefficients : Named num [1:2] 34.55 -0.95
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "lstat"
##  $ residuals    : Named num [1:506] -5.82 -4.27 3.97 1.64 6.71 ...
##   ..- attr(*, "names")= chr [1:506] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:506] -506.86 -152.46 4.43 2.12 7.13 ...
##   ..- attr(*, "names")= chr [1:506] "(Intercept)" "lstat" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:506] 29.8 25.9 30.7 31.8 29.5 ...
##   ..- attr(*, "names")= chr [1:506] "1" "2" "3" "4" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:506, 1:2] -22.4944 0.0445 0.0445 0.0445 0.0445 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:506] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "lstat"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   ..$ qraux: num [1:2] 1.04 1.02
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 504
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = medv ~ lstat, data = my_Boston)
##  $ terms        :Classes 'terms', 'formula'  language medv ~ lstat
##   .. ..- attr(*, "variables")= language list(medv, lstat)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "medv" "lstat"
##   .. .. .. ..$ : chr "lstat"
##   .. ..- attr(*, "term.labels")= chr "lstat"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(medv, lstat)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "medv" "lstat"
##  $ model        :'data.frame':   506 obs. of  2 variables:
##   ..$ medv : num [1:506] 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
##   ..$ lstat: num [1:506] 4.98 9.14 4.03 2.94 5.33 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language medv ~ lstat
##   .. .. ..- attr(*, "variables")= language list(medv, lstat)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "medv" "lstat"
##   .. .. .. .. ..$ : chr "lstat"
##   .. .. ..- attr(*, "term.labels")= chr "lstat"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(medv, lstat)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "medv" "lstat"
##  - attr(*, "class")= chr "lm"

We see that there is a lot of information stored in lm.fit! Usually, we don’t directly work on lm objects using $. Rather, we use helpful accessing functions to access the information we need:

coef(lm.fit)
## (Intercept)       lstat 
##  34.5538409  -0.9500494
confint(lm.fit)
##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505

The predict() function can be used to produce confidence intervals and prediction intervals for the prediction of mdev for a given value of lstat:

predict(lm.fit, data.frame(lstat = 10), interval = "confidence")
##        fit      lwr      upr
## 1 25.05335 24.47413 25.63256
predict(lm.fit, data.frame(lstat = 10), interval = "prediction")
##        fit      lwr      upr
## 1 25.05335 12.82763 37.27907

So given lstat to be 10 (10 percent lower status in the population), the model predicts an average house median value to be 25.053. The 95% CI is [24.47, 25.63] for the average house median.

The 95% CI for any house value is [12.83, 37.28] (as specified by "prediction" for interval argument). This is called prediction interval. It has a much bigger uncertainty due to randomness of value from home to home.

We can also plot the data set along with the fitting line:

attach(my_Boston)
plot(lstat, medv)
abline(lm.fit)

Here we use the attach() function to tell R that we will always use my_Boston data set for any relevant functions. The plot() function is the original function in R to create a scatter plot. The abline function accepts coefficients from a lm result and create the corresponding linear fit.

An important step in data modeling is to validate the model assumptions. For example, a simple linear regression assumes that the residual error should be normally distributed. We can check this assumption by making the residual plot:

plot(lm.fit, which = 1)

The dashed line shows zero error line, and we expect the error to be normally distributed about zero without dependence on lstat. However, here we clearly see that the error does depend on lstat, which cast doubt on the model validity.

We can also make a normality plot to check the residual distribution:

plot(lm.fit, which = 2)

The error is also not quite normally distributed when lstat has a z-score between 1 and 3. So a liner model may be problematic here. You will learn more details in regression analysis and statistical learning courses.


2. Multiple Linear regression


Now let’s look at an example of multiple linear regression. This part is from:

https://www.tmwr.org/base-r.html

We are going to use experimental data from McDonald (2009), by way of Mangiafico (2015), on the relationship between the ambient temperature and the rate of cricket chirps per minute. Data were collected for two species: O. exclamationis and O. niveus. The data are contained in a data frame called crickets in the modeldata package with a total of 31 data points.

library(tidyverse)

data(crickets, package = "modeldata")
names(crickets)
## [1] "species" "temp"    "rate"
# Plot the temperature on the x-axis, the chirp rate on the y-axis. The plot
# elements will be colored differently for each species:
ggplot(crickets, 
       aes(x = temp, y = rate, color = species, pch = species, lty = species)) + 
  # Plot points for each data point and color by species
  geom_point(size = 2) + 
  # Show a simple linear model fit created separately for each species:
  geom_smooth(method = lm, se = FALSE) + 
  scale_color_brewer(palette = "Paired") +
  labs(x = "Temperature (C)", y = "Chirp Rate (per minute)")

The data exhibit fairly linear trends for each species. For a given temperature, O. exclamationis appears to chirp more per minute than the other species. For an inferential model, the researchers might have specified the following null hypotheses prior to seeing the data:

There may be some scientific or practical value in predicting the chirp rate but in this example we will focus on inference.

For multiple linear regression, we use the following formula

rate ~ temp + species
## rate ~ temp + species

Here rate is the response and both temp and species are predictors. The true meaning of this formula is not temp “plus” species, but a general linear model:

\[ rate = \beta_0 + \beta_1 * temp + \beta_2 * species\] Note that Species is not a quantitative variable; in the data frame, it is represented as a factor column with levels “O. exclamationis” and “O. niveus”. The vast majority of model functions cannot operate on nonnumeric data. For species, the model needs to encode the species data in a numeric format. The most common approach is to use indicator variables (also known as dummy variables) in place of the original qualitative values.

In this instance, since species has two possible values, the model formula will automatically encode this column as numeric by adding a new column that has a value of zero when the species is “O. exclamationis” and a value of one when the data correspond to “O. niveus”.

attach(crickets)
lm.fit2 <- lm(rate ~ temp + species)
lm.fit2
## 
## Call:
## lm(formula = rate ~ temp + species)
## 
## Coefficients:
##      (Intercept)              temp  speciesO. niveus  
##           -7.211             3.603           -10.065

To interpret the result, with each degree of temperature increase, the chirping rate increases by 3.6 on average. The species O.niveus chirps 10.06 per minute less than the other species on average.

The only issue in this analysis is the intercept value. It indicates that at 0° C, there are negative chirps per minute for both species. While this doesn’t make sense, the data only go as low as 17.2° C and interpreting the model at 0° C would be an extrapolation. This would be a bad idea. That being said, the model fit is good within the applicable range of the temperature values; the conclusions should be limited to the observed temperature range.