library(tidyverse)
library(modelr)
library(nycflights13)
Starting this module, we are going to learn some modeling basics to set foundations for future courses such as data mining or machine learning.
Within the context of data science, a model usually refers to a mathematical model that captures the relationship between variables of given data set.
For example, a simple linear model can be represented by a function
\[ y = ax + b\]
Here \(y\) is called the target variable or the response variable, and \(x\) can be called a feature, predictor, of independent variable etc. In a linear model, \(y\) is usually a numeric variable, and \(x\) can be numeric or categorical.
\(a\) and \(b\) are coefficients of the model. They are usually determined by data using different algorithms and approaches.
This model is called a linear model since no matter what values \(a\) and \(b\) take, \(y\) and \(x\) always follow a linear relationship.
On the other hand, a quadratic model is in the function form of:
\[ y = ax^2 + bx + c\]
It is obvious that the name “quadratic” comes from the fact that when \(a \neq 0\), \(y\) is a quadratic function of \(x\).
An important way to classify models is by whether the model formula contains a random error that is not accounted for by the given data. For example, if we have data on \(x\) and \(y\), then the model
\[ \begin{equation}\tag{1} y = \beta_0 + \beta_1 x \end{equation}\]
is a deterministic model because given any \(x\) value we can compute a definitive \(y\) value with given parameters \((a,b,c)\).
On the other hand, a model is called stochastic if there is an error term that can be assumed to follow some probability distribution independent of \(x\) but cannot be predicted with certainty. For example,
\[ \begin{equation}\tag{2} y = \beta_0 + \beta_1 x + \epsilon, \; \epsilon \sim \mathcal{N}(0, \sigma^2) \end{equation}\] Here we assume that even given \(x\) value, there is still some variance in \(y\) that cannot be accounted for, which is of course nearly always true for real data. However, formula (1) and (2) frequently lead to the same model when we use least square method to fit parameters in (1) and maximum-likelihood estimate for parameters in (2). You will learn more details in the course of Linear Regression.
First, let’s understand the mathematical principle to build a model. Usually the goal is to find the optimized parameters for a particular model by minimizing the error between model prediction and measured data.
Let’s look at sim1
data set, which is a toy example in
modelr
package. This data set contains artificial data
about the relationship between two continuous random variables:
sim1
## # A tibble: 30 × 2
## x y
## <int> <dbl>
## 1 1 4.20
## 2 1 7.51
## 3 1 2.13
## 4 2 8.99
## 5 2 10.2
## 6 2 11.3
## 7 3 7.36
## 8 3 10.5
## 9 3 10.5
## 10 4 12.4
## # ℹ 20 more rows
ggplot(sim1, aes(x,y)) +
geom_point()
There is an apparent pattern in the data - a linear relationship. So we are going to use a simple linear model to fit the data:
\[ y = \beta_0 + \beta_1 x\]
However, out of all possible lines (depending on the values of the intercept \(\beta_0\) and the slope of \(x\), \(\beta_1\)), which one is the best? Of course, we hope to minimize the error and we need a metric here. The most commonly used metric in statistical modeling is the mean square error (MSE):
In the figure above (which shifts the \(x\) values of sim1
a little
bit to make things more visually clear), the solid line is a linear fit
to the data shown as scattered points. The error for each data point is
marked as the blue vertical segments. The length of each blue segment
represents the error (the distance between model prediction and actual
target value).
The mathematical principle here is to minimize the sum of square errors:
\[ \mathrm{find} \; \beta_0, \beta_1 \; \mathrm{to \; minimize \;} \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y_i})^2\]
where \(\hat{y_i}\) is the model prediction for each \(x_i\) value, in a linear model it is
\[ \hat{y_i} = \beta_0 + \beta_1 x_i \] Such a solution is called the least square solution (LSE). The mathematical details regarding how to find the values of \(\beta_0\) and \(\beta_1\) are introduced in courses such as Regression Analysis.
Linear models are simple, while non-linear models are more flexible since they have more parameters. An important rule of thumb in data modeling is that more flexible models may not be better than a simple model due to the overfitting problem. Which one to use depends on factors such as data size, data quality, and the actual trend of data. Let’s look at the following example to understand it.
The figure below is from the textbook “Introduction to Statistical Learning”. The left figure shows the data of wage (in $1,000) vs age from an income survey of men from the central Atlantic region of the United States. The right figure shows the data of wage vs year for the same data set.
For wage vs age, we see that on average, it seems that wage first increases with age then decreases at around 60. This does make sense since we know that usually people earn more as their working experience increases. However, after retirement people usually earn less since they don’t work as much as before.
This increasing then decreasing trend can be captured by the quadratic model, but not the linear model. Therefore, it is appropriate to model the relationship between wage and age as as quadratic function.
On the other hand, for wage vs year, on average there is a slow but steady increase between 2003 and 2009. In this case, a linear model is enough to capture the steadily increasing or decreasing trend.
Although linear model is only a special case for the quadratic model (when the quadratic term vanishes), but using the more flexible quadratic model may not necessarily be the better choice because:
You will learn more details in the course studying statistical learning. Here is only a very brief introduction of what modeling is.
Now let’s learn how to do simple modeling with base R package. We will do linear regression which is very simple yet very important in modeling data. There are two basic types of linear regression:
\[ y = \beta_0 + \beta_1 x\] - multiple linear regression: there are more than one independent variables
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n\]
The goal is to find best estimated values for \(\beta_i\) values based on given data set and quantify uncertainties.
For this part, we are going to use the following libraries:
library(MASS)
library(ISLR2)
If you have not installed these packages, please install them. The examples are from section 3.6 of the textbook “Introduction to Statistical Learning, 2nd Edition”.
Now let’s use the Boston
data set from the
ISLR2
library. The data set records medv
(median house value) for 506 census tracts in Boston.
Check the help documentation for the meaning of each variable with
?Boston
.
class(Boston)
## [1] "data.frame"
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 5.21 28.7
Let’s start by using the lm()
function to fit a simple
linear regression model, with medv
as the response and the
lstat
as the predictor.
lm.fit <- lm(medv ~ lstat, data = Boston)
The input medv ~ lstat
is called a
formula. The variable on the left of ~
is
the response, and the variables on the right are predictors. The meaning
of medv ~ lstat
is precisely a linear model:
\[ medv = \beta_0 + \beta_1 * lstat\]
The lm()
function then performs the actual data
fitting to find out the best estimated values of \(\beta_0\) and \(\beta_1\). The data
argument
tells lm()
which data set to use.
The result of this linear model is stored in the object
lm.fit
, which is a list. If we just print
itself, we will have a very brief summary of the model giving the values
of estimated coefficients.
lm.fit
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Coefficients:
## (Intercept) lstat
## 34.55 -0.95
So \(\beta_0\) (intercept) is
estimated to be 34.55, and \(\beta_1\)
(coefficient of lstat
) is estimated to be -0.95. So our
model turns out to be:
\[ \begin{equation} \tag{3} medv = 34.55 - 0.95 * lstat \end{equation}\]
To have a more comprehensive look of model details, we can use the
function summary()
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
Usually, a model can be used for two main purposes - for statistical inference or predict for unseen data. With the example above, let’s see how the two purposes are different but relate to each other.
For predictive purpose, we simply put a new value of
lstat
into our model as the input variable, and it will
give us a predicted output value for \(medv\). The following code predict values
of \(medv\) for \(lstat = 5, 10, 15\).
predict(lm.fit, data.frame(lstat = c(5, 10, 15)))
## 1 2 3
## 29.80359 25.05335 20.30310
But please be noted that this prediction is from model (3), which is
a deterministic model to predict the average
medv
value for given lstat
. Usually we also
hope to know the uncertainty of this prediction. We may do the
following:
predict(lm.fit, data.frame(lstat = 10), interval = "confidence")
## fit lwr upr
## 1 25.05335 24.47413 25.63256
predict(lm.fit, data.frame(lstat = 10), interval = "prediction")
## fit lwr upr
## 1 25.05335 12.82763 37.27907
So given lstat
to be 10 (10 percent lower status in the
population), the model predicts an average house median value to be
25.053. The 95% CI is [24.47, 25.63] for the average house median.
The 95% CI for any house value is [12.83, 37.28] (as specified by
"prediction"
for interval
argument). This is
called prediction interval. It has a much bigger
uncertainty due to randomness of value from home to home. In other
words, the random error \(\epsilon\) in
equation (2) also plays a role here.
For inferential purpose, usually we need to perform some kind of statistical tests (such as hypothesis testing) to understand and explain relationships between variables for the underlying population. These inferences often concern the mechanisms and forces shaping the data.
For example, if we hope to know whether lstat
has a
significant impact to medv
, we may simply check its p-value
in the lm.fit
result.
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
The result shows the p-value for the hypothesis test \(H_0: \beta_1 = 0\) and \(H_a: \beta_1 \neq 0\) is nearly zero. So we
should reject \(H_0\) and our data can
serve as evidence that lstat
does have a non-negligible
effect on medv
.
we can also check the confidence interval estimate for our parameters:
confint(lm.fit)
## 2.5 % 97.5 %
## (Intercept) 33.448457 35.6592247
## lstat -1.026148 -0.8739505
We will have another lecture on different statistical tests that are typically performed to tabular data for different data types and purposes.
lm
object is a listWe can check the data structure of lm.fit
by
typeof(lm.fit)
## [1] "list"
class(lm.fit)
## [1] "lm"
str(lm.fit)
## List of 12
## $ coefficients : Named num [1:2] 34.55 -0.95
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "lstat"
## $ residuals : Named num [1:506] -5.82 -4.27 3.97 1.64 6.71 ...
## ..- attr(*, "names")= chr [1:506] "1" "2" "3" "4" ...
## $ effects : Named num [1:506] -506.86 -152.46 4.43 2.12 7.13 ...
## ..- attr(*, "names")= chr [1:506] "(Intercept)" "lstat" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:506] 29.8 25.9 30.7 31.8 29.5 ...
## ..- attr(*, "names")= chr [1:506] "1" "2" "3" "4" ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## ..$ qr : num [1:506, 1:2] -22.4944 0.0445 0.0445 0.0445 0.0445 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:506] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:2] "(Intercept)" "lstat"
## .. ..- attr(*, "assign")= int [1:2] 0 1
## ..$ qraux: num [1:2] 1.04 1.02
## ..$ pivot: int [1:2] 1 2
## ..$ tol : num 1e-07
## ..$ rank : int 2
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 504
## $ xlevels : Named list()
## $ call : language lm(formula = medv ~ lstat, data = Boston)
## $ terms :Classes 'terms', 'formula' language medv ~ lstat
## .. ..- attr(*, "variables")= language list(medv, lstat)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "medv" "lstat"
## .. .. .. ..$ : chr "lstat"
## .. ..- attr(*, "term.labels")= chr "lstat"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(medv, lstat)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "medv" "lstat"
## $ model :'data.frame': 506 obs. of 2 variables:
## ..$ medv : num [1:506] 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## ..$ lstat: num [1:506] 4.98 9.14 4.03 2.94 5.33 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language medv ~ lstat
## .. .. ..- attr(*, "variables")= language list(medv, lstat)
## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "medv" "lstat"
## .. .. .. .. ..$ : chr "lstat"
## .. .. ..- attr(*, "term.labels")= chr "lstat"
## .. .. ..- attr(*, "order")= int 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(medv, lstat)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:2] "medv" "lstat"
## - attr(*, "class")= chr "lm"
We see that there is a lot of information stored in
lm.fit
! Usually, we don’t directly work on lm
objects using $
. Rather, we use helpful accessing functions
to access the information we need such as:
coef(lm.fit)
## (Intercept) lstat
## 34.5538409 -0.9500494
confint(lm.fit)
## 2.5 % 97.5 %
## (Intercept) 33.448457 35.6592247
## lstat -1.026148 -0.8739505
In this example, most likely we would like to see how well our model fits the data by plotting the fitting line on top of raw data:
attach(Boston)
plot(lstat, medv)
abline(lm.fit)
Here we use the attach()
function to tell R that we will
always use my_Boston
data set for any relevant functions.
The plot()
function is the original function in R to create
a scatter plot. The abline
function accepts coefficients
from a lm
result and create the corresponding linear
fit.
Of course we can do the same by ggplot
:
ggplot(Boston) + geom_point(aes(x = lstat, y = medv)) + geom_smooth(aes(x = lstat, y = coef(lm.fit)[1] + lstat * coef(lm.fit)[2]))
An important step in data modeling is to validate the model assumptions. For example, a simple linear regression assumes that the residual error should be normally distributed. We can check this assumption by making the residual plot:
plot(lm.fit, which = 1)
The dashed line shows zero error line, and we expect the error to be
normally distributed about zero without dependence on
lstat
. However, here we clearly see that the error does
depend on lstat
, which cast doubt on the model
validity.
We can also make a normality plot to check the residual distribution:
plot(lm.fit, which = 2)
The error is also not quite normally distributed when
lstat
has a z-score between 1 and 3. So a liner model may
be problematic here. This is expected since raw data show a non-linear
trend. So a quadratic model probably would work better here.
To fit a polynomial model (in this case a quadratic model) to data, we may do the following:
attach(Boston)
lm.fit2 <- lm(medv ~ poly(lstat, 2))
This fits data to the model
\[ medv = \beta_0 + \beta_1 * lstat + \beta_2 * lstat^2\]
We can check how well the model fits the data by the code:
plot(lstat, medv)
ix <- sort(lstat, index.return=T)$ix
lines(lstat[ix], predict(lm.fit2)[ix], col = "red", lwd = 2)
More details of polynomial model will be given in future labs.