Estimation

We can start by recalling some basics notions from lecture notes, by multiple linear model we mean something like this

\[ Y=\beta_0+\beta_1X_1+\beta_2X_2+\dots+\beta_pX_p+ \epsilon \]

where \(\beta_i\,\,\,i=0,\dots,p\) are unknown parameters and \(\boldsymbol{\epsilon}\) can be assumed to follow a Gaussian distribution (or not) with \(E(\boldsymbol{\epsilon})=0\) and \(Var(\boldsymbol{\epsilon})=\sigma^2I\).

Please, make sure you understand why this is the case and what are the consequences of this latter assumptions on the model properties ( more details in the lecture notes ).

In statistical modeling, specifically under linear model(s) domain we are assuming linear relationship between the dependent variable and one or more independent variables or predictors, also known as covariates. The term “linear” in linear model refers to the fact that the parameters of the model enter linearly, meaning that the effect of each predictor on the dependent variable is assumed to be a linear function of that predictor. However, it is important to note that the predictors themselves do not have to be linear. In fact, the linear model can be used with non-linear transformations of the predictors, as long as the relationship between the transformed predictors and the dependent variable is still linear. This flexibility makes the linear model a powerful tool for analyzing a wide range of data sets, from simple to complex, and for making predictions based on those data. For example:

\[ Y=\beta_0+\beta_1X_1+\beta_2\log{X_2}+\beta_3 X_1 X_2+\epsilon \]

Despite the presence of a logarithmic term and an interaction term, the given equation is still considered a linear model. This is because the model is linear in its parameters, which are the coefficients (\(beta_0\), \(\beta_1\), \(\beta_2\), \(\beta_3\)) that multiply each of the predictor variables. The dependent variable Y is a linear combination of the parameters and the predictor variables, where the coefficients are constant and do not depend on any function or transformation of the predictor variables. Therefore, the given equation is still considered a linear model because it is linear in its parameters, and the logarithmic term and interaction term do not violate the linearity assumption of the model. But what about this

\[ Y=\beta_0+\beta_1X_1^{\beta_2}+\epsilon \]

The given equation is not a linear model because it is not linear in the parameters, which are the coefficients that multiply the predictor variables. Specifically, the exponent \(\beta_2\) applied to \(X_1\) means that the coefficient \(\beta_1\) is not constant, but rather varies as a power function of \(X_1\). This violates the linearity assumption of the model, which states that the coefficients must be constant and not depend on any function or transformation of the predictor variables.

Nonlinear models, like the given equation, require more complex techniques for estimation and inference.

Now let’s take a look at an example to understand how things work in the R language. The dataset we are going to examine today concerns the number of species found on the various Galapagos Islands.

library(faraway)
data(gala)
head(gala)
##              Species Endemics  Area Elevation Nearest Scruz Adjacent
## Baltra            58       23 25.09       346     0.6   0.6     1.84
## Bartolome         31       21  1.24       109     0.6  26.3   572.33
## Caldwell           3        3  0.21       114     2.8  58.7     0.78
## Champion          25        9  0.10        46     1.9  47.4     0.18
## Coamano            2        1  0.05        77     1.9   1.9   903.82
## Daphne.Major      18       11  0.34       119     8.0   8.0     1.84

The gala dataset includes several variables that describe different aspects of the Galapagos Islands. These variables are: Species, which represents the number of plant species found on each island; Endemics, which counts the number of endemic species; Area, which gives the size of each island in square kilometers; Elevation, which measures the highest elevation on each island in meters; Nearest, which indicates the distance to the nearest neighboring island in kilometers; Scruz, which gives the distance to Santa Cruz island, also in kilometers; and Adjacent, which represents the area of the adjacent island in square kilometers. To learn more about these variables, you can access the data description by running the command ?gala in R

str(gala)
## 'data.frame':    30 obs. of  7 variables:
##  $ Species  : num  58 31 3 25 2 18 24 10 8 2 ...
##  $ Endemics : num  23 21 3 9 1 11 0 7 4 2 ...
##  $ Area     : num  25.09 1.24 0.21 0.1 0.05 ...
##  $ Elevation: num  346 109 114 46 77 119 93 168 71 112 ...
##  $ Nearest  : num  0.6 0.6 2.8 1.9 1.9 8 6 34.1 0.4 2.6 ...
##  $ Scruz    : num  0.6 26.3 58.7 47.4 1.9 ...
##  $ Adjacent : num  1.84 572.33 0.78 0.18 903.82 ...

Everything seems to be formatted correctly. In lab exam this would not be the case, pay attention to this, please. We can fit a linear mode in R using lm() function. lm() is a function in R that fits linear regression models to data. In brief you need to ingredients:

In our case we have \(y=Species\), then

lmod <- lm(Species ~ Area+ Elevation+ Nearest + Scruz + Adjacent, data=gala) # fit the model
summary(lmod)
## 
## Call:
## lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent, 
##     data = gala)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -111.679  -34.898   -7.862   33.460  182.584 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.068221  19.154198   0.369 0.715351    
## Area        -0.023938   0.022422  -1.068 0.296318    
## Elevation    0.319465   0.053663   5.953 3.82e-06 ***
## Nearest      0.009144   1.054136   0.009 0.993151    
## Scruz       -0.240524   0.215402  -1.117 0.275208    
## Adjacent    -0.074805   0.017700  -4.226 0.000297 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.98 on 24 degrees of freedom
## Multiple R-squared:  0.7658, Adjusted R-squared:  0.7171 
## F-statistic:  15.7 on 5 and 24 DF,  p-value: 6.838e-07

from this output we can extract the regression quantities we need, residuals() (gives \(e_i=(y_i- \hat{y}_i)\)), fitted() (returns \(\hat{y}_i\) ), df.residual() (returns the d.o.f.), deviance() (gives the RSS) and coef() (gives the \(\hat{\beta}_i\)) for \(i=1,\dots, p\).

residuals(lmod)
##       Baltra    Bartolome     Caldwell     Champion      Coamano Daphne.Major 
##   -58.725946    38.273154   -26.330659    14.635734    38.383916   -25.087705 
## Daphne.Minor       Darwin         Eden      Enderby     Espanola   Fernandina 
##    -9.919668    19.018992   -20.314202   -28.785943    49.343513    -3.989598 
##     Gardner1     Gardner2     Genovesa      Isabela     Marchena       Onslow 
##    62.033276   -59.633796    40.497176   -39.403558   -37.694540    -2.037233 
##        Pinta       Pinzon   Las.Plazas       Rabida SanCristobal  SanSalvador 
##  -111.679486   -42.475375   -23.075807    -5.553122    73.048122   -40.676318 
##    SantaCruz      SantaFe   SantaMaria      Seymour      Tortuga         Wolf 
##   182.583587   -23.376486    89.383371    -5.805095   -36.935732    -5.700573
fitted(lmod) 
##       Baltra    Bartolome     Caldwell     Champion      Coamano Daphne.Major 
##  116.7259460   -7.2731544   29.3306594   10.3642660  -36.3839155   43.0877052 
## Daphne.Minor       Darwin         Eden      Enderby     Espanola   Fernandina 
##   33.9196678   -9.0189919   28.3142017   30.7859425   47.6564865   96.9895982 
##     Gardner1     Gardner2     Genovesa      Isabela     Marchena       Onslow 
##   -4.0332759   64.6337956   -0.4971756  386.4035578   88.6945404    4.0372328 
##        Pinta       Pinzon   Las.Plazas       Rabida SanCristobal  SanSalvador 
##  215.6794862  150.4753750   35.0758066   75.5531221  206.9518779  277.6763183 
##    SantaCruz      SantaFe   SantaMaria      Seymour      Tortuga         Wolf 
##  261.4164131   85.3764857  195.6166286   49.8050946   52.9357316   26.7005735
# Please make sure that (species- fitted values) gives the residuals
df.residual(lmod) 
## [1] 24
# dof are 24, does it make sense to you ? 
deviance(lmod)
## [1] 89231.37
coef(lmod) 
##  (Intercept)         Area    Elevation      Nearest        Scruz     Adjacent 
##  7.068220709 -0.023938338  0.319464761  0.009143961 -0.240524230 -0.074804832

You can even extract other quantities by examining the model object and its summary:

names(lmod)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
lmodsum<- summary(lmod)
names(lmodsum)
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"

now we can check the assumptions

plot(lmod$residuals)

qqnorm(lmod$residuals)

plot(lmod,1:2)

Are the residuals reasonably normally distributed with mean zero and constant variance?

The plot() function in R can be used to generate four diagnostic plots for linear regression models:

In linear regression is also useful to have some measure of goodness of fit, these measures just tell you how well the model fits the data. One common choice is the \(R^2\), we recall here its formulation

\[ R^2=1-\frac{\sum_i(\hat{y}_i-y_i)^2}{\sum_{i}(y_i-\bar{y})^2} \]

\(R^2\) is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variables in a regression model. It is commonly used as a goodness-of-fit measure for linear regression models.

The \(R^2\) value ranges from 0 to 1, where 0 indicates that the model explains none of the variability in the dependent variable, and 1 indicates that the model explains all of the variability. However, it is rare to see an \(R^2\) value of 1 in practice, as it would mean that the model perfectly predicts the dependent variable with no error.

In general, a higher \(R^2\) value indicates a better fit of the model to the data. However, the interpretation of what constitutes a “good” \(R^2\) value depends on the specific context and the goals of the analysis.

For example, in some fields, such as physics or engineering, models with \(R^2\) values of 0.9 or higher may be considered good. In contrast, in social sciences or economics, where human behavior is much more complex and difficult to predict, an \(R^2\) value of 0.2 to 0.3 may be considered a good fit.

It’s important to keep in mind that \(R^2\) should not be the only criterion for evaluating a model’s performance, and it should be used in conjunction with other metrics such as residual plots, cross-validation, and statistical significance tests to ensure that the model is valid and reliable for the specific analysis at hand.

It is a mistake to rely on \(R^2\) as a sole measure of fit. Let’s see an example

Anscombe_data<- read.csv('Anscombe_quartet_data.csv')
par(mfrow=c(2,2))
{plot(Anscombe_data$x123, Anscombe_data$y1, col='orange', xlim=c(0,max(Anscombe_data$x123)))
plot(Anscombe_data$x123, Anscombe_data$y2, col='orange',xlim=c(0,max(Anscombe_data$x123)))
plot(Anscombe_data$x123, Anscombe_data$y3, col='orange',xlim=c(0,max(Anscombe_data$x123)))
plot(Anscombe_data$x4, Anscombe_data$y4, col='orange', xlim=c(0, max(Anscombe_data$x4)))}

In figure we see the famous Anscombe quartet, if we fit a linear model on each of these 4 datasets it turns out that the \(R^2=0.66\) in all four cases. Do you understand why statistical modelling is more than just math ?

An alternative measure of fit is \(\hat{\sigma}^2\), the advantage is that is measured in the units of the rensponse \(Y\). The regression summary returns both values and it is worth paying attention always to both of them. Is good practice to explore your data before to fit any model you have in mind !!!

Diagnostics

Now we have estimated our linear model, it is time to check the assumptions seen in the lecture notes. It is a good practice to check regression diagnostics before using the model for any kind of inference purposes you have in mind (e.g. Tests, Confidence Intervals, Predictions).

We can divide our problems into two main categories

  1. We have assumed that \(\boldsymbol{\epsilon}\sim N(\boldsymbol{0},\sigma^2I)\), with this assumption we are not only telling that the error terms are all centered in zero and have constant variance, we are also saying that they are uncorrelated (Do you see why ?)

  2. We are assuming that the structural part of the model \(E(\boldsymbol{Y})=\boldsymbol{X}\boldsymbol{\beta}\) is correct and it is the same for all the observations, there is no structural break in the data.

Let’s start by checking the independency, constant variance and normality assumptions of the errors \(\boldsymbol{\epsilon}\). As has been already discussed, the error terms are not observable but we can estimate them using the residuals \(\boldsymbol{e}\).

Merely examining the residuals by plotting them does not suffice to verify the assumption of constant variance, as other factors may be involved. To ensure accuracy, additional factors must be checked. One of the most crucial diagnostic plots is the plot of residuals versus predicted values. If all assumptions are valid, a symmetrical, constant variation in the vertical direction should be evident. This plot not only identifies heteroskedasticity but also detects non-linearity.

res1<- rnorm(100)
res2 <- rnorm(100, sd=1:50)
res3.1 <- rnorm(33, mean=-8, sd=1)
res3.2 <- rnorm(33, mean=0, sd=1)
res3.3 <- rnorm(33, mean=-4, sd=1)
res3 <- c(res3.1, res3.2, res3.3)
par(mfrow=c(1,3))
{
plot(res1, xlab='Fitted', ylab='Residuals', main='No problem')
  abline(h=0)
  plot(res2,xlab='Fitted',ylab='Residual',main='Heteroschedasticity')
  abline(h=0)
plot(res3, xlab='Fitted', ylab='Residuals', main='Non-linear')
abline(h=0)
    }

Creating a plot of residuals against a predictor variable that is not included in the model, such as (\(\boldsymbol{e}\), \(\boldsymbol{x}_i\)), can also be helpful. If any patterns or structures are apparent in this plot, it may suggest that this predictor should be incorporated into the model.

It is often challenging to check the residuals without prior knowledge or experience with data, so we can generate some artificial plot just to take an idea of what a good (bad) model(s) looks like

par(mfrow=c(1,3))
n <- 200
for(i in 1:3) {x <- runif(n); plot(x, rnorm(x), main='Constant Variance')}

for(i in 1:3) {x <- runif(n); plot(x, x*rnorm(n), main='Non Costant variance')}

for(i in 1:3) {x <- runif(n); plot(x, cos(x*pi)+rnorm(n, sd=1), main='Nonlinearity')}

Since tests and confidence intervals rely heavily on the normality assumption, it is advisable to confirm its validity. The normality of residuals can be evaluated by examining a Q-Q plot, which compares the quantiles of the residuals with those of a normal distribution. Formally a Q-Q plot, or quantile-quantile plot, is a graphical method used to compare the distribution of a sample to a theoretical distribution, such as a normal distribution. The plot displays the sample quantiles on the vertical axis and the theoretical quantiles on the horizontal axis. If the sample and theoretical distributions are identical, the points on the Q-Q plot will fall along a straight line. Departures from a straight line indicate deviations from the theoretical distribution. Q-Q plots can be used to assess the normality of a distribution, or to compare the distribution of a sample to other distributions, such as a uniform or exponential distribution.

data(savings)
lmod <- lm(sr~pop15+pop75+dpi+ddpi, savings)
{qqnorm(residuals(lmod))
qqline(residuals(lmod))}

Normal residuals should follow the 45 degree line approximately. Just to get an idea, we can simulate some non-normal distribution and see what a Q-Q plot looks like

par(mfrow=c(1,3))
n <- 200

for(i in 1:3) {x <- rnorm(n); qqnorm(x); qqline(x)}

for(i in 1:3) {x <- exp(rnorm(n)); qqnorm(x); qqline(x)} # lognormal distribution 

for(i in 1:3) {x <- rcauchy(n); qqnorm(x); qqline(x)} # cauchy distribution

for(i in 1:3) {x <- runif(n); qqnorm(x); qqline(x)} # uniform distribution

Addressing non-normality is not a one-size-fits-all approach; the appropriate solution will depend on the nature of the problem encountered. In the case of skewed errors, transforming the response variable may be sufficient to resolve the issue. If non-normality persists despite a transformation, an alternative approach may be to acknowledge the non-normality and make inferences based on a different distribution, or alternatively, employ resampling methods.

If you would like to test if the residuals follows a normal distribution or not you can use the Shapiro Wilk test.

shapiro.test(residuals(lmod))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(lmod)
## W = 0.98698, p-value = 0.8524

Under the null we have that the residual are normally distributed. Since the p-value is large we do not reject the null.

Exercises

  1. Try to obtain the vector of coefficient estimates by using the formula in the lecture notes \(\hat{\boldsymbol{\beta}}=\boldsymbol{(X^TX)^{-1}X^Ty}\).
  2. Try to obtain \(\hat{\sigma}^2\) from lm() output and compare it the result obtained using the formula form the lecture notes.
  3. The dataset teengamb concerns a study teenage gambling in Britain. Fit a regression model with the expenditure on gambling as the response and the sex, status, income and verbal score as predictors.
  1. Try to replicate the output of the summary() function by computing all the quantities using the formulas seen in the lecture notes.
  2. What percentage of variation in the response is explained by these predictors ?
  3. Which observation has the largest (positive) residual ? Give the case number
  4. Compute the mean and the median of the residuals. Are they symmetric and centered around zero ?
  5. Compute the correlation of the residuals with the fitted values
  6. Compute the correlation of the residuals with the income
  7. For all other predictors held constant, what would be the difference in predicted expenditure on gambling for a male compared to a female.
  1. In this question, we investigate the relative merits of methods for computing the coefficients. Generate some artificial data by
x <- 1:20
y <- x+ rnorm(20)

Fit a polynomial in x for predicting y. Compute \(\hat{\boldsymbol{\beta}}\) in two ways- by using lm() and by using direct calculation described in the lecture notes. At what degree of polynomial does the direct calculation method fail ? (Hint_ you need to use I() function in fitting the polynomial, lm(y~x+ I(x^2)).

References