Linear Regression review: handout

Introduction

The lm() function is used to fit linear models. It has the following generic structure:

lm(y ~ x, data)

where the first argument specifies the variables used in the model; in this example, the model regresses a response variable y against an explanatory variable x.

The second argument is used only when the dataframe name is not already specified in the first argument.

Running the function creates an object (of class ‘lm’) that contains several components, such as the model coefficients.

The model coefficients are directly displayed upon running lm(), while other components can be accessed through either the $ notation or specific functions like summary().

The NHANES dataset

The National Health and Nutrition Examination Survey (NHANES) consists of a set of surveys and measurements conducted by the US CDC to assess the health and nutritional status of adults and children in the United States. The following example uses data from a sample of 500 adults (individuals ages 21 and older) from the NHANES dataset - nhanes.samp.adult.500.

The following example illustrates the linear relationship between BMI and age in the NHANES data, with age as a predictor of BMI. Adding a best-fitting line to these data using regression techniques would allow for prediction of an individual’s BMI based on their age. The linear model could also be used to investigate questions about the population-level relationship between BMI and age, since the data are a random sample from the population of adults in the United States.

Example 1 of 1: `lm()` with NHANES

Click here for more details

#load the data
library(oibiostat)
data("nhanes.samp.adult.500")

#fitting linear model
lm(BMI ~ Age, data = nhanes.samp.adult.500)

## 
## Call:
## lm(formula = BMI ~ Age, data = nhanes.samp.adult.500)
## 
## Coefficients:
## (Intercept)          Age  
##    28.40113      0.01982

The following example shows a scatterplot with a least squares regression line: Click here for more details

nhanes.samp.adult.500 %>% 
  select(c(Age, BMI)) %>%  ## select age and BMI
  drop_na() %>% ## drop missing values
  ggplot(aes(x = Age, y = BMI)) + 
  geom_point(color = pal_aaas("default", alpha = 1)(9)[6]) + ## generate scatterplot
  geom_smooth(method = lm, se = FALSE, color = pal_aaas("default", alpha = 1)(9)[4]) + ## add best-fitting least squares line
  theme_bw() + 
  xlab("Age (in years)") +
  ylab("BMI") + 
  labs(title = "Scatterplot of BMI and Age in NHANES data (n = 500)", 
       subtitle = "Straight line is the best 'fitted' line.") ##pretty plot produced!

## `geom_smooth()` using formula 'y ~ x'

Next we consider the problem of extracting residuals from a model fit: Click here for more details

#name the model object
model.BMIvsAge = lm(BMI ~ Age, data = nhanes.samp.adult.500)

#extract residuals with residuals()
residuals = residuals(model.BMIvsAge)
#alternatively... extract residuals with $residuals
residuals = model.BMIvsAge$residuals

The following example demonstrates extracting predicted values from the model of BMI versus age in nhanes.samp.adult.500: Click here for more details

#extract predicted values with predict()
predicted = predict(model.BMIvsAge)

#alternatively... extract predicted values with $fitted.values
predicted = model.BMIvsAge$fitted.values

Checking assumptions with residual plots: Click here for more details

There are four assumptions that must be met for a linear model to be considered reasonable: linearity, constant variability, independent observations, and normally distributed residuals. In a scatterplot, we plot the residuals and fitted values.

When a linear model is a good fit for the data, the residuals should scatter around the horizontal line $y = 0$ with no apparent pattern. Does a linear model seem appropriate for these data? Yes

Does the variability of the residuals seem constant across the range of predicted RFFT scores? Yes

Next we consider a histogram of the residuals; we will compare the shape of the histogram with a normal density curve.

data = as_tibble(residuals) %>% rename(x = value)
ggplot(data = data, aes(x)) + 
  geom_histogram(aes(y = ..density..),
                 fill = pal_aaas("default", alpha = 1)(9)[6]) + 
  theme_bw() +
  stat_function(fun = dnorm,
                args = list(mean = mean(data$x),
                            sd = sd(data$x)),
                col = pal_aaas("default", alpha = 1)(9)[5],
                size = 2) +
  ylab("Density") +
  xlab("Residuals")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Do the residuals appear to be normally distributed? ???

Finally, we consider a Q-Q plot for the residuals: the quantile-quantile plot is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as normal. It’s just a visual check, not an air-tight proof, so it is somewhat subjective. But it allows us to see at-a-glance if our assumption is plausible, and if not, how the assumption is violated and what data points contribute to the violation. A Q-Q plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from similar distribution, we should see the points forming a line that’s roughly straight.

data = as_tibble(residuals) %>% rename(x = value) 
ggplot(data = data, 
       aes(sample = x)) + 
  stat_qq(color = pal_aaas("default", alpha = 1)(9)[6]) + 
  stat_qq_line(color = pal_aaas("default", alpha = 1)(9)[4], 
               linetype = "dashed",
               size = 2) +
  theme_bw() + 
  ylab("Sample quantiles") + 
  xlab("Theoretical quantiles") +
  labs(title = "Normal Q-Q plot of residuals in NHANES data (n = 500)")

Now do the residuals appear to be normally distributed? ???

Obtaining model summary of coefficients: Click here for more details

summary(model.BMIvsAge)

## 
## Call:
## lm(formula = BMI ~ Age, data = nhanes.samp.adult.500)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.586  -4.668  -1.235   3.610  39.575 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.40113    0.96172  29.531   <2e-16 ***
## Age          0.01982    0.01825   1.086    0.278    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.815 on 495 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.002377,   Adjusted R-squared:  0.0003618 
## F-statistic:  1.18 on 1 and 495 DF,  p-value: 0.278

Linear Regression review: handout

Housekeeping

Introduction

The NHANES dataset

Example 1 of 1: lm() with NHANES

Example 1 of 1: `lm()` with NHANES