Linear Regression

Basic Scatter Plots

To create the basic scatter plots used in the linear regression workshop, use the plot() function in base R

Usage

plot(x, y, type = "type", ...)

Arguments

x → x data vector

y → y data vector

type = "type" → argument to choose plot type

  • "l" for a line graph
  • "p" for a point graph

... → Other optional inputs, often called graphical parameters. These are typically universal across all graphing functions in base R.

  • Some useful graphical parameters:
    • main = "title" adds a title to the graph
    • xlab = "x label" labels the x axis
    • ylab = "y label" labels the y axis
    • col = "color" changes the color of the main graph element
    • border = "color" changes the border color of bar graphs/histograms
    • lwd = 1 changes the line width (positive integers starting at 1)
    • lty = "linetype" changes the line type (line graphs)
      • dashed
      • dotdash
      • longdash
      • twodash

Examples

# A dataset called income_data was previously loaded into R from a .csv file

# create new x and y vetors for simplicity 
income = income_data$income
happiness = income_data$happiness

# basic plot
plot(income, happiness, 
     # scatter plot
     type = "p")

# plot with title and axis labels 
plot(income, happiness,
     # scatter plot
     type = "p",
     
     # labels 
     main = "Yearly Income Vs. Happiness",
     xlab = "Income (thousands of USD)",
     ylab = "Happiness (scale of 1 - 10)")

# plot with labels and colors 
plot(income, happiness,
     # scatter plot
     type = "p",
     
     # labels 
     main = "Yearly Income Vs. Happiness",
     xlab = "Income (thousands of USD)",
     ylab = "Happiness (scale of 1 - 10)",
     
     # point color
     col = "plum4",
     # point line width
     lwd = 2)

Linear Regression Trendline

To add a linear trend line to a plot() function, add the abline() function as the line directly after your plot() in your code

Usage

abline(lm( y_var ~ x_var), ...)

Arguments

lm(y_var ~ x_var) → Linear model function

  • This function generates the information needed for abline() to add the plot, including the m and b coefficients
    • IMPORTANT: The y variable MUST be the first variable inside the lm() function

... → Other optional inputs, often called graphical parameters. These are typically universal across all graphing functions in base R.

  • Some useful graphical parameters:
    • main = "title" adds a title to the graph
    • xlab = "x label" labels the x axis
    • ylab = "y label" labels the y axis
    • col = "color" changes the color of the main graph element
    • border = "color" changes the border color of bar graphs/histograms
    • lwd = 1 changes the line width (positive integers starting at 1)
    • lty = "linetype" changes the line type (line graphs)
      • dashed
      • dotdash
      • longdash
      • twodash

Examples

# basic plot and linear regression line
plot(income, happiness, 
     # scatter plot
     type = "p")
abline(lm(happiness ~ income))

# adding colors, labels, and line types
plot(income, happiness,
     # scatter plot
     type = "p",
     
     # labels 
     main = "Yearly Income Vs. Happiness",
     xlab = "Income (thousands of USD)",
     ylab = "Happiness (scale of 1 - 10)",
     
     # point color
     col = "plum3",
     # point line width
     lwd = 2)
abline(lm(happiness ~ income),
      # line type
      lty = "twodash",
      
      # color
      col = "plum4",
      
      # line width
      lwd = 2
      )

Linear Model Function

The linear model function is crucial for both the visual analysis (remember how it was used to add the linear trendline to the scatter plot), as well as the actual statistical analysis of the linear regression model. In a basic sense, the lm() function calculates the model, including the m and b coefficients.

Usage

variable_name = lm(y_var ~ x_var)

Arguments

y_var → y data vector

x_var → x data vector

Notes

  • Using the lm() function alone will provide some information, but to properly store the results, and acess them later, you need to set the function equal to a variable.

  • The coefficients are stored in the model in an element called coefficients. They can be isolated using the dollar sign $ and double square brackets [[ ]].

Examples

# first create the linear model
income_lm = lm(happiness ~ income)

# view results in console
income_lm
## 
## Call:
## lm(formula = happiness ~ income)
## 
## Coefficients:
## (Intercept)       income  
##      0.2043       0.7138
# Viewing coefficients 
income_lm$coefficients
## (Intercept)      income 
##   0.2042704   0.7138255
# Isolating coefficients 
b = income_lm$coefficients[[1]]
m =income_lm$coefficients[[2]]

# Put parenthesis around statments to auto-print them to the console
(b = income_lm$coefficients[[1]])
## [1] 0.2042704
(m =income_lm$coefficients[[2]])
## [1] 0.7138255

Testing Assumptions

There are four assumptions to linear regression, most of which can be tested using the linear model variable and the plot() function. First is a general description of using plot() to test them, followed by explanations and examples of each.

Usage

plot(model_variable) or plot(model_variable, x)

Arguments

model_variable → variable containing the linear model created with lm()

x → number (1-4) of the plot trying to be accessed

  • 1 for Residuals vs. Fitted
  • 2 for Q-Q plot
  • 3 for Scale-Location
  • 4

Notes

  • Using the plot() function with a lm() linear model variable works in a kind of peculiar way. It produces 4 plots in a specific order and allows you to iterate through them in one of two ways:
    1. Provide plot() with just the lm() linear model varaible and use the enter key to iterate through them
    2. Provide the plot() with the lm() linear model and x, the number of the specific plot that you want

Examples

  1. Linearity: The relationship between x and the mean of y is linear.
    • First plot → plot(model_variable, 1)

      plot(income_lm, 1)

  2. Normality: For any fixed value of x, y is normally distributed.
    • Second plot → plot(model_variable, 2)

      plot(income_lm, 2)

  3. Homoscedasticity: The variance of the residuals is the same for any value of x.
    • Third plot → plot(model_variable, 3)

      plot(income_lm, 3)

  4. Independence: Observations are independent of each other.
    • This assumption can not be tested visually, instead test by examining the experimental design

Model Quality Analysis

Once the assumptions of linear regression have been tested and the model has been created, there are various ways within R to test the accuracy of the model. Most of them come from examining the variable that contains the lm() linear model.You can dp this using the summary() function.

Usage

summary(model_variable)

Examples

summary(income_lm)
## 
## Call:
## lm(formula = happiness ~ income)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.02479 -0.48526  0.04078  0.45898  2.37805 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.20427    0.08884   2.299   0.0219 *  
## income       0.71383    0.01854  38.505   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7181 on 496 degrees of freedom
## Multiple R-squared:  0.7493, Adjusted R-squared:  0.7488 
## F-statistic:  1483 on 1 and 496 DF,  p-value: < 2.2e-16
  • Residuals

    • The linear model variable contains a 5 number summary of the residuals which provides general information and can be used to create a boxplot:

      summary(income_lm$residuals)
      ##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
      ## -2.02479 -0.48526  0.04078  0.00000  0.45898  2.37805
      boxplot(income_lm$residuals,
        # main title 
        main = "Income Model",
        # font size 
        cex.main = 2, 
      
        # y label
        ylab = "Residuals",
      
        # colors
        col = "plum3",
        border = "plum4"
        )

    • Residual Standard Error: At the bottom of the summary, the residual standard error of the model is displayed. It reports both the standard error and degrees of freedom.

  • R-Squared Values are located at the bottom of the summary under the residual standard error. It contains a multiple R-squared and a adjusted R-squared measurement.

  • F-Statistic is located under the R-squared values, also reported with degrees of freedom