To create the basic scatter plots used in the linear regression workshop, use the plot()
function in base R
Usage
plot(x, y, type = "type", ...)
Arguments
x
→ x data vector
y
→ y data vector
type = "type"
→ argument to choose plot type
"l"
for a line graph"p"
for a point graph...
→ Other optional inputs, often called graphical parameters
. These are typically universal across all graphing functions in base R
.
main = "title"
adds a title to the graphxlab = "x label"
labels the x axisylab = "y label"
labels the y axiscol = "color"
changes the color of the main graph element
border = "color"
changes the border color of bar graphs/histogramslwd = 1
changes the line width (positive integers starting at 1)lty = "linetype"
changes the line type (line graphs)
dashed
dotdash
longdash
twodash
Examples
# A dataset called income_data was previously loaded into R from a .csv file
# create new x and y vetors for simplicity
income = income_data$income
happiness = income_data$happiness
# basic plot
plot(income, happiness,
# scatter plot
type = "p")
# plot with title and axis labels
plot(income, happiness,
# scatter plot
type = "p",
# labels
main = "Yearly Income Vs. Happiness",
xlab = "Income (thousands of USD)",
ylab = "Happiness (scale of 1 - 10)")
# plot with labels and colors
plot(income, happiness,
# scatter plot
type = "p",
# labels
main = "Yearly Income Vs. Happiness",
xlab = "Income (thousands of USD)",
ylab = "Happiness (scale of 1 - 10)",
# point color
col = "plum4",
# point line width
lwd = 2)
To add a linear trend line to a plot()
function, add the abline()
function as the line directly after your plot()
in your code
Usage
abline(lm( y_var ~ x_var), ...)
Arguments
lm(y_var ~ x_var)
→ Linear model function
abline()
to add the plot, including the m
and b
coefficients
lm()
function...
→ Other optional inputs, often called graphical parameters
. These are typically universal across all graphing functions in base R
.
main = "title"
adds a title to the graphxlab = "x label"
labels the x axisylab = "y label"
labels the y axiscol = "color"
changes the color of the main graph element
border = "color"
changes the border color of bar graphs/histogramslwd = 1
changes the line width (positive integers starting at 1)lty = "linetype"
changes the line type (line graphs)
dashed
dotdash
longdash
twodash
Examples
# basic plot and linear regression line
plot(income, happiness,
# scatter plot
type = "p")
abline(lm(happiness ~ income))
# adding colors, labels, and line types
plot(income, happiness,
# scatter plot
type = "p",
# labels
main = "Yearly Income Vs. Happiness",
xlab = "Income (thousands of USD)",
ylab = "Happiness (scale of 1 - 10)",
# point color
col = "plum3",
# point line width
lwd = 2)
abline(lm(happiness ~ income),
# line type
lty = "twodash",
# color
col = "plum4",
# line width
lwd = 2
)
The linear model function is crucial for both the visual analysis (remember how it was used to add the linear trendline to the scatter plot), as well as the actual statistical analysis of the linear regression model. In a basic sense, the lm()
function calculates the model, including the m
and b
coefficients.
Usage
variable_name = lm(y_var ~ x_var)
Arguments
y_var
→ y data vector
x_var
→ x data vector
Notes
Using the lm()
function alone will provide some information, but to properly store the results, and acess them later, you need to set the function equal to a variable.
The coefficients are stored in the model in an element called coefficients
. They can be isolated using the dollar sign $
and double square brackets [[ ]]
.
Examples
# first create the linear model
income_lm = lm(happiness ~ income)
# view results in console
income_lm
##
## Call:
## lm(formula = happiness ~ income)
##
## Coefficients:
## (Intercept) income
## 0.2043 0.7138
# Viewing coefficients
income_lm$coefficients
## (Intercept) income
## 0.2042704 0.7138255
# Isolating coefficients
b = income_lm$coefficients[[1]]
m =income_lm$coefficients[[2]]
# Put parenthesis around statments to auto-print them to the console
(b = income_lm$coefficients[[1]])
## [1] 0.2042704
(m =income_lm$coefficients[[2]])
## [1] 0.7138255
There are four assumptions to linear regression, most of which can be tested using the linear model variable and the plot()
function. First is a general description of using plot()
to test them, followed by explanations and examples of each.
Usage
plot(model_variable)
or plot(model_variable, x)
Arguments
model_variable
→ variable containing the linear model created with lm()
x
→ number (1-4) of the plot trying to be accessed
1
for Residuals vs. Fitted2
for Q-Q plot3
for Scale-Location4
Notes
plot()
function with a lm()
linear model variable works in a kind of peculiar way. It produces 4 plots in a specific order and allows you to iterate through them in one of two ways:
plot()
with just the lm()
linear model varaible and use the enter
key to iterate through themplot()
with the lm()
linear model and x
, the number of the specific plot that you wantExamples
First plot → plot(model_variable, 1)
plot(income_lm, 1)
Second plot → plot(model_variable, 2)
plot(income_lm, 2)
Third plot → plot(model_variable, 3)
plot(income_lm, 3)
Once the assumptions of linear regression have been tested and the model has been created, there are various ways within R
to test the accuracy of the model. Most of them come from examining the variable that contains the lm()
linear model.You can dp this using the summary()
function.
Usage
summary(model_variable)
Examples
summary(income_lm)
##
## Call:
## lm(formula = happiness ~ income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.02479 -0.48526 0.04078 0.45898 2.37805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.20427 0.08884 2.299 0.0219 *
## income 0.71383 0.01854 38.505 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7181 on 496 degrees of freedom
## Multiple R-squared: 0.7493, Adjusted R-squared: 0.7488
## F-statistic: 1483 on 1 and 496 DF, p-value: < 2.2e-16
Residuals
The linear model variable contains a 5 number summary of the residuals which provides general information and can be used to create a boxplot:
summary(income_lm$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.02479 -0.48526 0.04078 0.00000 0.45898 2.37805
boxplot(income_lm$residuals,
# main title
main = "Income Model",
# font size
cex.main = 2,
# y label
ylab = "Residuals",
# colors
col = "plum3",
border = "plum4"
)
Residual Standard Error: At the bottom of the summary, the residual standard error of the model is displayed. It reports both the standard error and degrees of freedom.
R-Squared Values are located at the bottom of the summary under the residual standard error. It contains a multiple R-squared and a adjusted R-squared measurement.
F-Statistic is located under the R-squared values, also reported with degrees of freedom