In this tutorial, we’ll explore the basics of linear modeling using R. We’ll walk through the steps to perform a simple linear regression analysis, visualize the data, and interpret the results.
Before we start, make sure you have R and RStudio installed. You can download them from the following links:
Linear modeling is a statistical technique used to model the relationship between a dependent variable (response) and one or more independent variables (predictors) by fitting a linear equation.
Simple linear regression models the relationship between a single predictor and the response. The model equation is:
Y = β0 + β1X + ε
Where:
Y is the response variable
X is the predictor variable
β0 is the intercept
β1 is the slope
ε is the error term
Let’s start by simply importing a sample dataset. We’ll use the
mtcars dataset that comes with R, which
contains information about various car models.
# Load the mtcars dataset
data(mtcars)
To have a peak at the structure of the data set, use the following code:
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
The dataset contains 32 observations on 11 (numeric) variables:
| mpg | Miles/(US) gallon |
| cyl | Number of cylinders |
| disp | Displacement (cu.in.) |
| hp | Gross horsepower |
| drat | Rear axle ratio |
| wt | Weight (1000 lbs) |
| qsec | 1/4 mile time |
| vs | Engine (0 = V-shaped, 1 = straight) |
| am | Transmission (0 = automatic, 1 = manual) |
| gear | Number of forward gears |
| carb | Number of carburetors |
We’ll use mpg as the response variable
and hp as the predictor for our simple
linear regression.
Now, let’s perform a simple linear regression analysis to model the
relationship between mpg and
hp.
# Fit a simple linear regression model
model <- lm(mpg ~ hp, data = mtcars)
# Display the summary of the model
summary(model)
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7121 -2.1122 -0.8854 1.5819 8.2360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
## hp -0.06823 0.01012 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
The lm() function fits the model, and
summary(model) provides a summary of the
regression results.
To visualize the relationship, we’ll create a scatter plot with the regression line, using ggplot.
# Install and load the ggplot2 package
# install.packages("ggplot2")
library(ggplot2)
# Create a scatter plot
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Scatter Plot of hp vs. mpg",
x = "Horsepower",
y = "Miles per Gallon")
## `geom_smooth()` using formula = 'y ~ x'
This scatter plot shows the relationship between
hp and mpg,
along with the linear regression line.
We can clearly observe a negative relationship between the two variables: as Horsepower increases, Miles per Gallon decreases. This means that as the Horsepower of a car increases, the fuel efficiency tends to decrease.
We can evaluate the model’s performance by calculating the mean squared error (MSE).
# Calculate Mean Squared Error (MSE)
predicted <- predict(model)
actual <- mtcars$mpg
mse <- mean((actual - predicted)^2)
mse
## [1] 13.98982
The lower the MSE, the better the model fits the data.
The summary of the regression model provides coefficients, including the intercept and slope. Interpret these values to draw conclusions about the relationship between the variables.
In this tutorial, we introduced linear modeling in R and performed a
simple linear regression analysis using the
mtcars data set that comes with R. You can
apply these concepts to analyze and model relationships between
variables of other datasets.