Load Packages and Dataset
Packages used to import and analyse the data include
broom - elegant view of the linear regression model’s results
sigr - succinct and correct stat summaries for reports
dplyr - for manipulating data
A simple demo on LM
note that LM was already covered in previous course so below coding parts are relatively straightforward with limited commentaries.
a couple of notes: instead of writing the y~x inside the lm model, we can also define their relationships in as.formula
lm function is from the stats package which is by default loaded
summary(unemployment)## male_unemployment female_unemployment
## Min. :2.900 Min. :4.000
## 1st Qu.:4.900 1st Qu.:4.400
## Median :6.000 Median :5.200
## Mean :5.954 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100
## Max. :9.800 Max. :7.900
fmla <- as.formula(female_unemployment ~ male_unemployment)
fmla## female_unemployment ~ male_unemployment
unemployment_model <- lm(fmla, data = unemployment)
unemployment_model##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Coefficients:
## (Intercept) male_unemployment
## 1.4341 0.6945
Examining a model
# Print unemployment_model
unemployment_model##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Coefficients:
## (Intercept) male_unemployment
## 1.4341 0.6945
# Call summary() on unemployment_model to get more details
summary(unemployment_model)##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77621 -0.34050 -0.09004 0.27911 1.31254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43411 0.60340 2.377 0.0367 *
## male_unemployment 0.69453 0.09767 7.111 1.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared: 0.8213, Adjusted R-squared: 0.8051
## F-statistic: 50.56 on 1 and 11 DF, p-value: 1.966e-05
# Call glance() on unemployment_model to see the details in a tidier form
glance(unemployment_model)## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.821 0.805 0.580 50.6 0.0000197 1 -10.3 26.6 28.3
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# Call wrapFTest() on unemployment_model to see the most relevant details
wrapFTest(unemployment_model)## [1] "F Test summary: (R2=0.8213, F(1,11)=50.56, p=1.966e-05)."
model itself reports coefficients and the model build (dependent and independent variables)
summary function provides more details on fitness of the model
glance - get the details in a tidier form
wrapFTest (within the sigr package) provides the results from F Test
Making some predictions
- note that we used newrates in the prediction to see how much female_unemployment rate is when there is a given unemployment rate for male.
# unemployment is in your workspace
summary(unemployment)## male_unemployment female_unemployment
## Min. :2.900 Min. :4.000
## 1st Qu.:4.900 1st Qu.:4.400
## Median :6.000 Median :5.200
## Mean :5.954 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100
## Max. :9.800 Max. :7.900
# newrates is in your workspace
newrates## male_unemployment
## 1 5
# Predict female unemployment in the unemployment data set
unemployment$prediction <- predict(unemployment_model, unemployment)
# load the ggplot2 package
library(ggplot2)
# Make a plot to compare predictions to actual (prediction on x axis).
ggplot(unemployment, aes(x = prediction, y = female_unemployment)) +
geom_point() +
geom_abline(color = "blue")# Predict female unemployment rate when male unemployment is 5%
pred <- predict(unemployment_model, newrates)
# Print it
pred## 1
## 4.906757
Multivariate linear regression
This part was covered in previous exercise too - therefore the walk-through is relatively simple below.
head(bloodpressure)## blood_pressure age weight
## 1 132 52 173
## 2 143 59 184
## 3 153 67 194
## 4 162 73 211
## 5 154 64 196
## 6 168 74 220
summary(bloodpressure)## blood_pressure age weight
## Min. :128.0 Min. :46.00 Min. :167
## 1st Qu.:140.0 1st Qu.:56.50 1st Qu.:186
## Median :153.0 Median :64.00 Median :194
## Mean :150.1 Mean :62.45 Mean :195
## 3rd Qu.:160.5 3rd Qu.:69.50 3rd Qu.:209
## Max. :168.0 Max. :74.00 Max. :220
Fitting the model
# bloodpressure is in the workspace
summary(bloodpressure)## blood_pressure age weight
## Min. :128.0 Min. :46.00 Min. :167
## 1st Qu.:140.0 1st Qu.:56.50 1st Qu.:186
## Median :153.0 Median :64.00 Median :194
## Mean :150.1 Mean :62.45 Mean :195
## 3rd Qu.:160.5 3rd Qu.:69.50 3rd Qu.:209
## Max. :168.0 Max. :74.00 Max. :220
# Create the formula and print it
fmla <- blood_pressure ~ age + weight
fmla## blood_pressure ~ age + weight
# Fit the model: bloodpressure_model
bloodpressure_model <- lm(fmla, data = bloodpressure)
# Print bloodpressure_model and call summary()
bloodpressure_model##
## Call:
## lm(formula = fmla, data = bloodpressure)
##
## Coefficients:
## (Intercept) age weight
## 30.9941 0.8614 0.3349
summary(bloodpressure_model)##
## Call:
## lm(formula = fmla, data = bloodpressure)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4640 -1.1949 -0.4078 1.8511 2.6981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.9941 11.9438 2.595 0.03186 *
## age 0.8614 0.2482 3.470 0.00844 **
## weight 0.3349 0.1307 2.563 0.03351 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.318 on 8 degrees of freedom
## Multiple R-squared: 0.9768, Adjusted R-squared: 0.9711
## F-statistic: 168.8 on 2 and 8 DF, p-value: 2.874e-07
Visualization
# predict blood pressure using bloodpressure_model :prediction
bloodpressure$prediction <- predict(bloodpressure_model, bloodpressure)
# plot the results
ggplot(bloodpressure, aes(prediction, blood_pressure)) +
geom_point() +
geom_abline(color = "blue")