ML_Regression_Supervised Learning

Load Packages and Dataset

Packages used to import and analyse the data include

broom - elegant view of the linear regression model’s results
sigr - succinct and correct stat summaries for reports
dplyr - for manipulating data

A simple demo on LM

note that LM was already covered in previous course so below coding parts are relatively straightforward with limited commentaries.
a couple of notes: instead of writing the y~x inside the lm model, we can also define their relationships in as.formula
lm function is from the stats package which is by default loaded

summary(unemployment)

##  male_unemployment female_unemployment
##  Min.   :2.900     Min.   :4.000      
##  1st Qu.:4.900     1st Qu.:4.400      
##  Median :6.000     Median :5.200      
##  Mean   :5.954     Mean   :5.569      
##  3rd Qu.:6.700     3rd Qu.:6.100      
##  Max.   :9.800     Max.   :7.900

fmla <- as.formula(female_unemployment ~ male_unemployment)
fmla

## female_unemployment ~ male_unemployment

unemployment_model <- lm(fmla, data = unemployment)
unemployment_model

## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Coefficients:
##       (Intercept)  male_unemployment  
##            1.4341             0.6945

Examining a model

# Print unemployment_model
unemployment_model

## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Coefficients:
##       (Intercept)  male_unemployment  
##            1.4341             0.6945

# Call summary() on unemployment_model to get more details
summary(unemployment_model)

## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77621 -0.34050 -0.09004  0.27911  1.31254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.43411    0.60340   2.377   0.0367 *  
## male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared:  0.8213, Adjusted R-squared:  0.8051 
## F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05

# Call glance() on unemployment_model to see the details in a tidier form
glance(unemployment_model)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.821         0.805 0.580      50.6 0.0000197     1  -10.3  26.6  28.3
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

# Call wrapFTest() on unemployment_model to see the most relevant details
wrapFTest(unemployment_model)

## [1] "F Test summary: (R2=0.8213, F(1,11)=50.56, p=1.966e-05)."

model itself reports coefficients and the model build (dependent and independent variables)
summary function provides more details on fitness of the model
glance - get the details in a tidier form
wrapFTest (within the sigr package) provides the results from F Test

Making some predictions

note that we used newrates in the prediction to see how much female_unemployment rate is when there is a given unemployment rate for male.

# unemployment is in your workspace
summary(unemployment)

##  male_unemployment female_unemployment
##  Min.   :2.900     Min.   :4.000      
##  1st Qu.:4.900     1st Qu.:4.400      
##  Median :6.000     Median :5.200      
##  Mean   :5.954     Mean   :5.569      
##  3rd Qu.:6.700     3rd Qu.:6.100      
##  Max.   :9.800     Max.   :7.900

# newrates is in your workspace
newrates

##   male_unemployment
## 1                 5

# Predict female unemployment in the unemployment data set
unemployment$prediction <- predict(unemployment_model, unemployment)

# load the ggplot2 package
library(ggplot2)

# Make a plot to compare predictions to actual (prediction on x axis). 
ggplot(unemployment, aes(x = prediction, y = female_unemployment)) + 
  geom_point() +
  geom_abline(color = "blue")

# Predict female unemployment rate when male unemployment is 5%
pred <- predict(unemployment_model, newrates)
# Print it
pred

##        1 
## 4.906757

Multivariate linear regression

This part was covered in previous exercise too - therefore the walk-through is relatively simple below.

head(bloodpressure)

##   blood_pressure age weight
## 1            132  52    173
## 2            143  59    184
## 3            153  67    194
## 4            162  73    211
## 5            154  64    196
## 6            168  74    220

summary(bloodpressure)

##  blood_pressure       age            weight   
##  Min.   :128.0   Min.   :46.00   Min.   :167  
##  1st Qu.:140.0   1st Qu.:56.50   1st Qu.:186  
##  Median :153.0   Median :64.00   Median :194  
##  Mean   :150.1   Mean   :62.45   Mean   :195  
##  3rd Qu.:160.5   3rd Qu.:69.50   3rd Qu.:209  
##  Max.   :168.0   Max.   :74.00   Max.   :220

Fitting the model

# bloodpressure is in the workspace
summary(bloodpressure)

##  blood_pressure       age            weight   
##  Min.   :128.0   Min.   :46.00   Min.   :167  
##  1st Qu.:140.0   1st Qu.:56.50   1st Qu.:186  
##  Median :153.0   Median :64.00   Median :194  
##  Mean   :150.1   Mean   :62.45   Mean   :195  
##  3rd Qu.:160.5   3rd Qu.:69.50   3rd Qu.:209  
##  Max.   :168.0   Max.   :74.00   Max.   :220

# Create the formula and print it
fmla <- blood_pressure ~ age + weight
fmla

## blood_pressure ~ age + weight

# Fit the model: bloodpressure_model
bloodpressure_model <- lm(fmla, data = bloodpressure)

# Print bloodpressure_model and call summary() 
bloodpressure_model

## 
## Call:
## lm(formula = fmla, data = bloodpressure)
## 
## Coefficients:
## (Intercept)          age       weight  
##     30.9941       0.8614       0.3349

summary(bloodpressure_model)

## 
## Call:
## lm(formula = fmla, data = bloodpressure)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4640 -1.1949 -0.4078  1.8511  2.6981 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  30.9941    11.9438   2.595  0.03186 * 
## age           0.8614     0.2482   3.470  0.00844 **
## weight        0.3349     0.1307   2.563  0.03351 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.318 on 8 degrees of freedom
## Multiple R-squared:  0.9768, Adjusted R-squared:  0.9711 
## F-statistic: 168.8 on 2 and 8 DF,  p-value: 2.874e-07

Visualization

# predict blood pressure using bloodpressure_model :prediction
bloodpressure$prediction <- predict(bloodpressure_model, bloodpressure)

# plot the results
ggplot(bloodpressure, aes(prediction, blood_pressure)) + 
    geom_point() +
    geom_abline(color = "blue")

ML_Regression_Supervised Learning

Kate C

2022-01-10

Load Packages and Dataset

A simple demo on LM

Examining a model

Making some predictions

Multivariate linear regression