Exercise 1 - Simple linear regression

Load the data biomass and plot HHV as a function of carbon.

library(tidymodels)
# or run in console table with "remotes::install_github("tidymodels/parsnip")"
data("biomass") # in model.data package
head(biomass)

##                   sample  dataset carbon hydrogen oxygen nitrogen sulfur    HHV
## 1           Akhrot Shell Training  49.81     5.64  42.94     0.41   0.00 20.008
## 2 Alabama Oak Wood Waste Training  49.50     5.70  41.30     0.20   0.00 19.228
## 3                  Alder Training  47.82     5.80  46.25     0.11   0.02 18.299
## 4                Alfalfa Training  45.10     4.97  35.60     3.30   0.16 18.151
## 5     Alfalfa Seed Straw Training  46.76     5.40  40.72     1.00   0.02 18.450
## 6         Alfalfa Stalks Training  45.40     5.75  40.20     2.04   0.10 18.465

# plot
biomass %>%
ggplot(aes(x = carbon, y = HHV)) +
  geom_point()

Fit a simple linear regression model to the data. With HHV as the response and carbon as the predictor. Is it a good fit?

Both intercept and slope p-value are significant, meaning that we have evidence to reject the null hypothesis in favor of the $b_0$ and $b_1$ are not equal to zero. Meantime, the large R-squared indicates we have 85 % variation in HHV explained by carbon. Thus, the model is a good fit.

# create a model specification
linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm") -> lm_spec

# fit the model
fit(lm_spec, HHV ~ carbon, data = biomass)  -> simple_lm_fit

# see the summary table
simple_lm_fit %>%
  pluck("fit") %>%
  summary()

## 
## Call:
## stats::lm(formula = HHV ~ carbon, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.9357  -0.5228  -0.0704   0.4119   9.4552 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.767605   0.308779   8.963   <2e-16 ***
## carbon      0.339371   0.006256  54.243   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.473 on 534 degrees of freedom
## Multiple R-squared:  0.8464, Adjusted R-squared:  0.8461 
## F-statistic:  2942 on 1 and 534 DF,  p-value: < 2.2e-16

# another way to call summary table
# simple_lm_fit$fit %>% summary()

# using `tidy` to make a table as a tibble
simple_lm_fit %>% 
  tidy()

## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)    2.77    0.309        8.96 5.29e- 18
## 2 carbon         0.339   0.00626     54.2  2.23e-219

# another way to extract the r-square
simple_lm_fit %>% 
  glance() %>%
  pull(r.squared)

## [1] 0.8463907

Use the model to predict what the HHV of samples with carbon = 10, 20, …, 80.

# create a new HHV data frame
new_hhv <- tibble(carbon = seq(from = 10, to = 80, by = 10))
new_hhv

## # A tibble: 8 x 1
##   carbon
##    <dbl>
## 1     10
## 2     20
## 3     30
## 4     40
## 5     50
## 6     60
## 7     70
## 8     80

predict(simple_lm_fit, new_data = new_hhv) %>%
  rename(predictHHV = .pred)

## # A tibble: 8 x 1
##   predictHHV
##        <dbl>
## 1       6.16
## 2       9.56
## 3      12.9 
## 4      16.3 
## 5      19.7 
## 6      23.1 
## 7      26.5 
## 8      29.9

Produce diagnostics plots. You can use plot() on the $fit object to produce some diagnostics.

par(mfrow = c(2, 2))
simple_lm_fit %>%
  pluck("fit") %>%
  plot()

Exercise 2 - Multiple linear regression

Fit a linear regression model to the data. With HHV as the response and carbon and hydrogen as the predictor. How is the fit compared to the simple linear model?

The p-values of $b_0, b_1, b_2$ are totally significant, and the r-squared is 0.85. We can say that this model and the previous simple linear model are similar. In the cause of we have 85 % variation can explain the models and the predictor(s) is/ are significant.

fit(lm_spec, HHV ~ carbon + hydrogen, data = biomass)  -> multiple_lm_fit

summary(multiple_lm_fit$fit)

## 
## Call:
## stats::lm(formula = HHV ~ carbon + hydrogen, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1393  -0.4984  -0.1074   0.3934  10.6733 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.02853    0.49815   2.065   0.0394 *  
## carbon       0.34805    0.00646  53.882  < 2e-16 ***
## hydrogen     0.24180    0.05491   4.403 1.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.448 on 533 degrees of freedom
## Multiple R-squared:  0.8518, Adjusted R-squared:  0.8512 
## F-statistic:  1532 on 2 and 533 DF,  p-value: < 2.2e-16

Fit a linear regression model to the data. With HHV as the response and all the molecules as the predictor. How is the fit compared to the previous models?

The mean square error of three models are similar. However, We can see that oxygen and nitrogen are not significant in the full model, meaning that we don’t have evidence that $b_3$ and $b_4$ are not equal to zero. Because all of the predictors of the simple and multiple models are in the significant level, we can conclude that these two models are superior to the full model.

fit(lm_spec, HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass)  -> full_lm_fit
summary(full_lm_fit$fit)

## 
## Call:
## stats::lm(formula = HHV ~ carbon + hydrogen + oxygen + nitrogen + 
##     sulfur, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.3185  -0.4913  -0.0703   0.4308   9.9135 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.366432   0.700763   0.523 0.601260    
## carbon       0.352207   0.008360  42.130  < 2e-16 ***
## hydrogen     0.254543   0.062828   4.051 5.85e-05 ***
## oxygen       0.007578   0.009212   0.823 0.411108    
## nitrogen    -0.007530   0.058594  -0.129 0.897796    
## sulfur       0.460805   0.138858   3.319 0.000967 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.437 on 530 degrees of freedom
## Multiple R-squared:  0.8548, Adjusted R-squared:  0.8535 
## F-statistic: 624.3 on 5 and 530 DF,  p-value: < 2.2e-16

# find out the three model's MSE, which are close to 2
anova(simple_lm_fit$fit) %>% tidy()

## # A tibble: 2 x 6
##   term         df sumsq  meansq statistic    p.value
##   <chr>     <int> <dbl>   <dbl>     <dbl>      <dbl>
## 1 carbon        1 6385. 6385.       2942.  2.23e-219
## 2 Residuals   534 1159.    2.17       NA  NA

anova(multiple_lm_fit$fit) %>% tidy

## # A tibble: 3 x 6
##   term         df  sumsq  meansq statistic    p.value
##   <chr>     <int>  <dbl>   <dbl>     <dbl>      <dbl>
## 1 carbon        1 6385.  6385.      3044.   1.75e-222
## 2 hydrogen      1   40.7   40.7       19.4  1.29e-  5
## 3 Residuals   533 1118.     2.10      NA   NA

anova(full_lm_fit$fit) %>% tidy()

## # A tibble: 6 x 6
##   term         df      sumsq     meansq  statistic    p.value
##   <chr>     <int>      <dbl>      <dbl>      <dbl>      <dbl>
## 1 carbon        1 6385.      6385.      3090.       2.73e-223
## 2 hydrogen      1   40.7       40.7       19.7      1.11e-  5
## 3 oxygen        1    0.00433    0.00433    0.00210  9.63e-  1
## 4 nitrogen      1    0.344      0.344      0.166    6.83e-  1
## 5 sulfur        1   22.8       22.8       11.0      9.67e-  4
## 6 Residuals   530 1095.         2.07      NA       NA

Labs 02 - Week 1 Wednesday

Yunting Chiu

2021-05-20

Exercise 1 - Simple linear regression

Exercise 2 - Multiple linear regression