library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readxl)

Multiple Linear Regression

Dataset

The dataset I will be using is from the UCI Machine Learning Repository, containing concrete strength factor data.

The dataset has the following factors:

  • Cement (component 1) – quantitative – kg in a m3 mixture – Input Variable
  • Blast Furnace Slag (component 2) – quantitative – kg in a m3 mixture – Input Variable
  • Fly Ash (component 3) – quantitative – kg in a m3 mixture – Input Variable
  • Water (component 4) – quantitative – kg in a m3 mixture – Input Variable
  • Superplasticizer (component 5) – quantitative – kg in a m3 mixture – Input Variable
  • Coarse Aggregate (component 6) – quantitative – kg in a m3 mixture – Input Variable
  • Fine Aggregate (component 7) – quantitative – kg in a m3 mixture – Input Variable
  • Age – quantitative – Day (1~365) – Input Variable

And the following dependent variable:

  • Concrete compressive strength – quantitative – MPa – Output Variable

We preview the data below.

##   Cement Blast.Furnace.Slag Fly.Ash Water Superplasticizer
## 1  540.0                0.0       0   162              2.5
## 2  540.0                0.0       0   162              2.5
## 3  332.5              142.5       0   228              0.0
## 4  332.5              142.5       0   228              0.0
## 5  198.6              132.4       0   192              0.0
## 6  266.0              114.0       0   228              0.0
##   Course.Aggregate Fine.Aggregate Age Strength
## 1           1040.0          676.0  28    79.99
## 2           1055.0          676.0  28    61.89
## 3            932.0          594.0 270    40.27
## 4            932.0          594.0 365    41.05
## 5            978.4          825.5 360    44.30
## 6            932.0          670.0  90    47.03

Exploring the Data

Distribution of Cemete Strength

The distribution of cement strength is normal and centered at 35.82.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.33   23.71   34.45   35.82   46.13   82.60

Correlation Plot

The following chart looks at correlation between different factors and cement strength.

## Warning: package 'corrplot' was built under R version 3.6.1
## corrplot 0.84 loaded

Regression

For multiple linear regression, we use Fly Ash, Superplasticizer, Blast Furnace Slag, Water, Age, and Cement as factors. The p-value for Fine Aggregate and Course Aggregate is > 0.05.

## 
## Call:
## lm(formula = Strength ~ +Fly.Ash + Superplasticizer + Blast.Furnace.Slag + 
##     Water + Age + Cement, data = concrete)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.987  -6.469   0.653   6.547  34.732 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        28.992982   4.213202   6.881 1.03e-11 ***
## Fly.Ash             0.068660   0.007735   8.877  < 2e-16 ***
## Superplasticizer    0.240311   0.084567   2.842  0.00458 ** 
## Blast.Furnace.Slag  0.086472   0.004974  17.385  < 2e-16 ***
## Water              -0.218088   0.021129 -10.322  < 2e-16 ***
## Age                 0.113492   0.005407  20.988  < 2e-16 ***
## Cement              0.105413   0.004246  24.825  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.41 on 1023 degrees of freedom
## Multiple R-squared:  0.614,  Adjusted R-squared:  0.6118 
## F-statistic: 271.2 on 6 and 1023 DF,  p-value: < 2.2e-16

Multiple Regression Line

\[ y = 29 + 0.069*\hat{Fly.Ash } + 0.24*\hat{Superplasticizer} + 0.086*\hat{Blast.Furnace.Slag} -0.218*\hat{Water} + 0.113*\hat{Age} + 0.105*\hat{Cement}\]

Residual Analysis

Histogram of Residuals

The histogram of residuals is normally distributed.

Residual Plot

The residual plot shows constant variability, but seems to have a linear pattern.