##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
The dataset I will be using is from the UCI Machine Learning Repository, containing concrete strength factor data.
The dataset has the following factors:
And the following dependent variable:
We preview the data below.
concrete <- read.csv("https://raw.githubusercontent.com/mkivenson/Computational-Mathematics/master/Discussions/Concrete%20Strength%20Dataset.csv", header = TRUE)
head(concrete)## Cement Blast.Furnace.Slag Fly.Ash Water Superplasticizer
## 1 540.0 0.0 0 162 2.5
## 2 540.0 0.0 0 162 2.5
## 3 332.5 142.5 0 228 0.0
## 4 332.5 142.5 0 228 0.0
## 5 198.6 132.4 0 192 0.0
## 6 266.0 114.0 0 228 0.0
## Course.Aggregate Fine.Aggregate Age Strength
## 1 1040.0 676.0 28 79.99
## 2 1055.0 676.0 28 61.89
## 3 932.0 594.0 270 40.27
## 4 932.0 594.0 365 41.05
## 5 978.4 825.5 360 44.30
## 6 932.0 670.0 90 47.03
The distribution of cement strength is normal and centered at 35.82.
ggplot(concrete, aes(Strength)) +
geom_histogram(bins = 25) +
ggtitle("Distribution of Cemete Strength")## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.33 23.71 34.45 35.82 46.13 82.60
The following chart looks at correlation between different factors and cement strength.
## Warning: package 'corrplot' was built under R version 3.6.1
## corrplot 0.84 loaded
For multiple linear regression, we use Fly Ash, Superplasticizer, Blast Furnace Slag, Water, Age, and Cement as factors. The p-value for Fine Aggregate and Course Aggregate is > 0.05.
lm_concrete <- lm(Strength ~ +
#Fine.Aggregate +
Fly.Ash +
Superplasticizer +
Blast.Furnace.Slag +
Water +
Age +
#Course.Aggregate +
Cement,
data = concrete)
summary(lm_concrete)##
## Call:
## lm(formula = Strength ~ +Fly.Ash + Superplasticizer + Blast.Furnace.Slag +
## Water + Age + Cement, data = concrete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.987 -6.469 0.653 6.547 34.732
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.992982 4.213202 6.881 1.03e-11 ***
## Fly.Ash 0.068660 0.007735 8.877 < 2e-16 ***
## Superplasticizer 0.240311 0.084567 2.842 0.00458 **
## Blast.Furnace.Slag 0.086472 0.004974 17.385 < 2e-16 ***
## Water -0.218088 0.021129 -10.322 < 2e-16 ***
## Age 0.113492 0.005407 20.988 < 2e-16 ***
## Cement 0.105413 0.004246 24.825 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.41 on 1023 degrees of freedom
## Multiple R-squared: 0.614, Adjusted R-squared: 0.6118
## F-statistic: 271.2 on 6 and 1023 DF, p-value: < 2.2e-16
\[ y = 29 + 0.069*\hat{Fly.Ash } + 0.24*\hat{Superplasticizer} + 0.086*\hat{Blast.Furnace.Slag} -0.218*\hat{Water} + 0.113*\hat{Age} + 0.105*\hat{Cement}\]
The histogram of residuals is normally distributed.
The residual plot shows constant variability, but seems to have a linear pattern.