When it comes to working with modeling and predictions, Simple linear regression it’s one of the most common regressions that we start using for predicts variables. for this blog, I m going to try to create a simple regression and some of the functions that we can use in order to understand our model regression outputs.
Trees dataset
for this linear regression im going to work with the Trees dataset which contains 31 observation and 3 variables.
library(caret)## Loading required package: lattice
## Loading required package: ggplot2
library(ggplot2)
library(corrplot)## corrplot 0.84 loaded
library(RColorBrewer)
data(trees)
df <- trees
str(df)## 'data.frame': 31 obs. of 3 variables:
## $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
## $ Height: num 70 65 63 72 81 83 66 75 80 75 ...
## $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
summary(df)## Girth Height Volume
## Min. : 8.30 Min. :63 Min. :10.20
## 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
## Median :12.90 Median :76 Median :24.20
## Mean :13.25 Mean :76 Mean :30.17
## 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
## Max. :20.60 Max. :87 Max. :77.00
plot(df)corrplot(cor(df), type="lower", order="alphabet",
col=brewer.pal(n=10, name="PiYG"))With the correlation plot we can see if the variables correlation.
model1 <- lm(Volume ~ ., data = df)
model1##
## Call:
## lm(formula = Volume ~ ., data = df)
##
## Coefficients:
## (Intercept) Girth Height
## -57.9877 4.7082 0.3393
summary(model1)##
## Call:
## lm(formula = Volume ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4065 -2.6493 -0.2876 2.2003 8.4847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
## Girth 4.7082 0.2643 17.816 < 2e-16 ***
## Height 0.3393 0.1302 2.607 0.0145 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
## F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
plot(model1)We can see on the summary model1 that our model has R2 of 94 % which indicate this is a good model to predcit the volume of the tree. Simple linear regression can be very usefull to do regression and predict variables using linear model.
When working with linear regression they are some issue that may affect our linear regression model:
. Missing data
. Multiculinary
. Data Distrubution