Linear Regression

When it comes to working with modeling and predictions, Simple linear regression it’s one of the most common regressions that we start using for predicts variables. for this blog, I m going to try to create a simple regression and some of the functions that we can use in order to understand our model regression outputs.

Trees dataset

for this linear regression im going to work with the Trees dataset which contains 31 observation and 3 variables.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(ggplot2)
library(corrplot)
## corrplot 0.84 loaded
library(RColorBrewer)
data(trees)

df <- trees

str(df)
## 'data.frame':    31 obs. of  3 variables:
##  $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
##  $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
##  $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
summary(df)
##      Girth           Height       Volume     
##  Min.   : 8.30   Min.   :63   Min.   :10.20  
##  1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40  
##  Median :12.90   Median :76   Median :24.20  
##  Mean   :13.25   Mean   :76   Mean   :30.17  
##  3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30  
##  Max.   :20.60   Max.   :87   Max.   :77.00
plot(df)

corrplot(cor(df), type="lower", order="alphabet",
         col=brewer.pal(n=10, name="PiYG"))

With the correlation plot we can see if the variables correlation.

Model

model1 <-  lm(Volume ~ ., data = df)


model1
## 
## Call:
## lm(formula = Volume ~ ., data = df)
## 
## Coefficients:
## (Intercept)        Girth       Height  
##    -57.9877       4.7082       0.3393
summary(model1)
## 
## Call:
## lm(formula = Volume ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4065 -2.6493 -0.2876  2.2003  8.4847 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
## Girth         4.7082     0.2643  17.816  < 2e-16 ***
## Height        0.3393     0.1302   2.607   0.0145 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared:  0.948,  Adjusted R-squared:  0.9442 
## F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16
plot(model1)

We can see on the summary model1 that our model has R2 of 94 % which indicate this is a good model to predcit the volume of the tree. Simple linear regression can be very usefull to do regression and predict variables using linear model.

When working with linear regression they are some issue that may affect our linear regression model:

. Missing data

. Multiculinary

. Data Distrubution