In Blog1 I decided to run a simple linear regression model in R and interpret the key components of the R linear model output. The linear regression analysis simply allows us to test the correlation or association between variables. For this example I took dataset ‘trees’ that comes built in with R.

Load the data

data(trees)
head(trees,10)%>%kable() %>%
  kable_styling()
Girth Height Volume
8.3 70 10.3
8.6 65 10.3
8.8 63 10.2
10.5 72 16.4
10.7 81 18.8
10.8 83 19.7
11.0 66 15.6
11.0 75 18.2
11.1 80 22.6
11.2 75 19.9
str(trees)
## 'data.frame':    31 obs. of  3 variables:
##  $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
##  $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
##  $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
 summary(trees) %>%kable() %>%
  kable_styling()
Girth Height Volume
Min. : 8.30 Min. :63 Min. :10.20
1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
Median :12.90 Median :76 Median :24.20
Mean :13.25 Mean :76 Mean :30.17
3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
Max. :20.60 Max. :87 Max. :77.00

These data includes measurements of the diameter, height and volume of 31 black cherry trees. Tree girth is a measurement of the circumference of tree trunk. So in our case, the ‘Girth’ is the diameter at breast height in inches.

Let’s build simple regression model to see the relationship between Girth and Height to see how the tree diameters change as trees grow higher.

Simple Linear Model

simplemodel <- lm(Girth ~ Height, data = trees)
# let's view the output:
summary(simplemodel)
## 
## Call:
## lm(formula = Girth ~ Height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2386 -1.9205 -0.0714  2.7450  4.5384 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -6.18839    5.96020  -1.038  0.30772   
## Height       0.25575    0.07816   3.272  0.00276 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.728 on 29 degrees of freedom
## Multiple R-squared:  0.2697, Adjusted R-squared:  0.2445 
## F-statistic: 10.71 on 1 and 29 DF,  p-value: 0.002758

The model above is achieved by using the lm() function in R and the output is called using the summary() function on the model.

Formula Call

As we can see, the first item shown in the output is the formula R used to fit the data. Note the simplicity in the syntax: the formula just needs the predictor (Height) and the target/response variable (Girth), together with the data being used (trees).

The formula for this data is: Girth = -6.18839 + 0.25575 * Height ± 0.07816

Residuals

The next item in the model output talks about the residuals. Residuals are essentially the difference between the actual observed response values (Girth in our case) and the response values that the model predicted. When assessing how well the model fit the data, we should look for a symmetrical distribution across these points on the mean value zero (0).

P- Value

The Pr(>t) acronym found in the model output relates to the probability of observing any value equal or larger than t. Typically, a p-value of 5% or less is a good cut-off point. In our example, p-value < 0.05, meaning that the association of Girth and Height is statistically significant.

Residual Standard Error

Residual Standard Error is measure of the quality of a linear regression fit. Theoretically, every linear model is assumed to contain an error term E. In our example, the actual Girth can deviate from the true regression line by approximately 2.728 inch, on average.

Adjusted R-squared

R2 is a measure of the linear relationship between our predictor variable (Height) and our response / target variable (Girth). It always lies between 0 and 1 (i.e.: a number near 0 represents a regression that does not explain the variance in the response variable well and a number close to 1 does explain the observed variance in the response variable). In our example, the R2 we get is 0.2445 Or roughly 24% of the variance found in the response variable (Girth) can be explained by the predictor variable (Height). It looks like R2 is not strong. If I was able to choose any metric to predict Girth it wouldn’t be only one variable used for sure. Nevertheless, it’s hard to define what level of R2 is appropriate to claim the model fits well. Essentially, it will vary with the application and the domain studied.In multiple regression settings, the R2 will always increase as more variables are included in the model. That’s why the adjusted R2 is the preferred measure as it adjusts for the number of variables considered.

F-Statistic

F-statistic is a good indicator of whether there is a relationship between our predictor and the response variables. The further the F-statistic is from 1 the better it is. However, how much larger the F-statistic needs to be depends on both the number of data points and the number of predictors.In our example the F-statistic is 10.71 which is not much larger than 1 given the size of our data.

ggplotRegression <- function (fit) {

require(ggplot2)

ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red") +
  labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
                     "Intercept =",signif(fit$coef[[1]],5 ),
                     " Slope =",signif(fit$coef[[2]], 5),
                     " P =",signif(summary(fit)$coef[2,4], 5)))
}
fit1 <- lm(Girth ~ Height, data = trees)
ggplotRegression(fit1)
## Loading required package: ggplot2
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'

hist(simplemodel$residuals)

Predict

#Using the model for manual prediction -6.18839 + 0.25575 * Height

-6.18839 + 0.25575 * 81
## [1] 14.52736
-6.18839 + 0.25575 * 94
## [1] 17.85211
-6.18839 + 0.25575 * 150
## [1] 32.17411
#To predict, use dataframe and store the new values
newheight<-data.frame(Height=c(81,94,150))

#Use predict function predict(model,dataframe)
predict(simplemodel,newheight)
##        1        2        3 
## 14.52712 17.85184 32.17367