What Is Linear Regression?

Linear regression is used to discover the relationship between a set of predictor variables (or independent variables) and a response variable (or dependent variable). If there is only one predictor variable, then it is a simple linear regression. If there is more than one predictor variable, then it becomes a multiple linear regression.

In this case, we will do a simple linear regression where the predictor variable is human height in cm, and the response variable is human weight in kg. We will create the linear regression model and get the coefficients in the model to explain the relationship.

Then we will use the model to predict the weight of a person based on his or per height. We will explain the process and interpret the results along the way.

Create The Model

First create 2 vectors, one for the predictor variable, and another for the response variable.

height <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
weight <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

Then use the lm() function to fit a linear model. The function takes in a formula in the form of weight ~ height in this case. For multiple linear regression, the formula will take in more variables such as weight ~ height + age + gender.

relation <- lm(weight ~ height)
print(relation) 
## 
## Call:
## lm(formula = weight ~ height)
## 
## Coefficients:
## (Intercept)       height  
##    -38.4551       0.6746

Interprete The Result

From the printout of the model, we can see that the coefficient for height is 0.67. A positive value near to 1 means a change in height is positively associated with change in weight. In this case, an increase of 1 cm in height is associated with an increase of 0.67 kg in weight.

Also, the y-intercept is found to be -38.46. It is the expected mean value of weight variable when height equals 0. But since height cannot be zero in real-life, the intercept has no intrinsic meaning in this case, other than serving the purpose of completing the regression formula. But if the height variable can be rescaled so that its mean is 0, then the intercept will mean the predicted value when the height is at the mean level.

Test The Model

The linear model needs to be tested on its robustness and accuracy. The robustness of the model is evaluated with the f-statistic and p-value. The accuracy of the model is evaluated with the R-squared value.

# Get the summary of the relationship
print(summary(relation))
## 
## Call:
## lm(formula = weight ~ height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3002 -1.6629  0.0412  1.8944  3.9775 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -38.45509    8.04901  -4.778  0.00139 ** 
## height        0.67461    0.05191  12.997 1.16e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.253 on 8 degrees of freedom
## Multiple R-squared:  0.9548, Adjusted R-squared:  0.9491 
## F-statistic: 168.9 on 1 and 8 DF,  p-value: 1.164e-06

The summary printout above shows the F-statistic and its p-value at the bottom. The F-statistic is the ratio of two variances, namely SSR (sum of squares of regression) divided by SSE (sum of squares of error). The further away it is from 1, the stronger indication that there is a relationship between the predictor and the response variables. In this case, the f-statistic of 168.92 is large enough to indicate a relationship between the variables.

P-value is the probability that the result obtained is due to chance. If it is lower than the predetermined significance level (typically 0.05 or 5% 95% confidence level), then the result is said to be significant. In this case, it is well below 0.01 and so the result should be due to the relationship between the predictor and response variables rather than by chance.

The summary also shows the R-squared or the coefficient of determination. It measures the proportion of the variance as explained by the model. If it is significantly greater than zero, then the model is accurate. In this case, a value of 0.95 means the model explains more than 95.48% of the variance, hence a high accuracy of the model.

In summary, the model is both robust and accurate. The robustness of the model is validated with the high f-statistic and low p-value. The accuracy of the model is validated with the high R-squared value.

Use Model For Prediction

Now we are going to use the linear regression model to predict the weight of a person with the height 170cm. We can use the predict() function which takes in the model object and the new data as arguments.

a <- data.frame(height = 170)
result <-  predict(relation, a)
print(result)
##        1 
## 76.22869

The predicted weight is found to be 76.23 kg. To confirm, plug in the intercept, coefficient and height values into the regression equation and we will obtain the same result. That is, predicted weight = -38.46 + 0.67 * 170 (cm) = 76.23 (kg).

Visualize The Model

# Give the chart file a name
png(file = "linearregression.png")
# Plot the chart
plot(height, weight, col = "blue", main = "Height & Weight Regression", abline(lm(weight ~ height)), cex = 1.3, pch = 16, xlab = "Height in cm", ylab = "Weight in kg")

The plot shows there is a linear relationship between height and weight.

# Save the file
dev.off()
## png 
##   3