Simple Linear Regression

A Simple Linear Regression is the most basic form of regression, where we have one independent variable \(X\) and one dependent variable \(Y\). The regression equation will take form \(Y=b0+b1*X\).

A Function to perform a Simple Linear Regression

Topic of my first blog post would be creating a basic R function that will perform a Simple Linear Regression Analysis. The function will read in data frame, dependent and independent variables. I will simplify my life by working only with continuous variables. So, both \(X\) and \(Y\) will be continuous.

Data - Country BMI vs Country Life Expectancy

As a working example, I will build regression model for dependency between a country BMI and its life expectancy. My hypothesis is that there should be a negative correlation between BMI and life expectancy.

Reading data

I will use Wikipedia as my source.

library(rvest) 
library(dplyr)
library(knitr)
library(rcompanion)
library(MASS)
library(tidyverse)
library(caret)

bmiURL <- "https://en.wikipedia.org/wiki/List_of_countries_by_body_mass_index"

temp<-bmiURL%>%read_html%>%html_nodes("table")

BMI<-html_table(temp[[2]],header = NA,fill=TRUE)

BMI<-BMI%>%rename("Country"=!!names(.[1]), "BMI" = !!names(.[3]))

BMI$BMI <- as.numeric(as.character(BMI$BMI))

BMI<-BMI[c(1,3)]

LifeExp <- "https://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy"

temp<-LifeExp%>%read_html%>%html_nodes("table")

LifeExp<-html_table(temp[[1]],header = NA,fill=TRUE)

LifeExp<-LifeExp%>%rename("Country"=!!names(.[1]), "LExp" = !!names(.[3]))

LifeExp$LExp<-as.numeric(as.character(LifeExp$LExp))

LifeExp<-LifeExp[c(1,3)]

ourData<-merge(BMI, LifeExp, by.x="Country", by.y="Country")

Now, we have our data ready to be used for the analysis.

Creating the function

The function performs the following 20 steps:

  1. Prints missing data information
  2. Prints top 5 rows
  3. Prints summary of the X and Y
  4. Creates histogram/density functions for X and Y
  5. Prints correlation coefficient and evaluates the coefficient
  6. Plot X vs Y
  7. Split data into train and test
  8. Adds polynomial elements - it needs automation
  9. Eliminates outliers based on Cook’s distance
  10. Applies Tukey transformation and log transformation
  11. Evaluates F Statistics
  12. Evaluates P value for the transformed X
  13. Shows final Regression Function
  14. Prints Adjusted R squire
  15. Plots Residual Plots
  16. Plots Final Model
  17. Shows Robust Model summary for comparison
  18. Test correlation between actual and predicted values
  19. Validation Plots
  20. Prints model summary
SimLinReg<-function(ourData,X,Y){
 
  Xv<-(ourData[[X]])
  
  Yv<-(ourData[[Y]])
  
  print(cat("Step1. Missing independent variable values is",sum(is.na(X)),"\n"))
  
  print(cat("Step1. Missing dependent variable values is",sum(is.na(Y)),"\n"))
  
  ourMod<-lm(Yv~Xv)
  
  print("Step2. Top 5 rows.")
  
  print(kable(head(ourData)))
  
  print("Step3. Data Summary.")
  
  print(kable(summary(ourData)))
  
  hist(Xv, main='Step4. Histogram of Independent Variable')

  plot(density(Xv),main = "Step4. Density of Independent Variable")
  
  hist(Yv, main='Step4. Histogram of Dependent Variable')

  plot(density(Yv),main = "Step4. Density of Dependent Variable.")
  
  print(cat("Step5. Correlation is",cor(Xv,Yv), "\n"))
  
  if (abs(cor(Xv,Yv))<.3){print("Step5. Correlation. !!!Bad - no predictive power. Disregard Model.")}
  else if (abs(cor(Xv,Yv))<.5) {print("Step5. Correlation. Weak - poor predictive power")} else if
  (abs(cor(Xv,Yv))<.7) {print("Step5. Correlation. Moderate - decent predictive power")} 
  else {print("Step5. Correlation. Strong - excelent predictive power")}
  
  
  plot1 <- plot(Yv~Xv, main="Step6. Plot of Independent vs Dependent Variable.")
  abline(coef(ourMod))
  
  # step7
  set.seed(123)
  NData <- data.frame(Yv,Xv)
  colnames(NData)<-c("Yv","Xv")
  training.samples <- NData$Yv %>%
  createDataPartition(p = 0.8, list = FALSE)
  train  <- NData[training.samples, ]
  test <- NData[-training.samples, ]
  
  #Step8
  
  ourMod1<-lm(Yv~Xv+I(Xv^.5)+I(Xv^-1),data=train)
  
  cooksd <- cooks.distance(ourMod1)
  
  inf <- as.numeric(names(cooksd)[(cooksd > 5*mean(cooksd, na.rm=T))]) 
  
  print(cat("Step9. Outliers. ",length(inf),"outliers were taken out\n"))
  
  train<-train[-inf]
  
  # Step10
   
  Yvl= log(transformTukey(train$Yv,plotit=FALSE,returnLambda = TRUE))
   
  train$Yv_tuk<-log(train$Yv^Yvl)
  
  test$Yv_tuk <-log(test$Yv^Yvl)
  
  Xvl = log(transformTukey(train$Xv,plotit=FALSE,returnLambda = TRUE))
  
  train$Xv_tuk = log(train$Xv^Xvl)
  
  test$Xv_tuk = log(test$Xv^Xvl)
   
  ourMod2<-lm(Yv_tuk~Xv_tuk+I(Xv_tuk^.5)+I(Xv_tuk^(-1)),data=train)
  
  f_stat <- summary(ourMod2)$fstatistic 
  pval<-pf(f_stat[1], f_stat[2], f_stat[3], lower=FALSE)
  
  if (pval<.05) {print("Step11. F statistics validates the model")} else {print("Step11. F statistics does not validate the model. Advice not to use")}
  
  pval1<-summary(ourMod2)$coefficient[2,4]
  
  if (pval1<.05) {print("Step12. P value for the independent variable shows good fit")} 
  else {print("Step12. P value for the independent variable shows bad fit. Advice not to use")}
  
  b0<-summary(ourMod2)$coefficient[1,1]
  
  b1<-summary(ourMod2)$coefficient[2,1]
  
  b2<-summary(ourMod2)$coefficient[3,1]
  
  b3<-summary(ourMod2)$coefficient[4,1]
  
  print(cat("Step13. Regression function is Y=",b0,"+",b1,"*x+",b2,"*x^0.5+",b3,"*(1/x)\n"))
  
  aR2<-summary(ourMod2)$adj.r.squared
  
  print(cat("Step14. Adjusted R Squared is",aR2,"\n"))
  
  plot(ourMod2, main="Step15. Residual Plots.")
  
  plot(train$Xv_tuk,(b0+train$Xv_tuk*b1+train$Xv_tuk^.5*b2+(1/train$Xv_tuk)*b3),main="Step16. Plot Final Model.")
  
  print("Step17. Robust Regression for Comparison.")
  
  print(summary(rlm(Yv_tuk~Xv_tuk+I(Xv_tuk^0.5)+I(1/Xv_tuk),data=train)))
  
  Yv_pred <- predict(ourMod2, test) 
  actuals_preds <- data.frame(cbind(actuals=test$Yv_tuk, predicteds=Yv_pred))  # make actuals_predicteds dataframe.
  
  if (cor(actuals_preds)[2]<.5) {print("Step18. Correlation between actual and predicted variables. !!!Bad - disregard the model")}  
  else if (cor(actuals_preds)[2]<.7) {print("Step18. Correlation between actual and predicted variables: weak - poor predictive power")}
  else if (cor(actuals_preds)[2]<.85) {print("Step18. Correlation between actual and predicted variables: moderate - decent predictive power")} 
  else {print("Step18. Correlation between actual and predicted variables. Strong - excelent predictive power")}
  plot(test$Yv_tuk,Yv_pred, main="Step19. Plot of Actual vs Predicted Values.")
  abline(0, 1)
  
  new<- lm (actuals~., data = actuals_preds)
  print(cat("Step19. Predicted R squired is",summary (new)[[9]],"\n"))
  
  print("Step20. Summary of Final Model.")
  
  return(summary(ourMod2))
}

SimLinReg(ourData,"BMI","LExp")
## Step1. Missing independent variable values is 0 
## NULL
## Step1. Missing dependent variable values is 0 
## NULL
## [1] "Step2. Top 5 rows."
## 
## 
## Country                 BMI   LExp
## --------------------  -----  -----
## Afghanistan            21.6   60.5
## Albania                26.1   77.8
## Algeria                26.2   75.6
## Angola                 24.1   52.4
## Antigua and Barbuda    28.1   76.4
## Argentina              27.7   76.3
## [1] "Step3. Data Summary."
## 
## 
##        Country               BMI             LExp     
## ---  -----------------  --------------  --------------
##      Length:176         Min.   :20.50   Min.   :50.10 
##      Class :character   1st Qu.:23.90   1st Qu.:65.95 
##      Mode  :character   Median :25.90   Median :73.60 
##      NA                 Mean   :25.54   Mean   :71.52 
##      NA                 3rd Qu.:27.00   3rd Qu.:76.75 
##      NA                 Max.   :31.90   Max.   :83.70

## Step5. Correlation is 0.5228951 
## NULL
## [1] "Step5. Correlation. Moderate - decent predictive power"

## Step9. Outliers.  5 outliers were taken out
## NULL
## 
##     lambda     W Shapiro.p.value
## 541    3.5 0.974        0.007878
## 
## if (lambda >  0){TRANS = x ^ lambda} 
## if (lambda == 0){TRANS = log(x)} 
## if (lambda <  0){TRANS = -1 * x ^ lambda} 
## 
## 
##     lambda      W Shapiro.p.value
## 446  1.125 0.9824         0.06252
## 
## if (lambda >  0){TRANS = x ^ lambda} 
## if (lambda == 0){TRANS = log(x)} 
## if (lambda <  0){TRANS = -1 * x ^ lambda} 
## 
## [1] "Step11. F statistics validates the model"
## [1] "Step12. P value for the independent variable shows good fit"
## Step13. Regression function is Y= -11093.39 + -14646.06 *x+ 24045.7 *x^0.5+ 699.7548 *(1/x)
## NULL
## Step14. Adjusted R Squared is 0.3012769 
## NULL

## [1] "Step17. Robust Regression for Comparison."
## 
## Call: rlm(formula = Yv_tuk ~ Xv_tuk + I(Xv_tuk^0.5) + I(1/Xv_tuk), 
##     data = train)
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.3348772 -0.0737304 -0.0006243  0.0668326  0.3336795 
## 
## Coefficients:
##               Value       Std. Error  t value    
## (Intercept)   -10811.2901   2983.9158     -3.6232
## Xv_tuk        -14287.7636   3917.0979     -3.6475
## I(Xv_tuk^0.5)  23446.1197   6446.9996      3.6367
## I(1/Xv_tuk)      681.2725    189.3206      3.5985
## 
## Residual standard error: 0.1073 on 139 degrees of freedom
## [1] "Step18. Correlation between actual and predicted variables: moderate - decent predictive power"

## Step19. Predicted R squired is 0.5354083 
## NULL
## [1] "Step20. Summary of Final Model."
## 
## Call:
## lm(formula = Yv_tuk ~ Xv_tuk + I(Xv_tuk^0.5) + I(Xv_tuk^(-1)), 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32883 -0.06924  0.00442  0.07212  0.33641 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -11093.4     3132.4  -3.541 0.000542 ***
## Xv_tuk         -14646.1     4112.0  -3.562 0.000505 ***
## I(Xv_tuk^0.5)   24045.7     6767.8   3.553 0.000521 ***
## I(Xv_tuk^(-1))    699.8      198.7   3.521 0.000582 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1174 on 139 degrees of freedom
## Multiple R-squared:  0.316,  Adjusted R-squared:  0.3013 
## F-statistic: 21.41 on 3 and 139 DF,  p-value: 1.849e-11

Conclusion

Function

The function in development has a major weakness - polynomial portion is not automated. However, it should save a lot of time in developing a simple regression as it automates many other steps.

Model for BMI vs Life Expectancy

My hypothesis was wrong. The relationship between BMI and Life Expectancy by country is much more complex. It is sinusoidal. It drops as BMI goes up, then it goes up and finally drops again. Which makes sense as BMI relates to GDP per capita, which in turn relates to Life Expectancy. High BMI correlates with high GDP per capita which correlates with high Life Expectancy.