A Simple Linear Regression is the most basic form of regression, where we have one independent variable \(X\) and one dependent variable \(Y\). The regression equation will take form \(Y=b0+b1*X\).
Topic of my first blog post would be creating a basic R function that will perform a Simple Linear Regression Analysis. The function will read in data frame, dependent and independent variables. I will simplify my life by working only with continuous variables. So, both \(X\) and \(Y\) will be continuous.
As a working example, I will build regression model for dependency between a country BMI and its life expectancy. My hypothesis is that there should be a negative correlation between BMI and life expectancy.
I will use Wikipedia as my source.
library(rvest)
library(dplyr)
library(knitr)
library(rcompanion)
library(MASS)
library(tidyverse)
library(caret)
bmiURL <- "https://en.wikipedia.org/wiki/List_of_countries_by_body_mass_index"
temp<-bmiURL%>%read_html%>%html_nodes("table")
BMI<-html_table(temp[[2]],header = NA,fill=TRUE)
BMI<-BMI%>%rename("Country"=!!names(.[1]), "BMI" = !!names(.[3]))
BMI$BMI <- as.numeric(as.character(BMI$BMI))
BMI<-BMI[c(1,3)]
LifeExp <- "https://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy"
temp<-LifeExp%>%read_html%>%html_nodes("table")
LifeExp<-html_table(temp[[1]],header = NA,fill=TRUE)
LifeExp<-LifeExp%>%rename("Country"=!!names(.[1]), "LExp" = !!names(.[3]))
LifeExp$LExp<-as.numeric(as.character(LifeExp$LExp))
LifeExp<-LifeExp[c(1,3)]
ourData<-merge(BMI, LifeExp, by.x="Country", by.y="Country")
Now, we have our data ready to be used for the analysis.
The function performs the following 20 steps:
SimLinReg<-function(ourData,X,Y){
Xv<-(ourData[[X]])
Yv<-(ourData[[Y]])
print(cat("Step1. Missing independent variable values is",sum(is.na(X)),"\n"))
print(cat("Step1. Missing dependent variable values is",sum(is.na(Y)),"\n"))
ourMod<-lm(Yv~Xv)
print("Step2. Top 5 rows.")
print(kable(head(ourData)))
print("Step3. Data Summary.")
print(kable(summary(ourData)))
hist(Xv, main='Step4. Histogram of Independent Variable')
plot(density(Xv),main = "Step4. Density of Independent Variable")
hist(Yv, main='Step4. Histogram of Dependent Variable')
plot(density(Yv),main = "Step4. Density of Dependent Variable.")
print(cat("Step5. Correlation is",cor(Xv,Yv), "\n"))
if (abs(cor(Xv,Yv))<.3){print("Step5. Correlation. !!!Bad - no predictive power. Disregard Model.")}
else if (abs(cor(Xv,Yv))<.5) {print("Step5. Correlation. Weak - poor predictive power")} else if
(abs(cor(Xv,Yv))<.7) {print("Step5. Correlation. Moderate - decent predictive power")}
else {print("Step5. Correlation. Strong - excelent predictive power")}
plot1 <- plot(Yv~Xv, main="Step6. Plot of Independent vs Dependent Variable.")
abline(coef(ourMod))
# step7
set.seed(123)
NData <- data.frame(Yv,Xv)
colnames(NData)<-c("Yv","Xv")
training.samples <- NData$Yv %>%
createDataPartition(p = 0.8, list = FALSE)
train <- NData[training.samples, ]
test <- NData[-training.samples, ]
#Step8
ourMod1<-lm(Yv~Xv+I(Xv^.5)+I(Xv^-1),data=train)
cooksd <- cooks.distance(ourMod1)
inf <- as.numeric(names(cooksd)[(cooksd > 5*mean(cooksd, na.rm=T))])
print(cat("Step9. Outliers. ",length(inf),"outliers were taken out\n"))
train<-train[-inf]
# Step10
Yvl= log(transformTukey(train$Yv,plotit=FALSE,returnLambda = TRUE))
train$Yv_tuk<-log(train$Yv^Yvl)
test$Yv_tuk <-log(test$Yv^Yvl)
Xvl = log(transformTukey(train$Xv,plotit=FALSE,returnLambda = TRUE))
train$Xv_tuk = log(train$Xv^Xvl)
test$Xv_tuk = log(test$Xv^Xvl)
ourMod2<-lm(Yv_tuk~Xv_tuk+I(Xv_tuk^.5)+I(Xv_tuk^(-1)),data=train)
f_stat <- summary(ourMod2)$fstatistic
pval<-pf(f_stat[1], f_stat[2], f_stat[3], lower=FALSE)
if (pval<.05) {print("Step11. F statistics validates the model")} else {print("Step11. F statistics does not validate the model. Advice not to use")}
pval1<-summary(ourMod2)$coefficient[2,4]
if (pval1<.05) {print("Step12. P value for the independent variable shows good fit")}
else {print("Step12. P value for the independent variable shows bad fit. Advice not to use")}
b0<-summary(ourMod2)$coefficient[1,1]
b1<-summary(ourMod2)$coefficient[2,1]
b2<-summary(ourMod2)$coefficient[3,1]
b3<-summary(ourMod2)$coefficient[4,1]
print(cat("Step13. Regression function is Y=",b0,"+",b1,"*x+",b2,"*x^0.5+",b3,"*(1/x)\n"))
aR2<-summary(ourMod2)$adj.r.squared
print(cat("Step14. Adjusted R Squared is",aR2,"\n"))
plot(ourMod2, main="Step15. Residual Plots.")
plot(train$Xv_tuk,(b0+train$Xv_tuk*b1+train$Xv_tuk^.5*b2+(1/train$Xv_tuk)*b3),main="Step16. Plot Final Model.")
print("Step17. Robust Regression for Comparison.")
print(summary(rlm(Yv_tuk~Xv_tuk+I(Xv_tuk^0.5)+I(1/Xv_tuk),data=train)))
Yv_pred <- predict(ourMod2, test)
actuals_preds <- data.frame(cbind(actuals=test$Yv_tuk, predicteds=Yv_pred)) # make actuals_predicteds dataframe.
if (cor(actuals_preds)[2]<.5) {print("Step18. Correlation between actual and predicted variables. !!!Bad - disregard the model")}
else if (cor(actuals_preds)[2]<.7) {print("Step18. Correlation between actual and predicted variables: weak - poor predictive power")}
else if (cor(actuals_preds)[2]<.85) {print("Step18. Correlation between actual and predicted variables: moderate - decent predictive power")}
else {print("Step18. Correlation between actual and predicted variables. Strong - excelent predictive power")}
plot(test$Yv_tuk,Yv_pred, main="Step19. Plot of Actual vs Predicted Values.")
abline(0, 1)
new<- lm (actuals~., data = actuals_preds)
print(cat("Step19. Predicted R squired is",summary (new)[[9]],"\n"))
print("Step20. Summary of Final Model.")
return(summary(ourMod2))
}
SimLinReg(ourData,"BMI","LExp")
## Step1. Missing independent variable values is 0
## NULL
## Step1. Missing dependent variable values is 0
## NULL
## [1] "Step2. Top 5 rows."
##
##
## Country BMI LExp
## -------------------- ----- -----
## Afghanistan 21.6 60.5
## Albania 26.1 77.8
## Algeria 26.2 75.6
## Angola 24.1 52.4
## Antigua and Barbuda 28.1 76.4
## Argentina 27.7 76.3
## [1] "Step3. Data Summary."
##
##
## Country BMI LExp
## --- ----------------- -------------- --------------
## Length:176 Min. :20.50 Min. :50.10
## Class :character 1st Qu.:23.90 1st Qu.:65.95
## Mode :character Median :25.90 Median :73.60
## NA Mean :25.54 Mean :71.52
## NA 3rd Qu.:27.00 3rd Qu.:76.75
## NA Max. :31.90 Max. :83.70
## Step5. Correlation is 0.5228951
## NULL
## [1] "Step5. Correlation. Moderate - decent predictive power"
## Step9. Outliers. 5 outliers were taken out
## NULL
##
## lambda W Shapiro.p.value
## 541 3.5 0.974 0.007878
##
## if (lambda > 0){TRANS = x ^ lambda}
## if (lambda == 0){TRANS = log(x)}
## if (lambda < 0){TRANS = -1 * x ^ lambda}
##
##
## lambda W Shapiro.p.value
## 446 1.125 0.9824 0.06252
##
## if (lambda > 0){TRANS = x ^ lambda}
## if (lambda == 0){TRANS = log(x)}
## if (lambda < 0){TRANS = -1 * x ^ lambda}
##
## [1] "Step11. F statistics validates the model"
## [1] "Step12. P value for the independent variable shows good fit"
## Step13. Regression function is Y= -11093.39 + -14646.06 *x+ 24045.7 *x^0.5+ 699.7548 *(1/x)
## NULL
## Step14. Adjusted R Squared is 0.3012769
## NULL
## [1] "Step17. Robust Regression for Comparison."
##
## Call: rlm(formula = Yv_tuk ~ Xv_tuk + I(Xv_tuk^0.5) + I(1/Xv_tuk),
## data = train)
## Residuals:
## Min 1Q Median 3Q Max
## -0.3348772 -0.0737304 -0.0006243 0.0668326 0.3336795
##
## Coefficients:
## Value Std. Error t value
## (Intercept) -10811.2901 2983.9158 -3.6232
## Xv_tuk -14287.7636 3917.0979 -3.6475
## I(Xv_tuk^0.5) 23446.1197 6446.9996 3.6367
## I(1/Xv_tuk) 681.2725 189.3206 3.5985
##
## Residual standard error: 0.1073 on 139 degrees of freedom
## [1] "Step18. Correlation between actual and predicted variables: moderate - decent predictive power"
## Step19. Predicted R squired is 0.5354083
## NULL
## [1] "Step20. Summary of Final Model."
##
## Call:
## lm(formula = Yv_tuk ~ Xv_tuk + I(Xv_tuk^0.5) + I(Xv_tuk^(-1)),
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.32883 -0.06924 0.00442 0.07212 0.33641
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11093.4 3132.4 -3.541 0.000542 ***
## Xv_tuk -14646.1 4112.0 -3.562 0.000505 ***
## I(Xv_tuk^0.5) 24045.7 6767.8 3.553 0.000521 ***
## I(Xv_tuk^(-1)) 699.8 198.7 3.521 0.000582 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1174 on 139 degrees of freedom
## Multiple R-squared: 0.316, Adjusted R-squared: 0.3013
## F-statistic: 21.41 on 3 and 139 DF, p-value: 1.849e-11
The function in development has a major weakness - polynomial portion is not automated. However, it should save a lot of time in developing a simple regression as it automates many other steps.
My hypothesis was wrong. The relationship between BMI and Life Expectancy by country is much more complex. It is sinusoidal. It drops as BMI goes up, then it goes up and finally drops again. Which makes sense as BMI relates to GDP per capita, which in turn relates to Life Expectancy. High BMI correlates with high GDP per capita which correlates with high Life Expectancy.