Simple Linear Regression

Building a simple linear regression model for predicting weight of Adipose Tissu using waist circumference

Linear regression is one of the most basic algoriths used in prediction of a continuos value and I personally call it an early man’s tool for data science.

Linear regression is of two types and the types are based on the number of inputs that are used to regress a value. Simple linear regression as its name is simple and accepts one input only to produce a corresponding prediction. Multiple linear regression on the other hand is a single output (continuous) variable and two or more input variables.

Tip: When we have single input (continuous) variable and single output (continuous) variable we go with simple linear regression.

The equation of simple linear regression is:
Linear Regression Equation

Linear Regression Equation

where,

    * ßo and ß1 are known by multiple names such as coefficients or parameters or estimates.
    * X is the input variable.
    * e is the Error term also called as epsilion.
Simple Linear Regression Model

Simple Linear Regression Model

Let us begin our journey to understand the simple linear regression by reading the data file (csv file here).

# We are reading the data into an object called df. 
# You are free to set an object name of your choice
df <- read.csv('G:/Projects/Datasets/wc-at.csv')

# The View() function will display the file table 
View(df)

# attach() function will help us in managing the contents of the data we are dealing with
# by calling attach(df), R will first search for variables in this dataset by default 
attach(df)

As a best practice we will try perform some basic statistical analysis on the two variables. We will try to execute a block of code to understand the relation ship between these two variables.

First Business MOments: Limits of central tendency which are also known individually as mean, median and mode often are look at first when the data is up for analysis

We will look at some common plots statisticians use in their analysis.

Histogram which is called using hist() is a basic plot which visually explains the distribution, skewness and kurtosis of a feature which is plotted. We have plotted both the features in the data i.e. Waist and Adipose Tissue data.

As We can infer from the plots that AT is negatively skewed or left skewed and Waist data looks like normally distributed. We can calculate the numerical values of skewness and kurtosis using skewness() and kurtosis() which are available in e1071 package.

We can also plot the distribution using a qqplot() and a straight line covering through these points can be used to understand the data normality.

Box Plot is highly information plot for basic analysis of data. It provides a five point summary for Q1, Median, Q3, Lower Whisker and Upper Whisker. It can used to know the distribution of data as well.

Tip: if any of the terms/words are new for you, please drop a comment below and I will provide you the explaination in the most simple form I know.

print("The summary of Waist and AT variables in df:")
## [1] "The summary of Waist and AT variables in df:"
summary(df)
##      Waist             AT        
##  Min.   : 63.5   Min.   : 11.44  
##  1st Qu.: 80.0   1st Qu.: 50.88  
##  Median : 90.8   Median : 96.54  
##  Mean   : 91.9   Mean   :101.89  
##  3rd Qu.:104.0   3rd Qu.:137.00  
##  Max.   :121.0   Max.   :253.00

Plots we discussed in the section above.

attach(df)
## The following objects are masked from df (pos = 3):
## 
##     AT, Waist
library(e1071)

hist(as.numeric(df$Waist))

hist(as.numeric(df$AT))

skewness(as.numeric(df$Waist))
## [1] 0.130389
skewness(as.numeric(df$AT))
## [1] 0.5688705
kurtosis(as.numeric(df$Waist))
## [1] -1.141846
kurtosis(as.numeric(df$AT))
## [1] -0.3760059
qqplot(AT,Waist)

boxplot(df, col = c('red','sienna'))

plot(df)

line(Waist, AT)
## 
## Call:
## line(Waist, AT)
## 
## Coefficients:
## [1]  -239.498     3.711

Lets us now start working on exploratory analysis of these two variables.

We will work on correlation analysis first.

Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. A positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable increases as the other decreases.

Correlation can be called a strong correlation if the Cor() between two fetures is above 0.85. It is called moderate if it ranges between 0.65 and 0.85, and low if its below 0.65.

cor(AT,Waist)
## [1] 0.8185578

As we see here the correlation between Adipose tissue and waist is ~0.82 which means the relationship is moderate between them.

Lets now start building the simple linera regression model. We will name our model object as firstlm which is our first linear regression model.

lm() which mean linear model is used to construct the linear regression model. The arguments are simple i.e. the ouput variable we are interested in and the input variable which can regress the value of the output with help of intercept and parameter.

summary() will return the key factors about the model. The model will be measured on multiple performance metrics such as residuals, R-Squared value, P-value for significance of variables, Intercepts and parameters.

firstlm <- lm(AT~Waist)
summary(firstlm)
## 
## Call:
## lm(formula = AT ~ Waist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -107.288  -19.143   -2.939   16.376   90.342 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -215.9815    21.7963  -9.909   <2e-16 ***
## Waist          3.4589     0.2347  14.740   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.06 on 107 degrees of freedom
## Multiple R-squared:   0.67,  Adjusted R-squared:  0.667 
## F-statistic: 217.3 on 1 and 107 DF,  p-value: < 2.2e-16
confint(firstlm,level = 0.95)
##                   2.5 %     97.5 %
## (Intercept) -259.190053 -172.77292
## Waist          2.993689    3.92403
predict(firstlm,interval="predict")
## Warning in predict.lm(firstlm, interval = "predict"): predictions on current data refer to _future_ responses
##            fit         lwr       upr
## 1    42.568252 -23.7607107 108.89721
## 2    35.131704 -31.3249765 101.58838
## 3    66.953210   0.9383962 132.96802
## 4    74.389758   8.4385892 140.34093
## 5    42.222366 -24.1122081 108.55694
## 6    32.537559 -33.9671546  99.04227
## 7    63.840237  -2.2056980 129.88617
## 8    72.487385   6.5213726 138.45340
## 9     3.656083 -63.5036005  70.81577
## 10   37.207020 -29.2125284 103.62657
## 11   32.710502 -33.7909536  99.21196
## 12   43.432966 -22.8821078 109.74804
## 13   36.861134 -29.5645231 103.28679
## 14   57.268404  -8.8518878 123.38870
## 15   50.350685 -15.8605336 116.56190
## 16   22.160981 -44.5537679  88.87573
## 17   46.718883 -19.5452517 112.98302
## 18   40.492936 -25.8701771 106.85605
## 19   39.282335 -27.1012331 105.66590
## 20   46.545940 -19.7208032 112.81268
## 21   49.831856 -16.3867039 116.05042
## 22   63.840237  -2.2056980 129.88617
## 23   60.381377  -5.7022296 126.46498
## 24   92.548770  26.6894200 158.40812
## 25   67.644982   1.6367253 133.65324
## 26  102.233576  36.3862036 168.08095
## 27   83.555735  17.6622091 149.44926
## 28   62.456693  -3.6039202 128.51731
## 29   81.480420  15.5758571 147.38498
## 30   69.374412   3.3819768 135.36685
## 31   72.833271   6.8700310 138.79651
## 32   88.744024  22.8729233 154.61513
## 33   98.082945  32.2335934 163.93230
## 34   93.240542  27.3829016 159.09818
## 35  136.822170  70.8074775 202.83686
## 36  110.880725  45.0222774 176.73917
## 37   98.774717  32.9260237 164.62341
## 38  140.281029  74.2316072 206.33045
## 39   60.727263  -5.3524301 126.80696
## 40   57.268404  -8.8518878 123.38870
## 41   72.833271   6.8700310 138.79651
## 42   46.891826 -19.3697083 113.15336
## 43   62.456693  -3.6039202 128.51731
## 44   83.209849  17.3145658 149.10513
## 45   71.103842   5.1264122 137.08127
## 46  154.462353  88.2365608 220.68815
## 47  110.188953  44.3321471 176.04576
## 48  110.880725  45.0222774 176.73917
## 49   59.689606  -6.4019262 125.78114
## 50   58.306062  -7.8017094 124.41383
## 51   94.624085  28.7694706 160.47870
## 52   73.870929   7.9158100 139.82605
## 53   78.713332  12.7922191 144.63445
## 54   45.162396 -21.1255054 111.45030
## 55   55.193088 -10.9531208 121.33930
## 56   55.884860 -10.2525800 122.02230
## 57   87.706367  21.8313711 153.58136
## 58   82.518078  16.6191807 148.41697
## 59   79.750990  13.8363291 145.66565
## 60   73.525043   7.5672497 139.48284
## 61   52.426001 -13.7565798 118.60858
## 62   77.675674  11.7478144 143.60353
## 63   60.035492  -6.0520617 126.12304
## 64  158.612984  92.3252791 224.90069
## 65  197.698095 130.6020356 264.79416
## 66  198.735753 131.6127559 265.85875
## 67  117.798443  51.9163563 183.68053
## 68  148.928178  82.7776990 215.07866
## 69  147.198748  81.0701043 213.32739
## 70  154.116467  87.8956245 220.33731
## 71  154.116467  87.8956245 220.33731
## 72  133.363311  67.3800865 199.34653
## 73  119.527873  53.6378248 185.41792
## 74  129.904451  63.9494297 195.85947
## 75  157.575326  91.3035349 223.84712
## 76  129.904451  63.9494297 195.85947
## 77  140.281029  74.2316072 206.33045
## 78  143.739889  77.6524810 209.82730
## 79  150.657608  84.4844833 216.83073
## 80  161.034186  94.7082219 227.36015
## 81  142.010459  75.9424508 208.07847
## 82  164.493045  98.1096934 230.87640
## 83  164.493045  98.1096934 230.87640
## 84  171.410764 104.9030239 237.91850
## 85  159.304756  93.0062808 225.60323
## 86  143.739889  77.6524810 209.82730
## 87  167.951905 101.5079578 234.39585
## 88  159.304756  93.0062808 225.60323
## 89  202.540498 135.3163441 269.76465
## 90  161.034186  94.7082219 227.36015
## 91  121.257303  55.3584733 187.15613
## 92  148.928178  82.7776990 215.07866
## 93  122.986732  57.0783023 188.89516
## 94  110.880725  45.0222774 176.73917
## 95  119.527873  53.6378248 185.41792
## 96  147.198748  81.0701043 213.32739
## 97  150.657608  84.4844833 216.83073
## 98  126.445592  60.5155029 192.37568
## 99   98.774717  32.9260237 164.62341
## 100 138.551600  72.5199497 204.58325
## 101 150.657608  84.4844833 216.83073
## 102 161.380072  95.0485136 227.71163
## 103 181.787342 115.0691257 248.50556
## 104 133.363311  67.3800865 199.34653
## 105 130.250337  64.2926425 196.20803
## 106 106.730093  40.8795247 172.58066
## 107 136.130398  70.1222603 202.13854
## 108 157.229440  90.9628890 223.49599
## 109 159.304756  93.0062808 225.60323

Interpreting the summary of model ‘firstlm’:

Firstly, adjusted R-Squared is at 0.667 which means that the model accuracy is at 66.7%. Secondly, Coefficients are ßo = -215.98 and ß1 = 3.4589 and the standard error is 33.06.

Using the above parameters we can write the simple linear model as Y = -215.98 + 3.46*X + 33.06. This equation will be used to predict the values of Adimose tissue by replacing the value of waist (row by row).

What is the difference between Multiple R-Squared and Adjusted R-Squared?

Multiple R squared is simply a measure of Rsquared for models that have multiple predictor variables. Therefore it measures the amount of variation in the response variable that can be explained by the predictor variables. The fundamental point is that when you add predictors to your model, the multiple Rsquared will always increase, as a predictor will always explain some portion of the variance.

Adjusted Rsquared controls against this increase, and adds penalties for the number of predictors in the model. Therefore it shows a balance between the most parsimonious model, and the best fitting model. Generally, if you have a large difference between your multiple and your adjusted Rsquared that indicates you may have overfit your model.

Can we increase the R-Squared value? Other words, caan we increase the accuracy of our model?

Yes, We can increase the model accuracy but this includes a lot of trial and error or experimentation with the features used in the model.

We will first plot qqplots to check the normality of the input variables (and sometimes the output variables).

qqplot(AT, Waist)

qqnorm(AT)

qqnorm(Waist)

The Skewness of these variables confirmed that the data is right or positive skewed and the data is not normal.

Hence, we can now perform some basic transformation on the data to check if we can increase the accuracy of the model. We will first use the logarithmic transformation on input variable Waist.

# Logarthmic transformation
secondlm <-lm(log(AT)~log(Waist))  # Regression using logarthmic transformation
summary(secondlm)
## 
## Call:
## lm(formula = log(AT) ~ log(Waist))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.96388 -0.21762  0.01988  0.21214  0.79811 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12.4607     0.9820  -12.69   <2e-16 ***
## log(Waist)    3.7476     0.2176   17.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3358 on 107 degrees of freedom
## Multiple R-squared:  0.7348, Adjusted R-squared:  0.7324 
## F-statistic: 296.5 on 1 and 107 DF,  p-value: < 2.2e-16

As the lengends say, experiments do yield good results and we have increased the accuracy by over 6%.