Linear regression is one of the most basic algoriths used in prediction of a continuos value and I personally call it an early man’s tool for data science.
Linear regression is of two types and the types are based on the number of inputs that are used to regress a value. Simple linear regression as its name is simple and accepts one input only to produce a corresponding prediction. Multiple linear regression on the other hand is a single output (continuous) variable and two or more input variables.
Tip: When we have single input (continuous) variable and single output (continuous) variable we go with simple linear regression.
The equation of simple linear regression is:
Linear Regression Equation
where,
* ßo and ß1 are known by multiple names such as coefficients or parameters or estimates.
* X is the input variable.
* e is the Error term also called as epsilion.
Simple Linear Regression Model
Let us begin our journey to understand the simple linear regression by reading the data file (csv file here).
# We are reading the data into an object called df.
# You are free to set an object name of your choice
df <- read.csv('G:/Projects/Datasets/wc-at.csv')
# The View() function will display the file table
View(df)
# attach() function will help us in managing the contents of the data we are dealing with
# by calling attach(df), R will first search for variables in this dataset by default
attach(df)
As a best practice we will try perform some basic statistical analysis on the two variables. We will try to execute a block of code to understand the relation ship between these two variables.
We will look at some common plots statisticians use in their analysis.
Histogram which is called using hist() is a basic plot which visually explains the distribution, skewness and kurtosis of a feature which is plotted. We have plotted both the features in the data i.e. Waist and Adipose Tissue data.
As We can infer from the plots that AT is negatively skewed or left skewed and Waist data looks like normally distributed. We can calculate the numerical values of skewness and kurtosis using skewness() and kurtosis() which are available in e1071 package.
We can also plot the distribution using a qqplot() and a straight line covering through these points can be used to understand the data normality.
Box Plot is highly information plot for basic analysis of data. It provides a five point summary for Q1, Median, Q3, Lower Whisker and Upper Whisker. It can used to know the distribution of data as well.
Tip: if any of the terms/words are new for you, please drop a comment below and I will provide you the explaination in the most simple form I know.
print("The summary of Waist and AT variables in df:")
## [1] "The summary of Waist and AT variables in df:"
summary(df)
## Waist AT
## Min. : 63.5 Min. : 11.44
## 1st Qu.: 80.0 1st Qu.: 50.88
## Median : 90.8 Median : 96.54
## Mean : 91.9 Mean :101.89
## 3rd Qu.:104.0 3rd Qu.:137.00
## Max. :121.0 Max. :253.00
Plots we discussed in the section above.
attach(df)
## The following objects are masked from df (pos = 3):
##
## AT, Waist
library(e1071)
hist(as.numeric(df$Waist))
hist(as.numeric(df$AT))
skewness(as.numeric(df$Waist))
## [1] 0.130389
skewness(as.numeric(df$AT))
## [1] 0.5688705
kurtosis(as.numeric(df$Waist))
## [1] -1.141846
kurtosis(as.numeric(df$AT))
## [1] -0.3760059
qqplot(AT,Waist)
boxplot(df, col = c('red','sienna'))
plot(df)
line(Waist, AT)
##
## Call:
## line(Waist, AT)
##
## Coefficients:
## [1] -239.498 3.711
Lets us now start working on exploratory analysis of these two variables.
We will work on correlation analysis first.
Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. A positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable increases as the other decreases.
Correlation can be called a strong correlation if the Cor() between two fetures is above 0.85. It is called moderate if it ranges between 0.65 and 0.85, and low if its below 0.65.
cor(AT,Waist)
## [1] 0.8185578
As we see here the correlation between Adipose tissue and waist is ~0.82 which means the relationship is moderate between them.
Lets now start building the simple linera regression model. We will name our model object as firstlm which is our first linear regression model.
lm() which mean linear model is used to construct the linear regression model. The arguments are simple i.e. the ouput variable we are interested in and the input variable which can regress the value of the output with help of intercept and parameter.
summary() will return the key factors about the model. The model will be measured on multiple performance metrics such as residuals, R-Squared value, P-value for significance of variables, Intercepts and parameters.
firstlm <- lm(AT~Waist)
summary(firstlm)
##
## Call:
## lm(formula = AT ~ Waist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107.288 -19.143 -2.939 16.376 90.342
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -215.9815 21.7963 -9.909 <2e-16 ***
## Waist 3.4589 0.2347 14.740 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.06 on 107 degrees of freedom
## Multiple R-squared: 0.67, Adjusted R-squared: 0.667
## F-statistic: 217.3 on 1 and 107 DF, p-value: < 2.2e-16
confint(firstlm,level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -259.190053 -172.77292
## Waist 2.993689 3.92403
predict(firstlm,interval="predict")
## Warning in predict.lm(firstlm, interval = "predict"): predictions on current data refer to _future_ responses
## fit lwr upr
## 1 42.568252 -23.7607107 108.89721
## 2 35.131704 -31.3249765 101.58838
## 3 66.953210 0.9383962 132.96802
## 4 74.389758 8.4385892 140.34093
## 5 42.222366 -24.1122081 108.55694
## 6 32.537559 -33.9671546 99.04227
## 7 63.840237 -2.2056980 129.88617
## 8 72.487385 6.5213726 138.45340
## 9 3.656083 -63.5036005 70.81577
## 10 37.207020 -29.2125284 103.62657
## 11 32.710502 -33.7909536 99.21196
## 12 43.432966 -22.8821078 109.74804
## 13 36.861134 -29.5645231 103.28679
## 14 57.268404 -8.8518878 123.38870
## 15 50.350685 -15.8605336 116.56190
## 16 22.160981 -44.5537679 88.87573
## 17 46.718883 -19.5452517 112.98302
## 18 40.492936 -25.8701771 106.85605
## 19 39.282335 -27.1012331 105.66590
## 20 46.545940 -19.7208032 112.81268
## 21 49.831856 -16.3867039 116.05042
## 22 63.840237 -2.2056980 129.88617
## 23 60.381377 -5.7022296 126.46498
## 24 92.548770 26.6894200 158.40812
## 25 67.644982 1.6367253 133.65324
## 26 102.233576 36.3862036 168.08095
## 27 83.555735 17.6622091 149.44926
## 28 62.456693 -3.6039202 128.51731
## 29 81.480420 15.5758571 147.38498
## 30 69.374412 3.3819768 135.36685
## 31 72.833271 6.8700310 138.79651
## 32 88.744024 22.8729233 154.61513
## 33 98.082945 32.2335934 163.93230
## 34 93.240542 27.3829016 159.09818
## 35 136.822170 70.8074775 202.83686
## 36 110.880725 45.0222774 176.73917
## 37 98.774717 32.9260237 164.62341
## 38 140.281029 74.2316072 206.33045
## 39 60.727263 -5.3524301 126.80696
## 40 57.268404 -8.8518878 123.38870
## 41 72.833271 6.8700310 138.79651
## 42 46.891826 -19.3697083 113.15336
## 43 62.456693 -3.6039202 128.51731
## 44 83.209849 17.3145658 149.10513
## 45 71.103842 5.1264122 137.08127
## 46 154.462353 88.2365608 220.68815
## 47 110.188953 44.3321471 176.04576
## 48 110.880725 45.0222774 176.73917
## 49 59.689606 -6.4019262 125.78114
## 50 58.306062 -7.8017094 124.41383
## 51 94.624085 28.7694706 160.47870
## 52 73.870929 7.9158100 139.82605
## 53 78.713332 12.7922191 144.63445
## 54 45.162396 -21.1255054 111.45030
## 55 55.193088 -10.9531208 121.33930
## 56 55.884860 -10.2525800 122.02230
## 57 87.706367 21.8313711 153.58136
## 58 82.518078 16.6191807 148.41697
## 59 79.750990 13.8363291 145.66565
## 60 73.525043 7.5672497 139.48284
## 61 52.426001 -13.7565798 118.60858
## 62 77.675674 11.7478144 143.60353
## 63 60.035492 -6.0520617 126.12304
## 64 158.612984 92.3252791 224.90069
## 65 197.698095 130.6020356 264.79416
## 66 198.735753 131.6127559 265.85875
## 67 117.798443 51.9163563 183.68053
## 68 148.928178 82.7776990 215.07866
## 69 147.198748 81.0701043 213.32739
## 70 154.116467 87.8956245 220.33731
## 71 154.116467 87.8956245 220.33731
## 72 133.363311 67.3800865 199.34653
## 73 119.527873 53.6378248 185.41792
## 74 129.904451 63.9494297 195.85947
## 75 157.575326 91.3035349 223.84712
## 76 129.904451 63.9494297 195.85947
## 77 140.281029 74.2316072 206.33045
## 78 143.739889 77.6524810 209.82730
## 79 150.657608 84.4844833 216.83073
## 80 161.034186 94.7082219 227.36015
## 81 142.010459 75.9424508 208.07847
## 82 164.493045 98.1096934 230.87640
## 83 164.493045 98.1096934 230.87640
## 84 171.410764 104.9030239 237.91850
## 85 159.304756 93.0062808 225.60323
## 86 143.739889 77.6524810 209.82730
## 87 167.951905 101.5079578 234.39585
## 88 159.304756 93.0062808 225.60323
## 89 202.540498 135.3163441 269.76465
## 90 161.034186 94.7082219 227.36015
## 91 121.257303 55.3584733 187.15613
## 92 148.928178 82.7776990 215.07866
## 93 122.986732 57.0783023 188.89516
## 94 110.880725 45.0222774 176.73917
## 95 119.527873 53.6378248 185.41792
## 96 147.198748 81.0701043 213.32739
## 97 150.657608 84.4844833 216.83073
## 98 126.445592 60.5155029 192.37568
## 99 98.774717 32.9260237 164.62341
## 100 138.551600 72.5199497 204.58325
## 101 150.657608 84.4844833 216.83073
## 102 161.380072 95.0485136 227.71163
## 103 181.787342 115.0691257 248.50556
## 104 133.363311 67.3800865 199.34653
## 105 130.250337 64.2926425 196.20803
## 106 106.730093 40.8795247 172.58066
## 107 136.130398 70.1222603 202.13854
## 108 157.229440 90.9628890 223.49599
## 109 159.304756 93.0062808 225.60323
Interpreting the summary of model ‘firstlm’:
Firstly, adjusted R-Squared is at 0.667 which means that the model accuracy is at 66.7%. Secondly, Coefficients are ßo = -215.98 and ß1 = 3.4589 and the standard error is 33.06.
Using the above parameters we can write the simple linear model as Y = -215.98 + 3.46*X + 33.06. This equation will be used to predict the values of Adimose tissue by replacing the value of waist (row by row).
Multiple R squared is simply a measure of Rsquared for models that have multiple predictor variables. Therefore it measures the amount of variation in the response variable that can be explained by the predictor variables. The fundamental point is that when you add predictors to your model, the multiple Rsquared will always increase, as a predictor will always explain some portion of the variance.
Adjusted Rsquared controls against this increase, and adds penalties for the number of predictors in the model. Therefore it shows a balance between the most parsimonious model, and the best fitting model. Generally, if you have a large difference between your multiple and your adjusted Rsquared that indicates you may have overfit your model.
Yes, We can increase the model accuracy but this includes a lot of trial and error or experimentation with the features used in the model.
We will first plot qqplots to check the normality of the input variables (and sometimes the output variables).
qqplot(AT, Waist)
qqnorm(AT)
qqnorm(Waist)
The Skewness of these variables confirmed that the data is right or positive skewed and the data is not normal.
Hence, we can now perform some basic transformation on the data to check if we can increase the accuracy of the model. We will first use the logarithmic transformation on input variable Waist.
# Logarthmic transformation
secondlm <-lm(log(AT)~log(Waist)) # Regression using logarthmic transformation
summary(secondlm)
##
## Call:
## lm(formula = log(AT) ~ log(Waist))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.96388 -0.21762 0.01988 0.21214 0.79811
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.4607 0.9820 -12.69 <2e-16 ***
## log(Waist) 3.7476 0.2176 17.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3358 on 107 degrees of freedom
## Multiple R-squared: 0.7348, Adjusted R-squared: 0.7324
## F-statistic: 296.5 on 1 and 107 DF, p-value: < 2.2e-16
As the lengends say, experiments do yield good results and we have increased the accuracy by over 6%.