Linear Regression establishes a relationship between a Dependent variable i.e. Y and one or more Independent variables i.e X, using a best fit straight line known as Regression Line. The equation of this regresiion line can then be used to predict value of ‘Y’ for any given ‘X’.
Dependent Variable (Target) : Continuous
Independent Variable(Predictor(s)): Continuous/Discrete
Simple linear regression involves one target(Y) and one predictor(X). This demo performs simple linear regression using Least Sqaures Method to find regression line that shows trend in the data i.e. relationship between X and Y . The equation of regression line in slope-intercept form is:
Y = mX + c ,where m= slope of straight line
c= Y-intercept
require("datasets")
data("iris")
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Since simple L.R. requires just one target, let’s take “Sepal.Length”" attribute as our target(Y) and “Sepal.Width” attribute as Predictor(X) to find if there exists any kind of relationship between them.
Y<- iris[,"Sepal.Width"] # select Target attribute
X<- iris[,"Sepal.Length"] # select Predictor attribute
head(X)
## [1] 5.1 4.9 4.7 4.6 5.0 5.4
head(Y)
## [1] 3.5 3.0 3.2 3.1 3.6 3.9
xycorr<- cor(Y,X, method="pearson") # find pearson correlation coefficient
xycorr # a value near 1 implies high correlation and that near 0 shows low correlation
## [1] -0.1175698
plot(Y~X, col=X)
model1<- lm(Y~X)
model1 # provides regression line coefficients i.e. slope and y-intercept
##
## Call:
## lm(formula = Y ~ X)
##
## Coefficients:
## (Intercept) X
## 3.41895 -0.06188
plot(Y~X, col=X) # scatter plot between X and Y
abline(model1, col="blue", lwd=3) # add regression line to scatter plot to see relationship between X and Y
The graph shows that slope of the line is downwards, hence, there exists a negative correlation between ‘X’ and ‘Y’. So, if we increase X, the value of Y will decrease and vice-versa.
U<- iris[,"Petal.Width"] # select Target
V<- iris[,"Petal.Length"] # select Predictor
xycorr<- cor(U,V, method="pearson")
xycorr
## [1] 0.9628654
plot(U~V, col=V)
model2<- lm(U~V)
model2
##
## Call:
## lm(formula = U ~ V)
##
## Coefficients:
## (Intercept) V
## -0.3631 0.4158
plot(U~V, col=V) # scatter plot between U and V
abline(model2, col="blue", lwd=3) # add regression line to scatter plot to see relationship between U and V
The above graph shows that slope of the line goes upwards, hence, there exists a positive correlation between ‘U’ and ‘V’. So, if we increase X, the value of Y will also increase and vice-versa.
Now, let’s use the line coefficients for two equations that we got in model1 and model2 to predict value of Target for any given value of Predictor.
# Prediction of 'Sepal.Width' when 'Sepal.Length'= 20
p1<- predict(model1,data.frame("X"=20))
p1
## 1
## 2.181251
The predicted value of Sepal.Width is 2.1812509 when Sepal.Length= 20
# Prediction of 'Petal.Width' when 'Petal.Length'= 15
p2<- predict(model2,data.frame("V"=15))
p2
## 1
## 5.873256
The predicted value of Petal.Width is 5.8732557 when Petal.Length= 15
You may also wish to try out Data Classification, Clustering or Linear Regression from following links:
k-NN Classification for beginners
Using Airquality Datasetk-means Clustering for beginners
Using Airquality DatasetLinear Regression for beginners
Good luck! :)