Linear Regression establishes a relationship between a Dependent variable i.e. Y and one or more Independent variables i.e X, using a best fit straight line known as Regression Line. The equation of this regresiion line can then be used to predict value of ‘Y’ for any given ‘X’.
Dependent Variable (Target) : Continuous
Independent Variable(Predictor(s)): Continuous/Discrete
Simple linear regression involves one target(Y) and one predictor(X). This demo performs simple linear regression using Least Sqaures Method to find regression line that shows trend in the data i.e. relationship between X and Y . The equation of regression line in slope-intercept form is:
Y = mX + c ,where m= slope of straight line
c= Y-intercept
The details about this dataset can be found at https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/airquality.html
require("datasets")
data("airquality")
str(airquality)
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
Let’s begin by finding which attributes have missing values. We then need to impute those missing values(NA), which we will be doing simply by replacing NA with monthly average. Let’s begin!
col1<- mapply(anyNA,airquality) # apply function anyNA() on all columns of airquality dataset
col1
## Ozone Solar.R Wind Temp Month Day
## TRUE TRUE FALSE FALSE FALSE FALSE
The output shows that only Ozone and Solar.R attributes have NA i.e. some missing value.
# Impute monthly mean in Ozone
for (i in 1:nrow(airquality)){
if(is.na(airquality[i,"Ozone"])){
airquality[i,"Ozone"]<- mean(airquality[which(airquality[,"Month"]==airquality[i,"Month"]),"Ozone"],na.rm = TRUE)
}
# Impute monthly mean in Solar.R
if(is.na(airquality[i,"Solar.R"])){
airquality[i,"Solar.R"]<- mean(airquality[which(airquality[,"Month"]==airquality[i,"Month"]),"Solar.R"],na.rm = TRUE)
}
}
#Normalize the dataset so that no particular attribute has more impact on clustering algorithm than others.
normalize<- function(x){
return((x-min(x))/(max(x)-min(x)))
}
airquality<- normalize(airquality) # replace contents of dataset with normalized values
str(airquality)
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : num 0.1201 0.1051 0.033 0.0511 0.0679 ...
## $ Solar.R: num 0.568 0.351 0.444 0.937 0.541 ...
## $ Wind : num 0.0192 0.021 0.0348 0.0315 0.0399 ...
## $ Temp : num 0.198 0.213 0.219 0.183 0.165 ...
## $ Month : num 0.012 0.012 0.012 0.012 0.012 ...
## $ Day : num 0 0.003 0.00601 0.00901 0.01201 ...
Yay! We have removed missing values from our dataset. We will now perform Linear Regression on our dataset!
Since simple L.R. requires just one target, let’s take “Ozone” attribute as our target(Y) and “Solar.R” attribute as Predictor(X) to find if there exists any kind of relationship between them.
Y<- airquality[,"Ozone"] # select Target attribute
X<- airquality[,"Solar.R"] # select Predictor attribute
model1<- lm(Y~X)
model1 # provides regression line coefficients i.e. slope and y-intercept
##
## Call:
## lm(formula = Y ~ X)
##
## Coefficients:
## (Intercept) X
## 0.06509 0.09849
plot(Y~X) # scatter plot between X and Y
abline(model1, col="blue", lwd=3) # add regression line to scatter plot to see relationship between X and Y
The above graph shows that slope of the line goes upwards, hence, there exists a positive correlation between ‘Ozone’ and ‘Solar.R’. So, if we increase X, the value of Y will also increase and vice-versa.
We will perform linear regression to find relationship of “Ozone” with “Wind” now.
Y<- airquality[,"Ozone"] # select Target attribute
X<- airquality[,"Wind"] # select Predictor attribute
model2<- lm(Y~X)
model2 # provides regression line coefficients i.e. slope and y-intercept
##
## Call:
## lm(formula = Y ~ X)
##
## Coefficients:
## (Intercept) X
## 0.2364 -4.3410
plot(Y~X) # scatter plot between X and Y
abline(model2, col="blue", lwd=3) # add regression line to scatter plot to see relationship between X and Y
The above graph shows that slope of the line goes downwards, hence, there exists a negative correlation between ‘Ozone’ and ‘Wind’. So, if we increase X, the value of Y will decrease and vice-versa.
From the above 2 graphs we can conclude that “Solar.R” is positively related to “Ozone” whereas “Wind” is negatively related.
Now, let’s use the line coefficients for two equations that we got in model1 and model2 to predict value of Target for any given value of Predictor.
# Prediction of 'Ozone' when 'Solar.R'= 10
p1<- predict(model1,data.frame("X"=10))
p1
## 1
## 1.049993
The predicted value of “Ozone” is 1.0499933 when “Solar.R”= 10
# Prediction of 'Ozone' when 'Wind'= 5
p2<- predict(model2,data.frame("X"=5))
p2
## 1
## -21.46849
The predicted value of “Ozone” is -21.4684949 when “Wind”= 5
You may also wish to try out Data Classification, Clustering or Linear Regression from following links:
k-NN Classification for beginners
Using Airquality Datasetk-means Clustering for beginners
Using Airquality DatasetLinear Regression for beginners
Good luck! :)