A simple example of regression is predicting value y when value x is known. To do this we need to have the relationship between x and y.
The steps to create the relationship is:
Check if there is good correlation in the below dataset and if it can be used for regression model
height in cms (151, 174, 138, 186, 128, 136, 179, 163, 152, 131, 153, 177, 148, 189, 138, 146, 199, 167, 153, 130) weight in kgs (63, 81, 56, 91, 47, 57, 76, 72, 62, 48, 65, 84, 59, 93, 49, 55, 79, 75, 66, 49) If yes, predict weight for the following heights 160, 170, 180
Setup
library(ggplot2)
library(corrgram)
library(gridExtra)
Dataset
# height in cms
hght <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131, 153, 177, 148, 189, 138, 146, 199, 167, 153, 130)
# weight in kgs
wght <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48, 65, 84, 59, 93, 49, 55, 79, 75, 66, 49)
dfrModel <-data.frame(hght,wght)
dfrModel
## hght wght
## 1 151 63
## 2 174 81
## 3 138 56
## 4 186 91
## 5 128 47
## 6 136 57
## 7 179 76
## 8 163 72
## 9 152 62
## 10 131 48
## 11 153 65
## 12 177 84
## 13 148 59
## 14 189 93
## 15 138 49
## 16 146 55
## 17 199 79
## 18 167 75
## 19 153 66
## 20 130 49
Exploratory Analysis
# check ut wght & hght
summary(dfrModel$hght)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 128.0 138.0 152.5 156.9 174.8 199.0
summary(dfrModel$wght)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 47.00 55.75 64.00 66.35 76.75 93.00
# ?quantile()
wght.qnt <- quantile(dfrModel$wght, probs=c(.25, .75))
# ?IQR()
wght.max <- 1.5 * IQR(dfrModel$wght)
wght.out <- dfrModel$wght
wght.out[dfrModel$wght < (wght.qnt[1] - wght.max)] <- NA
wght.out[dfrModel$wght > (wght.qnt[2] + wght.max)] <- NA
#print(dfrModel$wght)
#print(wght.out)
print(dfrModel$wght[is.na(wght.out)])
## numeric(0)
hght.qnt <- quantile(dfrModel$hght, probs=c(.25, .75))
hght.max <- 1.5 * IQR(dfrModel$hght)
hght.out <- dfrModel$hght
hght.out[dfrModel$hght < (hght.qnt[1] - hght.max)] <- NA
hght.out[dfrModel$hght > (hght.qnt[2] + hght.max)] <- NA
#print(dfrModel$hght)
#print(hght.out)
print(dfrModel$hght[is.na(hght.out)])
## numeric(0)
# check outliers in wght
wghtPlot <- ggplot(dfrModel, aes(x="", y=wght)) +
geom_boxplot(aes(fill=wght), color="green") +
labs(title="Weight Outliers")
# check out hght
hghtPlot <- ggplot(dfrModel, aes(x="", y=hght)) +
geom_boxplot(aes(fill=hght), color="blue") +
labs(title="Height Outliers")
# show plot
grid.arrange(hghtPlot, wghtPlot, nrow=1, ncol=2)
There are no Outliers in the both Weight and Height
Correlation
# correlation coefficient
cor(dfrModel$hght, dfrModel$wght)
## [1] 0.944644
#cor(x, y, method = c("pearson", "kendall", "spearman"))
# correlation test
cor.test(dfrModel$hght, dfrModel$wght)
##
## Pearson's product-moment correlation
##
## data: dfrModel$hght and dfrModel$wght
## t = 12.215, df = 18, p-value = 3.788e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8627911 0.9782375
## sample estimates:
## cor
## 0.944644
#cor.test(x, y, method=c("pearson", "kendall", "spearman"))
**_Observation**
Strong Positive correlation (94%) is obeserved between wght & hght
Correlation Visualization
# visualize correlation
#pairs(dfrModel)
plot(dfrModel)
# visualize correlation
# http://www.statmethods.net/advgraphs/correlograms.html
corrgram(dfrModel)
#####Observation
Strong Positive correlation is obeserved between wght & hght
Plot Graph
# base chart
plot(dfrModel$wght,dfrModel$hght, col="blue", main="Regression",
abline(lm(dfrModel$hght~dfrModel$wght)), cex=1, pch=16, xlab="Height", ylab="Weight")
# ggplot
ggplot(dfrModel, aes(x=hght, y=wght)) +
geom_point(shape=19, colour="blue", fill="blue") +
geom_smooth(method='lm', formula=y~x) +
labs(title="Weight & Height Regression") +
labs(x="Height") +
labs(y="Weight")
It seen that as hght increases wght Increases as there is Positive correlation
Linear Model
x <- dfrModel$hght
y <- dfrModel$wght
slmModel <- lm(y~x)
No errors. Model successfully created.
Show Model
# print summary
summary(slmModel)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.1573 -1.7267 0.7701 2.6045 6.2102
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -33.55669 8.25032 -4.067 0.000723 ***
## x 0.63675 0.05213 12.215 3.79e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.846 on 18 degrees of freedom
## Multiple R-squared: 0.8924, Adjusted R-squared: 0.8864
## F-statistic: 149.2 on 1 and 18 DF, p-value: 3.788e-10
Observation
## R-squared: Should be as much closer to 1 as possible
R-Square is 0.8924 which is more than 0.75, which is good. It is explaining 89% of variance.
## P-Value: Should be less than 0.05
P-Value of hght (x) is less than 0.05 …
Model is acceptable and we can use this for predictive analytics
Test Data
# find wght of a person with height 160, 170, 180
dfrTest <- data.frame(x=c(160,170,180))
#names(dfrTest) <- c("x")
dfrTest
## x
## 1 160
## 2 170
## 3 180
Test Data successfully created.
Predict
result <- predict(slmModel, dfrTest)
print(result)
## 1 2 3
## 68.32394 74.69148 81.05902
Prediction is on expected lines.
There are no Outliers in the Data
There is strong correlation (94%) between both the variables
In correlation plot as Height increases, weight also get increases
There are no errors while creating the model. It created successfully as well as tested successfully for the given Data set.
For height 160, 170, 180 cms folowing are the predicted weight,
68.32394, 74.69148, 81.05902 kgs.