Introduction
A simple example of regression is predicting value y when value x is known. To do this we need to have the relationship between x and y.

The steps to create the relationship is:

Problem Defination
Check if there is good correlation in the dataset below and if it can be used for regression model
If yes, predict weight for the following heights 160, 170, 180

Dataset

# height in cms
hght <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131, 153, 177, 148, 189, 138, 146, 199, 167, 153, 130)
# weight in kgs
wght <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48, 65, 84, 59, 93, 49, 55, 79, 75, 66, 49)

Data Location
Data mentioned above is in the R-Markdown file given as an assignment.

Data Description
The data was given in a R Markdown file as an assignment.
A data frame with 20 observations on 2 variables.
[, 1] Height in cms
[, 2] Weight in kgs

Setup

library(ggplot2)
library(corrgram)
library(gridExtra)

*Dataset

dfrModel <- data.frame(hght <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131, 153, 177, 148, 189, 138, 146, 199, 167, 153, 130),wght <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48, 65, 84, 59, 93, 49, 55, 79, 75, 66, 49))
names(dfrModel) <- c("Height", "Weight")
head(dfrModel)
##   Height Weight
## 1    151     63
## 2    174     81
## 3    138     56
## 4    186     91
## 5    128     47
## 6    136     57

Exploratory Analysis

summary(dfrModel$Height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   128.0   138.0   152.5   156.9   174.8   199.0
summary(dfrModel$Weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   47.00   55.75   64.00   66.35   76.75   93.00
length(dfrModel$Height)
## [1] 20
length(dfrModel$Weight)
## [1] 20
#quantile() 
#code finds outliers
Height.qnt <- quantile(dfrModel$Height, probs=c(.25, .75))
#IQR
Height.max <- 1.5 * IQR(dfrModel$Height)
Height.out <- dfrModel$Height
Height.out[dfrModel$Height < (Height.qnt[1] - Height.max)] <- NA
Height.out[dfrModel$Height > (Height.qnt[2] + Height.max)] <- NA
print(dfrModel$Height)
##  [1] 151 174 138 186 128 136 179 163 152 131 153 177 148 189 138 146 199
## [18] 167 153 130
#print(Height.out)
print(dfrModel$Height[is.na(Height.out)])
## numeric(0)
Weight.qnt <- quantile(dfrModel$Weight, probs=c(.25, .75))
Weight.max <- 1.5 * IQR(dfrModel$Weight)
Weight.out <- dfrModel$Weight
Weight.out[dfrModel$Weight < (Weight.qnt[1] - Weight.max)] <- NA
Weight.out[dfrModel$Weight > (Weight.qnt[2] + Weight.max)] <- NA
print(dfrModel$Weight)
##  [1] 63 81 56 91 47 57 76 72 62 48 65 84 59 93 49 55 79 75 66 49
#print(Weight.out)
print(dfrModel$Weight[is.na(Weight.out)])
## numeric(0)
# Check outliers in Weight
WeightPlot <- ggplot(dfrModel, aes(x="", y=Weight)) +
            geom_boxplot(aes(fill=Weight), color="green") +
            labs(title="Weight Outliers")
WeightPlot

# Check outliers in Height
HeightPlot <- ggplot(dfrModel, aes(x="", y=Height)) +
            geom_boxplot(aes(fill=Height), color="blue") +
            labs(title="Height Outliers")

HeightPlot

grid.arrange(WeightPlot, HeightPlot, nrow=1, ncol=2)

Observation
Outliers absent in Height and weight.
Outlier count is zero.
Therefore this model can be worked upon.

Correlation

# correlation coefficient
#cor(dfrModel$Weight, dfrModel$Height)
#cor(dfrModel$Height, dfrModel$Weight, method = c("pearson", "kendall", "spearman"))
# correlation test
cor.test(dfrModel$Height, dfrModel$Weight)
## 
##  Pearson's product-moment correlation
## 
## data:  dfrModel$Height and dfrModel$Weight
## t = 12.215, df = 18, p-value = 3.788e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8627911 0.9782375
## sample estimates:
##      cor 
## 0.944644
#cor.test(dfrModel$Height, dfrModel$Weight, method=c("pearson", "kendall", "spearman")) 

Observation
Strong positive correlation is obeserved between Height & Weight.
This means as Height increases Weight increases.

Correlation Visualization

# Visualize correlation
plot(dfrModel)

Correlation Visualization

# Visualize correlation
pairs(dfrModel)

# Visualize correlation
# http://www.statmethods.net/advgraphs/correlograms.html
corrgram(dfrModel)

Observation
Strong positive correlation is obeserved between Weight and Height

#ggplot
ggplot(dfrModel, aes(x=Height, y=Weight)) +
    geom_point(shape=19, colour="blue", fill="blue") +
    geom_smooth(method='lm', formula=y~x) + 
    labs(title="Height and Weight Regression") +
    labs(x="Height in cms") +
    labs(y="Weight in kgs")

Observation
It seen that as Height increases Weight increases

Linear Model

x <- dfrModel$Height
y <- dfrModel$Weight
slmModel <- lm(y~x) 
#y~x implies creating linear model between two

Observation
No errors. Model successfully created.

Show Model

# Print summary
summary(slmModel)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.1573  -1.7267   0.7701   2.6045   6.2102 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -33.55669    8.25032  -4.067 0.000723 ***
## x             0.63675    0.05213  12.215 3.79e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.846 on 18 degrees of freedom
## Multiple R-squared:  0.8924, Adjusted R-squared:  0.8864 
## F-statistic: 149.2 on 1 and 18 DF,  p-value: 3.788e-10

Observation
R-squared:Should be as much closer to 1 as possible
R-Square being more than 75 is good

P-Value: Should be less than 0.05
P-Value of Height (x) is less than 0.05.
Model is acceptable and we can use this for predictive analytics.

Test Data

# Predict Weight for the following heights 160, 170, 180  
dfrTest <- data.frame(x=c(160, 170, 180 ))
#names(dfrTest) <- c("x")
dfrTest 
##     x
## 1 160
## 2 170
## 3 180

Observation
Test Data successfully created.

Predict

result <-  predict(slmModel, dfrTest) 
#slm is simple linear model
print(result)
##        1        2        3 
## 68.32394 74.69148 81.05902

Observation
Prediction is on expected lines.
It is shown through the predicted values that as height increases weight increases.