Trees - Predictive Modelling of Volume of Cheery Trees

#INTRODUCTION 
#Trees dataset is used to do predictive modelling. This is a very simple linear regression model used to give an idea of how it works. For this purpose, we have used a dataset called trees. 

#Data
#Trees data is an inbuilt dataset of rstudio. This data set provides measurements of the girth, height and volume of timber in 31 felled black cherry trees. Note that girth is the diameter of the tree (in inches) measured at 4 ft 6 in above the ground. 

#This dataset contains 31 observations and 3 variables details of which have been given below.

#[,1]   Girth      numeric   Tree diameter in inches
#[,2]   Height   numeric     Height in ft
#[,3]   Volume   numeric     Volume of timber in cubic ft

#Business Purpose: To investigate the relationship between three variables and checking the significance of variables responsible for growth. Exploring data and building a fitted regressed model. Trees' volume will be the dependent variable and we will check out for the significant variables making a dominating impact on the trees' growth.

#Since data is already prepared so, let's dive into the Exploration part.
data(trees)
dataset <- trees
str(dataset) #All data types are in correct forms.

## 'data.frame':    31 obs. of  3 variables:
##  $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
##  $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
##  $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...

#A data analyst must know when to use which techniques. For e.g. histograms and boxplots aren't needed here. We can use it here but they won't provide us with meaningful insights. Main concern is the relation amongst the variables and not the individual stats and descriptive measures. We will use relevant plots and statistics related to correlation.

#Correlation

library(car)

## Loading required package: carData

scatterplotMatrix(~Girth+Volume+Height, dataset)

cor(dataset, method="spearman")

##            Girth    Height    Volume
## Girth  1.0000000 0.4408387 0.9547151
## Height 0.4408387 1.0000000 0.5787101
## Volume 0.9547151 0.5787101 1.0000000

#Statistical and visualisation depicts that there is a strong correlation between the girth and volume of the trees. 
#A better analyst tries to draw plots having diverse information with less coding. For e.g. above two step coding can be done through one code given below.


library(GGally)

## Loading required package: ggplot2

ggpairs(dataset, columns = 1:3, title="Trees")

#But the question is - is this relation significant ? It means if I take any cherry tree at random will this correlation hold tree or is it just a random case ? 

#But before using any test we must know about the nature of data distribution in two variables. It can be done through visual methods(qqplpots, qqlines, histograms, boxplots etc.) and/or tests(shapiro.test()), and most importantly two concerned variables must have a linear relation.

#Is covariance between Girth and Volume linear?
library(tidyverse)

## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------------ tidyverse 1.2.1 --

## v tibble  1.4.2     v purrr   0.2.5
## v tidyr   0.8.1     v dplyr   0.7.6
## v readr   1.1.1     v stringr 1.3.1
## v tibble  1.4.2     v forcats 0.3.0

## -- Conflicts --------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x dplyr::recode() masks car::recode()
## x purrr::some()   masks car::some()

ggplot(dataset, aes(Girth, Volume)) + geom_point()+geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#smooth isn't perfectly straight but we can assume linearity as it is almost linear. May be one of them isn't linear ?

#Is data normally distributed ? - shapiro.test() and boxplot
library(lattice)
library(e1071)
bwplot(dataset$Girth)  #median value is in middle so it should be normally distributed.

shapiro.test(dataset$Girth) #p>0.05 so data can be assumed to be normally distributed.

## 
##  Shapiro-Wilk normality test
## 
## data:  dataset$Girth
## W = 0.94117, p-value = 0.08893

bwplot(dataset$Volume) #median valus is skewed so it isn't normally distributed.

shapiro.test(dataset$Volume) #p<0.05 so data isn't normally distributed.

## 
##  Shapiro-Wilk normality test
## 
## data:  dataset$Volume
## W = 0.88757, p-value = 0.003579

c(skewness(dataset$Volume), kurtosis(dataset$Volume)) #Also skeewness and kurtosis aren't equal to 0 and 3 respectively for them being normal.

## [1] 1.0132739 0.2460393

#So, Volume isn't normally distributed but Girth is. So, for conducting hypothesis tests on non-normal distribution we will use non-parametric tests such as cor.test(). So let's use it.

cor.test(dataset$Girth, dataset$Volume,method="pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  dataset$Girth and dataset$Volume
## t = 20.478, df = 29, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9322519 0.9841887
## sample estimates:
##       cor 
## 0.9671194

#p-value < 2.2e-16, thus, correlation has to be significant.

#Conclusion: In our regressed model synthesis, volume must be there.

#Model Synthesis: let's build our model step by step starting from data-preprocessing.

#Data-preprocessing - formulation of training and testing dataset

library("caTools")
split <- sample.split(dataset$Volume, SplitRatio = 2/3)
training_set <- subset(dataset, split==T)
test_set     <- subset(dataset, split==F)

#Regressed Model of training_set

regressor1 <- lm(Volume~Girth, data=training_set) 
summary(regressor1) #Adjusted R-squared:  0.9242

## 
## Call:
## lm(formula = Volume ~ Girth, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8927 -3.8074 -0.1091  2.9675  8.1133 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -37.5750     4.0612  -9.252 2.91e-08 ***
## Girth         5.1680     0.2962  17.448 1.00e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.253 on 18 degrees of freedom
## Multiple R-squared:  0.9442, Adjusted R-squared:  0.9411 
## F-statistic: 304.4 on 1 and 18 DF,  p-value: 1.002e-12

#predicted values of test_set

(y_predicted_values <- predict(regressor1, newdata=test_set))

##         3         7         8        10        11        14        19 
##  7.903761 19.273458 19.273458 20.307066 20.823871 22.891088 33.227176 
##        20        25        26        30 
## 33.743981 46.664091 51.832135 55.449765

#Let's predict the Volume at Girth=17.5

predict(regressor1, data.frame(Girth=17.5))

##        1 
## 52.86574

#52.61681 which is closer to the predicted value of 55.7

#Visualising training_set results

ggplot() + geom_point(aes(training_set$Girth, training_set$Volume), col="red") +
geom_smooth(aes(training_set$Girth, predict(regressor1, newdata = training_set)), col="blue") + 
ggtitle("Volume Vs.Girth(training_set)") + xlab("Girth") + ylab("Volume")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#Our model has been made and seems accurate now let's check it's proximity with the test results.

#Visualising test_set results

ggplot() + geom_point(aes(test_set$Girth, test_set$Volume), col="red") +
geom_smooth(aes(training_set$Girth, predict(regressor1, newdata = training_set)), col="blue") + 
ggtitle("Volume Vs.Girth(test_set") + xlab("Girth") + ylab("Volume")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#Voila !! Our model seems to pass and we can reply in this model and can say this is the best fitted line but before we can make any conclusion let's also incorporate the influence of height on the volume. Let's study the new model.

#MODEL2: Volume ~ Girth + Height
regressor2 <- lm(Volume~., data=dataset)
summary(regressor2) #Adjusted R-squared:  0.9442

## 
## Call:
## lm(formula = Volume ~ ., data = dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4065 -2.6493 -0.2876  2.2003  8.4847 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
## Girth         4.7082     0.2643  17.816  < 2e-16 ***
## Height        0.3393     0.1302   2.607   0.0145 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared:  0.948,  Adjusted R-squared:  0.9442 
## F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

#Since Adjusted R-squared[MODEL2]:  0.9442 > Adjusted R-squared[MODEL1]:  0.9242 therefore we must include height in our model. And that will be our best fitted model so far and can be industrially used to predict future Volume of cherries.

#Visualisation -This part has three variables so we need a 3D space visualisation. It can be done through scatterplot3D.

Girth <- seq(8,21,by=0.5)
Height <- seq(60, 90, by=0.5)
pred_grid <- expand.grid(Girth=Girth, Height = Height) 
#The expand.grid() function creates a data frame from all combinations of the factor variables.

pred_grid$Volume2 <- predict(regressor2, newdata = pred_grid)
head(pred_grid, 10)

##    Girth Height     Volume2
## 1    8.0     60  0.03269916
## 2    8.5     60  2.38677941
## 3    9.0     60  4.74085966
## 4    9.5     60  7.09493991
## 5   10.0     60  9.44902017
## 6   10.5     60 11.80310042
## 7   11.0     60 14.15718067
## 8   11.5     60 16.51126092
## 9   12.0     60 18.86534117
## 10  12.5     60 21.21942142

#We have noticed that we have a negative and a very low impractical volume which isn't possible so we need to get rid of Girth<8. We always need to analyse our scaling otherwise it would create unexpected results and model won't be uniformly valid for all readings. Sometimes, we do mistakes but visibility of their negative impact may be beyond the gven dataset/problem in hand. So, as a resonsible analyst we must cross validate our model by comparing it with all the real life cases.

#Let's rectify the scaling of Girth.
Girth <- seq(9,21,by=0.5)
Height <- seq(60, 90, by=0.5)
pred_grid <- expand.grid(Girth=Girth, Height = Height) 
pred_grid$Volume2 <- predict(regressor2, newdata = pred_grid)
head(pred_grid, 10)

##    Girth Height  Volume2
## 1    9.0     60  4.74086
## 2    9.5     60  7.09494
## 3   10.0     60  9.44902
## 4   10.5     60 11.80310
## 5   11.0     60 14.15718
## 6   11.5     60 16.51126
## 7   12.0     60 18.86534
## 8   12.5     60 21.21942
## 9   13.0     60 23.57350
## 10  13.5     60 25.92758

#Situation seems to be resolved now.

#Let's make a scatterplot3D from the predicted Volume and predicted grid.
library(scatterplot3d)
regressor1_scatterplot <- scatterplot3d(pred_grid$Girth, pred_grid$Height, pred_grid$Volume2, pch=1, angle = 60, color="deeppink", xlab = "Girth (Inch)", ylab = "Height (ft)", zlab = "Volume (ft3)", main = "ScatterPlot3D")

#Overlaying our data points on the created scatterplot3D to see how well they fit in.
regressor1_scatterplot$points3d(dataset$Girth, dataset$Height, dataset$Volume, pch=16)

#Let's predict the Volume at Girth=17.5 and Height=82
predict(regressor2, data.frame(Girth=17.5, Height=82))

##        1 
## 52.22375

#52.22375 which is closer to the true value of 55.7 than the first method(52.61681).

#Now you all must be wondering that this model is the best suited but if we think about the natural course of how a tree grows it depends then we will realise that the one parameter is related to the other. For e.g. if the height increases then the stem diameter also increases. And regression model takes into one predictor while keeping all others as constant. It means that we are actually not considering the interaction term to incorporate about the dependence factor; so we will have to add this factor in our synthesis of our model.

#NOTE: This is the part of transformation of independent variables to create new regressed factors which will improve the model's purity. This is a fundamental concept requiring a lot of analytical skills and one must pratice a lot to get used to it otherwise complex models can't be built while doing with real world problems. 

#MODEL3 Volume ~ Height + Grith + Height*Grith 

regressor3 <- lm(Volume~Height+Girth+Height*Girth, data=dataset)
summary(regressor3)

## 
## Call:
## lm(formula = Volume ~ Height + Girth + Height * Girth, data = dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5821 -1.0673  0.3026  1.5641  4.6649 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.39632   23.83575   2.911  0.00713 ** 
## Height       -1.29708    0.30984  -4.186  0.00027 ***
## Girth        -5.85585    1.92134  -3.048  0.00511 ** 
## Height:Girth  0.13465    0.02438   5.524 7.48e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.709 on 27 degrees of freedom
## Multiple R-squared:  0.9756, Adjusted R-squared:  0.9728 
## F-statistic: 359.3 on 3 and 27 DF,  p-value: < 2.2e-16

#Adjusted R-squared:  0.9728 

#Since Adjusted R-squared:  0.9728[MODEL3] > Adjusted R-squared:  0.9442[MODEL2]. Finally, this is the best fitted model we can built.

#VISUALISATION: FINAL REGRESSED MODEL

ggplot() + geom_point(aes(dataset$Girth, dataset$Volume), col="red") +
geom_smooth(aes(dataset$Girth, predict(regressor3, newdata = dataset)), col="blue") + 
ggtitle("Volume Vs.Girth") + xlab("Girth") + ylab("Volume")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#Scatterplot3D 

Girth <- seq(9,21,by=0.5)
Height <- seq(60, 90, by=0.5)
pred_grid <- expand.grid(Girth=Girth, Height = Height) 
pred_grid$Volume3 <- predict(regressor3, newdata = pred_grid)
regressor3_scatterplot <- scatterplot3d(pred_grid$Girth, pred_grid$Height, pred_grid$Volume3, pch=1, angle = 60, color="deeppink", xlab = "Girth (Inch)", ylab = "Height (ft)", zlab = "Volume (ft3)", main = "ScatterPlot3D")
regressor3_scatterplot$points3d(dataset$Girth, dataset$Height, dataset$Volume, pch=16)

#Let's predict the Volume at Girth=17.5 and Height=82
predict(regressor3, data.frame(Girth=17.5, Height=82))

##       1 
## 53.7872

#53.7872 which is closer to the true value of 55.7 than the previous method(52.22375).

#If you compare both the plots you will find better fitting of data points on the latest plot. So it is the best fitted model plus the predicted values are also very closest as compared to the first two models. So, we can say this is the best fitted model amongst the all.


#CONCLUSION: So we have invested a trees dataset and with only three attributes we have designed a robust model which can be industrially used for future predictions of volume of cherry tress. Our model is 97.28 accurate which is an acceptable one. We have learnt various phases and learnings involved in the simple and multilinear model. 

#SPECIAL NOTE: We can never say that our model is the best one because there will always be a better version of it possible and actually have been made and used for years. We have learnt that a model is defined by the analyst's analytical, technical, thinking and problem solving skills.

#And the most important and relevant skill is your domain knowledge in the area of problem you are analysing. If you don't like cars you can't be 100% useful to analyse car selling business case. 

#Because interest and experience in the concerned department forms an another useful skill that is DATA INTUITION.

#Thanks for taking out your valuable time for my work. I am an engineering student looking for opportunities so I am bound to make mistakes. Thus, I would be gratefully thankful if you could please suggest rectification and guidance.

#Best regards,
#Puneet, 
#Cont: +918826103939, India

Trees - Predictive Modelling of Volume of Cheery Trees

Puneet

December 9, 2018