1 Introduction

In this work, we are going to predict the petal length of different flower species. For that we will perform a linear regression analysis. We are handling a data set containing information of iris flower which each sepal length and width and each petal length and width.

2 Data Analysis

2.1 Import Data

First, we need to import our data set “iris.csv” to R studio. To read our loaded data we need to call the preinstalled readr library. Apart from this library different libraries that will be used throughout the project will be called now despite use later. In this way it can be keep a certain order.

library(readr)
library(ggplot2)  
library(DMwR)
library(lawstat)
library(lmtest)
library(car)
library(gridExtra)

Once we have our library called, we can upload the data. In our CSV document, the first column has to be skipped. For this, in the “colClasses” parameter, the column that want to be omitted is classified as “NULL” and the other columns are classified as NA, so that R chooses the appropriate data type for each attribute.

Iris <- read.csv("data set/iris.csv", colClasses = c("NULL",NA,NA,NA,NA,NA))

Before start to manipulate data it is highly recommended to get known of the data set that we are handling. Here there are different functions that helps to know better the data and furthermore can help in data preprocessing.

attributes(Iris) # List your attributes within your data set.
summary(Iris) # Prints the min, max, mean, median, and quartiles of each attribute.
str(Iris) # Displays the structure of your data set.
names(Iris) # Names your attributes within your data set.

It is also possible to change the names of attributes using the following code line.

names(Iris) <- c("Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width", "Species")

2.2 Cleaning Data

The most important part of cleaning data is to know if there are missing values in our data set. We can check the existence of missing values with the summary() function, mentioned before, or with the next code line.

is.na(Iris) # Will show NA´s thorugh logical data.

In this example, there aren´t missing values but in case that exist some, a very common treatment is to replace NA values (missing data) by the mean.

Iris$Sepal_Length[is.na(Iris$Sepal_Length)] <- mean(Iris$Sepal_Length, na.rm = TRUE)

2.3 Initial exploration of data

Apart from understanding better data using functions mentioned before, there are another different tools that help to understand data, visualizing how it is distributed.

To visualize plots, it can be used ggplot() function, and this needs the installation of ggplot2 package. To visualize plots side by side grid.arrange() function is used that requires the installation of gridExtra package.

The first plot is an histogram which is a graphical display of continuous sample data.

hist_Sepal_Length <- ggplot(Iris, aes(Iris$Sepal_Length)) + geom_histogram(color="black", binwidth = 0.2)
hist_Sepal_Width <- ggplot(Iris, aes(Iris$Sepal_Width)) + geom_histogram(color="black", binwidth=0.2)
hist_Petal_Length <- ggplot(Iris, aes(Iris$Petal_Length)) + geom_histogram(color="black", binwidth = 0.2)
hist_Petal_Width <- ggplot(Iris, aes(Iris$Petal_Width)) + geom_histogram(color="black", binwidth=0.2)
grid.arrange(hist_Sepal_Length, hist_Sepal_Width,hist_Petal_Length,hist_Petal_Width, nrow=2, ncol = 2)

To visulaize how categorical data is distributed it is used the bar chart. In this data set the attribute “Species” is the only categorical data.

ggplot(Iris, aes(Iris$Species, fill=Species)) + geom_bar(stat="count", position="dodge") # fill = Species, classiffy its bar with the specy color.

Another common used plot is the scatter plot that represents values for two different numeric variables. It is used to observe relationships between variables.

scatter_Sepal <- ggplot(Iris, aes(Iris$Sepal_Length, Iris$Sepal_Width)) + geom_point(color="black")
scatter_Petal <- ggplot(Iris, aes(Iris$Petal_Length, Iris$Petal_Width)) + geom_point(color="blue")
scatter_Length <- ggplot(Iris, aes(Iris$Sepal_Length, Iris$Petal_Length)) + geom_point(color="green")
scatter_Width <- ggplot(Iris, aes(Iris$Sepal_Width, Iris$Petal_Width)) + geom_point(color="red")
scatter_SepalLength_PetalWidth <- ggplot(Iris, aes(Sepal_Length, Petal_Width)) + geom_point(color="yellow")
scatter_SepalWidth_PetalLength <- ggplot(Iris, aes(Sepal_Width, Petal_Length)) + geom_point(color="purple")

grid.arrange(scatter_Sepal, scatter_Petal, scatter_Length, scatter_Width, scatter_SepalLength_PetalWidth, scatter_SepalWidth_PetalLength, nrow=2 ,ncol = 3)

The last plot that is going to be studied is the boxplot. It is a standardized way of displaying the distribution of data based on 5-number summary (min, Q1, median, Q3 and MAX). It can tell you about outliers, if data is symmetrical, how tightly is data grouped and if and how data is skewed.

par(mfrow=c(2, 2))  # divide graph area in 2 columns
boxplot(Iris$Sepal_Length, main="Sepal Length")
boxplot(Iris$Sepal_Width, main="Sepal Width")  
boxplot(Iris$Petal_Length, main="Petal Length") 
boxplot(Iris$Petal_Width, main="Petal Width")

It can be observed an outlier value in the Sepal_Width variable. To print outlier values you have to code next line.

outliers <- boxplot(Iris$Sepal_Width, plot=FALSE)$out

This gives rise to introduce the next part where conflict data like outliers are going to be removed.

2.4 Pre-processing

In this part, the data set has to be modified according to the analysis. As mentioned before prevoiusly observed outliers are going to be removed. Procedure is explained in the next code lines.

Iris_outliers <- Iris[which(Iris$Sepal_Width %in% outliers),] #  First you need to  find in which rows the outliers are.
Iris[-which(Iris$Sepal_Width %in% outliers),] #  Print the data set without the outliers.

##     Sepal_Length Sepal_Width Petal_Length Petal_Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

Iris_dataprep <- Iris[-which(Iris$Sepal_Width %in% outliers),] # Create a new data set without the outliers.

From this point forward our data set is renamed to “Iris_dataprep”.

2.5 Creating training and testing set

Once data is preprocessed, it´s time to create training and testing sets for the linear regression model that must be performed.

First the seed number must be set. The seed number is a number that you choose for a starting point used to create a sequence of random numbers. It is also helpful for others who want to recreate your same results.

set.seed(123)

The next step is to split data into two sets, training and testing set. A common split is 80/20, which means that 80% of the data will be the training set’s size and 20% of the data will be the test set’s size. Next two lines set the size of each set.

trainSize <- round(nrow(Iris_dataprep)*0.8) 
testSize <- nrow(Iris_dataprep)-trainSize

Once the size is set, training and testing sets must be created. We also want these sets to be created in a randomized order, which will create the most optimal model.

training_indices <- sample(seq_len(nrow(Iris_dataprep)),size =trainSize) 
testing_indices <- sample(seq_len(nrow(Iris_dataprep)),size =testSize)
# seq_len() --> creates a sequence that starts at 1 and with steps of 1 finishes at the number value
trainSet <- Iris_dataprep[training_indices,]
testSet <- Iris_dataprep[testing_indices,]

2.6 Modelling

Now is time to run data through linear modelling algorithm. Linear Regression Model is helpful when trying to discover the relationship between two variables. These two variables represent the X and Y within the linear equation. The X variable is the predictor variable, also known as the independent variable because it doesn’t depend on other attributes while making predictions. Y is the response variable, also known as the dependent variable because its value depends on the other variables.

In our Iris example, we are trying to predict Petal´s length, so it is the dependant variable whereas Petal´s width is the independant variable.

IrisLinearMod <- lm(Petal_Length ~ Petal_Width, trainSet)
summary(IrisLinearMod)

## 
## Call:
## lm(formula = Petal_Length ~ Petal_Width, data = trainSet)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.33733 -0.30972 -0.01868  0.23580  1.39028 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.09107    0.08891   12.27   <2e-16 ***
## Petal_Width  2.22761    0.06084   36.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4958 on 115 degrees of freedom
## Multiple R-squared:  0.921,  Adjusted R-squared:  0.9203 
## F-statistic:  1340 on 1 and 115 DF,  p-value: < 2.2e-16

2.6.1 Assumptions of Linear Regression

Building a linear regression model is only half of the work. In order to actually be usable in practice, the model should conform to the assumptions of linear regression.

2.6.1.1 Assumption 1

First assumption says that the regression model is linear in parameters. This means that the equation must have the following form: Y = Estimate * X + Intercept. It can be checked in the next code line.

print(IrisLinearMod) # y = 1.038 + 2.27 * X (x = Petal_Width, y = Petal_Length)

## 
## Call:
## lm(formula = Petal_Length ~ Petal_Width, data = trainSet)
## 
## Coefficients:
## (Intercept)  Petal_Width  
##       1.091        2.228

2.6.1.2 Assumption 2

Second assumption says that the mean of residuals is zero. It can be calculated like this.

mean(IrisLinearMod$residuals)

## [1] 6.20254e-18

Since the mean of residuals is approximately zero, this assumption holds true for this model.

2.6.1.3 Assumption 3

Third assumption check the homoscedasticity of residuals or equal variance. For that, four plots are built with the next lines.

par(mfrow=c(2,2))  
plot(IrisLinearMod) # plot() function prints a plot

From the first plot (top-left), as the fitted values along x increase, the residuals increase and then decrease. This pattern is indicated by the red line, which should be approximately flat if the disturbances are homoscedastic. The plot on the bottom left also checks this, and is more convenient as the disturbance term in Y axis is standardized.

In this case, there is a definite pattern noticed. So, there is heteroscedasticity.

2.6.1.4 Assumption 4

The fourth assumption says that there must not be autocorrelation of residuals. There are three different ways to the no autocorrelation of residuals.

2.6.1.4.1 Using acf plot

acf(IrisLinearMod$residuals)

The X axis corresponds to the lags of the residual, increasing in steps of 1. If the residuals were not autocorrelated, the correlation (Y-axis) from the immediate next line onwards will drop to a near zero value below the dashed blue line (significance level). Here is clearly observed that residuals are not autocorrelated.

2.6.1.4.2 Using runs test

For this function, “lawstat” package must be installed.

runs.test(IrisLinearMod$residuals)

## 
##  Runs Test - Two sided
## 
## data:  IrisLinearMod$residuals
## Standardized Runs Statistic = 0.093652, p-value = 0.9254

This gives a p-value = 0.6787. With this value we cannot reject the null hypothesis. Therefore we can safely assume that residuals are not autocorrelated.

2.6.1.4.3 Using Durbin-Watson test

For this function, “lmtest” package must be installed.

dwtest(IrisLinearMod)

## 
##  Durbin-Watson test
## 
## data:  IrisLinearMod
## DW = 2.0302, p-value = 0.5648
## alternative hypothesis: true autocorrelation is greater than 0

With a high p value of 0.7994, we cannot reject the null hypothesis that true autocorrelation is zero. So the assumption that residuals should not be autocorrelated is satisfied by this model.

2.6.1.5 Assumption 5

The 5th assumption says that the X variables and residuals must be uncorrelated. For that, here a correlation test on the X variable and the residuals is done.

cor.test(trainSet$Petal_Width, IrisLinearMod$residuals)

## 
##  Pearson's product-moment correlation
## 
## data:  trainSet$Petal_Width and IrisLinearMod$residuals
## t = -1.1753e-15, df = 115, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.181533  0.181533
## sample estimates:
##          cor 
## -1.09597e-16

p-value is high, so null hypothesis that true correlation is 0 can’t be rejected. So, the assumption holds true for this model.

2.6.1.6 Assumption 6

The 6th assumption says that the number of observation must be greater than number of Xs. This can be directly observed by looking at the data. In our case observations = 102 (trainSet) and variables = 5.

2.6.1.7 Assumption 7

In the 7th assumption the variability in X values is observed, and this must be positive. This means that the X values in a given sample must not all be the same. To check this:

var(trainSet$Petal_Length)

## [1] 3.084696

2.6.1.8 Assumption 8

The 8th assumption says that if the Y and X variable has an inverse relationship, the model equation should be specified appropriately:

\(Y = B1 + B2 * (1/X)\)

2.6.1.9 Assumption 9

In this assumption a non perfect multicollinearity must exist. It means that there is no perfect linear relationship between explanatory variables. It can be checked using Variance Inflation Factor (VIF). If VIF is high means that the information in that variable is already explained by other X variables present in the given model, which means, more redundant is that variable.

For the function used here “car” package must be installed.

IrisLinearMod_2 <- lm(Petal_Width ~ ., data=Iris_dataprep)  
vif(IrisLinearMod_2)

##                   GVIF Df GVIF^(1/(2*Df))
## Sepal_Length  7.291814  1        2.700336
## Sepal_Width   2.164117  1        1.471094
## Petal_Length 39.654794  1        6.297205
## Species      31.267161  2        2.364679

#corrplot(cor(Iris_dataprep[,-1]))

Generally, VIF for an X variable should be less than 4. It is proven that does not exist perfect multicollinearity.

2.6.1.10 Assumption 10

In the last assumption the residuals should be normally distributed. For that the Normal Q-Q is observed and if points lie exactly on the line, it is perfectly normal distribution.

par(mfrow=c(2,2))
plot(IrisLinearMod)

The qqnorm() plot is shown in the top-right.

2.7 Prediction

The last step is to predict the Iris petal´s length through the petal´s width. To do this, we’ll be using the prediction function – predict().

IrisPrediction <- predict(IrisLinearMod,testSet)

Now is possible to compare predictions and actual values for distance variable to check the correlation accuracy.

actuals_preds <- data.frame(cbind(actuals=testSet$Petal_Length, predicteds=IrisPrediction))  # make actuals_predicteds dataframe.
correlation_accuracy <- cor(actuals_preds)
correlation_accuracy # 96.29%

##              actuals predicteds
## actuals    1.0000000  0.9542997
## predicteds 0.9542997  1.0000000

Finally, other prediction parameters can be calculated, like MAE, MSE, RMSE and MAPE. Installing the package “DMwR” it can be used the regr.eval() function.

regr.eval(actuals_preds$actuals, actuals_preds$predicteds)

##       mae       mse      rmse      mape 
## 0.3804305 0.2509693 0.5009684 0.1246689

# MAE: mean absolute error
# MSE: mean squared error
# RMSE: root mean squared error
# MAPE: mean absolute percentage error

Iris

Elias Lobato

16/12/2019