#Data for statistical analysis

##The iris data is used to compute the various statistical analysis:

data<-iris

head(iris, 150)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals. The dataset contains a set of 150 records under five attributes - sepal length, sepal width, petal length, petal width and species.

##Evaluating the normal distribution of iris data:

hist(iris$Petal.Length, col = "light green")

qqnorm(iris$Petal.Length, col = "green")
qqline(iris$Petal.Length, col = "purple")

A useful way to address this question is to rephrase it as: what do probability plots look like for data that I know came from a normal distribution? We can answer this by simulating data from a normal distribution using rnorm.

sim_norm <- rnorm(n = length(iris$Petal.Length), mean = mean(iris$Petal.Length), sd = sd(iris$Petal.Length))

sim_norm
##   [1]  1.8762763  5.6002303  2.3823471  3.0219743  3.3307206  3.0718110
##   [7]  0.5555646  2.3599008  5.3535057  5.2417061  5.1142673  5.3390768
##  [13]  2.1500715  5.5462389  5.6095301  3.6157050  4.6104779  3.9657647
##  [19]  1.5004877 -0.4699319  4.3829814  1.5752873  4.4775034  4.6944013
##  [25]  1.4504341  5.3342467  3.3979144  4.6625426  0.9782925  5.2458986
##  [31]  0.9322574  3.3308966  5.1401980  3.6189246  8.0380070  5.6062768
##  [37]  3.1094491  3.9736391  4.0656420  5.6295651  4.4806246  3.9798064
##  [43]  3.0971954  4.5560109  3.8293571  5.3639456  4.4174113  3.0209347
##  [49]  1.5158933 -0.8284493  3.2586134  4.6409253  2.8666278  4.8992547
##  [55]  3.2110264  7.3460857  1.5825531  2.7943165  3.9251592  4.9845949
##  [61]  2.7125772  5.2399996  4.8511864  0.7951556  8.1246438  3.7524032
##  [67]  7.7763289  2.5352056  0.9576627  4.2823073  6.6978193  4.7319367
##  [73]  7.2657872  3.4622536  3.5287792  1.1739676  1.3850794  2.3585517
##  [79]  4.5696934  4.8128923  4.1847320 -2.1141482  3.6881388  4.1170209
##  [85]  3.9022278  2.8797976  5.3735061  1.6491752  4.8522039  5.1356169
##  [91]  2.7793010  4.0979477  4.7398873  1.7191028  2.2791275  8.0048963
##  [97]  1.6705183  1.9689120  4.8034016  2.9229370  3.2936951  4.9111530
## [103] -0.5629073  3.6379257  3.3357901  5.5260761  0.5602952  5.9022496
## [109]  2.5108217  4.4528925  5.0825149  3.0295305  3.1525971  4.5231471
## [115]  1.0697094  2.6641462  4.4161451  5.8084018  1.2240901  1.5351264
## [121]  6.0170942  2.2273154  2.6974215  3.6477798  6.0496768  4.5588954
## [127]  4.9689509  3.2634917  3.3897339  2.6835755  4.7710149  7.6848889
## [133]  4.0522004  1.7924575  3.5229578  5.4450353  3.8471554  5.1234901
## [139]  4.8321664  3.7041709  3.0712707  1.0481217  3.1550564  0.8452859
## [145]  4.5539199  5.9528513  3.0421248  7.1957239  4.7316452  4.9512266
qqnorm(sim_norm, col = "green")
qqline(sim_norm, col = "purple")

##Linear Regression plot:

library(psych)
library(tidyverse)
## -- Attaching packages ------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.1     v dplyr   1.0.0
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ---------------------------------------------------------- tidyverse_conflicts() --
## x ggplot2::%+%()   masks psych::%+%()
## x ggplot2::alpha() masks psych::alpha()
## x dplyr::filter()  masks stats::filter()
## x dplyr::lag()     masks stats::lag()
pairs.panels(iris, method = "pearson", hist.col = "#5D6D7E")

iris %>% 
  dplyr::select(c("Sepal.Length", 'Sepal.Width',"Petal.Length", "Petal.Width")) %>% 
  pairs.panels(method = "pearson", hist.col = "#5D6D7E")

cor(iris$Petal.Length, iris$Sepal.Length)
## [1] 0.8717538

Independence of the variables can be shown using corrplot shown above Comment on the finding: We can see that the correlation between Sepal Length and Petal Length is 0.87, this means that the two variables are very closely correlated. And, as sepal length increases (or decreases), petal length also increases (or decreases). Here the Sepal length is the independant variable and the Petal length is the dependant variable.

##Regression of “Petal.Length” and “Sepal.Length”: Here the Petal Length is the dependant variable and Sepal Length is the independant variable

m1 <- lm(Petal.Length ~ Sepal.Length, data = iris)

summary(m1)
## 
## Call:
## lm(formula = Petal.Length ~ Sepal.Length, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.47747 -0.59072 -0.00668  0.60484  2.49512 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.10144    0.50666  -14.02   <2e-16 ***
## Sepal.Length  1.85843    0.08586   21.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8678 on 148 degrees of freedom
## Multiple R-squared:   0.76,  Adjusted R-squared:  0.7583 
## F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16
m1$coefficients[1]
## (Intercept) 
##   -7.101443
plot(iris$Petal.Length ~ iris$Sepal.Length, col= "green")
abline(m1, col= "purple")

Comment on the finding: Adjusted R-squared value is 0.7583 (or 75.83%),which indicates the percentage of the variance in the dependent variable that the independent variables explain collectively. We can see very strong relationship between our model and the dependent variable(Petal length), thus there is smaller differences between the observed data and the fitted values.


##Model diagnostics of the iris data: To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.

###linearity

###Constant variability

plot(m1$residuals ~ iris$Sepal.Length, col= "blue")
abline(h = 0, lty = 3, col='red')  # adds a horizontal dashed line at y = 0

###Nearly normal residuals: To check this condition, we can look at a histogram

hist(m1$residuals, col = "light blue")

###or a normal probability plot of the residuals

qqnorm(m1$residuals, col= "blue")
qqline(m1$residuals, col= "red")  # adds diagonal line to the normal prob plot

###All these can be done using only one function

plot(m1)

##Using ggplot functions to plot the iris data:

ggplot(iris, aes(Sepal.Length, Petal.Length)) +
  geom_point() +
  geom_smooth(method = "lm", se = T)
## `geom_smooth()` using formula 'y ~ x'

Comment: se=T shows the confidence interval

##Using predicted values to plot and analyse the iris data:

###Read the iris dataset

df = iris

###Creating multiple linear model:

Here the Petal Width is the dependant variable on the variables Sepal Length and Petal Length

lm_fit <- lm(Petal.Width ~ Petal.Length + Sepal.Width, data=df)
summary(lm_fit)
## 
## Call:
## lm(formula = Petal.Width ~ Petal.Length + Sepal.Width, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53907 -0.11443 -0.01447  0.12168  0.65419 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.70648    0.15133  -4.668 6.78e-06 ***
## Petal.Length  0.42627    0.01045  40.804  < 2e-16 ***
## Sepal.Width   0.09940    0.04231   2.349   0.0201 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2034 on 147 degrees of freedom
## Multiple R-squared:  0.9297, Adjusted R-squared:  0.9288 
## F-statistic: 972.7 on 2 and 147 DF,  p-value: < 2.2e-16

Comment on the finding: HEre the Adjusted R-squared is 0.988 (or 92.88%), which means that there is very strong relationship between our model and the dependent variable(Petal width), thus there is smaller differences between the observed data and the fitted values.

Saving the predictions of the model in the new data frame, together we variable you want to plot against

predicted_df <- data.frame(Petal.width_pred = predict(lm_fit, df), Petal.Length=df$Petal.Length, Sepal.Width=df$Sepal.Width)

###Creating the the predicted line of multiple linear regression

ggplot(data = df, aes(x = Petal.Width, y = Petal.Length)) + 
  geom_point(color='blue') +
  geom_smooth(color='red',data = predicted_df, aes(x=Petal.width_pred, y=Petal.Length))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#####Thank you!#####