Second research question: Can we predict sepal length based on petal length, sepal width, petal width, and the species of iris flowers?
1.1 DATA: I used data from the R program (already built in data set)
Unit of Observation: Iris flower Sample Size: 150 observations (flowers) Dependant variable: Sepal length Explanatory variables: Petal length, sepal width, petal width, and the species of iris flowers
Each observation in the dataset provides measurements for four numerical variables (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) and one categorical variable (Species). In total there are five variables for each flower.
# Loading the 'iris' dataset
data(iris)
1.2 ANALYSIS
# Displaying the first few rows of the dataset
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Description: - Sepal.Length: length of the sepal in centimeters (Numerical/Continuous). - Sepal.Width: width of the sepal in centimeters (Numerical/Continuous). - Petal.Length:length of the petal in centimeters (Numerical/Continuous). - Petal.Width: width of the petal in centimeters.(Numerical/Continuous) - Species: species of the iris flower (Categorical/Nominal).Categories include Setosa, Versicolor and Virginica.
iris$Species <- as.factor(iris$Species)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Had problems factoring the categorical variable with the code used in class where we used numbers (0,1,2). In order to fix it I used another similar code that I found online. This solved my problems and allowed me to knit the document.
# Summary statistics
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
library(car)
## Loading required package: carData
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
scatterplotMatrix(iris[, c(1:4)], smooth = FALSE)
For the scatterplot matrix I excluded the 5th column. Dimensions: 4x4 as
there are 4 numerical variables. Dimension of data frame: rows=150,
columns=150*6= 900
Most correlated: petal length and petal width
fit <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data = iris)
vif_values <- car::vif(fit)
print(vif_values)
## GVIF Df GVIF^(1/(2*Df))
## Sepal.Width 2.227466 1 1.492470
## Petal.Length 23.161648 1 4.812655
## Petal.Width 21.021401 1 4.584910
## Species 40.039177 2 2.515482
Detected a problem of multicolinearity. Should drop one variable. Petal length or petal width. In this case I will remove petal width from the regression model. Even the species variable is a problem. Since it is above 2.25, however I will keep it in the regression model in order to not have such a small number of variables.
fit1 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Species, data = iris)
iris$StdResid <- round(rstandard(fit1), 3) #Standardized residuals
iris$CooksD <- round(cooks.distance(fit1), 3) #Cooks distances
hist(iris$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
No outliers detected. No values above 3 or below -3.
shapiro.test(iris$StdResid)
##
## Shapiro-Wilk normality test
##
## data: iris$StdResid
## W = 0.99515, p-value = 0.903
Based on the Shapiro-Wilk test I cannot reject the null hypothesis since p-value is 0.903. Therefore I can conclude that data is normally distributed.
hist(iris$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
Based on the histogram there is a large gap. Should be removed.
head(iris[order(-iris$CooksD),], 6)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species StdResid
## 15 5.8 4.0 1.2 0.2 setosa 2.481
## 107 4.9 2.5 4.5 1.7 virginica -2.220
## 101 6.3 3.3 6.0 2.5 virginica -2.540
## 84 6.0 2.7 5.1 1.6 versicolor -1.845
## 85 5.4 3.0 4.5 1.5 versicolor -2.679
## 71 5.9 3.2 4.8 1.8 versicolor -2.102
## CooksD
## 15 0.067
## 107 0.066
## 101 0.040
## 84 0.038
## 85 0.035
## 71 0.033
Based on the cooks distance I will remove ID 107, 101, 15 and 136.
library(dplyr)
ids_to_remove <- c(107, 101, 15, 136)
iris_filtered <- iris %>%
filter(!row_number() %in% ids_to_remove)
# Set a CRAN mirror
options(repos = "https://cran.r-project.org")
# Install the 'olsrr' package
install.packages("olsrr")
##
## The downloaded binary packages are in
## /var/folders/pk/lzttr4pn5m172f0nsbblvy240000gn/T//Rtmpm2Xrbp/downloaded_packages
# Load the 'olsrr' library
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
# Assuming 'fit1' is already defined
ols_test_breusch_pagan(fit1)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ----------------------------------------
## Response : Sepal.Length
## Variables: fitted values of Sepal.Length
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 4.127549
## Prob > Chi2 = 0.04219042
Can reject null hypothesis at p-value= 0.001, meaning that variance is not constant. This means that standard errors are biased. Need to use lmROBUST model for regression analysis.
I used the options (repos code) because I kept getting the following error: “trying to use CRAN without setting a mirror”. After seraching online i found this code and it solved the error I had.
# Install and load the estimatr package
install.packages("estimatr")
##
## The downloaded binary packages are in
## /var/folders/pk/lzttr4pn5m172f0nsbblvy240000gn/T//Rtmpm2Xrbp/downloaded_packages
library(estimatr)
# Fit a robust linear regression model
fit_robust <- lm_robust(Sepal.Length ~ Sepal.Width + Petal.Length + Species, data = iris)
# Print the summary of the robust regression model
summary(fit_robust)
##
## Call:
## lm_robust(formula = Sepal.Length ~ Sepal.Width + Petal.Length +
## Species, data = iris)
##
## Standard error type: HC2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 2.3904 0.24739 9.663 2.337e-17 1.9014 2.8793 145
## Sepal.Width 0.4322 0.07911 5.463 1.984e-07 0.2759 0.5886 145
## Petal.Length 0.7756 0.06799 11.409 6.441e-22 0.6413 0.9100 145
## Speciesversicolor -0.9558 0.21232 -4.502 1.374e-05 -1.3754 -0.5362 145
## Speciesvirginica -1.3941 0.29870 -4.667 6.884e-06 -1.9845 -0.8037 145
##
## Multiple R-squared: 0.8633 , Adjusted R-squared: 0.8595
## F-statistic: 261.4 on 4 and 145 DF, p-value: < 2.2e-16
1.3 CONCLUSION
Multiple coefficient of correlation: 86.33% of variability of sepal length is explained by linear effect of all explanatory variables.
Based on F-statistic I can reject null hypothesis and conclude that a linear relationship has been found. (P-value<0.001)
If sepal width increases by 1 cm, sepal length will on average increase by 0.43 cm (p-value<0.001), assuming all other variables stay the same.
If petal length increases by 1 cm, sepal length will on average increase by 0.78 cm (p-value<0.001), assuming all other variables stay the same.
Given the values of all other explanatory variables, the sepal length of species Versicolor is on avarage 0.96 cm less compared to species Setosa.
Given the values of all other explanatory variables, the sepal length of species Versicolor is on avarage 1.4 cm less compared to species Setosa.