Use iris dataset to build a regression model in R and conducting residual analysis

Predict petal length (Petal.Length) based on sepal length (Sepal.Length)

library(ggplot2)


data(iris)


model <- lm(Petal.Length ~ Sepal.Length, data=iris)
summary(model)
## 
## Call:
## lm(formula = Petal.Length ~ Sepal.Length, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.47747 -0.59072 -0.00668  0.60484  2.49512 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.10144    0.50666  -14.02   <2e-16 ***
## Sepal.Length  1.85843    0.08586   21.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8678 on 148 degrees of freedom
## Multiple R-squared:   0.76,  Adjusted R-squared:  0.7583 
## F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16
plot(Petal.Length ~ Sepal.Length, data=iris)
abline(model)

par(mfrow=c(2,2))
plot(model)

residuals <- resid(model)
ggplot(data=iris, aes(x=residuals)) +
  geom_histogram(binwidth = 0.2, color='black', fill='skyblue') +
  ggtitle("Histogram of Residuals")

shapiro.test(residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals
## W = 0.99437, p-value = 0.831

The slope of 1.85843. for every one-unit increase in Sepal.Length, the Petal.Length is expected to increase by about 1.85843 units. Both coefficients are highly significant (p-value < 2e-16), indicating a strong relationship. .

The R-squared value is 0.76, meaning the model explains 76% of the variability in petal length, which is quite high.

Residuals in Q-Q plot does follow a straight line The histogram of residuals also confirmed that.

The Shapiro-Wilk test on the residuals gives a p-value is 0.831. Since this p-value is greater than 0.05, we fail to reject the null hypothesis, suggesting that there is not enough evidence to say the residuals are not normally distributed. Therefore, residual is normally distributed. Therefore linear regression assumption hold.