Object Overview

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Data Description:The heart and body weights of samples of male and female cats used for digitalis experiments. The cats were all adult, over 2 kg body weight.

This data frame contains the following columns:

Sex sex: Factor with levels “F” and “M”.

Bwt body weight in kg.

Hwt heart weight in g.

Import Data and Overview

library(MASS)
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:MASS':
## 
##     cement
## The following object is masked from 'package:datasets':
## 
##     rivers
data<-MASS::cats
data<-data[,-1]

Data Overview

str(data)
## 'data.frame':    144 obs. of  2 variables:
##  $ Bwt: num  2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
##  $ Hwt: num  7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...
head(data)
##   Bwt Hwt
## 1 2.0 7.0
## 2 2.0 7.4
## 3 2.0 9.5
## 4 2.1 7.2
## 5 2.1 7.3
## 6 2.1 7.6

As the str() function shows, the data is structured with 2 columns and 144 observations.

Scatter plot

ggplot(data, aes(x = Bwt, y = Hwt))+geom_point()+ggtitle("Association between body weight and heart weight")

Correlation

cor(data$Bwt,data$Hwt)
## [1] 0.8041274
corrplot(cor(data))

Simple Linear Regression

mod<-lm(Hwt ~ Bwt, data = data)
summary(mod)
## 
## Call:
## lm(formula = Hwt ~ Bwt, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5694 -0.9634 -0.0921  1.0426  5.1238 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.3567     0.6923  -0.515    0.607    
## Bwt           4.0341     0.2503  16.119   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.452 on 142 degrees of freedom
## Multiple R-squared:  0.6466, Adjusted R-squared:  0.6441 
## F-statistic: 259.8 on 1 and 142 DF,  p-value: < 2.2e-16

As the summary of the linear regression model shows,

slope = 4.0341

intercept = -0.3567

therefore, the linear regression model formula is y = -0.3567 + 4.0341x

plots

plot(data, xlab = "Body Weight", ylab = "Heart Weight")
abline(mod) # plot the model

ols_plot_resid_hist(mod) # residual distribution

ols_plot_resid_fit(mod) # Homoscedasticity

ols_plot_cooksd_chart(mod) # cook's distance for indentifying outliers

  1. The residuals distribution suggests that it is normally distributed

  2. The P-value is 2e-16, it means that the heart weight is statistically significantly associated with body weight.

  3. The adjusted R-squared is 0.6441, which means that this model explains 64.41% of the data’s variation.

  4. The heart weight is strongly correlated with body weight, an increase of 1 kg of body weight results increase 0.8g of heart weight.

  5. The ols_plot_resid_fit() function shows that the model past Homoscedasticity test.

  6. The Cook’s distance plot shows there are 5 outliers in the data, located at the 135th,136th,140th,142th,144th rows.

  7. Eventhough the model only explains about 65 % of the variations, the model is correctly built since it pass the linear regression assumption tests.