Assignment R - Multiple regression and diagnostic plots

Load the data & Libraries

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.2.1

library(lmtest)

## Warning: package 'lmtest' was built under R version 4.2.1

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.2.1

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

heartData <- read.table("heartdata.csv", sep = ",", header = TRUE)
heartDisease <- heartData$heart.disease
Biking <- heartData$biking
Smoking <- heartData$smoking

Colinearity

Using the cor() function to test the relationship between independent variables and make sure they aren’t too highly correlated.

cor(Biking, Smoking)

## [1] 0.01513618

The value is far from 1 and shows low correlation between smoking and biking.

Normality

A normal distribution curve is fit to the histogram.

hist(heartDisease, density = 20, breaks = 20, prob = TRUE, 
     xlab = "Percentage of people with heart disease", 
     ylab = "Density for percentage of people with heart disease", 
     main = "(%) Heart Disease in 500 counties",
     col = "black")
curve(dnorm(x, mean = mean(heartDisease), sd = sd(heartDisease)), 
      col = "darkblue", lwd = 3, add = TRUE, yaxt = "n")

Linearity

Scatter plot for biking and heart disease

ggplot(heartData, aes(x = biking, y = heartDisease)) + 
  geom_point() + 
  geom_smooth(method = lm,  linetype = "dashed",
              color = "darkred", fill = "blue")

## `geom_smooth()` using formula 'y ~ x'

Scatter plot for smoking and heart disease

ggplot(heartData, aes(x = smoking, y = heartDisease)) + 
  geom_point() + 
  geom_smooth(method = lm,  linetype = "dashed",
              color = "darkred", fill = "blue")

## `geom_smooth()` using formula 'y ~ x'

Diagnostic plots

Creating the multi-regression model

model.1 <- lm( heart.disease ~ biking + smoking, data = heartData)
summary(model.1)

## 
## Call:
## lm(formula = heart.disease ~ biking + smoking, data = heartData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1789 -0.4463  0.0362  0.4422  1.9331 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.984658   0.080137  186.99   <2e-16 ***
## biking      -0.200133   0.001366 -146.53   <2e-16 ***
## smoking      0.178334   0.003539   50.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.654 on 495 degrees of freedom
## Multiple R-squared:  0.9796, Adjusted R-squared:  0.9795 
## F-statistic: 1.19e+04 on 2 and 495 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(model.1)

par(mfrow=c(1,1))

The p-values for coefficients are quit low. The samples on Q-Q plot are mostly on diagonal and it is showing a good fit for normal distribution of the samples. The other residual plots also show an even distribution of residuals around zero. Therefore, it seems that the linear model is a good model in this case.

creating the plotting data

plotting.data <- expand.grid(
  biking = seq(min(heartData$biking), max(heartData$biking), length.out = 30),
  smoking = c(min(heartData$smoking), mean(heartData$smoking), max(heartData$smoking)))

plotting.data$predicted.y <- predict(model.1, newdata = plotting.data)

plotting.data$smoking <- round(plotting.data$smoking, digits = 2)
plotting.data$smoking <- as.factor(plotting.data$smoking)

Plotting the data

heart.plot <- ggplot(heartData, aes(x = biking, y = heart.disease)) + 
  geom_point() + 
  geom_line(data = plotting.data, aes(x = biking, y = predicted.y, color = smoking), size = 1.25)

heart.plot <-
  heart.plot +
  theme_bw() +
  labs(title = "Rates of heart disease (% of population) \n 
       as a function of biking to work and smoking",
       x = "Biking to work (% of population)",
       y = "Heart disease (% of population)",
       color = "Smoking \n (% of population)")

heart.plot

Confidence and Prediction band

Confidence Intervals are estimates that are calculated from sample data to determine ranges likely to contain the population parameter(mean, standard deviation)of interest. Confidence interval pertains to a statistic estimated from multiple values, in this case — the regression coefficient. It expresses sampling uncertainty, which comes from the fact that our data is just a random sample of the population we try to model. It can be interpreted as follows: if we had collected many other data sets on heart disease in 500 counties and had fit such a model to each of them, in 95% of the cases the true population coefficient (which we would know should we have data on all cases) would fall within the confidence interval.

Prediction intervals tell you where you can expect to see the next data point sampled. Assume that the data really are randomly sampled from a Gaussian distribution. Collect a sample of data and calculate a prediction interval. Then sample one more value from the population. If you do this many times, you’d expect that next value to lie within that prediction interval in 95% of the samples.The key point is that the prediction interval tells you about the distribution of values, not the uncertainty in determining the population mean.

Prediction intervals must account for both the uncertainty in knowing the value of the population mean, plus data scatter. So a prediction interval is always wider than a confidence interval.

We firs prepare the plotting data, adding the confidence y and prediction y.

plotting.data2 <- expand.grid(
  biking = seq(min(heartData$biking), max(heartData$biking), length.out = 30),
  smoking = c(mean(heartData$smoking)))

plotting.data2$confidence.y <- predict(model.1, newdata = plotting.data2, 
                                       interval = 'confidence')
plotting.data2$predicted.y <- predict(model.1, newdata = plotting.data2,
                                      interval = 'prediction')

plotting.data2$smoking <- round(plotting.data2$smoking, digits = 2)
plotting.data2$smoking <- as.factor(plotting.data2$smoking)

heart.plot2 <- ggplot() + 
  geom_point(data = heartData, aes(x = biking, y = heart.disease)) +
  geom_line(data = plotting.data2, aes(x = biking, y = confidence.y[,"fit"]), 
            color = "red", size = 1.25) +
  geom_ribbon(data = plotting.data2, 
              aes(x = biking, ymin = confidence.y[,"lwr"], 
                  ymax = confidence.y[,"upr"]), color = "green")+
  geom_line(data = plotting.data2, 
            aes(x = biking, y = predicted.y[,"lwr"]), color = "blue", size = 1.25) +
  geom_line(data = plotting.data2, 
            aes(x = biking, y = predicted.y[,"upr"]), color = "blue", size = 1.25) 

heart.plot2 <- heart.plot2 +
  theme_bw() +
  labs(title = "Prediction band for Rates of heart disease (% of population) \n 
       as a function of biking to work for average smoking rate(15.44%)",
       x = "Biking to work (% of population)",
       y = "Heart disease (% of population)",
       color = "Smoking \n (% of population)")

heart.plot2

Confidence band is so small and is hardly visible in this figure.

Homoskedasticity

Goldfeld-Quandt test to determine if heteroscedasticity is present in a regression model.

The Goldfeld-Quandt test works by removing some number of observations located in the center of the dataset, then testing to see if the spread of residuals is different from the resulting two datasets that are on either side of the central observations.

Null (H0): Homoscedasticity is present. Alternative (HA): Heteroscedasticity is present.

gqtest(model.1, order.by = heartData$heart.disease, data = heartData, fraction = 100)

## 
##  Goldfeld-Quandt test
## 
## data:  model.1
## GQ = 1.2347, df1 = 196, df2 = 196, p-value = 0.07047
## alternative hypothesis: variance increases from segment 1 to 2

Since the p-value is not less than 0.05, we fail to reject the null hypothesis. We do not have sufficient evidence to say that heteroscedasticity is present in the regression model.