Tugas 6 Komputasi Statistika

Multiple Linear Regression in R


Kontak \(\downarrow\)
Email
Instagram https://www.instagram.com/nbrigittag/
RPubs https://rpubs.com/naftalibrigitta/
Nama Naftali Brigitta Gunawan
NIM 20214920002

How to perform a multiple linear regression

The formula for a multiple linear regression is:

\(y = β0 + β1X1 + ... + βnXn + ε\)

  • \(y\) = the predicted value of the dependent variable

  • \(β0\) = the y-intercept (value of y when all other parameters are set to 0)

  • \(β1X1\) = the regression coefficient (β1) of the first independent variable (X1)

  • \(...\) = do the same for however many independent variables you are testing

  • \(βnXn\) = the regression coefficient of the last independent variable

  • \(ε\) = model error

1. Load the data into R

library(ggplot2)
library(dplyr)
library(broom)
library(ggpubr)

heart <- read.csv("heart.data.csv")
summary(heart)
##        X             biking          smoking        heart.disease    
##  Min.   :  1.0   Min.   : 1.119   Min.   : 0.5259   Min.   : 0.5519  
##  1st Qu.:125.2   1st Qu.:20.205   1st Qu.: 8.2798   1st Qu.: 6.5137  
##  Median :249.5   Median :35.824   Median :15.8146   Median :10.3853  
##  Mean   :249.5   Mean   :37.788   Mean   :15.4350   Mean   :10.1745  
##  3rd Qu.:373.8   3rd Qu.:57.853   3rd Qu.:22.5689   3rd Qu.:13.7240  
##  Max.   :498.0   Max.   :74.907   Max.   :29.9467   Max.   :20.4535

The conclusions of summary(heart) :

  • Dependent Variables = heart.disease

  • Independent Variables = smoking and biking


Step 2 : Make sure your data meet the assumptions

There are four main assumptions for linear regression.

1. Independence of observations (or no autocorrelation)

We can use cor function to test the relationship between biking variable and smoking variable.

cor(heart$biking, heart$smoking)
## [1] 0.01513618

The correlation between biking and smoking is only 0.015 or 1.5%.

2. Normality

hist(heart$heart.disease)

Because the histogram are like bell-shaped (high in the middle and fewer on the tails), so we can move to the next step.

3. Linearity

We can use plot to visualize with a scatter plot. We made two scatter plots (biking and smoking).

plot(heart.disease ~ biking, data = heart)

plot(heart.disease ~ smoking, data = heart)

Although the relationship between smoking and heart disease is a bit less clear, it still appears linear. We can proceed with linear regression.

4. Homoscedasticity (or homogeneity of variance)

This means that the prediction error doesn’t change significantly. We can test this assumption later, after fitting the linear model.


Step 3 : Perform the linear regression analysis

heart.disease.lm<-lm(heart.disease ~ biking + smoking, data = heart)

summary(heart.disease.lm)
## 
## Call:
## lm(formula = heart.disease ~ biking + smoking, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1789 -0.4463  0.0362  0.4422  1.9331 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.984658   0.080137  186.99   <2e-16 ***
## biking      -0.200133   0.001366 -146.53   <2e-16 ***
## smoking      0.178334   0.003539   50.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.654 on 495 degrees of freedom
## Multiple R-squared:  0.9796, Adjusted R-squared:  0.9795 
## F-statistic: 1.19e+04 on 2 and 495 DF,  p-value: < 2.2e-16

The result are :

  • The estimated effect of biking on heart disease is -0.2, while the estimated effect of smoking is 0.178.

  • This means that for every 1% increase in biking to work, there is a correlated 0.2% decrease in the incidence of heart disease. Meanwhile, for every 1% increase in smoking, there is a 0.178% increase in the rate of heart disease.

  • The standard errors for these regression coefficients are very small, and the t statistics are very large (-147 and 50.4, respectively). The p values reflect these small errors and large t statistics. For both parameters, there is almost zero probability that this effect is due to chance.


Step 4 : Check for homoscedasticity

par(mfrow=c(2,2))
plot(heart.disease.lm)

par(mfrow=c(1,1))

The residuals form our models almost perfectly (linear line), based on these residuals, we can say that our model meets the assumption of homoscedasticity.


Step 5 : Visualize the results with a graph

1. Create a new dataframe with the information needed to plot the model

We can use expand.grid() to create dataframe with parameters.

plotting.data<-expand.grid(
  biking = seq(min(heart$biking), max(heart$biking), length.out=30),
    smoking=c(min(heart$smoking), mean(heart$smoking), max(heart$smoking)))

2. Predict the values of heart disease based on your linear model

We will save out predicted y values as a new column in the dataset we just created.

plotting.data$predicted.y <- predict.lm(heart.disease.lm, newdata=plotting.data)

3. Round the smoking numbers to two decimals

We will make the legend to easier when we read later on.

plotting.data$smoking <- round(plotting.data$smoking, digits = 2)

4. Change the ‘smoking’ variable into a factor

This allows us to plot the interaction between biking and heart disease at each of the three levels of smoking we chose.

plotting.data$smoking <- as.factor(plotting.data$smoking)

5. Plot the original data

heart.plot <- ggplot(heart, aes(x = biking, y = heart.disease)) +
  geom_point()

heart.plot

6. Add the regression lines

heart.plot <- heart.plot +
  geom_line(data = plotting.data, aes(x = biking, y = predicted.y, color = smoking), size = 1.25)

heart.plot

7. Plot the original data

heart.plot <-
heart.plot +
  theme_bw() +
  labs(title = "Rates of heart disease (% of population) \n as a function of biking to work and smoking",
      x = "Biking to work (% of population)",
      y = "Heart disease (% of population)",
      color = "Smoking \n (% of population)")

heart.plot


Step 6 : Report your results

In our survey of 500 towns, we found significant relationships between the frequency of biking to work and the frequency of heart disease and the frequency of smoking and frequency of heart disease (p < 0 and p < 0.001, respectively).

Specifically we found a 0.2% decrease (± 0.0014) in the frequency of heart disease for every 1% increase in biking, and a 0.178% increase (± 0.0035) in the frequency of heart disease for every 1% increase in smoking.