Tugas 6 Komputasi Statistika
Multiple Linear Regression in R
| Kontak | \(\downarrow\) |
| naftaligunawan@gmail.com | |
| https://www.instagram.com/nbrigittag/ | |
| RPubs | https://rpubs.com/naftalibrigitta/ |
| Nama | Naftali Brigitta Gunawan |
| NIM | 20214920002 |
How to perform a multiple linear regression
The formula for a multiple linear regression is:
\(y = β0 + β1X1 + ... + βnXn + ε\)
\(y\) = the predicted value of the dependent variable
\(β0\) = the y-intercept (value of y when all other parameters are set to 0)
\(β1X1\) = the regression coefficient (β1) of the first independent variable (X1)
\(...\) = do the same for however many independent variables you are testing
\(βnXn\) = the regression coefficient of the last independent variable
\(ε\) = model error
1. Load the data into R
library(ggplot2)
library(dplyr)
library(broom)
library(ggpubr)
heart <- read.csv("heart.data.csv")
summary(heart)## X biking smoking heart.disease
## Min. : 1.0 Min. : 1.119 Min. : 0.5259 Min. : 0.5519
## 1st Qu.:125.2 1st Qu.:20.205 1st Qu.: 8.2798 1st Qu.: 6.5137
## Median :249.5 Median :35.824 Median :15.8146 Median :10.3853
## Mean :249.5 Mean :37.788 Mean :15.4350 Mean :10.1745
## 3rd Qu.:373.8 3rd Qu.:57.853 3rd Qu.:22.5689 3rd Qu.:13.7240
## Max. :498.0 Max. :74.907 Max. :29.9467 Max. :20.4535
The conclusions of summary(heart) :
Dependent Variables =
heart.diseaseIndependent Variables =
smokingandbiking
Step 2 : Make sure your data meet the assumptions
There are four main assumptions for linear regression.
1. Independence of observations (or no autocorrelation)
We can use cor function to test the relationship between biking variable and smoking variable.
cor(heart$biking, heart$smoking)## [1] 0.01513618
The correlation between biking and smoking is only 0.015 or 1.5%.
2. Normality
hist(heart$heart.disease)Because the histogram are like bell-shaped (high in the middle and fewer on the tails), so we can move to the next step.
3. Linearity
We can use plot to visualize with a scatter plot. We made two scatter plots (biking and smoking).
plot(heart.disease ~ biking, data = heart)plot(heart.disease ~ smoking, data = heart)Although the relationship between smoking and heart disease is a bit less clear, it still appears linear. We can proceed with linear regression.
4. Homoscedasticity (or homogeneity of variance)
This means that the prediction error doesn’t change significantly. We can test this assumption later, after fitting the linear model.
Step 3 : Perform the linear regression analysis
heart.disease.lm<-lm(heart.disease ~ biking + smoking, data = heart)
summary(heart.disease.lm)##
## Call:
## lm(formula = heart.disease ~ biking + smoking, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1789 -0.4463 0.0362 0.4422 1.9331
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.984658 0.080137 186.99 <2e-16 ***
## biking -0.200133 0.001366 -146.53 <2e-16 ***
## smoking 0.178334 0.003539 50.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.654 on 495 degrees of freedom
## Multiple R-squared: 0.9796, Adjusted R-squared: 0.9795
## F-statistic: 1.19e+04 on 2 and 495 DF, p-value: < 2.2e-16
The result are :
The estimated effect of biking on heart disease is -0.2, while the estimated effect of smoking is 0.178.
This means that for every 1% increase in biking to work, there is a correlated 0.2% decrease in the incidence of heart disease. Meanwhile, for every 1% increase in smoking, there is a 0.178% increase in the rate of heart disease.
The standard errors for these regression coefficients are very small, and the t statistics are very large (-147 and 50.4, respectively). The p values reflect these small errors and large t statistics. For both parameters, there is almost zero probability that this effect is due to chance.
Step 4 : Check for homoscedasticity
par(mfrow=c(2,2))
plot(heart.disease.lm)par(mfrow=c(1,1))The residuals form our models almost perfectly (linear line), based on these residuals, we can say that our model meets the assumption of homoscedasticity.
Step 5 : Visualize the results with a graph
1. Create a new dataframe with the information needed to plot the model
We can use expand.grid() to create dataframe with parameters.
plotting.data<-expand.grid(
biking = seq(min(heart$biking), max(heart$biking), length.out=30),
smoking=c(min(heart$smoking), mean(heart$smoking), max(heart$smoking)))2. Predict the values of heart disease based on your linear model
We will save out predicted y values as a new column in the dataset we just created.
plotting.data$predicted.y <- predict.lm(heart.disease.lm, newdata=plotting.data)3. Round the smoking numbers to two decimals
We will make the legend to easier when we read later on.
plotting.data$smoking <- round(plotting.data$smoking, digits = 2)4. Change the ‘smoking’ variable into a factor
This allows us to plot the interaction between biking and heart disease at each of the three levels of smoking we chose.
plotting.data$smoking <- as.factor(plotting.data$smoking)5. Plot the original data
heart.plot <- ggplot(heart, aes(x = biking, y = heart.disease)) +
geom_point()
heart.plot6. Add the regression lines
heart.plot <- heart.plot +
geom_line(data = plotting.data, aes(x = biking, y = predicted.y, color = smoking), size = 1.25)
heart.plot7. Plot the original data
heart.plot <-
heart.plot +
theme_bw() +
labs(title = "Rates of heart disease (% of population) \n as a function of biking to work and smoking",
x = "Biking to work (% of population)",
y = "Heart disease (% of population)",
color = "Smoking \n (% of population)")
heart.plotStep 6 : Report your results
In our survey of 500 towns, we found significant relationships between the frequency of biking to work and the frequency of heart disease and the frequency of smoking and frequency of heart disease (p < 0 and p < 0.001, respectively).
Specifically we found a 0.2% decrease (± 0.0014) in the frequency of heart disease for every 1% increase in biking, and a 0.178% increase (± 0.0035) in the frequency of heart disease for every 1% increase in smoking.