Multiple Linear
| Kontak | : \(\downarrow\) |
| dhelaagatha@gmail.com | |
| https://www.instagram.com/dhelaagatha/ | |
| RPubs | https://rpubs.com/dhelaasafiani/ |
| Nama | Dhela Agatha |
| NIM | 20214920009 |
| Prodi | Statistika |
Load Data and Library
library(ggplot2)
library(dplyr)
library(broom)
library(ggpubr)heart = read.csv("heart.data.csv")
summary (heart)## X biking smoking heart.disease
## Min. : 1.0 Min. : 1.119 Min. : 0.5259 Min. : 0.5519
## 1st Qu.:125.2 1st Qu.:20.205 1st Qu.: 8.2798 1st Qu.: 6.5137
## Median :249.5 Median :35.824 Median :15.8146 Median :10.3853
## Mean :249.5 Mean :37.788 Mean :15.4350 Mean :10.1745
## 3rd Qu.:373.8 3rd Qu.:57.853 3rd Qu.:22.5689 3rd Qu.:13.7240
## Max. :498.0 Max. :74.907 Max. :29.9467 Max. :20.4535
Dependent Variable (Y) = heart.disease Independent
Variable - X1 = biking - X2 = smoking
Assumption Testing on The Data Used
Autocorrelation Test
cor(heart$biking, heart$smoking)## [1] 0.01513618
Karena Corelasi antar Independent Variabel rendah, hanya 15% bisa dibilang hasil regresinya tdak akan terlalu bias.
Normality test
Testing apakah data yang digunakan berdistribusi normal atau tidak
hist(heart$heart.disease)shapiro.test(heart$heart.disease)##
## Shapiro-Wilk normality test
##
## data: heart$heart.disease
## W = 0.98047, p-value = 3.158e-06
Karena p-value < 0,05 dan Histrogram mengvisualisasikan data nya mirip seperti lonceng, maka diasumsikan datanya memiliki distribusi normal.
Linearity
Variabel Dependen dan Independen harus memiliki hubungan linear yang jelas
plot(heart.disease ~ biking, data=heart)
Hasil Grafk menunjukkan sebuah hubungan linear yang kuat antara
heart.disease dan biking Semakin Sering
Bersepeda semakin rendah peluang terkena penyakit Jantung.
plot(heart.disease ~ smoking, data=heart)Hubungan Linear antara smoking dan
heart.disease bisa terlihat samar-samar linear walau tidak
sesignificant biking tapi bisa dilihat Semakin Sering
Merokok, Semakin Besar Peluang terkena Penyakit Jantung.
Homogenitas Variansi
Homogenitas Variansi akan diuji setelah model sudsh dibuat untuk menunjukkan predksi tidak akan meleset jauh daripada prediksi lainnya.
Linear Model
lmheart <- lm(heart.disease ~ biking + smoking , data = heart)
summary(lmheart)##
## Call:
## lm(formula = heart.disease ~ biking + smoking, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1789 -0.4463 0.0362 0.4422 1.9331
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.984658 0.080137 186.99 <2e-16 ***
## biking -0.200133 0.001366 -146.53 <2e-16 ***
## smoking 0.178334 0.003539 50.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.654 on 495 degrees of freedom
## Multiple R-squared: 0.9796, Adjusted R-squared: 0.9795
## F-statistic: 1.19e+04 on 2 and 495 DF, p-value: < 2.2e-16
Let’s see if there’s a linear relationship between income and happiness in our survey of 500 people with incomes ranging from $15k to $75k, where happiness is measured on a scale of 1 to 10.
\[ Y = 14.498 - 0.2 X1 + 0.178 X2 \]
Karena p-value < 0.05 bisa di bilang Model Linear yang ada akan berfungsu cukup baik dengan tingkat akurasi (R_SQ) kurang lebih 98%.
Dengan Setiap Bersepeda bisa mengurangi peluang terkena Serangan
Jantung dan sebaliknya, merokok dapat memperbesar peluang terkena heart
diseases dilihat dari Koefisien regresinya yang Negatif untuk
biking dan Positif untuk smoking.
Homogenity of Variance
par(mfrow=c(2,2))
plot(lmheart)par(mfrow=c(1,1))Graph Visualization
Create a new dataframe with the information needed to plot the model
Use the function expand.grid() to create a dataframe with the parameters you supply. Within this function we will:
Create a sequence from the lowest to the highest value of your observed biking data; Choose the minimum, mean, and maximum values of smoking, in order to make 3 levels of smoking over which to predict rates of heart disease.
plotting.data<-expand.grid(
biking = seq(min(heart$biking), max(heart$biking), length.out=30),
smoking=c(min(heart$smoking), mean(heart$smoking), max(heart$smoking)))Predict the values of heart disease based on your linear model
Next we will save our ‘predicted y’ values as a new column in the dataset we just created.
plotting.data$predicted.y <- predict.lm(lmheart, newdata=plotting.data)Change the ‘smoking’ variable into a factor
This allows us to plot the interaction between biking and heart disease at each of the three levels of smoking we chose.
plotting.data$smoking <- as.factor(plotting.data$smoking)Graph Finalization
Original Plot
heart.plot <- ggplot(heart, aes(x=biking, y=heart.disease)) +
geom_point()
heart.plotAdded Graph of 3 Level Smoking Intencity
(Anggaplah Jarang, Lumayan Sering dan Sangat Sering)
heart.plot <- heart.plot +
geom_line(data=plotting.data, aes(x=biking, y=predicted.y, color=smoking), size=1.25)
heart.plotFinals
heart.plot <-
heart.plot +
theme_bw() +
labs(title = "Rates of heart disease (% of population) \n as a function of biking to work and smoking",
x = "Biking to work (% of population)",
y = "Heart disease (% of population)",
color = "Smoking \n (% of population)")
heart.plotResults
After we see the Graph, we can conclude that there is significant
relation between Chance of Heart Diseases to Intensity of
Biking and Smoking
Based on Model, 1% Intensitas Bersepda naik maka 0.2% peluang untuk mengalami Heart Disease turun dan untuk 1% Intensitas Merokok akan naik 0.178% Peluang terkena Heart Disease.
Pesan Moral : Jangan Merokok, Tetap Bersepeda