Email         : ferdinand.widjaya@student.matanauniversity.ac.id
RPubs       : https://rpubs.com/ferdnw/
Address : ARA Center, Matana University Tower
   Jl. CBD Barat Kav, RT.1, Curug Sangereng, Kelapa Dua, Tangerang, Banten 15810.

Load Data and Library

library(ggplot2)
library(dplyr)
library(broom)
library(ggpubr)

heart = read.csv("heart.data.csv")
summary (heart)

##        X             biking          smoking        heart.disease    
##  Min.   :  1.0   Min.   : 1.119   Min.   : 0.5259   Min.   : 0.5519  
##  1st Qu.:125.2   1st Qu.:20.205   1st Qu.: 8.2798   1st Qu.: 6.5137  
##  Median :249.5   Median :35.824   Median :15.8146   Median :10.3853  
##  Mean   :249.5   Mean   :37.788   Mean   :15.4350   Mean   :10.1745  
##  3rd Qu.:373.8   3rd Qu.:57.853   3rd Qu.:22.5689   3rd Qu.:13.7240  
##  Max.   :498.0   Max.   :74.907   Max.   :29.9467   Max.   :20.4535

Dependent Variable (Y) = heart.disease
Independent Variable - X1 = biking
- X2 = smoking

Assumption Testing on The Data Used

Autocorrelation Test

cor(heart$biking, heart$smoking)

## [1] 0.01513618

Karena Corelasi antar Independent Variabel rendah, hanya 15% bisa dibilang hasil regresinya tdak akan terlalu bias.

Normality test

Testing apakah data yang digunakan berdistribusi normal atau tidak

hist(heart$heart.disease)

shapiro.test(heart$heart.disease)

## 
##  Shapiro-Wilk normality test
## 
## data:  heart$heart.disease
## W = 0.98047, p-value = 3.158e-06

Karena p-value < 0,05 dan Histrogram mengvisualisasikan data nya mirip seperti lonceng, maka diasumsikan datanya memiliki distribusi normal.

Linearity

Variabel Dependen dan Independen harus memiliki hubungan linear yang jelas

plot(heart.disease ~ biking, data=heart)

Hasil Grafk menunjukkan sebuah hubungan linear yang kuat antara heart.disease dan biking Semakin Sering Bersepeda semakin rendah peluang terkena penyakit Jantung.

plot(heart.disease ~ smoking, data=heart)

Hubungan Linear antara smoking dan heart.disease bisa terlihat samar-samar linear walau tidak sesignificant biking tapi bisa dilihat Semakin Sering Merokok, Semakin Besar Peluang terkena Penyakit Jantung.

Homogenitas Variansi

Homogenitas Variansi akan diuji setelah model sudsh dibuat untuk menunjukkan predksi tidak akan meleset jauh daripada prediksi lainnya.

Linear Model

lmheart <- lm(heart.disease ~ biking + smoking , data = heart)

summary(lmheart)

## 
## Call:
## lm(formula = heart.disease ~ biking + smoking, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1789 -0.4463  0.0362  0.4422  1.9331 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.984658   0.080137  186.99   <2e-16 ***
## biking      -0.200133   0.001366 -146.53   <2e-16 ***
## smoking      0.178334   0.003539   50.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.654 on 495 degrees of freedom
## Multiple R-squared:  0.9796, Adjusted R-squared:  0.9795 
## F-statistic: 1.19e+04 on 2 and 495 DF,  p-value: < 2.2e-16

Let’s see if there’s a linear relationship between income and happiness in our survey of 500 people with incomes ranging from $15k to $75k, where happiness is measured on a scale of 1 to 10.

\[ Y = 14.498 - 0.2 X1 + 0.178 X2 \]

Karena p-value < 0.05 bisa di bilang Model Linear yang ada akan berfungsu cukup baik dengan tingkat akurasi (R_SQ) kurang lebih 98%.

Dengan Setiap Bersepeda bisa mengurangi peluang terkena Serangan Jantung dan sebaliknya, merokok dapat memperbesar peluang terkena heart diseases dilihat dari Koefisien regresinya yang Negatif untuk biking dan Positif untuk smoking.

Homogenity of Variance

par(mfrow=c(2,2))
plot(lmheart)

par(mfrow=c(1,1))

Graph Visualization

Create a new dataframe with the information needed to plot the model

Use the function expand.grid() to create a dataframe with the parameters you supply. Within this function we will:

Create a sequence from the lowest to the highest value of your observed biking data; Choose the minimum, mean, and maximum values of smoking, in order to make 3 levels of smoking over which to predict rates of heart disease.

plotting.data<-expand.grid(
  biking = seq(min(heart$biking), max(heart$biking), length.out=30),
    smoking=c(min(heart$smoking), mean(heart$smoking), max(heart$smoking)))

Predict the values of heart disease based on your linear model

Next we will save our ‘predicted y’ values as a new column in the dataset we just created.

plotting.data$predicted.y <- predict.lm(lmheart, newdata=plotting.data)

Change the ‘smoking’ variable into a factor

This allows us to plot the interaction between biking and heart disease at each of the three levels of smoking we chose.

plotting.data$smoking <- as.factor(plotting.data$smoking)

Graph Finalization

Original Plot

heart.plot <- ggplot(heart, aes(x=biking, y=heart.disease)) +
  geom_point()

heart.plot

Added Graph of 3 Level Smoking Intencity

(Anggaplah Jarang, Lumayan Sering dan Sangat Sering)

heart.plot <- heart.plot +
  geom_line(data=plotting.data, aes(x=biking, y=predicted.y, color=smoking), size=1.25)

heart.plot

Finals

heart.plot <-
heart.plot +
  theme_bw() +
  labs(title = "Rates of heart disease (% of population) \n as a function of biking to work and smoking",
      x = "Biking to work (% of population)",
      y = "Heart disease (% of population)",
      color = "Smoking \n (% of population)")

heart.plot

Results

After we see the Graph, we can conclude that there is significant relation between Chance of Heart Diseases to Intensity of Biking and Smoking

Based on Model, 1% Intensitas Bersepda naik maka 0.2% peluang untuk mengalami Heart Disease turun dan untuk 1% Intensitas Merokok akan naik 0.178% Peluang terkena Heart Disease.

Pesan Moral : Jangan Merokok, Tetap Bersepeda

Multiple Linear Regreesion

Unit 7

Ferdinand Nathaniel Widjaya (20214920006)

April 10, 2023