데이터 ‘heart.csv’ 를 활용해 Biking과 smoking이 the
percentage of people with heart disease 에 영향을 미치는지 유의수준 5%로
검정하세요.
1. 데이터 불러오기
데이터 description:
observations on the percentage of people biking to work each day, the
percentage of people smoking, and the percentage of people with heart
disease
library(readr)
heart_data <- read_csv("~/Downloads/heart.csv")
## New names:
## Rows: 498 Columns: 4
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (4): ...1, biking, smoking, heart.disease
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
head(heart_data)
## # A tibble: 6 × 4
## ...1 biking smoking heart.disease
## <dbl> <dbl> <dbl> <dbl>
## 1 1 30.8 10.9 11.8
## 2 2 65.1 2.22 2.85
## 3 3 1.96 17.6 17.2
## 4 4 44.8 2.80 6.82
## 5 5 69.4 16.0 4.06
## 6 6 54.4 29.3 9.55
summary(heart_data)
## ...1 biking smoking heart.disease
## Min. : 1.0 Min. : 1.119 Min. : 0.5259 Min. : 0.5519
## 1st Qu.:125.2 1st Qu.:20.205 1st Qu.: 8.2798 1st Qu.: 6.5137
## Median :249.5 Median :35.824 Median :15.8146 Median :10.3853
## Mean :249.5 Mean :37.788 Mean :15.4350 Mean :10.1745
## 3rd Qu.:373.8 3rd Qu.:57.853 3rd Qu.:22.5689 3rd Qu.:13.7240
## Max. :498.0 Max. :74.907 Max. :29.9467 Max. :20.4535
| 독립변수 | 종속변수 |
|---|---|
| Biking: % of people biking to work each day (Num/Cont) | Heart Disease: % of people with heart disease (Num/Cont) |
| Smoking: % of people smoking (Num/Cont) |
heart_data$biking_factor <- cut(heart_data$biking,
breaks = 3,
labels = c("Low", "Medium", "High"))
heart_data$smoking_factor <- cut(heart_data$smoking,
breaks = 3,
labels = c("Low", "Medium", "High"))
str(heart_data)
## spc_tbl_ [498 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ...1 : num [1:498] 1 2 3 4 5 6 7 8 9 10 ...
## $ biking : num [1:498] 30.8 65.13 1.96 44.8 69.43 ...
## $ smoking : num [1:498] 10.9 2.22 17.59 2.8 15.97 ...
## $ heart.disease : num [1:498] 11.77 2.85 17.18 6.82 4.06 ...
## $ biking_factor : Factor w/ 3 levels "Low","Medium",..: 2 3 1 2 3 3 2 1 3 2 ...
## $ smoking_factor: Factor w/ 3 levels "Low","Medium",..: 2 1 2 1 2 3 1 2 2 3 ...
## - attr(*, "spec")=
## .. cols(
## .. ...1 = col_double(),
## .. biking = col_double(),
## .. smoking = col_double(),
## .. heart.disease = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
어떤 회귀 모델을 생성해야 하는가?
Since the dependent
variable is a continuous numerical variable, and the independent
variables are also numerical, we should use Multiple Linear
Regression.
회귀 모델에 대한 결과 해석하기
model <- lm(heart.disease~ biking + smoking, data = heart_data)
summary(model)
##
## Call:
## lm(formula = heart.disease ~ biking + smoking, data = heart_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1789 -0.4463 0.0362 0.4422 1.9331
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.984658 0.080137 186.99 <2e-16 ***
## biking -0.200133 0.001366 -146.53 <2e-16 ***
## smoking 0.178334 0.003539 50.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.654 on 495 degrees of freedom
## Multiple R-squared: 0.9796, Adjusted R-squared: 0.9795
## F-statistic: 1.19e+04 on 2 and 495 DF, p-value: < 2.2e-16
독립변수와 종속변수 간 관계가 통계적으로 유의한가? 종속변수와
관련성이 있는 변수는 무엇인가?
○ 모든 변수의 p-value가 <
2e-16, 그리고 별표가 있음→ biking과 smoking이 모두 유의미한 변수임. 즉,
Biking과 Smoking은 Heart Disease에 유의한 영향을 미친다.
○ Biking (Estimate = -0.200133)은 부정적인 관계 (Biking이
증가할수록 Heart Disease가 감소)를 갖고 있다.
○
Smoking (Estimate = 0.178334)은 긍정적인 관계 (Smoking이 증가할수록
Heart Disease가 증가)를 갖고 있다.