과제

데이터 ‘heart.csv’ 를 활용해 Biking과 smoking이 the percentage of people with heart disease 에 영향을 미치는지 유의수준 5%로 검정하세요.
1. 데이터 불러오기
데이터 description: observations on the percentage of people biking to work each day, the percentage of people smoking, and the percentage of people with heart disease

library(readr)
heart_data <- read_csv("~/Downloads/heart.csv")
## New names:
## Rows: 498 Columns: 4
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (4): ...1, biking, smoking, heart.disease
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
head(heart_data)
## # A tibble: 6 × 4
##    ...1 biking smoking heart.disease
##   <dbl>  <dbl>   <dbl>         <dbl>
## 1     1  30.8    10.9          11.8 
## 2     2  65.1     2.22          2.85
## 3     3   1.96   17.6          17.2 
## 4     4  44.8     2.80          6.82
## 5     5  69.4    16.0           4.06
## 6     6  54.4    29.3           9.55
summary(heart_data)
##       ...1           biking          smoking        heart.disease    
##  Min.   :  1.0   Min.   : 1.119   Min.   : 0.5259   Min.   : 0.5519  
##  1st Qu.:125.2   1st Qu.:20.205   1st Qu.: 8.2798   1st Qu.: 6.5137  
##  Median :249.5   Median :35.824   Median :15.8146   Median :10.3853  
##  Mean   :249.5   Mean   :37.788   Mean   :15.4350   Mean   :10.1745  
##  3rd Qu.:373.8   3rd Qu.:57.853   3rd Qu.:22.5689   3rd Qu.:13.7240  
##  Max.   :498.0   Max.   :74.907   Max.   :29.9467   Max.   :20.4535
  1. 독립변수와 종속변수는 무엇인가? 각 변수의 타입은 무엇인가?
독립변수 종속변수
Biking: % of people biking to work each day (Num/Cont) Heart Disease: % of people with heart disease (Num/Cont)
Smoking: % of people smoking (Num/Cont)
  1. 독립변수를 factor화 하고 str 함수로 level 살펴보기
heart_data$biking_factor <- cut(heart_data$biking, 
                                breaks = 3, 
                                labels = c("Low", "Medium", "High"))

heart_data$smoking_factor <- cut(heart_data$smoking, 
                                 breaks = 3, 
                                 labels = c("Low", "Medium", "High"))
str(heart_data)
## spc_tbl_ [498 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ...1          : num [1:498] 1 2 3 4 5 6 7 8 9 10 ...
##  $ biking        : num [1:498] 30.8 65.13 1.96 44.8 69.43 ...
##  $ smoking       : num [1:498] 10.9 2.22 17.59 2.8 15.97 ...
##  $ heart.disease : num [1:498] 11.77 2.85 17.18 6.82 4.06 ...
##  $ biking_factor : Factor w/ 3 levels "Low","Medium",..: 2 3 1 2 3 3 2 1 3 2 ...
##  $ smoking_factor: Factor w/ 3 levels "Low","Medium",..: 2 1 2 1 2 3 1 2 2 3 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ...1 = col_double(),
##   ..   biking = col_double(),
##   ..   smoking = col_double(),
##   ..   heart.disease = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
  1. 어떤 회귀 모델을 생성해야 하는가?
    Since the dependent variable is a continuous numerical variable, and the independent variables are also numerical, we should use Multiple Linear Regression.

  2. 회귀 모델에 대한 결과 해석하기

model <- lm(heart.disease~ biking + smoking, data = heart_data)
summary(model)
## 
## Call:
## lm(formula = heart.disease ~ biking + smoking, data = heart_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1789 -0.4463  0.0362  0.4422  1.9331 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.984658   0.080137  186.99   <2e-16 ***
## biking      -0.200133   0.001366 -146.53   <2e-16 ***
## smoking      0.178334   0.003539   50.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.654 on 495 degrees of freedom
## Multiple R-squared:  0.9796, Adjusted R-squared:  0.9795 
## F-statistic: 1.19e+04 on 2 and 495 DF,  p-value: < 2.2e-16

독립변수와 종속변수 간 관계가 통계적으로 유의한가? 종속변수와 관련성이 있는 변수는 무엇인가?
○ 모든 변수의 p-value가 < 2e-16, 그리고 별표가 있음→ biking과 smoking이 모두 유의미한 변수임. 즉, Biking과 Smoking은 Heart Disease에 유의한 영향을 미친다.

○ Biking (Estimate = -0.200133)은 부정적인 관계 (Biking이 증가할수록 Heart Disease가 감소)를 갖고 있다.
○ Smoking (Estimate = 0.178334)은 긍정적인 관계 (Smoking이 증가할수록 Heart Disease가 증가)를 갖고 있다.