Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Body Mass Index(BMI) and risk of cardiovascular disease; the Framingham study

Cases

What are the cases, and how many are there? The Framingham Heart Study is a long-term, ongoing cardiovascular cohort study on residents of the city of Framingham, Massachusetts. The study began in 1948 with 5,209 adult subjects from Framingham, and is now on its third generation of participants.

Data collection

Describe the method of data collection. The Framingham Heart Study participants, and their children and grandchildren, voluntarily consented to undergo a detailed medical history, physical examination, and medical tests every two years, creating a wealth of data about physical and mental health, especially about cardiovascular disease. All subjects were white.

Type of study

What type of study is this (observational/experiment)? prospective observational longitudinal study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link. www.kaggle.com

Dependent Variable

What is the response variable? Is it quantitative or qualitative? BMI, the BMI was calculated by subject’s weight(kg) and height(m). It is a quatitative variable. BMI was calculated as the weight in kilograms divided by the square of the height in meters (kg/m2).

Independent Variable

You should have two independent variables, one quantitative and one qualitative. The independat variables including sex( qualitative), age(quantitative), education (qualitative), smoking(qualitative), hypertension (qualitative), diabetes(qualitative), cholestrol(quantitative), coronary heart disease(qualitative)

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed. Means will be calculated for all parameters in both men and women and in different age groups. The age group categories are: <30 years, 30 to 39 years, 40 to 49 years, 50 to 59 years, and ???60 years. The majority of the individuals in the <30 years category were between 20 and 29 years of age, and the majority of the individuals in the ???60 years category were between 60 and 69 years of age in both men and women. Subjects were also divided into 6 groups according to their BMI: <21.00, 21.00 to 22.99, 23.00 to 24.99, 25.00 to 27.49, 27.50 to 29.99, and ???30.00 kg/m2. These ranges are selected because they are similar to those selected in other large epidemiological studies of men and women.5927 To achieve normal distribution, a logarithmic transformation will be applied to BMI, total cholesterol in men and women. The PROC REG procedure will be used to test the association of BMI (as a continuous variable) with blood pressure, glucose, and plasma lipid levels after adjustment for age effects and exclusion of smokers. The odds ratios for each unit of BMI increase will be determined using PROC LOGIST, after the exclusion of smokers from the analysis to avoid residual effects of smoking.

require(rvest)

## Loading required package: rvest

## Loading required package: xml2

require(dplyr)

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

require(stringr)

## Loading required package: stringr

require(tidyr)

## Loading required package: tidyr

require(dplyr)
require(ggplot2)

## Loading required package: ggplot2

require(readr)

## Loading required package: readr

## 
## Attaching package: 'readr'

## The following object is masked from 'package:rvest':
## 
##     guess_encoding

require(broom)

## Loading required package: broom

fhs <- read_csv("https://raw.githubusercontent.com/johnpannyc/data-606-final-project/aaa4460bec757f87321b826800b2017a48b3d437/framingham.csv")

## Parsed with column specification:
## cols(
##   male = col_integer(),
##   age = col_integer(),
##   education = col_integer(),
##   currentSmoker = col_integer(),
##   cigsPerDay = col_integer(),
##   BPMeds = col_integer(),
##   prevalentStroke = col_integer(),
##   prevalentHyp = col_integer(),
##   diabetes = col_integer(),
##   totChol = col_integer(),
##   sysBP = col_double(),
##   diaBP = col_double(),
##   BMI = col_double(),
##   heartRate = col_integer(),
##   glucose = col_integer(),
##   TenYearCHD = col_integer()
## )

dim(fhs)

## [1] 4240   16

head(fhs)

## # A tibble: 6 x 16
##    male   age education currentSmoker cigsPerDay BPMeds prevalentStroke
##   <int> <int>     <int>         <int>      <int>  <int>           <int>
## 1     1    39         4             0          0      0               0
## 2     0    46         2             0          0      0               0
## 3     1    48         1             1         20      0               0
## 4     0    61         3             1         30      0               0
## 5     0    46         3             1         23      0               0
## 6     0    43         2             0          0      0               0
## # ... with 9 more variables: prevalentHyp <int>, diabetes <int>,
## #   totChol <int>, sysBP <dbl>, diaBP <dbl>, BMI <dbl>, heartRate <int>,
## #   glucose <int>, TenYearCHD <int>

tail(fhs)

## # A tibble: 6 x 16
##    male   age education currentSmoker cigsPerDay BPMeds prevalentStroke
##   <int> <int>     <int>         <int>      <int>  <int>           <int>
## 1     1    51         3             1         43      0               0
## 2     0    48         2             1         20     NA               0
## 3     0    44         1             1         15      0               0
## 4     0    52         2             0          0      0               0
## 5     1    40         3             0          0      0               0
## 6     0    39         3             1         30      0               0
## # ... with 9 more variables: prevalentHyp <int>, diabetes <int>,
## #   totChol <int>, sysBP <dbl>, diaBP <dbl>, BMI <dbl>, heartRate <int>,
## #   glucose <int>, TenYearCHD <int>

summary(fhs)

##       male             age          education     currentSmoker   
##  Min.   :0.0000   Min.   :32.00   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:42.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :49.00   Median :2.000   Median :0.0000  
##  Mean   :0.4292   Mean   :49.58   Mean   :1.979   Mean   :0.4941  
##  3rd Qu.:1.0000   3rd Qu.:56.00   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :70.00   Max.   :4.000   Max.   :1.0000  
##                                   NA's   :105                     
##    cigsPerDay         BPMeds        prevalentStroke     prevalentHyp   
##  Min.   : 0.000   Min.   :0.00000   Min.   :0.000000   Min.   :0.0000  
##  1st Qu.: 0.000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.0000  
##  Median : 0.000   Median :0.00000   Median :0.000000   Median :0.0000  
##  Mean   : 9.006   Mean   :0.02962   Mean   :0.005896   Mean   :0.3106  
##  3rd Qu.:20.000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:1.0000  
##  Max.   :70.000   Max.   :1.00000   Max.   :1.000000   Max.   :1.0000  
##  NA's   :29       NA's   :53                                           
##     diabetes          totChol          sysBP           diaBP      
##  Min.   :0.00000   Min.   :107.0   Min.   : 83.5   Min.   : 48.0  
##  1st Qu.:0.00000   1st Qu.:206.0   1st Qu.:117.0   1st Qu.: 75.0  
##  Median :0.00000   Median :234.0   Median :128.0   Median : 82.0  
##  Mean   :0.02571   Mean   :236.7   Mean   :132.4   Mean   : 82.9  
##  3rd Qu.:0.00000   3rd Qu.:263.0   3rd Qu.:144.0   3rd Qu.: 90.0  
##  Max.   :1.00000   Max.   :696.0   Max.   :295.0   Max.   :142.5  
##                    NA's   :50                                     
##       BMI          heartRate         glucose         TenYearCHD    
##  Min.   :15.54   Min.   : 44.00   Min.   : 40.00   Min.   :0.0000  
##  1st Qu.:23.07   1st Qu.: 68.00   1st Qu.: 71.00   1st Qu.:0.0000  
##  Median :25.40   Median : 75.00   Median : 78.00   Median :0.0000  
##  Mean   :25.80   Mean   : 75.88   Mean   : 81.96   Mean   :0.1519  
##  3rd Qu.:28.04   3rd Qu.: 83.00   3rd Qu.: 87.00   3rd Qu.:0.0000  
##  Max.   :56.80   Max.   :143.00   Max.   :394.00   Max.   :1.0000  
##  NA's   :19      NA's   :1        NA's   :388

Male and Female participants in the study

table(fhs$male)

## 
##    0    1 
## 2420 1820

histogram of age

hist(fhs$age)

Data visualization:histogram of BMI

hist(fhs$BMI, main=paste("distribution of BMI in Framingham Heart Study"))

boxplot

boxplot(fhs$BMI)

Above plot shows that BMI is normal distribution with mean BMI equals to 25.80 and median BMI equals to 25.40.

Ten Year CHD prevalence

table(fhs$TenYearCHD)

## 
##    0    1 
## 3596  644

Among the all participants, 3596 participants without TenYearCHD, 644 participants with TenYearCHD.

Current Smoker prevalence (0:non_smoker;1:smoker)

table(fhs$currentSmoker)

## 
##    0    1 
## 2145 2095

Hypertension Prevalence (0:normal, 1:hypertension)

table(fhs$prevalentHyp)

## 
##    0    1 
## 2923 1317

Diabetes prevalence

table(fhs$diabetes)

## 
##    0    1 
## 4131  109

ggplot(data = fhs, aes(x = TenYearCHD, y = BMI,group=TenYearCHD)) + 
  geom_boxplot()

## Warning: Removed 19 rows containing non-finite values (stat_boxplot).

ggplot(data = fhs, aes(x = factor(TenYearCHD), y = BMI)) + 
  geom_boxplot()

## Warning: Removed 19 rows containing non-finite values (stat_boxplot).

#Explore Data

glimpse(fhs)

## Observations: 4,240
## Variables: 16
## $ male            <int> 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0...
## $ age             <int> 39, 46, 48, 61, 46, 43, 63, 45, 52, 43, 50, 43...
## $ education       <int> 4, 2, 1, 3, 3, 2, 1, 2, 1, 1, 1, 2, 1, 3, 2, 2...
## $ currentSmoker   <int> 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1...
## $ cigsPerDay      <int> 0, 0, 20, 30, 23, 0, 0, 20, 0, 30, 0, 0, 15, 0...
## $ BPMeds          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ prevalentStroke <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ prevalentHyp    <int> 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1...
## $ diabetes        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ totChol         <int> 195, 250, 245, 225, 285, 228, 205, 313, 260, 2...
## $ sysBP           <dbl> 106.0, 121.0, 127.5, 150.0, 130.0, 180.0, 138....
## $ diaBP           <dbl> 70.0, 81.0, 80.0, 95.0, 84.0, 110.0, 71.0, 71....
## $ BMI             <dbl> 26.97, 28.73, 25.34, 28.58, 23.10, 30.30, 33.1...
## $ heartRate       <int> 80, 95, 75, 65, 85, 77, 60, 79, 76, 93, 75, 72...
## $ glucose         <int> 77, 76, 70, 103, 85, 99, 85, 78, 79, 88, 76, 6...
## $ TenYearCHD      <int> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1...

BMI related Ten Year CHD incidence

mod0<-glm(TenYearCHD~BMI+age+diabetes+prevalentHyp+currentSmoker, data=fhs, family=binomial)
mod0

## 
## Call:  glm(formula = TenYearCHD ~ BMI + age + diabetes + prevalentHyp + 
##     currentSmoker, family = binomial, data = fhs)
## 
## Coefficients:
##   (Intercept)            BMI            age       diabetes   prevalentHyp  
##      -6.29940        0.01814        0.06908        0.82298        0.63088  
## currentSmoker  
##       0.52487  
## 
## Degrees of Freedom: 4220 Total (i.e. Null);  4215 Residual
##   (19 observations deleted due to missingness)
## Null Deviance:       3571 
## Residual Deviance: 3268  AIC: 3280

summary(mod0)

## 
## Call:
## glm(formula = TenYearCHD ~ BMI + age + diabetes + prevalentHyp + 
##     currentSmoker, family = binomial, data = fhs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5114  -0.6039  -0.4434  -0.3308   2.5719  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -6.299401   0.429949 -14.652  < 2e-16 ***
## BMI            0.018144   0.011021   1.646 0.099716 .  
## age            0.069083   0.005764  11.985  < 2e-16 ***
## diabetes       0.822977   0.217329   3.787 0.000153 ***
## prevalentHyp   0.630882   0.096994   6.504 7.80e-11 ***
## currentSmoker  0.524874   0.094503   5.554 2.79e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3571.5  on 4220  degrees of freedom
## Residual deviance: 3267.9  on 4215  degrees of freedom
##   (19 observations deleted due to missingness)
## AIC: 3279.9
## 
## Number of Fisher Scoring iterations: 5

As continuous variable, BMI did not reach statistical signficance. So we tried to transform it into categorical data.

The following are the ranges to define different categories of BMI. Underweight: BMI is less than 18.5 Normal weight: BMI is 18.5 to 24.9 Overweight: BMI is 25 to 29.9 Obese: BMI is 30 or more We are going to study whether BMI >30 as obese is a risk factor of CHD Here, we create 2 categories, BMI<=30 non-obese, BMI>30 obese

fhs$obesity<-ifelse(fhs$BMI>30, 1,0)

glimpse(fhs)

## Observations: 4,240
## Variables: 17
## $ male            <int> 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0...
## $ age             <int> 39, 46, 48, 61, 46, 43, 63, 45, 52, 43, 50, 43...
## $ education       <int> 4, 2, 1, 3, 3, 2, 1, 2, 1, 1, 1, 2, 1, 3, 2, 2...
## $ currentSmoker   <int> 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1...
## $ cigsPerDay      <int> 0, 0, 20, 30, 23, 0, 0, 20, 0, 30, 0, 0, 15, 0...
## $ BPMeds          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ prevalentStroke <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ prevalentHyp    <int> 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1...
## $ diabetes        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ totChol         <int> 195, 250, 245, 225, 285, 228, 205, 313, 260, 2...
## $ sysBP           <dbl> 106.0, 121.0, 127.5, 150.0, 130.0, 180.0, 138....
## $ diaBP           <dbl> 70.0, 81.0, 80.0, 95.0, 84.0, 110.0, 71.0, 71....
## $ BMI             <dbl> 26.97, 28.73, 25.34, 28.58, 23.10, 30.30, 33.1...
## $ heartRate       <int> 80, 95, 75, 65, 85, 77, 60, 79, 76, 93, 75, 72...
## $ glucose         <int> 77, 76, 70, 103, 85, 99, 85, 78, 79, 88, 76, 6...
## $ TenYearCHD      <int> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ obesity         <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0...

mod2 <- glm( TenYearCHD ~ obesity + male+diabetes + prevalentHyp+currentSmoker, data = fhs,family=binomial)
mod2

## 
## Call:  glm(formula = TenYearCHD ~ obesity + male + diabetes + prevalentHyp + 
##     currentSmoker, family = binomial, data = fhs)
## 
## Coefficients:
##   (Intercept)        obesity           male       diabetes   prevalentHyp  
##      -2.46102        0.05307        0.49228        1.02231        0.96554  
## currentSmoker  
##       0.15445  
## 
## Degrees of Freedom: 4220 Total (i.e. Null);  4215 Residual
##   (19 observations deleted due to missingness)
## Null Deviance:       3571 
## Residual Deviance: 3391  AIC: 3403

summary(mod2)

## 
## Call:
## glm(formula = TenYearCHD ~ obesity + male + diabetes + prevalentHyp + 
##     currentSmoker, family = binomial, data = fhs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2750  -0.5632  -0.4358  -0.4047   2.2552  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.46102    0.08886 -27.697  < 2e-16 ***
## obesity        0.05307    0.12621   0.421   0.6741    
## male           0.49228    0.09039   5.446 5.15e-08 ***
## diabetes       1.02231    0.21330   4.793 1.64e-06 ***
## prevalentHyp   0.96554    0.09126  10.580  < 2e-16 ***
## currentSmoker  0.15445    0.09150   1.688   0.0914 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3571.5  on 4220  degrees of freedom
## Residual deviance: 3391.4  on 4215  degrees of freedom
##   (19 observations deleted due to missingness)
## AIC: 3403.4
## 
## Number of Fisher Scoring iterations: 4

table(fhs$obesity)

## 
##    0    1 
## 3685  536

Now, we focuse on the age above 50 yd group

fhs2<- filter(fhs, age>50)

let us look at the new dataframe fhs

glimpse(fhs2)

## Observations: 1,883
## Variables: 17
## $ male            <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0...
## $ age             <int> 61, 63, 52, 52, 52, 60, 61, 60, 59, 61, 54, 56...
## $ education       <int> 3, 1, 1, 1, 3, 1, 3, 1, 1, NA, 1, NA, 1, 1, 2,...
## $ currentSmoker   <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0...
## $ cigsPerDay      <int> 30, 0, 0, 0, 20, 0, 0, 0, 0, 5, 20, 0, 0, 0, 0...
## $ BPMeds          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1...
## $ prevalentStroke <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ prevalentHyp    <int> 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1...
## $ diabetes        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1...
## $ totChol         <int> 225, 205, 260, 234, 215, 260, 272, 247, 209, 1...
## $ sysBP           <dbl> 150.0, 138.0, 141.5, 148.0, 132.0, 110.0, 182....
## $ diaBP           <dbl> 95.0, 71.0, 89.0, 78.0, 82.0, 72.5, 121.0, 88....
## $ BMI             <dbl> 28.58, 33.11, 26.36, 34.17, 25.11, 26.59, 32.8...
## $ heartRate       <int> 65, 60, 76, 70, 71, 65, 85, 72, 90, 72, 96, 72...
## $ glucose         <int> 103, 85, 79, 113, 75, NA, 65, 74, 88, 75, 87, ...
## $ TenYearCHD      <int> 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1...
## $ obesity         <dbl> 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0...

table(fhs2$obesity)

## 
##    0    1 
## 1597  276

table(fhs2$TenYearCHD)

## 
##    0    1 
## 1452  431

mod3 <- glm( TenYearCHD ~ obesity + male+diabetes + prevalentHyp+currentSmoker, data = fhs2, family=binomial)
mod3

## 
## Call:  glm(formula = TenYearCHD ~ obesity + male + diabetes + prevalentHyp + 
##     currentSmoker, family = binomial, data = fhs2)
## 
## Coefficients:
##   (Intercept)        obesity           male       diabetes   prevalentHyp  
##       -1.9847         0.1058         0.5089         0.7806         0.7520  
## currentSmoker  
##        0.2604  
## 
## Degrees of Freedom: 1872 Total (i.e. Null);  1867 Residual
##   (10 observations deleted due to missingness)
## Null Deviance:       2004 
## Residual Deviance: 1925  AIC: 1937

summary(mod3)

## 
## Call:
## glm(formula = TenYearCHD ~ obesity + male + diabetes + prevalentHyp + 
##     currentSmoker, family = binomial, data = fhs2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3615  -0.7207  -0.6417  -0.5075   2.0560  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -1.9847     0.1148 -17.287  < 2e-16 ***
## obesity         0.1058     0.1583   0.668  0.50389    
## male            0.5089     0.1183   4.302 1.69e-05 ***
## diabetes        0.7806     0.2484   3.142  0.00168 ** 
## prevalentHyp    0.7520     0.1166   6.451 1.11e-10 ***
## currentSmoker   0.2604     0.1196   2.177  0.02945 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2003.6  on 1872  degrees of freedom
## Residual deviance: 1925.1  on 1867  degrees of freedom
##   (10 observations deleted due to missingness)
## AIC: 1937.1
## 
## Number of Fisher Scoring iterations: 4

To stuy the male gender

fhs3<-filter(fhs2, male==1)

glimpse(fhs3)

## Observations: 780
## Variables: 17
## $ male            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ age             <int> 52, 61, 54, 56, 52, 54, 51, 56, 53, 57, 60, 53...
## $ education       <int> 1, NA, 1, NA, 1, 2, 4, 4, 1, 1, 1, 1, 4, 3, 1,...
## $ currentSmoker   <int> 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0...
## $ cigsPerDay      <int> 0, 5, 20, 0, 0, 0, 0, 20, 20, 0, 20, 20, 30, 0...
## $ BPMeds          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ prevalentStroke <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ prevalentHyp    <int> 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0...
## $ diabetes        <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ totChol         <int> 260, 175, 214, 257, 178, 195, 216, 270, 220, 2...
## $ sysBP           <dbl> 141.5, 134.0, 147.0, 153.5, 160.0, 132.0, 112....
## $ diaBP           <dbl> 89.0, 82.5, 74.0, 102.0, 98.0, 83.5, 66.0, 79....
## $ BMI             <dbl> 26.36, 18.59, 24.71, 28.09, 40.11, 26.21, 23.4...
## $ heartRate       <int> 76, 72, 96, 72, 75, 75, 90, 95, 78, 75, 90, 60...
## $ glucose         <int> 79, 75, 87, 75, 225, 100, 95, 93, 73, 64, 83, ...
## $ TenYearCHD      <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0...
## $ obesity         <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

mod4 <- lm( TenYearCHD ~ obesity + diabetes + prevalentHyp+currentSmoker, data = fhs3, family=binomial)

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'family' will be disregarded

mod4

## 
## Call:
## lm(formula = TenYearCHD ~ obesity + diabetes + prevalentHyp + 
##     currentSmoker, data = fhs3, family = binomial)
## 
## Coefficients:
##   (Intercept)        obesity       diabetes   prevalentHyp  currentSmoker  
##       0.15968        0.07385        0.26057        0.11473        0.10064

summary(mod4)

## 
## Call:
## lm(formula = TenYearCHD ~ obesity + diabetes + prevalentHyp + 
##     currentSmoker, data = fhs3, family = binomial)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7095 -0.2744 -0.2335  0.5511  0.8403 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.15968    0.02677   5.964 3.75e-09 ***
## obesity        0.07385    0.05364   1.377 0.168947    
## diabetes       0.26057    0.07666   3.399 0.000711 ***
## prevalentHyp   0.11473    0.03278   3.499 0.000493 ***
## currentSmoker  0.10064    0.03182   3.162 0.001627 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4379 on 772 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.04509,    Adjusted R-squared:  0.04014 
## F-statistic: 9.114 on 4 and 772 DF,  p-value: 3.388e-07

study gender=female

fhs4 <- filter(fhs2, male==0)

glimpse(fhs4)

## Observations: 1,103
## Variables: 17
## $ male            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ age             <int> 61, 63, 52, 52, 60, 61, 60, 59, 52, 53, 65, 63...
## $ education       <int> 3, 1, 1, 3, 1, 3, 1, 1, 1, 3, 1, 2, 1, 1, 1, 1...
## $ currentSmoker   <int> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0...
## $ cigsPerDay      <int> 30, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 40, 3, 0, 9...
## $ BPMeds          <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0...
## $ prevalentStroke <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ prevalentHyp    <int> 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1...
## $ diabetes        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0...
## $ totChol         <int> 225, 205, 234, 215, 260, 272, 247, 209, NA, 31...
## $ sysBP           <dbl> 150.0, 138.0, 148.0, 132.0, 110.0, 182.0, 130....
## $ diaBP           <dbl> 95.0, 71.0, 78.0, 82.0, 72.5, 121.0, 88.0, 85....
## $ BMI             <dbl> 28.58, 33.11, 34.17, 25.11, 26.59, 32.80, 30.3...
## $ heartRate       <int> 65, 60, 70, 71, 65, 85, 72, 90, 70, 76, 90, 95...
## $ glucose         <int> 103, 85, 113, 75, NA, 65, 74, 88, NA, 215, 87,...
## $ TenYearCHD      <int> 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0...
## $ obesity         <dbl> 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0...

mod5 <- lm( TenYearCHD ~ obesity + diabetes + prevalentHyp + currentSmoker, data = fhs4,family=binomial)

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'family' will be disregarded

mod5

## 
## Call:
## lm(formula = TenYearCHD ~ obesity + diabetes + prevalentHyp + 
##     currentSmoker, data = fhs4, family = binomial)
## 
## Coefficients:
##   (Intercept)        obesity       diabetes   prevalentHyp  currentSmoker  
##      0.122729      -0.007899       0.076696       0.136878       0.001147

summary(mod5)

## 
## Call:
## lm(formula = TenYearCHD ~ obesity + diabetes + prevalentHyp + 
##     currentSmoker, data = fhs4, family = binomial)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3375 -0.2596 -0.1227 -0.1227  0.8852 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.122729   0.018800   6.528 1.02e-10 ***
## obesity       -0.007899   0.031249  -0.253    0.800    
## diabetes       0.076696   0.062879   1.220    0.223    
## prevalentHyp   0.136878   0.024132   5.672 1.81e-08 ***
## currentSmoker  0.001147   0.026432   0.043    0.965    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.388 on 1091 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.03228,    Adjusted R-squared:  0.02873 
## F-statistic: 9.097 on 4 and 1091 DF,  p-value: 3.14e-07

DATA 606 Final Project

Jun Pan

December 6, 2018