Data Preparation

# load data

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Body Mass Index(BMI) and risk of cardiovascular disease; the Framingham study

Cases

What are the cases, and how many are there? The Framingham Heart Study is a long-term, ongoing cardiovascular cohort study on residents of the city of Framingham, Massachusetts. The study began in 1948 with 5,209 adult subjects from Framingham, and is now on its third generation of participants.

Data collection

Describe the method of data collection. The Framingham Heart Study participants, and their children and grandchildren, voluntarily consented to undergo a detailed medical history, physical examination, and medical tests every two years, creating a wealth of data about physical and mental health, especially about cardiovascular disease. All subjects were white.

Type of study

What type of study is this (observational/experiment)? prospective observational longitudinal study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link. www.kaggle.com

Dependent Variable

What is the response variable? Is it quantitative or qualitative? BMI, the BMI was calculated by subject’s weight(kg) and height(m). It is a quatitative variable. BMI was calculated as the weight in kilograms divided by the square of the height in meters (kg/m2).

Independent Variable

You should have two independent variables, one quantitative and one qualitative. The independat variables including sex( qualitative), age(quantitative), education (qualitative), smoking(qualitative), hypertension (qualitative), diabetes(qualitative), cholestrol(quantitative), coronary heart disease(qualitative)

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed. Means will be calculated for all parameters in both men and women and in different age groups. The age group categories are: <30 years, 30 to 39 years, 40 to 49 years, 50 to 59 years, and ???60 years. The majority of the individuals in the <30 years category were between 20 and 29 years of age, and the majority of the individuals in the ???60 years category were between 60 and 69 years of age in both men and women. Subjects were also divided into 6 groups according to their BMI: <21.00, 21.00 to 22.99, 23.00 to 24.99, 25.00 to 27.49, 27.50 to 29.99, and ???30.00 kg/m2. These ranges are selected because they are similar to those selected in other large epidemiological studies of men and women.5927 To achieve normal distribution, a logarithmic transformation will be applied to BMI, total cholesterol in men and women. The PROC REG procedure will be used to test the association of BMI (as a continuous variable) with blood pressure, glucose, and plasma lipid levels after adjustment for age effects and exclusion of smokers. The odds ratios for each unit of BMI increase will be determined using PROC LOGIST, after the exclusion of smokers from the analysis to avoid residual effects of smoking.

require(rvest)
## Loading required package: rvest
## Loading required package: xml2
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(stringr)
## Loading required package: stringr
require(tidyr)
## Loading required package: tidyr
require(dplyr)
require(ggplot2)
## Loading required package: ggplot2
fhs <- read.csv("https://raw.githubusercontent.com/johnpannyc/data-606-final-project/aaa4460bec757f87321b826800b2017a48b3d437/framingham.csv")
dim(fhs)
## [1] 4240   16
head(fhs)
##   male age education currentSmoker cigsPerDay BPMeds prevalentStroke
## 1    1  39         4             0          0      0               0
## 2    0  46         2             0          0      0               0
## 3    1  48         1             1         20      0               0
## 4    0  61         3             1         30      0               0
## 5    0  46         3             1         23      0               0
## 6    0  43         2             0          0      0               0
##   prevalentHyp diabetes totChol sysBP diaBP   BMI heartRate glucose
## 1            0        0     195 106.0    70 26.97        80      77
## 2            0        0     250 121.0    81 28.73        95      76
## 3            0        0     245 127.5    80 25.34        75      70
## 4            1        0     225 150.0    95 28.58        65     103
## 5            0        0     285 130.0    84 23.10        85      85
## 6            1        0     228 180.0   110 30.30        77      99
##   TenYearCHD
## 1          0
## 2          0
## 3          0
## 4          1
## 5          0
## 6          0
tail(fhs)
##      male age education currentSmoker cigsPerDay BPMeds prevalentStroke
## 4235    1  51         3             1         43      0               0
## 4236    0  48         2             1         20     NA               0
## 4237    0  44         1             1         15      0               0
## 4238    0  52         2             0          0      0               0
## 4239    1  40         3             0          0      0               0
## 4240    0  39         3             1         30      0               0
##      prevalentHyp diabetes totChol sysBP diaBP   BMI heartRate glucose
## 4235            0        0     207 126.5    80 19.71        65      68
## 4236            0        0     248 131.0    72 22.00        84      86
## 4237            0        0     210 126.5    87 19.16        86      NA
## 4238            0        0     269 133.5    83 21.47        80     107
## 4239            1        0     185 141.0    98 25.60        67      72
## 4240            0        0     196 133.0    86 20.91        85      80
##      TenYearCHD
## 4235          0
## 4236          0
## 4237          0
## 4238          0
## 4239          0
## 4240          0
summary(fhs)
##       male             age          education     currentSmoker   
##  Min.   :0.0000   Min.   :32.00   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:42.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :49.00   Median :2.000   Median :0.0000  
##  Mean   :0.4292   Mean   :49.58   Mean   :1.979   Mean   :0.4941  
##  3rd Qu.:1.0000   3rd Qu.:56.00   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :70.00   Max.   :4.000   Max.   :1.0000  
##                                   NA's   :105                     
##    cigsPerDay         BPMeds        prevalentStroke     prevalentHyp   
##  Min.   : 0.000   Min.   :0.00000   Min.   :0.000000   Min.   :0.0000  
##  1st Qu.: 0.000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.0000  
##  Median : 0.000   Median :0.00000   Median :0.000000   Median :0.0000  
##  Mean   : 9.006   Mean   :0.02962   Mean   :0.005896   Mean   :0.3106  
##  3rd Qu.:20.000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:1.0000  
##  Max.   :70.000   Max.   :1.00000   Max.   :1.000000   Max.   :1.0000  
##  NA's   :29       NA's   :53                                           
##     diabetes          totChol          sysBP           diaBP      
##  Min.   :0.00000   Min.   :107.0   Min.   : 83.5   Min.   : 48.0  
##  1st Qu.:0.00000   1st Qu.:206.0   1st Qu.:117.0   1st Qu.: 75.0  
##  Median :0.00000   Median :234.0   Median :128.0   Median : 82.0  
##  Mean   :0.02571   Mean   :236.7   Mean   :132.4   Mean   : 82.9  
##  3rd Qu.:0.00000   3rd Qu.:263.0   3rd Qu.:144.0   3rd Qu.: 90.0  
##  Max.   :1.00000   Max.   :696.0   Max.   :295.0   Max.   :142.5  
##                    NA's   :50                                     
##       BMI          heartRate         glucose         TenYearCHD    
##  Min.   :15.54   Min.   : 44.00   Min.   : 40.00   Min.   :0.0000  
##  1st Qu.:23.07   1st Qu.: 68.00   1st Qu.: 71.00   1st Qu.:0.0000  
##  Median :25.40   Median : 75.00   Median : 78.00   Median :0.0000  
##  Mean   :25.80   Mean   : 75.88   Mean   : 81.96   Mean   :0.1519  
##  3rd Qu.:28.04   3rd Qu.: 83.00   3rd Qu.: 87.00   3rd Qu.:0.0000  
##  Max.   :56.80   Max.   :143.00   Max.   :394.00   Max.   :1.0000  
##  NA's   :19      NA's   :1        NA's   :388

Research question