# load data
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
Body Mass Index(BMI) and risk of cardiovascular disease; the Framingham study
What are the cases, and how many are there? The Framingham Heart Study is a long-term, ongoing cardiovascular cohort study on residents of the city of Framingham, Massachusetts. The study began in 1948 with 5,209 adult subjects from Framingham, and is now on its third generation of participants.
Describe the method of data collection. The Framingham Heart Study participants, and their children and grandchildren, voluntarily consented to undergo a detailed medical history, physical examination, and medical tests every two years, creating a wealth of data about physical and mental health, especially about cardiovascular disease. All subjects were white.
What type of study is this (observational/experiment)? prospective observational longitudinal study.
If you collected the data, state self-collected. If not, provide a citation/link. www.kaggle.com
What is the response variable? Is it quantitative or qualitative? BMI, the BMI was calculated by subject’s weight(kg) and height(m). It is a quatitative variable. BMI was calculated as the weight in kilograms divided by the square of the height in meters (kg/m2).
You should have two independent variables, one quantitative and one qualitative. The independat variables including sex( qualitative), age(quantitative), education (qualitative), smoking(qualitative), hypertension (qualitative), diabetes(qualitative), cholestrol(quantitative), coronary heart disease(qualitative)
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed. Means will be calculated for all parameters in both men and women and in different age groups. The age group categories are: <30 years, 30 to 39 years, 40 to 49 years, 50 to 59 years, and ???60 years. The majority of the individuals in the <30 years category were between 20 and 29 years of age, and the majority of the individuals in the ???60 years category were between 60 and 69 years of age in both men and women. Subjects were also divided into 6 groups according to their BMI: <21.00, 21.00 to 22.99, 23.00 to 24.99, 25.00 to 27.49, 27.50 to 29.99, and ???30.00 kg/m2. These ranges are selected because they are similar to those selected in other large epidemiological studies of men and women.5927 To achieve normal distribution, a logarithmic transformation will be applied to BMI, total cholesterol in men and women. The PROC REG procedure will be used to test the association of BMI (as a continuous variable) with blood pressure, glucose, and plasma lipid levels after adjustment for age effects and exclusion of smokers. The odds ratios for each unit of BMI increase will be determined using PROC LOGIST, after the exclusion of smokers from the analysis to avoid residual effects of smoking.
require(rvest)
## Loading required package: rvest
## Loading required package: xml2
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(stringr)
## Loading required package: stringr
require(tidyr)
## Loading required package: tidyr
require(dplyr)
require(ggplot2)
## Loading required package: ggplot2
fhs <- read.csv("https://raw.githubusercontent.com/johnpannyc/data-606-final-project/aaa4460bec757f87321b826800b2017a48b3d437/framingham.csv")
dim(fhs)
## [1] 4240 16
head(fhs)
## male age education currentSmoker cigsPerDay BPMeds prevalentStroke
## 1 1 39 4 0 0 0 0
## 2 0 46 2 0 0 0 0
## 3 1 48 1 1 20 0 0
## 4 0 61 3 1 30 0 0
## 5 0 46 3 1 23 0 0
## 6 0 43 2 0 0 0 0
## prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose
## 1 0 0 195 106.0 70 26.97 80 77
## 2 0 0 250 121.0 81 28.73 95 76
## 3 0 0 245 127.5 80 25.34 75 70
## 4 1 0 225 150.0 95 28.58 65 103
## 5 0 0 285 130.0 84 23.10 85 85
## 6 1 0 228 180.0 110 30.30 77 99
## TenYearCHD
## 1 0
## 2 0
## 3 0
## 4 1
## 5 0
## 6 0
tail(fhs)
## male age education currentSmoker cigsPerDay BPMeds prevalentStroke
## 4235 1 51 3 1 43 0 0
## 4236 0 48 2 1 20 NA 0
## 4237 0 44 1 1 15 0 0
## 4238 0 52 2 0 0 0 0
## 4239 1 40 3 0 0 0 0
## 4240 0 39 3 1 30 0 0
## prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose
## 4235 0 0 207 126.5 80 19.71 65 68
## 4236 0 0 248 131.0 72 22.00 84 86
## 4237 0 0 210 126.5 87 19.16 86 NA
## 4238 0 0 269 133.5 83 21.47 80 107
## 4239 1 0 185 141.0 98 25.60 67 72
## 4240 0 0 196 133.0 86 20.91 85 80
## TenYearCHD
## 4235 0
## 4236 0
## 4237 0
## 4238 0
## 4239 0
## 4240 0
summary(fhs)
## male age education currentSmoker
## Min. :0.0000 Min. :32.00 Min. :1.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:42.00 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :49.00 Median :2.000 Median :0.0000
## Mean :0.4292 Mean :49.58 Mean :1.979 Mean :0.4941
## 3rd Qu.:1.0000 3rd Qu.:56.00 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :70.00 Max. :4.000 Max. :1.0000
## NA's :105
## cigsPerDay BPMeds prevalentStroke prevalentHyp
## Min. : 0.000 Min. :0.00000 Min. :0.000000 Min. :0.0000
## 1st Qu.: 0.000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.0000
## Median : 0.000 Median :0.00000 Median :0.000000 Median :0.0000
## Mean : 9.006 Mean :0.02962 Mean :0.005896 Mean :0.3106
## 3rd Qu.:20.000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:1.0000
## Max. :70.000 Max. :1.00000 Max. :1.000000 Max. :1.0000
## NA's :29 NA's :53
## diabetes totChol sysBP diaBP
## Min. :0.00000 Min. :107.0 Min. : 83.5 Min. : 48.0
## 1st Qu.:0.00000 1st Qu.:206.0 1st Qu.:117.0 1st Qu.: 75.0
## Median :0.00000 Median :234.0 Median :128.0 Median : 82.0
## Mean :0.02571 Mean :236.7 Mean :132.4 Mean : 82.9
## 3rd Qu.:0.00000 3rd Qu.:263.0 3rd Qu.:144.0 3rd Qu.: 90.0
## Max. :1.00000 Max. :696.0 Max. :295.0 Max. :142.5
## NA's :50
## BMI heartRate glucose TenYearCHD
## Min. :15.54 Min. : 44.00 Min. : 40.00 Min. :0.0000
## 1st Qu.:23.07 1st Qu.: 68.00 1st Qu.: 71.00 1st Qu.:0.0000
## Median :25.40 Median : 75.00 Median : 78.00 Median :0.0000
## Mean :25.80 Mean : 75.88 Mean : 81.96 Mean :0.1519
## 3rd Qu.:28.04 3rd Qu.: 83.00 3rd Qu.: 87.00 3rd Qu.:0.0000
## Max. :56.80 Max. :143.00 Max. :394.00 Max. :1.0000
## NA's :19 NA's :1 NA's :388