Introduction
This report will explore the Framingham Heart Study data set, particularly the relationship between heart rate, cigarettes per day, blood pressure, and education level attainment features. The Framingham Heart Study is a long-term data set that analyzes cardiovascular disease among the population of Framingham, Massachusetts. The data set contains 4240 observations with 16 features. Regarding the features of interest, heart rate contains 1 missing value, cigarettes per day contains 29 missing values, education contains 105 missing values, and both blood pressure features (sysBP and diaBP) contain no missing values. The details of the feature variables are in the provided R output.
#providing summary of all features in the original data set
summary(heart)
## male age education currentSmoker
## Min. :0.0000 Min. :32.00 Min. :1.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:42.00 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :49.00 Median :2.000 Median :0.0000
## Mean :0.4292 Mean :49.58 Mean :1.979 Mean :0.4941
## 3rd Qu.:1.0000 3rd Qu.:56.00 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :70.00 Max. :4.000 Max. :1.0000
## NA's :105
## cigsPerDay BPMeds prevalentStroke prevalentHyp
## Min. : 0.000 Min. :0.00000 Min. :0.000000 Min. :0.0000
## 1st Qu.: 0.000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.0000
## Median : 0.000 Median :0.00000 Median :0.000000 Median :0.0000
## Mean : 9.006 Mean :0.02962 Mean :0.005896 Mean :0.3106
## 3rd Qu.:20.000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:1.0000
## Max. :70.000 Max. :1.00000 Max. :1.000000 Max. :1.0000
## NA's :29 NA's :53
## diabetes totChol sysBP diaBP
## Min. :0.00000 Min. :107.0 Min. : 83.5 Min. : 48.0
## 1st Qu.:0.00000 1st Qu.:206.0 1st Qu.:117.0 1st Qu.: 75.0
## Median :0.00000 Median :234.0 Median :128.0 Median : 82.0
## Mean :0.02571 Mean :236.7 Mean :132.4 Mean : 82.9
## 3rd Qu.:0.00000 3rd Qu.:263.0 3rd Qu.:144.0 3rd Qu.: 90.0
## Max. :1.00000 Max. :696.0 Max. :295.0 Max. :142.5
## NA's :50
## BMI heartRate glucose TenYearCHD
## Min. :15.54 Min. : 44.00 Min. : 40.00 Min. :0.0000
## 1st Qu.:23.07 1st Qu.: 68.00 1st Qu.: 71.00 1st Qu.:0.0000
## Median :25.40 Median : 75.00 Median : 78.00 Median :0.0000
## Mean :25.80 Mean : 75.88 Mean : 81.96 Mean :0.1519
## 3rd Qu.:28.04 3rd Qu.: 83.00 3rd Qu.: 87.00 3rd Qu.:0.0000
## Max. :56.80 Max. :143.00 Max. :394.00 Max. :1.0000
## NA's :19 NA's :1 NA's :388
The mentioned missing values were omitted from the data set of interest during the analysis. While this did result in the loss of some information, it did not have a significant impact on the results of the report.
Exploratory Data Analysis (EDA)
#providing summary of the features of interest in the reduced data set
summary(heart2)
## heart.education heart.cigsPerDay heart.heartRate heart.sysBP
## Min. :1.000 Min. : 0.000 Min. : 44.00 Min. : 83.5
## 1st Qu.:1.000 1st Qu.: 0.000 1st Qu.: 68.00 1st Qu.:117.0
## Median :2.000 Median : 0.000 Median : 75.00 Median :128.0
## Mean :1.979 Mean : 9.006 Mean : 75.88 Mean :132.4
## 3rd Qu.:3.000 3rd Qu.:20.000 3rd Qu.: 83.00 3rd Qu.:144.0
## Max. :4.000 Max. :70.000 Max. :143.00 Max. :295.0
## NA's :105 NA's :29 NA's :1
## heart.diaBP
## Min. : 48.0
## 1st Qu.: 75.0
## Median : 82.0
## Mean : 82.9
## 3rd Qu.: 90.0
## Max. :142.5
##
EDA Objectives
The primary objective of an exploratory data analysis is to assess the data before making any assumptions about it. It provides a look at any data entry errors, and assits with understanding patterns in the data, as well as find any anomalies in the data.
Feature Variable Analysis
As mentioned there are some missing values in the features of interest. Heart rate contains 1 missing value, cigarettes per day contains 29 missing values, and education contains 105 missing values. The blood pressure measure features do not contain any missing values. Education is a categorical value as the results can only be the numbers 1 through 4 and each number represents a specific level of educational attainment. The remaining features are all numerical values with varied ranges.
Correlations between numerical features
All the numerical features has the potential to be correlated as that is part of the purpose of this report.
Potential dependency between categorical variables
As there is only one categorical variable being used in this analysis, there is no concern for categorical feature dependency.