This report will explore the Framingham Heart Study dataset. The Framingham Heart Study is a long-term prospective stude of the etiology of cardiovascular disease among a population of free-living subject in the community of Framingham, Massachusetts. The dataset contains 4240 observations with 16 feature variables. The variables are as follows:
sex: the gender of the observations. The feature is a binary named “male”
age: Age at the time of medical examination in years
education: A categorical feature of the participants’ education, with the levels: Some high school (1), high school/GED (2), some college/vocational school (3), college (4)
currentSmoker: Current cigarette smoking at the time of examination
cigsPerDay: Number of cigarettes smoked each day
BPmeds: Use of Anti-hypertensive meidcation at exam
prevalentStroke: Prevalent stroke (0 = free of disease)
prevalentHyp: Prevalent hypertension (subject was defined as hypertensive if treated)
diabetes: Diabetic according to criteria of first exam treated
totChol: Total cholesterol (mg/dL)
sysBP: Systolic blood pressure (mmHg)
diaBP: Diastolic blood pressure (mmHg)
BMI: Body Mass Index, weight (kg)/height(m)^2
heartRate: Heart rate (beats/minute)
glucose: Blood glucose level (mg/dL)
TenYearCHD: The 10 year risk of coronary heart disease (CHD)
#providing summary of all variables in the original dataset
summary(heart)
## male age education currentSmoker
## Min. :0.0000 Min. :32.00 Min. :1.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:42.00 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :49.00 Median :2.000 Median :0.0000
## Mean :0.4292 Mean :49.58 Mean :1.979 Mean :0.4941
## 3rd Qu.:1.0000 3rd Qu.:56.00 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :70.00 Max. :4.000 Max. :1.0000
## NA's :105
## cigsPerDay BPMeds prevalentStroke prevalentHyp
## Min. : 0.000 Min. :0.00000 Min. :0.000000 Min. :0.0000
## 1st Qu.: 0.000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.0000
## Median : 0.000 Median :0.00000 Median :0.000000 Median :0.0000
## Mean : 9.006 Mean :0.02962 Mean :0.005896 Mean :0.3106
## 3rd Qu.:20.000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:1.0000
## Max. :70.000 Max. :1.00000 Max. :1.000000 Max. :1.0000
## NA's :29 NA's :53
## diabetes totChol sysBP diaBP
## Min. :0.00000 Min. :107.0 Min. : 83.5 Min. : 48.0
## 1st Qu.:0.00000 1st Qu.:206.0 1st Qu.:117.0 1st Qu.: 75.0
## Median :0.00000 Median :234.0 Median :128.0 Median : 82.0
## Mean :0.02571 Mean :236.7 Mean :132.4 Mean : 82.9
## 3rd Qu.:0.00000 3rd Qu.:263.0 3rd Qu.:144.0 3rd Qu.: 90.0
## Max. :1.00000 Max. :696.0 Max. :295.0 Max. :142.5
## NA's :50
## BMI heartRate glucose TenYearCHD
## Min. :15.54 Min. : 44.00 Min. : 40.00 Min. :0.0000
## 1st Qu.:23.07 1st Qu.: 68.00 1st Qu.: 71.00 1st Qu.:0.0000
## Median :25.40 Median : 75.00 Median : 78.00 Median :0.0000
## Mean :25.80 Mean : 75.88 Mean : 81.96 Mean :0.1519
## 3rd Qu.:28.04 3rd Qu.: 83.00 3rd Qu.: 87.00 3rd Qu.:0.0000
## Max. :56.80 Max. :143.00 Max. :394.00 Max. :1.0000
## NA's :19 NA's :1 NA's :388
The variables of interest for this report are sex, heartRate, cigsperDay, education, sysBP, and diaBP.
#FEEDBACK ON ASSIGNMENT 2
#The details visual analysis (tables and graphics) needs to be included in the report so we can see whether the underling feature variables required to be engineered. At the same time you need to tell why or why not a feature engineering procedure is needed or not needed for individual feature variables.
#You also need to justify the decision on how you handle the missing values. Need detailed procedures and explanations.
#providing summary of the variables of interest in the reduced dataset
summary(heart2)
## heart.male heart.education heart.cigsPerDay heart.heartRate
## Min. :0.0000 Min. :1.000 Min. : 0.000 Min. : 44.00
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.: 0.000 1st Qu.: 68.00
## Median :0.0000 Median :2.000 Median : 0.000 Median : 75.00
## Mean :0.4292 Mean :1.979 Mean : 9.006 Mean : 75.88
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:20.000 3rd Qu.: 83.00
## Max. :1.0000 Max. :4.000 Max. :70.000 Max. :143.00
## NA's :105 NA's :29 NA's :1
## heart.sysBP heart.diaBP
## Min. : 83.5 Min. : 48.0
## 1st Qu.:117.0 1st Qu.: 75.0
## Median :128.0 Median : 82.0
## Mean :132.4 Mean : 82.9
## 3rd Qu.:144.0 3rd Qu.: 90.0
## Max. :295.0 Max. :142.5
##
par(mfrow = c(2,2))
male = table(hearts$heart.male)
pie(male, main="Distribution of Sex")
ed = table(hearts$heart.education)
pie(ed, main = "Distribution of Education")
#hist(hearts$heart.cigsPerDay)
#hist(hearts$heart.heartRate)
plot(hearts$heart.cigsPerDay, hearts$heart.heartRate)
plot(hearts$heart.sysBP, hearts$heart.diaBP)
The mentioned missing values were omitted from the dataset of interest during the analysis. While this did result in the loss of some information, it did not have a significant impact on the results of the report. This was determined by analysing the amount of missing values against the number of available observations. The highest number of missing values in a category was the 105 missing values in the education variable. As there are 4240 total observations, the missing variables account for less than 2.5% of the total values.
The primary objective of an exploratory data analysis is to assess the data before making any assumptions about it. This is achieved by inspecting the distribution of variables, detecfting outliers, examining the trend of the variables, and assessing the associations between the variables.
As mentioned there are some missing values in the variables of interest. Heart rate contains 1 missing value, cigarettes per day contains 29 missing values, and education contains 105 missing values. The blood pressure measure variables do not contain any missing values, nor does the male variable. Education is a categorical value as the results can only be the numbers 1 through 4 and each number represents a specific level of educational attainment. Male is a binary categorical variable where a response of male is recorded as '0' and a response of female is recorded as '1.' The remaining variables are all numerical values with varied ranges.
All the numerical variables have the potential to be correlated as that is part of the purpose of this report.
An analysis of the dependency between the two categorical variables, male and education, was performed.
#attempting to create a chart that will show male on the x-axis with education appropriately dividing each
mosaicplot(hearts$heart.male ~ hearts$heart.education, data=hearts, col=c("Blue", "Red", "Pink", "Purple"), main = "Sex vs. Education")
There are more women in the college educated group than men, as well as the some high school group. The other educational categories, however, do not necessarily follow this trend. There is a higher amount of men in the high school/GED category, as well as the some college category.
#regression analysis of correlated variables
compare = lm(hearts$heart.male ~ hearts$heart.education, data=hearts)
summary(compare)
##
## Call:
## lm(formula = hearts$heart.male ~ hearts$heart.education, data = hearts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.443 -0.427 -0.419 0.573 0.581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.410946 0.016853 24.38 <2e-16 ***
## hearts$heart.education 0.008026 0.007569 1.06 0.289
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4947 on 4105 degrees of freedom
## Multiple R-squared: 0.0002738, Adjusted R-squared: 3.029e-05
## F-statistic: 1.124 on 1 and 4105 DF, p-value: 0.289
par(mfrow = c(2,2))
plot(compare)
A p-value of 0.289 suggests there is no correlation or dependency between the male and education variables.