Introduction

This report will explore the Framingham Heart Study dataset. The Framingham Heart Study is a long-term prospective stude of the etiology of cardiovascular disease among a population of free-living subject in the community of Framingham, Massachusetts. The dataset contains 4240 observations with 16 feature variables. The variables are as follows:

sex: the gender of the observations. The feature is a binary named “male”

age: Age at the time of medical examination in years

education: A categorical feature of the participants’ education, with the levels: Some high school (1), high school/GED (2), some college/vocational school (3), college (4)

currentSmoker: Current cigarette smoking at the time of examination

cigsPerDay: Number of cigarettes smoked each day

BPmeds: Use of Anti-hypertensive meidcation at exam

prevalentStroke: Prevalent stroke (0 = free of disease)

prevalentHyp: Prevalent hypertension (subject was defined as hypertensive if treated)

diabetes: Diabetic according to criteria of first exam treated

totChol: Total cholesterol (mg/dL)

sysBP: Systolic blood pressure (mmHg)

diaBP: Diastolic blood pressure (mmHg)

BMI: Body Mass Index, weight (kg)/height(m)^2

heartRate: Heart rate (beats/minute)

glucose: Blood glucose level (mg/dL)

TenYearCHD: The 10 year risk of coronary heart disease (CHD)

#providing summary of all variables in the original dataset
summary(heart)
##       male             age          education     currentSmoker   
##  Min.   :0.0000   Min.   :32.00   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:42.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :49.00   Median :2.000   Median :0.0000  
##  Mean   :0.4292   Mean   :49.58   Mean   :1.979   Mean   :0.4941  
##  3rd Qu.:1.0000   3rd Qu.:56.00   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :70.00   Max.   :4.000   Max.   :1.0000  
##                                   NA's   :105                     
##    cigsPerDay         BPMeds        prevalentStroke     prevalentHyp   
##  Min.   : 0.000   Min.   :0.00000   Min.   :0.000000   Min.   :0.0000  
##  1st Qu.: 0.000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.0000  
##  Median : 0.000   Median :0.00000   Median :0.000000   Median :0.0000  
##  Mean   : 9.006   Mean   :0.02962   Mean   :0.005896   Mean   :0.3106  
##  3rd Qu.:20.000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:1.0000  
##  Max.   :70.000   Max.   :1.00000   Max.   :1.000000   Max.   :1.0000  
##  NA's   :29       NA's   :53                                           
##     diabetes          totChol          sysBP           diaBP      
##  Min.   :0.00000   Min.   :107.0   Min.   : 83.5   Min.   : 48.0  
##  1st Qu.:0.00000   1st Qu.:206.0   1st Qu.:117.0   1st Qu.: 75.0  
##  Median :0.00000   Median :234.0   Median :128.0   Median : 82.0  
##  Mean   :0.02571   Mean   :236.7   Mean   :132.4   Mean   : 82.9  
##  3rd Qu.:0.00000   3rd Qu.:263.0   3rd Qu.:144.0   3rd Qu.: 90.0  
##  Max.   :1.00000   Max.   :696.0   Max.   :295.0   Max.   :142.5  
##                    NA's   :50                                     
##       BMI          heartRate         glucose         TenYearCHD    
##  Min.   :15.54   Min.   : 44.00   Min.   : 40.00   Min.   :0.0000  
##  1st Qu.:23.07   1st Qu.: 68.00   1st Qu.: 71.00   1st Qu.:0.0000  
##  Median :25.40   Median : 75.00   Median : 78.00   Median :0.0000  
##  Mean   :25.80   Mean   : 75.88   Mean   : 81.96   Mean   :0.1519  
##  3rd Qu.:28.04   3rd Qu.: 83.00   3rd Qu.: 87.00   3rd Qu.:0.0000  
##  Max.   :56.80   Max.   :143.00   Max.   :394.00   Max.   :1.0000  
##  NA's   :19      NA's   :1        NA's   :388
The variables of interest for this report are sex, heartRate, cigsperDay, education, sysBP, and diaBP. 
#FEEDBACK ON ASSIGNMENT 2
#The details visual analysis (tables and graphics) needs to be included in the report so we can see whether the underling feature variables required to be engineered. At the same time you need to tell why or why not a feature engineering procedure is needed or not needed for individual feature variables. 

#You also need to justify the decision on how you handle the missing values. Need detailed procedures and explanations.

Exploratory Data Analysis (EDA)

#providing summary of the variables of interest in the reduced dataset
summary(heart2)
##    heart.male     heart.education heart.cigsPerDay heart.heartRate 
##  Min.   :0.0000   Min.   :1.000   Min.   : 0.000   Min.   : 44.00  
##  1st Qu.:0.0000   1st Qu.:1.000   1st Qu.: 0.000   1st Qu.: 68.00  
##  Median :0.0000   Median :2.000   Median : 0.000   Median : 75.00  
##  Mean   :0.4292   Mean   :1.979   Mean   : 9.006   Mean   : 75.88  
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:20.000   3rd Qu.: 83.00  
##  Max.   :1.0000   Max.   :4.000   Max.   :70.000   Max.   :143.00  
##                   NA's   :105     NA's   :29       NA's   :1       
##   heart.sysBP     heart.diaBP   
##  Min.   : 83.5   Min.   : 48.0  
##  1st Qu.:117.0   1st Qu.: 75.0  
##  Median :128.0   Median : 82.0  
##  Mean   :132.4   Mean   : 82.9  
##  3rd Qu.:144.0   3rd Qu.: 90.0  
##  Max.   :295.0   Max.   :142.5  
## 
par(mfrow = c(2,2))
male = table(hearts$heart.male)
pie(male, main="Distribution of Sex")
ed = table(hearts$heart.education)
pie(ed, main = "Distribution of Education")
#hist(hearts$heart.cigsPerDay)
#hist(hearts$heart.heartRate)
plot(hearts$heart.cigsPerDay, hearts$heart.heartRate)
plot(hearts$heart.sysBP, hearts$heart.diaBP)

The mentioned missing values were omitted from the dataset of interest during the analysis. While this did result in the loss of some information, it did not have a significant impact on the results of the report. This was determined by analysing the amount of missing values against the number of available observations. The highest number of missing values in a category was the 105 missing values in the education variable. As there are 4240 total observations, the missing variables account for less than 2.5% of the total values. 

EDA Objectives

The primary objective of an exploratory data analysis is to assess the data before making any assumptions about it. This is achieved by inspecting the distribution of variables, detecfting outliers, examining the trend of the variables, and assessing the associations between the variables.

Feature Variable Analysis

As mentioned there are some missing values in the variables of interest. Heart rate contains 1 missing value, cigarettes per day contains 29 missing values, and education contains 105 missing values. The blood pressure measure variables do not contain any missing values, nor does the male variable. Education is a categorical value as the results can only be the numbers 1 through 4 and each number represents a specific level of educational attainment. Male is a binary categorical variable where a response of male is recorded as '0' and a response of female is recorded as '1.' The remaining variables are all numerical values with varied ranges. 

Correlations between numerical variables

All the numerical variables have the potential to be correlated as that is part of the purpose of this report. 

Potential dependency between categorical variables

An analysis of the dependency between the two categorical variables, male and education, was performed. 
#attempting to create a chart that will show male on the x-axis with education appropriately dividing each  
mosaicplot(hearts$heart.male ~ hearts$heart.education, data=hearts, col=c("Blue", "Red", "Pink", "Purple"), main = "Sex vs. Education")

There are more women in the college educated group than men, as well as the some high school group. The other educational categories, however, do not necessarily follow this trend. There is a higher amount of men in the high school/GED category, as well as the some college category.
#regression analysis of correlated variables
compare = lm(hearts$heart.male ~ hearts$heart.education, data=hearts)
summary(compare)
## 
## Call:
## lm(formula = hearts$heart.male ~ hearts$heart.education, data = hearts)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.443 -0.427 -0.419  0.573  0.581 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            0.410946   0.016853   24.38   <2e-16 ***
## hearts$heart.education 0.008026   0.007569    1.06    0.289    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4947 on 4105 degrees of freedom
## Multiple R-squared:  0.0002738,  Adjusted R-squared:  3.029e-05 
## F-statistic: 1.124 on 1 and 4105 DF,  p-value: 0.289
par(mfrow = c(2,2))
plot(compare)

  A p-value of 0.289 suggests there is no correlation or dependency between the male and education variables.