Basic Data Cleaning and Statistical Methods Implemented Using UCI Heart Disease Data Set

Introduction

In this project, I will demonstrate a few techniques used to check the cleanliness of a data set, as well as run a couple of basic statistical analyses.

The data set used is titled ‘Heart Disease’, which is a data set located on the UCI Machine Learning Repository site. The data set can be found at https://archive.ics.uci.edu/ml/datasets/Heart+Disease.

The Heart Disease data set contains a subset of 14 attributes and a total of 303 individual instances. This data set was donated to UCI on July 1, 1988, making this an outdated data set as related to heart disease. Once again, this data set is used to demonstrate data cleaning and how to run a couple of statistical analyses, and is not meant to be taken to analyze the effects of heart disease today.

Loading the Data Set into a new Data Frame

First, I will load the data set into a data frame named ‘UCI_heart’.

#Loads the data into a new data frame.
UCI_heart <- read.csv("~/R datasets/UCIheart.csv")

Data Cleaning

First, I want to understand my data frame before starting the data cleaning process.

#Inspect the data frame
head(UCI_heart)

##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  63   1  3      145  233   1       0     150     0     2.3     0  0    1
## 2  37   1  2      130  250   0       1     187     0     3.5     0  0    2
## 3  41   0  1      130  204   0       0     172     0     1.4     2  0    2
## 4  56   1  1      120  236   0       1     178     0     0.8     2  0    2
## 5  57   0  0      120  354   0       1     163     1     0.6     2  0    2
## 6  57   1  0      140  192   0       1     148     0     0.4     1  0    1
##   target
## 1      1
## 2      1
## 3      1
## 4      1
## 5      1
## 6      1

#View the column names in the data frame.
colnames(UCI_heart)

##  [1] "age"      "sex"      "cp"       "trestbps" "chol"     "fbs"     
##  [7] "restecg"  "thalach"  "exang"    "oldpeak"  "slope"    "ca"      
## [13] "thal"     "target"

#View a summary of the data frame.
summary(UCI_heart)

##       age             sex               cp           trestbps    
##  Min.   :29.00   Min.   :0.0000   Min.   :0.000   Min.   : 94.0  
##  1st Qu.:47.50   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:120.0  
##  Median :55.00   Median :1.0000   Median :1.000   Median :130.0  
##  Mean   :54.37   Mean   :0.6832   Mean   :0.967   Mean   :131.6  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :3.000   Max.   :200.0  
##       chol            fbs            restecg          thalach     
##  Min.   :126.0   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:133.5  
##  Median :240.0   Median :0.0000   Median :1.0000   Median :153.0  
##  Mean   :246.3   Mean   :0.1485   Mean   :0.5281   Mean   :149.6  
##  3rd Qu.:274.5   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##      exang           oldpeak         slope             ca        
##  Min.   :0.0000   Min.   :0.00   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.80   Median :1.000   Median :0.0000  
##  Mean   :0.3267   Mean   :1.04   Mean   :1.399   Mean   :0.7294  
##  3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.20   Max.   :2.000   Max.   :4.0000  
##       thal           target      
##  Min.   :0.000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:0.0000  
##  Median :2.000   Median :1.0000  
##  Mean   :2.314   Mean   :0.5446  
##  3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :3.000   Max.   :1.0000

#View data types in the data frame.
str(UCI_heart)

## 'data.frame':    303 obs. of  14 variables:
##  $ age     : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp      : int  3 2 1 1 0 0 1 1 2 2 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : int  0 0 2 2 2 1 1 2 2 2 ...
##  $ ca      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal    : int  1 2 2 2 2 1 2 3 3 2 ...
##  $ target  : int  1 1 1 1 1 1 1 1 1 1 ...

Missing and Duplicated Values

The next step is to check if there are any missing values in the data frame. To do so, I will run a sum counting the amount of variables that are missing a value. From the code below, it is determined there are no missing values in the data frame.

#Check if there are any missing values.
sum(is.na(UCI_heart))

## [1] 0

Next, I want to see if there are any duplicated entries present in my data frame. The duplicated() function is used to see if there are duplicates.

#Identify if there are duplicates in the data frame.
duplicates <- UCI_heart%>%duplicated()

#Displays how many duplicates are present in a table. If a value is not a duplicate, it is placed in 'FALSE'. If the value is a duplicate, it is placed in 'TRUE'.
duplicates_counts <- duplicates%>%table()
duplicates_counts

## .
## FALSE  TRUE 
##   302     1

From the duplicates table, there is one duplicate in the data frame.

#Removes the duplicate rows by only returning the unique rows.
UCI_heart <- UCI_heart%>%distinct()
#Displays how many duplicates are present in the updated data frame.
duplicates_counts_unique <- UCI_heart%>%duplicated()%>%table()
duplicates_counts_unique

## .
## FALSE 
##   302

Now that the data frame is clean, I will continue with the next steps.

Distribution of Population Between Males and Females

From looking at the data frame and the information about the original data set on UCI’s website, the variable sex is identified by two integers: 0 representing male, and 1 representing female.

I would like to see how many of each sex is in the data frame. First, since males and females are identified as integers, I renamed the variables and converted the integers to character. In this case, what was labeled as ‘0’ is now renamed ‘Female’ and what was labeled as ‘1’ is now renamed ‘Male’.

#Rename sex variables to Male and Female
UCI_heart$sex[UCI_heart$sex %in% "0"] = "Female"
UCI_heart$sex[UCI_heart$sex %in% "1"] = "Male"
#attach heart dataset with change in sex variable names
attach(UCI_heart)
#View the updated data frame
head(UCI_heart)

##   age    sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  63   Male  3      145  233   1       0     150     0     2.3     0  0    1
## 2  37   Male  2      130  250   0       1     187     0     3.5     0  0    2
## 3  41 Female  1      130  204   0       0     172     0     1.4     2  0    2
## 4  56   Male  1      120  236   0       1     178     0     0.8     2  0    2
## 5  57 Female  0      120  354   0       1     163     1     0.6     2  0    2
## 6  57   Male  0      140  192   0       1     148     0     0.4     1  0    1
##   target
## 1      1
## 2      1
## 3      1
## 4      1
## 5      1
## 6      1

With the renaming complete, a bar chart was created to highlight the distribution of males and females in the data frame.

#Create a bar plot showing the amount of males and females in the data
bar_plot <- ggplot(data = UCI_heart, aes(sex)) + geom_bar(aes(fill = sex))
bar_plot

#Count the amount of males and females in the data frame and return a table
count_gender <- UCI_heart%>%count(sex)
count_gender

##      sex   n
## 1 Female  96
## 2   Male 206

From the bar chart, it is seen that there are more than twice the amount of males as there are females. This is something to keep in mind when running further tests. For example, adding more females to the study can cause the data to fluctuate, leading to different results than what we will currently calculate.

Outliers in the Subsets

In this data frame, I want to see if there is a relationship between a participant’s age and their cholesterol. To begin, a box plot is done on both males and females to see if there are outliers in the data as related to cholesterol levels.

Scatter Plot and Box Plot of Cholesterol Levels for Males and Females

To begin, I would like to see a scatter plot of the relationship between age and cholesterol level for males and females.

scatter <- ggplot(UCI_heart, aes(x = age, y = chol)) +
            geom_point(aes(color = factor(sex))) +
            labs(title = "Relationship between Age and Cholesterol Level", x = "Age", y = "Cholesterol Level",
                 color = "Sex")

scatter

I can also see if there are any outliers present by creating a boxplot for males and females.

#Create a female subset for cholesterol levels
female_chol <- subset(chol, sex %in% c("Female"))
#Create a male subset for cholesterol levels
male_chol <- subset(chol, sex %in% c("Male"))
#Boxplots created for side-by-side comparison
range <- range(female_chol, male_chol)
par(mfrow = c(1,2))
boxplot(female_chol, xlab = 'Female', ylab = 'Cholesterol', ylim = range)
boxplot(male_chol, xlab = 'Male', ylab = 'Cholesterol', ylim = range)

#A data table is created using pander to see the mean, median, and IQR for both subsets
set.caption("Cholesterol levels for Males and Females")
#Creates data table
data.table = rbind(c(mean(female_chol), median(female_chol), IQR(female_chol)),   c(mean(male_chol), median(male_chol), IQR(male_chol)))
#Creates column names
colnames(data.table) = c("Mean", "Median", "IQR")
#Creates row names
rownames(data.table) = c("Female", "Male")
#Calls pander to create the data table
pander(data.table)

Cholesterol levels for Males and Females
	Mean	Median	IQR
Female	261.3	253	82
Male	239.6	235	59.75

#Prints the table
table(sex)

## sex
## Female   Male 
##     96    206

Identify and Remove Outliers

From both the scatter plot and box plot, there appears to be an outlier in the female subset with a cholesterol level well above 500. There are many different methods to testing for outliers. The method I chose was to use Chi-Squared Test for outliers, available in the outliers package. The chi-squared test is used to test for outliers in the left and right tails of the data.

The status of an outlier can be determined by looking at the p-value listed by the chi-squared test. A p-value measures the probability of a value being statistically significant and reject the hypothesis of the test. In the case of the chi-squared test, the hypothesis is there is no difference between the value in question and all other values in the subset. If the p-value is less than 0.05, the hypothesis can be rejected, and it can be stated that the value in question is an outlier.

Below is the result for testing the highest value in the female subset and the lowest value in the male subset:

# Test for outlier on the female subset
chisq.out.test(female_chol)

## 
##  chi-squared test for outlier
## 
## data:  female_chol
## X-squared = 21.627, p-value = 3.311e-06
## alternative hypothesis: highest value 564 is an outlier

From the chi-squared test for outlier, it is determined that the value of 564 is an outlier.

Another method we can use is the outlier function in the outliers package. In the code below, the outlier function was used to identify where the outlier is located in the female subset:

outlier(female_chol)

## [1] 564

Now that it has been determined there is an outlier in the female subset, another subset is created in order to remove the outlier from the original female subset.

#Remove outlier in the female subset
female_chol_2 <- rm.outlier(female_chol)
#Re-run box plots to view the updated box plot
par(mfrow = c(1,2))
boxplot(female_chol_2, xlab = 'Female', ylab = 'Cholesterol', ylim = range)
boxplot(male_chol, xlab = 'Male', ylab = 'Cholesterol', ylim = range)

#Check to see if there are any missing values
sum(is.na(female_chol_2))

## [1] 0

sum(is.na(male_chol))

## [1] 0

#A data table is created using pander to see the mean, median, and IQR for both subsets
set.caption("Cholesterol levels for Males and Females")
#Creates data table
data.table_1 = rbind(c(mean(female_chol_2), median(female_chol_2), IQR(female_chol_2)),   c(mean(male_chol), median(male_chol), IQR(male_chol)))
#Creates column names
colnames(data.table_1) = c("Mean", "Median", "IQR")
#Creates row names
rownames(data.table_1) = c("Female", "Male")
#Calls pander to create the data table
pander(data.table_1)

Cholesterol levels for Males and Females
	Mean	Median	IQR
Female	258.1	252	80
Male	239.6	235	59.75

#Prints the table
table(sex)

## sex
## Female   Male 
##     96    206

From the box plot above, there appears to be another outlier present in the female subset. The drawback of the chi-squared test for outlier is only the most extreme outlier is tested. In order to see if there is another outlier in the female subset, I decided to use a qqplot. A qqplot will also help see if the data is normally distributed in the male and female subsets.

par(mfrow = c(1,2))
qqPlot(female_chol_2, ylab = "Female")

## [1] 11 84

qqPlot(male_chol, ylab = "Male")

## [1]  68 107

From looking at the qqplots for the male and female subsets, there does not appear to be an obvious outlier in the subsets. The data also looks relatively normally distributed. As such, I’m deciding to continue with my statistical analysis.

Two Sample T-Test

After cleaning the data and removing outliers, the next step is to begin a statistical analysis. In this sample, I will see if there is a difference in cholesterol levels between males and females.

Two basic statistical tests that can be used to see if there is a difference between two groups is a t-test or a z-test. A z-test is used when the population standard deviation is known and the sample size is greater than 30. If the population standard deviation is not known or the sample size is 30 or less, a t-test is used.

In this case, since the population standard deviation is unknown, a t-test will be used. Further, a two sample t-test will be used since the means of two different samples (males and females) are compared.

#Two-Sample T-test between the male and female subsets as related to age and cholesterol levels
#Null hypothesis: There is no difference between cholesterol levels as related to age between males and females
t.test(female_chol_2, male_chol)

## 
##  Welch Two Sample t-test
## 
## data:  female_chol_2 and male_chol
## t = 2.806, df = 143.66, p-value = 0.005711
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   5.472454 31.555241
## sample estimates:
## mean of x mean of y 
##  258.1158  239.6019

The original null hypothesis states there is not a difference between the means in cholesterol of males and females. The t-test has a p-value of less than 0.05, which means we can reject the null hypothesis. By doing so, the conclusion of the t-test is there is a difference in means between the cholesterol levels of males and females.

Keep in mind, males have over twice the amount of participants than females. My recommendation would be to have a more equal amount in each population in order to determine if there is a difference in cholesterol between males and females.

Linear Regression

Another statistical method is linear regression. Linear regression is used to see if there is a linear relationship variables. In this case, since only two variables are being compared to one another, the type of linear regression being used is called simple linear regression. Simple linear regression is used to predict the value of one variable based on the value of another.

The formula for simple linear regression is y’ = ax + b. If there is a linear relationship in the data, we can predict the value of y (dependent variable) based upon x (explanatory variable). In the equation, b is the intercept value (value of y when x = 0) and a is the slope of the line. In the cases below, I want to see if there is a linear relationship between cholesterol level (y value) and age (x value) for each sex. In other words, can I predict a person’s cholesterol level based on their age?

First, we remove the outlier found earlier from the main data frame:

#Remove the outlier in the UCI_heart table and still return a table
outliers <- boxplot(UCI_heart$chol, plot = FALSE)$out

UCI_2 <- UCI_heart

UCI_2 <- UCI_2[-which(UCI_2$chol %in% outliers), ]
#View UCI_2 to see if it's still a table
head(UCI_2)

##   age    sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  63   Male  3      145  233   1       0     150     0     2.3     0  0    1
## 2  37   Male  2      130  250   0       1     187     0     3.5     0  0    2
## 3  41 Female  1      130  204   0       0     172     0     1.4     2  0    2
## 4  56   Male  1      120  236   0       1     178     0     0.8     2  0    2
## 5  57 Female  0      120  354   0       1     163     1     0.6     2  0    2
## 6  57   Male  0      140  192   0       1     148     0     0.4     1  0    1
##   target
## 1      1
## 2      1
## 3      1
## 4      1
## 5      1
## 6      1

Now, we can create a simple linear regression model for both the male and female subsets. First, the ggplot2 function is used to plot the simple linear regression models for both subsets:

ggplot(UCI_2, aes(x = age, y = chol, color = sex)) + geom_point() + geom_smooth(aes(group = sex),method = "lm") + labs(x = "Age", y = "Cholesterol Level")

## `geom_smooth()` using formula 'y ~ x'

The shaded region for both lines is the 95% confidence interval for each line. The model is saying it is 95% confident the data should be included in the shaded region for each sex. Looking at the visual, there appears to be a lot of data outside of the shaded regions for both linear regression models. Based on this, I would not be confident that a person’s age can accurately predict their cholesterol level.

Another way to print the simple linear regression lines is the following code below:

#Male and Female will be numeric values when plotting
sexsymb <- as.numeric(UCI_2$sex)

## Warning: NAs introduced by coercion

#Creates the plot
plot(age, chol, pch = sexsymb, col = sexsymb+1, ylab = "Cholesterol", xlab = "Age",
     ylim = c(200,300))
#Creates the linear regression line for males
abline(lm(chol~age, subset = sex == "Male"), lty = 1, col = 2)
#Creates the linear regression line for females
abline(lm(chol~age, subset = sex == "Female"), lty = 2, col = 3)
legend("topright", legend = c("Male", "Female"), pch = c(1, 2),
       lty = 1:2, col = 2:3, bty = "n")

#Creates the linear regression equation for Males subset
male_fit <- lm(chol~age, subset = sex == "Male")
#Creates the linear regression equation for Females subset
female_fit <- lm(chol~age, subset = sex == "Female")
#Print the summary of the Males subset to view linear regression equation
summary(male_fit)

## 
## Call:
## lm(formula = chol ~ age, subset = sex == "Male")
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -115.862  -31.148   -3.436   27.141  112.566 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 201.1655    18.2310  11.034   <2e-16 ***
## age           0.7140     0.3342   2.136   0.0338 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.28 on 204 degrees of freedom
## Multiple R-squared:  0.02188,    Adjusted R-squared:  0.01709 
## F-statistic: 4.564 on 1 and 204 DF,  p-value: 0.03384

#Print the summary of the Females subset to view linear regression equation
summary(female_fit)

## 
## Call:
## lm(formula = chol ~ age, subset = sex == "Female")
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -139.919  -40.015   -8.582   32.375  282.290 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 160.9537    38.8906   4.139 7.61e-05 ***
## age           1.8023     0.6888   2.616   0.0104 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.17 on 94 degrees of freedom
## Multiple R-squared:  0.06789,    Adjusted R-squared:  0.05797 
## F-statistic: 6.846 on 1 and 94 DF,  p-value: 0.01035

From this, we can view the simple linear regression plot and formula for the males and females subsets.

For the males, the equation for predicting the cholesterol level would be cholesterol level = 0.7140*(age) + 201.1655.

For the females, the equation for predicting the cholesterol level would be cholesterol level = 1.8023*(age) + 160.9537.

However, an important statistic to keep in mind is the R² value. R² is known as the coefficient of determination, which is used to describe proportion of variance the outcome Y is explained by the regression model. Another way to describe R² is it measures how well the linear regression model fits the data.

In the Males regression model, the R² value is .01709, which means the model explains about 1.7% of the variability in cholesterol levels. The Females regression model isn’t much better, with 0.05797 (or about 5.8%) of the variability in cholesterol levels can be explained by the regression model.

With such a low R² value for both the Males and Females subsets, it does not appear either regression model is a reliable model for predicting cholesterol levels in males and females based upon a subject’s age.

Conclusion

Using the Heart Disease data from the UCI Machine Learning Repository site, I demonstrated basic data cleaning techniques as well as basic statistical methods. The statistical methods included using scatter plots and box plots to view data, as well as how to identify and properly remove an outlier from a data set using the chi-square test for outliers and viewing a qqplot for additional analysis. Next, a two sample t-test was run to see if there was a difference in cholesterol levels between males and females. From the t-test run above, it was concluded there was a difference in cholesterol levels between males and females.

Next, a simple linear regression model was performed for both the Males and Females subset in order to see if we can predict the cholesterol level of an individual based on their age. After running linear regressions for both subsets, it was determined there is not an accurate prediction for cholesterol levels for males or females based on their age. This is not surprising, since the model only used one factor in attempting to predict the cholesterol level. If I were to run another analysis on this data, I would use multiple linear regression modeling for a more accurate prediction for cholesterol levels for an individual.

This analysis was meant to be a simple breakdown on data cleaning and the use of basic statistical analyses. If I were to further analyze this data, I would use more of the variables presented in the data set to see if there is any correlation in the factors presented.

Thank you to the reader for viewing my work.