Downloading and Cleaning Data, Some Preliminary Steps

It is a demonstration of downloading data from an online repository and conducting some intuitive data preparation, and or cleaning activities for further analyses. Thus, I am going to use the hear-disease data which come from the University of California, Irvin online repository, and I am going to use the processed data from Hungarian sample. Here’s the link for the data: http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data

1. Downloading Data from a Data Repository

Downloading the data in our machine using the provided url. Here, I am telling r to follow the given path and access the data. In addition, I assigned a vector called ‘location’ to make it easy for me do manipulate afterward.

location <- "http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data"

The data is not read, yet. Because I haven’t told r to read the data. Now, let’s tell r to read the data:

heart_data <- read.csv(location, header = FALSE)

And quickly check it the above process worked. I am going to use the ‘head()’ command. It is quick and easy and provides only first 6-rows of the data set.

head(heart_data)

  V1 V2 V3  V4  V5 V6 V7  V8 V9 V10 V11 V12 V13 V14
1 28  1  2 130 132  0  2 185  0   0   ?   ?   ?   0
2 29  1  2 120 243  0  0 160  0   0   ?   ?   ?   0
3 29  1  2 140   ?  0  0 170  0   0   ?   ?   ?   0
4 30  0  1 170 237  0  1 170  0   0   ?   ?   6   0
5 31  0  2 100 219  0  1 150  0   0   ?   ?   ?   0
6 32  0  2 105 198  0  0 165  0   0   ?   ?   ?   0

Hurrah! It worked. However, when we look at the outcome, we see that we are missing the names of the variables. There are 14 variables and none of them have the names. It is pretty normal when we work on the archived data. I have to go back and find out what are these variables and how are they labeled.

Variable Information:

-Column 1: age:– age in years
-Column 2: sex:– sex (1 = male; 0 = female)
-Column 3: cp:– chest pain type

    -- Value 1: typical angina
    -- Value 2: atypical angina
    -- Value 3: non-anginal pain
    -- Value 4: asymptomatic

-Column 4: trestbps:– resting blood pressure (in mm Hg on admission to the hospital)

-Column 5: chol:– serum cholestoral in mg/dl
-Column 6: fbs:– (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
-Column 7: restecg:– resting electrocardiographic results

    -- Value 0: normal
    -- Value 1: having ST-T wave abnormality (T wave    
    inversions and/or ST elevation or depression of > 0.05 mV)
    -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria

-Column 8: thalach:– maximum heart rate achieved

-Column 9: exang:– exercise induced angina (1 = yes; 0 = no)
-Column 10: oldpeak:– ST depression induced by exercise relative to rest
-Column 11: slope:– the slope of the peak exercise ST segment

    -- Value 1: upsloping
    -- Value 2: flat
    -- Value 3: downsloping

-Column 12: ca:– number of major vessels (0-3) colored by flourosopy
-Column 13: thal:– 3 = normal; 6 = fixed defect; 7 = reversable defect
-Column 14: num:– the predicted attribute; diagnosis of heart disease (angiographic disease status)

    -- Value 0: < 50% diameter narrowing
    -- Value 1: > 50% diameter narrowing

(Source: http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/heart-disease.names)

1.a. Assigning Names to the Variables in the Dataset

To label the columns,I used the above information and supplied names accordingly. Remember, we use ‘colnames()’ function.

colnames(heart_data) <- c(
  "age", "sex", "cp", "trestecg", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"
)
head(heart_data)# Checking if that worked

  age sex cp trestecg chol fbs restecg thalach exang oldpeak slope ca thal num
1  28   1  2      130  132   0       2     185     0       0     ?  ?    ?   0
2  29   1  2      120  243   0       0     160     0       0     ?  ?    ?   0
3  29   1  2      140    ?   0       0     170     0       0     ?  ?    ?   0
4  30   0  1      170  237   0       1     170     0       0     ?  ?    6   0
5  31   0  2      100  219   0       1     150     0       0     ?  ?    ?   0
6  32   0  2      105  198   0       0     165     0       0     ?  ?    ?   0

The renaming of the variables definitely worked. However, I noticed some new problems, i.e., I have seen some rows with ‘?’ signs. I assume they are missing data. I want them to be replaced by ‘NA’ instead.

1.b. Replacing Unusual Signs of Missing Data by NAs

Because I don’t want to change the name of the dataset at this point, and I still want my data set to reflect the changes I am going to make, I am am going to use ‘heart_data’ both in and outside of the bracket. Here, the command [heart_data == “?”] is applicable to the whole data set. So, all the “?” sing in the whole data set will be replaced by ‘NA’, the the resulting dataset will still be named ‘heart_data’. Let’s write the code and run.

heart_data[heart_data == "?"] <- NA
head(heart_data)

  age sex cp trestecg chol fbs restecg thalach exang oldpeak slope   ca thal
1  28   1  2      130  132   0       2     185     0       0  <NA> <NA> <NA>
2  29   1  2      120  243   0       0     160     0       0  <NA> <NA> <NA>
3  29   1  2      140 <NA>   0       0     170     0       0  <NA> <NA> <NA>
4  30   0  1      170  237   0       1     170     0       0  <NA> <NA>    6
5  31   0  2      100  219   0       1     150     0       0  <NA> <NA> <NA>
6  32   0  2      105  198   0       0     165     0       0  <NA> <NA> <NA>
  num
1   0
2   0
3   0
4   0
5   0
6   0

I checked the first 6 rows of the new data set and found that the missing data have now been denoted by ‘NA’ rather then the ‘?’ sign.

1.c. Checking Variable Types and Changing them as Needed

It is always prudent to check the variables types when you download the data. If we don’t, we encounter with problems or get wrong results during analysis. The easy way to check the variable types is to use the str()function.

Lets’ check the structure, type of all of the variables included in the heart_data data set.

str(heart_data)

'data.frame':   294 obs. of  14 variables:
 $ age     : int  28 29 29 30 31 32 32 32 33 34 ...
 $ sex     : int  1 1 1 0 0 0 1 1 1 0 ...
 $ cp      : int  2 2 2 1 2 2 2 2 3 2 ...
 $ trestecg: chr  "130" "120" "140" "170" ...
 $ chol    : chr  "132" "243" NA "237" ...
 $ fbs     : chr  "0" "0" "0" "0" ...
 $ restecg : chr  "2" "0" "0" "1" ...
 $ thalach : chr  "185" "160" "170" "170" ...
 $ exang   : chr  "0" "0" "0" "0" ...
 $ oldpeak : num  0 0 0 0 0 0 0 0 0 0 ...
 $ slope   : chr  NA NA NA NA ...
 $ ca      : chr  NA NA NA NA ...
 $ thal    : chr  NA NA NA "6" ...
 $ num     : int  0 0 0 0 0 0 0 0 0 0 ...

We got variable itemized information. There are total of 294 observations under 14 variables:

Age, sex, cd, and num are mared as integers
trestecg, chol, fbs, restecg, thalach, exang, slope, ca, and thl are marked as character vectors, and
num as numerical vector

1.d. Changing the Variable Types

# Into factor
heart_data$sex <- as.factor(heart_data$sex)
heart_data$cp <- as.factor(heart_data$cp)
heart_data$fbs <- as.factor(heart_data$fbs)
heart_data$restecg <- as.factor(heart_data$restecg)
heart_data$exang <- as.factor(heart_data$exang)
heart_data$slope <- as.factor(heart_data$slope)
heart_data$num <- as.factor(heart_data$num)

# Into Integer and Into Factor
heart_data$ca <- as.integer(heart_data$ca)
heart_data$ca <- as.factor(heart_data$ca)

heart_data$thal <- as.integer(heart_data$thal)
heart_data$thal <- as.factor(heart_data$thal)
str(heart_data)

'data.frame':   294 obs. of  14 variables:
 $ age     : int  28 29 29 30 31 32 32 32 33 34 ...
 $ sex     : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 2 2 2 1 ...
 $ cp      : Factor w/ 4 levels "1","2","3","4": 2 2 2 1 2 2 2 2 3 2 ...
 $ trestecg: chr  "130" "120" "140" "170" ...
 $ chol    : chr  "132" "243" NA "237" ...
 $ fbs     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ restecg : Factor w/ 3 levels "0","1","2": 3 1 1 2 2 1 1 1 1 1 ...
 $ thalach : chr  "185" "160" "170" "170" ...
 $ exang   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ oldpeak : num  0 0 0 0 0 0 0 0 0 0 ...
 $ slope   : Factor w/ 3 levels "1","2","3": NA NA NA NA NA NA NA NA NA NA ...
 $ ca      : Factor w/ 1 level "0": NA NA NA NA NA NA NA NA NA NA ...
 $ thal    : Factor w/ 3 levels "3","6","7": NA NA NA 2 NA NA NA NA NA NA ...
 $ num     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

In addition, I can see that the sex, cd, fbs, restecg, exang, and num have do have numerical representation to show their various categories. We don’t have to change the numerical representation by the actual categores like ‘male’, ‘female’, to be successfully conduct the analyses, but we can change them if we want to. For the purpose of my own prctice, I am going to go ahead and change the values by the actual categories:

1.e. Checking the Missing Values by Variables

Having a lot of missing values creates problem in data analysis. Especially in the social science, having more than 5% of missing data is problematic. However, it is essential to realize that there’s hardly any dataset (if we collect them) without the missing values. There are various reasons for having some missing values. Thus, it is desirable that we analyze total number of missing cases.

summary(is.na(heart_data))# Gives overall synopsis of missing data by variables

    age             sex              cp           trestecg      
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:294       FALSE:294       FALSE:294       FALSE:293      
                                                 TRUE :1        
    chol            fbs           restecg         thalach       
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:271       FALSE:286       FALSE:293       FALSE:293      
 TRUE :23        TRUE :8         TRUE :1         TRUE :1        
   exang          oldpeak          slope             ca         
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:293       FALSE:294       FALSE:104       FALSE:3        
 TRUE :1                         TRUE :190       TRUE :291      
    thal            num         
 Mode :logical   Mode :logical  
 FALSE:28        FALSE:294      
 TRUE :266

mean(is.na(heart_data)) #provides the mean of the missing data

[1] 0.1899903

The outcome of the summary funtion suggests that

most of the varaibles do not have even a single missing data, e.g., age, sex, cp, oldpeak, & num.
Some variable have less than 5 missing values, e.g., trestecg, restecg, thalach, and exang
Some variables do have good amount of missing data, e.g., chol, and fbs, and
Some variables, i.e., ca, thal, & slope have extremelty high number of missing cases, i.e, 291, 266, & 190, respectively, Which are approximately 99%, 90%, and 65% of total available data points. They collectively brought the volume of the mean missing data to approximately 19%. This sheer volume of missing data can easily skew our findings, or our findings lack the validity and/or generalizability.

1.f. Getting Rid of Troublesome Variables from the Model

So, we will be better off if we omit these variables from our data set. There are various ways to do so: we can either remove each column one by one or all three columns, at once. I am going to remove them at once. First of all I will identify their position in the data set, and use NULL to replace them. Slope is in the 11th position, while ‘ca’ and ‘thal’ are in 12th, and 13th positions.

heart_data[11:13] <- list(NULL) #It will change the colum
head(heart_data)

  age sex cp trestecg chol fbs restecg thalach exang oldpeak num
1  28   1  2      130  132   0       2     185     0       0   0
2  29   1  2      120  243   0       0     160     0       0   0
3  29   1  2      140 <NA>   0       0     170     0       0   0
4  30   0  1      170  237   0       1     170     0       0   0
5  31   0  2      100  219   0       1     150     0       0   0
6  32   0  2      105  198   0       0     165     0       0   0

Looks like that worked. I no longer see the ‘slope’ variable. Once again, let’s check the volume of the missing data.

mean(is.na(heart_data)) #provides the mean of the missing data

[1] 0.01082251

After removing these lackluster varaibles, total number of missing data is approximately 1%. Good news!

1.g. Deleting the Missing Data Points (Casewise Deletion)

And, now, lets get rid of missing data points. You don’t have to do it, but I am doing it for the purpose of practicing this command.

heart_data <- na.omit(heart_data) #Removes the missing rows from the data
summary(heart_data)

      age        sex     cp        trestecg             chol           fbs    
 Min.   :28.00   0: 69   1: 10   Length:261         Length:261         0:242  
 1st Qu.:42.00   1:192   2: 92   Class :character   Class :character   1: 19  
 Median :49.00           3: 46   Mode  :character   Mode  :character          
 Mean   :47.77           4:113                                                
 3rd Qu.:54.00                                                                
 Max.   :65.00                                                                
 restecg   thalach          exang      oldpeak       num    
 0:208   Length:261         0:178   Min.   :0.0000   0:163  
 1: 47   Class :character   1: 83   1st Qu.:0.0000   1: 98  
 2:  6   Mode  :character           Median :0.0000          
                                    Mean   :0.6123          
                                    3rd Qu.:1.0000          
                                    Max.   :5.0000

The summary table shows variable wise distribution of the data. We don’t see any missing values.

1.h. Changing Numerical Values to Actual Categories

Following the inmormation given to each variable, I am going to change the numerical categorical values of the variables sed, cp, fbs, restecg, num, and exang by their respective categories.

heart_data$sex <- ifelse(test=heart_data$sex == 0, yes = "Female", no = "Male")

heart_data$exang <- ifelse(test=heart_data$exang == 0, yes = "ExInduced_Angina", no = "No")

heart_data$fbs <- ifelse(test = heart_data$fbs == 0, yes = "FALSE", no = "TRUE")

heart_data$cp = with(heart_data, ifelse(cp == 1, "Typical_Angina", ifelse(cp == 2, "Atypical_Angina", ifelse(cp == 3, "Non_Anginal_Pain","Asymptomatic"))))

heart_data$restecg = with(heart_data, ifelse(restecg == 0, "Normal", ifelse(restecg == 1, "ST-T_WabNorm", "LV_Hypertrophy")))

heart_data$num <- ifelse(test=heart_data$num == 0, yes = "LessThanFifty", no = "MoreThanFifty")

'data.frame':   261 obs. of  11 variables:
 $ age     : int  28 29 30 31 32 32 32 33 34 34 ...
 $ sex     : Factor w/ 2 levels "Female","Male": 2 2 1 1 1 2 2 2 1 2 ...
 $ cp      : Factor w/ 4 levels "Asymptomatic",..: 2 2 4 2 2 2 2 3 2 2 ...
 $ trestecg: num  130 120 170 100 105 110 125 120 130 150 ...
 $ chol    : num  132 243 237 219 198 225 254 298 161 214 ...
 $ fbs     : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
 $ restecg : Factor w/ 3 levels "LV_Hypertrophy",..: 1 2 3 3 2 2 2 2 2 3 ...
 $ thalach : num  185 160 170 150 165 184 155 185 190 168 ...
 $ exang   : Factor w/ 2 levels "ExInduced_Angina",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ oldpeak : num  0 0 0 0 0 0 0 0 0 0 ...
 $ num     : Factor w/ 2 levels "LessThanFifty",..: 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "na.action")= 'omit' Named int [1:33] 3 28 32 35 45 66 73 76 82 87 ...
  ..- attr(*, "names")= chr [1:33] "3" "28" "32" "35" ...

Now, I can see that my data are almost ready for further analysis. Lets make the “num”–diagnosis of heart disease (angiographic disease status)–an outcome variable, where, LessThanFifty refers to a good condition of the blood vessel narrowed by less then 50%, and MoreThanFifty to a hear disease condition with blood vessel narrowed by more than 50%.

We are now going to check if the condition if male or female both have the comparable cases of heart disease or it pertains to a particular gender using the ‘xtabs’ function.

xtabs(~ num + sex, data = heart_data)

               sex
num             Female Male
  LessThanFifty     57  106
  MoreThanFifty     12   86

We can see the differences in the numbers of male and female with narrowed and/or regular valve size. Lets now check if they reported the chest pain:

xtabs(~ num + cp, data = heart_data)

               cp
num             Asymptomatic Atypical_Angina Non_Anginal_Pain Typical_Angina
  LessThanFifty           37              84               35              7
  MoreThanFifty           76               8               11              3

Its a mixed outcome. There are bunch of people diagnosed with a heart condition, but they did not report any kind of chest pain. Some people definitely did, though.

Let’s keep checking for all the categorical variables:

xtabs(~ num + fbs, data = heart_data)

               fbs
num             FALSE TRUE
  LessThanFifty   157    6
  MoreThanFifty    85   13

xtabs(~ num + restecg, data = heart_data)

               restecg
num             LV_Hypertrophy Normal ST-T_WabNorm
  LessThanFifty              5    130           28
  MoreThanFifty              1     78           19

xtabs(~ num + exang, data = heart_data)

               exang
num             ExInduced_Angina  No
  LessThanFifty              144  19
  MoreThanFifty               34  64

Doing Logistic Regression

Let’s start with a simple model. We will try to predict heart disease based on patient’s gender.

gender_logistic <- glm(num ~ sex, data = heart_data, family = "binomial")
summary(gender_logistic)


Call:
glm(formula = num ~ sex, family = "binomial", data = heart_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0900  -1.0900  -0.6181   1.2674   1.8704  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.5581     0.3176  -4.906 9.28e-07 ***
sexMale       1.3491     0.3492   3.864 0.000112 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 345.46  on 260  degrees of freedom
Residual deviance: 327.84  on 259  degrees of freedom
AIC: 331.84

Number of Fisher Scoring iterations: 3

Let’s see how all the varaibles predict the log odds

logistic <- glm(num~ ., data = heart_data, family = "binomial")
summary(logistic)


Call:
glm(formula = num ~ ., family = "binomial", data = heart_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7682  -0.4879  -0.2655   0.4366   2.4531  

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -2.163497   3.389683  -0.638  0.52330    
age                 -0.009255   0.029526  -0.313  0.75394    
sexMale              1.307881   0.481892   2.714  0.00665 ** 
cpAtypical_Angina   -2.288454   0.528762  -4.328 1.51e-05 ***
cpNon_Anginal_Pain  -0.884304   0.499009  -1.772  0.07637 .  
cpTypical_Angina    -0.368482   0.967228  -0.381  0.70323    
trestecg            -0.001603   0.011514  -0.139  0.88925    
chol                 0.005218   0.002788   1.872  0.06127 .  
fbsTRUE              1.639892   0.779925   2.103  0.03550 *  
restecgNormal        0.862764   1.715431   0.503  0.61500    
restecgST-T_WabNorm  0.470803   1.763641   0.267  0.78951    
thalach             -0.008658   0.010189  -0.850  0.39549    
exangNo              0.943270   0.487140   1.936  0.05283 .  
oldpeak              1.184473   0.280460   4.223 2.41e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 345.46  on 260  degrees of freedom
Residual deviance: 184.70  on 247  degrees of freedom
AIC: 212.7

Number of Fisher Scoring iterations: 6

Calculating McFadden’s Pseudo R-squared

a. Calculating Log-liklihood of the Null Model

ll.null <- logistic$null.deviance/-2

b. Calculating Log-Likelihood of the Final Model

ll.proposed <- logistic$deviance/-2

c. Calculating Pseudo R-squared or Overall Effect Size

(ll.null - ll.proposed)/ll.null

[1] 0.4653618

d. We can use the same log-likelihood statistics to calculate the p-value for the R-squred using the Chi-square distribution

1 - pchisq(2*(ll.null - ll.proposed),df = (length(logistic$coefficients)-1))

[1] 1

e. Plotting the Results to show the Probability of having a Heart Disease

i. Create a New Data Frame to Store Required Variables

predicted.data <- data.frame(probability.of.hd = logistic$fitted.values, num = heart_data$num)

ii. Sorting the data frame from low to high probabilities

predicted.data <- predicted.data[order(predicted.data$probability.of.hd, decreasing = FALSE),]

iii. Adding a new Column to the Data from that ranks the Data form low to high probability

predicted.data$rank <- 1:nrow(predicted.data)

iv. Using ggplot2 and cowplot libraries and drawing the graph

library(ggplot2)
library(cowplot)
ggplot(data=predicted.data, aes(x=rank, y=probability.of.hd))+
  geom_point(aes(color=num), alpha=1, shape=4, stroke=2)+
  xlab("Index")+
  ylab("Predicted Probability of Getting Heart Disease")

Got Ample Support from: https://www.youtube.com/watch?v=C4N3_XJJ-jU