It is a demonstration of downloading data from an online repository and conducting some intuitive data preparation, and or cleaning activities for further analyses. Thus, I am going to use the hear-disease data which come from the University of California, Irvin online repository, and I am going to use the processed data from Hungarian sample. Here’s the link for the data: http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data
Downloading the data in our machine using the provided url. Here, I am telling r to follow the given path and access the data. In addition, I assigned a vector called ‘location’ to make it easy for me do manipulate afterward.
location <- "http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data"
The data is not read, yet. Because I haven’t told r to read the data. Now, let’s tell r to read the data:
heart_data <- read.csv(location, header = FALSE)
And quickly check it the above process worked. I am going to use the ‘head()’ command. It is quick and easy and provides only first 6-rows of the data set.
head(heart_data)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
1 28 1 2 130 132 0 2 185 0 0 ? ? ? 0
2 29 1 2 120 243 0 0 160 0 0 ? ? ? 0
3 29 1 2 140 ? 0 0 170 0 0 ? ? ? 0
4 30 0 1 170 237 0 1 170 0 0 ? ? 6 0
5 31 0 2 100 219 0 1 150 0 0 ? ? ? 0
6 32 0 2 105 198 0 0 165 0 0 ? ? ? 0
Hurrah! It worked. However, when we look at the outcome, we see that we are missing the names of the variables. There are 14 variables and none of them have the names. It is pretty normal when we work on the archived data. I have to go back and find out what are these variables and how are they labeled.
Variable Information:
-Column 1: age:– age in years
-Column 2: sex:– sex (1 = male; 0 = female)
-Column 3: cp:– chest pain type
-- Value 1: typical angina
-- Value 2: atypical angina
-- Value 3: non-anginal pain
-- Value 4: asymptomatic
-Column 4: trestbps:– resting blood pressure (in mm Hg on admission to the hospital)
-Column 5: chol:– serum cholestoral in mg/dl
-Column 6: fbs:– (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
-Column 7: restecg:– resting electrocardiographic results
-- Value 0: normal
-- Value 1: having ST-T wave abnormality (T wave
inversions and/or ST elevation or depression of > 0.05 mV)
-- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
-Column 8: thalach:– maximum heart rate achieved
-Column 9: exang:– exercise induced angina (1 = yes; 0 = no)
-Column 10: oldpeak:– ST depression induced by exercise relative to rest
-Column 11: slope:– the slope of the peak exercise ST segment
-- Value 1: upsloping
-- Value 2: flat
-- Value 3: downsloping
-Column 12: ca:– number of major vessels (0-3) colored by flourosopy
-Column 13: thal:– 3 = normal; 6 = fixed defect; 7 = reversable defect
-Column 14: num:– the predicted attribute; diagnosis of heart disease (angiographic disease status)
-- Value 0: < 50% diameter narrowing
-- Value 1: > 50% diameter narrowing
(Source: http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/heart-disease.names)
To label the columns,I used the above information and supplied names accordingly. Remember, we use ‘colnames()’ function.
colnames(heart_data) <- c(
"age", "sex", "cp", "trestecg", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"
)
head(heart_data)# Checking if that worked
age sex cp trestecg chol fbs restecg thalach exang oldpeak slope ca thal num
1 28 1 2 130 132 0 2 185 0 0 ? ? ? 0
2 29 1 2 120 243 0 0 160 0 0 ? ? ? 0
3 29 1 2 140 ? 0 0 170 0 0 ? ? ? 0
4 30 0 1 170 237 0 1 170 0 0 ? ? 6 0
5 31 0 2 100 219 0 1 150 0 0 ? ? ? 0
6 32 0 2 105 198 0 0 165 0 0 ? ? ? 0
The renaming of the variables definitely worked. However, I noticed some new problems, i.e., I have seen some rows with ‘?’ signs. I assume they are missing data. I want them to be replaced by ‘NA’ instead.
Because I don’t want to change the name of the dataset at this point, and I still want my data set to reflect the changes I am going to make, I am am going to use ‘heart_data’ both in and outside of the bracket. Here, the command [heart_data == “?”] is applicable to the whole data set. So, all the “?” sing in the whole data set will be replaced by ‘NA’, the the resulting dataset will still be named ‘heart_data’. Let’s write the code and run.
heart_data[heart_data == "?"] <- NA
head(heart_data)
age sex cp trestecg chol fbs restecg thalach exang oldpeak slope ca thal
1 28 1 2 130 132 0 2 185 0 0 <NA> <NA> <NA>
2 29 1 2 120 243 0 0 160 0 0 <NA> <NA> <NA>
3 29 1 2 140 <NA> 0 0 170 0 0 <NA> <NA> <NA>
4 30 0 1 170 237 0 1 170 0 0 <NA> <NA> 6
5 31 0 2 100 219 0 1 150 0 0 <NA> <NA> <NA>
6 32 0 2 105 198 0 0 165 0 0 <NA> <NA> <NA>
num
1 0
2 0
3 0
4 0
5 0
6 0
I checked the first 6 rows of the new data set and found that the missing data have now been denoted by ‘NA’ rather then the ‘?’ sign.
It is always prudent to check the variables types when you download the data. If we don’t, we encounter with problems or get wrong results during analysis. The easy way to check the variable types is to use the str()function.
Lets’ check the structure, type of all of the variables included in the heart_data data set.
str(heart_data)
'data.frame': 294 obs. of 14 variables:
$ age : int 28 29 29 30 31 32 32 32 33 34 ...
$ sex : int 1 1 1 0 0 0 1 1 1 0 ...
$ cp : int 2 2 2 1 2 2 2 2 3 2 ...
$ trestecg: chr "130" "120" "140" "170" ...
$ chol : chr "132" "243" NA "237" ...
$ fbs : chr "0" "0" "0" "0" ...
$ restecg : chr "2" "0" "0" "1" ...
$ thalach : chr "185" "160" "170" "170" ...
$ exang : chr "0" "0" "0" "0" ...
$ oldpeak : num 0 0 0 0 0 0 0 0 0 0 ...
$ slope : chr NA NA NA NA ...
$ ca : chr NA NA NA NA ...
$ thal : chr NA NA NA "6" ...
$ num : int 0 0 0 0 0 0 0 0 0 0 ...
We got variable itemized information. There are total of 294 observations under 14 variables:
# Into factor
heart_data$sex <- as.factor(heart_data$sex)
heart_data$cp <- as.factor(heart_data$cp)
heart_data$fbs <- as.factor(heart_data$fbs)
heart_data$restecg <- as.factor(heart_data$restecg)
heart_data$exang <- as.factor(heart_data$exang)
heart_data$slope <- as.factor(heart_data$slope)
heart_data$num <- as.factor(heart_data$num)
# Into Integer and Into Factor
heart_data$ca <- as.integer(heart_data$ca)
heart_data$ca <- as.factor(heart_data$ca)
heart_data$thal <- as.integer(heart_data$thal)
heart_data$thal <- as.factor(heart_data$thal)
str(heart_data)
'data.frame': 294 obs. of 14 variables:
$ age : int 28 29 29 30 31 32 32 32 33 34 ...
$ sex : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 2 2 2 1 ...
$ cp : Factor w/ 4 levels "1","2","3","4": 2 2 2 1 2 2 2 2 3 2 ...
$ trestecg: chr "130" "120" "140" "170" ...
$ chol : chr "132" "243" NA "237" ...
$ fbs : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ restecg : Factor w/ 3 levels "0","1","2": 3 1 1 2 2 1 1 1 1 1 ...
$ thalach : chr "185" "160" "170" "170" ...
$ exang : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ oldpeak : num 0 0 0 0 0 0 0 0 0 0 ...
$ slope : Factor w/ 3 levels "1","2","3": NA NA NA NA NA NA NA NA NA NA ...
$ ca : Factor w/ 1 level "0": NA NA NA NA NA NA NA NA NA NA ...
$ thal : Factor w/ 3 levels "3","6","7": NA NA NA 2 NA NA NA NA NA NA ...
$ num : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
In addition, I can see that the sex, cd, fbs, restecg, exang, and num have do have numerical representation to show their various categories. We don’t have to change the numerical representation by the actual categores like ‘male’, ‘female’, to be successfully conduct the analyses, but we can change them if we want to. For the purpose of my own prctice, I am going to go ahead and change the values by the actual categories:
Having a lot of missing values creates problem in data analysis. Especially in the social science, having more than 5% of missing data is problematic. However, it is essential to realize that there’s hardly any dataset (if we collect them) without the missing values. There are various reasons for having some missing values. Thus, it is desirable that we analyze total number of missing cases.
summary(is.na(heart_data))# Gives overall synopsis of missing data by variables
age sex cp trestecg
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:294 FALSE:294 FALSE:294 FALSE:293
TRUE :1
chol fbs restecg thalach
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:271 FALSE:286 FALSE:293 FALSE:293
TRUE :23 TRUE :8 TRUE :1 TRUE :1
exang oldpeak slope ca
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:293 FALSE:294 FALSE:104 FALSE:3
TRUE :1 TRUE :190 TRUE :291
thal num
Mode :logical Mode :logical
FALSE:28 FALSE:294
TRUE :266
mean(is.na(heart_data)) #provides the mean of the missing data
[1] 0.1899903
The outcome of the summary funtion suggests that
So, we will be better off if we omit these variables from our data set. There are various ways to do so: we can either remove each column one by one or all three columns, at once. I am going to remove them at once. First of all I will identify their position in the data set, and use NULL to replace them. Slope is in the 11th position, while ‘ca’ and ‘thal’ are in 12th, and 13th positions.
heart_data[11:13] <- list(NULL) #It will change the colum
head(heart_data)
age sex cp trestecg chol fbs restecg thalach exang oldpeak num
1 28 1 2 130 132 0 2 185 0 0 0
2 29 1 2 120 243 0 0 160 0 0 0
3 29 1 2 140 <NA> 0 0 170 0 0 0
4 30 0 1 170 237 0 1 170 0 0 0
5 31 0 2 100 219 0 1 150 0 0 0
6 32 0 2 105 198 0 0 165 0 0 0
Looks like that worked. I no longer see the ‘slope’ variable. Once again, let’s check the volume of the missing data.
mean(is.na(heart_data)) #provides the mean of the missing data
[1] 0.01082251
After removing these lackluster varaibles, total number of missing data is approximately 1%. Good news!
And, now, lets get rid of missing data points. You don’t have to do it, but I am doing it for the purpose of practicing this command.
heart_data <- na.omit(heart_data) #Removes the missing rows from the data
summary(heart_data)
age sex cp trestecg chol fbs
Min. :28.00 0: 69 1: 10 Length:261 Length:261 0:242
1st Qu.:42.00 1:192 2: 92 Class :character Class :character 1: 19
Median :49.00 3: 46 Mode :character Mode :character
Mean :47.77 4:113
3rd Qu.:54.00
Max. :65.00
restecg thalach exang oldpeak num
0:208 Length:261 0:178 Min. :0.0000 0:163
1: 47 Class :character 1: 83 1st Qu.:0.0000 1: 98
2: 6 Mode :character Median :0.0000
Mean :0.6123
3rd Qu.:1.0000
Max. :5.0000
The summary table shows variable wise distribution of the data. We don’t see any missing values.
Following the inmormation given to each variable, I am going to change the numerical categorical values of the variables sed, cp, fbs, restecg, num, and exang by their respective categories.
heart_data$sex <- ifelse(test=heart_data$sex == 0, yes = "Female", no = "Male")
heart_data$exang <- ifelse(test=heart_data$exang == 0, yes = "ExInduced_Angina", no = "No")
heart_data$fbs <- ifelse(test = heart_data$fbs == 0, yes = "FALSE", no = "TRUE")
heart_data$cp = with(heart_data, ifelse(cp == 1, "Typical_Angina", ifelse(cp == 2, "Atypical_Angina", ifelse(cp == 3, "Non_Anginal_Pain","Asymptomatic"))))
heart_data$restecg = with(heart_data, ifelse(restecg == 0, "Normal", ifelse(restecg == 1, "ST-T_WabNorm", "LV_Hypertrophy")))
heart_data$num <- ifelse(test=heart_data$num == 0, yes = "LessThanFifty", no = "MoreThanFifty")
'data.frame': 261 obs. of 11 variables:
$ age : int 28 29 30 31 32 32 32 33 34 34 ...
$ sex : Factor w/ 2 levels "Female","Male": 2 2 1 1 1 2 2 2 1 2 ...
$ cp : Factor w/ 4 levels "Asymptomatic",..: 2 2 4 2 2 2 2 3 2 2 ...
$ trestecg: num 130 120 170 100 105 110 125 120 130 150 ...
$ chol : num 132 243 237 219 198 225 254 298 161 214 ...
$ fbs : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
$ restecg : Factor w/ 3 levels "LV_Hypertrophy",..: 1 2 3 3 2 2 2 2 2 3 ...
$ thalach : num 185 160 170 150 165 184 155 185 190 168 ...
$ exang : Factor w/ 2 levels "ExInduced_Angina",..: 1 1 1 1 1 1 1 1 1 1 ...
$ oldpeak : num 0 0 0 0 0 0 0 0 0 0 ...
$ num : Factor w/ 2 levels "LessThanFifty",..: 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "na.action")= 'omit' Named int [1:33] 3 28 32 35 45 66 73 76 82 87 ...
..- attr(*, "names")= chr [1:33] "3" "28" "32" "35" ...
Now, I can see that my data are almost ready for further analysis. Lets make the “num”–diagnosis of heart disease (angiographic disease status)–an outcome variable, where, LessThanFifty refers to a good condition of the blood vessel narrowed by less then 50%, and MoreThanFifty to a hear disease condition with blood vessel narrowed by more than 50%.
We are now going to check if the condition if male or female both have the comparable cases of heart disease or it pertains to a particular gender using the ‘xtabs’ function.
xtabs(~ num + sex, data = heart_data)
sex
num Female Male
LessThanFifty 57 106
MoreThanFifty 12 86
We can see the differences in the numbers of male and female with narrowed and/or regular valve size. Lets now check if they reported the chest pain:
xtabs(~ num + cp, data = heart_data)
cp
num Asymptomatic Atypical_Angina Non_Anginal_Pain Typical_Angina
LessThanFifty 37 84 35 7
MoreThanFifty 76 8 11 3
Its a mixed outcome. There are bunch of people diagnosed with a heart condition, but they did not report any kind of chest pain. Some people definitely did, though.
Let’s keep checking for all the categorical variables:
xtabs(~ num + fbs, data = heart_data)
fbs
num FALSE TRUE
LessThanFifty 157 6
MoreThanFifty 85 13
xtabs(~ num + restecg, data = heart_data)
restecg
num LV_Hypertrophy Normal ST-T_WabNorm
LessThanFifty 5 130 28
MoreThanFifty 1 78 19
xtabs(~ num + exang, data = heart_data)
exang
num ExInduced_Angina No
LessThanFifty 144 19
MoreThanFifty 34 64
Let’s start with a simple model. We will try to predict heart disease based on patient’s gender.
gender_logistic <- glm(num ~ sex, data = heart_data, family = "binomial")
summary(gender_logistic)
Call:
glm(formula = num ~ sex, family = "binomial", data = heart_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0900 -1.0900 -0.6181 1.2674 1.8704
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.5581 0.3176 -4.906 9.28e-07 ***
sexMale 1.3491 0.3492 3.864 0.000112 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 345.46 on 260 degrees of freedom
Residual deviance: 327.84 on 259 degrees of freedom
AIC: 331.84
Number of Fisher Scoring iterations: 3
logistic <- glm(num~ ., data = heart_data, family = "binomial")
summary(logistic)
Call:
glm(formula = num ~ ., family = "binomial", data = heart_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7682 -0.4879 -0.2655 0.4366 2.4531
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.163497 3.389683 -0.638 0.52330
age -0.009255 0.029526 -0.313 0.75394
sexMale 1.307881 0.481892 2.714 0.00665 **
cpAtypical_Angina -2.288454 0.528762 -4.328 1.51e-05 ***
cpNon_Anginal_Pain -0.884304 0.499009 -1.772 0.07637 .
cpTypical_Angina -0.368482 0.967228 -0.381 0.70323
trestecg -0.001603 0.011514 -0.139 0.88925
chol 0.005218 0.002788 1.872 0.06127 .
fbsTRUE 1.639892 0.779925 2.103 0.03550 *
restecgNormal 0.862764 1.715431 0.503 0.61500
restecgST-T_WabNorm 0.470803 1.763641 0.267 0.78951
thalach -0.008658 0.010189 -0.850 0.39549
exangNo 0.943270 0.487140 1.936 0.05283 .
oldpeak 1.184473 0.280460 4.223 2.41e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 345.46 on 260 degrees of freedom
Residual deviance: 184.70 on 247 degrees of freedom
AIC: 212.7
Number of Fisher Scoring iterations: 6
ll.null <- logistic$null.deviance/-2
ll.proposed <- logistic$deviance/-2
(ll.null - ll.proposed)/ll.null
[1] 0.4653618
1 - pchisq(2*(ll.null - ll.proposed),df = (length(logistic$coefficients)-1))
[1] 1
predicted.data <- data.frame(probability.of.hd = logistic$fitted.values, num = heart_data$num)
predicted.data <- predicted.data[order(predicted.data$probability.of.hd, decreasing = FALSE),]
predicted.data$rank <- 1:nrow(predicted.data)
library(ggplot2)
library(cowplot)
ggplot(data=predicted.data, aes(x=rank, y=probability.of.hd))+
geom_point(aes(color=num), alpha=1, shape=4, stroke=2)+
xlab("Index")+
ylab("Predicted Probability of Getting Heart Disease")
Got Ample Support from: https://www.youtube.com/watch?v=C4N3_XJJ-jU