Your task is to choose one dataset, then study the data and its associated description of the data (i.e. “data dictionary”). You should take the data, and create an R data frame with a subset of the columns (and if you like rows) in the dataset. Your deliverable is the R code to perform these transformation tasks.

I use the “UC Irvine Machine Learning Repository,” which is an online repository with many databases. One such dataset is Heart Disease dataset http://archive.ics.uci.edu/ml/datasets/Heart+Disease which has four data sets from four different places.

I choose the reprocessed data of Hungarian Heart Disease Data Set for this assignment. The link to the dataset is http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/reprocessed.hungarian.data. The data was donated to the site on 1988 and was last modified on July 23th 1996.

I got the dataset from online and create a table on R. I also check how much data I have, by checking the number of columns and rows.

library(RCurl)
## Loading required package: bitops
x <- getURL("http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/reprocessed.hungarian.data")
y <- read.table(text = x, header = FALSE)
head(y, 10)
##    V1 V2 V3  V4  V5 V6 V7  V8 V9 V10 V11 V12 V13 V14
## 1  40  1  2 140 289  0  0 172  0 0.0  -9  -9  -9   0
## 2  49  0  3 160 180  0  0 156  0 1.0   2  -9  -9   1
## 3  37  1  2 130 283  0  1  98  0 0.0  -9  -9  -9   0
## 4  48  0  4 138 214  0  0 108  1 1.5   2  -9  -9   3
## 5  54  1  3 150  -9  0  0 122  0 0.0  -9  -9  -9   0
## 6  39  1  3 120 339  0  0 170  0 0.0  -9  -9  -9   0
## 7  45  0  2 130 237  0  0 170  0 0.0  -9  -9  -9   0
## 8  54  1  2 110 208  0  0 142  0 0.0  -9  -9  -9   0
## 9  37  1  4 140 207  0  0 130  1 1.5   2  -9  -9   1
## 10 48  0  2 120 284  0  0 120  0 0.0  -9  -9  -9   0
ncol(y)
## [1] 14
nrow(y)
## [1] 294

Choosing specific data or creating a subset of data naming it the HHD table. I also name the columns according to the data dictionary. The data dictionary for the “Heart Disease Data Set” can found at http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/heart-disease.names.

Up

HHD <- data.frame(y[c(1:4, 7, 9)])
names(HHD) <- c("Age", "Sex", "Chest_Pain_Type", "Resting_Blood_Pressure", "Resting_EKG", "Exercise_Induced_Angina")
head(HHD, 3)
##   Age Sex Chest_Pain_Type Resting_Blood_Pressure Resting_EKG
## 1  40   1               2                    140           0
## 2  49   0               3                    160           0
## 3  37   1               2                    130           1
##   Exercise_Induced_Angina
## 1                       0
## 2                       0
## 3                       0

I transform the data in the table HHD, subset of the Hungarian “Heart Disease Data Set”.

Up

HHD$Sex <- ifelse(HHD$Sex=="0", "female", 
           ifelse(HHD$Sex=="1", "male", 
           ifelse(HHD$Sex=="-9", "missing", "N/A")
))

HHD$Chest_Pain_Type <- ifelse(HHD$Chest_Pain_Type=="1", "typical angina", 
          ifelse(HHD$Chest_Pain_Type=="2", "atypical angina", 
          ifelse(HHD$Chest_Pain_Type=="3", "non-anginal pain", 
          ifelse(HHD$Chest_Pain_Type=="4", "asymptomatic",  
          ifelse(HHD$Chest_Pain_Type=="-9", "missing", "N/A")
))))

HHD$Resting_EKG <- ifelse(HHD$Resting_EKG=="0", "normal", 
          ifelse(HHD$Resting_EKG=="1", "ST-T wave abnormality", 
          ifelse(HHD$Resting_EKG=="2", "probable or definite left ventricular hypertrophy", 
          ifelse(HHD$Resting_EKG=="-9", "missing", "N/A")
)))

HHD$Exercise_Induced_Angina <- ifelse(HHD$Exercise_Induced_Angina=="0", "no", 
           ifelse(HHD$Exercise_Induced_Angina=="1", "yes", 
           ifelse(HHD$Exercise_Induced_Angina=="-9", "missing", "N/A")
))

head(HHD, 10)
##    Age    Sex  Chest_Pain_Type Resting_Blood_Pressure
## 1   40   male  atypical angina                    140
## 2   49 female non-anginal pain                    160
## 3   37   male  atypical angina                    130
## 4   48 female     asymptomatic                    138
## 5   54   male non-anginal pain                    150
## 6   39   male non-anginal pain                    120
## 7   45 female  atypical angina                    130
## 8   54   male  atypical angina                    110
## 9   37   male     asymptomatic                    140
## 10  48 female  atypical angina                    120
##              Resting_EKG Exercise_Induced_Angina
## 1                 normal                      no
## 2                 normal                      no
## 3  ST-T wave abnormality                      no
## 4                 normal                     yes
## 5                 normal                      no
## 6                 normal                      no
## 7                 normal                      no
## 8                 normal                      no
## 9                 normal                     yes
## 10                normal                      no

I wanted to see the information of patients who had an abnormal EKG at rest and had angina (chest pain) with exercise. So, I created another subset name SubHHD, which contains only the rows that I wanted to look at.

Up

SubHHD <- subset(HHD, Resting_EKG != "normal" & Resting_EKG !="missing" & Exercise_Induced_Angina == "yes" )

I get rid of the unique row names, and put the SubHHD table in a R Markdown format table using knitr package

Up

rownames(SubHHD) <- NULL
library(knitr)
kable(SubHHD, align = "c", caption = "Table 1: List of patients who had an abnormal EKG at rest and had angina with exercise.")
Table 1: List of patients who had an abnormal EKG at rest and had angina with exercise.
Age Sex Chest_Pain_Type Resting_Blood_Pressure Resting_EKG Exercise_Induced_Angina
58 male atypical angina 136 ST-T wave abnormality yes
53 male asymptomatic 124 ST-T wave abnormality yes
54 female non-anginal pain 130 ST-T wave abnormality yes
52 male asymptomatic 112 ST-T wave abnormality yes
52 male asymptomatic 160 ST-T wave abnormality yes
57 male atypical angina 140 ST-T wave abnormality yes
65 male asymptomatic 130 ST-T wave abnormality yes
59 female asymptomatic 130 ST-T wave abnormality yes
56 male asymptomatic 170 ST-T wave abnormality yes
56 male asymptomatic 150 ST-T wave abnormality yes
61 female asymptomatic 130 ST-T wave abnormality yes
50 male asymptomatic 140 ST-T wave abnormality yes
47 male asymptomatic 160 ST-T wave abnormality yes
59 male asymptomatic 140 probable or definite left ventricular hypertrophy yes
50 male asymptomatic 140 ST-T wave abnormality yes
46 male asymptomatic 110 ST-T wave abnormality yes
53 male asymptomatic 180 ST-T wave abnormality yes
48 male asymptomatic 122 ST-T wave abnormality yes
45 male asymptomatic 130 ST-T wave abnormality yes
61 male asymptomatic 125 ST-T wave abnormality yes
57 female asymptomatic 180 ST-T wave abnormality yes

Let us see how many male patient and how many female patient there are, who had an abnormal EKG at rest and had angina (chest pain) with exercise in Hungary. Using the plyr package I see that there are only 4 female patient and 17 male patient who meets my criteria.

Up

library(plyr)
SubHHD2 <- count(SubHHD, 'Sex')
kable(SubHHD2, align = "c", caption = "Table 2: Number of patients by sex in table 1.")
Table 2: Number of patients by sex in table 1.
Sex freq
female 4
male 17