Pre-processing of data from the chronic kidney disease dataset from Kaggle (https://www.kaggle.com/datasets/mansoordaku/ckdisease)
Initially, we will load our data into R using read.csv(). Let’s also load the library ‘tidyverse’, which we will use in our processing.
kidney_data <- read.csv('C:/Data Science YH/R Programming/Final Assignment/data/kidney_disease.csv')
library(tidyverse)
Next, let’s investigate the data structure in the dataset using the str() function.
str(kidney_data)
'data.frame': 400 obs. of 26 variables:
$ id : int 0 1 2 3 4 5 6 7 8 9 ...
$ age : num 48 7 62 48 51 60 68 24 52 53 ...
$ bp : num 80 50 80 70 80 90 70 NA 100 90 ...
$ sg : num 1.02 1.02 1.01 1 1.01 ...
$ al : num 1 4 2 4 2 3 0 2 3 2 ...
$ su : num 0 0 3 0 0 0 0 4 0 0 ...
$ rbc : chr "" "" "normal" "normal" ...
$ pc : chr "normal" "normal" "normal" "abnormal" ...
$ pcc : chr "notpresent" "notpresent" "notpresent" "present" ...
$ ba : chr "notpresent" "notpresent" "notpresent" "notpresent" ...
$ bgr : num 121 NA 423 117 106 74 100 410 138 70 ...
$ bu : num 36 18 53 56 26 25 54 31 60 107 ...
$ sc : num 1.2 0.8 1.8 3.8 1.4 1.1 24 1.1 1.9 7.2 ...
$ sod : num NA NA NA 111 NA 142 104 NA NA 114 ...
$ pot : num NA NA NA 2.5 NA 3.2 4 NA NA 3.7 ...
$ hemo : num 15.4 11.3 9.6 11.2 11.6 12.2 12.4 12.4 10.8 9.5 ...
$ pcv : chr "44" "38" "31" "32" ...
$ wc : chr "7800" "6000" "7500" "6700" ...
$ rc : chr "5.2" "" "" "3.9" ...
$ htn : chr "yes" "no" "no" "yes" ...
$ dm : chr "yes" "no" "yes" "no" ...
$ cad : chr "no" "no" "no" "no" ...
$ appet : chr "good" "good" "poor" "poor" ...
$ pe : chr "no" "no" "no" "yes" ...
$ ane : chr "no" "no" "yes" "yes" ...
$ classification: chr "ckd" "ckd" "ckd" "ckd" ...
We have an id variable with type ‘integer’, and numerous variables of types ‘numeric’ and ‘character’. From the key to the dataset, we can tell that the parameters ‘pcv’, ‘rc’, and ‘wc’ should probably be numerical, let’s deal with that later. Now, let’s check for presence of NA, along with distributions in the various columns using summary().
summary(kidney_data)
id age bp sg al su
Min. : 0.00 Min. : 2.00 Min. : 50.00 Min. :1.005 Min. :0.000 Min. :0.0000
1st Qu.: 99.75 1st Qu.:42.00 1st Qu.: 70.00 1st Qu.:1.010 1st Qu.:0.000 1st Qu.:0.0000
Median :199.50 Median :55.00 Median : 80.00 Median :1.020 Median :0.000 Median :0.0000
Mean :199.50 Mean :51.48 Mean : 76.47 Mean :1.017 Mean :1.017 Mean :0.4501
3rd Qu.:299.25 3rd Qu.:64.50 3rd Qu.: 80.00 3rd Qu.:1.020 3rd Qu.:2.000 3rd Qu.:0.0000
Max. :399.00 Max. :90.00 Max. :180.00 Max. :1.025 Max. :5.000 Max. :5.0000
NA's :9 NA's :12 NA's :47 NA's :46 NA's :49
rbc pc pcc ba bgr bu
Length:400 Length:400 Length:400 Length:400 Min. : 22 Min. : 1.50
Class :character Class :character Class :character Class :character 1st Qu.: 99 1st Qu.: 27.00
Mode :character Mode :character Mode :character Mode :character Median :121 Median : 42.00
Mean :148 Mean : 57.43
3rd Qu.:163 3rd Qu.: 66.00
Max. :490 Max. :391.00
NA's :44 NA's :19
sc sod pot hemo pcv wc
Min. : 0.400 Min. : 4.5 Min. : 2.500 Min. : 3.10 Length:400 Length:400
1st Qu.: 0.900 1st Qu.:135.0 1st Qu.: 3.800 1st Qu.:10.30 Class :character Class :character
Median : 1.300 Median :138.0 Median : 4.400 Median :12.65 Mode :character Mode :character
Mean : 3.072 Mean :137.5 Mean : 4.627 Mean :12.53
3rd Qu.: 2.800 3rd Qu.:142.0 3rd Qu.: 4.900 3rd Qu.:15.00
Max. :76.000 Max. :163.0 Max. :47.000 Max. :17.80
NA's :17 NA's :87 NA's :88 NA's :52
rc htn dm cad appet
Length:400 Length:400 Length:400 Length:400 Length:400
Class :character Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character
pe ane classification
Length:400 Length:400 Length:400
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
We see here that most of our numeric variables contain a significant amount of NaN values. These have to be dealt with somehow. Let’s try to replace the NA in the ‘age’ column using the following ifelse() segment. This piece of code checks for a testexpression (in this case: is.na(kidney_data$age)), and then implements either arg2 (if TRUE) or arg3 (if FALSE). This way we should change all NA in ‘age’ to the median (arg2).
kidney_data$age <- ifelse(is.na(kidney_data$age), median(kidney_data$age, na.rm=TRUE), kidney_data$age) # Replaces any NA in the 'age' column with the median value.
summary(kidney_data$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 42.00 55.00 51.56 64.00 90.00
We see here that our method removed all of the NA in the ‘age’ column, with minor changes to the other distribution parameters. These minor changes are likely related to that the ‘age’ column had relatively few NA. Obviously, replacing with the median doesn’t affect the median value.
Let’s use another method for the ‘bp’ column. We will instead replace the NA with the mean. The code segment used is very similar.
kidney_data$bp <- ifelse(is.na(kidney_data$bp), mean(kidney_data$bp, na.rm=TRUE), kidney_data$bp) # Replaces any NA value in the 'bp' column with the mean value.
summary(kidney_data$bp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
50.00 70.00 78.23 76.47 80.00 180.00
As we can see here, now the mean is unchanged and the median is affected instead. It is important to understand that all replacements of missing data will always change your dataset. However, for many purposes, replacing or removing missing data is necessary for further processing.
Let’s revert to the original dataframe and use the median replacement method for the entire dataframe.
kidney_data <- read.csv("C:/Data Science YH/R Programming/Final Assignment/data/kidney_disease.csv")
for (column in (1:ncol(kidney_data))) { # Loops over all columns in kidney_data
kidney_data[,column] <- ifelse(is.na(kidney_data[,column]), mean(kidney_data[,column], na.rm=TRUE), kidney_data[,column]) # Replaces all NA values in the columns with the mean value.
}
summary(kidney_data)
id age bp sg al su
Min. : 0.00 Min. : 2.00 Min. : 50.00 Min. :1.005 Min. :0.000 Min. :0.0000
1st Qu.: 99.75 1st Qu.:42.00 1st Qu.: 70.00 1st Qu.:1.015 1st Qu.:0.000 1st Qu.:0.0000
Median :199.50 Median :54.00 Median : 78.23 Median :1.017 Median :1.000 Median :0.0000
Mean :199.50 Mean :51.48 Mean : 76.47 Mean :1.017 Mean :1.017 Mean :0.4501
3rd Qu.:299.25 3rd Qu.:64.00 3rd Qu.: 80.00 3rd Qu.:1.020 3rd Qu.:2.000 3rd Qu.:0.4501
Max. :399.00 Max. :90.00 Max. :180.00 Max. :1.025 Max. :5.000 Max. :5.0000
rbc pc pcc ba bgr bu
Length:400 Length:400 Length:400 Length:400 Min. : 22 Min. : 1.50
Class :character Class :character Class :character Class :character 1st Qu.:101 1st Qu.: 27.00
Mode :character Mode :character Mode :character Mode :character Median :126 Median : 44.00
Mean :148 Mean : 57.43
3rd Qu.:150 3rd Qu.: 61.75
Max. :490 Max. :391.00
sc sod pot hemo pcv wc
Min. : 0.400 Min. : 4.5 Min. : 2.500 Min. : 3.10 Length:400 Length:400
1st Qu.: 0.900 1st Qu.:135.0 1st Qu.: 4.000 1st Qu.:10.88 Class :character Class :character
Median : 1.400 Median :137.5 Median : 4.627 Median :12.53 Mode :character Mode :character
Mean : 3.072 Mean :137.5 Mean : 4.627 Mean :12.53
3rd Qu.: 3.072 3rd Qu.:141.0 3rd Qu.: 4.800 3rd Qu.:14.62
Max. :76.000 Max. :163.0 Max. :47.000 Max. :17.80
rc htn dm cad appet
Length:400 Length:400 Length:400 Length:400 Length:400
Class :character Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character
pe ane classification
Length:400 Length:400 Length:400
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Voila! We have succesfully cleaned out all of the NA missing data in the dataset and replaced it with the median for each column!
Next, let’s look at the character columns. We start with investigating the ‘rbc’ column.
kidney_data %>% count(rbc)
Here we can see that we have two recorded values: abnormal (47) and normal (201). There’s also a substantial amount of missing data here (152). Missing data in binary categorical parameters is in some sense trickier than for numerical parameters, or categorical parameters with many different factors, as the mean and median doesn’t necessarily tell you much. In many cases, it can be beneficial to make the parameter categorical instead of binary, by introducing a third outcome: ‘no record’. Let’s do that with our ‘rbc’ column.
kidney_data$rbc <- ifelse((kidney_data$rbc == ""), 'no record', kidney_data$rbc) # Replaces any empty values with 'no record'.
table(kidney_data$rbc)
abnormal no record normal
47 152 201
There we go! We have now imputed a new outcome ‘no record’, for our missing data in the ‘rbc’ column. Let’s investigate the other initial character parameters (‘pc’, ‘pcc’, and ‘ba’).
kidney_data %>% count(pc)
kidney_data %>% count(pcc)
kidney_data %>% count(ba)
NA
Count() gives us a dataframe as result, which is easier to work with than a table, like we get from table(). We see here that the ‘pc’ varaiable contains a lot of missing values as well, whereas the ‘pcc’ and ‘ba’ parameters have only 4. Let’s do the same replacement for ‘pc’ as we did for ‘rbc’.
kidney_data$pc <- ifelse((kidney_data$pc == ""), 'no record', kidney_data$pc) # Replaces any empty values with 'no record'.
kidney_data %>% count(pc)
Good. Now what to do with ‘pcc’ and ‘ba’? Let’s keep them as a binary parameters, and instead remove the rows with missing data. Since there’s only 4 rows with missing data in ‘pcc’ and ‘ba’, the impact on our overall data should be small.
kidney_data <- subset(kidney_data, pcc!="") # Removes any rows with empty values.
kidney_data <- subset(kidney_data, ba!="") # Removes any rows with empty values.
kidney_data %>% count(pcc)
kidney_data %>% count(ba)
Excellent! What about the rest of the character parameters? Let’s use count() on them as well.
kidney_data %>% count(htn)
kidney_data %>% count(dm)
kidney_data %>% count(cad)
kidney_data %>% count(appet)
kidney_data %>% count(pe)
kidney_data %>% count(ane)
kidney_data %>% count(classification)
Alright. That’s a lot of parameters! First and foremost, we can see that all of these parameters have binary outcomes. However, some of them have missing data (‘htn’, ‘dm’, ‘cad’, ‘appet’, ‘pe’), and some of them have corrupted data (‘dm’, ‘cad’, ‘classification’).
Let’s deal with the corrupted data first, and start with the ‘dm’ column. First, we’ll just print it out again.
kidney_data %>% count(dm)
So there’s a lot going on here. We have 6 different outcomes, but it is evident that the only true outcomes are ‘no’ or ‘yes’. So it’s a binary parameter. Let’s use an ifelse again, and just check for presence of the characters ‘no’ or ‘yes’. The grepl() function checks if a character is present in a string or not. Here we check if ‘yes’ is present in the string, then we replace it with a clean ‘yes’. Then we do the same for ‘no’, and have a look at our column again.
kidney_data$dm <- ifelse((grepl('yes', kidney_data$dm) == TRUE), 'yes', kidney_data$dm) #Replace any char containing 'yes' with 'Yes' only.
kidney_data$dm <- ifelse((grepl('no', kidney_data$dm) == TRUE), 'no', kidney_data$dm) #Replace any char containing 'no' with 'No' only.
kidney_data %>% count(dm)
Perfect! Now we can do the same with the other two columns with corrupted characters. In the case of ‘classification’, we need to be careful. Notice how both outcomes contain ‘ckd’. This means our standard replacement function will replace every outcome with ‘ckd’. Therefore, let’s change the outcome to ‘neg’ and ‘pos’ instead, and make sure we check for ‘notckd’ first.
kidney_data$cad <- ifelse((grepl('yes', kidney_data$cad) == TRUE), 'yes', kidney_data$cad) #Replace any char containing 'yes' with 'Yes' only.
kidney_data$cad <- ifelse((grepl('no', kidney_data$cad) == TRUE), 'no', kidney_data$cad) #Replace any char containing 'no' with 'No' only.
kidney_data$classification <- ifelse((grepl('notckd', kidney_data$classification) == TRUE), 'neg', kidney_data$classification) #Replace any char containing 'notckd' with 'neg'.
kidney_data$classification <- ifelse((grepl('ckd', kidney_data$classification) == TRUE), 'pos', kidney_data$classification) #Replace any char containing 'ckd' with 'pos'.
kidney_data %>% count(cad)
kidney_data %>% count(classification)
Excellent! Now we have corrected the corrupt data in all of our character columns. Let’s evaluate again.
kidney_data %>% count(ba)
kidney_data %>% count(htn)
kidney_data %>% count(dm)
kidney_data %>% count(cad)
kidney_data %>% count(appet)
kidney_data %>% count(pe)
kidney_data %>% count(ane)
kidney_data %>% count(classification)
It seems like all of these parameters are now binary, but some are missing values. As for ‘pcc’, there are very few rows missing, so the best solution here is probably to remove the rows with missing values. We can use the same method as for ‘pcc’.
kidney_data <- subset(kidney_data, htn!="") # Removes all rows containing missing values.
kidney_data <- subset(kidney_data, dm!="") # Removes all rows containing missing values.
kidney_data <- subset(kidney_data, cad!="") # Removes all rows containing missing values.
kidney_data <- subset(kidney_data, appet!="") # Removes all rows containing missing values.
kidney_data <- subset(kidney_data, pe!="") # Removes all rows containing missing values.
kidney_data <- subset(kidney_data, ane!="") # Removes all rows containing missing values.
nrow(kidney_data)
[1] 393
It seems like in total, we removed only 7 row! Likely, a lot of the missing data were for the same recordings.
Now, we have three more columns that are of type ‘character’, that seem to contain only or mostly numerical values: ‘pcv’, ‘wc’ and ‘rc’. Let’s look at their count() and see what they look like.
kidney_data %>% count(pcv)
kidney_data %>% count(wc)
kidney_data %>% count(rc)
All these parameters have lots of missing values, and they all contain corrupted values. One way of handling this corruption in this case is to search for any characters that are not numeric and replace them with ’’. Let’s try this using the gsub() function.
kidney_data$pcv <- gsub("[^0-9.-]", "", kidney_data$pcv) # Removes any non-numeric characters
kidney_data$wc <- gsub("[^0-9.-]", "", kidney_data$wc) # Removes any non-numeric characters
kidney_data$rc <- gsub("[^0-9.-]", "", kidney_data$rc) # Removes any non-numeric characters
kidney_data %>% count(rc)
kidney_data %>% count(pcv)
kidney_data %>% count(wc)
Seems like it worked fine! Now, what remains is numerical data encoded as character, including lots of missing data points. Let’s convert the three datasets into numerical, and replace the missing data with the mean.
kidney_data$pcv <- as.numeric(kidney_data$pcv) # Force the dataset into type numeric
kidney_data$rc <- as.numeric(kidney_data$rc) # Force the dataset into type numeric
kidney_data$wc <- as.numeric(kidney_data$wc) # Force the dataset into type numeric
for (column in (17:19)) {
kidney_data[,column] <- ifelse(is.na(kidney_data[,column]), mean(kidney_data[,column], na.rm=TRUE), kidney_data[,column]) # Replaces any NA values with the mean.
}
There we go! Let’s have a final look at the summary() and str() of the dataset.
summary(kidney_data)
id age bp sg al su
Min. : 0.0 Min. : 2.00 Min. : 50.00 Min. :1.005 Min. :0.000 Min. :0.0000
1st Qu.: 98.0 1st Qu.:42.00 1st Qu.: 70.00 1st Qu.:1.015 1st Qu.:0.000 1st Qu.:0.0000
Median :196.0 Median :54.00 Median : 80.00 Median :1.017 Median :1.000 Median :0.0000
Mean :197.7 Mean :51.52 Mean : 76.64 Mean :1.017 Mean :1.035 Mean :0.4582
3rd Qu.:298.0 3rd Qu.:64.00 3rd Qu.: 80.00 3rd Qu.:1.020 3rd Qu.:2.000 3rd Qu.:0.4501
Max. :399.0 Max. :90.00 Max. :180.00 Max. :1.025 Max. :5.000 Max. :5.0000
rbc pc pcc ba bgr bu
Length:393 Length:393 Length:393 Length:393 Min. : 22.0 Min. : 1.50
Class :character Class :character Class :character Class :character 1st Qu.:102.0 1st Qu.: 27.00
Mode :character Mode :character Mode :character Mode :character Median :127.0 Median : 45.00
Mean :148.9 Mean : 57.85
3rd Qu.:153.0 3rd Qu.: 65.00
Max. :490.0 Max. :391.00
sc sod pot hemo pcv wc
Min. : 0.400 Min. : 4.5 Min. : 2.500 Min. : 3.10 Min. : 9.00 Min. : 2200
1st Qu.: 1.000 1st Qu.:135.0 1st Qu.: 4.000 1st Qu.:10.80 1st Qu.:34.00 1st Qu.: 6900
Median : 1.400 Median :137.5 Median : 4.627 Median :12.53 Median :38.73 Median : 8408
Mean : 3.114 Mean :137.5 Mean : 4.627 Mean :12.49 Mean :38.73 Mean : 8408
3rd Qu.: 3.072 3rd Qu.:141.0 3rd Qu.: 4.800 3rd Qu.:14.50 3rd Qu.:44.00 3rd Qu.: 9400
Max. :76.000 Max. :163.0 Max. :47.000 Max. :17.80 Max. :54.00 Max. :26400
rc htn dm cad appet
Min. :2.100 Length:393 Length:393 Length:393 Length:393
1st Qu.:4.500 Class :character Class :character Class :character Class :character
Median :4.682 Mode :character Mode :character Mode :character Mode :character
Mean :4.682
3rd Qu.:5.000
Max. :8.000
pe ane classification
Length:393 Length:393 Length:393
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
str(kidney_data)
'data.frame': 393 obs. of 26 variables:
$ id : int 0 1 2 3 4 5 6 7 8 9 ...
$ age : num 48 7 62 48 51 60 68 24 52 53 ...
$ bp : num 80 50 80 70 80 ...
$ sg : num 1.02 1.02 1.01 1 1.01 ...
$ al : num 1 4 2 4 2 3 0 2 3 2 ...
$ su : num 0 0 3 0 0 0 0 4 0 0 ...
$ rbc : chr "no record" "no record" "normal" "normal" ...
$ pc : chr "normal" "normal" "normal" "abnormal" ...
$ pcc : chr "notpresent" "notpresent" "notpresent" "present" ...
$ ba : chr "notpresent" "notpresent" "notpresent" "notpresent" ...
$ bgr : num 121 148 423 117 106 ...
$ bu : num 36 18 53 56 26 25 54 31 60 107 ...
$ sc : num 1.2 0.8 1.8 3.8 1.4 1.1 24 1.1 1.9 7.2 ...
$ sod : num 138 138 138 111 138 ...
$ pot : num 4.63 4.63 4.63 2.5 4.63 ...
$ hemo : num 15.4 11.3 9.6 11.2 11.6 12.2 12.4 12.4 10.8 9.5 ...
$ pcv : num 44 38 31 32 35 39 36 44 33 29 ...
$ wc : num 7800 6000 7500 6700 7300 ...
$ rc : num 5.2 4.68 4.68 3.9 4.6 ...
$ htn : chr "yes" "no" "no" "yes" ...
$ dm : chr "yes" "no" "yes" "no" ...
$ cad : chr "no" "no" "no" "no" ...
$ appet : chr "good" "good" "poor" "poor" ...
$ pe : chr "no" "no" "no" "yes" ...
$ ane : chr "no" "no" "yes" "yes" ...
$ classification: chr "pos" "pos" "pos" "pos" ...
That’s it! The dataset is now cleaned, and ready for the next step. If the data should be used for machine learning purposes, one might consider converting the categorical parameters to either dummy variables, or in the case of binary outcomes, booleans.
Let’s finish by exprting our new cleaned-up dataset as a csv-file.
write.csv(kidney_data, 'C:/Data Science YH/R Programming/Final Assignment/data/kidney_disease_cleaned.csv')