I down loaded this data set from Kaggle account for the purposes of practicing and advanced my skills in data analysis, data science and machine learning in medical health research. The aim of this project was to advanced my skills in various statistical tools (e.g. STATA, SPPSS, excel, R, Power BI) in analyzing epidemiological research data and building a dashboard using PowerBI, Excel and R shiny flex dashboard.
About the data set
The dataset provides a comprehensive collection of patient clinical information, drug exposure profiles, and drug-related biochemical characteristics to support research on the early identification of Chronic Kidney Disease (CKD). It combines real-world–style patient health indicators with detailed properties of nephrotoxic and non-nephrotoxic medications that may influence kidney function.
The data set contains:
Patient Clinical Information
Includes age, gender, blood pressure, blood urea, serum creatinine, albumin levels, random blood glucose, and health conditions such as diabetes and hypertension. These features reflect common clinical factors associated with kidney health.
Drug Exposure Profiles
Each patient was linked to a drug along with dosage and duration of use. A separate label indicates whether the drug is considered nephrotoxic related effects.
CKD Risk Classification
Each record includes a CKD risk label derived from clinical biomarkers, health conditions, and drug-related toxicity indicators.
Purpose of the Dataset
Ø To Understand how clinical and drug-specific factors together influence kidney health
Ø To Develop data-driven healthcare applications and decision-support tools
Ø To evaluate drug that are related to kidney stress
DATA ANALYSIS PLAN FOR THIS DATA SET
Data Management
Ø Handling missing data, outliers
Ø Mutate characters into factor s for categorical variables
Data Manipulation
Ø Sub setting data, filtering etc.
Ø Mutate Age to categories
Data visualization
Ø Only Bar graphs used for categorical variable
Ø Histogram and Shapiro test for studying the normality assumptions of continuous scale variables
Statistical data analysis
Descriptive statistics
Ø Frequency and percentages for qualitative variable
Ø Mean and standard deviation for normal continuous scale variables
Ø Median and Interquartile range for skewed variables
Inferential statistics
To Understand how clinical and drug-specific factors together influence kidney health. I employed
i. Bivariate Analysis – Chi-square test of association for categorical variables and Welch test
ii.Multivariate Analysis- Logistic Regression
Please Note: I fitted multiple logistic regression model to control the confounder variables instead of using Mantel-Haezel statistics.
Statistical Package used to analyze this dataset was R programming.
Why R programming
Ø Open source soft ware
Ø Simple and easy to use
Ø The epidemiological research dataset was used
PART 3: To develop data-driven healthcare applications and decision-support tools
This part, I demonstrated my skills in supervised machine learning
Type of algorithm used:
Linear regression
Logistic regression
# CLEAR WORKING SPACErm(list =ls(all.names =TRUE))#========================================-# SET WDsetwd("C:/CDK") #==================================-# LOAD PACKAGES#===================================-library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(expss)
Loading required package: maditr
To select rows from data: rows(mtcars, am==0)
Attaching package: 'maditr'
The following objects are masked from 'package:dplyr':
between, coalesce, first, last
The following object is masked from 'package:purrr':
transpose
The following object is masked from 'package:readr':
cols
Attaching package: 'expss'
The following objects are masked from 'package:stringr':
fixed, regex
The following objects are masked from 'package:dplyr':
compute, contains, na_if, recode, vars, where
The following objects are masked from 'package:purrr':
keep, modify, modify_if, when
The following objects are masked from 'package:tidyr':
contains, nest
The following object is masked from 'package:ggplot2':
vars
library(table1)
Attaching package: 'table1'
The following objects are masked from 'package:base':
units, units<-
library(gtsummary)
Attaching package: 'gtsummary'
The following objects are masked from 'package:expss':
contains, vars, where
library(flextable)
Attaching package: 'flextable'
The following object is masked from 'package:gtsummary':
continuous_summary
The following object is masked from 'package:expss':
set_caption
The following object is masked from 'package:purrr':
compose
library(officer)library(broom)library(gt)
Attaching package: 'gt'
The following objects are masked from 'package:expss':
contains, gt, tab_caption, vars, where
library(readxl)
Attaching package: 'readxl'
The following object is masked from 'package:officer':
read_xlsx
Attaching package: 'e1071'
The following object is masked from 'package:ggplot2':
element
library(psych)
Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':
%+%, alpha
library(summarytools)
Registered S3 method overwritten by 'plyr':
method from
[.indexed table1
Attaching package: 'summarytools'
The following objects are masked from 'package:table1':
label, label<-
The following object is masked from 'package:tibble':
view
library(broom.helpers)
Attaching package: 'broom.helpers'
The following objects are masked from 'package:gtsummary':
all_categorical, all_continuous, all_contrasts, all_dichotomous,
all_interaction, all_intercepts
The following objects are masked from 'package:expss':
contains, vars, where
#=====================================- # LOAD DATA SETCDK <-read_excel("CDK.xlsx",sheet ="CDK")#======================================-# View Data set#=======================================-view(CDK)
x must either be a summarytools object created with freq(), descr(), or a list of summarytools objects created using by()
Section A: Data Processing
#1.1 DATA CLEANING #1.1.1 Keeping variablesCDK<-CDK|>select(patient_age,gender,bp_systolic,bp_diastolic, blood_urea,blood_glucose_random,diabetes,hypertension, drug_name,drug_dosage_mg,exposure_days,nephrotoxic_label, ckd_risk_label)#==================================================-CDK<-CDK|>mutate(diabetes=factor(diabetes,levels =c(0,1),labels =c("No","Yes"),exclude =NA),hypertension=factor(hypertension,levels =c(0,1),labels =c("No","Yes"),exclude =NA),nephrotoxic_label=factor(nephrotoxic_label,levels =c(0,1),labels =c("non-nephrotoxic","nephrotoxic"),exclude =NA),ckd_risk_label=factor(ckd_risk_label,levels =c(0,1,2),labels =c("Low risk","Moderate risk","High risk"),exclude =NA),gender=factor(gender,labels =c("Female","Male"),exclude =NA),drug_name=factor(drug_name,labels =c("Amphotericin-B ","Aspirin","Cisplatin","Gentamicin","Ibuprofen","Paracetamol ","Tobramycin ","Vancomycin"),exclude =NA))|>apply_labels(patient_age="Patient Age(Years)" ,gender="sex",bp_systolic="Systolic blood pressure(mm/Hg)",bp_diastolic="Diastolic blood pressure(mm/HG)",blood_urea="Blood urea(mmol/L)",drug_dosage_mg="Drug dosage(mg)",exposure_days="Days of exposure" ,drug_name="Drug Type",blood_glucose_random="Blood glucose",nephrotoxic_label="nephrotoxic medication" ,ckd_risk_label="Risk of chronic Kidney disease ")|>mutate(Age_cat=case_when( patient_age>=18& patient_age<=22~1, patient_age>=23& patient_age<=27~2, patient_age>=28& patient_age<=32~3, patient_age>=33& patient_age<=37~4, patient_age>=38& patient_age<=42~5, patient_age>=43& patient_age<=47~6, patient_age>=48& patient_age<=52~7, patient_age>=53& patient_age<=57~8, patient_age>=58& patient_age<=62~9, patient_age>=63& patient_age<=67~10, patient_age>=68& patient_age<=72~11, patient_age>=73& patient_age<=77~12, patient_age>=78& patient_age<=82~13, patient_age>=83& patient_age<=87~14, patient_age>=88& patient_age<=92~15 ))%>%mutate(Age_cat=factor(Age_cat,levels =c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15),labels =c("18-22","23-27","28-32","33-37","38-42","43-47","48-52","53-57","58-62","63-67","68-72","73-77","78,82","83-87","88-92")))|>apply_labels(Age_cat="Patients Age group(years)")#==================================================-# Adding another column-> Grouping exposure days of drug useCDK<-within(CDK,{ exposure_cat<-NA exposure_cat[exposure_days>=1& exposure_days<=4]<-"1-4" exposure_cat[exposure_days>=5& exposure_days<=9]<-"5-10" exposure_cat[exposure_days>=10& exposure_days<=14]<-"10-14" exposure_cat[exposure_days>=15& exposure_days<=19]<-"15-19" exposure_cat[exposure_days>=20& exposure_days<=24]<-"20-24" exposure_cat[exposure_days>=25& exposure_days<=29]<-"25-29" exposure_cat[exposure_days>=30& exposure_days<=34]<-"30-34" }) #=======label exposure_cat to exposure daysCDK<-apply_labels(CDK,exposure_cat="exposure days")#==========================================================-#1.1.2 Save Data set as CDK.RDatasave(CDK,file="C:/CDK/CDK.RData")#=================================================-
Section B: Data Visualization
# 2.0 DATA VISUALIZATION----# 2.1 Checking the distribution of CKD with female data set----# Filter female data set#=============================================- # Explore genderCDK%>%count(gender,sort =TRUE)
# A tibble: 2 × 2
gender n
<fct> <int>
1 Female 776
2 Male 724
CDKf<-CDK|>filter(gender=="Female")#========================================-View(CDKf)#=========================================- # 2.2 Visualize categorical variables#++++++++++++++++++++++++++++++++++++++++++# 2.2.1 Patient Age categories-------- # Summarized using count()df<-CDKf%>%select(Age_cat,gender)%>%count(Age_cat)|>ggplot(aes(x=reorder(Age_cat,n),y=n))+geom_bar(stat ="identity",fill="violet",color="white")+geom_text(aes(label = n),hjust=1.45)+coord_flip()+theme_classic()+labs(x="Patients Age(Years)",y="Count",title ="Female Patients Age Distribution")df
#=============================================================-# 2.2.2 Patients Health Condition-----# 2.2.2.1 Proportion of female with Hypertension-----df1<-CDKf%>%select(hypertension)%>%count(hypertension)%>%mutate(Percentage=n/sum(n),perce_label=paste0(round(Percentage*100),"%"))%>%ggplot(aes(x=reorder(hypertension,Percentage),y=Percentage))+geom_bar(stat="identity",fill="pink",color="black")+geom_text(aes(label=perce_label),vjust=-0.25)+labs(x="Hypertension status",y="Percent",title ="% of female patient with hypertension problem")+scale_y_continuous(labels = scales::percent)+theme_bw() df1
#====================================================-# 2.2.2.2 Proportion of female patients with diabetes-----df2<-CDKf%>%select(diabetes)%>%count(diabetes)%>%mutate(Percentage=n/sum(n),perce_label=paste0(round(Percentage*100),"%"))%>%ggplot(aes(x=reorder(diabetes,Percentage),y=Percentage))+geom_bar(stat="identity",fill="purple",color="black")+geom_text(aes(label=perce_label),vjust=-0.25)+labs(x="Diabetes status",y="Percent",title ="% of female patient with diabetes problem")+scale_y_continuous(labels = scales::percent)+theme_classic() df2
#=================================================- #2.2.2.3 Proportion of female patients with diabetes----- df3<-CDKf%>%select(drug_name)%>%count(drug_name)%>%mutate(Percentage=n/sum(n),perce_label=paste0(round(Percentage*100),"%"))%>%ggplot(aes(x=reorder(drug_name,Percentage),y=Percentage))+geom_bar(stat="identity",fill="skyblue",color="black")+geom_text(aes(label=perce_label),vjust=-0.25)+labs(x="Drug type",y="Percent",title ="% of female patient use drug")+scale_y_continuous(labels = scales::percent)+theme_classic() df3
#=======================================================- #2.2.2.4 Proportion of exposure days----#============================================-Exposure<-data.frame("exposurecat"=c("1-4","5-10","10-14","15-19","20-24","25-29"),"Freq"=c(97,132,131,127,140,149),"Percent"=c("12.5%","17.0%", "16.9%","16.4%","18.0%","19.2%"))Exposure$exposurecat<-as.factor(Exposure$exposurecat)#+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++Exposure|>ggplot(aes(x =exposurecat, y =as.numeric(Freq))) +geom_bar(stat ="identity", color ="black", fill ="dodgerblue1")+geom_text(label=with(Exposure, paste(Freq, paste0('(', Percent, ')'))), vjust=-1) +ylim(0, 200)+labs(title ="Days of drug consumption by female patients",y="Female patients",x="Days of consumption drug")
#=============================================================- # 2.3 Visualize continuous variables# 2.3.1 Distribution of Patient Age (Years)#2.3.1.1: Normality AssumptionCDKf%>%ggplot(aes(x=patient_age))+geom_histogram(fill="blue",color="white")+theme_classic()+labs(title ="Age distribution of female patient",y="Counts",x="Patients Age(Years)")
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
# Testing normality for clarity using shapiro wilk testshapiro.test(CDKf$patient_age)
Shapiro-Wilk normality test
data: CDKf$patient_age
W = 0.952, p-value = 3.319e-15
# Note: Normality assumption in age is violate# Reason : P-value <-0.05 thus fail to reject Ho# 2.2.3.2: Identify Outliers in patient Age----boxplot(CDKf$patient_age,col ="violet")
#==========================================-# 2.3.2 TEST NORMALITY ASSUMPTION USING SHAPIRO WILK TEST----# Please note that:The remaining continuous scale variables i---- #used Shapiro test----#==========================================-shapiro.test(CDK$bp_systolic)
Shapiro-Wilk normality test
data: CDK$bp_systolic
W = 0.99903, p-value = 0.6182
shapiro.test(CDK$bp_diastolic)
Shapiro-Wilk normality test
data: CDK$bp_diastolic
W = 0.9989, p-value = 0.4988
shapiro.test(CDK$blood_urea)
Shapiro-Wilk normality test
data: CDK$blood_urea
W = 0.99862, p-value = 0.2819
shapiro.test(CDK$blood_glucose_random)
Shapiro-Wilk normality test
data: CDK$blood_glucose_random
W = 0.99846, p-value = 0.1973
shapiro.test(CDK$drug_dosage_mg)
Shapiro-Wilk normality test
data: CDK$drug_dosage_mg
W = 0.95146, p-value < 2.2e-16
#===========================================-#N/B: All the variables met normality assumption except----#Patient age and drug dosage(mg) #============================================-#2.3.2.1 Describe continuous variables----CSV<-CDK|>select(patient_age,bp_diastolic,drug_dosage_mg,bp_systolic, blood_urea,blood_glucose_random)describe(CSV)
#========================================================-# Reporting patient_age and drug_dosage_mg using table1 functionTable2<-CDK%>%select(patient_age,drug_dosage_mg,nephrotoxic_label)table1(~patient_age+drug_dosage_mg|nephrotoxic_label,data=Table2)