Busara has been approached by an international organization called JHG, which is working on sexual reproductive health. They have implemented a program in Kenya where they are giving HIV preventive medicine for free to anyone in need. The pill, which we call DP, must be taken daily to be effective.
JHG has partnered with clinics across the country, who will be distributing DP pills to patients. The DP comes in packages of 30 pills. Once a month, a patient will need to go back to the clinic and receive a new package. JHG is seeing a much lower uptake and demand for the pills than they had expected. They have concluded that it must be a behavioral problem, since they have made the pills entirely free and easily accessible in hundreds of clinics across the country. .
The task is to Explore Continuation Rates and Variables affecting them
The data is comprised of 20 COLUMNS as described above
Notes:The following packages were used(description shown)
library(tidyverse)#For data cleaning and organizing
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)#For plotting
library(knitr)#For use with rmarkdown oin presentation
library(lubridate)#For sdate use and conversion
## Loading required package: timechange
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(readr)#For importing data
library(dplyr)#For making manipulations
library(rmarkdown)#For creating a sharable document
library(janitor)#For easing data cleaning
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(skimr)#For easy summary
#LOAD DATA
Test <- read_csv("C:\\Users\\AAH\\Downloads\\DP_continuation - Sheet1.csv")
## Rows: 22131 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): clientsid, Gender, CountyOfBirth, EntryPoint, ReferredOrTransferre...
## dbl (9): Age, Weight_, Height, Received_DP, Refill1month, Refill2months, Re...
## date (2): InitialVisitDate, Lmp
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Notes: Just the first 6 rows & then look at the structure
head(Test)#FIRST 6 ROWS
## # A tibble: 6 × 20
## clientsid Gender Count…¹ Age Entry…² Refer…³ InitialV…⁴ BP Weight_ Height
## <chr> <chr> <chr> <dbl> <chr> <chr> <date> <chr> <dbl> <dbl>
## 1 10269-07… M Kisii 38 Referr… VCT Si… 2018-03-14 113/… 68 172
## 2 10269-07… F Meru 44 Referr… VCT Si… 2018-04-11 120/… 102 168
## 3 10269-07… M Kiambu 40 Referr… Outrea… 2018-04-04 113/… 0 0
## 4 10269-07… M Bungoma 24 Referr… VCT Si… 2018-04-11 000/… 0 0
## 5 10269-07… F Kiambu 31 Referr… VCT Si… 2018-05-13 000/… 75.5 760.
## 6 10269-07… M Kiambu 36 Referr… Outrea… 2018-05-07 000/… 0 0
## # … with 10 more variables: SignOfSti <chr>, Lmp <date>, Cluster <chr>,
## # FacilityType <chr>, Received_DP <dbl>, Refill1month <dbl>,
## # Refill2months <dbl>, Refill3months <dbl>, Refill6months <dbl>,
## # received_counseling <dbl>, and abbreviated variable names ¹CountyOfBirth,
## # ²EntryPoint, ³ReferredOrTransferredFrom, ⁴InitialVisitDate
tail(Test)#LAST 6 ROWS
## # A tibble: 6 × 20
## clientsid Gender Count…¹ Age Entry…² Refer…³ InitialV…⁴ BP Weight_ Height
## <chr> <chr> <chr> <dbl> <chr> <chr> <date> <chr> <dbl> <dbl>
## 1 99999/20… F Unknown 30 Referr… VCT Si… 2017-11-23 133/… 79 165
## 2 99999/20… F Unknown 35 Referr… VCT Si… 2017-11-17 112/… 76 162
## 3 99999/20… F Unknown 41 Referr… VCT Si… 2017-12-20 114/… 94 154
## 4 99999/20… F Turkana 47 Referr… VCT Si… 2017-11-24 109/… 60 168
## 5 99999/20… F Meru 27 Referr… VCT Si… 2017-04-03 154/… 47 159
## 6 99999/20… F Kakame… 26 Referr… VCT Si… 2017-04-18 109/… 51 163
## # … with 10 more variables: SignOfSti <chr>, Lmp <date>, Cluster <chr>,
## # FacilityType <chr>, Received_DP <dbl>, Refill1month <dbl>,
## # Refill2months <dbl>, Refill3months <dbl>, Refill6months <dbl>,
## # received_counseling <dbl>, and abbreviated variable names ¹CountyOfBirth,
## # ²EntryPoint, ³ReferredOrTransferredFrom, ⁴InitialVisitDate
dim(Test)#Number of rows and columns
## [1] 22131 20
str(Test)#Check column data types and structure of columns
## spc_tbl_ [22,131 × 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ clientsid : chr [1:22131] "10269-07-1/18" "10269-07-2/18" "10269-07-3/18" "10269-07-4/18" ...
## $ Gender : chr [1:22131] "M" "F" "M" "M" ...
## $ CountyOfBirth : chr [1:22131] "Kisii" "Meru" "Kiambu" "Bungoma" ...
## $ Age : num [1:22131] 38 44 40 24 31 36 20 19 42 31 ...
## $ EntryPoint : chr [1:22131] "Referred" "Referred" "Referred" "Referred" ...
## $ ReferredOrTransferredFrom: chr [1:22131] "VCT Site" "VCT Site" "Outreach" "VCT Site" ...
## $ InitialVisitDate : Date[1:22131], format: "2018-03-14" "2018-04-11" ...
## $ BP : chr [1:22131] "113/82" "120/84" "113/84" "000/00" ...
## $ Weight_ : num [1:22131] 68 102 0 0 75.5 0 0 0 70 74.2 ...
## $ Height : num [1:22131] 172 168 0 0 760 ...
## $ SignOfSti : chr [1:22131] "No" "No" "No" "No" ...
## $ Lmp : Date[1:22131], format: "1900-01-01" "1900-01-01" ...
## $ Cluster : chr [1:22131] "Nairobi" "Nairobi" "Nairobi" "Nairobi" ...
## $ FacilityType : chr [1:22131] "Public" "Public" "Public" "Public" ...
## $ Received_DP : num [1:22131] 1 1 1 1 1 1 1 1 1 1 ...
## $ Refill1month : num [1:22131] 1 1 1 1 1 1 0 0 1 0 ...
## $ Refill2months : num [1:22131] 1 1 1 1 0 1 0 0 1 0 ...
## $ Refill3months : num [1:22131] 1 1 1 0 0 1 0 0 1 0 ...
## $ Refill6months : num [1:22131] 1 0 0 0 0 0 0 0 0 0 ...
## $ received_counseling : num [1:22131] 0 1 1 1 1 0 1 1 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. clientsid = col_character(),
## .. Gender = col_character(),
## .. CountyOfBirth = col_character(),
## .. Age = col_double(),
## .. EntryPoint = col_character(),
## .. ReferredOrTransferredFrom = col_character(),
## .. InitialVisitDate = col_date(format = ""),
## .. BP = col_character(),
## .. Weight_ = col_double(),
## .. Height = col_double(),
## .. SignOfSti = col_character(),
## .. Lmp = col_date(format = ""),
## .. Cluster = col_character(),
## .. FacilityType = col_character(),
## .. Received_DP = col_double(),
## .. Refill1month = col_double(),
## .. Refill2months = col_double(),
## .. Refill3months = col_double(),
## .. Refill6months = col_double(),
## .. received_counseling = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
View(Test)#Get a full view of the dataset
Notes: Check is done by different simple functions
#Column names clening
Test2<-clean_names(Test)#clean column names for easy machine and human reading
colnames(Test2)#Confirm column names are clean
## [1] "clientsid" "gender"
## [3] "county_of_birth" "age"
## [5] "entry_point" "referred_or_transferred_from"
## [7] "initial_visit_date" "bp"
## [9] "weight" "height"
## [11] "sign_of_sti" "lmp"
## [13] "cluster" "facility_type"
## [15] "received_dp" "refill1month"
## [17] "refill2months" "refill3months"
## [19] "refill6months" "received_counseling"
#Check for duplicate clients
sum(duplicated(Test2))#Count all duplicated values;There are no data duplicated values
## [1] 0
#Now I want to find if this data set contains any abnormal values
summary(Test2)#Check for na values
## clientsid gender county_of_birth age
## Length:22131 Length:22131 Length:22131 Min. : 15.00
## Class :character Class :character Class :character 1st Qu.: 22.00
## Mode :character Mode :character Mode :character Median : 25.00
## Mean : 27.28
## 3rd Qu.: 31.00
## Max. :544.00
##
## entry_point referred_or_transferred_from initial_visit_date
## Length:22131 Length:22131 Min. :2017-02-02
## Class :character Class :character 1st Qu.:2017-10-10
## Mode :character Mode :character Median :2018-02-22
## Mean :2018-02-21
## 3rd Qu.:2018-07-10
## Max. :2018-12-31
##
## bp weight height sign_of_sti
## Length:22131 Min. : 0.00 Min. : 0.0 Length:22131
## Class :character 1st Qu.: 50.00 1st Qu.: 0.0 Class :character
## Mode :character Median : 60.00 Median : 156.0 Mode :character
## Mean : 76.11 Mean : 190.8
## 3rd Qu.: 68.00 3rd Qu.: 164.0
## Max. :79174.00 Max. :1560000.0
## NA's :117 NA's :117
## lmp cluster facility_type received_dp
## Min. :1900-01-01 Length:22131 Length:22131 Min. :1
## 1st Qu.:1900-01-01 Class :character Class :character 1st Qu.:1
## Median :1900-01-01 Mode :character Mode :character Median :1
## Mean :1930-11-01 Mean :1
## 3rd Qu.:2017-05-01 3rd Qu.:1
## Max. :2019-04-04 Max. :1
## NA's :117
## refill1month refill2months refill3months refill6months
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.4762 Mean :0.2759 Mean :0.1788 Mean :0.06593
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
##
## received_counseling
## Min. :0.000
## 1st Qu.:0.000
## Median :1.000
## Mean :0.521
## 3rd Qu.:1.000
## Max. :1.000
##
We find that lmp, Height and Weight have 117 na values each
Test3 <- Test2[rowSums(is.na(Test2)) > 0,]
View(Test3)
dim(Test3)
## [1] 244 20
244 rows and 20 columns
Notes: This is to reduce the margin of errors in our data This will not have a major effect on the data
Test3<- Test2 %>% mutate(across(where(is.numeric), ~replace_na(., median(., na.rm=TRUE))))
Test4<- Test3 %>% mutate(across(where(is.Date), ~replace_na(., median(., na.rm=TRUE))))
sum(is.na(Test4))
## [1] 361
#This is for the numeric values and the date values respectively
#Convert lmp column to a date data type
Test5 <- Test4 %>%
mutate(lmp = as_date(lmp, format = "%Y-%m-%d"))
str(Test5)
## tibble [22,131 × 20] (S3: tbl_df/tbl/data.frame)
## $ clientsid : chr [1:22131] "10269-07-1/18" "10269-07-2/18" "10269-07-3/18" "10269-07-4/18" ...
## $ gender : chr [1:22131] "M" "F" "M" "M" ...
## $ county_of_birth : chr [1:22131] "Kisii" "Meru" "Kiambu" "Bungoma" ...
## $ age : num [1:22131] 38 44 40 24 31 36 20 19 42 31 ...
## $ entry_point : chr [1:22131] "Referred" "Referred" "Referred" "Referred" ...
## $ referred_or_transferred_from: chr [1:22131] "VCT Site" "VCT Site" "Outreach" "VCT Site" ...
## $ initial_visit_date : Date[1:22131], format: "2018-03-14" "2018-04-11" ...
## $ bp : chr [1:22131] "113/82" "120/84" "113/84" "000/00" ...
## $ weight : num [1:22131] 68 102 0 0 75.5 0 0 0 70 74.2 ...
## $ height : num [1:22131] 172 168 0 0 760 ...
## $ sign_of_sti : chr [1:22131] "No" "No" "No" "No" ...
## $ lmp : Date[1:22131], format: "1900-01-01" "1900-01-01" ...
## $ cluster : chr [1:22131] "Nairobi" "Nairobi" "Nairobi" "Nairobi" ...
## $ facility_type : chr [1:22131] "Public" "Public" "Public" "Public" ...
## $ received_dp : num [1:22131] 1 1 1 1 1 1 1 1 1 1 ...
## $ refill1month : num [1:22131] 1 1 1 1 1 1 0 0 1 0 ...
## $ refill2months : num [1:22131] 1 1 1 1 0 1 0 0 1 0 ...
## $ refill3months : num [1:22131] 1 1 1 0 0 1 0 0 1 0 ...
## $ refill6months : num [1:22131] 1 0 0 0 0 0 0 0 0 0 ...
## $ received_counseling : num [1:22131] 0 1 1 1 1 0 1 1 1 1 ...
View(Test5)
Notes:The mutate function under dyplr was used to change the data type as well as format the date arrangement
Test6 <- Test5[rowSums(is.na(Test5)) > 0,]
View(Test6)
The Na values are characters from the bp column and “signsof STI column” These qualitative values will require consultation from the enumerators/Data collection team before cleaning
Notes: We removed all insignificant variables
# Create a bar plot of the received DP variable
ggplot(Test5, aes(x = initial_visit_date, y = received_dp,fill = "blue")) +
scale_x_date(date_labels="%b %y",date_breaks ="3 months")+
geom_bar(stat = "identity") +
ggtitle("Received DP by Quarters") +
xlab("Dates in Quarters") +
ylab("Received DP")
We can see that received dp was highest at around October 16th 2017 A normal distribution is displayed whereby the lowest reception of DP was around at the end and at the start of the data frame
# Calculate continuation rates
Test7 <- Test5 %>%
mutate(continuation_rate = (refill1month + refill2months + refill3months + refill1month) / received_dp)
View(Test7)
We have now included a continuation rate column in the data
# Create line plot of continuation rate
ggplot(Test7, aes(x = initial_visit_date, y = continuation_rate, color = continuation_rate)) +
geom_line() +
scale_x_date(date_labels="%b %y",date_breaks ="3 months")+
ggtitle("Continuation Rate by Month") +
xlab("Month") +
ylab("Continuation Rate")
# Count the number of occurrences of each continuation rate
counts <- Test7 %>%
group_by(continuation_rate) %>%
summarize(count = n())
print(counts)
## # A tibble: 4 × 2
## continuation_rate count
## <dbl> <int>
## 1 0 11592
## 2 2 4434
## 3 3 2147
## 4 4 3958
We can see the exact number of patients who went back for the first to the 6th time
# Create histogram of continuation rate
ggplot(Test7, aes(x=continuation_rate)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black") +
scale_y_continuous(breaks = seq(0, max(counts$count), by = 2000), limits = c(0, max(counts$count)))+
ggtitle("Distribution of Continuation Rate among 22000 patients") +
xlab("Continuation Rate") +
ylab("Patients")
The vast majority of patients did not go for a refill of the deugs and this is confirmed above with more than 50% of patients not going back for a refill
This is very concerning since less than 20% of candidates went back for a refill
data_filtered <- filter(Test7, continuation_rate >= 0)
ggplot(Test7, aes(x = gender, fill = continuation_rate)) +
geom_bar(position = "dodge") +
ggtitle("Continuation Rate by Location") +
xlab("Location") +
ylab("Continuation Rate")
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
We can see there were abot 7700 females and 2700 females overall
Our final formula is Log(p/(1-p) = -9.27321 + 0.15213(Pregnancies)+ 0.03810(Glucose) + 0.08779(BMI)+ 0.90231(DiabetesPedigreeFunction)
The predicted value (probability of diabetes) for pregnancy is Log(p/(1-p)) = -9.27321 + 0.15213(1) + 0.03810(0) + 0.08779(0) + 0.90231(0) Log(p/(1-p)) = -9.12108 exp (-9.12108) / (1+ exp (-9.12108)) = 0.000193
The predicted value (probability of diabetes) for glucose level is Log(p/(1-p)) = -9.27321 + 0.15213(0) + 0.03810(1) + 0.08779(0) + 0.90231(0) Log(p/(1-p)) = -9.19911 exp (-9.19911) / (1+ exp (-9.19911)) =0.000101
The predicted value (probability of diabetes) for BMI is Log(p/(1-p)) = -9.27321 + 0.15213(0) + 0.03810(0) + 0.08779(1) + 0.90231(0) Log(p/(1-p)) = - 9.18542 exp (-9.18542) / (1+ exp (-9.18542)) =0.000103
The predicted value (probability of diabetes) for diabetes pedigree function is Log(p/(1-p)) = -9.27321 + 0.15213(0) + 0.03810(0) + 0.08779(0) + 0.90231 (1) Log(p/(1-p)) = -8.37079 exp (-8.37079) / (1+ exp (-8.37079)) =0.000231
Checking accuracy of our model, the P-value of pregnancies, glucose, BMI and DiabetesPedigreeFunction were less than the level of significance(0.05). This meant that the predictors were probably an excellent addition to the model.
With an accuracy of 0.7756098in our model, indicates that 78% of the time our model classified the patients in a high-risk category when they actually had a high risk of getting diabetes. This shows our model is good.
Assumptions of logistic regression hold ie: * Little or no multicollinearity. For our model there is no multicollinearity between the independent variables * Autocorrelation. This assumption holds as the p-value= 0.3747 is greater than the alpha level of 0.005 and with D = 1.9746 shows positive autocorrelation. * Normality. For logistic regression, since it does not require a linear relationship between dependent and independent variables, normality do not need to be normally distributed. With p-values less than alpha level, the normality assumption does not hold.