JHG INTERVENTION

Background :

Busara has been approached by an international organization called JHG, which is working on sexual reproductive health. They have implemented a program in Kenya where they are giving HIV preventive medicine for free to anyone in need. The pill, which we call DP, must be taken daily to be effective.

JHG has partnered with clinics across the country, who will be distributing DP pills to patients. The DP comes in packages of 30 pills. Once a month, a patient will need to go back to the clinic and receive a new package. JHG is seeing a much lower uptake and demand for the pills than they had expected. They have concluded that it must be a behavioral problem, since they have made the pills entirely free and easily accessible in hundreds of clinics across the country. .

Business task :

The task is to Explore Continuation Rates and Variables affecting them

Data

The data is comprised of 20 COLUMNS as described above

Setting up environment

Notes:The following packages were used(description shown)

library(tidyverse)#For data cleaning and organizing
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggplot2)#For plotting
library(knitr)#For use with rmarkdown oin presentation
library(lubridate)#For sdate use and conversion
## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(readr)#For importing data
library(dplyr)#For making manipulations
library(rmarkdown)#For creating a sharable document
library(janitor)#For easing data cleaning
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(skimr)#For easy summary

IMPORT THE DATA(PREPARE)

#LOAD DATA
Test <- read_csv("C:\\Users\\AAH\\Downloads\\DP_continuation - Sheet1.csv")
## Rows: 22131 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (9): clientsid, Gender, CountyOfBirth, EntryPoint, ReferredOrTransferre...
## dbl  (9): Age, Weight_, Height, Received_DP, Refill1month, Refill2months, Re...
## date (2): InitialVisitDate, Lmp
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

PREVIEW THE DATA AND FIND THE STRUCTURE

Notes: Just the first 6 rows & then look at the structure

head(Test)#FIRST 6 ROWS
## # A tibble: 6 × 20
##   clientsid Gender Count…¹   Age Entry…² Refer…³ InitialV…⁴ BP    Weight_ Height
##   <chr>     <chr>  <chr>   <dbl> <chr>   <chr>   <date>     <chr>   <dbl>  <dbl>
## 1 10269-07… M      Kisii      38 Referr… VCT Si… 2018-03-14 113/…    68     172 
## 2 10269-07… F      Meru       44 Referr… VCT Si… 2018-04-11 120/…   102     168 
## 3 10269-07… M      Kiambu     40 Referr… Outrea… 2018-04-04 113/…     0       0 
## 4 10269-07… M      Bungoma    24 Referr… VCT Si… 2018-04-11 000/…     0       0 
## 5 10269-07… F      Kiambu     31 Referr… VCT Si… 2018-05-13 000/…    75.5   760.
## 6 10269-07… M      Kiambu     36 Referr… Outrea… 2018-05-07 000/…     0       0 
## # … with 10 more variables: SignOfSti <chr>, Lmp <date>, Cluster <chr>,
## #   FacilityType <chr>, Received_DP <dbl>, Refill1month <dbl>,
## #   Refill2months <dbl>, Refill3months <dbl>, Refill6months <dbl>,
## #   received_counseling <dbl>, and abbreviated variable names ¹​CountyOfBirth,
## #   ²​EntryPoint, ³​ReferredOrTransferredFrom, ⁴​InitialVisitDate
tail(Test)#LAST 6 ROWS
## # A tibble: 6 × 20
##   clientsid Gender Count…¹   Age Entry…² Refer…³ InitialV…⁴ BP    Weight_ Height
##   <chr>     <chr>  <chr>   <dbl> <chr>   <chr>   <date>     <chr>   <dbl>  <dbl>
## 1 99999/20… F      Unknown    30 Referr… VCT Si… 2017-11-23 133/…      79    165
## 2 99999/20… F      Unknown    35 Referr… VCT Si… 2017-11-17 112/…      76    162
## 3 99999/20… F      Unknown    41 Referr… VCT Si… 2017-12-20 114/…      94    154
## 4 99999/20… F      Turkana    47 Referr… VCT Si… 2017-11-24 109/…      60    168
## 5 99999/20… F      Meru       27 Referr… VCT Si… 2017-04-03 154/…      47    159
## 6 99999/20… F      Kakame…    26 Referr… VCT Si… 2017-04-18 109/…      51    163
## # … with 10 more variables: SignOfSti <chr>, Lmp <date>, Cluster <chr>,
## #   FacilityType <chr>, Received_DP <dbl>, Refill1month <dbl>,
## #   Refill2months <dbl>, Refill3months <dbl>, Refill6months <dbl>,
## #   received_counseling <dbl>, and abbreviated variable names ¹​CountyOfBirth,
## #   ²​EntryPoint, ³​ReferredOrTransferredFrom, ⁴​InitialVisitDate
dim(Test)#Number of rows and columns
## [1] 22131    20
str(Test)#Check column data types and structure of columns
## spc_tbl_ [22,131 × 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ clientsid                : chr [1:22131] "10269-07-1/18" "10269-07-2/18" "10269-07-3/18" "10269-07-4/18" ...
##  $ Gender                   : chr [1:22131] "M" "F" "M" "M" ...
##  $ CountyOfBirth            : chr [1:22131] "Kisii" "Meru" "Kiambu" "Bungoma" ...
##  $ Age                      : num [1:22131] 38 44 40 24 31 36 20 19 42 31 ...
##  $ EntryPoint               : chr [1:22131] "Referred" "Referred" "Referred" "Referred" ...
##  $ ReferredOrTransferredFrom: chr [1:22131] "VCT Site" "VCT Site" "Outreach" "VCT Site" ...
##  $ InitialVisitDate         : Date[1:22131], format: "2018-03-14" "2018-04-11" ...
##  $ BP                       : chr [1:22131] "113/82" "120/84" "113/84" "000/00" ...
##  $ Weight_                  : num [1:22131] 68 102 0 0 75.5 0 0 0 70 74.2 ...
##  $ Height                   : num [1:22131] 172 168 0 0 760 ...
##  $ SignOfSti                : chr [1:22131] "No" "No" "No" "No" ...
##  $ Lmp                      : Date[1:22131], format: "1900-01-01" "1900-01-01" ...
##  $ Cluster                  : chr [1:22131] "Nairobi" "Nairobi" "Nairobi" "Nairobi" ...
##  $ FacilityType             : chr [1:22131] "Public" "Public" "Public" "Public" ...
##  $ Received_DP              : num [1:22131] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Refill1month             : num [1:22131] 1 1 1 1 1 1 0 0 1 0 ...
##  $ Refill2months            : num [1:22131] 1 1 1 1 0 1 0 0 1 0 ...
##  $ Refill3months            : num [1:22131] 1 1 1 0 0 1 0 0 1 0 ...
##  $ Refill6months            : num [1:22131] 1 0 0 0 0 0 0 0 0 0 ...
##  $ received_counseling      : num [1:22131] 0 1 1 1 1 0 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   clientsid = col_character(),
##   ..   Gender = col_character(),
##   ..   CountyOfBirth = col_character(),
##   ..   Age = col_double(),
##   ..   EntryPoint = col_character(),
##   ..   ReferredOrTransferredFrom = col_character(),
##   ..   InitialVisitDate = col_date(format = ""),
##   ..   BP = col_character(),
##   ..   Weight_ = col_double(),
##   ..   Height = col_double(),
##   ..   SignOfSti = col_character(),
##   ..   Lmp = col_date(format = ""),
##   ..   Cluster = col_character(),
##   ..   FacilityType = col_character(),
##   ..   Received_DP = col_double(),
##   ..   Refill1month = col_double(),
##   ..   Refill2months = col_double(),
##   ..   Refill3months = col_double(),
##   ..   Refill6months = col_double(),
##   ..   received_counseling = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
View(Test)#Get a full view of the dataset

CLEANING(PROCESS)

CHECK FOR ANY DUPLICATES AND MISSING VALUES

Notes: Check is done by different simple functions

#Column names clening
Test2<-clean_names(Test)#clean column names for easy machine and human reading
colnames(Test2)#Confirm column names are clean
##  [1] "clientsid"                    "gender"                      
##  [3] "county_of_birth"              "age"                         
##  [5] "entry_point"                  "referred_or_transferred_from"
##  [7] "initial_visit_date"           "bp"                          
##  [9] "weight"                       "height"                      
## [11] "sign_of_sti"                  "lmp"                         
## [13] "cluster"                      "facility_type"               
## [15] "received_dp"                  "refill1month"                
## [17] "refill2months"                "refill3months"               
## [19] "refill6months"                "received_counseling"
#Check for duplicate clients

sum(duplicated(Test2))#Count all duplicated values;There are no data duplicated values
## [1] 0

Further cleaning of the data

#Now I want to find if this data set contains any abnormal values
summary(Test2)#Check for na values
##   clientsid            gender          county_of_birth         age        
##  Length:22131       Length:22131       Length:22131       Min.   : 15.00  
##  Class :character   Class :character   Class :character   1st Qu.: 22.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 25.00  
##                                                           Mean   : 27.28  
##                                                           3rd Qu.: 31.00  
##                                                           Max.   :544.00  
##                                                                           
##  entry_point        referred_or_transferred_from initial_visit_date  
##  Length:22131       Length:22131                 Min.   :2017-02-02  
##  Class :character   Class :character             1st Qu.:2017-10-10  
##  Mode  :character   Mode  :character             Median :2018-02-22  
##                                                  Mean   :2018-02-21  
##                                                  3rd Qu.:2018-07-10  
##                                                  Max.   :2018-12-31  
##                                                                      
##       bp                weight             height          sign_of_sti       
##  Length:22131       Min.   :    0.00   Min.   :      0.0   Length:22131      
##  Class :character   1st Qu.:   50.00   1st Qu.:      0.0   Class :character  
##  Mode  :character   Median :   60.00   Median :    156.0   Mode  :character  
##                     Mean   :   76.11   Mean   :    190.8                     
##                     3rd Qu.:   68.00   3rd Qu.:    164.0                     
##                     Max.   :79174.00   Max.   :1560000.0                     
##                     NA's   :117        NA's   :117                           
##       lmp               cluster          facility_type       received_dp
##  Min.   :1900-01-01   Length:22131       Length:22131       Min.   :1   
##  1st Qu.:1900-01-01   Class :character   Class :character   1st Qu.:1   
##  Median :1900-01-01   Mode  :character   Mode  :character   Median :1   
##  Mean   :1930-11-01                                         Mean   :1   
##  3rd Qu.:2017-05-01                                         3rd Qu.:1   
##  Max.   :2019-04-04                                         Max.   :1   
##  NA's   :117                                                            
##   refill1month    refill2months    refill3months    refill6months    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.4762   Mean   :0.2759   Mean   :0.1788   Mean   :0.06593  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##                                                                      
##  received_counseling
##  Min.   :0.000      
##  1st Qu.:0.000      
##  Median :1.000      
##  Mean   :0.521      
##  3rd Qu.:1.000      
##  Max.   :1.000      
## 

We find that lmp, Height and Weight have 117 na values each

Upon further inspection, We find that the below dataset isa compilation of all Na values

Test3 <- Test2[rowSums(is.na(Test2)) > 0,]
View(Test3)
dim(Test3) 
## [1] 244  20

244 rows and 20 columns

Replacing the outliers with the median values

Notes: This is to reduce the margin of errors in our data This will not have a major effect on the data

Test3<- Test2 %>% mutate(across(where(is.numeric), ~replace_na(., median(., na.rm=TRUE))))
Test4<- Test3 %>% mutate(across(where(is.Date), ~replace_na(., median(., na.rm=TRUE))))
sum(is.na(Test4))
## [1] 361
#This is for the numeric values and the date values respectively

Next we convert the lmp column to the date column

#Convert lmp column to a date data type
Test5 <- Test4 %>%
  mutate(lmp = as_date(lmp, format = "%Y-%m-%d"))
str(Test5)
## tibble [22,131 × 20] (S3: tbl_df/tbl/data.frame)
##  $ clientsid                   : chr [1:22131] "10269-07-1/18" "10269-07-2/18" "10269-07-3/18" "10269-07-4/18" ...
##  $ gender                      : chr [1:22131] "M" "F" "M" "M" ...
##  $ county_of_birth             : chr [1:22131] "Kisii" "Meru" "Kiambu" "Bungoma" ...
##  $ age                         : num [1:22131] 38 44 40 24 31 36 20 19 42 31 ...
##  $ entry_point                 : chr [1:22131] "Referred" "Referred" "Referred" "Referred" ...
##  $ referred_or_transferred_from: chr [1:22131] "VCT Site" "VCT Site" "Outreach" "VCT Site" ...
##  $ initial_visit_date          : Date[1:22131], format: "2018-03-14" "2018-04-11" ...
##  $ bp                          : chr [1:22131] "113/82" "120/84" "113/84" "000/00" ...
##  $ weight                      : num [1:22131] 68 102 0 0 75.5 0 0 0 70 74.2 ...
##  $ height                      : num [1:22131] 172 168 0 0 760 ...
##  $ sign_of_sti                 : chr [1:22131] "No" "No" "No" "No" ...
##  $ lmp                         : Date[1:22131], format: "1900-01-01" "1900-01-01" ...
##  $ cluster                     : chr [1:22131] "Nairobi" "Nairobi" "Nairobi" "Nairobi" ...
##  $ facility_type               : chr [1:22131] "Public" "Public" "Public" "Public" ...
##  $ received_dp                 : num [1:22131] 1 1 1 1 1 1 1 1 1 1 ...
##  $ refill1month                : num [1:22131] 1 1 1 1 1 1 0 0 1 0 ...
##  $ refill2months               : num [1:22131] 1 1 1 1 0 1 0 0 1 0 ...
##  $ refill3months               : num [1:22131] 1 1 1 0 0 1 0 0 1 0 ...
##  $ refill6months               : num [1:22131] 1 0 0 0 0 0 0 0 0 0 ...
##  $ received_counseling         : num [1:22131] 0 1 1 1 1 0 1 1 1 1 ...
View(Test5)

Notes:The mutate function under dyplr was used to change the data type as well as format the date arrangement

Upon further inspection, We find that the below dataset is a compilation of all Na values

Test6 <- Test5[rowSums(is.na(Test5)) > 0,]
View(Test6)

The Na values are characters from the bp column and “signsof STI column” These qualitative values will require consultation from the enumerators/Data collection team before cleaning

THE DATA IS NOW MACHINE READABLE AND READY FOR ANALYSIS

DATA VISUALIZATION AND INTERPRETATION

Notes: We removed all insignificant variables

# Create a bar plot of the received DP variable
ggplot(Test5, aes(x = initial_visit_date, y = received_dp,fill = "blue")) +
  scale_x_date(date_labels="%b %y",date_breaks  ="3 months")+
  geom_bar(stat = "identity") +
  ggtitle("Received DP by Quarters") +
  xlab("Dates in Quarters") +
  ylab("Received DP")

We can see that received dp was highest at around October 16th 2017 A normal distribution is displayed whereby the lowest reception of DP was around at the end and at the start of the data frame

WE NOW CALCULATE CONTINUATION RATES AND INPUT CONTINUATION COLUMN

# Calculate continuation rates
Test7 <- Test5 %>%
  mutate(continuation_rate = (refill1month + refill2months + refill3months + refill1month) / received_dp)
View(Test7)

We have now included a continuation rate column in the data

NEXT WE PLOT A TIME SERIES SHOWING THE CONTINUATION RATE ACROSS THE YEARS

# Create line plot of continuation rate
ggplot(Test7, aes(x = initial_visit_date, y = continuation_rate, color = continuation_rate)) +
  geom_line() +
  scale_x_date(date_labels="%b %y",date_breaks  ="3 months")+
  ggtitle("Continuation Rate by Month") +
  xlab("Month") +
  ylab("Continuation Rate")

NEXT WE CHECK THE EXACT NUMBER OF PATIENTS PER CONTINUATION RATE IN A TIBBLE

# Count the number of occurrences of each continuation rate
counts <- Test7 %>%
  group_by(continuation_rate) %>%
  summarize(count = n())
print(counts)
## # A tibble: 4 × 2
##   continuation_rate count
##               <dbl> <int>
## 1                 0 11592
## 2                 2  4434
## 3                 3  2147
## 4                 4  3958

We can see the exact number of patients who went back for the first to the 6th time

WE THEN VISUALIZE THE ABOVE DATA WITH A HISTOGRAM

# Create histogram of continuation rate
ggplot(Test7, aes(x=continuation_rate)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  scale_y_continuous(breaks = seq(0, max(counts$count), by = 2000), limits = c(0, max(counts$count)))+
  ggtitle("Distribution of Continuation Rate among 22000 patients") +
  xlab("Continuation Rate") +
  ylab("Patients")

Notable Statistics

The vast majority of patients did not go for a refill of the deugs and this is confirmed above with more than 50% of patients not going back for a refill

This is very concerning since less than 20% of candidates went back for a refill

CONTINUATION RATE BY GENDER

data_filtered <- filter(Test7, continuation_rate >= 0)
ggplot(Test7, aes(x = gender, fill = continuation_rate)) +
  geom_bar(position = "dodge") +
  ggtitle("Continuation Rate by Location") +
  xlab("Location") +
  ylab("Continuation Rate")
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

We can see there were abot 7700 females and 2700 females overall

Results and interpretation

Our final formula is Log(p/(1-p) = -9.27321 + 0.15213(Pregnancies)+ 0.03810(Glucose) + 0.08779(BMI)+ 0.90231(DiabetesPedigreeFunction)

The predicted value (probability of diabetes) for pregnancy is Log(p/(1-p)) = -9.27321 + 0.15213(1) + 0.03810(0) + 0.08779(0) + 0.90231(0) Log(p/(1-p)) = -9.12108 exp (-9.12108) / (1+ exp (-9.12108)) = 0.000193

The predicted value (probability of diabetes) for glucose level is Log(p/(1-p)) = -9.27321 + 0.15213(0) + 0.03810(1) + 0.08779(0) + 0.90231(0) Log(p/(1-p)) = -9.19911 exp (-9.19911) / (1+ exp (-9.19911)) =0.000101

The predicted value (probability of diabetes) for BMI is Log(p/(1-p)) = -9.27321 + 0.15213(0) + 0.03810(0) + 0.08779(1) + 0.90231(0) Log(p/(1-p)) = - 9.18542 exp (-9.18542) / (1+ exp (-9.18542)) =0.000103

The predicted value (probability of diabetes) for diabetes pedigree function is Log(p/(1-p)) = -9.27321 + 0.15213(0) + 0.03810(0) + 0.08779(0) + 0.90231 (1) Log(p/(1-p)) = -8.37079 exp (-8.37079) / (1+ exp (-8.37079)) =0.000231

Conclusion

Checking accuracy of our model, the P-value of pregnancies, glucose, BMI and DiabetesPedigreeFunction were less than the level of significance(0.05). This meant that the predictors were probably an excellent addition to the model.

  • Using the coefficient estimates from our model, we can conclude that when a female is pregnant, has increasing glucose level, insulin, BMI, Diabetes Pedigree Function and skin thickness will likely have diabetes i.e.: With a coefficient estimate of glucose = 0.039808, this shows the probability of being diabetic increases with increase in glucose and the odds of having diabetes are higher with an increase in glucose level. An increased BMI might also indicate a risk of developing diabetes and normally, there is a high risk of developing diabetes as the age of the person increases (given other factors). This shows that the factors do affect diabetes.

With an accuracy of 0.7756098in our model, indicates that 78% of the time our model classified the patients in a high-risk category when they actually had a high risk of getting diabetes. This shows our model is good.

Assumptions of logistic regression hold ie: * Little or no multicollinearity. For our model there is no multicollinearity between the independent variables * Autocorrelation. This assumption holds as the p-value= 0.3747 is greater than the alpha level of 0.005 and with D = 1.9746 shows positive autocorrelation. * Normality. For logistic regression, since it does not require a linear relationship between dependent and independent variables, normality do not need to be normally distributed. With p-values less than alpha level, the normality assumption does not hold.

References

  • Machine learning Essentials: Practical Guide in R by Alboukadeal KASSAMBARA
  • Applied Logistic Regression by David W. Hosmer Jr., Stanley Lemeshow, And Rodney X. Sturdivant