This analysis aims to investigate the relationship between demographic factors (age, gender, marital status, employment status, and monthly household income) and the likelihood of having health insurance.
Introduction
Target Audience
Libraries
To kick off, let; load the necessary libraries needed for this analysis.
According to this, we have several columns that are not in their right data types. These columns include: age, marital status, how many children do you have and gender. To correct this, necessary data type will be allocated as follows.
Click to show/hide code
health_insurance$date_and_time <-as.Date(health_insurance$date_and_time, format ="%y/%m/%d")health_insurance$how_many_children_do_you_have_if_any <-as.numeric(health_insurance$how_many_children_do_you_have_if_any)health_insurance$age <-as.factor(health_insurance$age)health_insurance$marital_status <-as.factor(health_insurance$marital_status)health_insurance$gender <-as.factor(health_insurance$gender)#health_insurance$monthly_household_income <- as.factor(health_insurance$monthly_household_income)
The blanks in this case were replaced with “Unknown”
Next, in the “Monthly Household Income” column, we have 259 missing values. This is quite a significant number of missing values so we will replace missing values with “Unknown”
Next up, is the “If yes, which insurance cover” column which has missing values. In this case, if the patient has insurance cover, we will replace missing value with “Unknown” but if No, we will replace missing value with “Not Applicable”
ggplot(age_group_counts, aes(x = age, y = number_of_patients, fill = age)) +geom_bar(stat ="identity") +labs(title ="Number of Patients by Age Group",x ="Age Group",y ="Number of Patients" ) +theme_minimal() +scale_fill_brewer(palette ="Set2") +theme(legend.position ="none")
Findings: The 18-30 age group had the highest number of patients whereas the 60+ age group recorded the least number of patients
ggplot(gender_group_counts, aes(x =reorder(gender, -number_of_patients), y = number_of_patients)) +geom_bar(stat ="identity", fill=c("#B45D58","#B8647D","#A875A0")) +geom_text(aes(label = number_of_patients), hjust =0.5, color ="black", size =3)+labs(title ="Gender Distribution by Number of Patients",x ="Gender",y ="Number of Patients") +scale_y_continuous(labels = scales::comma_format()) +coord_flip() +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
Note
Findings:
We had a high record of males which was followed closely by the number of females. The margin between the two was quite small. It is also right to note that we had 10 patients that didn’t register their gender.
3.0 What the marital status count of the patients?
ggplot(marital_count, aes(x =reorder(marital_status, -number_of_patients), y = number_of_patients)) +geom_bar(stat ="identity", fill=c("#B45D58","#B8647D","#A875A0","#848BB8")) +geom_text(aes(label = number_of_patients), hjust =0.5, color ="black", size =3)+labs(title ="Distribution of Marital Status Among Patients",x ="Marital Status",y ="Number of Patients") +scale_y_continuous(labels = scales::comma_format()) +coord_flip() +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
Note
Findings: The Married category had the highest number of patients whereas the Divorced category had the least number of patients. It also important to note that we have 13 patients who did not register their gender.
4.0 How many children did the patients have?
Click to show/hide code
children_counts <- health_insurance %>%group_by(how_many_children_do_you_have_if_any) %>%summarise(count_of_children =n())%>%arrange(desc(count_of_children))datatable(children_counts, options =list(pageLength =10, lengthMenu =c(10, 25, 50)), colnames =c("Number of Children Category", "Children Count"))
Note
Findings: From the table we can conclude that the highest record was of patients with no children
ggplot(employement_status, aes(x =reorder(employment_status, -employment_count), y = employment_count)) +geom_bar(stat ="identity", fill=c("#EF5350", "#D32F2F", "#BA68C8", "#9C27B0")) +geom_text(aes(label = employment_count), hjust =0.5, color ="black", size =3)+labs(title ="Employment Status Distribution Among Patients",x ="Employment Status",y ="Number of Patients") +scale_y_continuous(labels = scales::comma_format()) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
Note
Findings: From the plot, it is safe to say that the highest record was of patients were employed whereas the self-employed had the least record count. However, there is a small margin between the employed and the unemployed. We also have 20 patients who didn’t record their employment status
ggplot(employment_insurance2, aes(x =reorder(employment_status, total), y = total, fill = have_you_ever_had_health_insurance)) +geom_bar(stat ="identity", position ="dodge") +geom_text(aes(label = total), position =position_dodge(width =0.9), vjust =-0.5, color ="black", size =3) +labs(title ="Health Insurance Status by Employment Type",x ="Employment Status",y ="Number of Users",fill ="Health Insurance" ) +scale_y_continuous(labels = scales::comma_format()) +coord_flip() +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
According to the plot, the employed and self-employed category has the highest number of patients who have insurance whereas the unemployed category has a record of many patients without health insurance
5.2. Employment, ever insurance, type of insurance
Findings: The highest record was of patients who earn less than 10,000 where the least record count was of patients who earn 30,000-40,000. We also have 255 patients who didn’t register their monthly household income.
ggplot(ever_had_insurance, aes(x =reorder(have_you_ever_had_health_insurance, -insurance_count), y = insurance_count)) +geom_bar(stat ="identity", fill=c("#EF5350", "#BA68C8")) +geom_text(aes(label = insurance_count), hjust =0.5, color ="black", size =3)+labs(title ="Have Patients Ever Had Health Insurance?",x ="Response (Yes/No)",y ="Number of Patients") +scale_y_continuous(labels = scales::comma_format()) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
8.0 What type of insurance do the patients have?
Click to show/hide code
#since we have insurance with the same name but different cases, we can standardize them to lower case so as to easily merge themhealth_insurance <- health_insurance %>%mutate(if_yes_which_insurance_cover =str_to_lower(if_yes_which_insurance_cover))
Click to show/hide code
#we also have same insurance appearing twice in different names, lets normalize it health_insurance <- health_insurance %>%mutate(if_yes_which_insurance_cover =case_when(str_detect(if_yes_which_insurance_cover, regex("apa", ignore_case =TRUE)) ~"APA Insurance",str_detect(if_yes_which_insurance_cover, regex("jubilee", ignore_case =TRUE)) ~"Jubilee Insurance",str_detect(if_yes_which_insurance_cover, regex("cic", ignore_case =TRUE)) ~"CIC Insurance",TRUE~ if_yes_which_insurance_cover ))
ggplot(insurance_type, aes(x =reorder(if_yes_which_insurance_cover, insurance_type_count), y = insurance_type_count, fill = insurance_type_count)) +geom_bar(stat ="identity", fill=c("#2E7D32", "#388E3C", "#4CAF50", "#81C784", "#C8E6C9", "#FFB74D", "#FF9800", "#F57C00", "#FF5722", "#E57373")) +geom_text(aes(label = insurance_type_count), hjust =0.5, color ="black", size =3)+labs(title ="Top 10 Most Common Insurance Types",x ="Insurance Type",y ="Number of Users" ) +scale_y_continuous(labels = scales::comma_format()) +coord_flip() +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
Note
Findings: According to the plot, NHIF is the most common insurance used by the patients, however, we also have a high record of patients who do not use any insurance cover (represented as “not applicable”)
9.0 Did patient have insurance cover during their last hospital visit?
# A tibble: 8 × 3
gender have_you_ever_had_a_cancer_screening_e_g_mammogram_colonoscopy…¹ total
<fct> <chr> <int>
1 Female No 2172
2 Female Unspecified 13
3 Female Yes 842
4 Male No 2471
5 Male Unspecified 14
6 Male Yes 618
7 Unknown No 8
8 Unknown Yes 2
# ℹ abbreviated name:
# ¹have_you_ever_had_a_cancer_screening_e_g_mammogram_colonoscopy_etc
Click to show/hide code
ggplot(cancer_screen, aes(x = gender, y = total, fill = have_you_ever_had_a_cancer_screening_e_g_mammogram_colonoscopy_etc )) +geom_bar(stat ="identity", position ="dodge") +geom_text(aes(label = total), position =position_dodge(width =0.9), vjust =-0.5, color ="black", size =3) +labs(title ="Cancer Screening Status by Gender",x ="Gender", y ="Number of Patients",fill ="Ever Undergone Cancer Screening?" ) +scale_y_continuous(labels = scales::comma_format()) +theme_minimal()