We aim to visually interpret district wise people’s data and draw out some meaningful result from the analysis. We first set the working directory.
setwd("C:/Users/salil/Desktop/AllDocuments/AnalyticsEdgeFolder")
We then load in the data sets which we would require for the analysis.
master <- readRDS("master.rds")
clinical <- readRDS("clinical.rds")
vital <- readRDS("vital.rds")
Initially, we would only require the master data set. We therefore proceed with the same. First we try to get some overview and do some basic cleaning of the data. We begin by removing the NAs from the required columns.
summary(master)
## beneficiary_id age age_type dob
## Length:933212 Min. : 1.00 Length:933212 Length:933212
## Class :character 1st Qu.: 36.00 Class :character Class :character
## Mode :character Median : 55.00 Mode :character Mode :character
## Mean : 49.54
## 3rd Qu.: 65.00
## Max. :120.00
##
## gender state_id district_id village_id
## Length:933212 Min. :1 Min. : 1.000 Min. : 1
## Class :character 1st Qu.:1 1st Qu.: 4.000 1st Qu.: 3583
## Mode :character Median :1 Median : 7.000 Median : 6515
## Mean :1 Mean : 7.247 Mean : 6541
## 3rd Qu.:1 3rd Qu.:11.000 3rd Qu.: 9746
## Max. :1 Max. :13.000 Max. :12566
##
## van_id date_of_registration marital_status father_name
## Min. : 1.0 Length:933212 Length:933212 Length:933212
## 1st Qu.: 78.0 Class :character Class :character Class :character
## Median :142.0 Mode :character Mode :character Mode :character
## Mean :140.9
## 3rd Qu.:205.0
## Max. :283.0
##
## contact_no occupation community religion
## Length:933212 Length:933212 Length:933212 Length:933212
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## education_status economic_status source_water toilet_home
## Length:933212 Length:933212 Min. :1.000 Length:933212
## Class :character Class :character 1st Qu.:1.000 Class :character
## Mode :character Mode :character Median :1.000 Mode :character
## Mean :1.068
## 3rd Qu.:1.000
## Max. :3.000
## NA's :1
## booklet_issued visit_count created_by created_on
## Length:933212 Min. : 0.000 Length:933212 Length:933212
## Class :character 1st Qu.: 1.000 Class :character Class :character
## Mode :character Median : 2.000 Mode :character Mode :character
## Mean : 5.987
## 3rd Qu.: 8.000
## Max. :53.000
## NA's :196
colSums(is.na(master))
## beneficiary_id age age_type
## 0 0 0
## dob gender state_id
## 15 1 0
## district_id village_id van_id
## 0 0 0
## date_of_registration marital_status father_name
## 0 1 1
## contact_no occupation community
## 1 1 1
## religion education_status economic_status
## 1 1 1
## source_water toilet_home booklet_issued
## 1 1 701414
## visit_count created_by created_on
## 196 0 0
which(is.na(master$education_status))
## [1] 913053
master[913053,]
## beneficiary_id age age_type dob gender state_id district_id village_id
## 2406586 test 12 Y <NA> <NA> 1 1 1
## van_id date_of_registration marital_status father_name contact_no
## 2406586 1 t_2017-03-29 17:00:00 <NA> <NA> <NA>
## occupation community religion education_status economic_status
## 2406586 <NA> <NA> <NA> <NA> <NA>
## source_water toilet_home booklet_issued visit_count created_by
## 2406586 NA <NA> <NA> NA sa
## created_on
## 2406586 t_2017-03-29 17:00:00
master1 <- as.data.frame(master[which(!is.na(master$education_status)),])
unique(master1$age_type)
## [1] "Y" "D" "M"
The age_type column has 3 unique variables, namely ‘Y’, ‘M’ and ‘D’. These signify ‘Years’, ‘Months’ and ‘Days’. As we are concerned with the number of years the patient has completed, we convert the respective age values of ‘M’ and ‘D’ to ‘0’ and convert the age_type value to ‘Y’ as well.
master1$age[which(master1$age_type=="D")] <- 0
master1$age[which(master1$age_type=="M")] <- 0
master1$age_type[master1$age_type=="D"] <- c("Y")
master1$age_type[master1$age_type=="M"] <- c("Y")
After the required cleaning of the data, we start with our first task of visualization. In this we will graphically try to represent and differentiate the educational prosperity of people across several generations living in the various districts of the state. For this we require the ‘district_id’, ‘age’ and ‘education_status’ columns of the master1 data set. We first change the district ids to respective district names to make our visualization informative, detailed and appealing.
unique(master1$district_id)
## [1] 9 8 2 13 4 1 10 3 11 12 5 6 7
district_id <- c(1:13)
district_names <- c("East Godavari", "Srikakulam", "Visakhapatnam", "Vizianagaram", "Guntur", "Krishna", "Nellore", "Prakasam", "West Godavari", "Ananthapur", "Chittoor", "Kadapa", "Kurnool")
lapply(1:13,FUN = function(i){master1$district_id[master1$district_id == district_id[i]] <<- district_names[i]})
## [[1]]
## [1] "East Godavari"
##
## [[2]]
## [1] "Srikakulam"
##
## [[3]]
## [1] "Visakhapatnam"
##
## [[4]]
## [1] "Vizianagaram"
##
## [[5]]
## [1] "Guntur"
##
## [[6]]
## [1] "Krishna"
##
## [[7]]
## [1] "Nellore"
##
## [[8]]
## [1] "Prakasam"
##
## [[9]]
## [1] "West Godavari"
##
## [[10]]
## [1] "Ananthapur"
##
## [[11]]
## [1] "Chittoor"
##
## [[12]]
## [1] "Kadapa"
##
## [[13]]
## [1] "Kurnool"
As the objective is to analyze the data across generations, we categorise people into various age groups.
master2 <- master1[which(master1$age<=18),]
master3 <- master1[which(master1$age>18 & master1$age<=35),]
master4 <- master1[which(master1$age>35 & master1$age<=50),]
master5 <- master1[which(master1$age>50 & master1$age<=70),]
master6 <- master1[which(master1$age>70),]
We load the ‘ggplot2’ and the ‘plotly’ package for better visualization of our data.
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
We first derive the education details of the generation of ‘Post Millennials’. This generation comprises of people belonging to the age between 0 and 18 years. First we construct a table and then we graphically represent the details mentioned in the table.
table2 <- table(master2$district_id,master2$education_status)
table2
##
## College High School Illiterate Primary School
## Ananthapur 155 778 4223 3578
## Chittoor 227 871 4811 3971
## East Godavari 126 862 2987 3364
## Guntur 100 468 3620 2338
## Kadapa 143 851 5770 3226
## Krishna 70 269 1621 921
## Kurnool 191 988 5603 2568
## Nellore 170 859 4961 5495
## Prakasam 92 616 5252 3068
## Srikakulam 112 600 3354 1562
## Visakhapatnam 161 834 4244 1829
## Vizianagaram 117 349 1797 1144
## West Godavari 50 278 1297 1222
plot2 <- ggplot(master2, aes(x=district_id, fill=education_status)) +
geom_bar() +
labs(title = "Education Status of the Post Millennials") + coord_flip()
plot2 <- ggplotly(plot2)
plot2
We then derive the education details of the generation of ‘Young Adults’ comprising of people between the age of 19 and 35 years.
table3 <- table(master3$district_id,master3$education_status)
table3
##
## College High School Illiterate Primary School
## Ananthapur 1029 3258 7848 3846
## Chittoor 2014 2976 5471 3108
## East Godavari 774 1779 4678 2363
## Guntur 975 1826 7195 3136
## Kadapa 852 1446 7263 2783
## Krishna 774 1564 4237 2089
## Kurnool 940 2658 10401 3948
## Nellore 814 1386 5205 3398
## Prakasam 373 571 5424 941
## Srikakulam 1048 1809 3424 1293
## Visakhapatnam 1663 2065 5383 1247
## Vizianagaram 889 1086 3109 792
## West Godavari 751 1907 2329 1729
plot3 <- ggplot(master3, aes(x=district_id, fill=education_status)) +
geom_bar() +
labs(title = "Education Status of the Young Adults") + coord_flip()
plot3 <- ggplotly(plot3)
plot3
We do the same for the ‘Middle Aged’ generation having people between the age of 36 and 50.
table4 <- table(master4$district_id,master4$education_status)
table4
##
## College High School Illiterate Primary School
## Ananthapur 243 780 9059 2398
## Chittoor 674 1551 10563 3535
## East Godavari 300 986 10232 4024
## Guntur 307 772 11222 3555
## Kadapa 375 829 10829 3360
## Krishna 244 954 10356 3539
## Kurnool 344 1192 9044 2376
## Nellore 505 842 10803 5133
## Prakasam 301 609 11239 1417
## Srikakulam 262 902 7049 1207
## Visakhapatnam 317 725 10023 1605
## Vizianagaram 194 384 7594 1069
## West Godavari 257 820 6065 3425
plot4 <- ggplot(master4, aes(x=district_id, fill=education_status)) +
geom_bar() +
labs(title = "Education Status of the Middle Aged") + coord_flip()
plot4 <- ggplotly(plot4)
plot4
We move to the next generation called the ‘Experienced’ lot. This group includes people between the age of 51 and 70.
table5 <- table(master5$district_id,master5$education_status)
table5
##
## College High School Illiterate Primary School
## Ananthapur 207 954 26473 4806
## Chittoor 577 1827 27688 7822
## East Godavari 252 1179 27871 8244
## Guntur 237 1156 31998 8291
## Kadapa 246 870 30295 6539
## Krishna 182 1433 25855 7517
## Kurnool 215 1146 21586 3203
## Nellore 360 868 29281 10822
## Prakasam 200 943 30840 4773
## Srikakulam 293 1513 25944 2858
## Visakhapatnam 195 542 22710 2815
## Vizianagaram 152 474 22640 3692
## West Godavari 255 1083 20619 8298
plot5 <- ggplot(master5, aes(x=district_id, fill=education_status)) +
geom_bar() +
labs(title = "Education Status of the Experienced") + coord_flip()
plot5 <- ggplotly(plot5)
plot5
We finally do the education status analysis of the old aged people. We include every person in this group who has an age above 70 years.
table6 <- table(master6$district_id,master6$education_status)
table6
##
## College High School Illiterate Primary School
## Ananthapur 24 136 6485 950
## Chittoor 85 338 6412 1715
## East Godavari 42 183 4093 1134
## Guntur 48 247 7654 2062
## Kadapa 36 152 5866 1451
## Krishna 33 232 3966 1309
## Kurnool 30 220 3906 399
## Nellore 45 114 4226 1432
## Prakasam 31 139 7384 1232
## Srikakulam 36 256 5406 511
## Visakhapatnam 23 56 1908 323
## Vizianagaram 31 57 3670 264
## West Godavari 45 127 4122 1306
plot6 <- ggplot(master6, aes(x=district_id, fill=education_status)) +
geom_bar() +
labs(title = "Education Status of the Old Aged") + coord_flip()
plot6 <- ggplotly(plot6)
plot6
Our next task is to visually represent the occupational structure of males and females. For this we would require the ‘gender’ and the ‘occupation’ columns of the master1 data set. As we did in the case of district ids, we would first change the occupation ids to respective occupation names. We make use of the reference data set provided to us.
unique(master1$occupation)
## [1] "8" "2"
## [3] "4" "6"
## [5] "7" "3"
## [7] "1" "5"
## [9] "Cultivation(Agriculture)" "Business"
## [11] "Agricultural labour" "Homemaker"
## [13] "Others" "Unemployed"
## [15] "Govt employee" "Private employee"
OccupationID <- c(1:8)
OccupationName <- c("Cultivation(Agriculture)", "Agricultural labour", "Business", "Unemployed", "Govt employee", "Private employee", "Others", "Homemaker")
lapply(1:8,FUN = function(i){master1$occupation[master1$occupation == OccupationID[i]] <<- OccupationName[i]})
## [[1]]
## [1] "Cultivation(Agriculture)"
##
## [[2]]
## [1] "Agricultural labour"
##
## [[3]]
## [1] "Business"
##
## [[4]]
## [1] "Unemployed"
##
## [[5]]
## [1] "Govt employee"
##
## [[6]]
## [1] "Private employee"
##
## [[7]]
## [1] "Others"
##
## [[8]]
## [1] "Homemaker"
We then split the data into two parts, one comprising of males and the other comprising of females.
unique(master1$gender)
## [1] "F" "M" "T" "N/A"
master7 <- master1[master1$gender=="F",]
master8 <- master1[master1$gender=="M",]
We then build a table cumulating the frequencies of females working across various sectors.
master7a <- data.frame(table(master7$occupation))
colnames(master7a) <- c("Ocuupation", "Frequency")
We finally draw a pie chart out of the derived information.
master7a$Prop <- round(master7a$Frequency/sum(master7a$Frequency)*100, 2)
master7a$Ocuupation <- paste(master7a$Ocuupation, master7a$Prop, "%", sep = "")
pie(master7a$Prop, labels = master7a$Ocuupation, col = rainbow(length(master7a$Ocuupation)), main="Occupational Structure of Females", cex=0.75)
We repeat the same thing to compute the occupational structure of all the males in the data set.
master8a <- data.frame(table(master8$occupation))
colnames(master8a) <- c("Ocuupation", "Frequency")
master8a$Prop <- round(master8a$Frequency/sum(master8a$Frequency)*100, 2)
master8a$Ocuupation <- paste(master8a$Ocuupation, master8a$Prop, "%", sep = "")
pie(master8a$Prop, labels = master8a$Ocuupation, col = rainbow(length(master8a$Ocuupation)), main="Occupational Structure of Males", cex=0.7)
The next thing we do is examine and visualize the monthly footfall of people. This will help us sense which parts of the year experiences the maximum number of people coming for checkups and health consults. Doctors can be appointed according to the number of patients getting registered across different months of the year. For this we will require the date of registration column of the master data set. We first load the package lubridate as we will be dealing with dates.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
We then create a different column to register the months of the visits.
master1$date_of_registration <- strptime(master1$date_of_registration, format = "t_%Y-%m-%d %H:%M:%S")
master1$Month <- months.POSIXt(master1$date_of_registration)
Further we create another data set which notes the frequencies of visits according to the various months.
MonthWiseCount <- data.frame(table(master1$Month))
The data set does not have any details of patient visits in the month of April. This is a probable error so we would proceed with whatever information we have with us. The data is sorted alphabetically. We first have to change it as per the desired order.
MonthWiseCount$Var1 <- factor(MonthWiseCount$Var1, ordered = TRUE, levels = c("January", "February", "March", "May", "June", "July", "August", "September", "October", "November", "December"))
We first draw a line graph with the names of the months on the x-axis and the number of registrations on the y axis.
ggplot(MonthWiseCount, aes(x=Var1, y=MonthWiseCount$Freq)) + geom_line(aes(group=1)) + xlab("Month") + ylab("Number of Registrations")
To make it visually interesting, we draw a heatmap. This heatmap will help us recognise which region requires what amount of medical attention during which part of the year. First we make a table revealing the same information and then we go on to build a heatmap.
MonthNDistrictCount <- data.frame(table(master1$Month, master1$district_id))
MonthNDistrictCount$Var1 <- factor(MonthNDistrictCount$Var1, ordered = TRUE, levels = c("January", "February", "March", "May", "June", "July", "August", "September", "October", "November", "December"))
colnames(MonthNDistrictCount) <- c("Month", "District", "Number of Registrations")
ggplot(MonthNDistrictCount, aes(x=District, y=MonthNDistrictCount$Month)) + geom_tile(aes(fill=MonthNDistrictCount$`Number of Registrations`)) + scale_fill_gradient(name="Number of Registrations", low = "White", high="Red") + theme(axis.title.y = element_blank())
We move on to our next task and now we will make use of the clinical data set. The clinical data set records the details of the disease or the ailment the patient had visited for. We will use the data set to record the gender wise count of the ailments. For this purpose we use the multiple bar diagram.
unique(clinical$category_id)
## [1] 0 3 5 1 2 4
The category id column contains a variable ‘0’ along with others. As per the reference table there are no details about the same. This probably would have happened because of some glitch while recording the details or because of issues while detection of the illness. We therefore remove this category.
clinical1 <- clinical[which(!clinical$category_id==0), c(2,3)]
As we did previously, we will assign names to the respective category ids to make the graph more informative.
category_id <- c(1:5)
category_id_names <- c("Reproductive and Child Health", "Communicable Diseases", "Chronic", "Minor Ailments", "Others")
lapply(1:5,FUN = function(i){clinical1$category_id[clinical1$category_id == category_id[i]] <<- category_id_names[i]})
## [[1]]
## [1] "Reproductive and Child Health"
##
## [[2]]
## [1] "Communicable Diseases"
##
## [[3]]
## [1] "Chronic"
##
## [[4]]
## [1] "Minor Ailments"
##
## [[5]]
## [1] "Others"
To get details about the gender of the patients, we need to merge the clinical data set and the master data set. First we load the dplyr package.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
We then categorically pick the required columns from the master data set.
master9 <- master1[,c(1,5)]
We then inner join both the data sets on the beneficiary id column.
Joined1 <- inner_join(clinical1, master9, by="beneficiary_id")
We now select a subset comprising of ‘M’ and ‘F’.
Joined2 <- subset(Joined1,Joined1$gender=="M" | Joined1$gender=="F")
We assign ‘Male’ to ‘M’ and ‘Female’ to ‘F’.
Joined2$gender[Joined2$gender=="M"] <- c("Male")
Joined2$gender[Joined2$gender=="F"] <- c("Female")
We finally construct a multiple bar diagram.
Joined3 <- table(Joined2$gender, Joined2$category_id)
barplot(Joined3, beside = TRUE, legend=rownames(Joined3), ylab = "Number of Registrations", col = c("Blue", "Orange"), las=3)
We can observe from the data that females are more prone to diseases and ailments.
Our final task for the project is to check the count of people having blood pressure belonging to the respective intervals. We will use the vital data set to get the details of the systolic and diastolic blood pressure. We first work on the systolic blood pressure.
plot(vital$bp_systolic)
There are some values which do not belong to concentrated region of points. With the purpose of not including these values, we slightly filter our data set. The purpose of doing this is to let the histogram focus on areas where a large majority of values belong to.
vital1 <- vital[which(vital$bp_systolic>=80 & vital$bp_systolic<=180),c(2,8)]
We then construct a histogram from the given information.
ggplot(vital1, aes(x=bp_systolic)) + geom_histogram(binwidth = 8, color="Red", fill="Blue") + labs(x="Systolic Blood Pressure", y="Number of Patients")
We perform the same task for diastolic blood pressure.
plot(vital$bp_diastolic)
vital2 <- vital[which(vital$bp_diastolic>=50 & vital$bp_diastolic<=120),c(2,9)]
ggplot(vital2, aes(x=bp_diastolic)) + geom_histogram(binwidth = 8, color="Red", fill="Blue") + labs(x="Diastolic Blood Pressure", y="Number of Patients") + scale_y_continuous(limits = c(0,500000))