Data Visualization: Representation of District Wise People Data

We aim to visually interpret district wise people’s data and draw out some meaningful result from the analysis. We first set the working directory.

setwd("C:/Users/salil/Desktop/AllDocuments/AnalyticsEdgeFolder")

We then load in the data sets which we would require for the analysis.

master <- readRDS("master.rds")
clinical <- readRDS("clinical.rds")
vital <- readRDS("vital.rds")

Initially, we would only require the master data set. We therefore proceed with the same. First we try to get some overview and do some basic cleaning of the data. We begin by removing the NAs from the required columns.

summary(master)

##  beneficiary_id          age           age_type             dob           
##  Length:933212      Min.   :  1.00   Length:933212      Length:933212     
##  Class :character   1st Qu.: 36.00   Class :character   Class :character  
##  Mode  :character   Median : 55.00   Mode  :character   Mode  :character  
##                     Mean   : 49.54                                        
##                     3rd Qu.: 65.00                                        
##                     Max.   :120.00                                        
##                                                                           
##     gender             state_id  district_id       village_id   
##  Length:933212      Min.   :1   Min.   : 1.000   Min.   :    1  
##  Class :character   1st Qu.:1   1st Qu.: 4.000   1st Qu.: 3583  
##  Mode  :character   Median :1   Median : 7.000   Median : 6515  
##                     Mean   :1   Mean   : 7.247   Mean   : 6541  
##                     3rd Qu.:1   3rd Qu.:11.000   3rd Qu.: 9746  
##                     Max.   :1   Max.   :13.000   Max.   :12566  
##                                                                 
##      van_id      date_of_registration marital_status     father_name       
##  Min.   :  1.0   Length:933212        Length:933212      Length:933212     
##  1st Qu.: 78.0   Class :character     Class :character   Class :character  
##  Median :142.0   Mode  :character     Mode  :character   Mode  :character  
##  Mean   :140.9                                                             
##  3rd Qu.:205.0                                                             
##  Max.   :283.0                                                             
##                                                                            
##   contact_no         occupation         community           religion        
##  Length:933212      Length:933212      Length:933212      Length:933212     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  education_status   economic_status     source_water   toilet_home       
##  Length:933212      Length:933212      Min.   :1.000   Length:933212     
##  Class :character   Class :character   1st Qu.:1.000   Class :character  
##  Mode  :character   Mode  :character   Median :1.000   Mode  :character  
##                                        Mean   :1.068                     
##                                        3rd Qu.:1.000                     
##                                        Max.   :3.000                     
##                                        NA's   :1                         
##  booklet_issued      visit_count      created_by         created_on       
##  Length:933212      Min.   : 0.000   Length:933212      Length:933212     
##  Class :character   1st Qu.: 1.000   Class :character   Class :character  
##  Mode  :character   Median : 2.000   Mode  :character   Mode  :character  
##                     Mean   : 5.987                                        
##                     3rd Qu.: 8.000                                        
##                     Max.   :53.000                                        
##                     NA's   :196

colSums(is.na(master))

##       beneficiary_id                  age             age_type 
##                    0                    0                    0 
##                  dob               gender             state_id 
##                   15                    1                    0 
##          district_id           village_id               van_id 
##                    0                    0                    0 
## date_of_registration       marital_status          father_name 
##                    0                    1                    1 
##           contact_no           occupation            community 
##                    1                    1                    1 
##             religion     education_status      economic_status 
##                    1                    1                    1 
##         source_water          toilet_home       booklet_issued 
##                    1                    1               701414 
##          visit_count           created_by           created_on 
##                  196                    0                    0

which(is.na(master$education_status))

## [1] 913053

master[913053,]

##         beneficiary_id age age_type  dob gender state_id district_id village_id
## 2406586           test  12        Y <NA>   <NA>        1           1          1
##         van_id  date_of_registration marital_status father_name contact_no
## 2406586      1 t_2017-03-29 17:00:00           <NA>        <NA>       <NA>
##         occupation community religion education_status economic_status
## 2406586       <NA>      <NA>     <NA>             <NA>            <NA>
##         source_water toilet_home booklet_issued visit_count created_by
## 2406586           NA        <NA>           <NA>          NA         sa
##                    created_on
## 2406586 t_2017-03-29 17:00:00

master1 <- as.data.frame(master[which(!is.na(master$education_status)),])

unique(master1$age_type)

## [1] "Y" "D" "M"

The age_type column has 3 unique variables, namely ‘Y’, ‘M’ and ‘D’. These signify ‘Years’, ‘Months’ and ‘Days’. As we are concerned with the number of years the patient has completed, we convert the respective age values of ‘M’ and ‘D’ to ‘0’ and convert the age_type value to ‘Y’ as well.

master1$age[which(master1$age_type=="D")] <- 0
master1$age[which(master1$age_type=="M")] <- 0
master1$age_type[master1$age_type=="D"] <- c("Y")
master1$age_type[master1$age_type=="M"] <- c("Y")

After the required cleaning of the data, we start with our first task of visualization. In this we will graphically try to represent and differentiate the educational prosperity of people across several generations living in the various districts of the state. For this we require the ‘district_id’, ‘age’ and ‘education_status’ columns of the master1 data set. We first change the district ids to respective district names to make our visualization informative, detailed and appealing.

unique(master1$district_id)

##  [1]  9  8  2 13  4  1 10  3 11 12  5  6  7

district_id <- c(1:13)
district_names <- c("East Godavari", "Srikakulam", "Visakhapatnam", "Vizianagaram", "Guntur", "Krishna", "Nellore", "Prakasam", "West Godavari", "Ananthapur", "Chittoor", "Kadapa", "Kurnool")
lapply(1:13,FUN = function(i){master1$district_id[master1$district_id == district_id[i]] <<- district_names[i]})

## [[1]]
## [1] "East Godavari"
## 
## [[2]]
## [1] "Srikakulam"
## 
## [[3]]
## [1] "Visakhapatnam"
## 
## [[4]]
## [1] "Vizianagaram"
## 
## [[5]]
## [1] "Guntur"
## 
## [[6]]
## [1] "Krishna"
## 
## [[7]]
## [1] "Nellore"
## 
## [[8]]
## [1] "Prakasam"
## 
## [[9]]
## [1] "West Godavari"
## 
## [[10]]
## [1] "Ananthapur"
## 
## [[11]]
## [1] "Chittoor"
## 
## [[12]]
## [1] "Kadapa"
## 
## [[13]]
## [1] "Kurnool"

As the objective is to analyze the data across generations, we categorise people into various age groups.

master2 <- master1[which(master1$age<=18),]
master3 <- master1[which(master1$age>18 & master1$age<=35),]
master4 <- master1[which(master1$age>35 & master1$age<=50),]
master5 <- master1[which(master1$age>50 & master1$age<=70),]
master6 <- master1[which(master1$age>70),]

We load the ‘ggplot2’ and the ‘plotly’ package for better visualization of our data.

library(ggplot2)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

We first derive the education details of the generation of ‘Post Millennials’. This generation comprises of people belonging to the age between 0 and 18 years. First we construct a table and then we graphically represent the details mentioned in the table.

table2 <- table(master2$district_id,master2$education_status)
table2

##                
##                 College High School Illiterate Primary School
##   Ananthapur        155         778       4223           3578
##   Chittoor          227         871       4811           3971
##   East Godavari     126         862       2987           3364
##   Guntur            100         468       3620           2338
##   Kadapa            143         851       5770           3226
##   Krishna            70         269       1621            921
##   Kurnool           191         988       5603           2568
##   Nellore           170         859       4961           5495
##   Prakasam           92         616       5252           3068
##   Srikakulam        112         600       3354           1562
##   Visakhapatnam     161         834       4244           1829
##   Vizianagaram      117         349       1797           1144
##   West Godavari      50         278       1297           1222

plot2 <- ggplot(master2, aes(x=district_id, fill=education_status)) +
  geom_bar() +
  labs(title = "Education Status of the Post Millennials") + coord_flip()
plot2 <- ggplotly(plot2)
plot2

We then derive the education details of the generation of ‘Young Adults’ comprising of people between the age of 19 and 35 years.

table3 <- table(master3$district_id,master3$education_status)
table3

##                
##                 College High School Illiterate Primary School
##   Ananthapur       1029        3258       7848           3846
##   Chittoor         2014        2976       5471           3108
##   East Godavari     774        1779       4678           2363
##   Guntur            975        1826       7195           3136
##   Kadapa            852        1446       7263           2783
##   Krishna           774        1564       4237           2089
##   Kurnool           940        2658      10401           3948
##   Nellore           814        1386       5205           3398
##   Prakasam          373         571       5424            941
##   Srikakulam       1048        1809       3424           1293
##   Visakhapatnam    1663        2065       5383           1247
##   Vizianagaram      889        1086       3109            792
##   West Godavari     751        1907       2329           1729

plot3 <- ggplot(master3, aes(x=district_id, fill=education_status)) +
  geom_bar() +
  labs(title = "Education Status of the Young Adults") + coord_flip()
plot3 <- ggplotly(plot3)
plot3

We do the same for the ‘Middle Aged’ generation having people between the age of 36 and 50.

table4 <- table(master4$district_id,master4$education_status)
table4

##                
##                 College High School Illiterate Primary School
##   Ananthapur        243         780       9059           2398
##   Chittoor          674        1551      10563           3535
##   East Godavari     300         986      10232           4024
##   Guntur            307         772      11222           3555
##   Kadapa            375         829      10829           3360
##   Krishna           244         954      10356           3539
##   Kurnool           344        1192       9044           2376
##   Nellore           505         842      10803           5133
##   Prakasam          301         609      11239           1417
##   Srikakulam        262         902       7049           1207
##   Visakhapatnam     317         725      10023           1605
##   Vizianagaram      194         384       7594           1069
##   West Godavari     257         820       6065           3425

plot4 <- ggplot(master4, aes(x=district_id, fill=education_status)) +
  geom_bar() +
  labs(title = "Education Status of the Middle Aged") + coord_flip()
plot4 <- ggplotly(plot4)
plot4

We move to the next generation called the ‘Experienced’ lot. This group includes people between the age of 51 and 70.

table5 <- table(master5$district_id,master5$education_status)
table5

##                
##                 College High School Illiterate Primary School
##   Ananthapur        207         954      26473           4806
##   Chittoor          577        1827      27688           7822
##   East Godavari     252        1179      27871           8244
##   Guntur            237        1156      31998           8291
##   Kadapa            246         870      30295           6539
##   Krishna           182        1433      25855           7517
##   Kurnool           215        1146      21586           3203
##   Nellore           360         868      29281          10822
##   Prakasam          200         943      30840           4773
##   Srikakulam        293        1513      25944           2858
##   Visakhapatnam     195         542      22710           2815
##   Vizianagaram      152         474      22640           3692
##   West Godavari     255        1083      20619           8298

plot5 <- ggplot(master5, aes(x=district_id, fill=education_status)) +
  geom_bar() +
  labs(title = "Education Status of the Experienced") + coord_flip()
plot5 <- ggplotly(plot5)
plot5

We finally do the education status analysis of the old aged people. We include every person in this group who has an age above 70 years.

table6 <- table(master6$district_id,master6$education_status)
table6

##                
##                 College High School Illiterate Primary School
##   Ananthapur         24         136       6485            950
##   Chittoor           85         338       6412           1715
##   East Godavari      42         183       4093           1134
##   Guntur             48         247       7654           2062
##   Kadapa             36         152       5866           1451
##   Krishna            33         232       3966           1309
##   Kurnool            30         220       3906            399
##   Nellore            45         114       4226           1432
##   Prakasam           31         139       7384           1232
##   Srikakulam         36         256       5406            511
##   Visakhapatnam      23          56       1908            323
##   Vizianagaram       31          57       3670            264
##   West Godavari      45         127       4122           1306

plot6 <- ggplot(master6, aes(x=district_id, fill=education_status)) +
  geom_bar() +
  labs(title = "Education Status of the Old Aged") + coord_flip()
plot6 <- ggplotly(plot6)
plot6

Our next task is to visually represent the occupational structure of males and females. For this we would require the ‘gender’ and the ‘occupation’ columns of the master1 data set. As we did in the case of district ids, we would first change the occupation ids to respective occupation names. We make use of the reference data set provided to us.

unique(master1$occupation)

##  [1] "8"                        "2"                       
##  [3] "4"                        "6"                       
##  [5] "7"                        "3"                       
##  [7] "1"                        "5"                       
##  [9] "Cultivation(Agriculture)" "Business"                
## [11] "Agricultural labour"      "Homemaker"               
## [13] "Others"                   "Unemployed"              
## [15] "Govt employee"            "Private employee"

OccupationID <- c(1:8)
OccupationName <- c("Cultivation(Agriculture)", "Agricultural labour", "Business", "Unemployed", "Govt employee", "Private employee", "Others", "Homemaker")
lapply(1:8,FUN = function(i){master1$occupation[master1$occupation == OccupationID[i]] <<- OccupationName[i]})

## [[1]]
## [1] "Cultivation(Agriculture)"
## 
## [[2]]
## [1] "Agricultural labour"
## 
## [[3]]
## [1] "Business"
## 
## [[4]]
## [1] "Unemployed"
## 
## [[5]]
## [1] "Govt employee"
## 
## [[6]]
## [1] "Private employee"
## 
## [[7]]
## [1] "Others"
## 
## [[8]]
## [1] "Homemaker"

We then split the data into two parts, one comprising of males and the other comprising of females.

unique(master1$gender)

## [1] "F"   "M"   "T"   "N/A"

master7 <- master1[master1$gender=="F",]
master8 <- master1[master1$gender=="M",]

We then build a table cumulating the frequencies of females working across various sectors.

master7a <- data.frame(table(master7$occupation))
colnames(master7a) <- c("Ocuupation", "Frequency")

We finally draw a pie chart out of the derived information.

master7a$Prop <- round(master7a$Frequency/sum(master7a$Frequency)*100, 2)
master7a$Ocuupation <- paste(master7a$Ocuupation, master7a$Prop, "%", sep = "")
pie(master7a$Prop, labels = master7a$Ocuupation, col = rainbow(length(master7a$Ocuupation)), main="Occupational Structure of Females", cex=0.75)

We repeat the same thing to compute the occupational structure of all the males in the data set.

master8a <- data.frame(table(master8$occupation))
colnames(master8a) <- c("Ocuupation", "Frequency")
master8a$Prop <- round(master8a$Frequency/sum(master8a$Frequency)*100, 2)
master8a$Ocuupation <- paste(master8a$Ocuupation, master8a$Prop, "%", sep = "")
pie(master8a$Prop, labels = master8a$Ocuupation, col = rainbow(length(master8a$Ocuupation)), main="Occupational Structure of Males", cex=0.7)

The next thing we do is examine and visualize the monthly footfall of people. This will help us sense which parts of the year experiences the maximum number of people coming for checkups and health consults. Doctors can be appointed according to the number of patients getting registered across different months of the year. For this we will require the date of registration column of the master data set. We first load the package lubridate as we will be dealing with dates.

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

We then create a different column to register the months of the visits.

master1$date_of_registration <- strptime(master1$date_of_registration, format = "t_%Y-%m-%d %H:%M:%S")
master1$Month <- months.POSIXt(master1$date_of_registration)

Further we create another data set which notes the frequencies of visits according to the various months.

MonthWiseCount <- data.frame(table(master1$Month))

The data set does not have any details of patient visits in the month of April. This is a probable error so we would proceed with whatever information we have with us. The data is sorted alphabetically. We first have to change it as per the desired order.

MonthWiseCount$Var1 <- factor(MonthWiseCount$Var1, ordered = TRUE, levels = c("January", "February", "March", "May", "June", "July", "August", "September", "October", "November", "December"))

We first draw a line graph with the names of the months on the x-axis and the number of registrations on the y axis.

ggplot(MonthWiseCount, aes(x=Var1, y=MonthWiseCount$Freq)) + geom_line(aes(group=1)) + xlab("Month") + ylab("Number of Registrations")

To make it visually interesting, we draw a heatmap. This heatmap will help us recognise which region requires what amount of medical attention during which part of the year. First we make a table revealing the same information and then we go on to build a heatmap.

MonthNDistrictCount <- data.frame(table(master1$Month, master1$district_id))
MonthNDistrictCount$Var1 <- factor(MonthNDistrictCount$Var1, ordered = TRUE, levels = c("January", "February", "March", "May", "June", "July", "August", "September", "October", "November", "December"))
colnames(MonthNDistrictCount) <- c("Month", "District", "Number of Registrations")
ggplot(MonthNDistrictCount, aes(x=District, y=MonthNDistrictCount$Month)) + geom_tile(aes(fill=MonthNDistrictCount$`Number of Registrations`)) + scale_fill_gradient(name="Number of Registrations", low = "White", high="Red") + theme(axis.title.y = element_blank())

We move on to our next task and now we will make use of the clinical data set. The clinical data set records the details of the disease or the ailment the patient had visited for. We will use the data set to record the gender wise count of the ailments. For this purpose we use the multiple bar diagram.

unique(clinical$category_id)

## [1] 0 3 5 1 2 4

The category id column contains a variable ‘0’ along with others. As per the reference table there are no details about the same. This probably would have happened because of some glitch while recording the details or because of issues while detection of the illness. We therefore remove this category.

clinical1 <- clinical[which(!clinical$category_id==0), c(2,3)]

As we did previously, we will assign names to the respective category ids to make the graph more informative.

category_id <- c(1:5)
category_id_names <- c("Reproductive and Child Health", "Communicable Diseases", "Chronic", "Minor Ailments", "Others")
lapply(1:5,FUN = function(i){clinical1$category_id[clinical1$category_id == category_id[i]] <<- category_id_names[i]})

## [[1]]
## [1] "Reproductive and Child Health"
## 
## [[2]]
## [1] "Communicable Diseases"
## 
## [[3]]
## [1] "Chronic"
## 
## [[4]]
## [1] "Minor Ailments"
## 
## [[5]]
## [1] "Others"

To get details about the gender of the patients, we need to merge the clinical data set and the master data set. First we load the dplyr package.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

We then categorically pick the required columns from the master data set.

master9 <- master1[,c(1,5)]

We then inner join both the data sets on the beneficiary id column.

Joined1 <- inner_join(clinical1, master9, by="beneficiary_id")

We now select a subset comprising of ‘M’ and ‘F’.

Joined2 <- subset(Joined1,Joined1$gender=="M" | Joined1$gender=="F")

We assign ‘Male’ to ‘M’ and ‘Female’ to ‘F’.

Joined2$gender[Joined2$gender=="M"] <- c("Male")
Joined2$gender[Joined2$gender=="F"] <- c("Female")

We finally construct a multiple bar diagram.

Joined3 <- table(Joined2$gender, Joined2$category_id)
barplot(Joined3, beside = TRUE, legend=rownames(Joined3), ylab = "Number of Registrations", col = c("Blue", "Orange"), las=3)

We can observe from the data that females are more prone to diseases and ailments.

Our final task for the project is to check the count of people having blood pressure belonging to the respective intervals. We will use the vital data set to get the details of the systolic and diastolic blood pressure. We first work on the systolic blood pressure.

plot(vital$bp_systolic)

There are some values which do not belong to concentrated region of points. With the purpose of not including these values, we slightly filter our data set. The purpose of doing this is to let the histogram focus on areas where a large majority of values belong to.

vital1 <- vital[which(vital$bp_systolic>=80 & vital$bp_systolic<=180),c(2,8)]

We then construct a histogram from the given information.

ggplot(vital1, aes(x=bp_systolic)) + geom_histogram(binwidth = 8, color="Red", fill="Blue") + labs(x="Systolic Blood Pressure", y="Number of Patients")

We perform the same task for diastolic blood pressure.

plot(vital$bp_diastolic)

vital2 <- vital[which(vital$bp_diastolic>=50 & vital$bp_diastolic<=120),c(2,9)]
ggplot(vital2, aes(x=bp_diastolic)) + geom_histogram(binwidth = 8, color="Red", fill="Blue") + labs(x="Diastolic Blood Pressure", y="Number of Patients") + scale_y_continuous(limits = c(0,500000))

Data Visualization: Representation of District Wise People Data

Salil Suman Meher

10/07/2020