1 Introduction

Medicare is a medical service provider for free or at lower cost for Australian citizens. It also pays Medicare benefit on behalf of the Deaprtment of Health.1

If you ask any Australian they would say Medicare is the cornerstone of being proud Aussies! I’d say this is debatable and rest this case for now.

The aim of this project is to learn Data Exploration through R language. I believe as a beginner the best way to learn the language is to find dataset that means something to you.

Data source - data.gov.au

2 Data Wrangling

2.1 Library

library(tidyverse)
library(forcats)
library(lubridate)
library(DT)

library(ggmap)
library(rgdal)
library(rgeos)
library(maptools)
library(broom)
library(viridis)
library(treemapify)
library(gridExtra)

2.2 Load data

data1 <- read_csv("D:/R Datawarehouse/medicare/medicares2017.csv")

Glimpse of the data structure

glimpse(data1)
## Observations: 238
## Variables: 11
## $ `OFFICE TYPE`                   <chr> "Medicare", "Medicare", "Medic...
## $ `SITE NAME`                     <chr> "Armadale (WA)", "Armidale (NS...
## $ ADDRESS                         <chr> "40-42 William Street", "246-2...
## $ SUBURB                          <chr> "Armadale", "Armidale", "Newpo...
## $ STATE                           <chr> "WA", "NSW", "VIC", "QLD", "WA...
## $ POSTCODE                        <int> 6112, 2350, 3015, 4814, 6330, ...
## $ LATITUDE                        <dbl> -32.15567, -30.51246, -37.8413...
## $ LONGITUDE                       <dbl> 116.0137, 151.6619, 144.8827, ...
## $ Open                            <time> 08:30:00, 08:30:00, 08:30:00,...
## $ Close                           <chr> "16:30:00", "16:30:00", "16:30...
## $ `Closed for lunch/Office Notes` <chr> "No", "No", "No", "Yes - Irreg...

Let us issue a unique ID to each Medicare office. However before that I just want to confirm that all the SITE NAME are unique which will indicate me that they all are located separately.

data1 %>% 
  group_by(`SITE NAME`) %>% 
  summarise(n=n()) %>% 
  arrange(-n)
## # A tibble: 237 x 2
##       `SITE NAME`     n
##             <chr> <int>
##  1    Charlestown     2
##  2   Airport West     1
##  3         Albany     1
##  4         Albury     1
##  5  Alice Springs     1
##  6         Ararat     1
##  7  Armadale (WA)     1
##  8 Armidale (NSW)     1
##  9        Ashmore     1
## 10       Atherton     1
## # ... with 227 more rows
x <- which(duplicated(data1$`SITE NAME`))

data1 %>% 
  filter(`SITE NAME` == "Charlestown")
## # A tibble: 2 x 11
##   `OFFICE TYPE` `SITE NAME`
##           <chr>       <chr>
## 1      Medicare Charlestown
## 2      Medicare Charlestown
## # ... with 9 more variables: ADDRESS <chr>, SUBURB <chr>, STATE <chr>,
## #   POSTCODE <int>, LATITUDE <dbl>, LONGITUDE <dbl>, Open <time>,
## #   Close <chr>, `Closed for lunch/Office Notes` <chr>

Above table shows SITE NAME Charlestown has duplicate value. Upon further inspection they seem to be in different place. So its safe issue a unique ID to each of them.

2.3 Issue unique ID to each Medicare office

data1 <- data1 %>% 
  mutate(ID = sprintf("ID%03d", row_number())) %>% 
  select(ID, everything())

head(data1, 2)
## # A tibble: 2 x 12
##      ID `OFFICE TYPE`    `SITE NAME`               ADDRESS   SUBURB STATE
##   <chr>         <chr>          <chr>                 <chr>    <chr> <chr>
## 1 ID001      Medicare  Armadale (WA)  40-42 William Street Armadale    WA
## 2 ID002      Medicare Armidale (NSW) 246-248 Beardy Street Armidale   NSW
## # ... with 6 more variables: POSTCODE <int>, LATITUDE <dbl>,
## #   LONGITUDE <dbl>, Open <time>, Close <chr>, `Closed for lunch/Office
## #   Notes` <chr>

3 Data structure and content

3.1 Variable names

There are 238 rows and 12 columns in the data.

Below are the variable names:

datatable(data.frame(names(data1)), colnames = "Variables", options = list(
  pageLength =12
))

3.2 Rename variables

First of all, let us rename the variables OFFICE TYPE, SITE NAME Closed for lunch/Office Notes

data1 <- data1 %>% 
  rename(Office_type = `OFFICE TYPE`,
         Closed_for_lunch = `Closed for lunch/Office Notes`)

I can see that SITE NAME and SUBURB have similar data. Let us verify that it is true for all observations.

data1 %>% 
  mutate(similar = (`SITE NAME` == SUBURB)) %>% 
  filter(similar == FALSE) %>% 
  select(`SITE NAME`, SUBURB, similar)
## # A tibble: 31 x 3
##                `SITE NAME`       SUBURB similar
##                      <chr>        <chr>   <lgl>
##  1           Armadale (WA)     Armadale   FALSE
##  2          Armidale (NSW)     Armidale   FALSE
##  3   Townsville Jobseekers   Aitkenvale   FALSE
##  4                Moreland    Brunswick   FALSE
##  5 Seniors & Carers Cairns       Cairns   FALSE
##  6 Seniors & Carers Robina       Robina   FALSE
##  7           Fountain Gate Narre Warren   FALSE
##  8           Warwick Grove      Warwick   FALSE
##  9              Hervey Bay       Pialba   FALSE
## 10           Kawana Waters      Buddina   FALSE
## # ... with 21 more rows

The above table shows that there are only 31 SITE NAME which doesn’t match with its SUBURB. I will not merge these variables. I will also add variables Tot_open_hrs which is the time difference between Closing and Opening hours.

data1 <- data1 %>% 
  mutate(Diff_hr = (hms(Close) - hms(Open))) %>% 
  mutate(Tot_open_hrs = as.numeric(Diff_hr)/(60*60))

3.3 Missing values

We have 2 missing values in the dataset. Strange, since we didn’t have any missing values at the initial stage.

Lets see where are those missing variable

colSums(is.na(data1))
##               ID      Office_type        SITE NAME          ADDRESS 
##                0                0                0                0 
##           SUBURB            STATE         POSTCODE         LATITUDE 
##                0                0                0                0 
##        LONGITUDE             Open            Close Closed_for_lunch 
##                0                0                0                0 
##          Diff_hr     Tot_open_hrs 
##                1                1

Because the varible had “Mon to Fri - Irregular Hours” the time difference between Close and Open couldn’t be calculated therefore it created NA. I have repalced “Mon to Fri - Irregular Hours” with “00:00:00”

# 
# data1 %>% 
#   filter(is.na(Diff_hr))
# 
# data1$Close <- data1$Close %>% 
#   gsub("Mon to Fri - Irregular Hours", format("00:00:00", "%H:%M:%S"))
# 
# strptime(("20:00:00"), format = "%H:%M:%S")
# 
# 
# strftime("22:29:56")
# 
# 
# dates <- c("02/27/92", "02/27/92", "01/14/92", "02/28/92", "02/01/92")
# times <- c("23:03:20", "22:29:56", "01:03:30", "18:21:03", "16:56:26")
# x <- paste(dates, times)
# strptime(x, "%m/%d/%y %H:%M:%S")

Let us re-run the code for time difference again:

data1 <- data1 %>% 
  mutate(Diff_hr = (hms(Close) - hms(Open))) %>% 
  mutate(Tot_open_hrs = as.numeric(Diff_hr)/(60*60))

There are 2 missing values. We will address this probelm later. This will not impact our result.

3.4 Changing data types

Let get into business of assigning the right data structure to each variables

glimpse(data1)
## Observations: 238
## Variables: 14
## $ ID               <chr> "ID001", "ID002", "ID003", "ID004", "ID005", ...
## $ Office_type      <chr> "Medicare", "Medicare", "Medicare", "Medicare...
## $ `SITE NAME`      <chr> "Armadale (WA)", "Armidale (NSW)", "Newport",...
## $ ADDRESS          <chr> "40-42 William Street", "246-248 Beardy Stree...
## $ SUBURB           <chr> "Armadale", "Armidale", "Newport", "Aitkenval...
## $ STATE            <chr> "WA", "NSW", "VIC", "QLD", "WA", "QLD", "NT",...
## $ POSTCODE         <int> 6112, 2350, 3015, 4814, 6330, 4883, 870, 4807...
## $ LATITUDE         <dbl> -32.15567, -30.51246, -37.84136, -19.29713, -...
## $ LONGITUDE        <dbl> 116.0137, 151.6619, 144.8827, 146.7644, 117.8...
## $ Open             <time> 08:30:00, 08:30:00, 08:30:00, 08:30:00, 08:3...
## $ Close            <chr> "16:30:00", "16:30:00", "16:30:00", "16:30:00...
## $ Closed_for_lunch <chr> "No", "No", "No", "Yes - Irregular Hours", "N...
## $ Diff_hr          <S4: Period> 8H 0M 0S, 8H 0M 0S, 8H 0M 0S, 8H 0M 0S...
## $ Tot_open_hrs     <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, ...
  1. Office_type

Let see how many unique values are present

unique(data1$Office_type)
## [1] "Medicare"                   "Medicare Self Service Only"

They should be changed to factor

  1. SITE NAME, ADDRESS

We will leve these varibale as character

  1. SUBURB

This can be changed to factor. Let see how many unique SUBURBS are present

length(unique(data1$SUBURB))
## [1] 235
  1. STATE

This can be changed to factor.Let see how many unique STATE are present

(unique(data1$STATE))
## [1] "WA"  "NSW" "VIC" "QLD" "NT"  "TAS" "SA"  "ACT"

8 State. Which is correct!

  1. POSTCODE

Thre is no benefit leaving POSTCODE as characters. We’ll change ti factor. But let see how many unique one are there

length(unique(data1$POSTCODE))
## [1] 236

Interesting, since there 235 unique suburb but 236 post codes.

  1. Closed_for_lunch

Lets see the count for each unique values

table(data1$Closed_for_lunch)
## 
##  12:00:00 to 14:00:00  12:30:00 to 13:30:00                    No 
##                     1                     2                   226 
## Yes - Irregular Hours 
##                     9

Ok! We have to change 12:00:00 to 14:00:00, 12:30:00 to 13:30:00 & Yes - Irregular Hours

data1$Closed_for_lunch <- stringr::str_replace_all(data1$Closed_for_lunch, "12:00:00 to 14:00:00", "YES - 1 hr")

data1$Closed_for_lunch <- stringr::str_replace_all(data1$Closed_for_lunch, "12:30:00 to 13:30:00", "YES - 1 hr")

data1$Closed_for_lunch <-  stringr::str_replace_all(data1$Closed_for_lunch, "Yes - Irregular Hours", "YES - Irregular hrs")

Lets us change above varibale which we marked to factor

data1 <- data1 %>% 
  mutate_at(vars(Office_type, STATE, SUBURB, POSTCODE, Closed_for_lunch), funs(as.factor(.)))

Lets glimpse the structure of the dataset

glimpse(data1)
## Observations: 238
## Variables: 14
## $ ID               <chr> "ID001", "ID002", "ID003", "ID004", "ID005", ...
## $ Office_type      <fctr> Medicare, Medicare, Medicare, Medicare, Medi...
## $ `SITE NAME`      <chr> "Armadale (WA)", "Armidale (NSW)", "Newport",...
## $ ADDRESS          <chr> "40-42 William Street", "246-248 Beardy Stree...
## $ SUBURB           <fctr> Armadale, Armidale, Newport, Aitkenvale, Alb...
## $ STATE            <fctr> WA, NSW, VIC, QLD, WA, QLD, NT, QLD, VIC, VI...
## $ POSTCODE         <fctr> 6112, 2350, 3015, 4814, 6330, 4883, 870, 480...
## $ LATITUDE         <dbl> -32.15567, -30.51246, -37.84136, -19.29713, -...
## $ LONGITUDE        <dbl> 116.0137, 151.6619, 144.8827, 146.7644, 117.8...
## $ Open             <time> 08:30:00, 08:30:00, 08:30:00, 08:30:00, 08:3...
## $ Close            <chr> "16:30:00", "16:30:00", "16:30:00", "16:30:00...
## $ Closed_for_lunch <fctr> No, No, No, YES - Irregular hrs, No, No, No,...
## $ Diff_hr          <S4: Period> 8H 0M 0S, 8H 0M 0S, 8H 0M 0S, 8H 0M 0S...
## $ Tot_open_hrs     <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, ...

Awesome! Before we move into visualisation it would be interesting to add more inforamtion to this dataset such as population of that suburb, that suburb distance from the CBD etc.

Ok I give up! I spent 2 hours findind electoral division based on post code and couldn’t find it.

4 Visualisation

# unique(data1$STATE)
state_levels <- c("NSW", "VIC","QLD","WA", "SA", "TAS", "ACT", "NT")

data1$STATE <- fct_relevel(data1$STATE, state_levels)

count_md <- data1 %>% 
  group_by(STATE, Office_type) %>% 
  summarise(Count = n()) %>% 
  arrange(-Count)

p1 <- count_md %>% 
  ggplot(aes(STATE, Count, fill = Office_type))+
  geom_bar(stat = "identity")+
  geom_text(aes(label = Count), vjust = -.50)+
  labs(title = "Total count of all Medicare offices by State", fill = "Office Type")+
  theme(legend.position = c(0.85, 0.8))
  

p2 <- data1 %>% 
  group_by(STATE) %>% 
  summarise(Tot_hrs = sum(Tot_open_hrs, na.rm = T)) %>% 
  ggplot(aes(area = Tot_hrs, fill = Tot_hrs, label = STATE))+
  geom_treemap()+
  geom_treemap_text(fontface = "italic", colour = "white", place = "centre",grow = TRUE)+
  theme(legend.position = "bottom", legend.background = element_rect(color = "black", 
    fill = "grey90", size = 1, linetype = "solid"), legend.direction = "horizontal")+
  labs(fill = "Tot hrs opened in a day", title = "Total operational hrs in a day", caption = "Area of the tile proportional to total opening hours in a given day")+
  scale_fill_viridis()
  

grid.arrange(p1,p2, nrow = 2)

# data1 %>% 
#   filter(SUBURB %in% c("Albury","Charlestown")  & STATE == "NSW")
# data1 %>% 
#   group_by(SUBURB, STATE) %>% 
#   summarise(Count = n()) %>% 
#   arrange(-Count) %>% 
#   filter(Count >= 2) %>% 
#   ggplot(aes(SUBURB, Count))+
#   geom_bar(stat="identity", aes(fill = SUBURB))+
#   theme(legend.position = "none")

p1 <- data1 %>% 
  group_by(SUBURB, STATE) %>% 
  summarise(Count = n()) %>% 
  arrange(-Count) %>% 
  filter(Count >= 2) %>% 
  ggplot(aes(area = Count, fill = as.factor(Count), label = SUBURB))+
  geom_treemap()+
  geom_treemap_text(fontface = "italic", colour = "white", place = "centre",grow = TRUE)+
  labs(fill = "Total Medicare offices", caption = "These are the only Suburbs located in NSW which have 2 Medicare offices out of all AU", title = "Suburbs which have  two Medicare office")+
  theme(legend.position = "bottom", legend.background = element_rect(color = "black", 
    fill = "grey90", size = 1, linetype = "solid"), legend.direction = "horizontal")
  

p2 <- data1 %>% 
  filter(Closed_for_lunch != "No") %>% 
  group_by(STATE, Closed_for_lunch, SUBURB) %>% 
  summarise(Count = n()) %>% 
  ggplot(aes(area = Count, fill = Closed_for_lunch, label = SUBURB, subgroup = STATE))+
  geom_treemap()+
  geom_treemap_subgroup_text(place = "centre", grow = T, alpha = 0.5, colour =
                             "black", fontface = "italic", min.size = 0) +
  geom_treemap_text(colour = "white", place = "topleft", reflow = T)+
  geom_treemap_subgroup_border()+
  theme(legend.position = "bottom", legend.background = element_rect(color = "black", 
    fill = "grey90", size = 1, linetype = "solid"), legend.direction = "horizontal")+
  labs(fill = "Closed for lunch ?", title = "List of Medicare suburb which closes for lunch")

grid.arrange(p1,p2, nrow = 2)

5 Maps

5.1 Medicare Geospatial Map

aus_ucl <- readOGR(dsn = "1270055004_ucl_2016_aust_shape/UCL_2016_AUST.shp")

aus_ucl@data$id <- rownames(aus_ucl@data)

df_map <- tidy(aus_ucl)

oz_states <- data.frame(state=c("NSW", "NT","QLD", "SA",  "TAS", "VIC", "WA", "ACT"),
                        Region=c("New South Wales","Northern Territory",
                                 "Queensland","South Australia",
                                 "Tasmania","Victoria","Western Australia", "Australian Capital Territory"))
Regions <- data1 %>%
  left_join(oz_states, by = c("STATE"="state")) %>%
  group_by(Region) %>%
  summarise(Count = n()) %>%
  arrange(-Count)

df_map <- df_map %>%
  left_join(aus_ucl@data, by=c("id")) %>%
  left_join(Regions, by=c("STE_NAME16"="Region"))


#########

center_map <- geocode("Australia")

lon <- 133.7751
lat <- -25.2744

g_map <- get_map(location = c(lon, lat), zoom = 4, maptype = "terrain", source = "google")

ggmap(g_map)+
  geom_polygon(data = df_map, aes(long, lat, group = group, fill = Count),color = "black", size = 0.2)+
  scale_fill_viridis()+
  geom_point(data = data1, aes(LONGITUDE, LATITUDE), color = "red", size = .8)+
  labs(x = "Longitude", y = "Latitude", title = "Total Count of Medicare by State and its GPS loactions",
       caption = "Red dots represent location of Medicare by their GPS Coordinates")

5.2 Check Coordinate Reference System (CRS)

aus_ucl@proj4string
## CRS arguments:
##  +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs

Projections: setting and transforming CRS in R2

The Coordinate Reference System (CRS) of spatial objects defines where they are placed on the Earth’s surface. You may have noticed ’proj4string ’in the summary of lnd above: the information that follows represents its CRS. Spatial data should always have a CRS. If no CRS information is provided, and the correct CRS is known, it can be set as follow:

proj4string(aus) <- NA_character_ # remove CRS information from lnd proj4string(aus) <- CRS(“+init=epsg:27700”) # assign a new CRS

R issues a warning when the CRS is changed. This is so the user knows that they are simply changing the CRS, not reprojecting the data. An easy way to refer to different projections is via [EPSG codes.] (http://www.epsg-registry.org/)

Let plot using base plot cmd

plot(aus_ucl)

6 Junk CMDS

# df_map %>% 
#   ggplot(aes(long,lat, group = group, fill = Count))+
#   geom_polygon()+
#   coord_equal()+
#   scale_fill_viridis(option = "A")+
#   labs(x = "Longitude", y = "Latitude", title = "Total count of Medicare by state")


# unique(aus_ucl$STE_NAME16)
# 
# x <- aus_ucl@data  
#   

# 
# data1 <- data1 %>% 
#   mutate(STE_CODE16 = ifelse(STATE == "NSW", 1,
#                              ifelse(STATE == "VIC", 2,
#                                     ifelse(STATE == "QLD", 3,
#                                            ifelse(STATE == "SA", 4,
#                                                   ifelse(STATE =="WA", 5,
#                                                          ifelse(STATE == "TAS", 6,
#                                                                 ifelse(STATE=="NT",7,8))))))))
# NOTE: Better way to write above multiple ifelse statement is by using case_when function.[3](http://dplyr.tidyverse.org/reference/case_when.html)
# unique(aus_ucl$STE_NAME16)
# 
# x <- aus_ucl@data  
#   # head(aus_ucl@data)
  
# map_df <- tidy(aus_ucl)
# aus_ucl$id <- row.names(aus_ucl)
# 
# dim(aus_ucl)
# 
# map_df <- left_join(map_df, aus_ucl@data)
# data1$STE_CODE16 <- as.factor(data1$STE_CODE16)
# 
# map_df2 <- map_df
# 
# data2 <- data1 %>% 
#   select(c(1,2,6,15))
# names(data1)
# # gc()
# map_df2 <- left_join(map_df2, data2)

# map_df %>% 
#   ggplot(aes(long, lat, group = group, fill = STE_NAME16))+
#   geom_polygon()+
#   coord_equal()+
#   geom_point(data = data1, aes(x = LONGITUDE,y =  LATITUDE))
# a <- get_map(location = "Australia")
# 
# ggplot(aes(x=lon, y=lat), data=a) +
# geom_blank() + coord_map("mercator") +
# annotation_raster(ggmap,
# xmin, xmax, ymin, ymax)

# aus_adm <- readOGR("AUS_adm_shp/AUS_adm1.shp")
# ```
# 
# ```{r}
# plot(aus_adm)
# ```
# 
# ```{r}
# aus_adm@data


# Tot_medi_count <- data1 %>% 
#   group_by(STATE) %>% 
#   summarise(Tot_medic = n())
# 
# Tot_medi_count <- Tot_medi_count %>% 
#   mutate(ID_1 = case_when(
#     STATE == "NSW" ~ 5,
#     STATE == "VIC" ~ 10,
#     STATE == "QLD" ~ 7,
#     STATE == "WA" ~ 11,
#     STATE == "SA" ~ 8,
#     STATE == "TAS" ~ 9,
#     STATE == "ACT" ~ 2,
#     STATE == "NT" ~ 6
#   )) %>% 
#   mutate_at(vars(ID_1), funs(as.factor(.)))
# 
# aus_adm@data <-  left_join(aus_adm@data, Tot_medi_count)
# 
# aus_adm_df <- tidy(aus_adm)
# 
# aus_adm$id <- row.names(aus_adm)
# 
# aus_adm_df <- left_join(aus_adm_df, aus_adm@data)
# 
# centroid.df <- as.data.frame(coordinates(aus_adm))
# names(centroid.df) <- c("Longitude", "Latitude")
# 
# aus_adm_df %>% 
#   ggplot(aes(long, lat, na.rm = F, group = group, fill = as.factor(Tot_medic)))+
#   geom_polygon()+
#   coord_equal()+
#   scale_fill_viridis(discrete = T)+
#   geom_text(data = centroid.df, aes(label = Longitude, x = Longitude, y = Latitude))