Medicare is a medical service provider for free or at lower cost for Australian citizens. It also pays Medicare benefit on behalf of the Deaprtment of Health.1
If you ask any Australian they would say Medicare is the cornerstone of being proud Aussies! I’d say this is debatable and rest this case for now.
The aim of this project is to learn Data Exploration through R language. I believe as a beginner the best way to learn the language is to find dataset that means something to you.
Data source - data.gov.au
library(tidyverse)
library(forcats)
library(lubridate)
library(DT)
library(ggmap)
library(rgdal)
library(rgeos)
library(maptools)
library(broom)
library(viridis)
library(treemapify)
library(gridExtra)data1 <- read_csv("D:/R Datawarehouse/medicare/medicares2017.csv")Glimpse of the data structure
glimpse(data1)## Observations: 238
## Variables: 11
## $ `OFFICE TYPE` <chr> "Medicare", "Medicare", "Medic...
## $ `SITE NAME` <chr> "Armadale (WA)", "Armidale (NS...
## $ ADDRESS <chr> "40-42 William Street", "246-2...
## $ SUBURB <chr> "Armadale", "Armidale", "Newpo...
## $ STATE <chr> "WA", "NSW", "VIC", "QLD", "WA...
## $ POSTCODE <int> 6112, 2350, 3015, 4814, 6330, ...
## $ LATITUDE <dbl> -32.15567, -30.51246, -37.8413...
## $ LONGITUDE <dbl> 116.0137, 151.6619, 144.8827, ...
## $ Open <time> 08:30:00, 08:30:00, 08:30:00,...
## $ Close <chr> "16:30:00", "16:30:00", "16:30...
## $ `Closed for lunch/Office Notes` <chr> "No", "No", "No", "Yes - Irreg...
Let us issue a unique ID to each Medicare office. However before that I just want to confirm that all the SITE NAME are unique which will indicate me that they all are located separately.
data1 %>%
group_by(`SITE NAME`) %>%
summarise(n=n()) %>%
arrange(-n)## # A tibble: 237 x 2
## `SITE NAME` n
## <chr> <int>
## 1 Charlestown 2
## 2 Airport West 1
## 3 Albany 1
## 4 Albury 1
## 5 Alice Springs 1
## 6 Ararat 1
## 7 Armadale (WA) 1
## 8 Armidale (NSW) 1
## 9 Ashmore 1
## 10 Atherton 1
## # ... with 227 more rows
x <- which(duplicated(data1$`SITE NAME`))
data1 %>%
filter(`SITE NAME` == "Charlestown")## # A tibble: 2 x 11
## `OFFICE TYPE` `SITE NAME`
## <chr> <chr>
## 1 Medicare Charlestown
## 2 Medicare Charlestown
## # ... with 9 more variables: ADDRESS <chr>, SUBURB <chr>, STATE <chr>,
## # POSTCODE <int>, LATITUDE <dbl>, LONGITUDE <dbl>, Open <time>,
## # Close <chr>, `Closed for lunch/Office Notes` <chr>
Above table shows SITE NAME Charlestown has duplicate value. Upon further inspection they seem to be in different place. So its safe issue a unique ID to each of them.
data1 <- data1 %>%
mutate(ID = sprintf("ID%03d", row_number())) %>%
select(ID, everything())
head(data1, 2)## # A tibble: 2 x 12
## ID `OFFICE TYPE` `SITE NAME` ADDRESS SUBURB STATE
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ID001 Medicare Armadale (WA) 40-42 William Street Armadale WA
## 2 ID002 Medicare Armidale (NSW) 246-248 Beardy Street Armidale NSW
## # ... with 6 more variables: POSTCODE <int>, LATITUDE <dbl>,
## # LONGITUDE <dbl>, Open <time>, Close <chr>, `Closed for lunch/Office
## # Notes` <chr>
There are 238 rows and 12 columns in the data.
Below are the variable names:
datatable(data.frame(names(data1)), colnames = "Variables", options = list(
pageLength =12
))First of all, let us rename the variables OFFICE TYPE, SITE NAME Closed for lunch/Office Notes
data1 <- data1 %>%
rename(Office_type = `OFFICE TYPE`,
Closed_for_lunch = `Closed for lunch/Office Notes`)I can see that SITE NAME and SUBURB have similar data. Let us verify that it is true for all observations.
data1 %>%
mutate(similar = (`SITE NAME` == SUBURB)) %>%
filter(similar == FALSE) %>%
select(`SITE NAME`, SUBURB, similar)## # A tibble: 31 x 3
## `SITE NAME` SUBURB similar
## <chr> <chr> <lgl>
## 1 Armadale (WA) Armadale FALSE
## 2 Armidale (NSW) Armidale FALSE
## 3 Townsville Jobseekers Aitkenvale FALSE
## 4 Moreland Brunswick FALSE
## 5 Seniors & Carers Cairns Cairns FALSE
## 6 Seniors & Carers Robina Robina FALSE
## 7 Fountain Gate Narre Warren FALSE
## 8 Warwick Grove Warwick FALSE
## 9 Hervey Bay Pialba FALSE
## 10 Kawana Waters Buddina FALSE
## # ... with 21 more rows
The above table shows that there are only 31 SITE NAME which doesn’t match with its SUBURB. I will not merge these variables. I will also add variables Tot_open_hrs which is the time difference between Closing and Opening hours.
data1 <- data1 %>%
mutate(Diff_hr = (hms(Close) - hms(Open))) %>%
mutate(Tot_open_hrs = as.numeric(Diff_hr)/(60*60))We have 2 missing values in the dataset. Strange, since we didn’t have any missing values at the initial stage.
Lets see where are those missing variable
colSums(is.na(data1))## ID Office_type SITE NAME ADDRESS
## 0 0 0 0
## SUBURB STATE POSTCODE LATITUDE
## 0 0 0 0
## LONGITUDE Open Close Closed_for_lunch
## 0 0 0 0
## Diff_hr Tot_open_hrs
## 1 1
Because the varible had “Mon to Fri - Irregular Hours” the time difference between Close and Open couldn’t be calculated therefore it created NA. I have repalced “Mon to Fri - Irregular Hours” with “00:00:00”
#
# data1 %>%
# filter(is.na(Diff_hr))
#
# data1$Close <- data1$Close %>%
# gsub("Mon to Fri - Irregular Hours", format("00:00:00", "%H:%M:%S"))
#
# strptime(("20:00:00"), format = "%H:%M:%S")
#
#
# strftime("22:29:56")
#
#
# dates <- c("02/27/92", "02/27/92", "01/14/92", "02/28/92", "02/01/92")
# times <- c("23:03:20", "22:29:56", "01:03:30", "18:21:03", "16:56:26")
# x <- paste(dates, times)
# strptime(x, "%m/%d/%y %H:%M:%S")Let us re-run the code for time difference again:
data1 <- data1 %>%
mutate(Diff_hr = (hms(Close) - hms(Open))) %>%
mutate(Tot_open_hrs = as.numeric(Diff_hr)/(60*60))There are 2 missing values. We will address this probelm later. This will not impact our result.
Let get into business of assigning the right data structure to each variables
glimpse(data1)## Observations: 238
## Variables: 14
## $ ID <chr> "ID001", "ID002", "ID003", "ID004", "ID005", ...
## $ Office_type <chr> "Medicare", "Medicare", "Medicare", "Medicare...
## $ `SITE NAME` <chr> "Armadale (WA)", "Armidale (NSW)", "Newport",...
## $ ADDRESS <chr> "40-42 William Street", "246-248 Beardy Stree...
## $ SUBURB <chr> "Armadale", "Armidale", "Newport", "Aitkenval...
## $ STATE <chr> "WA", "NSW", "VIC", "QLD", "WA", "QLD", "NT",...
## $ POSTCODE <int> 6112, 2350, 3015, 4814, 6330, 4883, 870, 4807...
## $ LATITUDE <dbl> -32.15567, -30.51246, -37.84136, -19.29713, -...
## $ LONGITUDE <dbl> 116.0137, 151.6619, 144.8827, 146.7644, 117.8...
## $ Open <time> 08:30:00, 08:30:00, 08:30:00, 08:30:00, 08:3...
## $ Close <chr> "16:30:00", "16:30:00", "16:30:00", "16:30:00...
## $ Closed_for_lunch <chr> "No", "No", "No", "Yes - Irregular Hours", "N...
## $ Diff_hr <S4: Period> 8H 0M 0S, 8H 0M 0S, 8H 0M 0S, 8H 0M 0S...
## $ Tot_open_hrs <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, ...
Let see how many unique values are present
unique(data1$Office_type)## [1] "Medicare" "Medicare Self Service Only"
They should be changed to factor
We will leve these varibale as character
This can be changed to factor. Let see how many unique SUBURBS are present
length(unique(data1$SUBURB))## [1] 235
This can be changed to factor.Let see how many unique STATE are present
(unique(data1$STATE))## [1] "WA" "NSW" "VIC" "QLD" "NT" "TAS" "SA" "ACT"
8 State. Which is correct!
Thre is no benefit leaving POSTCODE as characters. We’ll change ti factor. But let see how many unique one are there
length(unique(data1$POSTCODE))## [1] 236
Interesting, since there 235 unique suburb but 236 post codes.
Lets see the count for each unique values
table(data1$Closed_for_lunch)##
## 12:00:00 to 14:00:00 12:30:00 to 13:30:00 No
## 1 2 226
## Yes - Irregular Hours
## 9
Ok! We have to change 12:00:00 to 14:00:00, 12:30:00 to 13:30:00 & Yes - Irregular Hours
data1$Closed_for_lunch <- stringr::str_replace_all(data1$Closed_for_lunch, "12:00:00 to 14:00:00", "YES - 1 hr")
data1$Closed_for_lunch <- stringr::str_replace_all(data1$Closed_for_lunch, "12:30:00 to 13:30:00", "YES - 1 hr")
data1$Closed_for_lunch <- stringr::str_replace_all(data1$Closed_for_lunch, "Yes - Irregular Hours", "YES - Irregular hrs")Lets us change above varibale which we marked to factor
data1 <- data1 %>%
mutate_at(vars(Office_type, STATE, SUBURB, POSTCODE, Closed_for_lunch), funs(as.factor(.)))Lets glimpse the structure of the dataset
glimpse(data1)## Observations: 238
## Variables: 14
## $ ID <chr> "ID001", "ID002", "ID003", "ID004", "ID005", ...
## $ Office_type <fctr> Medicare, Medicare, Medicare, Medicare, Medi...
## $ `SITE NAME` <chr> "Armadale (WA)", "Armidale (NSW)", "Newport",...
## $ ADDRESS <chr> "40-42 William Street", "246-248 Beardy Stree...
## $ SUBURB <fctr> Armadale, Armidale, Newport, Aitkenvale, Alb...
## $ STATE <fctr> WA, NSW, VIC, QLD, WA, QLD, NT, QLD, VIC, VI...
## $ POSTCODE <fctr> 6112, 2350, 3015, 4814, 6330, 4883, 870, 480...
## $ LATITUDE <dbl> -32.15567, -30.51246, -37.84136, -19.29713, -...
## $ LONGITUDE <dbl> 116.0137, 151.6619, 144.8827, 146.7644, 117.8...
## $ Open <time> 08:30:00, 08:30:00, 08:30:00, 08:30:00, 08:3...
## $ Close <chr> "16:30:00", "16:30:00", "16:30:00", "16:30:00...
## $ Closed_for_lunch <fctr> No, No, No, YES - Irregular hrs, No, No, No,...
## $ Diff_hr <S4: Period> 8H 0M 0S, 8H 0M 0S, 8H 0M 0S, 8H 0M 0S...
## $ Tot_open_hrs <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, ...
Awesome! Before we move into visualisation it would be interesting to add more inforamtion to this dataset such as population of that suburb, that suburb distance from the CBD etc.
Ok I give up! I spent 2 hours findind electoral division based on post code and couldn’t find it.
# unique(data1$STATE)
state_levels <- c("NSW", "VIC","QLD","WA", "SA", "TAS", "ACT", "NT")
data1$STATE <- fct_relevel(data1$STATE, state_levels)
count_md <- data1 %>%
group_by(STATE, Office_type) %>%
summarise(Count = n()) %>%
arrange(-Count)
p1 <- count_md %>%
ggplot(aes(STATE, Count, fill = Office_type))+
geom_bar(stat = "identity")+
geom_text(aes(label = Count), vjust = -.50)+
labs(title = "Total count of all Medicare offices by State", fill = "Office Type")+
theme(legend.position = c(0.85, 0.8))
p2 <- data1 %>%
group_by(STATE) %>%
summarise(Tot_hrs = sum(Tot_open_hrs, na.rm = T)) %>%
ggplot(aes(area = Tot_hrs, fill = Tot_hrs, label = STATE))+
geom_treemap()+
geom_treemap_text(fontface = "italic", colour = "white", place = "centre",grow = TRUE)+
theme(legend.position = "bottom", legend.background = element_rect(color = "black",
fill = "grey90", size = 1, linetype = "solid"), legend.direction = "horizontal")+
labs(fill = "Tot hrs opened in a day", title = "Total operational hrs in a day", caption = "Area of the tile proportional to total opening hours in a given day")+
scale_fill_viridis()
grid.arrange(p1,p2, nrow = 2)# data1 %>%
# filter(SUBURB %in% c("Albury","Charlestown") & STATE == "NSW")# data1 %>%
# group_by(SUBURB, STATE) %>%
# summarise(Count = n()) %>%
# arrange(-Count) %>%
# filter(Count >= 2) %>%
# ggplot(aes(SUBURB, Count))+
# geom_bar(stat="identity", aes(fill = SUBURB))+
# theme(legend.position = "none")
p1 <- data1 %>%
group_by(SUBURB, STATE) %>%
summarise(Count = n()) %>%
arrange(-Count) %>%
filter(Count >= 2) %>%
ggplot(aes(area = Count, fill = as.factor(Count), label = SUBURB))+
geom_treemap()+
geom_treemap_text(fontface = "italic", colour = "white", place = "centre",grow = TRUE)+
labs(fill = "Total Medicare offices", caption = "These are the only Suburbs located in NSW which have 2 Medicare offices out of all AU", title = "Suburbs which have two Medicare office")+
theme(legend.position = "bottom", legend.background = element_rect(color = "black",
fill = "grey90", size = 1, linetype = "solid"), legend.direction = "horizontal")
p2 <- data1 %>%
filter(Closed_for_lunch != "No") %>%
group_by(STATE, Closed_for_lunch, SUBURB) %>%
summarise(Count = n()) %>%
ggplot(aes(area = Count, fill = Closed_for_lunch, label = SUBURB, subgroup = STATE))+
geom_treemap()+
geom_treemap_subgroup_text(place = "centre", grow = T, alpha = 0.5, colour =
"black", fontface = "italic", min.size = 0) +
geom_treemap_text(colour = "white", place = "topleft", reflow = T)+
geom_treemap_subgroup_border()+
theme(legend.position = "bottom", legend.background = element_rect(color = "black",
fill = "grey90", size = 1, linetype = "solid"), legend.direction = "horizontal")+
labs(fill = "Closed for lunch ?", title = "List of Medicare suburb which closes for lunch")
grid.arrange(p1,p2, nrow = 2)aus_ucl <- readOGR(dsn = "1270055004_ucl_2016_aust_shape/UCL_2016_AUST.shp")
aus_ucl@data$id <- rownames(aus_ucl@data)
df_map <- tidy(aus_ucl)
oz_states <- data.frame(state=c("NSW", "NT","QLD", "SA", "TAS", "VIC", "WA", "ACT"),
Region=c("New South Wales","Northern Territory",
"Queensland","South Australia",
"Tasmania","Victoria","Western Australia", "Australian Capital Territory"))
Regions <- data1 %>%
left_join(oz_states, by = c("STATE"="state")) %>%
group_by(Region) %>%
summarise(Count = n()) %>%
arrange(-Count)
df_map <- df_map %>%
left_join(aus_ucl@data, by=c("id")) %>%
left_join(Regions, by=c("STE_NAME16"="Region"))
#########
center_map <- geocode("Australia")
lon <- 133.7751
lat <- -25.2744
g_map <- get_map(location = c(lon, lat), zoom = 4, maptype = "terrain", source = "google")
ggmap(g_map)+
geom_polygon(data = df_map, aes(long, lat, group = group, fill = Count),color = "black", size = 0.2)+
scale_fill_viridis()+
geom_point(data = data1, aes(LONGITUDE, LATITUDE), color = "red", size = .8)+
labs(x = "Longitude", y = "Latitude", title = "Total Count of Medicare by State and its GPS loactions",
caption = "Red dots represent location of Medicare by their GPS Coordinates")aus_ucl@proj4string## CRS arguments:
## +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
Projections: setting and transforming CRS in R2
The Coordinate Reference System (CRS) of spatial objects defines where they are placed on the Earth’s surface. You may have noticed ’proj4string ’in the summary of lnd above: the information that follows represents its CRS. Spatial data should always have a CRS. If no CRS information is provided, and the correct CRS is known, it can be set as follow:
proj4string(aus) <- NA_character_ # remove CRS information from lnd proj4string(aus) <- CRS(“+init=epsg:27700”) # assign a new CRS
R issues a warning when the CRS is changed. This is so the user knows that they are simply changing the CRS, not reprojecting the data. An easy way to refer to different projections is via [EPSG codes.] (http://www.epsg-registry.org/)
Let plot using base plot cmd
plot(aus_ucl)# df_map %>%
# ggplot(aes(long,lat, group = group, fill = Count))+
# geom_polygon()+
# coord_equal()+
# scale_fill_viridis(option = "A")+
# labs(x = "Longitude", y = "Latitude", title = "Total count of Medicare by state")
# unique(aus_ucl$STE_NAME16)
#
# x <- aus_ucl@data
#
#
# data1 <- data1 %>%
# mutate(STE_CODE16 = ifelse(STATE == "NSW", 1,
# ifelse(STATE == "VIC", 2,
# ifelse(STATE == "QLD", 3,
# ifelse(STATE == "SA", 4,
# ifelse(STATE =="WA", 5,
# ifelse(STATE == "TAS", 6,
# ifelse(STATE=="NT",7,8))))))))
# NOTE: Better way to write above multiple ifelse statement is by using case_when function.[3](http://dplyr.tidyverse.org/reference/case_when.html)
# unique(aus_ucl$STE_NAME16)
#
# x <- aus_ucl@data
# # head(aus_ucl@data)
# map_df <- tidy(aus_ucl)
# aus_ucl$id <- row.names(aus_ucl)
#
# dim(aus_ucl)
#
# map_df <- left_join(map_df, aus_ucl@data)
# data1$STE_CODE16 <- as.factor(data1$STE_CODE16)
#
# map_df2 <- map_df
#
# data2 <- data1 %>%
# select(c(1,2,6,15))
# names(data1)
# # gc()
# map_df2 <- left_join(map_df2, data2)
# map_df %>%
# ggplot(aes(long, lat, group = group, fill = STE_NAME16))+
# geom_polygon()+
# coord_equal()+
# geom_point(data = data1, aes(x = LONGITUDE,y = LATITUDE))
# a <- get_map(location = "Australia")
#
# ggplot(aes(x=lon, y=lat), data=a) +
# geom_blank() + coord_map("mercator") +
# annotation_raster(ggmap,
# xmin, xmax, ymin, ymax)
# aus_adm <- readOGR("AUS_adm_shp/AUS_adm1.shp")
# ```
#
# ```{r}
# plot(aus_adm)
# ```
#
# ```{r}
# aus_adm@data
# Tot_medi_count <- data1 %>%
# group_by(STATE) %>%
# summarise(Tot_medic = n())
#
# Tot_medi_count <- Tot_medi_count %>%
# mutate(ID_1 = case_when(
# STATE == "NSW" ~ 5,
# STATE == "VIC" ~ 10,
# STATE == "QLD" ~ 7,
# STATE == "WA" ~ 11,
# STATE == "SA" ~ 8,
# STATE == "TAS" ~ 9,
# STATE == "ACT" ~ 2,
# STATE == "NT" ~ 6
# )) %>%
# mutate_at(vars(ID_1), funs(as.factor(.)))
#
# aus_adm@data <- left_join(aus_adm@data, Tot_medi_count)
#
# aus_adm_df <- tidy(aus_adm)
#
# aus_adm$id <- row.names(aus_adm)
#
# aus_adm_df <- left_join(aus_adm_df, aus_adm@data)
#
# centroid.df <- as.data.frame(coordinates(aus_adm))
# names(centroid.df) <- c("Longitude", "Latitude")
#
# aus_adm_df %>%
# ggplot(aes(long, lat, na.rm = F, group = group, fill = as.factor(Tot_medic)))+
# geom_polygon()+
# coord_equal()+
# scale_fill_viridis(discrete = T)+
# geom_text(data = centroid.df, aes(label = Longitude, x = Longitude, y = Latitude))