The following packages will be required or may come in handy for this task.
# readxl package functions are used to read Excel files
library(readxl)
# overcomes the cumbersome nested coding typically required to join multiple functions
library(magrittr)
# Pipes are really powerful when paired with compatible packages such as dplyr
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Create 'Formattable' Data Structures
library(formattable)
# imputation of the mean, median and mode
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
# Tidy messy data
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
# calculation of the Mahalanobis distance
library(MVN)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## sROC 0.1-2 loaded
# for different histogram() function
library(lattice)
Data Analysis is a process of collecting, transforming, cleaning, and modelling data with the goal of discovering the required information. Although the data modelling, prediction, and statistical inference is vital in any statistical analysis, the quality of the analyses (output) always depends on how far we can minimise the amount of garbage of our data by preprocessing data. Although a typical data preprocessing process usually define five major tasks as follows; I may not necessarily shadow the following order of tasks in my report to preprocess the data.
For the data Preprocessing I downloaded four health record datasets which includes 20 Unique variables from the Australian Institute of Health and Welfare website.
As the first step in any data preprocessing task is to “GET” the data and as the data came as Excel spreadsheets, I used the readxl package to read that Excel data. As these three worksheets make notes at the beginning, I have to read in data that starts further down in the worksheets.In addition, there were some empty columns of those and I had to drop them from the data tables.
According to the variable definitions given in four data sets, the data types for most of variables need to be corrected.I converted them into proper data yypes with more customized column formatting.
I Identified missing values in the data sets and dealed with case-by-case approach. Moreover, I also used Multivariate Outlier Detection Methods to scan the numeric data for outliers. But it was not Successful, therefore I used univariate box plot approach and handled with that outliers using Capping.
Then I joined three tables together and and checked is there anything to do with the final output and realize that do not need to transformed data to any format.
Moreover, I checked the skewness of Total_beds variable using histogram, it was right_skewed and used logarithmic transformation to reshape the distribution of that variable to reduce skewness
Now, the tidy version of the data extracted from the original source “Tidy” is ready for the study
Data and their sources
As a key component of Australian comprehensive system of health services, hospitals make a great contribution to the national health outcomes (Victorian Auditor-General’s Report – 2016). With the growing density of the healthcare system, patients now have a significant array of choices when it comes to receiving care from Hospitals. This frequently includes the option of selecting what type of hospitals they wish to utilize from a long list of many. Inequality between hospitals in Australia has been an ongoing concern due to its exploitation of economies of scale.Therefore, it is important to create a tidy data set about hospitals to help people to choose one hospital over the other.
Australian Institute of Health and Welfare is committed to providing statistical information that can be used to promote discussion and make decisions on health and welfare services (AIHW web).Therefore, for my study I downloaded four datasets that has different health records from the website of the AIHW, https://www.aihw.gov.au/reports-data/myhospitals/sectors/admitted-patients.;
Import data
Before running any command,I searched current working directory by running the getwd function.All onwards commands are to be located under working directory “E:/5. MasterOfAnalytics_RMIT/4. MATH2349 - Data Wrangling/2. Assignments/3. Assignment2” which was the result of getwd()
The datasets are stored in the .xlsx format. Unlike xlsx package, the readxl package has no external dependencies (like Java or Perl), so I use it to import that Excel data. These three worksheets make notes at the beginning. Therefore, I include the skip argument to read in data that starts 12,13, and 23 rows down in the worksheets respectively. But R warn that there are number of empty columns in these data sets (6 empty columns in “ALOS”, 4 in “NWAU” and 1 in “Admission”,Specialised tables) I drop those columns by column index numbers and the minus sign is to drop variables
Moreover, I found more than 4000 “total” rows in Category variable of Admission dataset which needed to delete.
Variables in the datasets
we can view the headers of the data sets using head() function. Acording to the results Variable names and short descriptions are given below:
Merge three data sets
Before cleaning these datasets, I may not be able to merge. Therefore, this step is shifted to the section “Tidy & Manipulate Data II”
# --------Import data
# Finding the current working directory
getwd()
## [1] "E:/5. MasterOfAnalytics_RMIT/4. MATH2349 - Data Wrangling/2. Assignments/3. Assignment2"
# read "average-length-of-stay-multilevel-data" worksheet by skipping the first 12 rows and assigned a table name as "ALOS"
library(readxl)
ALOS<- read_excel("average-length-of-stay-multilevel-data.xlsx", sheet = "Average length of stay", skip = 12)
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in O30023 / R30023C15: got '‡'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in O30024 / R30024C15: got '‡'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in O30025 / R30025C15: got '‡'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in O30026 / R30026C15: got '‡'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in O30027 / R30027C15: got '‡'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in O30028 / R30028C15: got '‡'
## New names:
## * `` -> ...9
## * `` -> ...11
## * `` -> ...13
## * `` -> ...15
## * `` -> ...17
## * ...
# read "udm-cost-acute-admitted-patients-data" worksheet by skipping the first 13 rows and assigned a table name as "NWAU"
NWAU <- read_excel("udm-cost-acute-admitted-patients-data.xlsx", sheet = "Cost per NWAU", skip = 13)
## New names:
## * `` -> ...7
## * `` -> ...9
## * `` -> ...11
## * `` -> ...13
# read "udm-patient-admission-data" worksheet by skipping the first 23 rows and assigned a table name as "Admission"
Admission <- read_excel("udm-patient-admission-data.xlsx", sheet = "Patient admissions", skip = 23)
## New names:
## * `` -> ...7
# --------Control data
# Drop hidden empty columns by column index numbers
ALOS1 <- ALOS[ -c(9,11,13,15,17,19) ]
NWAU1 <- NWAU[ -c(7,9,11,13) ]
Admission1_1 <- Admission[ -c(7) ]
# Delete "total" rows in Category variable of Admission
str(Admission1_1)
## tibble [47,364 x 6] (S3: tbl_df/tbl/data.frame)
## $ Reporting unit : chr [1:47364] "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" ...
## $ Reporting unit type : chr [1:47364] "Hospital" "Hospital" "Hospital" "Hospital" ...
## $ State : chr [1:47364] "NSW" "NSW" "NSW" "NSW" ...
## $ Time period : chr [1:47364] "2011–12" "2012–13" "2013–14" "2014–15" ...
## $ Category : chr [1:47364] "Childbirth" "Childbirth" "Childbirth" "Childbirth" ...
## $ Number of patient admissions: chr [1:47364] "0" "0" "0" "0" ...
Admission1<-Admission1_1[!(Admission1_1$Category=="Total"),]
str(Admission1)
## tibble [43,417 x 6] (S3: tbl_df/tbl/data.frame)
## $ Reporting unit : chr [1:43417] "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" ...
## $ Reporting unit type : chr [1:43417] "Hospital" "Hospital" "Hospital" "Hospital" ...
## $ State : chr [1:43417] "NSW" "NSW" "NSW" "NSW" ...
## $ Time period : chr [1:43417] "2011–12" "2012–13" "2013–14" "2014–15" ...
## $ Category : chr [1:43417] "Childbirth" "Childbirth" "Childbirth" "Childbirth" ...
## $ Number of patient admissions: chr [1:43417] "0" "0" "0" "0" ...
# view headers
head(ALOS1)
head(NWAU1)
head(Admission1)
Understanding data structures and variable types in the data set are also crucial for conducting data preprocessing.
Display the structure; I can compactly display the structure of R objects using str() function.The str() outputs reveals how R guesses the data types of each variable.
Proper data type conversions; According to the variable definitions given in four data sets, the data types for most of variables need to be corrected. I used transmute() function to adds new variables with correct data types and drops existing ones.
More customized column formatting; Although the data types are correct, for some variables I had to do some formatting like currency, percentages using ‘Formattable’ function, which resembles conditional formatting in Microsoft Excel.
# This is the R chunk for the Understand Section
# display the structure
str(ALOS1)
## tibble [30,021 x 13] (S3: tbl_df/tbl/data.frame)
## $ Reporting unit : chr [1:30021] "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" ...
## $ Reporting unit type : chr [1:30021] "Hospital" "Hospital" "Hospital" "Hospital" ...
## $ State : chr [1:30021] "NSW" "NSW" "NSW" "NSW" ...
## $ Local Hospital Network (LHN) : chr [1:30021] "Albury Wodonga Health" "Albury Wodonga Health" "Albury Wodonga Health" "Albury Wodonga Health" ...
## $ Peer group : chr [1:30021] "Large hospitals" "Large hospitals" "Large hospitals" "Large hospitals" ...
## $ Time period : chr [1:30021] "2011–12" "2012–13" "2013–14" "2014–15" ...
## $ Category : chr [1:30021] "Cellulitis" "Cellulitis" "Cellulitis" "Cellulitis" ...
## $ Total number of stays : chr [1:30021] "111" "116" "141" "155" ...
## $ Number of overnight stays : chr [1:30021] "92" "98" "115" "123" ...
## $ Percentage of overnight stays : chr [1:30021] "0.83" "0.84" "0.82" "0.79" ...
## $ Average length of stay (days) : chr [1:30021] "3.9" "3.3" "3.1" "2.5" ...
## $ Peer group average (days) : chr [1:30021] "3.7" "3.5" "3.3" "3.2" ...
## $ Total overnight patient bed days: chr [1:30021] "356" "326" "351" "306" ...
str(NWAU1)
## tibble [388 x 9] (S3: tbl_df/tbl/data.frame)
## $ Reporting unit : chr [1:388] "Armidale Hospital" "Armidale Hospital" "Armidale Hospital" "Auburn Hospital" ...
## $ Reporting unit type : chr [1:388] "Hospital" "Hospital" "Hospital" "Hospital" ...
## $ State : chr [1:388] "NSW" "NSW" "NSW" "NSW" ...
## $ Peer group : chr [1:388] "Medium regional hospitals" "Medium regional hospitals" "Medium regional hospitals" "Medium metropolitan hospitals" ...
## $ Time period : chr [1:388] "2012–13" "2013–14" "2014–15" "2012–13" ...
## $ Number of NWAU : chr [1:388] "6177" "6532" "6729" "NP" ...
## $ Cost per NWAU, constant prices (adjusted for inflation) : chr [1:388] "4700" "4700" "4300" "NP" ...
## $ Peer average cost per NWAU, constant prices (adjusted for inflation): num [1:388] 5100 4940 4950 4300 4200 4080 4460 4440 4460 4450 ...
## $ Percentage of private patients : chr [1:388] "0.17" "0.19" "0.16" "NP" ...
str(Admission1)
## tibble [43,417 x 6] (S3: tbl_df/tbl/data.frame)
## $ Reporting unit : chr [1:43417] "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" ...
## $ Reporting unit type : chr [1:43417] "Hospital" "Hospital" "Hospital" "Hospital" ...
## $ State : chr [1:43417] "NSW" "NSW" "NSW" "NSW" ...
## $ Time period : chr [1:43417] "2011–12" "2012–13" "2013–14" "2014–15" ...
## $ Category : chr [1:43417] "Childbirth" "Childbirth" "Childbirth" "Childbirth" ...
## $ Number of patient admissions: chr [1:43417] "0" "0" "0" "0" ...
# Proper data type conversions;
library(dplyr)
library(formattable)
ALOS2 <- ALOS1 %>% transmute(H_Name = as.factor(`Reporting unit`),
H_Type = as.factor(`Reporting unit type`),
State = as.factor(State),
LHN = as.factor(`Local Hospital Network (LHN)`),
Peer_group = as.factor(`Peer group`),
Time_period = `Time period`,
Category = as.factor(Category),
Total_stays= as.double(`Total number of stays`),
overnight_stays= as.double(`Number of overnight stays`),
overnight_stays_per = percent(as.double(`Percentage of overnight stays`)),
AvgLOS = as.double(`Average length of stay (days)`),
Peer_average= as.double(`Peer group average (days)`),
Total_beds= as.double(`Total overnight patient bed days`))
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning in percent(as.double(`Percentage of overnight stays`)): NAs introduced
## by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
NWAU2 <- NWAU1 %>% transmute(H_Name = as.factor(`Reporting unit`),
H_Type = as.factor(`Reporting unit type`),
State = as.factor(State),
Peer_group = as.factor(`Peer group`),
Time_period = `Time period`,
No_NWAU= as.double(`Number of NWAU`),
Cost_NWAU= currency(as.double(`Cost per NWAU, constant prices (adjusted for inflation)`)),
Peer_Cost_NWAU= currency(as.double(`Peer average cost per NWAU, constant prices (adjusted for inflation)`)),
private_patients_per = percent(as.double(`Percentage of private patients`)))
## Warning: NAs introduced by coercion
## Warning in currency(as.double(`Cost per NWAU, constant prices (adjusted for
## inflation)`)): NAs introduced by coercion
## Warning in percent(as.double(`Percentage of private patients`)): NAs introduced
## by coercion
Admission2 <- Admission1 %>% transmute(H_Name = as.factor(`Reporting unit`),
H_Type = as.factor(`Reporting unit type`),
State = as.factor(State),
Time_period = `Time period`,
Category = as.factor(Category),
No_admissions = as.double(`Number of patient admissions`))
## Warning: NAs introduced by coercion
Identify missing values
In R, a numeric missing value is represented by NA (“not available”), while character missing values are represented by
According to the results, There were lots of missing value columns identified 6, 3 and 1 respectively
Dealing with missing values
Identified missing values are too large to justifiably remove them from the analysis. There is no good way to deal with missing data”. But some ways are better than others, situation depending. I got the following decisions on dealing with missing values.
Missing values in LHN variable of ALOS2 data table ; There is no logical imputation method to dealing with that variable rather than google it to find is there any Local Hospital Network for that hospitals. Therefore I replace “Google for LHN” for its 600 missing values.
Missing values in Total_stays and overnight_stays variables of ALOS2 data table / Missing values in No_NWAU variables of NWAU2 data table / Missing values in No_admissions of Admission2 data table ;I recoded the missing value for those variables with the mean value of that variables as it is reasonable to apply “mean” values for missing “total” values.
Missing values in overnight_stays_per,AvgLOS and Peer_average variables of ALOS2 data table / Missing values in Cost_NWAU, private_patients_per variables of NWAU2 data table ; I recoded the missing value for those variables with the median value of that variables as it is not reasonable to apply “mean” values for missing another “mean” values.
Missing values in Total_beds variable of ALOS2 data table ; If I can assume that Total_beds of the same Peer_group have similar distributions of total number of beds, we can apply this logic to remove the missing values of Total_beds.
# --------Identify missing values
# Scan the data for missing values
colSums(is.na(ALOS2))
## H_Name H_Type State LHN
## 0 0 0 600
## Peer_group Time_period Category Total_stays
## 0 0 0 7013
## overnight_stays overnight_stays_per AvgLOS Peer_average
## 7017 10565 19770 8445
## Total_beds
## 1804
colSums(is.na(NWAU2))
## H_Name H_Type State
## 0 0 0
## Peer_group Time_period No_NWAU
## 0 0 36
## Cost_NWAU Peer_Cost_NWAU private_patients_per
## 36 0 36
colSums(is.na(Admission2))
## H_Name H_Type State Time_period Category
## 0 0 0 0 0
## No_admissions
## 4081
# --------Dealing with missing values
# (1) ALOS2 Table except Total_beds
library(Hmisc)
ALOS3 <- ALOS2 %>% transmute(H_Name,
H_Type,
State,
LHN = "Google for LHN", na.rm = TRUE,
Peer_group,
Time_period,
Category,
Total_stays = impute(ALOS2$Total_stays, fun = mean),
overnight_stays = impute(ALOS2$overnight_stays, fun = mean),
overnight_stays_per = impute(ALOS2$overnight_stays_per, fun = median),
AvgLOS = impute(ALOS2$AvgLOS, fun = median),
Peer_average = impute(ALOS2$Peer_average, fun = median),
Total_beds)
# Scan the data for missing values again
colSums(is.na(ALOS3))
## H_Name H_Type State LHN
## 0 0 0 0
## na.rm Peer_group Time_period Category
## 0 0 0 0
## Total_stays overnight_stays overnight_stays_per AvgLOS
## 0 0 0 0
## Peer_average Total_beds
## 0 1804
# Total_beds of ALOS2
sample_na <- function(x){
if(all(is.na(x))){ return(NA) }
vals <- x[!is.na(x)]
if(length(vals) == 1){return(vals)}
return(sample(x[!is.na(x)], size = length(x[is.na(x)]), replace = T))
}
ALOS3 <- ALOS3 %>% group_by(Peer_group) %>%
mutate(Total_beds = ifelse(is.na(Total_beds),
sample_na(Total_beds), Total_beds))
# Scan the data for missing values again
colSums(is.na(ALOS3))
## H_Name H_Type State LHN
## 0 0 0 0
## na.rm Peer_group Time_period Category
## 0 0 0 0
## Total_stays overnight_stays overnight_stays_per AvgLOS
## 0 0 0 0
## Peer_average Total_beds
## 0 0
# (2) NWAU2 table
NWAU3 <- NWAU2 %>% transmute(H_Name,
H_Type,
State,
Peer_group,
Time_period,
No_NWAU = impute(NWAU2$No_NWAU, fun = mean),
Cost_NWAU = impute(NWAU2$Cost_NWAU, fun = median),
Peer_Cost_NWAU,
private_patients_per = impute(NWAU2$private_patients_per, fun = median))
# Scan the data for missing values again
colSums(is.na(NWAU3))
## H_Name H_Type State
## 0 0 0
## Peer_group Time_period No_NWAU
## 0 0 0
## Cost_NWAU Peer_Cost_NWAU private_patients_per
## 0 0 0
# (3) Admission2 table
Admission3 <- Admission2 %>% transmute(H_Name,
H_Type,
State,
Time_period,
Category,
No_admissions = impute(Admission2$No_admissions, fun = mean))
# Scan the data for missing values again
colSums(is.na(Admission3))
## H_Name H_Type State Time_period Category
## 0 0 0 0 0
## No_admissions
## 0
As an inclusion of outliers would have a great impact on the analysis results, detecting outliers and dealing with them are also crucial in the data preprocessing.
Detecting outliers
First I detected the outliers of multivariate setting by the most commonly used distance metric called “Mahalanobis distance” As the data sets are too large, the allocated vectors for detecting outliers are more than 6 GB (“Error; sample size musst be between 3 and 5000 cells”.
Then I tried to use univariate box plot approach. According to the Tukey’s method
• all variable of ALOS3 and Admission3 data sets seems to have many outliers • all variable of NWAU3 do not seems to have outliers except one outlier in Cost_NWAU
handling outliers
Most of the ways to deal with outliers.I handdled outliers using Capping or winsorising involves replacing the outliers with the nearest neighbours that are not outliers
# subset numeric variables using Peer_group
#ALOS3_Large <- ALOS3 %>% filter( Peer_group == "Large hospitals" ) %>% dplyr::select(Total_stays, overnight_stays, overnight_stays_per,AvgLOS,Peer_average,Total_beds)
#ALOS3_Large <- ALOS3_Large[ -c(1) ]
# Multivariate outlier detection using Mahalanobis distance with QQ plots
# ALOS3_Large <- mvn(data = ALOS3_Large, multivariateOutlierMethod = "quan", showOutliers = TRUE)
# univariate box plot approach for ALOS3
# Limit the Peer_group to 5 dimensions as r warn about incorrect number of dimensions (6)
ALOS4 <- ALOS3 %>% filter( Peer_group == "Large hospitals" | Peer_group == "Medium hospitals" | Peer_group == "Children's hospitals"| Peer_group == "Major hospitals"| Peer_group == "Small hospitals")
par(mfrow=c(3,2))
boxplot(ALOS4$Total_stays ~ ALOS4$Peer_group, main="Total stays by peer group", ylab = "Total stays", xlab = "peer group")
boxplot(ALOS4$overnight_stays ~ ALOS4$Peer_group, main="Total stays by peer group", ylab = "overnight stays", xlab = "peer group")
boxplot(ALOS4$overnight_stays_per ~ ALOS4$Peer_group, main="Total stays by peer group", ylab = "overnight stays per", xlab = "peer group")
boxplot(ALOS4$AvgLOS ~ ALOS4$Peer_group, main="Total stays by peer group", ylab = "AvgLOS", xlab = "peer group")
boxplot(ALOS4$Peer_average ~ ALOS4$Peer_group, main="Total stays by peer group", ylab = "Peer averager", xlab = "peer group")
boxplot(ALOS4$Total_beds ~ ALOS4$Peer_group, main="Total stays by peer group", ylab = "Total_beds", xlab = "peer group")
# univariate box plot approach for NWAU3
NWAU4 <- NWAU3 %>% filter( Peer_group == "Large hospitals" | Peer_group == "Medium hospitals" | Peer_group == "Children's hospitals"| Peer_group == "Major hospitals"| Peer_group == "Small hospitals")
par(mfrow=c(2,2))
boxplot(NWAU4$No_NWAU ~ NWAU4$Peer_group, main="No_NWAU", ylab = "Total stays", xlab = "peer group")
boxplot(NWAU4$Cost_NWAU ~ NWAU4$Peer_group, main="Cost_NWAU", ylab = "overnight stays", xlab = "peer group")
boxplot(NWAU4$Peer_Cost_NWAU ~ NWAU4$Peer_group, main="Peer_Cost_NWAU", ylab = "overnight stays per", xlab = "peer group")
boxplot(NWAU4$private_patients_per ~ NWAU4$Peer_group, main="private_patients_per", ylab = "AvgLOS", xlab = "peer group")
# univariate box plot approach for Admission3
Admission4 <- Admission3 %>% filter( State == "ACT" | State == "NSW" | State == "SA"| State == "Qld"| State == "Vic")
par(mfrow=c(1,1))
boxplot(Admission4$No_admissions ~ Admission4$State, main="private_patients_per", ylab = "AvgLOS", xlab = "peer group")
# handling outliers using Capping or winsorising involves replacing the outliers with the nearest neighbours that are not outliers
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
ALOS3$Total_stays <- ALOS3$Total_stays %>% cap()
ALOS3$Total_stays <- ALOS3$overnight_stays %>% cap()
ALOS3$Total_stays <- ALOS3$overnight_stays_per %>% cap()
ALOS3$Total_stays <- ALOS3$AvgLOS %>% cap()
ALOS3$Total_stays <- ALOS3$Peer_average %>% cap()
ALOS3$Total_stays <- ALOS3$Total_beds %>% cap()
NWAU3$Cost_NWAU <- NWAU3$Cost_NWAU %>% cap()
Admission3$No_admissions <- Admission3$No_admissions %>% cap()
I needed to create a variable called “temp” in first three data sets concatenating variables H_Name, Time_period for the Purpose of creating a common variable to joining these three tables.
Then I joined tables using inner joining methods.I used this join type as I don’t need to keep my tidy data set to be with empty records.
# creating a common variable
ALOS5 <- mutate(ALOS3,
temp = paste(H_Name, Time_period, sep=" / "))
NWAU5 <- mutate(NWAU3,
temp = paste(H_Name, Time_period, sep=" / "))
Admission5 <- mutate(Admission3,
temp = paste(H_Name, Time_period, sep=" / "))
# join tables
join_1 <- ALOS5 %>% inner_join(NWAU5, by = "temp")
join_2 <- join_1 %>% inner_join(Admission5, by = "temp")
join_3 <- join_2 %>% transmute(H_Name = H_Name.x,
H_Type = H_Type.x,
State = State.x,
LHN,
Peer_group = Peer_group.x,
Time_period = Time_period.x,
Category = Category.x,
Total_stays,
overnight_stays,
overnight_stays_per,
AvgLOS,
Peer_average,
Total_beds,
No_NWAU,
Cost_NWAU,
Peer_Cost_NWAU,
private_patients_per,
No_admissions )
According to the Tidy Data Principles and my purpose of the study join_3 data set is now tidy and do not need to transformed data to any format.
When I tried to detect outlies using the z- scores, I found that most of the variables are right skewed.
When I checked the skewness of Total_beds variable using histogram, it was right_skewed. Therefore, I used logarithmic transformation to reshape the distribution of that variable to reduce skewness
# histogram to check the distribution of Total_beds of the data set join_3
transform <- hist(join_3$Total_beds)
# logarithmic transformation
log_transform <- log10(join_3$Total_beds)
hist(log_transform)
# Now, the tidy version of the data extracted from the original source is ready for the study
Tidy <- join_3
head(Tidy)
Original data:
Modifying a direct quote:
Victorian Auditor-General’s Report, February 2016, Hospital Performance: Length of Stay, consulted 04 May 2020, https://www.audit.vic.gov.au/sites/default/files/20160210-Length-of-Stay.pdf.
Kirsten Slyter, 11/19/2018, Types of Hospitals: Your Go-to Guide for Deciphering the Differences, Viewed 04 May 2020, https://www.rasmussen.edu/degrees/health-sciences/blog/types-of-hospitals/.