Required packages

The following packages will be required or may come in handy for this task.

# readxl package functions are used to read Excel files
library(readxl)

# overcomes the cumbersome nested coding typically required to join multiple functions
library(magrittr)

# Pipes are really powerful when paired with compatible packages such as dplyr
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Create 'Formattable' Data Structures
library(formattable)

# imputation of the mean, median and mode 
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
# Tidy messy data
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
# calculation of the Mahalanobis distance
library(MVN) 
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## sROC 0.1-2 loaded
# for different histogram() function
library(lattice)

Executive Summary

Data Analysis is a process of collecting, transforming, cleaning, and modelling data with the goal of discovering the required information. Although the data modelling, prediction, and statistical inference is vital in any statistical analysis, the quality of the analyses (output) always depends on how far we can minimise the amount of garbage of our data by preprocessing data. Although a typical data preprocessing process usually define five major tasks as follows; I may not necessarily shadow the following order of tasks in my report to preprocess the data.

an image caption Source: Ultimate Funny Dog Videos Compilation 2013.
Source: Data Preprocessing, Dr. Anil Dolgun.


For the data Preprocessing I downloaded four health record datasets which includes 20 Unique variables from the Australian Institute of Health and Welfare website.

As the first step in any data preprocessing task is to “GET” the data and as the data came as Excel spreadsheets, I used the readxl package to read that Excel data. As these three worksheets make notes at the beginning, I have to read in data that starts further down in the worksheets.In addition, there were some empty columns of those and I had to drop them from the data tables.

According to the variable definitions given in four data sets, the data types for most of variables need to be corrected.I converted them into proper data yypes with more customized column formatting.

I Identified missing values in the data sets and dealed with case-by-case approach. Moreover, I also used Multivariate Outlier Detection Methods to scan the numeric data for outliers. But it was not Successful, therefore I used univariate box plot approach and handled with that outliers using Capping.

Then I joined three tables together and and checked is there anything to do with the final output and realize that do not need to transformed data to any format.

Moreover, I checked the skewness of Total_beds variable using histogram, it was right_skewed and used logarithmic transformation to reshape the distribution of that variable to reduce skewness

Now, the tidy version of the data extracted from the original source “Tidy” is ready for the study

Data

Data and their sources

As a key component of Australian comprehensive system of health services, hospitals make a great contribution to the national health outcomes (Victorian Auditor-General’s Report – 2016). With the growing density of the healthcare system, patients now have a significant array of choices when it comes to receiving care from Hospitals. This frequently includes the option of selecting what type of hospitals they wish to utilize from a long list of many. Inequality between hospitals in Australia has been an ongoing concern due to its exploitation of economies of scale.Therefore, it is important to create a tidy data set about hospitals to help people to choose one hospital over the other.

Australian Institute of Health and Welfare is committed to providing statistical information that can be used to promote discussion and make decisions on health and welfare services (AIHW web).Therefore, for my study I downloaded four datasets that has different health records from the website of the AIHW, https://www.aihw.gov.au/reports-data/myhospitals/sectors/admitted-patients.;

Import data

Before running any command,I searched current working directory by running the getwd function.All onwards commands are to be located under working directory “E:/5. MasterOfAnalytics_RMIT/4. MATH2349 - Data Wrangling/2. Assignments/3. Assignment2” which was the result of getwd()

The datasets are stored in the .xlsx format. Unlike xlsx package, the readxl package has no external dependencies (like Java or Perl), so I use it to import that Excel data. These three worksheets make notes at the beginning. Therefore, I include the skip argument to read in data that starts 12,13, and 23 rows down in the worksheets respectively. But R warn that there are number of empty columns in these data sets (6 empty columns in “ALOS”, 4 in “NWAU” and 1 in “Admission”,Specialised tables) I drop those columns by column index numbers and the minus sign is to drop variables

Moreover, I found more than 4000 “total” rows in Category variable of Admission dataset which needed to delete.

Variables in the datasets

we can view the headers of the data sets using head() function. Acording to the results Variable names and short descriptions are given below:

Merge three data sets

Before cleaning these datasets, I may not be able to merge. Therefore, this step is shifted to the section “Tidy & Manipulate Data II”

# --------Import data

# Finding the current working directory
getwd() 
## [1] "E:/5. MasterOfAnalytics_RMIT/4. MATH2349 - Data Wrangling/2. Assignments/3. Assignment2"
# read "average-length-of-stay-multilevel-data" worksheet by skipping the first 12 rows and assigned a table name as "ALOS"
library(readxl)
ALOS<- read_excel("average-length-of-stay-multilevel-data.xlsx", sheet = "Average length of stay", skip = 12)
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in O30023 / R30023C15: got '‡'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in O30024 / R30024C15: got '‡'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in O30025 / R30025C15: got '‡'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in O30026 / R30026C15: got '‡'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in O30027 / R30027C15: got '‡'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in O30028 / R30028C15: got '‡'
## New names:
## * `` -> ...9
## * `` -> ...11
## * `` -> ...13
## * `` -> ...15
## * `` -> ...17
## * ...
# read "udm-cost-acute-admitted-patients-data" worksheet by skipping the first 13 rows and assigned a table name as "NWAU"
NWAU <- read_excel("udm-cost-acute-admitted-patients-data.xlsx", sheet = "Cost per NWAU", skip = 13)
## New names:
## * `` -> ...7
## * `` -> ...9
## * `` -> ...11
## * `` -> ...13
# read "udm-patient-admission-data" worksheet by skipping the first 23 rows and assigned a table name as "Admission"
Admission <- read_excel("udm-patient-admission-data.xlsx", sheet = "Patient admissions", skip = 23)
## New names:
## * `` -> ...7
# --------Control data

# Drop hidden empty columns by column index numbers
ALOS1 <- ALOS[ -c(9,11,13,15,17,19) ]
NWAU1 <- NWAU[ -c(7,9,11,13) ]
Admission1_1 <- Admission[ -c(7) ]

# Delete "total" rows in Category variable of Admission
str(Admission1_1)
## tibble [47,364 x 6] (S3: tbl_df/tbl/data.frame)
##  $ Reporting unit              : chr [1:47364] "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" ...
##  $ Reporting unit type         : chr [1:47364] "Hospital" "Hospital" "Hospital" "Hospital" ...
##  $ State                       : chr [1:47364] "NSW" "NSW" "NSW" "NSW" ...
##  $ Time period                 : chr [1:47364] "2011–12" "2012–13" "2013–14" "2014–15" ...
##  $ Category                    : chr [1:47364] "Childbirth" "Childbirth" "Childbirth" "Childbirth" ...
##  $ Number of patient admissions: chr [1:47364] "0" "0" "0" "0" ...
Admission1<-Admission1_1[!(Admission1_1$Category=="Total"),]
str(Admission1)
## tibble [43,417 x 6] (S3: tbl_df/tbl/data.frame)
##  $ Reporting unit              : chr [1:43417] "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" ...
##  $ Reporting unit type         : chr [1:43417] "Hospital" "Hospital" "Hospital" "Hospital" ...
##  $ State                       : chr [1:43417] "NSW" "NSW" "NSW" "NSW" ...
##  $ Time period                 : chr [1:43417] "2011–12" "2012–13" "2013–14" "2014–15" ...
##  $ Category                    : chr [1:43417] "Childbirth" "Childbirth" "Childbirth" "Childbirth" ...
##  $ Number of patient admissions: chr [1:43417] "0" "0" "0" "0" ...
# view headers 
head(ALOS1)
head(NWAU1)
head(Admission1)

Understand

Understanding data structures and variable types in the data set are also crucial for conducting data preprocessing.

More customized column formatting; Although the data types are correct, for some variables I had to do some formatting like currency, percentages using ‘Formattable’ function, which resembles conditional formatting in Microsoft Excel.

# This is the R chunk for the Understand Section

# display the structure
str(ALOS1)
## tibble [30,021 x 13] (S3: tbl_df/tbl/data.frame)
##  $ Reporting unit                  : chr [1:30021] "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" ...
##  $ Reporting unit type             : chr [1:30021] "Hospital" "Hospital" "Hospital" "Hospital" ...
##  $ State                           : chr [1:30021] "NSW" "NSW" "NSW" "NSW" ...
##  $ Local Hospital Network (LHN)    : chr [1:30021] "Albury Wodonga Health" "Albury Wodonga Health" "Albury Wodonga Health" "Albury Wodonga Health" ...
##  $ Peer group                      : chr [1:30021] "Large hospitals" "Large hospitals" "Large hospitals" "Large hospitals" ...
##  $ Time period                     : chr [1:30021] "2011–12" "2012–13" "2013–14" "2014–15" ...
##  $ Category                        : chr [1:30021] "Cellulitis" "Cellulitis" "Cellulitis" "Cellulitis" ...
##  $ Total number of stays           : chr [1:30021] "111" "116" "141" "155" ...
##  $ Number of overnight stays       : chr [1:30021] "92" "98" "115" "123" ...
##  $ Percentage of overnight stays   : chr [1:30021] "0.83" "0.84" "0.82" "0.79" ...
##  $ Average length of stay (days)   : chr [1:30021] "3.9" "3.3" "3.1" "2.5" ...
##  $ Peer group average (days)       : chr [1:30021] "3.7" "3.5" "3.3" "3.2" ...
##  $ Total overnight patient bed days: chr [1:30021] "356" "326" "351" "306" ...
str(NWAU1)
## tibble [388 x 9] (S3: tbl_df/tbl/data.frame)
##  $ Reporting unit                                                      : chr [1:388] "Armidale Hospital" "Armidale Hospital" "Armidale Hospital" "Auburn Hospital" ...
##  $ Reporting unit type                                                 : chr [1:388] "Hospital" "Hospital" "Hospital" "Hospital" ...
##  $ State                                                               : chr [1:388] "NSW" "NSW" "NSW" "NSW" ...
##  $ Peer group                                                          : chr [1:388] "Medium regional hospitals" "Medium regional hospitals" "Medium regional hospitals" "Medium metropolitan hospitals" ...
##  $ Time period                                                         : chr [1:388] "2012–13" "2013–14" "2014–15" "2012–13" ...
##  $ Number of NWAU                                                      : chr [1:388] "6177" "6532" "6729" "NP" ...
##  $ Cost per NWAU, constant prices (adjusted for inflation)             : chr [1:388] "4700" "4700" "4300" "NP" ...
##  $ Peer average cost per NWAU, constant prices (adjusted for inflation): num [1:388] 5100 4940 4950 4300 4200 4080 4460 4440 4460 4450 ...
##  $ Percentage of private patients                                      : chr [1:388] "0.17" "0.19" "0.16" "NP" ...
str(Admission1)
## tibble [43,417 x 6] (S3: tbl_df/tbl/data.frame)
##  $ Reporting unit              : chr [1:43417] "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" "Albury Wodonga Health [Albury Campus]" ...
##  $ Reporting unit type         : chr [1:43417] "Hospital" "Hospital" "Hospital" "Hospital" ...
##  $ State                       : chr [1:43417] "NSW" "NSW" "NSW" "NSW" ...
##  $ Time period                 : chr [1:43417] "2011–12" "2012–13" "2013–14" "2014–15" ...
##  $ Category                    : chr [1:43417] "Childbirth" "Childbirth" "Childbirth" "Childbirth" ...
##  $ Number of patient admissions: chr [1:43417] "0" "0" "0" "0" ...
# Proper data type conversions;
library(dplyr)
library(formattable)
ALOS2 <- ALOS1 %>% transmute(H_Name = as.factor(`Reporting unit`),
                             H_Type = as.factor(`Reporting unit type`),
                             State = as.factor(State),
                             LHN = as.factor(`Local Hospital Network (LHN)`),
                             Peer_group = as.factor(`Peer group`),
                             Time_period = `Time period`,
                             Category = as.factor(Category),
                             Total_stays= as.double(`Total number of stays`),
                             overnight_stays= as.double(`Number of overnight stays`),
                             overnight_stays_per = percent(as.double(`Percentage of overnight stays`)),
                             AvgLOS = as.double(`Average length of stay (days)`),
                             Peer_average= as.double(`Peer group average (days)`),
                             Total_beds= as.double(`Total overnight patient bed days`))
## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion
## Warning in percent(as.double(`Percentage of overnight stays`)): NAs introduced
## by coercion
## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion
NWAU2 <- NWAU1 %>% transmute(H_Name = as.factor(`Reporting unit`),
                             H_Type = as.factor(`Reporting unit type`),
                             State = as.factor(State),
                             Peer_group = as.factor(`Peer group`),
                             Time_period = `Time period`,
                             No_NWAU= as.double(`Number of NWAU`),
                             Cost_NWAU= currency(as.double(`Cost per NWAU, constant prices (adjusted for inflation)`)),
                             Peer_Cost_NWAU= currency(as.double(`Peer average cost per NWAU, constant prices (adjusted for inflation)`)),
                             private_patients_per = percent(as.double(`Percentage of private patients`)))
## Warning: NAs introduced by coercion
## Warning in currency(as.double(`Cost per NWAU, constant prices (adjusted for
## inflation)`)): NAs introduced by coercion
## Warning in percent(as.double(`Percentage of private patients`)): NAs introduced
## by coercion
Admission2 <- Admission1 %>% transmute(H_Name = as.factor(`Reporting unit`),
                             H_Type = as.factor(`Reporting unit type`),
                             State = as.factor(State),
                             Time_period = `Time period`,
                             Category = as.factor(Category),
                             No_admissions = as.double(`Number of patient admissions`))
## Warning: NAs introduced by coercion

Scan I

Identify missing values

In R, a numeric missing value is represented by NA (“not available”), while character missing values are represented by . As the data sets are very large, I can’t use is.na() function alone to identify each and very missing values. Instead I try to identify the total missing values in each column using is.na() function with colSums.

According to the results, There were lots of missing value columns identified 6, 3 and 1 respectively

Dealing with missing values

Identified missing values are too large to justifiably remove them from the analysis. There is no good way to deal with missing data”. But some ways are better than others, situation depending. I got the following decisions on dealing with missing values.

# --------Identify missing values
# Scan the data for missing values
colSums(is.na(ALOS2))
##              H_Name              H_Type               State                 LHN 
##                   0                   0                   0                 600 
##          Peer_group         Time_period            Category         Total_stays 
##                   0                   0                   0                7013 
##     overnight_stays overnight_stays_per              AvgLOS        Peer_average 
##                7017               10565               19770                8445 
##          Total_beds 
##                1804
colSums(is.na(NWAU2))
##               H_Name               H_Type                State 
##                    0                    0                    0 
##           Peer_group          Time_period              No_NWAU 
##                    0                    0                   36 
##            Cost_NWAU       Peer_Cost_NWAU private_patients_per 
##                   36                    0                   36
colSums(is.na(Admission2))
##        H_Name        H_Type         State   Time_period      Category 
##             0             0             0             0             0 
## No_admissions 
##          4081
# --------Dealing with missing values
# (1) ALOS2 Table except Total_beds
library(Hmisc)
ALOS3 <- ALOS2 %>% transmute(H_Name,
                             H_Type,
                             State,
                             LHN = "Google for LHN", na.rm = TRUE,
                             Peer_group,
                             Time_period,
                             Category,
                             Total_stays = impute(ALOS2$Total_stays, fun = mean),
                             overnight_stays = impute(ALOS2$overnight_stays, fun = mean),
                             overnight_stays_per = impute(ALOS2$overnight_stays_per, fun = median),
                             AvgLOS = impute(ALOS2$AvgLOS, fun = median),
                             Peer_average = impute(ALOS2$Peer_average, fun = median),
                             Total_beds)

# Scan the data for missing values again
colSums(is.na(ALOS3))
##              H_Name              H_Type               State                 LHN 
##                   0                   0                   0                   0 
##               na.rm          Peer_group         Time_period            Category 
##                   0                   0                   0                   0 
##         Total_stays     overnight_stays overnight_stays_per              AvgLOS 
##                   0                   0                   0                   0 
##        Peer_average          Total_beds 
##                   0                1804
# Total_beds of ALOS2 
sample_na <- function(x){
  if(all(is.na(x))){ return(NA) }
  vals <- x[!is.na(x)]
  if(length(vals) == 1){return(vals)}
  return(sample(x[!is.na(x)], size = length(x[is.na(x)]), replace = T))
  }
ALOS3 <- ALOS3  %>% group_by(Peer_group) %>%
  mutate(Total_beds = ifelse(is.na(Total_beds),
                                 sample_na(Total_beds), Total_beds))

# Scan the data for missing values again
colSums(is.na(ALOS3))
##              H_Name              H_Type               State                 LHN 
##                   0                   0                   0                   0 
##               na.rm          Peer_group         Time_period            Category 
##                   0                   0                   0                   0 
##         Total_stays     overnight_stays overnight_stays_per              AvgLOS 
##                   0                   0                   0                   0 
##        Peer_average          Total_beds 
##                   0                   0
# (2) NWAU2 table
NWAU3 <- NWAU2 %>% transmute(H_Name,
                             H_Type,
                             State,
                             Peer_group,
                             Time_period,
                             No_NWAU = impute(NWAU2$No_NWAU, fun = mean),
                             Cost_NWAU = impute(NWAU2$Cost_NWAU, fun = median),
                             Peer_Cost_NWAU,
                             private_patients_per = impute(NWAU2$private_patients_per, fun = median))

# Scan the data for missing values again
colSums(is.na(NWAU3))
##               H_Name               H_Type                State 
##                    0                    0                    0 
##           Peer_group          Time_period              No_NWAU 
##                    0                    0                    0 
##            Cost_NWAU       Peer_Cost_NWAU private_patients_per 
##                    0                    0                    0
# (3) Admission2 table
Admission3 <- Admission2 %>% transmute(H_Name,
                             H_Type,
                             State,
                             Time_period,
                             Category,
                             No_admissions = impute(Admission2$No_admissions, fun = mean))

# Scan the data for missing values again
colSums(is.na(Admission3))
##        H_Name        H_Type         State   Time_period      Category 
##             0             0             0             0             0 
## No_admissions 
##             0

Scan II

As an inclusion of outliers would have a great impact on the analysis results, detecting outliers and dealing with them are also crucial in the data preprocessing.

Detecting outliers

First I detected the outliers of multivariate setting by the most commonly used distance metric called “Mahalanobis distance” As the data sets are too large, the allocated vectors for detecting outliers are more than 6 GB (“Error; sample size musst be between 3 and 5000 cells”.

Then I tried to use univariate box plot approach. According to the Tukey’s method

• all variable of ALOS3 and Admission3 data sets seems to have many outliers • all variable of NWAU3 do not seems to have outliers except one outlier in Cost_NWAU

handling outliers

Most of the ways to deal with outliers.I handdled outliers using Capping or winsorising involves replacing the outliers with the nearest neighbours that are not outliers

# subset numeric variables using Peer_group
#ALOS3_Large <- ALOS3 %>%  filter( Peer_group == "Large hospitals" ) %>%  dplyr::select(Total_stays, overnight_stays, overnight_stays_per,AvgLOS,Peer_average,Total_beds)
#ALOS3_Large <- ALOS3_Large[ -c(1) ]

# Multivariate outlier detection using Mahalanobis distance with QQ plots
# ALOS3_Large <- mvn(data = ALOS3_Large, multivariateOutlierMethod = "quan", showOutliers = TRUE)

# univariate box plot approach for ALOS3 
# Limit the Peer_group to 5 dimensions as r warn about incorrect number of dimensions (6)
ALOS4 <- ALOS3 %>% filter( Peer_group == "Large hospitals" | Peer_group == "Medium hospitals" | Peer_group == "Children's hospitals"| Peer_group == "Major hospitals"| Peer_group == "Small hospitals")

par(mfrow=c(3,2))
boxplot(ALOS4$Total_stays ~ ALOS4$Peer_group, main="Total stays by peer group", ylab = "Total stays", xlab = "peer group")

boxplot(ALOS4$overnight_stays ~ ALOS4$Peer_group, main="Total stays by peer group", ylab = "overnight stays", xlab = "peer group")

boxplot(ALOS4$overnight_stays_per ~ ALOS4$Peer_group, main="Total stays by peer group", ylab = "overnight stays per", xlab = "peer group")

boxplot(ALOS4$AvgLOS ~ ALOS4$Peer_group, main="Total stays by peer group", ylab = "AvgLOS", xlab = "peer group")

boxplot(ALOS4$Peer_average ~ ALOS4$Peer_group, main="Total stays by peer group", ylab = "Peer averager", xlab = "peer group")

boxplot(ALOS4$Total_beds ~ ALOS4$Peer_group, main="Total stays by peer group", ylab = "Total_beds", xlab = "peer group")

# univariate box plot approach for NWAU3 
NWAU4 <- NWAU3 %>% filter( Peer_group == "Large hospitals" | Peer_group == "Medium hospitals" | Peer_group == "Children's hospitals"| Peer_group == "Major hospitals"| Peer_group == "Small hospitals")

par(mfrow=c(2,2))
boxplot(NWAU4$No_NWAU ~ NWAU4$Peer_group, main="No_NWAU", ylab = "Total stays", xlab = "peer group")

boxplot(NWAU4$Cost_NWAU ~ NWAU4$Peer_group, main="Cost_NWAU", ylab = "overnight stays", xlab = "peer group")

boxplot(NWAU4$Peer_Cost_NWAU ~ NWAU4$Peer_group, main="Peer_Cost_NWAU", ylab = "overnight stays per", xlab = "peer group")

boxplot(NWAU4$private_patients_per ~ NWAU4$Peer_group, main="private_patients_per", ylab = "AvgLOS", xlab = "peer group")

# univariate box plot approach for Admission3 

Admission4 <- Admission3 %>% filter( State == "ACT" | State == "NSW" | State == "SA"| State == "Qld"| State == "Vic")

par(mfrow=c(1,1))
boxplot(Admission4$No_admissions ~ Admission4$State, main="private_patients_per", ylab = "AvgLOS", xlab = "peer group")

# handling outliers using Capping or winsorising involves replacing the outliers with the nearest neighbours that are not outliers

cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}

ALOS3$Total_stays <- ALOS3$Total_stays %>% cap()
ALOS3$Total_stays <- ALOS3$overnight_stays %>% cap()
ALOS3$Total_stays <- ALOS3$overnight_stays_per %>% cap()
ALOS3$Total_stays <- ALOS3$AvgLOS %>% cap()
ALOS3$Total_stays <- ALOS3$Peer_average %>% cap()
ALOS3$Total_stays <- ALOS3$Total_beds %>% cap()

NWAU3$Cost_NWAU <- NWAU3$Cost_NWAU %>% cap()  

Admission3$No_admissions <- Admission3$No_admissions %>% cap() 

Tidy & Manipulate Data II

I needed to create a variable called “temp” in first three data sets concatenating variables H_Name, Time_period for the Purpose of creating a common variable to joining these three tables.

Then I joined tables using inner joining methods.I used this join type as I don’t need to keep my tidy data set to be with empty records.

# creating a common variable 
ALOS5 <- mutate(ALOS3,
                temp = paste(H_Name, Time_period, sep=" / "))

NWAU5 <- mutate(NWAU3,
                temp = paste(H_Name, Time_period, sep=" / "))

Admission5 <- mutate(Admission3,
                temp = paste(H_Name, Time_period, sep=" / "))



# join tables

join_1 <- ALOS5 %>% inner_join(NWAU5, by = "temp")

join_2 <- join_1 %>% inner_join(Admission5, by = "temp")

join_3 <- join_2 %>% transmute(H_Name = H_Name.x,
                             H_Type = H_Type.x,
                             State = State.x,
                             LHN,
                             Peer_group = Peer_group.x,
                             Time_period = Time_period.x,
                             Category = Category.x,
                             Total_stays,
                             overnight_stays,
                             overnight_stays_per,
                             AvgLOS,
                             Peer_average,
                             Total_beds,
                             No_NWAU,
                             Cost_NWAU,
                             Peer_Cost_NWAU,
                             private_patients_per,
                             No_admissions )

Tidy & Manipulate Data I

According to the Tidy Data Principles and my purpose of the study join_3 data set is now tidy and do not need to transformed data to any format.

Transform

When I tried to detect outlies using the z- scores, I found that most of the variables are right skewed.

When I checked the skewness of Total_beds variable using histogram, it was right_skewed. Therefore, I used logarithmic transformation to reshape the distribution of that variable to reduce skewness

# histogram to check the distribution of Total_beds of the data set join_3
transform <- hist(join_3$Total_beds)

# logarithmic transformation
log_transform <- log10(join_3$Total_beds)
hist(log_transform)

# Now, the tidy version of the data extracted from the original source is ready for the study
Tidy <- join_3
head(Tidy)              

Citations

Original data:

Modifying a direct quote: