Required packages

# This is the R chunk for the required packages
library(ggplot2) # Useful for creating plots
library(dplyr)  # Useful for data manipulation
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr) # Useful for creating nice tables
library(qqplotr)# Useful for qqplots
## 
## Attaching package: 'qqplotr'
## The following objects are masked from 'package:ggplot2':
## 
##     stat_qq_line, StatQqLine
library(car) # Useful for Applied Regression
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library(readxl) # Useful for importing excel sheets
library(tinytex) # Useful for compiling LaTeX Documents
library(rlang) # Useful for tidyverse features
## Warning: package 'rlang' was built under R version 4.0.3
library(tidyr) # Useful for Tidy Messy Data
library(outliers)# Useful for tests for outliers
## Warning: package 'outliers' was built under R version 4.0.3

Executive Summary

The main steps of the data pre-processing are as follows:

  1. Get: The two datasets were downloaded from the public open data website( as indicated below) and imported into R. Since the original datasets was large, I checked the description of the datasets and only imported the range of my target variables.Please refer: https://www.aihw.gov.au/reports-data/myhospitals/sectors/admitted-patients.

  2. Understand: This step includes understanding my original datasets(as to the nature of each variable as to character, factor, numeric)using the str()function, as well as the joined dataset, using inner_join()function. I checked the data volume, the data structure and got an understanding of each attribute.

  3. Tidy and manipulate: From my understanding of the datasets, I tidied up the messy data to make sure each variable is stored as a column and each observation is one row. I used the filter function. I also generated a new variable using the mutate function. I manipulated the string values in order to create a common key for joining two datasets.

  4. Scan: I checked missing values, special values and obvious error for all variables. I used the is.na function and sapply function for the same. For different type variables, I had different approaches: for categorical variables, I checked the plausibility of values for each column; for numerical columns, I scanned the columns for outliers using boxplots and I used the capping function or winsoring method to contain the outliers .

  5. Transform: In the final step I applied data transformation to the column that has a skewed distributed after capping the outliers.Using log10 function helped to transform variables to return them back to normality. The final dataset can be deemed tidy and suitable for analysis.

Please see the details of the 5 steps in each of the chunks below that explains the functions used in detail.

Data

Please see below: For both datasets of this assignment, I used read_xlsx() function and specified the sheet or table when importing data.

Data set1: For the first dataset, I imported the full dataset and then subset the columns I need. The read_xlsx() filled the unnamed columns with numbers as name. Therefore the indexes of the columns can be easily identified from the name. I inspected the dataset after I imported the full dataset, I also knew that the observations required are from row 2 to row 30021. The first data used for this analysis was sourced from https://www.aihw.gov.au/reports-data/myhospitals/sectors/admitted-patients. The data set, Average length of stay in hospital, consists of 30021 of 18 variables. The 18 variables are Reporting unit, Reporting Type, State, Peer group, Time Period, Category, Total number of stays, Number of Overnight Stays,Percentage of Overnight Stays, Average length of stay, Peer group average, Total overnight patient bed days,and in between 6 empty columns showing NA heads, thus totaling 18 columns. The first 12 rows have been skipped. Please see chunk below:

# This is the R chunk for the Data Section
# Dataset 1

ALOS <- read_excel("udm-average-length-of-stay-data (1).xlsx", skip = 12)
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in N30019 / R30019C14: got '‡'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in N30020 / R30020C14: got '‡'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in N30021 / R30021C14: got '‡'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in N30022 / R30022C14: got '‡'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in N30023 / R30023C14: got '‡'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in N30024 / R30024C14: got '‡'
## New names:
## * `` -> ...8
## * `` -> ...10
## * `` -> ...12
## * `` -> ...14
## * `` -> ...16
## * ...
head(ALOS)

I have eliminated the empty columns from the data set 1, using the code below, and then I have subsetted the dataset with 7 variables I need; please see below:

ALOS_remove <- ALOS[c(-8,-10,-12,-14,-16,-18)]
head(ALOS_remove)
ALOS1 <- ALOS_remove[1:30021, c(1,3,4,7,8,10,12)]
head(ALOS1)

Tidy and filter Dataset 1: The above data set is untidy and I have used the filter function to remove the NP values, as shown below:

Tidy_ALOS <- ALOS1%>% filter(!`Average length of stay (days)`== "NP")

Dataset 2: The second dataset is also from https://www.aihw.gov.au/reports-data/myhospitals/sectors/admitted-patients. However I have taken the Hospital resources 2018-19 AS1 table which has 707 observations and 17 variables(2018-19 is the latest available). The 17 variables are: State, Hospital name, Establishment ID, Medicare provider no., Local Hospital Network code, Local Hospital Network, Remoteness area (code), Remoteness area, Number of available beds, Peer group code, Peer group name, Supplied Establishments data, Supplied Morbidity data, Supplied Emergency department data, Supplied Elective Surgery Waiting times data, Supplied Non-admitted patient aggregate data, IHPA funding designation.

For the second dataset, Hospital resources, I inspected the dataset before importing. I knew that the attributes required are up to ‘H’ columns in AS.1. I did not import the full dataset, but only the table AS.1. Then I subsetted the columns that I am interested. The subset contains: Hospital Name, Local Hospital Network, Remoteness Area Code, Remoteness Area Please see the following chunks:

# 2nd Data set:
HOSPRES <- read_excel("C:/Users/Srikanth/Desktop/Working Directory/Assignment 2 DW/Hospital-resources-2018-19-Tables (1).xlsx", 
    sheet = "A.S1", skip = 10)
head(HOSPRES)

Subsetting the above with variables or columns that I need:

HOSPRES1 <- HOSPRES[, c(2,8)]
head(HOSPRES1)

I chose these two datasets to work on since they have the same attribute region names, “Hospital Name”; they can be joined for further analysis. However, before I can join these two datasets, I need to tidy up the ‘Reporting Unit’ of ALOS1 dataset to ‘Hospitals’ to create a common key for them to merge. Pre-tiding the dataset

  1. I simplified the names as given below, apart from the first column name as “Hospital name” instead of “Reporting Unit”.
# Renaming the columns 
colnames(ALOS1) <- c("Hospital Name", "State", "Peer group", "TotalStays", "OvernightStays", "ALOStay(Days)", "Total Overnight BedDays")
head(ALOS1)

Tidying the second data set: Nothing to tidy as I have taken only the Hospital name and the remoteness area which is not available in the first data set.

Merging datasets/Scanning obvious error Before I join the two datasets, I want to make sure that there are no typos for their common key columns. Therefore I used setdiff() to check those two columns.

# Checking difference before merging
a <- ALOS1$`Hospital Name`
b <- HOSPRES1$`Hospital name`
setdiff(a,b)
##   [1] "Armidale Hospital"                                                  
##   [2] "Auburn Hospital"                                                    
##   [3] "Ballina Hospital"                                                   
##   [4] "Balmain Hospital"                                                   
##   [5] "Balranald Multi Purpose Service"                                    
##   [6] "Bankstown Lidcombe Hospital"                                        
##   [7] "Baradine Multi Purpose Service"                                     
##   [8] "Barham Hospital"                                                    
##   [9] "Barraba Multi Purpose Service"                                      
##  [10] "Batemans Bay Hospital"                                              
##  [11] "Bathurst Health Service"                                            
##  [12] "Batlow/Adelong Multi Purpose Service"                               
##  [13] "Bellinger River District Hospital"                                  
##  [14] "Belmont Hospital"                                                   
##  [15] "Berrigan Multi Purpose Service"                                     
##  [16] "Bingara Multi Purpose Service"                                      
##  [17] "Blacktown Hospital"                                                 
##  [18] "Blayney Multi Purpose Service"                                      
##  [19] "Blue Mountains Hospital"                                            
##  [20] "Boggabri Multi Purpose Service"                                     
##  [21] "Bombala Multi Purpose Service"                                      
##  [22] "Bonalbo Hospital"                                                   
##  [23] "Boorowa Multi Purpose Service"                                      
##  [24] "Bourke Multi Purpose Service"                                       
##  [25] "Bowral Hospital"                                                    
##  [26] "Braidwood Multi Purpose Service"                                    
##  [27] "Brewarrina Multi Purpose Service"                                   
##  [28] "Broken Hill Hospital"                                               
##  [29] "Bulahdelah Hospital"                                                
##  [30] "Bulli Hospital"                                                     
##  [31] "Byron Bay Hospital"                                                 
##  [32] "Calvary Mater Newcastle Hospital"                                   
##  [33] "Camden Hospital"                                                    
##  [34] "Campbelltown Hospital"                                              
##  [35] "Canowindra Hospital"                                                
##  [36] "Canterbury Hospital"                                                
##  [37] "Casino Hospital"                                                    
##  [38] "Cessnock Hospital"                                                  
##  [39] "Cobar Health Service"                                               
##  [40] "Coffs Harbour Hospital"                                             
##  [41] "Coledale Hospital"                                                  
##  [42] "Collarenebri Multi Purpose Service"                                 
##  [43] "Concord Hospital"                                                   
##  [44] "Condobolin Health Service"                                          
##  [45] "Coolah Multi Purpose Service"                                       
##  [46] "Coolamon Multi Purpose Service"                                     
##  [47] "Cooma Hospital"                                                     
##  [48] "Coonabarabran Health Service"                                       
##  [49] "Coonamble Multi Purpose Service"                                    
##  [50] "Cootamundra Hospital"                                               
##  [51] "Coraki Hospital"                                                    
##  [52] "Corowa Hospital"                                                    
##  [53] "Cowra Health Service"                                               
##  [54] "Crookwell Hospital"                                                 
##  [55] "Culcairn Multi Purpose Service"                                     
##  [56] "David Berry Hospital"                                               
##  [57] "Deniliquin Hospital"                                                
##  [58] "Dorrigo Multi Purpose Service"                                      
##  [59] "Dubbo Hospital"                                                     
##  [60] "Dunedoo Multi Purpose Service"                                      
##  [61] "Dungog Hospital"                                                    
##  [62] "Fairfield Hospital"                                                 
##  [63] "Finley Hospital"                                                    
##  [64] "Gilgandra Multi Purpose Service"                                    
##  [65] "Glen Innes Hospital"                                                
##  [66] "Gloucester Hospital"                                                
##  [67] "Gosford Hospital"                                                   
##  [68] "Goulburn Hospital"                                                  
##  [69] "Gower Wilson Multi Purpose Service"                                 
##  [70] "Grafton Base Hospital"                                              
##  [71] "Grenfell Multi Purpose Service"                                     
##  [72] "Griffith Hospital"                                                  
##  [73] "Gulgong Multi Purpose Service"                                      
##  [74] "Gundagai Hospital"                                                  
##  [75] "Gunnedah Hospital"                                                  
##  [76] "Guyra Multi Purpose Service"                                        
##  [77] "Hawkesbury Hospital"                                                
##  [78] "Hay Hospital"                                                       
##  [79] "Henty Multi Purpose Service"                                        
##  [80] "Hillston Hospital"                                                  
##  [81] "Holbrook Hospital"                                                  
##  [82] "Hornsby Ku-ring-gai Hospital"                                       
##  [83] "Inverell Hospital"                                                  
##  [84] "Jerilderie Multi Purpose Service"                                   
##  [85] "John Hunter Hospital"                                               
##  [86] "Junee Multi Purpose Service"                                        
##  [87] "Justice Health Services"                                            
##  [88] "Kempsey Hospital"                                                   
##  [89] "Kiama Hospital"                                                     
##  [90] "Kurri Kurri Hospital"                                               
##  [91] "Kyogle Multi Purpose Service"                                       
##  [92] "Lachlan Health Service - Forbes"                                    
##  [93] "Lachlan Health Service - Parkes"                                    
##  [94] "Lake Cargelligo Multi Purpose Service"                              
##  [95] "Leeton Hospital"                                                    
##  [96] "Lightning Ridge Multi Purpose Service"                              
##  [97] "Lismore Hospital"                                                   
##  [98] "Lithgow Hospital"                                                   
##  [99] "Liverpool Hospital"                                                 
## [100] "Lockhart Hospital"                                                  
## [101] "Macksville Hospital"                                                
## [102] "Maclean Hospital"                                                   
## [103] "Maitland Hospital"                                                  
## [104] "Manilla Hospital"                                                   
## [105] "Manly Hospital"                                                     
## [106] "Manning Hospital"                                                   
## [107] "Merriwa Multi Purpose Service"                                      
## [108] "Milton Ulladulla Hospital"                                          
## [109] "Molong Health Service"                                              
## [110] "Mona Vale Hospital"                                                 
## [111] "Moree Hospital"                                                     
## [112] "Moruya Hospital"                                                    
## [113] "Mount Druitt Hospital"                                              
## [114] "Mudgee Health Service"                                              
## [115] "Mullumbimby Hospital"                                               
## [116] "Murrumburrah-Harden Hospital"                                       
## [117] "Murwillumbah Hospital"                                              
## [118] "Muswellbrook Hospital"                                              
## [119] "Narrabri Hospital"                                                  
## [120] "Narrandera Hospital"                                                
## [121] "Narromine Health Service"                                           
## [122] "Nepean Hospital"                                                    
## [123] "Nimbin Multi Purpose Service"                                       
## [124] "Nyngan Multi Purpose Service"                                       
## [125] "Oberon Multi Purpose Service"                                       
## [126] "Pambula Hospital"                                                   
## [127] "Peak Hill Multipurpose Service"                                     
## [128] "Port Kembla Hospital"                                               
## [129] "Port Macquarie Hospital"                                            
## [130] "Portland Hospital"                                                  
## [131] "Prince of Wales Hospital"                                           
## [132] "Queanbeyan Hospital"                                                
## [133] "Quirindi Hospital"                                                  
## [134] "Royal Hospital for Women"                                           
## [135] "Royal North Shore Hospital"                                         
## [136] "Royal Prince Alfred Hospital"                                       
## [137] "Royal Prince Alfred Institute of Rheumatology & Orthopaedics"       
## [138] "Royal Rehabilitation Hospital"                                      
## [139] "Ryde Hospital"                                                      
## [140] "Rylstone Multi Purpose Service"                                     
## [141] "Scone Hospital"                                                     
## [142] "Shellharbour Hospital"                                              
## [143] "Shoalhaven Hospital"                                                
## [144] "Singleton Hospital"                                                 
## [145] "South East Regional Hospital"                                       
## [146] "Springwood Hospital"                                                
## [147] "St George Hospital NSW"                                             
## [148] "St Joseph's Hospital"                                               
## [149] "St Vincent's Hospital [Darlinghurst]"                               
## [150] "Sutherland Hospital"                                                
## [151] "Sydney Children's Hospital"                                         
## [152] "Sydney Hospital / Sydney Eye Hospital"                              
## [153] "Tamworth Hospital"                                                  
## [154] "Temora Hospital"                                                    
## [155] "Tenterfield Hospital"                                               
## [156] "The Children's Hospital at Westmead"                                
## [157] "Tocumwal Hospital"                                                  
## [158] "Tomaree Hospital"                                                   
## [159] "Tottenham Multipurpose Service"                                     
## [160] "Trangie Multi Purpose Service"                                      
## [161] "Trundle Multi Purpose Health Service"                               
## [162] "Tullamore Multi Purpose Health Service"                             
## [163] "Tumbarumba Multi Purpose Service"                                   
## [164] "Tumut Hospital"                                                     
## [165] "Urana Multi Purpose Service"                                        
## [166] "Urbenville Multi Purpose Service"                                   
## [167] "Vegetable Creek Multi Purpose Service Emmaville"                    
## [168] "Wagga Wagga Hospital"                                               
## [169] "Walcha Multi Purpose Service"                                       
## [170] "Walgett Multipurpose Service"                                       
## [171] "War Memorial Hospital"                                              
## [172] "Warialda Multi Purpose Service"                                     
## [173] "Warren Multi Purpose Service"                                       
## [174] "Wauchope Hospital"                                                  
## [175] "Wee Waa Hospital"                                                   
## [176] "Wellington Health Service"                                          
## [177] "Wentworth Hospital"                                                 
## [178] "Werris Creek Hospital"                                              
## [179] "Westmead Hospital"                                                  
## [180] "Wilcannia Multi Purpose Service"                                    
## [181] "Wilson Memorial Community Hospital, Murrurundi"                     
## [182] "Wingham Hospital"                                                   
## [183] "Wollongong Hospital"                                                
## [184] "Woy Woy Hospital"                                                   
## [185] "Wyalong Hospital"                                                   
## [186] "Wyong Hospital"                                                     
## [187] "Yass Hospital"                                                      
## [188] "Young Hospital"                                                     
## [189] "Albury Wodonga Health [Wodonga Campus]"                             
## [190] "Austin Hospital [Heidelberg]"                                       
## [191] "Bass Coast Health"                                                  
## [192] "Beaufort & Skipton Health Service [Beaufort]"                       
## [193] "Beaufort & Skipton Health Service [Skipton]"                        
## [194] "Boort District Health"                                              
## [195] "Castlemaine Health"                                                 
## [196] "Caulfield Hospital"                                                 
## [197] "Central Gippsland Health Service [Maffra]"                          
## [198] "Cranbourne Integrated Care Centre"                                  
## [199] "Djerriwarrh Health Service [Bacchus Marsh]"                         
## [200] "Edenhope and District Memorial Hospital"                            
## [201] "Gippsland Southern Health Service - Korumburra"                     
## [202] "Gippsland Southern Health Service - Leongatha"                      
## [203] "Healesville Hospital and Yarra Valley Health"                       
## [204] "Heathcote Health"                                                   
## [205] "Heidelberg Repatriation Hospital [Heidelberg West]"                 
## [206] "Mallee Track Health & Community Service [Ouyen]"                    
## [207] "Maryborough District Health Service [Dunolly]"                      
## [208] "Maryborough District Health Service [Maryborough]"                  
## [209] "Otway Health & Community Services [Apollo Bay]"                     
## [210] "Peter MacCallum Cancer Centre"                                      
## [211] "Rochester & Elmore District Health Service"                         
## [212] "Royal Children's Hospital [Parkville]"                              
## [213] "Royal Melbourne Hospital [Parkville]"                               
## [214] "Royal Melbourne Hospital- Royal Park Campus"                        
## [215] "Royal Women's Hospital [Parkville]"                                 
## [216] "Sandringham Hospital"                                               
## [217] "St George's Health Service- Aged Care"                              
## [218] "St Vincent's Hospital [Fitzroy]"                                    
## [219] "Swan Hill District Health [Swan Hill]"                              
## [220] "Terang & Mortlake Health Service [Terang]"                          
## [221] "The Alfred"                                                         
## [222] "The Bendigo Hospital"                                               
## [223] "The Kilmore & District Hospital"                                    
## [224] "The Northern Hospital [Epping]"                                     
## [225] "The Peter James Centre [East Burwood]"                              
## [226] "The Royal Victorian Eye & Ear Hospital"                             
## [227] "University Hospital Geelong"                                        
## [228] "Upper Murray Health & Community Services [Corryong]"                
## [229] "Werribee Mercy Hospital"                                            
## [230] "West Gippsland Healthcare Group [Warragul]"                         
## [231] "West Wimmera Health Service [Rupanyup]"                             
## [232] "Western District Health Service [Coleraine District Health Service]"
## [233] "Western District Health Service [Hamilton]"                         
## [234] "Western District Health Service [Penshurst]"                        
## [235] "Women's at Sandringham"                                             
## [236] "Bundaberg Base Hospital"                                            
## [237] "Mater Adult Hospital"                                               
## [238] "Mater Children's Hospital"                                          
## [239] "Mater Mothers' Hospital"                                            
## [240] "Mount Isa Base Hospital"                                            
## [241] "Nambour General Hospital"                                           
## [242] "Queen Elizabeth II Jubilee Hospital"                                
## [243] "Royal Children's Hospital"                                          
## [244] "St George Hospital Qld"                                             
## [245] "The Townsville Hospital"                                            
## [246] "Wynnum Health Service"                                              
## [247] "Balaklava Soldiers' Memorial District Hospital"                     
## [248] "Booleroo Centre District Hospital and Health Services"              
## [249] "Central Yorke Peninsula Hospital (Maitland)"                        
## [250] "Gumeracha District Soldiers' Memorial Hospital"                     
## [251] "Karoonda and District Soldiers' Memorial Hospital"                  
## [252] "Leigh Creek Health Service"                                         
## [253] "McLaren Vale & Districts War Memorial Hospital Inc"                 
## [254] "Meningie and Districts Memorial Hospital and Health Services"       
## [255] "Millicent and Districts Hospital and Health Service"                
## [256] "Mount Barker District Soldiers' Memorial Hospital"                  
## [257] "Mount Gambier and Districts Health Service"                         
## [258] "Murray Bridge Soldiers' Memorial Hospital"                          
## [259] "Noarlunga Hospital"                                                 
## [260] "Northern Yorke Peninsula Health Service (Wallaroo)"                 
## [261] "Port Augusta Hospital and Regional Health Services"                 
## [262] "Port Broughton and District Hospital and Health Service"            
## [263] "Riverton District Soldiers' Memorial Hospital"                      
## [264] "Southern Yorke Peninsula Health Service (Yorketown)"                
## [265] "Armadale-Kelmscott Memorial Hospital"                               
## [266] "Busselton Health Campus"                                            
## [267] "Denmark Hospital and Health Service"                                
## [268] "Dongara Eneabba Mingenew Health Service"                            
## [269] "Joondalup Health Campus (Public)"                                   
## [270] "Kaleeya Hospital"                                                   
## [271] "Kondinin Districts Health Service"                                  
## [272] "Kununoppin Health Service"                                          
## [273] "Merredin Health Service"                                            
## [274] "Morawa Health Service"                                              
## [275] "Murray District Hospital"                                           
## [276] "North Midlands Health Service"                                      
## [277] "Northampton Kalbarri Health Service"                                
## [278] "Peel Health Campus"                                                 
## [279] "Pingelly Health Centre"                                             
## [280] "Ravensthorpe Health Centre"                                         
## [281] "Royal Perth Hospital Shenton Park Campus"                           
## [282] "Royal Perth Hospital Wellington Street Campus"                      
## [283] "St John of God Midland Public Hospital"                             
## [284] "Swan District Hospital"                                             
## [285] "Wyalkatchem-Koorda and Districts Hospital"                          
## [286] "Mersey Community Hospital"                                          
## [287] "North West Regional Hospital"                                       
## [288] "Gove Hospital"                                                      
## [289] "Calvary Public Hospital"                                            
## [290] "The Canberra Hospital"
ALOS2 <- inner_join(HOSPRES1,ALOS1, by = c("Hospital name" = "Hospital Name"))
dim(ALOS2)
## [1] 15549     8

Understand

I used the str()function to understand the variables in the dataframe as given below in detail:

# This is the R chunk for the Understand Section
str(ALOS2)
## tibble [15,549 x 8] (S3: tbl_df/tbl/data.frame)
##  $ Hospital name          : chr [1:15549] "Albany Hospital" "Albany Hospital" "Albany Hospital" "Albany Hospital" ...
##  $ Remoteness area        : chr [1:15549] "Outer Regional" "Outer Regional" "Outer Regional" "Outer Regional" ...
##  $ State                  : chr [1:15549] "WA" "WA" "WA" "WA" ...
##  $ Peer group             : chr [1:15549] "Medium hospitals" "Medium hospitals" "Medium hospitals" "Medium hospitals" ...
##  $ TotalStays             : chr [1:15549] "75" "54" "46" "61" ...
##  $ OvernightStays         : chr [1:15549] "75" "54" "46" "60" ...
##  $ ALOStay(Days)          : chr [1:15549] "1.8" "1.6" "NP" "1.4" ...
##  $ Total Overnight BedDays: chr [1:15549] "135" "87" "82" "85" ...

After inspecting the dataset, I noticed all the attributes were in character data type, which is incorrect. Hospital name should be character, as it is. Remoteness area should be a factor. State name should be character. Peer group should be a factor. However, OvernightStays and Average length of stay should be numeric So also, Total Overnight BedDays should be numeric.

# Converting character datatype to factor datatype with order
ALOS2$`Remoteness area`  <- factor(ALOS2$`Remoteness area`)

ALOS2$`Peer group`<- factor(ALOS2$`Peer group`)

# Converting character data type to numeric data type
ALOS2$`TotalStays` <-  as.numeric(ALOS2$`TotalStays`)
## Warning: NAs introduced by coercion
ALOS2$`OvernightStays` <-  as.numeric(ALOS2$`OvernightStays`)
## Warning: NAs introduced by coercion
ALOS2$`ALOStay(Days)` <-as.numeric(ALOS2$`ALOStay(Days)`)
## Warning: NAs introduced by coercion
ALOS2$`Total Overnight BedDays` <-  as.numeric(ALOS2$`Total Overnight BedDays`)
## Warning: NAs introduced by coercion
str(ALOS2)
## tibble [15,549 x 8] (S3: tbl_df/tbl/data.frame)
##  $ Hospital name          : chr [1:15549] "Albany Hospital" "Albany Hospital" "Albany Hospital" "Albany Hospital" ...
##  $ Remoteness area        : Factor w/ 5 levels "Inner Regional",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ State                  : chr [1:15549] "WA" "WA" "WA" "WA" ...
##  $ Peer group             : Factor w/ 6 levels "Children's hospitals",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ TotalStays             : num [1:15549] 75 54 46 61 62 70 101 125 125 117 ...
##  $ OvernightStays         : num [1:15549] 75 54 46 60 62 64 101 125 125 117 ...
##  $ ALOStay(Days)          : num [1:15549] 1.8 1.6 NA 1.4 1.4 1.5 4.2 4 3.4 4.1 ...
##  $ Total Overnight BedDays: num [1:15549] 135 87 82 85 84 93 422 502 422 474 ...

I noticed the output ‘NAs introduced by coercion’, which means there were missing values in the dataset. I will deal with this in later,in tidy and manipulate step.

head(ALOS2)

Tidy & Manipulate Data I

My data had a lot of NA values in ALOStay(days).Please see how I am tidying it up below: and, After removing the NA’s, I now manipulate the above data to get Large and Medium hospitals from the Peer group as follows:

# This is the R chunk for the Tidy & Manipulate Data I 

Tidy_ALOS <- ALOS2%>% filter(!`ALOStay(Days)`== "NA") 
#Large hospitals consist of 4411 observations and 12 variables
ALOS_L<- Tidy_ALOS %>% filter(Tidy_ALOS$`Peer group` == "Large hospitals")

#Medium hospitals consist of 2182 observations and 12 variables
ALOS_M<- Tidy_ALOS %>% filter(Tidy_ALOS$`Peer group` == "Medium hospitals")

Tidy & Manipulate Data II

Here, I am creating a new variable, OvernightStayRatio which is calculated as (OvernightStays/TotalStays)*100, using the Mutate function. I am giving it in 2 separate chunks for large and medium hospitals.

ALOS_L1 <- mutate(ALOS_L,"OvernightStayRatio" = (`OvernightStays`/`TotalStays`)*100 )

head(ALOS_L1)
# This is the R chunk for the Tidy & Manipulate Data II 
ALOS_M1 <- mutate(ALOS_M,"OvernightStayRatio" = (`OvernightStays`/`TotalStays`)*100)

head(ALOS_M1)

Scan I

The dataset is scanned for inconsistencies and missing values for which the is.na() function is used to check if we have any.The sum() function is also called to get the total number of missing values.

# For large Hospitals
sum(is.na(ALOS_L1))
## [1] 0
# For Medium Hospitals
sum(is.na(ALOS_M1))
## [1] 0

I am using sapply function for detailed analysis as given below:

# This is the R chunk for the Scan I - Part1:
# Locate index of the missing value for Large Hospitals
sapply(ALOS_L1, function(x) which(is.na(x)))
## $`Hospital name`
## integer(0)
## 
## $`Remoteness area`
## integer(0)
## 
## $State
## integer(0)
## 
## $`Peer group`
## integer(0)
## 
## $TotalStays
## integer(0)
## 
## $OvernightStays
## integer(0)
## 
## $`ALOStay(Days)`
## integer(0)
## 
## $`Total Overnight BedDays`
## integer(0)
## 
## $OvernightStayRatio
## integer(0)

There are no missing values after the scan function in ALOS_L, as they already been eliminated in tidy and manipulate function. Now let us see if we find any further missing values in ALOS_M.

# This is the R chunk for the Scan I - Part2:
# Locate index of the missing value for Medium Hospitals
sapply(ALOS_M1, function(x) which(is.na(x)))
## $`Hospital name`
## integer(0)
## 
## $`Remoteness area`
## integer(0)
## 
## $State
## integer(0)
## 
## $`Peer group`
## integer(0)
## 
## $TotalStays
## integer(0)
## 
## $OvernightStays
## integer(0)
## 
## $`ALOStay(Days)`
## integer(0)
## 
## $`Total Overnight BedDays`
## integer(0)
## 
## $OvernightStayRatio
## integer(0)

There are no missing values after the scan function in medium hospitals peer group,as well, as they already been eliminated in tidy and manipulate function

Scan II

The dataset is further scanned for outliers. I will use the boxplot function in this case to check for outliers in the ALOStay(Days) variable.

par(mfrow = c(1,2))
#Boxplot for Large hospital data set
boxplot(ALOS_L1$`ALOStay(Days)`, col = 'red', main = "Large hospitals", ylab = "ALOStay(Days)")

#Boxplot for Medium hospital data set
boxplot(ALOS_M1$`ALOStay(Days)`, col = 'green', main = "Medium hospitals", ylab = "ALOStay(Days)")

I can see a lot of outliers on the upper quartile, in the boxplots above.The outliers found are further treated using the capping method also known as winsoring where the method involves replacing the outliers with the nearest neighbours that aren’t outliers.

cap <- function(x){
    quantiles <- quantile( x, c( 0.05,0.25, 0.75, 0.95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}
Large_cap <- ALOS_L1$`ALOStay(Days)` %>% cap()
Medium_cap <- ALOS_M1$`ALOStay(Days)` %>% cap()

And now again checking with the boxplots:

par(mfrow = c(1,2))
boxplot(Large_cap, main = "Large hospitals", xlab = "OvernightStayRatio", ylab = "ALOStay(Days)", col="red")
boxplot(Medium_cap, main= "Medium hospitals", ylab = "ALOStay(Days)", col="green")

Transform

I decided to transform the Average Length of Stay(Days) in both Large and Medium Hospitals using the log10 function in R. This is done to show the data in a normal distribution.

# This is the R chunk for the Transform Section
Large_log <- log10(ALOS_L1$`ALOStay(Days)`)
Medium_log  <-log10(ALOS_M1$`ALOStay(Days)`)


par(mfrow =c(1,2))
hist(log(Large_log), main = "Large_log",col="red",ylab = "", xlab = "ALOStay(Days)")
hist(log(Medium_log), main = "Medium_log",col = "green", ylab = "", xlab ="ALOStay(Days)")

To infer:

The datasets were majorly tidy data sets, only a few variables needed to converted,cleaned to analyse the dataset.Variables such as ALOStay(Days), ie(Average length of stay(Days)), were plotted to analyse the large and medium hospital data sets.Similarly the other numeric variables can be analysed.

The above datasets can be used for further analysis.