Required packages

library(readr)
library(dplyr)
library(tidyr)
library(MVN)
library(forecast)

Executive Summary

In this markdown, we will be processing data from Australian Bureau of Statistics (ABS) for Income and Method of transportation. Through many cleaning steps including checking for null, infinite values and transpose row to columns, we come out with 3 columns in which 2 of them are used as combined primary key (Region Name and Code) to join 2 datasets together*. We also remove some redundant columns such as year as all of our data is all collected in 2016. To make it perfect, we also re-label the data to improve comprehension.

We then tried to identify outliers. However, after consideration the nature of our data, we decided it would not reasonable to remove them.

Finally, the variable of income from over 3000 (I3000) would be selected for transformation since it had a skewed-right distribution. BoxCox transformation was implemented as it performed better for this attribute as compared to other methods.

*Note: For these datasets, the joining can only be performed after data tidying to get them into the right format ready for combining. Therefore, the merging step would be moved to Tidy section.

Data

After downloading data as CSV, we import to R for further preprocessing. We also use head function for quick understanding of data. Each dataset has 13 columns.

travel_method <- read_csv("travel_to_work.csv")
income <- read_csv("income.csv")

head(travel_method)

## # A tibble: 6 x 13
##   MEASURE `Data item` REGIONTYPE `Geography Leve~ LGA_2017 Region FREQUENCY
##   <chr>   <chr>       <chr>      <chr>               <int> <chr>  <chr>    
## 1 WORK_T~ Used one m~ LGA2017    Local Governmen~    20110 Alpin~ A        
## 2 WORK_T~ Used one m~ LGA2017    Local Governmen~    20110 Alpin~ A        
## 3 WORK_T~ Used one m~ LGA2017    Local Governmen~    20110 Alpin~ A        
## 4 WORK_T~ Used one m~ LGA2017    Local Governmen~    20110 Alpin~ A        
## 5 WORK_T~ Used one m~ LGA2017    Local Governmen~    20110 Alpin~ A        
## 6 WORK_T~ Used one m~ LGA2017    Local Governmen~    20110 Alpin~ A        
## # ... with 6 more variables: Frequency <chr>, TIME <int>, Time <int>,
## #   Value <int>, `Flag Codes` <chr>, Flags <chr>

head(income)

## # A tibble: 6 x 13
##   MEASURE `Data item` REGIONTYPE `Geography Leve~ LGA_2017 Region FREQUENCY
##   <chr>   <chr>       <chr>      <chr>               <int> <chr>  <chr>    
## 1 PERSIN~ Persons ea~ LGA2017    Local Governmen~    20110 Alpin~ A        
## 2 PERSIN~ Persons ea~ LGA2017    Local Governmen~    20110 Alpin~ A        
## 3 PERSIN~ Persons ea~ LGA2017    Local Governmen~    20110 Alpin~ A        
## 4 PERSIN~ Persons ea~ LGA2017    Local Governmen~    20110 Alpin~ A        
## 5 PERSIN~ Persons ea~ LGA2017    Local Governmen~    20110 Alpin~ A        
## 6 PERSIN~ Persons ea~ LGA2017    Local Governmen~    20110 Alpin~ A        
## # ... with 6 more variables: Frequency <chr>, TIME <int>, Time <int>,
## #   Value <dbl>, `Flag Codes` <chr>, Flags <chr>

Understand

After knowing what columns, we analyze deeper into datatype of each column, most of them is char and int by default. In this step, we also label the value to show different type of travel.

str(travel_method)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1034 obs. of  13 variables:
##  $ MEASURE        : chr  "WORK_TRAV_3" "WORK_TRAV_4" "WORK_TRAV_5" "WORK_TRAV_6" ...
##  $ Data item      : chr  "Used one method - Train or tram (no.)" "Used one method - Bus (no.)" "Used one method - Car (as driver or passenger) (no.)" "Used one method - Motor bike/scooter (no.)" ...
##  $ REGIONTYPE     : chr  "LGA2017" "LGA2017" "LGA2017" "LGA2017" ...
##  $ Geography Level: chr  "Local Government Areas (2017)" "Local Government Areas (2017)" "Local Government Areas (2017)" "Local Government Areas (2017)" ...
##  $ LGA_2017       : int  20110 20110 20110 20110 20110 20110 20110 20110 20110 20110 ...
##  $ Region         : chr  "Alpine (S)" "Alpine (S)" "Alpine (S)" "Alpine (S)" ...
##  $ FREQUENCY      : chr  "A" "A" "A" "A" ...
##  $ Frequency      : chr  "Annual" "Annual" "Annual" "Annual" ...
##  $ TIME           : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ Time           : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ Value          : int  8 34 3526 30 60 49 375 4132 50 490 ...
##  $ Flag Codes     : chr  NA NA NA NA ...
##  $ Flags          : chr  NA NA NA NA ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 13
##   .. ..$ MEASURE        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Data item      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ REGIONTYPE     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Geography Level: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ LGA_2017       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Region         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ FREQUENCY      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Frequency      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ TIME           : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Time           : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Value          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Flag Codes     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Flags          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

travel_method$MEASURE <- travel_method$MEASURE %>% factor(
  levels = c("WORK_TRAV_3","WORK_TRAV_4","WORK_TRAV_5","WORK_TRAV_6","WORK_TRAV_7","WORK_TRAV_8","WORK_TRAV_9","WORK_TRAV_10","WORK_TRAV_12","WORK_TRAV_14","WORK_TRAV_15","WORK_TRAV_16","WORK_TRAV_17"),
  labels = c("train_tram","bus","car","motorbike_scooter","bicycle","other","walk","total_one","total_more_than_one","worked_from_home","employed_not_go_to_work","not_stated","total_employed"))

levels(travel_method$MEASURE)

##  [1] "train_tram"              "bus"                    
##  [3] "car"                     "motorbike_scooter"      
##  [5] "bicycle"                 "other"                  
##  [7] "walk"                    "total_one"              
##  [9] "total_more_than_one"     "worked_from_home"       
## [11] "employed_not_go_to_work" "not_stated"             
## [13] "total_employed"

income$MEASURE <- income$MEASURE %>% factor(
  levels = c("PERSINC_2","PERSINC_3","PERSINC_4","PERSINC_5","PERSINC_6","PERSINC_7","PERSINC_8"),
  labels = c("I1_499","I500_999","I1000_1999","I2000_2999","I3000","nil_earning","negative_earning")
)

levels(income$MEASURE)

## [1] "I1_499"           "I500_999"         "I1000_1999"      
## [4] "I2000_2999"       "I3000"            "nil_earning"     
## [7] "negative_earning"

# convert LGA codes to factor
travel_method$LGA_2017 <- as.character(travel_method$LGA_2017) %>% factor()
str(travel_method)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1034 obs. of  13 variables:
##  $ MEASURE        : Factor w/ 13 levels "train_tram","bus",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Data item      : chr  "Used one method - Train or tram (no.)" "Used one method - Bus (no.)" "Used one method - Car (as driver or passenger) (no.)" "Used one method - Motor bike/scooter (no.)" ...
##  $ REGIONTYPE     : chr  "LGA2017" "LGA2017" "LGA2017" "LGA2017" ...
##  $ Geography Level: chr  "Local Government Areas (2017)" "Local Government Areas (2017)" "Local Government Areas (2017)" "Local Government Areas (2017)" ...
##  $ LGA_2017       : Factor w/ 80 levels "20110","20260",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Region         : chr  "Alpine (S)" "Alpine (S)" "Alpine (S)" "Alpine (S)" ...
##  $ FREQUENCY      : chr  "A" "A" "A" "A" ...
##  $ Frequency      : chr  "Annual" "Annual" "Annual" "Annual" ...
##  $ TIME           : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ Time           : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ Value          : int  8 34 3526 30 60 49 375 4132 50 490 ...
##  $ Flag Codes     : chr  NA NA NA NA ...
##  $ Flags          : chr  NA NA NA NA ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 13
##   .. ..$ MEASURE        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Data item      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ REGIONTYPE     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Geography Level: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ LGA_2017       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Region         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ FREQUENCY      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Frequency      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ TIME           : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Time           : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Value          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Flag Codes     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Flags          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

# convert LGA codes to factor
income$LGA_2017 <- as.character(income$LGA_2017) %>% factor()
str(income)

## Classes 'tbl_df', 'tbl' and 'data.frame':    560 obs. of  13 variables:
##  $ MEASURE        : Factor w/ 7 levels "I1_499","I500_999",..: 1 2 3 4 5 6 7 1 2 3 ...
##  $ Data item      : chr  "Persons earning $1-$499 per week (%)" "Persons earning $500-$999 per week (%)" "Persons earning $1000-$1999 per week (%)" "Persons earning $2000-$2999 per week (%)" ...
##  $ REGIONTYPE     : chr  "LGA2017" "LGA2017" "LGA2017" "LGA2017" ...
##  $ Geography Level: chr  "Local Government Areas (2017)" "Local Government Areas (2017)" "Local Government Areas (2017)" "Local Government Areas (2017)" ...
##  $ LGA_2017       : Factor w/ 80 levels "20110","20260",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Region         : chr  "Alpine (S)" "Alpine (S)" "Alpine (S)" "Alpine (S)" ...
##  $ FREQUENCY      : chr  "A" "A" "A" "A" ...
##  $ Frequency      : chr  "Annual" "Annual" "Annual" "Annual" ...
##  $ TIME           : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ Time           : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ Value          : num  34.8 27.8 17.5 2.3 1.2 5.1 0.5 32.7 26.5 15.8 ...
##  $ Flag Codes     : chr  NA NA NA NA ...
##  $ Flags          : chr  NA NA NA NA ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 13
##   .. ..$ MEASURE        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Data item      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ REGIONTYPE     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Geography Level: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ LGA_2017       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Region         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ FREQUENCY      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Frequency      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ TIME           : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Time           : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Value          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Flag Codes     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Flags          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

# dropping unnecessary columns
travel_method <- travel_method %>% select(MEASURE, LGA_2017, Region, Value)
income <- income %>% select(MEASURE, LGA_2017, Region, Value)

head(travel_method)

## # A tibble: 6 x 4
##   MEASURE           LGA_2017 Region     Value
##   <fct>             <fct>    <chr>      <int>
## 1 train_tram        20110    Alpine (S)     8
## 2 bus               20110    Alpine (S)    34
## 3 car               20110    Alpine (S)  3526
## 4 motorbike_scooter 20110    Alpine (S)    30
## 5 bicycle           20110    Alpine (S)    60
## 6 other             20110    Alpine (S)    49

head(income)

## # A tibble: 6 x 4
##   MEASURE     LGA_2017 Region     Value
##   <fct>       <fct>    <chr>      <dbl>
## 1 I1_499      20110    Alpine (S)  34.8
## 2 I500_999    20110    Alpine (S)  27.8
## 3 I1000_1999  20110    Alpine (S)  17.5
## 4 I2000_2999  20110    Alpine (S)   2.3
## 5 I3000       20110    Alpine (S)   1.2
## 6 nil_earning 20110    Alpine (S)   5.1

Tidy & Manipulate Data I

In this step, we transpose row to column to that each row is unique by region name and code.

# Tidying data
travel_method <- travel_method %>% spread(MEASURE, Value)
income <- income %>% spread(MEASURE, Value)

head(travel_method)

## # A tibble: 6 x 15
##   LGA_2017 Region train_tram   bus   car motorbike_scoot~ bicycle other
##   <fct>    <chr>       <int> <int> <int>            <int>   <int> <int>
## 1 20110    Alpin~          8    34  3526               30      60    49
## 2 20260    Arara~         10     9  3272               13      32    28
## 3 20570    Balla~        544   370 33525               92     303   282
## 4 20660    Banyu~       6006   831 37204              237     640   365
## 5 20740    Bass ~         38    52  8749               34      93    77
## 6 20830    Baw B~        282    63 15327               67      43   146
## # ... with 7 more variables: walk <int>, total_one <int>,
## #   total_more_than_one <int>, worked_from_home <int>,
## #   employed_not_go_to_work <int>, not_stated <int>, total_employed <int>

head(income)

## # A tibble: 6 x 9
##   LGA_2017 Region I1_499 I500_999 I1000_1999 I2000_2999 I3000 nil_earning
##   <fct>    <chr>   <dbl>    <dbl>      <dbl>      <dbl> <dbl>       <dbl>
## 1 20110    Alpin~   34.8     27.8       17.5        2.3   1.2         5.1
## 2 20260    Arara~   32.7     26.5       15.8        2     1.1         4.9
## 3 20570    Balla~   33.1     25.8       20.2        3.1   1.8         6.8
## 4 20660    Banyu~   26.4     22.1       25.2        6.2   3.8         8.8
## 5 20740    Bass ~   38       26.7       15.1        2.1   1.4         5.7
## 6 20830    Baw B~   32.2     25.9       19.3        3.3   1.8         7.5
## # ... with 1 more variable: negative_earning <dbl>

# merging data
data <- travel_method %>% left_join(., income, by = c("LGA_2017" = "LGA_2017", "Region" = "Region"))

head(data)

## # A tibble: 6 x 22
##   LGA_2017 Region train_tram   bus   car motorbike_scoot~ bicycle other
##   <fct>    <chr>       <int> <int> <int>            <int>   <int> <int>
## 1 20110    Alpin~          8    34  3526               30      60    49
## 2 20260    Arara~         10     9  3272               13      32    28
## 3 20570    Balla~        544   370 33525               92     303   282
## 4 20660    Banyu~       6006   831 37204              237     640   365
## 5 20740    Bass ~         38    52  8749               34      93    77
## 6 20830    Baw B~        282    63 15327               67      43   146
## # ... with 14 more variables: walk <int>, total_one <int>,
## #   total_more_than_one <int>, worked_from_home <int>,
## #   employed_not_go_to_work <int>, not_stated <int>, total_employed <int>,
## #   I1_499 <dbl>, I500_999 <dbl>, I1000_1999 <dbl>, I2000_2999 <dbl>,
## #   I3000 <dbl>, nil_earning <dbl>, negative_earning <dbl>

Tidy & Manipulate Data II

In this step, we create new column named ‘public_transport’ which is the sum or number of train_tram and bus.

data <- data %>% mutate(public_transport = train_tram + bus)

head(data)

## # A tibble: 6 x 23
##   LGA_2017 Region train_tram   bus   car motorbike_scoot~ bicycle other
##   <fct>    <chr>       <int> <int> <int>            <int>   <int> <int>
## 1 20110    Alpin~          8    34  3526               30      60    49
## 2 20260    Arara~         10     9  3272               13      32    28
## 3 20570    Balla~        544   370 33525               92     303   282
## 4 20660    Banyu~       6006   831 37204              237     640   365
## 5 20740    Bass ~         38    52  8749               34      93    77
## 6 20830    Baw B~        282    63 15327               67      43   146
## # ... with 15 more variables: walk <int>, total_one <int>,
## #   total_more_than_one <int>, worked_from_home <int>,
## #   employed_not_go_to_work <int>, not_stated <int>, total_employed <int>,
## #   I1_499 <dbl>, I500_999 <dbl>, I1000_1999 <dbl>, I2000_2999 <dbl>,
## #   I3000 <dbl>, nil_earning <dbl>, negative_earning <dbl>,
## #   public_transport <int>

Scan I

In this step, we scan through the whole dataset to find null value. In the end, there are only 5 rows have missing value, we decided to removed them. Further check on infinite value, we don’t have any issue.

# checking missing data
colSums(is.na(data))

##                LGA_2017                  Region              train_tram 
##                       0                       0                       4 
##                     bus                     car       motorbike_scooter 
##                       0                       0                       2 
##                 bicycle                   other                    walk 
##                       0                       0                       0 
##               total_one     total_more_than_one        worked_from_home 
##                       0                       0                       0 
## employed_not_go_to_work              not_stated          total_employed 
##                       0                       0                       0 
##                  I1_499                I500_999              I1000_1999 
##                       0                       0                       0 
##              I2000_2999                   I3000             nil_earning 
##                       0                       0                       0 
##        negative_earning        public_transport 
##                       0                       4

# show missing data
data[!complete.cases(data),]

## # A tibble: 5 x 23
##   LGA_2017 Region train_tram   bus   car motorbike_scoot~ bicycle other
##   <fct>    <chr>       <int> <int> <int>            <int>   <int> <int>
## 1 22250    Ganna~         NA     5  2880               43      29    42
## 2 22980    Hindm~         NA     5  1468               NA       9    25
## 3 26670    Towon~         NA    10  1632               35       7    27
## 4 27630    Yarri~         NA    11  1624                9      23    48
## 5 29399    Uninc~          4    13    64               NA       3    59
## # ... with 15 more variables: walk <int>, total_one <int>,
## #   total_more_than_one <int>, worked_from_home <int>,
## #   employed_not_go_to_work <int>, not_stated <int>, total_employed <int>,
## #   I1_499 <dbl>, I500_999 <dbl>, I1000_1999 <dbl>, I2000_2999 <dbl>,
## #   I3000 <dbl>, nil_earning <dbl>, negative_earning <dbl>,
## #   public_transport <int>

# excluding missing data
data <- data[complete.cases(data),]

# checking special values including na again
is.special <- function(x){
if (is.numeric(x)) !is.finite(x) else is.na(x)
}

colSums(sapply(data, is.special))

##                LGA_2017                  Region              train_tram 
##                       0                       0                       0 
##                     bus                     car       motorbike_scooter 
##                       0                       0                       0 
##                 bicycle                   other                    walk 
##                       0                       0                       0 
##               total_one     total_more_than_one        worked_from_home 
##                       0                       0                       0 
## employed_not_go_to_work              not_stated          total_employed 
##                       0                       0                       0 
##                  I1_499                I500_999              I1000_1999 
##                       0                       0                       0 
##              I2000_2999                   I3000             nil_earning 
##                       0                       0                       0 
##        negative_earning        public_transport 
##                       0                       0

# checking for NaN
colSums(sapply(select(data, -LGA_2017, -Region), is.nan))

##              train_tram                     bus                     car 
##                       0                       0                       0 
##       motorbike_scooter                 bicycle                   other 
##                       0                       0                       0 
##                    walk               total_one     total_more_than_one 
##                       0                       0                       0 
##        worked_from_home employed_not_go_to_work              not_stated 
##                       0                       0                       0 
##          total_employed                  I1_499                I500_999 
##                       0                       0                       0 
##              I1000_1999              I2000_2999                   I3000 
##                       0                       0                       0 
##             nil_earning        negative_earning        public_transport 
##                       0                       0                       0

Scan II

In this step, we are scanning for outliers value. The result from Chi-Square Q-Q plot suggests that there are 40 possible outliers. However, due to the the unbalance nature of Australian regional data, it is normal to have extreme values in the dataset. Therefore, no observation would be excluded in this step.

summary(select(data, -LGA_2017, -Region))

##    train_tram         bus              car         motorbike_scooter
##  Min.   :    3   Min.   :   3.0   Min.   :   618   Min.   :  3.0    
##  1st Qu.:   12   1st Qu.:  31.5   1st Qu.:  5553   1st Qu.: 28.5    
##  Median :  233   Median :  78.0   Median : 14643   Median : 86.0    
##  Mean   : 2783   Mean   : 382.3   Mean   : 23911   Mean   :130.6    
##  3rd Qu.: 4734   3rd Qu.: 631.0   3rd Qu.: 37932   3rd Qu.:215.0    
##  Max.   :16541   Max.   :3937.0   Max.   :105512   Max.   :482.0    
##     bicycle           other            walk         total_one     
##  Min.   :   4.0   Min.   : 11.0   Min.   :   62   Min.   :   732  
##  1st Qu.:  40.0   1st Qu.: 71.5   1st Qu.:  382   1st Qu.:  6168  
##  Median : 126.0   Median :175.0   Median :  660   Median : 16772  
##  Mean   : 451.4   Mean   :252.2   Mean   : 1154   Mean   : 29293  
##  3rd Qu.: 288.5   3rd Qu.:424.5   3rd Qu.: 1060   3rd Qu.: 50625  
##  Max.   :4509.0   Max.   :846.0   Max.   :17074   Max.   :115751  
##  total_more_than_one worked_from_home employed_not_go_to_work
##  Min.   :  11        Min.   : 109.0   Min.   :  155          
##  1st Qu.:  63        1st Qu.: 610.5   1st Qu.:  949          
##  Median : 346        Median :1308.0   Median : 2357          
##  Mean   :1605        Mean   :1670.8   Mean   : 3369          
##  3rd Qu.:3040        3rd Qu.:2486.5   3rd Qu.: 5664          
##  Max.   :7172        Max.   :5325.0   Max.   :11648          
##    not_stated     total_employed       I1_499         I500_999    
##  Min.   :  11.0   Min.   :  1046   Min.   :16.80   Min.   :16.30  
##  1st Qu.: 123.0   1st Qu.:  8020   1st Qu.:27.55   1st Qu.:22.90  
##  Median : 266.0   Median : 21741   Median :30.40   Median :25.10  
##  Mean   : 350.6   Mean   : 36289   Mean   :30.41   Mean   :24.46  
##  3rd Qu.: 483.5   3rd Qu.: 60167   3rd Qu.:33.80   3rd Qu.:26.50  
##  Max.   :1288.0   Max.   :137913   Max.   :45.10   Max.   :30.40  
##    I1000_1999      I2000_2999        I3000         nil_earning    
##  Min.   :10.50   Min.   :1.100   Min.   : 0.600   Min.   : 4.900  
##  1st Qu.:17.50   1st Qu.:2.250   1st Qu.: 1.200   1st Qu.: 6.100  
##  Median :20.80   Median :3.200   Median : 1.600   Median : 7.100  
##  Mean   :20.40   Mean   :3.815   Mean   : 2.464   Mean   : 8.007  
##  3rd Qu.:23.35   3rd Qu.:5.300   3rd Qu.: 2.900   3rd Qu.: 8.950  
##  Max.   :29.90   Max.   :9.500   Max.   :11.800   Max.   :18.800  
##  negative_earning public_transport
##  Min.   :0.300    Min.   :    6   
##  1st Qu.:0.400    1st Qu.:   47   
##  Median :0.600    Median :  362   
##  Mean   :0.592    Mean   : 3165   
##  3rd Qu.:0.700    3rd Qu.: 5188   
##  Max.   :1.400    Max.   :17404

results <- mvn(data = select(data, -LGA_2017, -Region), multivariateOutlierMethod = "quan", showOutliers = TRUE)

## Warning in covMcd(data, alpha = alpha): The covariance matrix of the data is singular.
## There are 14 observations (in the entire dataset of 75 obs.) lying
## on the hyperplane with equation a_1*(x_i1 - m_1) + ... + a_p*(x_ip
## - m_p) = 0 with (m_1, ..., m_p) the mean of these observations and
## coefficients a_i from the vector a <- c(-0.3839362, -0.3839362, 0,
## 0, 0, 0, 0, 0.3048986, 0.3048986, 0.3048986, 0.3048986, 0.3048986,
## -0.3048986, -1e-07, 2e-07, -2e-07, 1e-06, -3e-07, -1e-07, 5e-06,
## 0.3839362)

Transform

Income over 3000 is a perfect attribute for illustrating data transformation since it has a skewed-right distribution shape. After performing and evaluating several transformation methods, we came out with the result that BoxCox transformation performed better than the others for this variable. It shaped the distribution into a nearly symmetric form. Therefore, its result would be kept for this report.

hist(data$I3000)

boxcox_x3<- BoxCox(data$I3000,lambda = "auto")

hist(boxcox_x3)

Data Preprocessing Assignment 3

Vinh Loi Chau - s3699871 | Binh Chon Nhut Le - s3256292

21 October 2018