library(readr)
library(dplyr)
library(tidyr)
library(MVN)
library(forecast)
In this markdown, we will be processing data from Australian Bureau of Statistics (ABS) for Income and Method of transportation. Through many cleaning steps including checking for null, infinite values and transpose row to columns, we come out with 3 columns in which 2 of them are used as combined primary key (Region Name and Code) to join 2 datasets together*. We also remove some redundant columns such as year as all of our data is all collected in 2016. To make it perfect, we also re-label the data to improve comprehension.
We then tried to identify outliers. However, after consideration the nature of our data, we decided it would not reasonable to remove them.
Finally, the variable of income from over 3000 (I3000) would be selected for transformation since it had a skewed-right distribution. BoxCox transformation was implemented as it performed better for this attribute as compared to other methods.
*Note: For these datasets, the joining can only be performed after data tidying to get them into the right format ready for combining. Therefore, the merging step would be moved to Tidy section.
After downloading data as CSV, we import to R for further preprocessing. We also use head function for quick understanding of data. Each dataset has 13 columns.
travel_method <- read_csv("travel_to_work.csv")
income <- read_csv("income.csv")
head(travel_method)
## # A tibble: 6 x 13
## MEASURE `Data item` REGIONTYPE `Geography Leve~ LGA_2017 Region FREQUENCY
## <chr> <chr> <chr> <chr> <int> <chr> <chr>
## 1 WORK_T~ Used one m~ LGA2017 Local Governmen~ 20110 Alpin~ A
## 2 WORK_T~ Used one m~ LGA2017 Local Governmen~ 20110 Alpin~ A
## 3 WORK_T~ Used one m~ LGA2017 Local Governmen~ 20110 Alpin~ A
## 4 WORK_T~ Used one m~ LGA2017 Local Governmen~ 20110 Alpin~ A
## 5 WORK_T~ Used one m~ LGA2017 Local Governmen~ 20110 Alpin~ A
## 6 WORK_T~ Used one m~ LGA2017 Local Governmen~ 20110 Alpin~ A
## # ... with 6 more variables: Frequency <chr>, TIME <int>, Time <int>,
## # Value <int>, `Flag Codes` <chr>, Flags <chr>
head(income)
## # A tibble: 6 x 13
## MEASURE `Data item` REGIONTYPE `Geography Leve~ LGA_2017 Region FREQUENCY
## <chr> <chr> <chr> <chr> <int> <chr> <chr>
## 1 PERSIN~ Persons ea~ LGA2017 Local Governmen~ 20110 Alpin~ A
## 2 PERSIN~ Persons ea~ LGA2017 Local Governmen~ 20110 Alpin~ A
## 3 PERSIN~ Persons ea~ LGA2017 Local Governmen~ 20110 Alpin~ A
## 4 PERSIN~ Persons ea~ LGA2017 Local Governmen~ 20110 Alpin~ A
## 5 PERSIN~ Persons ea~ LGA2017 Local Governmen~ 20110 Alpin~ A
## 6 PERSIN~ Persons ea~ LGA2017 Local Governmen~ 20110 Alpin~ A
## # ... with 6 more variables: Frequency <chr>, TIME <int>, Time <int>,
## # Value <dbl>, `Flag Codes` <chr>, Flags <chr>
After knowing what columns, we analyze deeper into datatype of each column, most of them is char and int by default. In this step, we also label the value to show different type of travel.
str(travel_method)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1034 obs. of 13 variables:
## $ MEASURE : chr "WORK_TRAV_3" "WORK_TRAV_4" "WORK_TRAV_5" "WORK_TRAV_6" ...
## $ Data item : chr "Used one method - Train or tram (no.)" "Used one method - Bus (no.)" "Used one method - Car (as driver or passenger) (no.)" "Used one method - Motor bike/scooter (no.)" ...
## $ REGIONTYPE : chr "LGA2017" "LGA2017" "LGA2017" "LGA2017" ...
## $ Geography Level: chr "Local Government Areas (2017)" "Local Government Areas (2017)" "Local Government Areas (2017)" "Local Government Areas (2017)" ...
## $ LGA_2017 : int 20110 20110 20110 20110 20110 20110 20110 20110 20110 20110 ...
## $ Region : chr "Alpine (S)" "Alpine (S)" "Alpine (S)" "Alpine (S)" ...
## $ FREQUENCY : chr "A" "A" "A" "A" ...
## $ Frequency : chr "Annual" "Annual" "Annual" "Annual" ...
## $ TIME : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## $ Time : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## $ Value : int 8 34 3526 30 60 49 375 4132 50 490 ...
## $ Flag Codes : chr NA NA NA NA ...
## $ Flags : chr NA NA NA NA ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 13
## .. ..$ MEASURE : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Data item : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ REGIONTYPE : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Geography Level: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ LGA_2017 : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Region : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ FREQUENCY : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Frequency : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ TIME : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Time : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Value : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Flag Codes : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Flags : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
travel_method$MEASURE <- travel_method$MEASURE %>% factor(
levels = c("WORK_TRAV_3","WORK_TRAV_4","WORK_TRAV_5","WORK_TRAV_6","WORK_TRAV_7","WORK_TRAV_8","WORK_TRAV_9","WORK_TRAV_10","WORK_TRAV_12","WORK_TRAV_14","WORK_TRAV_15","WORK_TRAV_16","WORK_TRAV_17"),
labels = c("train_tram","bus","car","motorbike_scooter","bicycle","other","walk","total_one","total_more_than_one","worked_from_home","employed_not_go_to_work","not_stated","total_employed"))
levels(travel_method$MEASURE)
## [1] "train_tram" "bus"
## [3] "car" "motorbike_scooter"
## [5] "bicycle" "other"
## [7] "walk" "total_one"
## [9] "total_more_than_one" "worked_from_home"
## [11] "employed_not_go_to_work" "not_stated"
## [13] "total_employed"
income$MEASURE <- income$MEASURE %>% factor(
levels = c("PERSINC_2","PERSINC_3","PERSINC_4","PERSINC_5","PERSINC_6","PERSINC_7","PERSINC_8"),
labels = c("I1_499","I500_999","I1000_1999","I2000_2999","I3000","nil_earning","negative_earning")
)
levels(income$MEASURE)
## [1] "I1_499" "I500_999" "I1000_1999"
## [4] "I2000_2999" "I3000" "nil_earning"
## [7] "negative_earning"
# convert LGA codes to factor
travel_method$LGA_2017 <- as.character(travel_method$LGA_2017) %>% factor()
str(travel_method)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1034 obs. of 13 variables:
## $ MEASURE : Factor w/ 13 levels "train_tram","bus",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Data item : chr "Used one method - Train or tram (no.)" "Used one method - Bus (no.)" "Used one method - Car (as driver or passenger) (no.)" "Used one method - Motor bike/scooter (no.)" ...
## $ REGIONTYPE : chr "LGA2017" "LGA2017" "LGA2017" "LGA2017" ...
## $ Geography Level: chr "Local Government Areas (2017)" "Local Government Areas (2017)" "Local Government Areas (2017)" "Local Government Areas (2017)" ...
## $ LGA_2017 : Factor w/ 80 levels "20110","20260",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Region : chr "Alpine (S)" "Alpine (S)" "Alpine (S)" "Alpine (S)" ...
## $ FREQUENCY : chr "A" "A" "A" "A" ...
## $ Frequency : chr "Annual" "Annual" "Annual" "Annual" ...
## $ TIME : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## $ Time : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## $ Value : int 8 34 3526 30 60 49 375 4132 50 490 ...
## $ Flag Codes : chr NA NA NA NA ...
## $ Flags : chr NA NA NA NA ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 13
## .. ..$ MEASURE : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Data item : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ REGIONTYPE : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Geography Level: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ LGA_2017 : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Region : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ FREQUENCY : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Frequency : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ TIME : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Time : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Value : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Flag Codes : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Flags : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
# convert LGA codes to factor
income$LGA_2017 <- as.character(income$LGA_2017) %>% factor()
str(income)
## Classes 'tbl_df', 'tbl' and 'data.frame': 560 obs. of 13 variables:
## $ MEASURE : Factor w/ 7 levels "I1_499","I500_999",..: 1 2 3 4 5 6 7 1 2 3 ...
## $ Data item : chr "Persons earning $1-$499 per week (%)" "Persons earning $500-$999 per week (%)" "Persons earning $1000-$1999 per week (%)" "Persons earning $2000-$2999 per week (%)" ...
## $ REGIONTYPE : chr "LGA2017" "LGA2017" "LGA2017" "LGA2017" ...
## $ Geography Level: chr "Local Government Areas (2017)" "Local Government Areas (2017)" "Local Government Areas (2017)" "Local Government Areas (2017)" ...
## $ LGA_2017 : Factor w/ 80 levels "20110","20260",..: 1 1 1 1 1 1 1 2 2 2 ...
## $ Region : chr "Alpine (S)" "Alpine (S)" "Alpine (S)" "Alpine (S)" ...
## $ FREQUENCY : chr "A" "A" "A" "A" ...
## $ Frequency : chr "Annual" "Annual" "Annual" "Annual" ...
## $ TIME : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## $ Time : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## $ Value : num 34.8 27.8 17.5 2.3 1.2 5.1 0.5 32.7 26.5 15.8 ...
## $ Flag Codes : chr NA NA NA NA ...
## $ Flags : chr NA NA NA NA ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 13
## .. ..$ MEASURE : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Data item : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ REGIONTYPE : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Geography Level: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ LGA_2017 : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Region : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ FREQUENCY : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Frequency : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ TIME : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Time : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Value : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Flag Codes : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Flags : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
# dropping unnecessary columns
travel_method <- travel_method %>% select(MEASURE, LGA_2017, Region, Value)
income <- income %>% select(MEASURE, LGA_2017, Region, Value)
head(travel_method)
## # A tibble: 6 x 4
## MEASURE LGA_2017 Region Value
## <fct> <fct> <chr> <int>
## 1 train_tram 20110 Alpine (S) 8
## 2 bus 20110 Alpine (S) 34
## 3 car 20110 Alpine (S) 3526
## 4 motorbike_scooter 20110 Alpine (S) 30
## 5 bicycle 20110 Alpine (S) 60
## 6 other 20110 Alpine (S) 49
head(income)
## # A tibble: 6 x 4
## MEASURE LGA_2017 Region Value
## <fct> <fct> <chr> <dbl>
## 1 I1_499 20110 Alpine (S) 34.8
## 2 I500_999 20110 Alpine (S) 27.8
## 3 I1000_1999 20110 Alpine (S) 17.5
## 4 I2000_2999 20110 Alpine (S) 2.3
## 5 I3000 20110 Alpine (S) 1.2
## 6 nil_earning 20110 Alpine (S) 5.1
In this step, we transpose row to column to that each row is unique by region name and code.
# Tidying data
travel_method <- travel_method %>% spread(MEASURE, Value)
income <- income %>% spread(MEASURE, Value)
head(travel_method)
## # A tibble: 6 x 15
## LGA_2017 Region train_tram bus car motorbike_scoot~ bicycle other
## <fct> <chr> <int> <int> <int> <int> <int> <int>
## 1 20110 Alpin~ 8 34 3526 30 60 49
## 2 20260 Arara~ 10 9 3272 13 32 28
## 3 20570 Balla~ 544 370 33525 92 303 282
## 4 20660 Banyu~ 6006 831 37204 237 640 365
## 5 20740 Bass ~ 38 52 8749 34 93 77
## 6 20830 Baw B~ 282 63 15327 67 43 146
## # ... with 7 more variables: walk <int>, total_one <int>,
## # total_more_than_one <int>, worked_from_home <int>,
## # employed_not_go_to_work <int>, not_stated <int>, total_employed <int>
head(income)
## # A tibble: 6 x 9
## LGA_2017 Region I1_499 I500_999 I1000_1999 I2000_2999 I3000 nil_earning
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 20110 Alpin~ 34.8 27.8 17.5 2.3 1.2 5.1
## 2 20260 Arara~ 32.7 26.5 15.8 2 1.1 4.9
## 3 20570 Balla~ 33.1 25.8 20.2 3.1 1.8 6.8
## 4 20660 Banyu~ 26.4 22.1 25.2 6.2 3.8 8.8
## 5 20740 Bass ~ 38 26.7 15.1 2.1 1.4 5.7
## 6 20830 Baw B~ 32.2 25.9 19.3 3.3 1.8 7.5
## # ... with 1 more variable: negative_earning <dbl>
# merging data
data <- travel_method %>% left_join(., income, by = c("LGA_2017" = "LGA_2017", "Region" = "Region"))
head(data)
## # A tibble: 6 x 22
## LGA_2017 Region train_tram bus car motorbike_scoot~ bicycle other
## <fct> <chr> <int> <int> <int> <int> <int> <int>
## 1 20110 Alpin~ 8 34 3526 30 60 49
## 2 20260 Arara~ 10 9 3272 13 32 28
## 3 20570 Balla~ 544 370 33525 92 303 282
## 4 20660 Banyu~ 6006 831 37204 237 640 365
## 5 20740 Bass ~ 38 52 8749 34 93 77
## 6 20830 Baw B~ 282 63 15327 67 43 146
## # ... with 14 more variables: walk <int>, total_one <int>,
## # total_more_than_one <int>, worked_from_home <int>,
## # employed_not_go_to_work <int>, not_stated <int>, total_employed <int>,
## # I1_499 <dbl>, I500_999 <dbl>, I1000_1999 <dbl>, I2000_2999 <dbl>,
## # I3000 <dbl>, nil_earning <dbl>, negative_earning <dbl>
In this step, we create new column named ‘public_transport’ which is the sum or number of train_tram and bus.
data <- data %>% mutate(public_transport = train_tram + bus)
head(data)
## # A tibble: 6 x 23
## LGA_2017 Region train_tram bus car motorbike_scoot~ bicycle other
## <fct> <chr> <int> <int> <int> <int> <int> <int>
## 1 20110 Alpin~ 8 34 3526 30 60 49
## 2 20260 Arara~ 10 9 3272 13 32 28
## 3 20570 Balla~ 544 370 33525 92 303 282
## 4 20660 Banyu~ 6006 831 37204 237 640 365
## 5 20740 Bass ~ 38 52 8749 34 93 77
## 6 20830 Baw B~ 282 63 15327 67 43 146
## # ... with 15 more variables: walk <int>, total_one <int>,
## # total_more_than_one <int>, worked_from_home <int>,
## # employed_not_go_to_work <int>, not_stated <int>, total_employed <int>,
## # I1_499 <dbl>, I500_999 <dbl>, I1000_1999 <dbl>, I2000_2999 <dbl>,
## # I3000 <dbl>, nil_earning <dbl>, negative_earning <dbl>,
## # public_transport <int>
In this step, we scan through the whole dataset to find null value. In the end, there are only 5 rows have missing value, we decided to removed them. Further check on infinite value, we don’t have any issue.
# checking missing data
colSums(is.na(data))
## LGA_2017 Region train_tram
## 0 0 4
## bus car motorbike_scooter
## 0 0 2
## bicycle other walk
## 0 0 0
## total_one total_more_than_one worked_from_home
## 0 0 0
## employed_not_go_to_work not_stated total_employed
## 0 0 0
## I1_499 I500_999 I1000_1999
## 0 0 0
## I2000_2999 I3000 nil_earning
## 0 0 0
## negative_earning public_transport
## 0 4
# show missing data
data[!complete.cases(data),]
## # A tibble: 5 x 23
## LGA_2017 Region train_tram bus car motorbike_scoot~ bicycle other
## <fct> <chr> <int> <int> <int> <int> <int> <int>
## 1 22250 Ganna~ NA 5 2880 43 29 42
## 2 22980 Hindm~ NA 5 1468 NA 9 25
## 3 26670 Towon~ NA 10 1632 35 7 27
## 4 27630 Yarri~ NA 11 1624 9 23 48
## 5 29399 Uninc~ 4 13 64 NA 3 59
## # ... with 15 more variables: walk <int>, total_one <int>,
## # total_more_than_one <int>, worked_from_home <int>,
## # employed_not_go_to_work <int>, not_stated <int>, total_employed <int>,
## # I1_499 <dbl>, I500_999 <dbl>, I1000_1999 <dbl>, I2000_2999 <dbl>,
## # I3000 <dbl>, nil_earning <dbl>, negative_earning <dbl>,
## # public_transport <int>
# excluding missing data
data <- data[complete.cases(data),]
# checking special values including na again
is.special <- function(x){
if (is.numeric(x)) !is.finite(x) else is.na(x)
}
colSums(sapply(data, is.special))
## LGA_2017 Region train_tram
## 0 0 0
## bus car motorbike_scooter
## 0 0 0
## bicycle other walk
## 0 0 0
## total_one total_more_than_one worked_from_home
## 0 0 0
## employed_not_go_to_work not_stated total_employed
## 0 0 0
## I1_499 I500_999 I1000_1999
## 0 0 0
## I2000_2999 I3000 nil_earning
## 0 0 0
## negative_earning public_transport
## 0 0
# checking for NaN
colSums(sapply(select(data, -LGA_2017, -Region), is.nan))
## train_tram bus car
## 0 0 0
## motorbike_scooter bicycle other
## 0 0 0
## walk total_one total_more_than_one
## 0 0 0
## worked_from_home employed_not_go_to_work not_stated
## 0 0 0
## total_employed I1_499 I500_999
## 0 0 0
## I1000_1999 I2000_2999 I3000
## 0 0 0
## nil_earning negative_earning public_transport
## 0 0 0
In this step, we are scanning for outliers value. The result from Chi-Square Q-Q plot suggests that there are 40 possible outliers. However, due to the the unbalance nature of Australian regional data, it is normal to have extreme values in the dataset. Therefore, no observation would be excluded in this step.
summary(select(data, -LGA_2017, -Region))
## train_tram bus car motorbike_scooter
## Min. : 3 Min. : 3.0 Min. : 618 Min. : 3.0
## 1st Qu.: 12 1st Qu.: 31.5 1st Qu.: 5553 1st Qu.: 28.5
## Median : 233 Median : 78.0 Median : 14643 Median : 86.0
## Mean : 2783 Mean : 382.3 Mean : 23911 Mean :130.6
## 3rd Qu.: 4734 3rd Qu.: 631.0 3rd Qu.: 37932 3rd Qu.:215.0
## Max. :16541 Max. :3937.0 Max. :105512 Max. :482.0
## bicycle other walk total_one
## Min. : 4.0 Min. : 11.0 Min. : 62 Min. : 732
## 1st Qu.: 40.0 1st Qu.: 71.5 1st Qu.: 382 1st Qu.: 6168
## Median : 126.0 Median :175.0 Median : 660 Median : 16772
## Mean : 451.4 Mean :252.2 Mean : 1154 Mean : 29293
## 3rd Qu.: 288.5 3rd Qu.:424.5 3rd Qu.: 1060 3rd Qu.: 50625
## Max. :4509.0 Max. :846.0 Max. :17074 Max. :115751
## total_more_than_one worked_from_home employed_not_go_to_work
## Min. : 11 Min. : 109.0 Min. : 155
## 1st Qu.: 63 1st Qu.: 610.5 1st Qu.: 949
## Median : 346 Median :1308.0 Median : 2357
## Mean :1605 Mean :1670.8 Mean : 3369
## 3rd Qu.:3040 3rd Qu.:2486.5 3rd Qu.: 5664
## Max. :7172 Max. :5325.0 Max. :11648
## not_stated total_employed I1_499 I500_999
## Min. : 11.0 Min. : 1046 Min. :16.80 Min. :16.30
## 1st Qu.: 123.0 1st Qu.: 8020 1st Qu.:27.55 1st Qu.:22.90
## Median : 266.0 Median : 21741 Median :30.40 Median :25.10
## Mean : 350.6 Mean : 36289 Mean :30.41 Mean :24.46
## 3rd Qu.: 483.5 3rd Qu.: 60167 3rd Qu.:33.80 3rd Qu.:26.50
## Max. :1288.0 Max. :137913 Max. :45.10 Max. :30.40
## I1000_1999 I2000_2999 I3000 nil_earning
## Min. :10.50 Min. :1.100 Min. : 0.600 Min. : 4.900
## 1st Qu.:17.50 1st Qu.:2.250 1st Qu.: 1.200 1st Qu.: 6.100
## Median :20.80 Median :3.200 Median : 1.600 Median : 7.100
## Mean :20.40 Mean :3.815 Mean : 2.464 Mean : 8.007
## 3rd Qu.:23.35 3rd Qu.:5.300 3rd Qu.: 2.900 3rd Qu.: 8.950
## Max. :29.90 Max. :9.500 Max. :11.800 Max. :18.800
## negative_earning public_transport
## Min. :0.300 Min. : 6
## 1st Qu.:0.400 1st Qu.: 47
## Median :0.600 Median : 362
## Mean :0.592 Mean : 3165
## 3rd Qu.:0.700 3rd Qu.: 5188
## Max. :1.400 Max. :17404
results <- mvn(data = select(data, -LGA_2017, -Region), multivariateOutlierMethod = "quan", showOutliers = TRUE)
## Warning in covMcd(data, alpha = alpha): The covariance matrix of the data is singular.
## There are 14 observations (in the entire dataset of 75 obs.) lying
## on the hyperplane with equation a_1*(x_i1 - m_1) + ... + a_p*(x_ip
## - m_p) = 0 with (m_1, ..., m_p) the mean of these observations and
## coefficients a_i from the vector a <- c(-0.3839362, -0.3839362, 0,
## 0, 0, 0, 0, 0.3048986, 0.3048986, 0.3048986, 0.3048986, 0.3048986,
## -0.3048986, -1e-07, 2e-07, -2e-07, 1e-06, -3e-07, -1e-07, 5e-06,
## 0.3839362)
Income over 3000 is a perfect attribute for illustrating data transformation since it has a skewed-right distribution shape. After performing and evaluating several transformation methods, we came out with the result that BoxCox transformation performed better than the others for this variable. It shaped the distribution into a nearly symmetric form. Therefore, its result would be kept for this report.
hist(data$I3000)
boxcox_x3<- BoxCox(data$I3000,lambda = "auto")
hist(boxcox_x3)