library(ggplot2)
library(readxl)
library(tidyr)
library(dplyr)
library(outliers)
library(forecast)
In this work, the authors have sourced 4 datasets from www.data.vic.gov.au “Towns-in-Time” series of datasets compiled by the Department of planning. Each of the datasets features, population demographics data, dwelling numbers data, summary data and town land area data.
These 4 datasets present a number of pre-processing difficulties in that; first the datasets containing variables in both horizontal and spread format.
The second difficulty is that observations are located disparately in 4 different locations in the spreadsheet.
The third difficulty is that each dataset and it’s corresponding data frames require additional preparative work prior to processing.
The last issue is that the resulting components of 4 excel entities need to be combined prior to processing.
This project demonstrates that each dataset needs to be read into 16 different “R” Data Frames prior to consolidation and transformation before production of a single useful dataset useful for production of statistics for analysis . The data was sourced on 22nd September 2018 (prior to a website relaunch) from:
https://www.data.vic.gov.au/data/dataset/towns-in-time-traralgon-south
https://www.data.vic.gov.au/data/dataset/towns-in-time-tallarook
https://www.data.vic.gov.au/data/dataset/towns-in-time-skipton
https://www.data.vic.gov.au/data/dataset/towns-in-time-smythesdale
Additionally the data has been assessed in an effort to deal with issues associated with its completeness and appropriateness with tests for missing data and the presence of outliers.
The final outcome is a small dataframe containing population data of 4 small Australian towns in regional Victoria for the years 2006 and 2011, with a new variable for population density.
We have selected the datafiles for four seperate towns within Victoria - Smythesdale, Tallarook, Skipton, and Traralgon-South. Each raw datafile has a sections on the area of the local government area, summary, age structure, and dwelling data. Data is based on data collected on censuse nights from 1981 trough to 2011.
To create the datframe for our assignment, we are required to tidy and manipluate the individual data frames prior to the final join.
Each raw datafile for each town has 4 sections of data to read in seperatly. We read in the local government area data for 2006 and 2011, the summary data, the age structure, and dwellings data for each town seperatly.
Before we can construct the final dataframe, the data requires some inital manipulation to put the data in tidy format.
Age1 <- read_xls('Smythesdale.xls',sheet="Data", skip=16,n_max=6)
str(Age1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of 8 variables:
## $ Age structure: chr "0-17" "18-24" "25-44" "45-64" ...
## $ 1981 : chr "-" "-" "-" "-" ...
## $ 1986 : num 63.7 18.9 47.7 35.8 15.9 ...
## $ 1991 : num 76.1 18.6 91.1 57.8 12.8 ...
## $ 1996 : num 95 15.7 85.7 51.9 14.1 ...
## $ 2001 : num 100.8 22.1 95 49.9 28.8 ...
## $ 2006 : num 75 20 67 79 27 11
## $ 2011 : num 108 33 122 150 46 20
dim(Age1)
## [1] 6 8
head(Age1)
## # A tibble: 6 x 8
## `Age structure` `1981` `1986` `1991` `1996` `2001` `2006` `2011`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0-17 - 63.7 76.1 95.0 101. 75 108
## 2 18-24 - 18.9 18.6 15.7 22.1 20 33
## 3 25-44 - 47.7 91.1 85.7 95.0 67 122
## 4 45-64 - 35.8 57.8 51.9 49.9 79 150
## 5 65-74 - 15.9 12.8 14.1 28.8 27 46
## 6 75+ - 1.99 0 16.5 15.4 11 20
area1 <- read_xls('Smythesdale.xls',sheet="Data", skip=4,n_max=3)
str(area1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 7 variables:
## $ Golden Plains (S) LGA: chr "Area 2006: 1.4 km²" "Area 2011: 5.92 km²"
## $ X__1 : logi NA NA
## $ X__2 : chr "Location: 16km SW of Ballarat" NA
## $ X__3 : logi NA NA
## $ X__4 : logi NA NA
## $ X__5 : chr "Defined in 2011 as:" NA
## $ X__6 : chr "UCL222098" NA
dim(area1)
## [1] 2 7
head(area1)
## # A tibble: 2 x 7
## `Golden Plains (S)~ X__1 X__2 X__3 X__4 X__5 X__6
## <chr> <lgl> <chr> <lgl> <lgl> <chr> <chr>
## 1 Area 2006: 1.4 km² NA Location: 16km~ NA NA Defined in~ UCL22~
## 2 Area 2011: 5.92 km² NA <NA> NA NA <NA> <NA>
Dwell1 <- read_xls('Smythesdale.xls',sheet="Data", skip=25,n_max=4)
str(Dwell1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 4 obs. of 8 variables:
## $ Dwellings: chr "Occupied private dwellings" "Unoccupied private dwellings" NA "Separate house"
## $ 1981 : chr "-" "-" NA "-"
## $ 1986 : num 59 7 NA 62.9
## $ 1991 : num 82.15 8.95 NA 86.51
## $ 1996 : num 94.1 7.83 NA 91.81
## $ 2001 : num 112 11 NA 114
## $ 2006 : num 107 17 NA 118
## $ 2011 : num 191 23 NA 200
dim(Dwell1)
## [1] 4 8
head(Dwell1)
## # A tibble: 4 x 8
## Dwellings `1981` `1986` `1991` `1996` `2001` `2006` `2011`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Occupied private dwell~ - 59 82.2 94.1 112 107 191
## 2 Unoccupied private dwe~ - 7 8.95 7.83 11 17 23
## 3 <NA> <NA> NA NA NA NA NA NA
## 4 Separate house - 62.9 86.5 91.8 114. 118 200
sum1 <- read_xls('Smythesdale.xls',sheet="Data", skip=8,n_max=5)
str(sum1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5 obs. of 8 variables:
## $ Summary data: chr "Total population (enumerated)*" "Male: Female ratio*" "Visitors on Census night*" "Population in non-private dwellings*" ...
## $ 1981 : chr "-" "-" "-" "-" ...
## $ 1986 : num 184 119 11 0 84
## $ 1991 : num 256.648 111.085 0.531 0 146.411
## $ 1996 : num 279.02 104.98 1.24 0 161.37
## $ 2001 : num 312 100 0 0 169
## $ 2006 : num 279 105 8 0 152
## $ 2011 : num 479 106 8 0 293
dim(sum1)
## [1] 5 8
head(sum1)
## # A tibble: 5 x 8
## `Summary data` `1981` `1986` `1991` `1996` `2001` `2006` `2011`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Total population (enu~ - 184 257. 279. 312 279 479
## 2 Male: Female ratio* - 119. 111. 105. 100 105. 106.
## 3 Visitors on Census ni~ - 11 0.531 1.24 0 8 8
## 4 Population in non-pri~ - 0 0 0 0 0 0
## 5 Population living sam~ - 84 146. 161. 169 152 293
Age2 <- read_xls('Tallarook.xls',sheet="Data", skip=16,n_max=6)
str(Age2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of 8 variables:
## $ Age structure: chr "0-17" "18-24" "25-44" "45-64" ...
## $ 1981 : num 64.1 26.62 58.18 50.29 9.86 ...
## $ 1986 : num 68.3 30.1 76.3 33.1 16.1 ...
## $ 1991 : num 66.5 20.8 92.3 44.6 18.8 ...
## $ 1996 : num 74.39 8.93 81.33 58.52 5.95 ...
## $ 2001 : num 77.8 21.1 84.5 59.5 33.6 ...
## $ 2006 : num 61 23 51 84 24 11
## $ 2011 : num 56 12 44 60 14 18
dim(Age2)
## [1] 6 8
head(Age2)
## # A tibble: 6 x 8
## `Age structure` `1981` `1986` `1991` `1996` `2001` `2006` `2011`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0-17 64.1 68.3 66.5 74.4 77.8 61 56
## 2 18-24 26.6 30.1 20.8 8.93 21.1 23 12
## 3 25-44 58.2 76.3 92.3 81.3 84.5 51 44
## 4 45-64 50.3 33.1 44.6 58.5 59.5 84 60
## 5 65-74 9.86 16.1 18.8 5.95 33.6 24 14
## 6 75+ 3.94 6.03 6.94 14.9 14.4 11 18
area2 <- read_xls('Tallarook.xls',sheet="Data", skip=4,n_max=3)
str(area2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 7 variables:
## $ Mitchell (S) LGA: chr "Area 2006: 3.64 km²" "Area 2011: 1.41 km²"
## $ X__1 : logi NA NA
## $ X__2 : chr "Location: 78km north of Melbourne" NA
## $ X__3 : logi NA NA
## $ X__4 : logi NA NA
## $ X__5 : chr "Defined in 2011 as:" NA
## $ X__6 : chr "UCL222105" NA
dim(area2)
## [1] 2 7
head(area2)
## # A tibble: 2 x 7
## `Mitchell (S) LGA` X__1 X__2 X__3 X__4 X__5 X__6
## <chr> <lgl> <chr> <lgl> <lgl> <chr> <chr>
## 1 Area 2006: 3.64 k~ NA Location: 78km n~ NA NA Defined i~ UCL22~
## 2 Area 2011: 1.41 k~ NA <NA> NA NA <NA> <NA>
Dwell2 <- read_xls('Tallarook.xls',sheet="Data", skip=25,n_max=4)
str(Dwell2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 4 obs. of 8 variables:
## $ Dwellings: chr "Occupied private dwellings" "Unoccupied private dwellings" NA "Separate house"
## $ 1981 : num 69 9 NA 75
## $ 1986 : num 77 8 NA 82
## $ 1991 : num 90 10 NA 94
## $ 1996 : num 92 6 NA 95
## $ 2001 : num 108 11 NA 119
## $ 2006 : num 100 19 NA 115
## $ 2011 : num 83 15 NA 95
dim(Dwell2)
## [1] 4 8
head(Dwell2)
## # A tibble: 4 x 8
## Dwellings `1981` `1986` `1991` `1996` `2001` `2006` `2011`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Occupied private dwell~ 69 77 90 92 108 100 83
## 2 Unoccupied private dwe~ 9 8 10 6 11 19 15
## 3 <NA> NA NA NA NA NA NA NA
## 4 Separate house 75.0 82 94 95 119 115 95
sum2 <- read_xls('Tallarook.xls',sheet="Data", skip=8,n_max=5)
str(sum2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5 obs. of 8 variables:
## $ Summary data: chr "Total population (enumerated)*" "Male: Female ratio*" "Visitors on Census night*" "Population in non-private dwellings*" ...
## $ 1981 : num 213 120 7 10 115
## $ 1986 : num 230 113 7 0 125
## $ 1991 : num 250 89.4 9 0 113
## $ 1996 : num 244 102 12 0 145
## $ 2001 : num 291 102 6 0 196
## $ 2006 : num 254 88.1 3 0 172
## $ 2011 : num 204 94.3 3 0 116
dim(sum2)
## [1] 5 8
head(sum2)
## # A tibble: 5 x 8
## `Summary data` `1981` `1986` `1991` `1996` `2001` `2006` `2011`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Total population (enum~ 213 230 250 244 291 254 204
## 2 Male: Female ratio* 120. 113. 89.4 102. 102. 88.1 94.3
## 3 Visitors on Census nig~ 7 7 9 12 6 3 3
## 4 Population in non-priv~ 10 0 0 0 0 0 0
## 5 Population living same~ 115 125 113 145 196 172 116
Age3 <- read_xls('Skipton.xls',sheet="Data", skip=16,n_max=6)
str(Age3)
## Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of 8 variables:
## $ Age structure: chr "0-17" "18-24" "25-44" "45-64" ...
## $ 1981 : num 150.4 49.8 112.6 118.5 53.8 ...
## $ 1986 : num 146.4 35.2 129.6 126.9 42.9 ...
## $ 1991 : num 115.5 29.8 123.7 103.1 55.1 ...
## $ 1996 : num 104.5 22.5 99.7 118.6 58.7 ...
## $ 2001 : num 106.3 10.8 109.5 125 54.2 ...
## $ 2006 : num 97.4 17 87.3 135.6 68.3 ...
## $ 2011 : num 77 25 67 134 47 90
dim(Age3)
## [1] 6 8
head(Age3)
## # A tibble: 6 x 8
## `Age structure` `1981` `1986` `1991` `1996` `2001` `2006` `2011`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0-17 150. 146. 116. 104. 106. 97.4 77
## 2 18-24 49.8 35.2 29.8 22.5 10.8 17.0 25
## 3 25-44 113. 130. 124. 99.7 110. 87.3 67
## 4 45-64 119. 127. 103. 119. 125. 136. 134
## 5 65-74 53.8 42.9 55.1 58.7 54.2 68.3 47
## 6 75+ 19.9 27.0 34.8 49.0 48.2 66.4 90
area3 <- read_xls('Skipton.xls',sheet="Data", skip=4,n_max=3)
str(area3)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 7 variables:
## $ Corangamite (S) LGA: chr "Area 2006: 4.92 km²" "Area 2011: 3.63 km²"
## $ X__1 : logi NA NA
## $ X__2 : chr "Location: 43km SW of Ballarat" NA
## $ X__3 : logi NA NA
## $ X__4 : logi NA NA
## $ X__5 : chr "Defined in 2011 as:" NA
## $ X__6 : chr "UCL222096" NA
dim(area3)
## [1] 2 7
head(area3)
## # A tibble: 2 x 7
## `Corangamite (S) L~ X__1 X__2 X__3 X__4 X__5 X__6
## <chr> <lgl> <chr> <lgl> <lgl> <chr> <chr>
## 1 Area 2006: 4.92 km² NA Location: 43km ~ NA NA Defined i~ UCL22~
## 2 Area 2011: 3.63 km² NA <NA> NA NA <NA> <NA>
Dwell3 <- read_xls('Skipton.xls',sheet="Data", skip=25,n_max=4)
str(Dwell3)
## Classes 'tbl_df', 'tbl' and 'data.frame': 4 obs. of 8 variables:
## $ Dwellings: chr "Occupied private dwellings" "Unoccupied private dwellings" NA "Separate house"
## $ 1981 : num 176 31 NA 195
## $ 1986 : num 191 34 NA 205
## $ 1991 : num 186 48 NA 212
## $ 1996 : num 187 56 NA 229
## $ 2001 : num 192 52 NA 220
## $ 2006 : num 203 41 NA 229
## $ 2011 : num 192 56 NA 248
dim(Dwell3)
## [1] 4 8
head(Dwell3)
## # A tibble: 4 x 8
## Dwellings `1981` `1986` `1991` `1996` `2001` `2006` `2011`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Occupied private dwell~ 176 191 186 187 192 203 192
## 2 Unoccupied private dwe~ 31 34 48 56 52 41 56
## 3 <NA> NA NA NA NA NA NA NA
## 4 Separate house 195. 205. 212 229 220 229 248
sum3 <- read_xls('Skipton.xls',sheet="Data", skip=8,n_max=5)
str(sum3)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5 obs. of 8 variables:
## $ Summary data: chr "Total population (enumerated)*" "Male: Female ratio*" "Visitors on Census night*" "Population in non-private dwellings*" ...
## $ 1981 : num 505 102 20 17 283
## $ 1986 : num 508 105 11 9 274
## $ 1991 : num 462 104 9 3 245
## $ 1996 : num 453 101 11 23 266
## $ 2001 : num 454 93.2 9 19 271
## $ 2006 : num 472 92.7 17 17 254
## $ 2011 : num 439 90.9 15 26 250
dim(sum3)
## [1] 5 8
head(sum3)
## # A tibble: 5 x 8
## `Summary data` `1981` `1986` `1991` `1996` `2001` `2006` `2011`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Total population (enum~ 505 508 462 453 454 472 439
## 2 Male: Female ratio* 102 105. 104. 101. 93.2 92.7 90.9
## 3 Visitors on Census nig~ 20 11 9 11 9 17 15
## 4 Population in non-priv~ 17 9 3 23 19 17 26
## 5 Population living same~ 283 274 245 266 271 254 250
Age4 <- read_xls('Traralgon-South.xls',sheet="Data", skip=16,n_max=6)
str(Age4)
## Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of 8 variables:
## $ Age structure: chr "0-17" "18-24" "25-44" "45-64" ...
## $ 1981 : chr "-" "-" "-" "-" ...
## $ 1986 : chr "-" "-" "-" "-" ...
## $ 1991 : chr "-" "-" "-" "-" ...
## $ 1996 : chr "-" "-" "-" "-" ...
## $ 2001 : chr "-" "-" "-" "-" ...
## $ 2006 : num 122.6 13 110.6 37.9 0 ...
## $ 2011 : num 180 15 125 119 15 0
dim(Age4)
## [1] 6 8
head(Age4)
## # A tibble: 6 x 8
## `Age structure` `1981` `1986` `1991` `1996` `2001` `2006` `2011`
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 0-17 - - - - - 123. 180
## 2 18-24 - - - - - 13.0 15
## 3 25-44 - - - - - 111. 125
## 4 45-64 - - - - - 37.9 119
## 5 65-74 - - - - - 0 15
## 6 75+ - - - - - 0 0
area4 <- read_xls('Traralgon-South.xls',sheet="Data", skip=4,n_max=3)
str(area4)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 7 variables:
## $ Latrobe (C) LGA: chr "Area 2006: 1.72 km²" "Area 2011: 4.83 km²"
## $ X__1 : logi NA NA
## $ X__2 : chr "Location: 11km south of Traralgon" NA
## $ X__3 : logi NA NA
## $ X__4 : logi NA NA
## $ X__5 : chr "Defined in 2011 as:" NA
## $ X__6 : chr "UCL222112" NA
dim(area4)
## [1] 2 7
head(area4)
## # A tibble: 2 x 7
## `Latrobe (C) LGA` X__1 X__2 X__3 X__4 X__5 X__6
## <chr> <lgl> <chr> <lgl> <lgl> <chr> <chr>
## 1 Area 2006: 1.72 k~ NA Location: 11km s~ NA NA Defined i~ UCL22~
## 2 Area 2011: 4.83 k~ NA <NA> NA NA <NA> <NA>
Dwell4 <- read_xls('Traralgon-South.xls',sheet="Data", skip=25,n_max=4)
str(Dwell4)
## Classes 'tbl_df', 'tbl' and 'data.frame': 4 obs. of 8 variables:
## $ Dwellings: chr "Occupied private dwellings" "Unoccupied private dwellings" NA "Separate house"
## $ 1981 : chr "-" "-" NA "-"
## $ 1986 : chr "-" "-" NA "-"
## $ 1991 : chr "-" "-" NA "-"
## $ 1996 : chr "-" "-" NA "-"
## $ 2001 : chr "-" "-" NA "-"
## $ 2006 : num 86 9 NA 95
## $ 2011 : num 140 10 NA 150
dim(Dwell4)
## [1] 4 8
head(Dwell4)
## # A tibble: 4 x 8
## Dwellings `1981` `1986` `1991` `1996` `2001` `2006` `2011`
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Occupied private dwell~ - - - - - 86 140
## 2 Unoccupied private dwe~ - - - - - 9 10
## 3 <NA> <NA> <NA> <NA> <NA> <NA> NA NA
## 4 Separate house - - - - - 95 150
sum4 <- read_xls('Traralgon-South.xls',sheet="Data", skip=8,n_max=5)
str(sum4)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5 obs. of 8 variables:
## $ Summary data: chr "Total population (enumerated)*" "Male: Female ratio*" "Visitors on Census night*" "Population in non-private dwellings*" ...
## $ 1981 : chr "-" "-" "-" "-" ...
## $ 1986 : chr "-" "-" "-" "-" ...
## $ 1991 : chr "-" "-" "-" "-" ...
## $ 1996 : chr "-" "-" "-" "-" ...
## $ 2001 : chr "-" "-" "-" "-" ...
## $ 2006 : num 284 85.6 6 0 141
## $ 2011 : num 454 97.4 9 0 288
dim(sum4)
## [1] 5 8
head(sum4)
## # A tibble: 5 x 8
## `Summary data` `1981` `1986` `1991` `1996` `2001` `2006` `2011`
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Total population (enum~ - - - - - 284 454
## 2 Male: Female ratio* - - - - - 85.6 97.4
## 3 Visitors on Census nig~ - - - - - 6 9
## 4 Population in non-priv~ - - - - - 0 0
## 5 Population living same~ - - - - - 141 288
To enable the joining of data, we were required to manually input the town name for each of dataset based on the datafile.
sum1 <- mutate(sum1,Town="Smythesdale")
area1 <- mutate(area1,Town="Smythesdale") %>% separate(1,c("Area_Year","Area"),sep=": ")
Dwell1 <- mutate(Dwell1,Town="Smythesdale")
Age1 <- mutate(Age1,Town="Smythesdale")
sum2 <- mutate(sum2,Town="Tallarook")
area2 <- mutate(area2,Town="Tallarook") %>% separate(1,c("Area_Year","Area"),sep=": ")
Dwell2 <- mutate(Dwell2,Town="Tallarook")
Age2 <- mutate(Age2,Town="Tallarook")
sum3 <- mutate(sum3,Town="Skipton")
area3 <- mutate(area3,Town="Skipton") %>% separate(1,c("Area_Year","Area"),sep=": ")
Dwell3 <- mutate(Dwell3,Town="Skipton")
Age3 <- mutate(Age3,Town="Skipton")
sum4 <- mutate(sum4,Town="Traralgon-South")
area4 <- mutate(area4,Town="Traralgon-South") %>% separate(1,c("Area_Year","Area"),sep=": ")
Dwell4 <- mutate(Dwell4,Town="Traralgon-South")
Age4 <- mutate(Age4,Town="Traralgon-South")
In this step, we join each dataframe for each town into one dataframe of the same type, ready for the final join. We did the final data manipulation for each data file as required.
To set up the summary dataframe, we needed to change the format of some of the variables from character to numeric format for Smythesdale 1981,and Traralgon-South 1981 to 2001 as these variables have been read in as a character, whilst the same variables for Tallarook and Skipton, and 1986-2001 for Smythesdale read in as numeric. NA’s are created by coercion.
sum1$`1981` <- as.numeric(sum1$`1981`)
sum4$`1981` <- as.numeric(sum4$`1981`)
sum4$`1986` <- as.numeric(sum4$`1986`)
sum4$`1991` <- as.numeric(sum4$`1991`)
sum4$`1996` <- as.numeric(sum4$`1996`)
sum4$`2001` <- as.numeric(sum4$`2001`)
Once the characters are converted, we combined the summary data. We then gather the Years as observations with the count as a new variable. We then spread the Town, Year, Male: Female ratio, Population in non-private dwellings, Population living same address 5 years ago, Total population (enumerated), and Visitors on Census night from observations to variables.
popsummary <- bind_rows(sum1,sum2,sum3,sum4)
popsummary <- gather(popsummary,2:8,key="Year",value="Count")
popsummary <- spread(popsummary,key = `Summary data`,value = 'Count')
str(popsummary)
## Classes 'tbl_df', 'tbl' and 'data.frame': 28 obs. of 7 variables:
## $ Town : chr "Skipton" "Skipton" "Skipton" "Skipton" ...
## $ Year : chr "1981" "1986" "1991" "1996" ...
## $ Male: Female ratio* : num 102 104.8 104.4 101.3 93.2 ...
## $ Population in non-private dwellings* : num 17 9 3 23 19 17 26 NA 0 0 ...
## $ Population living same address 5 years ago: num 283 274 245 266 271 ...
## $ Total population (enumerated)* : num 505 508 462 453 454 ...
## $ Visitors on Census night* : num 20 11 9 11 9 17 15 NA 11 0.531 ...
dim(popsummary)
## [1] 28 7
head(popsummary)
## # A tibble: 6 x 7
## Town Year `Male: Female r~ `Population in ~ `Population liv~
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Skip~ 1981 102 17 283
## 2 Skip~ 1986 105. 9 274
## 3 Skip~ 1991 104. 3 245
## 4 Skip~ 1996 101. 23 266
## 5 Skip~ 2001 93.2 19 271
## 6 Skip~ 2006 92.7 17 254
## # ... with 2 more variables: `Total population (enumerated)*` <dbl>,
## # `Visitors on Census night*` <dbl>
Here we join the area data. We then separate the Variable Area Year to separate variables containing the word “Area” and the corresponding year. We then drop all variables except for Year, Area, and Town.
LGA <- bind_rows(area1,area2,area3,area4)
LGA <-separate(LGA, Area_Year,c("Area2","Year"),sep=" ")
LGA <- LGA[,c(2,3,10)]
str(LGA)
## Classes 'tbl_df', 'tbl' and 'data.frame': 8 obs. of 3 variables:
## $ Year: chr "2006" "2011" "2006" "2011" ...
## $ Area: chr "1.4 km²" "5.92 km²" "3.64 km²" "1.41 km²" ...
## $ Town: chr "Smythesdale" "Smythesdale" "Tallarook" "Tallarook" ...
dim(LGA)
## [1] 8 3
head(LGA)
## # A tibble: 6 x 3
## Year Area Town
## <chr> <chr> <chr>
## 1 2006 1.4 km² Smythesdale
## 2 2011 5.92 km² Smythesdale
## 3 2006 3.64 km² Tallarook
## 4 2011 1.41 km² Tallarook
## 5 2006 4.92 km² Skipton
## 6 2011 3.63 km² Skipton
Similar to the summary data, we need to convert some of the year data from character to numeric variable to enable the join. NA’s are created by coercion.
Dwell1$`1981` <- as.numeric(Dwell1$`1981`)
Dwell4$`1981` <- as.numeric(Dwell4$`1981`)
Dwell4$`1986` <- as.numeric(Dwell4$`1986`)
Dwell4$`1991` <- as.numeric(Dwell4$`1991`)
Dwell4$`1996` <- as.numeric(Dwell4$`1996`)
Dwell4$`2001` <- as.numeric(Dwell4$`2001`)
We then bind the 4 dataframes into one, gather the Year variables into observations, and spread Occupied private dwellings, Unoccupied private dwellings, and Separate house into seperate variables.
Towndwell <- bind_rows(Dwell1,Dwell2,Dwell3,Dwell4)
Towndwell <- gather(Towndwell,2:8,key="Year",value="Count")
Towndwell <- spread(Towndwell,key = `Dwellings`, value="Count")
str(Towndwell)
## Classes 'tbl_df', 'tbl' and 'data.frame': 28 obs. of 6 variables:
## $ Town : chr "Skipton" "Skipton" "Skipton" "Skipton" ...
## $ Year : chr "1981" "1986" "1991" "1996" ...
## $ Occupied private dwellings : num 176 191 186 187 192 ...
## $ Separate house : num 195 205 212 229 220 ...
## $ Unoccupied private dwellings: num 31 34 48 56 52 ...
## $ <NA> : num NA NA NA NA NA NA NA NA NA NA ...
dim(Towndwell)
## [1] 28 6
head(Towndwell)
## # A tibble: 6 x 6
## Town Year `Occupied privat~ `Separate house` `Unoccupied priv~ `<NA>`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Skipt~ 1981 176 195. 31 NA
## 2 Skipt~ 1986 191 205. 34 NA
## 3 Skipt~ 1991 186 212 48 NA
## 4 Skipt~ 1996 187 229 56 NA
## 5 Skipt~ 2001 192 220 52 NA
## 6 Skipt~ 2006 203 229 41 NA
Again, we needed to convert some variables from character to numeric variables. NA’s are created by coercion.
Age1$`1981` <- as.numeric(Age1$`1981`)
Age4$`1981` <- as.numeric(Age4$`1981`)
Age4$`1986` <- as.numeric(Age4$`1986`)
Age4$`1991` <- as.numeric(Age4$`1991`)
Age4$`1996` <- as.numeric(Age4$`1996`)
Age4$`2001` <- as.numeric(Age4$`2001`)
We then bind the Age Structures dataframes into one, gathering the Years into observations and then spreading the age brackets into variables.
Age <- bind_rows(Age1,Age2,Age3,Age4)
Age <- gather(Age,2:8,key="Year",value="Count")
Age <- spread(Age,key = `Age structure`, value="Count")
str(Age)
## Classes 'tbl_df', 'tbl' and 'data.frame': 28 obs. of 8 variables:
## $ Town : chr "Skipton" "Skipton" "Skipton" "Skipton" ...
## $ Year : chr "1981" "1986" "1991" "1996" ...
## $ 0-17 : num 150 146 116 104 106 ...
## $ 18-24: num 49.8 35.2 29.8 22.5 10.8 ...
## $ 25-44: num 112.6 129.6 123.7 99.7 109.5 ...
## $ 45-64: num 119 127 103 119 125 ...
## $ 65-74: num 53.8 42.9 55.1 58.7 54.2 ...
## $ 75+ : num 19.9 27 34.8 49 48.2 ...
dim(Age)
## [1] 28 8
head(Age)
## # A tibble: 6 x 8
## Town Year `0-17` `18-24` `25-44` `45-64` `65-74` `75+`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Skipton 1981 150. 49.8 113. 119. 53.8 19.9
## 2 Skipton 1986 146. 35.2 130. 127. 42.9 27.0
## 3 Skipton 1991 116. 29.8 124. 103. 55.1 34.8
## 4 Skipton 1996 104. 22.5 99.7 119. 58.7 49.0
## 5 Skipton 2001 106. 10.8 110. 125. 54.2 48.2
## 6 Skipton 2006 97.4 17.0 87.3 136. 68.3 66.4
Finally, we did the final join. The final structure, dimensions and head of the data set will be discussed in the following section.
FullData <- full_join(Age,popsummary,Age,by=c("Town","Year"))
FullData <- left_join(FullData, LGA, by=c("Town","Year"))
FullData$Town <- as.factor(FullData$Town)
FullData$Year <- as.ordered(FullData$Year)
We now have a data frame with 28 observations and 14 variables.
Town - a factor variable with 4 levels. Contains the name of the towns.
Year - a ordered factor with 7 levels. Contains the corresponding year of each observation.
0-17 - a numeric variable containing the count of occupants in the corresponding age bracket in each observations.
18-24 - a numeric variable containing the count of occupants in the corresponding age bracket in each observations.
25-44 - a numeric variable containing the count of occupants in the corresponding age bracket in each observations.
45-64 - a numeric variable containing the count of occupants in the corresponding age bracket in each observations.
65-74 - a numeric variable containing the count of occupants in the corresponding age bracket in each observations.
75+ - a numeric variable containing the count of occupants in the corresponding age bracket in each observations.
Male: Female ratio* - a Numeric variable. The ratio of male to females.
Population in non-private dwellings* - a numeric variable showing how many people were occupoing non-private dwellings on census night.
Population living same address 5 years ago - a numeric variable Shows how many poeple are still living in the same address as years ago.
Total population (enumerated)* - a numeric variable showing the total popluation on census night.
Visitors on Census night* - a numeric variable showing the number of visitors to the towns on census night.
Area - a character variable showing the area kilometers squared.
str(FullData)
## Classes 'tbl_df', 'tbl' and 'data.frame': 28 obs. of 14 variables:
## $ Town : Factor w/ 4 levels "Skipton","Smythesdale",..: 1 1 1 1 1 1 1 2 2 2 ...
## $ Year : Ord.factor w/ 7 levels "1981"<"1986"<..: 1 2 3 4 5 6 7 1 2 3 ...
## $ 0-17 : num 150 146 116 104 106 ...
## $ 18-24 : num 49.8 35.2 29.8 22.5 10.8 ...
## $ 25-44 : num 112.6 129.6 123.7 99.7 109.5 ...
## $ 45-64 : num 119 127 103 119 125 ...
## $ 65-74 : num 53.8 42.9 55.1 58.7 54.2 ...
## $ 75+ : num 19.9 27 34.8 49 48.2 ...
## $ Male: Female ratio* : num 102 104.8 104.4 101.3 93.2 ...
## $ Population in non-private dwellings* : num 17 9 3 23 19 17 26 NA 0 0 ...
## $ Population living same address 5 years ago: num 283 274 245 266 271 ...
## $ Total population (enumerated)* : num 505 508 462 453 454 ...
## $ Visitors on Census night* : num 20 11 9 11 9 17 15 NA 11 0.531 ...
## $ Area : chr NA NA NA NA ...
dim(FullData)
## [1] 28 14
head(FullData)
## # A tibble: 6 x 14
## Town Year `0-17` `18-24` `25-44` `45-64` `65-74` `75+` `Male: Female r~
## <fct> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Skip~ 1981 150. 49.8 113. 119. 53.8 19.9 102
## 2 Skip~ 1986 146. 35.2 130. 127. 42.9 27.0 105.
## 3 Skip~ 1991 116. 29.8 124. 103. 55.1 34.8 104.
## 4 Skip~ 1996 104. 22.5 99.7 119. 58.7 49.0 101.
## 5 Skip~ 2001 106. 10.8 110. 125. 54.2 48.2 93.2
## 6 Skip~ 2006 97.4 17.0 87.3 136. 68.3 66.4 92.7
## # ... with 5 more variables: `Population in non-private dwellings*` <dbl>,
## # `Population living same address 5 years ago` <dbl>, `Total population
## # (enumerated)*` <dbl>, `Visitors on Census night*` <dbl>, Area <chr>
The final dataframe is almost completly in tidy format, as it was done during the data reading stage. The final step is to split out the area variable, ready for the next step. We create two new variables, the Area Km2 containing the numerical value and a default variable Dot, which we subsequently dropped, containing the characters ‘Km2’. We then convert the Area Km2 to numeric values.
Sub_Group <- separate(FullData, Area,c("Area Km2","Dot"),sep=" ")
Sub_Group <- Sub_Group[,1:14]
Sub_Group$`Area Km2` <- as.numeric(Sub_Group$`Area Km2`)
str(Sub_Group)
## Classes 'tbl_df', 'tbl' and 'data.frame': 28 obs. of 14 variables:
## $ Town : Factor w/ 4 levels "Skipton","Smythesdale",..: 1 1 1 1 1 1 1 2 2 2 ...
## $ Year : Ord.factor w/ 7 levels "1981"<"1986"<..: 1 2 3 4 5 6 7 1 2 3 ...
## $ 0-17 : num 150 146 116 104 106 ...
## $ 18-24 : num 49.8 35.2 29.8 22.5 10.8 ...
## $ 25-44 : num 112.6 129.6 123.7 99.7 109.5 ...
## $ 45-64 : num 119 127 103 119 125 ...
## $ 65-74 : num 53.8 42.9 55.1 58.7 54.2 ...
## $ 75+ : num 19.9 27 34.8 49 48.2 ...
## $ Male: Female ratio* : num 102 104.8 104.4 101.3 93.2 ...
## $ Population in non-private dwellings* : num 17 9 3 23 19 17 26 NA 0 0 ...
## $ Population living same address 5 years ago: num 283 274 245 266 271 ...
## $ Total population (enumerated)* : num 505 508 462 453 454 ...
## $ Visitors on Census night* : num 20 11 9 11 9 17 15 NA 11 0.531 ...
## $ Area Km2 : num NA NA NA NA NA 4.92 3.63 NA NA NA ...
dim(Sub_Group)
## [1] 28 14
head(Sub_Group)
## # A tibble: 6 x 14
## Town Year `0-17` `18-24` `25-44` `45-64` `65-74` `75+` `Male: Female r~
## <fct> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Skip~ 1981 150. 49.8 113. 119. 53.8 19.9 102
## 2 Skip~ 1986 146. 35.2 130. 127. 42.9 27.0 105.
## 3 Skip~ 1991 116. 29.8 124. 103. 55.1 34.8 104.
## 4 Skip~ 1996 104. 22.5 99.7 119. 58.7 49.0 101.
## 5 Skip~ 2001 106. 10.8 110. 125. 54.2 48.2 93.2
## 6 Skip~ 2006 97.4 17.0 87.3 136. 68.3 66.4 92.7
## # ... with 5 more variables: `Population in non-private dwellings*` <dbl>,
## # `Population living same address 5 years ago` <dbl>, `Total population
## # (enumerated)*` <dbl>, `Visitors on Census night*` <dbl>, `Area
## # Km2` <dbl>
We want to create a new character - the population density for each town. We do this by calculating the Total population (enumerated)* devided by the Area Km2 variable to create the new variable Population Density.
Sub_Group <- Sub_Group %>%
mutate(`Population Density` = `Total population (enumerated)*`/Sub_Group$`Area Km2`, rm.na=TRUE)
Sub_Group <- Sub_Group[,1:15]
Sub_Group$`Population Density` <- Sub_Group$`Population Density` %>% round(2)
Here we are scanning the datafile for missing variables. As can be seen, there is 6 missing variables in 13 of the varibales, and 20 for the Area variable.
colSums(is.na(Sub_Group))
## Town
## 0
## Year
## 0
## 0-17
## 6
## 18-24
## 6
## 25-44
## 6
## 45-64
## 6
## 65-74
## 6
## 75+
## 6
## Male: Female ratio*
## 6
## Population in non-private dwellings*
## 6
## Population living same address 5 years ago
## 6
## Total population (enumerated)*
## 6
## Visitors on Census night*
## 6
## Area Km2
## 20
## Population Density
## 20
is.na(Sub_Group)
## Town Year 0-17 18-24 25-44 45-64 65-74 75+ Male: Female ratio*
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [8,] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [16,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [17,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [20,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [21,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [22,] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [23,] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [24,] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [25,] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [26,] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [27,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Population in non-private dwellings*
## [1,] FALSE
## [2,] FALSE
## [3,] FALSE
## [4,] FALSE
## [5,] FALSE
## [6,] FALSE
## [7,] FALSE
## [8,] TRUE
## [9,] FALSE
## [10,] FALSE
## [11,] FALSE
## [12,] FALSE
## [13,] FALSE
## [14,] FALSE
## [15,] FALSE
## [16,] FALSE
## [17,] FALSE
## [18,] FALSE
## [19,] FALSE
## [20,] FALSE
## [21,] FALSE
## [22,] TRUE
## [23,] TRUE
## [24,] TRUE
## [25,] TRUE
## [26,] TRUE
## [27,] FALSE
## [28,] FALSE
## Population living same address 5 years ago
## [1,] FALSE
## [2,] FALSE
## [3,] FALSE
## [4,] FALSE
## [5,] FALSE
## [6,] FALSE
## [7,] FALSE
## [8,] TRUE
## [9,] FALSE
## [10,] FALSE
## [11,] FALSE
## [12,] FALSE
## [13,] FALSE
## [14,] FALSE
## [15,] FALSE
## [16,] FALSE
## [17,] FALSE
## [18,] FALSE
## [19,] FALSE
## [20,] FALSE
## [21,] FALSE
## [22,] TRUE
## [23,] TRUE
## [24,] TRUE
## [25,] TRUE
## [26,] TRUE
## [27,] FALSE
## [28,] FALSE
## Total population (enumerated)* Visitors on Census night* Area Km2
## [1,] FALSE FALSE TRUE
## [2,] FALSE FALSE TRUE
## [3,] FALSE FALSE TRUE
## [4,] FALSE FALSE TRUE
## [5,] FALSE FALSE TRUE
## [6,] FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE
## [8,] TRUE TRUE TRUE
## [9,] FALSE FALSE TRUE
## [10,] FALSE FALSE TRUE
## [11,] FALSE FALSE TRUE
## [12,] FALSE FALSE TRUE
## [13,] FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE
## [15,] FALSE FALSE TRUE
## [16,] FALSE FALSE TRUE
## [17,] FALSE FALSE TRUE
## [18,] FALSE FALSE TRUE
## [19,] FALSE FALSE TRUE
## [20,] FALSE FALSE FALSE
## [21,] FALSE FALSE FALSE
## [22,] TRUE TRUE TRUE
## [23,] TRUE TRUE TRUE
## [24,] TRUE TRUE TRUE
## [25,] TRUE TRUE TRUE
## [26,] TRUE TRUE TRUE
## [27,] FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE
## Population Density
## [1,] TRUE
## [2,] TRUE
## [3,] TRUE
## [4,] TRUE
## [5,] TRUE
## [6,] FALSE
## [7,] FALSE
## [8,] TRUE
## [9,] TRUE
## [10,] TRUE
## [11,] TRUE
## [12,] TRUE
## [13,] FALSE
## [14,] FALSE
## [15,] TRUE
## [16,] TRUE
## [17,] TRUE
## [18,] TRUE
## [19,] TRUE
## [20,] FALSE
## [21,] FALSE
## [22,] TRUE
## [23,] TRUE
## [24,] TRUE
## [25,] TRUE
## [26,] TRUE
## [27,] FALSE
## [28,] FALSE
The variables that are missing are Smythesdale 1981, and Traralgon-South 1986-2001. Smythesdale, Traralgon-South, Tallarook and Skipton have NA’s in Area Km2 and Population Density for all years prior to 2006.
As such, we’ve made the descision to omit the missing values, as we cannot reliably calcule the missing data. This reduces the dataframe to 8 observations.
Sub_Group <- na.omit(Sub_Group)
colSums(is.na(Sub_Group))
## Town
## 0
## Year
## 0
## 0-17
## 0
## 18-24
## 0
## 25-44
## 0
## 45-64
## 0
## 65-74
## 0
## 75+
## 0
## Male: Female ratio*
## 0
## Population in non-private dwellings*
## 0
## Population living same address 5 years ago
## 0
## Total population (enumerated)*
## 0
## Visitors on Census night*
## 0
## Area Km2
## 0
## Population Density
## 0
dim(Sub_Group)
## [1] 8 15
We’ve selected the z-score method to scan for outliers. An examination of the results show no outliers, as there are no absolut valuesof z-scores > 3, we can determine there are no outliers.
z.scores <- Sub_Group %>% scores(type = "z")
z.scores %>% summary()
## Town Year 0-17 18-24
## Min. : NA Min. : NA Min. :-1.0133 Min. :-1.0963
## 1st Qu.: NA 1st Qu.: NA 1st Qu.:-0.6313 1st Qu.:-0.7443
## Median : NA Median : NA Median :-0.2449 Median :-0.1734
## Mean :NaN Mean :NaN Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: 0.3580 3rd Qu.: 0.5304
## Max. : NA Max. : NA Max. : 2.0427 Max. : 1.8742
## NA's :8 NA's :8
## 25-44 45-64 65-74 75+
## Min. :-1.2624 Min. :-1.53513 Min. :-1.3610 Min. :-0.8198
## 1st Qu.:-0.6662 1st Qu.:-0.63529 1st Qu.:-0.6955 1st Qu.:-0.5697
## Median :-0.2229 Median : 0.03866 Median :-0.2104 Median :-0.3803
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.9169 3rd Qu.: 0.85252 3rd Qu.: 0.7258 3rd Qu.: 0.1379
## Max. : 1.2791 Max. : 1.23816 Max. : 1.7212 Max. : 1.9079
##
## Male: Female ratio* Population in non-private dwellings*
## Min. :-1.2702 Min. :-0.5249
## 1st Qu.:-0.6490 1st Qu.:-0.5249
## Median :-0.2029 Median :-0.5249
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5940 3rd Qu.:-0.1099
## Max. : 1.4437 Max. : 2.0143
##
## Population living same address 5 years ago Total population (enumerated)*
## Min. :-1.30668 Min. :-1.36181
## 1st Qu.:-0.83571 1st Qu.:-0.75435
## Median : 0.03895 Median : 0.02982
## Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.76842 3rd Qu.: 0.88689
## Max. : 1.20044 Max. : 1.06802
##
## Visitors on Census night* Area Km2 Population Density
## Min. :-1.1035 Min. :-1.1580 Min. :-1.1436
## 1st Qu.:-0.6621 1st Qu.:-1.0200 1st Qu.:-0.6789
## Median :-0.1226 Median : 0.1146 Median :-0.2861
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3678 3rd Qu.: 0.8079 3rd Qu.: 0.6313
## Max. : 1.6430 Max. : 1.4157 Max. : 1.7294
##
which(abs(z.scores) >3 )
## integer(0)
length(which( abs(z.scores) >3 ))
## [1] 0
To demonstrate the transformation of data, we’ve selected the variable 0-17, which shows a right-skewed distribution. We’ve selected the Box-Cox transformation to transform the data into a normal distribution.
hist(Sub_Group$`0-17`, main = "Distribution of 0-17 age group", col="green")
boxcox_age<- BoxCox(Sub_Group$`75+`,lambda = "auto")
hist(boxcox_age, main = "Box-Cox transformation of 0-17 age group", col="purple")