MATH2349 Semester 2, 2018 - Preparing data for processing - Population density of towns in Victoria, Australia

Required packages

library(ggplot2)
library(readxl)
library(tidyr)
library(dplyr)
library(outliers)
library(forecast)

Executive Summary

In this work, the authors have sourced 4 datasets from www.data.vic.gov.au “Towns-in-Time” series of datasets compiled by the Department of planning. Each of the datasets features, population demographics data, dwelling numbers data, summary data and town land area data.

These 4 datasets present a number of pre-processing difficulties in that; first the datasets containing variables in both horizontal and spread format.

The second difficulty is that observations are located disparately in 4 different locations in the spreadsheet.

The third difficulty is that each dataset and it’s corresponding data frames require additional preparative work prior to processing.

The last issue is that the resulting components of 4 excel entities need to be combined prior to processing.

This project demonstrates that each dataset needs to be read into 16 different “R” Data Frames prior to consolidation and transformation before production of a single useful dataset useful for production of statistics for analysis . The data was sourced on 22nd September 2018 (prior to a website relaunch) from:

https://www.data.vic.gov.au/data/dataset/towns-in-time-traralgon-south

https://www.data.vic.gov.au/data/dataset/towns-in-time-tallarook

https://www.data.vic.gov.au/data/dataset/towns-in-time-skipton

https://www.data.vic.gov.au/data/dataset/towns-in-time-smythesdale

Additionally the data has been assessed in an effort to deal with issues associated with its completeness and appropriateness with tests for missing data and the presence of outliers.

The final outcome is a small dataframe containing population data of 4 small Australian towns in regional Victoria for the years 2006 and 2011, with a new variable for population density.

Read, Tidy, Manipulate, and Join the Data

We have selected the datafiles for four seperate towns within Victoria - Smythesdale, Tallarook, Skipton, and Traralgon-South. Each raw datafile has a sections on the area of the local government area, summary, age structure, and dwelling data. Data is based on data collected on censuse nights from 1981 trough to 2011.

To create the datframe for our assignment, we are required to tidy and manipluate the individual data frames prior to the final join.

Step 1 - Read in the raw data steps

Each raw datafile for each town has 4 sections of data to read in seperatly. We read in the local government area data for 2006 and 2011, the summary data, the age structure, and dwellings data for each town seperatly.

Before we can construct the final dataframe, the data requires some inital manipulation to put the data in tidy format.

Smythesdale

Age1 <- read_xls('Smythesdale.xls',sheet="Data", skip=16,n_max=6)
str(Age1)

## Classes 'tbl_df', 'tbl' and 'data.frame':    6 obs. of  8 variables:
##  $ Age structure: chr  "0-17" "18-24" "25-44" "45-64" ...
##  $ 1981         : chr  "-" "-" "-" "-" ...
##  $ 1986         : num  63.7 18.9 47.7 35.8 15.9 ...
##  $ 1991         : num  76.1 18.6 91.1 57.8 12.8 ...
##  $ 1996         : num  95 15.7 85.7 51.9 14.1 ...
##  $ 2001         : num  100.8 22.1 95 49.9 28.8 ...
##  $ 2006         : num  75 20 67 79 27 11
##  $ 2011         : num  108 33 122 150 46 20

dim(Age1)

## [1] 6 8

head(Age1)

## # A tibble: 6 x 8
##   `Age structure` `1981` `1986` `1991` `1996` `2001` `2006` `2011`
##   <chr>           <chr>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 0-17            -       63.7    76.1   95.0  101.      75    108
## 2 18-24           -       18.9    18.6   15.7   22.1     20     33
## 3 25-44           -       47.7    91.1   85.7   95.0     67    122
## 4 45-64           -       35.8    57.8   51.9   49.9     79    150
## 5 65-74           -       15.9    12.8   14.1   28.8     27     46
## 6 75+             -        1.99    0     16.5   15.4     11     20

area1 <- read_xls('Smythesdale.xls',sheet="Data", skip=4,n_max=3)
str(area1)

## Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  7 variables:
##  $ Golden Plains (S) LGA: chr  "Area 2006: 1.4 km²" "Area 2011: 5.92 km²"
##  $ X__1                 : logi  NA NA
##  $ X__2                 : chr  "Location: 16km SW of Ballarat" NA
##  $ X__3                 : logi  NA NA
##  $ X__4                 : logi  NA NA
##  $ X__5                 : chr  "Defined in 2011 as:" NA
##  $ X__6                 : chr  "UCL222098" NA

dim(area1)

## [1] 2 7

head(area1)

## # A tibble: 2 x 7
##   `Golden Plains (S)~ X__1  X__2            X__3  X__4  X__5        X__6  
##   <chr>               <lgl> <chr>           <lgl> <lgl> <chr>       <chr> 
## 1 Area 2006: 1.4 km²  NA    Location: 16km~ NA    NA    Defined in~ UCL22~
## 2 Area 2011: 5.92 km² NA    <NA>            NA    NA    <NA>        <NA>

Dwell1 <- read_xls('Smythesdale.xls',sheet="Data", skip=25,n_max=4)
str(Dwell1)

## Classes 'tbl_df', 'tbl' and 'data.frame':    4 obs. of  8 variables:
##  $ Dwellings: chr  "Occupied private dwellings" "Unoccupied private dwellings" NA "Separate house"
##  $ 1981     : chr  "-" "-" NA "-"
##  $ 1986     : num  59 7 NA 62.9
##  $ 1991     : num  82.15 8.95 NA 86.51
##  $ 1996     : num  94.1 7.83 NA 91.81
##  $ 2001     : num  112 11 NA 114
##  $ 2006     : num  107 17 NA 118
##  $ 2011     : num  191 23 NA 200

dim(Dwell1)

## [1] 4 8

head(Dwell1)

## # A tibble: 4 x 8
##   Dwellings               `1981` `1986` `1991` `1996` `2001` `2006` `2011`
##   <chr>                   <chr>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Occupied private dwell~ -        59    82.2   94.1    112     107    191
## 2 Unoccupied private dwe~ -         7     8.95   7.83    11      17     23
## 3 <NA>                    <NA>     NA    NA     NA       NA      NA     NA
## 4 Separate house          -        62.9  86.5   91.8    114.    118    200

sum1 <- read_xls('Smythesdale.xls',sheet="Data", skip=8,n_max=5)
str(sum1)

## Classes 'tbl_df', 'tbl' and 'data.frame':    5 obs. of  8 variables:
##  $ Summary data: chr  "Total population (enumerated)*" "Male: Female ratio*" "Visitors on Census night*" "Population in non-private dwellings*" ...
##  $ 1981        : chr  "-" "-" "-" "-" ...
##  $ 1986        : num  184 119 11 0 84
##  $ 1991        : num  256.648 111.085 0.531 0 146.411
##  $ 1996        : num  279.02 104.98 1.24 0 161.37
##  $ 2001        : num  312 100 0 0 169
##  $ 2006        : num  279 105 8 0 152
##  $ 2011        : num  479 106 8 0 293

dim(sum1)

## [1] 5 8

head(sum1)

## # A tibble: 5 x 8
##   `Summary data`         `1981` `1986`  `1991` `1996` `2001` `2006` `2011`
##   <chr>                  <chr>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Total population (enu~ -        184  257.    279.      312   279    479 
## 2 Male: Female ratio*    -        119. 111.    105.      100   105.   106.
## 3 Visitors on Census ni~ -         11    0.531   1.24      0     8      8 
## 4 Population in non-pri~ -          0    0       0         0     0      0 
## 5 Population living sam~ -         84  146.    161.      169   152    293

Tallarook

Age2 <- read_xls('Tallarook.xls',sheet="Data", skip=16,n_max=6)
str(Age2)

## Classes 'tbl_df', 'tbl' and 'data.frame':    6 obs. of  8 variables:
##  $ Age structure: chr  "0-17" "18-24" "25-44" "45-64" ...
##  $ 1981         : num  64.1 26.62 58.18 50.29 9.86 ...
##  $ 1986         : num  68.3 30.1 76.3 33.1 16.1 ...
##  $ 1991         : num  66.5 20.8 92.3 44.6 18.8 ...
##  $ 1996         : num  74.39 8.93 81.33 58.52 5.95 ...
##  $ 2001         : num  77.8 21.1 84.5 59.5 33.6 ...
##  $ 2006         : num  61 23 51 84 24 11
##  $ 2011         : num  56 12 44 60 14 18

dim(Age2)

## [1] 6 8

head(Age2)

## # A tibble: 6 x 8
##   `Age structure` `1981` `1986` `1991` `1996` `2001` `2006` `2011`
##   <chr>            <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 0-17             64.1   68.3   66.5   74.4    77.8     61     56
## 2 18-24            26.6   30.1   20.8    8.93   21.1     23     12
## 3 25-44            58.2   76.3   92.3   81.3    84.5     51     44
## 4 45-64            50.3   33.1   44.6   58.5    59.5     84     60
## 5 65-74             9.86  16.1   18.8    5.95   33.6     24     14
## 6 75+               3.94   6.03   6.94  14.9    14.4     11     18

area2 <- read_xls('Tallarook.xls',sheet="Data", skip=4,n_max=3)
str(area2)

## Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  7 variables:
##  $ Mitchell (S) LGA: chr  "Area 2006: 3.64 km²" "Area 2011: 1.41 km²"
##  $ X__1            : logi  NA NA
##  $ X__2            : chr  "Location: 78km north of Melbourne" NA
##  $ X__3            : logi  NA NA
##  $ X__4            : logi  NA NA
##  $ X__5            : chr  "Defined in 2011 as:" NA
##  $ X__6            : chr  "UCL222105" NA

dim(area2)

## [1] 2 7

head(area2)

## # A tibble: 2 x 7
##   `Mitchell (S) LGA` X__1  X__2              X__3  X__4  X__5       X__6  
##   <chr>              <lgl> <chr>             <lgl> <lgl> <chr>      <chr> 
## 1 Area 2006: 3.64 k~ NA    Location: 78km n~ NA    NA    Defined i~ UCL22~
## 2 Area 2011: 1.41 k~ NA    <NA>              NA    NA    <NA>       <NA>

Dwell2 <- read_xls('Tallarook.xls',sheet="Data", skip=25,n_max=4)
str(Dwell2)

## Classes 'tbl_df', 'tbl' and 'data.frame':    4 obs. of  8 variables:
##  $ Dwellings: chr  "Occupied private dwellings" "Unoccupied private dwellings" NA "Separate house"
##  $ 1981     : num  69 9 NA 75
##  $ 1986     : num  77 8 NA 82
##  $ 1991     : num  90 10 NA 94
##  $ 1996     : num  92 6 NA 95
##  $ 2001     : num  108 11 NA 119
##  $ 2006     : num  100 19 NA 115
##  $ 2011     : num  83 15 NA 95

dim(Dwell2)

## [1] 4 8

head(Dwell2)

## # A tibble: 4 x 8
##   Dwellings               `1981` `1986` `1991` `1996` `2001` `2006` `2011`
##   <chr>                    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Occupied private dwell~   69       77     90     92    108    100     83
## 2 Unoccupied private dwe~    9        8     10      6     11     19     15
## 3 <NA>                      NA       NA     NA     NA     NA     NA     NA
## 4 Separate house            75.0     82     94     95    119    115     95

sum2 <- read_xls('Tallarook.xls',sheet="Data", skip=8,n_max=5)
str(sum2)

## Classes 'tbl_df', 'tbl' and 'data.frame':    5 obs. of  8 variables:
##  $ Summary data: chr  "Total population (enumerated)*" "Male: Female ratio*" "Visitors on Census night*" "Population in non-private dwellings*" ...
##  $ 1981        : num  213 120 7 10 115
##  $ 1986        : num  230 113 7 0 125
##  $ 1991        : num  250 89.4 9 0 113
##  $ 1996        : num  244 102 12 0 145
##  $ 2001        : num  291 102 6 0 196
##  $ 2006        : num  254 88.1 3 0 172
##  $ 2011        : num  204 94.3 3 0 116

dim(sum2)

## [1] 5 8

head(sum2)

## # A tibble: 5 x 8
##   `Summary data`          `1981` `1986` `1991` `1996` `2001` `2006` `2011`
##   <chr>                    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Total population (enum~   213    230   250     244    291   254    204  
## 2 Male: Female ratio*       120.   113.   89.4   102.   102.   88.1   94.3
## 3 Visitors on Census nig~     7      7     9      12      6     3      3  
## 4 Population in non-priv~    10      0     0       0      0     0      0  
## 5 Population living same~   115    125   113     145    196   172    116

Skipton

Age3 <- read_xls('Skipton.xls',sheet="Data", skip=16,n_max=6)
str(Age3)

## Classes 'tbl_df', 'tbl' and 'data.frame':    6 obs. of  8 variables:
##  $ Age structure: chr  "0-17" "18-24" "25-44" "45-64" ...
##  $ 1981         : num  150.4 49.8 112.6 118.5 53.8 ...
##  $ 1986         : num  146.4 35.2 129.6 126.9 42.9 ...
##  $ 1991         : num  115.5 29.8 123.7 103.1 55.1 ...
##  $ 1996         : num  104.5 22.5 99.7 118.6 58.7 ...
##  $ 2001         : num  106.3 10.8 109.5 125 54.2 ...
##  $ 2006         : num  97.4 17 87.3 135.6 68.3 ...
##  $ 2011         : num  77 25 67 134 47 90

dim(Age3)

## [1] 6 8

head(Age3)

## # A tibble: 6 x 8
##   `Age structure` `1981` `1986` `1991` `1996` `2001` `2006` `2011`
##   <chr>            <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 0-17             150.   146.   116.   104.   106.    97.4     77
## 2 18-24             49.8   35.2   29.8   22.5   10.8   17.0     25
## 3 25-44            113.   130.   124.    99.7  110.    87.3     67
## 4 45-64            119.   127.   103.   119.   125.   136.     134
## 5 65-74             53.8   42.9   55.1   58.7   54.2   68.3     47
## 6 75+               19.9   27.0   34.8   49.0   48.2   66.4     90

area3 <- read_xls('Skipton.xls',sheet="Data", skip=4,n_max=3)
str(area3)

## Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  7 variables:
##  $ Corangamite (S) LGA: chr  "Area 2006: 4.92 km²" "Area 2011: 3.63 km²"
##  $ X__1               : logi  NA NA
##  $ X__2               : chr  "Location: 43km SW of Ballarat" NA
##  $ X__3               : logi  NA NA
##  $ X__4               : logi  NA NA
##  $ X__5               : chr  "Defined in 2011 as:" NA
##  $ X__6               : chr  "UCL222096" NA

dim(area3)

## [1] 2 7

head(area3)

## # A tibble: 2 x 7
##   `Corangamite (S) L~ X__1  X__2             X__3  X__4  X__5       X__6  
##   <chr>               <lgl> <chr>            <lgl> <lgl> <chr>      <chr> 
## 1 Area 2006: 4.92 km² NA    Location: 43km ~ NA    NA    Defined i~ UCL22~
## 2 Area 2011: 3.63 km² NA    <NA>             NA    NA    <NA>       <NA>

Dwell3 <- read_xls('Skipton.xls',sheet="Data", skip=25,n_max=4)
str(Dwell3)

## Classes 'tbl_df', 'tbl' and 'data.frame':    4 obs. of  8 variables:
##  $ Dwellings: chr  "Occupied private dwellings" "Unoccupied private dwellings" NA "Separate house"
##  $ 1981     : num  176 31 NA 195
##  $ 1986     : num  191 34 NA 205
##  $ 1991     : num  186 48 NA 212
##  $ 1996     : num  187 56 NA 229
##  $ 2001     : num  192 52 NA 220
##  $ 2006     : num  203 41 NA 229
##  $ 2011     : num  192 56 NA 248

dim(Dwell3)

## [1] 4 8

head(Dwell3)

## # A tibble: 4 x 8
##   Dwellings               `1981` `1986` `1991` `1996` `2001` `2006` `2011`
##   <chr>                    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Occupied private dwell~   176    191     186    187    192    203    192
## 2 Unoccupied private dwe~    31     34      48     56     52     41     56
## 3 <NA>                       NA     NA      NA     NA     NA     NA     NA
## 4 Separate house            195.   205.    212    229    220    229    248

sum3 <- read_xls('Skipton.xls',sheet="Data", skip=8,n_max=5)
str(sum3)

## Classes 'tbl_df', 'tbl' and 'data.frame':    5 obs. of  8 variables:
##  $ Summary data: chr  "Total population (enumerated)*" "Male: Female ratio*" "Visitors on Census night*" "Population in non-private dwellings*" ...
##  $ 1981        : num  505 102 20 17 283
##  $ 1986        : num  508 105 11 9 274
##  $ 1991        : num  462 104 9 3 245
##  $ 1996        : num  453 101 11 23 266
##  $ 2001        : num  454 93.2 9 19 271
##  $ 2006        : num  472 92.7 17 17 254
##  $ 2011        : num  439 90.9 15 26 250

dim(sum3)

## [1] 5 8

head(sum3)

## # A tibble: 5 x 8
##   `Summary data`          `1981` `1986` `1991` `1996` `2001` `2006` `2011`
##   <chr>                    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Total population (enum~    505   508    462    453   454    472    439  
## 2 Male: Female ratio*        102   105.   104.   101.   93.2   92.7   90.9
## 3 Visitors on Census nig~     20    11      9     11     9     17     15  
## 4 Population in non-priv~     17     9      3     23    19     17     26  
## 5 Population living same~    283   274    245    266   271    254    250

Traralgon-South

Age4 <- read_xls('Traralgon-South.xls',sheet="Data", skip=16,n_max=6)
str(Age4)

## Classes 'tbl_df', 'tbl' and 'data.frame':    6 obs. of  8 variables:
##  $ Age structure: chr  "0-17" "18-24" "25-44" "45-64" ...
##  $ 1981         : chr  "-" "-" "-" "-" ...
##  $ 1986         : chr  "-" "-" "-" "-" ...
##  $ 1991         : chr  "-" "-" "-" "-" ...
##  $ 1996         : chr  "-" "-" "-" "-" ...
##  $ 2001         : chr  "-" "-" "-" "-" ...
##  $ 2006         : num  122.6 13 110.6 37.9 0 ...
##  $ 2011         : num  180 15 125 119 15 0

dim(Age4)

## [1] 6 8

head(Age4)

## # A tibble: 6 x 8
##   `Age structure` `1981` `1986` `1991` `1996` `2001` `2006` `2011`
##   <chr>           <chr>  <chr>  <chr>  <chr>  <chr>   <dbl>  <dbl>
## 1 0-17            -      -      -      -      -       123.     180
## 2 18-24           -      -      -      -      -        13.0     15
## 3 25-44           -      -      -      -      -       111.     125
## 4 45-64           -      -      -      -      -        37.9    119
## 5 65-74           -      -      -      -      -         0       15
## 6 75+             -      -      -      -      -         0        0

area4 <- read_xls('Traralgon-South.xls',sheet="Data", skip=4,n_max=3)
str(area4)

## Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  7 variables:
##  $ Latrobe (C) LGA: chr  "Area 2006: 1.72 km²" "Area 2011: 4.83 km²"
##  $ X__1           : logi  NA NA
##  $ X__2           : chr  "Location: 11km south of Traralgon" NA
##  $ X__3           : logi  NA NA
##  $ X__4           : logi  NA NA
##  $ X__5           : chr  "Defined in 2011 as:" NA
##  $ X__6           : chr  "UCL222112" NA

dim(area4)

## [1] 2 7

head(area4)

## # A tibble: 2 x 7
##   `Latrobe (C) LGA`  X__1  X__2              X__3  X__4  X__5       X__6  
##   <chr>              <lgl> <chr>             <lgl> <lgl> <chr>      <chr> 
## 1 Area 2006: 1.72 k~ NA    Location: 11km s~ NA    NA    Defined i~ UCL22~
## 2 Area 2011: 4.83 k~ NA    <NA>              NA    NA    <NA>       <NA>

Dwell4 <- read_xls('Traralgon-South.xls',sheet="Data", skip=25,n_max=4)
str(Dwell4)

## Classes 'tbl_df', 'tbl' and 'data.frame':    4 obs. of  8 variables:
##  $ Dwellings: chr  "Occupied private dwellings" "Unoccupied private dwellings" NA "Separate house"
##  $ 1981     : chr  "-" "-" NA "-"
##  $ 1986     : chr  "-" "-" NA "-"
##  $ 1991     : chr  "-" "-" NA "-"
##  $ 1996     : chr  "-" "-" NA "-"
##  $ 2001     : chr  "-" "-" NA "-"
##  $ 2006     : num  86 9 NA 95
##  $ 2011     : num  140 10 NA 150

dim(Dwell4)

## [1] 4 8

head(Dwell4)

## # A tibble: 4 x 8
##   Dwellings               `1981` `1986` `1991` `1996` `2001` `2006` `2011`
##   <chr>                   <chr>  <chr>  <chr>  <chr>  <chr>   <dbl>  <dbl>
## 1 Occupied private dwell~ -      -      -      -      -          86    140
## 2 Unoccupied private dwe~ -      -      -      -      -           9     10
## 3 <NA>                    <NA>   <NA>   <NA>   <NA>   <NA>       NA     NA
## 4 Separate house          -      -      -      -      -          95    150

sum4 <- read_xls('Traralgon-South.xls',sheet="Data", skip=8,n_max=5)
str(sum4)

## Classes 'tbl_df', 'tbl' and 'data.frame':    5 obs. of  8 variables:
##  $ Summary data: chr  "Total population (enumerated)*" "Male: Female ratio*" "Visitors on Census night*" "Population in non-private dwellings*" ...
##  $ 1981        : chr  "-" "-" "-" "-" ...
##  $ 1986        : chr  "-" "-" "-" "-" ...
##  $ 1991        : chr  "-" "-" "-" "-" ...
##  $ 1996        : chr  "-" "-" "-" "-" ...
##  $ 2001        : chr  "-" "-" "-" "-" ...
##  $ 2006        : num  284 85.6 6 0 141
##  $ 2011        : num  454 97.4 9 0 288

dim(sum4)

## [1] 5 8

head(sum4)

## # A tibble: 5 x 8
##   `Summary data`          `1981` `1986` `1991` `1996` `2001` `2006` `2011`
##   <chr>                   <chr>  <chr>  <chr>  <chr>  <chr>   <dbl>  <dbl>
## 1 Total population (enum~ -      -      -      -      -       284    454  
## 2 Male: Female ratio*     -      -      -      -      -        85.6   97.4
## 3 Visitors on Census nig~ -      -      -      -      -         6      9  
## 4 Population in non-priv~ -      -      -      -      -         0      0  
## 5 Population living same~ -      -      -      -      -       141    288

Step 2 - Impute the town name for each dataset

To enable the joining of data, we were required to manually input the town name for each of dataset based on the datafile.

Smythesdale

sum1 <- mutate(sum1,Town="Smythesdale")
area1 <- mutate(area1,Town="Smythesdale") %>% separate(1,c("Area_Year","Area"),sep=": ")
Dwell1 <- mutate(Dwell1,Town="Smythesdale")
Age1 <- mutate(Age1,Town="Smythesdale")

Tallarook

sum2 <- mutate(sum2,Town="Tallarook")
area2 <- mutate(area2,Town="Tallarook") %>% separate(1,c("Area_Year","Area"),sep=": ")
Dwell2 <- mutate(Dwell2,Town="Tallarook")
Age2 <- mutate(Age2,Town="Tallarook")

Skipton

sum3 <- mutate(sum3,Town="Skipton")
area3 <- mutate(area3,Town="Skipton") %>% separate(1,c("Area_Year","Area"),sep=": ")
Dwell3 <- mutate(Dwell3,Town="Skipton")
Age3 <- mutate(Age3,Town="Skipton")

Traralgon-South

sum4 <- mutate(sum4,Town="Traralgon-South")
area4 <- mutate(area4,Town="Traralgon-South") %>% separate(1,c("Area_Year","Area"),sep=": ")
Dwell4 <- mutate(Dwell4,Town="Traralgon-South")
Age4 <- mutate(Age4,Town="Traralgon-South")

Step 3 - The inital joins

In this step, we join each dataframe for each town into one dataframe of the same type, ready for the final join. We did the final data manipulation for each data file as required.

Summary data

To set up the summary dataframe, we needed to change the format of some of the variables from character to numeric format for Smythesdale 1981,and Traralgon-South 1981 to 2001 as these variables have been read in as a character, whilst the same variables for Tallarook and Skipton, and 1986-2001 for Smythesdale read in as numeric. NA’s are created by coercion.

sum1$`1981` <- as.numeric(sum1$`1981`)
sum4$`1981` <- as.numeric(sum4$`1981`)
sum4$`1986` <- as.numeric(sum4$`1986`)
sum4$`1991` <- as.numeric(sum4$`1991`)
sum4$`1996` <- as.numeric(sum4$`1996`)
sum4$`2001` <- as.numeric(sum4$`2001`)

Once the characters are converted, we combined the summary data. We then gather the Years as observations with the count as a new variable. We then spread the Town, Year, Male: Female ratio, Population in non-private dwellings, Population living same address 5 years ago, Total population (enumerated), and Visitors on Census night from observations to variables.

popsummary <- bind_rows(sum1,sum2,sum3,sum4)
popsummary <- gather(popsummary,2:8,key="Year",value="Count")
popsummary <- spread(popsummary,key = `Summary data`,value = 'Count')
str(popsummary)

## Classes 'tbl_df', 'tbl' and 'data.frame':    28 obs. of  7 variables:
##  $ Town                                      : chr  "Skipton" "Skipton" "Skipton" "Skipton" ...
##  $ Year                                      : chr  "1981" "1986" "1991" "1996" ...
##  $ Male: Female ratio*                       : num  102 104.8 104.4 101.3 93.2 ...
##  $ Population in non-private dwellings*      : num  17 9 3 23 19 17 26 NA 0 0 ...
##  $ Population living same address 5 years ago: num  283 274 245 266 271 ...
##  $ Total population (enumerated)*            : num  505 508 462 453 454 ...
##  $ Visitors on Census night*                 : num  20 11 9 11 9 17 15 NA 11 0.531 ...

dim(popsummary)

## [1] 28  7

head(popsummary)

## # A tibble: 6 x 7
##   Town  Year  `Male: Female r~ `Population in ~ `Population liv~
##   <chr> <chr>            <dbl>            <dbl>            <dbl>
## 1 Skip~ 1981             102                 17              283
## 2 Skip~ 1986             105.                 9              274
## 3 Skip~ 1991             104.                 3              245
## 4 Skip~ 1996             101.                23              266
## 5 Skip~ 2001              93.2               19              271
## 6 Skip~ 2006              92.7               17              254
## # ... with 2 more variables: `Total population (enumerated)*` <dbl>,
## #   `Visitors on Census night*` <dbl>

Local Government Area

Here we join the area data. We then separate the Variable Area Year to separate variables containing the word “Area” and the corresponding year. We then drop all variables except for Year, Area, and Town.

LGA <- bind_rows(area1,area2,area3,area4)
LGA <-separate(LGA, Area_Year,c("Area2","Year"),sep=" ")
LGA <- LGA[,c(2,3,10)]
str(LGA)

## Classes 'tbl_df', 'tbl' and 'data.frame':    8 obs. of  3 variables:
##  $ Year: chr  "2006" "2011" "2006" "2011" ...
##  $ Area: chr  "1.4 km²" "5.92 km²" "3.64 km²" "1.41 km²" ...
##  $ Town: chr  "Smythesdale" "Smythesdale" "Tallarook" "Tallarook" ...

dim(LGA)

## [1] 8 3

head(LGA)

## # A tibble: 6 x 3
##   Year  Area     Town       
##   <chr> <chr>    <chr>      
## 1 2006  1.4 km²  Smythesdale
## 2 2011  5.92 km² Smythesdale
## 3 2006  3.64 km² Tallarook  
## 4 2011  1.41 km² Tallarook  
## 5 2006  4.92 km² Skipton    
## 6 2011  3.63 km² Skipton

Dwelling Data

Similar to the summary data, we need to convert some of the year data from character to numeric variable to enable the join. NA’s are created by coercion.

Dwell1$`1981` <- as.numeric(Dwell1$`1981`)
Dwell4$`1981` <- as.numeric(Dwell4$`1981`)
Dwell4$`1986` <- as.numeric(Dwell4$`1986`)
Dwell4$`1991` <- as.numeric(Dwell4$`1991`)
Dwell4$`1996` <- as.numeric(Dwell4$`1996`)
Dwell4$`2001` <- as.numeric(Dwell4$`2001`)

We then bind the 4 dataframes into one, gather the Year variables into observations, and spread Occupied private dwellings, Unoccupied private dwellings, and Separate house into seperate variables.

Towndwell <- bind_rows(Dwell1,Dwell2,Dwell3,Dwell4)
Towndwell <- gather(Towndwell,2:8,key="Year",value="Count")
Towndwell <- spread(Towndwell,key = `Dwellings`, value="Count")
str(Towndwell)

## Classes 'tbl_df', 'tbl' and 'data.frame':    28 obs. of  6 variables:
##  $ Town                        : chr  "Skipton" "Skipton" "Skipton" "Skipton" ...
##  $ Year                        : chr  "1981" "1986" "1991" "1996" ...
##  $ Occupied private dwellings  : num  176 191 186 187 192 ...
##  $ Separate house              : num  195 205 212 229 220 ...
##  $ Unoccupied private dwellings: num  31 34 48 56 52 ...
##  $ <NA>                        : num  NA NA NA NA NA NA NA NA NA NA ...

dim(Towndwell)

## [1] 28  6

head(Towndwell)

## # A tibble: 6 x 6
##   Town   Year  `Occupied privat~ `Separate house` `Unoccupied priv~ `<NA>`
##   <chr>  <chr>             <dbl>            <dbl>             <dbl>  <dbl>
## 1 Skipt~ 1981                176             195.                31     NA
## 2 Skipt~ 1986                191             205.                34     NA
## 3 Skipt~ 1991                186             212                 48     NA
## 4 Skipt~ 1996                187             229                 56     NA
## 5 Skipt~ 2001                192             220                 52     NA
## 6 Skipt~ 2006                203             229                 41     NA

Age data

Again, we needed to convert some variables from character to numeric variables. NA’s are created by coercion.

Age1$`1981` <- as.numeric(Age1$`1981`)
Age4$`1981` <- as.numeric(Age4$`1981`)
Age4$`1986` <- as.numeric(Age4$`1986`)
Age4$`1991` <- as.numeric(Age4$`1991`)
Age4$`1996` <- as.numeric(Age4$`1996`)
Age4$`2001` <- as.numeric(Age4$`2001`)

We then bind the Age Structures dataframes into one, gathering the Years into observations and then spreading the age brackets into variables.

Age <- bind_rows(Age1,Age2,Age3,Age4)
Age <- gather(Age,2:8,key="Year",value="Count")
Age <- spread(Age,key = `Age structure`, value="Count")
str(Age)

## Classes 'tbl_df', 'tbl' and 'data.frame':    28 obs. of  8 variables:
##  $ Town : chr  "Skipton" "Skipton" "Skipton" "Skipton" ...
##  $ Year : chr  "1981" "1986" "1991" "1996" ...
##  $ 0-17 : num  150 146 116 104 106 ...
##  $ 18-24: num  49.8 35.2 29.8 22.5 10.8 ...
##  $ 25-44: num  112.6 129.6 123.7 99.7 109.5 ...
##  $ 45-64: num  119 127 103 119 125 ...
##  $ 65-74: num  53.8 42.9 55.1 58.7 54.2 ...
##  $ 75+  : num  19.9 27 34.8 49 48.2 ...

dim(Age)

## [1] 28  8

head(Age)

## # A tibble: 6 x 8
##   Town    Year  `0-17` `18-24` `25-44` `45-64` `65-74` `75+`
##   <chr>   <chr>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>
## 1 Skipton 1981   150.     49.8   113.     119.    53.8  19.9
## 2 Skipton 1986   146.     35.2   130.     127.    42.9  27.0
## 3 Skipton 1991   116.     29.8   124.     103.    55.1  34.8
## 4 Skipton 1996   104.     22.5    99.7    119.    58.7  49.0
## 5 Skipton 2001   106.     10.8   110.     125.    54.2  48.2
## 6 Skipton 2006    97.4    17.0    87.3    136.    68.3  66.4

Full Join

Finally, we did the final join. The final structure, dimensions and head of the data set will be discussed in the following section.

FullData <- full_join(Age,popsummary,Age,by=c("Town","Year"))
FullData <- left_join(FullData, LGA, by=c("Town","Year"))
FullData$Town <- as.factor(FullData$Town)
FullData$Year <- as.ordered(FullData$Year)

Understand

We now have a data frame with 28 observations and 14 variables.

Town - a factor variable with 4 levels. Contains the name of the towns.
Year - a ordered factor with 7 levels. Contains the corresponding year of each observation.
0-17 - a numeric variable containing the count of occupants in the corresponding age bracket in each observations.
18-24 - a numeric variable containing the count of occupants in the corresponding age bracket in each observations.
25-44 - a numeric variable containing the count of occupants in the corresponding age bracket in each observations.
45-64 - a numeric variable containing the count of occupants in the corresponding age bracket in each observations.
65-74 - a numeric variable containing the count of occupants in the corresponding age bracket in each observations.
75+ - a numeric variable containing the count of occupants in the corresponding age bracket in each observations.
Male: Female ratio* - a Numeric variable. The ratio of male to females.
Population in non-private dwellings* - a numeric variable showing how many people were occupoing non-private dwellings on census night.
Population living same address 5 years ago - a numeric variable Shows how many poeple are still living in the same address as years ago.
Total population (enumerated)* - a numeric variable showing the total popluation on census night.
Visitors on Census night* - a numeric variable showing the number of visitors to the towns on census night.
Area - a character variable showing the area kilometers squared.

str(FullData)

## Classes 'tbl_df', 'tbl' and 'data.frame':    28 obs. of  14 variables:
##  $ Town                                      : Factor w/ 4 levels "Skipton","Smythesdale",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Year                                      : Ord.factor w/ 7 levels "1981"<"1986"<..: 1 2 3 4 5 6 7 1 2 3 ...
##  $ 0-17                                      : num  150 146 116 104 106 ...
##  $ 18-24                                     : num  49.8 35.2 29.8 22.5 10.8 ...
##  $ 25-44                                     : num  112.6 129.6 123.7 99.7 109.5 ...
##  $ 45-64                                     : num  119 127 103 119 125 ...
##  $ 65-74                                     : num  53.8 42.9 55.1 58.7 54.2 ...
##  $ 75+                                       : num  19.9 27 34.8 49 48.2 ...
##  $ Male: Female ratio*                       : num  102 104.8 104.4 101.3 93.2 ...
##  $ Population in non-private dwellings*      : num  17 9 3 23 19 17 26 NA 0 0 ...
##  $ Population living same address 5 years ago: num  283 274 245 266 271 ...
##  $ Total population (enumerated)*            : num  505 508 462 453 454 ...
##  $ Visitors on Census night*                 : num  20 11 9 11 9 17 15 NA 11 0.531 ...
##  $ Area                                      : chr  NA NA NA NA ...

dim(FullData)

## [1] 28 14

head(FullData)

## # A tibble: 6 x 14
##   Town  Year  `0-17` `18-24` `25-44` `45-64` `65-74` `75+` `Male: Female r~
##   <fct> <ord>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>            <dbl>
## 1 Skip~ 1981   150.     49.8   113.     119.    53.8  19.9            102  
## 2 Skip~ 1986   146.     35.2   130.     127.    42.9  27.0            105. 
## 3 Skip~ 1991   116.     29.8   124.     103.    55.1  34.8            104. 
## 4 Skip~ 1996   104.     22.5    99.7    119.    58.7  49.0            101. 
## 5 Skip~ 2001   106.     10.8   110.     125.    54.2  48.2             93.2
## 6 Skip~ 2006    97.4    17.0    87.3    136.    68.3  66.4             92.7
## # ... with 5 more variables: `Population in non-private dwellings*` <dbl>,
## #   `Population living same address 5 years ago` <dbl>, `Total population
## #   (enumerated)*` <dbl>, `Visitors on Census night*` <dbl>, Area <chr>

Tidy & Manipulate Data I

The final dataframe is almost completly in tidy format, as it was done during the data reading stage. The final step is to split out the area variable, ready for the next step. We create two new variables, the Area Km2 containing the numerical value and a default variable Dot, which we subsequently dropped, containing the characters ‘Km2’. We then convert the Area Km2 to numeric values.

Sub_Group <- separate(FullData, Area,c("Area Km2","Dot"),sep=" ") 
Sub_Group <- Sub_Group[,1:14]
Sub_Group$`Area Km2` <- as.numeric(Sub_Group$`Area Km2`)
str(Sub_Group)

## Classes 'tbl_df', 'tbl' and 'data.frame':    28 obs. of  14 variables:
##  $ Town                                      : Factor w/ 4 levels "Skipton","Smythesdale",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Year                                      : Ord.factor w/ 7 levels "1981"<"1986"<..: 1 2 3 4 5 6 7 1 2 3 ...
##  $ 0-17                                      : num  150 146 116 104 106 ...
##  $ 18-24                                     : num  49.8 35.2 29.8 22.5 10.8 ...
##  $ 25-44                                     : num  112.6 129.6 123.7 99.7 109.5 ...
##  $ 45-64                                     : num  119 127 103 119 125 ...
##  $ 65-74                                     : num  53.8 42.9 55.1 58.7 54.2 ...
##  $ 75+                                       : num  19.9 27 34.8 49 48.2 ...
##  $ Male: Female ratio*                       : num  102 104.8 104.4 101.3 93.2 ...
##  $ Population in non-private dwellings*      : num  17 9 3 23 19 17 26 NA 0 0 ...
##  $ Population living same address 5 years ago: num  283 274 245 266 271 ...
##  $ Total population (enumerated)*            : num  505 508 462 453 454 ...
##  $ Visitors on Census night*                 : num  20 11 9 11 9 17 15 NA 11 0.531 ...
##  $ Area Km2                                  : num  NA NA NA NA NA 4.92 3.63 NA NA NA ...

dim(Sub_Group)

## [1] 28 14

head(Sub_Group)

## # A tibble: 6 x 14
##   Town  Year  `0-17` `18-24` `25-44` `45-64` `65-74` `75+` `Male: Female r~
##   <fct> <ord>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>            <dbl>
## 1 Skip~ 1981   150.     49.8   113.     119.    53.8  19.9            102  
## 2 Skip~ 1986   146.     35.2   130.     127.    42.9  27.0            105. 
## 3 Skip~ 1991   116.     29.8   124.     103.    55.1  34.8            104. 
## 4 Skip~ 1996   104.     22.5    99.7    119.    58.7  49.0            101. 
## 5 Skip~ 2001   106.     10.8   110.     125.    54.2  48.2             93.2
## 6 Skip~ 2006    97.4    17.0    87.3    136.    68.3  66.4             92.7
## # ... with 5 more variables: `Population in non-private dwellings*` <dbl>,
## #   `Population living same address 5 years ago` <dbl>, `Total population
## #   (enumerated)*` <dbl>, `Visitors on Census night*` <dbl>, `Area
## #   Km2` <dbl>

Tidy & Manipulate Data II

We want to create a new character - the population density for each town. We do this by calculating the Total population (enumerated)* devided by the Area Km2 variable to create the new variable Population Density.

Sub_Group <- Sub_Group %>% 
    mutate(`Population Density` = `Total population (enumerated)*`/Sub_Group$`Area Km2`, rm.na=TRUE)
Sub_Group <- Sub_Group[,1:15]
Sub_Group$`Population Density` <- Sub_Group$`Population Density` %>% round(2)

Scan I

Here we are scanning the datafile for missing variables. As can be seen, there is 6 missing variables in 13 of the varibales, and 20 for the Area variable.

colSums(is.na(Sub_Group))

##                                       Town 
##                                          0 
##                                       Year 
##                                          0 
##                                       0-17 
##                                          6 
##                                      18-24 
##                                          6 
##                                      25-44 
##                                          6 
##                                      45-64 
##                                          6 
##                                      65-74 
##                                          6 
##                                        75+ 
##                                          6 
##                        Male: Female ratio* 
##                                          6 
##       Population in non-private dwellings* 
##                                          6 
## Population living same address 5 years ago 
##                                          6 
##             Total population (enumerated)* 
##                                          6 
##                  Visitors on Census night* 
##                                          6 
##                                   Area Km2 
##                                         20 
##                         Population Density 
##                                         20

is.na(Sub_Group)

##        Town  Year  0-17 18-24 25-44 45-64 65-74   75+ Male: Female ratio*
##  [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE               FALSE
##  [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE               FALSE
##  [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE               FALSE
##  [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE               FALSE
##  [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE               FALSE
##  [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE               FALSE
##  [7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE               FALSE
##  [8,] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE                TRUE
##  [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE               FALSE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE               FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE               FALSE
## [12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE               FALSE
## [13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE               FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE               FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE               FALSE
## [16,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE               FALSE
## [17,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE               FALSE
## [18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE               FALSE
## [19,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE               FALSE
## [20,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE               FALSE
## [21,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE               FALSE
## [22,] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE                TRUE
## [23,] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE                TRUE
## [24,] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE                TRUE
## [25,] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE                TRUE
## [26,] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE                TRUE
## [27,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE               FALSE
## [28,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE               FALSE
##       Population in non-private dwellings*
##  [1,]                                FALSE
##  [2,]                                FALSE
##  [3,]                                FALSE
##  [4,]                                FALSE
##  [5,]                                FALSE
##  [6,]                                FALSE
##  [7,]                                FALSE
##  [8,]                                 TRUE
##  [9,]                                FALSE
## [10,]                                FALSE
## [11,]                                FALSE
## [12,]                                FALSE
## [13,]                                FALSE
## [14,]                                FALSE
## [15,]                                FALSE
## [16,]                                FALSE
## [17,]                                FALSE
## [18,]                                FALSE
## [19,]                                FALSE
## [20,]                                FALSE
## [21,]                                FALSE
## [22,]                                 TRUE
## [23,]                                 TRUE
## [24,]                                 TRUE
## [25,]                                 TRUE
## [26,]                                 TRUE
## [27,]                                FALSE
## [28,]                                FALSE
##       Population living same address 5 years ago
##  [1,]                                      FALSE
##  [2,]                                      FALSE
##  [3,]                                      FALSE
##  [4,]                                      FALSE
##  [5,]                                      FALSE
##  [6,]                                      FALSE
##  [7,]                                      FALSE
##  [8,]                                       TRUE
##  [9,]                                      FALSE
## [10,]                                      FALSE
## [11,]                                      FALSE
## [12,]                                      FALSE
## [13,]                                      FALSE
## [14,]                                      FALSE
## [15,]                                      FALSE
## [16,]                                      FALSE
## [17,]                                      FALSE
## [18,]                                      FALSE
## [19,]                                      FALSE
## [20,]                                      FALSE
## [21,]                                      FALSE
## [22,]                                       TRUE
## [23,]                                       TRUE
## [24,]                                       TRUE
## [25,]                                       TRUE
## [26,]                                       TRUE
## [27,]                                      FALSE
## [28,]                                      FALSE
##       Total population (enumerated)* Visitors on Census night* Area Km2
##  [1,]                          FALSE                     FALSE     TRUE
##  [2,]                          FALSE                     FALSE     TRUE
##  [3,]                          FALSE                     FALSE     TRUE
##  [4,]                          FALSE                     FALSE     TRUE
##  [5,]                          FALSE                     FALSE     TRUE
##  [6,]                          FALSE                     FALSE    FALSE
##  [7,]                          FALSE                     FALSE    FALSE
##  [8,]                           TRUE                      TRUE     TRUE
##  [9,]                          FALSE                     FALSE     TRUE
## [10,]                          FALSE                     FALSE     TRUE
## [11,]                          FALSE                     FALSE     TRUE
## [12,]                          FALSE                     FALSE     TRUE
## [13,]                          FALSE                     FALSE    FALSE
## [14,]                          FALSE                     FALSE    FALSE
## [15,]                          FALSE                     FALSE     TRUE
## [16,]                          FALSE                     FALSE     TRUE
## [17,]                          FALSE                     FALSE     TRUE
## [18,]                          FALSE                     FALSE     TRUE
## [19,]                          FALSE                     FALSE     TRUE
## [20,]                          FALSE                     FALSE    FALSE
## [21,]                          FALSE                     FALSE    FALSE
## [22,]                           TRUE                      TRUE     TRUE
## [23,]                           TRUE                      TRUE     TRUE
## [24,]                           TRUE                      TRUE     TRUE
## [25,]                           TRUE                      TRUE     TRUE
## [26,]                           TRUE                      TRUE     TRUE
## [27,]                          FALSE                     FALSE    FALSE
## [28,]                          FALSE                     FALSE    FALSE
##       Population Density
##  [1,]               TRUE
##  [2,]               TRUE
##  [3,]               TRUE
##  [4,]               TRUE
##  [5,]               TRUE
##  [6,]              FALSE
##  [7,]              FALSE
##  [8,]               TRUE
##  [9,]               TRUE
## [10,]               TRUE
## [11,]               TRUE
## [12,]               TRUE
## [13,]              FALSE
## [14,]              FALSE
## [15,]               TRUE
## [16,]               TRUE
## [17,]               TRUE
## [18,]               TRUE
## [19,]               TRUE
## [20,]              FALSE
## [21,]              FALSE
## [22,]               TRUE
## [23,]               TRUE
## [24,]               TRUE
## [25,]               TRUE
## [26,]               TRUE
## [27,]              FALSE
## [28,]              FALSE

The variables that are missing are Smythesdale 1981, and Traralgon-South 1986-2001. Smythesdale, Traralgon-South, Tallarook and Skipton have NA’s in Area Km2 and Population Density for all years prior to 2006.

As such, we’ve made the descision to omit the missing values, as we cannot reliably calcule the missing data. This reduces the dataframe to 8 observations.

Sub_Group <- na.omit(Sub_Group)
colSums(is.na(Sub_Group))

##                                       Town 
##                                          0 
##                                       Year 
##                                          0 
##                                       0-17 
##                                          0 
##                                      18-24 
##                                          0 
##                                      25-44 
##                                          0 
##                                      45-64 
##                                          0 
##                                      65-74 
##                                          0 
##                                        75+ 
##                                          0 
##                        Male: Female ratio* 
##                                          0 
##       Population in non-private dwellings* 
##                                          0 
## Population living same address 5 years ago 
##                                          0 
##             Total population (enumerated)* 
##                                          0 
##                  Visitors on Census night* 
##                                          0 
##                                   Area Km2 
##                                          0 
##                         Population Density 
##                                          0

dim(Sub_Group)

## [1]  8 15

Scan II

We’ve selected the z-score method to scan for outliers. An examination of the results show no outliers, as there are no absolut valuesof z-scores > 3, we can determine there are no outliers.

z.scores <- Sub_Group %>%  scores(type = "z")
z.scores %>% summary()

##       Town          Year          0-17             18-24        
##  Min.   : NA   Min.   : NA   Min.   :-1.0133   Min.   :-1.0963  
##  1st Qu.: NA   1st Qu.: NA   1st Qu.:-0.6313   1st Qu.:-0.7443  
##  Median : NA   Median : NA   Median :-0.2449   Median :-0.1734  
##  Mean   :NaN   Mean   :NaN   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: 0.3580   3rd Qu.: 0.5304  
##  Max.   : NA   Max.   : NA   Max.   : 2.0427   Max.   : 1.8742  
##  NA's   :8     NA's   :8                                        
##      25-44             45-64              65-74              75+         
##  Min.   :-1.2624   Min.   :-1.53513   Min.   :-1.3610   Min.   :-0.8198  
##  1st Qu.:-0.6662   1st Qu.:-0.63529   1st Qu.:-0.6955   1st Qu.:-0.5697  
##  Median :-0.2229   Median : 0.03866   Median :-0.2104   Median :-0.3803  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.9169   3rd Qu.: 0.85252   3rd Qu.: 0.7258   3rd Qu.: 0.1379  
##  Max.   : 1.2791   Max.   : 1.23816   Max.   : 1.7212   Max.   : 1.9079  
##                                                                          
##  Male: Female ratio* Population in non-private dwellings*
##  Min.   :-1.2702     Min.   :-0.5249                     
##  1st Qu.:-0.6490     1st Qu.:-0.5249                     
##  Median :-0.2029     Median :-0.5249                     
##  Mean   : 0.0000     Mean   : 0.0000                     
##  3rd Qu.: 0.5940     3rd Qu.:-0.1099                     
##  Max.   : 1.4437     Max.   : 2.0143                     
##                                                          
##  Population living same address 5 years ago Total population (enumerated)*
##  Min.   :-1.30668                           Min.   :-1.36181              
##  1st Qu.:-0.83571                           1st Qu.:-0.75435              
##  Median : 0.03895                           Median : 0.02982              
##  Mean   : 0.00000                           Mean   : 0.00000              
##  3rd Qu.: 0.76842                           3rd Qu.: 0.88689              
##  Max.   : 1.20044                           Max.   : 1.06802              
##                                                                           
##  Visitors on Census night*    Area Km2       Population Density
##  Min.   :-1.1035           Min.   :-1.1580   Min.   :-1.1436   
##  1st Qu.:-0.6621           1st Qu.:-1.0200   1st Qu.:-0.6789   
##  Median :-0.1226           Median : 0.1146   Median :-0.2861   
##  Mean   : 0.0000           Mean   : 0.0000   Mean   : 0.0000   
##  3rd Qu.: 0.3678           3rd Qu.: 0.8079   3rd Qu.: 0.6313   
##  Max.   : 1.6430           Max.   : 1.4157   Max.   : 1.7294   
##

which(abs(z.scores) >3 )

## integer(0)

length(which( abs(z.scores) >3 ))

## [1] 0

Transform

To demonstrate the transformation of data, we’ve selected the variable 0-17, which shows a right-skewed distribution. We’ve selected the Box-Cox transformation to transform the data into a normal distribution.

hist(Sub_Group$`0-17`, main = "Distribution of 0-17 age group", col="green")

boxcox_age<- BoxCox(Sub_Group$`75+`,lambda = "auto")
hist(boxcox_age, main = "Box-Cox transformation of 0-17 age group", col="purple")

MATH2349 Semester 2, 2018 - Preparing data for processing - Population density of towns in Victoria, Australia

Assignment 3

Luke Browne - s3689420, Simon Boyd - s3046049

Required packages

Executive Summary

Read, Tidy, Manipulate, and Join the Data

Step 1 - Read in the raw data steps

Smythesdale

Tallarook

Skipton

Traralgon-South

Step 2 - Impute the town name for each dataset

Smythesdale

Tallarook

Skipton

Traralgon-South

Step 3 - The inital joins

Summary data

Local Government Area

Dwelling Data

Age data

Full Join

Understand

Tidy & Manipulate Data I

Tidy & Manipulate Data II

Scan I

Scan II

Transform