MATH2349 Data Wrangling

Required packages

library(readr)
library(knitr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

Executive Summary

Two datasets were downloaded from City of Melbourne’s Open Data repository. One provides population forecasts for each year between 2016 and 2041 by geographic area and gender (other categories present were ignored for the purposes of this investigation.) The other one includes responses to a survey of the City of Melbourne’s population by gender.

Only relevant columns were selected from each dataset. Rows relating to other sub-geographies or population segments were also filtered out.

The population forecasts dataset was untidy. It didn’t have a “Gender” column, rather it forecast the population for each gender in a separate column and one row for each year. For this investigation it was transformed into a dataset with one row for each year and gender with a “Gender” and a “Population” column.

This transformation made it possible to join the two datasets on “Gender” and “Year”. The responses in the survey which were expressed as percentages were converted to proportions and then multiplied with the estimated male and female populations respectively to estimate the total number of people in the City of Melbourne who might say e.g. that they participate in adequate physical activity for 30 minutes or more on 4 or more days per week.

Data

Two data sets have been sourced from the City of Melbourne:

City of Melbourne Population Forecasts 2016 to 2041 - Age and Gender

This dataset contains population forecasts for the city of Melbourne by Age and Gender.

More details on the forecasts can be found here: https://data.melbourne.vic.gov.au/People/City-of-Melbourne-Population-Forecasts-2016-to-204/vtsx-jhki

The data has been directly downloaded in CSV format from this link: https://data.melbourne.vic.gov.au/api/views/vtsx-jhki/rows.csv?accessType=DOWNLOAD

The original dataset contains 59 columns, however, for this analysis only four columns are required, therefore the other columns have been dropped in the first step.

Social Indicators for City of Melbourne Residents 2018

This dataset contains the result of a survey of 1247 residents of the City of Melbourne in 2018.

A more detailed description of the survey can be found here: https://data.melbourne.vic.gov.au/People/Social-Indicators-for-City-of-Melbourne-Residents-/n9ie-cp6t

The data has been directly downloaded in CSV format from this link: https://data.melbourne.vic.gov.au/api/views/n9ie-cp6t/rows.csv?accessType=DOWNLOAD

The original dataset contains 10 columns, however, for this analysis only six columns are required, therefore the other columns have been dropped in the first step.

populationForecast <- 
  read.csv("~/City_of_Melbourne_Population_Forecasts_2016_to_2041_-_Age_and_Gender.csv", 
           stringsAsFactors = TRUE) %>% 
  select(Year, Geography, Male.population, Female.population)

populationForecast %>% head()

socialIndicators <- read.csv("~/Social_Indicators_for_City_of_Melbourne_Residents_2018.csv",
                             stringsAsFactors = TRUE) %>% 
  select(YEAR, DESCRIPTION, RESPONSE, RESPONDENT.GROUP, RESULT, FORMAT)

socialIndicators %>% head()

Understand

Population Forecast Dataset

After dropping columns irrelevant to our investigation, the dataset has these four columns:

Year: An integer column referring to the year for which the population figures have been forecast. Since each value refers to an entire calendar, rather than a particular instant in time, it is appropriate that this is not a date variable. It is however more than just a “label”. For example, it is completely valid to subtract two years from each other to get the difference in years between two forecasts. Also, while it would be possible to replace it with a variable indicating the start date or the end date of each given year, for the purposes of this investigation there is no value in doing so.

Geography: This string has correctly been converted to a factor as part of the load. This factors refer to various geographical areas within the City of Melbourne for which forecasts have been provided. Note however, that these areas aren’t necessary mutually exclusive. E.g. “Carlton” or “Melbourne (CBD)” are sub-divisions of “City of Melbourne” which refers to the entire Local Government Area (LGA). For the purposes of our investigation, we are only interested in forecasts for the entire LGA. Therefore we will apply a filter on this column in a subsequent step.

Male.population: An integer column. It indicates the forecast number of male residents for the given year and geography.

Female.population: An integer column. It indicates the forecast number of female residents for the given year and geography.

Note that we will tidy this dataset in a subsequent step which will introduce a “Gender” column and a “Population” column to replace the two columns above.

str(populationForecast)

## 'data.frame':    364 obs. of  4 variables:
##  $ Year             : int  2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 ...
##  $ Geography        : Factor w/ 14 levels "Carlton","City of Melbourne",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Male.population  : int  71716 78097 83508 88309 93456 98907 102919 107515 112129 117007 ...
##  $ Female.population: int  74383 81040 86455 91501 96783 102358 106433 111252 116069 121137 ...

Social Indicators Dataset

After dropping columns irrelevant to our investigation, the dataset has these six columns:

YEAR: An integer column referring to the year in which the survey was performed. In this dataset the only value is “2018”.

DESCRIPTION: This string has correctly been converted to a factor as part of the load. It describes the social indicator that was measured in the survey.

RESPONSE: The survey response. This is not a “free text” response but a fixed set of answers. This string has correctly been converted to a factor as part of the load.

RESPONDENT.GROUP: This string has correctly been converted to a factor as part of the load. It describes the group surveyed. We will filter on this column since we are only interested in the “Male” and “Female” groups.

RESULT: The numeric result. Either an average score or a percentage (as indicated by the “FORMAT” column).

FORMAT: This string has correctly been converted to a factor as part of the load. The allowable values are either “Average” or “Per cent”. We will apply a filter on this column in a subsequent step since we are only interested in “Per cent” indicators.

str(socialIndicators)

## 'data.frame':    666 obs. of  6 variables:
##  $ YEAR            : int  2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
##  $ DESCRIPTION     : Factor w/ 36 levels "Aware of traditional owners of Melbourne",..: 12 12 12 12 12 12 12 12 12 12 ...
##  $ RESPONSE        : Factor w/ 10 levels "Able to name both Wurundjeri and Boonwurrung",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ RESPONDENT.GROUP: Factor w/ 18 levels "18-24 years",..: 8 7 9 10 12 14 15 16 18 17 ...
##  $ RESULT          : num  52.3 53.8 57.6 57.7 49.7 55 46.7 46.5 48 60.1 ...
##  $ FORMAT          : Factor w/ 2 levels "Average","Per cent": 2 2 2 2 2 2 2 2 2 2 ...

Tidy & Manipulate Data I

The Population Forecasts Dataset is untidy because population sizes for “Male” and “Female” genders are stored in separate columns (“Male.population” and “Female.population”). To make the dataset tidy, we should have a “Gender” column and a “Population” column. This way, rather than two observations per row, we will have one observation per row and the number of rows will double.

Before applying the gather() function to “unpivot” the dataset, we’ll rename the source columns to “Male” and “Female” so that these names will become the values in our new Gender column. Because we want “Gender” to be a factor, we set factor_key to TRUE.

colnames(populationForecast)[3] <- "Male"
colnames(populationForecast)[4] <- "Female"
populationForecast <- populationForecast %>% 
  gather(`Male`, `Female`, key = "Gender", value = "Population", 
         factor_key = TRUE)

str(populationForecast)

## 'data.frame':    728 obs. of  4 variables:
##  $ Year      : int  2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 ...
##  $ Geography : Factor w/ 14 levels "Carlton","City of Melbourne",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Gender    : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Population: int  71716 78097 83508 88309 93456 98907 102919 107515 112129 117007 ...

head(populationForecast)

Tidy & Manipulate Data II

In the Social Indicators dataset we are only in responses in “Per cent” format and only in responses for respondent groups “Male” and “Female”.

First we’ll apply a filter to only select the rows we are interested in.

Now that the “RESULT” column only contains percentages, we can create a new “PROPORTION” column from it by dividing all values by 100.

Since for each question there is only one response (the “yes”/“true” response) we’ll also concatenate the “DESCRIPTION” and “RESPONSE” columns into a new “Indicator” column.

Since the responses are intended to represent the Male and Female population of the entire City of Melbourne LGA, we also want to filter the Population Forecast dataset on the Geography column.

socialIndicators <- socialIndicators %>% filter(FORMAT == "Per cent") %>% 
  filter(RESPONDENT.GROUP == "Male" | RESPONDENT.GROUP == "Female")
socialIndicators <- socialIndicators %>% mutate(PROPORTION = RESULT / 100.0) %>% 
  mutate(Indicator = (paste(DESCRIPTION, RESPONSE, sep = ": ")))

head(socialIndicators)

populationForecast <- 
  populationForecast %>% filter(Geography == 'City of Melbourne')

head(populationForecast)

Scan I

Neither data frame has any missing values.

For the Social Indicators dataset, the proportions are between 0.03 and 0.96. If we had encountered negative values or values above 1 then that would have indicated an issue with the source data. The only “YEAR” value is 2018 as was expected since the survey was only run in 2018.

For the Population Forecast dataset, the years range from 2016 to 2041 and the Population values (which are split by Gender) range from 71716 to 195600. There is no reason to assume that there are any obvious errors.

is.na(socialIndicators) %>% sum

## [1] 0

is.na(populationForecast) %>% sum

## [1] 0

summary(socialIndicators$PROPORTION)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.2233  0.4960  0.4778  0.6927  0.9600

summary(socialIndicators$YEAR)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2018    2018    2018    2018    2018    2018

summary(populationForecast$Year)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2016    2022    2028    2028    2035    2041

summary(populationForecast$Population)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   71716  105554  135299  134589  164130  195600

Scan II

We will visually examine the population by gender estimates for potential outliers using two chart types. A boxplot (shower both genders) and two histograms (one for each gender).

The boxplot does not indicate any outliers.

The histograms for both male and female population estimates show roughly linear growth with no obvious anomalies.

# This is the R chunk for the Scan II
boxplot(populationForecast$Population ~ populationForecast$Gender, 
        main="City of Melbourne Population Forecasts 2016-2041 by gender", 
        ylab = "Population", xlab = "Gender")

populationForecast %>% filter(Gender == 'Male') %>% 
  plot(Population ~ Year, data = ., 
       main="City of Melbourne Male Population Forecast")

populationForecast %>% filter(Gender == 'Female') %>% 
  plot(Population ~ Year, data = ., 
       main="City of Melbourne Female Population Forecast")

Transform

We can now join the two datasets. We will join on two columns, “Year” and “Gender”. Since the Social Indicators were only surveyed for 2018, applying an inner join will remove the population forecasts for other years from the resulting joined dataset.

Having joined the datasets, we can also derive another column called “Count” by multiplying the “Proportion” column previously derived from the Social Indicators dataset and the “Population” column that we created when untidying the Population Forecast dataset and then rounding the result to an integer. Assuming the samples are representative of the population, the Count estimates the total number of people who would have given that response, had the entire population been surveyed and had the population estimate been accurate.

Since we know that this dataset contains data for 2018 and the City of Melbourne, the is no need to display those columns and for aesthetic (space saving) purposes, we’ll only preview the Gender, Indicator and Count columns.

# This is the R chunk for the Transform Section
result <- populationForecast %>% 
  inner_join(socialIndicators, 
             by = c("Year" = "YEAR", "Gender" = "RESPONDENT.GROUP")) %>% 
  mutate(Count = round(PROPORTION * Population)) 


result %>% select(Gender, Indicator, Count) %>% head