Data Wrangling - Australian Cities

Setup

# Load packages 
library(readr)
library(knitr)
library(tidyr) 
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(magrittr)

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:tidyr':
## 
##     extract

Read/Import data

# Read/Import Data
australia_cities <- data.frame(read_csv("au.csv"))

## Rows: 1035 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): city, country, iso2, admin_name, capital
## dbl (4): lat, lng, population, population_proper
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(australia_cities)

Explanation of actions taken:

The data set has been downloaded and exists in “Downloads” folder.
To save the data set as a data frame, the output of read_csv() function is nested into data.frame() function and eventually assigned to variable “australia_cities”.
To get a sense of the data frame, I called head() function which shows top 6 observations for inspection.

Data description

The data set is about Australian cities and it provides information such as latitude, longitude, city type and population on 1,035 locations (Marya Alizadeh 2020, para. 1). The source of data is https://www.kaggle.com/maryamalizadeh/worldcities-australia. The webpage states that “this data had been subsetted from the main data set which was available on https://simplemaps.com/data/world-cities” (Marya Alizadeh 2020, para. 2).

The australia_cities data set includes nine variables:

city : name of the city
lat : abbreviated term for latitude
lng : abbreviated term for longitude
country : referring to Australia
iso2 : country code for Australia (AU)
admin_name : state that the city belongs to
capital : type of city
population : as per metadata table in the main data set, this variable is city’s urban population (Simple Maps 2021, All Fields Table)
population_proper : as per metadata table in the main data set, this variable is city’s municipal population (Simple Maps 2021, All Fields Table)

Inspect dataset and variables

# Check dimensions of the data frame
dim(australia_cities)

## [1] 1035    9

# Check data types of the variables individually
class(australia_cities$city)

## [1] "character"

class(australia_cities$lat)

## [1] "numeric"

class(australia_cities$lng)

## [1] "numeric"

class(australia_cities$country)

## [1] "character"

class(australia_cities$iso2)

## [1] "character"

class(australia_cities$admin_name)

## [1] "character"

class(australia_cities$capital)

## [1] "character"

class(australia_cities$population)

## [1] "numeric"

class(australia_cities$population_proper)

## [1] "numeric"

# Convert variable australia_cities$admin_name to factor variable and define levels
australia_cities$admin_name <- factor(australia_cities$admin_name, levels = c("New South Wales", "Victoria", "Queensland",
                                                                              "Western Australia","South Australia", "Australian Capital Territory", "Tasmania", "Northern Territory"))
class(australia_cities$admin_name)

## [1] "factor"

levels(australia_cities$admin_name)

## [1] "New South Wales"              "Victoria"                    
## [3] "Queensland"                   "Western Australia"           
## [5] "South Australia"              "Australian Capital Territory"
## [7] "Tasmania"                     "Northern Territory"

# Convert variable australia_cities$capital to factor and define levels
australia_cities$capital <- factor(australia_cities$capital, levels = c("primary", "admin"), exclude = NA)
class(australia_cities$capital)

## [1] "factor"

levels(australia_cities$capital)

## [1] "primary" "admin"

# Check the column names in the data frame
colnames(australia_cities)

## [1] "city"              "lat"               "lng"              
## [4] "country"           "iso2"              "admin_name"       
## [7] "capital"           "population"        "population_proper"

# Define another data frame
australia_cities_1 <- australia_cities

# Rename variables/columns to be more informative
colnames(australia_cities_1) <- c("city", "latitude", "longitude", "country", "country_code", "state", "city_type", "urban_population", "municipal_population")
colnames(australia_cities_1)

## [1] "city"                 "latitude"             "longitude"           
## [4] "country"              "country_code"         "state"               
## [7] "city_type"            "urban_population"     "municipal_population"

head(australia_cities_1)

Explanation of actions taken:

This step asks for the data frame to be inspected thoroughly.
Checked the dimensions of the data frame by using dim() function
Inspected the data types for each variable individually by using class() function and realised that variables “admin_name” and “capital” need to be converted from character variables to nominal categorical variables.
Converted “admin_name” to categorical (factor) variable by using factor() function and defined levels for the variables based on Australian states (New South Wales, Victoria, Queensland, Western Australia, South Australia, Australian Capital Territory, Tasmania, Northern Territory).
Converted “capital” to categorical (factor) variable by using factor() function and defined levels for the variables based on the type of city (primary and admin). This variable also includes “NA” which was excluded when defining levels.
Examined the variable names by using colnames() function, and found out that the names are not very informative
To avoid being stuck in a broken loop of renaming variables and the functions applied so far, saved my data frame under a new name (australia_cities_1)
Renamed the variables to “city”, “latitude”, “longitude”, “country”, “country_code”, “state”, “city_type”, “urban_population”, and “municipal_population” respectively.
Checked the top six rows of australia_cities_1 by using head() to ensure that renaming is executed correctly and the new names are compatible with the contents of each variable.

Tidy data

# check variable names
colnames(australia_cities_1)

## [1] "city"                 "latitude"             "longitude"           
## [4] "country"              "country_code"         "state"               
## [7] "city_type"            "urban_population"     "municipal_population"

# Check and tidy observations
n_distinct(australia_cities_1$city)

## [1] 1026

australia_cities_1 %>% 
  filter(duplicated(australia_cities_1$city) == TRUE)

australia_cities_2 <- australia_cities_1[!duplicated(australia_cities_1$city), ]
australia_cities_2

# Remove unnecessary variables
australia_cities_3 <- australia_cities_2 %>% 
  select(-(country:country_code))
australia_cities_3

# Replace values of factor variable with more informative terms
australia_cities_4 <- australia_cities_3
levels(australia_cities_4$city_type)[1] <- "nation-capital"
levels(australia_cities_4$city_type)[2] <- "state-capital"
levels(australia_cities_4$city_type)[3] <- "non-capital"
australia_cities_4 <- australia_cities_4 %>% 
  mutate(city_type = replace_na(australia_cities_4$city_type, "non-capital"))
australia_cities_4

Wickham and Grolemund (2017, p. 149) argue that “there are three principles to tidy data:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell”.

They (2017, p. 152) also mention that “there are two common problems with untidy data:

One variable might be spread across multiple columns.
One observation might be scattered across multiple rows”.

Explanation of actions taken:

At this stage, based on the tidy data principles and the common problems with untidy data (Wickham & Grolemund 2017), australia_cities_1 is investigated.
Checked column headers via colnames() function - the output shows that each column name is allocated to a variable and there are no values as column headers.
Checked the number of distinct values in australia_cities_1$city by n_distinct() function to examine whether each observation has its own row. It turns out while there is a total of 1035 rows in the data frame, 1026 of those are unique values. This means that nine rows are duplicated and need to sliced out of the data frame. This is achieved by combining indexing and duplicated() function.
Visual inspection of the contents of the data frame indicates that none of the columns contain multiple variables.

Now, the data frame complies with tidy data principles (Wickham & Grolemund 2017). However, there is still more to be done to turn it into a clean and meaningful data frame:

Variables “country” and “country_code” do not seem very useful and informative in the context that are being used which is basically an audience of Australian residents. As a result, they can be removed from the data frame by select() function.
Variable “city_type” currently includes three different values, “admin” for state capitals such as Sydney, “primary” for nation capital i.e. Canberra, and “NA” for other non-capital cities. To make it easier for the audience to communicate with the data, “admin”, “primary” and “NA” will need to be transformed to “state-capital”, “nation-capital”, and “non-capital” respectively.
For replacing “primary” and “admin”, levels() is used to allocate their new values (“nation-capital” and “state-capital”).
In order to replace “NA”, first, “non-capital” requires to be added as a factor level. Then, replace_na() is nested into mutate() from dplyr package to override “NA” with “non-capital”.

Summary statistics

# Grouped summary statistics on urban_population
australia_cities_4 %>% 
  group_by(state) %>% 
  summarise(urban_pop_mean = mean(urban_population),
            urban_pop_median = median(urban_population),
            urban_pop_min = min(urban_population),
            urban_pop_max = max(urban_population),
            urban_pop_Std_dev = sd(urban_population))

# Grouped summary statistics on municipal_population
australia_cities_4 %>% 
  group_by(state) %>% 
  summarise(municipal_pop_mean = mean(municipal_population),
            municipal_pop_median = median(municipal_population),
            municipal_pop_min = min(municipal_population),
            municipal_pop_max = max(municipal_population),
            municipal_pop_Std_dev = sd(municipal_population))

Explanation of actions taken:

There are four numeric variables in the data frame including latitude, longitude, urban_population, and municipal_population. Due to the nature of variables latitude and longitude, providing summary statistics on them will not be very useful, as a result I concentrate on urban and municipal population. In terms of categorical variables, there are two, state and city_type, by choosing state we can provide more meaningful summary statistics.
The summary statistics is calculated by piping australia_cities_4 into group_by() where it is grouped by “state” and the result is piped into a combination of summarise() and statistical functions such as mean(), median(), min(), max(), and sd().
Standard deviation of both urban_population and municipal_population return “NA” for Australian Capital Territory. The reason for this is that, there are at least two values required to calculate standard deviation, whereas, “Australian Capital Territory” consists of only one city in this data frame being Canberra.

Create a list

# Create a factor vector corresponding to state variable
state <- factor(c("New South Wales", "Victoria", "Queensland", "Western Australia",
                              "South Australia", "Tasmania", "Australian Capital Territory", "Northern Territory"),
                levels = c("New South Wales", "Victoria", "Queensland", "Western Australia",
                           "South Australia", "Tasmania", "Australian Capital Territory", "Northern Territory"))
state

## [1] New South Wales              Victoria                    
## [3] Queensland                   Western Australia           
## [5] South Australia              Tasmania                    
## [7] Australian Capital Territory Northern Territory          
## 8 Levels: New South Wales Victoria Queensland ... Northern Territory

# Create a numeric vector for economic rank of the state in Australia
economic_rank_of_state <- c(1, 2, 3, 4, 5, 6, 7, 8)
economic_rank_of_state

## [1] 1 2 3 4 5 6 7 8

# Combine vectors to create a list
state_economic_performance <- list(state, economic_rank_of_state)
state_economic_performance

## [[1]]
## [1] New South Wales              Victoria                    
## [3] Queensland                   Western Australia           
## [5] South Australia              Tasmania                    
## [7] Australian Capital Territory Northern Territory          
## 8 Levels: New South Wales Victoria Queensland ... Northern Territory
## 
## [[2]]
## [1] 1 2 3 4 5 6 7 8

Explanation of actions taken:

In this step, we are asked to create a list which includes a numeric variable for each level of a categorical variable from the data frame. To achieve this, two vectors are crated, one is a factor vector and the other one is numeric.
For the categorical variable, “state” vector is produced with eight levels corresponding to state variable in the data frame.
For the numeric variable, “economic_rank_of_state” is made with values from one to eight to simply show how the state ranks among other states in relation to economic performance (Reserve Bank of Australia 2021, Output Share by State).
Finally, the two vectors are combined by using list() function to construct the requested list “state_economic_performance”.

Join the list

# Convert the newly created list to a data frame
state_economic_performance_df <- as.data.frame(state_economic_performance)
class(state_economic_performance_df)

## [1] "data.frame"

colnames(state_economic_performance_df)

## [1] "structure.1.8..levels...c..New.South.Wales....Victoria....Queensland..."
## [2] "c.1..2..3..4..5..6..7..8."

colnames(state_economic_performance_df) <- c("state", "economic_rank_of_state")
colnames(state_economic_performance_df)

## [1] "state"                  "economic_rank_of_state"

state_economic_performance_df

# Join the newly created data frame onto the original data frame using left_join() function
australia_cities_5 <- australia_cities_4 %>% 
  left_join(state_economic_performance_df, by = "state")
head(australia_cities_5, 10)

tail(australia_cities_5, 10)

Explanation of actions taken:

This step instructs us to join the newly created list to the original data frame in such way that the numeric variable is maintained and matched to the categorical variable.
In order to achieve this, the list requires to be converted to a data frame by using as.data.frame() function.
To be able to join the new data frame “state_economic_performance_df”, state (our categorical variable) will need to be used as primary key between the two data frames. In addition, the name of the primary key columns in both data frames needs to identical, colnames() function is then used to set the names equal to a character vector.
Data frame “state_economic_performance_df” is then inspected to ensure that is ready to be joined onto the data frame “australia_cities_4”.
Joining the two data frames is conducted by using left_join() function with “australia_cities_4” as the left data frame and “state_economic_performance_df” as the right data frame. The output of the function is then assigned to “australia_cities_5”.
To ensure that the join process is completed flawlessly, we inspect top ten rows and bottom ten rows of “australia_cities_5” data frame.

Subsetting I

# Subset the first ten observations of the data frame
subset_1 <- australia_cities_5[1:10, ]
subset_1

# Convert to a matrix and check structure
matrix <- as.matrix(subset_1)
class(matrix)

## [1] "matrix" "array"

typeof(matrix)

## [1] "character"

str(matrix)

##  chr [1:10, 1:8] "Sydney" "Melbourne" "Brisbane" "Perth" "Adelaide" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:10] "1" "2" "3" "4" ...
##   ..$ : chr [1:8] "city" "latitude" "longitude" "state" ...

Explanation of actions taken:

In this step, we are asked to perform sub-setting and conversion to a matrix. Index sub-setting is used to subset the first ten rows including all the variables.
Then the subsetted data is converted to a matrix using as.matrix() function.
In terms of checking the structure of “matrix”, class() shows that it is a matrix/array. The output of typeof() indicates it is of character data type. And, str() also proves that it is a character data type with further details.
Unlike “subset_1” which was a data frame and could contain multiple different data types, “matrix” can only contain one data type which in this case is character. This happens because R finds it more practical and easier to convert numeric and factor to character rather than numeric and character to factor or even factor and character to numeric.

Subsetting II

# Subset the first and the last variables of the data frame
subset_2 <- australia_cities_5 %>% 
  select(city, economic_rank_of_state)
head(subset_2)

# Save the subset as a .RData file
save(subset_2, file = "subset_2.RData")

Explanation of actions taken:

To subset the first and the last variables of the main data frame, select() function is used with “city” and “economic_rank_of_state” as arguments.
Then I use save() function to save the new subsetted data as “subset_2.RData” in the current working directory.

Create a new Data Frame

# Create an ordinal variable
data_wrangling_final <- factor( c("NN","PA", "CR", "DI", "HD"), 
                      levels = c("NN", "PA", "CR", "DI", "HD"),     
                      ordered=TRUE) 
str(data_wrangling_final)

##  Ord.factor w/ 5 levels "NN"<"PA"<"CR"<..: 1 2 3 4 5

levels(data_wrangling_final)

## [1] "NN" "PA" "CR" "DI" "HD"

# Create an integer variable
number_of_students <- c(5L, 8L, 10L, 5L, 2L)
str(number_of_students)

##  int [1:5] 5 8 10 5 2

# create a data frame from the ordinal and the integer variables
data_wrangling_report <- data.frame(data_wrangling_final, number_of_students)
data_wrangling_report

# Create a numeric vector and bind it onto the data frame
mean_study_hours_weekly <- c(3.1, 10.4, 15.7, 21.5, 33.3)
class(mean_study_hours_weekly)

## [1] "numeric"

data_wrangling_report_2 <- cbind(data_wrangling_report, mean_study_hours_weekly)
data_wrangling_report_2

dim(data_wrangling_report_2)

## [1] 5 3

str(data_wrangling_report_2)

## 'data.frame':    5 obs. of  3 variables:
##  $ data_wrangling_final   : Ord.factor w/ 5 levels "NN"<"PA"<"CR"<..: 1 2 3 4 5
##  $ number_of_students     : int  5 8 10 5 2
##  $ mean_study_hours_weekly: num  3.1 10.4 15.7 21.5 33.3

Explanation of actions taken:

This step requires us to construct a data frame with two variables, one integer and one ordinal. The data frame is created based on a hypothetical scenario and records the number of students versus each academic grade. Key codes for academic grades (NN, PA, CR, DI, and HD) are chosen as the ordinal variable, and the number of students is the integer variable.
“data_wrangling_final” (ordinal variable) is constructed by factor() function with levels as NN < PA < CR < DI < HD and ordered = TRUE.
“number_of_students” (integer variable) is created by c() function and “L” is added to define values as integer.
“data_wrangling_report” (data frame) is made by combining the ordinal and integer variables through data.frame() function. Then the data frame is called to be inspected for assurance.
The second section of the step requests a numeric vector to be created and to be added to the new data frame via cbind() function. To achieve this, “mean_study_hours_weekly” (numeric vector) is constructed by c() function.
Finally, “data_wrangling_report_2” (new data frame with three variables) is made by using cbind() to bind “mean_study_hours_weekly” onto “data_wrangling_report”. Attributes of this data frame are checked by applying dim() and str() functions.

Create another Data Frame

# Create another data frame with common categorical variable to data_wrangling_report_2
mean_years_experience_data_field <- c(0.7, 3.3, 3.7, 5.8, 4.5)
data_frame_experience <- data.frame(data_wrangling_final, mean_years_experience_data_field)
str(data_frame_experience)

## 'data.frame':    5 obs. of  2 variables:
##  $ data_wrangling_final            : Ord.factor w/ 5 levels "NN"<"PA"<"CR"<..: 1 2 3 4 5
##  $ mean_years_experience_data_field: num  0.7 3.3 3.7 5.8 4.5

data_frame_experience

# join data_frame_experience onto data_wrangling_report_2
data_wrangling_report_complete <- data_wrangling_report_2 %>% 
  inner_join(data_frame_experience, by = "data_wrangling_final")
data_wrangling_report_complete

Explanation of actions taken:

This step asks for a new data frame to be created in such way that it includes a variable common with the data frame constructed in the previous step.
“data_frame_experience” (the new data frame) contains two variables, “data_wrangling_final” (ordinal variable and common with the previous data frame) and “mean_years_experience_data_field” (numeric variable referring to mean of years experience in the field of data).
The two variables are combined through data.frame() function to create the new data frame.
Eventually, inner_join() is used to combine “data_wrangling_report_2” data frame with “data_frame_experience” date frame by “data_wrangling_final” (common variable) to create “data_wrangling_report_complete” data frame. In order to gain assurance on success of the join process, “data_wrangling_report_complete” is then called and inspected.

Reference List

Alizadeh, M 2020, Australia Cities Database, Kaggle, viewed 12 September 2021, https://www.kaggle.com/maryamalizadeh/worldcities-australia.
Reserve Bank of Australia 2021, Composition of the Australian Economy, Reserve Bank of Australia, viewed 12 September 2021, https://www.rba.gov.au/education/resources/snapshots/economy-composition-snapshot/.
Simple Maps 2021, World Cities Database, Simple Maps, viewed 12 September 2021, https://simplemaps.com/data/world-cities.
Wickham, H & Grolemund, G 2017, R for Data Science Import, Tidy, Transform, Visualize, and Model Data, O’Reilly Media, Inc., Sebastopol, CA.

Data Wrangling - Australian Cities

Data Preprocessing

Mohammad Younesi

Setup

Read/Import data

Data description

Inspect dataset and variables

Tidy data

Summary statistics

Create a list

Join the list

Subsetting I

Subsetting II

Create a new Data Frame

Create another Data Frame

Reference List