Required packages

library(readxl)
library(tidyr)
library(dplyr)
library(lubridate)
library(outliers)
library(stringr)
library(MVN)
library(forecast)

Executive Summary

The purpose of the pre-processing performed in this assignment was to produce a dataset that could be used to explore whether there is a relationship between average maximum temperature, rainfall, windspeed and/or humidity, and population size and/or growth, for locations throughout Australia. The raw data was sourced from the Australian Bureau of Statistics (ABS) and Kaggle (via the Australian Bureau of Meteorology) and pre-processed using the following steps:

Data

The data in this assignment was sourced from the ABS and Kaggle. The ABS data was downloaded from the ABS website on 16/05/2020 using the following link: “https://www.abs.gov.au/AUSSTATS/subscriber.nsf/log?openagent&32180ds0003_2009-19.xls&3218.0&Data%20Cubes&381DE9FC52CAD374CA258535000B98C0&0&2018-19&25.03.2020&Latest”. The data was stored as an Excel 97-2003 workbook (.xls) file.

According to the ABS, the data contains estimates of resident populations according to Significant Urban Areas of Australia as of June 30 each year from 2009 to 2019 (Australian Bureau of Statistics, 2020). Significant Urban Areas are areas with a population of at least 10,000 based on Statistical Areas Level 2 (SA2s) and can represent a collection of Urban Centres whose collective population is at least 10,000 (Australian Bureau of Statistics, 2020). The ABS data contained the following variables:

The Kaggle data was downloaded from “https://www.kaggle.com/jsphyg/weather-dataset-rattle-package#weatherAUS.csv” and is stored as a Comma Separated Value (CSV) file. The Kaggle data was originally sourced from the Australian Bureau of Meteorology (BoM) website (“http://www.bom.gov.au/climate/data”) and was published to the Kaggle website as a single CSV file by the Kaggle user “Joe Young” (“https://www.kaggle.com/jsphyg”).

According to the Kaggle and BoM websites, the data contains daily weather observations from weather stations across Australia from 01/11/2017 to 25/06/2017 (Kaggle.com, 2019). Not all weather stations in the network observe all variables and, in some locations and where appropriate, observations from a number of stations have been combined to produce a summary for a single location (Bureau of Meteorology, 2020). The data contained the following variables:

setwd("C:/Users/James/Desktop/Post Grad studies/Data Wrangling - MATH2349/Assignment 2")
# Import the ABS population data as "abspop"
abspop <- read_excel("32180ds0003_2009-19.xls", sheet = "Table 1", skip = 6)
New names:
* `` -> ...1
* `` -> ...2
* `` -> ...14
* `` -> ...16
names(abspop)[names(abspop) == "...2"] <- "Sig. Urban Area" #Rename second column to "Sig. Urban Area"
names(abspop) #the columns with relevant data are 2-13
 [1] "...1"            "Sig. Urban Area" "2009"            "2010"            "2011"           
 [6] "2012"            "2013"            "2014"            "2015"            "2016"           
[11] "2017"            "2018"            "2019"            "...14"           "2018-2019"      
[16] "...16"          
tail(abspop,10) #There is a totals row and 6 other rows at the bottom of the data that are not needed
abspop <- abspop[3:(nrow(abspop)-7), 2:13] #select the rows and columns with relevant data in them
head(abspop)
# Import the Kaggle weather data as "weather"
weather <- read.csv("weatherAUS.csv", stringsAsFactors = FALSE)
#Select the columns relevant to fulfilling the goal of the pre-processing
desired_measures <- c("Date", "Location", "MaxTemp", "Rainfall", "WindSpeed9am", "WindSpeed3pm", "Humidity9am", "Humidity3pm") #To compare population growth against heat, rain, wind and humidity.
weather <- select(weather, all_of(desired_measures))
head(weather)
## Data will be merged after it has been tidied, manipulated and formatted

Understand

For the ABS population data:

For the Kaggle weather data:

The Year/Date data was converted to an ordered factor after the weather and population data were merged (refer ## Merging the data)

## Review and revise the variables in the population data
str(abspop)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   110 obs. of  12 variables:
 $ Sig. Urban Area: chr  "Not in any Significant Urban Area (NSW)" "Albury - Wodonga" "Armidale" "Ballina" ...
 $ 2009           : chr  "971734" "82307" "23166" "24126" ...
 $ 2010           : chr  "979752" "83245" "23349" "24309" ...
 $ 2011           : chr  "984419" "84195" "23471" "24469" ...
 $ 2012           : chr  "989905" "85229" "23631" "24622" ...
 $ 2013           : chr  "995524" "86564" "23756" "24852" ...
 $ 2014           : chr  "1001028" "87982" "23856" "25066" ...
 $ 2015           : chr  "1006487" "89306" "23979" "25325" ...
 $ 2016           : chr  "1013060" "90836" "24114" "25643" ...
 $ 2017           : chr  "1018827" "92251" "24369" "25930" ...
 $ 2018           : chr  "1025509" "93541" "24482" "26358" ...
 $ 2019           : chr  "1031562" "94837" "24584" "26625" ...
abspop[ ,2:12]<- sapply(abspop[ ,2:12], as.integer) #Convert Year columns to integers
sapply(abspop, class)
Sig. Urban Area            2009            2010            2011            2012            2013 
    "character"       "integer"       "integer"       "integer"       "integer"       "integer" 
           2014            2015            2016            2017            2018            2019 
      "integer"       "integer"       "integer"       "integer"       "integer"       "integer" 
## Review and revise the variables in the weather data
str(weather)
'data.frame':   142193 obs. of  8 variables:
 $ Date        : chr  "2008-12-01" "2008-12-02" "2008-12-03" "2008-12-04" ...
 $ Location    : chr  "Albury" "Albury" "Albury" "Albury" ...
 $ MaxTemp     : num  22.9 25.1 25.7 28 32.3 29.7 25 26.7 31.9 30.1 ...
 $ Rainfall    : num  0.6 0 0 0 1 0.2 0 0 0 1.4 ...
 $ WindSpeed9am: int  20 4 19 11 7 19 20 6 7 15 ...
 $ WindSpeed3pm: int  24 22 26 9 20 24 24 17 28 11 ...
 $ Humidity9am : int  71 44 38 45 82 55 49 48 42 58 ...
 $ Humidity3pm : int  22 25 30 16 33 23 19 19 9 27 ...
weather$Date <- ymd(weather$Date) #lubridate package - convert "Date" to a date type using "ymd()"
sapply(weather, class)
        Date     Location      MaxTemp     Rainfall WindSpeed9am WindSpeed3pm  Humidity9am 
      "Date"  "character"    "numeric"    "numeric"    "integer"    "integer"    "integer" 
 Humidity3pm 
   "integer" 

Tidy & Manipulate Data I

The ABS population data did not conform to the tidy data principles because each observation (estimated population in a given Sig. Urban Area (SUA) in a given year) did not have its own row. Instead, observations were spread across rows and columns, meaning the data was in a wide format. In order to make the ABS population data tidy, the year columns had to be rearranged so they were in rows, and the estimated populations (the values) were in their own column. The ABS population data was tidied using the “gather()” function, which took the year columns and rearranged them to be rows while creating a new column (called “Population”) to house the value of each observation. The Population column was divided by 1,000 to adjust the scale and improve readability. The column was also renamed to “Population(000s)” to ensure its values would be interpreted correctly. Finally, a new column was created to measure the year-on-year growth in population for each SUA, using a technique adapted from the rstudio community website (Community.rstudio.com, 2019). For the first observation in each SUA (2009) the growth was set to a missing value, as setting it to zero could result in inaccurate conclusions being drawn. The missing values were imputed later (refer ## Scan I).

#Tidy the population data
abspop <- abspop %>% gather(Year, Population, -1) %>% select(c(2,1,3))
abspop$Year <- as.integer(abspop$Year)
head(abspop, 3)
#Transform the Population variable to make it easier to read.
abspop <- abspop %>% mutate(Population = Population/1000) %>% rename(`Population(000s)` = Population)
#Add a year-on-year growth column to the population data
abspop <- abspop %>% arrange(`Sig. Urban Area`, Year) %>% 
  mutate(PopGrowthPct = 
  case_when(Year == 2009 ~ as.double(NA),
            Year != 2009 ~ (`Population(000s)`-lag(`Population(000s)`))/lag(`Population(000s)`)*100))

Scan I

The datasets were scanned for missing values and obvious errors before the datasets were manipulated or merged as this made it easier to identify explanations for missing/erroneous values. For example, it would be easier to identify that a missing value in the weather data was due to a weather station being unable to track that measurement before the data is manipulated or merged, because location or time-specific context could be lost when the data is manipulated or merged.

Because variables like temperature, humidity and rainfall can vary from location-to-location, using the mean/median of all locations to impute missing values for a single location could result in unrealistic values being imputed. Therefore, when a missing value was identified, the mean/median for the relevant measure was calculated for that particular location and used to impute the missing value.

The population data was scanned and 110 missing values were identified, being the NAs that were added when the population growth variable was created. Because the distribution of the variable was skewed, the median was used to impute the missing values on a location-by-location basis. There were no special values (“NaN” or infinities) in the ABS population data.

The weather data was scanned and 11,090 missing values were identified, with all missing values being in numeric variables. Before imputing missing values, the variables with normal distributions were identified and their missing values were imputed using means and a technique adapted from stackoverflow (Stackoverflow.com, 2019). The remaining missing values were imputed using medians.

Once the missing values in the weather data had been imputed, no more missing values were present in the data. There were also no special values (“NaN” or infinites).

#Scan the population data for missing values 
sum(is.na(abspop)) #110 NAs in the population data
[1] 110
sum(apply(abspop, 1, is.nan)) #no NAN values
[1] 0
sum(apply(abspop, 1, is.infinite)) #no infinite values
[1] 0
#Impute the NAs in population growth
boxplot(abspop$PopGrowthPct, main = "Distribution of Population Growth (%)") #not normally distributed
#use median to impute missing values in PopGrowthPct as it is not normally distributed
abspop <- abspop %>% group_by(`Sig. Urban Area`) %>%
  mutate(PopGrowthPct = ifelse(is.na(PopGrowthPct), median(PopGrowthPct, na.rm = TRUE), PopGrowthPct))
#Scan the weather data for missing values
initial_weather_NAs <- sum(is.na(weather))
initial_weather_NAs #11,090 NAs in the weather data
[1] 11090
colSums(is.na(weather)) #all columns with missing values are numeric data types
        Date     Location      MaxTemp     Rainfall WindSpeed9am WindSpeed3pm  Humidity9am 
           0            0          322         1406         1348         2630         1774 
 Humidity3pm 
        3610 
##Impute missing values - use mean for normally-distributed variables and median for everything else
par(mfrow = c(2,3))

i <- 1
while (i <= ncol(weather)) { 
  if (is.numeric(weather[,i]) == TRUE)  boxplot(weather[,i], main = names(weather[i]))
  i <- i+1
} #MaxTemp and Humidity3pm appear normally distributed

which(abs(scores(weather$MaxTemp, type = "z")) >3) #0 outliers
integer(0)
which(abs(scores(weather$Humidity3pm, type = "z")) >3) #0 outliers
integer(0)
#MaxTemp and Humidity3pm appear normally distributed with no outliers. Impute them using mean
nrml_dist_vars <- c("MaxTemp", "Humidity3pm")
for (i in nrml_dist_vars) {
  weather[,i] <-  ave(weather[,i], weather$Location, FUN = function(x) 
    ifelse(is.na(x), mean(x,na.rm = TRUE), x))
}
initial_weather_NAs-sum(is.na(weather)) #3,932 fewer NAs
[1] 3932
#For the variables with skewed distributions, impute missing values using median.
weather <- weather %>% group_by(Location) %>%
  mutate(Rainfall = ifelse(is.na(Rainfall), median(Rainfall, na.rm = TRUE), Rainfall))
weather <- weather %>% group_by(Location) %>%
  mutate(WindSpeed9am = ifelse(is.na(WindSpeed9am), median(WindSpeed9am, na.rm = TRUE), WindSpeed9am))
weather <- weather %>% group_by(Location) %>%
  mutate(WindSpeed3pm = ifelse(is.na(WindSpeed3pm), median(WindSpeed3pm, na.rm = TRUE), WindSpeed3pm))
weather <- weather %>% group_by(Location) %>%
  mutate(Humidity9am = ifelse(is.na(Humidity9am), median(Humidity9am, na.rm = TRUE), Humidity9am))
sum(is.na(weather)) #no more missing values in the weather data
[1] 0
sum(apply(weather, 1, is.nan)) #no NaN values in the weather data
[1] 0
sum(apply(weather, 1, is.infinite)) #no infinite values in the weather data
[1] 0

Tidy & Manipulate Data II

While the weather data had daily observations, the ABS population data had yearly observations (population estimates). Therefore, the weather data had to be aggregated to align with the frequency of the ABS population data. The “mutate()” function and “year” argument were used to create a Year column in the weather data, based on the year in the Date column. Once the Year field had been created, mutate was again used to create columns for maximum humidity and wind speed, so that relationships between humidity or windiness and population size/growth could be explored. Finally, in order to align with the population data, the weather data was grouped by year and location and daily measurements were averaged to produce average daily figures for each year and location.

The ABS population data and weather data were then compared to identify any locations that required manipulation in order to match. Although 143 unique names were identified, only 16 names matched. An exploration of the names that did not match revealed that some Sig. Urban Areas had hyphens, which prevented them from matching to very similar Locations that did not (e.g. “Albury” and “Albury - Wodonga”). To resolve this, “gsub()” was used to remove hyphens and everything after them from the names of Sig. Urban Areas using a technique adapted from the “ecology & stats” blog by Steve Walker (Walker, S. C., 2013). This increased the number of matches to 20, an increase of 4 matches. Further examination revealed that the names of some Locations were actually two words without a space separating them (e.g. “CoffsHarbour”). To improve the number of matching locations, spaces within Sig. Urban Area names were removed using “gsub()”, leading to a total of 25 matches.

By manipulating the names of Sig. Urban Areas, the number of matching location names achieved between the two datasets was 25, representing 23% of Sig. Urban Areas and 51% of weather station Locations.

#Group the weather data by year to create a "Year" column using the "Date"" column
weather <- weather %>% mutate(Year = year(Date))
weather$Year <- as.integer(weather$Year)
#Create daily maximums for humidity and windspeed
weather <- weather %>% mutate(MaxHumidity = pmax(Humidity9am, Humidity3pm, na.rm = TRUE),
                              MaxWindspeed = pmax(WindSpeed9am, WindSpeed3pm, na.rm = TRUE))
#Group by year and create average daily values for rain, temperature, humidity and windspeed
avgweather <- weather %>% group_by(Year, Location) %>% 
  summarise("Avg Daily Rainfall" = mean(Rainfall, na.rm = TRUE),
            "Avg MaxTemp" = mean(MaxTemp, na.rm = TRUE),
            "Avg Max Humidity" = mean(MaxHumidity, na.rm = TRUE),
            "Avg Max Windspeed" = mean(MaxWindspeed, na.rm = TRUE))
##Preparation for merging
unique_places <- union(avgweather$Location, abspop$`Sig. Urban Area`)
length(unique_places) #143
[1] 143
matches <- intersect(avgweather$Location, abspop$`Sig. Urban Area`)
length(matches) #16
[1] 16
#Matching on Location and Sig. Urban Area produces 16 matches, could any be altered in order to match?
mismatches <- setdiff(unique_places, matches)
head(sort(mismatches),30) #Hyphenated names and names with spaces appear to be a mismatch issue.
 [1] "Albury"                "Albury - Wodonga"      "Alice Springs"        
 [4] "AliceSprings"          "Armidale"              "Bacchus Marsh"        
 [7] "BadgerysCreek"         "Bairnsdale"            "Ballina"              
[10] "Batemans Bay"          "Bathurst"              "Bowral - Mittagong"   
[13] "Broken Hill"           "Broome"                "Bunbury"              
[16] "Bundaberg"             "Burnie - Wynyard"      "Busselton"            
[19] "Camden Haven"          "Canberra"              "Canberra - Queanbeyan"
[22] "Central Coast"         "Cobar"                 "Coffs Harbour"        
[25] "CoffsHarbour"          "Colac"                 "Dartmoor"             
[28] "Devonport"             "Dubbo"                 "Echuca - Moama"       
#Search for hypens in Location and Sig. Urban Area, are there hyphens in just one?
sum(str_detect(mismatches, pattern = "-"), na.rm = TRUE) #20 hyphens in mismatches
[1] 20
sum(str_detect(avgweather$Location, pattern = "-"), na.rm = TRUE) #no hyphens in "Location"
[1] 0
sum(str_detect(unique(abspop$`Sig. Urban Area`), pattern = "-")) #all hyphens are in `Sig. Urban Area`
[1] 20
#Does removing everything after a hyphen improve the number of matches?
abspop$`Sig. Urban Area` <- gsub(pattern = ' -.*', replacement = '', abspop$`Sig. Urban Area`)
sum(str_detect(abspop$`Sig. Urban Area`, pattern = "-"), na.rm = TRUE) #no more hyphens
[1] 0
matches2 <- intersect(avgweather$Location, abspop$`Sig. Urban Area`)
length(intersect(matches, matches2)) == length(matches) #None of the original matches have been lost
[1] TRUE
length(matches2) #now 20 matches, total matches have improved by 4
[1] 20
#Some Locations are all one word, does removing the spaces in `Sig. Urban Area` improve matching?
abspop$`Sig. Urban Area` <- gsub(pattern = ' ', replacement = '', abspop$`Sig. Urban Area`)
matches3 <- intersect(avgweather$Location, abspop$`Sig. Urban Area`)
length(intersect(matches, matches3)) == length(matches) #None of the original matches have been lost
[1] TRUE
length(matches3) #25 matches, total matches have improved by 5
[1] 25

Merging the Data

“inner_join()” was used to merge the ABS population data and the weather data, with Year and Location (Sig. Urban Area) used as the joining keys. Because the merged data did not require any further string manipulation, “Location” and “Year” were converted to factors so the effect of individual locations or years could be observed. Values in the Year column ranged from 2009 to 2017 and the levels were ordered from smallest to largest.

weatherpop <- inner_join(avgweather, abspop, by = c("Location" = "Sig. Urban Area", "Year"))
sapply(weatherpop, class) #Location is currently a character variable but it should be a factor
              Year           Location Avg Daily Rainfall        Avg MaxTemp   Avg Max Humidity 
         "integer"        "character"          "numeric"          "numeric"          "numeric" 
 Avg Max Windspeed   Population(000s)       PopGrowthPct 
         "numeric"          "numeric"          "numeric" 
weatherpop$Location <- as.factor(weatherpop$Location)
levels(weatherpop$Location)
 [1] "Adelaide"     "Albany"       "Albury"       "AliceSprings" "Ballarat"     "Bendigo"     
 [7] "Brisbane"     "Cairns"       "Canberra"     "CoffsHarbour" "Darwin"       "GoldCoast"   
[13] "Hobart"       "Launceston"   "Melbourne"    "Mildura"      "MountGambier" "Newcastle"   
[19] "Perth"        "Portland"     "Sale"         "Sydney"       "Townsville"   "WaggaWagga"  
[25] "Wollongong"  
#Turn Year into a factor so the effect of individual years can be calculated (if we wish to do so)
weatherpop$Year <- factor(weatherpop$Year, levels = c(2009:2017), ordered = TRUE)
levels(weatherpop$Year)
[1] "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" "2017"
is.ordered(weatherpop$Year) #Year is now ordered, the smallest year is 2009
[1] TRUE

Scan II

Visual inspection of boxplots was used to scan the merged data for outliers. Univariate outliers were identified in the Rainfall, Humidity, Population and Population Growth variables. The outliers in these variables were examined to determine whether they were likely to be the result of an error, or valid observations that just happened to be outliers. The same variables were also scanned for bivariate outliers by grouping observations according to year and visually inspecting the resulting boxplots.

The “mvn()” function was used to identify multivariate outliers. While 100 multivariate outliers were detected, none were for observations later than 2012. The multivariate outliers between 2009 and 2012 may have been a result of the after effects of the “Millennium drought”, which ended with heavy rainfall in 2009 and 2010, (Heberger, 2012). That said, there is no mention of the drought’s impact on the weather data by the BoM. Given the data was published by the ABS and BoM (agencies who have a responsibility to, and reputation for, providing accurate and reliable data), it is unlikely that the outliers observed were the result of error (e.g. data entry errors, measurement errors etc.), and therefore, it was decided that the outliers should not be removed or altered.

In summary, once location had been taken into account and the reliability of the data sources had been considered, there were no outliers in the merged data that seemed to be likely the result of error. While multivariate outliers were detected, they were not removed from the data because it was unlikely that the observations were erroneous and removing the multivariate outliers could have resulted in the loss of valuable information (i.e. the effect of a drought ending on population size/growth). Similarly, capping and imputing were not performed so as to avoid reducing the strength of any relationships between certain weather conditions and population size or growth.

par(mfrow=c(2,4))
plot(weatherpop$Year, main = "Frequency of Years")
plot(weatherpop$Location, main = "Frequency of Locations")
i <- 3 #produce boxplots of each numeric variable (#3 onwards), using a while loop for efficiency
while (i <= ncol(weatherpop)) { 
  boxplot(weatherpop[,i], main = names(weatherpop[i]))
  i <- i+1
}
#outliers present in Rainfall, Humidity, Population and Population Growth.
par(mfrow=c(1,3))

rainbox <- boxplot(weatherpop$`Avg Daily Rainfall`, main = "Avg Daily Rainfall")
humidbox <- boxplot(weatherpop$`Avg Max Humidity`, main = "Avg Max Humidity")
popbox <- boxplot(weatherpop$`Population(000s)`, main = "Population(000s)")
#Rainfall - univariate
par(mfrow=c(1,1))

rain_outliers <- weatherpop[which(weatherpop$`Avg Daily Rainfall` == rainbox$out[[1]]),c("Location", "Avg Daily Rainfall")]
i <- 2
while (i <= length(rainbox$out)) {
  rain_outliers <- rbind(rain_outliers, weatherpop[which(weatherpop$`Avg Daily Rainfall` == rainbox$out[[i]]) ,c("Location", "Avg Daily Rainfall")])
  i <- i+1
}
rain_outliers #all tropical climates/locations
#Rainfall - bivariate (rain and year)
outliers_rain <- boxplot(weatherpop$`Avg Daily Rainfall` ~ weatherpop$Year, xlab = "Year", ylab = "Avg Daily Rainfall", main = "Distribution of Avg Daily Rainfall by Year")

rain_outlier_locs <- weatherpop[which(weatherpop$`Avg Daily Rainfall` == outliers_rain$out[[1]]),c("Year", "Location", "Avg Daily Rainfall")]
i <- 2
while (i <= length(outliers_rain$out)) {
  rain_outlier_locs <- rbind(rain_outlier_locs, weatherpop[which(weatherpop$`Avg Daily Rainfall` == outliers_rain$out[[i]]) ,c("Year", "Location", "Avg Daily Rainfall")])
  i <- i+1
}
rain_outlier_locs
as.vector(unique(rain_outlier_locs$Location)) #All are tropical climates/locations
[1] "Cairns"       "CoffsHarbour" "Townsville"   "Darwin"       "GoldCoast"   
#Humidity - univariate
humid_outliers <- weatherpop[which(weatherpop$`Avg Max Humidity` == humidbox$out[[1]]),c("Location", "Avg Max Humidity")]
i <- 2
while (i <= length(humidbox$out)) {
  humid_outliers <- rbind(humid_outliers, weatherpop[which(weatherpop$`Avg Max Humidity` == humidbox$out[[i]]) ,c("Location", "Avg Max Humidity")])
  i <- i+1
}
as.vector(unique(humid_outliers$Location)) # All Alice Springs, which is known to a have low humidity
[1] "AliceSprings"
#Humidity - bivariate (humidity and year)
outliers_humid <- boxplot(weatherpop$`Avg Max Humidity` ~ weatherpop$Year, xlab = "Year", ylab = "Avg Max Humidity", main = "Distribution of Avg Max Humidity by Year")

out_humid_locs <- weatherpop[which(weatherpop$`Avg Max Humidity` == outliers_humid$out[[1]]),c("Year", "Location", "Avg Max Humidity")]
i <- 2
while (i <= length(outliers_humid$out)) {
  out_humid_locs <- rbind(out_humid_locs, weatherpop[which(weatherpop$`Avg Max Humidity` == outliers_humid$out[[i]]) ,c("Year", "Location", "Avg Max Humidity")])
  i <- i+1
}
as.vector(unique(out_humid_locs$Location)) #AliceSprings was the only outlier location, year-to-year
[1] "AliceSprings"
#Population - univariate
population_outliers <- weatherpop[which(weatherpop$`Population(000s)` == popbox$out[[1]]),c("Location", "Population(000s)")]
i <- 2
while (i <= length(popbox$out)) {
  population_outliers <- rbind(population_outliers, weatherpop[which(weatherpop$`Population(000s)` == popbox$out[[i]]),c("Location", "Population(000s)")])
  i <- i+1
}
as.vector(unique(population_outliers$Location))
[1] "Adelaide"  "Brisbane"  "Melbourne" "Perth"     "Sydney"   
#Population - bivariate (population and year)
outliers_pop <- boxplot(weatherpop$`Population(000s)` ~ weatherpop$Year, xlab = "Year", ylab = "Population(000s)", main = "Distribution of Population(000s) by Year")

pop_outlier_locs <- weatherpop[which(weatherpop$`Population(000s)` == outliers_pop$out[[1]]),c("Year", "Location", "Population(000s)")]
i <- 2
while (i <= length(outliers_pop$out)) {
  pop_outlier_locs <- rbind(pop_outlier_locs, weatherpop[which(weatherpop$`Population(000s)` == outliers_pop$out[[i]]),c("Year", "Location", "Population(000s)")])
  i <- i+1
}
as.vector(unique(pop_outlier_locs$Location)) #All population outliers are capital cities.
[1] "Adelaide"  "Brisbane"  "Melbourne" "Perth"     "Sydney"   
# As population growth is a result of year and population, use a bivariate scan.
outliers_growth <- boxplot(weatherpop$PopGrowthPct ~ weatherpop$Year, xlab = "Year", ylab = "YoY Population Growth (%)", main = "Distribution of Year-on-Year Population Growth (%) by year")

growth_outlier_locs <- weatherpop[which(weatherpop$PopGrowthPct == outliers_growth$out[[1]]), c("Year", "Location", "PopGrowthPct")]
i <- 2
while (i <= length(outliers_growth$out)) {
  growth_outlier_locs <- rbind(growth_outlier_locs, weatherpop[which(weatherpop$PopGrowthPct == outliers_growth$out[[i]]),c("Year", "Location", "PopGrowthPct")])
  i <- i+1
}
growth_outlier_locs
## Multivariate outliers
multi <- mvn(data = weatherpop[3:8], multivariateOutlierMethod = "quan", showOutliers = TRUE)

multi$multivariateOutliers #observations 1-100 are multivariate outliers (MVOs)
weather_MVOs <- weatherpop[ c(1:100),] 
summary(weather_MVOs) #Every observation from 2009-2012 is a multivariate outlier
      Year            Location  Avg Daily Rainfall  Avg MaxTemp    Avg Max Humidity
 2009   :25   Adelaide    : 4   Min.   :0.2104     Min.   :17.33   Min.   :32.19   
 2010   :25   Albany      : 4   1st Qu.:1.7150     1st Qu.:20.09   1st Qu.:67.51   
 2011   :25   Albury      : 4   Median :2.3792     Median :22.35   Median :70.36   
 2012   :25   AliceSprings: 4   Mean   :2.7836     Mean   :22.89   Mean   :70.46   
 2013   : 0   Ballarat    : 4   3rd Qu.:3.3721     3rd Qu.:25.17   3rd Qu.:75.95   
 2014   : 0   Bendigo     : 4   Max.   :7.6848     Max.   :32.67   Max.   :85.20   
 (Other): 0   (Other)     :76                                                      
 Avg Max Windspeed Population(000s)   PopGrowthPct    
 Min.   : 9.698    Min.   :  10.81   Min.   :-0.4704  
 1st Qu.:17.503    1st Qu.:  53.78   1st Qu.: 0.8263  
 Median :21.082    Median : 116.52   Median : 1.2165  
 Mean   :20.170    Mean   : 650.32   Mean   : 1.2933  
 3rd Qu.:23.113    3rd Qu.: 454.39   3rd Qu.: 1.8104  
 Max.   :27.645    Max.   :4305.28   Max.   : 3.1997  
                                                      

Transform

When it was identified that a variable may benefit from being transformed, a new variable with the same values as the original was created and this was then transformed. This approach was chosen because it would provide more analysis options to anyone wishing to undertake analysis on the dataset. For example, if a non-parametric test could be used to conduct analysis then the original variables could be used. Alternatively, if parametric tests were going to be conducted and ease of interpretation was not a concern, then the transformed variables could be used.

The distributions of the variables in the weatherpop data were explored and 4 variables were identified as having skewed distributions that could benefit from being transformed.

#Check the distribution of each variable and transform if necessary
i <- 3  #don't need to view the distributions of "Year" or "Location" because they're categorical
par(mfrow = c(2,3))
while (i <= ncol(weatherpop)) {
  hist(weatherpop[[i]], main = colnames(weatherpop[i]), xlab = colnames(weatherpop[i]))
  i <- i+1
} #Rainfall and population are positively-skewed; Humidity and windspeed are negatively-skewed.
#Use BoxCox transformation on Population as it needs to be transformed, not just normalised.
par(mfrow = c(1,2))

weatherpop <- weatherpop %>% mutate(bxcx_pop = `Population(000s)`)
weatherpop$bxcx_pop <- BoxCox(weatherpop$`Population(000s)`, lambda = "auto")
hist(weatherpop$`Population(000s)`, main="Distribution of Population(000s)", xlab="Population(000s)")
hist(weatherpop$bxcx_pop, main="BoxCox Population(000s)", xlab="BoxCox transformed Population(000s)")
#Experiment with Transforming the Rainfall, Humidity and Windspeed data
par(mfrow=c(1,2))

hist(sqrt(weatherpop$`Avg Daily Rainfall`),main="Square-root of Avg Daily Rainfall",xlab="Square-root of Avg Daily Rainfall")
hist(log(weatherpop$`Avg Daily Rainfall`),main="Log of Avg Daily Rainfall", xlab="Log of Avg Daily Rainfall")

weatherpop <- weatherpop %>% mutate(sqrt_rain = sqrt(`Avg Daily Rainfall`))
hist((weatherpop$`Avg Max Humidity`)**2,main="Avg Max Humidity squared",xlab="Avg Max Humidity squared")
hist((weatherpop$`Avg Max Humidity`)**3, main="Avg Max Humidity cubed",xlab="Avg Max Humidity cubed")

weatherpop <- weatherpop %>% mutate(humidity_cubed = `Avg Max Humidity`**3) #appeared closer to normal
hist((weatherpop$`Avg Max Windspeed`)**2, main="Avg Max Windspeed squared",xlab="Avg Max Windspeed squared")
hist((weatherpop$`Avg Max Windspeed`)**3,main="Avg Max Windspeed cubed",xlab="Avg Max Windspeed cubed") #Squaring appears closer to normal

weatherpop <- weatherpop %>% mutate(windspeed_sqrd = `Avg Max Windspeed`**2)

References

Australian Bureau of Statistics, 2020. 3218.0 - Regional Population Growth, Australia, 2018-2019 - Population Estimates by Significant Urban Area, 2009 to 2019, viewed 16 May 2020, https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3218.02018-19?OpenDocument

Australian Bureau of Statistics, 2010. 3218.0 - Regional Population Growth, Australia, 2008-09, viewed 06 June 2020, https://www.abs.gov.au/AUSSTATS/abs@.nsf/Previousproducts/3218.0Main%20Features102008-09?opendocument&tabname=Summary&prodno=3218.0&issue=2008-09#=&view=

Bureau of Meteorology, 2020. Daily Weather Observations, viewed 16 May 2020, http://www.bom.gov.au/climate/dwo/

Community.rstudio.com, 2019. Growth rate calculation in R, viewed 31 May 2020, https://community.rstudio.com/t/growth-rate-calculation-in-r/38675

Downes, P. M., Hanslow, K., & Tulip, P. (2014). The effect of the mining boom on the Australian economy. Reserve Bank of Australia research discussion paper, (2014-08).

Heberger, M. (2012). Australia’s millennium drought: Impacts and responses. In The world’s water (pp. 97-125). Island Press, Washington, DC.

Kaggle.com, 2019. Rain in Australia - Predict rain tomorrow in Australia, viewed 16 May 2020, https://www.kaggle.com/jsphyg/weather-dataset-rattle-package#weatherAUS.csv

Stackoverflow.com, 2019. Impute missing data with mean by group, viewed 23 May 2020, https://stackoverflow.com/questions/55345593/impute-missing-data-with-mean-by-group

Walker, S. C., 2013. Ecology & stats, ‘Remove (or replace) everything before or after a specified character in R strings’, viewed 25 May 2020, https://stevencarlislewalker.wordpress.com/2013/02/13/remove-or-replace-everything-before-or-aftera-specified-character-in-r-strings/



---
title: "MATH2349 Semester 1, 2020"
author: "James Angus - s3851901"
subtitle: Assignment 2
output:
  html_notebook: default
---
## Required packages 
```{r}
library(readxl)
library(tidyr)
library(dplyr)
library(lubridate)
library(outliers)
library(stringr)
library(MVN)
library(forecast)
```

## Executive Summary 

The purpose of the pre-processing performed in this assignment was to produce a dataset that could be used to explore whether there is a relationship between average maximum temperature, rainfall, windspeed and/or humidity, and population size and/or growth, for locations throughout Australia.
The raw data was sourced from the Australian Bureau of Statistics (ABS) and Kaggle (via the Australian Bureau of Meteorology) and pre-processed using the following steps:

* Importing the raw data using the functions read_excel (read_xl package) and read.csv (base R).
* Understanding the data and apply data type conversions as necessary.
* Tidying and manipulating the ABS population data, including the transformation of an existing column to improve readability and the creation of a column measuring population growth, year-to-year.
* Scanning each dataset for missing and special values (e.g. "NA", "NaN").
* Imputing missing values using mean or median (depending on the distribution of the variable).
* Aggregating daily observations and producing averages to align observations frequencies.
* Manipulation of location names (strings) to improve data similarity.
* Merging the two datasets based on locations and Years.
* Scanning the merged data for outliers, identifying whether the outliers were erroneous or valid observations and justifying their retention.
* Making duplicates of variables and transforming them to provide alternative versions of the variables that could be used for parametric tests (i.e. they satisfy the assumption of normality).

## Data 

The data in this assignment was sourced from the ABS and Kaggle.
The ABS data was downloaded from the ABS website on 16/05/2020 using the following link: "https://www.abs.gov.au/AUSSTATS/subscriber.nsf/log?openagent&32180ds0003_2009-19.xls&3218.0&Data%20Cubes&381DE9FC52CAD374CA258535000B98C0&0&2018-19&25.03.2020&Latest". The data was stored as an Excel 97-2003 workbook (.xls) file.

According to the ABS, the data contains estimates of resident populations according to Significant Urban Areas of Australia as of June 30 each year from 2009 to 2019 (Australian Bureau of Statistics, 2020). Significant Urban Areas are areas with a population of at least 10,000 based on Statistical Areas Level 2 (SA2s) and can represent a collection of Urban Centres whose collective population is at least 10,000 (Australian Bureau of Statistics, 2020). The ABS data contained the following variables:

* ASGS code - Australian Statistical Geography Standard (ASGS), Significant Urban Area identifier
* Significant Urban Areas - The name of the Significant Urban Area
* Each of the variables from "2009" to "2019" are the estimated resident population of a given Significant Urban Area as of June 30 of the relevant year
* Change 2018-2019 No. - the change in estimated resident population from 2018 to 2019
* Change 2018-2019 % - the percentage change in estimated resident population from 2018 to 2019

The Kaggle data was downloaded from "https://www.kaggle.com/jsphyg/weather-dataset-rattle-package#weatherAUS.csv" and is stored as a Comma Separated Value (CSV) file. The Kaggle data was originally sourced from the Australian Bureau of Meteorology (BoM) website ("http://www.bom.gov.au/climate/data") and was published to the Kaggle website as a single CSV file by the Kaggle user "Joe Young" ("https://www.kaggle.com/jsphyg").

According to the Kaggle and BoM websites, the data contains daily weather observations from weather stations across Australia from 01/11/2017 to 25/06/2017 (Kaggle.com, 2019). Not all weather stations in the network observe all variables and, in some locations and where appropriate, observations from a number of stations have been combined to produce a summary for a single location (Bureau of Meteorology, 2020). The data contained the following variables:

* Date - The date of observation
* Location - The name of the location of the weather station
* MinTemp - The minimum temperature observed in the 24 hours to 9am, measured in degrees Celsius
* MaxTemp - The maximum temperature observed in the 24 hours to 9am, measured in degrees Celsius
* Rainfall - The amount of rainfall recorded on a given day, measured in millimetres (mm)
* Evaporation - The Class A pan evaporation in the 24 hours to 9am, measured in millimetres (mm)
* Sunshine - The number of hours of sunshine observed on a given day
* WindGustDir - The direction of the strongest wind gust observed in the 24 hours to midnight
* WindGustSpeed - Strongest wind gust (in kilometres per hour) observed in the 24 hours to midnight
* WindDir9am - Wind direction measured at 9am
* WindDir3pm - Wind direction measured at 3pm
* WindSpeed9am - Wind speed measured in kilometres per hour (km/h) at 9am
* WindSpeed3pm - Wind speed measured in kilometres per hour (km/h) at 3pm
* Humidity9am - Relative humidity (percent) measured at 9am
* Humidity3pm - Relative humidity (percent) measured at 3pm
* Pressure9am - Atmospheric pressure measured in hectopascals (hPa) at 9am
* Pressure3pm - Atmospheric pressure measured in hectopascals (hPa) at 3pm
* Cloud9am - Fraction of sky obscured by cloud at 9am, measured in eighths (octas)
* Cloud3pm - Fraction of sky obscured by cloud at 3pm, measured in eighths (octas)
* Temp9am - The temperature observed at 9am, measured in degrees Celsius
* Temp3pm - The temperature observed at 3pm, measured in degrees Celsius
* RainToday - Indicates whether rain was observed on the day
* RISK_MM - The amount of rainfall for the next day, measured in millimetres (mm)
* RainTomorrow - Indicates whether rain was observed on following day

```{r}
setwd("C:/Users/James/Desktop/Post Grad studies/Data Wrangling - MATH2349/Assignment 2")
# Import the ABS population data as "abspop"
abspop <- read_excel("32180ds0003_2009-19.xls", sheet = "Table 1", skip = 6)
names(abspop)[names(abspop) == "...2"] <- "Sig. Urban Area" #Rename second column to "Sig. Urban Area"
names(abspop) #the columns with relevant data are 2-13
tail(abspop,10) #There is a totals row and 6 other rows at the bottom of the data that are not needed
abspop <- abspop[3:(nrow(abspop)-7), 2:13] #select the rows and columns with relevant data in them
head(abspop)
# Import the Kaggle weather data as "weather"
weather <- read.csv("weatherAUS.csv", stringsAsFactors = FALSE)
#Select the columns relevant to fulfilling the goal of the pre-processing
desired_measures <- c("Date", "Location", "MaxTemp", "Rainfall", "WindSpeed9am", "WindSpeed3pm", "Humidity9am", "Humidity3pm") #To compare population growth against heat, rain, wind and humidity.
weather <- select(weather, all_of(desired_measures))
head(weather)
## Data will be merged after it has been tidied, manipulated and formatted
```

## Understand 

For the ABS population data:

* The data consisted of 110 observations of 12 variables, all of which were character data types.
* The population estimations were imported as characters but were actually whole numbers. Thus, the population estimations were converted from characters to integers.
* While the Significant Urban Areas (SUA) were categorical and could have been converted to factors, the SUAs had to be manipulated later-on to improve the number of matches between SUAs and Locations in the weather data. This manipulation would convert the SUAs back to characters and so, to avoid duplicating effort, SUAs were converted to factors after the datasets were merged.

For the Kaggle weather data:

* The data consisted of 142,193 observations of 8 variables. Two variables were character type (Date and Location), two were numeric type (MaxTemp and Rainfall) and the rest were integer type.
* The Dates were imported as characters but were converted to a date type using the "ymd()" function.
* Although the Locations were categorical and could have been converted to factors, doing so would have been suboptimal because the Significant Urban Areas (SUAs) in the population data were left as characters. With the SUAs being characters, the Locations would be coerced to a character data type when the two were used as keys for merging the two datasets. To avoid this, the Locations were converted to factors after the datasets were merged.

The Year/Date data was converted to an ordered factor after the weather and population data were merged (refer ## Merging the data)

```{r}
## Review and revise the variables in the population data
str(abspop)
abspop[ ,2:12]<- sapply(abspop[ ,2:12], as.integer) #Convert Year columns to integers
sapply(abspop, class)
## Review and revise the variables in the weather data
str(weather)
weather$Date <- ymd(weather$Date) #lubridate package - convert "Date" to a date type using "ymd()"
sapply(weather, class)
```

##	Tidy & Manipulate Data I 

The ABS population data did not conform to the tidy data principles because each observation (estimated population in a given Sig. Urban Area (SUA) in a given year) did not have its own row. Instead, observations were spread across rows and columns, meaning the data was in a wide format. In order to make the ABS population data tidy, the year columns had to be rearranged so they were in rows, and the estimated populations (the values) were in their own column.
The ABS population data was tidied using the "gather()" function, which took the year columns and rearranged them to be rows while creating a new column (called "Population") to house the value of each observation.
The Population column was divided by 1,000 to adjust the scale and improve readability. The column was also renamed to "Population(000s)" to ensure its values would be interpreted correctly.
Finally, a new column was created to measure the year-on-year growth in population for each SUA, using a technique adapted from the rstudio community website (Community.rstudio.com, 2019). For the first observation in each SUA (2009) the growth was set to a missing value, as setting it to zero could result in inaccurate conclusions being drawn. The missing values were imputed later (refer ## Scan I).

```{r}
#Tidy the population data
abspop <- abspop %>% gather(Year, Population, -1) %>% select(c(2,1,3))
abspop$Year <- as.integer(abspop$Year)
head(abspop, 3)
#Transform the Population variable to make it easier to read.
abspop <- abspop %>% mutate(Population = Population/1000) %>% rename(`Population(000s)` = Population)
#Add a year-on-year growth column to the population data
abspop <- abspop %>% arrange(`Sig. Urban Area`, Year) %>% 
  mutate(PopGrowthPct = 
  case_when(Year == 2009 ~ as.double(NA),
            Year != 2009 ~ (`Population(000s)`-lag(`Population(000s)`))/lag(`Population(000s)`)*100))
```

##	Scan I

The datasets were scanned for missing values and obvious errors before the datasets were manipulated or merged as this made it easier to identify explanations for missing/erroneous values. For example, it would be easier to identify that a missing value in the weather data was due to a weather station being unable to track that measurement before the data is manipulated or merged, because location or time-specific context could be lost when the data is manipulated or merged.

Because variables like temperature, humidity and rainfall can vary from location-to-location, using the mean/median of all locations to impute missing values for a single location could result in unrealistic values being imputed. Therefore, when a missing value was identified, the mean/median for the relevant measure was calculated for that particular location and used to impute the missing value.

The population data was scanned and 110 missing values were identified, being the NAs that were added when the population growth variable was created. Because the distribution of the variable was skewed, the median was used to impute the missing values on a location-by-location basis. There were no special values ("NaN" or infinities) in the ABS population data.

The weather data was scanned and 11,090 missing values were identified, with all missing values being in numeric variables. Before imputing missing values, the variables with normal distributions were identified and their missing values were imputed using means and a technique adapted from stackoverflow (Stackoverflow.com, 2019). The remaining missing values were imputed using medians.

Once the missing values in the weather data had been imputed, no more missing values were present in the data. There were also no special values ("NaN" or infinites).

```{r}
#Scan the population data for missing values 
sum(is.na(abspop)) #110 NAs in the population data
sum(apply(abspop, 1, is.nan)) #no NAN values
sum(apply(abspop, 1, is.infinite)) #no infinite values
#Impute the NAs in population growth
boxplot(abspop$PopGrowthPct, main = "Distribution of Population Growth (%)") #not normally distributed
#use median to impute missing values in PopGrowthPct as it is not normally distributed
abspop <- abspop %>% group_by(`Sig. Urban Area`) %>%
  mutate(PopGrowthPct = ifelse(is.na(PopGrowthPct), median(PopGrowthPct, na.rm = TRUE), PopGrowthPct))
#Scan the weather data for missing values
initial_weather_NAs <- sum(is.na(weather))
initial_weather_NAs #11,090 NAs in the weather data
colSums(is.na(weather)) #all columns with missing values are numeric data types
##Impute missing values - use mean for normally-distributed variables and median for everything else
par(mfrow = c(2,3))
i <- 1
while (i <= ncol(weather)) { 
  if (is.numeric(weather[,i]) == TRUE)  boxplot(weather[,i], main = names(weather[i]))
  i <- i+1
} #MaxTemp and Humidity3pm appear normally distributed
which(abs(scores(weather$MaxTemp, type = "z")) >3) #0 outliers
which(abs(scores(weather$Humidity3pm, type = "z")) >3) #0 outliers
#MaxTemp and Humidity3pm appear normally distributed with no outliers. Impute them using mean
nrml_dist_vars <- c("MaxTemp", "Humidity3pm")
for (i in nrml_dist_vars) {
  weather[,i] <-  ave(weather[,i], weather$Location, FUN = function(x) 
    ifelse(is.na(x), mean(x,na.rm = TRUE), x))
}
initial_weather_NAs-sum(is.na(weather)) #3,932 fewer NAs
#For the variables with skewed distributions, impute missing values using median.
weather <- weather %>% group_by(Location) %>%
  mutate(Rainfall = ifelse(is.na(Rainfall), median(Rainfall, na.rm = TRUE), Rainfall))
weather <- weather %>% group_by(Location) %>%
  mutate(WindSpeed9am = ifelse(is.na(WindSpeed9am), median(WindSpeed9am, na.rm = TRUE), WindSpeed9am))
weather <- weather %>% group_by(Location) %>%
  mutate(WindSpeed3pm = ifelse(is.na(WindSpeed3pm), median(WindSpeed3pm, na.rm = TRUE), WindSpeed3pm))
weather <- weather %>% group_by(Location) %>%
  mutate(Humidity9am = ifelse(is.na(Humidity9am), median(Humidity9am, na.rm = TRUE), Humidity9am))
sum(is.na(weather)) #no more missing values in the weather data
sum(apply(weather, 1, is.nan)) #no NaN values in the weather data
sum(apply(weather, 1, is.infinite)) #no infinite values in the weather data
```

##	Tidy & Manipulate Data II

While the weather data had daily observations, the ABS population data had yearly observations (population estimates). Therefore, the weather data had to be aggregated to align with the frequency of the ABS population data. The "mutate()" function and "year" argument were used to create a Year column in the weather data, based on the year in the Date column. Once the Year field had been created, mutate was again used to create columns for maximum humidity and wind speed, so that relationships between humidity or windiness and population size/growth could be explored.
Finally, in order to align with the population data, the weather data was grouped by year and location and daily measurements were averaged to produce average daily figures for each year and location.

The ABS population data and weather data were then compared to identify any locations that required manipulation in order to match. Although 143 unique names were identified, only 16 names matched. An exploration of the names that did not match revealed that some Sig. Urban Areas had hyphens, which prevented them from matching to very similar Locations that did not (e.g. "Albury" and "Albury - Wodonga").  To resolve this, "gsub()" was used to remove hyphens and everything after them from the names of Sig. Urban Areas using a technique adapted from the "ecology & stats" blog by Steve Walker (Walker, S. C., 2013).  This increased the number of matches to 20, an increase of 4 matches. Further examination revealed that the names of some Locations were actually two words without a space separating them (e.g. "CoffsHarbour"). To improve the number of matching locations, spaces within Sig. Urban Area names were removed using "gsub()", leading to a total of 25 matches.

By manipulating the names of Sig. Urban Areas, the number of matching location names achieved between the two datasets was 25, representing 23% of Sig. Urban Areas and 51% of weather station Locations.

```{r}
#Group the weather data by year to create a "Year" column using the "Date"" column
weather <- weather %>% mutate(Year = year(Date))
weather$Year <- as.integer(weather$Year)
#Create daily maximums for humidity and windspeed
weather <- weather %>% mutate(MaxHumidity = pmax(Humidity9am, Humidity3pm, na.rm = TRUE),
                              MaxWindspeed = pmax(WindSpeed9am, WindSpeed3pm, na.rm = TRUE))
#Group by year and create average daily values for rain, temperature, humidity and windspeed
avgweather <- weather %>% group_by(Year, Location) %>% 
  summarise("Avg Daily Rainfall" = mean(Rainfall, na.rm = TRUE),
            "Avg MaxTemp" = mean(MaxTemp, na.rm = TRUE),
            "Avg Max Humidity" = mean(MaxHumidity, na.rm = TRUE),
            "Avg Max Windspeed" = mean(MaxWindspeed, na.rm = TRUE))
##Preparation for merging
unique_places <- union(avgweather$Location, abspop$`Sig. Urban Area`)
length(unique_places) #143
matches <- intersect(avgweather$Location, abspop$`Sig. Urban Area`)
length(matches) #16
#Matching on Location and Sig. Urban Area produces 16 matches, could any be altered in order to match?
mismatches <- setdiff(unique_places, matches)
head(sort(mismatches),30) #Hyphenated names and names with spaces appear to be a mismatch issue.
#Search for hypens in Location and Sig. Urban Area, are there hyphens in just one?
sum(str_detect(mismatches, pattern = "-"), na.rm = TRUE) #20 hyphens in mismatches
sum(str_detect(avgweather$Location, pattern = "-"), na.rm = TRUE) #no hyphens in "Location"
sum(str_detect(unique(abspop$`Sig. Urban Area`), pattern = "-")) #all hyphens are in `Sig. Urban Area`
#Does removing everything after a hyphen improve the number of matches?
abspop$`Sig. Urban Area` <- gsub(pattern = ' -.*', replacement = '', abspop$`Sig. Urban Area`)
sum(str_detect(abspop$`Sig. Urban Area`, pattern = "-"), na.rm = TRUE) #no more hyphens
matches2 <- intersect(avgweather$Location, abspop$`Sig. Urban Area`)
length(intersect(matches, matches2)) == length(matches) #None of the original matches have been lost
length(matches2) #now 20 matches, total matches have improved by 4
#Some Locations are all one word, does removing the spaces in `Sig. Urban Area` improve matching?
abspop$`Sig. Urban Area` <- gsub(pattern = ' ', replacement = '', abspop$`Sig. Urban Area`)
matches3 <- intersect(avgweather$Location, abspop$`Sig. Urban Area`)
length(intersect(matches, matches3)) == length(matches) #None of the original matches have been lost
length(matches3) #25 matches, total matches have improved by 5
```

## Merging the Data

"inner_join()" was used to merge the ABS population data and the weather data, with Year and Location (Sig. Urban Area) used as the joining keys.
Because the merged data did not require any further string manipulation, "Location" and "Year" were converted to factors so the effect of individual locations or years could be observed. Values in the Year column ranged from 2009 to 2017 and the levels were ordered from smallest to largest.

```{r}
weatherpop <- inner_join(avgweather, abspop, by = c("Location" = "Sig. Urban Area", "Year"))
sapply(weatherpop, class) #Location is currently a character variable but it should be a factor
weatherpop$Location <- as.factor(weatherpop$Location)
levels(weatherpop$Location)
#Turn Year into a factor so the effect of individual years can be calculated (if we wish to do so)
weatherpop$Year <- factor(weatherpop$Year, levels = c(2009:2017), ordered = TRUE)
levels(weatherpop$Year)
is.ordered(weatherpop$Year) #Year is now ordered, the smallest year is 2009
```

##	Scan II

Visual inspection of boxplots was used to scan the merged data for outliers. Univariate outliers were identified in the Rainfall, Humidity, Population and Population Growth variables. The outliers in these variables were examined to determine whether they were likely to be the result of an error, or valid observations that just happened to be outliers. The same variables were also scanned for bivariate outliers by grouping observations according to year and visually inspecting the resulting boxplots.

* All outliers in the Rainfall variable were found to occur in tropical locations, where high rainfall is common. Therefore, these outliers were deemed to be valid observations.
* All outliers in Humidity occurred in Alice Springs, which is known to be a very dry location. Therefore, the Humidity outliers were deemed to be valid observations.
* All outliers in Population occurred in capital cities. As people tend to congregate in cities, it made sense that the population outliers occurred in capital cities. Thus, they were deemed valid. 
* Because the Population Growth variable was created using the Year variable, Population Growth was only scanned for bivariate outliers to ensure outliers were not masked by any measurement/data entry errors specific to a given year. Bivariate outliers were identified in Perth, Darwin, Alice Springs and Launceston, with the two largest growths being Darwin (4.07% in 2013) and Perth (2.94% in 2013). Factors such as the Australian mining boom, which lead to increased migration into Western Australia, and the fact that the Northern Territory was the third-fastest growing state in Australia from 2004 to 2009, offered reassurance that these outliers were unlikely to be due to error (Downes et al. 2014), (Australian Bureau of Statistics, 2010). Furthermore, given that the Population Growth variable was based on data provided by the ABS, which is known for providing reliable and accurate data, there was little reason to suspect that the outliers were the result of errors. Thus, they were deemed valid.

The "mvn()" function was used to identify multivariate outliers. While 100 multivariate outliers were detected, none were for observations later than 2012. The multivariate outliers between 2009 and 2012 may have been a result of the after effects of the "Millennium drought", which ended with heavy rainfall in 2009 and 2010, (Heberger, 2012). That said, there is no mention of the drought's impact on the weather data by the BoM. Given the data was published by the ABS and BoM (agencies who have a responsibility to, and reputation for, providing accurate and reliable data), it is unlikely that the outliers observed were the result of error (e.g. data entry errors, measurement errors etc.), and therefore, it was decided that the outliers should not be removed or altered.

In summary, once location had been taken into account and the reliability of the data sources had been considered, there were no outliers in the merged data that seemed to be likely the result of error. While multivariate outliers were detected, they were not removed from the data because it was unlikely that the observations were erroneous and removing the multivariate outliers could have resulted in the loss of valuable information (i.e. the effect of a drought ending on population size/growth). Similarly, capping and imputing were not performed so as to avoid reducing the strength of any relationships between certain weather conditions and population size or growth.

```{r}
par(mfrow=c(2,4))
plot(weatherpop$Year, main = "Frequency of Years")
plot(weatherpop$Location, main = "Frequency of Locations")
i <- 3 #produce boxplots of each numeric variable (#3 onwards), using a while loop for efficiency
while (i <= ncol(weatherpop)) { 
  boxplot(weatherpop[,i], main = names(weatherpop[i]))
  i <- i+1
}
#outliers present in Rainfall, Humidity, Population and Population Growth.
par(mfrow=c(1,3))
rainbox <- boxplot(weatherpop$`Avg Daily Rainfall`, main = "Avg Daily Rainfall")
humidbox <- boxplot(weatherpop$`Avg Max Humidity`, main = "Avg Max Humidity")
popbox <- boxplot(weatherpop$`Population(000s)`, main = "Population(000s)")
#Rainfall - univariate
par(mfrow=c(1,1))
rain_outliers <- weatherpop[which(weatherpop$`Avg Daily Rainfall` == rainbox$out[[1]]),c("Location", "Avg Daily Rainfall")]
i <- 2
while (i <= length(rainbox$out)) {
  rain_outliers <- rbind(rain_outliers, weatherpop[which(weatherpop$`Avg Daily Rainfall` == rainbox$out[[i]]) ,c("Location", "Avg Daily Rainfall")])
  i <- i+1
}
rain_outliers #all tropical climates/locations
#Rainfall - bivariate (rain and year)
outliers_rain <- boxplot(weatherpop$`Avg Daily Rainfall` ~ weatherpop$Year, xlab = "Year", ylab = "Avg Daily Rainfall", main = "Distribution of Avg Daily Rainfall by Year")
rain_outlier_locs <- weatherpop[which(weatherpop$`Avg Daily Rainfall` == outliers_rain$out[[1]]),c("Year", "Location", "Avg Daily Rainfall")]
i <- 2
while (i <= length(outliers_rain$out)) {
  rain_outlier_locs <- rbind(rain_outlier_locs, weatherpop[which(weatherpop$`Avg Daily Rainfall` == outliers_rain$out[[i]]) ,c("Year", "Location", "Avg Daily Rainfall")])
  i <- i+1
}
rain_outlier_locs
as.vector(unique(rain_outlier_locs$Location)) #All are tropical climates/locations
#Humidity - univariate
humid_outliers <- weatherpop[which(weatherpop$`Avg Max Humidity` == humidbox$out[[1]]),c("Location", "Avg Max Humidity")]
i <- 2
while (i <= length(humidbox$out)) {
  humid_outliers <- rbind(humid_outliers, weatherpop[which(weatherpop$`Avg Max Humidity` == humidbox$out[[i]]) ,c("Location", "Avg Max Humidity")])
  i <- i+1
}
as.vector(unique(humid_outliers$Location)) # All Alice Springs, which is known to a have low humidity
#Humidity - bivariate (humidity and year)
outliers_humid <- boxplot(weatherpop$`Avg Max Humidity` ~ weatherpop$Year, xlab = "Year", ylab = "Avg Max Humidity", main = "Distribution of Avg Max Humidity by Year")
out_humid_locs <- weatherpop[which(weatherpop$`Avg Max Humidity` == outliers_humid$out[[1]]),c("Year", "Location", "Avg Max Humidity")]
i <- 2
while (i <= length(outliers_humid$out)) {
  out_humid_locs <- rbind(out_humid_locs, weatherpop[which(weatherpop$`Avg Max Humidity` == outliers_humid$out[[i]]) ,c("Year", "Location", "Avg Max Humidity")])
  i <- i+1
}
as.vector(unique(out_humid_locs$Location)) #AliceSprings was the only outlier location, year-to-year
#Population - univariate
population_outliers <- weatherpop[which(weatherpop$`Population(000s)` == popbox$out[[1]]),c("Location", "Population(000s)")]
i <- 2
while (i <= length(popbox$out)) {
  population_outliers <- rbind(population_outliers, weatherpop[which(weatherpop$`Population(000s)` == popbox$out[[i]]),c("Location", "Population(000s)")])
  i <- i+1
}
as.vector(unique(population_outliers$Location))
#Population - bivariate (population and year)
outliers_pop <- boxplot(weatherpop$`Population(000s)` ~ weatherpop$Year, xlab = "Year", ylab = "Population(000s)", main = "Distribution of Population(000s) by Year")
pop_outlier_locs <- weatherpop[which(weatherpop$`Population(000s)` == outliers_pop$out[[1]]),c("Year", "Location", "Population(000s)")]
i <- 2
while (i <= length(outliers_pop$out)) {
  pop_outlier_locs <- rbind(pop_outlier_locs, weatherpop[which(weatherpop$`Population(000s)` == outliers_pop$out[[i]]),c("Year", "Location", "Population(000s)")])
  i <- i+1
}
as.vector(unique(pop_outlier_locs$Location)) #All population outliers are capital cities.
# As population growth is a result of year and population, use a bivariate scan.
outliers_growth <- boxplot(weatherpop$PopGrowthPct ~ weatherpop$Year, xlab = "Year", ylab = "YoY Population Growth (%)", main = "Distribution of Year-on-Year Population Growth (%) by year")
growth_outlier_locs <- weatherpop[which(weatherpop$PopGrowthPct == outliers_growth$out[[1]]), c("Year", "Location", "PopGrowthPct")]
i <- 2
while (i <= length(outliers_growth$out)) {
  growth_outlier_locs <- rbind(growth_outlier_locs, weatherpop[which(weatherpop$PopGrowthPct == outliers_growth$out[[i]]),c("Year", "Location", "PopGrowthPct")])
  i <- i+1
}
growth_outlier_locs
## Multivariate outliers
multi <- mvn(data = weatherpop[3:8], multivariateOutlierMethod = "quan", showOutliers = TRUE)
multi$multivariateOutliers #observations 1-100 are multivariate outliers (MVOs)
weather_MVOs <- weatherpop[ c(1:100),] 
summary(weather_MVOs) #Every observation from 2009-2012 is a multivariate outlier
```

##	Transform 

When it was identified that a variable may benefit from being transformed, a new variable with the same values as the original was created and this was then transformed. This approach was chosen because it would provide more analysis options to anyone wishing to undertake analysis on the dataset. For example, if a non-parametric test could be used to conduct analysis then the original variables could be used. Alternatively, if parametric tests were going to be conducted and ease of interpretation was not a concern, then the transformed variables could be used.

The distributions of the variables in the weatherpop data were explored and 4 variables were identified as having skewed distributions that could benefit from being transformed.

* Population was transformed using the BoxCox transformation as the variable needed to be scaled and its distribution made normal so it could be used in parametric tests. The transformed variable was assigned to "bxcx_pop".
* The "Avg Daily Rainfall" data was positively skewed. The log transformation over-corrected, resulting in a negatively skewed distribution, and so the square-root transformation was chosen to bring the distribution closer to normal. The transformed variable was assigned to "sqrt_rain".
* Both the "Avg Max Humidity" data and the "Avg Max Windspeed" data were negatively skewed. After trying a couple of transformations on each distribution, visual inspection confirmed that raising the "Avg Max Humidity" data to the power of 3 (cubing) and squaring the "Avg Max Windspeed" brought the distributions of the two variables closest to normal. The transformed variables were assigned to "humidity_cubed" and "windspeed_sqrd".

```{r}
#Check the distribution of each variable and transform if necessary
i <- 3  #don't need to view the distributions of "Year" or "Location" because they're categorical
par(mfrow = c(2,3))
while (i <= ncol(weatherpop)) {
  hist(weatherpop[[i]], main = colnames(weatherpop[i]), xlab = colnames(weatherpop[i]))
  i <- i+1
} #Rainfall and population are positively-skewed; Humidity and windspeed are negatively-skewed.
#Use BoxCox transformation on Population as it needs to be transformed, not just normalised.
par(mfrow = c(1,2))
weatherpop <- weatherpop %>% mutate(bxcx_pop = `Population(000s)`)
weatherpop$bxcx_pop <- BoxCox(weatherpop$`Population(000s)`, lambda = "auto")
hist(weatherpop$`Population(000s)`, main="Distribution of Population(000s)", xlab="Population(000s)")
hist(weatherpop$bxcx_pop, main="BoxCox Population(000s)", xlab="BoxCox transformed Population(000s)")
#Experiment with Transforming the Rainfall, Humidity and Windspeed data
par(mfrow=c(1,2))
hist(sqrt(weatherpop$`Avg Daily Rainfall`),main="Square-root of Avg Daily Rainfall",xlab="Square-root of Avg Daily Rainfall")
hist(log(weatherpop$`Avg Daily Rainfall`),main="Log of Avg Daily Rainfall", xlab="Log of Avg Daily Rainfall")
weatherpop <- weatherpop %>% mutate(sqrt_rain = sqrt(`Avg Daily Rainfall`))
hist((weatherpop$`Avg Max Humidity`)**2,main="Avg Max Humidity squared",xlab="Avg Max Humidity squared")
hist((weatherpop$`Avg Max Humidity`)**3, main="Avg Max Humidity cubed",xlab="Avg Max Humidity cubed")
weatherpop <- weatherpop %>% mutate(humidity_cubed = `Avg Max Humidity`**3) #appeared closer to normal
hist((weatherpop$`Avg Max Windspeed`)**2, main="Avg Max Windspeed squared",xlab="Avg Max Windspeed squared")
hist((weatherpop$`Avg Max Windspeed`)**3,main="Avg Max Windspeed cubed",xlab="Avg Max Windspeed cubed") #Squaring appears closer to normal
weatherpop <- weatherpop %>% mutate(windspeed_sqrd = `Avg Max Windspeed`**2)
```

## References
Australian Bureau of Statistics, 2020. 3218.0 - Regional Population Growth, Australia, 2018-2019 - Population Estimates by Significant Urban Area, 2009 to 2019, viewed 16 May 2020, https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3218.02018-19?OpenDocument

Australian Bureau of Statistics, 2010. 3218.0 - Regional Population Growth, Australia, 2008-09, viewed 06 June 2020, https://www.abs.gov.au/AUSSTATS/abs@.nsf/Previousproducts/3218.0Main%20Features102008-09?opendocument&tabname=Summary&prodno=3218.0&issue=2008-09&num=&view=

Bureau of Meteorology, 2020. Daily Weather Observations, viewed 16 May 2020, http://www.bom.gov.au/climate/dwo/

Community.rstudio.com, 2019. Growth rate calculation in R, viewed 31 May 2020, https://community.rstudio.com/t/growth-rate-calculation-in-r/38675

Downes, P. M., Hanslow, K., & Tulip, P. (2014). The effect of the mining boom on the Australian economy. Reserve Bank of Australia research discussion paper, (2014-08).

Heberger, M. (2012). Australia's millennium drought: Impacts and responses. In The world's water (pp. 97-125). Island Press, Washington, DC.

Kaggle.com, 2019. Rain in Australia - Predict rain tomorrow in Australia, viewed 16 May 2020, https://www.kaggle.com/jsphyg/weather-dataset-rattle-package#weatherAUS.csv

Stackoverflow.com, 2019. Impute missing data with mean by group, viewed 23 May 2020, https://stackoverflow.com/questions/55345593/impute-missing-data-with-mean-by-group

Walker, S. C., 2013. Ecology & stats, 'Remove (or replace) everything before or after a specified character in R strings', viewed 25 May 2020, https://stevencarlislewalker.wordpress.com/2013/02/13/remove-or-replace-everything-before-or-aftera-specified-character-in-r-strings/ 

<br>
<br>