MATH2349 Data Wrangling

Required packages

# This is the R chunk for the required packages
library(readxl)
library(tidyr)
library(dplyr)
library(outliers)
library(forecast)
library(knitr)
options(scipen = 999) #See understand section

Executive Summary

The aim of this report is to restructure loan deferral data from Australian banks across the July to August period in 2020 using Rstudio and various packages. The loan deferral data (.xlsx) was obtained from the apra.gov.au website and two separate sheets, one for each months’ data, was read into Rstudio. The current report only was only concerned with descriptive variables of each bank, the total amount of loans deferred (in $millions) and the percentage of loans deferred.

The raw data-set also contains the variable loan type, namely, all, housing and business loans. However, this variable was spread out over the columns. This report will use the Dplyr functions to convert the original data-set from a wide-to-long format. Once converted, the two tables (July and August) will be merged, scanned for missing values/outliers, and subject to other preprocessing tasks.

The resulting product will follow the ‘tidy’ principals to ensure the data is in organised manner and prepared for future data analyses.

Data

Following the restrictions imposed by the government on individuals and businesses due to COVID-19, banks offered temporary loan repayment deferrals to ease the burden for people and small/medium businesses in these tough times. The Australian Prudential Regulation Authority (APRA) has been publishing monthly loan repayment data across loan-lending institutions. Below is the links from where the data was downloaded and the R script to import the data.

July -> https://www.apra.gov.au/temporary-loan-repayment-deferrals-due-to-covid-19-july-2020

August -> https://www.apra.gov.au/temporary-loan-repayment-deferrals-due-to-covid-19-august-2020

# This is the R chunk for the Data Section
deferralsjuly <- read_excel("bankjuly.xlsx", sheet = "Current month", skip = 2)
deferralsaugust <- read_excel("bankaugust.xlsx", sheet ="Current month", skip = 2)

head(deferralsjuly)

head(deferralsaugust)

As previously mentioned, the variables of interest are bank descriptives’, institution name and ABN, type of loan (all, housing, business), amount of loans deferred ($millions), and the percentage of loans deferred (within their type). The following code is the removal of unused variables and the unnecessary first row.

deferralsjuly <- deferralsjuly[-c(1), c(1,2,3,4,9,10,18,19)] #Dropped unused variables and first row removed
deferralsaugust <- deferralsaugust[-c(1), c(1,2,3,4,9,10,18,19)]

head(deferralsjuly)

Tidy & Manipulate Data I

The two tables are of the same structure and are also untidy datasets. Loan type, being all loans, housing loans and business loans, are spread out across the columns. First given is all loans, where data of total amount deferred and percentages are given. In column 5 and 6, data from housing loans is given while business loan data populate the last two columns (see table above). In order to convert the data from a wide to long structure, several steps were required which is outlined below.

Firstly, it was necessary to rename the columns of interest while labeling them with the correct loan type. For example, the variable total amount of loans deferred (in $), we can see R has named them ‘Total…3’,‘…9’ and ‘…18’. Representing total amount deferred for all loans, housing loans and business loans respectively. Although for percentage of loans deferred, loan type is given in the name.

For total amount and percantages of loan deferred variables, the columns were renamed, added with a ’_’ separator and labeled with it’s corresponding loan type (Total ($millions)_Housing). This step was necessary as we are dealing with three variables (loan type) spread across the two existing variables.

#Renamed columns to prepare for pivot function
#July
deferralsjuly <- deferralsjuly %>% rename('Total ($millions)_All' = Total...3, 
                                'Total ($millions)_Housing' = Total...9, 
                                'Total ($millions)_Business' = Total...18)
deferralsjuly <- deferralsjuly %>% rename('% of Loans (Within their type)_All' =  `Share of total loans`,
                                '% of Loans (Within their type)_Housing' = `Share of total housing loans`, 
                                '% of Loans (Within their type)_Business' = `Share of total small and medium business loans`)
#August
deferralsaugust <- deferralsaugust %>% rename('Total ($millions)_All' = Total...3, 
                                          'Total ($millions)_Housing' = Total...9, 
                                          'Total ($millions)_Business' = Total...18)
deferralsaugust <- deferralsaugust %>% rename('% of Loans (Within their type)_All' =  `Share of total loans`,
                                          '% of Loans (Within their type)_Housing' = `Share of total housing loans`, 
                                          '% of Loans (Within their type)_Business' = `Share of total small and medium business loans`)

The two variables of interest are now labeled with their corresponding loan type. Next, it was decided to use the pivot_longer function to achieve the desired result of a ‘long’ format.

# Pivot_longer Codes

deferralsjuly <- deferralsjuly %>% pivot_longer(`Total ($millions)_All` : '% of Loans (Within their type)_Business',
             names_to = c(".value", "Loan Type"),
             names_sep = "_")

deferralsaugust <- deferralsaugust %>% pivot_longer(`Total ($millions)_All` : '% of Loans (Within their type)_Business',
                                                names_to = c(".value", "Loan Type"),
                                                names_sep = "_")

The code above has aspects worth mentioning here. After the function, the code asks to only grab the columns from Total ($millions)_All to % of Loans (Within their type)_Business. This represents all columns besides the descriptive information of the banks. The ‘name_to’ and ‘names_sep’ arguments reflect the reasoning of renaming columns in the structure outlined before.

Specifically, it asks R to;

Create a new variable called ‘Loan type’
From the columns ‘grabbed’ in the first part of the code, the new values populating the Loan type variable will be determined by the ‘.value’ after the ’_’ separator in said columns.

The resulting product (shown below) drops the duplication of the total amount of loans deferred and percentages variables across the columns by creating a new variable. This variable’s values are determined by the column names outlined in the previous section.

head(deferralsaugust)

Understand

With the data now in a ‘tidy’ format, this section focuses on the structure of the variables. From the str() function below, we can see that a few variables are not in their correct type.

str(deferralsjuly)

## tibble [156 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Institution name              : chr [1:156] "Commonwealth Bank of Australia" "Commonwealth Bank of Australia" "Commonwealth Bank of Australia" "National Australia Bank Limited" ...
##  $ ABN                           : num [1:156] 48123123124 48123123124 48123123124 12004044937 12004044937 ...
##  $ Loan Type                     : chr [1:156] "All" "Housing" "Business" "All" ...
##  $ Total ($millions)             : chr [1:156] "62070.2" "45810.9" "13677" "55383.6" ...
##  $ % of Loans (Within their type): chr [1:156] "9.2999999999999999E-2" "0.10100000000000001" "0.26400000000000001" "0.112" ...

Institution name is correct in being a character variable. As for ABN, it is also in the correct form, although without the ‘options(scipen = 999)’ set at the beginning of the report, R automatically converted the ABN variable to an exponential notation (4.81e+10). As the report doesn’t involve any other variables that required this default option on, it was decided to switch it off to better represent the ABN variable.

Next, ‘Loan type’ was converted to a factor variable to represent the 3 groups. Total ($millions) of loans deferred and their percentages were automatically set to character, below is the code where each variable is converted to as.double, % of Loans (Within their type) was then *100 to represent it as a percentage. Both variables were then rounded to the nearest whole number for cleanness.

Lastly, a new variable was created to represent the month being measured, which will come in use for the later merger of the two tables.

# This is the R chunk for the Understand Section
deferralsjuly$`Loan Type` <- factor(deferralsjuly$`Loan Type`, levels = c("All", "Housing", "Business"))
deferralsjuly$`Total ($millions)` <- as.double(deferralsjuly$`Total ($millions)`)  
deferralsjuly$`Total ($millions)` <- round(deferralsjuly$`Total ($millions)`, digits = 0)
deferralsjuly$`% of Loans (Within their type)` <- as.double(deferralsjuly$`% of Loans (Within their type)`)
deferralsjuly$`% of Loans (Within their type)` <- deferralsjuly$`% of Loans (Within their type)`*100
deferralsjuly$`% of Loans (Within their type)` <- round(deferralsjuly$`% of Loans (Within their type)`, digits = 0)
deferralsjuly$Date <- 'July_20'


deferralsaugust$`Loan Type` <- factor(deferralsaugust$`Loan Type`, levels = c("All", "Housing", "Business"))
deferralsaugust$`Total ($millions)` <- as.double(deferralsaugust$`Total ($millions)`)  
deferralsaugust$`Total ($millions)` <- round(deferralsaugust$`Total ($millions)`, digits = 0)
deferralsaugust$`% of Loans (Within their type)` <- as.double(deferralsaugust$`% of Loans (Within their type)`)
deferralsaugust$`% of Loans (Within their type)` <- deferralsaugust$`% of Loans (Within their type)`*100
deferralsaugust$`% of Loans (Within their type)` <- round(deferralsaugust$`% of Loans (Within their type)`, digits = 0)
deferralsaugust$Date <- 'Aug_20'
str(deferralsaugust)

## tibble [153 × 6] (S3: tbl_df/tbl/data.frame)
##  $ Institution name              : chr [1:153] "Commonwealth Bank of Australia" "Commonwealth Bank of Australia" "Commonwealth Bank of Australia" "National Australia Bank Limited" ...
##  $ ABN                           : num [1:153] 48123123124 48123123124 48123123124 12004044937 12004044937 ...
##  $ Loan Type                     : Factor w/ 3 levels "All","Housing",..: 1 2 3 1 2 3 1 2 3 1 ...
##  $ Total ($millions)             : num [1:153] 58842 44760 12487 50486 27109 ...
##  $ % of Loans (Within their type): num [1:153] 9 10 24 10 10 17 7 7 15 10 ...
##  $ Date                          : chr [1:153] "Aug_20" "Aug_20" "Aug_20" "Aug_20" ...

Scan I

To identify missing cases, two new data frames were created under the function !complete.cases (one for each month). This outputs observations where at least one value is ‘NA’ across the variables.

# This is the R chunk for Scan I 
nullJuly <- deferralsjuly[!complete.cases(deferralsjuly),]
nullAugust <- deferralsaugust[!complete.cases(deferralsaugust),]

head(nullJuly)

head(nullAugust)

As shown from the output above, there was no missing information for the bank’s descriptive variables. By observing the first 6 incomplete cases, it seemed that either both Total ($millions) and % of Loans (Within their type) were NA, or one was 0 and the other was NA. The below code checked this assumption across the the two incomplete data frames.

nullJuly %>% filter(`Total ($millions)` >0 | `% of Loans (Within their type)` >0)

nullAugust %>% filter(`Total ($millions)` >0 | `% of Loans (Within their type)` >0)

There were no cases where the Total ($millions) had a value of over 0 and had a NA value in the % of Loans (Within their type) column (and vice versa). This makes logical sense and demonstrates that there are no inconsistencies in the data.

To deal with missing cases in this example, there were two options that were considered. One, converting all NA’s to 0. This would then assume that the data given from the APRA website was not missing, but the given bank simply didn’t have any loans deferred. However, there are observations where 0’s are present, thus it was decided to exclude cases where a value was missing.

CompleteJuly <- deferralsjuly[complete.cases(deferralsjuly),]
CompleteAugust <- deferralsaugust[complete.cases(deferralsaugust),]

Tidy & Manipulate Data II

This section outlines the merging of the July and August data frames using a series of anti_joins and a subsequent full_join. Firstly, an anti_join from the July data frame against the August frame produces an output of July observations that don’t have an observation in August (called ‘UnmatchedJuly’). The same thing was done for the August data frame against the July frame.

Next, the aim was to only retain observations that have a representation in both July and August. For the July set, a second anti_join was computed which dropped all observations that matched with the ‘UnmatchedJuly’ data frame created in the first step. This kept only the July observations that had an observation in August. Same thing was done with the August data frame.

Lastly, the two ‘Matched’ data frames were merged using a full_join, with no duplication of columns. See output below.

# This is the R chunk for the Tidy & Manipulate Data II 
UnmatchedJuly <- CompleteJuly %>% anti_join(CompleteAugust, by = c("ABN", "Loan Type"))
UnmatchedAugust <- CompleteAugust %>% anti_join(CompleteJuly, by = c("ABN", "Loan Type"))

MatchedJuly <- CompleteJuly %>% anti_join(UnmatchedJuly, by = c("ABN", "Loan Type"))
MatchedAugust <- CompleteAugust %>% anti_join(UnmatchedAugust, by = c("ABN", "Loan Type"))

deferralsAugJuly <- MatchedJuly %>% full_join(MatchedAugust)
head(deferralsAugJuly)

Tidy & Manipulate Data III

With the two tables now merged into one deferralsAugJuly data frame, the next step is to manipulate the data to bring some useful insights. In this section, a new variable, namely Total($m) Difference July-Aug, will be created from the existing data. The Total($m) Difference July-Aug variable is simply the change in total amount of loans deferred from July to August.

To compute this, it was decided to factor the variable ‘Date’ into two distinct groups, “July_20” and “Aug_20”. However, as the resulting variable is a product of only one variable grouped by the two months, the resulting vector will be exactly half the size of the existing data frame. Thus it was decided to use the summarise function and create a new data frame while keeping the bank’s descriptive information.

Below is the code and output for the DeferralsSummary data frame. Do note the filter function being piped before the summary was computed, this ensured the resulting data frame was in the required length for the summarise function to work.

deferralsAugJuly$Date <- factor(deferralsAugJuly$Date, levels = c("July_20", "Aug_20"), labels = c("July_20", "Aug_20"))

DeferralsSummary <- deferralsAugJuly %>% filter(Date=="July_20") %>% summarise(`Institution name`=`Institution name`, 
          ABN=ABN, 
          `Loan Type`=`Loan Type`, 
          'Total($m) Difference July-Aug'= deferralsAugJuly$`Total ($millions)`[deferralsAugJuly$Date=="Aug_20"] - deferralsAugJuly$`Total ($millions)`[deferralsAugJuly$Date=="July_20"],
          '% Difference July-Aug' = deferralsAugJuly$`% of Loans (Within their type)`[deferralsAugJuly$Date=="Aug_20"]-deferralsAugJuly$`% of Loans (Within their type)`[deferralsAugJuly$Date=="July_20"])  

head(DeferralsSummary)

Scan II

Scanning for outliers is a necessary step before any data analysis is conducted as it can bias the results. As for an example, if an analyst wanted to look at the difference between loan deferral data between July and August, they would have to look at the ‘differences of scores’ and check if there are outliers having an extreme effect on the test statistic.

Below is the code to check for outliers using the Z-score method. A new column was created in the first line of code which uses the scores function to calculate the z-scores for each observation. Next, a new data frame, Outliers, was created under the condition of Z-scores being higher than 3, or less than -3.

# This is the R chunk for the Scan II
DeferralsSummary$Z_scores <- scores(DeferralsSummary$`Total($m) Difference July-Aug`)

Outliers <- DeferralsSummary %>% filter(Z_scores < -3 | Z_scores > 3)

head(Outliers)

We can see 3 potential sources of outliers in the data set. It was decided to remove these cases as they would bias the results of subsequent data analyses and there is still a healthy amount of observations remaining (n=119).

These were removed using the inverse of the above filtering function. Specifically, only keeping Z scores that fall between -3 and 3.

DeferralsSummary <- DeferralsSummary %>% filter(Z_scores > -3 & Z_scores < 3)

Transform

For a paired sample t-test, there’s an assumption stating that the differences between scores (July-August) must be normally distributed. A common approach to test this is to plot a histogram of the distances of those scores.

# This is the R chunk for the Transform Section
hist(DeferralsSummary$`Total($m) Difference July-Aug`, main = "Histogram of Total($m) Loan Deferral Differential July-Aug", xlab = "Total($m) Loan Deferrals Differential July-Aug")

From the histogram above, we can see the distribution is slightly skewed, with scores bunched up around the 0 mark with some cases tailing off to more extreme negative numbers on the left. Although the shape of this distribution isn’t extremely skewed, transformations can be used to help meet the assumption of normality.

Below is the code to transform the Total($m) Difference July-Aug variable using the Box-Cox method. The ‘forecast’ package was installed to Rstudio, and the BoxCox function is applied to the variable, resulting in a new column containing the transformed values. A histogram is then produced which demonstrates the effect the transformation had on the distribution.

DeferralsSummary$Total_Boxcox <- BoxCox(DeferralsSummary$`Total($m) Difference July-Aug`, lambda = "auto")

head(DeferralsSummary)

hist(DeferralsSummary$Total_Boxcox, main = "Histogram of Total($m) Loan Deferral Differential July-Aug (Box-Cox)", xlab = "Total($m) Loan Deferrals Differential July-Aug (Box-Cox)")