Required packages

For this project the require packages are: [1]tidyr [2]lubridate [3]dplyr [4]ggplot2 [5]car [6] magrittr [7]kableExtra [8]ggthemes [9]forecast

library(tidyr)
library(lubridate)
library(dplyr)
library(ggplot2)
library(car)
library(magrittr)
library(kableExtra)
library(ggthemes)
library(forecast)
options(scipen=999)

1. Introduction

The year 2020 came with much bad news, including COVID-19. Coronavirus has impacted our life in various ways and also drastically transformed our economy. Due to this, there is some change noted in the Australian dollar exchange rate, property price, and stocks. The following assignment mainly focuses on to study this impact only. The report is to present in a PDF format file with a 25-page limit. Also, our dataset needs to meet the 10 minimum requirement.

2. Problem statment

To study the impact of corona on Australian economy data of various type need to be collected. The motive of this assignment is to prepare an analysis of ready data to have some valuable insight into the impact of corona on monetary things. To see how corona increased or decreased the sales of some stock, how the price of the Australian dollar change, and what is the buying pattern of people before and during corona. All these questions can be answered after the data is cleaned and prepared.

3. Executive Summary

Four different data sets were collected 3 from Kaggle and 1 from macrotrends and in total 105 CSV files were processed. Each data was processed individually using fread() and for loop to read 100 stock CSV files and head and structure was checked. Data relevant to Australia was filtered and type conversions were applied on each dataset. Missing-value was identified using summary statistics and colSums() function. Corona dataset was untidy so created a function to process and clean 3 coronavirus CSV files. To deal with missing values in the property data frame cluster/group based median values were used to replace missing data. Outliers in each dataset were managed on each data set individually to some extent as the corona is expected to created outliers and outliers for this type of problem statement. Only the extreme Outliers in the data frame were dropped as after dropping them more outliers were formed. Finally, all the datasets were combined on the date to form a final data frame and data with date 2018-09-02 and above were only selected as the data need to be processed for analyzing the impact of corona on the Australian economy so data older than that is not relevant. Furthur missing value generated after joining datasets were dealt with either filling data using the front-fill method or simply replacing numerical observation with 0 and character observation with Null. Finally, the distribution of the numerical columns in the final data frame was visualized and relevant data transformation was applied and the head, structure, and summary of the data were shown.

4. Data

For this assignment 4 different data sets were downloaded from different sources: [1] Data about real estate of asutralia [2] Data about changed in Australian doller based on US dollar rate [3] Data about corona that includes: (i) Corona confirmed cases (ii) Corona death cases (iii) Corona recovered cases [4] Stock data inside Australia
These data set meets the requirement 2 (Data set should include multiple data types (numerics, characters, factors, etc)

4.1 Data description

AUS Real Estate Sales September 2018 to May 2020.[1] This data was collected from Kaggle and is about the Australian property market. The data is collected by HtAG, it is a web portal for real estate professionals that assists its customers in making property-related decisions based on timely and actionable market information. The inspiration behind making this data set public is to provide the raw sales data to the community to assist with impact analysis of COVID-19 on the Australian property market.

Australian and US dollar exchange rate historical chart.[2] This data was collected from macrotrends and is about the change in the Australian dollar rate value as compared to the US. The data about the exchange rate back to 1991. The website gives an easy visualization of change in the Australian dollar, using a simple line plot with x-axis as the date and y-axis as the dollar value.

Novel Corona Virus 2019 Dataset.[3] This data was collected from Kaggle and is about the cases of COVID-19 in different countries and their states and it informs: [1] Number of cases [2] How many patients recovered [3] How many deaths were recorded due to COVID-19. On 31 December 2019, WHO was alerted to several cases of pneumonia in Wuhan City, Hubei Province of China. The data is uploaded from Johns Hopkins Github repository (2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE) by one of the Kaggle Grandmaster (SRK ).

Australian Historical Stock Prices.[4] This data set contains historical share price data from the top 100 companies listed on the Australian Securities Exchange (ASX) and is consist of 100 CSV files each file name denotes the stock name. Each CSV file contains the stock price information for one company including [1] Date [2] Opening price [3] High price [4] Low price [5] Closing price [6] Adjusted closing price [7] Trading volume. The data was published under CC0: Public Domain License.

4.2 Reading the data

Reading dataset 1 - Reading the aus-property-sales-sep2018-april2020.csv using read.csv() function keeping stringsAsFactors as FALSE. * Using the dim() function to see the number of variables and observations. * The following data have 11 variables and 290046 observations. * Using the str() function to print the structure of the data. * The data have following columns: [1] date_sold [2] price [3] suburb [4] city_name [5] state [6] lat [7] lon [8] bedrooms [9] property_type [10] loc_pid [11] lga_pid * Used head() function to print top 2.

propDf <- read.csv("aus-property-sales-sep2018-april2020.csv", stringsAsFactors = FALSE)
print(dim(propDf))

## [1] 290046     11

str(propDf)

## 'data.frame':    290046 obs. of  11 variables:
##  $ date_sold    : chr  "2018-09-18 00:00:00" "2018-09-24 00:00:00" "2018-09-26 00:00:00" "2018-09-26 00:00:00" ...
##  $ price        : chr  "NULL" "NULL" "1730000" "1928000" ...
##  $ suburb       : chr  "Darling Point" "Darling Point" "Darling Point" "Darling Point" ...
##  $ city_name    : chr  "Sydney" "Sydney" "Sydney" "Sydney" ...
##  $ state        : chr  "NSW" "NSW" "NSW" "NSW" ...
##  $ lat          : chr  "-33.869565" "-33.872179" "-33.868386" "-33.875465" ...
##  $ lon          : chr  "151.241317" "151.239726" "151.237471" "151.23628" ...
##  $ bedrooms     : int  3 3 2 3 3 2 3 2 3 3 ...
##  $ property_type: chr  "unit" "unit" "unit" "unit" ...
##  $ loc_pid      : chr  "NSW1221" "NSW1221" "NSW1221" "NSW1221" ...
##  $ lga_pid      : chr  "NSW180" "NSW180" "NSW180" "NSW180" ...

print(head(propDf, 2))

##             date_sold price        suburb city_name state        lat
## 1 2018-09-18 00:00:00  NULL Darling Point    Sydney   NSW -33.869565
## 2 2018-09-24 00:00:00  NULL Darling Point    Sydney   NSW -33.872179
##          lon bedrooms property_type loc_pid lga_pid
## 1 151.241317        3          unit NSW1221  NSW180
## 2 151.239726        3          unit NSW1221  NSW180

Reading dataset 2 - Reading the australian-us-dollar-exchange-rate-historical-chart.csv using read.csv() function keeping stringsAsFactors as FALSE and skipping the top 15 rows. * Using the dim() function to see the number of variables and observations. * The following data have 2 variables and 7597 observations. * Using the str() function to print the structure of the data. * The data have the following columns: [1] date [2] value * Used head() function to print top 2 values.

AudDf <- read.csv("australian-us-dollar-exchange-rate-historical-chart.csv", stringsAsFactors = FALSE, skip = 15)
print(dim(AudDf))

## [1] 7597    2

str(AudDf)

## 'data.frame':    7597 obs. of  2 variables:
##  $ date : chr  "1991-01-02" "1991-01-03" "1991-01-04" "1991-01-07" ...
##  $ value: num  0.775 0.779 0.778 0.778 0.777 0.774 0.775 0.775 0.778 0.783 ...

print(head(AudDf, 2))

##         date value
## 1 1991-01-02 0.775
## 2 1991-01-03 0.779

Reading dataset 3.1 - Reading the time_series_covid_19_confirmed.csv using read.csv() function keeping stringsAsFactors as FALSE. * As the main motive of this assignment is to focus on the impact of corona on the Australian economy so, we will filter out the data that is related to Australia only. * Using the subset() function and subsetting on the Country.Region column of the data by keeping it equal to Australia. * Using row names(CoronaConfDf) <- NULL to reset the index into order as subsetting will use the index that data have by default. * Using the dim() function to see the number of variables and observations. * Initially before subsetting the data have 266 observations 121 variables and after subsetting, we are left with 8 observations 121 variables. Using the head() function shows us that the observations are rather untidy that full fill the requirement 5 (At least one of the data sets that you use should be Untidy), as the data have variables as [1] Province.State [2] Country.Region [3] Lat [4] Long [5] Other variables shows the dates * So, each observation is for the different states of Australia. * Using the str() function to print the structure of the data.

CoronaConfDf <- read.csv("time_series_covid_19_confirmed.csv", stringsAsFactors = FALSE)
print(paste("Dim before subsetting ", dim(CoronaConfDf)))

## [1] "Dim before subsetting  266" "Dim before subsetting  121"

CoronaConfDf <- subset(CoronaConfDf, Country.Region == 'Australia')
print(paste("Dim After subsetting ", dim(CoronaConfDf)))

## [1] "Dim After subsetting  8"   "Dim After subsetting  121"

rownames(CoronaConfDf) <- NULL
print(dim(CoronaConfDf))

## [1]   8 121

str(CoronaConfDf)

## 'data.frame':    8 obs. of  121 variables:
##  $ Province.State: chr  "Australian Capital Territory" "New South Wales" "Northern Territory" "Queensland" ...
##  $ Country.Region: chr  "Australia" "Australia" "Australia" "Australia" ...
##  $ Lat           : num  -35.5 -33.9 -12.5 -28 -34.9 ...
##  $ Long          : num  149 151 131 153 139 ...
##  $ X1.22.20      : int  0 0 0 0 0 0 0 0
##  $ X1.23.20      : int  0 0 0 0 0 0 0 0
##  $ X1.24.20      : int  0 0 0 0 0 0 0 0
##  $ X1.25.20      : int  0 0 0 0 0 0 0 0
##  $ X1.26.20      : int  0 3 0 0 0 0 1 0
##  $ X1.27.20      : int  0 4 0 0 0 0 1 0
##  $ X1.28.20      : int  0 4 0 0 0 0 1 0
##  $ X1.29.20      : int  0 4 0 1 0 0 1 0
##  $ X1.30.20      : int  0 4 0 3 0 0 2 0
##  $ X1.31.20      : int  0 4 0 2 0 0 3 0
##  $ X2.1.20       : int  0 4 0 3 1 0 4 0
##  $ X2.2.20       : int  0 4 0 2 2 0 4 0
##  $ X2.3.20       : int  0 4 0 2 2 0 4 0
##  $ X2.4.20       : int  0 4 0 3 2 0 4 0
##  $ X2.5.20       : int  0 4 0 3 2 0 4 0
##  $ X2.6.20       : int  0 4 0 4 2 0 4 0
##  $ X2.7.20       : int  0 4 0 5 2 0 4 0
##  $ X2.8.20       : int  0 4 0 5 2 0 4 0
##  $ X2.9.20       : int  0 4 0 5 2 0 4 0
##  $ X2.10.20      : int  0 4 0 5 2 0 4 0
##  $ X2.11.20      : int  0 4 0 5 2 0 4 0
##  $ X2.12.20      : int  0 4 0 5 2 0 4 0
##  $ X2.13.20      : int  0 4 0 5 2 0 4 0
##  $ X2.14.20      : int  0 4 0 5 2 0 4 0
##  $ X2.15.20      : int  0 4 0 5 2 0 4 0
##  $ X2.16.20      : int  0 4 0 5 2 0 4 0
##  $ X2.17.20      : int  0 4 0 5 2 0 4 0
##  $ X2.18.20      : int  0 4 0 5 2 0 4 0
##  $ X2.19.20      : int  0 4 0 5 2 0 4 0
##  $ X2.20.20      : int  0 4 0 5 2 0 4 0
##  $ X2.21.20      : int  0 4 0 5 2 0 4 0
##  $ X2.22.20      : int  0 4 0 5 2 0 4 0
##  $ X2.23.20      : int  0 4 0 5 2 0 4 0
##  $ X2.24.20      : int  0 4 0 5 2 0 4 0
##  $ X2.25.20      : int  0 4 0 5 2 0 4 0
##  $ X2.26.20      : int  0 4 0 5 2 0 4 0
##  $ X2.27.20      : int  0 4 0 5 2 0 4 0
##  $ X2.28.20      : int  0 4 0 5 2 0 4 0
##  $ X2.29.20      : int  0 4 0 9 3 0 7 2
##  $ X3.1.20       : int  0 6 0 9 3 0 7 2
##  $ X3.2.20       : int  0 6 0 9 3 1 9 2
##  $ X3.3.20       : int  0 13 0 11 3 1 9 2
##  $ X3.4.20       : int  0 22 1 11 5 1 10 2
##  $ X3.5.20       : int  0 22 1 13 5 1 10 3
##  $ X3.6.20       : int  0 26 0 13 7 1 10 3
##  $ X3.7.20       : int  0 28 0 13 7 1 11 3
##  $ X3.8.20       : int  0 38 0 15 7 2 11 3
##  $ X3.9.20       : int  0 48 0 15 7 2 15 4
##  $ X3.10.20      : int  0 55 1 18 7 2 18 6
##  $ X3.11.20      : int  0 65 1 20 9 3 21 9
##  $ X3.12.20      : int  0 65 1 20 9 3 21 9
##  $ X3.13.20      : int  1 92 1 35 16 5 36 14
##  $ X3.14.20      : int  1 112 1 46 19 5 49 17
##  $ X3.15.20      : int  1 134 1 61 20 6 57 17
##  $ X3.16.20      : int  2 171 1 68 29 7 71 28
##  $ X3.17.20      : int  2 210 1 78 29 7 94 31
##  $ X3.18.20      : int  3 267 1 94 37 10 121 35
##  $ X3.19.20      : int  4 307 1 144 42 10 121 52
##  $ X3.20.20      : int  6 353 3 184 50 10 121 64
##  $ X3.21.20      : int  9 436 3 221 67 16 229 90
##  $ X3.22.20      : int  19 669 5 259 100 22 355 120
##  $ X3.23.20      : int  32 669 5 319 134 28 355 140
##  $ X3.24.20      : int  39 818 6 397 170 28 411 175
##  $ X3.25.20      : int  39 1029 6 443 170 36 466 175
##  $ X3.26.20      : int  53 1219 12 493 235 47 520 231
##  $ X3.27.20      : int  62 1405 12 555 257 47 574 231
##  $ X3.28.20      : int  71 1617 15 625 287 62 685 278
##  $ X3.29.20      : int  77 1791 15 656 299 66 769 311
##  $ X3.30.20      : int  78 2032 15 689 305 66 821 355
##  $ X3.31.20      : int  80 2032 17 743 337 69 917 364
##  $ X4.1.20       : int  84 2182 19 781 367 69 968 392
##  $ X4.2.20       : int  87 2298 21 835 367 72 1036 400
##  $ X4.3.20       : int  91 2389 22 873 396 74 1085 400
##  $ X4.4.20       : int  93 2493 26 900 407 80 1115 436
##  $ X4.5.20       : int  96 2580 27 907 407 82 1135 453
##  $ X4.6.20       : int  96 2637 28 921 411 86 1158 460
##  $ X4.7.20       : int  96 2686 28 934 411 89 1191 460
##  $ X4.8.20       : int  99 2734 28 943 415 98 1212 481
##  $ X4.9.20       : int  100 2773 28 953 420 111 1228 495
##  $ X4.10.20      : int  103 2822 28 965 428 122 1241 506
##  $ X4.11.20      : int  103 2857 28 974 429 133 1265 514
##  $ X4.12.20      : int  103 2857 28 983 429 133 1268 514
##  $ X4.13.20      : int  102 2863 28 987 429 144 1281 517
##  $ X4.14.20      : int  103 2870 28 998 433 165 1291 527
##  $ X4.15.20      : int  103 2886 28 999 433 165 1299 527
##  $ X4.16.20      : int  103 2897 28 1001 433 169 1299 532
##  $ X4.17.20      : int  103 2926 28 1007 435 180 1302 541
##  $ X4.18.20      : int  103 2936 28 1015 435 188 1319 544
##  $ X4.19.20      : int  103 2957 28 1019 435 195 1328 545
##  $ X4.20.20      : int  104 2963 28 1019 435 200 1329 545
##  $ X4.21.20      : int  104 2969 28 1024 437 201 1336 546
##  $ X4.22.20      : int  104 2971 28 1024 438 205 1336 546
##  $ X4.23.20      : int  104 2976 28 1026 438 207 1337 546
##  $ X4.24.20      : int  105 2982 28 1026 438 207 1343 548
##  $ X4.25.20      : int  106 2994 28 1026 438 207 1346 549
##   [list output truncated]

print(head(CoronaConfDf, 2))

##                 Province.State Country.Region      Lat     Long X1.22.20
## 1 Australian Capital Territory      Australia -35.4735 149.0124        0
## 2              New South Wales      Australia -33.8688 151.2093        0
##   X1.23.20 X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20
## 1        0        0        0        0        0        0        0        0
## 2        0        0        0        3        4        4        4        4
##   X1.31.20 X2.1.20 X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20 X2.8.20
## 1        0       0       0       0       0       0       0       0       0
## 2        4       4       4       4       4       4       4       4       4
##   X2.9.20 X2.10.20 X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20 X2.16.20
## 1       0        0        0        0        0        0        0        0
## 2       4        4        4        4        4        4        4        4
##   X2.17.20 X2.18.20 X2.19.20 X2.20.20 X2.21.20 X2.22.20 X2.23.20 X2.24.20
## 1        0        0        0        0        0        0        0        0
## 2        4        4        4        4        4        4        4        4
##   X2.25.20 X2.26.20 X2.27.20 X2.28.20 X2.29.20 X3.1.20 X3.2.20 X3.3.20
## 1        0        0        0        0        0       0       0       0
## 2        4        4        4        4        4       6       6      13
##   X3.4.20 X3.5.20 X3.6.20 X3.7.20 X3.8.20 X3.9.20 X3.10.20 X3.11.20
## 1       0       0       0       0       0       0        0        0
## 2      22      22      26      28      38      48       55       65
##   X3.12.20 X3.13.20 X3.14.20 X3.15.20 X3.16.20 X3.17.20 X3.18.20 X3.19.20
## 1        0        1        1        1        2        2        3        4
## 2       65       92      112      134      171      210      267      307
##   X3.20.20 X3.21.20 X3.22.20 X3.23.20 X3.24.20 X3.25.20 X3.26.20 X3.27.20
## 1        6        9       19       32       39       39       53       62
## 2      353      436      669      669      818     1029     1219     1405
##   X3.28.20 X3.29.20 X3.30.20 X3.31.20 X4.1.20 X4.2.20 X4.3.20 X4.4.20
## 1       71       77       78       80      84      87      91      93
## 2     1617     1791     2032     2032    2182    2298    2389    2493
##   X4.5.20 X4.6.20 X4.7.20 X4.8.20 X4.9.20 X4.10.20 X4.11.20 X4.12.20
## 1      96      96      96      99     100      103      103      103
## 2    2580    2637    2686    2734    2773     2822     2857     2857
##   X4.13.20 X4.14.20 X4.15.20 X4.16.20 X4.17.20 X4.18.20 X4.19.20 X4.20.20
## 1      102      103      103      103      103      103      103      104
## 2     2863     2870     2886     2897     2926     2936     2957     2963
##   X4.21.20 X4.22.20 X4.23.20 X4.24.20 X4.25.20 X4.26.20 X4.27.20 X4.28.20
## 1      104      104      104      105      106      106      106      106
## 2     2969     2971     2976     2982     2994     3002     3004     3016
##   X4.29.20 X4.30.20 X5.1.20 X5.2.20 X5.3.20 X5.4.20 X5.5.20 X5.6.20
## 1      106      106     106     106     106     107     107     107
## 2     3016     3025    3030    3035    3033    3035    3042    3044
##   X5.7.20 X5.8.20 X5.9.20 X5.10.20 X5.11.20 X5.12.20 X5.13.20 X5.14.20
## 1     107     107     107      107      107      107      107      107
## 2    3047    3051    3053     3053     3053     3059     3063     3071
##   X5.15.20 X5.16.20 X5.17.20
## 1      107      107      107
## 2     3074     3075     3076

Reading dataset 3.2 - Reading the time_series_covid_19_deaths.csv using read.csv() function keeping stringsAsFactors as FALSE. * Similar to the section 3.1 subsetting and resetting of the index was done. * Using the dim() function to see the number of variables and observations. * Initially before subsetting the data have 266 observations 121 variables and after subsetting, we are left with 8 observations 121 variables. Using the head() function shows us that the observations are rather untidy that full fill the requirement 5 (At least one of the data sets that you use should be Untidy), as the data have variables as [1] Province.State [2] Country.Region [3] Lat [4] Long [5] Other variables shows the dates * So, each observation is for the different states of Australia. * Using the str() function to print the structure of the data.

CoronaDeathDf <- read.csv("time_series_covid_19_deaths.csv", stringsAsFactors = FALSE)
print(paste("Dim before subsetting ", dim(CoronaDeathDf)))

## [1] "Dim before subsetting  266" "Dim before subsetting  121"

CoronaDeathDf <- subset(CoronaDeathDf, Country.Region == 'Australia')
print(paste("Dim After subsetting ", dim(CoronaDeathDf)))

## [1] "Dim After subsetting  8"   "Dim After subsetting  121"

rownames(CoronaDeathDf) <- NULL
print(dim(CoronaDeathDf))

## [1]   8 121

str(CoronaDeathDf)

## 'data.frame':    8 obs. of  121 variables:
##  $ Province.State: chr  "Australian Capital Territory" "New South Wales" "Northern Territory" "Queensland" ...
##  $ Country.Region: chr  "Australia" "Australia" "Australia" "Australia" ...
##  $ Lat           : num  -35.5 -33.9 -12.5 -28 -34.9 ...
##  $ Long          : num  149 151 131 153 139 ...
##  $ X1.22.20      : int  0 0 0 0 0 0 0 0
##  $ X1.23.20      : int  0 0 0 0 0 0 0 0
##  $ X1.24.20      : int  0 0 0 0 0 0 0 0
##  $ X1.25.20      : int  0 0 0 0 0 0 0 0
##  $ X1.26.20      : int  0 0 0 0 0 0 0 0
##  $ X1.27.20      : int  0 0 0 0 0 0 0 0
##  $ X1.28.20      : int  0 0 0 0 0 0 0 0
##  $ X1.29.20      : int  0 0 0 0 0 0 0 0
##  $ X1.30.20      : int  0 0 0 0 0 0 0 0
##  $ X1.31.20      : int  0 0 0 0 0 0 0 0
##  $ X2.1.20       : int  0 0 0 0 0 0 0 0
##  $ X2.2.20       : int  0 0 0 0 0 0 0 0
##  $ X2.3.20       : int  0 0 0 0 0 0 0 0
##  $ X2.4.20       : int  0 0 0 0 0 0 0 0
##  $ X2.5.20       : int  0 0 0 0 0 0 0 0
##  $ X2.6.20       : int  0 0 0 0 0 0 0 0
##  $ X2.7.20       : int  0 0 0 0 0 0 0 0
##  $ X2.8.20       : int  0 0 0 0 0 0 0 0
##  $ X2.9.20       : int  0 0 0 0 0 0 0 0
##  $ X2.10.20      : int  0 0 0 0 0 0 0 0
##  $ X2.11.20      : int  0 0 0 0 0 0 0 0
##  $ X2.12.20      : int  0 0 0 0 0 0 0 0
##  $ X2.13.20      : int  0 0 0 0 0 0 0 0
##  $ X2.14.20      : int  0 0 0 0 0 0 0 0
##  $ X2.15.20      : int  0 0 0 0 0 0 0 0
##  $ X2.16.20      : int  0 0 0 0 0 0 0 0
##  $ X2.17.20      : int  0 0 0 0 0 0 0 0
##  $ X2.18.20      : int  0 0 0 0 0 0 0 0
##  $ X2.19.20      : int  0 0 0 0 0 0 0 0
##  $ X2.20.20      : int  0 0 0 0 0 0 0 0
##  $ X2.21.20      : int  0 0 0 0 0 0 0 0
##  $ X2.22.20      : int  0 0 0 0 0 0 0 0
##  $ X2.23.20      : int  0 0 0 0 0 0 0 0
##  $ X2.24.20      : int  0 0 0 0 0 0 0 0
##  $ X2.25.20      : int  0 0 0 0 0 0 0 0
##  $ X2.26.20      : int  0 0 0 0 0 0 0 0
##  $ X2.27.20      : int  0 0 0 0 0 0 0 0
##  $ X2.28.20      : int  0 0 0 0 0 0 0 0
##  $ X2.29.20      : int  0 0 0 0 0 0 0 0
##  $ X3.1.20       : int  0 0 0 0 0 0 0 1
##  $ X3.2.20       : int  0 0 0 0 0 0 0 1
##  $ X3.3.20       : int  0 0 0 0 0 0 0 1
##  $ X3.4.20       : int  0 1 0 0 0 0 0 1
##  $ X3.5.20       : int  0 1 0 0 0 0 0 1
##  $ X3.6.20       : int  0 1 0 0 0 0 0 1
##  $ X3.7.20       : int  0 1 0 0 0 0 0 1
##  $ X3.8.20       : int  0 2 0 0 0 0 0 1
##  $ X3.9.20       : int  0 2 0 0 0 0 0 1
##  $ X3.10.20      : int  0 2 0 0 0 0 0 1
##  $ X3.11.20      : int  0 2 0 0 0 0 0 1
##  $ X3.12.20      : int  0 2 0 0 0 0 0 1
##  $ X3.13.20      : int  0 2 0 0 0 0 0 1
##  $ X3.14.20      : int  0 2 0 0 0 0 0 1
##  $ X3.15.20      : int  0 2 0 0 0 0 0 1
##  $ X3.16.20      : int  0 2 0 0 0 0 0 1
##  $ X3.17.20      : int  0 4 0 0 0 0 0 1
##  $ X3.18.20      : int  0 5 0 0 0 0 0 1
##  $ X3.19.20      : int  0 5 0 0 0 0 0 1
##  $ X3.20.20      : int  0 6 0 0 0 0 0 1
##  $ X3.21.20      : int  0 6 0 0 0 0 0 1
##  $ X3.22.20      : int  0 6 0 0 0 0 0 1
##  $ X3.23.20      : int  0 6 0 0 0 0 0 1
##  $ X3.24.20      : int  0 7 0 0 0 0 0 1
##  $ X3.25.20      : int  0 7 0 0 0 0 0 1
##  $ X3.26.20      : int  0 7 0 1 0 0 3 2
##  $ X3.27.20      : int  0 7 0 1 0 0 3 2
##  $ X3.28.20      : int  0 8 0 1 0 0 3 2
##  $ X3.29.20      : int  0 8 0 2 0 0 4 2
##  $ X3.30.20      : int  1 8 0 2 0 0 4 2
##  $ X3.31.20      : int  1 8 0 2 0 1 4 2
##  $ X4.1.20       : int  1 9 0 2 0 2 4 2
##  $ X4.2.20       : int  1 10 0 4 0 2 5 2
##  $ X4.3.20       : int  1 12 0 4 0 2 7 2
##  $ X4.4.20       : int  2 12 0 4 0 2 8 2
##  $ X4.5.20       : int  2 16 0 4 0 2 8 3
##  $ X4.6.20       : int  2 18 0 4 0 2 10 4
##  $ X4.7.20       : int  2 21 0 4 1 2 11 4
##  $ X4.8.20       : int  2 21 0 4 2 3 12 6
##  $ X4.9.20       : int  2 21 0 4 3 3 12 6
##  $ X4.10.20      : int  2 22 0 4 3 4 13 6
##  $ X4.11.20      : int  2 23 0 5 3 4 14 6
##  $ X4.12.20      : int  2 24 0 5 4 5 14 6
##  $ X4.13.20      : int  2 25 0 5 4 5 14 6
##  $ X4.14.20      : int  2 25 0 5 4 6 14 6
##  $ X4.15.20      : int  3 25 0 5 4 6 14 6
##  $ X4.16.20      : int  3 25 0 5 4 6 14 6
##  $ X4.17.20      : int  3 26 0 5 4 7 14 7
##  $ X4.18.20      : int  3 26 0 6 4 7 14 7
##  $ X4.19.20      : int  3 26 0 6 4 7 14 7
##  $ X4.20.20      : int  3 26 0 6 4 7 14 7
##  $ X4.21.20      : int  3 26 0 6 4 7 14 7
##  $ X4.22.20      : int  3 26 0 6 4 7 14 7
##  $ X4.23.20      : int  3 31 0 6 4 8 16 7
##  $ X4.24.20      : int  3 33 0 6 4 9 16 8
##  $ X4.25.20      : int  3 33 0 6 4 10 16 8
##   [list output truncated]

print(head(CoronaDeathDf, 2))

##                 Province.State Country.Region      Lat     Long X1.22.20
## 1 Australian Capital Territory      Australia -35.4735 149.0124        0
## 2              New South Wales      Australia -33.8688 151.2093        0
##   X1.23.20 X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20
## 1        0        0        0        0        0        0        0        0
## 2        0        0        0        0        0        0        0        0
##   X1.31.20 X2.1.20 X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20 X2.8.20
## 1        0       0       0       0       0       0       0       0       0
## 2        0       0       0       0       0       0       0       0       0
##   X2.9.20 X2.10.20 X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20 X2.16.20
## 1       0        0        0        0        0        0        0        0
## 2       0        0        0        0        0        0        0        0
##   X2.17.20 X2.18.20 X2.19.20 X2.20.20 X2.21.20 X2.22.20 X2.23.20 X2.24.20
## 1        0        0        0        0        0        0        0        0
## 2        0        0        0        0        0        0        0        0
##   X2.25.20 X2.26.20 X2.27.20 X2.28.20 X2.29.20 X3.1.20 X3.2.20 X3.3.20
## 1        0        0        0        0        0       0       0       0
## 2        0        0        0        0        0       0       0       0
##   X3.4.20 X3.5.20 X3.6.20 X3.7.20 X3.8.20 X3.9.20 X3.10.20 X3.11.20
## 1       0       0       0       0       0       0        0        0
## 2       1       1       1       1       2       2        2        2
##   X3.12.20 X3.13.20 X3.14.20 X3.15.20 X3.16.20 X3.17.20 X3.18.20 X3.19.20
## 1        0        0        0        0        0        0        0        0
## 2        2        2        2        2        2        4        5        5
##   X3.20.20 X3.21.20 X3.22.20 X3.23.20 X3.24.20 X3.25.20 X3.26.20 X3.27.20
## 1        0        0        0        0        0        0        0        0
## 2        6        6        6        6        7        7        7        7
##   X3.28.20 X3.29.20 X3.30.20 X3.31.20 X4.1.20 X4.2.20 X4.3.20 X4.4.20
## 1        0        0        1        1       1       1       1       2
## 2        8        8        8        8       9      10      12      12
##   X4.5.20 X4.6.20 X4.7.20 X4.8.20 X4.9.20 X4.10.20 X4.11.20 X4.12.20
## 1       2       2       2       2       2        2        2        2
## 2      16      18      21      21      21       22       23       24
##   X4.13.20 X4.14.20 X4.15.20 X4.16.20 X4.17.20 X4.18.20 X4.19.20 X4.20.20
## 1        2        2        3        3        3        3        3        3
## 2       25       25       25       25       26       26       26       26
##   X4.21.20 X4.22.20 X4.23.20 X4.24.20 X4.25.20 X4.26.20 X4.27.20 X4.28.20
## 1        3        3        3        3        3        3        3        3
## 2       26       26       31       33       33       34       34       39
##   X4.29.20 X4.30.20 X5.1.20 X5.2.20 X5.3.20 X5.4.20 X5.5.20 X5.6.20
## 1        3        3       3       3       3       3       3       3
## 2       40       41      41      42      42      43      44      44
##   X5.7.20 X5.8.20 X5.9.20 X5.10.20 X5.11.20 X5.12.20 X5.13.20 X5.14.20
## 1       3       3       3        3        3        3        3        3
## 2      44      44      44       44       44       45       45       45
##   X5.15.20 X5.16.20 X5.17.20
## 1        3        3        3
## 2       45       45       46

Reading dataset 3.3 - Reading the time_series_covid_19_recovered.csv using read.csv() function keeping stringsAsFactors as FALSE. * Similar to the section 3.1 subsetting and resetting of the index was done. * Using the dim() function to see the number of variables and observations. * Initially before subsetting the data have 253 observations 121 variables and after subsetting, we are left with 8 observations 121 variables. Using the head() function shows us that the observations are rather untidy * this meets the requirement 5 (At least one of the data sets that you use should be Untidy), as the data have variables as: [1] Province.State [2] Country.Region [3] Lat [4] Long [5] Other variables shows the dates So, each observation is for different state of Australia. * Using the str() function to print the structure of the data.

CoronaRecDf <- read.csv("time_series_covid_19_recovered.csv", stringsAsFactors = FALSE)
print(paste("Dim before subsetting ", dim(CoronaRecDf)))

## [1] "Dim before subsetting  253" "Dim before subsetting  121"

CoronaRecDf <- subset(CoronaRecDf, Country.Region == 'Australia')
print(paste("Dim After subsetting ", dim(CoronaRecDf)))

## [1] "Dim After subsetting  8"   "Dim After subsetting  121"

rownames(CoronaRecDf) <- NULL
print(dim(CoronaRecDf))

## [1]   8 121

str(CoronaRecDf)

## 'data.frame':    8 obs. of  121 variables:
##  $ Province.State: chr  "Australian Capital Territory" "New South Wales" "Northern Territory" "Queensland" ...
##  $ Country.Region: chr  "Australia" "Australia" "Australia" "Australia" ...
##  $ Lat           : num  -35.5 -33.9 -12.5 -28 -34.9 ...
##  $ Long          : num  149 151 131 153 139 ...
##  $ X1.22.20      : int  0 0 0 0 0 0 0 0
##  $ X1.23.20      : int  0 0 0 0 0 0 0 0
##  $ X1.24.20      : int  0 0 0 0 0 0 0 0
##  $ X1.25.20      : int  0 0 0 0 0 0 0 0
##  $ X1.26.20      : int  0 0 0 0 0 0 0 0
##  $ X1.27.20      : int  0 0 0 0 0 0 0 0
##  $ X1.28.20      : int  0 0 0 0 0 0 0 0
##  $ X1.29.20      : int  0 0 0 0 0 0 0 0
##  $ X1.30.20      : int  0 2 0 0 0 0 0 0
##  $ X1.31.20      : int  0 2 0 0 0 0 0 0
##  $ X2.1.20       : int  0 2 0 0 0 0 0 0
##  $ X2.2.20       : int  0 2 0 0 0 0 0 0
##  $ X2.3.20       : int  0 2 0 0 0 0 0 0
##  $ X2.4.20       : int  0 2 0 0 0 0 0 0
##  $ X2.5.20       : int  0 2 0 0 0 0 0 0
##  $ X2.6.20       : int  0 2 0 0 0 0 0 0
##  $ X2.7.20       : int  0 2 0 0 0 0 0 0
##  $ X2.8.20       : int  0 2 0 0 0 0 0 0
##  $ X2.9.20       : int  0 2 0 0 0 0 0 0
##  $ X2.10.20      : int  0 2 0 0 0 0 0 0
##  $ X2.11.20      : int  0 2 0 0 0 0 0 0
##  $ X2.12.20      : int  0 2 0 0 0 0 0 0
##  $ X2.13.20      : int  0 4 0 0 0 0 4 0
##  $ X2.14.20      : int  0 4 0 0 0 0 4 0
##  $ X2.15.20      : int  0 4 0 0 0 0 4 0
##  $ X2.16.20      : int  0 4 0 0 0 0 4 0
##  $ X2.17.20      : int  0 4 0 0 2 0 4 0
##  $ X2.18.20      : int  0 4 0 0 2 0 4 0
##  $ X2.19.20      : int  0 4 0 0 2 0 4 0
##  $ X2.20.20      : int  0 4 0 0 2 0 4 0
##  $ X2.21.20      : int  0 4 0 1 2 0 4 0
##  $ X2.22.20      : int  0 4 0 1 2 0 4 0
##  $ X2.23.20      : int  0 4 0 1 2 0 4 0
##  $ X2.24.20      : int  0 4 0 1 2 0 4 0
##  $ X2.25.20      : int  0 4 0 1 2 0 4 0
##  $ X2.26.20      : int  0 4 0 1 2 0 4 0
##  $ X2.27.20      : int  0 4 0 1 2 0 4 0
##  $ X2.28.20      : int  0 4 0 1 2 0 4 0
##  $ X2.29.20      : int  0 4 0 1 2 0 4 0
##  $ X3.1.20       : int  0 4 0 1 2 0 4 0
##  $ X3.2.20       : int  0 4 0 1 2 0 4 0
##  $ X3.3.20       : int  0 4 0 1 2 0 4 0
##  $ X3.4.20       : int  0 4 0 1 2 0 4 0
##  $ X3.5.20       : int  0 4 0 8 2 0 7 0
##  $ X3.6.20       : int  0 4 0 8 2 0 7 0
##  $ X3.7.20       : int  0 4 0 8 2 0 7 0
##  $ X3.8.20       : int  0 4 0 8 2 0 7 0
##  $ X3.9.20       : int  0 4 0 8 2 0 7 0
##  $ X3.10.20      : int  0 4 0 8 2 0 7 0
##  $ X3.11.20      : int  0 4 0 8 2 0 7 0
##  $ X3.12.20      : int  0 4 0 8 2 0 7 0
##  $ X3.13.20      : int  0 4 0 8 3 0 8 0
##  $ X3.14.20      : int  0 4 0 8 3 0 8 0
##  $ X3.15.20      : int  0 4 0 8 3 0 8 0
##  $ X3.16.20      : int  0 4 0 8 3 0 8 0
##  $ X3.17.20      : int  0 4 0 8 3 0 8 0
##  $ X3.18.20      : int  0 4 0 8 3 0 8 0
##  $ X3.19.20      : int  0 4 0 8 3 3 8 0
##  $ X3.20.20      : int  0 4 0 8 3 3 8 0
##  $ X3.21.20      : int  0 4 0 8 3 3 8 0
##  $ X3.22.20      : int  0 4 0 8 3 3 70 0
##  $ X3.23.20      : int  0 4 0 8 3 3 70 0
##  $ X3.24.20      : int  0 4 0 8 3 3 97 0
##  $ X3.25.20      : int  0 4 0 8 6 3 97 1
##  $ X3.26.20      : int  1 4 0 8 6 3 149 1
##  $ X3.27.20      : int  1 4 0 8 6 3 171 1
##  $ X3.28.20      : int  2 4 0 8 6 5 191 28
##  $ X3.29.20      : int  2 4 0 8 6 5 191 28
##  $ X3.30.20      : int  2 4 0 8 6 5 191 41
##  $ X3.31.20      : int  3 4 0 8 6 5 291 41
##  $ X4.1.20       : int  8 4 0 8 6 5 343 48
##  $ X4.2.20       : int  11 4 0 8 6 5 422 64
##  $ X4.3.20       : int  18 4 0 8 46 5 476 92
##  $ X4.4.20       : int  18 4 1 8 46 5 527 92
##  $ X4.5.20       : int  28 4 1 8 46 5 573 92
##  $ X4.6.20       : int  40 4 2 173 53 26 620 162
##  $ X4.7.20       : int  40 4 2 173 53 26 620 162
##  $ X4.8.20       : int  40 4 2 173 53 26 620 162
##  $ X4.9.20       : int  47 4 2 345 120 48 736 170
##  $ X4.10.20      : int  59 4 2 372 179 48 926 203
##  $ X4.11.20      : int  59 4 2 372 179 48 926 216
##  $ X4.12.20      : int  59 4 2 372 179 48 926 216
##  $ X4.13.20      : int  59 4 2 372 179 48 926 216
##  $ X4.14.20      : int  72 4 6 442 240 53 1118 251
##  $ X4.15.20      : int  72 4 6 442 240 53 1118 251
##  $ X4.16.20      : int  82 4 6 442 279 67 1137 338
##  $ X4.17.20      : int  81 1379 9 442 331 67 1159 340
##  $ X4.18.20      : int  88 1379 9 738 331 67 1172 340
##  $ X4.19.20      : int  88 1379 9 738 331 67 1172 340
##  $ X4.20.20      : int  91 1379 12 738 369 71 1196 435
##  $ X4.21.20      : int  92 1755 15 738 376 74 1202 443
##  $ X4.22.20      : int  93 1926 16 738 388 77 1243 451
##  $ X4.23.20      : int  93 2006 16 738 394 91 1251 458
##  $ X4.24.20      : int  97 2056 20 738 401 106 1254 464
##  $ X4.25.20      : int  98 2193 23 803 402 117 1262 478
##   [list output truncated]

print(head(CoronaRecDf, 2))

##                 Province.State Country.Region      Lat     Long X1.22.20
## 1 Australian Capital Territory      Australia -35.4735 149.0124        0
## 2              New South Wales      Australia -33.8688 151.2093        0
##   X1.23.20 X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20
## 1        0        0        0        0        0        0        0        0
## 2        0        0        0        0        0        0        0        2
##   X1.31.20 X2.1.20 X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20 X2.8.20
## 1        0       0       0       0       0       0       0       0       0
## 2        2       2       2       2       2       2       2       2       2
##   X2.9.20 X2.10.20 X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20 X2.16.20
## 1       0        0        0        0        0        0        0        0
## 2       2        2        2        2        4        4        4        4
##   X2.17.20 X2.18.20 X2.19.20 X2.20.20 X2.21.20 X2.22.20 X2.23.20 X2.24.20
## 1        0        0        0        0        0        0        0        0
## 2        4        4        4        4        4        4        4        4
##   X2.25.20 X2.26.20 X2.27.20 X2.28.20 X2.29.20 X3.1.20 X3.2.20 X3.3.20
## 1        0        0        0        0        0       0       0       0
## 2        4        4        4        4        4       4       4       4
##   X3.4.20 X3.5.20 X3.6.20 X3.7.20 X3.8.20 X3.9.20 X3.10.20 X3.11.20
## 1       0       0       0       0       0       0        0        0
## 2       4       4       4       4       4       4        4        4
##   X3.12.20 X3.13.20 X3.14.20 X3.15.20 X3.16.20 X3.17.20 X3.18.20 X3.19.20
## 1        0        0        0        0        0        0        0        0
## 2        4        4        4        4        4        4        4        4
##   X3.20.20 X3.21.20 X3.22.20 X3.23.20 X3.24.20 X3.25.20 X3.26.20 X3.27.20
## 1        0        0        0        0        0        0        1        1
## 2        4        4        4        4        4        4        4        4
##   X3.28.20 X3.29.20 X3.30.20 X3.31.20 X4.1.20 X4.2.20 X4.3.20 X4.4.20
## 1        2        2        2        3       8      11      18      18
## 2        4        4        4        4       4       4       4       4
##   X4.5.20 X4.6.20 X4.7.20 X4.8.20 X4.9.20 X4.10.20 X4.11.20 X4.12.20
## 1      28      40      40      40      47       59       59       59
## 2       4       4       4       4       4        4        4        4
##   X4.13.20 X4.14.20 X4.15.20 X4.16.20 X4.17.20 X4.18.20 X4.19.20 X4.20.20
## 1       59       72       72       82       81       88       88       91
## 2        4        4        4        4     1379     1379     1379     1379
##   X4.21.20 X4.22.20 X4.23.20 X4.24.20 X4.25.20 X4.26.20 X4.27.20 X4.28.20
## 1       92       93       93       97       98       99      100      101
## 2     1755     1926     2006     2056     2193     2208     2227     2266
##   X4.29.20 X4.30.20 X5.1.20 X5.2.20 X5.3.20 X5.4.20 X5.5.20 X5.6.20
## 1      101      103     103     103     103     103     103     103
## 2     2284     2293    2293    2306    2328    2349    2418    2463
##   X5.7.20 X5.8.20 X5.9.20 X5.10.20 X5.11.20 X5.12.20 X5.13.20 X5.14.20
## 1     103     103     104      104      104      104      104      104
## 2    2486    2504    2504     2517     2543     2571     2595     2605
##   X5.15.20 X5.16.20 X5.17.20
## 1      104      104      104
## 2     2611     2611     2612

Reading dataset 4 - This data set is read differently than the others because it is not a single file but rather 100 CSV files. * So, a for loop is designed to read all those CSV files inside a directory. * list.files() function from base library is used to count and list all files inside the australian-historical-stock-prices directory. * A for loop is used to iterate on the list of files and each file is read using read.csv() function keeping stringsAsFactors as FALSE. * Used cbind() function to add a new column that has the name of the stock. * rbind() function is used to connect each new stock row-wise, hence Reading a stock data frame with all the stocks into a single data frame. * First and last variable names are changed to date and stockName for easy processing and more understandable name of the variable. * Using the dim() function to see the number of variables and observations. * The following data have 8 variables and 432888 observations. * Used head() function to print top 2 values. * Using the str() function to print the structure of the data.

stock <- data.frame()
for( i in c(1:length(list.files("./australian-historical-stock-prices")))){
  #print(list.files("./australian-historical-stock-prices")[i])
  x <- read.csv(paste("./australian-historical-stock-prices/",list.files("./australian-historical-stock-prices")[i], sep = ""), stringsAsFactors = FALSE)
  x <- cbind(x, gsub(".csv", "", list.files("./australian-historical-stock-prices")[i]))
  stock <- rbind(stock,x)
}
names(stock)[c(1,ncol(stock))] <- c("date","stockName")
print(dim(stock))

## [1] 432888      8

str(stock)

## 'data.frame':    432888 obs. of  8 variables:
##  $ date     : chr  "2015-03-31" "2015-04-01" "2015-04-02" "2015-04-07" ...
##  $ Open     : num  0.555 0.575 0.56 0.545 0.545 0.54 0.53 0.535 0.54 0.54 ...
##  $ High     : num  0.595 0.58 0.565 0.55 0.545 0.54 0.535 0.54 0.54 0.55 ...
##  $ Low      : num  0.53 0.555 0.535 0.54 0.53 0.532 0.53 0.53 0.535 0.54 ...
##  $ Close    : num  0.565 0.565 0.555 0.545 0.54 0.535 0.535 0.54 0.54 0.54 ...
##  $ Adj.Close: num  0.565 0.565 0.555 0.545 0.54 0.535 0.535 0.54 0.54 0.54 ...
##  $ Volume   : num  4816294 4376660 2779640 392179 668446 ...
##  $ stockName: Factor w/ 100 levels "A2M","AGL","ALL",..: 1 1 1 1 1 1 1 1 1 1 ...

print(head(stock, 2))

##         date  Open  High   Low Close Adj.Close  Volume stockName
## 1 2015-03-31 0.555 0.595 0.530 0.565     0.565 4816294       A2M
## 2 2015-04-01 0.575 0.580 0.555 0.565     0.565 4376660       A2M

5. Understand (Type conversion)

This meets the requirements 3 of the data i.e, Data set should include variables suitable for data type conversions so that you should be able to apply the required data type conversions (e.g., character -> factor, character -> date, numeric -> factor, etc. conversions).

Type conversion for dataset 1 - As initially the data was read keeping stringsAsFactors as FALSE so if the variable has any missing value so that will make the variable as chr. * So, we need to perform type conversion on each variable: * date column was converted to date time using as.Date() function. * price column was converted to numeric using as.numeric() function. * suburb column was converted to factor using as.factor() function. * city_name column was converted to factor using as.factor() function. * state column was converted to factor using as.factor() function. * lat column was converted to numeric using as.numeric() function. * lon column was converted to numeric using as.numeric() function. * bedrooms column was converted to ordered factor using as.factor() function with level as 0 < 1 < 2 < 3 < 4 < 5, this meets the requirement 4 (Data set should include at least one factor variable that needs to be labelled and/or ordered) * property_type column was converted to ordered factor using as.factor() function with level as unit < house < townhouse , this meets the requirement 4 (Data set should include at least one factor variable that needs to be labelled and/or ordered) * loc_pid column was converted to factor using as.factor() function. * lga_pid column was converted to factor using as.factor() function.

names(propDf)[1] <- "date"
propDf$date <- as.Date(propDf$date)
propDf$price <- as.numeric(propDf$price)
propDf$suburb <- as.factor(propDf$suburb)
propDf$city_name <- as.factor(propDf$city_name)
propDf$state <- as.factor(propDf$state)
propDf$lat <- as.numeric(propDf$lat)
propDf$lon <- as.numeric(propDf$lon)
propDf$bedrooms <- factor(propDf$bedrooms, levels = c(0,1,2,3,4,5), ordered = TRUE)
propDf$property_type <- factor(propDf$property_type, levels = c("unit","house","townhouse"), ordered = TRUE)
propDf$loc_pid <- as.factor(propDf$loc_pid)
propDf$lga_pid <- as.factor(propDf$lga_pid)
str(propDf)

## 'data.frame':    290046 obs. of  11 variables:
##  $ date         : Date, format: "2018-09-18" "2018-09-24" ...
##  $ price        : num  NA NA 1730000 1928000 1475000 ...
##  $ suburb       : Factor w/ 2017 levels "Abbotsbury","Abbotsford",..: 489 489 489 489 489 489 489 489 489 489 ...
##  $ city_name    : Factor w/ 6 levels "Adelaide","Brisbane",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ state        : Factor w/ 6 levels "ACT","NSW","QLD",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ lat          : num  -33.9 -33.9 -33.9 -33.9 -33.9 ...
##  $ lon          : num  151 151 151 151 151 ...
##  $ bedrooms     : Ord.factor w/ 6 levels "0"<"1"<"2"<"3"<..: 4 4 3 4 4 3 4 3 4 4 ...
##  $ property_type: Ord.factor w/ 3 levels "unit"<"house"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ loc_pid      : Factor w/ 2109 levels "ACT101","ACT102",..: 132 132 132 132 132 132 132 132 132 132 ...
##  $ lga_pid      : Factor w/ 115 levels "ACT","BRIS1",..: 9 9 9 9 9 9 9 9 9 9 ...

summary(propDf)

##       date                price                   suburb      
##  Min.   :2018-09-02   Min.   :     1000   Pakenham   :  1654  
##  1st Qu.:2019-02-19   1st Qu.:   485000   Craigieburn:  1372  
##  Median :2019-07-19   Median :   650000   Point Cook :  1322  
##  Mean   :2019-07-09   Mean   :   805693   Melbourne  :  1314  
##  3rd Qu.:2019-11-27   3rd Qu.:   912000   Frankston  :  1212  
##  Max.   :2020-05-13   Max.   :100000000   Brighton   :  1102  
##                       NA's   :57867       (Other)    :282070  
##      city_name      state             lat              lon       
##  Adelaide : 30019   ACT: 10094   Min.   :-38.48   Min.   :115.6  
##  Brisbane : 29509   NSW: 75762   1st Qu.:-37.79   1st Qu.:144.7  
##  Canberra : 10094   QLD: 29509   Median :-34.76   Median :145.2  
##  Melbourne:108444   SA : 30019   Mean   :-34.63   Mean   :143.3  
##  Perth    : 36218   VIC:108444   3rd Qu.:-33.72   3rd Qu.:151.1  
##  Sydney   : 75762   WA : 36218   Max.   : 31.92   Max.   :153.2  
##                                  NA's   :20       NA's   :20     
##  bedrooms     property_type       loc_pid          lga_pid      
##  0:   193   unit     : 68993   VIC2023:  1654   ACT    : 10094  
##  1: 15192   house    :198921   VIC660 :  1372   BRIS2  :  7913  
##  2: 59883   townhouse: 22132   VIC2082:  1322   BRIS3  :  7204  
##  3:112513                      VIC1634:  1314   VIC127 :  7064  
##  4: 82437                      VIC939 :  1212   NSW329 :  6004  
##  5: 19828                      VIC2453:  1099   VIC125 :  5883  
##                                (Other):282073   (Other):245884

Type conversion for dataset 2 - As initially the data was read keeping stringsAsFactors as FALSE so if the variable has any missing value so that will make the variable as chr. * So, we need to perform type conversion on each variable: * date column was converted to date time using as.Date() function. * value column was converted to numeric using as.numeric() function.

AudDf$date <- as.Date(AudDf$date)
AudDf$value <- as.numeric(AudDf$value)
str(AudDf)

## 'data.frame':    7597 obs. of  2 variables:
##  $ date : Date, format: "1991-01-02" "1991-01-03" ...
##  $ value: num  0.775 0.779 0.778 0.778 0.777 0.774 0.775 0.775 0.778 0.783 ...

summary(AudDf)

##       date                value       
##  Min.   :1991-01-02   Min.   :0.4787  
##  1st Qu.:1998-07-08   1st Qu.:0.6821  
##  Median :2006-01-19   Median :0.7484  
##  Mean   :2006-01-02   Mean   :0.7584  
##  3rd Qu.:2013-08-02   3rd Qu.:0.7960  
##  Max.   :2020-05-16   Max.   :1.1020

6.1 Tidy & Manipulate Data I

The third data (Novel Corona Virus 2019 Data set) is untidy because there is a different observation for each date.
The date column name is a date variable in itself and its observations are different variables.
So, to meet the three principal of tidy data, i.e: [1] Each variable must have its column [2] Each observation must have its row [3] Each value must have its cell. we need to convert the variable header of dates to observations using gather() function on data set.
The date column name has ‘X’ symbol in from so using gsub() to substitute the symbol and removing it with a blank.
Also, replacing 20 with the year 2020 for ease of date-time conversion.
Converting Country.Region and Province.State to factor type.
As we have 3 data sets so instead of making a repetitive code a function called CoronaDataFrameCleaner was created that takes input as data frame object (CoronaDataFrameObject) and Name of the newly created column having cases number (CaseColumnName).

CoronaDataFrameCleaner <- function(CoronaDataFrameObject, CaseColumnName){
  CoronaDataFrameObject <- CoronaDataFrameObject %>% gather(5:dim(CoronaDataFrameObject)[2],
                                                            key = "date", value = "value" )
  names(CoronaDataFrameObject)[ncol(CoronaDataFrameObject)] <- CaseColumnName
  CoronaDataFrameObject$date <- gsub("X", "", CoronaDataFrameObject$date)
  CoronaDataFrameObject$date <- gsub(".20$", ".2020", CoronaDataFrameObject$date)
  CoronaDataFrameObject$date <- as.Date(CoronaDataFrameObject$date, "%m.%d.%Y")
  CoronaDataFrameObject$Country.Region <- as.factor(CoronaDataFrameObject$Country.Region)
  CoronaDataFrameObject$Province.State <- as.factor(CoronaDataFrameObject$Province.State)
  return(CoronaDataFrameObject)
  }

Calling CoronaDataFrameCleaner function and passing the corona confirmed cases (CoronaConfDf) and column name as ConfirmedCases.
Printing the structure of newly created data frame using str() function and summary statistics using summary() function.
also showing the head of the data.

CoronaConfDf <- CoronaDataFrameCleaner(CoronaConfDf, 'ConfirmedCases')
str(CoronaConfDf)

## 'data.frame':    936 obs. of  6 variables:
##  $ Province.State: Factor w/ 8 levels "Australian Capital Territory",..: 1 2 3 4 5 6 7 8 1 2 ...
##  $ Country.Region: Factor w/ 1 level "Australia": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Lat           : num  -35.5 -33.9 -12.5 -28 -34.9 ...
##  $ Long          : num  149 151 131 153 139 ...
##  $ date          : Date, format: "2020-01-22" "2020-01-22" ...
##  $ ConfirmedCases: int  0 0 0 0 0 0 0 0 0 0 ...

summary(CoronaConfDf)

##                       Province.State   Country.Region      Lat        
##  Australian Capital Territory:117    Australia:936    Min.   :-41.45  
##  New South Wales             :117                     1st Qu.:-36.06  
##  Northern Territory          :117                     Median :-34.40  
##  Queensland                  :117                     Mean   :-32.00  
##  South Australia             :117                     3rd Qu.:-30.97  
##  Tasmania                    :117                     Max.   :-12.46  
##  (Other)                     :234                                     
##       Long            date            ConfirmedCases  
##  Min.   :115.9   Min.   :2020-01-22   Min.   :   0.0  
##  1st Qu.:136.7   1st Qu.:2020-02-20   1st Qu.:   2.0  
##  Median :145.5   Median :2020-03-20   Median :  28.0  
##  Mean   :141.2   Mean   :2020-03-20   Mean   : 365.6  
##  3rd Qu.:149.6   3rd Qu.:2020-04-18   3rd Qu.: 437.2  
##  Max.   :153.4   Max.   :2020-05-17   Max.   :3076.0  
##

head(CoronaConfDf, 2)

##                 Province.State Country.Region      Lat     Long       date
## 1 Australian Capital Territory      Australia -35.4735 149.0124 2020-01-22
## 2              New South Wales      Australia -33.8688 151.2093 2020-01-22
##   ConfirmedCases
## 1              0
## 2              0

Calling CoronaDataFrameCleaner function and passing the corona confirmed cases (CoronaDeathDf) and column name as DeathCases
Printing the structure of newly created data frame using str() function and summary statistics using summary() function.
also showing the head of the data.

CoronaDeathDf <- CoronaDataFrameCleaner(CoronaDeathDf, 'DeathCases')
str(CoronaDeathDf)

## 'data.frame':    936 obs. of  6 variables:
##  $ Province.State: Factor w/ 8 levels "Australian Capital Territory",..: 1 2 3 4 5 6 7 8 1 2 ...
##  $ Country.Region: Factor w/ 1 level "Australia": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Lat           : num  -35.5 -33.9 -12.5 -28 -34.9 ...
##  $ Long          : num  149 151 131 153 139 ...
##  $ date          : Date, format: "2020-01-22" "2020-01-22" ...
##  $ DeathCases    : int  0 0 0 0 0 0 0 0 0 0 ...

summary(CoronaDeathDf)

##                       Province.State   Country.Region      Lat        
##  Australian Capital Territory:117    Australia:936    Min.   :-41.45  
##  New South Wales             :117                     1st Qu.:-36.06  
##  Northern Territory          :117                     Median :-34.40  
##  Queensland                  :117                     Mean   :-32.00  
##  South Australia             :117                     3rd Qu.:-30.97  
##  Tasmania                    :117                     Max.   :-12.46  
##  (Other)                     :234                                     
##       Long            date              DeathCases    
##  Min.   :115.9   Min.   :2020-01-22   Min.   : 0.000  
##  1st Qu.:136.7   1st Qu.:2020-02-20   1st Qu.: 0.000  
##  Median :145.5   Median :2020-03-20   Median : 0.000  
##  Mean   :141.2   Mean   :2020-03-20   Mean   : 3.903  
##  3rd Qu.:149.6   3rd Qu.:2020-04-18   3rd Qu.: 4.000  
##  Max.   :153.4   Max.   :2020-05-17   Max.   :46.000  
##

head(CoronaDeathDf, 2)

##                 Province.State Country.Region      Lat     Long       date
## 1 Australian Capital Territory      Australia -35.4735 149.0124 2020-01-22
## 2              New South Wales      Australia -33.8688 151.2093 2020-01-22
##   DeathCases
## 1          0
## 2          0

Calling CoronaDataFrameCleaner function and passing the corona confirmed cases (CoronaRecDf) and column name as RecoveredCases
Printing the structure of newly created data frame using str() function and summary statistics using summary() function.
also showing the head of the data.

CoronaRecDf <- CoronaDataFrameCleaner(CoronaRecDf, 'RecoveredCases')
str(CoronaRecDf)

## 'data.frame':    936 obs. of  6 variables:
##  $ Province.State: Factor w/ 8 levels "Australian Capital Territory",..: 1 2 3 4 5 6 7 8 1 2 ...
##  $ Country.Region: Factor w/ 1 level "Australia": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Lat           : num  -35.5 -33.9 -12.5 -28 -34.9 ...
##  $ Long          : num  149 151 131 153 139 ...
##  $ date          : Date, format: "2020-01-22" "2020-01-22" ...
##  $ RecoveredCases: int  0 0 0 0 0 0 0 0 0 0 ...

summary(CoronaRecDf)

##                       Province.State   Country.Region      Lat        
##  Australian Capital Territory:117    Australia:936    Min.   :-41.45  
##  New South Wales             :117                     1st Qu.:-36.06  
##  Northern Territory          :117                     Median :-34.40  
##  Queensland                  :117                     Mean   :-32.00  
##  South Australia             :117                     3rd Qu.:-30.97  
##  Tasmania                    :117                     Max.   :-12.46  
##  (Other)                     :234                                     
##       Long            date            RecoveredCases  
##  Min.   :115.9   Min.   :2020-01-22   Min.   :   0.0  
##  1st Qu.:136.7   1st Qu.:2020-02-20   1st Qu.:   0.0  
##  Median :145.5   Median :2020-03-20   Median :   4.0  
##  Mean   :141.2   Mean   :2020-03-20   Mean   : 211.6  
##  3rd Qu.:149.6   3rd Qu.:2020-04-18   3rd Qu.: 104.0  
##  Max.   :153.4   Max.   :2020-05-17   Max.   :2612.0  
##

head(CoronaDeathDf, 2)

##                 Province.State Country.Region      Lat     Long       date
## 1 Australian Capital Territory      Australia -35.4735 149.0124 2020-01-22
## 2              New South Wales      Australia -33.8688 151.2093 2020-01-22
##   DeathCases
## 1          0
## 2          0

6.2 Merging Corona cases data

Performing a full join on the corona data frame to form a single data frame and to do this full_join() function was used and merge was done on date and Province.State to match and the state that data relates to.
The resulted dataframe have some redundant variables i.e, Lat.x“,”Long.x“,”Lat.y“,”Long.y“,”Country.Region.x“,”Country.Region.y“,”Country.Region", so dropping the variables using subsetting.
Renamed the variable names using names() function.
Printing the head using head() function and structure using str() function along with summary statistics using summary() function.

coronaJoinDf <- full_join(CoronaConfDf, CoronaDeathDf, by=c("date", "Province.State"))
coronaJoinDf <- full_join(coronaJoinDf, CoronaRecDf, by=c("date", "Province.State"))
coronaJoinDf <- coronaJoinDf[,!(names(coronaJoinDf) %in% c("Lat.x", "Long.x","Lat.y", "Long.y", "Country.Region.x", "Country.Region.y", "Country.Region"))]
names(coronaJoinDf)[c(1,2)] <- c("State", "date")

head(coronaJoinDf, 2)

##                          State       date ConfirmedCases DeathCases
## 1 Australian Capital Territory 2020-01-22              0          0
## 2              New South Wales 2020-01-22              0          0
##        Lat     Long RecoveredCases
## 1 -35.4735 149.0124              0
## 2 -33.8688 151.2093              0

str(coronaJoinDf)

## 'data.frame':    936 obs. of  7 variables:
##  $ State         : Factor w/ 8 levels "Australian Capital Territory",..: 1 2 3 4 5 6 7 8 1 2 ...
##  $ date          : Date, format: "2020-01-22" "2020-01-22" ...
##  $ ConfirmedCases: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DeathCases    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Lat           : num  -35.5 -33.9 -12.5 -28 -34.9 ...
##  $ Long          : num  149 151 131 153 139 ...
##  $ RecoveredCases: int  0 0 0 0 0 0 0 0 0 0 ...

summary(coronaJoinDf)

##                           State          date            ConfirmedCases  
##  Australian Capital Territory:117   Min.   :2020-01-22   Min.   :   0.0  
##  New South Wales             :117   1st Qu.:2020-02-20   1st Qu.:   2.0  
##  Northern Territory          :117   Median :2020-03-20   Median :  28.0  
##  Queensland                  :117   Mean   :2020-03-20   Mean   : 365.6  
##  South Australia             :117   3rd Qu.:2020-04-18   3rd Qu.: 437.2  
##  Tasmania                    :117   Max.   :2020-05-17   Max.   :3076.0  
##  (Other)                     :234                                        
##    DeathCases          Lat              Long       RecoveredCases  
##  Min.   : 0.000   Min.   :-41.45   Min.   :115.9   Min.   :   0.0  
##  1st Qu.: 0.000   1st Qu.:-36.06   1st Qu.:136.7   1st Qu.:   0.0  
##  Median : 0.000   Median :-34.40   Median :145.5   Median :   4.0  
##  Mean   : 3.903   Mean   :-32.00   Mean   :141.2   Mean   : 211.6  
##  3rd Qu.: 4.000   3rd Qu.:-30.97   3rd Qu.:149.6   3rd Qu.: 104.0  
##  Max.   :46.000   Max.   :-12.46   Max.   :153.4   Max.   :2612.0  
##

7. Tidy & Manipulate Data II

This step meets the requirement 6 (At least one variable needs to be created/mutated from the existing ones).
The stock data set which has data about the sales of stock can be used to create more data out it.
So, the mutation was done using mutate() function to create two variables i.e,
1. StockDifferenceOpenClose - Tell how the price of stock changed when it was first sold on that day and last sold. To do this observations from Open stock value are subtracted with Close stock value.
2. StockDifferenceHighLow - Tell us how much difference is there between the highest and the lowest value of the stock sold on a particular day. To do this observations from High stock value are subtracted with Low stock value.

stock$date <- as.Date(stock$date)
stock <- mutate(stock, 
                StockDifferenceOpenClose = Open - Close,
                StockDifferenceHighLow = High - Low)
head(stock, 2)

##         date  Open  High   Low Close Adj.Close  Volume stockName
## 1 2015-03-31 0.555 0.595 0.530 0.565     0.565 4816294       A2M
## 2 2015-04-01 0.575 0.580 0.555 0.565     0.565 4376660       A2M
##   StockDifferenceOpenClose StockDifferenceHighLow
## 1                    -0.01                  0.065
## 2                     0.01                  0.025

str(stock)

## 'data.frame':    432888 obs. of  10 variables:
##  $ date                    : Date, format: "2015-03-31" "2015-04-01" ...
##  $ Open                    : num  0.555 0.575 0.56 0.545 0.545 0.54 0.53 0.535 0.54 0.54 ...
##  $ High                    : num  0.595 0.58 0.565 0.55 0.545 0.54 0.535 0.54 0.54 0.55 ...
##  $ Low                     : num  0.53 0.555 0.535 0.54 0.53 0.532 0.53 0.53 0.535 0.54 ...
##  $ Close                   : num  0.565 0.565 0.555 0.545 0.54 0.535 0.535 0.54 0.54 0.54 ...
##  $ Adj.Close               : num  0.565 0.565 0.555 0.545 0.54 0.535 0.535 0.54 0.54 0.54 ...
##  $ Volume                  : num  4816294 4376660 2779640 392179 668446 ...
##  $ stockName               : Factor w/ 100 levels "A2M","AGL","ALL",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ StockDifferenceOpenClose: num  -0.01 0.01 0.005 0 0.005 ...
##  $ StockDifferenceHighLow  : num  0.065 0.025 0.03 0.01 0.015 ...

summarise(stock)

## data frame with 0 columns and 1 row

8. Scan I

In this step we will scan data set for any possible missing data.
To check variables that have missing value we pass our data frame to is.na() function with a column sum (colSums()) function.
As can be seen only variable price, lat and lon have missing value.

colSums(is.na(propDf))

##          date         price        suburb     city_name         state 
##             0         57867             0             0             0 
##           lat           lon      bedrooms property_type       loc_pid 
##            20            20             0             0             0 
##       lga_pid 
##             0

First we will target the propDf price variable and create a summary statistics table.
The maximum value is 100000000 and the minimum value is 1000.
The data have high variance and IQR.
Median is smaller than mean that suggests that data is right-skewed.
The data have in total 57867 missing values.

knitr::kable(propDf %>% summarise(Min = min(price,na.rm = TRUE), Max = max(price, na.rm = TRUE), n = n(), Missing = sum(is.na(price)), Q1 = quantile(price ,probs = .25,na.rm = TRUE), Median = median(price, na.rm = TRUE), Q3 = quantile(price, probs = .75,na.rm = TRUE), Mean = mean(price, na.rm = TRUE), SD = sd(price, na.rm = TRUE), IQR = IQR(price ,na.rm = TRUE)), "html", caption = "Table 1: Descriptive Statistics", align = "llllllllll", col.names = c("Minimum", "Maximum", "Sample Size", "Missing Count","First Quartile", "Median", "Third Quartile", "Mean", "Standard Deviation", "IQR"), digits = 2) %>% kable_styling(latex_options = "HOLD_position") %>% column_spec(1, bold = TRUE) %>% column_spec(c(2,4,6,8,10), color = 'white', background = 'black')

Table 1: Descriptive Statistics
Minimum	Maximum	Sample Size	Missing Count	First Quartile	Median	Third Quartile	Mean	Standard Deviation	IQR
1000	100000000	290046	57867	485000	650000	912000	805693	616272.9	427000

To deal with the missing price value we are going to replace the missing observation or price based on the location and year.
So, first we will list all the unique suburb that has a missing price in our data set.
Use a for loop and filter the data based on the year:
1. from year minimum date of the property data set, i.e, “2018-09-02” to “2018-12-31” and replace the value with the median value of that suburb in the year 2018, as there can be outliers in the data that can greatly affect the mean value so the median is the best approach to replace value here.
2. from year maximum date of the property data set, i.e, “2020-05-13” to date greater than “2018-12-31” and replace the value with the median value of that suburb in the year 2019 and 2020 (as 2020 have less data so merged with 2019), as there can be outliers in the data that can greatly affect the mean value so median is the best approach to replace value here.

missingPrice <- unique(propDf[is.na(propDf$price),'suburb'])
for (location in missingPrice){
  propDf[propDf$suburb == location & is.na(propDf$price) & propDf$date > min(propDf$date) & propDf$date < '2018-12-31', 'price'] = median(propDf[propDf$suburb == i & propDf$date > min(propDf$date) & propDf$date < '2018-12-31', 'price'], na.rm = TRUE) # Dates in 2018
    propDf[propDf$suburb == location & is.na(propDf$price) & propDf$date <= max(propDf$date) & propDf$date > '2018-12-31', 'price'] = median(propDf[propDf$suburb == i & propDf$date <= max(propDf$date) & propDf$date > '2018-12-31', 'price'], na.rm = TRUE) # Dates between 2019 and 2020
}

To deal with the longitude and latitude missing value in our data they are replaced with the mean of the suburb the data is missing.
Also suburb Mambourin have no other data so this observation is dropped.

missingLatLon <- unique(propDf[is.na(propDf$lat),'suburb'])
for (i in missingLatLon){
  propDf[propDf$suburb == i & is.na(propDf$lon), 'lon'] = mean(propDf[propDf$suburb == i, 'lon'], na.rm = TRUE)
  propDf[propDf$suburb == i & is.na(propDf$lat), 'lat'] = mean(propDf[propDf$suburb == i, 'lat'], na.rm = TRUE)
}
propDf <- propDf[!propDf$suburb=='Mambourin',]

Other data sets do not have any missing value so we will proceed to the next step.

colSums(is.na(AudDf))

##  date value 
##     0     0

colSums(is.na(stock))

##                     date                     Open                     High 
##                        0                     1527                     1527 
##                      Low                    Close                Adj.Close 
##                     1527                     1527                     1527 
##                   Volume                stockName StockDifferenceOpenClose 
##                     1527                        0                     1527 
##   StockDifferenceHighLow 
##                     1527

colSums(is.na(coronaJoinDf))

##          State           date ConfirmedCases     DeathCases            Lat 
##              0              0              0              0              0 
##           Long RecoveredCases 
##              0              0

9. Scan II

For our research topic the motive is to prepare a data to analyze the effect on Corona on the Australian economy so data date 2018-09-02 and above were filtered.

print(paste("Minimum date in AudDf is :" ,min(AudDf$date)))

## [1] "Minimum date in AudDf is : 1991-01-02"

print(paste("Minimum date in coronaJoinDf is :" ,min(coronaJoinDf$date)))

## [1] "Minimum date in coronaJoinDf is : 2020-01-22"

print(paste("Minimum date in propDf is :" ,min(propDf$date)))

## [1] "Minimum date in propDf is : 2018-09-02"

print(paste("Minimum date in stock is :" ,min(stock$date)))

## [1] "Minimum date in stock is : 2000-01-03"

AudDf <- subset(AudDf, date >= "2018-09-02")
stock <- subset(stock, date >= "2018-09-02")

propDf have one outlier on the higher side, so only going to keep data smaller than the maximum property price to remove the outlier.
But removing one outlier rather created more outliers, but will not be removed as these are natural and not due to some error but rather genuine price.

boxplot(propDf$price, ylim = c(-100,max(propDf$price, na.rm = TRUE)), main = 'Box plot for price before removing outlier', cex.main=0.5); propDf <- subset(propDf, price < max(propDf$price, na.rm = TRUE)) ; boxplot(propDf$price, ylim = c(-100,max(propDf$price, na.rm = TRUE)), main = 'Box plot for price after removing outlier', cex.main=0.5) # Dropping extrem outlier

As can be seen that mean approx to median that suggests that there is no skewness in the data and standard deviation and IQR is low hence the distribution must be high peak and less spread.
From the box plot it can be seen that there are some outliers, but this is completely normal in low variance data and can also be due to drastic change is exchange rate due to corona.

names(AudDf) <- c('date', 'AudToUsd')
knitr::kable(AudDf %>% summarise(Min = min(AudToUsd,na.rm = TRUE), Max = max(AudToUsd, na.rm = TRUE), n = n(), Missing = sum(is.na(AudToUsd)), Q1 = quantile(AudToUsd ,probs = .25,na.rm = TRUE), Median = median(AudToUsd, na.rm = TRUE), Q3 = quantile(AudToUsd, probs = .75,na.rm = TRUE), Mean = mean(AudToUsd, na.rm = TRUE), SD = sd(AudToUsd, na.rm = TRUE), IQR = IQR(AudToUsd ,na.rm = TRUE)), "html", caption = "Table 2: Descriptive Statistics", align = "llllllllll", col.names = c("Minimum", "Maximum", "Sample Size", "Missing Count","First Quartile", "Median", "Third Quartile", "Mean", "Standard Deviation", "IQR"), digits = 2) %>% kable_styling(latex_options = "HOLD_position") %>% column_spec(1, bold = TRUE) %>% column_spec(c(2,4,6,8,10), color = 'white', background = 'black'); boxplot(AudDf$AudToUsd, main = 'Box plot for AudToUsd', cex.main=0.5)

Table 2: Descriptive Statistics
Minimum	Maximum	Sample Size	Missing Count	First Quartile	Median	Third Quartile	Mean	Standard Deviation	IQR
0.57	0.74	535	0	0.68	0.69	0.71	0.69	0.03	0.03

It can be seen from the table that in each variable the median is smaller than mean that suggests that the distribution for these variables has right skewness in them.
Also, Standard Deviation and IQR is high that suggest more spread and less peak.

GatherStock <- stock %>% gather(Open, High, High, Low, Close, Adj.Close, Volume,StockDifferenceOpenClose, StockDifferenceHighLow, key = 'Parameter', value = 'value')
knitr::kable(GatherStock %>% group_by(GatherStock$Parameter) %>% summarise(Min = min(value,na.rm = TRUE), Max = max(value, na.rm = TRUE), n = n(), Q1 = quantile(value ,probs = .25,na.rm = TRUE), Median = median(value, na.rm = TRUE), Q3 = quantile(value, probs = .75,na.rm = TRUE), Mean = mean(value, na.rm = TRUE), SD = sd(value, na.rm = TRUE), IQR = IQR(value ,na.rm = TRUE)), "html", caption = "Table 3: Descriptive Statistics", align = "llllllllll", col.names = c('Parameter Type', "Minimum", "Maximum", "Sample Size","First Quartile", "Median", "Third Quartile", "Mean", "Standard Deviation", "IQR"), digits = 2) %>% kable_styling(latex_options = "HOLD_position") %>% column_spec(1, bold = TRUE) %>% column_spec(c(2,4,6,8,10), color = 'white', background = 'black')

Table 3: Descriptive Statistics
Parameter Type	Minimum	Maximum	Sample Size	First Quartile	Median	Third Quartile	Mean	Standard Deviation	IQR
Adj.Close	0.95	339.38	40043	4.75	11.70	24.70	22.31	34.83	19.95
Close	0.95	341.00	40043	4.95	12.05	25.43	22.91	35.42	20.48
High	0.98	342.75	40043	5.01	12.19	25.70	23.16	35.79	20.69
Low	0.90	337.03	40043	4.88	11.90	25.15	22.65	35.04	20.27
Open	0.94	339.42	40043	4.94	12.05	25.44	22.90	35.42	20.50
StockDifferenceHighLow	0.00	57.74	40043	0.10	0.22	0.50	0.51	1.12	0.40
StockDifferenceOpenClose	-52.79	30.42	40043	-0.09	0.00	0.07	-0.01	0.82	0.16
Volume	0.00	248837038.00	40043	1017361.50	2530727.00	5364411.00	4471622.35	6320144.83	4347049.50

It can be seen that the volume column has two outliers with value as 248837038 and 193729243.
These outliers are dropped but it rather generated more outliers in volume column that were expected, but not every outlier can be removed because it is natural for stock data to have a different value than normal fashion, i.e some stock is sold more than others.

ggplot(GatherStock, aes(x=Parameter, y=value)) + geom_boxplot(outlier.colour="black", outlier.shape=1, outlier.size=1.5 ,fill='#4271AE', color="#1F3552") + theme_economist() + theme(plot.title = element_text(family="Tahoma", size = 5), text = element_text(family="Tahoma"), axis.title = element_text(size = 12), axis.text.x = element_text(color = "grey20", size = 8, angle = 90)) + scale_x_discrete(name = "\nVariable Name")+ ggtitle("Boxplot for different variable of stock dataframe before removing outliers\n") + scale_y_continuous(name = 'Value\n'); stock <- stock[!stock$Volume == 248837038,]; stock <- stock[!stock$Volume == 193729243,]; GatherStock <- stock %>% gather(Open, High, High, Low, Close, Adj.Close, Volume,StockDifferenceOpenClose, StockDifferenceHighLow, key = 'Parameter', value = 'value'); ggplot(GatherStock, aes(x=Parameter, y=value)) + geom_boxplot(outlier.colour="black", outlier.shape=1, outlier.size=1.5 ,fill='#4271AE', color="#1F3552") + theme_economist() + theme(plot.title = element_text(family="Tahoma", size = 5), text = element_text(family="Tahoma"), axis.title = element_text(size = 12), axis.text.x = element_text(color = "grey20", size = 8, angle = 90)) + scale_x_discrete(name = "\nVariable Name")+ ggtitle("Boxplot for different variable of stock dataframe after removing outliers\n") + scale_y_continuous(name = 'Value\n')

10. Merging the data and handling any inconsistency in the data

Finally after individually reading, dealing with missing vale and outliers all the data frames are merged into one to create a final data frame.
The first merge is performed between the rather main data i.e propDf and coronaJoinDf as they have information starting from “2018-09-02” and all the information is crucial in these data frames hence full join is performed using full_join() function keeping date as the key and storing it in finalDf data frame.
After this full join is performed between finalDf and AudDf by date keeping finalDf on the left side as we want to see how corona affected the exchange rate.
Similar, left join is performed between finalDf and stock by date keeping finalDf on the left side as we are interested in seeing how corona affected the stock exchange market.

finalDf <- full_join(propDf, coronaJoinDf, by = 'date')
finalDf <- full_join(finalDf, AudDf, by = 'date')
finalDf <- full_join(finalDf, stock, by = 'date')
str(finalDf)

## 'data.frame':    34205178 obs. of  27 variables:
##  $ date                    : Date, format: "2018-09-26" "2018-09-26" ...
##  $ price                   : num  1730000 1730000 1730000 1730000 1730000 1730000 1730000 1730000 1730000 1730000 ...
##  $ suburb                  : Factor w/ 2017 levels "Abbotsbury","Abbotsford",..: 489 489 489 489 489 489 489 489 489 489 ...
##  $ city_name               : Factor w/ 6 levels "Adelaide","Brisbane",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ state                   : Factor w/ 6 levels "ACT","NSW","QLD",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ lat                     : num  -33.9 -33.9 -33.9 -33.9 -33.9 ...
##  $ lon                     : num  151 151 151 151 151 ...
##  $ bedrooms                : Ord.factor w/ 6 levels "0"<"1"<"2"<"3"<..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ property_type           : Ord.factor w/ 3 levels "unit"<"house"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ loc_pid                 : Factor w/ 2109 levels "ACT101","ACT102",..: 132 132 132 132 132 132 132 132 132 132 ...
##  $ lga_pid                 : Factor w/ 115 levels "ACT","BRIS1",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ State                   : Factor w/ 8 levels "Australian Capital Territory",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ ConfirmedCases          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ DeathCases              : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Lat                     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Long                    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ RecoveredCases          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ AudToUsd                : num  0.726 0.726 0.726 0.726 0.726 ...
##  $ Open                    : num  10.85 19.25 28.4 8.76 26.2 ...
##  $ High                    : num  10.97 19.78 28.83 8.83 26.87 ...
##  $ Low                     : num  10.64 19.2 28.11 8.69 25.96 ...
##  $ Close                   : num  10.7 19.59 28.71 8.75 26.72 ...
##  $ Adj.Close               : num  10.7 18.04 27.9 8.37 26.29 ...
##  $ Volume                  : num  3794249 2910828 2825963 1424545 447759 ...
##  $ stockName               : Factor w/ 100 levels "A2M","AGL","ALL",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ StockDifferenceOpenClose: num  0.15 -0.34 -0.31 0.01 -0.52 ...
##  $ StockDifferenceHighLow  : num  0.33 0.58 0.715 0.145 0.91 ...

head(finalDf, 2)

##         date   price        suburb city_name state       lat      lon
## 1 2018-09-26 1730000 Darling Point    Sydney   NSW -33.86839 151.2375
## 2 2018-09-26 1730000 Darling Point    Sydney   NSW -33.86839 151.2375
##   bedrooms property_type loc_pid lga_pid State ConfirmedCases DeathCases
## 1        2          unit NSW1221  NSW180  <NA>             NA         NA
## 2        2          unit NSW1221  NSW180  <NA>             NA         NA
##   Lat Long RecoveredCases AudToUsd  Open  High   Low Close Adj.Close
## 1  NA   NA             NA   0.7264 10.85 10.97 10.64 10.70   10.7000
## 2  NA   NA             NA   0.7264 19.25 19.78 19.20 19.59   18.0424
##    Volume stockName StockDifferenceOpenClose StockDifferenceHighLow
## 1 3794249       A2M                     0.15                   0.33
## 2 2910828       AGL                    -0.34                   0.58

After joining our data there is a development of several missing data.
Data in columns from coronaJoinDf is mainly missing as it just came in the year 2020 so there will be no data before that.
While it can be seen that data from AudDf and stock data frame also developed missing values that suggest that there were some days that were not in the data.

colSums(is.na(finalDf))

##                     date                    price                   suburb 
##                        0                       32                       32 
##                city_name                    state                      lat 
##                       32                       32                       32 
##                      lon                 bedrooms            property_type 
##                       32                       32                       32 
##                  loc_pid                  lga_pid                    State 
##                       32                       32                 15919394 
##           ConfirmedCases               DeathCases                      Lat 
##                 15919394                 15919394                 15919394 
##                     Long           RecoveredCases                 AudToUsd 
##                 15919394                 15919394                    62728 
##                     Open                     High                      Low 
##                   142213                   142213                   142213 
##                    Close                Adj.Close                   Volume 
##                   142213                   142213                   142213 
##                stockName StockDifferenceOpenClose   StockDifferenceHighLow 
##                   142213                   142213                   142213

To fill the missing values in AudToUsd we used the forward filling method using fill() function.
But as data was generated due to merge and not actually missing or error so replaced every numerical observation with 0 and every missing character observation with Null charter.
After this using the cbind() function to add a new column that uses weekdays() function to tell the day of the week bass on the date, as this can give valuable insight as in what weekday promotes more sale or property, change in stock or exchange rate, and also what weekdays most corona cases were identified, etc.

finalDf <- finalDf %>% fill(AudToUsd)
finalDf$State <- as.character(finalDf$State)
finalDf$city_name <-  as.character(finalDf$city_name)
finalDf$state <- as.character(finalDf$state)
finalDf$property_type <- as.character(finalDf$property_type)
finalDf$loc_pid <- as.character(finalDf$loc_pid)
finalDf$lga_pid <- as.character(finalDf$lga_pid)
finalDf$stockName <- as.character(finalDf$stockName)
finalDf$suburb <- as.character(finalDf$suburb)

finalDf[is.na(finalDf$stockName),'stockName'] <- 'Null'
finalDf[is.na(finalDf$State),'State'] <- 'Null'
finalDf[is.na(finalDf$suburb),c('suburb', 'city_name', 'state', 'property_type', 
                                'loc_pid', 'lga_pid')] <- 'Null'
finalDf[finalDf$stockName == 'Null', c('Open','High', 'Low', 'Close', 'Adj.Close', 'Volume',
                                       'StockDifferenceOpenClose', 'StockDifferenceHighLow')] <- 0
finalDf[finalDf$State == 'Null', c('ConfirmedCases','DeathCases', 'RecoveredCases', 'Long', 'Lat')] <- 0
finalDf[finalDf$suburb == 'Null', c('price','lat', 'lon', 'Long', 'bedrooms')] <- 0
finalDf <- cbind(finalDf, weekdays(finalDf$date))

After these column names are changed to a better describing column names.
All the missing values are now dealt with.

colSums(is.na(finalDf))

##                     date                    price                   suburb 
##                        0                        0                        0 
##                city_name                    state                      lat 
##                        0                        0                        0 
##                      lon                 bedrooms            property_type 
##                        0                        0                        0 
##                  loc_pid                  lga_pid                    State 
##                        0                        0                        0 
##           ConfirmedCases               DeathCases                      Lat 
##                        0                        0                        0 
##                     Long           RecoveredCases                 AudToUsd 
##                        0                        0                        0 
##                     Open                     High                      Low 
##                        0                        0                        0 
##                    Close                Adj.Close                   Volume 
##                        0                        0                        0 
##                stockName StockDifferenceOpenClose   StockDifferenceHighLow 
##                        0                        0                        0 
##   weekdays(finalDf$date) 
##                        0

names(finalDf) <- c('date', 'PropertyPrice', 'PropertySuburb', 'PropertyCityName', 'PropertyState',
                    'PropertyLat', 'PropertyLon', 'PropertyBedrooms','PropertyType', 'PropertyLocPid',
                    'PropertyLgaPid','CoronaState','CoronaConfirmedCases', 'CoronaDeathCases',
                    'CoronaLat', 'CoronaLon', 'CoronaRecoveredCases','AudToUsd', 'StockOpen', 
                    'StockHigh', 'StockLow','StockClose', 'StockAdj.Close', 'StockVolume', 
                    'StockName', 'StockDifferenceOpenClose', 'StockDifferenceHighLow', 'Weekday')

11. Transform

This step meets the requirement 9.
As can be seen from the majority of the histogram of numerical variables in our data is right-skewed so appropriate transformation should be applied to decrease the skewness and convert the distribution into a normal distribution.
All the variables have data on the positive side except StockDifferenceOpenClose variable.

hist(finalDf$PropertyPrice, main="Histogram for PropertyPrice", cex.main=0.5); hist(finalDf$AudToUsd, main="Histogram for AudToUsd", cex.main=0.5); hist(finalDf$StockOpen, main="Histogram for StockOpen", cex.main=0.5); hist(finalDf$StockHigh, main="Histogram for StockHigh", cex.main=0.5); hist(finalDf$StockLow, main="Histogram for StockLow", cex.main=0.5); hist(finalDf$StockClose, main="Histogram for StockClose", cex.main=0.5); hist(finalDf$StockAdj.Close, main="Histogram for StockAdj.Close", cex.main=0.5); hist(finalDf$StockVolume, main="Histogram for StockVolume", cex.main=0.5); hist(finalDf$StockDifferenceOpenClose, main="Histogram for StockDifferenceOpenClose", cex.main=0.5); hist(finalDf$StockDifferenceHighLow, main="Histogram for StockDifferenceHighLow", cex.main=0.5)

It can be seen from previous histograms that majority of numerical variables in our data is right-skewed and in the positive side with no zero(0) so log transformation would be best suited for transforming such data to decrease the skewness and convert the distribution into a normal distribution.
except AudToUsd which is not right-skewed and StockDifferenceOpenClose which has data points in the negative side, so log transformation can not be used on them.
So, for StockDifferenceOpenClose variable reciprocal transformation will be used.
As can be seen all the columns are closely following the normal distribution.

hist(log(finalDf$PropertyPrice), main="Histogram for PropertyPrice", cex.main=0.5); hist(log(finalDf$StockOpen), main="Histogram for StockOpen", cex.main=0.5); hist(log(finalDf$StockHigh), main="Histogram for StockHigh", cex.main=0.5); hist(log(finalDf$StockLow), main="Histogram for StockLow", cex.main=0.5); hist(log(finalDf$StockClose), main="Histogram for StockClose", cex.main=0.5); hist(log(finalDf$StockAdj.Close), main="Histogram for StockAdj.Close", cex.main=0.5); hist(log(finalDf$StockVolume), main="Histogram for StockVolume", cex.main=0.5); hist(1/(finalDf$StockDifferenceOpenClose), main="Histogram for StockDifferenceOpenClose", cex.main=0.5); hist(log(finalDf$StockDifferenceHighLow), main="Histogram for StockDifferenceHighLow", cex.main=0.5)

Finally applying the transformation to the main data frame and also performing typecasting.

head(finalDf, 2)

##         date PropertyPrice PropertySuburb PropertyCityName PropertyState
## 1 2018-09-26       1730000  Darling Point           Sydney           NSW
## 2 2018-09-26       1730000  Darling Point           Sydney           NSW
##   PropertyLat PropertyLon PropertyBedrooms PropertyType PropertyLocPid
## 1   -33.86839    151.2375                2         unit        NSW1221
## 2   -33.86839    151.2375                2         unit        NSW1221
##   PropertyLgaPid CoronaState CoronaConfirmedCases CoronaDeathCases
## 1         NSW180        Null                    0                0
## 2         NSW180        Null                    0                0
##   CoronaLat CoronaLon CoronaRecoveredCases AudToUsd StockOpen StockHigh
## 1         0         0                    0   0.7264     10.85     10.97
## 2         0         0                    0   0.7264     19.25     19.78
##   StockLow StockClose StockAdj.Close StockVolume StockName
## 1    10.64      10.70        10.7000     3794249       A2M
## 2    19.20      19.59        18.0424     2910828       AGL
##   StockDifferenceOpenClose StockDifferenceHighLow   Weekday
## 1                     0.15                   0.33 Wednesday
## 2                    -0.34                   0.58 Wednesday

finalDf$PropertyState <- as.factor(finalDf$PropertyState)
finalDf$PropertyCityName <-  as.factor(finalDf$PropertyCityName)
finalDf$CoronaState <- as.factor(finalDf$CoronaState)
finalDf$PropertyType <- factor(finalDf$PropertyType, levels = c("Null", "unit","house","townhouse"),ordered = TRUE)
finalDf$PropertyLocPid <- as.factor(finalDf$PropertyLocPid)
finalDf$PropertyLgaPid <- as.factor(finalDf$PropertyLgaPid)
finalDf$StockName <- as.factor(finalDf$StockName)
finalDf$PropertySuburb <- as.factor(finalDf$PropertySuburb)
finalDf$PropertyPrice <- log(finalDf$PropertyPrice)
finalDf$StockOpen <- log(finalDf$StockOpen)
finalDf$StockHigh <- log(finalDf$StockHigh)
finalDf$StockLow <- log(finalDf$StockLow)
finalDf$StockClose <- log(finalDf$StockClose)
finalDf$StockAdj.Close <- log(finalDf$StockAdj.Close)
finalDf$StockVolume <-log(finalDf$StockVolume)
finalDf$StockDifferenceOpenClose <- 1/(finalDf$StockDifferenceOpenClose)
finalDf$StockDifferenceHighLow <- log(finalDf$StockDifferenceHighLow)

head(finalDf)

##         date PropertyPrice PropertySuburb PropertyCityName PropertyState
## 1 2018-09-26      14.36363  Darling Point           Sydney           NSW
## 2 2018-09-26      14.36363  Darling Point           Sydney           NSW
## 3 2018-09-26      14.36363  Darling Point           Sydney           NSW
## 4 2018-09-26      14.36363  Darling Point           Sydney           NSW
## 5 2018-09-26      14.36363  Darling Point           Sydney           NSW
## 6 2018-09-26      14.36363  Darling Point           Sydney           NSW
##   PropertyLat PropertyLon PropertyBedrooms PropertyType PropertyLocPid
## 1   -33.86839    151.2375                2         unit        NSW1221
## 2   -33.86839    151.2375                2         unit        NSW1221
## 3   -33.86839    151.2375                2         unit        NSW1221
## 4   -33.86839    151.2375                2         unit        NSW1221
## 5   -33.86839    151.2375                2         unit        NSW1221
## 6   -33.86839    151.2375                2         unit        NSW1221
##   PropertyLgaPid CoronaState CoronaConfirmedCases CoronaDeathCases
## 1         NSW180        Null                    0                0
## 2         NSW180        Null                    0                0
## 3         NSW180        Null                    0                0
## 4         NSW180        Null                    0                0
## 5         NSW180        Null                    0                0
## 6         NSW180        Null                    0                0
##   CoronaLat CoronaLon CoronaRecoveredCases AudToUsd StockOpen StockHigh
## 1         0         0                    0   0.7264  2.384165  2.395164
## 2         0         0                    0   0.7264  2.957511  2.984671
## 3         0         0                    0   0.7264  3.346389  3.361243
## 4         0         0                    0   0.7264  2.170196  2.178155
## 5         0         0                    0   0.7264  3.265759  3.291010
## 6         0         0                    0   0.7264  1.907169  1.907169
##   StockLow StockClose StockAdj.Close StockVolume StockName
## 1 2.364620   2.370244       2.370244    15.14900       A2M
## 2 2.954910   2.975019       2.892725    14.88395       AGL
## 3 3.336125   3.357245       3.328651    14.85436       ALL
## 4 2.161597   2.169054       2.124350    14.16936       ALQ
## 5 3.256557   3.285412       3.269207    13.01201       ALU
## 6 1.895405   1.899832       1.859265    14.69220       ALX
##   StockDifferenceOpenClose StockDifferenceHighLow   Weekday
## 1                 6.666667            -1.10866262 Wednesday
## 2                -2.941176            -0.54472718 Wednesday
## 3                -3.225817            -0.33547274 Wednesday
## 4               100.000000            -1.93102154 Wednesday
## 5                -1.923084            -0.09430848 Wednesday
## 6                20.312817            -2.54135003 Wednesday

str(finalDf)

## 'data.frame':    34205178 obs. of  28 variables:
##  $ date                    : Date, format: "2018-09-26" "2018-09-26" ...
##  $ PropertyPrice           : num  14.4 14.4 14.4 14.4 14.4 ...
##  $ PropertySuburb          : Factor w/ 2000 levels "Abbotsbury","Abbotsford",..: 485 485 485 485 485 485 485 485 485 485 ...
##  $ PropertyCityName        : Factor w/ 7 levels "Adelaide","Brisbane",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ PropertyState           : Factor w/ 7 levels "ACT","NSW","Null",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ PropertyLat             : num  -33.9 -33.9 -33.9 -33.9 -33.9 ...
##  $ PropertyLon             : num  151 151 151 151 151 ...
##  $ PropertyBedrooms        : Ord.factor w/ 6 levels "0"<"1"<"2"<"3"<..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ PropertyType            : Ord.factor w/ 4 levels "Null"<"unit"<..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ PropertyLocPid          : Factor w/ 2091 levels "ACT101","ACT102",..: 131 131 131 131 131 131 131 131 131 131 ...
##  $ PropertyLgaPid          : Factor w/ 116 levels "ACT","BRIS1",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ CoronaState             : Factor w/ 9 levels "Australian Capital Territory",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ CoronaConfirmedCases    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CoronaDeathCases        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CoronaLat               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CoronaLon               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CoronaRecoveredCases    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ AudToUsd                : num  0.726 0.726 0.726 0.726 0.726 ...
##  $ StockOpen               : num  2.38 2.96 3.35 2.17 3.27 ...
##  $ StockHigh               : num  2.4 2.98 3.36 2.18 3.29 ...
##  $ StockLow                : num  2.36 2.95 3.34 2.16 3.26 ...
##  $ StockClose              : num  2.37 2.98 3.36 2.17 3.29 ...
##  $ StockAdj.Close          : num  2.37 2.89 3.33 2.12 3.27 ...
##  $ StockVolume             : num  15.1 14.9 14.9 14.2 13 ...
##  $ StockName               : Factor w/ 101 levels "A2M","AGL","ALL",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ StockDifferenceOpenClose: num  6.67 -2.94 -3.23 100 -1.92 ...
##  $ StockDifferenceHighLow  : num  -1.1087 -0.5447 -0.3355 -1.931 -0.0943 ...
##  $ Weekday                 : Factor w/ 7 levels "Friday","Monday",..: 7 7 7 7 7 7 7 7 7 7 ...

summary(finalDf)

##       date            PropertyPrice      PropertySuburb    
##  Min.   :2018-09-02   Min.   : -Inf   Pakenham  :  255271  
##  1st Qu.:2019-06-19   1st Qu.:13.05   Frankston :  169599  
##  Median :2020-01-28   Median :13.35   Tarneit   :  161499  
##  Mean   :2019-10-17   Mean   : -Inf   Melbourne :  159318  
##  3rd Qu.:2020-02-27   3rd Qu.:13.66   Point Cook:  159181  
##  Max.   :2020-05-17   Max.   :16.12   Sunbury   :  159119  
##                                       (Other)   :33141191  
##   PropertyCityName    PropertyState    PropertyLat      PropertyLon   
##  Adelaide : 4075336   ACT : 1207070   Min.   :-38.48   Min.   :  0.0  
##  Brisbane : 4121829   NSW : 9011640   1st Qu.:-37.76   1st Qu.:144.6  
##  Canberra : 1207070   Null:      32   Median :-34.07   Median :145.2  
##  Melbourne:11615565   QLD : 4121829   Mean   :-34.40   Mean   :143.4  
##  Null     :      32   SA  : 4075336   3rd Qu.:-33.68   3rd Qu.:151.1  
##  Perth    : 4173706   VIC :11615565   Max.   :  0.00   Max.   :153.2  
##  Sydney   : 9011640   WA  : 4173706                                   
##  PropertyBedrooms    PropertyType      PropertyLocPid    
##  0:   25132       Null     :      32   VIC2023:  255271  
##  1: 1898179       unit     : 8516643   VIC939 :  169599  
##  2: 7250252       house    :23035467   VIC2453:  161499  
##  3:13317829       townhouse: 2653036   VIC1634:  159318  
##  4: 9556504                            VIC2082:  159181  
##  5: 2157282                            VIC2394:  159119  
##                                        (Other):33141191  
##  PropertyLgaPid                           CoronaState      
##  ACT    : 1207070   Null                        :15919394  
##  BRIS2  : 1117338   Australian Capital Territory: 2285723  
##  BRIS3  : 1018200   New South Wales             : 2285723  
##  VIC127 :  975817   Northern Territory          : 2285723  
##  NSW283 :  862876   Queensland                  : 2285723  
##  BRIS4  :  772019   South Australia             : 2285723  
##  (Other):28251858   (Other)                     : 6857169  
##  CoronaConfirmedCases CoronaDeathCases    CoronaLat        CoronaLon    
##  Min.   :   0.0       Min.   : 0.0000   Min.   :-41.45   Min.   :  0.0  
##  1st Qu.:   0.0       1st Qu.: 0.0000   1st Qu.:-34.93   1st Qu.:  0.0  
##  Median :   0.0       Median : 0.0000   Median :-12.46   Median :115.9  
##  Mean   :  34.5       Mean   : 0.1923   Mean   :-17.10   Mean   : 75.5  
##  3rd Qu.:   4.0       3rd Qu.: 0.0000   3rd Qu.:  0.00   3rd Qu.:146.0  
##  Max.   :3076.0       Max.   :46.0000   Max.   :  0.00   Max.   :153.4  
##                                                                         
##  CoronaRecoveredCases    AudToUsd        StockOpen       StockHigh    
##  Min.   :   0.000     Min.   :0.5687   Min.   : -Inf   Min.   : -Inf  
##  1st Qu.:   0.000     1st Qu.:0.6602   1st Qu.:1.564   1st Qu.:1.575  
##  Median :   0.000     Median :0.6747   Median :2.485   Median :2.499  
##  Mean   :   3.259     Mean   :0.6721   Mean   : -Inf   Mean   : -Inf  
##  3rd Qu.:   0.000     3rd Qu.:0.6966   3rd Qu.:3.239   3rd Qu.:3.252  
##  Max.   :2612.000     Max.   :0.7357   Max.   :5.827   Max.   :5.837  
##                                                                       
##     StockLow       StockClose    StockAdj.Close   StockVolume   
##  Min.   : -Inf   Min.   : -Inf   Min.   : -Inf   Min.   : -Inf  
##  1st Qu.:1.550   1st Qu.:1.560   1st Qu.:1.530   1st Qu.:14.01  
##  Median :2.466   Median :2.483   Median :2.463   Median :14.94  
##  Mean   : -Inf   Mean   : -Inf   Mean   : -Inf   Mean   : -Inf  
##  3rd Qu.:3.223   3rd Qu.:3.237   3rd Qu.:3.219   3rd Qu.:15.73  
##  Max.   :5.820   Max.   :5.832   Max.   :5.827   Max.   :18.51  
##                                                                 
##    StockName        StockDifferenceOpenClose StockDifferenceHighLow
##  A2M    :  340911   Min.   :-204.0816        Min.   :   -Inf       
##  AGL    :  340911   1st Qu.:  -7.6922        1st Qu.:-2.1628       
##  ALL    :  340911   Median :   0.8929        Median :-1.3093       
##  ALQ    :  340911   Mean   :      Inf        Mean   :   -Inf       
##  ALU    :  340911   3rd Qu.:  12.5000        3rd Qu.:-0.3857       
##  ALX    :  340911   Max.   :      Inf        Max.   : 4.0560       
##  (Other):32159712                                                  
##       Weekday       
##  Friday   :8152045  
##  Monday   :6462713  
##  Saturday :  76536  
##  Sunday   :  10454  
##  Thursday :6262589  
##  Tuesday  :6334771  
##  Wednesday:6906070

References

“AUS Real Estate Sales September 2018 to May 2020”, Kaggle. [Online]. Available: https://www.kaggle.com/htagholdings/aus-real-estate-sales-march-2019-to-april-2020. [Accessed: 07-Jun-2020].
“Australian - US Dollar Exchange Rate (AUD USD) - Historical Chart”, Macrotrends. [Online]. Available: https://www.macrotrends.net/2551/australian-us-dollar-exchange-rate-historical-chart. [Accessed: 07-Jun-2020].
“Novel Corona Virus 2019 Dataset”, Kaggle. [Online]. Available: https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset. [Accessed: 07-Jun-2020].
“Australian Historical Stock Prices”, Kaggle. [Online]. Available: https://www.kaggle.com/ashbellett/australian-historical-stock-prices. [Accessed: 07-Jun-2020].

MATH2349 Semester 1, 2020

Assignment 2

Rashbir Singh Kohli (s3810585)