For this project the require packages are: [1]tidyr [2]lubridate [3]dplyr [4]ggplot2 [5]car [6] magrittr [7]kableExtra [8]ggthemes [9]forecast
library(tidyr)
library(lubridate)
library(dplyr)
library(ggplot2)
library(car)
library(magrittr)
library(kableExtra)
library(ggthemes)
library(forecast)
options(scipen=999)
The year 2020 came with much bad news, including COVID-19. Coronavirus has impacted our life in various ways and also drastically transformed our economy. Due to this, there is some change noted in the Australian dollar exchange rate, property price, and stocks. The following assignment mainly focuses on to study this impact only. The report is to present in a PDF format file with a 25-page limit. Also, our dataset needs to meet the 10 minimum requirement.
To study the impact of corona on Australian economy data of various type need to be collected. The motive of this assignment is to prepare an analysis of ready data to have some valuable insight into the impact of corona on monetary things. To see how corona increased or decreased the sales of some stock, how the price of the Australian dollar change, and what is the buying pattern of people before and during corona. All these questions can be answered after the data is cleaned and prepared.
Four different data sets were collected 3 from Kaggle and 1 from macrotrends and in total 105 CSV files were processed. Each data was processed individually using fread() and for loop to read 100 stock CSV files and head and structure was checked. Data relevant to Australia was filtered and type conversions were applied on each dataset. Missing-value was identified using summary statistics and colSums() function. Corona dataset was untidy so created a function to process and clean 3 coronavirus CSV files. To deal with missing values in the property data frame cluster/group based median values were used to replace missing data. Outliers in each dataset were managed on each data set individually to some extent as the corona is expected to created outliers and outliers for this type of problem statement. Only the extreme Outliers in the data frame were dropped as after dropping them more outliers were formed. Finally, all the datasets were combined on the date to form a final data frame and data with date 2018-09-02 and above were only selected as the data need to be processed for analyzing the impact of corona on the Australian economy so data older than that is not relevant. Furthur missing value generated after joining datasets were dealt with either filling data using the front-fill method or simply replacing numerical observation with 0 and character observation with Null. Finally, the distribution of the numerical columns in the final data frame was visualized and relevant data transformation was applied and the head, structure, and summary of the data were shown.
AUS Real Estate Sales September 2018 to May 2020.[1] This data was collected from Kaggle and is about the Australian property market. The data is collected by HtAG, it is a web portal for real estate professionals that assists its customers in making property-related decisions based on timely and actionable market information. The inspiration behind making this data set public is to provide the raw sales data to the community to assist with impact analysis of COVID-19 on the Australian property market.
Australian and US dollar exchange rate historical chart.[2] This data was collected from macrotrends and is about the change in the Australian dollar rate value as compared to the US. The data about the exchange rate back to 1991. The website gives an easy visualization of change in the Australian dollar, using a simple line plot with x-axis as the date and y-axis as the dollar value.
Novel Corona Virus 2019 Dataset.[3] This data was collected from Kaggle and is about the cases of COVID-19 in different countries and their states and it informs: [1] Number of cases [2] How many patients recovered [3] How many deaths were recorded due to COVID-19. On 31 December 2019, WHO was alerted to several cases of pneumonia in Wuhan City, Hubei Province of China. The data is uploaded from Johns Hopkins Github repository (2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE) by one of the Kaggle Grandmaster (SRK ).
Australian Historical Stock Prices.[4] This data set contains historical share price data from the top 100 companies listed on the Australian Securities Exchange (ASX) and is consist of 100 CSV files each file name denotes the stock name. Each CSV file contains the stock price information for one company including [1] Date [2] Opening price [3] High price [4] Low price [5] Closing price [6] Adjusted closing price [7] Trading volume. The data was published under CC0: Public Domain License.
Reading dataset 1 - Reading the aus-property-sales-sep2018-april2020.csv using read.csv() function keeping stringsAsFactors as FALSE. * Using the dim() function to see the number of variables and observations. * The following data have 11 variables and 290046 observations. * Using the str() function to print the structure of the data. * The data have following columns: [1] date_sold [2] price [3] suburb [4] city_name [5] state [6] lat [7] lon [8] bedrooms [9] property_type [10] loc_pid [11] lga_pid * Used head() function to print top 2.
propDf <- read.csv("aus-property-sales-sep2018-april2020.csv", stringsAsFactors = FALSE)
print(dim(propDf))
## [1] 290046 11
str(propDf)
## 'data.frame': 290046 obs. of 11 variables:
## $ date_sold : chr "2018-09-18 00:00:00" "2018-09-24 00:00:00" "2018-09-26 00:00:00" "2018-09-26 00:00:00" ...
## $ price : chr "NULL" "NULL" "1730000" "1928000" ...
## $ suburb : chr "Darling Point" "Darling Point" "Darling Point" "Darling Point" ...
## $ city_name : chr "Sydney" "Sydney" "Sydney" "Sydney" ...
## $ state : chr "NSW" "NSW" "NSW" "NSW" ...
## $ lat : chr "-33.869565" "-33.872179" "-33.868386" "-33.875465" ...
## $ lon : chr "151.241317" "151.239726" "151.237471" "151.23628" ...
## $ bedrooms : int 3 3 2 3 3 2 3 2 3 3 ...
## $ property_type: chr "unit" "unit" "unit" "unit" ...
## $ loc_pid : chr "NSW1221" "NSW1221" "NSW1221" "NSW1221" ...
## $ lga_pid : chr "NSW180" "NSW180" "NSW180" "NSW180" ...
print(head(propDf, 2))
## date_sold price suburb city_name state lat
## 1 2018-09-18 00:00:00 NULL Darling Point Sydney NSW -33.869565
## 2 2018-09-24 00:00:00 NULL Darling Point Sydney NSW -33.872179
## lon bedrooms property_type loc_pid lga_pid
## 1 151.241317 3 unit NSW1221 NSW180
## 2 151.239726 3 unit NSW1221 NSW180
Reading dataset 2 - Reading the australian-us-dollar-exchange-rate-historical-chart.csv using read.csv() function keeping stringsAsFactors as FALSE and skipping the top 15 rows. * Using the dim() function to see the number of variables and observations. * The following data have 2 variables and 7597 observations. * Using the str() function to print the structure of the data. * The data have the following columns: [1] date [2] value * Used head() function to print top 2 values.
AudDf <- read.csv("australian-us-dollar-exchange-rate-historical-chart.csv", stringsAsFactors = FALSE, skip = 15)
print(dim(AudDf))
## [1] 7597 2
str(AudDf)
## 'data.frame': 7597 obs. of 2 variables:
## $ date : chr "1991-01-02" "1991-01-03" "1991-01-04" "1991-01-07" ...
## $ value: num 0.775 0.779 0.778 0.778 0.777 0.774 0.775 0.775 0.778 0.783 ...
print(head(AudDf, 2))
## date value
## 1 1991-01-02 0.775
## 2 1991-01-03 0.779
Reading dataset 3.1 - Reading the time_series_covid_19_confirmed.csv using read.csv() function keeping stringsAsFactors as FALSE. * As the main motive of this assignment is to focus on the impact of corona on the Australian economy so, we will filter out the data that is related to Australia only. * Using the subset() function and subsetting on the Country.Region column of the data by keeping it equal to Australia. * Using row names(CoronaConfDf) <- NULL to reset the index into order as subsetting will use the index that data have by default. * Using the dim() function to see the number of variables and observations. * Initially before subsetting the data have 266 observations 121 variables and after subsetting, we are left with 8 observations 121 variables. Using the head() function shows us that the observations are rather untidy that full fill the requirement 5 (At least one of the data sets that you use should be Untidy), as the data have variables as [1] Province.State [2] Country.Region [3] Lat [4] Long [5] Other variables shows the dates * So, each observation is for the different states of Australia. * Using the str() function to print the structure of the data.
CoronaConfDf <- read.csv("time_series_covid_19_confirmed.csv", stringsAsFactors = FALSE)
print(paste("Dim before subsetting ", dim(CoronaConfDf)))
## [1] "Dim before subsetting 266" "Dim before subsetting 121"
CoronaConfDf <- subset(CoronaConfDf, Country.Region == 'Australia')
print(paste("Dim After subsetting ", dim(CoronaConfDf)))
## [1] "Dim After subsetting 8" "Dim After subsetting 121"
rownames(CoronaConfDf) <- NULL
print(dim(CoronaConfDf))
## [1] 8 121
str(CoronaConfDf)
## 'data.frame': 8 obs. of 121 variables:
## $ Province.State: chr "Australian Capital Territory" "New South Wales" "Northern Territory" "Queensland" ...
## $ Country.Region: chr "Australia" "Australia" "Australia" "Australia" ...
## $ Lat : num -35.5 -33.9 -12.5 -28 -34.9 ...
## $ Long : num 149 151 131 153 139 ...
## $ X1.22.20 : int 0 0 0 0 0 0 0 0
## $ X1.23.20 : int 0 0 0 0 0 0 0 0
## $ X1.24.20 : int 0 0 0 0 0 0 0 0
## $ X1.25.20 : int 0 0 0 0 0 0 0 0
## $ X1.26.20 : int 0 3 0 0 0 0 1 0
## $ X1.27.20 : int 0 4 0 0 0 0 1 0
## $ X1.28.20 : int 0 4 0 0 0 0 1 0
## $ X1.29.20 : int 0 4 0 1 0 0 1 0
## $ X1.30.20 : int 0 4 0 3 0 0 2 0
## $ X1.31.20 : int 0 4 0 2 0 0 3 0
## $ X2.1.20 : int 0 4 0 3 1 0 4 0
## $ X2.2.20 : int 0 4 0 2 2 0 4 0
## $ X2.3.20 : int 0 4 0 2 2 0 4 0
## $ X2.4.20 : int 0 4 0 3 2 0 4 0
## $ X2.5.20 : int 0 4 0 3 2 0 4 0
## $ X2.6.20 : int 0 4 0 4 2 0 4 0
## $ X2.7.20 : int 0 4 0 5 2 0 4 0
## $ X2.8.20 : int 0 4 0 5 2 0 4 0
## $ X2.9.20 : int 0 4 0 5 2 0 4 0
## $ X2.10.20 : int 0 4 0 5 2 0 4 0
## $ X2.11.20 : int 0 4 0 5 2 0 4 0
## $ X2.12.20 : int 0 4 0 5 2 0 4 0
## $ X2.13.20 : int 0 4 0 5 2 0 4 0
## $ X2.14.20 : int 0 4 0 5 2 0 4 0
## $ X2.15.20 : int 0 4 0 5 2 0 4 0
## $ X2.16.20 : int 0 4 0 5 2 0 4 0
## $ X2.17.20 : int 0 4 0 5 2 0 4 0
## $ X2.18.20 : int 0 4 0 5 2 0 4 0
## $ X2.19.20 : int 0 4 0 5 2 0 4 0
## $ X2.20.20 : int 0 4 0 5 2 0 4 0
## $ X2.21.20 : int 0 4 0 5 2 0 4 0
## $ X2.22.20 : int 0 4 0 5 2 0 4 0
## $ X2.23.20 : int 0 4 0 5 2 0 4 0
## $ X2.24.20 : int 0 4 0 5 2 0 4 0
## $ X2.25.20 : int 0 4 0 5 2 0 4 0
## $ X2.26.20 : int 0 4 0 5 2 0 4 0
## $ X2.27.20 : int 0 4 0 5 2 0 4 0
## $ X2.28.20 : int 0 4 0 5 2 0 4 0
## $ X2.29.20 : int 0 4 0 9 3 0 7 2
## $ X3.1.20 : int 0 6 0 9 3 0 7 2
## $ X3.2.20 : int 0 6 0 9 3 1 9 2
## $ X3.3.20 : int 0 13 0 11 3 1 9 2
## $ X3.4.20 : int 0 22 1 11 5 1 10 2
## $ X3.5.20 : int 0 22 1 13 5 1 10 3
## $ X3.6.20 : int 0 26 0 13 7 1 10 3
## $ X3.7.20 : int 0 28 0 13 7 1 11 3
## $ X3.8.20 : int 0 38 0 15 7 2 11 3
## $ X3.9.20 : int 0 48 0 15 7 2 15 4
## $ X3.10.20 : int 0 55 1 18 7 2 18 6
## $ X3.11.20 : int 0 65 1 20 9 3 21 9
## $ X3.12.20 : int 0 65 1 20 9 3 21 9
## $ X3.13.20 : int 1 92 1 35 16 5 36 14
## $ X3.14.20 : int 1 112 1 46 19 5 49 17
## $ X3.15.20 : int 1 134 1 61 20 6 57 17
## $ X3.16.20 : int 2 171 1 68 29 7 71 28
## $ X3.17.20 : int 2 210 1 78 29 7 94 31
## $ X3.18.20 : int 3 267 1 94 37 10 121 35
## $ X3.19.20 : int 4 307 1 144 42 10 121 52
## $ X3.20.20 : int 6 353 3 184 50 10 121 64
## $ X3.21.20 : int 9 436 3 221 67 16 229 90
## $ X3.22.20 : int 19 669 5 259 100 22 355 120
## $ X3.23.20 : int 32 669 5 319 134 28 355 140
## $ X3.24.20 : int 39 818 6 397 170 28 411 175
## $ X3.25.20 : int 39 1029 6 443 170 36 466 175
## $ X3.26.20 : int 53 1219 12 493 235 47 520 231
## $ X3.27.20 : int 62 1405 12 555 257 47 574 231
## $ X3.28.20 : int 71 1617 15 625 287 62 685 278
## $ X3.29.20 : int 77 1791 15 656 299 66 769 311
## $ X3.30.20 : int 78 2032 15 689 305 66 821 355
## $ X3.31.20 : int 80 2032 17 743 337 69 917 364
## $ X4.1.20 : int 84 2182 19 781 367 69 968 392
## $ X4.2.20 : int 87 2298 21 835 367 72 1036 400
## $ X4.3.20 : int 91 2389 22 873 396 74 1085 400
## $ X4.4.20 : int 93 2493 26 900 407 80 1115 436
## $ X4.5.20 : int 96 2580 27 907 407 82 1135 453
## $ X4.6.20 : int 96 2637 28 921 411 86 1158 460
## $ X4.7.20 : int 96 2686 28 934 411 89 1191 460
## $ X4.8.20 : int 99 2734 28 943 415 98 1212 481
## $ X4.9.20 : int 100 2773 28 953 420 111 1228 495
## $ X4.10.20 : int 103 2822 28 965 428 122 1241 506
## $ X4.11.20 : int 103 2857 28 974 429 133 1265 514
## $ X4.12.20 : int 103 2857 28 983 429 133 1268 514
## $ X4.13.20 : int 102 2863 28 987 429 144 1281 517
## $ X4.14.20 : int 103 2870 28 998 433 165 1291 527
## $ X4.15.20 : int 103 2886 28 999 433 165 1299 527
## $ X4.16.20 : int 103 2897 28 1001 433 169 1299 532
## $ X4.17.20 : int 103 2926 28 1007 435 180 1302 541
## $ X4.18.20 : int 103 2936 28 1015 435 188 1319 544
## $ X4.19.20 : int 103 2957 28 1019 435 195 1328 545
## $ X4.20.20 : int 104 2963 28 1019 435 200 1329 545
## $ X4.21.20 : int 104 2969 28 1024 437 201 1336 546
## $ X4.22.20 : int 104 2971 28 1024 438 205 1336 546
## $ X4.23.20 : int 104 2976 28 1026 438 207 1337 546
## $ X4.24.20 : int 105 2982 28 1026 438 207 1343 548
## $ X4.25.20 : int 106 2994 28 1026 438 207 1346 549
## [list output truncated]
print(head(CoronaConfDf, 2))
## Province.State Country.Region Lat Long X1.22.20
## 1 Australian Capital Territory Australia -35.4735 149.0124 0
## 2 New South Wales Australia -33.8688 151.2093 0
## X1.23.20 X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 3 4 4 4 4
## X1.31.20 X2.1.20 X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20 X2.8.20
## 1 0 0 0 0 0 0 0 0 0
## 2 4 4 4 4 4 4 4 4 4
## X2.9.20 X2.10.20 X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20 X2.16.20
## 1 0 0 0 0 0 0 0 0
## 2 4 4 4 4 4 4 4 4
## X2.17.20 X2.18.20 X2.19.20 X2.20.20 X2.21.20 X2.22.20 X2.23.20 X2.24.20
## 1 0 0 0 0 0 0 0 0
## 2 4 4 4 4 4 4 4 4
## X2.25.20 X2.26.20 X2.27.20 X2.28.20 X2.29.20 X3.1.20 X3.2.20 X3.3.20
## 1 0 0 0 0 0 0 0 0
## 2 4 4 4 4 4 6 6 13
## X3.4.20 X3.5.20 X3.6.20 X3.7.20 X3.8.20 X3.9.20 X3.10.20 X3.11.20
## 1 0 0 0 0 0 0 0 0
## 2 22 22 26 28 38 48 55 65
## X3.12.20 X3.13.20 X3.14.20 X3.15.20 X3.16.20 X3.17.20 X3.18.20 X3.19.20
## 1 0 1 1 1 2 2 3 4
## 2 65 92 112 134 171 210 267 307
## X3.20.20 X3.21.20 X3.22.20 X3.23.20 X3.24.20 X3.25.20 X3.26.20 X3.27.20
## 1 6 9 19 32 39 39 53 62
## 2 353 436 669 669 818 1029 1219 1405
## X3.28.20 X3.29.20 X3.30.20 X3.31.20 X4.1.20 X4.2.20 X4.3.20 X4.4.20
## 1 71 77 78 80 84 87 91 93
## 2 1617 1791 2032 2032 2182 2298 2389 2493
## X4.5.20 X4.6.20 X4.7.20 X4.8.20 X4.9.20 X4.10.20 X4.11.20 X4.12.20
## 1 96 96 96 99 100 103 103 103
## 2 2580 2637 2686 2734 2773 2822 2857 2857
## X4.13.20 X4.14.20 X4.15.20 X4.16.20 X4.17.20 X4.18.20 X4.19.20 X4.20.20
## 1 102 103 103 103 103 103 103 104
## 2 2863 2870 2886 2897 2926 2936 2957 2963
## X4.21.20 X4.22.20 X4.23.20 X4.24.20 X4.25.20 X4.26.20 X4.27.20 X4.28.20
## 1 104 104 104 105 106 106 106 106
## 2 2969 2971 2976 2982 2994 3002 3004 3016
## X4.29.20 X4.30.20 X5.1.20 X5.2.20 X5.3.20 X5.4.20 X5.5.20 X5.6.20
## 1 106 106 106 106 106 107 107 107
## 2 3016 3025 3030 3035 3033 3035 3042 3044
## X5.7.20 X5.8.20 X5.9.20 X5.10.20 X5.11.20 X5.12.20 X5.13.20 X5.14.20
## 1 107 107 107 107 107 107 107 107
## 2 3047 3051 3053 3053 3053 3059 3063 3071
## X5.15.20 X5.16.20 X5.17.20
## 1 107 107 107
## 2 3074 3075 3076
Reading dataset 3.2 - Reading the time_series_covid_19_deaths.csv using read.csv() function keeping stringsAsFactors as FALSE. * Similar to the section 3.1 subsetting and resetting of the index was done. * Using the dim() function to see the number of variables and observations. * Initially before subsetting the data have 266 observations 121 variables and after subsetting, we are left with 8 observations 121 variables. Using the head() function shows us that the observations are rather untidy that full fill the requirement 5 (At least one of the data sets that you use should be Untidy), as the data have variables as [1] Province.State [2] Country.Region [3] Lat [4] Long [5] Other variables shows the dates * So, each observation is for the different states of Australia. * Using the str() function to print the structure of the data.
CoronaDeathDf <- read.csv("time_series_covid_19_deaths.csv", stringsAsFactors = FALSE)
print(paste("Dim before subsetting ", dim(CoronaDeathDf)))
## [1] "Dim before subsetting 266" "Dim before subsetting 121"
CoronaDeathDf <- subset(CoronaDeathDf, Country.Region == 'Australia')
print(paste("Dim After subsetting ", dim(CoronaDeathDf)))
## [1] "Dim After subsetting 8" "Dim After subsetting 121"
rownames(CoronaDeathDf) <- NULL
print(dim(CoronaDeathDf))
## [1] 8 121
str(CoronaDeathDf)
## 'data.frame': 8 obs. of 121 variables:
## $ Province.State: chr "Australian Capital Territory" "New South Wales" "Northern Territory" "Queensland" ...
## $ Country.Region: chr "Australia" "Australia" "Australia" "Australia" ...
## $ Lat : num -35.5 -33.9 -12.5 -28 -34.9 ...
## $ Long : num 149 151 131 153 139 ...
## $ X1.22.20 : int 0 0 0 0 0 0 0 0
## $ X1.23.20 : int 0 0 0 0 0 0 0 0
## $ X1.24.20 : int 0 0 0 0 0 0 0 0
## $ X1.25.20 : int 0 0 0 0 0 0 0 0
## $ X1.26.20 : int 0 0 0 0 0 0 0 0
## $ X1.27.20 : int 0 0 0 0 0 0 0 0
## $ X1.28.20 : int 0 0 0 0 0 0 0 0
## $ X1.29.20 : int 0 0 0 0 0 0 0 0
## $ X1.30.20 : int 0 0 0 0 0 0 0 0
## $ X1.31.20 : int 0 0 0 0 0 0 0 0
## $ X2.1.20 : int 0 0 0 0 0 0 0 0
## $ X2.2.20 : int 0 0 0 0 0 0 0 0
## $ X2.3.20 : int 0 0 0 0 0 0 0 0
## $ X2.4.20 : int 0 0 0 0 0 0 0 0
## $ X2.5.20 : int 0 0 0 0 0 0 0 0
## $ X2.6.20 : int 0 0 0 0 0 0 0 0
## $ X2.7.20 : int 0 0 0 0 0 0 0 0
## $ X2.8.20 : int 0 0 0 0 0 0 0 0
## $ X2.9.20 : int 0 0 0 0 0 0 0 0
## $ X2.10.20 : int 0 0 0 0 0 0 0 0
## $ X2.11.20 : int 0 0 0 0 0 0 0 0
## $ X2.12.20 : int 0 0 0 0 0 0 0 0
## $ X2.13.20 : int 0 0 0 0 0 0 0 0
## $ X2.14.20 : int 0 0 0 0 0 0 0 0
## $ X2.15.20 : int 0 0 0 0 0 0 0 0
## $ X2.16.20 : int 0 0 0 0 0 0 0 0
## $ X2.17.20 : int 0 0 0 0 0 0 0 0
## $ X2.18.20 : int 0 0 0 0 0 0 0 0
## $ X2.19.20 : int 0 0 0 0 0 0 0 0
## $ X2.20.20 : int 0 0 0 0 0 0 0 0
## $ X2.21.20 : int 0 0 0 0 0 0 0 0
## $ X2.22.20 : int 0 0 0 0 0 0 0 0
## $ X2.23.20 : int 0 0 0 0 0 0 0 0
## $ X2.24.20 : int 0 0 0 0 0 0 0 0
## $ X2.25.20 : int 0 0 0 0 0 0 0 0
## $ X2.26.20 : int 0 0 0 0 0 0 0 0
## $ X2.27.20 : int 0 0 0 0 0 0 0 0
## $ X2.28.20 : int 0 0 0 0 0 0 0 0
## $ X2.29.20 : int 0 0 0 0 0 0 0 0
## $ X3.1.20 : int 0 0 0 0 0 0 0 1
## $ X3.2.20 : int 0 0 0 0 0 0 0 1
## $ X3.3.20 : int 0 0 0 0 0 0 0 1
## $ X3.4.20 : int 0 1 0 0 0 0 0 1
## $ X3.5.20 : int 0 1 0 0 0 0 0 1
## $ X3.6.20 : int 0 1 0 0 0 0 0 1
## $ X3.7.20 : int 0 1 0 0 0 0 0 1
## $ X3.8.20 : int 0 2 0 0 0 0 0 1
## $ X3.9.20 : int 0 2 0 0 0 0 0 1
## $ X3.10.20 : int 0 2 0 0 0 0 0 1
## $ X3.11.20 : int 0 2 0 0 0 0 0 1
## $ X3.12.20 : int 0 2 0 0 0 0 0 1
## $ X3.13.20 : int 0 2 0 0 0 0 0 1
## $ X3.14.20 : int 0 2 0 0 0 0 0 1
## $ X3.15.20 : int 0 2 0 0 0 0 0 1
## $ X3.16.20 : int 0 2 0 0 0 0 0 1
## $ X3.17.20 : int 0 4 0 0 0 0 0 1
## $ X3.18.20 : int 0 5 0 0 0 0 0 1
## $ X3.19.20 : int 0 5 0 0 0 0 0 1
## $ X3.20.20 : int 0 6 0 0 0 0 0 1
## $ X3.21.20 : int 0 6 0 0 0 0 0 1
## $ X3.22.20 : int 0 6 0 0 0 0 0 1
## $ X3.23.20 : int 0 6 0 0 0 0 0 1
## $ X3.24.20 : int 0 7 0 0 0 0 0 1
## $ X3.25.20 : int 0 7 0 0 0 0 0 1
## $ X3.26.20 : int 0 7 0 1 0 0 3 2
## $ X3.27.20 : int 0 7 0 1 0 0 3 2
## $ X3.28.20 : int 0 8 0 1 0 0 3 2
## $ X3.29.20 : int 0 8 0 2 0 0 4 2
## $ X3.30.20 : int 1 8 0 2 0 0 4 2
## $ X3.31.20 : int 1 8 0 2 0 1 4 2
## $ X4.1.20 : int 1 9 0 2 0 2 4 2
## $ X4.2.20 : int 1 10 0 4 0 2 5 2
## $ X4.3.20 : int 1 12 0 4 0 2 7 2
## $ X4.4.20 : int 2 12 0 4 0 2 8 2
## $ X4.5.20 : int 2 16 0 4 0 2 8 3
## $ X4.6.20 : int 2 18 0 4 0 2 10 4
## $ X4.7.20 : int 2 21 0 4 1 2 11 4
## $ X4.8.20 : int 2 21 0 4 2 3 12 6
## $ X4.9.20 : int 2 21 0 4 3 3 12 6
## $ X4.10.20 : int 2 22 0 4 3 4 13 6
## $ X4.11.20 : int 2 23 0 5 3 4 14 6
## $ X4.12.20 : int 2 24 0 5 4 5 14 6
## $ X4.13.20 : int 2 25 0 5 4 5 14 6
## $ X4.14.20 : int 2 25 0 5 4 6 14 6
## $ X4.15.20 : int 3 25 0 5 4 6 14 6
## $ X4.16.20 : int 3 25 0 5 4 6 14 6
## $ X4.17.20 : int 3 26 0 5 4 7 14 7
## $ X4.18.20 : int 3 26 0 6 4 7 14 7
## $ X4.19.20 : int 3 26 0 6 4 7 14 7
## $ X4.20.20 : int 3 26 0 6 4 7 14 7
## $ X4.21.20 : int 3 26 0 6 4 7 14 7
## $ X4.22.20 : int 3 26 0 6 4 7 14 7
## $ X4.23.20 : int 3 31 0 6 4 8 16 7
## $ X4.24.20 : int 3 33 0 6 4 9 16 8
## $ X4.25.20 : int 3 33 0 6 4 10 16 8
## [list output truncated]
print(head(CoronaDeathDf, 2))
## Province.State Country.Region Lat Long X1.22.20
## 1 Australian Capital Territory Australia -35.4735 149.0124 0
## 2 New South Wales Australia -33.8688 151.2093 0
## X1.23.20 X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## X1.31.20 X2.1.20 X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20 X2.8.20
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## X2.9.20 X2.10.20 X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20 X2.16.20
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## X2.17.20 X2.18.20 X2.19.20 X2.20.20 X2.21.20 X2.22.20 X2.23.20 X2.24.20
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## X2.25.20 X2.26.20 X2.27.20 X2.28.20 X2.29.20 X3.1.20 X3.2.20 X3.3.20
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## X3.4.20 X3.5.20 X3.6.20 X3.7.20 X3.8.20 X3.9.20 X3.10.20 X3.11.20
## 1 0 0 0 0 0 0 0 0
## 2 1 1 1 1 2 2 2 2
## X3.12.20 X3.13.20 X3.14.20 X3.15.20 X3.16.20 X3.17.20 X3.18.20 X3.19.20
## 1 0 0 0 0 0 0 0 0
## 2 2 2 2 2 2 4 5 5
## X3.20.20 X3.21.20 X3.22.20 X3.23.20 X3.24.20 X3.25.20 X3.26.20 X3.27.20
## 1 0 0 0 0 0 0 0 0
## 2 6 6 6 6 7 7 7 7
## X3.28.20 X3.29.20 X3.30.20 X3.31.20 X4.1.20 X4.2.20 X4.3.20 X4.4.20
## 1 0 0 1 1 1 1 1 2
## 2 8 8 8 8 9 10 12 12
## X4.5.20 X4.6.20 X4.7.20 X4.8.20 X4.9.20 X4.10.20 X4.11.20 X4.12.20
## 1 2 2 2 2 2 2 2 2
## 2 16 18 21 21 21 22 23 24
## X4.13.20 X4.14.20 X4.15.20 X4.16.20 X4.17.20 X4.18.20 X4.19.20 X4.20.20
## 1 2 2 3 3 3 3 3 3
## 2 25 25 25 25 26 26 26 26
## X4.21.20 X4.22.20 X4.23.20 X4.24.20 X4.25.20 X4.26.20 X4.27.20 X4.28.20
## 1 3 3 3 3 3 3 3 3
## 2 26 26 31 33 33 34 34 39
## X4.29.20 X4.30.20 X5.1.20 X5.2.20 X5.3.20 X5.4.20 X5.5.20 X5.6.20
## 1 3 3 3 3 3 3 3 3
## 2 40 41 41 42 42 43 44 44
## X5.7.20 X5.8.20 X5.9.20 X5.10.20 X5.11.20 X5.12.20 X5.13.20 X5.14.20
## 1 3 3 3 3 3 3 3 3
## 2 44 44 44 44 44 45 45 45
## X5.15.20 X5.16.20 X5.17.20
## 1 3 3 3
## 2 45 45 46
Reading dataset 3.3 - Reading the time_series_covid_19_recovered.csv using read.csv() function keeping stringsAsFactors as FALSE. * Similar to the section 3.1 subsetting and resetting of the index was done. * Using the dim() function to see the number of variables and observations. * Initially before subsetting the data have 253 observations 121 variables and after subsetting, we are left with 8 observations 121 variables. Using the head() function shows us that the observations are rather untidy * this meets the requirement 5 (At least one of the data sets that you use should be Untidy), as the data have variables as: [1] Province.State [2] Country.Region [3] Lat [4] Long [5] Other variables shows the dates So, each observation is for different state of Australia. * Using the str() function to print the structure of the data.
CoronaRecDf <- read.csv("time_series_covid_19_recovered.csv", stringsAsFactors = FALSE)
print(paste("Dim before subsetting ", dim(CoronaRecDf)))
## [1] "Dim before subsetting 253" "Dim before subsetting 121"
CoronaRecDf <- subset(CoronaRecDf, Country.Region == 'Australia')
print(paste("Dim After subsetting ", dim(CoronaRecDf)))
## [1] "Dim After subsetting 8" "Dim After subsetting 121"
rownames(CoronaRecDf) <- NULL
print(dim(CoronaRecDf))
## [1] 8 121
str(CoronaRecDf)
## 'data.frame': 8 obs. of 121 variables:
## $ Province.State: chr "Australian Capital Territory" "New South Wales" "Northern Territory" "Queensland" ...
## $ Country.Region: chr "Australia" "Australia" "Australia" "Australia" ...
## $ Lat : num -35.5 -33.9 -12.5 -28 -34.9 ...
## $ Long : num 149 151 131 153 139 ...
## $ X1.22.20 : int 0 0 0 0 0 0 0 0
## $ X1.23.20 : int 0 0 0 0 0 0 0 0
## $ X1.24.20 : int 0 0 0 0 0 0 0 0
## $ X1.25.20 : int 0 0 0 0 0 0 0 0
## $ X1.26.20 : int 0 0 0 0 0 0 0 0
## $ X1.27.20 : int 0 0 0 0 0 0 0 0
## $ X1.28.20 : int 0 0 0 0 0 0 0 0
## $ X1.29.20 : int 0 0 0 0 0 0 0 0
## $ X1.30.20 : int 0 2 0 0 0 0 0 0
## $ X1.31.20 : int 0 2 0 0 0 0 0 0
## $ X2.1.20 : int 0 2 0 0 0 0 0 0
## $ X2.2.20 : int 0 2 0 0 0 0 0 0
## $ X2.3.20 : int 0 2 0 0 0 0 0 0
## $ X2.4.20 : int 0 2 0 0 0 0 0 0
## $ X2.5.20 : int 0 2 0 0 0 0 0 0
## $ X2.6.20 : int 0 2 0 0 0 0 0 0
## $ X2.7.20 : int 0 2 0 0 0 0 0 0
## $ X2.8.20 : int 0 2 0 0 0 0 0 0
## $ X2.9.20 : int 0 2 0 0 0 0 0 0
## $ X2.10.20 : int 0 2 0 0 0 0 0 0
## $ X2.11.20 : int 0 2 0 0 0 0 0 0
## $ X2.12.20 : int 0 2 0 0 0 0 0 0
## $ X2.13.20 : int 0 4 0 0 0 0 4 0
## $ X2.14.20 : int 0 4 0 0 0 0 4 0
## $ X2.15.20 : int 0 4 0 0 0 0 4 0
## $ X2.16.20 : int 0 4 0 0 0 0 4 0
## $ X2.17.20 : int 0 4 0 0 2 0 4 0
## $ X2.18.20 : int 0 4 0 0 2 0 4 0
## $ X2.19.20 : int 0 4 0 0 2 0 4 0
## $ X2.20.20 : int 0 4 0 0 2 0 4 0
## $ X2.21.20 : int 0 4 0 1 2 0 4 0
## $ X2.22.20 : int 0 4 0 1 2 0 4 0
## $ X2.23.20 : int 0 4 0 1 2 0 4 0
## $ X2.24.20 : int 0 4 0 1 2 0 4 0
## $ X2.25.20 : int 0 4 0 1 2 0 4 0
## $ X2.26.20 : int 0 4 0 1 2 0 4 0
## $ X2.27.20 : int 0 4 0 1 2 0 4 0
## $ X2.28.20 : int 0 4 0 1 2 0 4 0
## $ X2.29.20 : int 0 4 0 1 2 0 4 0
## $ X3.1.20 : int 0 4 0 1 2 0 4 0
## $ X3.2.20 : int 0 4 0 1 2 0 4 0
## $ X3.3.20 : int 0 4 0 1 2 0 4 0
## $ X3.4.20 : int 0 4 0 1 2 0 4 0
## $ X3.5.20 : int 0 4 0 8 2 0 7 0
## $ X3.6.20 : int 0 4 0 8 2 0 7 0
## $ X3.7.20 : int 0 4 0 8 2 0 7 0
## $ X3.8.20 : int 0 4 0 8 2 0 7 0
## $ X3.9.20 : int 0 4 0 8 2 0 7 0
## $ X3.10.20 : int 0 4 0 8 2 0 7 0
## $ X3.11.20 : int 0 4 0 8 2 0 7 0
## $ X3.12.20 : int 0 4 0 8 2 0 7 0
## $ X3.13.20 : int 0 4 0 8 3 0 8 0
## $ X3.14.20 : int 0 4 0 8 3 0 8 0
## $ X3.15.20 : int 0 4 0 8 3 0 8 0
## $ X3.16.20 : int 0 4 0 8 3 0 8 0
## $ X3.17.20 : int 0 4 0 8 3 0 8 0
## $ X3.18.20 : int 0 4 0 8 3 0 8 0
## $ X3.19.20 : int 0 4 0 8 3 3 8 0
## $ X3.20.20 : int 0 4 0 8 3 3 8 0
## $ X3.21.20 : int 0 4 0 8 3 3 8 0
## $ X3.22.20 : int 0 4 0 8 3 3 70 0
## $ X3.23.20 : int 0 4 0 8 3 3 70 0
## $ X3.24.20 : int 0 4 0 8 3 3 97 0
## $ X3.25.20 : int 0 4 0 8 6 3 97 1
## $ X3.26.20 : int 1 4 0 8 6 3 149 1
## $ X3.27.20 : int 1 4 0 8 6 3 171 1
## $ X3.28.20 : int 2 4 0 8 6 5 191 28
## $ X3.29.20 : int 2 4 0 8 6 5 191 28
## $ X3.30.20 : int 2 4 0 8 6 5 191 41
## $ X3.31.20 : int 3 4 0 8 6 5 291 41
## $ X4.1.20 : int 8 4 0 8 6 5 343 48
## $ X4.2.20 : int 11 4 0 8 6 5 422 64
## $ X4.3.20 : int 18 4 0 8 46 5 476 92
## $ X4.4.20 : int 18 4 1 8 46 5 527 92
## $ X4.5.20 : int 28 4 1 8 46 5 573 92
## $ X4.6.20 : int 40 4 2 173 53 26 620 162
## $ X4.7.20 : int 40 4 2 173 53 26 620 162
## $ X4.8.20 : int 40 4 2 173 53 26 620 162
## $ X4.9.20 : int 47 4 2 345 120 48 736 170
## $ X4.10.20 : int 59 4 2 372 179 48 926 203
## $ X4.11.20 : int 59 4 2 372 179 48 926 216
## $ X4.12.20 : int 59 4 2 372 179 48 926 216
## $ X4.13.20 : int 59 4 2 372 179 48 926 216
## $ X4.14.20 : int 72 4 6 442 240 53 1118 251
## $ X4.15.20 : int 72 4 6 442 240 53 1118 251
## $ X4.16.20 : int 82 4 6 442 279 67 1137 338
## $ X4.17.20 : int 81 1379 9 442 331 67 1159 340
## $ X4.18.20 : int 88 1379 9 738 331 67 1172 340
## $ X4.19.20 : int 88 1379 9 738 331 67 1172 340
## $ X4.20.20 : int 91 1379 12 738 369 71 1196 435
## $ X4.21.20 : int 92 1755 15 738 376 74 1202 443
## $ X4.22.20 : int 93 1926 16 738 388 77 1243 451
## $ X4.23.20 : int 93 2006 16 738 394 91 1251 458
## $ X4.24.20 : int 97 2056 20 738 401 106 1254 464
## $ X4.25.20 : int 98 2193 23 803 402 117 1262 478
## [list output truncated]
print(head(CoronaRecDf, 2))
## Province.State Country.Region Lat Long X1.22.20
## 1 Australian Capital Territory Australia -35.4735 149.0124 0
## 2 New South Wales Australia -33.8688 151.2093 0
## X1.23.20 X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 2
## X1.31.20 X2.1.20 X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20 X2.8.20
## 1 0 0 0 0 0 0 0 0 0
## 2 2 2 2 2 2 2 2 2 2
## X2.9.20 X2.10.20 X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20 X2.16.20
## 1 0 0 0 0 0 0 0 0
## 2 2 2 2 2 4 4 4 4
## X2.17.20 X2.18.20 X2.19.20 X2.20.20 X2.21.20 X2.22.20 X2.23.20 X2.24.20
## 1 0 0 0 0 0 0 0 0
## 2 4 4 4 4 4 4 4 4
## X2.25.20 X2.26.20 X2.27.20 X2.28.20 X2.29.20 X3.1.20 X3.2.20 X3.3.20
## 1 0 0 0 0 0 0 0 0
## 2 4 4 4 4 4 4 4 4
## X3.4.20 X3.5.20 X3.6.20 X3.7.20 X3.8.20 X3.9.20 X3.10.20 X3.11.20
## 1 0 0 0 0 0 0 0 0
## 2 4 4 4 4 4 4 4 4
## X3.12.20 X3.13.20 X3.14.20 X3.15.20 X3.16.20 X3.17.20 X3.18.20 X3.19.20
## 1 0 0 0 0 0 0 0 0
## 2 4 4 4 4 4 4 4 4
## X3.20.20 X3.21.20 X3.22.20 X3.23.20 X3.24.20 X3.25.20 X3.26.20 X3.27.20
## 1 0 0 0 0 0 0 1 1
## 2 4 4 4 4 4 4 4 4
## X3.28.20 X3.29.20 X3.30.20 X3.31.20 X4.1.20 X4.2.20 X4.3.20 X4.4.20
## 1 2 2 2 3 8 11 18 18
## 2 4 4 4 4 4 4 4 4
## X4.5.20 X4.6.20 X4.7.20 X4.8.20 X4.9.20 X4.10.20 X4.11.20 X4.12.20
## 1 28 40 40 40 47 59 59 59
## 2 4 4 4 4 4 4 4 4
## X4.13.20 X4.14.20 X4.15.20 X4.16.20 X4.17.20 X4.18.20 X4.19.20 X4.20.20
## 1 59 72 72 82 81 88 88 91
## 2 4 4 4 4 1379 1379 1379 1379
## X4.21.20 X4.22.20 X4.23.20 X4.24.20 X4.25.20 X4.26.20 X4.27.20 X4.28.20
## 1 92 93 93 97 98 99 100 101
## 2 1755 1926 2006 2056 2193 2208 2227 2266
## X4.29.20 X4.30.20 X5.1.20 X5.2.20 X5.3.20 X5.4.20 X5.5.20 X5.6.20
## 1 101 103 103 103 103 103 103 103
## 2 2284 2293 2293 2306 2328 2349 2418 2463
## X5.7.20 X5.8.20 X5.9.20 X5.10.20 X5.11.20 X5.12.20 X5.13.20 X5.14.20
## 1 103 103 104 104 104 104 104 104
## 2 2486 2504 2504 2517 2543 2571 2595 2605
## X5.15.20 X5.16.20 X5.17.20
## 1 104 104 104
## 2 2611 2611 2612
Reading dataset 4 - This data set is read differently than the others because it is not a single file but rather 100 CSV files. * So, a for loop is designed to read all those CSV files inside a directory. * list.files() function from base library is used to count and list all files inside the australian-historical-stock-prices directory. * A for loop is used to iterate on the list of files and each file is read using read.csv() function keeping stringsAsFactors as FALSE. * Used cbind() function to add a new column that has the name of the stock. * rbind() function is used to connect each new stock row-wise, hence Reading a stock data frame with all the stocks into a single data frame. * First and last variable names are changed to date and stockName for easy processing and more understandable name of the variable. * Using the dim() function to see the number of variables and observations. * The following data have 8 variables and 432888 observations. * Used head() function to print top 2 values. * Using the str() function to print the structure of the data.
stock <- data.frame()
for( i in c(1:length(list.files("./australian-historical-stock-prices")))){
#print(list.files("./australian-historical-stock-prices")[i])
x <- read.csv(paste("./australian-historical-stock-prices/",list.files("./australian-historical-stock-prices")[i], sep = ""), stringsAsFactors = FALSE)
x <- cbind(x, gsub(".csv", "", list.files("./australian-historical-stock-prices")[i]))
stock <- rbind(stock,x)
}
names(stock)[c(1,ncol(stock))] <- c("date","stockName")
print(dim(stock))
## [1] 432888 8
str(stock)
## 'data.frame': 432888 obs. of 8 variables:
## $ date : chr "2015-03-31" "2015-04-01" "2015-04-02" "2015-04-07" ...
## $ Open : num 0.555 0.575 0.56 0.545 0.545 0.54 0.53 0.535 0.54 0.54 ...
## $ High : num 0.595 0.58 0.565 0.55 0.545 0.54 0.535 0.54 0.54 0.55 ...
## $ Low : num 0.53 0.555 0.535 0.54 0.53 0.532 0.53 0.53 0.535 0.54 ...
## $ Close : num 0.565 0.565 0.555 0.545 0.54 0.535 0.535 0.54 0.54 0.54 ...
## $ Adj.Close: num 0.565 0.565 0.555 0.545 0.54 0.535 0.535 0.54 0.54 0.54 ...
## $ Volume : num 4816294 4376660 2779640 392179 668446 ...
## $ stockName: Factor w/ 100 levels "A2M","AGL","ALL",..: 1 1 1 1 1 1 1 1 1 1 ...
print(head(stock, 2))
## date Open High Low Close Adj.Close Volume stockName
## 1 2015-03-31 0.555 0.595 0.530 0.565 0.565 4816294 A2M
## 2 2015-04-01 0.575 0.580 0.555 0.565 0.565 4376660 A2M
This meets the requirements 3 of the data i.e, Data set should include variables suitable for data type conversions so that you should be able to apply the required data type conversions (e.g., character -> factor, character -> date, numeric -> factor, etc. conversions).
Type conversion for dataset 1 - As initially the data was read keeping stringsAsFactors as FALSE so if the variable has any missing value so that will make the variable as chr. * So, we need to perform type conversion on each variable: * date column was converted to date time using as.Date() function. * price column was converted to numeric using as.numeric() function. * suburb column was converted to factor using as.factor() function. * city_name column was converted to factor using as.factor() function. * state column was converted to factor using as.factor() function. * lat column was converted to numeric using as.numeric() function. * lon column was converted to numeric using as.numeric() function. * bedrooms column was converted to ordered factor using as.factor() function with level as 0 < 1 < 2 < 3 < 4 < 5, this meets the requirement 4 (Data set should include at least one factor variable that needs to be labelled and/or ordered) * property_type column was converted to ordered factor using as.factor() function with level as unit < house < townhouse , this meets the requirement 4 (Data set should include at least one factor variable that needs to be labelled and/or ordered) * loc_pid column was converted to factor using as.factor() function. * lga_pid column was converted to factor using as.factor() function.
names(propDf)[1] <- "date"
propDf$date <- as.Date(propDf$date)
propDf$price <- as.numeric(propDf$price)
propDf$suburb <- as.factor(propDf$suburb)
propDf$city_name <- as.factor(propDf$city_name)
propDf$state <- as.factor(propDf$state)
propDf$lat <- as.numeric(propDf$lat)
propDf$lon <- as.numeric(propDf$lon)
propDf$bedrooms <- factor(propDf$bedrooms, levels = c(0,1,2,3,4,5), ordered = TRUE)
propDf$property_type <- factor(propDf$property_type, levels = c("unit","house","townhouse"), ordered = TRUE)
propDf$loc_pid <- as.factor(propDf$loc_pid)
propDf$lga_pid <- as.factor(propDf$lga_pid)
str(propDf)
## 'data.frame': 290046 obs. of 11 variables:
## $ date : Date, format: "2018-09-18" "2018-09-24" ...
## $ price : num NA NA 1730000 1928000 1475000 ...
## $ suburb : Factor w/ 2017 levels "Abbotsbury","Abbotsford",..: 489 489 489 489 489 489 489 489 489 489 ...
## $ city_name : Factor w/ 6 levels "Adelaide","Brisbane",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ state : Factor w/ 6 levels "ACT","NSW","QLD",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ lat : num -33.9 -33.9 -33.9 -33.9 -33.9 ...
## $ lon : num 151 151 151 151 151 ...
## $ bedrooms : Ord.factor w/ 6 levels "0"<"1"<"2"<"3"<..: 4 4 3 4 4 3 4 3 4 4 ...
## $ property_type: Ord.factor w/ 3 levels "unit"<"house"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ loc_pid : Factor w/ 2109 levels "ACT101","ACT102",..: 132 132 132 132 132 132 132 132 132 132 ...
## $ lga_pid : Factor w/ 115 levels "ACT","BRIS1",..: 9 9 9 9 9 9 9 9 9 9 ...
summary(propDf)
## date price suburb
## Min. :2018-09-02 Min. : 1000 Pakenham : 1654
## 1st Qu.:2019-02-19 1st Qu.: 485000 Craigieburn: 1372
## Median :2019-07-19 Median : 650000 Point Cook : 1322
## Mean :2019-07-09 Mean : 805693 Melbourne : 1314
## 3rd Qu.:2019-11-27 3rd Qu.: 912000 Frankston : 1212
## Max. :2020-05-13 Max. :100000000 Brighton : 1102
## NA's :57867 (Other) :282070
## city_name state lat lon
## Adelaide : 30019 ACT: 10094 Min. :-38.48 Min. :115.6
## Brisbane : 29509 NSW: 75762 1st Qu.:-37.79 1st Qu.:144.7
## Canberra : 10094 QLD: 29509 Median :-34.76 Median :145.2
## Melbourne:108444 SA : 30019 Mean :-34.63 Mean :143.3
## Perth : 36218 VIC:108444 3rd Qu.:-33.72 3rd Qu.:151.1
## Sydney : 75762 WA : 36218 Max. : 31.92 Max. :153.2
## NA's :20 NA's :20
## bedrooms property_type loc_pid lga_pid
## 0: 193 unit : 68993 VIC2023: 1654 ACT : 10094
## 1: 15192 house :198921 VIC660 : 1372 BRIS2 : 7913
## 2: 59883 townhouse: 22132 VIC2082: 1322 BRIS3 : 7204
## 3:112513 VIC1634: 1314 VIC127 : 7064
## 4: 82437 VIC939 : 1212 NSW329 : 6004
## 5: 19828 VIC2453: 1099 VIC125 : 5883
## (Other):282073 (Other):245884
Type conversion for dataset 2 - As initially the data was read keeping stringsAsFactors as FALSE so if the variable has any missing value so that will make the variable as chr. * So, we need to perform type conversion on each variable: * date column was converted to date time using as.Date() function. * value column was converted to numeric using as.numeric() function.
AudDf$date <- as.Date(AudDf$date)
AudDf$value <- as.numeric(AudDf$value)
str(AudDf)
## 'data.frame': 7597 obs. of 2 variables:
## $ date : Date, format: "1991-01-02" "1991-01-03" ...
## $ value: num 0.775 0.779 0.778 0.778 0.777 0.774 0.775 0.775 0.778 0.783 ...
summary(AudDf)
## date value
## Min. :1991-01-02 Min. :0.4787
## 1st Qu.:1998-07-08 1st Qu.:0.6821
## Median :2006-01-19 Median :0.7484
## Mean :2006-01-02 Mean :0.7584
## 3rd Qu.:2013-08-02 3rd Qu.:0.7960
## Max. :2020-05-16 Max. :1.1020
gather() function on data set.gsub() to substitute the symbol and removing it with a blank.Country.Region and Province.State to factor type.CoronaDataFrameCleaner was created that takes input as data frame object (CoronaDataFrameObject) and Name of the newly created column having cases number (CaseColumnName).CoronaDataFrameCleaner <- function(CoronaDataFrameObject, CaseColumnName){
CoronaDataFrameObject <- CoronaDataFrameObject %>% gather(5:dim(CoronaDataFrameObject)[2],
key = "date", value = "value" )
names(CoronaDataFrameObject)[ncol(CoronaDataFrameObject)] <- CaseColumnName
CoronaDataFrameObject$date <- gsub("X", "", CoronaDataFrameObject$date)
CoronaDataFrameObject$date <- gsub(".20$", ".2020", CoronaDataFrameObject$date)
CoronaDataFrameObject$date <- as.Date(CoronaDataFrameObject$date, "%m.%d.%Y")
CoronaDataFrameObject$Country.Region <- as.factor(CoronaDataFrameObject$Country.Region)
CoronaDataFrameObject$Province.State <- as.factor(CoronaDataFrameObject$Province.State)
return(CoronaDataFrameObject)
}
CoronaDataFrameCleaner function and passing the corona confirmed cases (CoronaConfDf) and column name as ConfirmedCases.str() function and summary statistics using summary() function.CoronaConfDf <- CoronaDataFrameCleaner(CoronaConfDf, 'ConfirmedCases')
str(CoronaConfDf)
## 'data.frame': 936 obs. of 6 variables:
## $ Province.State: Factor w/ 8 levels "Australian Capital Territory",..: 1 2 3 4 5 6 7 8 1 2 ...
## $ Country.Region: Factor w/ 1 level "Australia": 1 1 1 1 1 1 1 1 1 1 ...
## $ Lat : num -35.5 -33.9 -12.5 -28 -34.9 ...
## $ Long : num 149 151 131 153 139 ...
## $ date : Date, format: "2020-01-22" "2020-01-22" ...
## $ ConfirmedCases: int 0 0 0 0 0 0 0 0 0 0 ...
summary(CoronaConfDf)
## Province.State Country.Region Lat
## Australian Capital Territory:117 Australia:936 Min. :-41.45
## New South Wales :117 1st Qu.:-36.06
## Northern Territory :117 Median :-34.40
## Queensland :117 Mean :-32.00
## South Australia :117 3rd Qu.:-30.97
## Tasmania :117 Max. :-12.46
## (Other) :234
## Long date ConfirmedCases
## Min. :115.9 Min. :2020-01-22 Min. : 0.0
## 1st Qu.:136.7 1st Qu.:2020-02-20 1st Qu.: 2.0
## Median :145.5 Median :2020-03-20 Median : 28.0
## Mean :141.2 Mean :2020-03-20 Mean : 365.6
## 3rd Qu.:149.6 3rd Qu.:2020-04-18 3rd Qu.: 437.2
## Max. :153.4 Max. :2020-05-17 Max. :3076.0
##
head(CoronaConfDf, 2)
## Province.State Country.Region Lat Long date
## 1 Australian Capital Territory Australia -35.4735 149.0124 2020-01-22
## 2 New South Wales Australia -33.8688 151.2093 2020-01-22
## ConfirmedCases
## 1 0
## 2 0
CoronaDataFrameCleaner function and passing the corona confirmed cases (CoronaDeathDf) and column name as DeathCasesstr() function and summary statistics using summary() function.CoronaDeathDf <- CoronaDataFrameCleaner(CoronaDeathDf, 'DeathCases')
str(CoronaDeathDf)
## 'data.frame': 936 obs. of 6 variables:
## $ Province.State: Factor w/ 8 levels "Australian Capital Territory",..: 1 2 3 4 5 6 7 8 1 2 ...
## $ Country.Region: Factor w/ 1 level "Australia": 1 1 1 1 1 1 1 1 1 1 ...
## $ Lat : num -35.5 -33.9 -12.5 -28 -34.9 ...
## $ Long : num 149 151 131 153 139 ...
## $ date : Date, format: "2020-01-22" "2020-01-22" ...
## $ DeathCases : int 0 0 0 0 0 0 0 0 0 0 ...
summary(CoronaDeathDf)
## Province.State Country.Region Lat
## Australian Capital Territory:117 Australia:936 Min. :-41.45
## New South Wales :117 1st Qu.:-36.06
## Northern Territory :117 Median :-34.40
## Queensland :117 Mean :-32.00
## South Australia :117 3rd Qu.:-30.97
## Tasmania :117 Max. :-12.46
## (Other) :234
## Long date DeathCases
## Min. :115.9 Min. :2020-01-22 Min. : 0.000
## 1st Qu.:136.7 1st Qu.:2020-02-20 1st Qu.: 0.000
## Median :145.5 Median :2020-03-20 Median : 0.000
## Mean :141.2 Mean :2020-03-20 Mean : 3.903
## 3rd Qu.:149.6 3rd Qu.:2020-04-18 3rd Qu.: 4.000
## Max. :153.4 Max. :2020-05-17 Max. :46.000
##
head(CoronaDeathDf, 2)
## Province.State Country.Region Lat Long date
## 1 Australian Capital Territory Australia -35.4735 149.0124 2020-01-22
## 2 New South Wales Australia -33.8688 151.2093 2020-01-22
## DeathCases
## 1 0
## 2 0
CoronaDataFrameCleaner function and passing the corona confirmed cases (CoronaRecDf) and column name as RecoveredCasesstr() function and summary statistics using summary() function.CoronaRecDf <- CoronaDataFrameCleaner(CoronaRecDf, 'RecoveredCases')
str(CoronaRecDf)
## 'data.frame': 936 obs. of 6 variables:
## $ Province.State: Factor w/ 8 levels "Australian Capital Territory",..: 1 2 3 4 5 6 7 8 1 2 ...
## $ Country.Region: Factor w/ 1 level "Australia": 1 1 1 1 1 1 1 1 1 1 ...
## $ Lat : num -35.5 -33.9 -12.5 -28 -34.9 ...
## $ Long : num 149 151 131 153 139 ...
## $ date : Date, format: "2020-01-22" "2020-01-22" ...
## $ RecoveredCases: int 0 0 0 0 0 0 0 0 0 0 ...
summary(CoronaRecDf)
## Province.State Country.Region Lat
## Australian Capital Territory:117 Australia:936 Min. :-41.45
## New South Wales :117 1st Qu.:-36.06
## Northern Territory :117 Median :-34.40
## Queensland :117 Mean :-32.00
## South Australia :117 3rd Qu.:-30.97
## Tasmania :117 Max. :-12.46
## (Other) :234
## Long date RecoveredCases
## Min. :115.9 Min. :2020-01-22 Min. : 0.0
## 1st Qu.:136.7 1st Qu.:2020-02-20 1st Qu.: 0.0
## Median :145.5 Median :2020-03-20 Median : 4.0
## Mean :141.2 Mean :2020-03-20 Mean : 211.6
## 3rd Qu.:149.6 3rd Qu.:2020-04-18 3rd Qu.: 104.0
## Max. :153.4 Max. :2020-05-17 Max. :2612.0
##
head(CoronaDeathDf, 2)
## Province.State Country.Region Lat Long date
## 1 Australian Capital Territory Australia -35.4735 149.0124 2020-01-22
## 2 New South Wales Australia -33.8688 151.2093 2020-01-22
## DeathCases
## 1 0
## 2 0
full_join() function was used and merge was done on date and Province.State to match and the state that data relates to.names() function.head() function and structure using str() function along with summary statistics using summary() function.coronaJoinDf <- full_join(CoronaConfDf, CoronaDeathDf, by=c("date", "Province.State"))
coronaJoinDf <- full_join(coronaJoinDf, CoronaRecDf, by=c("date", "Province.State"))
coronaJoinDf <- coronaJoinDf[,!(names(coronaJoinDf) %in% c("Lat.x", "Long.x","Lat.y", "Long.y", "Country.Region.x", "Country.Region.y", "Country.Region"))]
names(coronaJoinDf)[c(1,2)] <- c("State", "date")
head(coronaJoinDf, 2)
## State date ConfirmedCases DeathCases
## 1 Australian Capital Territory 2020-01-22 0 0
## 2 New South Wales 2020-01-22 0 0
## Lat Long RecoveredCases
## 1 -35.4735 149.0124 0
## 2 -33.8688 151.2093 0
str(coronaJoinDf)
## 'data.frame': 936 obs. of 7 variables:
## $ State : Factor w/ 8 levels "Australian Capital Territory",..: 1 2 3 4 5 6 7 8 1 2 ...
## $ date : Date, format: "2020-01-22" "2020-01-22" ...
## $ ConfirmedCases: int 0 0 0 0 0 0 0 0 0 0 ...
## $ DeathCases : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Lat : num -35.5 -33.9 -12.5 -28 -34.9 ...
## $ Long : num 149 151 131 153 139 ...
## $ RecoveredCases: int 0 0 0 0 0 0 0 0 0 0 ...
summary(coronaJoinDf)
## State date ConfirmedCases
## Australian Capital Territory:117 Min. :2020-01-22 Min. : 0.0
## New South Wales :117 1st Qu.:2020-02-20 1st Qu.: 2.0
## Northern Territory :117 Median :2020-03-20 Median : 28.0
## Queensland :117 Mean :2020-03-20 Mean : 365.6
## South Australia :117 3rd Qu.:2020-04-18 3rd Qu.: 437.2
## Tasmania :117 Max. :2020-05-17 Max. :3076.0
## (Other) :234
## DeathCases Lat Long RecoveredCases
## Min. : 0.000 Min. :-41.45 Min. :115.9 Min. : 0.0
## 1st Qu.: 0.000 1st Qu.:-36.06 1st Qu.:136.7 1st Qu.: 0.0
## Median : 0.000 Median :-34.40 Median :145.5 Median : 4.0
## Mean : 3.903 Mean :-32.00 Mean :141.2 Mean : 211.6
## 3rd Qu.: 4.000 3rd Qu.:-30.97 3rd Qu.:149.6 3rd Qu.: 104.0
## Max. :46.000 Max. :-12.46 Max. :153.4 Max. :2612.0
##
mutate() function to create two variables i.e,
StockDifferenceOpenClose - Tell how the price of stock changed when it was first sold on that day and last sold. To do this observations from Open stock value are subtracted with Close stock value.StockDifferenceHighLow - Tell us how much difference is there between the highest and the lowest value of the stock sold on a particular day. To do this observations from High stock value are subtracted with Low stock value.stock$date <- as.Date(stock$date)
stock <- mutate(stock,
StockDifferenceOpenClose = Open - Close,
StockDifferenceHighLow = High - Low)
head(stock, 2)
## date Open High Low Close Adj.Close Volume stockName
## 1 2015-03-31 0.555 0.595 0.530 0.565 0.565 4816294 A2M
## 2 2015-04-01 0.575 0.580 0.555 0.565 0.565 4376660 A2M
## StockDifferenceOpenClose StockDifferenceHighLow
## 1 -0.01 0.065
## 2 0.01 0.025
str(stock)
## 'data.frame': 432888 obs. of 10 variables:
## $ date : Date, format: "2015-03-31" "2015-04-01" ...
## $ Open : num 0.555 0.575 0.56 0.545 0.545 0.54 0.53 0.535 0.54 0.54 ...
## $ High : num 0.595 0.58 0.565 0.55 0.545 0.54 0.535 0.54 0.54 0.55 ...
## $ Low : num 0.53 0.555 0.535 0.54 0.53 0.532 0.53 0.53 0.535 0.54 ...
## $ Close : num 0.565 0.565 0.555 0.545 0.54 0.535 0.535 0.54 0.54 0.54 ...
## $ Adj.Close : num 0.565 0.565 0.555 0.545 0.54 0.535 0.535 0.54 0.54 0.54 ...
## $ Volume : num 4816294 4376660 2779640 392179 668446 ...
## $ stockName : Factor w/ 100 levels "A2M","AGL","ALL",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ StockDifferenceOpenClose: num -0.01 0.01 0.005 0 0.005 ...
## $ StockDifferenceHighLow : num 0.065 0.025 0.03 0.01 0.015 ...
summarise(stock)
## data frame with 0 columns and 1 row
is.na() function with a column sum (colSums()) function.price, lat and lon have missing value.colSums(is.na(propDf))
## date price suburb city_name state
## 0 57867 0 0 0
## lat lon bedrooms property_type loc_pid
## 20 20 0 0 0
## lga_pid
## 0
propDf price variable and create a summary statistics table.right-skewed.knitr::kable(propDf %>% summarise(Min = min(price,na.rm = TRUE), Max = max(price, na.rm = TRUE), n = n(), Missing = sum(is.na(price)), Q1 = quantile(price ,probs = .25,na.rm = TRUE), Median = median(price, na.rm = TRUE), Q3 = quantile(price, probs = .75,na.rm = TRUE), Mean = mean(price, na.rm = TRUE), SD = sd(price, na.rm = TRUE), IQR = IQR(price ,na.rm = TRUE)), "html", caption = "Table 1: Descriptive Statistics", align = "llllllllll", col.names = c("Minimum", "Maximum", "Sample Size", "Missing Count","First Quartile", "Median", "Third Quartile", "Mean", "Standard Deviation", "IQR"), digits = 2) %>% kable_styling(latex_options = "HOLD_position") %>% column_spec(1, bold = TRUE) %>% column_spec(c(2,4,6,8,10), color = 'white', background = 'black')
| Minimum | Maximum | Sample Size | Missing Count | First Quartile | Median | Third Quartile | Mean | Standard Deviation | IQR |
|---|---|---|---|---|---|---|---|---|---|
| 1000 | 100000000 | 290046 | 57867 | 485000 | 650000 | 912000 | 805693 | 616272.9 | 427000 |
missingPrice <- unique(propDf[is.na(propDf$price),'suburb'])
for (location in missingPrice){
propDf[propDf$suburb == location & is.na(propDf$price) & propDf$date > min(propDf$date) & propDf$date < '2018-12-31', 'price'] = median(propDf[propDf$suburb == i & propDf$date > min(propDf$date) & propDf$date < '2018-12-31', 'price'], na.rm = TRUE) # Dates in 2018
propDf[propDf$suburb == location & is.na(propDf$price) & propDf$date <= max(propDf$date) & propDf$date > '2018-12-31', 'price'] = median(propDf[propDf$suburb == i & propDf$date <= max(propDf$date) & propDf$date > '2018-12-31', 'price'], na.rm = TRUE) # Dates between 2019 and 2020
}
Mambourin have no other data so this observation is dropped.missingLatLon <- unique(propDf[is.na(propDf$lat),'suburb'])
for (i in missingLatLon){
propDf[propDf$suburb == i & is.na(propDf$lon), 'lon'] = mean(propDf[propDf$suburb == i, 'lon'], na.rm = TRUE)
propDf[propDf$suburb == i & is.na(propDf$lat), 'lat'] = mean(propDf[propDf$suburb == i, 'lat'], na.rm = TRUE)
}
propDf <- propDf[!propDf$suburb=='Mambourin',]
colSums(is.na(AudDf))
## date value
## 0 0
colSums(is.na(stock))
## date Open High
## 0 1527 1527
## Low Close Adj.Close
## 1527 1527 1527
## Volume stockName StockDifferenceOpenClose
## 1527 0 1527
## StockDifferenceHighLow
## 1527
colSums(is.na(coronaJoinDf))
## State date ConfirmedCases DeathCases Lat
## 0 0 0 0 0
## Long RecoveredCases
## 0 0
print(paste("Minimum date in AudDf is :" ,min(AudDf$date)))
## [1] "Minimum date in AudDf is : 1991-01-02"
print(paste("Minimum date in coronaJoinDf is :" ,min(coronaJoinDf$date)))
## [1] "Minimum date in coronaJoinDf is : 2020-01-22"
print(paste("Minimum date in propDf is :" ,min(propDf$date)))
## [1] "Minimum date in propDf is : 2018-09-02"
print(paste("Minimum date in stock is :" ,min(stock$date)))
## [1] "Minimum date in stock is : 2000-01-03"
AudDf <- subset(AudDf, date >= "2018-09-02")
stock <- subset(stock, date >= "2018-09-02")
propDf have one outlier on the higher side, so only going to keep data smaller than the maximum property price to remove the outlier.boxplot(propDf$price, ylim = c(-100,max(propDf$price, na.rm = TRUE)), main = 'Box plot for price before removing outlier', cex.main=0.5); propDf <- subset(propDf, price < max(propDf$price, na.rm = TRUE)) ; boxplot(propDf$price, ylim = c(-100,max(propDf$price, na.rm = TRUE)), main = 'Box plot for price after removing outlier', cex.main=0.5) # Dropping extrem outlier
names(AudDf) <- c('date', 'AudToUsd')
knitr::kable(AudDf %>% summarise(Min = min(AudToUsd,na.rm = TRUE), Max = max(AudToUsd, na.rm = TRUE), n = n(), Missing = sum(is.na(AudToUsd)), Q1 = quantile(AudToUsd ,probs = .25,na.rm = TRUE), Median = median(AudToUsd, na.rm = TRUE), Q3 = quantile(AudToUsd, probs = .75,na.rm = TRUE), Mean = mean(AudToUsd, na.rm = TRUE), SD = sd(AudToUsd, na.rm = TRUE), IQR = IQR(AudToUsd ,na.rm = TRUE)), "html", caption = "Table 2: Descriptive Statistics", align = "llllllllll", col.names = c("Minimum", "Maximum", "Sample Size", "Missing Count","First Quartile", "Median", "Third Quartile", "Mean", "Standard Deviation", "IQR"), digits = 2) %>% kable_styling(latex_options = "HOLD_position") %>% column_spec(1, bold = TRUE) %>% column_spec(c(2,4,6,8,10), color = 'white', background = 'black'); boxplot(AudDf$AudToUsd, main = 'Box plot for AudToUsd', cex.main=0.5)
| Minimum | Maximum | Sample Size | Missing Count | First Quartile | Median | Third Quartile | Mean | Standard Deviation | IQR |
|---|---|---|---|---|---|---|---|---|---|
| 0.57 | 0.74 | 535 | 0 | 0.68 | 0.69 | 0.71 | 0.69 | 0.03 | 0.03 |
GatherStock <- stock %>% gather(Open, High, High, Low, Close, Adj.Close, Volume,StockDifferenceOpenClose, StockDifferenceHighLow, key = 'Parameter', value = 'value')
knitr::kable(GatherStock %>% group_by(GatherStock$Parameter) %>% summarise(Min = min(value,na.rm = TRUE), Max = max(value, na.rm = TRUE), n = n(), Q1 = quantile(value ,probs = .25,na.rm = TRUE), Median = median(value, na.rm = TRUE), Q3 = quantile(value, probs = .75,na.rm = TRUE), Mean = mean(value, na.rm = TRUE), SD = sd(value, na.rm = TRUE), IQR = IQR(value ,na.rm = TRUE)), "html", caption = "Table 3: Descriptive Statistics", align = "llllllllll", col.names = c('Parameter Type', "Minimum", "Maximum", "Sample Size","First Quartile", "Median", "Third Quartile", "Mean", "Standard Deviation", "IQR"), digits = 2) %>% kable_styling(latex_options = "HOLD_position") %>% column_spec(1, bold = TRUE) %>% column_spec(c(2,4,6,8,10), color = 'white', background = 'black')
| Parameter Type | Minimum | Maximum | Sample Size | First Quartile | Median | Third Quartile | Mean | Standard Deviation | IQR |
|---|---|---|---|---|---|---|---|---|---|
| Adj.Close | 0.95 | 339.38 | 40043 | 4.75 | 11.70 | 24.70 | 22.31 | 34.83 | 19.95 |
| Close | 0.95 | 341.00 | 40043 | 4.95 | 12.05 | 25.43 | 22.91 | 35.42 | 20.48 |
| High | 0.98 | 342.75 | 40043 | 5.01 | 12.19 | 25.70 | 23.16 | 35.79 | 20.69 |
| Low | 0.90 | 337.03 | 40043 | 4.88 | 11.90 | 25.15 | 22.65 | 35.04 | 20.27 |
| Open | 0.94 | 339.42 | 40043 | 4.94 | 12.05 | 25.44 | 22.90 | 35.42 | 20.50 |
| StockDifferenceHighLow | 0.00 | 57.74 | 40043 | 0.10 | 0.22 | 0.50 | 0.51 | 1.12 | 0.40 |
| StockDifferenceOpenClose | -52.79 | 30.42 | 40043 | -0.09 | 0.00 | 0.07 | -0.01 | 0.82 | 0.16 |
| Volume | 0.00 | 248837038.00 | 40043 | 1017361.50 | 2530727.00 | 5364411.00 | 4471622.35 | 6320144.83 | 4347049.50 |
248837038 and 193729243.ggplot(GatherStock, aes(x=Parameter, y=value)) + geom_boxplot(outlier.colour="black", outlier.shape=1, outlier.size=1.5 ,fill='#4271AE', color="#1F3552") + theme_economist() + theme(plot.title = element_text(family="Tahoma", size = 5), text = element_text(family="Tahoma"), axis.title = element_text(size = 12), axis.text.x = element_text(color = "grey20", size = 8, angle = 90)) + scale_x_discrete(name = "\nVariable Name")+ ggtitle("Boxplot for different variable of stock dataframe before removing outliers\n") + scale_y_continuous(name = 'Value\n'); stock <- stock[!stock$Volume == 248837038,]; stock <- stock[!stock$Volume == 193729243,]; GatherStock <- stock %>% gather(Open, High, High, Low, Close, Adj.Close, Volume,StockDifferenceOpenClose, StockDifferenceHighLow, key = 'Parameter', value = 'value'); ggplot(GatherStock, aes(x=Parameter, y=value)) + geom_boxplot(outlier.colour="black", outlier.shape=1, outlier.size=1.5 ,fill='#4271AE', color="#1F3552") + theme_economist() + theme(plot.title = element_text(family="Tahoma", size = 5), text = element_text(family="Tahoma"), axis.title = element_text(size = 12), axis.text.x = element_text(color = "grey20", size = 8, angle = 90)) + scale_x_discrete(name = "\nVariable Name")+ ggtitle("Boxplot for different variable of stock dataframe after removing outliers\n") + scale_y_continuous(name = 'Value\n')
full_join() function keeping date as the key and storing it in finalDf data frame.finalDf and AudDf by date keeping finalDf on the left side as we want to see how corona affected the exchange rate.finalDf and stock by date keeping finalDf on the left side as we are interested in seeing how corona affected the stock exchange market.finalDf <- full_join(propDf, coronaJoinDf, by = 'date')
finalDf <- full_join(finalDf, AudDf, by = 'date')
finalDf <- full_join(finalDf, stock, by = 'date')
str(finalDf)
## 'data.frame': 34205178 obs. of 27 variables:
## $ date : Date, format: "2018-09-26" "2018-09-26" ...
## $ price : num 1730000 1730000 1730000 1730000 1730000 1730000 1730000 1730000 1730000 1730000 ...
## $ suburb : Factor w/ 2017 levels "Abbotsbury","Abbotsford",..: 489 489 489 489 489 489 489 489 489 489 ...
## $ city_name : Factor w/ 6 levels "Adelaide","Brisbane",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ state : Factor w/ 6 levels "ACT","NSW","QLD",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ lat : num -33.9 -33.9 -33.9 -33.9 -33.9 ...
## $ lon : num 151 151 151 151 151 ...
## $ bedrooms : Ord.factor w/ 6 levels "0"<"1"<"2"<"3"<..: 3 3 3 3 3 3 3 3 3 3 ...
## $ property_type : Ord.factor w/ 3 levels "unit"<"house"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ loc_pid : Factor w/ 2109 levels "ACT101","ACT102",..: 132 132 132 132 132 132 132 132 132 132 ...
## $ lga_pid : Factor w/ 115 levels "ACT","BRIS1",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ State : Factor w/ 8 levels "Australian Capital Territory",..: NA NA NA NA NA NA NA NA NA NA ...
## $ ConfirmedCases : int NA NA NA NA NA NA NA NA NA NA ...
## $ DeathCases : int NA NA NA NA NA NA NA NA NA NA ...
## $ Lat : num NA NA NA NA NA NA NA NA NA NA ...
## $ Long : num NA NA NA NA NA NA NA NA NA NA ...
## $ RecoveredCases : int NA NA NA NA NA NA NA NA NA NA ...
## $ AudToUsd : num 0.726 0.726 0.726 0.726 0.726 ...
## $ Open : num 10.85 19.25 28.4 8.76 26.2 ...
## $ High : num 10.97 19.78 28.83 8.83 26.87 ...
## $ Low : num 10.64 19.2 28.11 8.69 25.96 ...
## $ Close : num 10.7 19.59 28.71 8.75 26.72 ...
## $ Adj.Close : num 10.7 18.04 27.9 8.37 26.29 ...
## $ Volume : num 3794249 2910828 2825963 1424545 447759 ...
## $ stockName : Factor w/ 100 levels "A2M","AGL","ALL",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ StockDifferenceOpenClose: num 0.15 -0.34 -0.31 0.01 -0.52 ...
## $ StockDifferenceHighLow : num 0.33 0.58 0.715 0.145 0.91 ...
head(finalDf, 2)
## date price suburb city_name state lat lon
## 1 2018-09-26 1730000 Darling Point Sydney NSW -33.86839 151.2375
## 2 2018-09-26 1730000 Darling Point Sydney NSW -33.86839 151.2375
## bedrooms property_type loc_pid lga_pid State ConfirmedCases DeathCases
## 1 2 unit NSW1221 NSW180 <NA> NA NA
## 2 2 unit NSW1221 NSW180 <NA> NA NA
## Lat Long RecoveredCases AudToUsd Open High Low Close Adj.Close
## 1 NA NA NA 0.7264 10.85 10.97 10.64 10.70 10.7000
## 2 NA NA NA 0.7264 19.25 19.78 19.20 19.59 18.0424
## Volume stockName StockDifferenceOpenClose StockDifferenceHighLow
## 1 3794249 A2M 0.15 0.33
## 2 2910828 AGL -0.34 0.58
coronaJoinDf is mainly missing as it just came in the year 2020 so there will be no data before that.AudDf and stock data frame also developed missing values that suggest that there were some days that were not in the data.colSums(is.na(finalDf))
## date price suburb
## 0 32 32
## city_name state lat
## 32 32 32
## lon bedrooms property_type
## 32 32 32
## loc_pid lga_pid State
## 32 32 15919394
## ConfirmedCases DeathCases Lat
## 15919394 15919394 15919394
## Long RecoveredCases AudToUsd
## 15919394 15919394 62728
## Open High Low
## 142213 142213 142213
## Close Adj.Close Volume
## 142213 142213 142213
## stockName StockDifferenceOpenClose StockDifferenceHighLow
## 142213 142213 142213
AudToUsd we used the forward filling method using fill() function.Null charter.cbind() function to add a new column that uses weekdays() function to tell the day of the week bass on the date, as this can give valuable insight as in what weekday promotes more sale or property, change in stock or exchange rate, and also what weekdays most corona cases were identified, etc.finalDf <- finalDf %>% fill(AudToUsd)
finalDf$State <- as.character(finalDf$State)
finalDf$city_name <- as.character(finalDf$city_name)
finalDf$state <- as.character(finalDf$state)
finalDf$property_type <- as.character(finalDf$property_type)
finalDf$loc_pid <- as.character(finalDf$loc_pid)
finalDf$lga_pid <- as.character(finalDf$lga_pid)
finalDf$stockName <- as.character(finalDf$stockName)
finalDf$suburb <- as.character(finalDf$suburb)
finalDf[is.na(finalDf$stockName),'stockName'] <- 'Null'
finalDf[is.na(finalDf$State),'State'] <- 'Null'
finalDf[is.na(finalDf$suburb),c('suburb', 'city_name', 'state', 'property_type',
'loc_pid', 'lga_pid')] <- 'Null'
finalDf[finalDf$stockName == 'Null', c('Open','High', 'Low', 'Close', 'Adj.Close', 'Volume',
'StockDifferenceOpenClose', 'StockDifferenceHighLow')] <- 0
finalDf[finalDf$State == 'Null', c('ConfirmedCases','DeathCases', 'RecoveredCases', 'Long', 'Lat')] <- 0
finalDf[finalDf$suburb == 'Null', c('price','lat', 'lon', 'Long', 'bedrooms')] <- 0
finalDf <- cbind(finalDf, weekdays(finalDf$date))
colSums(is.na(finalDf))
## date price suburb
## 0 0 0
## city_name state lat
## 0 0 0
## lon bedrooms property_type
## 0 0 0
## loc_pid lga_pid State
## 0 0 0
## ConfirmedCases DeathCases Lat
## 0 0 0
## Long RecoveredCases AudToUsd
## 0 0 0
## Open High Low
## 0 0 0
## Close Adj.Close Volume
## 0 0 0
## stockName StockDifferenceOpenClose StockDifferenceHighLow
## 0 0 0
## weekdays(finalDf$date)
## 0
names(finalDf) <- c('date', 'PropertyPrice', 'PropertySuburb', 'PropertyCityName', 'PropertyState',
'PropertyLat', 'PropertyLon', 'PropertyBedrooms','PropertyType', 'PropertyLocPid',
'PropertyLgaPid','CoronaState','CoronaConfirmedCases', 'CoronaDeathCases',
'CoronaLat', 'CoronaLon', 'CoronaRecoveredCases','AudToUsd', 'StockOpen',
'StockHigh', 'StockLow','StockClose', 'StockAdj.Close', 'StockVolume',
'StockName', 'StockDifferenceOpenClose', 'StockDifferenceHighLow', 'Weekday')
StockDifferenceOpenClose variable.hist(finalDf$PropertyPrice, main="Histogram for PropertyPrice", cex.main=0.5); hist(finalDf$AudToUsd, main="Histogram for AudToUsd", cex.main=0.5); hist(finalDf$StockOpen, main="Histogram for StockOpen", cex.main=0.5); hist(finalDf$StockHigh, main="Histogram for StockHigh", cex.main=0.5); hist(finalDf$StockLow, main="Histogram for StockLow", cex.main=0.5); hist(finalDf$StockClose, main="Histogram for StockClose", cex.main=0.5); hist(finalDf$StockAdj.Close, main="Histogram for StockAdj.Close", cex.main=0.5); hist(finalDf$StockVolume, main="Histogram for StockVolume", cex.main=0.5); hist(finalDf$StockDifferenceOpenClose, main="Histogram for StockDifferenceOpenClose", cex.main=0.5); hist(finalDf$StockDifferenceHighLow, main="Histogram for StockDifferenceHighLow", cex.main=0.5)
StockDifferenceOpenClose variable reciprocal transformation will be used.hist(log(finalDf$PropertyPrice), main="Histogram for PropertyPrice", cex.main=0.5); hist(log(finalDf$StockOpen), main="Histogram for StockOpen", cex.main=0.5); hist(log(finalDf$StockHigh), main="Histogram for StockHigh", cex.main=0.5); hist(log(finalDf$StockLow), main="Histogram for StockLow", cex.main=0.5); hist(log(finalDf$StockClose), main="Histogram for StockClose", cex.main=0.5); hist(log(finalDf$StockAdj.Close), main="Histogram for StockAdj.Close", cex.main=0.5); hist(log(finalDf$StockVolume), main="Histogram for StockVolume", cex.main=0.5); hist(1/(finalDf$StockDifferenceOpenClose), main="Histogram for StockDifferenceOpenClose", cex.main=0.5); hist(log(finalDf$StockDifferenceHighLow), main="Histogram for StockDifferenceHighLow", cex.main=0.5)
head(finalDf, 2)
## date PropertyPrice PropertySuburb PropertyCityName PropertyState
## 1 2018-09-26 1730000 Darling Point Sydney NSW
## 2 2018-09-26 1730000 Darling Point Sydney NSW
## PropertyLat PropertyLon PropertyBedrooms PropertyType PropertyLocPid
## 1 -33.86839 151.2375 2 unit NSW1221
## 2 -33.86839 151.2375 2 unit NSW1221
## PropertyLgaPid CoronaState CoronaConfirmedCases CoronaDeathCases
## 1 NSW180 Null 0 0
## 2 NSW180 Null 0 0
## CoronaLat CoronaLon CoronaRecoveredCases AudToUsd StockOpen StockHigh
## 1 0 0 0 0.7264 10.85 10.97
## 2 0 0 0 0.7264 19.25 19.78
## StockLow StockClose StockAdj.Close StockVolume StockName
## 1 10.64 10.70 10.7000 3794249 A2M
## 2 19.20 19.59 18.0424 2910828 AGL
## StockDifferenceOpenClose StockDifferenceHighLow Weekday
## 1 0.15 0.33 Wednesday
## 2 -0.34 0.58 Wednesday
finalDf$PropertyState <- as.factor(finalDf$PropertyState)
finalDf$PropertyCityName <- as.factor(finalDf$PropertyCityName)
finalDf$CoronaState <- as.factor(finalDf$CoronaState)
finalDf$PropertyType <- factor(finalDf$PropertyType, levels = c("Null", "unit","house","townhouse"),ordered = TRUE)
finalDf$PropertyLocPid <- as.factor(finalDf$PropertyLocPid)
finalDf$PropertyLgaPid <- as.factor(finalDf$PropertyLgaPid)
finalDf$StockName <- as.factor(finalDf$StockName)
finalDf$PropertySuburb <- as.factor(finalDf$PropertySuburb)
finalDf$PropertyPrice <- log(finalDf$PropertyPrice)
finalDf$StockOpen <- log(finalDf$StockOpen)
finalDf$StockHigh <- log(finalDf$StockHigh)
finalDf$StockLow <- log(finalDf$StockLow)
finalDf$StockClose <- log(finalDf$StockClose)
finalDf$StockAdj.Close <- log(finalDf$StockAdj.Close)
finalDf$StockVolume <-log(finalDf$StockVolume)
finalDf$StockDifferenceOpenClose <- 1/(finalDf$StockDifferenceOpenClose)
finalDf$StockDifferenceHighLow <- log(finalDf$StockDifferenceHighLow)
head(finalDf)
## date PropertyPrice PropertySuburb PropertyCityName PropertyState
## 1 2018-09-26 14.36363 Darling Point Sydney NSW
## 2 2018-09-26 14.36363 Darling Point Sydney NSW
## 3 2018-09-26 14.36363 Darling Point Sydney NSW
## 4 2018-09-26 14.36363 Darling Point Sydney NSW
## 5 2018-09-26 14.36363 Darling Point Sydney NSW
## 6 2018-09-26 14.36363 Darling Point Sydney NSW
## PropertyLat PropertyLon PropertyBedrooms PropertyType PropertyLocPid
## 1 -33.86839 151.2375 2 unit NSW1221
## 2 -33.86839 151.2375 2 unit NSW1221
## 3 -33.86839 151.2375 2 unit NSW1221
## 4 -33.86839 151.2375 2 unit NSW1221
## 5 -33.86839 151.2375 2 unit NSW1221
## 6 -33.86839 151.2375 2 unit NSW1221
## PropertyLgaPid CoronaState CoronaConfirmedCases CoronaDeathCases
## 1 NSW180 Null 0 0
## 2 NSW180 Null 0 0
## 3 NSW180 Null 0 0
## 4 NSW180 Null 0 0
## 5 NSW180 Null 0 0
## 6 NSW180 Null 0 0
## CoronaLat CoronaLon CoronaRecoveredCases AudToUsd StockOpen StockHigh
## 1 0 0 0 0.7264 2.384165 2.395164
## 2 0 0 0 0.7264 2.957511 2.984671
## 3 0 0 0 0.7264 3.346389 3.361243
## 4 0 0 0 0.7264 2.170196 2.178155
## 5 0 0 0 0.7264 3.265759 3.291010
## 6 0 0 0 0.7264 1.907169 1.907169
## StockLow StockClose StockAdj.Close StockVolume StockName
## 1 2.364620 2.370244 2.370244 15.14900 A2M
## 2 2.954910 2.975019 2.892725 14.88395 AGL
## 3 3.336125 3.357245 3.328651 14.85436 ALL
## 4 2.161597 2.169054 2.124350 14.16936 ALQ
## 5 3.256557 3.285412 3.269207 13.01201 ALU
## 6 1.895405 1.899832 1.859265 14.69220 ALX
## StockDifferenceOpenClose StockDifferenceHighLow Weekday
## 1 6.666667 -1.10866262 Wednesday
## 2 -2.941176 -0.54472718 Wednesday
## 3 -3.225817 -0.33547274 Wednesday
## 4 100.000000 -1.93102154 Wednesday
## 5 -1.923084 -0.09430848 Wednesday
## 6 20.312817 -2.54135003 Wednesday
str(finalDf)
## 'data.frame': 34205178 obs. of 28 variables:
## $ date : Date, format: "2018-09-26" "2018-09-26" ...
## $ PropertyPrice : num 14.4 14.4 14.4 14.4 14.4 ...
## $ PropertySuburb : Factor w/ 2000 levels "Abbotsbury","Abbotsford",..: 485 485 485 485 485 485 485 485 485 485 ...
## $ PropertyCityName : Factor w/ 7 levels "Adelaide","Brisbane",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ PropertyState : Factor w/ 7 levels "ACT","NSW","Null",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ PropertyLat : num -33.9 -33.9 -33.9 -33.9 -33.9 ...
## $ PropertyLon : num 151 151 151 151 151 ...
## $ PropertyBedrooms : Ord.factor w/ 6 levels "0"<"1"<"2"<"3"<..: 3 3 3 3 3 3 3 3 3 3 ...
## $ PropertyType : Ord.factor w/ 4 levels "Null"<"unit"<..: 2 2 2 2 2 2 2 2 2 2 ...
## $ PropertyLocPid : Factor w/ 2091 levels "ACT101","ACT102",..: 131 131 131 131 131 131 131 131 131 131 ...
## $ PropertyLgaPid : Factor w/ 116 levels "ACT","BRIS1",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ CoronaState : Factor w/ 9 levels "Australian Capital Territory",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ CoronaConfirmedCases : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CoronaDeathCases : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CoronaLat : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CoronaLon : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CoronaRecoveredCases : num 0 0 0 0 0 0 0 0 0 0 ...
## $ AudToUsd : num 0.726 0.726 0.726 0.726 0.726 ...
## $ StockOpen : num 2.38 2.96 3.35 2.17 3.27 ...
## $ StockHigh : num 2.4 2.98 3.36 2.18 3.29 ...
## $ StockLow : num 2.36 2.95 3.34 2.16 3.26 ...
## $ StockClose : num 2.37 2.98 3.36 2.17 3.29 ...
## $ StockAdj.Close : num 2.37 2.89 3.33 2.12 3.27 ...
## $ StockVolume : num 15.1 14.9 14.9 14.2 13 ...
## $ StockName : Factor w/ 101 levels "A2M","AGL","ALL",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ StockDifferenceOpenClose: num 6.67 -2.94 -3.23 100 -1.92 ...
## $ StockDifferenceHighLow : num -1.1087 -0.5447 -0.3355 -1.931 -0.0943 ...
## $ Weekday : Factor w/ 7 levels "Friday","Monday",..: 7 7 7 7 7 7 7 7 7 7 ...
summary(finalDf)
## date PropertyPrice PropertySuburb
## Min. :2018-09-02 Min. : -Inf Pakenham : 255271
## 1st Qu.:2019-06-19 1st Qu.:13.05 Frankston : 169599
## Median :2020-01-28 Median :13.35 Tarneit : 161499
## Mean :2019-10-17 Mean : -Inf Melbourne : 159318
## 3rd Qu.:2020-02-27 3rd Qu.:13.66 Point Cook: 159181
## Max. :2020-05-17 Max. :16.12 Sunbury : 159119
## (Other) :33141191
## PropertyCityName PropertyState PropertyLat PropertyLon
## Adelaide : 4075336 ACT : 1207070 Min. :-38.48 Min. : 0.0
## Brisbane : 4121829 NSW : 9011640 1st Qu.:-37.76 1st Qu.:144.6
## Canberra : 1207070 Null: 32 Median :-34.07 Median :145.2
## Melbourne:11615565 QLD : 4121829 Mean :-34.40 Mean :143.4
## Null : 32 SA : 4075336 3rd Qu.:-33.68 3rd Qu.:151.1
## Perth : 4173706 VIC :11615565 Max. : 0.00 Max. :153.2
## Sydney : 9011640 WA : 4173706
## PropertyBedrooms PropertyType PropertyLocPid
## 0: 25132 Null : 32 VIC2023: 255271
## 1: 1898179 unit : 8516643 VIC939 : 169599
## 2: 7250252 house :23035467 VIC2453: 161499
## 3:13317829 townhouse: 2653036 VIC1634: 159318
## 4: 9556504 VIC2082: 159181
## 5: 2157282 VIC2394: 159119
## (Other):33141191
## PropertyLgaPid CoronaState
## ACT : 1207070 Null :15919394
## BRIS2 : 1117338 Australian Capital Territory: 2285723
## BRIS3 : 1018200 New South Wales : 2285723
## VIC127 : 975817 Northern Territory : 2285723
## NSW283 : 862876 Queensland : 2285723
## BRIS4 : 772019 South Australia : 2285723
## (Other):28251858 (Other) : 6857169
## CoronaConfirmedCases CoronaDeathCases CoronaLat CoronaLon
## Min. : 0.0 Min. : 0.0000 Min. :-41.45 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.:-34.93 1st Qu.: 0.0
## Median : 0.0 Median : 0.0000 Median :-12.46 Median :115.9
## Mean : 34.5 Mean : 0.1923 Mean :-17.10 Mean : 75.5
## 3rd Qu.: 4.0 3rd Qu.: 0.0000 3rd Qu.: 0.00 3rd Qu.:146.0
## Max. :3076.0 Max. :46.0000 Max. : 0.00 Max. :153.4
##
## CoronaRecoveredCases AudToUsd StockOpen StockHigh
## Min. : 0.000 Min. :0.5687 Min. : -Inf Min. : -Inf
## 1st Qu.: 0.000 1st Qu.:0.6602 1st Qu.:1.564 1st Qu.:1.575
## Median : 0.000 Median :0.6747 Median :2.485 Median :2.499
## Mean : 3.259 Mean :0.6721 Mean : -Inf Mean : -Inf
## 3rd Qu.: 0.000 3rd Qu.:0.6966 3rd Qu.:3.239 3rd Qu.:3.252
## Max. :2612.000 Max. :0.7357 Max. :5.827 Max. :5.837
##
## StockLow StockClose StockAdj.Close StockVolume
## Min. : -Inf Min. : -Inf Min. : -Inf Min. : -Inf
## 1st Qu.:1.550 1st Qu.:1.560 1st Qu.:1.530 1st Qu.:14.01
## Median :2.466 Median :2.483 Median :2.463 Median :14.94
## Mean : -Inf Mean : -Inf Mean : -Inf Mean : -Inf
## 3rd Qu.:3.223 3rd Qu.:3.237 3rd Qu.:3.219 3rd Qu.:15.73
## Max. :5.820 Max. :5.832 Max. :5.827 Max. :18.51
##
## StockName StockDifferenceOpenClose StockDifferenceHighLow
## A2M : 340911 Min. :-204.0816 Min. : -Inf
## AGL : 340911 1st Qu.: -7.6922 1st Qu.:-2.1628
## ALL : 340911 Median : 0.8929 Median :-1.3093
## ALQ : 340911 Mean : Inf Mean : -Inf
## ALU : 340911 3rd Qu.: 12.5000 3rd Qu.:-0.3857
## ALX : 340911 Max. : Inf Max. : 4.0560
## (Other):32159712
## Weekday
## Friday :8152045
## Monday :6462713
## Saturday : 76536
## Sunday : 10454
## Thursday :6262589
## Tuesday :6334771
## Wednesday:6906070