Student number: s3810585

Student email: s3810585@rmit.student.edu.au

1. Setup

Importing the libraries for data processing.
Imported libraries are:

library(data.table)
library(tidyverse)
library(lubridate)
library(mltools)

2. Data Description

The following dataset(Novel Corona Virus 2019 Dataset) named COVID19_line_list_data.csv is about the latest epidemic diseases called COVID19 (What you need to know about coronavirus (COVID-19)) and is taken from Kaggle that is uploaded from Johns Hopkins Github repository (2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE) by one of the Kaggle Grandmaster (SRK ).
For this assignment data version 34 is used from Kaggle that is uploaded on 16/03/2020.
The following data source have 6 data CSV files.
COVID19_line_list_data.csv CSV file have 27 Variables and 1085 Observations.
Initially the data set have 27 variables named as:

Variable Name	Description	Variable Name	Description	Variable Name	Description
id	It has index number for each data value	case_in_country	Number of case in the country	reporting date	Date when corona virus was reported
V4	Variable with null data	summary	Text data that summarises the corona virus report for that case.	location	Name of the city
country	Name of the country	gender	Sex (Either male or female)	age	Age of the corona patient
symptom_onset	Date when the symptoms were started showing	If_onset_approximated	Is symptoms onset were approximated (1/0)	hosp_visit_date	Date the patient visited hospital
exposure_start	Date the exposure started	exposure_end	Date the exposure ended	visiting Wuhan	The patient was visiting Wuhan (1/0)
from Wuhan	The patient was from Wuhan (1/0)	death	The date person died or the person died or not(Date/ 1/ 0)	recovered	The patient successfully recovered (1/0)
symptom	The symptom that the patient showed	source	Source of the data	link	URL to the data
V22	Variable with null data	V23	Variable with null data	V24	Variable with null data
V25	Variable with null data	V26	Variable with null data	V27	Variable with null data

3. Read/Import Data

For reading the COVID19_line_list_data.csv dataset, fread() function was used from data.table library.
fread() have faster read speed and automatically detects attributes like:
1. Separator
2. Column Class
3. Number of rows
4. Does not automatically changed string column to factors
The main reason for using this function to the data set is the read speed on huge data and the ability to smartly detect the column’s class without external need. (fread by M Dowle)
The top three values of the data were shown with the help of head() function (head).

CovidData <- fread("COVID19_line_list_data.csv")
head(CovidData, 2)

##    id case_in_country reporting date V4
## 1:  1              NA      1/20/2020 NA
## 2:  2              NA      1/20/2020 NA
##                                                                                                                                                                                                                                                                                                                                                                                                                 summary
## 1: First confirmed imported COVID-19 pneumonia patient in Shenzhen (from Wuhan): male, 66, shenzheng residence, visited relatives in Wuhan on 12/29/2019, symptoms onset on 01/03/2020, returned to Shenzhen and seek medical care on 01/04/2020, hospitalized on 01/11/2020, sample sent to China CDC for testing on 01/18/2020, confirmed on 01/19/2020. 8 others under medical observation, contact tracing ongoing.
## 2:                                                                                                                                                                    First confirmed imported COVID-19 pneumonia patient in Shanghai (from Wuhan): female, 56, Wuhan residence, arrived in Shanghai from Wuhan on 01/12/2020, symptom onset and visited fever clinic on 01/15/2020, laboratory confirmed on 01/20/2020
##               location country gender age symptom_onset
## 1: Shenzhen, Guangdong   China   male  66      01/03/20
## 2:            Shanghai   China female  56     1/15/2020
##    If_onset_approximated hosp_visit_date exposure_start exposure_end
## 1:                     0        01/11/20     12/29/2019     01/04/20
## 2:                     0       1/15/2020           <NA>     01/12/20
##    visiting Wuhan from Wuhan death recovered symptom
## 1:              1          0     0         0        
## 2:              0          1     0         0        
##                                                    source
## 1:                   Shenzhen Municipal Health Commission
## 2: Official Weibo of Shanghai Municipal Health Commission
##                                                                                                              link
## 1:                                                         http://wjw.sz.gov.cn/wzx/202001/t20200120_18987787.htm
## 2: https://www.weibo.com/2372649470/IqogQhgfa?from=page_1001062372649470_profile&wvr=6&mod=weibotime&type=comment
##    V22 V23 V24 V25 V26 V27
## 1:  NA  NA  NA  NA  NA  NA
## 2:  NA  NA  NA  NA  NA  NA

#Reference - https://www.rdocumentation.org/packages/data.table/versions/1.8.8/topics/fread
#            https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/head

4. Inspect and Understand

4.1 Checking the dimensions of the data frame.

Printing the dimensions of the dataset, where :
1. value at index one denotes the Number of Rows and
2. value at index two denotes the Number of Columns.
To find dimensions I used dim() function from base library that take only one argument as RObject (dim).

dim(CovidData)

## [1] 1085   27

#Refrences - https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/dim

4.2 Summarizing

Summarizing the types of variables by checking the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the data set.
Checking the column names in the data frame, rename them if required.
Variables names were printed using names() function.
To summarise the data, str() function was used.
Checking the attributes of the data top 10 values using attributes() function.

print(names(CovidData))
##  [1] "id"                    "case_in_country"      
##  [3] "reporting date"        "V4"                   
##  [5] "summary"               "location"             
##  [7] "country"               "gender"               
##  [9] "age"                   "symptom_onset"        
## [11] "If_onset_approximated" "hosp_visit_date"      
## [13] "exposure_start"        "exposure_end"         
## [15] "visiting Wuhan"        "from Wuhan"           
## [17] "death"                 "recovered"            
## [19] "symptom"               "source"               
## [21] "link"                  "V22"                  
## [23] "V23"                   "V24"                  
## [25] "V25"                   "V26"                  
## [27] "V27"
str(CovidData)
## Classes 'data.table' and 'data.frame':   1085 obs. of  27 variables:
##  $ id                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ case_in_country      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ reporting date       : chr  "1/20/2020" "1/20/2020" "1/21/2020" "1/21/2020" ...
##  $ V4                   : logi  NA NA NA NA NA NA ...
##  $ summary              : chr  "First confirmed imported COVID-19 pneumonia patient in Shenzhen (from Wuhan): male, 66, shenzheng residence, vi"| __truncated__ "First confirmed imported COVID-19 pneumonia patient in Shanghai (from Wuhan): female, 56, Wuhan residence, arri"| __truncated__ "First confirmed imported cases in Zhejiang: patient is male, 46, lives in Wuhan, self-driving from Wuhan to Han"| __truncated__ "new confirmed imported COVID-19 pneumonia in Tianjin: female, age 60, recently visited Wuhan, visited fever cli"| __truncated__ ...
##  $ location             : chr  "Shenzhen, Guangdong" "Shanghai" "Zhejiang" "Tianjin" ...
##  $ country              : chr  "China" "China" "China" "China" ...
##  $ gender               : chr  "male" "female" "male" "female" ...
##  $ age                  : num  66 56 46 60 58 44 34 37 39 56 ...
##  $ symptom_onset        : chr  "01/03/20" "1/15/2020" "01/04/20" NA ...
##  $ If_onset_approximated: int  0 0 0 NA NA 0 0 0 0 0 ...
##  $ hosp_visit_date      : chr  "01/11/20" "1/15/2020" "1/17/2020" "1/19/2020" ...
##  $ exposure_start       : chr  "12/29/2019" NA NA NA ...
##  $ exposure_end         : chr  "01/04/20" "01/12/20" "01/03/20" NA ...
##  $ visiting Wuhan       : int  1 0 0 1 0 0 0 1 1 1 ...
##  $ from Wuhan           : int  0 1 1 0 0 1 1 0 0 0 ...
##  $ death                : chr  "0" "0" "0" "0" ...
##  $ recovered            : chr  "0" "0" "0" "0" ...
##  $ symptom              : chr  "" "" "" "" ...
##  $ source               : chr  "Shenzhen Municipal Health Commission" "Official Weibo of Shanghai Municipal Health Commission" "Health Commission of Zhejiang Province" "人民日报官方微博" ...
##  $ link                 : chr  "http://wjw.sz.gov.cn/wzx/202001/t20200120_18987787.htm" "https://www.weibo.com/2372649470/IqogQhgfa?from=page_1001062372649470_profile&wvr=6&mod=weibotime&type=comment" "http://www.zjwjw.gov.cn/art/2020/1/21/art_1202101_41786033.html" "https://m.weibo.cn/status/4463235401268457?" ...
##  $ V22                  : logi  NA NA NA NA NA NA ...
##  $ V23                  : logi  NA NA NA NA NA NA ...
##  $ V24                  : logi  NA NA NA NA NA NA ...
##  $ V25                  : logi  NA NA NA NA NA NA ...
##  $ V26                  : logi  NA NA NA NA NA NA ...
##  $ V27                  : logi  NA NA NA NA NA NA ...
##  - attr(*, ".internal.selfref")=<externalptr>
attributes(CovidData[1:10,])
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $class
## [1] "data.table" "data.frame"
## 
## $.internal.selfref
## <pointer: 0x7feda600d4e0>
## 
## $names
##  [1] "id"                    "case_in_country"      
##  [3] "reporting date"        "V4"                   
##  [5] "summary"               "location"             
##  [7] "country"               "gender"               
##  [9] "age"                   "symptom_onset"        
## [11] "If_onset_approximated" "hosp_visit_date"      
## [13] "exposure_start"        "exposure_end"         
## [15] "visiting Wuhan"        "from Wuhan"           
## [17] "death"                 "recovered"            
## [19] "symptom"               "source"               
## [21] "link"                  "V22"                  
## [23] "V23"                   "V24"                  
## [25] "V25"                   "V26"                  
## [27] "V27"

4.3 Dropping columns that have less than 2 Unique attributes

Used a for loop to iterate on the variables in the dataset.
Dropped the column/variable if there are less than 1 unique columns, that resulted in the dropping of the column:
1. V4
2. V22
3. V23
4. V24
5. V25
6. V26
7. V27

for (headers in colnames(CovidData)){
  if( length(unique(CovidData[[headers]])) < 2 ){
    print(paste('Column ' ,headers, ' is dropped.'))
    CovidData <- select(CovidData,-c(headers)) } }

## [1] "Column  V4  is dropped."

## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(headers)` instead of `headers` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

## [1] "Column  V22  is dropped."
## [1] "Column  V23  is dropped."
## [1] "Column  V24  is dropped."
## [1] "Column  V25  is dropped."
## [1] "Column  V26  is dropped."
## [1] "Column  V27  is dropped."

4.4 Replacing any whirespaces in header with underscore

Few variables have a white space in between.
So white space is substituted using gsub() function with and ( _ ).
New variable names were assigned using names() function.
Kept them as it is because we didn’t cover the missing data imputation part yet.

headerNew <- gsub(" ", "_", names(CovidData))
names(CovidData) <- headerNew

4.5 Replaced NA’s in case in country

Replaced NA’s with an impossible value of -1.
As the blank NA, data point denotes that into case was registered till that date in the data.
This column can give valuable insight on when the outbreak was first discovered in a particular country.
To find NA values is.na() function qas used.
A total of 197 Missing Values were discovered are transformed to -1.
Kept them as it is because we didn’t cover the missing data imputation part yet.

print(paste('Number of missing values: ', 
            dim(CovidData[is.na(CovidData$case_in_country)])[1]))

## [1] "Number of missing values:  197"

CovidData$case_in_country[is.na(CovidData$case_in_country)] <- -1

4.6 Replaced NA’s in case in If_onset_approximated

Replaced NA’s with an impossible value of -1.
As the blank NA, data point denotes a missing value and there are total of 525 missing values.
So replacing with an impossible value is the best decision as during analysis this variable may or may not give valuable insights.
To find NA values is.na() function is used.
Kept them as it is because we didn’t cover the missing data imputation part yet.

print(paste('Number of missing values: ',
            dim(CovidData[is.na(CovidData$If_onset_approximated)])[1]))

## [1] "Number of missing values:  525"

CovidData$If_onset_approximated[is.na(CovidData$If_onset_approximated)] <- -1

4.7 encoding gender column

As being a data science student I know that no analysis can be done on plain text, so encoding a text data is important.
As the majority of machine learning models can not interpret plain text, and encoding(Converting text data to numerical form by giving each text a unique number) it is required for a good prediction.
Males are encoded as 0 and Females are encoded as 1.
Total 183 Missing Values were discovered and are dropped using drop_na() function from the tidyverse library.
Did it because we didn’t cover the missing data imputation part yet.

CovidData$gender <- gsub("female", 1, CovidData$gender)
CovidData$gender <- gsub("male", 0, CovidData$gender)
print(paste('Number of missing values: ', dim(CovidData[is.na(CovidData$gender)])[1]))

## [1] "Number of missing values:  183"

CovidData <- CovidData %>% drop_na(gender)

4.8 If the patient is recovered changing the data point to 1

This column has 3 types of data points.
1. Dates - date on which the patient was recovered
2. 0 - The patient has not recovered
3. 1 - The patient has recovered
So, the date denotes that the patient has successfully recovered (i.e - 1)
So, replaced dates with 1.
Used gsub() function te subsitute ( / ) with "", i.e, removed ( / ) sign.
Then used for loop to check the character length of each data point using nchar() function.
if the length is less than 3 then used gsub() to substitute numbers along with regex pattern (regex) with 1.

CovidData$recovered <- gsub('/', "", CovidData$recovered)
for(count in 1:dim(CovidData)[1]){
  if((nchar(CovidData$recovered[count]) > 3) == TRUE){
    CovidData$recovered[count] <- gsub('[0-9]+', 1, CovidData$recovered[count]) } }

4.9 If the patient died then changing the data point to 1

This column has 3 types of data points.
1. Dates - date on which patient died
2. 0 - the patient is alive
3. 1 - the patient died
So, the date denotes that the patient died (i.e - 1)
So, replaced dates with 1.
Used gsub() function te subsitute ( / ) with "", i.e, removed ( / ) sign.
Then used for loop to check the character length of each data point using nchar() function.
if the length is less than 3 then used gsub() to substitute numbers along with regex pattern (regex) with 1.

CovidData$death <- gsub('/', "", CovidData$death) 
for(count in 1:dim(CovidData)[1]){
  if((nchar(CovidData$death[count]) > 3) == TRUE){
    CovidData$death[count] <- gsub('[0-9]+', 1, CovidData$death[count]) } }

4.10 transforming columns into factors

Changed the following columns to factors:
1. gender
2. from_Wuhan
3. visiting_Wuhan
4. death
5. recovered
6. country
Used transform() function for this (transform).

CovidData <- transform(CovidData, gender=factor(gender), 
                       from_Wuhan=factor(from_Wuhan), 
                       visiting_Wuhan=factor(visiting_Wuhan), 
                       death=factor(death), 
                       recovered=factor(recovered), 
                       country=factor(country) )
# Reference - https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/transform

4.11 one_hot encoding the countries

As being a data science student I know that no analysis can be done on plain text, so encoding a text data is important.
As the majority of machine learning models can not interpret plain text, and encoding(Converting text data to numerical form by giving each text a unique number) it is required for a good prediction.
Used one_hot() function from mltools library(one_hot).
There are total of 38 countries in this dataset.
38 countries are allocated as variables and given a unique combination of 1’s and 0’s to uniquely identify them.
Encoded data is saved into a new dataframe named encodedCountry.
The principal of one hot encoding says that we have to drop the last column as it reduces ambiguous variables and still saves the features of the dataset.
So, the last column was dropped from encodedCountry.
Dropped the country variable as it has been encoded.
Used cbind() function that joins two dataframe using columns.

print(unique(CovidData$country))

##  [1] China       France      Japan       Malaysia    Nepal      
##  [6] Singapore   South Korea Taiwan      Thailand    USA        
## [11] Vietnam     Australia   Canada      Cambodia    Sri Lanka  
## [16] Germany     UAE         Hong Kong   Italy       Russia     
## [21] UK          India       Phillipines Finland     Spain      
## [26] Sweden      Israel      Lebanon     Kuwait      Bahrain    
## [31] Algeria     Croatia     Switzerland
## 33 Levels: Algeria Australia Bahrain Cambodia Canada China ... Vietnam

encodedCountry <- one_hot(as.data.table(CovidData$country))
encodedCountry <-  select(encodedCountry,-c(length(encodedCountry) - 1))
CovidData <- select(CovidData,-c(country))
CovidData <- cbind(CovidData, encodedCountry)

4.12 processing state column

Cleaning the state variable.
Removed commas and values after commas.
Removed name of states in languages other than English.
all this is done using regex and gsub() function.
Did it because we didn’t cover the missing data imputation part yet.

CovidData$location <- gsub(" Guangdong", "", CovidData$location)
CovidData$location <- gsub(" Hubei", "", CovidData$location)
CovidData$location <- gsub( " Guangxi", "", CovidData$location)
CovidData$location <- gsub( "-", "", CovidData$location)
CovidData$location <- gsub("[,]", "", CovidData$location)
CovidData$location <- gsub("[ (陕西)]", "", CovidData$location)

4.13 Manipulating the reporting date variable

Converted the reporting_date column to separate columns of:
1. reporting_day - Day the patient reported
2. reporting_month - The month the patient reported
3. reporting_year - The year the patient reported
To do this separate() function was used, and separation was done on ( / ) symbol (separate).
Dropped the NA’s in the reporting_day variable.

CovidData <- separate(CovidData, reporting_date, 
                      into = c('reporting_day', 'reporting_month', 'reporting_year'), 
                      sep = "/")
# Reference - https://www.rdocumentation.org/packages/tidyr/versions/0.8.3/topics/separate
CovidData <- CovidData %>% drop_na(reporting_day)

4.14 Converting character columns to numeric

Substituted the year that has only 20 and 19 to 2020 0r 2019 using sub() function.
Converted the character type columns of:
1. reporting_day
2. reporting_month
3. reporting_year, to numeric using as.numeric() function.

for(i in CovidData$reporting_year){
  if((nchar(i) < 3) == TRUE)
    CovidData$reporting_year <- sub("20", "2020", i)
  CovidData$reporting_year <- sub("19", "2019", i) }

CovidData$reporting_day <- as.numeric(CovidData$reporting_day)
CovidData$reporting_month <- as.numeric(CovidData$reporting_month)
CovidData$reporting_year <- as.numeric(CovidData$reporting_year)

4.15 Handling the symptom variable in the data set

Separated the text data using the separate() function and separated on the (, ) pattern.
Replacted the NA’s with (-) symbol using is.na() function.
Used for loop to check for white spaces smaller than 2 character length using nchar() function and replaced it with (-) sumbol using sub() function.
Kept them as it is because we didn’t cover the missing data imputation part yet.

CovidData <- separate(CovidData, symptom, 
                      into = c('symptom1', '
                               symptom2', 
                               'symptom3', 
                               'symptom4', 
                               'symptom5'),
                      sep = ", ")

for(blank in CovidData$symptom1){ 
  if((nchar(blank) < 3) == TRUE) 
    CovidData$symptom1 <- sub("", "-", blank)}

CovidData$symptom2[is.na(CovidData$symptom2)] <- '-'
CovidData$symptom3[is.na(CovidData$symptom3)] <- '-'
CovidData$symptom4[is.na(CovidData$symptom4)] <- '-'
CovidData$symptom5[is.na(CovidData$symptom5)] <- '-'

4.16 Dropping NA

Dropped NA values from the followiung columns:
1. age
2. from_Wuhan
Did it because we didn’t cover missing data imputation part yet.

CovidData <- CovidData %>% drop_na(age)
CovidData <- CovidData %>% drop_na(from_Wuhan)

4.17 Converting NA dates to 01/01/0000

Replaced the missing NA dates with impossible date of 01/01/0000.
Kept them as it is because we didn’t cover the missing data imputation part yet.

CovidData$exposure_end[is.na(CovidData$exposure_end)] <- '01/01/0000'
CovidData$exposure_start[is.na(CovidData$exposure_start)] <- '01/01/0000'
CovidData$hosp_visit_date[is.na(CovidData$hosp_visit_date)] <- '01/01/0000'
CovidData$symptom_onset[is.na(CovidData$symptom_onset)] <- '01/01/0000'

4.18 Converting strings to proper date format using regex expression

Subsituted the date in the correct format for furthur processing using sub() and regex from:
1. exposure_end
2. exposure_start
3. hosp_visit_date
4. symptom_onset

CovidData$exposure_end <- sub('(^1/)', '01/', CovidData$exposure_end)
CovidData$exposure_start <- sub('(^1/)', '01/', CovidData$exposure_start)
CovidData$hosp_visit_date <- sub('(^1/)', '01/', CovidData$hosp_visit_date)
CovidData$symptom_onset <- sub('(^1/)', '01/', CovidData$symptom_onset)
CovidData$exposure_end <- sub('(^2/)', '02/', CovidData$exposure_end)
CovidData$exposure_start <- sub('(^2/)', '02/', CovidData$exposure_start)
CovidData$hosp_visit_date <- sub('(^2/)', '02/', CovidData$hosp_visit_date)
CovidData$symptom_onset <- sub('(^2/)', '02/', CovidData$symptom_onset)
CovidData$exposure_end <- sub('(/20$)', '/2020', CovidData$exposure_end)
CovidData$exposure_start <- sub('(/20$)', '/2020', CovidData$exposure_start)
CovidData$hosp_visit_date <- sub('(/20$)', '/2020', CovidData$hosp_visit_date)
CovidData$symptom_onset <- sub('(/20$)', '/2020', CovidData$symptom_onset)
CovidData$exposure_end <- sub('(/19$)', '/2019', CovidData$exposure_end)
CovidData$exposure_start <- sub('(/19$)', '/2019', CovidData$exposure_start)
CovidData$hosp_visit_date <- sub('(/19$)', '/2019', CovidData$hosp_visit_date)
CovidData$symptom_onset <- sub('(/19$)', '/2019', CovidData$symptom_onset)

4.19 Converting string date in Month/day/year to date time column

Converting the formatted character variables to date time object using mdy() function from the lubridate library.
Following columns are converted:
1. exposure_end
2. exposure_start
3. hosp_visit_date
4. symptom_onset

CovidData$exposure_end <- mdy(CovidData$exposure_end)
CovidData$exposure_start <- mdy(CovidData$exposure_start)
CovidData$hosp_visit_date <- mdy(CovidData$hosp_visit_date)
CovidData$symptom_onset <- mdy(CovidData$symptom_onset)

4.20 Saving the new data

Saving the processed data in dataframe CovidData as CovidData.csv using write.csv() function from readr library.

write_csv(CovidData, 'CovidData.csv')

5. Subsetting I

Subsetting the dataframe to have top 10 values by specifying the starting row index and last rows index separated by ( : ) and no value after (,) comma denotes that we are taking all the variables.
Converted the dataframe to the matrix using the as.matrix() function.
The structure of the matrix is character.
The matrix can have only one data type i.e, either it can have an only integer, character, logical, etc.
The matrix is of type Character because character data type have higher precedence than integer and Logical data types.
Matrix follows the order of coercion that is roughly logical < integer < numeric < character.
So, matrix converted all the variables and its data to Character data type, as it is hard to convert character data to integer data or, character to logical data. As a result to put everything on a common comparative groud each data type is converted to Character.
str(CovidMatrix) shows that that matrix have an total of 10 observations and a total of 58 variables.

CovidMatrix <- as.matrix(CovidData[1:10,])
print(paste('Type of data is: ', typeof(CovidMatrix)))

## [1] "Type of data is:  character"

str(CovidMatrix)

##  chr [1:10, 1:58] " 1" " 2" " 3" " 4" " 5" " 6" " 7" " 8" " 9" "10" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:58] "id" "case_in_country" "reporting_day" "reporting_month" ...

6. Subsetting II

Subsetted the variable 1 and variable 57 i.e First and Last variable from the dataframe by passing the starting and last index number of the variable into a vector using c().
Saved the variables as RObject with the name of CovidDataRObject using save() function by passing file argument as CovidDataRObject.RData for creating a RData file.
Loaded the Robject using the load() function.

CovidDataRObject <- CovidData[c(1,57)]
save(CovidDataRObject, file = 'CovidDataRObject.RData')
load('CovidDataRObject.RData')

7. Create a new Data Frame

7.1 Creating a new dataframe from scratch

New data frame names IndipendentDf is created with variable names as:
1. PersonID - (Integer) random 10 integer values from 1 to 100, used sample() function (R Language Generating Random Integers).
2. Power - (Character) (Factor) Random 10 character values from (‘Low’, ‘Mid’, ‘High’)
Converted the Power variable to factor using transform() function, with the order as “Low” < “Mid” < “High” that results in an ordinal variable.

IndipendentDf <- data.frame(PersonID = sample(1:100, 10), 
                            Power = sample(c('Low', 'Mid', 'High'), 
                                           10, 
                                           replace=TRUE))

IndipendentDf <- transform(IndipendentDf,Power = factor(Power, 
                                                        levels = c('Low', 'Mid', 'High'), 
                                                        ordered=TRUE))

7.2 Creating a another numeric vector

Creating a new vector with the name NewColumn that has 10 uniformly distributed using runif() function.
Combined the IndipendentDf and NewColumn into the same dataframe using cbind() function, which combines the data column-wise.

NewColumn <- c(runif(10, 1, 10))
IndipendentDf <- cbind(IndipendentDf, NewColumn)

7.3 Checking the new dataframe

The following data object have 10 Observations and 3 Variables.
Checking the structure of the IndipendentDf using the str() function.
Checking the attributes of the IndipendentDf using attributes() function.

print(paste('Number of Observations in the dataframe is :', dim(IndipendentDf)[1], " || 
            ",'Number of Variables in the dataframe is :', dim(IndipendentDf)[2]))

## [1] "Number of Observations in the dataframe is : 10  || \n             Number of Variables in the dataframe is : 3"

str(IndipendentDf)

## 'data.frame':    10 obs. of  3 variables:
##  $ PersonID : int  78 20 77 26 1 22 61 28 41 83
##  $ Power    : Ord.factor w/ 3 levels "Low"<"Mid"<"High": 2 3 3 2 1 2 2 1 2 1
##  $ NewColumn: num  9.99 9.14 2.62 6.08 3.16 ...

attributes(IndipendentDf)

## $names
## [1] "PersonID"  "Power"     "NewColumn"
## 
## $class
## [1] "data.frame"
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10

8. REFERENCES

What you need to know about coronavirus (COVID-19), n.d, viewed 24 march 2020, <https://www.health.gov.au/news/health-alerts/novel-coronavirus-2019-ncov-health-alert/what-you-need-to-know-about-coronavirus-covid-19>
2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE, n.d, viewed 24 march 2020, <https://github.com/CSSEGISandData/COVID-19>
SRK, n.d, viewed 24 march 2020, <https://www.kaggle.com/sudalairajkumar>
fread by M Dowle, n.d, viewed 24 march 2020, <https://www.rdocumentation.org/packages/data.table/versions/1.8.8/topics/fread>
R-core@R-project.org , dim, n.d, viewed 24 march 2020, <https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/dim>
Novel Corona Virus 2019 Dataset, n.d, viewed 24 march 2020, <https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset/data#>
R-core R-core@R-project.org , head, n.d, viewed 24 march 2020, <https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/head>
R-core R-core@R-project.org, transform, n.d, viewed 24 march 2020, <https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/transform>
Ben Gorman, one_hot, n.d, viewed 24 march 2020, <https://www.rdocumentation.org/packages/mltools/versions/0.3.5/topics/one_hot>
Gskinner, regex, n.d, viewed 24 march 2020, <https://regexr.com>
Hadley Wickham , separate, n.d, viewed 24 march 2020, <https://www.rdocumentation.org/packages/tidyr/versions/0.8.3/topics/separate>
James D. McCaffrey, R Language Generating Random Integers, June 11, 2016, viewed 24 march 2020,<https://jamesmccaffrey.wordpress.com/2016/06/11/r-language-generating-random-integers>
Introduction to data.table, 2019-12-08, viewed 24 march 2020, <https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html>
tidyverse: Easily Install and Load the ‘Tidyverse’, n.d, viewed 24 march 2020, <https://cran.r-project.org/web/packages/tidyverse/index.html>
Do more with dates and times in R, n.d, viewed 24 march 2020, <https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html>
mltools: Machine Learning Tools, n.d, viewed 24 march 2020, <https://cran.r-project.org/web/packages/mltools/index.html>

MATH2349 Semester 1, 2020

Assignment 1

Rashbir Singh Kohli

Student number: s3810585

Student email: s3810585@rmit.student.edu.au

1. Setup

2. Data Description

3. Read/Import Data

4. Inspect and Understand

4.1 Checking the dimensions of the data frame.

4.2 Summarizing

4.3 Dropping columns that have less than 2 Unique attributes

4.4 Replacing any whirespaces in header with underscore

4.5 Replaced NA’s in case in country

4.6 Replaced NA’s in case in If_onset_approximated

4.7 encoding gender column

4.8 If the patient is recovered changing the data point to 1

4.9 If the patient died then changing the data point to 1

4.10 transforming columns into factors

4.11 one_hot encoding the countries

4.12 processing state column

4.13 Manipulating the reporting date variable

4.14 Converting character columns to numeric

4.15 Handling the symptom variable in the data set

4.16 Dropping NA

4.17 Converting NA dates to 01/01/0000

4.18 Converting strings to proper date format using regex expression

4.19 Converting string date in Month/day/year to date time column

4.20 Saving the new data

5. Subsetting I

6. Subsetting II

7. Create a new Data Frame

7.1 Creating a new dataframe from scratch

7.2 Creating a another numeric vector

7.3 Checking the new dataframe

8. REFERENCES