Student number: s3810585

Student email: s3810585@rmit.student.edu.au


1. Setup


library(data.table)
library(tidyverse)
library(lubridate)
library(mltools)

2. Data Description


Variable Name Description Variable Name Description Variable Name Description
id It has index number for each data value case_in_country Number of case in the country reporting date Date when corona virus was reported
V4 Variable with null data summary Text data that summarises the corona virus report for that case. location Name of the city
country Name of the country gender Sex (Either male or female) age Age of the corona patient
symptom_onset Date when the symptoms were started showing If_onset_approximated Is symptoms onset were approximated (1/0) hosp_visit_date Date the patient visited hospital
exposure_start Date the exposure started exposure_end Date the exposure ended visiting Wuhan The patient was visiting Wuhan (1/0)
from Wuhan The patient was from Wuhan (1/0) death The date person died or the person died or not(Date/ 1/ 0) recovered The patient successfully recovered (1/0)
symptom The symptom that the patient showed source Source of the data link URL to the data
V22 Variable with null data V23 Variable with null data V24 Variable with null data
V25 Variable with null data V26 Variable with null data V27 Variable with null data

3. Read/Import Data


  • For reading the COVID19_line_list_data.csv dataset, fread() function was used from data.table library.
  • fread() have faster read speed and automatically detects attributes like:
    1. Separator
    2. Column Class
    3. Number of rows
    4. Does not automatically changed string column to factors
  • The main reason for using this function to the data set is the read speed on huge data and the ability to smartly detect the column’s class without external need. (fread by M Dowle)
  • The top three values of the data were shown with the help of head() function (head).
CovidData <- fread("COVID19_line_list_data.csv")
head(CovidData, 2)
##    id case_in_country reporting date V4
## 1:  1              NA      1/20/2020 NA
## 2:  2              NA      1/20/2020 NA
##                                                                                                                                                                                                                                                                                                                                                                                                                 summary
## 1: First confirmed imported COVID-19 pneumonia patient in Shenzhen (from Wuhan): male, 66, shenzheng residence, visited relatives in Wuhan on 12/29/2019, symptoms onset on 01/03/2020, returned to Shenzhen and seek medical care on 01/04/2020, hospitalized on 01/11/2020, sample sent to China CDC for testing on 01/18/2020, confirmed on 01/19/2020. 8 others under medical observation, contact tracing ongoing.
## 2:                                                                                                                                                                    First confirmed imported COVID-19 pneumonia patient in Shanghai (from Wuhan): female, 56, Wuhan residence, arrived in Shanghai from Wuhan on 01/12/2020, symptom onset and visited fever clinic on 01/15/2020, laboratory confirmed on 01/20/2020
##               location country gender age symptom_onset
## 1: Shenzhen, Guangdong   China   male  66      01/03/20
## 2:            Shanghai   China female  56     1/15/2020
##    If_onset_approximated hosp_visit_date exposure_start exposure_end
## 1:                     0        01/11/20     12/29/2019     01/04/20
## 2:                     0       1/15/2020           <NA>     01/12/20
##    visiting Wuhan from Wuhan death recovered symptom
## 1:              1          0     0         0        
## 2:              0          1     0         0        
##                                                    source
## 1:                   Shenzhen Municipal Health Commission
## 2: Official Weibo of Shanghai Municipal Health Commission
##                                                                                                              link
## 1:                                                         http://wjw.sz.gov.cn/wzx/202001/t20200120_18987787.htm
## 2: https://www.weibo.com/2372649470/IqogQhgfa?from=page_1001062372649470_profile&wvr=6&mod=weibotime&type=comment
##    V22 V23 V24 V25 V26 V27
## 1:  NA  NA  NA  NA  NA  NA
## 2:  NA  NA  NA  NA  NA  NA
#Reference - https://www.rdocumentation.org/packages/data.table/versions/1.8.8/topics/fread
#            https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/head

4. Inspect and Understand


4.1 Checking the dimensions of the data frame.

  • Printing the dimensions of the dataset, where :
    1. value at index one denotes the Number of Rows and
    2. value at index two denotes the Number of Columns.
  • To find dimensions I used dim() function from base library that take only one argument as RObject (dim).
dim(CovidData)
## [1] 1085   27
#Refrences - https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/dim

4.2 Summarizing

  • Summarizing the types of variables by checking the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the data set.
  • Checking the column names in the data frame, rename them if required.
  • Variables names were printed using names() function.
  • To summarise the data, str() function was used.
  • Checking the attributes of the data top 10 values using attributes() function.
print(names(CovidData))
##  [1] "id"                    "case_in_country"      
##  [3] "reporting date"        "V4"                   
##  [5] "summary"               "location"             
##  [7] "country"               "gender"               
##  [9] "age"                   "symptom_onset"        
## [11] "If_onset_approximated" "hosp_visit_date"      
## [13] "exposure_start"        "exposure_end"         
## [15] "visiting Wuhan"        "from Wuhan"           
## [17] "death"                 "recovered"            
## [19] "symptom"               "source"               
## [21] "link"                  "V22"                  
## [23] "V23"                   "V24"                  
## [25] "V25"                   "V26"                  
## [27] "V27"
str(CovidData)
## Classes 'data.table' and 'data.frame':   1085 obs. of  27 variables:
##  $ id                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ case_in_country      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ reporting date       : chr  "1/20/2020" "1/20/2020" "1/21/2020" "1/21/2020" ...
##  $ V4                   : logi  NA NA NA NA NA NA ...
##  $ summary              : chr  "First confirmed imported COVID-19 pneumonia patient in Shenzhen (from Wuhan): male, 66, shenzheng residence, vi"| __truncated__ "First confirmed imported COVID-19 pneumonia patient in Shanghai (from Wuhan): female, 56, Wuhan residence, arri"| __truncated__ "First confirmed imported cases in Zhejiang: patient is male, 46, lives in Wuhan, self-driving from Wuhan to Han"| __truncated__ "new confirmed imported COVID-19 pneumonia in Tianjin: female, age 60, recently visited Wuhan, visited fever cli"| __truncated__ ...
##  $ location             : chr  "Shenzhen, Guangdong" "Shanghai" "Zhejiang" "Tianjin" ...
##  $ country              : chr  "China" "China" "China" "China" ...
##  $ gender               : chr  "male" "female" "male" "female" ...
##  $ age                  : num  66 56 46 60 58 44 34 37 39 56 ...
##  $ symptom_onset        : chr  "01/03/20" "1/15/2020" "01/04/20" NA ...
##  $ If_onset_approximated: int  0 0 0 NA NA 0 0 0 0 0 ...
##  $ hosp_visit_date      : chr  "01/11/20" "1/15/2020" "1/17/2020" "1/19/2020" ...
##  $ exposure_start       : chr  "12/29/2019" NA NA NA ...
##  $ exposure_end         : chr  "01/04/20" "01/12/20" "01/03/20" NA ...
##  $ visiting Wuhan       : int  1 0 0 1 0 0 0 1 1 1 ...
##  $ from Wuhan           : int  0 1 1 0 0 1 1 0 0 0 ...
##  $ death                : chr  "0" "0" "0" "0" ...
##  $ recovered            : chr  "0" "0" "0" "0" ...
##  $ symptom              : chr  "" "" "" "" ...
##  $ source               : chr  "Shenzhen Municipal Health Commission" "Official Weibo of Shanghai Municipal Health Commission" "Health Commission of Zhejiang Province" "人民日报官方微博" ...
##  $ link                 : chr  "http://wjw.sz.gov.cn/wzx/202001/t20200120_18987787.htm" "https://www.weibo.com/2372649470/IqogQhgfa?from=page_1001062372649470_profile&wvr=6&mod=weibotime&type=comment" "http://www.zjwjw.gov.cn/art/2020/1/21/art_1202101_41786033.html" "https://m.weibo.cn/status/4463235401268457?" ...
##  $ V22                  : logi  NA NA NA NA NA NA ...
##  $ V23                  : logi  NA NA NA NA NA NA ...
##  $ V24                  : logi  NA NA NA NA NA NA ...
##  $ V25                  : logi  NA NA NA NA NA NA ...
##  $ V26                  : logi  NA NA NA NA NA NA ...
##  $ V27                  : logi  NA NA NA NA NA NA ...
##  - attr(*, ".internal.selfref")=<externalptr>
attributes(CovidData[1:10,])
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $class
## [1] "data.table" "data.frame"
## 
## $.internal.selfref
## <pointer: 0x7feda600d4e0>
## 
## $names
##  [1] "id"                    "case_in_country"      
##  [3] "reporting date"        "V4"                   
##  [5] "summary"               "location"             
##  [7] "country"               "gender"               
##  [9] "age"                   "symptom_onset"        
## [11] "If_onset_approximated" "hosp_visit_date"      
## [13] "exposure_start"        "exposure_end"         
## [15] "visiting Wuhan"        "from Wuhan"           
## [17] "death"                 "recovered"            
## [19] "symptom"               "source"               
## [21] "link"                  "V22"                  
## [23] "V23"                   "V24"                  
## [25] "V25"                   "V26"                  
## [27] "V27"

4.3 Dropping columns that have less than 2 Unique attributes

  • Used a for loop to iterate on the variables in the dataset.
  • Dropped the column/variable if there are less than 1 unique columns, that resulted in the dropping of the column:
    1. V4
    2. V22
    3. V23
    4. V24
    5. V25
    6. V26
    7. V27
for (headers in colnames(CovidData)){
  if( length(unique(CovidData[[headers]])) < 2 ){
    print(paste('Column ' ,headers, ' is dropped.'))
    CovidData <- select(CovidData,-c(headers)) } }
## [1] "Column  V4  is dropped."
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(headers)` instead of `headers` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## [1] "Column  V22  is dropped."
## [1] "Column  V23  is dropped."
## [1] "Column  V24  is dropped."
## [1] "Column  V25  is dropped."
## [1] "Column  V26  is dropped."
## [1] "Column  V27  is dropped."

4.4 Replacing any whirespaces in header with underscore

  • Few variables have a white space in between.
  • So white space is substituted using gsub() function with and ( _ ).
  • New variable names were assigned using names() function.
  • Kept them as it is because we didn’t cover the missing data imputation part yet.
headerNew <- gsub(" ", "_", names(CovidData))
names(CovidData) <- headerNew

4.5 Replaced NA’s in case in country

  • Replaced NA’s with an impossible value of -1.
  • As the blank NA, data point denotes that into case was registered till that date in the data.
  • This column can give valuable insight on when the outbreak was first discovered in a particular country.
  • To find NA values is.na() function qas used.
  • A total of 197 Missing Values were discovered are transformed to -1.
  • Kept them as it is because we didn’t cover the missing data imputation part yet.
print(paste('Number of missing values: ', 
            dim(CovidData[is.na(CovidData$case_in_country)])[1]))
## [1] "Number of missing values:  197"
CovidData$case_in_country[is.na(CovidData$case_in_country)] <- -1

4.6 Replaced NA’s in case in If_onset_approximated

  • Replaced NA’s with an impossible value of -1.
  • As the blank NA, data point denotes a missing value and there are total of 525 missing values.
  • So replacing with an impossible value is the best decision as during analysis this variable may or may not give valuable insights.
  • To find NA values is.na() function is used.
  • Kept them as it is because we didn’t cover the missing data imputation part yet.
print(paste('Number of missing values: ',
            dim(CovidData[is.na(CovidData$If_onset_approximated)])[1]))
## [1] "Number of missing values:  525"
CovidData$If_onset_approximated[is.na(CovidData$If_onset_approximated)] <- -1

4.7 encoding gender column

  • As being a data science student I know that no analysis can be done on plain text, so encoding a text data is important.
  • As the majority of machine learning models can not interpret plain text, and encoding(Converting text data to numerical form by giving each text a unique number) it is required for a good prediction.
  • Males are encoded as 0 and Females are encoded as 1.
  • Total 183 Missing Values were discovered and are dropped using drop_na() function from the tidyverse library.
  • Did it because we didn’t cover the missing data imputation part yet.
CovidData$gender <- gsub("female", 1, CovidData$gender)
CovidData$gender <- gsub("male", 0, CovidData$gender)
print(paste('Number of missing values: ', dim(CovidData[is.na(CovidData$gender)])[1]))
## [1] "Number of missing values:  183"
CovidData <- CovidData %>% drop_na(gender)

4.8 If the patient is recovered changing the data point to 1

  • This column has 3 types of data points.
    1. Dates - date on which the patient was recovered
    2. 0 - The patient has not recovered
    3. 1 - The patient has recovered
  • So, the date denotes that the patient has successfully recovered (i.e - 1)
  • So, replaced dates with 1.
  • Used gsub() function te subsitute ( / ) with "", i.e, removed ( / ) sign.
  • Then used for loop to check the character length of each data point using nchar() function.
  • if the length is less than 3 then used gsub() to substitute numbers along with regex pattern (regex) with 1.
CovidData$recovered <- gsub('/', "", CovidData$recovered)
for(count in 1:dim(CovidData)[1]){
  if((nchar(CovidData$recovered[count]) > 3) == TRUE){
    CovidData$recovered[count] <- gsub('[0-9]+', 1, CovidData$recovered[count]) } }

4.9 If the patient died then changing the data point to 1

  • This column has 3 types of data points.
    1. Dates - date on which patient died
    2. 0 - the patient is alive
    3. 1 - the patient died
  • So, the date denotes that the patient died (i.e - 1)
  • So, replaced dates with 1.
  • Used gsub() function te subsitute ( / ) with "", i.e, removed ( / ) sign.
  • Then used for loop to check the character length of each data point using nchar() function.
  • if the length is less than 3 then used gsub() to substitute numbers along with regex pattern (regex) with 1.
CovidData$death <- gsub('/', "", CovidData$death) 
for(count in 1:dim(CovidData)[1]){
  if((nchar(CovidData$death[count]) > 3) == TRUE){
    CovidData$death[count] <- gsub('[0-9]+', 1, CovidData$death[count]) } }

4.10 transforming columns into factors

  • Changed the following columns to factors:
    1. gender
    2. from_Wuhan
    3. visiting_Wuhan
    4. death
    5. recovered
    6. country
  • Used transform() function for this (transform).
CovidData <- transform(CovidData, gender=factor(gender), 
                       from_Wuhan=factor(from_Wuhan), 
                       visiting_Wuhan=factor(visiting_Wuhan), 
                       death=factor(death), 
                       recovered=factor(recovered), 
                       country=factor(country) )
# Reference - https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/transform

4.11 one_hot encoding the countries

  • As being a data science student I know that no analysis can be done on plain text, so encoding a text data is important.
  • As the majority of machine learning models can not interpret plain text, and encoding(Converting text data to numerical form by giving each text a unique number) it is required for a good prediction.
  • Used one_hot() function from mltools library(one_hot).
  • There are total of 38 countries in this dataset.
  • 38 countries are allocated as variables and given a unique combination of 1’s and 0’s to uniquely identify them.
  • Encoded data is saved into a new dataframe named encodedCountry.
  • The principal of one hot encoding says that we have to drop the last column as it reduces ambiguous variables and still saves the features of the dataset.
  • So, the last column was dropped from encodedCountry.
  • Dropped the country variable as it has been encoded.
  • Used cbind() function that joins two dataframe using columns.
print(unique(CovidData$country))
##  [1] China       France      Japan       Malaysia    Nepal      
##  [6] Singapore   South Korea Taiwan      Thailand    USA        
## [11] Vietnam     Australia   Canada      Cambodia    Sri Lanka  
## [16] Germany     UAE         Hong Kong   Italy       Russia     
## [21] UK          India       Phillipines Finland     Spain      
## [26] Sweden      Israel      Lebanon     Kuwait      Bahrain    
## [31] Algeria     Croatia     Switzerland
## 33 Levels: Algeria Australia Bahrain Cambodia Canada China ... Vietnam
encodedCountry <- one_hot(as.data.table(CovidData$country))
encodedCountry <-  select(encodedCountry,-c(length(encodedCountry) - 1))
CovidData <- select(CovidData,-c(country))
CovidData <- cbind(CovidData, encodedCountry)

4.12 processing state column

  • Cleaning the state variable.
  • Removed commas and values after commas.
  • Removed name of states in languages other than English.
  • all this is done using regex and gsub() function.
  • Did it because we didn’t cover the missing data imputation part yet.
CovidData$location <- gsub(" Guangdong", "", CovidData$location)
CovidData$location <- gsub(" Hubei", "", CovidData$location)
CovidData$location <- gsub( " Guangxi", "", CovidData$location)
CovidData$location <- gsub( "-", "", CovidData$location)
CovidData$location <- gsub("[,]", "", CovidData$location)
CovidData$location <- gsub("[ (陕西)]", "", CovidData$location)

4.13 Manipulating the reporting date variable

  • Converted the reporting_date column to separate columns of:
    1. reporting_day - Day the patient reported
    2. reporting_month - The month the patient reported
    3. reporting_year - The year the patient reported
  • To do this separate() function was used, and separation was done on ( / ) symbol (separate).
  • Dropped the NA’s in the reporting_day variable.
CovidData <- separate(CovidData, reporting_date, 
                      into = c('reporting_day', 'reporting_month', 'reporting_year'), 
                      sep = "/")
# Reference - https://www.rdocumentation.org/packages/tidyr/versions/0.8.3/topics/separate
CovidData <- CovidData %>% drop_na(reporting_day)

4.14 Converting character columns to numeric

  • Substituted the year that has only 20 and 19 to 2020 0r 2019 using sub() function.
  • Converted the character type columns of:
    1. reporting_day
    2. reporting_month
    3. reporting_year, to numeric using as.numeric() function.
for(i in CovidData$reporting_year){
  if((nchar(i) < 3) == TRUE)
    CovidData$reporting_year <- sub("20", "2020", i)
  CovidData$reporting_year <- sub("19", "2019", i) }

CovidData$reporting_day <- as.numeric(CovidData$reporting_day)
CovidData$reporting_month <- as.numeric(CovidData$reporting_month)
CovidData$reporting_year <- as.numeric(CovidData$reporting_year)

4.15 Handling the symptom variable in the data set

  • Separated the text data using the separate() function and separated on the (, ) pattern.
  • Replacted the NA’s with (-) symbol using is.na() function.
  • Used for loop to check for white spaces smaller than 2 character length using nchar() function and replaced it with (-) sumbol using sub() function.
  • Kept them as it is because we didn’t cover the missing data imputation part yet.
CovidData <- separate(CovidData, symptom, 
                      into = c('symptom1', '
                               symptom2', 
                               'symptom3', 
                               'symptom4', 
                               'symptom5'),
                      sep = ", ")

for(blank in CovidData$symptom1){ 
  if((nchar(blank) < 3) == TRUE) 
    CovidData$symptom1 <- sub("", "-", blank)}

CovidData$symptom2[is.na(CovidData$symptom2)] <- '-'
CovidData$symptom3[is.na(CovidData$symptom3)] <- '-'
CovidData$symptom4[is.na(CovidData$symptom4)] <- '-'
CovidData$symptom5[is.na(CovidData$symptom5)] <- '-'

4.16 Dropping NA

  • Dropped NA values from the followiung columns:
    1. age
    2. from_Wuhan
  • Did it because we didn’t cover missing data imputation part yet.
CovidData <- CovidData %>% drop_na(age)
CovidData <- CovidData %>% drop_na(from_Wuhan)

4.17 Converting NA dates to 01/01/0000

  • Replaced the missing NA dates with impossible date of 01/01/0000.
  • Kept them as it is because we didn’t cover the missing data imputation part yet.
CovidData$exposure_end[is.na(CovidData$exposure_end)] <- '01/01/0000'
CovidData$exposure_start[is.na(CovidData$exposure_start)] <- '01/01/0000'
CovidData$hosp_visit_date[is.na(CovidData$hosp_visit_date)] <- '01/01/0000'
CovidData$symptom_onset[is.na(CovidData$symptom_onset)] <- '01/01/0000'

4.18 Converting strings to proper date format using regex expression

  • Subsituted the date in the correct format for furthur processing using sub() and regex from:
    1. exposure_end
    2. exposure_start
    3. hosp_visit_date
    4. symptom_onset
CovidData$exposure_end <- sub('(^1/)', '01/', CovidData$exposure_end)
CovidData$exposure_start <- sub('(^1/)', '01/', CovidData$exposure_start)
CovidData$hosp_visit_date <- sub('(^1/)', '01/', CovidData$hosp_visit_date)
CovidData$symptom_onset <- sub('(^1/)', '01/', CovidData$symptom_onset)
CovidData$exposure_end <- sub('(^2/)', '02/', CovidData$exposure_end)
CovidData$exposure_start <- sub('(^2/)', '02/', CovidData$exposure_start)
CovidData$hosp_visit_date <- sub('(^2/)', '02/', CovidData$hosp_visit_date)
CovidData$symptom_onset <- sub('(^2/)', '02/', CovidData$symptom_onset)
CovidData$exposure_end <- sub('(/20$)', '/2020', CovidData$exposure_end)
CovidData$exposure_start <- sub('(/20$)', '/2020', CovidData$exposure_start)
CovidData$hosp_visit_date <- sub('(/20$)', '/2020', CovidData$hosp_visit_date)
CovidData$symptom_onset <- sub('(/20$)', '/2020', CovidData$symptom_onset)
CovidData$exposure_end <- sub('(/19$)', '/2019', CovidData$exposure_end)
CovidData$exposure_start <- sub('(/19$)', '/2019', CovidData$exposure_start)
CovidData$hosp_visit_date <- sub('(/19$)', '/2019', CovidData$hosp_visit_date)
CovidData$symptom_onset <- sub('(/19$)', '/2019', CovidData$symptom_onset)

4.19 Converting string date in Month/day/year to date time column

  • Converting the formatted character variables to date time object using mdy() function from the lubridate library.
  • Following columns are converted:
    1. exposure_end
    2. exposure_start
    3. hosp_visit_date
    4. symptom_onset
CovidData$exposure_end <- mdy(CovidData$exposure_end)
CovidData$exposure_start <- mdy(CovidData$exposure_start)
CovidData$hosp_visit_date <- mdy(CovidData$hosp_visit_date)
CovidData$symptom_onset <- mdy(CovidData$symptom_onset)

4.20 Saving the new data

  • Saving the processed data in dataframe CovidData as CovidData.csv using write.csv() function from readr library.
write_csv(CovidData, 'CovidData.csv')

5. Subsetting I


CovidMatrix <- as.matrix(CovidData[1:10,])
print(paste('Type of data is: ', typeof(CovidMatrix)))
## [1] "Type of data is:  character"
str(CovidMatrix)
##  chr [1:10, 1:58] " 1" " 2" " 3" " 4" " 5" " 6" " 7" " 8" " 9" "10" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:58] "id" "case_in_country" "reporting_day" "reporting_month" ...

6. Subsetting II


CovidDataRObject <- CovidData[c(1,57)]
save(CovidDataRObject, file = 'CovidDataRObject.RData')
load('CovidDataRObject.RData')

7. Create a new Data Frame


7.1 Creating a new dataframe from scratch

  • New data frame names IndipendentDf is created with variable names as:
    1. PersonID - (Integer) random 10 integer values from 1 to 100, used sample() function (R Language Generating Random Integers).
    2. Power - (Character) (Factor) Random 10 character values from (‘Low’, ‘Mid’, ‘High’)
  • Converted the Power variable to factor using transform() function, with the order as “Low” < “Mid” < “High” that results in an ordinal variable.
IndipendentDf <- data.frame(PersonID = sample(1:100, 10), 
                            Power = sample(c('Low', 'Mid', 'High'), 
                                           10, 
                                           replace=TRUE))

IndipendentDf <- transform(IndipendentDf,Power = factor(Power, 
                                                        levels = c('Low', 'Mid', 'High'), 
                                                        ordered=TRUE))

7.2 Creating a another numeric vector

  • Creating a new vector with the name NewColumn that has 10 uniformly distributed using runif() function.
  • Combined the IndipendentDf and NewColumn into the same dataframe using cbind() function, which combines the data column-wise.
NewColumn <- c(runif(10, 1, 10))
IndipendentDf <- cbind(IndipendentDf, NewColumn)

7.3 Checking the new dataframe

  • The following data object have 10 Observations and 3 Variables.
  • Checking the structure of the IndipendentDf using the str() function.
  • Checking the attributes of the IndipendentDf using attributes() function.
print(paste('Number of Observations in the dataframe is :', dim(IndipendentDf)[1], " || 
            ",'Number of Variables in the dataframe is :', dim(IndipendentDf)[2]))
## [1] "Number of Observations in the dataframe is : 10  || \n             Number of Variables in the dataframe is : 3"
str(IndipendentDf)
## 'data.frame':    10 obs. of  3 variables:
##  $ PersonID : int  78 20 77 26 1 22 61 28 41 83
##  $ Power    : Ord.factor w/ 3 levels "Low"<"Mid"<"High": 2 3 3 2 1 2 2 1 2 1
##  $ NewColumn: num  9.99 9.14 2.62 6.08 3.16 ...
attributes(IndipendentDf)
## $names
## [1] "PersonID"  "Power"     "NewColumn"
## 
## $class
## [1] "data.frame"
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10

8. REFERENCES

    1. What you need to know about coronavirus (COVID-19), n.d, viewed 24 march 2020, <https://www.health.gov.au/news/health-alerts/novel-coronavirus-2019-ncov-health-alert/what-you-need-to-know-about-coronavirus-covid-19>
    2. 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE, n.d, viewed 24 march 2020, <https://github.com/CSSEGISandData/COVID-19>
    3. SRK, n.d, viewed 24 march 2020, <https://www.kaggle.com/sudalairajkumar>
    4. fread by M Dowle, n.d, viewed 24 march 2020, <https://www.rdocumentation.org/packages/data.table/versions/1.8.8/topics/fread>
    5. R-core@R-project.org , dim, n.d, viewed 24 march 2020, <https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/dim>
    6. Novel Corona Virus 2019 Dataset, n.d, viewed 24 march 2020, <https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset/data#>
    7. R-core R-core@R-project.org , head, n.d, viewed 24 march 2020, <https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/head>
    8. R-core R-core@R-project.org, transform, n.d, viewed 24 march 2020, <https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/transform>
    9. Ben Gorman, one_hot, n.d, viewed 24 march 2020, <https://www.rdocumentation.org/packages/mltools/versions/0.3.5/topics/one_hot>
    10. Gskinner, regex, n.d, viewed 24 march 2020, <https://regexr.com>
    11. Hadley Wickham , separate, n.d, viewed 24 march 2020, <https://www.rdocumentation.org/packages/tidyr/versions/0.8.3/topics/separate>
    12. James D. McCaffrey, R Language Generating Random Integers, June 11, 2016, viewed 24 march 2020,<https://jamesmccaffrey.wordpress.com/2016/06/11/r-language-generating-random-integers>
    13. Introduction to data.table, 2019-12-08, viewed 24 march 2020, <https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html>
    14. tidyverse: Easily Install and Load the ‘Tidyverse’, n.d, viewed 24 march 2020, <https://cran.r-project.org/web/packages/tidyverse/index.html>
    15. Do more with dates and times in R, n.d, viewed 24 march 2020, <https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html>
    16. mltools: Machine Learning Tools, n.d, viewed 24 march 2020, <https://cran.r-project.org/web/packages/mltools/index.html>