4. Inspect and Understand
4.1 Checking the dimensions of the data frame.
- Printing the dimensions of the dataset, where :
- value at index one denotes the Number of Rows and
- value at index two denotes the Number of Columns.
- To find dimensions I used dim() function from base library that take only one argument as RObject (dim).
dim(CovidData)
## [1] 1085 27
#Refrences - https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/dim
4.2 Summarizing
- Summarizing the types of variables by checking the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the data set.
- Checking the column names in the data frame, rename them if required.
- Variables names were printed using names() function.
- To summarise the data, str() function was used.
- Checking the attributes of the data top 10 values using
attributes() function.
print(names(CovidData))
## [1] "id" "case_in_country"
## [3] "reporting date" "V4"
## [5] "summary" "location"
## [7] "country" "gender"
## [9] "age" "symptom_onset"
## [11] "If_onset_approximated" "hosp_visit_date"
## [13] "exposure_start" "exposure_end"
## [15] "visiting Wuhan" "from Wuhan"
## [17] "death" "recovered"
## [19] "symptom" "source"
## [21] "link" "V22"
## [23] "V23" "V24"
## [25] "V25" "V26"
## [27] "V27"
str(CovidData)
## Classes 'data.table' and 'data.frame': 1085 obs. of 27 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ case_in_country : int NA NA NA NA NA NA NA NA NA NA ...
## $ reporting date : chr "1/20/2020" "1/20/2020" "1/21/2020" "1/21/2020" ...
## $ V4 : logi NA NA NA NA NA NA ...
## $ summary : chr "First confirmed imported COVID-19 pneumonia patient in Shenzhen (from Wuhan): male, 66, shenzheng residence, vi"| __truncated__ "First confirmed imported COVID-19 pneumonia patient in Shanghai (from Wuhan): female, 56, Wuhan residence, arri"| __truncated__ "First confirmed imported cases in Zhejiang: patient is male, 46, lives in Wuhan, self-driving from Wuhan to Han"| __truncated__ "new confirmed imported COVID-19 pneumonia in Tianjin: female, age 60, recently visited Wuhan, visited fever cli"| __truncated__ ...
## $ location : chr "Shenzhen, Guangdong" "Shanghai" "Zhejiang" "Tianjin" ...
## $ country : chr "China" "China" "China" "China" ...
## $ gender : chr "male" "female" "male" "female" ...
## $ age : num 66 56 46 60 58 44 34 37 39 56 ...
## $ symptom_onset : chr "01/03/20" "1/15/2020" "01/04/20" NA ...
## $ If_onset_approximated: int 0 0 0 NA NA 0 0 0 0 0 ...
## $ hosp_visit_date : chr "01/11/20" "1/15/2020" "1/17/2020" "1/19/2020" ...
## $ exposure_start : chr "12/29/2019" NA NA NA ...
## $ exposure_end : chr "01/04/20" "01/12/20" "01/03/20" NA ...
## $ visiting Wuhan : int 1 0 0 1 0 0 0 1 1 1 ...
## $ from Wuhan : int 0 1 1 0 0 1 1 0 0 0 ...
## $ death : chr "0" "0" "0" "0" ...
## $ recovered : chr "0" "0" "0" "0" ...
## $ symptom : chr "" "" "" "" ...
## $ source : chr "Shenzhen Municipal Health Commission" "Official Weibo of Shanghai Municipal Health Commission" "Health Commission of Zhejiang Province" "人民日报官方微博" ...
## $ link : chr "http://wjw.sz.gov.cn/wzx/202001/t20200120_18987787.htm" "https://www.weibo.com/2372649470/IqogQhgfa?from=page_1001062372649470_profile&wvr=6&mod=weibotime&type=comment" "http://www.zjwjw.gov.cn/art/2020/1/21/art_1202101_41786033.html" "https://m.weibo.cn/status/4463235401268457?" ...
## $ V22 : logi NA NA NA NA NA NA ...
## $ V23 : logi NA NA NA NA NA NA ...
## $ V24 : logi NA NA NA NA NA NA ...
## $ V25 : logi NA NA NA NA NA NA ...
## $ V26 : logi NA NA NA NA NA NA ...
## $ V27 : logi NA NA NA NA NA NA ...
## - attr(*, ".internal.selfref")=<externalptr>
attributes(CovidData[1:10,])
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $class
## [1] "data.table" "data.frame"
##
## $.internal.selfref
## <pointer: 0x7feda600d4e0>
##
## $names
## [1] "id" "case_in_country"
## [3] "reporting date" "V4"
## [5] "summary" "location"
## [7] "country" "gender"
## [9] "age" "symptom_onset"
## [11] "If_onset_approximated" "hosp_visit_date"
## [13] "exposure_start" "exposure_end"
## [15] "visiting Wuhan" "from Wuhan"
## [17] "death" "recovered"
## [19] "symptom" "source"
## [21] "link" "V22"
## [23] "V23" "V24"
## [25] "V25" "V26"
## [27] "V27"
4.3 Dropping columns that have less than 2 Unique attributes
- Used a for loop to iterate on the variables in the dataset.
- Dropped the column/variable if there are less than 1 unique columns, that resulted in the dropping of the column:
- V4
- V22
- V23
- V24
- V25
- V26
- V27
for (headers in colnames(CovidData)){
if( length(unique(CovidData[[headers]])) < 2 ){
print(paste('Column ' ,headers, ' is dropped.'))
CovidData <- select(CovidData,-c(headers)) } }
## [1] "Column V4 is dropped."
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(headers)` instead of `headers` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## [1] "Column V22 is dropped."
## [1] "Column V23 is dropped."
## [1] "Column V24 is dropped."
## [1] "Column V25 is dropped."
## [1] "Column V26 is dropped."
## [1] "Column V27 is dropped."
4.5 Replaced NA’s in case in country
- Replaced NA’s with an impossible value of -1.
- As the blank NA, data point denotes that into case was registered till that date in the data.
- This column can give valuable insight on when the outbreak was first discovered in a particular country.
- To find NA values is.na() function qas used.
- A total of 197 Missing Values were discovered are transformed to -1.
- Kept them as it is because we didn’t cover the missing data imputation part yet.
print(paste('Number of missing values: ',
dim(CovidData[is.na(CovidData$case_in_country)])[1]))
## [1] "Number of missing values: 197"
CovidData$case_in_country[is.na(CovidData$case_in_country)] <- -1
4.6 Replaced NA’s in case in If_onset_approximated
- Replaced NA’s with an impossible value of -1.
- As the blank NA, data point denotes a missing value and there are total of 525 missing values.
- So replacing with an impossible value is the best decision as during analysis this variable may or may not give valuable insights.
- To find NA values is.na() function is used.
- Kept them as it is because we didn’t cover the missing data imputation part yet.
print(paste('Number of missing values: ',
dim(CovidData[is.na(CovidData$If_onset_approximated)])[1]))
## [1] "Number of missing values: 525"
CovidData$If_onset_approximated[is.na(CovidData$If_onset_approximated)] <- -1
4.7 encoding gender column
- As being a data science student I know that no analysis can be done on plain text, so encoding a text data is important.
- As the majority of machine learning models can not interpret plain text, and encoding(Converting text data to numerical form by giving each text a unique number) it is required for a good prediction.
- Males are encoded as 0 and Females are encoded as 1.
- Total 183 Missing Values were discovered and are dropped using drop_na() function from the tidyverse library.
- Did it because we didn’t cover the missing data imputation part yet.
CovidData$gender <- gsub("female", 1, CovidData$gender)
CovidData$gender <- gsub("male", 0, CovidData$gender)
print(paste('Number of missing values: ', dim(CovidData[is.na(CovidData$gender)])[1]))
## [1] "Number of missing values: 183"
CovidData <- CovidData %>% drop_na(gender)
4.8 If the patient is recovered changing the data point to 1
- This column has 3 types of data points.
- Dates - date on which the patient was recovered
- 0 - The patient has not recovered
- 1 - The patient has recovered
- So, the date denotes that the patient has successfully recovered (i.e - 1)
- So, replaced dates with 1.
- Used gsub() function te subsitute ( / ) with "", i.e, removed ( / ) sign.
- Then used for loop to check the character length of each data point using nchar() function.
- if the length is less than 3 then used gsub() to substitute numbers along with regex pattern (regex) with 1.
CovidData$recovered <- gsub('/', "", CovidData$recovered)
for(count in 1:dim(CovidData)[1]){
if((nchar(CovidData$recovered[count]) > 3) == TRUE){
CovidData$recovered[count] <- gsub('[0-9]+', 1, CovidData$recovered[count]) } }
4.9 If the patient died then changing the data point to 1
- This column has 3 types of data points.
- Dates - date on which patient died
- 0 - the patient is alive
- 1 - the patient died
- So, the date denotes that the patient died (i.e - 1)
- So, replaced dates with 1.
- Used gsub() function te subsitute ( / ) with "", i.e, removed ( / ) sign.
- Then used for loop to check the character length of each data point using nchar() function.
- if the length is less than 3 then used gsub() to substitute numbers along with regex pattern (regex) with 1.
CovidData$death <- gsub('/', "", CovidData$death)
for(count in 1:dim(CovidData)[1]){
if((nchar(CovidData$death[count]) > 3) == TRUE){
CovidData$death[count] <- gsub('[0-9]+', 1, CovidData$death[count]) } }
4.11 one_hot encoding the countries
- As being a data science student I know that no analysis can be done on plain text, so encoding a text data is important.
- As the majority of machine learning models can not interpret plain text, and encoding(Converting text data to numerical form by giving each text a unique number) it is required for a good prediction.
- Used one_hot() function from mltools library(one_hot).
- There are total of 38 countries in this dataset.
- 38 countries are allocated as variables and given a unique combination of 1’s and 0’s to uniquely identify them.
- Encoded data is saved into a new dataframe named encodedCountry.
- The principal of one hot encoding says that we have to drop the last column as it reduces ambiguous variables and still saves the features of the dataset.
- So, the last column was dropped from encodedCountry.
- Dropped the country variable as it has been encoded.
- Used cbind() function that joins two dataframe using columns.
print(unique(CovidData$country))
## [1] China France Japan Malaysia Nepal
## [6] Singapore South Korea Taiwan Thailand USA
## [11] Vietnam Australia Canada Cambodia Sri Lanka
## [16] Germany UAE Hong Kong Italy Russia
## [21] UK India Phillipines Finland Spain
## [26] Sweden Israel Lebanon Kuwait Bahrain
## [31] Algeria Croatia Switzerland
## 33 Levels: Algeria Australia Bahrain Cambodia Canada China ... Vietnam
encodedCountry <- one_hot(as.data.table(CovidData$country))
encodedCountry <- select(encodedCountry,-c(length(encodedCountry) - 1))
CovidData <- select(CovidData,-c(country))
CovidData <- cbind(CovidData, encodedCountry)
4.12 processing state column
- Cleaning the state variable.
- Removed commas and values after commas.
- Removed name of states in languages other than English.
- all this is done using regex and gsub() function.
- Did it because we didn’t cover the missing data imputation part yet.
CovidData$location <- gsub(" Guangdong", "", CovidData$location)
CovidData$location <- gsub(" Hubei", "", CovidData$location)
CovidData$location <- gsub( " Guangxi", "", CovidData$location)
CovidData$location <- gsub( "-", "", CovidData$location)
CovidData$location <- gsub("[,]", "", CovidData$location)
CovidData$location <- gsub("[ (陕西)]", "", CovidData$location)
4.13 Manipulating the reporting date variable
- Converted the reporting_date column to separate columns of:
- reporting_day - Day the patient reported
- reporting_month - The month the patient reported
- reporting_year - The year the patient reported
- To do this separate() function was used, and separation was done on ( / ) symbol (separate).
- Dropped the NA’s in the reporting_day variable.
CovidData <- separate(CovidData, reporting_date,
into = c('reporting_day', 'reporting_month', 'reporting_year'),
sep = "/")
# Reference - https://www.rdocumentation.org/packages/tidyr/versions/0.8.3/topics/separate
CovidData <- CovidData %>% drop_na(reporting_day)
4.14 Converting character columns to numeric
- Substituted the year that has only 20 and 19 to 2020 0r 2019 using sub() function.
- Converted the character type columns of:
- reporting_day
- reporting_month
- reporting_year, to numeric using as.numeric() function.
for(i in CovidData$reporting_year){
if((nchar(i) < 3) == TRUE)
CovidData$reporting_year <- sub("20", "2020", i)
CovidData$reporting_year <- sub("19", "2019", i) }
CovidData$reporting_day <- as.numeric(CovidData$reporting_day)
CovidData$reporting_month <- as.numeric(CovidData$reporting_month)
CovidData$reporting_year <- as.numeric(CovidData$reporting_year)
4.15 Handling the symptom variable in the data set
- Separated the text data using the separate() function and separated on the (, ) pattern.
- Replacted the NA’s with (-) symbol using is.na() function.
- Used for loop to check for white spaces smaller than 2 character length using nchar() function and replaced it with (-) sumbol using sub() function.
- Kept them as it is because we didn’t cover the missing data imputation part yet.
CovidData <- separate(CovidData, symptom,
into = c('symptom1', '
symptom2',
'symptom3',
'symptom4',
'symptom5'),
sep = ", ")
for(blank in CovidData$symptom1){
if((nchar(blank) < 3) == TRUE)
CovidData$symptom1 <- sub("", "-", blank)}
CovidData$symptom2[is.na(CovidData$symptom2)] <- '-'
CovidData$symptom3[is.na(CovidData$symptom3)] <- '-'
CovidData$symptom4[is.na(CovidData$symptom4)] <- '-'
CovidData$symptom5[is.na(CovidData$symptom5)] <- '-'
4.16 Dropping NA
- Dropped NA values from the followiung columns:
- age
- from_Wuhan
- Did it because we didn’t cover missing data imputation part yet.
CovidData <- CovidData %>% drop_na(age)
CovidData <- CovidData %>% drop_na(from_Wuhan)
4.17 Converting NA dates to 01/01/0000
- Replaced the missing NA dates with impossible date of 01/01/0000.
- Kept them as it is because we didn’t cover the missing data imputation part yet.
CovidData$exposure_end[is.na(CovidData$exposure_end)] <- '01/01/0000'
CovidData$exposure_start[is.na(CovidData$exposure_start)] <- '01/01/0000'
CovidData$hosp_visit_date[is.na(CovidData$hosp_visit_date)] <- '01/01/0000'
CovidData$symptom_onset[is.na(CovidData$symptom_onset)] <- '01/01/0000'
4.19 Converting string date in Month/day/year to date time column
- Converting the formatted character variables to date time object using mdy() function from the lubridate library.
- Following columns are converted:
- exposure_end
- exposure_start
- hosp_visit_date
- symptom_onset
CovidData$exposure_end <- mdy(CovidData$exposure_end)
CovidData$exposure_start <- mdy(CovidData$exposure_start)
CovidData$hosp_visit_date <- mdy(CovidData$hosp_visit_date)
CovidData$symptom_onset <- mdy(CovidData$symptom_onset)
4.20 Saving the new data
- Saving the processed data in dataframe
CovidData as CovidData.csv using write.csv() function from readr library.
write_csv(CovidData, 'CovidData.csv')
7. Create a new Data Frame
7.1 Creating a new dataframe from scratch
- New data frame names IndipendentDf is created with variable names as:
- PersonID - (Integer) random 10 integer values from 1 to 100, used sample() function (R Language Generating Random Integers).
- Power - (Character) (Factor) Random 10 character values from (‘Low’, ‘Mid’, ‘High’)
- Converted the Power variable to factor using transform() function, with the order as “Low” < “Mid” < “High” that results in an ordinal variable.
IndipendentDf <- data.frame(PersonID = sample(1:100, 10),
Power = sample(c('Low', 'Mid', 'High'),
10,
replace=TRUE))
IndipendentDf <- transform(IndipendentDf,Power = factor(Power,
levels = c('Low', 'Mid', 'High'),
ordered=TRUE))
7.2 Creating a another numeric vector
- Creating a new vector with the name NewColumn that has 10 uniformly distributed using runif() function.
- Combined the IndipendentDf and NewColumn into the same dataframe using cbind() function, which combines the data column-wise.
NewColumn <- c(runif(10, 1, 10))
IndipendentDf <- cbind(IndipendentDf, NewColumn)
7.3 Checking the new dataframe
- The following data object have 10 Observations and 3 Variables.
- Checking the structure of the IndipendentDf using the str() function.
- Checking the attributes of the IndipendentDf using attributes() function.
print(paste('Number of Observations in the dataframe is :', dim(IndipendentDf)[1], " ||
",'Number of Variables in the dataframe is :', dim(IndipendentDf)[2]))
## [1] "Number of Observations in the dataframe is : 10 || \n Number of Variables in the dataframe is : 3"
str(IndipendentDf)
## 'data.frame': 10 obs. of 3 variables:
## $ PersonID : int 78 20 77 26 1 22 61 28 41 83
## $ Power : Ord.factor w/ 3 levels "Low"<"Mid"<"High": 2 3 3 2 1 2 2 1 2 1
## $ NewColumn: num 9.99 9.14 2.62 6.08 3.16 ...
attributes(IndipendentDf)
## $names
## [1] "PersonID" "Power" "NewColumn"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10