Required Packages


library(readr)
library(outliers)
library(lubridate)
library(tidyr)
library(dplyr)

Executive Summary

Data Wrangling refers to the preprocessing of Data and converting it into an analyzable state. The two datasets used for the assignment are The Covid-19 India Dataset and the Statewise Testing Details dataset.
Both the datasets are available on: https://www.kaggle.com/sudalairajkumar/covid19-in-india
The datasets are merged together for the assignment purpose. Various tasks like data understanding, tidying of data, manipulation, filtering, scanning, transforming have been performed on the dataset. Skewness is checked for the Confirmed cases variable and the neccessary transformation is done to reduce the skewness.

Data

The Dataset used are the covid_19_india[1] which contains the details of the increase in coronavirus cases per day in different states of India and the StatewiseTestingdetails[1] which contains the date wise details of the Total tests carried out each day in different states of India.
The variables in covid_19_india dataset are Date and time of recording, the State/Union Territory for which the information is recorded, ConfirmedIndianNationals refers to the cases reported from Indian citizens, ConfirmedForeignNationals are the cases reported from people who are not Indian Citizens, Cured refers to cases cured till that date, Deaths are the number of deaths due to Covid-19 till that date, and Confirmed refers to the total confirmed cases in India till that date.
The variables in StatewiseTestingdetails dataset are Date which refers to the date of the information recorded, States refers to the State of India for which the information is recorded, TotalSamples refers to the Tests being done till that date, and Positive Refers to the total number of people infected by coronavirus till the date in that State.

covid_19_india <- read_csv("Downloads/covid_19_india.csv")

── Column specification ────────────────────────────────────────────────────────────────────
cols(
  Date = col_character(),
  Time = col_time(format = ""),
  `State/UnionTerritory` = col_character(),
  ConfirmedIndianNational = col_character(),
  ConfirmedForeignNational = col_character(),
  Cured = col_double(),
  Deaths = col_double(),
  Confirmed = col_double()
)
StatewiseTestingDetails <- read_csv("Downloads/StatewiseTestingDetails.csv")

── Column specification ────────────────────────────────────────────────────────────────────
cols(
  Date = col_character(),
  State = col_character(),
  TotalSamples = col_double(),
  Positive = col_double()
)
covid_19_india<- covid_19_india%>%unite(DateTime, Date, Time, sep=" ")
head(covid_19_india)
head(StatewiseTestingDetails)

The two datasets have been merged together as per the assignment specification. The type of the variable date for StatewiseTestingdetails has been converted to character for proper merging of the dataset as the covid_19_india has date stored as a character variable.

StatewiseTestingDetails$Date<-as.character(StatewiseTestingDetails$Date, format="%d/%m/%y")
covid_19_india<-covid_19_india %>%separate(DateTime, into = c("Date", "Time"), sep = " ")
df_join <-full_join(covid_19_india,StatewiseTestingDetails, by = c("Date" = "Date", "State/UnionTerritory"="State"))

head(df_join)

Tidy & Data Manipulate I

It is observed in the dataset covid_19_india, the date and time have been merged into a single colum. Its breaks the Wickham and Grolemund principles as each variable must have its own column. Date and Time can be separated as follows:

covid_19_india<-covid_19_india %>%separate(DateTime, into = c("Date", "Time"), sep = " ")
head(df_join)

The above code has been used before merging the 2 datasets above in the Data section.

Understand

The structure and dimensions of the dataset can be checked using the str() and dim() functions respectively. We can observe that the dataset contains character, time, numeric variables. There are a total of 7266 observations and a total of 10 variables.

dim(df_join)
[1] 7266   10
str(df_join)
tibble [7,266 × 10] (S3: tbl_df/tbl/data.frame)
 $ Date                    : chr [1:7266] "30/1/20" "31/1/20" "1/2/20" "2/2/20" ...
 $ Time                    : chr [1:7266] "18:00:00" "18:00:00" "18:00:00" "18:00:00" ...
 $ State/UnionTerritory    : chr [1:7266] "Kerala" "Kerala" "Kerala" "Kerala" ...
 $ ConfirmedIndianNational : chr [1:7266] "1" "1" "2" "3" ...
 $ ConfirmedForeignNational: chr [1:7266] "0" "0" "0" "0" ...
 $ Cured                   : num [1:7266] 0 0 0 0 0 0 0 0 0 0 ...
 $ Deaths                  : num [1:7266] 0 0 0 0 0 0 0 0 0 0 ...
 $ Confirmed               : num [1:7266] 1 1 2 3 3 3 3 3 3 3 ...
 $ TotalSamples            : num [1:7266] NA NA NA NA NA NA NA NA NA NA ...
 $ Positive                : num [1:7266] NA NA NA NA NA NA NA NA NA NA ...

Earlier on the type of the variable date for StatewiseTestingdetails has been converted to character for proper merging of the dataset as the covid_19_india has date stored as a character variable. In the merged dataset, we can observe that the date is stored as a character variable and should be converted to type Date.

df_join$Date<-as.Date(df_join$Date,format = "%d/%m/%y")
str(df_join)
tibble [7,266 × 10] (S3: tbl_df/tbl/data.frame)
 $ Date                    : Date[1:7266], format: "2020-01-30" "2020-01-31" "2020-02-01" ...
 $ Time                    : chr [1:7266] "18:00:00" "18:00:00" "18:00:00" "18:00:00" ...
 $ State/UnionTerritory    : chr [1:7266] "Kerala" "Kerala" "Kerala" "Kerala" ...
 $ ConfirmedIndianNational : chr [1:7266] "1" "1" "2" "3" ...
 $ ConfirmedForeignNational: chr [1:7266] "0" "0" "0" "0" ...
 $ Cured                   : num [1:7266] 0 0 0 0 0 0 0 0 0 0 ...
 $ Deaths                  : num [1:7266] 0 0 0 0 0 0 0 0 0 0 ...
 $ Confirmed               : num [1:7266] 1 1 2 3 3 3 3 3 3 3 ...
 $ TotalSamples            : num [1:7266] NA NA NA NA NA NA NA NA NA NA ...
 $ Positive                : num [1:7266] NA NA NA NA NA NA NA NA NA NA ...

The State/UnionTerritory variable has been converted to a factor variable. But there is no need to for ordering/labelling. Instead the data has been arranged in increasing Alphabetical order for the variable State/UnionTerritory. Earlier it was arranged based on date which had all the data for states mixed up. The data is more clear now as it is arranged for the States and all the data for a particular state could be found continously.

df_join$`State/UnionTerritory`<-as.factor(df_join$`State/UnionTerritory`)
df_join<-df_join%>% arrange(`State/UnionTerritory`)

head(df_join)
str(df_join)
tibble [7,266 × 10] (S3: tbl_df/tbl/data.frame)
 $ Date                    : Date[1:7266], format: "2020-03-26" "2020-03-27" "2020-03-28" ...
 $ Time                    : chr [1:7266] "18:00:00" "10:00:00" "18:00:00" "19:30:00" ...
 $ State/UnionTerritory    : Factor w/ 42 levels "Andaman and Nicobar Islands",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ ConfirmedIndianNational : chr [1:7266] "1" "1" "6" "-" ...
 $ ConfirmedForeignNational: chr [1:7266] "0" "0" "0" "-" ...
 $ Cured                   : num [1:7266] 0 0 0 0 0 0 0 0 0 0 ...
 $ Deaths                  : num [1:7266] 0 0 0 0 0 0 0 0 0 0 ...
 $ Confirmed               : num [1:7266] 1 1 6 9 9 10 10 10 10 10 ...
 $ TotalSamples            : num [1:7266] NA NA NA NA NA NA NA NA NA NA ...
 $ Positive                : num [1:7266] NA NA NA NA NA NA NA NA NA NA ...

Tidy and Manipulate II

A variable Negative has been created using the mutate() function to calculated to the total number of people who were Covid-19 negative based on the TotalSamples and Positive variables.

df_join<-mutate(df_join, NegativeCases = TotalSamples - Positive)
head(df_join)

Scan I

The missing values can be calculated using the colSums() function nested with is.na(). The follwing missing values are observed.

colSums(is.na(df_join))
                    Date                     Time     State/UnionTerritory 
                       0                      179                        0 
 ConfirmedIndianNational ConfirmedForeignNational                    Cured 
                     179                      179                      179 
                  Deaths                Confirmed             TotalSamples 
                     179                      179                     1300 
                Positive            NegativeCases 
                    2772                     2772 

The missing values can be omitted as it would not affect the dataset. Even though about 20% of values are omitted, still it won’t affect the dataset as the future dates contain the cases and samples recorded for all the previous dates. This is because ever date also has the records of the cases recorded till previous date also included in that and not only new cases of that day. Even if we omit values for a particular date, the observations for the next nearest date contains the sum of cases or new cases recorded for all the previous dates and hence it would not affect the case and Sample count of the country Statewise. na.omit() function is used to omit the missing values.

df_join<-na.omit(df_join)
colSums(is.na(df_join))
                    Date                     Time     State/UnionTerritory 
                       0                        0                        0 
 ConfirmedIndianNational ConfirmedForeignNational                    Cured 
                       0                        0                        0 
                  Deaths                Confirmed             TotalSamples 
                       0                        0                        0 
                Positive            NegativeCases 
                       0                        0 
str(df_join)
tibble [4,357 × 11] (S3: tbl_df/tbl/data.frame)
 $ Date                    : Date[1:4357], format: "2020-04-17" "2020-04-24" "2020-04-27" ...
 $ Time                    : chr [1:4357] "17:00:00" "17:00:00" "17:00:00" "17:00:00" ...
 $ State/UnionTerritory    : Factor w/ 42 levels "Andaman and Nicobar Islands",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ ConfirmedIndianNational : chr [1:4357] "-" "-" "-" "-" ...
 $ ConfirmedForeignNational: chr [1:4357] "-" "-" "-" "-" ...
 $ Cured                   : num [1:4357] 10 11 11 16 33 33 33 33 33 33 ...
 $ Deaths                  : num [1:4357] 0 0 0 0 0 0 0 0 0 0 ...
 $ Confirmed               : num [1:4357] 11 22 33 33 33 33 33 33 33 33 ...
 $ TotalSamples            : num [1:4357] 1403 2679 2848 3754 6677 ...
 $ Positive                : num [1:4357] 12 27 33 33 33 33 33 33 33 33 ...
 $ NegativeCases           : num [1:4357] 1391 2652 2815 3721 6644 ...
 - attr(*, "na.action")= 'omit' Named int [1:2909] 1 2 3 4 5 6 7 8 9 10 ...
  ..- attr(*, "names")= chr [1:2909] "1" "2" "3" "4" ...
head(df_join)

Scan II

Now the outliers in the dataset are observed. The variables Cured, Deaths, Confirmed, TotalSamples, Positive, NegativeCases are observed for possible outliers using the boxplot for each variables.

df_join$Confirmed %>%  boxplot(main="Box Plot of Confirmed", ylab="Confirmed", col = "grey")

df_join$Cured %>%  boxplot(main="Box Plot of Cured", ylab="Cured", col = "grey")

df_join$Deaths %>%  boxplot(main="Box Plot of Deaths", ylab="Deaths", col = "grey")

df_join$TotalSamples %>%  boxplot(main="Box Plot of TotalSamples", ylab="TotalSamples", col = "grey")

df_join$Positive %>%  boxplot(main="Box Plot of Positive", ylab="Positive", col = "grey")

df_join$NegativeCases %>%  boxplot(main="Box Plot of NegativeCases", ylab="NegativeCases", col = "grey")

NA
NA

It is observed that all the variables contain outliers. This is because ever date also has the records of the cases recorded till previous date also included in that and not only new cases of that day. So the observation count for that particular day is the observation recorded for that day minus for the previous day which gives us the count of that particular day. Hence, outliers are kept intact.

Transform

By creating a histogram of the variable confimed, it is observed that the variable is right skewed.

hist(df_join$Confirmed)

So mathematical operations have been used to reduce skewness and convert distribution to normal distribution. Using the natural logarithm transformation

hist(log(df_join$Confirmed))

---
title: "Data Wrangling MATH2349 Assignment 2"
author: 'Tanmay Vashisth,  Student Id: s3823176'
date: "25/10/2020"
output:
  html_notebook: default
  html_document: default
---
## Required Packages
<br>
```{r}
library(readr)
library(outliers)
library(lubridate)
library(tidyr)
library(dplyr)
```


## Executive Summary
Data Wrangling refers to the preprocessing of Data and converting it into an analyzable state. The two datasets used for the assignment are The Covid-19 India Dataset and the Statewise Testing Details dataset. <br>
Both the datasets are available on: https://www.kaggle.com/sudalairajkumar/covid19-in-india
<br>
The datasets are merged together for the assignment purpose. Various tasks like data understanding, tidying of data, manipulation, filtering, scanning, transforming have been performed on the dataset. Skewness is checked for the Confirmed cases variable and the neccessary transformation is done to reduce the skewness.
<br> <br>


## Data
The Dataset used are the covid_19_india[1] which contains the details of the increase in coronavirus cases per day in different states of India and the StatewiseTestingdetails[1] which contains the date wise details of the Total tests carried out each day in different states of India.<br>
The variables in covid_19_india dataset are Date and time of recording, the State/Union Territory for which the information is recorded, ConfirmedIndianNationals refers to the cases reported from Indian citizens, ConfirmedForeignNationals are the cases reported from people who are not Indian Citizens, Cured refers to cases cured till that date, Deaths are the number of deaths due to Covid-19 till that date, and Confirmed refers to the total confirmed cases in India till that date.<br>
The variables in StatewiseTestingdetails dataset are Date which refers to the date of the information recorded, States refers to the State of India for which the information is recorded, TotalSamples refers to the Tests being done till that date, and Positive Refers to the total number of people infected by coronavirus till the date in that State.
```{r}
covid_19_india <- read_csv("Downloads/covid_19_india.csv")
```

```{r}
StatewiseTestingDetails <- read_csv("Downloads/StatewiseTestingDetails.csv")
covid_19_india<- covid_19_india%>%unite(DateTime, Date, Time, sep=" ")

```

```{r}
head(covid_19_india)
head(StatewiseTestingDetails)
```

The two datasets have been merged together as per the assignment specification. The type of the variable date for StatewiseTestingdetails has been converted to character for proper merging of the dataset as the covid_19_india has date stored as a character variable.
```{r}
StatewiseTestingDetails$Date<-as.character(StatewiseTestingDetails$Date, format="%d/%m/%y")
covid_19_india<-covid_19_india %>%separate(DateTime, into = c("Date", "Time"), sep = " ")
df_join <-full_join(covid_19_india,StatewiseTestingDetails, by = c("Date" = "Date", "State/UnionTerritory"="State"))

head(df_join)
```
## Tidy & Data Manipulate I
It is observed in the dataset covid_19_india, the date and time have been merged into a single colum. Its breaks the Wickham and Grolemund principles as each variable must have its own column.
Date and Time can be separated as follows:
```{r}
covid_19_india<-covid_19_india %>%separate(DateTime, into = c("Date", "Time"), sep = " ")

```

```{r}
head(df_join)
```

The above code has been used before merging the 2 datasets above in the Data section.

## Understand
The structure and dimensions of the dataset can be checked using the str() and dim() functions respectively. We can observe that the dataset contains character, time, numeric variables. There are a total of 7266 observations and a total of 10 variables.

```{r}
dim(df_join)
str(df_join)
```
Earlier on the type of the variable date for StatewiseTestingdetails has been converted to character for proper merging of the dataset as the covid_19_india has date stored as a character variable.
In the merged dataset, we can observe that the date is stored as a character variable and should be converted to type Date.
```{r}
df_join$Date<-as.Date(df_join$Date,format = "%d/%m/%y")
str(df_join)

```

The State/UnionTerritory variable has been converted to a factor variable. But there is no need to for ordering/labelling. Instead the data has been arranged in increasing Alphabetical order for the variable State/UnionTerritory. Earlier it was arranged based on date which had all the data for states mixed up. The data is more clear now as it is arranged for the States and all the data for a particular state could be found continously.
```{r}
df_join$`State/UnionTerritory`<-as.factor(df_join$`State/UnionTerritory`)
df_join<-df_join%>% arrange(`State/UnionTerritory`)

head(df_join)
str(df_join)
```



## Tidy and Manipulate II
A variable Negative has been created using the mutate() function to calculated to the total number of people who were Covid-19 negative based on the TotalSamples and Positive variables.
```{r}
df_join<-mutate(df_join, NegativeCases = TotalSamples - Positive)
head(df_join)
```

## Scan I
The missing values can be calculated using the colSums() function nested with is.na().
The follwing missing values are observed.
```{r}
colSums(is.na(df_join))

```

The missing values can be omitted as it would not affect the dataset. Even though about 20% of values are omitted, still it won't affect the dataset as the future dates contain the cases and samples recorded for all the previous dates. This is because ever date also has the records of the cases recorded till previous date also included in that and not only new cases of that day. Even if we omit values for a particular date, the observations for the next nearest date contains the sum of cases or new cases recorded for all the previous dates and hence it would not affect the case and Sample count of the country Statewise.
na.omit() function is used to omit the missing values.


```{r}
df_join<-na.omit(df_join)
colSums(is.na(df_join))
str(df_join)
head(df_join)
```


## Scan II

Now the outliers in the dataset are observed. The variables Cured, Deaths, Confirmed, TotalSamples, Positive, NegativeCases are observed for possible outliers using the boxplot for each variables.
```{r}
df_join$Confirmed %>%  boxplot(main="Box Plot of Confirmed", ylab="Confirmed", col = "grey")
df_join$Cured %>%  boxplot(main="Box Plot of Cured", ylab="Cured", col = "grey")
df_join$Deaths %>%  boxplot(main="Box Plot of Deaths", ylab="Deaths", col = "grey")
df_join$TotalSamples %>%  boxplot(main="Box Plot of TotalSamples", ylab="TotalSamples", col = "grey")
df_join$Positive %>%  boxplot(main="Box Plot of Positive", ylab="Positive", col = "grey")
df_join$NegativeCases %>%  boxplot(main="Box Plot of NegativeCases", ylab="NegativeCases", col = "grey")


```
It is observed that all the variables contain outliers. This is because ever date also has the records of the cases recorded till previous date also included in that and not only new cases of that day. So the observation count for that particular day is the observation recorded for that day minus for the previous day which gives us the count of that particular day. Hence, outliers are kept intact.

## Transform
By creating a histogram of the variable confimed, it is observed that the variable is right skewed.
```{r}
hist(df_join$Confirmed)

```

So mathematical operations have been used to reduce skewness and convert distribution to normal distribution. Using the natural logarithm transformation
```{r}
hist(log(df_join$Confirmed))

```

