Assignment 1

Data Frames in R

Introduction

Data frames are the main components for data structure in R, Data frames structure consist of two dimensional array structure that store the rows and columns. Few main data frame characteristics are columns names should not be empty, row names should be unique, each column should contains same number of data items and data types can be numeric, factor or charter type.

To initialize data frame we can use a method data.frame() by passing in the data as method parameters. Other important method is str() which can give the summary of data frame structure.

In this assignment, I will be working with data frames to populate sample data and perform simple analysis using data frame default methods.

Packages info

Most of the methods used in this assignment are builtin r methods except for visualization purpose In order to run the rmarkdown code, need to install ggplot package

ggplot:

ggplot is the package used for visualization purpose, it provides graphical representation of data set.

Data Preparation

For this assignment I will be analyzing data set that consist of Covid-19 cases by States in India. Data consist of 4 columns States, Total confimed cases, cured and death.

In order to start analysing the data set, I will be importing data set from csv file to data frame in R. To do that I will be using built in method read.csv() which will help to read the csv file into data frame and I will use head() method to print top 6 rows of the data set to verify the data has been loaded successfully.

covid19Data <- read.csv("covid_india_states.csv")
head(covid19Data)

##   X                       State Total.Confirmed.cases Cured Death
## 1 0 Andaman and Nicobar Islands                    33    16     0
## 2 1              Andhra Pradesh                  1463   403    33
## 3 2           Arunachal Pradesh                     1     1     0
## 4 3                       Assam                    42    29     1
## 5 4                       Bihar                   426    82     2
## 6 5                  Chandigarh                    56    17     0

before moving to data analysing part, its always better to understand data structure and to do that I will be using str() method as follows:

str(covid19Data)

## 'data.frame':    32 obs. of  5 variables:
##  $ X                    : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ State                : chr  "Andaman and Nicobar Islands" "Andhra Pradesh" "Arunachal Pradesh" "Assam" ...
##  $ Total.Confirmed.cases: int  33 1463 1 42 426 56 40 3515 7 4395 ...
##  $ Cured                : int  16 403 1 29 82 17 36 1094 7 613 ...
##  $ Death                : int  0 33 0 1 2 0 0 59 0 214 ...

As per the results from above method data consist of one char column and four different integers column type that can be used for analysis purpose.

Data Source: https://www.kaggle.com/ravichaubey1506/covid19-india

Data Analysis

I will be using multiple data analysis methods to explore the data set assigned to covid19Data

To get the dimensions of the data frame such as total number of rows and column we can use dim() method as bellow:

dim(covid19Data)

## [1] 32  5

there are total 35 rows and 5 column in the data frame.

To get the names of columns we can use names() method as bellow:

names(covid19Data)

## [1] "X"                     "State"                 "Total.Confirmed.cases"
## [4] "Cured"                 "Death"

In order to read first 10 rows of the data frame we can use head() method by passing in the negative value of total rows, for this instance total number of rows are 32 and in order to retrieve first 10 rows we can pass in n value as -22.

head(covid19Data, n = -22)

##    X                       State Total.Confirmed.cases Cured Death
## 1  0 Andaman and Nicobar Islands                    33    16     0
## 2  1              Andhra Pradesh                  1463   403    33
## 3  2           Arunachal Pradesh                     1     1     0
## 4  3                       Assam                    42    29     1
## 5  4                       Bihar                   426    82     2
## 6  5                  Chandigarh                    56    17     0
## 7  6                Chhattisgarh                    40    36     0
## 8  7                       Delhi                  3515  1094    59
## 9  8                         Goa                     7     7     0
## 10 9                     Gujarat                  4395   613   214

Method called tail() can be used to retrieve last n number of rows by passing in negative value from total number of rows such as for this instance we can pass in -22 to get the last 10 rows:

tail(covid19Data, n = -22)

##     X         State Total.Confirmed.cases Cured Death
## 23 22        Odisha                   143    41     1
## 24 23    Puducherry                     8     5     0
## 25 24        Punjab                   357    90    19
## 26 25     Rajasthan                  2584   836    58
## 27 26    Tamil Nadu                  2323  1258    27
## 28 27     Telengana                  1039   441    26
## 29 28       Tripura                     2     2     0
## 30 29   Uttarakhand                    57    36     0
## 31 30 Uttar Pradesh                  2281   555    41
## 32 31   West Bengal                   795   139    33

List down the names of states in the data set:

covid19Data['State']

##                          State
## 1  Andaman and Nicobar Islands
## 2               Andhra Pradesh
## 3            Arunachal Pradesh
## 4                        Assam
## 5                        Bihar
## 6                   Chandigarh
## 7                 Chhattisgarh
## 8                        Delhi
## 9                          Goa
## 10                     Gujarat
## 11                     Haryana
## 12            Himachal Pradesh
## 13           Jammu and Kashmir
## 14                   Jharkhand
## 15                   Karnataka
## 16                      Kerala
## 17                      Ladakh
## 18              Madhya Pradesh
## 19                 Maharashtra
## 20                     Manipur
## 21                   Meghalaya
## 22                     Mizoram
## 23                      Odisha
## 24                  Puducherry
## 25                      Punjab
## 26                   Rajasthan
## 27                  Tamil Nadu
## 28                   Telengana
## 29                     Tripura
## 30                 Uttarakhand
## 31               Uttar Pradesh
## 32                 West Bengal

Summaries the data set by using summary() method to get the statistical information about the data set:

summary(covid19Data)

##        X            State           Total.Confirmed.cases     Cured       
##  Min.   : 0.00   Length:32          Min.   :    1.00      Min.   :   0.0  
##  1st Qu.: 7.75   Class :character   1st Qu.:   30.25      1st Qu.:  16.0  
##  Median :15.50   Mode  :character   Median :  228.00      Median :  61.5  
##  Mean   :15.50                      Mean   : 1092.88      Mean   : 283.3  
##  3rd Qu.:23.25                      3rd Qu.: 1145.00      3rd Qu.: 412.5  
##  Max.   :31.00                      Max.   :10498.00      Max.   :1773.0  
##      Death      
##  Min.   :  0.0  
##  1st Qu.:  0.0  
##  Median :  2.5  
##  Mean   : 36.0  
##  3rd Qu.: 28.5  
##  Max.   :459.0

Lastly, for better visualization I will be using ggplot2 package to create line graph for accumulated cases in India by States:

library(ggplot2)
ggplot(covid19Data, aes(x=Total.Confirmed.cases, y=State)) +
  geom_bar(stat = "identity") +
  ggtitle("Covid-19 cases in India by States")

Conclusion

Data Frames are very useful data structure to manipulate the data set and some basic analysis from the data set before proceeding with deep dive into data. It provides a structure to useful methods such as summary(), mean(), mode(), etc. However, although Data Frames provide some useful methods for analysis purpose, yet to perform deep dive in data and do visualization we still have to rely on other useful packages.