Data frames are the main components for data structure in R, Data frames structure consist of two dimensional array structure that store the rows and columns. Few main data frame characteristics are columns names should not be empty, row names should be unique, each column should contains same number of data items and data types can be numeric, factor or charter type.
To initialize data frame we can use a method data.frame() by passing in the data as method parameters. Other important method is str() which can give the summary of data frame structure.
In this assignment, I will be working with data frames to populate sample data and perform simple analysis using data frame default methods.
Most of the methods used in this assignment are builtin r methods except for visualization purpose In order to run the rmarkdown code, need to install ggplot package
ggplot is the package used for visualization purpose, it provides graphical representation of data set.
For this assignment I will be analyzing data set that consist of Covid-19 cases by States in India. Data consist of 4 columns States, Total confimed cases, cured and death.
In order to start analysing the data set, I will be importing data set from csv file to data frame in R. To do that I will be using built in method read.csv() which will help to read the csv file into data frame and I will use head() method to print top 6 rows of the data set to verify the data has been loaded successfully.
covid19Data <- read.csv("covid_india_states.csv")
head(covid19Data)
## X State Total.Confirmed.cases Cured Death
## 1 0 Andaman and Nicobar Islands 33 16 0
## 2 1 Andhra Pradesh 1463 403 33
## 3 2 Arunachal Pradesh 1 1 0
## 4 3 Assam 42 29 1
## 5 4 Bihar 426 82 2
## 6 5 Chandigarh 56 17 0
before moving to data analysing part, its always better to understand data structure and to do that I will be using str() method as follows:
str(covid19Data)
## 'data.frame': 32 obs. of 5 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ State : chr "Andaman and Nicobar Islands" "Andhra Pradesh" "Arunachal Pradesh" "Assam" ...
## $ Total.Confirmed.cases: int 33 1463 1 42 426 56 40 3515 7 4395 ...
## $ Cured : int 16 403 1 29 82 17 36 1094 7 613 ...
## $ Death : int 0 33 0 1 2 0 0 59 0 214 ...
As per the results from above method data consist of one char column and four different integers column type that can be used for analysis purpose.
Data Source: https://www.kaggle.com/ravichaubey1506/covid19-india
I will be using multiple data analysis methods to explore the data set assigned to covid19Data
To get the dimensions of the data frame such as total number of rows and column we can use dim() method as bellow:
dim(covid19Data)
## [1] 32 5
there are total 35 rows and 5 column in the data frame.
To get the names of columns we can use names() method as bellow:
names(covid19Data)
## [1] "X" "State" "Total.Confirmed.cases"
## [4] "Cured" "Death"
In order to read first 10 rows of the data frame we can use head() method by passing in the negative value of total rows, for this instance total number of rows are 32 and in order to retrieve first 10 rows we can pass in n value as -22.
head(covid19Data, n = -22)
## X State Total.Confirmed.cases Cured Death
## 1 0 Andaman and Nicobar Islands 33 16 0
## 2 1 Andhra Pradesh 1463 403 33
## 3 2 Arunachal Pradesh 1 1 0
## 4 3 Assam 42 29 1
## 5 4 Bihar 426 82 2
## 6 5 Chandigarh 56 17 0
## 7 6 Chhattisgarh 40 36 0
## 8 7 Delhi 3515 1094 59
## 9 8 Goa 7 7 0
## 10 9 Gujarat 4395 613 214
Method called tail() can be used to retrieve last n number of rows by passing in negative value from total number of rows such as for this instance we can pass in -22 to get the last 10 rows:
tail(covid19Data, n = -22)
## X State Total.Confirmed.cases Cured Death
## 23 22 Odisha 143 41 1
## 24 23 Puducherry 8 5 0
## 25 24 Punjab 357 90 19
## 26 25 Rajasthan 2584 836 58
## 27 26 Tamil Nadu 2323 1258 27
## 28 27 Telengana 1039 441 26
## 29 28 Tripura 2 2 0
## 30 29 Uttarakhand 57 36 0
## 31 30 Uttar Pradesh 2281 555 41
## 32 31 West Bengal 795 139 33
List down the names of states in the data set:
covid19Data['State']
## State
## 1 Andaman and Nicobar Islands
## 2 Andhra Pradesh
## 3 Arunachal Pradesh
## 4 Assam
## 5 Bihar
## 6 Chandigarh
## 7 Chhattisgarh
## 8 Delhi
## 9 Goa
## 10 Gujarat
## 11 Haryana
## 12 Himachal Pradesh
## 13 Jammu and Kashmir
## 14 Jharkhand
## 15 Karnataka
## 16 Kerala
## 17 Ladakh
## 18 Madhya Pradesh
## 19 Maharashtra
## 20 Manipur
## 21 Meghalaya
## 22 Mizoram
## 23 Odisha
## 24 Puducherry
## 25 Punjab
## 26 Rajasthan
## 27 Tamil Nadu
## 28 Telengana
## 29 Tripura
## 30 Uttarakhand
## 31 Uttar Pradesh
## 32 West Bengal
Summaries the data set by using summary() method to get the statistical information about the data set:
summary(covid19Data)
## X State Total.Confirmed.cases Cured
## Min. : 0.00 Length:32 Min. : 1.00 Min. : 0.0
## 1st Qu.: 7.75 Class :character 1st Qu.: 30.25 1st Qu.: 16.0
## Median :15.50 Mode :character Median : 228.00 Median : 61.5
## Mean :15.50 Mean : 1092.88 Mean : 283.3
## 3rd Qu.:23.25 3rd Qu.: 1145.00 3rd Qu.: 412.5
## Max. :31.00 Max. :10498.00 Max. :1773.0
## Death
## Min. : 0.0
## 1st Qu.: 0.0
## Median : 2.5
## Mean : 36.0
## 3rd Qu.: 28.5
## Max. :459.0
Lastly, for better visualization I will be using ggplot2 package to create line graph for accumulated cases in India by States:
library(ggplot2)
ggplot(covid19Data, aes(x=Total.Confirmed.cases, y=State)) +
geom_bar(stat = "identity") +
ggtitle("Covid-19 cases in India by States")
Data Frames are very useful data structure to manipulate the data set and some basic analysis from the data set before proceeding with deep dive into data. It provides a structure to useful methods such as summary(), mean(), mode(), etc. However, although Data Frames provide some useful methods for analysis purpose, yet to perform deep dive in data and do visualization we still have to rely on other useful packages.