Introduction

First and foremost, Statistical Analysis mandates the creation and collection of data from different sources and formats. The vast swathes of libraries in R programming language facilitate the collection of data from myriads of formats and sources without a glitch. In this project, we will go through the basics of how to collect and create data then we will dive deep in to data management routines. This project can be used as a reference guide and manual for data analyst, data scientists, machine learning engineers, and learners.

All the code chunks will appear in shaded rectangular boxes and the output of the code chunks will be enclosed in unshaded rectangular boxes.

Data Creation

Objects

We use the term object to refer to data stored in R environment. But, object can denote to other complex data structures like functions. We can store variable data as object in R.

a<-2
b<-3
c<-4

The values 2, 3, and 4 are stored in objects a,b,c respectively. When we evaluate a, b, and c, we get the values stored in those objects.

## [1] 2

## [1] 3

## [1] 4

Vectors

We can also store more than one values in R objects. These are called vectors. Actually, R object with singular value is also a vector but containing only one value, a scalar.

Here, we create a numeric vector x and check the class of the object.

x<-c(1,2,3,4,5) #numeric vector
x

## [1] 1 2 3 4 5

class(x)

## [1] "numeric"

The class of vector x is numeric. We can also create other types of vectors in R: character, logical. In programming language, a vector in R can be classified as one dimensional array.

x<-c("I", "love", "you") #character vector
x

## [1] "I"    "love" "you"

class(x)

## [1] "character"

Here, we have created a character vector x with three different strings as the elements.

x<-c(TRUE, FALSE, TRUE, FALSE, TRUE) #a logical vector of true and false
x

## [1]  TRUE FALSE  TRUE FALSE  TRUE

class(x)

## [1] "logical"

An important fact about vector is that it can only contain similar types of data. If we put numeric and character data in the same vector, R will evaluate all of them as strings.

x<-c(1,2,3, "I", "love", "you" )
x

## [1] "1"    "2"    "3"    "I"    "love" "you"

class(x)

## [1] "character"

Data Frames

The most common way of storing a dataset in R object is in a data frame. Conceptually, we can think of data frame as a table with each row corresponding to a single observation and each column representing different variables.

dat<-data.frame(Student_id=1:10, Physics=c(85,69,72,67,92,89,73,74,56,91), 
                Chemistry=c(81,87,67,73,95,59,78,66,94,71))
dat

##    Student_id Physics Chemistry
## 1           1      85        81
## 2           2      69        87
## 3           3      72        67
## 4           4      67        73
## 5           5      92        95
## 6           6      89        59
## 7           7      73        78
## 8           8      74        66
## 9           9      56        94
## 10         10      91        71

We have created a data frame with three variables based on the results of 10 students in Physics and Chemistry.

The head() command will show the first 6 lines of the data frame.

head(dat)

##   Student_id Physics Chemistry
## 1          1      85        81
## 2          2      69        87
## 3          3      72        67
## 4          4      67        73
## 5          5      92        95
## 6          6      89        59

And the str() function will show the structure of the data frame.

str(dat)

## 'data.frame':    10 obs. of  3 variables:
##  $ Student_id: int  1 2 3 4 5 6 7 8 9 10
##  $ Physics   : num  85 69 72 67 92 89 73 74 56 91
##  $ Chemistry : num  81 87 67 73 95 59 78 66 94 71

We can see, the data frame contains one integer variable and two numeric variables as expected.

The $ symbol is typically used to access the column/variable from a data frame. That extracted column/variable has the properties of a vector.

dat$Physics

##  [1] 85 69 72 67 92 89 73 74 56 91

length() function is used to find the length of any vector.

length(dat$Physics)

## [1] 10

The Physics column in dat data frame has the length of 10.

Factors

Factor is a special data type in R to characterize the categorical variables with different levels.

library(dslabs)
data("murders")
head(murders)

##        state abb region population total
## 1    Alabama  AL  South    4779736   135
## 2     Alaska  AK   West     710231    19
## 3    Arizona  AZ   West    6392017   232
## 4   Arkansas  AR  South    2915918    93
## 5 California  CA   West   37253956  1257
## 6   Colorado  CO   West    5029196    65

class(murders$region)

## [1] "factor"

levels(murders$region)

## [1] "Northeast"     "South"         "North Central" "West"

We have loaded a dataset murders from the dslabs library. This data set contains a column named region with 4 different levels: Northeast, South, North Central, and West. This is a factor column with 4 categories.

We can reorder the appearance of these levels according to the total number of murders in each region.

regions<-murders$region
value<-murders$total
ordered.regions<-reorder(regions,value, FUN = sum)
levels(ordered.regions)

## [1] "Northeast"     "North Central" "West"          "South"

By default, the reorder function works in ascending order.

Matrices

Matrix is another kind of data type or object in R similar to data frame as they are two dimensional:it has rows and columns. But unlike data frame, matrices can not contain different type of data type. Matrices mostly contain numeric data.

But matrices have major advantages over data frames that we can execute matrix algebra operations on them, a powerful mathematical technique ubiquitous in data science, machine learning, and statistical learning.

matrix01<-matrix(data = 1:12, nrow = 3, ncol = 4, byrow = T) #creating a matrix with 3 rows and 
#4 columns. The element will be arranged by row. 
matrix01

##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12

We can excess matrix elements by the following operations.

matrix01[2,3] #element of second row and third column

## [1] 7

matrix01[3,] #the entire third row

## [1]  9 10 11 12

matrix01[,2] #the entire second column

## [1]  2  6 10

The extracted row and column from the matrix is not a matrix anymore but a vector.

We can also subset the matrix to create smaller matrix.

matrix02<-matrix01[1:2,2:4]
matrix02

##      [,1] [,2] [,3]
## [1,]    2    3    4
## [2,]    6    7    8

Later, we can convert and save the matrix as a data frame.

dat01<-as.data.frame(matrix01)
dat01

##   V1 V2 V3 V4
## 1  1  2  3  4
## 2  5  6  7  8
## 3  9 10 11 12

The columns names have been given automatically but we can give new names to the columns.

column.names<-paste0("Column",1:4)
column.names

## [1] "Column1" "Column2" "Column3" "Column4"

First, we have created the column names. Now, we will assign the names to each column of the data frame.

colnames(dat01)<-column.names
dat01

##   Column1 Column2 Column3 Column4
## 1       1       2       3       4
## 2       5       6       7       8
## 3       9      10      11      12

Data Collection

Importing CSV files

CSV or the comma separated files are the most common file format for storing data. We can import CSV files from our local storage by the following command. The forward slashes “/” has been used in the path of file directory rather than the backward slash "". Otherwise. it might produce an error in reading the file. If it feels way to cumbersome to set the path of the designated file then the function file.choose() can be used to choose the file by clicking on mouse. It is convenient for new learners.

icudata<-read.csv(file="C:/Users/PC/R Working Directory/Datasets for Statistical Analysis/ICUData.csv", header = T, stringsAsFactors = TRUE) #importing an icu data set and 
#also converting all character columns to factors. 
head(icudata)

##   ID    sex age          surgery heart.rate temperature bilirubin SAPS.II
## 1  1 female  76            other       98.0        36.5  6.512142      57
## 2  2 female  60 gastrointestinal       80.0        38.1 14.523197      52
## 3  3   male  66   cardiothoracic       99.6        37.4 22.972480      57
## 4  4   male  74            other      110.0        39.1 19.299346      45
## 5  5 female  68            other       94.1        38.5 39.076485      49
## 6  6   male  68   cardiothoracic       88.8        35.1 14.805941      53
##   liver.failure LOS              outcome
## 1             0   1                 died
## 2             0   2                 home
## 3             0   1 secondary care/rehab
## 4             0   2                 home
## 5             0   1                 home
## 6             0   1 secondary care/rehab

TO check the number of observations and variables in the imported dataset, we can use the dim() function.

dim(icudata)

## [1] 500  11

This icudata set has 500 observations and 11 variables with a couple of factor variables.

The read.csv() function is comparatively slow. It is recommended to import any small dataset with read.csv function. We should better use read_csv() function from readr library to import bigger dataset faster than the read.csv() function.

We can also import a CSV file from a URL.

data.url<-read.csv(file = 'https://raw.githubusercontent.com/Statology/Miscellaneous/main/basketball_data.csv')
head(data.url)

##   player assists points
## 1      A       6     12
## 2      B       7     19
## 3      C      14      7
## 4      D       4      6
## 5      E       5     10

class(data.url)

## [1] "data.frame"

We have imported a basketball dataset from a URL saved as a data frame.

Importing Text files

We can use the function readLines() from base R to import and read the lines of any text files. We will read a text file containing the names of popular machine learning algorithms from our local machine.

ML.algorithms<-readLines(con = "C:/Users/PC/Desktop/Machine leanring algo.txt")

## Warning in readLines(con = "C:/Users/PC/Desktop/Machine leanring algo.txt"):
## incomplete final line found on 'C:/Users/PC/Desktop/Machine leanring algo.txt'

ML.algorithms

##  [1] "Linear Regression"                   "Logistic Regression"                
##  [3] "Linear Discriminant Analysis"        "Classification and Regression Trees"
##  [5] "Naive Bayes"                         "K-Nearest Neighbors"                
##  [7] "Learning Vector Quantization"        "Support Vector Machines"            
##  [9] "Random Forest"                       "Boosting"                           
## [11] "AdaBoost"

We can also limit the number of lines to be read from the text file using the argument n.

ML.algorithms.limited<-readLines(con ="C:/Users/PC/Desktop/Machine leanring algo.txt" , n = 5) 
#reading only the first 5 lines
ML.algorithms.limited

## [1] "Linear Regression"                   "Logistic Regression"                
## [3] "Linear Discriminant Analysis"        "Classification and Regression Trees"
## [5] "Naive Bayes"

class(ML.algorithms)

## [1] "character"

class(ML.algorithms.limited)

## [1] "character"

The imported text files are classified as character vector. We can read any line from the character vector by indexing.

ML.algorithms[3] #reading the 3rd line

## [1] "Linear Discriminant Analysis"

We can also convert these character vector in to data frames.

data.algorithms<-data.frame(ML_algorithms=ML.algorithms)
head(data.algorithms)

##                         ML_algorithms
## 1                   Linear Regression
## 2                 Logistic Regression
## 3        Linear Discriminant Analysis
## 4 Classification and Regression Trees
## 5                         Naive Bayes
## 6                 K-Nearest Neighbors

We can access any rows of the data frame by indexing the data frame.

data.algorithms[4,] #extracting 4th row of the data frame

## [1] "Classification and Regression Trees"

Importing zip files

Suppose I have a zip file saved in my local directory. We have to first take a look what’s inside the zip file to extract a particular data set.

unzip(zipfile ="C:/Users/PC/R Working Directory/Datasets for Statistical Analysis/murders.zip", 
      list = TRUE) #the list argument will only show the list of files without extracting.

##          Name Length                Date
## 1 murders.csv   1333 2021-08-19 03:38:00

The zip file consists of only one data set named murders. The following command will extract and import the murders data set as a data frame.

library(readr)
data.unzip<-read_csv(file = unzip("C:/Users/PC/R Working Directory/Datasets for Statistical Analysis/murders.zip","murders.csv"))
head(data.unzip)

## # A tibble: 6 x 8
##   state abb   region population PopulationDensi~ murders gunmurders gunownership
##   <chr> <chr> <chr>       <dbl>            <dbl>   <dbl>      <dbl>        <dbl>
## 1 Alab~ AL    South     4779736             94.6     199        135        0.517
## 2 Ariz~ AZ    West      6392017             57.0     352        232        0.311
## 3 Cali~ CA    West     37253956            244.     1811       1257        0.213
## 4 Colo~ CO    West      5029196             49.3     117         65        0.347
## 5 Conn~ CT    North~    3574097            741.      131         97        0.167
## 6 Flor~ FL    South    19687653            360.      987        669        0.245

The murders data set inside the zipped file has been successfully imported as a data frame.

Importing Excel File

We will use the read_excel() function from the readxl library to import excel data file. First, we need to install readxl library. We have already installed the library, so we need to simply load readxl.

library(readxl)
data.excel<-read_excel(path="C:/Users/PC/R Working Directory/Datasets for Statistical Analysis/sample excel data.xls", 
sheet = NULL)
head(data.excel)

## # A tibble: 6 x 7
##   OrderDate           Region  Rep     Item   Units `Unit Cost`  Total
##   <dttm>              <chr>   <chr>   <chr>  <dbl>       <dbl>  <dbl>
## 1 2014-09-01 00:00:00 Central Smith   Desk       2      125    250   
## 2 2015-06-17 00:00:00 Central Kivell  Desk       5      125    625   
## 3 2015-09-10 00:00:00 Central Gill    Pencil     7        1.29   9.03
## 4 2015-11-17 00:00:00 Central Jardine Binder    11        4.99  54.9 
## 5 2015-10-31 00:00:00 Central Andrews Pencil    14        1.29  18.1 
## 6 2014-02-26 00:00:00 Central Gill    Pen       27       20.0  540.

class(data.excel)

## [1] "tbl_df"     "tbl"        "data.frame"

We have successfully imported an excel data set on sales from a stationary shop that returned a tibble, a special type of data frame with more flexibility. The sheet argument specifies how many sheet we want to read. Alternatively, we can also specify the range for the columns and rows to be read by the range argument.

Importing TSV File

TSV files are tab separated value files where each data column is separated by tab.

library(readr)
data.tsv<-read_tsv(file="C:/Users/PC/R Working Directory/Datasets for Statistical Analysis/Sample TSV files/InventoryEvents.tsv")
head(data.tsv)

## # A tibble: 6 x 5
##   EventDate           EventType Quantity        ItemId SkuId
##   <dttm>              <chr>        <dbl>         <dbl> <chr>
## 1 2013-01-01 10:10:35 out              3 3596710216072 <NA> 
## 2 2013-01-01 10:10:36 out              1 3596710216074 <NA> 
## 3 2013-01-01 10:10:46 out              1 3596710216069 <NA> 
## 4 2013-01-01 10:10:54 out              3 3596710216070 <NA> 
## 5 2013-01-01 10:10:55 out              1 3596710216073 <NA> 
## 6 2013-01-01 10:40:23 out              1 3596710216072 <NA>

The inventory events tsv data set from the local machine has been imported as a tibble.

Importing File with Read.table() function

Read.table is a generic function to read data from any tabular format. We can specify the delimiter and header option as arguments in the read.table() function. By default, the function will not read the first row as the header.

df.table<-read.table(file = "C:/Users/PC/R Working Directory/Datasets for Statistical Analysis/ICUData.csv", 
                     header = TRUE, sep = ",", stringsAsFactors = TRUE) #csv files are comma 
#separated so sep argument has been set to "," 
head(df.table)

##   ID    sex age          surgery heart.rate temperature bilirubin SAPS.II
## 1  1 female  76            other       98.0        36.5  6.512142      57
## 2  2 female  60 gastrointestinal       80.0        38.1 14.523197      52
## 3  3   male  66   cardiothoracic       99.6        37.4 22.972480      57
## 4  4   male  74            other      110.0        39.1 19.299346      45
## 5  5 female  68            other       94.1        38.5 39.076485      49
## 6  6   male  68   cardiothoracic       88.8        35.1 14.805941      53
##   liver.failure LOS              outcome
## 1             0   1                 died
## 2             0   2                 home
## 3             0   1 secondary care/rehab
## 4             0   2                 home
## 5             0   1                 home
## 6             0   1 secondary care/rehab

Data Management and Data Manipulation

Sorting and Ordering

Sorting

We will import the US gun murder data for sorting operation. We will try to get some insights on the safety of different state in the context of gun murders.

library(dslabs)
data("murders")
head(murders)

##        state abb region population total
## 1    Alabama  AL  South    4779736   135
## 2     Alaska  AK   West     710231    19
## 3    Arizona  AZ   West    6392017   232
## 4   Arkansas  AR  South    2915918    93
## 5 California  CA   West   37253956  1257
## 6   Colorado  CO   West    5029196    65

We will sort the total number of murders in increasing order from lowest to highest.

sort(murders$total)

##  [1]    2    4    5    5    7    8   11   12   12   16   19   21   22   27   32
## [16]   36   38   53   63   65   67   84   93   93   97   97   99  111  116  118
## [31]  120  135  142  207  219  232  246  250  286  293  310  321  351  364  376
## [46]  413  457  517  669  805 1257

However, this does not give us any information on which states have which murder total. We don’t see here which state has 351 murder in total. For this purpose, we will look at another function order(). Order works a bit differently than the sorting function. It returns the index that sorts the original vector. We can create a vector and sort it.

x<-c(10,89,2,8,17,13,49)
sort(x)

## [1]  2  8 10 13 17 49 89

Ordering

Rather than sorting the vector, we can get the index of for the sorting operation by order() function and sort it. The order function gives us the index of the elements in the vector for sorting in increasing order.

index<-order(x)
index

## [1] 3 4 1 6 5 7 2

We got the index we need to sort the vector x. Now we can sort it.

x[index]

## [1]  2  8 10 13 17 49 89

As we can see, the results are the same. So, how this helps us getting the name of the states with respect to the total murder? Remember, each index corresponds to a particular row in the data frame. So first, we can get the index according to the total murder. Then we can use the index to order the states according to the total murder.

index.total.murder<-order(murders$total)
murders$state[index.total.murder]

##  [1] "Vermont"              "North Dakota"         "New Hampshire"       
##  [4] "Wyoming"              "Hawaii"               "South Dakota"        
##  [7] "Maine"                "Idaho"                "Montana"             
## [10] "Rhode Island"         "Alaska"               "Iowa"                
## [13] "Utah"                 "West Virginia"        "Nebraska"            
## [16] "Oregon"               "Delaware"             "Minnesota"           
## [19] "Kansas"               "Colorado"             "New Mexico"          
## [22] "Nevada"               "Arkansas"             "Washington"          
## [25] "Connecticut"          "Wisconsin"            "District of Columbia"
## [28] "Oklahoma"             "Kentucky"             "Massachusetts"       
## [31] "Mississippi"          "Alabama"              "Indiana"             
## [34] "South Carolina"       "Tennessee"            "Arizona"             
## [37] "New Jersey"           "Virginia"             "North Carolina"      
## [40] "Maryland"             "Ohio"                 "Missouri"            
## [43] "Louisiana"            "Illinois"             "Georgia"             
## [46] "Michigan"             "Pennsylvania"         "New York"            
## [49] "Florida"              "Texas"                "California"

So, finally we see, Vermont has lowest total of murder and California has the highest total of murder.

Let’s check out if this is indeed true.

murders[murders$state %in% c("Vermont","California"),]

##         state abb    region population total
## 5  California  CA      West   37253956  1257
## 46    Vermont  VT Northeast     625741     2

Yes, we see Vermont has only 2 murders in total and California with the 1257 murders in total.

If we want to sort in decreasing order than we can specify the argument decreasing=TRUE.

sort(murders$total, decreasing = TRUE)

##  [1] 1257  805  669  517  457  413  376  364  351  321  310  293  286  250  246
## [16]  232  219  207  142  135  120  118  116  111   99   97   97   93   93   84
## [31]   67   65   63   53   38   36   32   27   22   21   19   16   12   12   11
## [46]    8    7    5    5    4    2

Maximum and Minimum

If we simply want to find out the maximum value from a vector, we can use the max() function.

max(murders$total)

## [1] 1257

Now, we will look out for the state for this maximum total murder. For this, we can use the which.max() function. which.max function gives the index of the maximum of the vector, in this case, total murder.

murders$state[which.max(murders$total)]

## [1] "California"

Similarly, we can use the functions min() and which.min() to find the minimum number of total murder and the state.

min(murders$total)

## [1] 2

murders$state[which.min(murders$total)]

## [1] "Vermont"

Rank

Rank is another useful function related to sorting and ordering. It gives the ranking of each element in the vector according to their values.

## [1] 10 89  2  8 17 13 49

rank(x)

## [1] 3 7 1 2 5 4 6

The first element in the vector x is 10 which is ranked 3 and 89 is ranked 1 for being the highest value.

Tidy Data

We can say, a data table or data frame is in tidy format if each row represents an observation and columns represent different variable available for each of the observation. Our murders dataset in a proper example of tidy data with each row representing a state with each of the five columns providing a different variable related to these states: name, abbreviation, region, population, and total murders.

##        state abb region population total
## 1    Alabama  AL  South    4779736   135
## 2     Alaska  AK   West     710231    19
## 3    Arizona  AZ   West    6392017   232
## 4   Arkansas  AR  South    2915918    93
## 5 California  CA   West   37253956  1257
## 6   Colorado  CO   West    5029196    65

But not all data comes in such a nice, healthy, and tidy format. There are many examples of wide data, we need to reshape in to tidy format. The tidyr package includes several functions that are useful for tidying up data. We will import a fertility wide format dataset from dslabs package to reshape it in tidy format.

library(tidyverse)
library(dslabs)
path <- system.file("extdata", package="dslabs")
filename <- file.path(path, "fertility-two-countries-example.csv")
wide_data <- read_csv(filename)

wide_data[,1:8]

## # A tibble: 2 x 8
##   country     `1960` `1961` `1962` `1963` `1964` `1965` `1966`
##   <chr>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Germany       2.41   2.44   2.47   2.49   2.49   2.48   2.44
## 2 South Korea   6.16   5.99   5.79   5.57   5.36   5.16   4.99

Okay, now we call this fertility wide data “not tidy”. Why? Upon close inspection, we see each row has multiple observations of the same type of data(fertility) through different years. In tidy data format, the years as the column/variable heading should be in a “year” variable itself and the values of fertility should be in a column named “fertility”.

Gather

We can use the gather() function from tidyr package to convert this wide data in to tidy data.

tidy.data<-gather(data = wide_data, "year","fertility","1960":"2015")
head(tidy.data)

## # A tibble: 6 x 3
##   country     year  fertility
##   <chr>       <chr>     <dbl>
## 1 Germany     1960       2.41
## 2 South Korea 1960       6.16
## 3 Germany     1961       2.44
## 4 South Korea 1961       5.99
## 5 Germany     1962       2.47
## 6 South Korea 1962       5.79

The data has been transformed in to tidy format with columns year and fertility. In this example, the column country was not gathered and each year resulted in two rows since we have two countries. It is recommended to specify in gather function which column will not be gathered.

tidy.data01<-gather(data = wide_data, year, fertility, -country)
head(tidy.data01)

## # A tibble: 6 x 3
##   country     year  fertility
##   <chr>       <chr>     <dbl>
## 1 Germany     1960       2.41
## 2 South Korea 1960       6.16
## 3 Germany     1961       2.44
## 4 South Korea 1961       5.99
## 5 Germany     1962       2.47
## 6 South Korea 1962       5.79

Both of the converted tidy data looks exactly the same. The class of year column in the tidy data is character. We need to convert this character vector to integer so that we can analysis the data.

tidy.data$year<-as.integer(tidy.data$year)
tidy.data

## # A tibble: 112 x 3
##    country      year fertility
##    <chr>       <int>     <dbl>
##  1 Germany      1960      2.41
##  2 South Korea  1960      6.16
##  3 Germany      1961      2.44
##  4 South Korea  1961      5.99
##  5 Germany      1962      2.47
##  6 South Korea  1962      5.79
##  7 Germany      1963      2.49
##  8 South Korea  1963      5.57
##  9 Germany      1964      2.49
## 10 South Korea  1964      5.36
## # ... with 102 more rows

Now that the data has become tidy, we can apply simple plotting command to the data.

tidy.data %>% ggplot(aes(year,fertility,color=country))+geom_point()

Spread

The spread() function does exactly the opposite of the gather function.

wide.data.again<-tidy.data %>% spread(key = year, value=fertility)
wide.data.again[,1:8]

## # A tibble: 2 x 8
##   country     `1960` `1961` `1962` `1963` `1964` `1965` `1966`
##   <chr>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Germany       2.41   2.44   2.47   2.49   2.49   2.48   2.44
## 2 South Korea   6.16   5.99   5.79   5.57   5.36   5.16   4.99

We have recovered the wide data from the tidy data.

Separate

All data wrangling processes are not as simple as the above examples we have shown. We will import another wide dataset with two variables. The dataset is definitely not tidy, and also not optimal for practical data analysis, and inference.

path <- system.file("extdata", package = "dslabs")
filename <- file.path(path, "life-expectancy-and-fertility-two-countries-example.csv")
raw_data <- read_csv(filename)
select(raw_data, 1:5)

## # A tibble: 2 x 5
##   country  `1960_fertility` `1960_life_expec~ `1961_fertility` `1961_life_expec~
##   <chr>               <dbl>             <dbl>            <dbl>             <dbl>
## 1 Germany              2.41              69.3             2.44              69.8
## 2 South K~             6.16              53.0             5.99              53.8

The data set is not in tidy format. Moreover, the data set contains two variables: fertility and life expectancy. We will use the gather function first to collapse the columns but we must be careful not to name the new column because the new key column will contain two different variables. We will sort this out later.

raw.gather.data<-raw_data %>% gather(key,value,-country)
head(raw.gather.data)

## # A tibble: 6 x 3
##   country     key                  value
##   <chr>       <chr>                <dbl>
## 1 Germany     1960_fertility        2.41
## 2 South Korea 1960_fertility        6.16
## 3 Germany     1960_life_expectancy 69.3 
## 4 South Korea 1960_life_expectancy 53.0 
## 5 Germany     1961_fertility        2.44
## 6 South Korea 1961_fertility        5.99

We have to separate the year information from the variable names and create another column for the year.

raw.gather.data %>% separate(key, c("year","variable_name"), "_") #separating the "key" column in to

## # A tibble: 224 x 4
##    country     year  variable_name value
##    <chr>       <chr> <chr>         <dbl>
##  1 Germany     1960  fertility      2.41
##  2 South Korea 1960  fertility      6.16
##  3 Germany     1960  life          69.3 
##  4 South Korea 1960  life          53.0 
##  5 Germany     1961  fertility      2.44
##  6 South Korea 1961  fertility      5.99
##  7 Germany     1961  life          69.8 
##  8 South Korea 1961  life          53.8 
##  9 Germany     1962  fertility      2.47
## 10 South Korea 1962  fertility      5.79
## # ... with 214 more rows

#two new columns: year and variable_name. The separator here is "_"

We have received an warning for 112 rows where the life_expectancy has been truncated to life as the separator only worked for the first “-” separator but not for the second appearance of “" in the variable. We can solve this problem by merging the extra "” with the variables.

raw.gather.data<-raw.gather.data %>% separate(key, c("year","variable_name"), extra="merge")
head(raw.gather.data, n=10)

## # A tibble: 10 x 4
##    country     year  variable_name   value
##    <chr>       <chr> <chr>           <dbl>
##  1 Germany     1960  fertility        2.41
##  2 South Korea 1960  fertility        6.16
##  3 Germany     1960  life_expectancy 69.3 
##  4 South Korea 1960  life_expectancy 53.0 
##  5 Germany     1961  fertility        2.44
##  6 South Korea 1961  fertility        5.99
##  7 Germany     1961  life_expectancy 69.8 
##  8 South Korea 1961  life_expectancy 53.8 
##  9 Germany     1962  fertility        2.47
## 10 South Korea 1962  fertility        5.79

In the last step, we need to spread fertility and life_expectancy as separate variables.

final.tidy.data<-raw.gather.data %>% spread(variable_name,value)
head(final.tidy.data,n=10)

## # A tibble: 10 x 4
##    country year  fertility life_expectancy
##    <chr>   <chr>     <dbl>           <dbl>
##  1 Germany 1960       2.41            69.3
##  2 Germany 1961       2.44            69.8
##  3 Germany 1962       2.47            70.0
##  4 Germany 1963       2.49            70.1
##  5 Germany 1964       2.49            70.7
##  6 Germany 1965       2.48            70.6
##  7 Germany 1966       2.44            70.8
##  8 Germany 1967       2.37            71.0
##  9 Germany 1968       2.28            70.6
## 10 Germany 1969       2.17            70.5

Finally, we have produced the neat and tidy data with one row for each observation with three variables: year, fertility and life expectancy.

Manipulation of Data Frames

The dplyr package from the tidyverse library provides some of the most common operations dealing with data frames. The names are very suggestive of the operations they perform and therefore easy to remember. To add a new column to the existing data frame, we will use mutate function. To filter the data frame conditionally, we will use filter and to subset a data frame, we will use select function.

Adding a column with mutate

We can extend our data frame by adding new columns by manipulating the already existing columns for further analysis. We will add a new rate column in our murder dataset to get the murder rate with respect to the per million population for each of the states.

library(dslabs)
data("murders")
murders.new<-murders %>% mutate(rate=total/population*10^5)
head(murders.new)

##        state abb region population total     rate
## 1    Alabama  AL  South    4779736   135 2.824424
## 2     Alaska  AK   West     710231    19 2.675186
## 3    Arizona  AZ   West    6392017   232 3.629527
## 4   Arkansas  AR  South    2915918    93 3.189390
## 5 California  CA   West   37253956  1257 3.374138
## 6   Colorado  CO   West    5029196    65 1.292453

A new column rate has appeared in the murders data frame.

Subsetting with Filter

Now, we will subset our newly created murders data frame with conditionals applied through the filter function.

murders.new %>% filter(rate<=0.80)

##           state abb        region population total      rate
## 1        Hawaii  HI          West    1360301     7 0.5145920
## 2         Idaho  ID          West    1567582    12 0.7655102
## 3          Iowa  IA North Central    3046355    21 0.6893484
## 4 New Hampshire  NH     Northeast    1316470     5 0.3798036
## 5  North Dakota  ND North Central     672591     4 0.5947151
## 6          Utah  UT          West    2763885    22 0.7959810
## 7       Vermont  VT     Northeast     625741     2 0.3196211

We have filtered the state with murder rate below or equal to 0.80 per million of the population.

Select

Previously, subsetting was applied only to the rows of the data frame. We will not subset the data frame with columns by the select function.

murders.new.select<-murders.new %>% select(state, region, rate)
murders.new.select %>% filter(rate<=0.80)

##           state        region      rate
## 1        Hawaii          West 0.5145920
## 2         Idaho          West 0.7655102
## 3          Iowa North Central 0.6893484
## 4 New Hampshire     Northeast 0.3798036
## 5  North Dakota North Central 0.5947151
## 6          Utah          West 0.7959810
## 7       Vermont     Northeast 0.3196211

We have preferentially selected only the state, region, and rate columns and later filtered them with murder rate equal to or below 0.80 per million of the population.

Summarize

Summarizing data is one of the most important aspects of exploratory data analysis. The summarize function in dplyr package is suitable to get the summary statistics in a very easy-to-understand and intuitive format. In this section, we will import a curated dataset heights which includes male and female heights.

data("heights")
head(heights)

##      sex height
## 1   Male     75
## 2   Male     70
## 3   Male     68
## 4   Male     74
## 5   Male     61
## 6 Female     65

We will filter the male height and get the average height of male. Later, we will learn to get grouped summary.

height.summary<-heights %>% filter(sex=="Male") %>% summarize(average_male=mean(height), 
                                              standard_deviation=sd(height))
height.summary

##   average_male standard_deviation
## 1     69.31475           3.611024

Remember, the summarize function always returns a data frame, not a numeric vector.

class(height.summary)

## [1] "data.frame"

We can access the element of height.summary by the accessor $ syntax.

height.summary$average_male

## [1] 69.31475

height.summary$standard_deviation

## [1] 3.611024

Another thing to remember, the summarize function will always return a single numeric value. If we use any function which returns more than one value then we will get an error. But there’s a tricky way to get summarize function returning more than a single numeric output.

If we want to calculate the average US murder rate then simply calculating the average of rate column in the murders.new will not suffice because all the states are given the same weight. This is wrong:

mean(murders.new$rate)

## [1] 2.779125

The following is the right calculation of average:

sum(murders.new$total)/sum(murders.new$population)*10^5

## [1] 3.034555

In this case, all the states are weighted in proportion to their population size.

Pull

In our previous discussion, we have seen that the summarize function always returns a data frame which is sometimes problematic if we simply need a numeric value as input for some other functions.

We can use the pull function to get around with it. The pull function is able to pull out a value from a data frame.

murders.new %>% summarize(overall_rate=sum(total)/sum(population)*10^5) %>%
  pull(overall_rate)

## [1] 3.034555

This time we got a simple numeric value as output rather than a data frame from the summarize function.

Grouped Summary using group_by

In a data frame, if there are factors or categorical variable then we can group the data by each category. This grouped data can later be used to summarize on each group or category. The heights dataset has a categorical variable sex with two distinct factor:Male, and Female. We can group this dataset using group_by function from dplyr package.

grouped.heights<-heights %>% group_by(sex)
grouped.heights

## # A tibble: 1,050 x 2
## # Groups:   sex [2]
##    sex    height
##    <fct>   <dbl>
##  1 Male       75
##  2 Male       70
##  3 Male       68
##  4 Male       74
##  5 Male       61
##  6 Female     65
##  7 Female     66
##  8 Female     62
##  9 Female     66
## 10 Male       67
## # ... with 1,040 more rows

Although, this grouped dataset looks very similar to the heights dataset but internally this dataset has been divided in to two distinct datasets containing two different groups of Male and Female.

We can get the summary from this grouped dataset.

grouped.heights %>% summarize(average=mean(height), standard_deviation=sd(height))

## # A tibble: 2 x 3
##   sex    average standard_deviation
##   <fct>    <dbl>              <dbl>
## 1 Female    64.9               3.76
## 2 Male      69.3               3.61

The summarize function returned a data frame with average and standard_deviation for both the Male and Female groups.

From the murders.new dataset, we can group by region and get the median rate for each of the region.

murders.new %>% group_by(region) %>% summarize(median_rate=median(rate))

## # A tibble: 4 x 2
##   region        median_rate
##   <fct>               <dbl>
## 1 Northeast            1.80
## 2 South                3.40
## 3 North Central        1.97
## 4 West                 1.29

Arranging the Data Frames

For close inspection, it is sometimes necessary to sort the data frame by the different columns. We have already learned about the sort and order function. But, if we want to sort an entire data table then the arrange function from the dplyr library is a very flexible choice. We will arrange the murders.new dataset first with the population size and later with the murder rate. By defaults, the arrange function orders the dataset in the ascending order.

murders.new %>% arrange(population) %>% head()

##                  state abb        region population total       rate
## 1              Wyoming  WY          West     563626     5  0.8871131
## 2 District of Columbia  DC         South     601723    99 16.4527532
## 3              Vermont  VT     Northeast     625741     2  0.3196211
## 4         North Dakota  ND North Central     672591     4  0.5947151
## 5               Alaska  AK          West     710231    19  2.6751860
## 6         South Dakota  SD North Central     814180     8  0.9825837

murders.new %>% arrange(rate) %>% head()

##           state abb        region population total      rate
## 1       Vermont  VT     Northeast     625741     2 0.3196211
## 2 New Hampshire  NH     Northeast    1316470     5 0.3798036
## 3        Hawaii  HI          West    1360301     7 0.5145920
## 4  North Dakota  ND North Central     672591     4 0.5947151
## 5          Iowa  IA North Central    3046355    21 0.6893484
## 6         Idaho  ID          West    1567582    12 0.7655102

If we want to arrange the table in descending order of the population size then we can use the desc function inside the arrange function.

murders.new %>% arrange(desc(population)) %>% head()

##          state abb        region population total     rate
## 1   California  CA          West   37253956  1257 3.374138
## 2        Texas  TX         South   25145561   805 3.201360
## 3      Florida  FL         South   19687653   669 3.398069
## 4     New York  NY     Northeast   19378102   517 2.667960
## 5     Illinois  IL North Central   12830632   364 2.836961
## 6 Pennsylvania  PA     Northeast   12702379   457 3.597751

Top n

For most part of our data manipulation, we have been using the head function to shows the first couple of rows in the whole data frame. But, if we want to see the top n number of rows according to certain column then we can use the top_n function. Here, we will show the top 10 rows from the murders.new dataset according to murder rate. Note than, these rows are not arranged in an ascending or descending order. For that, we need to arrange them explicitly.

murders.new %>% top_n(10, rate)

##                   state abb        region population total      rate
## 1               Arizona  AZ          West    6392017   232  3.629527
## 2              Delaware  DE         South     897934    38  4.231937
## 3  District of Columbia  DC         South     601723    99 16.452753
## 4               Georgia  GA         South    9920000   376  3.790323
## 5             Louisiana  LA         South    4533372   351  7.742581
## 6              Maryland  MD         South    5773552   293  5.074866
## 7              Michigan  MI North Central    9883640   413  4.178622
## 8           Mississippi  MS         South    2967297   120  4.044085
## 9              Missouri  MO North Central    5988927   321  5.359892
## 10       South Carolina  SC         South    4625364   207  4.475323

Dot Operator

The dot operator, “.”, is a very useful operator to pull out a column from a data frame and save it as a vector after manipulation. Later we can calculate certain statistics from this vector. Here we will use the dot operator to pull out the rate column after manipulation then find the median.

rate<-murders %>% filter(region=="South") %>% mutate(rate=total/population*10^5) %>%
  .$rate
median(rate)

## [1] 3.398069

Do

Previously we have stated that if the summarize function have to return multiple outputs then it will produce an error. Most of the base R functions do not understand the group_by tibble and we can’t directly pipe the output of a grouped tibble as the input of a R functions.

The do function comes to the rescue. It essentially creates the bridge between base R functions and the tibbles. For that, we have to create a function that fits in to the tidyverse approach: it receives a data frame and returns a data frame. We will create a function to return minimum, median, and the maximum height from the data frame provided as the argument of the function.

new.height.summary<-function(dat){
  x<-quantile(dat$height, c(0,0.5,1)) #0 for minimum, 0.5 for median and 1 for the maximum. 
  tibble(minimum=x[1], median=x[2], maximum=x[3])
}

Now, instead of using the summarize function we can use our custom-made summary function combined with the do function to get robust summary for both Male and Female when they are grouped by the group_by function.

heights %>% group_by(sex) %>% do(new.height.summary(.))

## # A tibble: 2 x 4
## # Groups:   sex [2]
##   sex    minimum median maximum
##   <fct>    <dbl>  <dbl>   <dbl>
## 1 Female      51   65.0    79  
## 2 Male        50   69      82.7

Voila! That worked perfectly. We had to use the dot operator inside the new.height.summary function to pipe the tibble created by group_by function as the argument of the new.height.summary function. Otherwise it will produce an error by saying argument “dat” is missing.

Joining Tables: alternatives of SQL

For data analysis, one table might not be fully adequate for the purpose. We may require to explore more data tables and at times, join those tables together to compare various statistics. This section will be much more similar to SQL join functions and other set operations as well. We will explore all of them followed by practical examples.

Suppose, we intent to explore the relationship between the population size of each state in US and the number of electoral votes. For this, we will import two different datasets, one containing the US population size and the other with the electoral votes. First, we load the table with population size.

##        state abb region population total
## 1    Alabama  AL  South    4779736   135
## 2     Alaska  AK   West     710231    19
## 3    Arizona  AZ   West    6392017   232
## 4   Arkansas  AR  South    2915918    93
## 5 California  CA   West   37253956  1257
## 6   Colorado  CO   West    5029196    65

And this table contains the number of electoral votes.

##          state electoral_votes clinton trump others
## 1   California              55    61.7  31.6    6.7
## 2        Texas              38    43.2  52.2    4.5
## 3      Florida              29    47.8  49.0    3.2
## 4     New York              29    59.0  36.5    4.5
## 5     Illinois              20    55.8  38.8    5.4
## 6 Pennsylvania              20    47.9  48.6    3.6

Concatenating or joining these two tables straightforward is not savvy since the order of states in both tables are different. We can check this by this command.

identical(results_us_election_2016$state, murders$state)

## [1] FALSE

The join functions described below are built to handle these situations.

Let’s join these two tables by the left_join() function from the dplyr library. We will remove the others column to fit the table in the screen.

tab.joined<-left_join(murders,results_us_election_2016, by="state") %>% select(-others)
head(tab.joined)

##        state abb region population total electoral_votes clinton trump
## 1    Alabama  AL  South    4779736   135               9    34.4  62.1
## 2     Alaska  AK   West     710231    19               3    36.6  51.3
## 3    Arizona  AZ   West    6392017   232              11    45.1  48.7
## 4   Arkansas  AR  South    2915918    93               6    33.7  60.6
## 5 California  CA   West   37253956  1257              55    61.7  31.6
## 6   Colorado  CO   West    5029196    65               9    48.2  43.3

The two tables have been joined successfully. Now, we can navigate and draw a smooth plot from this newly created data table using ggplot library.

library(ggrepel)

## Warning: package 'ggrepel' was built under R version 4.0.5

tab.joined %>% ggplot(aes(population/10^6, electoral_votes, label = abb)) + #population in millions
geom_point() +
geom_text_repel(max.overlaps = 20) + #for marking points in plots with state abbreviation
scale_x_continuous(trans = "log2") +
scale_y_continuous(trans = "log2") +
geom_smooth(method = "lm", se = FALSE) #fitting a linear model

In practice, we will not always be so lucky to get two tables with matching rows, in this case, matching state names in both of the tables.

We will now create two tables with some matching and non matching states.

table01<-slice(murders,1:8) %>% select(state, population)
table01

##         state population
## 1     Alabama    4779736
## 2      Alaska     710231
## 3     Arizona    6392017
## 4    Arkansas    2915918
## 5  California   37253956
## 6    Colorado    5029196
## 7 Connecticut    3574097
## 8    Delaware     897934

table02<-results_us_election_2016 %>% 
  filter(state %in% c("Alabama","Alaska","Arizona","Delaware","Colorado","Ohio","Texas","Florida")) %>%
  select(state,electoral_votes)
table02

##      state electoral_votes
## 1    Texas              38
## 2  Florida              29
## 3     Ohio              18
## 4  Arizona              11
## 5  Alabama               9
## 6 Colorado               9
## 7   Alaska               3
## 8 Delaware               3

Left Join

If we want a table like table01 and simply want to add electoral votes from table02 then we can use the left_join function. We also need to specify the criteria in the argument by which to match the two tables, in this case, state.

table.left<-left_join(table01, table02, by="state")
table.left

##         state population electoral_votes
## 1     Alabama    4779736               9
## 2      Alaska     710231               3
## 3     Arizona    6392017              11
## 4    Arkansas    2915918              NA
## 5  California   37253956              NA
## 6    Colorado    5029196               9
## 7 Connecticut    3574097              NA
## 8    Delaware     897934               3

We can see, three NA’s have been introduced in the electoral_votes column due to the absence of matching states in table02 with table01. The states Arkansas,California, and Connecticut are missing in the second table.

Right Join

Alternatively, we can join table01 and table02 but this time according to the matching states of table02. We will use right_join function here.

table.right<-right_join(table01, table02, by="state")
table.right

##      state population electoral_votes
## 1  Alabama    4779736               9
## 2   Alaska     710231               3
## 3  Arizona    6392017              11
## 4 Colorado    5029196               9
## 5 Delaware     897934               3
## 6    Texas         NA              38
## 7  Florida         NA              29
## 8     Ohio         NA              18

We can see, the preferences have been given to table02 with three NA’s being introduced in the population column as the states Texas, Florida, and Ohio are missing in table01.

Inner Join

If we want to join two tables and keep only those rows with matching information then we can use the inner_join function. This is similar to the set operation intersection.

table.inner<-inner_join(table01,table02, by="state")
table.inner

##      state population electoral_votes
## 1  Alabama    4779736               9
## 2   Alaska     710231               3
## 3  Arizona    6392017              11
## 4 Colorado    5029196               9
## 5 Delaware     897934               3

Only the states common to both of the tables are returned.

Full Join

If we intend to keep all the rows from both of the tables then we can use full_join. It is similar to the set operation union. The non-matching rows will return NA’s.

table.full<-full_join(table01, table02, by="state")
table.full

##          state population electoral_votes
## 1      Alabama    4779736               9
## 2       Alaska     710231               3
## 3      Arizona    6392017              11
## 4     Arkansas    2915918              NA
## 5   California   37253956              NA
## 6     Colorado    5029196               9
## 7  Connecticut    3574097              NA
## 8     Delaware     897934               3
## 9        Texas         NA              38
## 10     Florida         NA              29
## 11        Ohio         NA              18

Semi Join

The semi_join function keep the part of first table for which we have information in the second. It does not add the columns of the second.

table.semi<-semi_join(table01, table02, by="state")
table.semi

##      state population
## 1  Alabama    4779736
## 2   Alaska     710231
## 3  Arizona    6392017
## 4 Colorado    5029196
## 5 Delaware     897934

Anti Join

anti_join function is the opposite of semi_join function. It keeps the information of table01 for which there are no matching information in table02

table.anti<-anti_join(table01, table02, by="state")
table.anti

##         state population
## 1    Arkansas    2915918
## 2  California   37253956
## 3 Connecticut    3574097

Set Operations

Set operations are pretty useful for combining datasets. Defaults R commands for set operations are uniquely applicable on vectors but set operations from dplyr package are applicable also in data frames. Basic set operations are intersection, union, and set difference.

Intersection

We will create two subsets of tables from the previously created tab.joined table.

set01<-tab.joined[1:5,] #selecting the first 5 rows
set02<-tab.joined[3:7,] #selecting rows 3 to 7

To make sure, we use the intersect function from the dplyr rather than the base package, we can use dplyr::intersect command like this:

dplyr::intersect(set01,set02)

##        state abb region population total electoral_votes clinton trump
## 1    Arizona  AZ   West    6392017   232              11    45.1  48.7
## 2   Arkansas  AR  South    2915918    93               6    33.7  60.6
## 3 California  CA   West   37253956  1257              55    61.7  31.6

We get only three rows common in the two tables or sets.

Union

The union function from the dplyr package will unite or combine all the rows in the two tables or sets.

dplyr::union(set01,set02)

##         state abb    region population total electoral_votes clinton trump
## 1     Alabama  AL     South    4779736   135               9    34.4  62.1
## 2      Alaska  AK      West     710231    19               3    36.6  51.3
## 3     Arizona  AZ      West    6392017   232              11    45.1  48.7
## 4    Arkansas  AR     South    2915918    93               6    33.7  60.6
## 5  California  CA      West   37253956  1257              55    61.7  31.6
## 6    Colorado  CO      West    5029196    65               9    48.2  43.3
## 7 Connecticut  CT Northeast    3574097    97               7    54.6  40.9

Now, we get all the rows combined in the two tables or sets.

Set Difference

The setdiff function in dplyr package will subtract the common rows, between first and second set, from the first set.

dplyr::setdiff(set01,set02)

##     state abb region population total electoral_votes clinton trump
## 1 Alabama  AL  South    4779736   135               9    34.4  62.1
## 2  Alaska  AK   West     710231    19               3    36.6  51.3

Set Equality

The setequal function from the dplyr package is a very important function to check whether two sets or tables are equal regardless on the order of the rows.

dplyr::setequal(set01,set02)

## [1] FALSE

set01 and set02 tables are not equal. They have different rows.

This marks the end of this Data Collection and Data Management project. In future string processing, web scraping, and Data Mining procedures in R will be discussed in details.

Data Collection and Data Management Routines in R

Md Sakhawat Hossen

2/7/2022