First and foremost, Statistical Analysis mandates the creation and collection of data from different sources and formats. The vast swathes of libraries in R programming language facilitate the collection of data from myriads of formats and sources without a glitch. In this project, we will go through the basics of how to collect and create data then we will dive deep in to data management routines. This project can be used as a reference guide and manual for data analyst, data scientists, machine learning engineers, and learners.
All the code chunks will appear in shaded rectangular boxes and the output of the code chunks will be enclosed in unshaded rectangular boxes.
We use the term object to refer to data stored in R environment. But, object can denote to other complex data structures like functions. We can store variable data as object in R.
The values 2, 3, and 4 are stored in objects a,b,c respectively. When we evaluate a, b, and c, we get the values stored in those objects.
## [1] 2
## [1] 3
## [1] 4
We can also store more than one values in R objects. These are called vectors. Actually, R object with singular value is also a vector but containing only one value, a scalar.
Here, we create a numeric vector x and check the class of the object.
## [1] 1 2 3 4 5
## [1] "numeric"
The class of vector x is numeric. We can also create other types of vectors in R: character, logical. In programming language, a vector in R can be classified as one dimensional array.
## [1] "I" "love" "you"
## [1] "character"
Here, we have created a character vector x with three different strings as the elements.
## [1] TRUE FALSE TRUE FALSE TRUE
## [1] "logical"
An important fact about vector is that it can only contain similar types of data. If we put numeric and character data in the same vector, R will evaluate all of them as strings.
## [1] "1" "2" "3" "I" "love" "you"
## [1] "character"
The most common way of storing a dataset in R object is in a data frame. Conceptually, we can think of data frame as a table with each row corresponding to a single observation and each column representing different variables.
dat<-data.frame(Student_id=1:10, Physics=c(85,69,72,67,92,89,73,74,56,91),
Chemistry=c(81,87,67,73,95,59,78,66,94,71))
dat## Student_id Physics Chemistry
## 1 1 85 81
## 2 2 69 87
## 3 3 72 67
## 4 4 67 73
## 5 5 92 95
## 6 6 89 59
## 7 7 73 78
## 8 8 74 66
## 9 9 56 94
## 10 10 91 71
We have created a data frame with three variables based on the results of 10 students in Physics and Chemistry.
The head() command will show the first 6 lines of the data frame.
## Student_id Physics Chemistry
## 1 1 85 81
## 2 2 69 87
## 3 3 72 67
## 4 4 67 73
## 5 5 92 95
## 6 6 89 59
And the str() function will show the structure of the data frame.
## 'data.frame': 10 obs. of 3 variables:
## $ Student_id: int 1 2 3 4 5 6 7 8 9 10
## $ Physics : num 85 69 72 67 92 89 73 74 56 91
## $ Chemistry : num 81 87 67 73 95 59 78 66 94 71
We can see, the data frame contains one integer variable and two numeric variables as expected.
The $ symbol is typically used to access the column/variable from a data frame. That extracted column/variable has the properties of a vector.
## [1] 85 69 72 67 92 89 73 74 56 91
length() function is used to find the length of any vector.
## [1] 10
The Physics column in dat data frame has the length of 10.
Factor is a special data type in R to characterize the categorical variables with different levels.
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
## [1] "factor"
## [1] "Northeast" "South" "North Central" "West"
We have loaded a dataset murders from the dslabs library. This data set contains a column named region with 4 different levels: Northeast, South, North Central, and West. This is a factor column with 4 categories.
We can reorder the appearance of these levels according to the total number of murders in each region.
regions<-murders$region
value<-murders$total
ordered.regions<-reorder(regions,value, FUN = sum)
levels(ordered.regions)## [1] "Northeast" "North Central" "West" "South"
By default, the reorder function works in ascending order.
Matrix is another kind of data type or object in R similar to data frame as they are two dimensional:it has rows and columns. But unlike data frame, matrices can not contain different type of data type. Matrices mostly contain numeric data.
But matrices have major advantages over data frames that we can execute matrix algebra operations on them, a powerful mathematical technique ubiquitous in data science, machine learning, and statistical learning.
matrix01<-matrix(data = 1:12, nrow = 3, ncol = 4, byrow = T) #creating a matrix with 3 rows and
#4 columns. The element will be arranged by row.
matrix01## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
We can excess matrix elements by the following operations.
## [1] 7
## [1] 9 10 11 12
## [1] 2 6 10
The extracted row and column from the matrix is not a matrix anymore but a vector.
We can also subset the matrix to create smaller matrix.
## [,1] [,2] [,3]
## [1,] 2 3 4
## [2,] 6 7 8
Later, we can convert and save the matrix as a data frame.
## V1 V2 V3 V4
## 1 1 2 3 4
## 2 5 6 7 8
## 3 9 10 11 12
The columns names have been given automatically but we can give new names to the columns.
## [1] "Column1" "Column2" "Column3" "Column4"
First, we have created the column names. Now, we will assign the names to each column of the data frame.
## Column1 Column2 Column3 Column4
## 1 1 2 3 4
## 2 5 6 7 8
## 3 9 10 11 12
CSV or the comma separated files are the most common file format for storing data. We can import CSV files from our local storage by the following command. The forward slashes “/” has been used in the path of file directory rather than the backward slash "". Otherwise. it might produce an error in reading the file. If it feels way to cumbersome to set the path of the designated file then the function file.choose() can be used to choose the file by clicking on mouse. It is convenient for new learners.
icudata<-read.csv(file="C:/Users/PC/R Working Directory/Datasets for Statistical Analysis/ICUData.csv", header = T, stringsAsFactors = TRUE) #importing an icu data set and
#also converting all character columns to factors.
head(icudata)## ID sex age surgery heart.rate temperature bilirubin SAPS.II
## 1 1 female 76 other 98.0 36.5 6.512142 57
## 2 2 female 60 gastrointestinal 80.0 38.1 14.523197 52
## 3 3 male 66 cardiothoracic 99.6 37.4 22.972480 57
## 4 4 male 74 other 110.0 39.1 19.299346 45
## 5 5 female 68 other 94.1 38.5 39.076485 49
## 6 6 male 68 cardiothoracic 88.8 35.1 14.805941 53
## liver.failure LOS outcome
## 1 0 1 died
## 2 0 2 home
## 3 0 1 secondary care/rehab
## 4 0 2 home
## 5 0 1 home
## 6 0 1 secondary care/rehab
TO check the number of observations and variables in the imported dataset, we can use the dim() function.
## [1] 500 11
This icudata set has 500 observations and 11 variables with a couple of factor variables.
The read.csv() function is comparatively slow. It is recommended to import any small dataset with read.csv function. We should better use read_csv() function from readr library to import bigger dataset faster than the read.csv() function.
We can also import a CSV file from a URL.
data.url<-read.csv(file = 'https://raw.githubusercontent.com/Statology/Miscellaneous/main/basketball_data.csv')
head(data.url)## player assists points
## 1 A 6 12
## 2 B 7 19
## 3 C 14 7
## 4 D 4 6
## 5 E 5 10
## [1] "data.frame"
We have imported a basketball dataset from a URL saved as a data frame.
We can use the function readLines() from base R to import and read the lines of any text files. We will read a text file containing the names of popular machine learning algorithms from our local machine.
## Warning in readLines(con = "C:/Users/PC/Desktop/Machine leanring algo.txt"):
## incomplete final line found on 'C:/Users/PC/Desktop/Machine leanring algo.txt'
## [1] "Linear Regression" "Logistic Regression"
## [3] "Linear Discriminant Analysis" "Classification and Regression Trees"
## [5] "Naive Bayes" "K-Nearest Neighbors"
## [7] "Learning Vector Quantization" "Support Vector Machines"
## [9] "Random Forest" "Boosting"
## [11] "AdaBoost"
We can also limit the number of lines to be read from the text file using the argument n.
ML.algorithms.limited<-readLines(con ="C:/Users/PC/Desktop/Machine leanring algo.txt" , n = 5)
#reading only the first 5 lines
ML.algorithms.limited## [1] "Linear Regression" "Logistic Regression"
## [3] "Linear Discriminant Analysis" "Classification and Regression Trees"
## [5] "Naive Bayes"
## [1] "character"
## [1] "character"
The imported text files are classified as character vector. We can read any line from the character vector by indexing.
## [1] "Linear Discriminant Analysis"
We can also convert these character vector in to data frames.
## ML_algorithms
## 1 Linear Regression
## 2 Logistic Regression
## 3 Linear Discriminant Analysis
## 4 Classification and Regression Trees
## 5 Naive Bayes
## 6 K-Nearest Neighbors
We can access any rows of the data frame by indexing the data frame.
## [1] "Classification and Regression Trees"
Suppose I have a zip file saved in my local directory. We have to first take a look what’s inside the zip file to extract a particular data set.
unzip(zipfile ="C:/Users/PC/R Working Directory/Datasets for Statistical Analysis/murders.zip",
list = TRUE) #the list argument will only show the list of files without extracting. ## Name Length Date
## 1 murders.csv 1333 2021-08-19 03:38:00
The zip file consists of only one data set named murders. The following command will extract and import the murders data set as a data frame.
library(readr)
data.unzip<-read_csv(file = unzip("C:/Users/PC/R Working Directory/Datasets for Statistical Analysis/murders.zip","murders.csv"))
head(data.unzip)## # A tibble: 6 x 8
## state abb region population PopulationDensi~ murders gunmurders gunownership
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alab~ AL South 4779736 94.6 199 135 0.517
## 2 Ariz~ AZ West 6392017 57.0 352 232 0.311
## 3 Cali~ CA West 37253956 244. 1811 1257 0.213
## 4 Colo~ CO West 5029196 49.3 117 65 0.347
## 5 Conn~ CT North~ 3574097 741. 131 97 0.167
## 6 Flor~ FL South 19687653 360. 987 669 0.245
The murders data set inside the zipped file has been successfully imported as a data frame.
We will use the read_excel() function from the readxl library to import excel data file. First, we need to install readxl library. We have already installed the library, so we need to simply load readxl.
library(readxl)
data.excel<-read_excel(path="C:/Users/PC/R Working Directory/Datasets for Statistical Analysis/sample excel data.xls",
sheet = NULL)
head(data.excel)## # A tibble: 6 x 7
## OrderDate Region Rep Item Units `Unit Cost` Total
## <dttm> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2014-09-01 00:00:00 Central Smith Desk 2 125 250
## 2 2015-06-17 00:00:00 Central Kivell Desk 5 125 625
## 3 2015-09-10 00:00:00 Central Gill Pencil 7 1.29 9.03
## 4 2015-11-17 00:00:00 Central Jardine Binder 11 4.99 54.9
## 5 2015-10-31 00:00:00 Central Andrews Pencil 14 1.29 18.1
## 6 2014-02-26 00:00:00 Central Gill Pen 27 20.0 540.
## [1] "tbl_df" "tbl" "data.frame"
We have successfully imported an excel data set on sales from a stationary shop that returned a tibble, a special type of data frame with more flexibility. The sheet argument specifies how many sheet we want to read. Alternatively, we can also specify the range for the columns and rows to be read by the range argument.
TSV files are tab separated value files where each data column is separated by tab.
library(readr)
data.tsv<-read_tsv(file="C:/Users/PC/R Working Directory/Datasets for Statistical Analysis/Sample TSV files/InventoryEvents.tsv")
head(data.tsv)## # A tibble: 6 x 5
## EventDate EventType Quantity ItemId SkuId
## <dttm> <chr> <dbl> <dbl> <chr>
## 1 2013-01-01 10:10:35 out 3 3596710216072 <NA>
## 2 2013-01-01 10:10:36 out 1 3596710216074 <NA>
## 3 2013-01-01 10:10:46 out 1 3596710216069 <NA>
## 4 2013-01-01 10:10:54 out 3 3596710216070 <NA>
## 5 2013-01-01 10:10:55 out 1 3596710216073 <NA>
## 6 2013-01-01 10:40:23 out 1 3596710216072 <NA>
The inventory events tsv data set from the local machine has been imported as a tibble.
Read.table is a generic function to read data from any tabular format. We can specify the delimiter and header option as arguments in the read.table() function. By default, the function will not read the first row as the header.
df.table<-read.table(file = "C:/Users/PC/R Working Directory/Datasets for Statistical Analysis/ICUData.csv",
header = TRUE, sep = ",", stringsAsFactors = TRUE) #csv files are comma
#separated so sep argument has been set to ","
head(df.table)## ID sex age surgery heart.rate temperature bilirubin SAPS.II
## 1 1 female 76 other 98.0 36.5 6.512142 57
## 2 2 female 60 gastrointestinal 80.0 38.1 14.523197 52
## 3 3 male 66 cardiothoracic 99.6 37.4 22.972480 57
## 4 4 male 74 other 110.0 39.1 19.299346 45
## 5 5 female 68 other 94.1 38.5 39.076485 49
## 6 6 male 68 cardiothoracic 88.8 35.1 14.805941 53
## liver.failure LOS outcome
## 1 0 1 died
## 2 0 2 home
## 3 0 1 secondary care/rehab
## 4 0 2 home
## 5 0 1 home
## 6 0 1 secondary care/rehab
We will import the US gun murder data for sorting operation. We will try to get some insights on the safety of different state in the context of gun murders.
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
We will sort the total number of murders in increasing order from lowest to highest.
## [1] 2 4 5 5 7 8 11 12 12 16 19 21 22 27 32
## [16] 36 38 53 63 65 67 84 93 93 97 97 99 111 116 118
## [31] 120 135 142 207 219 232 246 250 286 293 310 321 351 364 376
## [46] 413 457 517 669 805 1257
However, this does not give us any information on which states have which murder total. We don’t see here which state has 351 murder in total. For this purpose, we will look at another function order(). Order works a bit differently than the sorting function. It returns the index that sorts the original vector. We can create a vector and sort it.
## [1] 2 8 10 13 17 49 89
Rather than sorting the vector, we can get the index of for the sorting operation by order() function and sort it. The order function gives us the index of the elements in the vector for sorting in increasing order.
## [1] 3 4 1 6 5 7 2
We got the index we need to sort the vector x. Now we can sort it.
## [1] 2 8 10 13 17 49 89
As we can see, the results are the same. So, how this helps us getting the name of the states with respect to the total murder? Remember, each index corresponds to a particular row in the data frame. So first, we can get the index according to the total murder. Then we can use the index to order the states according to the total murder.
## [1] "Vermont" "North Dakota" "New Hampshire"
## [4] "Wyoming" "Hawaii" "South Dakota"
## [7] "Maine" "Idaho" "Montana"
## [10] "Rhode Island" "Alaska" "Iowa"
## [13] "Utah" "West Virginia" "Nebraska"
## [16] "Oregon" "Delaware" "Minnesota"
## [19] "Kansas" "Colorado" "New Mexico"
## [22] "Nevada" "Arkansas" "Washington"
## [25] "Connecticut" "Wisconsin" "District of Columbia"
## [28] "Oklahoma" "Kentucky" "Massachusetts"
## [31] "Mississippi" "Alabama" "Indiana"
## [34] "South Carolina" "Tennessee" "Arizona"
## [37] "New Jersey" "Virginia" "North Carolina"
## [40] "Maryland" "Ohio" "Missouri"
## [43] "Louisiana" "Illinois" "Georgia"
## [46] "Michigan" "Pennsylvania" "New York"
## [49] "Florida" "Texas" "California"
So, finally we see, Vermont has lowest total of murder and California has the highest total of murder.
Let’s check out if this is indeed true.
## state abb region population total
## 5 California CA West 37253956 1257
## 46 Vermont VT Northeast 625741 2
Yes, we see Vermont has only 2 murders in total and California with the 1257 murders in total.
If we want to sort in decreasing order than we can specify the argument decreasing=TRUE.
## [1] 1257 805 669 517 457 413 376 364 351 321 310 293 286 250 246
## [16] 232 219 207 142 135 120 118 116 111 99 97 97 93 93 84
## [31] 67 65 63 53 38 36 32 27 22 21 19 16 12 12 11
## [46] 8 7 5 5 4 2
If we simply want to find out the maximum value from a vector, we can use the max() function.
## [1] 1257
Now, we will look out for the state for this maximum total murder. For this, we can use the which.max() function. which.max function gives the index of the maximum of the vector, in this case, total murder.
## [1] "California"
Similarly, we can use the functions min() and which.min() to find the minimum number of total murder and the state.
## [1] 2
## [1] "Vermont"
Rank is another useful function related to sorting and ordering. It gives the ranking of each element in the vector according to their values.
## [1] 10 89 2 8 17 13 49
## [1] 3 7 1 2 5 4 6
The first element in the vector x is 10 which is ranked 3 and 89 is ranked 1 for being the highest value.
We can say, a data table or data frame is in tidy format if each row represents an observation and columns represent different variable available for each of the observation. Our murders dataset in a proper example of tidy data with each row representing a state with each of the five columns providing a different variable related to these states: name, abbreviation, region, population, and total murders.
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
But not all data comes in such a nice, healthy, and tidy format. There are many examples of wide data, we need to reshape in to tidy format. The tidyr package includes several functions that are useful for tidying up data. We will import a fertility wide format dataset from dslabs package to reshape it in tidy format.
library(tidyverse)
library(dslabs)
path <- system.file("extdata", package="dslabs")
filename <- file.path(path, "fertility-two-countries-example.csv")
wide_data <- read_csv(filename)## # A tibble: 2 x 8
## country `1960` `1961` `1962` `1963` `1964` `1965` `1966`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Germany 2.41 2.44 2.47 2.49 2.49 2.48 2.44
## 2 South Korea 6.16 5.99 5.79 5.57 5.36 5.16 4.99
Okay, now we call this fertility wide data “not tidy”. Why? Upon close inspection, we see each row has multiple observations of the same type of data(fertility) through different years. In tidy data format, the years as the column/variable heading should be in a “year” variable itself and the values of fertility should be in a column named “fertility”.
We can use the gather() function from tidyr package to convert this wide data in to tidy data.
## # A tibble: 6 x 3
## country year fertility
## <chr> <chr> <dbl>
## 1 Germany 1960 2.41
## 2 South Korea 1960 6.16
## 3 Germany 1961 2.44
## 4 South Korea 1961 5.99
## 5 Germany 1962 2.47
## 6 South Korea 1962 5.79
The data has been transformed in to tidy format with columns year and fertility. In this example, the column country was not gathered and each year resulted in two rows since we have two countries. It is recommended to specify in gather function which column will not be gathered.
## # A tibble: 6 x 3
## country year fertility
## <chr> <chr> <dbl>
## 1 Germany 1960 2.41
## 2 South Korea 1960 6.16
## 3 Germany 1961 2.44
## 4 South Korea 1961 5.99
## 5 Germany 1962 2.47
## 6 South Korea 1962 5.79
Both of the converted tidy data looks exactly the same. The class of year column in the tidy data is character. We need to convert this character vector to integer so that we can analysis the data.
## # A tibble: 112 x 3
## country year fertility
## <chr> <int> <dbl>
## 1 Germany 1960 2.41
## 2 South Korea 1960 6.16
## 3 Germany 1961 2.44
## 4 South Korea 1961 5.99
## 5 Germany 1962 2.47
## 6 South Korea 1962 5.79
## 7 Germany 1963 2.49
## 8 South Korea 1963 5.57
## 9 Germany 1964 2.49
## 10 South Korea 1964 5.36
## # ... with 102 more rows
Now that the data has become tidy, we can apply simple plotting command to the data.
The spread() function does exactly the opposite of the gather function.
## # A tibble: 2 x 8
## country `1960` `1961` `1962` `1963` `1964` `1965` `1966`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Germany 2.41 2.44 2.47 2.49 2.49 2.48 2.44
## 2 South Korea 6.16 5.99 5.79 5.57 5.36 5.16 4.99
We have recovered the wide data from the tidy data.
All data wrangling processes are not as simple as the above examples we have shown. We will import another wide dataset with two variables. The dataset is definitely not tidy, and also not optimal for practical data analysis, and inference.
path <- system.file("extdata", package = "dslabs")
filename <- file.path(path, "life-expectancy-and-fertility-two-countries-example.csv")
raw_data <- read_csv(filename)
select(raw_data, 1:5)## # A tibble: 2 x 5
## country `1960_fertility` `1960_life_expec~ `1961_fertility` `1961_life_expec~
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Germany 2.41 69.3 2.44 69.8
## 2 South K~ 6.16 53.0 5.99 53.8
The data set is not in tidy format. Moreover, the data set contains two variables: fertility and life expectancy. We will use the gather function first to collapse the columns but we must be careful not to name the new column because the new key column will contain two different variables. We will sort this out later.
## # A tibble: 6 x 3
## country key value
## <chr> <chr> <dbl>
## 1 Germany 1960_fertility 2.41
## 2 South Korea 1960_fertility 6.16
## 3 Germany 1960_life_expectancy 69.3
## 4 South Korea 1960_life_expectancy 53.0
## 5 Germany 1961_fertility 2.44
## 6 South Korea 1961_fertility 5.99
We have to separate the year information from the variable names and create another column for the year.
raw.gather.data %>% separate(key, c("year","variable_name"), "_") #separating the "key" column in to ## # A tibble: 224 x 4
## country year variable_name value
## <chr> <chr> <chr> <dbl>
## 1 Germany 1960 fertility 2.41
## 2 South Korea 1960 fertility 6.16
## 3 Germany 1960 life 69.3
## 4 South Korea 1960 life 53.0
## 5 Germany 1961 fertility 2.44
## 6 South Korea 1961 fertility 5.99
## 7 Germany 1961 life 69.8
## 8 South Korea 1961 life 53.8
## 9 Germany 1962 fertility 2.47
## 10 South Korea 1962 fertility 5.79
## # ... with 214 more rows
We have received an warning for 112 rows where the life_expectancy has been truncated to life as the separator only worked for the first “-” separator but not for the second appearance of “" in the variable. We can solve this problem by merging the extra "” with the variables.
raw.gather.data<-raw.gather.data %>% separate(key, c("year","variable_name"), extra="merge")
head(raw.gather.data, n=10)## # A tibble: 10 x 4
## country year variable_name value
## <chr> <chr> <chr> <dbl>
## 1 Germany 1960 fertility 2.41
## 2 South Korea 1960 fertility 6.16
## 3 Germany 1960 life_expectancy 69.3
## 4 South Korea 1960 life_expectancy 53.0
## 5 Germany 1961 fertility 2.44
## 6 South Korea 1961 fertility 5.99
## 7 Germany 1961 life_expectancy 69.8
## 8 South Korea 1961 life_expectancy 53.8
## 9 Germany 1962 fertility 2.47
## 10 South Korea 1962 fertility 5.79
In the last step, we need to spread fertility and life_expectancy as separate variables.
## # A tibble: 10 x 4
## country year fertility life_expectancy
## <chr> <chr> <dbl> <dbl>
## 1 Germany 1960 2.41 69.3
## 2 Germany 1961 2.44 69.8
## 3 Germany 1962 2.47 70.0
## 4 Germany 1963 2.49 70.1
## 5 Germany 1964 2.49 70.7
## 6 Germany 1965 2.48 70.6
## 7 Germany 1966 2.44 70.8
## 8 Germany 1967 2.37 71.0
## 9 Germany 1968 2.28 70.6
## 10 Germany 1969 2.17 70.5
Finally, we have produced the neat and tidy data with one row for each observation with three variables: year, fertility and life expectancy.
The dplyr package from the tidyverse library provides some of the most common operations dealing with data frames. The names are very suggestive of the operations they perform and therefore easy to remember. To add a new column to the existing data frame, we will use mutate function. To filter the data frame conditionally, we will use filter and to subset a data frame, we will use select function.
We can extend our data frame by adding new columns by manipulating the already existing columns for further analysis. We will add a new rate column in our murder dataset to get the murder rate with respect to the per million population for each of the states.
library(dslabs)
data("murders")
murders.new<-murders %>% mutate(rate=total/population*10^5)
head(murders.new)## state abb region population total rate
## 1 Alabama AL South 4779736 135 2.824424
## 2 Alaska AK West 710231 19 2.675186
## 3 Arizona AZ West 6392017 232 3.629527
## 4 Arkansas AR South 2915918 93 3.189390
## 5 California CA West 37253956 1257 3.374138
## 6 Colorado CO West 5029196 65 1.292453
A new column rate has appeared in the murders data frame.
Now, we will subset our newly created murders data frame with conditionals applied through the filter function.
## state abb region population total rate
## 1 Hawaii HI West 1360301 7 0.5145920
## 2 Idaho ID West 1567582 12 0.7655102
## 3 Iowa IA North Central 3046355 21 0.6893484
## 4 New Hampshire NH Northeast 1316470 5 0.3798036
## 5 North Dakota ND North Central 672591 4 0.5947151
## 6 Utah UT West 2763885 22 0.7959810
## 7 Vermont VT Northeast 625741 2 0.3196211
We have filtered the state with murder rate below or equal to 0.80 per million of the population.
Previously, subsetting was applied only to the rows of the data frame. We will not subset the data frame with columns by the select function.
murders.new.select<-murders.new %>% select(state, region, rate)
murders.new.select %>% filter(rate<=0.80)## state region rate
## 1 Hawaii West 0.5145920
## 2 Idaho West 0.7655102
## 3 Iowa North Central 0.6893484
## 4 New Hampshire Northeast 0.3798036
## 5 North Dakota North Central 0.5947151
## 6 Utah West 0.7959810
## 7 Vermont Northeast 0.3196211
We have preferentially selected only the state, region, and rate columns and later filtered them with murder rate equal to or below 0.80 per million of the population.
Summarizing data is one of the most important aspects of exploratory data analysis. The summarize function in dplyr package is suitable to get the summary statistics in a very easy-to-understand and intuitive format. In this section, we will import a curated dataset heights which includes male and female heights.
## sex height
## 1 Male 75
## 2 Male 70
## 3 Male 68
## 4 Male 74
## 5 Male 61
## 6 Female 65
We will filter the male height and get the average height of male. Later, we will learn to get grouped summary.
height.summary<-heights %>% filter(sex=="Male") %>% summarize(average_male=mean(height),
standard_deviation=sd(height))
height.summary## average_male standard_deviation
## 1 69.31475 3.611024
Remember, the summarize function always returns a data frame, not a numeric vector.
## [1] "data.frame"
We can access the element of height.summary by the accessor $ syntax.
## [1] 69.31475
## [1] 3.611024
Another thing to remember, the summarize function will always return a single numeric value. If we use any function which returns more than one value then we will get an error. But there’s a tricky way to get summarize function returning more than a single numeric output.
If we want to calculate the average US murder rate then simply calculating the average of rate column in the murders.new will not suffice because all the states are given the same weight. This is wrong:
## [1] 2.779125
The following is the right calculation of average:
## [1] 3.034555
In this case, all the states are weighted in proportion to their population size.
In our previous discussion, we have seen that the summarize function always returns a data frame which is sometimes problematic if we simply need a numeric value as input for some other functions.
We can use the pull function to get around with it. The pull function is able to pull out a value from a data frame.
## [1] 3.034555
This time we got a simple numeric value as output rather than a data frame from the summarize function.
In a data frame, if there are factors or categorical variable then we can group the data by each category. This grouped data can later be used to summarize on each group or category. The heights dataset has a categorical variable sex with two distinct factor:Male, and Female. We can group this dataset using group_by function from dplyr package.
## # A tibble: 1,050 x 2
## # Groups: sex [2]
## sex height
## <fct> <dbl>
## 1 Male 75
## 2 Male 70
## 3 Male 68
## 4 Male 74
## 5 Male 61
## 6 Female 65
## 7 Female 66
## 8 Female 62
## 9 Female 66
## 10 Male 67
## # ... with 1,040 more rows
Although, this grouped dataset looks very similar to the heights dataset but internally this dataset has been divided in to two distinct datasets containing two different groups of Male and Female.
We can get the summary from this grouped dataset.
## # A tibble: 2 x 3
## sex average standard_deviation
## <fct> <dbl> <dbl>
## 1 Female 64.9 3.76
## 2 Male 69.3 3.61
The summarize function returned a data frame with average and standard_deviation for both the Male and Female groups.
From the murders.new dataset, we can group by region and get the median rate for each of the region.
## # A tibble: 4 x 2
## region median_rate
## <fct> <dbl>
## 1 Northeast 1.80
## 2 South 3.40
## 3 North Central 1.97
## 4 West 1.29
For close inspection, it is sometimes necessary to sort the data frame by the different columns. We have already learned about the sort and order function. But, if we want to sort an entire data table then the arrange function from the dplyr library is a very flexible choice. We will arrange the murders.new dataset first with the population size and later with the murder rate. By defaults, the arrange function orders the dataset in the ascending order.
## state abb region population total rate
## 1 Wyoming WY West 563626 5 0.8871131
## 2 District of Columbia DC South 601723 99 16.4527532
## 3 Vermont VT Northeast 625741 2 0.3196211
## 4 North Dakota ND North Central 672591 4 0.5947151
## 5 Alaska AK West 710231 19 2.6751860
## 6 South Dakota SD North Central 814180 8 0.9825837
## state abb region population total rate
## 1 Vermont VT Northeast 625741 2 0.3196211
## 2 New Hampshire NH Northeast 1316470 5 0.3798036
## 3 Hawaii HI West 1360301 7 0.5145920
## 4 North Dakota ND North Central 672591 4 0.5947151
## 5 Iowa IA North Central 3046355 21 0.6893484
## 6 Idaho ID West 1567582 12 0.7655102
If we want to arrange the table in descending order of the population size then we can use the desc function inside the arrange function.
## state abb region population total rate
## 1 California CA West 37253956 1257 3.374138
## 2 Texas TX South 25145561 805 3.201360
## 3 Florida FL South 19687653 669 3.398069
## 4 New York NY Northeast 19378102 517 2.667960
## 5 Illinois IL North Central 12830632 364 2.836961
## 6 Pennsylvania PA Northeast 12702379 457 3.597751
For most part of our data manipulation, we have been using the head function to shows the first couple of rows in the whole data frame. But, if we want to see the top n number of rows according to certain column then we can use the top_n function. Here, we will show the top 10 rows from the murders.new dataset according to murder rate. Note than, these rows are not arranged in an ascending or descending order. For that, we need to arrange them explicitly.
## state abb region population total rate
## 1 Arizona AZ West 6392017 232 3.629527
## 2 Delaware DE South 897934 38 4.231937
## 3 District of Columbia DC South 601723 99 16.452753
## 4 Georgia GA South 9920000 376 3.790323
## 5 Louisiana LA South 4533372 351 7.742581
## 6 Maryland MD South 5773552 293 5.074866
## 7 Michigan MI North Central 9883640 413 4.178622
## 8 Mississippi MS South 2967297 120 4.044085
## 9 Missouri MO North Central 5988927 321 5.359892
## 10 South Carolina SC South 4625364 207 4.475323
The dot operator, “.”, is a very useful operator to pull out a column from a data frame and save it as a vector after manipulation. Later we can calculate certain statistics from this vector. Here we will use the dot operator to pull out the rate column after manipulation then find the median.
rate<-murders %>% filter(region=="South") %>% mutate(rate=total/population*10^5) %>%
.$rate
median(rate)## [1] 3.398069
Previously we have stated that if the summarize function have to return multiple outputs then it will produce an error. Most of the base R functions do not understand the group_by tibble and we can’t directly pipe the output of a grouped tibble as the input of a R functions.
The do function comes to the rescue. It essentially creates the bridge between base R functions and the tibbles. For that, we have to create a function that fits in to the tidyverse approach: it receives a data frame and returns a data frame. We will create a function to return minimum, median, and the maximum height from the data frame provided as the argument of the function.
new.height.summary<-function(dat){
x<-quantile(dat$height, c(0,0.5,1)) #0 for minimum, 0.5 for median and 1 for the maximum.
tibble(minimum=x[1], median=x[2], maximum=x[3])
}Now, instead of using the summarize function we can use our custom-made summary function combined with the do function to get robust summary for both Male and Female when they are grouped by the group_by function.
## # A tibble: 2 x 4
## # Groups: sex [2]
## sex minimum median maximum
## <fct> <dbl> <dbl> <dbl>
## 1 Female 51 65.0 79
## 2 Male 50 69 82.7
Voila! That worked perfectly. We had to use the dot operator inside the new.height.summary function to pipe the tibble created by group_by function as the argument of the new.height.summary function. Otherwise it will produce an error by saying argument “dat” is missing.
For data analysis, one table might not be fully adequate for the purpose. We may require to explore more data tables and at times, join those tables together to compare various statistics. This section will be much more similar to SQL join functions and other set operations as well. We will explore all of them followed by practical examples.
Suppose, we intent to explore the relationship between the population size of each state in US and the number of electoral votes. For this, we will import two different datasets, one containing the US population size and the other with the electoral votes. First, we load the table with population size.
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
And this table contains the number of electoral votes.
## state electoral_votes clinton trump others
## 1 California 55 61.7 31.6 6.7
## 2 Texas 38 43.2 52.2 4.5
## 3 Florida 29 47.8 49.0 3.2
## 4 New York 29 59.0 36.5 4.5
## 5 Illinois 20 55.8 38.8 5.4
## 6 Pennsylvania 20 47.9 48.6 3.6
Concatenating or joining these two tables straightforward is not savvy since the order of states in both tables are different. We can check this by this command.
## [1] FALSE
The join functions described below are built to handle these situations.
Let’s join these two tables by the left_join() function from the dplyr library. We will remove the others column to fit the table in the screen.
tab.joined<-left_join(murders,results_us_election_2016, by="state") %>% select(-others)
head(tab.joined)## state abb region population total electoral_votes clinton trump
## 1 Alabama AL South 4779736 135 9 34.4 62.1
## 2 Alaska AK West 710231 19 3 36.6 51.3
## 3 Arizona AZ West 6392017 232 11 45.1 48.7
## 4 Arkansas AR South 2915918 93 6 33.7 60.6
## 5 California CA West 37253956 1257 55 61.7 31.6
## 6 Colorado CO West 5029196 65 9 48.2 43.3
The two tables have been joined successfully. Now, we can navigate and draw a smooth plot from this newly created data table using ggplot library.
## Warning: package 'ggrepel' was built under R version 4.0.5
tab.joined %>% ggplot(aes(population/10^6, electoral_votes, label = abb)) + #population in millions
geom_point() +
geom_text_repel(max.overlaps = 20) + #for marking points in plots with state abbreviation
scale_x_continuous(trans = "log2") +
scale_y_continuous(trans = "log2") +
geom_smooth(method = "lm", se = FALSE) #fitting a linear model In practice, we will not always be so lucky to get two tables with matching rows, in this case, matching state names in both of the tables.
We will now create two tables with some matching and non matching states.
## state population
## 1 Alabama 4779736
## 2 Alaska 710231
## 3 Arizona 6392017
## 4 Arkansas 2915918
## 5 California 37253956
## 6 Colorado 5029196
## 7 Connecticut 3574097
## 8 Delaware 897934
table02<-results_us_election_2016 %>%
filter(state %in% c("Alabama","Alaska","Arizona","Delaware","Colorado","Ohio","Texas","Florida")) %>%
select(state,electoral_votes)
table02## state electoral_votes
## 1 Texas 38
## 2 Florida 29
## 3 Ohio 18
## 4 Arizona 11
## 5 Alabama 9
## 6 Colorado 9
## 7 Alaska 3
## 8 Delaware 3
If we want a table like table01 and simply want to add electoral votes from table02 then we can use the left_join function. We also need to specify the criteria in the argument by which to match the two tables, in this case, state.
## state population electoral_votes
## 1 Alabama 4779736 9
## 2 Alaska 710231 3
## 3 Arizona 6392017 11
## 4 Arkansas 2915918 NA
## 5 California 37253956 NA
## 6 Colorado 5029196 9
## 7 Connecticut 3574097 NA
## 8 Delaware 897934 3
We can see, three NA’s have been introduced in the electoral_votes column due to the absence of matching states in table02 with table01. The states Arkansas,California, and Connecticut are missing in the second table.
Alternatively, we can join table01 and table02 but this time according to the matching states of table02. We will use right_join function here.
## state population electoral_votes
## 1 Alabama 4779736 9
## 2 Alaska 710231 3
## 3 Arizona 6392017 11
## 4 Colorado 5029196 9
## 5 Delaware 897934 3
## 6 Texas NA 38
## 7 Florida NA 29
## 8 Ohio NA 18
We can see, the preferences have been given to table02 with three NA’s being introduced in the population column as the states Texas, Florida, and Ohio are missing in table01.
If we want to join two tables and keep only those rows with matching information then we can use the inner_join function. This is similar to the set operation intersection.
## state population electoral_votes
## 1 Alabama 4779736 9
## 2 Alaska 710231 3
## 3 Arizona 6392017 11
## 4 Colorado 5029196 9
## 5 Delaware 897934 3
Only the states common to both of the tables are returned.
If we intend to keep all the rows from both of the tables then we can use full_join. It is similar to the set operation union. The non-matching rows will return NA’s.
## state population electoral_votes
## 1 Alabama 4779736 9
## 2 Alaska 710231 3
## 3 Arizona 6392017 11
## 4 Arkansas 2915918 NA
## 5 California 37253956 NA
## 6 Colorado 5029196 9
## 7 Connecticut 3574097 NA
## 8 Delaware 897934 3
## 9 Texas NA 38
## 10 Florida NA 29
## 11 Ohio NA 18
The semi_join function keep the part of first table for which we have information in the second. It does not add the columns of the second.
## state population
## 1 Alabama 4779736
## 2 Alaska 710231
## 3 Arizona 6392017
## 4 Colorado 5029196
## 5 Delaware 897934
anti_join function is the opposite of semi_join function. It keeps the information of table01 for which there are no matching information in table02
## state population
## 1 Arkansas 2915918
## 2 California 37253956
## 3 Connecticut 3574097
Set operations are pretty useful for combining datasets. Defaults R commands for set operations are uniquely applicable on vectors but set operations from dplyr package are applicable also in data frames. Basic set operations are intersection, union, and set difference.
We will create two subsets of tables from the previously created tab.joined table.
To make sure, we use the intersect function from the dplyr rather than the base package, we can use dplyr::intersect command like this:
## state abb region population total electoral_votes clinton trump
## 1 Arizona AZ West 6392017 232 11 45.1 48.7
## 2 Arkansas AR South 2915918 93 6 33.7 60.6
## 3 California CA West 37253956 1257 55 61.7 31.6
We get only three rows common in the two tables or sets.
The union function from the dplyr package will unite or combine all the rows in the two tables or sets.
## state abb region population total electoral_votes clinton trump
## 1 Alabama AL South 4779736 135 9 34.4 62.1
## 2 Alaska AK West 710231 19 3 36.6 51.3
## 3 Arizona AZ West 6392017 232 11 45.1 48.7
## 4 Arkansas AR South 2915918 93 6 33.7 60.6
## 5 California CA West 37253956 1257 55 61.7 31.6
## 6 Colorado CO West 5029196 65 9 48.2 43.3
## 7 Connecticut CT Northeast 3574097 97 7 54.6 40.9
Now, we get all the rows combined in the two tables or sets.
The setdiff function in dplyr package will subtract the common rows, between first and second set, from the first set.
## state abb region population total electoral_votes clinton trump
## 1 Alabama AL South 4779736 135 9 34.4 62.1
## 2 Alaska AK West 710231 19 3 36.6 51.3
The setequal function from the dplyr package is a very important function to check whether two sets or tables are equal regardless on the order of the rows.
## [1] FALSE
set01 and set02 tables are not equal. They have different rows.
This marks the end of this Data Collection and Data Management project. In future string processing, web scraping, and Data Mining procedures in R will be discussed in details.