library(tidyverse)
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
[37m-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.3.0 --[39m
[37m[32mv[37m [34mggplot2[37m 3.3.0 [32mv[37m [34mpurrr [37m 0.3.3
[32mv[37m [34mtibble [37m 2.1.3 [32mv[37m [34mdplyr [37m 0.8.3
[32mv[37m [34mtidyr [37m 1.0.2 [32mv[37m [34mstringr[37m 1.4.0
[32mv[37m [34mreadr [37m 1.3.1 [32mv[37m [34mforcats[37m 0.4.0[39m
package 㤼㸱ggplot2㤼㸲 was built under R version 3.6.3[37m-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[37m [34mdplyr[37m::[32mfilter()[37m masks [34mstats[37m::filter()
[31mx[37m [34mdplyr[37m::[32mlag()[37m masks [34mstats[37m::lag()[39m
bmi <- read_csv("bmi.csv")
Parsed with column specification:
cols(
.default = col_double(),
Country = [31mcol_character()[39m
)
See spec(...) for full column specifications.
# Check the class of bmi
class(bmi)
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
# Check the dimensions of bmi
dim(bmi)
[1] 199 30
# View the column names of bmi
names(bmi)
[1] "Country" "Y1980" "Y1981" "Y1982" "Y1983" "Y1984"
[7] "Y1985" "Y1986" "Y1987" "Y1988" "Y1989" "Y1990"
[13] "Y1991" "Y1992" "Y1993" "Y1994" "Y1995" "Y1996"
[19] "Y1997" "Y1998" "Y1999" "Y2000" "Y2001" "Y2002"
[25] "Y2003" "Y2004" "Y2005" "Y2006" "Y2007" "Y2008"
Since bmi doesn’t have a huge number of columns, you can view a quick snapshot of your data using the str() (for structure) command. In addition to the class and dimensions of your entire dataset, str() will tell you the class of each variable and give you a preview of its contents.
glimpse() function from dplyr is a slightly cleaner alternative to str(). str() and glimpse() give you a preview of your data, which may reveal issues with the way columns are labelled, how variables are encoded, etc.
You can use the summary() command to get a better feel for how your data are distributed, which may reveal unusual or extreme values, unexpected missing data, etc. For numeric variables, this means looking at means, quartiles (including the median), and extreme values. For character or factor variables, you may be curious about the number of times each value appears in the data (i.e. counts), which summary() also reveals.
# Check the structure of bmi
str(bmi)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 199 obs. of 30 variables:
$ Country: chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
$ Y1980 : num 21.5 25.2 22.3 25.7 20.9 ...
$ Y1981 : num 21.5 25.2 22.3 25.7 20.9 ...
$ Y1982 : num 21.5 25.3 22.4 25.7 20.9 ...
$ Y1983 : num 21.4 25.3 22.5 25.8 20.9 ...
$ Y1984 : num 21.4 25.3 22.6 25.8 20.9 ...
$ Y1985 : num 21.4 25.3 22.7 25.9 20.9 ...
$ Y1986 : num 21.4 25.3 22.8 25.9 21 ...
$ Y1987 : num 21.4 25.3 22.8 25.9 21 ...
$ Y1988 : num 21.3 25.3 22.9 26 21 ...
$ Y1989 : num 21.3 25.3 23 26 21.1 ...
$ Y1990 : num 21.2 25.3 23 26.1 21.1 ...
$ Y1991 : num 21.2 25.3 23.1 26.2 21.1 ...
$ Y1992 : num 21.1 25.2 23.2 26.2 21.1 ...
$ Y1993 : num 21.1 25.2 23.3 26.3 21.1 ...
$ Y1994 : num 21 25.2 23.3 26.4 21.1 ...
$ Y1995 : num 20.9 25.3 23.4 26.4 21.2 ...
$ Y1996 : num 20.9 25.3 23.5 26.5 21.2 ...
$ Y1997 : num 20.8 25.3 23.5 26.6 21.2 ...
$ Y1998 : num 20.8 25.4 23.6 26.7 21.3 ...
$ Y1999 : num 20.8 25.5 23.7 26.8 21.3 ...
$ Y2000 : num 20.7 25.6 23.8 26.8 21.4 ...
$ Y2001 : num 20.6 25.7 23.9 26.9 21.4 ...
$ Y2002 : num 20.6 25.8 24 27 21.5 ...
$ Y2003 : num 20.6 25.9 24.1 27.1 21.6 ...
$ Y2004 : num 20.6 26 24.2 27.2 21.7 ...
$ Y2005 : num 20.6 26.1 24.3 27.3 21.8 ...
$ Y2006 : num 20.6 26.2 24.4 27.4 21.9 ...
$ Y2007 : num 20.6 26.3 24.5 27.5 22.1 ...
$ Y2008 : num 20.6 26.4 24.6 27.6 22.3 ...
- attr(*, "spec")=
.. cols(
.. Country = [31mcol_character()[39m,
.. Y1980 = [32mcol_double()[39m,
.. Y1981 = [32mcol_double()[39m,
.. Y1982 = [32mcol_double()[39m,
.. Y1983 = [32mcol_double()[39m,
.. Y1984 = [32mcol_double()[39m,
.. Y1985 = [32mcol_double()[39m,
.. Y1986 = [32mcol_double()[39m,
.. Y1987 = [32mcol_double()[39m,
.. Y1988 = [32mcol_double()[39m,
.. Y1989 = [32mcol_double()[39m,
.. Y1990 = [32mcol_double()[39m,
.. Y1991 = [32mcol_double()[39m,
.. Y1992 = [32mcol_double()[39m,
.. Y1993 = [32mcol_double()[39m,
.. Y1994 = [32mcol_double()[39m,
.. Y1995 = [32mcol_double()[39m,
.. Y1996 = [32mcol_double()[39m,
.. Y1997 = [32mcol_double()[39m,
.. Y1998 = [32mcol_double()[39m,
.. Y1999 = [32mcol_double()[39m,
.. Y2000 = [32mcol_double()[39m,
.. Y2001 = [32mcol_double()[39m,
.. Y2002 = [32mcol_double()[39m,
.. Y2003 = [32mcol_double()[39m,
.. Y2004 = [32mcol_double()[39m,
.. Y2005 = [32mcol_double()[39m,
.. Y2006 = [32mcol_double()[39m,
.. Y2007 = [32mcol_double()[39m,
.. Y2008 = [32mcol_double()[39m
.. )
# Check the structure of bmi, the dplyr way
glimpse(bmi)
Observations: 199
Variables: 30
$ Country [3m[38;5;246m<chr>[39m[23m "Afghanistan", "Albania", "Algeria", "Andorra", "...
$ Y1980 [3m[38;5;246m<dbl>[39m[23m 21.48678, 25.22533, 22.25703, 25.66652, 20.94876,...
$ Y1981 [3m[38;5;246m<dbl>[39m[23m 21.46552, 25.23981, 22.34745, 25.70868, 20.94371,...
$ Y1982 [3m[38;5;246m<dbl>[39m[23m 21.45145, 25.25636, 22.43647, 25.74681, 20.93754,...
$ Y1983 [3m[38;5;246m<dbl>[39m[23m 21.43822, 25.27176, 22.52105, 25.78250, 20.93187,...
$ Y1984 [3m[38;5;246m<dbl>[39m[23m 21.42734, 25.27901, 22.60633, 25.81874, 20.93569,...
$ Y1985 [3m[38;5;246m<dbl>[39m[23m 21.41222, 25.28669, 22.69501, 25.85236, 20.94857,...
$ Y1986 [3m[38;5;246m<dbl>[39m[23m 21.40132, 25.29451, 22.76979, 25.89089, 20.96030,...
$ Y1987 [3m[38;5;246m<dbl>[39m[23m 21.37679, 25.30217, 22.84096, 25.93414, 20.98025,...
$ Y1988 [3m[38;5;246m<dbl>[39m[23m 21.34018, 25.30450, 22.90644, 25.98477, 21.01375,...
$ Y1989 [3m[38;5;246m<dbl>[39m[23m 21.29845, 25.31944, 22.97931, 26.04450, 21.05269,...
$ Y1990 [3m[38;5;246m<dbl>[39m[23m 21.24818, 25.32357, 23.04600, 26.10936, 21.09007,...
$ Y1991 [3m[38;5;246m<dbl>[39m[23m 21.20269, 25.28452, 23.11333, 26.17912, 21.12136,...
$ Y1992 [3m[38;5;246m<dbl>[39m[23m 21.14238, 25.23077, 23.18776, 26.24017, 21.14987,...
$ Y1993 [3m[38;5;246m<dbl>[39m[23m 21.06376, 25.21192, 23.25764, 26.30356, 21.13938,...
$ Y1994 [3m[38;5;246m<dbl>[39m[23m 20.97987, 25.22115, 23.32273, 26.36793, 21.14186,...
$ Y1995 [3m[38;5;246m<dbl>[39m[23m 20.91132, 25.25874, 23.39526, 26.43569, 21.16022,...
$ Y1996 [3m[38;5;246m<dbl>[39m[23m 20.85155, 25.31097, 23.46811, 26.50769, 21.19076,...
$ Y1997 [3m[38;5;246m<dbl>[39m[23m 20.81307, 25.33988, 23.54160, 26.58255, 21.22621,...
$ Y1998 [3m[38;5;246m<dbl>[39m[23m 20.78591, 25.39116, 23.61592, 26.66337, 21.27082,...
$ Y1999 [3m[38;5;246m<dbl>[39m[23m 20.75469, 25.46555, 23.69486, 26.75078, 21.31954,...
$ Y2000 [3m[38;5;246m<dbl>[39m[23m 20.69521, 25.55835, 23.77659, 26.83179, 21.37480,...
$ Y2001 [3m[38;5;246m<dbl>[39m[23m 20.62643, 25.66701, 23.86256, 26.92373, 21.43664,...
$ Y2002 [3m[38;5;246m<dbl>[39m[23m 20.59848, 25.77167, 23.95294, 27.02525, 21.51765,...
$ Y2003 [3m[38;5;246m<dbl>[39m[23m 20.58706, 25.87274, 24.05243, 27.12481, 21.59924,...
$ Y2004 [3m[38;5;246m<dbl>[39m[23m 20.57759, 25.98136, 24.15957, 27.23107, 21.69218,...
$ Y2005 [3m[38;5;246m<dbl>[39m[23m 20.58084, 26.08939, 24.27001, 27.32827, 21.80564,...
$ Y2006 [3m[38;5;246m<dbl>[39m[23m 20.58749, 26.20867, 24.38270, 27.43588, 21.93881,...
$ Y2007 [3m[38;5;246m<dbl>[39m[23m 20.60246, 26.32753, 24.48846, 27.53363, 22.08962,...
$ Y2008 [3m[38;5;246m<dbl>[39m[23m 20.62058, 26.44657, 24.59620, 27.63048, 22.25083,...
# View a summary of bmi
summary(bmi)
Country Y1980 Y1981 Y1982
Length:199 Min. :19.01 Min. :19.04 Min. :19.07
Class :character 1st Qu.:21.27 1st Qu.:21.31 1st Qu.:21.36
Mode :character Median :23.31 Median :23.39 Median :23.46
Mean :23.15 Mean :23.21 Mean :23.26
3rd Qu.:24.82 3rd Qu.:24.89 3rd Qu.:24.94
Max. :28.12 Max. :28.36 Max. :28.58
Y1983 Y1984 Y1985 Y1986
Min. :19.10 Min. :19.13 Min. :19.16 Min. :19.20
1st Qu.:21.42 1st Qu.:21.45 1st Qu.:21.47 1st Qu.:21.49
Median :23.57 Median :23.64 Median :23.73 Median :23.82
Mean :23.32 Mean :23.37 Mean :23.42 Mean :23.48
3rd Qu.:25.02 3rd Qu.:25.06 3rd Qu.:25.11 3rd Qu.:25.20
Max. :28.82 Max. :29.05 Max. :29.28 Max. :29.52
Y1987 Y1988 Y1989 Y1990
Min. :19.23 Min. :19.27 Min. :19.31 Min. :19.35
1st Qu.:21.50 1st Qu.:21.52 1st Qu.:21.55 1st Qu.:21.57
Median :23.87 Median :23.93 Median :24.03 Median :24.14
Mean :23.53 Mean :23.59 Mean :23.65 Mean :23.71
3rd Qu.:25.27 3rd Qu.:25.34 3rd Qu.:25.37 3rd Qu.:25.39
Max. :29.75 Max. :29.98 Max. :30.20 Max. :30.42
Y1991 Y1992 Y1993 Y1994
Min. :19.40 Min. :19.45 Min. :19.51 Min. :19.59
1st Qu.:21.60 1st Qu.:21.65 1st Qu.:21.74 1st Qu.:21.76
Median :24.20 Median :24.19 Median :24.27 Median :24.36
Mean :23.76 Mean :23.82 Mean :23.88 Mean :23.94
3rd Qu.:25.42 3rd Qu.:25.48 3rd Qu.:25.54 3rd Qu.:25.62
Max. :30.64 Max. :30.85 Max. :31.04 Max. :31.23
Y1995 Y1996 Y1997 Y1998
Min. :19.67 Min. :19.71 Min. :19.74 Min. :19.77
1st Qu.:21.83 1st Qu.:21.89 1st Qu.:21.94 1st Qu.:22.00
Median :24.41 Median :24.42 Median :24.50 Median :24.49
Mean :24.00 Mean :24.07 Mean :24.14 Mean :24.21
3rd Qu.:25.70 3rd Qu.:25.78 3rd Qu.:25.85 3rd Qu.:25.94
Max. :31.41 Max. :31.59 Max. :31.77 Max. :31.95
Y1999 Y2000 Y2001 Y2002
Min. :19.80 Min. :19.83 Min. :19.86 Min. :19.84
1st Qu.:22.04 1st Qu.:22.12 1st Qu.:22.22 1st Qu.:22.29
Median :24.61 Median :24.66 Median :24.73 Median :24.81
Mean :24.29 Mean :24.36 Mean :24.44 Mean :24.52
3rd Qu.:26.01 3rd Qu.:26.09 3rd Qu.:26.19 3rd Qu.:26.30
Max. :32.13 Max. :32.32 Max. :32.51 Max. :32.70
Y2003 Y2004 Y2005 Y2006
Min. :19.81 Min. :19.79 Min. :19.79 Min. :19.80
1st Qu.:22.37 1st Qu.:22.45 1st Qu.:22.54 1st Qu.:22.63
Median :24.89 Median :25.00 Median :25.11 Median :25.24
Mean :24.61 Mean :24.70 Mean :24.79 Mean :24.89
3rd Qu.:26.38 3rd Qu.:26.47 3rd Qu.:26.53 3rd Qu.:26.59
Max. :32.90 Max. :33.10 Max. :33.30 Max. :33.49
Y2007 Y2008
Min. :19.83 Min. :19.87
1st Qu.:22.73 1st Qu.:22.83
Median :25.36 Median :25.50
Mean :24.99 Mean :25.10
3rd Qu.:26.66 3rd Qu.:26.82
Max. :33.69 Max. :33.90
You can look at all the summaries you want, but at the end of the day, there is no substitute for looking at your data – either in raw table form or by plotting it.
The most basic way to look at your data in R is by printing it to the console. As you may know from experience, the print() command is not even necessary; you can just type the name of the object. The downside to this option is that R will attempt to print the entire dataset, which can be a nuisance if the dataset is too large.
One way around this is to use the head() and tail() commands, which only display the first and last 6 rows of data, respectively. You can view more (or fewer) rows by providing as a second argument to the function the number of rows you wish to view. These functions provide a useful method for quickly getting a sense of your data without overly cluttering the console.
# View the first 6 rows
head(bmi)
# View the first 15 rows
head(bmi, 15)
# View the last 6 rows
tail(bmi)
# View the last 10 rows
tail(bmi, 10)
There are many ways to visualize data.
A histogram, created with the hist() function, takes a vector (i.e. column) of data, breaks it up into intervals, then plots as a vertical bar the number of instances within each interval. A scatter plot, created with the plot() function, takes two vectors (i.e. columns) of data and plots them as a series of (x, y) coordinates on a two-dimensional plane.
# Histogram of BMIs from 2008
hist(bmi$Y2008)
# Scatter plot comparing BMIs from 1980 to those from 2008
plot(x = bmi$Y1980, y = bmi$Y2008)
The most important function in tidyr is gather(). It should be used when you have columns that are not variables and you want to collapse them into key-value pairs.
The easiest way to visualize the effect of gather() is that it makes wide datasets long.
gather(wide_df, my_key, my_val, -col)
# Apply gather() to bmi and save the result as bmi_long
bmi_long <- gather(bmi, year, bmi_val, -Country)
# View the first 20 rows of the result
head(bmi_long, 20)
The opposite of gather() is spread(), which takes key-values pairs and spreads them across multiple columns. This is useful when values in a column should actually be column names (i.e. variables). It can also make data more compact and easier to read.
The easiest way to visualize the effect of spread() is that it makes long datasets wide.
spread(long_df, my_key, my_val)
# Apply spread() to bmi_long
bmi_wide <- spread(bmi_long, year, bmi_val)
# View the head of bmi_wide
head(bmi_wide)
The separate() function allows you to separate one column into multiple columns. Unless you tell it otherwise, it will attempt to separate on any character that is not a letter or number. You can also specify a specific separator using the sep argument.
separate(treatments, year_mo, c("year", "month"))
Country_ISO <- read_csv("Country_ISO.csv")
Parsed with column specification:
cols(
Country_ISO = [31mcol_character()[39m
)
9 parsing failures.
row col expected actual file
41 -- 1 columns 2 columns 'Country_ISO.csv'
42 -- 1 columns 2 columns 'Country_ISO.csv'
79 -- 1 columns 2 columns 'Country_ISO.csv'
95 -- 1 columns 2 columns 'Country_ISO.csv'
96 -- 1 columns 2 columns 'Country_ISO.csv'
... ... ......... ......... .................
See problems(...) for more details.
# Apply separate() to bmi_cc
Country_ISO_clean <- separate(Country_ISO, col = Country_ISO, into = c("Country", "ISO"), sep = "/")
Expected 2 pieces. Missing pieces filled with `NA` in 9 rows [41, 42, 79, 95, 96, 107, 108, 119, 197].
# Print the head of the result
head(Country_ISO_clean)
The opposite of separate() is unite(), which takes multiple columns and pastes them together. By default, the contents of the columns will be separated by underscores in the new column, but this behavior can be altered via the sep argument.
unite(treatments, year_mo, year, month)
we sometimes come across datasets where column names are actually values of a variable (e.g. months of the year). This is often the case when working with repeated measures data, where measurements are taken on subjects of interest on multiple occasions over time. The gather() function is helpful in these situations.
census <- read_csv("census.csv")
Parsed with column specification:
cols(
YEAR = [32mcol_double()[39m,
JAN = [32mcol_double()[39m,
FEB = [32mcol_double()[39m,
MAR = [32mcol_double()[39m,
APR = [32mcol_double()[39m,
MAY = [32mcol_double()[39m,
JUN = [32mcol_double()[39m,
JUL = [32mcol_double()[39m,
AUG = [32mcol_double()[39m,
SEP = [32mcol_double()[39m,
OCT = [32mcol_double()[39m,
NOV = [32mcol_double()[39m,
DEC = [32mcol_double()[39m
)
# View the head of census
head(census)
# Gather the month columns
census2 <- gather(census, month, amount, -YEAR)
# Arrange rows by YEAR using dplyr's arrange
census2_arr <- arrange(census2, YEAR)
# View first 20 rows of census2_arr
head(census2_arr, 20)
Sometimes you’ll run into situations where variables are stored in both rows and columns.
spread(pets, type, num)
It’s also fairly common that you will find two variables stored in a single column of data. These variables may be joined by a separator like a dash, underscore, space, or forward slash.
The separate() function comes in handy in these situations.
separate(census_long3, yr_month, c("year", "month"))
As in other programming languages, R is capable of storing data in many different formats, most of which you’ve probably seen by now.
Loosely speaking, the class() function tells you what type of object you’re working with. (There are subtle differences between the class, type, and mode of an object)
# Make this evaluate to "character"
class("TRUE")
[1] "character"
# Make this evaluate to "numeric"
class(8484.00)
[1] "numeric"
# Make this evaluate to "integer"
class(99L)
[1] "integer"
# Make this evaluate to "factor"
class(factor("factor"))
[1] "factor"
# Make this evaluate to "logical"
class(FALSE)
[1] "logical"
It is often necessary to change, or coerce, the way that variables in a dataset are stored. This could be because of the way they were read into R (with read.csv(), for example) or perhaps the function you are using to analyze the data requires variables to be coded a certain way.
as.logical(1) returns TRUE and as.numeric(TRUE) returns 1
lubridate package: functions combine the letters y, m, d, h, m, s, which stand for year, month, day, hour, minute, and second, respectively. The order of the letters in the function should match the order of the date/time you are attempting to read in, although not all combinations are valid. Notice that the functions are “smart” in that they are capable of parsing multiple formats.
# Load the lubridate package
library(lubridate)
Attaching package: 㤼㸱lubridate㤼㸲
The following object is masked from 㤼㸱package:base㤼㸲:
date
# Experiment with basic lubridate functions
ymd("2015-08-25")
[1] "2015-08-25"
ymd("2015 August 25")
[1] "2015-08-25"
mdy("August 25, 2015")
[1] "2015-08-25"
hms("13:33:09")
[1] "13H 33M 9S"
ymd_hms("2015/08/25 13.33.09")
[1] "2015-08-25 13:33:09 UTC"
One common issue that comes up when cleaning data is the need to remove leading and/or trailing white space. The str_trim() function from stringr makes it easy to do this while leaving intact the part of the string that you actually want.
A similar issue is when you need to pad strings to make them a certain number of characters wide. One example is if you had a bunch of employee ID numbers, some of which begin with one or more zeros. When reading these data in, you find that the leading zeros have been dropped somewhere along the way (probably because the variable was thought to be numeric and in that case, leading zeros would be unnecessary.)
# Load the stringr package
library(stringr)
# Trim all leading and trailing whitespace
str_trim(c(" Filip ", "Nick ", " Jonathan"))
[1] "Filip" "Nick" "Jonathan"
# Pad these strings with leading zeros
str_pad(c("23485W", "8823453Q", "994Z"), width=9, side = "left", pad ="0")
[1] "00023485W" "08823453Q" "00000994Z"
In addition to trimming and padding strings, you may need to adjust their case from time to time. Making strings uppercase or lowercase is very straightforward in (base) R thanks to toupper() and tolower(). Each function takes exactly one argument: the character string (or vector/column of strings) to be converted to the desired case.
state <- c("al", "ak", "az", "ar", "ca", "co", "ct", "de", "fl", "ga", "hi", "id", "il", "in", "ia", "ks", "ky", "la", "me", "md", "ma", "mi", "mn", "ms", "mo", "mt", "ne", "nv", "nh", "nj","nm", "ny", "nc", "nd", "oh", "ok", "or", "pa", "ri", "sc", "sd", "tn", "tx", "ut", "vt", "va", "wa", "wv", "wi", "wy")
# Print state abbreviations
state
[1] "al" "ak" "az" "ar" "ca" "co" "ct" "de" "fl" "ga" "hi" "id" "il"
[14] "in" "ia" "ks" "ky" "la" "me" "md" "ma" "mi" "mn" "ms" "mo" "mt"
[27] "ne" "nv" "nh" "nj" "nm" "ny" "nc" "nd" "oh" "ok" "or" "pa" "ri"
[40] "sc" "sd" "tn" "tx" "ut" "vt" "va" "wa" "wv" "wi" "wy"
# Make states all uppercase
toupper(state)
[1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL"
[14] "IN" "IA" "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT"
[27] "NE" "NV" "NH" "NJ" "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI"
[40] "SC" "SD" "TN" "TX" "UT" "VT" "VA" "WA" "WV" "WI" "WY"
# lowercase again
tolower(state)
[1] "al" "ak" "az" "ar" "ca" "co" "ct" "de" "fl" "ga" "hi" "id" "il"
[14] "in" "ia" "ks" "ky" "la" "me" "md" "ma" "mi" "mn" "ms" "mo" "mt"
[27] "ne" "nv" "nh" "nj" "nm" "ny" "nc" "nd" "oh" "ok" "or" "pa" "ri"
[40] "sc" "sd" "tn" "tx" "ut" "vt" "va" "wa" "wv" "wi" "wy"
The stringr package provides two functions that are very useful for finding and/or replacing patterns in strings: str_detect() and str_replace().
Like all functions in stringr, the first argument of each is the string of interest. The second argument of each is the pattern of interest. In the case of str_detect(), this is the pattern we are searching for. In the case of str_replace(), this is the pattern we want to replace. Finally, str_replace() has a third argument, which is the string to replace with.
str_detect(c("banana", "kiwi"), "a")
[1] TRUE FALSE
str_replace(c("banana", "kiwi"), "a", "o")
[1] "bonana" "kiwi"
Missing values in R should be represented by NA, but unfortunately you will not always be so lucky. Before you can deal with missing values, you have to find them in the data.
If missing values are properly coded as NA, the is.na() function will help you find them. Otherwise, if your dataset is too big to just look at the whole thing, you may need to try searching for some of the usual suspects like "“,”#N/A", etc. You can also use the summary() and table() functions to turn up unexpected values in your data.
name <- c("Sarah", "Tom", "David", "Alice")
n_friends <- c(244, NA, 145, 43)
status <- c("Going out!", "","Movie night..." , "")
social_df <- data.frame(name, n_friends, status)
social_df
# Call is.na() on the full social_df to spot all NAs
is.na(social_df)
name n_friends status
[1,] FALSE FALSE FALSE
[2,] FALSE TRUE FALSE
[3,] FALSE FALSE FALSE
[4,] FALSE FALSE FALSE
# Use the any() function to ask whether there are any NAs in the data
any(is.na(social_df))
[1] TRUE
# View a summary() of the dataset
summary(social_df)
name n_friends status
Alice:1 Min. : 43.0 :2
David:1 1st Qu.: 94.0 Going out! :1
Sarah:1 Median :145.0 Movie night...:1
Tom :1 Mean :144.0
3rd Qu.:194.5
Max. :244.0
NA's :1
# Call table() on the status column
table(social_df$status)
Going out! Movie night...
2 1 1
Missing values can be a rather complex subject, but here we’ll only look at the simple case where you are simply interested in normalizing and/or removing all missing values from your data. For more information on why this is not always the best strategy, search online for “missing not at random.”
# Replace all empty strings in status with NA
social_df$status[social_df$status == ""] <- NA
# Print social_df to the console
social_df
# Use complete.cases() to see which rows have no missing values
complete.cases(social_df)
[1] TRUE FALSE TRUE FALSE
# Use na.omit() to remove all rows with any missing values
na.omit(social_df)
When dealing with strange values in your data, you often must decide whether they are just extreme or actually erroneous. Extreme values show up all over the place, but you, the data analyst, must figure out when they are plausible and when they are not.
df2 <- data.frame(A = rnorm(100, 50, 10),
B = c(rnorm(99, 50, 10), 500),
C = c(rnorm(97, 50, 10), 0, 0, -1))
# Look at a summary() of students3
summary(df2)
A B C
Min. :24.91 Min. : 19.55 Min. :-1.00
1st Qu.:43.41 1st Qu.: 41.52 1st Qu.:41.42
Median :51.91 Median : 49.98 Median :47.04
Mean :50.96 Mean : 53.44 Mean :46.60
3rd Qu.:58.64 3rd Qu.: 57.19 3rd Qu.:55.68
Max. :73.29 Max. :500.00 Max. :68.07
# View a histogram of the age variable
hist(df2$A)
# View a histogram of the absences variable
hist(df2$C)
# View a histogram of absences, but force zeros to be bucketed to the right of zero
hist(df2$C, right = FALSE)
Another useful way of looking at strange values is with boxplots. Simply put, boxplots draw a box around the middle 50% of values for a given variable, with a bolded horizontal line drawn at the median. Values that fall far from the bulk of the data points (i.e. outliers) are denoted by open circles. (If you’re curious about the exact formula for determining what is “far”, check out ?hist.)
# View a boxplot of age
boxplot(df2$A)
# View a boxplot of absences
boxplot(df2$C)