Latest Versions & Updates: This markdown document was built using the following versions of R and RStudio:
The following provides an overview of techniques we’ve learned, including links to the original session.
Objects in R contain single values, multiple values (vectors), and tabular data (data frames).
The Assignment Operator, <-, names and stores one or more values, functions, or data structures.
my_value <- 5 # Store a single value
my_vector <- c(5, 10, 15) # Vectors: Concatenated values
my_dataframe <- data.frame(x = c(1, 2, 3),
y = c("a", "b", "c"),
z = c(TRUE, TRUE, FALSE)) # Data Frames: Tabular structures
Print objects by simply entering the object name or explicitly using the function print().
my_value # Autoprints using only the object name
## [1] 5
print(my_vector) # Explicitly prints with function print()
## [1] 5 10 15
Built-In Objects already exist in R, such as letters, all lowercase letters, or mtcars, a dataset on cars from 1972.
letters
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Original Session: Intro to R: Operators
Arithmetic Operators in R are used for addition, subtraction, multiplication, division, operator preference, and exponentiation.
Class Numeric data are required.
(5^2 * 4) / 2
## [1] 50
Relational Operators in R are used in relational statements that compare one or a series of values, e.g. <, >, ==, !=.
Class Logical result from relational statements, i.e. TRUE or FALSE.
10 < c(8, 9, 11, 12)
## [1] FALSE FALSE TRUE TRUE
Logical Operators bind multiple relational statements.
OR, i.e. |, requires at least one statement to be TRUE.
5 > 1 | 10 < 5
## [1] TRUE
AND, i.e. &, requires all statements to be `TRUE.
5 > 1 & 10 < 5
## [1] FALSE
Original Session: Intro to R: Operators
The Dollar Sign Operator, i.e. $, subsets or extracts a specific variable from a dataset.
mtcars$mpg # Combine the dataset name and variable name to subset the variable
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
Indexing a subset variable is done with brackets, [ & ], and the number or numbers of the element(s) by position.
mtcars$mpg[5] # Combine the dataset, variable, and position to extract a specific value
## [1] 18.7
Index by Row & Column Position using the row number and column number in brackets, separated by a comma, ,.
mtcars[25, 1]
## [1] 19.2
Index by Name using the row name and column name in the same manner.
mtcars["Pontiac Firebird", "mpg"]
## [1] 19.2
Index Multiple Positions by concatenating more than one position number using function c().
mtcars["Honda Civic", c(1, 2, 4, 6)]
## mpg cyl hp wt
## Honda Civic 30.4 4 52 1.615
Subset All Rows or All Columns by leaving the position empty within the brackets.
mtcars[1:5, ] # Subset rows 1-5 and all columns
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
mtcars[ c(1, 2)] # Subset columns 1-2 and all rows
## mpg cyl
## Mazda RX4 21.0 6
## Mazda RX4 Wag 21.0 6
## Datsun 710 22.8 4
## Hornet 4 Drive 21.4 6
## Hornet Sportabout 18.7 8
## Valiant 18.1 6
## Duster 360 14.3 8
## Merc 240D 24.4 4
## Merc 230 22.8 4
## Merc 280 19.2 6
## Merc 280C 17.8 6
## Merc 450SE 16.4 8
## Merc 450SL 17.3 8
## Merc 450SLC 15.2 8
## Cadillac Fleetwood 10.4 8
## Lincoln Continental 10.4 8
## Chrysler Imperial 14.7 8
## Fiat 128 32.4 4
## Honda Civic 30.4 4
## Toyota Corolla 33.9 4
## Toyota Corona 21.5 4
## Dodge Challenger 15.5 8
## AMC Javelin 15.2 8
## Camaro Z28 13.3 8
## Pontiac Firebird 19.2 8
## Fiat X1-9 27.3 4
## Porsche 914-2 26.0 4
## Lotus Europa 30.4 4
## Ford Pantera L 15.8 8
## Ferrari Dino 19.7 6
## Maserati Bora 15.0 8
## Volvo 142E 21.4 4
Filter with Relational Operators by placing a relational statement in the row position, in brackets.
mtcars[mtcars$mpg < 15, ] # Subset only cars with less than 15 mpg
## mpg cyl disp hp drat wt qsec vs am gear carb
## Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
Assign Subset Data to New Objects using the assignment operator, <-, an object name, and the subset data.
gas_guzzlers <- mtcars[mtcars$mpg < 15, ]
Save Objects to Index Data using the assignment operator, <- and one or more relational statements.
index <- mtcars$cyl == 8 & mtcars$hp > 240 # Store logical values: TRUE or FALSE
dream_cars <- mtcars[index, ] # Use the indexing object in the row position
print(dream_cars) # Print results
## mpg cyl disp hp drat wt qsec vs am gear carb
## Duster 360 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
## Camaro Z28 13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
## Ford Pantera L 15.8 8 351 264 4.22 3.17 14.50 0 1 5 4
## Maserati Bora 15.0 8 301 335 3.54 3.57 14.60 0 1 5 8
Original Session: Intro to R: Subsets & Indices
Classes of both variables and single values dictate how R will recognize and work with them.
Identify Class by using the class() function and inputting either one or more values or an object.
class(10L) # Call class() on a single value; here, "L" indicates an integer
## [1] "integer"
class(c(TRUE, FALSE)) # Call class() on multiple values, e.g. "logical" values
## [1] "logical"
class(mtcars) # Call class() on an object with stored data to determine structure
## [1] "data.frame"
class(mtcars$mpg) # Call class() on a subset variable for the class of its values
## [1] "numeric"
Numeric data include any quantitative data, including:
numeric in an all encompassing term for quantitative datainteger, or values comprised of whole numbersdouble, or values with floating decimalsLogical data contain logical values, e.g. TRUE or FALSE.
Under the hood, logical data are represented by 1 and 0.
TRUE == 1
## [1] TRUE
FALSE == 0
## [1] TRUE
Character data contain uncategorized text, e.g. “Onondaga County”.
my_county <- "Onondaga County"
class(my_county)
## [1] "character"
Factor data represent categorical data where each category is a “level”, e.g. gender, race, or census tract.
cylinders <- factor(mtcars$cyl) # Create factors using the factor() function
class(cylinders)
## [1] "factor"
levels(cylinders) # Function levels() prints each category in a factor
## [1] "4" "6" "8"
Coercion is the act of converting values and objects to new classes, usually with an as.() function.
class(mtcars$mpg) # Print the class of variable "mpg"
## [1] "numeric"
mtcars$mpg <- as.character(mtcars$mpg) # Coerce the class from "numeric" to "character"
class(mtcars$mpg) # Re-print the class to confirm changes
## [1] "character"
The Purpose of Coercion is so R will treat your values in the manner you intend.
Function Overloading is the quality in R which allows functions to behave differently depending on object class.
class(mtcars$cyl) # Determine class for variable "cyl", or number of cylinders
## [1] "numeric"
summary(mtcars$cyl) # Print descriptive statitistics for numeric data with summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.000 4.000 6.000 6.188 8.000 8.000
mtcars$cyl <- as.character(mtcars$cyl) # Coerce to class "character" with as.character()
summary(mtcars$cyl) # Function summary() now prints the number of elements
## Length Class Mode
## 32 character character
mtcars$cyl <- as.factor(mtcars$cyl) # Coerce to class "factor" with as.factor()
summary(mtcars$cyl) # Prints each "level" (category) and frequency of each
## 4 6 8
## 11 7 14
Identify All Classes in a Dataset by using the function str(), or “structure”, which prints the:
str(iris) # Print the structure of the "iris" dataset, or 150 measures of iris species
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Coercion in Data Visualization is also very important. Observe the following. What do you notice about the x-axis?
data(mtcars)
plot(x = mtcars$cyl,
y = mtcars$mpg,
col = "tomato",
xlab = "Number of Cylinders",
ylab = "Miles per Gallon",
main = "Cylinders vs. MPG")
Note: R identifies two continuous variables and makes a scatterplot, assuming 5- and 7-cylinder engines are missing.
Prevent Categorical Variables from Appearing Continuous by coercing “numeric” variables to class “factor”.
data(mtcars)
plot(x = as.factor(mtcars$cyl), # The only change is nesting the variable in as.factor()
y = mtcars$mpg,
col = "tomato",
xlab = "Number of Cylinders",
ylab = "Miles per Gallon",
main = "Cylinders vs. MPG")
Function Overloading occurs as function plot() now acknowledges the “factor”, creating a box plot.
Coercion in Regression is even more important. Using lm(), we’ll try to create a linear model with the original mtcars.
data(mtcars)
my_lm <- lm(mpg ~ cyl,
data = mtcars)
print(my_lm)
##
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
##
## Coefficients:
## (Intercept) cyl
## 37.885 -2.876
Incorrect Interpretation: Per the coefficients, every unit of cyl added reduces mpg by 2.87. This is absurd.
data(mtcars)
my_lm <- lm(mpg ~ as.factor(cyl),
data = mtcars)
print(my_lm)
##
## Call:
## lm(formula = mpg ~ as.factor(cyl), data = mtcars)
##
## Coefficients:
## (Intercept) as.factor(cyl)6 as.factor(cyl)8
## 26.664 -6.921 -11.564
Correct Interpretation: 6-cylinder engines reduce mpg by 6.92, while 8-cylinder engines reduce mpg by 11.56.
Coercing Numeric Data is equally important, as demonstrated in the following scenario.
Scenario: You’re colleague has written a PDF scraper to extract key Form 990 data, seen in dataset form_990:
form_990 <- data.frame("FY_2017" = c("764882", "240739", "49212"),
"FY_2018" = c("841912", "263997", "41315"),
stringsAsFactors = FALSE,
row.names = c("Programming Expenses",
"Administrative Expenses",
"Fund Development Expenses"))
print(form_990)
## FY_2017 FY_2018
## Programming Expenses 764882 841912
## Administrative Expenses 240739 263997
## Fund Development Expenses 49212 41315
Practice: Find the sum total of all expenses in fiscal years 2017 and 2018.
class().form_990 using the $ operator.sum() to find the total of each fiscal year.sum() again on the totals.Conclusions: Identifying variable classes is a crucial first step in exploratory data analysis. As demonstrated above, failing to identify and coerce classes can be fatal to the accuracy of your analyses and visualizations. We’ve only looked at coercion with “numeric” and “factor” classes, but for nearly every data class (and there are many more), there is a way to coerce it to a more appropriate and actionable class.
Learn More about factor() and as.factor() by calling help(factor) and help(as.factor) within R. In addition, I highly recommend exploring the fourth module in DataCamp’s free Introduction to R.
The following provides an overview of base R functions for data of class “character”. Run the following in R.
url <- "https://tinyurl.com/y9xuc5pa"
construct <- read.csv(file = url, stringsAsFactors = FALSE); rm(url)
These are the records of Quality Structures, Inc., the largest of multiple contractors working on Syracuse International Airport’s 2018 renovations and retrieved via Freedom of Information Act (FOIA).
Read the documentation here: REIS GitHub Repository.
Overview: Data of class “character” is often easily distinguishable due to quotations, e.g. "this".
Any values you write or store are automatically converted to class “character” when using quotations. Observe:
my_word <- "perspicacity" # Quotes guarantee that value will be stored as class "character"
class(my_word)
## [1] "character"
print(my_word)
## [1] "perspicacity"
String Manipulation is the act of manipulating text data, most often referred to as strings.
We can think of “strings” as a sequence of characters, which may be alphabetical or numeric.
Pasting is the act of combining multiple strings to form a longer or more complex string, performed with paste().
x <- "I'm"
y <- "learning"
z <- "R!"
paste(x, y, z, sep = " ") # Argument "sep =" specifies the character between pasted strings
## [1] "I'm learning R!"
Notice that this we’ve pasted together objects, but you can just as easily input the strings by hand:
paste("I'm", "learning", "R!", sep = " ")
## [1] "I'm learning R!"
The versatility of paste() is often underappreciated at first glance. We could goof off by tampering with sep =:
paste("Millennial:", x, y, z, sep = ", like, ")
## [1] "Millennial:, like, I'm, like, learning, like, R!"
We could do something more useful, like combine names in a character roster. First, let’s create one:
first <- c("Luis", "Cody", "Shannon", "Jamison")
last <- c("Escoboza", "Peck", "Connor", "Crawford")
roster <- data.frame(first, last)
print(roster)
## first last
## 1 Luis Escoboza
## 2 Cody Peck
## 3 Shannon Connor
## 4 Jamison Crawford
Now we can use paste to make a “Surname, First Name” format, like so:
paste(roster$last, roster$first, sep = ", ")
## [1] "Escoboza, Luis" "Peck, Cody" "Connor, Shannon"
## [4] "Crawford, Jamison"
Now we can add it as a new variable using $.
roster$both <- paste(roster$last, roster$first, sep = ", ")
print(roster)
## first last both
## 1 Luis Escoboza Escoboza, Luis
## 2 Cody Peck Peck, Cody
## 3 Shannon Connor Connor, Shannon
## 4 Jamison Crawford Crawford, Jamison
We could also create a sequence of URLs for a web crawler, e.g. adult literacy programs around Dallas, TX:
url <- "https://www.nationalliteracydirectory.org/programs?q=75201&radius=25&student="
iteration <- as.character(c(1:6))
paste(url, iteration, sep = "")
## [1] "https://www.nationalliteracydirectory.org/programs?q=75201&radius=25&student=1"
## [2] "https://www.nationalliteracydirectory.org/programs?q=75201&radius=25&student=2"
## [3] "https://www.nationalliteracydirectory.org/programs?q=75201&radius=25&student=3"
## [4] "https://www.nationalliteracydirectory.org/programs?q=75201&radius=25&student=4"
## [5] "https://www.nationalliteracydirectory.org/programs?q=75201&radius=25&student=5"
## [6] "https://www.nationalliteracydirectory.org/programs?q=75201&radius=25&student=6"
Conclusions: The National Literacy Directory, which provides the search results for the above pages, is owned by the Dollar General Literacy Foundation - and they absolutely do not want those data shared, despite the tremendous potential it could have in the hands of researchers for ameliorating today’s adult literacy crisis. Fortunately, you can only search within a 25-mile radius, which limits the amount of search options.
Of course, if you just change radius=25 in url to, say, radius=6000, or 1/4 of the circumference of the earth, you’d have every adult education program in the United States, including Hawaii and Alaska. That’s 537 individual pages of search results through which one could sequence, covering 10,730 programs. But you should definitely not do that.
In sum, paste() is extremely useful. Never forget it.
The following reviews key concepts with which we’ve practiced, emphasizing elements in the first half of the present work.
<-c() and preserves distinctnessmy_object <- c(1, 3, 5)
class(my_object)
## [1] "numeric"
+, -, *, /)<, >, ==, etc.| and &
TRUEFALSE3 + 3 == 6 & 3 <= 12 / 4
## [1] TRUE
my_object <- 10
my_vector <- c("a", "b", "c")
my_matrix <- matrix(data = 1:4,
nrow = 2,
ncol = 2)
my_dataframe <- data.frame("x" = c(1, 2, 3),
"y" = c("a", "b", "c"))
print()my_vector
## [1] "a" "b" "c"
print(my_dataframe)
## x y
## 1 1 a
## 2 2 b
## 3 3 c
$ operator, e.g. dataframe_name$variable_namevector_name[3]dataframe_name[12, 5]df[12, ]which()mtcars$mpg[5]
## [1] 18.7
mtcars[5, "mpg"]
## [1] 18.7
vector[vector > 5]df[var1 > 5, ]<- and comparators, e.g. index <- df$variable < 15my_filter <- mtcars$mpg > 25
mtcars[my_filter, ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
class()str() on a data frameis.*() functions, e.g. is.character()5.0 or 55
L, e.g. 5L5.2TRUE and FALSE
"Some text."
""factor()levels()levels = and labels = in factor()as.*() functions, e.g. as.numeric()class(15)
## [1] "numeric"
is.character("Some text.")
## [1] TRUE
as.logical(0)
## [1] FALSE
as.logical(1)
## [1] TRUE
"This is a string."paste()
sep = takes a character string for which to separate pasted valuespaste0() collapses pasted strings with no separator by defaultvalues <- 1:3
suffixes <- c("st", "nd", "rd")
paste("This", "is", "my",
paste0(values, suffixes), "string.",
sep = " ")
## [1] "This is my 1st string." "This is my 2nd string."
## [3] "This is my 3rd string."
Quotation marks in strings are possible by using single quotes: 'Excel is "fine", thanks.'
Strings without quotes are possible using functions:
noquote() for numbered stringswriteLines() for bare stringswriteLines("You could make this an error message if you're writing a custom R function.")
## You could make this an error message if you're writing a custom R function.
Convert numeric data to character data using functions:
as.character() for simple coercionformat() for customized formatting via arguments, including:
scientific =)big.interval = or big.mark =justify =formatC() for syntax in C language syntax via arguments, including:
flag = "+"flag = "-"flag = "0"format(x = 00003500,
big.mark = ",",
drop0trailing = TRUE)
## [1] "3,500"
Further formatting options are available using the “scales” package.
Regular Expressions, in short, are using sequences of metacharacters for powerful pattern recognition.
For example, the following metacharacters can be used in any pattern = string.
. indicates “any character”* indicates “any number of times”^ indicates “beginning of string”$ indicates “end of string”\ indicates “ignore the following character, it’s actually a [insert metacharacter]”Important Caveat: Since many patterns contain metacharacters, like ., you must keep “regex” in mind.
pattern = "Census Tract 5.00". is a metacharacter, R interprets this as “5”, any character, and “00”\
string = "Census Tract 5\.00" will help ensure . really means ."Census Tract 5200"Don’t Worry: You won’t memorize “regex” unless you use them everyday, but some things stick over time.
Learn More: To learn more about “regex”, I recommend Jenny Bryan’s Stat 545: “Regular Expression in R”.
Even a rudimentary understanding of “regex” is powerful for pattern detection.
Overview: The “stringr” package is designed specifically for working with character data:
Unified, Consistent Framework: All “stringr” functions:
str_ for easy autocompletionInstalling & Loading: The following installs and loads package “stringr” if undetected:
if(!require("stringr")){install.packages("stringr")}
## Loading required package: stringr
library(stringr)
The “stringr” equivalent of function paste() is str_c(). Advantages over paste() include:
NA valuessep = is "", similar to paste0()paste("The", "quick", "brown", "fox", NA, "over", "the", "lazy", NA)
## [1] "The quick brown fox NA over the lazy NA"
str_c("The", "quick", "brown", "fox", NA, "over", "the", "lazy", NA)
## [1] NA
str_c("The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog.",
sep = " ")
## [1] "The quick brown fox jumps over the lazy dog."
Function str_length() determines the number of characters in a given string:
str_length("Duffle kerfuffle.")
## [1] 17
The base R equivalent is function nchar().
Function str_sub() extracts a subset of characters determined by beginning and ending positions:
order_records <- c("The order arrived at 03:02 PM EST, 03 December 2018.",
"The order arrived at 12:19 PM EST, 03 December 2018.",
"The order arrived at 09:53 AM EST, 03 December 2018.")
arrivals <- str_sub(string = order_records, # Input string or vector of strings
start = 22, # Indicate position number to begin extraction
end = 33) # Indicate position number to end extraction
data.frame("arrival_time" = arrivals) # Organize extracted data
## arrival_time
## 1 03:02 PM EST
## 2 12:19 PM EST
## 3 09:53 AM EST
Several “stringr” functions involve pattern recognition, which can be:
string = "Tract 32"fixed(), e.g. string = fixed("Tract 32)string = ".* 32"Detect patterns with str_detect(), which returns a logical value if conditions are met.
inconsistent_labels <- c("Tract 32",
"census tract 32.00",
"trAct 32",
"ct 32",
"tract 5.00",
"CT 5")
str_detect(string = inconsistent_labels,
pattern = "32") # Only detects values containing "32"
## [1] TRUE TRUE TRUE TRUE FALSE FALSE
str_detect(string = inconsistent_labels,
pattern = "32|5") # Detects values containing either "32" or "5"
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
Return values containing a specified pattern using str_subset():
str_subset(string = inconsistent_labels,
pattern = "32")
## [1] "Tract 32" "census tract 32.00" "trAct 32"
## [4] "ct 32"
str_subset(string = inconsistent_labels,
pattern = "00")
## [1] "census tract 32.00" "tract 5.00"
Quantify values containing a specified pattern using str_count():
str_count(string = inconsistent_labels,
pattern = "Tract|tract|trAct")
## [1] 1 1 1 0 1 0
str_count(string = inconsistent_labels,
pattern = "0")
## [1] 0 2 0 0 2 0
Split strings into composite parts with str_split() by specifying a pattern on which to split:
fo_review <- "This is complete crap! I've been waiting 20 minutes on the loading screen to connect to the server! After I connect because of desync and replication issues with the server, I get killed! Have to wait another 5 minutes on the loading screen to respawn! I go into a building and get stuck on a pile of trash so I need to fast travel somewhere!"
split_rev <- str_split(string = fo_review,
pattern = "! ", # Split string for every occurence of "! "
simplify = TRUE) # "FALSE" returns list, "TRUE" returns matrix
data.frame("sentences" = split_rev[, 1:5], stringsAsFactors = FALSE)
## sentences
## 1 This is complete crap
## 2 I've been waiting 20 minutes on the loading screen to connect to the server
## 3 After I connect because of desync and replication issues with the server, I get killed
## 4 Have to wait another 5 minutes on the loading screen to respawn
## 5 I go into a building and get stuck on a pile of trash so I need to fast travel somewhere!
Find & Replace the detected patterns with:
str_replace() for only the first pattern detected in stringstr_replace_all() for all patterns detected in stringprint(inconsistent_labels)
## [1] "Tract 32" "census tract 32.00" "trAct 32"
## [4] "ct 32" "tract 5.00" "CT 5"
str_replace_all(string = inconsistent_labels,
pattern = ".00| |[a-zA-Z]*", # Detect ".00", OR " ", OR any/all letters
replacement = "") # Replace with "", or nothing!
## [1] "32" "32" "32" "32" "5" "5"
Trimming eliminates any extra spaces surrounding characters using str_trim().
side = indicates which side to trim: "left", "right", or "both"str_trim(string = " mad whitespace ",
side = "both")
## [1] "mad whitespace"
Padding is the opposite of trimming, where str_pad() allows you to add characters.
side = indicates which side to pad: "left", "right", or "both"width = indicates the maximum number of characters achieved via paddingpad = indicates the character with which to padHere, we’ll use Syracuse’s Census Tract 61.02. Notably:
str_pad(string = "6102",
width = 6,
side = "left",
pad = "0")
## [1] "006102"
Use in concert with paste(), paste0(), or str_c() to create a full FIPS code!
Instructions: Run the following code to read in Census Geocoder output: geocoded.
library(readr)
url <- "https://tinyurl.com/y92r2qcd"
names <- c("id", "input", "indicator", "type", "output", "coords",
"line_id", "id_side", "state", "county", "tract", "block")
geocoded <- read_csv(file = url,
col_names = names)
geocoded <- geocoded[which(complete.cases(geocoded)), ]
Challenges: Perform the following tasks using the geocoded dataset:
as.*() function.output.
zipgeocoded$zip <- NAoutput, detect which rows contain “EAST SYRACUSE, NY”.
es_geocodedfips and paste variables state, county, and tract.
geocoded using geocoded$fips <- NAtract in geocoded to eliminate leading and trailing zeroes.
ct_abbr, by running geocoded$ct_abbr <- ...