Latest Versions & Updates: This markdown document was built using the following versions of R and RStudio:
Thanks to the implementation of literate programming, I’m able to weave machine-readable R expressions into this work while explaining those expressions in human-readable language.
Note that unformatted font, for example, this, that, and the other thing, is used to indicate machine-readable language, even if it’s used in-line, like these examples. It’s a simple and unobtrusive way to differentiate human-readable language from expressions intended for machine reading. Note this particular formatting, or lack thereof, when you see it - typically, it’s used to flag datasets, variables, entire expressions, function and package names, and operators.
So-called code chunks, unlike unformatted font, are much more easy to discern. “Code chunks” allow literate programming authors to insert machine-readable code in human-readable text. Behind the scenes, “code chunks” are often executed, without alerting the reader, in order to produce tables, visualizations, interactive tools, and more. In instructional materials, e.g. the present work, code chunks are used for demonstrative purposes, such as how to use a particular function. The following is an example of two “code chunks”, the first of which executes without output, and the second of which will both execute the expression and print the resulting output.
my_example <- "This is an example of a code chunk."
Now, we’ll both both execute and print the results.
print(my_example)
## [1] "This is an example of a code chunk."
When I first began studying R, one of my more regrettable mistakes - apart from not learning R earlier in life - was that I’d read literature on R and simply look at the coding examples. This was an error. If possible, try running every bit of ostensibly non-malicious code you find. There’s a reason most literature on R takes advantage of literate programming via code chunks, so read with RStudio open, and experiment with new expressions in the R console often.
In the previous session, Intro to R: Operators, you learned how to use three different types of operators in R, as well as how to store values in objects using assignment, and how to create vectors.
Arithmetic Operators: Allow R to perform mathematical computations like an advanced calculator while following the order of operations or operator precedence, i.e. “Please excuse my dear aunt Sally” or PEMDAS.
(): Parantheses^: Exponents*: Multiplication/: Division+: Addition-: SubtractionRelational Operators: Allow values in R to be compared and contrasted with some specified condition or criteria.
>: Greater than<: Less than>=: Greater than or equal to<=: Less than or equal to==: Exactly equal to!=: Not equal toLogical Operators: Allow different conditions or criteria using relational operators to be combined.
|: Or, where only one operand (side) must be TRUE for the expression to be TRUE&: And, where both operands (sides) must be TRUE for the expression to be TRUELogical Values: You also learned that statements using relational and logical operators result in logical values, that is, either TRUE or FALSE. Observe:
15 < 20 | 1 > 100
## [1] TRUE
Recall that because of the | (or) operator, only one operand (side) needs to be TRUE for the expression to evaluate to TRUE. Even though 1 > 100 is obviously FALSE, the entire expression is TRUE because 15 < 20 is TRUE.
Refresher: Now that we can recall a bit more about operators, try to evaluate the following expression in your head.
(5 * 5 == 25 | 10 != -10) & 200 + 300 > 400
Once you think you know the answer (remember, it’s TRUE or FALSE), run the code in R and find out!
Assignment Operator: The assignment operator, <-, allows you to store values, expressions, functions, and entire datasets in an object, or a variable which you name, and which may be used in myriad ways.
When you assign something to an object, the name of the object is on the left side of the assignment operator, or <-, while the contents on the right side may be one or more values, a dataset, or even an entire expression like the one above.
Assigning to Objects: Here, we’ll use <- to assign the value 10 to the object: october.
october <- 10
Note that assigning information to an object does not print the object, but it is stored in your workspace.
We can print october simply by running it in the console.
october
## [1] 10
Objects in Expressions: Recall that we can use objects interchangeably, as if the were values. Here, we’ll subtract 2 from october and store the evaluated result in a new object: august. We’ll then call august to print its value.
august <- october - 2
august
## [1] 8
Refresher: Finish the following expression by using the assignment operator, <-, and an integer (e.g. 1, 2, etc.) to assign the appropriate value to object: march.
march
Vectors: One of the most basic data structures are vectors, which contain one or more values of the same class. We learned how to make these by combining values using the function: c(). For example, c(2, 4, 6) combines the individual values 2, 4, and 6.
Storing Vectors: We learned to store multiple values in a vector using the assignment operator, <-. Let’s store 2, 4, and 6 into an object: even.
even <- c(2, 4, 6)
even
## [1] 2 4 6
Refresher: Now, create your own vector with function c(). Fill in the underscore, _, with odd numeric values. Make the name of the object: odd.
odd <- c(_, _, _)
Vectors are similar to linear board game formats like that of “Munchkin”. They are one-dimensional, so you only need a single value (or, sometimes, a set of single values) to indicate position. Source: World of Munchkin
Data frames, sometimes spelled dataframe or data.frame, are the most common data structures in the whole of R. It is made up of tabular data, that is, data comprised of rows and columns. If you’ve used spreadsheet software like Excel before, chances are you’re already familiar with the data frame, you just didn’t know it!
Appearance: What does a data frame look like? We can use a built-in dataset to demonstrate. Here, we’ll call function head() to observe the first few rows of the dataset: iris.
head(iris, 3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
Unlike vectors, data frames are two-dimensional, like chess, Battleship, or Bingo. The values indicating position come in pairs, one to indicate row position and another to indicate column position. Source: Chess Bazaar
Content: What makes a data frame a data frame? Besides being tabular, i.e. being made up of two dimensions (rows and colums), data frames contain variables of different classes. Let’s look at the dataset iris again. What one column is different than the others?
head(iris, 3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
If you recognized that species is different from the other columns, you’re spot on! We can get a good look at each variable’s class by calling function str(), or structure.
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
As you can see, the first four variables are of class numeric:
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthHowever, one sticks out: Species. R recognizes this as a factor. Don’t worry about factors now, except that they are basically categorical variables (there are 3 “categories” of iris in Species). Only recognize that there are different classes mixed together in a data frame.
Pro Tip: To view a data frame in its entirety, in RStudio, use function View() with the name of the data frame as the only input, e.g. View(iris). Note the capital “V”.
Subsetting is the act of extracting a subset, i.e. pulling out a smaller group of values from a larger group. Suppose we wanted to look at the mpg (miles per gallon) of all the cars in the data frame: mtcars. First, let’s look at the first rows of the dataset with head().
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Extracting Entire Variables: We can subset or extract an entire variable and all of its values by using the $ operator. While this is technically one of a few different, so-called “extraction operators”, it’s most frequently referred to as the dollar sign. Seriously.
You can subset or extract an entire variable by combining the following:
mtcars$mpgObserve:
mtcars$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
Wow, it looks a heck of a lot like a vector, or a series of values combined with c().
Spoiler: It’s a Vector. Yeah. Data frames are simply a series of vectors of the same length. Recall, vectors are just a series of elements. Therefore:
Another Example: Take the data frame, faithful, which measures eruption and interim waiting times for the geyser, “Old Faithful”, in Yellowstone National Park. Let’s take a look at the first few rows with function head().
head(faithful)
## eruptions waiting
## 1 3.600 79
## 2 1.800 54
## 3 3.333 74
## 4 2.283 62
## 5 4.533 85
## 6 2.883 55
Practice: Now, use the dollar sign operator, $, to extract the waiting times in variable: waiting. I’ll wait.
faithful$_
Indexing is essentially subsetting, but more precisely and often by position. It allows you to extract a specific element or series of elements from either a vector or a data frame. Instead of the dollar sign, or at times in combination with it, you can index by using brackets, or bracket notation, with [ and ].
Take a look at the row names of dataset mtcars using function rownames(). Note that it is a vector.
rownames(mtcars)
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D" "Merc 230"
## [10] "Merc 280" "Merc 280C" "Merc 450SE"
## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
## [31] "Maserati Bora" "Volvo 142E"
Indexing by Position: What if we wanted to extract a specific element (in this case, the name of a car) from this vector? Well, we need only:
[ ]To index, or extract, the fourth car, Hornet 4 Drive, just do the following:
rownames(mtcars)[4]
## [1] "Hornet 4 Drive"
Indexing in Variables: We can do precisely the same thing when we subset a variable, e.g. mpg in mtcars. Noteably, because they are in the same position, this gives you the mpg of the Hornet 4 Drive!
mtcars$mpg[4]
## [1] 21.4
Takeaway: A vector is a 1-dimensional data structure, meaning it’s simply a series or sequence of values.
When you’re already on the corner of 5th Avenue and East 23rd Street, checking out the Flat Iron Building, and you ask a local how to get to 5th Avenue and East 26th Street (you know, to check out the Museum of Mathematics), you’re on the right avenue already, so you only need to know how many blocks to walk.
It’s the same with vectors. You just need to know the number of blocks to get to your destination.
Indexing in Data Frames: Now we’re talking about 2-dimensional data, i.e. tabular data, i.e. data with rows and columns. We still use brackets to index by position. However, because data frames are 2-dimensional, we need 2 numbers.
To index a specific value in a data frame, you need only:
[ and ],Let’s try it to extract the mpg for the Hornet 4 Drive from the mtcars dataset. We know that the care is in the 4th row and mpg is the 1st column, so we need 1 and 4. Observe:
mtcars[4, 1]
## [1] 21.4
Extracting All Row or All Column Values: We can index an exact value by specifying the row and column position. However, you can also:
mtcars[, 1]mtcars[4, ]Let’s see that in action by extracting all values in variable mpg (the whole column!). Then we’ll extract all the values in observation Hornet 4 Drive (the whole row!).
mtcars[, 1]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
mtcars[4, ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Takeaway: A data frame is a 2-dimensional data structure, and therefore requires two numbers to indicate position.
Suppose we’re enjoying the National Museum of Mathematics on 5th Avenue and East 26th Street, but we’re going to be late for a show at Madison Square Garden on 8th Avenue and West 31st Street. We’ll need to go 5 blocks towards Upper Manhattan, and 4 blocks west (accounting for Broadway, of course).
We navigate data frames in the same way.
You can use other operators and functions to index to subset a range of values within both vectors and data frames. Let’s see what that looks like using the mtcars dataset.
You can create a vector of element positions using the concatenate, or c() function. Suppose you want to subset the first, fifth, and eleventh cars in the dataset. You simply need to plug those values into function c(), i.e. c(1, 5, 11).
Recall that in a data frame, you include:
[ & ]),)Now, let’s call the mtcars dataset and index positions 1, 5, and 11 using the concatenate function, c(). Note that we leave the column position (to the right of the comma) blank. By not specifying a position, all positions’ values are returned.
mtcars[c(1, 5, 11), ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.62 16.46 0 1 4 4
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2
## Merc 280C 17.8 6 167.6 123 3.92 3.44 18.90 1 0 4 4
We can even subset specific rows and columns by concatenating position values using function c() in the column position, i.e. to the right of the comma in brackets. Here’s rows 1, 5, and 11, along with only columns 1 (mpg) and 6 (wt, or weight):
mtcars[c(1, 5, 11), c(1, 6)]
## mpg wt
## Mazda RX4 21.0 2.62
## Hornet Sportabout 18.7 3.44
## Merc 280C 17.8 3.44
Pro Tip: For both rows and columns, if they have names, you can index by name using quotation marks, e.g. "mpg". Let’s look at the mileage, mpg, and number of cylinders, cyl, for the “Honda Civic”. Recall that we still must use function c() for more than one value:
mtcars["Honda Civic", c("mpg", "cyl")]
## mpg cyl
## Honda Civic 30.4 4
Cool, yeah?
Practice: Use one command to extract the cylinders (‘cyl’, column 2), horsepower (hp, column 4), and weight (wt, column 6) for the “Camaro Z28” and “Maserati Bora”, in rows 24 and 31, respectively. Which has more horsepower (hp)?
You can use either position or name, but remember to use quotation marks for each name!
mtcars[c(_, _), c(_, _, _)]
Suppose that instead of listing a series of position numbers, we’d like to provide a range or sequence. We can do this with a simple operator, the colon operator or :. For example, if we want rows 3, 4, 5, and 6, we simply type 3:6, without any use of the concatenate function, c()!
mtcars[3:6, ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
What’s more, you can combine ranges with individual positions using function c(), for example:
mtcars[c(3:6, 8, 12), ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Pro Tip: Recall that any numerical values can me made into a sequence. Take the following, c(1, 5, 10, 20, 100). We can determine the length of this vector by using function length(). Remember, we use length because vectors are one-dimensional, the same way we measure height and weight, e.g.
length(c(1, 5, 10, 20, 100))
## [1] 5
Why 5? Because the length of the vector is 5, that is, it contains 5 elements. We can make a sequence for each position number, i.e. 1 through 5, by using the colon operator, or :.
1:length(c(1, 5, 10, 20, 100))
## [1] 1 2 3 4 5
Know this: You can make sequences from any numbers.
Practice: Using the colon operator, finish the following command in R to print the first 5 values of mpg in mtcars.
mtcars[_ _ _, "mpg"]
Since we can store information in R using assignment (<-) and an object, it’s helpful to store our indices or position values within them. This is especially true if we have large datasets and many, many positions. Here, we simply use a combination of c() and :, if necessary, to store position values. Like above, we’ll store 3:6, 8, and 12.
index <- c(3:6, 8, 12)
Where does this get us? Well, instead of having messy code with a bunch of operators and digits, we can simply use object index to indicate row position. It looks like this:
mtcars[index, ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
This has huge implications!
Practice: Save a new object, new_index, by storing the values 3, 5, and 11 through 14. Make sure you use the concatenate function, c():
new_index <- c(_)
Now, use object new_index in the row position inside the brackets of mtcars.
mtcars[_, ]
Bonus Challenge: Save your output to a new object, my_cars, and use function mean() to determine the average weight (wt) of the above subset. Remember that you can subset a specific variable, like mpg or wt, by typing the dataset name, mtcars, using the dollar sign operator, $, along with the variable name.
We learned in Introduction to R: Operators that we can use relational and logical operators like “greater than” (>), “less than” (<), and “equal to” (==) to determine which values in a dataset meet some condition we’ve specified. These statements return logical values, i.e. TRUE and/or FALSE. We can actually turn those statements into criteria to perform filtering operations.
Let’s create an index of logical values, TRUE and FALSE, to describe whether a vehicle in mtcars gets good gas mileage, with an mpgvalue of greater than 20.
mtcars$mpg > 20
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
## [23] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
Just looking at a sequence of TRUE and FALSE values isn’t very helpful, so what can we do with them? Let’s save them in an object using the assignment operator (<-). We’ll call the object: mpg_index.
mpg_index <- mtcars$mpg > 20
Assigning it to an object doesn’t change anything, it’s still a series of logical values, with the first values seen here:
head(mpg_index)
## [1] TRUE TRUE TRUE TRUE FALSE FALSE
Filtering Operations: In this case, every TRUE and FALSE value corresponds to a row. TRUE values are those values that satisfied our condition, mpg > 20. Like we did with index and new_index, above, we can filter the dataset by using mpg_index in place of a position or collection of positions. Since we’re working with rows, mpg_index nests to the left of the comma (,).
mtcars[mpg_index, ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Bob’s your uncle! You performed a filtering operation. These can be quite complex and they certainly are not uncommon!
Determining Position Values: It would be more useful if, rather than a list of TRUE and FALSE logicals, we were able to get our hands on the numeric value of the position. This is easily achieved using the which() function. When called, which() prints the numeric values of every TRUE in a sequence of logical values.
Observe:
which(mpg_index)
## [1] 1 2 3 4 8 9 18 19 20 21 26 27 28 32
Here, we can determine precisely which rows satisfy our condition, mpg > 20. It prints out in a vector, which one could index further if one were so inclined. For example, let’s extract the 5th element and save it as object multi_pass.
multi_pass <- which(mpg_index)[5]
We can simply use object multi_pass as a more refined index on which to subset:
mtcars[multi_pass, ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Merc 240D 24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
Chicken. Good.
How Many Are True?: Recall that, under the hood, every TRUE is equal to 1, while every FALSE is equal to 0. Let’s prove that with some quick logic:
TRUE == 1
## [1] TRUE
FALSE == 0
## [1] TRUE
Therefore, we can quantify just how many cars in our subset satisfy the condition, mpg > 20, since they are represented by a TRUE. This is performed with the sum() function. Observe:
sum(mpg_index)
## [1] 14
Fourteen cars, neat! That’s less than half the total cars (n = 32). This practice is more useful than may initially meet the eye.
Practice: Here’s the scenario. It’s 1972 and due to a mysterious benefactor, you can get any car you want. Let’s create some criteria by which to filter.
mpg should be greater than 15.150 horsepower, or hp, because we’re insecure (recall the and operator, i.e. &).6 or 8 cylinders, or cyl (recall the or operator, i.e. |).Save your criteria in an object: ideal_cars. Use function View() to check it out, i.e. View(ideal_cars). Which will you choose?
Instructions: Run the following code to read in data on Syracuse Housing Code Violations from DataCuse. They’re stored in the object: viola.
library(readr)
url <- "https://tinyurl.com/ybq2anh2"
viola <- read_csv(url)
The following instructions guide you through some methods of how to go about multi-step filtering operations.
dim() to determine the number of rows and columns in the dataset, respectively.dim(viola)
## [1] 13896 24
13,896 rows and 24 columns, so a relatively small dataset. However, we’ll want to condsider dimension reduction techniques to eliminate unwanted variables.
names() or colnames() and input the dataset.names(viola)
## [1] "X" "Y"
## [3] "property_address" "property_zip"
## [5] "property_id" "violation_name"
## [7] "violation_date" "comply_by_date"
## [9] "violation_status" "case_number"
## [11] "case_type" "case_open_date"
## [13] "property_owner_name" "inspector_id"
## [15] "property_neighborhood" "vacant_property"
## [17] "owner_address" "owner_city"
## [19] "owner_state" "owner_zip"
## [21] "long" "lat"
## [23] "TNT_NAME" "ObjectId"
Since we’re not mapping any data (today!), we may want to consider eliminating long and lat, or longitude and latitude coordinates. Say we’re not interested in neighborhood boundaries, only property_zip - what else might we remove from location-related variables?
str() to examine the structure.str(viola, max.level = 1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 13896 obs. of 24 variables:
## $ X : num -76.2 -76.2 -76.2 -76.2 -76.2 ...
## $ Y : num 43 43 43 43 43 ...
## $ property_address : chr "443 Tennyson Ave" "439 Tennyson Ave" "435 Tennyson Ave" "421 Tennyson Ave & Burnet Pk" ...
## $ property_zip : chr "\"13204\"" "\"13204\"" "\"13204\"" "\"13204\"" ...
## $ property_id : chr "110.-12-20.0" "110.-12-21.0" "110.-12-23.0" "110.-12-24.0" ...
## $ violation_name : chr "SPCC - Section 27-72 (e) -Trash & Debris" "19 NYCRR Part 1203" "SPCC - Section 27-72 (f) - Overgrowth" "SPCC - Section 27-72 (e) -Trash & Debris" ...
## $ violation_date : POSIXct, format: "2017-09-11" "2018-08-23" ...
## $ comply_by_date : POSIXct, format: "2017-09-26" "2018-09-10" ...
## $ violation_status : chr "Closed" "Open" "Closed" "Closed" ...
## $ case_number : chr "2017-26306" "2018-26134" "2018-20411" "2017-04090" ...
## $ case_type : chr "Property Maintenance-Ext" "Building W/O Permit" "Overgrowth: Private, Occ" "Trash/Debris-Private, Occ" ...
## $ case_open_date : POSIXct, format: "2017-09-05" "2018-08-23" ...
## $ property_owner_name : chr "443 Tennyson Ave Trust &" "Kelly Wolfram" "Yvonne Vervaet" "Ramon Rosario" ...
## $ inspector_id : int 502 248 252 497 413 413 497 413 413 252 ...
## $ property_neighborhood: chr "Tipp Hill" "Tipp Hill" "Tipp Hill" "Tipp Hill" ...
## $ vacant_property : chr "N" "N" "N" "N" ...
## $ owner_address : chr "443 Tennyson Ave" "439 Tennyson Ave" "43124 Valiant Dr" "421 Tennyson Ave & Burnet Pk" ...
## $ owner_city : chr "Syracuse" "Syracuse" "South Riding" "Syracuse" ...
## $ owner_state : chr "NY" "NY" "VA" "NY" ...
## $ owner_zip : chr "\"13204\"" "\"13204\"" "\"20152\"" "\"13204\"" ...
## $ long : num -76.2 -76.2 -76.2 -76.2 -76.2 ...
## $ lat : num 43 43 43 43 43 ...
## $ TNT_NAME : chr "Westside" "Westside" "Westside" "Westside" ...
## $ ObjectId : int 13001 13002 13003 13004 13005 13006 13007 13008 13009 13010 ...
## - attr(*, "spec")=List of 2
## ..- attr(*, "class")= chr "col_spec"
This is likely the most useful base R function for eploratory data analysis, providing dimensions, variable names and classes, the object class for storing the data, and the first few observations. It’s a combination of functions dim(), names(), class(), and head().
property_zip, it would be helpful to know the distribution of violations by zip code. Let’s use function table() to determine the number of records for each zip code. Remember to subset variables with $:table(viola$property_zip)
##
## "13202" "13203" "13204" "13205" "13206" "13207" "13208" "13210" "13214"
## 194 2042 2740 3390 923 789 2476 919 18
## "13215" "13224"
## 4 401
There’s not much in the way of numeric data, but that doesn’t mean we can’t try to visualize some variables of interest.
table() and see where the most violations occur.zip_table <- table(viola$property_zip)
barplot(zip_table, col = "tomato")
long, or longitude, and lat, or latitude, for each. We’ll use the scatter plot function, plot(), from base R, setting x = to long and y = to lat. Note that a geospatial visualization is often just a scatter plot with a useful background.plot(x = viola$long, y = viola$lat)
If you squint your eyes, it kind of looks like Syracuse!
Now it’s your turn. We’ve done some quick and dirty Exploratory Data Analysis, and let’s imagine we drew conclusions from that to warrant the following filters for downstream analysis. We’re in the pre-processing or processing stage of a data analysis, now. Mess up here, and everything downstream will suffer the effects, as well. No pressure.
Dimension Reduction: We can get rid of a few variables (“dimensions”) that we won’t be using for our filtering. It would save us time, instead, to specify what we’d like to keep, rather than what we want to remove. Therefore, create a filter for property_address, property_zip, violation_date, violation_status, case_type, and property_owner_name. These are columns 3, 4, 7, 9, 11, and 13. Filter by preserving only these columns, using function c(). Store your filtered data in a new object: viola2.
Narrowing Geography: Suppose we’re only interest in property_zip exactly equal to 13206. Instead of columns, like above, create a conditioning filter using == to preserve only properties in 13206. First, we need to clean the zip codes a little bit by running the following code. Store the results in object: viola3.
viola2$property_zip <- gsub(x = viola2$property_zip, pattern = "\"", replacement = "")
Remember to use a conditional statement to detect “exactly equals”, or ==.
Relevant Status: Let’s eliminate irrelevant cases, i.e. violation_status with the value Closed should be eliminated. You can use != (“does not equal”) or == (“exactly equal to”) depending on how you’d like to tackle it! Store the resulting data in object: viola4.
Relevant Dates: Run the following code to convert the year 2018 into a filtering object: date_cutoff. Then, use a filtering operation, filter violation_date to greater than or equal to date_cutoff. This preserves only data from 2018. Store the results in object: viola5.
date_cutoff <- as.POSIXlt("2018-01-01")
case_type exactly equal to Vacant House. Perform a filtering operation and store your variables in object: viola6.Congratulations: You’ve completed some important data preprocessing for your team using your new skills. Run the following code to see what your team has made to represent the top 10 property owners with the greatest amount of open violations for vacant houses in 2018, and only in zip code 13206!
if(!require(dplyr)){install.packages("dplyr")}; library(dplyr)
if(!require(ggplot2)){install.packages("ggplot2")}; library(ggplot2)
viola7 <- as_data_frame(table(viola6$property_owner_name)) %>%
arrange(desc(n)) %>% rename(Name = Var1, Violations = n); viola8 <- viola7[1:10,]
viola8$Name <- factor(viola8$Name, levels = viola8$Name[order(-viola8$Violations)])
ggplot(viola8, aes(x = Name, y = Violations)) +
geom_bar(stat = "identity", fill = "skyblue") +
theme_classic() +
theme(axis.text.x=element_text(angle = -45, hjust = 0)) +
ggtitle(label = "Top 10 property owners with vacant housing violations",
subtitle = "Open violations in 13206 since January 1, 2018")
This quote didn’t fit anywhere, and it might confuse more than motivates, but I found it relevant.
“You start from the nitty-gritty programming types of aspects. We talk about operators and we talk about data types and we talk about classes and control structures and you work your way up to analyzing or reading in a dataset. It’s usually like the third or fourth lecture that I’m reading in data… It occurred to me that regardless of how you try to start teaching R, you always ended up doing that kind of crap because, like, you had to. Imagine no Tidyverse: If you wanted to take the mean of a variable within groups of another variable, that is a straight up ‘group_by()’ %>% ‘summarize()’ operation. But if you wanted to do that just in base R, you can use the ‘aggregate()’ function or the ‘tapply()’ function, but in order to do that you’d have to subset the column you wanted to group by from the data.frame. But how do you subset out the column? Well you have to use the ‘dollar sign operator’ to subset out the column - but then, but what’s that, right? And then you’ve already fallen down this deep hole… If you wanted to subset rows, you’d have to use the ‘bracket operator’, but what’s that? What’s one bracket versus two brackets? It’s like you always ended up falling down this hole of operators and data types.
There was no way to abstract out those low-level details. And I think the Tidyverse does almost all of that. You still have to know what a logical operation is because that’s how you filter things, but you can know it at a higher level. But there’s no way to work with data in base R without understanding operators and data types, basically… So you kind of ended up teaching that anyway… So you kind of get that stuff out of the way and let [students] see it at least once, and later on when you get caught in that net of analyzing data… you say ‘Oh, we saw factors before and so let’s have a review’."
Roger Peng, PhD., Bloomberg School of Public Health, Johns Hopkins University
“Analogy Corner”. Not So Standard Deviations, Episode 43. 20 June 2017.