readr - get the data
readr will attempt to automatically identify column types during the operation. Using arguments, you can define the operation further, such as skip_empty_rows which gives the option to not parse blank rows into the dataframe, otherwise they will be listed as N/A.
A list of the arguments can be found at: https://readr.tidyverse.org/reference/read_delim.html
dplyr - manipulate the data
So long as you don’t need to any data wrangling or cleaning operations, which is done via tidyr, the dataframe can be manipulated and modified to fit your needs via dplyr. Anyone familiar with SELECT operations in SQL will catch on to the operations in this package quickly.
The main operations of dplyr include:
- filter() - select portions of the dataframe based on their value
- select() - select portions of the dataframe based on their name
- rename() - changes the name of the column while keeping all the data in the column
- mutate() - add to the dataframe by performing operations on other portions of the dataframe
- arrange() - much like ORDER BY; sorts the data by a specific column or columns
- summarise() - condense values into a single value
All of these features can be combined to perform multiple operations within a single command. “%>%” the pipe command links the operations together.
Example:
Dataframe %>%
filter(column1 == “A”)%>%
arrange(desc(column2))
This example we filtered the dataframe on values in column 1 then arranged the result by column2 in descending order.
We are going to focus on two of the most powerful and used operations filter and select
filter - select portions of the dataframe based on their value
filter allows the user to parse the data into separate dataframes or models. Breaking down a large dataframe is essential to operations and readability of the later report.
An example of the operation, we take our dataframe and separate out the Microbreweries from the others and put into it’s own dataframe.
Syntax 1: NEWDATAFRAMENAME <- Existing_Data_frame %>% filter(column_name operator value)
Syntax 2: NEWDATAFRAMENAME <- filter(Existing_Data_frame, column_name operator value)
The operator defines what you are trying to get from the operation; it equals, it does not equal, it’s NA, it’s between or near, etc..
Useful functions include:
* ==, <, <=, etc
* !=
* is.na()
* near()
## # A tibble: 5 x 6
## brewery_name type address website state state_breweries
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 Valley Brewing~ Microb~ 1950 W Freemon~ http://www~ cali~ 284
## 2 Trumer Brauerei Microb~ 1404 4th St., ~ http://www~ cali~ 284
## 3 Thirsty Bear B~ Microb~ 661 Howard St.~ http://thi~ cali~ 284
## 4 Telegraph Brew~ Microb~ 416 N. Salsipu~ http://www~ cali~ 284
## 5 Stumptown Brew~ Microb~ 15045 River Rd~ http://www~ cali~ 284
We can perform multiple filter operations in the same command by combining them with either a “|” or a “,” The pipe gives us an OR value while the comma gives us an AND value.
Non_micro_california <- data_raw %>%
filter(type != "Microbrewery", state == "california")
Non_micro_california %>% head(5)## # A tibble: 5 x 6
## brewery_name type address website state state_breweries
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 Valley Brewin~ Brewpub PO Box 4653, S~ http://www.~ cali~ 284
## 2 Valley Brewin~ Brewpub 157 Adams St.,~ http://www.~ cali~ 284
## 3 Ukiah Brewing~ Brewpub 102 S. State S~ http://www.~ cali~ 284
## 4 Tustin Brewin~ Brewpub 13011 Newport ~ http://www.~ cali~ 284
## 5 Trumer Brauer~ Region~ 1404 Fourth St~ http://www.~ cali~ 284
select - select portions of the dataframe based on their name
The most used feature of select is selecting columns to return in the dataframe, typically by name, reducing the large dataframe to only what is useful for the operation at hand. Lets look at the two dataframes we have already created and utilize the select function.
NOTE - if you put a - (dash) ahead of the name, you are NOT selecting that column. This is very useful if you only want to remove a few columns from a dataframe.
# reusing the same code and piping in the select statement
microbreweries <- data_raw %>%
filter(type == "Microbrewery")%>%
select(brewery_name, address, website)
microbreweries %>% head(5)## # A tibble: 5 x 3
## brewery_name address website
## <chr> <chr> <chr>
## 1 Valley Brewing Co 1950 W Freemont, Stockton, Cal~ http://www.valleybre~
## 2 Trumer Brauerei 1404 4th St., Berkeley, Califo~ http://www.trumer-in~
## 3 Thirsty Bear Brewi~ 661 Howard St., San Francisco,~ http://thirstybear.c~
## 4 Telegraph Brewing ~ 416 N. Salsipuedes St., Santa ~ http://www.telegraph~
## 5 Stumptown Brewery 15045 River Rd., Guerneville, ~ http://www.stumptown~
# select statement removes two columns not needed from the first verion of the dataframe.
Non_micro_california <- data_raw %>%
filter(type != "Microbrewery", state == "california")%>%
select(-state, -state_breweries)
Non_micro_california %>% head(5)## # A tibble: 5 x 4
## brewery_name type address website
## <chr> <chr> <chr> <chr>
## 1 Valley Brewing ~ Brewpub PO Box 4653, Stockton, C~ http://www.valleyb~
## 2 Valley Brewing ~ Brewpub 157 Adams St., Stockton,~ http://www.valleyb~
## 3 Ukiah Brewing C~ Brewpub 102 S. State St., Ukiah,~ http://www.ukiahbr~
## 4 Tustin Brewing ~ Brewpub 13011 Newport Ave. #100,~ http://www.tustinb~
## 5 Trumer Brauerei RegionalB~ 1404 Fourth St., Berkele~ http://www.trumer-~