Select Certain Data

Select Variables

When working with a large data frame, often we want to only assess specific variables. The select() function allows us to select and/or rename variables.

Select specific columns

# data %>%  select(columns you want)

Select columns between two columns. This code selects all columns between column 5 and column 10.

# data %>% select(col5:col10)

Select all columns EXCEPT the columns between column 5 and column 10

# data %>% select( -(col5:col10) )

Other special functions in select:

Select Rows

The filter() function allows us to identify or select observations in which a particular variable matches a specific value/condition. The condition in the filter() function can be any kind of logical comparison and Boolean operators, such as:

Return rows when the filters are TRUE (this code returns the flights data that have month==1 and day==1).

# flights %>%  filter( month == 1, day == 1)

Filter rows that contain a specific string subset (this code returns rows that have “Germany” in their X column using the str_detect() function).

# pop_ger <-pop %>% filter(str_detect(X, "Germany"))

Select only full rows

# cons[complete.cases(cons),]

Select only non full rows (!)

# cons[!complete.cases(cons),]

Filter Examples

Filter if a variable is one or the other

# flights %>%  filter( month == 11 | month == 12)

Filter rows with relational operators

# flights %>% filter( arr_delay <= 120, dep_delay <= 120 )

Filter all rows EXCEPT a condition

# flights %>% filter( ! (arr_delay > 120 | dep_delay > 120) )

Ordering Data

The arrange() function allows us to order data by variables in ascending or descending order.

Arrange in ascending order for a single variable.

# data %>% arrange(column to arrange))

Arrange in ascending order for multiple variables

# flights %>% arrange( year, month, day )

Arrange in descending order for a single variable.

# data %>% arrange( desc(column to arrange) )

Arrange in decending order for multiple variables

# flights %>% arrange( desc(year, month, day ) )

Creating New Variables

The mutate() function allows us to add new variables while preserving the existing variables. Here is the list of some useful functions used inside the mutate().

Step 1

Select specific variables from the data and store them in a new data frame

# flights_sub<- flights %>% select(arr_delay, dep_delay, air_time)

Step 2A

Create new variables and keep existing variables

# mutate(flights_sub,  
#      gain = arr_delay - dep_delay, 
#      hours = air_time / 60, 
#      gain_per_hour = gain / hours)

Step 2B

Create new variables and remove existing variables

# transmute(flights,
#           gain = arr_delay - dep_delay,
#           hours = air_time / 60,
#           gain_per_hour = gain / hours)

Summarizing Variables

The summarise() (a.k.a. summarize() ) function allows us to perform the majority of summary statistics when performing exploratory data analysis. Here is the list of some useful functions that can be used inside summary().

All functions in this list take a vector of values and returns a single summary value.

Take an average value with summarise. (The na.rm = TRUE argument in mean() function will allow us to ignore the missing values while computing the average.)

# summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) 

Why use summarise and not just mean()?

Summarise() allows us to return multiple summary statistics and have control over which ones we return in a single line of code.

Groupby() and Summarize()

If we want to take the summary statistics grouped by a variable, then we need to use another function called group_by(). group_by() along with summarise() functions will allow us to take and compare summary statistics grouped by a factor variable.

Example 1

Group by destination and find mean delay for each destination.

# flights %>% group_by(dest) %>% summarise(mean_delay = mean(dep_delay, na.rm = TRUE))

Example 2

Providing multiple summary statistics.

Example 3

Count the number of instances per group

  • Filter: selects specific rows that are indigenous
  • Group by: makes groups by age group, then subgroup sex then sub sub group state
  • Summarise: counts the number of indigenous people per group above
  • Arrange: orders the data according to the first group, Age.