dplyr Introduction

2025-11-07

Introduction

Statistics have numerous tools to work with data. R has access to specific packages that assist in data analysis, in the case of this presentation, the dplyr package and its associated functions and its integration with dataframes.

dplyr’s packages can alter dataframes in many different ways, allowing for refined and exact statistical analysis to be performed when all are used in conjunction with each other.

Dataframe Introduction

##    Length Width Height
## 1       8     3     20
## 2      23     6     25
## 3       3    10     23
## 4      15     6     19
## 5      11     9     23
## 6      11     7     26
## 7      19     6     25
## 8      23     6     19
## 9      10     4     20
## 10     20     5     22

Boxes is a 3 column, 10 row dataframe that describes the dimensions of fictional 10 boxes, which will prove useful for displaying dplyr functions, what they do, and how they can prove beneficial in statistical analysis.

Dataframe Introduction Continued

This 3d plot showcases the unaltered version of the Boxes dataframe that will be used in this presentation. This dataframe will be altered in different ways with each concept.

Dplyr’s Functions

Dplyr includes a number of functions in its library, each beneficial in different ways for dataframe analysis and manipulation. Notable dplyr functions are listed below:

filter() - Extraction of specific rows
arrange() - Ordering and sorting rows
select() - Extraction of specific columns
mutate() - Create new columns, using existing or new data
summarize() - Summarize many values into one location
group_by() - Change scope of other functions

Dplyr functions often often tibbles, a modernized version of dataframes. This presentation will not describe the differences between them and will refer to both tibbles and dataframes as dataframes

Filter

The filter() function can filter a dataframe to only include rows that are within the specified filter. The example dataframe, Boxes, has 3 columns and 10 rows, Length, Width, and Height, each with 10 values.

filter() is told what dataframe to use, and then given criteria to search for only rows with Height \(\le 23\) and Width \(> 6\).

filter(Boxes, Height <= 23, Width > 6)

##   Length Width Height
## 1      3    10     23
## 2     11     9     23

Arrange

arrange() can sort columns in ascending or descending order, allowing for the largest and smallest values of a column to be quickly found for further analysis.

arrange(Boxes, Width)

##    Length Width Height
## 1       8     3     20
## 2      10     4     20
## 3      20     5     22
## 4      23     6     25
## 5      15     6     19
## 6      19     6     25
## 7      23     6     19
## 8      11     7     26
## 9      11     9     23
## 10      3    10     23

Select Introduction

select() allows for dataframe to be trimmed to only include specific columns or specific ranges of columns, allowing for quick separation of data for use elsewhere in different contexts. Additionally, the trimmed columns can be defined as their own dataframes.

select(Boxes, Length, Width)

##    Length Width
## 1       8     3
## 2      23     6
## 3       3    10
## 4      15     6
## 5      11     9
## 6      11     7
## 7      19     6
## 8      23     6
## 9      10     4
## 10     20     5

Select Example

This graph plots the data from a newly created dataframe using the parameters from the previous slide, trimmed with select().

Mutate Introductiom

mutate() allows for the creation of new columns on a dataframe using existing columns as data for operations, or for the implementation of entirely original columns.

mutate(Boxes, Volume = Length*Width*Height, Color="Purple")

##    Length Width Height Volume  Color
## 1       8     3     20    480 Purple
## 2      23     6     25   3450 Purple
## 3       3    10     23    690 Purple
## 4      15     6     19   1710 Purple
## 5      11     9     23   2277 Purple
## 6      11     7     26   2002 Purple
## 7      19     6     25   2850 Purple
## 8      23     6     19   2622 Purple
## 9      10     4     20    800 Purple
## 10     20     5     22   2200 Purple

Mutate Example 1

Complex operations can be performed to create new columns using mutate().

Ex: Creation of a dataframe with values defined by \(\displaystyle{5\pi \over 23} \times Height = NewColumn\).

##    Length Width Height NewColumn
## 1       8     3     20  13.65910
## 2      23     6     25  17.07387
## 3       3    10     23  15.70796
## 4      15     6     19  12.97614
## 5      11     9     23  15.70796
## 6      11     7     26  17.75683
## 7      19     6     25  17.07387
## 8      23     6     19  12.97614
## 9      10     4     20  13.65910
## 10     20     5     22  15.02501

Mutate Example 2

Using the Boxes dataframe assume that the units are in meters. A conversion of each column into feet could be done in a manner such as

\(Column(m) \times 3.281 = NewColumn(ft)\).

Doing so for each relevant column results in the following dataframe:

##    Length Width Height meterLength meterWidth meterHeight
## 1       8     3     20      26.248      9.843      65.620
## 2      23     6     25      75.463     19.686      82.025
## 3       3    10     23       9.843     32.810      75.463
## 4      15     6     19      49.215     19.686      62.339
## 5      11     9     23      36.091     29.529      75.463
## 6      11     7     26      36.091     22.967      85.306
## 7      19     6     25      62.339     19.686      82.025
## 8      23     6     19      75.463     19.686      62.339
## 9      10     4     20      32.810     13.124      65.620
## 10     20     5     22      65.620     16.405      72.182

Summarize

summarize() collapses a dataframe into one row, allowing review and analysis of the overall data following the use of a function over an entire column or groups in a column, collapsed into 1 result or group of results. This example takes the mean of all values across the entire dataframe for Length and yields a 1 row dataframe, as mean only results in 1 answer in this case when it is the mean for the whole dataframe.

summarize(Boxes, AvgLength = mean(Length, na.rm = TRUE))

##   AvgLength
## 1      14.3

Summarize and Group_by

group_by() redefines the scope of other functions. In this example, summarize() is used on a new scope of data defined by group_by() to obtain the mean Length for each group of Width values.

groupedBoxes = group_by(Boxes, Width)
grsumBoxes=summarize(groupedBoxes, AvgLength = mean(Length))
grsumBoxes

## # A tibble: 7 × 2
##   Width AvgLength
##   <dbl>     <dbl>
## 1     3         8
## 2     4        10
## 3     5        20
## 4     6        20
## 5     7        11
## 6     9        11
## 7    10         3

Summarize and Group_by

Using the altered dataframe created in the previous slide, the following plot was created:

Final Example

This final example showcases every function that has been featured in this presentation. The plot of the final result is provided on the next slide.

summaryBoxes = Boxes %>%
  mutate(Area = Length*Width*Height) %>%
    select(Width, Length, Area) %>%
      filter(Width <= 6) %>%
        group_by(Width) %>%
          summarize(AvgWidth = mean(Width),
            AvgLength = mean(Length), AvgArea = mean(Area))

## # A tibble: 4 × 4
##   Width AvgWidth AvgLength AvgArea
##   <dbl>    <dbl>     <dbl>   <dbl>
## 1     3        3         8     480
## 2     4        4        10     800
## 3     5        5        20    2200
## 4     6        6        20    2658

Introduction

Dataframe Introduction

Dataframe Introduction Continued

Dplyr’s Functions

Filter

Arrange

Select Introduction

Select Example

Mutate Introductiom

Mutate Example 1

Mutate Example 2

Summarize

Summarize and Group_by

Summarize and Group_by

Final Example

Final Example Plot