Tutorial: Analyzing Star Wars Character Data using Tidyverse and dplyr

In this tutorial, we will explore how to analyze and summarize data from the Star Wars universe using the R programming language. We’ll use the tidyverse package and its dplyr component to perform various data manipulation and analysis tasks on the starwars dataset, which contains information about different characters from the Star Wars series.

Step 1: Loading Libraries and Accessing the Dataset

First, make sure you have the required libraries installed. The tidyverse package provides a collection of packages for data manipulation and visualization. You can install it and load it using the following code:

To access information about the starwars dataset, you can use the ?starwars command. This will display the documentation for the dataset, including details about its columns and data.

Step 2: Filtering and Counting Data

We’ll start by filtering and counting data based on specific criteria using the dplyr functions filter() and count().

## [1] 6

## [1] 59

## [1] 0

In this code, we use the filter() function to extract subsets of data based on specific conditions. We then use nrow() to count the number of rows in the filtered datasets.

Step 3: More Complex Filtering and Counting

Next, we’ll explore more complex filtering and counting using logical operators.

## # A tibble: 3 × 2
##   `species == "Droid"`     n
##   <lgl>                <int>
## 1 FALSE                   77
## 2 TRUE                     6
## 3 NA                       4

## # A tibble: 2 × 2
##   `mass > 10 | mass < 10`     n
##   <lgl>                   <int>
## 1 TRUE                       59
## 2 NA                         28

## # A tibble: 3 × 2
##   `!species == "Droid" & mass > 10`     n
##   <lgl>                             <int>
## 1 FALSE                                 6
## 2 TRUE                                 54
## 3 NA                                   27

Here, we use the count() function directly, which not only filters the data but also counts the occurrences of different conditions. The | symbol represents logical OR, and the & symbol represents logical AND.

Step 4: Summarizing Data

Moving on, we’ll learn how to summarize data using the summarise() function.

## # A tibble: 1 × 1
##   mean_height
##         <dbl>
## 1        174.

In this code, the summarise() function calculates the mean height of all characters in the starwars dataset. The %>% operator is used to pipe the dataset into the function.

Step 5: Grouping and Summarizing

Now, we’ll explore how to group data by certain categories and then summarize within those groups.

## # A tibble: 38 × 2
##    species   mean_height
##    <chr>           <dbl>
##  1 Aleena            79 
##  2 Besalisk         198 
##  3 Cerean           198 
##  4 Chagrian         196 
##  5 Clawdite         168 
##  6 Droid            131.
##  7 Dug              112 
##  8 Ewok              88 
##  9 Geonosian        183 
## 10 Gungan           209.
## # ℹ 28 more rows

Here, we use the group_by() function to group data by the “species” column. Then, we calculate the mean height within each group using summarise().

Step 6: Calculating Mean and Standard Deviation

Let’s calculate both the mean and standard deviation of mass for each gender.

## # A tibble: 3 × 3
##   gender    mean_mass sd_mass
##   <chr>         <dbl>   <dbl>
## 1 feminine       54.7    8.59
## 2 masculine     106.   185.  
## 3 <NA>           48     NA

We use the same approach as before, grouping the data by the “gender” column and then calculating both the mean and standard deviation of mass within each group.

Step 7: Adding a New Column

Finally, we’ll calculate the Body Mass Index (BMI) for each character and add it as a new column in the dataset.

## # A tibble: 87 × 3
##     mass height   bmi
##    <dbl>  <int> <dbl>
##  1    77    172  26.0
##  2    75    167  26.9
##  3    32     96  34.7
##  4   136    202  33.3
##  5    49    150  21.8
##  6   120    178  37.9
##  7    75    165  27.5
##  8    32     97  34.0
##  9    84    183  25.1
## 10    77    182  23.2
## 11    84    188  23.8
## 12    NA    180  NA  
## 13   112    228  21.5
## 14    80    180  24.7
## 15    74    173  24.7
## 16  1358    175 443. 
## 17    77    170  26.6
## 18   110    180  34.0
## 19    17     66  39.0
## 20    75    170  26.0
## # ℹ 67 more rows

In this code, we use the mutate() function to create a new column named “bmi” by calculating the BMI based on the mass and height columns. The select() function is used to choose only the “mass” and “height” columns from the dataset.

Congratulations! You’ve successfully learned how to manipulate and analyze the Star Wars character data using the tidyverse and dplyr packages in R. These techniques can be applied to various datasets for exploratory data analysis and insights extraction.

To download code visit here: https://www.data03.online/2023/08/how-to-use-dplyr-in-r.html

How to Use dplyr in R: A Tutorial on Data Manipulation with Examples

data03.online

2023-08-17