In this tutorial, we will explore how to analyze and summarize data from the Star Wars universe using the R programming language. We’ll use the tidyverse package and its dplyr component to perform various data manipulation and analysis tasks on the starwars dataset, which contains information about different characters from the Star Wars series.
First, make sure you have the required libraries installed. The tidyverse package provides a collection of packages for data manipulation and visualization. You can install it and load it using the following code:
To access information about the starwars dataset, you can use the ?starwars command. This will display the documentation for the dataset, including details about its columns and data.
We’ll start by filtering and counting data based on specific criteria using the dplyr functions filter() and count().
## [1] 6
## [1] 59
## [1] 0
In this code, we use the filter() function to extract subsets of data based on specific conditions. We then use nrow() to count the number of rows in the filtered datasets.
Next, we’ll explore more complex filtering and counting using logical operators.
## # A tibble: 3 × 2
## `species == "Droid"` n
## <lgl> <int>
## 1 FALSE 77
## 2 TRUE 6
## 3 NA 4
## # A tibble: 2 × 2
## `mass > 10 | mass < 10` n
## <lgl> <int>
## 1 TRUE 59
## 2 NA 28
## # A tibble: 3 × 2
## `!species == "Droid" & mass > 10` n
## <lgl> <int>
## 1 FALSE 6
## 2 TRUE 54
## 3 NA 27
Here, we use the count() function directly, which not only filters the data but also counts the occurrences of different conditions. The | symbol represents logical OR, and the & symbol represents logical AND.
Moving on, we’ll learn how to summarize data using the summarise() function.
## # A tibble: 1 × 1
## mean_height
## <dbl>
## 1 174.
In this code, the summarise() function calculates the mean height of all characters in the starwars dataset. The %>% operator is used to pipe the dataset into the function.
Now, we’ll explore how to group data by certain categories and then summarize within those groups.
## # A tibble: 38 × 2
## species mean_height
## <chr> <dbl>
## 1 Aleena 79
## 2 Besalisk 198
## 3 Cerean 198
## 4 Chagrian 196
## 5 Clawdite 168
## 6 Droid 131.
## 7 Dug 112
## 8 Ewok 88
## 9 Geonosian 183
## 10 Gungan 209.
## # ℹ 28 more rows
Here, we use the group_by() function to group data by the “species” column. Then, we calculate the mean height within each group using summarise().
Let’s calculate both the mean and standard deviation of mass for each gender.
## # A tibble: 3 × 3
## gender mean_mass sd_mass
## <chr> <dbl> <dbl>
## 1 feminine 54.7 8.59
## 2 masculine 106. 185.
## 3 <NA> 48 NA
We use the same approach as before, grouping the data by the “gender” column and then calculating both the mean and standard deviation of mass within each group.
Finally, we’ll calculate the Body Mass Index (BMI) for each character and add it as a new column in the dataset.
## # A tibble: 87 × 3
## mass height bmi
## <dbl> <int> <dbl>
## 1 77 172 26.0
## 2 75 167 26.9
## 3 32 96 34.7
## 4 136 202 33.3
## 5 49 150 21.8
## 6 120 178 37.9
## 7 75 165 27.5
## 8 32 97 34.0
## 9 84 183 25.1
## 10 77 182 23.2
## 11 84 188 23.8
## 12 NA 180 NA
## 13 112 228 21.5
## 14 80 180 24.7
## 15 74 173 24.7
## 16 1358 175 443.
## 17 77 170 26.6
## 18 110 180 34.0
## 19 17 66 39.0
## 20 75 170 26.0
## # ℹ 67 more rows
In this code, we use the mutate() function to create a new column named “bmi” by calculating the BMI based on the mass and height columns. The select() function is used to choose only the “mass” and “height” columns from the dataset.
Congratulations! You’ve successfully learned how to manipulate and analyze the Star Wars character data using the tidyverse and dplyr packages in R. These techniques can be applied to various datasets for exploratory data analysis and insights extraction.
To download code visit here: https://www.data03.online/2023/08/how-to-use-dplyr-in-r.html