Luke, I Am Your Data Wrangler: Data Joins In Star Wars

Introduction

This code through examines how to break apart data into two sets, then use inner joins to join the data so that you have all the data you need for a specific analysis. We will then analyze the data from the joined data sets and analyze for specific variables.

Content Overview

I am going to break one data set apart (Star Wars) and then use joins to combine them together to collect data from them. I chose to break the data apart because I had trouble locating two separate data sets that had enough related data to make sense with doing a join.

With the data join, I will determine whether males or females are more represented within the Star Wars franchise.Since the data set includes both “gender” and “sex,” I chose to use “sex” as the variable. I left out “hermaphroditic” as well as any that were labeled “N/A” to keep the data focused.

Why You Should Care

Joining data sets is important so that you can more easily show relationships between the data. You want to be able to join data sets in different ways depending on what the goal is with the data. In this example, I want to be able to combine the two data sets to look at whether or not there are more female or male characters in Star Wars, so I need to be able to use the data relating to each sex in order to do so.

In terms of determining which gender is more prevalant, this is important because reviewing data within movie series can indicate underlying bias with the choice of characters that are most prevalent. Since Star Wars is a sci fi movie, with characters that are not human, this is a good example of how bias can be either overcome or underlying.

Learning Objectives

Specifically, you’ll learn how to break apart data sets, rejoin them with an inner join so that you can then analyze the data for a specific goal.
Breaking apart data is important in case there are situations where you didn’t need all of the data and wanted to present it in a way that was more concise and easier for people to view.

Next, we will then rejoin them. This is an important skill since you can use more than one data set with related variables in it and combine them so that the additional data can be used to analyze further.

Next, we will analyze the data to determine a prevalence of one variable. This is a skill you can use in many different analytical situations.

Step 1: Break Apart The Dataset

Here, we’ll show how to break a data set into two dataframes. This is an important skill to use if you wish to simplify a dataset or just choose one or more parts to analyze. You might have a large data set and need to only show specific data to stakeholders so that your analysis is clear or you may wish to simply break the data set apart to simplify the information into a smaller dataframe so it’s easier for you to focus on.

# First we will break apart the dataset into two data frames and we will label them dat1 and dat2.
# We select the columns using a list of column names and then assign this subsetted dataframe to a new dataframe.
dat1 <- starwars[c("name","birth_year", "species")]
dat2 <- starwars[c("name","sex","homeworld")]

# And now we will take a look at the head of our new dataframes
head(dat1)

head(dat2)

Step 2: Join the Datasets

Now with two separate data sets, let’s pretend for a moment that they came this way and our task as a data analyst is to recombine them to make comparisons between variables from each. We will join them by the character names, so we are choosing to use inner join, which joins all shared cases.

# Choose which type of join: outer(all), inner(only shared cases), left(all sets of data on left) and right(all data sets on right)
# Here we choose inner join in case there are rows with missing values, i.e. Luke Skywalker is found in dat1 but not dat2
# We also set it to not sort by name when merging so the head will be similar to the base dataframes.
dat3 <- merge( dat1, dat2, by="name", all = FALSE, sort = FALSE)

# Now let's view the head.
head(dat3)

If we wanted to preserve rows where one dataframe is missing a row found in the other, we would use all = TRUE, and the missing values in each column would default to NA. Left-joins and right-joins do this for one or the other dataframe but not both.

Step 3: Analyze the Data

Next, we will use the data set we combined into one to determine if there are any differences with respect to the sexes of the characters and other variables. In this data set, both “sex” and “gender” were used as different variables. To simplify this lesson, I have chosen to just use “sex” as the variable, as this refers to the birth sex of the character.

# First let's make a table for the sexes to see how many of each character there are.
table(dat3$sex, useNA = "no")

## 
##         female hermaphroditic           male           none 
##             16              1             60              6

There are a lot more male than female characters, it seems. Perhaps there is some scriptwriting bias, or it’s simply a coincidence.

# What is the mean age for each sex?
aggregate(birth_year ~ sex, dat3, mean)

Female characters seem to be younger than male characters too (though Yoda, being almost 900 years old, may be skewing the male mean). Jabba the Hutt, meanwhile, is a sprightly young 600 years old in Return Of The Jedi.

Step 4: Visualize the Data

The data we now have can be visualized in different ways to capture the learner’s interest. As we’ve read about in this course, choosing which visual aids you use is important to avoid presenting data in a way that can be misunderstood easily or appears to be biased.

# First we will visualize the breakdown of the sexes in a barplot.
# Because there are some NA values, we also need to subset our data within the plotting function to remove these rows.
ggplot(data = subset(dat3, !is.na(sex)), aes(x = sex, fill = sex)) + geom_bar()

What’s more, it can also be used for…

# ...visualizing the distribution of ages by sex.
ggplot(data = subset(dat3[dat3$sex != "hermaphroditic", ], !is.na(sex)), aes(x = sex, y = birth_year, fill = sex)) + geom_violin() + scale_y_log10()

There seems to be a much more narrow distribution of female ages vs male ages. Of course, again, male data is skewed by outliers such as Yoda, and as Jabba was the only hermaphrodite in the series, he (masculine gender identity) was removed as you can’t make a violin plot of a single point.

Conclusion

In conclusion, you should now see how you can break datasets down to simplify them for your stakeholders, join dataframes based on the data you wish to analyze and then create visuals from the data you joined to present to your stakeholders. You can choose to be in charge of the data you use and analyze by using joins and breaking datasets down into smaller dataframes.

Further Resources

Learn More About Data Joins With The Following:

Resource I. Jesse Lecy & Jamison Crawford (2024). Data Programming for the Social Sciences (DP4SS)
Resource I. Stack Overflow Discussion Board (2024). Stack Overflow

Learn More About Female Representation in Star Wars With The Following:

Resource I. Senseless Sexism In The Galactic Empire
Resource II. Intersectional Look Into The Galaxy
Resource III. Fandomentals: Old Star Wars Vs. New Star Wars Part 2: Sexism
Resource IV. The Force Is Too Strong With This One? Sexism, Star Wars and Female Heroes

Works Cited

This code through references and cites the following sources:

Jesse Lecy & Jamison Crawford (2024). Data Programming for the Social Sciences (DP4SS)
Source II.Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund R for Data Science (2e) (2021) R for Data Science (2e)
Source II Jenny Bryan (2019) Stat545