This code through examines how to break apart data into two sets,
then use inner joins to join the data so that you have all the data you
need for a specific analysis. We will then analyze the data from the
joined data sets and analyze for specific variables.
I am going to break one data set apart (Star Wars) and then use joins to combine them together to collect data from them. I chose to break the data apart because I had trouble locating two separate data sets that had enough related data to make sense with doing a join.
With the data join, I will determine whether males or females are
more represented within the Star Wars franchise.Since the data set
includes both “gender” and “sex,” I chose to use “sex” as the variable.
I left out “hermaphroditic” as well as any that were labeled “N/A” to
keep the data focused.
Joining data sets is important so that you can more easily show relationships between the data. You want to be able to join data sets in different ways depending on what the goal is with the data. In this example, I want to be able to combine the two data sets to look at whether or not there are more female or male characters in Star Wars, so I need to be able to use the data relating to each sex in order to do so.
In terms of determining which gender is more prevalant, this is
important because reviewing data within movie series can indicate
underlying bias with the choice of characters that are most prevalent.
Since Star Wars is a sci fi movie, with characters that are not human,
this is a good example of how bias can be either overcome or underlying.
Specifically, you’ll learn how to break apart data sets, rejoin them
with an inner join so that you can then analyze the data for a specific
goal.
Breaking apart data is important in case there are situations where you
didn’t need all of the data and wanted to present it in a way that was
more concise and easier for people to view.
Next, we will then rejoin them. This is an important skill since you can use more than one data set with related variables in it and combine them so that the additional data can be used to analyze further.
Next, we will analyze the data to determine a prevalence of one
variable. This is a skill you can use in many different analytical
situations.
Here, we’ll show how to break a data set into two dataframes. This is
an important skill to use if you wish to simplify a dataset or just
choose one or more parts to analyze. You might have a large data set and
need to only show specific data to stakeholders so that your analysis is
clear or you may wish to simply break the data set apart to simplify the
information into a smaller dataframe so it’s easier for you to focus on.
# First we will break apart the dataset into two data frames and we will label them dat1 and dat2.
# We select the columns using a list of column names and then assign this subsetted dataframe to a new dataframe.
dat1 <- starwars[c("name","birth_year", "species")]
dat2 <- starwars[c("name","sex","homeworld")]
# And now we will take a look at the head of our new dataframes
head(dat1)Now with two separate data sets, let’s pretend for a moment that they
came this way and our task as a data analyst is to recombine them to
make comparisons between variables from each. We will join them by the
character names, so we are choosing to use inner join, which joins all
shared cases.
# Choose which type of join: outer(all), inner(only shared cases), left(all sets of data on left) and right(all data sets on right)
# Here we choose inner join in case there are rows with missing values, i.e. Luke Skywalker is found in dat1 but not dat2
# We also set it to not sort by name when merging so the head will be similar to the base dataframes.
dat3 <- merge( dat1, dat2, by="name", all = FALSE, sort = FALSE)
# Now let's view the head.
head(dat3)If we wanted to preserve rows where one dataframe is missing a row
found in the other, we would use all = TRUE, and the
missing values in each column would default to NA.
Left-joins and right-joins do this for one or the other dataframe but
not both.
Next, we will use the data set we combined into one to determine if there are any differences with respect to the sexes of the characters and other variables. In this data set, both “sex” and “gender” were used as different variables. To simplify this lesson, I have chosen to just use “sex” as the variable, as this refers to the birth sex of the character.
# First let's make a table for the sexes to see how many of each character there are.
table(dat3$sex, useNA = "no")##
## female hermaphroditic male none
## 16 1 60 6
There are a lot more male than female characters, it seems. Perhaps there is some scriptwriting bias, or it’s simply a coincidence.
Female characters seem to be younger than male characters too (though
Yoda, being almost 900 years old, may be skewing the male mean). Jabba
the Hutt, meanwhile, is a sprightly young 600 years old in Return Of The
Jedi.
The data we now have can be visualized in different ways to capture the learner’s interest. As we’ve read about in this course, choosing which visual aids you use is important to avoid presenting data in a way that can be misunderstood easily or appears to be biased.
# First we will visualize the breakdown of the sexes in a barplot.
# Because there are some NA values, we also need to subset our data within the plotting function to remove these rows.
ggplot(data = subset(dat3, !is.na(sex)), aes(x = sex, fill = sex)) + geom_bar()
What’s more, it can also be used for…
# ...visualizing the distribution of ages by sex.
ggplot(data = subset(dat3[dat3$sex != "hermaphroditic", ], !is.na(sex)), aes(x = sex, y = birth_year, fill = sex)) + geom_violin() + scale_y_log10()
There seems to be a much more narrow distribution of female ages vs male
ages. Of course, again, male data is skewed by outliers such as Yoda,
and as Jabba was the only hermaphrodite in the series, he (masculine
gender identity) was removed as you can’t make a violin plot of a single
point.
In conclusion, you should now see how you can break datasets down to
simplify them for your stakeholders, join dataframes based on the data
you wish to analyze and then create visuals from the data you joined to
present to your stakeholders. You can choose to be in charge of the data
you use and analyze by using joins and breaking datasets down into
smaller dataframes.
Learn More About Data Joins With The Following:
Resource I. Jesse Lecy & Jamison Crawford (2024). Data Programming for the Social Sciences (DP4SS)
Resource I. Stack Overflow Discussion Board (2024). Stack Overflow
Learn More About Female Representation in Star Wars With The Following:
Resource I. Senseless Sexism In The Galactic Empire
Resource II. Intersectional Look Into The Galaxy
Resource III. Fandomentals: Old Star Wars Vs. New Star Wars Part 2: Sexism
Resource IV. The
Force Is Too Strong With This One? Sexism, Star Wars and Female
Heroes
This code through references and cites the following sources:
Jesse Lecy & Jamison Crawford (2024). Data Programming for the Social Sciences (DP4SS)
Source II.Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund R for Data Science (2e) (2021) R for Data Science (2e)
Source II Jenny Bryan (2019) Stat545