Introduction

As the population in the US keeps growing and growing each year, we are seeing a melting pot of newly adapted cultures. This is not recent. Historically, the US is a country fundamentally made by immigrants. Whether we come from Europe or Asia, from Oceania or Africa, unless you possess a history of being a First Nations citizen, your family has a history of immigration. In our interest to uncover more on the origins of last names (which ultimately are the ‘de facto’ form of knowing one’s ethnicity; not the only one [more on that below]), we have decided to create a database where we can see what are the origins of the most frequently occurring surnames from the most recent census in 2010. Since the Census hasn’t let anyone obtain data on the most common surnames from 2020, we have decided to use the 2010 data to create the database.

Now, it is important to note that one’s last name does NOT define one’s race or ethnicity. This is important to know because at the end of the day, since the US is a melting pot, one’s last name doesn’t necessarily represent your entire family. Maybe one ancestor married another one and adopted a new surname that doesn’t have anything to do with your own last name. We ultimately created the database on the origins of the last name, rather than what ethnicity possesses the last name the most.

Most Common Last Names by Ethnic Origin (2010)

From the 1250 last names encountered in the list of the Census (2020), the most popular last names’ origins were as following

  • English/Scottish
  • Spanish/Portuguese
  • German
  • Irish/Welsh
  • East Asian
  • Middle Eastern
  • South Asian
  • Other

Based on the data we obtained from https://www.census.gov/topics/population/genealogy/data/2010_surnames.html, this is our database:

library(ggplot2)
library(readr)
last_names <- read_csv("C:/Users/rooya/Documents/LastNames.csv")
## Rows: 703 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): English/Scottish, Spanish/Portuguese, German, Irish/Welsh, East Asi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
category_counts <- colSums(!is.na(last_names))
category_counts <- colSums(!is.na(last_names))
category_counts_df <- as.data.frame(category_counts)
category_counts_df$Category <- rownames(category_counts_df)
rownames(category_counts_df) <- NULL
names(category_counts_df) <- c("Count", "Category")

my_colors <- c("blue", "purple", "green", "orange", "maroon", "#CCCCFF", "#FFCC99", "#99CC00")

category_counts_df$Category <- factor(
  category_counts_df$Category, 
  levels = category_counts_df$Category[order(category_counts_df$Count, decreasing = TRUE)]
)

ggplot(category_counts_df, aes(x = Category, y = Count, fill = Category)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Number of Last Names by Origin Category", x = "Origin Category", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ylim(0, 800) +
  scale_fill_manual(values = my_colors)

Now, there is something that also has to be noted. In the case of English/Scottish and Irish/Welsh, we have ultimately decided to separate the both of them because ultimately, some of the Scottish Gaelic surnames are gaelicised forms of English surnames; and conversely, some of the English surnames are anglicised forms of the Gaelic surnames. In the case of the Irish and the Welsh, we have decided to combine both of them because there is more notability in Irish names, and for data collection purposes, we also included the Welsh as they also have a certain autonomy of their own.

The numbers were as follows:

  • English/Scottish: 703
  • Spanish/Portuguese: 199
  • German: 169
  • Irish/Welsh: 139
  • East Asian: 31
  • Middle Eastern: 5
  • South Asian: 2
  • Other: 2

This has to be noted: EVERY number represents what was the origin of each last name found in the list of the most common last names in the United States, as according to the Census (2010).

In the case of the German last names, we included Scandinavian surnames, and we also have to note that historically, there are many variations that occur between English and German. In other words, many English last names are ultimately “germanicized”, and the same occur to some German surnames that are angliziced. At the end, we decided that the less anglicized the surname, the more German it was.

For East Asian, we focused on countries located in East Asia (China, Taiwan, Japan, and Korea) and South East Asia (Vietnam, Thailand, Cambodia, etc). In the case of South Asia, we specifically chose regions that include India, Bangladesh, and Pakistan. For the Middle Eastern category, we included locations such as North Africa. In the Other category, there are small percentages of Slavic, African, and other small names.

Again, this means that last names from English/Scottish origins are the most common last names in the United States.

We had to create our own database to fundamentally separate them. This caused in the creation of LastNames.csv, which we had to pick one by one and separate it with the help of surname sources that we will put down below. Here is an URL of the database: https://docs.google.com/spreadsheets/d/1S4FO-cgXzmrUGycZIdfK_XjA2dP0ws-lcQN3k4sK8b8/edit?usp=sharing.

Hispanic Origins on the Most Common Last Names on the Census (2020)

Since me and my partner are Hispanic, we wanted to know how much has the Hispanic population grow from the last time in the Census, which was that 199 Spanish last names were among the most common from the 1,250 that were in the Census, we wanted to see if there was a change in the newest Census (2020). In addition to this, we also experimented and created R coding to try and create this dataset.

What we used was:

library(ggplot2)
library(readr)
last_names <- read_csv("C:/Users/rooya/Documents/LastNames.csv")
category_counts <- colSums(!is.na(last_names))
category_counts <- colSums(!is.na(last_names))
category_counts_df <- as.data.frame(category_counts)
category_counts_df$Category <- rownames(category_counts_df)
rownames(category_counts_df) <- NULL
names(category_counts_df) <- c("Count", "Category")

my_colors <- c("blue", "purple", "green", "orange", "maroon", "#CCCCFF", "#FFCC99", "#99CC00")

category_counts_df$Category <- factor(
  category_counts_df$Category, 
  levels = category_counts_df$Category[order(category_counts_df$Count, decreasing = TRUE)]
)

ggplot(category_counts_df, aes(x = Category, y = Count, fill = Category)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Number of Last Names by Origin Category", x = "Origin Category", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ylim(0, 800) +
  scale_fill_manual(values = my_colors)
  

One of the most growing populations in the US, we wanted to separate how many surnames were Hispanic or weren’t Hispanic. Here is our database for that:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ stringr   1.5.0
## ✔ forcats   1.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
geom_bar(stat = "identity", position = "dodge", color = "black")
## geom_bar: just = 0.5, width = NULL, na.rm = FALSE, orientation = NA
## stat_identity: na.rm = FALSE
## position_dodge
name = tibble("Last Names Origins" =c("Non-Hispanic", "Hispanic"),
            "Most Common Last Names per Capita" = c(808,192))
view(name)

               ggplot(name, aes(x = `Last Names Origins`, y = `Most Common Last Names per Capita`, fill = `Last Names Origins`)) +
                 geom_bar(stat = "identity", position = "dodge", color = "black") +
                 labs(title = "Hispanic Origins on the Most Common Last Names",
                      x = "Last Name Origins",
                      y = "Most Common Last Names per Capita") +
                 theme_minimal()      

Again, each number found in the graph represents the most common last names in the US, which was in total 1,000 last names, with the red representing each last name from a Hispanic origin, and blue from a non-Hispanic origin. The data shows at it follows:

  • Hispanic Origins on the Most Common Last Names according to the Census (2020): 192
  • Non-Hispanic Origins on the Most Common Last Names according to the Census (2020): 808

Interestingly enough, it was almost reaching up to 200, with the total being 192 as compared to 2010’s. It was still a very important data visualization that we felt that needed to be created, and we feel that as the population from certain backgrounds and ethnicities grow, the more multicultural that the last names may get in the future.

In Conclusion

Ultimately, as more people come to the US, there will be more diversity in the country.This not only brings a wider range of perspectives and experiences but also enriches American society in numerous areas such as in the arts, education, language, and business. In the case of .R and RStudio and programming, me and my partner didn’t have any experience whatsoever with programming. DataCamp functioned as a way to introduce ourselves to this concept, and we are very grateful of seeing how useful of a tool RStudio can be, especially with analyzing data.

Reflecting on the work, such diversity can enhance creativity and drive innovation by bringing different viewpoints and solutions to complex problems. Surnames may not necessarily define one’s own ethnicity, but ultimately it does provide an interesting vision on one’s own origin, which we demonstrated with the help of RStudio.