The purpose of this project is to demonstrate the ability to transform data from various wide formats into a more digestible format for analysis. As part of the project, I will also clean/tidy the data and perform analysis. Below you will see three different data sets that were provided by fellow classmates. In addition to providing the data set, each classmate was asked to suggest analysis that could be completed using the data set. I will show the loading, tidying, and analysis of each dataset below.
First, I’ll load the necessary libraries, which will be used accross all three data sets:
library(tidyverse)
Requested Analysis to perform: “This data could be […] used to compare genre sales across regions and see if there are any differences/similarities across the globe.”
This is a dataset that contains video game sales (copies sold) for 2019. As you can see from the screenshot below, the data about sales region (NA_Sales, PAL_Sales, JP_Sales, Other_Sales) are in a “wide” data format. We’ll need to “gather” these columns to create just two columns, a key and value pair for sales_location, and sales (copies sold). Having the data in a “long” format will allow us to perform the requested analysis more easily.
Let’s begin our tidying process by reading in the data and taking a look at the first few rows. The data is in a CSV file on my GitHub.
games <- readr::read_csv("https://raw.githubusercontent.com/christianthieme/MSDS-DATA607/master/dataset1_video_games_project2.csv")
head(games)
## # A tibble: 6 x 16
## Rank Name Genre ESRB_Rating Platform Publisher Developer Critic_Score
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 1 Wii ~ Spor~ E Wii Nintendo Nintendo~ 7.7
## 2 2 Supe~ Plat~ <NA> NES Nintendo Nintendo~ 10
## 3 3 Mari~ Raci~ E Wii Nintendo Nintendo~ 8.2
## 4 4 Play~ Shoo~ <NA> PC PUBG Cor~ PUBG Cor~ NA
## 5 5 Wii ~ Spor~ E Wii Nintendo Nintendo~ 8
## 6 6 Poke~ Role~ E GB Nintendo Game Fre~ 9.4
## # ... with 8 more variables: User_Score <dbl>, Total_Shipped <dbl>,
## # Global_Sales <dbl>, NA_Sales <dbl>, PAL_Sales <dbl>, JP_Sales <dbl>,
## # Other_Sales <dbl>, Year <dbl>
As you can see from the first few rows, the sales region columns are in a “wide” format. To fix that we are going to need to gather those sales region columns into just two columns, a key column and a value column. In doing this, some of the columns will become obsolete to us for our analysis, such as “Total Shipped” and “Global Sales”, so we’ll remove those upfront before gathering. After gathering our data, we notice that there are many nulls in our “sales” value. This is because many of the games in this particular data set do not have regional sales information. Since our analysis specifically asks to look at regional sales data, we’ll go ahead and exclude these null values from our analysis. As the last step of our cleaning process, we’ll rename the region names to something more understandable (i.e. NA_Sales = North America).
games_gathered <- games %>% dplyr::select(-Total_Shipped, -Global_Sales) %>%
tidyr::gather(NA_Sales, PAL_Sales, JP_Sales, Other_Sales, key = "sales_location", value = "sales") %>%
filter(!is.na(sales)) %>%
mutate(sales_location = ifelse(sales_location == "NA_Sales", "North America",
ifelse(sales_location == "PAL_Sales", "Europe",
ifelse(sales_location == "JP_Sales", "Japan", "Other"))))
games_gathered
## # A tibble: 48,718 x 12
## Rank Name Genre ESRB_Rating Platform Publisher Developer Critic_Score
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 20 Gran~ Acti~ M PS3 Rockstar~ Rockstar~ 9.4
## 2 21 Gran~ Acti~ M PS4 Rockstar~ Rockstar~ 9.7
## 3 31 Gran~ Acti~ M PS2 Rockstar~ Rockstar~ 9.6
## 4 33 Gran~ Acti~ M X360 Rockstar~ Rockstar~ NA
## 5 35 Call~ Shoo~ M PS4 Activisi~ Treyarch NA
## 6 41 Call~ Shoo~ M X360 Activisi~ Infinity~ 8.7
## 7 42 Call~ Shoo~ M X360 Activisi~ Treyarch 8.8
## 8 46 Red ~ Acti~ M PS4 Rockstar~ Rockstar~ 9.8
## 9 47 Call~ Shoo~ M X360 Activisi~ Treyarch NA
## 10 48 Call~ Shoo~ M PS3 Activisi~ Treyarch NA
## # ... with 48,708 more rows, and 4 more variables: User_Score <dbl>,
## # Year <dbl>, sales_location <chr>, sales <dbl>
Now that our data set is in a “long” format, we can move forward with the requested analysis. Before looking at the data by region, I’d like to get a feel for how many genres there are, and how many copies are sold for each genre.
games_gathered %>% count(Genre) %>% arrange(desc(n)) %>%
ggplot() +
aes(x = reorder(Genre, n), y = n) +
geom_bar(stat = "identity") +
geom_text(aes(label = n), hjust = -.15) +
labs(title = "Copies of Games Sold by Genre (in millions)") +
xlab("Genre") +
ylab("") +
theme(
panel.background = element_rect(fill = "white", color = NA),
axis.ticks.y = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.45)
) +
coord_flip()
In looking at the chart above, there are 20 genres, which will make any chart we create later look very busy and hard to understand. Let’s narrow our dataset down to the top 10 genres by copies sold.
top_ten_g <- games_gathered %>% count(Genre) %>% arrange(desc(n)) %>% top_n(10) %>% dplyr::pull(Genre)
games_filtered <- games_gathered %>%
filter(Genre %in% top_ten_g)
With our new filtered dataset, we are now prepared to take a look at sales by genre by sales region. We’ll do this by creating a grouped bar chart.
grouped_genres <- games_filtered %>% dplyr::group_by(sales_location, Genre) %>% summarise(sales = sum(sales))
ggplot(data = grouped_genres) +
aes(x = reorder(sales_location,desc(sales)), y = sales, fill = reorder(Genre, desc(sales))) +
geom_col(position = "dodge", color = "black") +
geom_text(aes(label = round(sales,0)), position = position_dodge(.9), vjust = -.25) +
labs(title = "Copies Sold by Sales Location", fill = "Genre") +
xlab("Sales Location") +
ylab("Copies Sold in Millions") +
theme(
panel.background = element_rect(fill = "white", color = NA),
plot.title = element_text(hjust = 0.55),
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
)
In the chart above it is very clear that sports games, action games and shooter games are all extremely popular in every geography. They are the top three genres in every location except for Japan. Japan’s top genre is role-playing, followed by sports and action. Going in to this analysis, I expected to see much more of a difference between different geographies, however, based on the above, it is pretty clear that for the most part, sports, action, and shooter games will please gamers from every location.
Requested Analysis to perform: “You can group the data by census region or census division. Then organize the rates according to year, changing it from wide data to long data.”
This data set contains information on state marriage rates by state and years. Rates are based on provisional counts of marriages per 1,000 residing in the area. As you can see by looking at the screenshot below, the data set is in a “wide” format with the years across the top for a single state. We will gather these columns and create two columns, a key and a value.
The original data set can be found here. I have created a CSV file containing the data that can be found on my GitHub here. Let’s load the data and then view the first several rows.
marriage <- readr::read_csv("https://raw.githubusercontent.com/christianthieme/MSDS-DATA607/master/marriage_rates_project2.csv")
head(marriage)
## # A tibble: 6 x 23
## state census_division census_region `2016` `2015` `2014` `2013` `2012` `2011`
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alab~ East South Cen~ South 7.15 7.35 7.81 7.82 8.2 8.4
## 2 Alas~ Pacific West 7.10 7.41 7.51 7.29 7.2 7.8
## 3 Ariz~ Mountain West 5.93 5.92 5.78 5.40 5.6 5.7
## 4 Arka~ West South Cen~ South 9.86 10.0 10.1 9.75 10.9 10.4
## 5 Cali~ Pacific West 6.46 6.18 6.44 6.46 6 5.8
## 6 Colo~ Mountain West 7.43 6.79 7.06 6.45 6.8 7
## # ... with 14 more variables: `2010` <dbl>, `2009` <dbl>, `2008` <dbl>,
## # `2007` <dbl>, `2006` <dbl>, `2005` <dbl>, `2004` <dbl>, `2003` <dbl>,
## # `2002` <dbl>, `2001` <dbl>, `2000` <dbl>, `1999` <dbl>, `1995` <dbl>,
## # `1990` <dbl>
Looking at the first few rows above, it looks like our data came in correctly. Let’s move forward with gathering the year rows into a “key” column and the values into a “value” column:
marriage_gather <- marriage %>% gather("2016":"1990", key = "year", value = "rate")
marriage_gather
## # A tibble: 1,020 x 5
## state census_division census_region year rate
## <chr> <chr> <chr> <chr> <dbl>
## 1 Alabama East South Central South 2016 7.15
## 2 Alaska Pacific West 2016 7.10
## 3 Arizona Mountain West 2016 5.93
## 4 Arkansas West South Central South 2016 9.86
## 5 California Pacific West 2016 6.46
## 6 Colorado Mountain West 2016 7.43
## 7 Connecticut New England Northeast 2016 5.62
## 8 Delaware South Atlantic South 2016 5.61
## 9 District of Columbia <NA> <NA> 2016 8.15
## 10 Florida South Atlantic South 2016 8.13
## # ... with 1,010 more rows
Now that our data is in the correct format, we can move forward with our analysis. The student requesting the analysis said we could perform the analysis by looking at either census division or census region. Since census division has more granularity, let’s select it to perform our analysis as opposed to census region. The first thing we’ll need to do is to group our data frame by census division and year. We’ll then summarize our rates by looking at the mean value. Next, we’ll visualize our data by census division and by year to see if we can spot any trends.
marriage_grouped <- marriage_gather %>% dplyr::group_by(census_division, year) %>% summarize(rate = mean(rate)) %>% filter(!is.na(census_division))
ggplot(marriage_grouped) +
aes(x = as.numeric(year), y = rate, color = census_division) +
geom_line() +
labs(title = "Average Rate of Marriage by Census Division by Year", color = "Census Division") +
ylab("Rate") +
xlab("Year") +
theme_bw() +
theme(
panel.border = element_blank(),
axis.line = element_line(color = "black"),
plot.title = element_text(hjust = 1.8)
)
When looking at data with a time element, its often best to look at the data represented as a line chart as we’ve done above. Looking at the chart above, it is clear that for most of the division, marital rates are decreasing, however, because there are no labels on this chart because of the quantity and grouping of data, it makes it a little hard to see what is happening with the divisions with smaller changes over the years. Let’s see if we can visualize this data in a different way to make it a little easier to tell what’s going on.
ggplot(marriage_grouped) +
aes(x = reorder(year, desc(year)), y = rate) +
geom_col() +
geom_text(aes(label = round(rate,1)), hjust=-.15) +
labs(title = "Average Rate of Marriage by Census Division by Year") +
ylab("Rates are based on provisional counts of marriages per 1,000 residing in the area") +
xlab("Year") +
ylim(0,25) +
theme(
panel.background = element_rect(fill = "white", color = NA),
axis.ticks.y = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.45)
) +
facet_grid(~census_division) +
coord_flip()
Wow. Now the story is clear. Looking just at this chart above we see an incredible trend. The average marriage rate for EVERY census division has decreased since 1990. They don’t all decrease at the same rate, or at the same time, however, the overwhelming trend and story from this data set is that for every census division, the average marriage rate has declined from 1990 to 2016. Many of the declines are fairly small. The most pronounced decline is in the “Mountian” division, which declined from an average of 21.5% to just around 10%, a decrease of over 11% in 26 years.
Requested Analysis to perform: “The problem is that the year variable is spread out into 65 different columns, 1 for each year, that need to be gathered into 1 column. In order to make this dataset tidy we would gather the year columns into one column until we had a 3 column dataset of country, year, and mortality.”
This UNICEF data set gives the under 5 mortality for regions across the years 1990-2018. The data, as shown in the screenshot below, is an estimate of the amount of children under the age of 5 that die per 1,000 children. As you can see, this is a “wide” data set, with the years spanning the columns. We will transform this dat set into a “long” data set and gather the years and values into a key, value pair of columns. First, as you can tell, there is quite a bit of cleanup that needs to be done to make this data set useable. The original data set can be found here.
I have downloaded the data set and stored it in my GitHub here. We will begin cleaning by reading in the data set and removing the top 12 rows which are just explanatory information about the data set. We’ll also remove all rows after the 45th row, because the CSV actually has a second data set at the bottom of the file. Next, we’ll use REGEX to clean the 12th row, which is the actual header row, and create a vector of cleaned header values and make those the column names.
mortality <- readr::read_csv("https://raw.githubusercontent.com/christianthieme/MSDS-DATA607/master/dataset3_under_5_mortality_project2.csv")
mortality_clean <- mortality[c(13:45),c(1:31)]
headers <- (mortality[c(12),c(1:31)])
heads <- stringr::str_extract_all(headers, "(Region Name)|(Uncertainty bounds)|([0-9]{4})")
head <- base::unlist(heads)
colnames(mortality_clean) <- head
head(mortality_clean)
## # A tibble: 6 x 31
## `Region Name` `Uncertainty bo~ `1990` `1991` `1992` `1993` `1994` `1995`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Sub-Saharan ~ Lower 176. 175. 174. 172. 172. 169.
## 2 Sub-Saharan ~ Median 180. 178. 177. 175. 175. 172.
## 3 Sub-Saharan ~ Upper 184. 182. 181. 179. 179. 176.
## 4 West and Cen~ Lower 191. 190. 188. 187. 185. 183.
## 5 West and Cen~ Median 197. 196. 194. 193. 191. 188.
## 6 West and Cen~ Upper 204. 202. 201. 199. 197. 195.
## # ... with 23 more variables: `1996` <dbl>, `1997` <dbl>, `1998` <dbl>,
## # `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>,
## # `2004` <dbl>, `2005` <dbl>, `2006` <dbl>, `2007` <dbl>, `2008` <dbl>,
## # `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>,
## # `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, `2018` <dbl>
Now we can work on transforming our data set from a “wide” data set, to a “long” data set.
mortality_long <- mortality_clean %>% gather("1990":"2018", key = "year", value = "deaths")
head(mortality_long)
## # A tibble: 6 x 4
## `Region Name` `Uncertainty bounds` year deaths
## <chr> <chr> <chr> <dbl>
## 1 Sub-Saharan Africa Lower 1990 176.
## 2 Sub-Saharan Africa Median 1990 180.
## 3 Sub-Saharan Africa Upper 1990 184.
## 4 West and Central Africa Lower 1990 191.
## 5 West and Central Africa Median 1990 197.
## 6 West and Central Africa Upper 1990 204.
Next, you’ll notice that this data set actually includes a confidence interval for the estimates they are making (Uncertainty Bounds), however, for our purposes, we will just use the “Median” value for our analysis, so let’s filter out data related to the “Lower” and “Upper” bounds.
mortality_final <- mortality_long %>% filter(`Uncertainty bounds` == "Median")
mortality_final
## # A tibble: 319 x 4
## `Region Name` `Uncertainty bounds` year deaths
## <chr> <chr> <chr> <dbl>
## 1 Sub-Saharan Africa Median 1990 180.
## 2 West and Central Africa Median 1990 197.
## 3 Eastern and Southern Africa Median 1990 164.
## 4 Middle East and North Africa Median 1990 65.0
## 5 South Asia Median 1990 130.
## 6 East Asia and Pacific Median 1990 56.8
## 7 Latin America and Caribbean Median 1990 54.6
## 8 North America Median 1990 11.0
## 9 Europe and Central Asia Median 1990 30.8
## 10 Eastern Europe and Central Asia Median 1990 46.4
## # ... with 309 more rows
Now that we have a cleaned data set, let’s see if we can identify any trends over the years in mortality rates for children under 5 in each region.
ggplot(data = mortality_final) +
aes(x = as.numeric(year), y = deaths, color = `Region Name`) +
geom_line() +
geom_point(size = .9) +
labs(title = "Mortaility Rate of Children Under 5", color = "Region") +
ylab("Death's per 1,000 births") +
xlab("Year") +
theme_bw() +
theme(
panel.border = element_blank(),
axis.line = element_line(color = "black"),
plot.title = element_text(hjust = .70)
)
In looking at the chart above, one thing is clear. The mortaility rate for children under 5 is declining in every region. Some regions have much more drastic declines than others, such as East Asia, Sub-Saharan Africa, West and Central Africa, and Eastern Europe and Central Asia. Others such as North America and Western Europe have enjoyed low mortaility rates for years, and so have only seen a very small decrease in the mortality rate. While it appears some incredible work has been done, we can still see that the regions of East Asia, Sub-Saharan Africa, West and Central Africa, and Eastern Europe and Central Asia still have a lot of work to do to catch up with the other regions of the world.
While the data sets may have come from different places and contained different information, many of the same “tidying” principles applied to each data set. In performing the above tidying, cleaning, and analysis, the following principles were followed:
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
Following these principles in any tidying exercise will make analysis much easier down the road.