For this week’s project, I will be reading in three different datasets to clean and tidy, then do some analysis. The three datasets are: World Population, MTA Daily Ridership, and Candy Hierarchy. I computed a similar analysis for all of the datasets which was visualization.
The world population dataset includes world populations of different continents, countries/territories, and capitals spanning several decades - specifically 1970 to 2022. It also includes ‘Area (km²)’, ‘Density (per km²)’, ‘Growth Rate’, and ‘World Population Percentage.’
world_pop <- read.csv("C:\\Users\\nakes\\OneDrive\\Desktop\\607\\world_population.csv")
head(world_pop, 4)
## Rank CCA3 Country.Territory Capital Continent X2022.Population
## 1 36 AFG Afghanistan Kabul Asia 41128771
## 2 138 ALB Albania Tirana Europe 2842321
## 3 34 DZA Algeria Algiers Africa 44903225
## 4 213 ASM American Samoa Pago Pago Oceania 44273
## X2020.Population X2015.Population X2010.Population X2000.Population
## 1 38972230 33753499 28189672 19542982
## 2 2866849 2882481 2913399 3182021
## 3 43451666 39543154 35856344 30774621
## 4 46189 51368 54849 58230
## X1990.Population X1980.Population X1970.Population Area..km..
## 1 10694796 12486631 10752971 652230
## 2 3295066 2941651 2324731 28748
## 3 25518074 18739378 13795915 2381741
## 4 47818 32886 27075 199
## Density..per.km.. Growth.Rate World.Population.Percentage
## 1 63.0587 1.0257 0.52
## 2 98.8702 0.9957 0.04
## 3 18.8531 1.0164 0.56
## 4 222.4774 0.9831 0.00
Tidy Data
In order to tidy this dataset, I transformed the dataset long because the populations per year were all separate variables. Now there is a variable for year and another for the population per year. Transforming the dataset long allows us to look at the population of the world by year, to look at growth.
world_pop2<-pivot_longer(data = world_pop,
cols = starts_with('X'),
names_to = "Year",
values_to = "Population per year",
names_pattern = '([0-9]+)') %>%
janitor::clean_names()
head(world_pop2, 4)
## # A tibble: 4 × 11
## rank cca3 country_territory capital continent area_km density_per_km
## <int> <chr> <chr> <chr> <chr> <int> <dbl>
## 1 36 AFG Afghanistan Kabul Asia 652230 63.1
## 2 36 AFG Afghanistan Kabul Asia 652230 63.1
## 3 36 AFG Afghanistan Kabul Asia 652230 63.1
## 4 36 AFG Afghanistan Kabul Asia 652230 63.1
## # ℹ 4 more variables: growth_rate <dbl>, world_population_percentage <dbl>,
## # year <chr>, population_per_year <int>
Analysis
In my analysis, the first plot shows how much of the entire world’s population percentage is made up by each continent. In the second graph, I plotted the population per year by continent.
#World population
ggplot(data=world_pop2, aes(x=continent, y=world_population_percentage)) +
ggtitle("World Population of Each Continent") +
ylab("Worl Population percentage") + xlab("Continent")+
scale_fill_brewer(palette = "Set3") +
geom_col(position = "dodge")
#Population per year
ggplot(data=world_pop2, aes(x=continent, y=population_per_year, fill=year)) +
ggtitle("Population of Each Continent by Year") +
ylab("Population") + xlab("Continent")+
scale_fill_brewer(palette = "Set3") +
geom_col(position = "dodge")
For the world population, we can see that Asia makes up most the population for the entire world, followed by North America. Oceania makes up the smallest percentage.
For the population by year, we can see that the continents that had a increase in population size was Africa, Asia, South American, and North America, while Europe and Oceania mostly stayed the same.
The world population dataset includes daily riderships from difference modes of transportation from 2020 to 2024. The modes of transportation includes subways, buses, the lirr, the metro north, access-a-ride, bridges and tunnels, and the staten island railroad. The dataset also includes percent of comparable ridership pre-pandemic.
ride <- read_xlsx("C:\\Users\\nakes\\OneDrive\\Desktop\\607\\MTA_Daily_Ridership_Data__Beginning_2020_20240930.xlsx")
head(ride, 3)
## # A tibble: 3 × 15
## Date Subways: Total Estimated Ridershi…¹ Subways: % of Compar…²
## <dttm> <dbl> <dbl>
## 1 2020-03-01 00:00:00 2212965 97
## 2 2020-03-02 00:00:00 5329915 96
## 3 2020-03-03 00:00:00 5481103 98
## # ℹ abbreviated names: ¹`Subways: Total Estimated Ridership`,
## # ²`Subways: % of Comparable Pre-Pandemic Day`
## # ℹ 12 more variables: `Buses: Total Estimated Ridership` <dbl>,
## # `Buses: % of Comparable Pre-Pandemic Day` <dbl>,
## # `LIRR: Total Estimated Ridership` <dbl>,
## # `LIRR: % of Comparable Pre-Pandemic Day` <dbl>,
## # `Metro-North: Total Estimated Ridership` <dbl>, …
Tidy Data
In order to tidy this dataset, I transformed the dataset long because the modes of transportation could have been one variable instead of separate ones. I transformed the variables separately in two different dataset. The first with data, mode of transport, and total ridership. The second with date, mode of transport, and percent of comparable pre pandemic ridership. Then I merged them together by date and mode of transport. Transforming the dataset long allows us to look at the mode of transportation by year, to see the how the trends changed.
ride <-ride %>% clean_names()
ride2 <- ride %>%
pivot_longer(cols=c('subways_total_estimated_ridership', 'buses_total_estimated_ridership', 'lirr_total_estimated_ridership', 'metro_north_total_estimated_ridership', 'access_a_ride_total_scheduled_trips', 'bridges_and_tunnels_total_traffic', 'staten_island_railway_total_estimated_ridership'),
names_to = "mode_of_transport",
values_to = "total_ride_trip",
names_pattern = '([A-Za-z]+)') %>%
select(date,mode_of_transport, total_ride_trip)
ride3 <- ride%>%
pivot_longer(cols=c('subways_percent_of_comparable_pre_pandemic_day', 'buses_percent_of_comparable_pre_pandemic_day', 'lirr_percent_of_comparable_pre_pandemic_day', 'metro_north_percent_of_comparable_pre_pandemic_day', 'access_a_ride_percent_of_comparable_pre_pandemic_day', 'bridges_and_tunnels_percent_of_comparable_pre_pandemic_day', 'staten_island_railway_percent_of_comparable_pre_pandemic_day'),
names_to = "mode_of_transport",
values_to = "pre_pandemic_percent",
names_pattern = '([A-Za-z]+)') %>%
select(date,mode_of_transport, pre_pandemic_percent)
ride4 <- merge(ride3,ride2,by.ride2=c('date','mode_of_transport'), by.ride3=c('date','mode_of_transport'))
head(ride4, 7)
## date mode_of_transport pre_pandemic_percent total_ride_trip
## 1 2020-03-01 access 113 19922
## 2 2020-03-01 bridges 98 786960
## 3 2020-03-01 buses 99 984908
## 4 2020-03-01 lirr 100 86790
## 5 2020-03-01 metro 59 55825
## 6 2020-03-01 staten 52 1636
## 7 2020-03-01 subways 97 2212965
Analysis
For the analysis, the goal is to look at ridership over the years by mode of transportation. I pulled out date from the date variable then created two mean variables of both total ridership and pre pandemic percent by year and mode of transport. Then I plotted both in two separate graphs.
ride4 <- ride4 %>%
mutate(year = year(date))
ride5 <- ride4 %>%
group_by(year, mode_of_transport) %>%
summarize(mean_total_ride = mean(total_ride_trip))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
ggplot(data=ride5, aes(x=year, y=mean_total_ride, fill=mode_of_transport)) +
ggtitle("Estimated Average Ridership by Year ") +
ylab("Estimated AVerage Count of Rides or Trips") + xlab("Year") +
scale_fill_brewer(palette = "Set1") +
geom_col(position = "dodge")
ride6 <- ride4 %>%
group_by(year, mode_of_transport) %>%
summarize(mean_prepand = mean(pre_pandemic_percent))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
ggplot(data=ride6, aes(x=year, y=mean_prepand, fill=mode_of_transport)) +
ggtitle("Average Percent Comparable Ridership Pre Pandemic by Year ") +
ylab("Average Comparable Riderships Percent") + xlab("Year") +
scale_fill_brewer(palette = "Dark2") +
geom_col(position = "dodge")
From the Estimated Average Ridership by Year, we can see that subways were the most popular mode of transportation from 2020 to 2024 and have been increasing except from 2023 to 2024. It looks like most of the modes of transportation increased over the past few years. Buses shifted from third to second place from 2020 to 2021, then continued in second. The least used mode of transportation is the metro north.
For the Average Percent Comparable Ridership Pre Pandemic by Year, it seems like access-a-ride increased the most. Most of other modes of transportation also increased, however, it looked like buses dropped.
The last dataset I chose was the candy hierarchy from 2017 which was a survey about people’s preference of candies they received on Halloween. The original dataset includes questions about whether or not the participant was going out for Halloween, gender, age, country, state/province/county, and then a list of many different candies where participants can choose how they feel about receiving this candy on Halloween with MEH for indifference, JOY for happy, and DESPAIR for they probably did not want it.
candy <- read.csv("C:\\Users\\nakes\\OneDrive\\Desktop\\607\\candyhierarchy2017.csv", header = TRUE, check.names = F)
head(candy, 2)
## Internal ID Q1: GOING OUT? Q2: GENDER Q3: AGE Q4: COUNTRY
## 1 90258773
## 2 90272821 No Male 44 USA
## Q5: STATE, PROVINCE, COUNTY, ETC Q6 | 100 Grand Bar
## 1
## 2 NM MEH
## Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes)
## 1
## 2 DESPAIR
## Q6 | Any full-sized candy bar Q6 | Black Jacks Q6 | Bonkers (the candy)
## 1
## 2 JOY MEH DESPAIR
## Q6 | Bonkers (the board game) Q6 | Bottle Caps Q6 | Box'o'Raisins
## 1
## 2 DESPAIR DESPAIR DESPAIR
## Q6 | Broken glow stick Q6 | Butterfinger Q6 | Cadbury Creme Eggs
## 1
## 2 DESPAIR DESPAIR MEH
## Q6 | Candy Corn
## 1
## 2 MEH
## Q6 | Candy that is clearly just the stuff given out for free at restaurants
## 1
## 2 DESPAIR
## Q6 | Caramellos Q6 | Cash, or other forms of legal tender Q6 | Chardonnay
## 1
## 2 MEH JOY MEH
## Q6 | Chick-o-Sticks (we don\xd5t know what that is) Q6 | Chiclets
## 1
## 2 DESPAIR DESPAIR
## Q6 | Coffee Crisp Q6 | Creepy Religious comics/Chick Tracts
## 1
## 2 DESPAIR DESPAIR
## Q6 | Dental paraphenalia Q6 | Dots Q6 | Dove Bars Q6 | Fuzzy Peaches
## 1
## 2 DESPAIR MEH JOY DESPAIR
## Q6 | Generic Brand Acetaminophen Q6 | Glow sticks Q6 | Goo Goo Clusters
## 1
## 2 DESPAIR DESPAIR DESPAIR
## Q6 | Good N' Plenty Q6 | Gum from baseball cards Q6 | Gummy Bears straight up
## 1
## 2 MEH DESPAIR MEH
## Q6 | Hard Candy Q6 | Healthy Fruit Q6 | Heath Bar
## 1
## 2 MEH DESPAIR MEH
## Q6 | Hershey's Dark Chocolate Q6 | Hershey\xd5s Milk Chocolate
## 1
## 2 JOY JOY
## Q6 | Hershey's Kisses Q6 | Hugs (actual physical hugs)
## 1
## 2 MEH DESPAIR
## Q6 | Jolly Rancher (bad flavor) Q6 | Jolly Ranchers (good flavor)
## 1
## 2 DESPAIR MEH
## Q6 | JoyJoy (Mit Iodine!) Q6 | Junior Mints Q6 | Senior Mints
## 1
## 2 DESPAIR DESPAIR DESPAIR
## Q6 | Kale smoothie Q6 | Kinder Happy Hippo Q6 | Kit Kat Q6 | LaffyTaffy
## 1
## 2 DESPAIR DESPAIR JOY DESPAIR
## Q6 | LemonHeads Q6 | Licorice (not black) Q6 | Licorice (yes black)
## 1
## 2 MEH MEH JOY
## Q6 | Lindt Truffle Q6 | Lollipops Q6 | Mars Q6 | Maynards Q6 | Mike and Ike
## 1
## 2 MEH DESPAIR DESPAIR DESPAIR MEH
## Q6 | Milk Duds Q6 | Milky Way Q6 | Regular M&Ms Q6 | Peanut M&M\xd5s
## 1
## 2 MEH JOY JOY MEH
## Q6 | Blue M&M's Q6 | Red M&M's Q6 | Green Party M&M's Q6 | Independent M&M's
## 1
## 2 JOY JOY JOY JOY
## Q6 | Abstained from M&M'ing. Q6 | Minibags of chips Q6 | Mint Kisses
## 1
## 2 DESPAIR DESPAIR MEH
## Q6 | Mint Juleps Q6 | Mr. Goodbar Q6 | Necco Wafers Q6 | Nerds
## 1
## 2 DESPAIR DESPAIR DESPAIR DESPAIR
## Q6 | Nestle Crunch Q6 | Now'n'Laters Q6 | Peeps Q6 | Pencils Q6 | Pixy Stix
## 1
## 2 JOY DESPAIR DESPAIR DESPAIR DESPAIR
## Q6 | Real Housewives of Orange County Season 9 Blue-Ray
## 1
## 2 DESPAIR
## Q6 | Reese\xd5s Peanut Butter Cups Q6 | Reese's Pieces
## 1
## 2 JOY JOY
## Q6 | Reggie Jackson Bar Q6 | Rolos
## 1
## 2 DESPAIR JOY
## Q6 | Sandwich-sized bags filled with BooBerry Crunch Q6 | Skittles
## 1
## 2 DESPAIR DESPAIR
## Q6 | Smarties (American) Q6 | Smarties (Commonwealth) Q6 | Snickers
## 1
## 2 DESPAIR DESPAIR MEH
## Q6 | Sourpatch Kids (i.e. abominations of nature) Q6 | Spotted Dick
## 1
## 2 DESPAIR DESPAIR
## Q6 | Starburst Q6 | Sweet Tarts Q6 | Swedish Fish
## 1
## 2 MEH DESPAIR MEH
## Q6 | Sweetums (a friend to diabetes) Q6 | Take 5 Q6 | Tic Tacs
## 1
## 2 DESPAIR DESPAIR DESPAIR
## Q6 | Those odd marshmallow circus peanut things Q6 | Three Musketeers
## 1
## 2 DESPAIR JOY
## Q6 | Tolberone something or other Q6 | Trail Mix Q6 | Twix
## 1
## 2 JOY DESPAIR JOY
## Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein
## 1
## 2 DESPAIR
## Q6 | Vicodin Q6 | Whatchamacallit Bars Q6 | White Bread
## 1
## 2 DESPAIR DESPAIR DESPAIR
## Q6 | Whole Wheat anything Q6 | York Peppermint Patties Q7: JOY OTHER
## 1
## 2 DESPAIR DESPAIR Mounds
## Q8: DESPAIR OTHER Q9: OTHER COMMENTS
## 1
## 2 Bottom line is Twix is really the only candy worth eating.
## Q10: DRESS Q11: DAY Q12: MEDIA [Daily Dish] Q12: MEDIA [Science]
## 1 NA NA
## 2 White and gold Sunday NA 1
## Q12: MEDIA [ESPN] Q12: MEDIA [Yahoo] Click Coordinates (x, y)
## 1 NA NA
## 2 NA NA (84, 25)
Tidy
This dataset was a little difficult to work with because it was a non-research intended survey where participants can input almost any response, and some the candies were not candy.
I decided to create my own question which was how do people feel about popular chocolate bars and candies. In order to tidy the dataset, I first renamed all the chocolate bars I considered to be popular, then transformed these candies longer into one variable since they were all separate variables. Then I filtered out missing responses to how people felt about these candy bars since participants had the option to skip responses. Transforming the dataset long allows us to look at ratings of each candy by popular candy bar, to access which is preferred over the other.
candy <- candy %>%
clean_names() %>%
rename(
"100 Grand Bar" = q6_100_grand_bar,
"Butterfinger" = q6_butterfinger,
"Heath Bar" = q6_heath_bar,
"Kit Kat" = q6_kit_kat,
"Nestle Crunch" = q6_nestle_crunch,
"Reeses Pieces" = q6_reeses_pieces,
"Snickers" = q6_snickers,
"Twix" = q6_twix,
"Milky Way" = q6_milky_way,
"Peanut M&MS"= q6_peanut_m_m_s,
"Hersheys Milk Chocolate" = q6_hershey_s_milk_chocolate)
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 23 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 41 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 64 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 82 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 23 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 41 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 64 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 82 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 23 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 41 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 64 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 82 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 23 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 41 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 64 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 82 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 23 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 41 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 64 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 82 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 23 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 41 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 64 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 82 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 23 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 41 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 64 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 82 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 23 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 41 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 64 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 82 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 23 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 41 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 64 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 82 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 23 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 41 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 64 is invalid UTF-8
## Warning in grepl(x = string, pattern = current_unicode, fixed = TRUE): input
## string 82 is invalid UTF-8
candy2 <- candy %>%
pivot_longer(cols=c('100 Grand Bar', 'Butterfinger', 'Heath Bar', 'Kit Kat', 'Nestle Crunch', 'Reeses Pieces', 'Snickers', 'Twix', 'Milky Way', 'Peanut M&MS', 'Hersheys Milk Chocolate'),
names_to = "candy",
values_to = "candy_rating") %>%
select(internal_id, candy, candy_rating)
candy3 <- candy2 %>%
filter(candy_rating!="")
head(candy3, 10)
## # A tibble: 10 × 3
## internal_id candy candy_rating
## <int> <chr> <chr>
## 1 90272821 100 Grand Bar MEH
## 2 90272821 Butterfinger DESPAIR
## 3 90272821 Heath Bar MEH
## 4 90272821 Kit Kat JOY
## 5 90272821 Nestle Crunch JOY
## 6 90272821 Reeses Pieces JOY
## 7 90272821 Snickers MEH
## 8 90272821 Twix JOY
## 9 90272821 Milky Way JOY
## 10 90272821 Peanut M&MS MEH
Analysis
For my analysis, I created a count of candy_rating while grouping by candy bar and candy rating. Then plotted the results to see how participants generally feel about popular candy bars.
candy5 <- candy3 %>%
group_by(candy, candy_rating) %>%
reframe(count_rating = table(candy_rating))
ggplot(data=candy5, aes(x=candy, y=count_rating, fill = candy_rating)) +
ggtitle("Ratings of Popular Chocolate Candy and Bars in 2017 ") +
ylab("Number of Ratings") + xlab("Chocolate Candy/Bar") +
scale_fill_brewer(palette = "Set1") +
geom_col(position = "dodge") +
theme(axis.text.x = element_text(angle=60, hjust=1))
## Don't know how to automatically pick scale for object of type <table>.
## Defaulting to continuous.
From my plot, we can see that for JOY, the most popular chocolate candy bar is Kit Kat, followed by Snickers and Twix. For DESPAIR, the highest rating is Hersey’s Milk Chocolate and for MEH, the highest rating is 100 Grand Bars.
In project 2, three very dissimilar datasets were imported, tidied, cleaned, and analyzed. The first dataset was easier to tidy and clean compared to the last two. In the future, I would try a different analysis to possibly look at correlations.