The task is to choose three datasets from the week’s class discussion to practice transforming wide-format datasets into tidy formats suitable for downstream analysis.
The three datasets I choose are: 1)world GDP by country; 2)Japanese population density; 3)a made up datsets about takeout spending habit by generation. The following process steps will be performed on each of the datasets individually:
Create a .CSV file that includes all of the information included in the dataset.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform the analysis requested.
Code R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
Dataset 1: World GDP by Country: 1960-2022
Import and load dataset. Original data has dimensions of 266 rows and 65 columns. After transformation to the long format, the new datasets has only 4 columns and 16758 rows. This long format is grouped in a way that’s easier for analysis, filtering and visualization. We are ready for the next step - analysis.
library(tidyr)library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(ggplot2)# create df to read the CSV file from my github repositorygdp_data <-read.csv("https://raw.githubusercontent.com/xiaofeimei1/sale-trend-analysis/refs/heads/main/GDP.csv", stringsAsFactors =FALSE,check.names =FALSE) print("origional columns:")
'data.frame': 266 obs. of 65 variables:
$ Country : chr "Aruba" "Africa Eastern and Southern" "Afghanistan" "Africa Western and Central" ...
$ Country Code: chr "ABW" "AFE" "AFG" "AFW" ...
$ 1960 : num NA 2.11e+10 5.38e+08 1.04e+10 NA ...
$ 1961 : num NA 2.16e+10 5.49e+08 1.12e+10 NA ...
$ 1962 : num NA 2.35e+10 5.47e+08 1.20e+10 NA ...
$ 1963 : num NA 2.80e+10 7.51e+08 1.27e+10 NA ...
$ 1964 : num NA 2.59e+10 8.00e+08 1.39e+10 NA ...
$ 1965 : num NA 2.95e+10 1.01e+09 1.49e+10 NA ...
$ 1966 : num NA 3.20e+10 1.40e+09 1.59e+10 NA ...
$ 1967 : num NA 3.33e+10 1.67e+09 1.45e+10 NA ...
$ 1968 : num NA 3.63e+10 1.37e+09 1.50e+10 NA ...
$ 1969 : num NA 4.16e+10 1.41e+09 1.70e+10 NA ...
$ 1970 : num NA 4.46e+10 1.75e+09 2.36e+10 NA ...
$ 1971 : num NA 4.92e+10 1.83e+09 2.09e+10 NA ...
$ 1972 : num NA 5.31e+10 1.60e+09 2.54e+10 NA ...
$ 1973 : num NA 6.95e+10 1.73e+09 3.20e+10 NA ...
$ 1974 : num NA 8.54e+10 2.16e+09 4.44e+10 NA ...
$ 1975 : num NA 9.08e+10 2.37e+09 5.17e+10 NA ...
$ 1976 : num NA 9.02e+10 2.56e+09 6.24e+10 NA ...
$ 1977 : num NA 1.02e+11 2.95e+09 6.56e+10 NA ...
$ 1978 : num NA 1.16e+11 3.30e+09 7.15e+10 NA ...
$ 1979 : num NA 1.34e+11 3.70e+09 8.89e+10 NA ...
$ 1980 : num NA 1.71e+11 3.64e+09 1.12e+11 5.93e+09 ...
$ 1981 : num NA 1.76e+11 3.48e+09 2.11e+11 5.55e+09 ...
$ 1982 : num NA 1.68e+11 NA 1.87e+11 5.55e+09 ...
$ 1983 : num NA 1.76e+11 NA 1.38e+11 5.78e+09 ...
$ 1984 : num NA 1.61e+11 NA 1.15e+11 6.13e+09 ...
$ 1985 : num NA 1.37e+11 NA 1.17e+11 7.55e+09 ...
$ 1986 : num 4.06e+08 1.53e+11 NA 1.08e+11 7.07e+09 ...
$ 1987 : num 4.88e+08 1.87e+11 NA 1.11e+11 8.08e+09 ...
$ 1988 : num 5.97e+08 2.05e+11 NA 1.09e+11 8.77e+09 ...
$ 1989 : num 6.96e+08 2.18e+11 NA 1.02e+11 1.02e+10 ...
$ 1990 : num 7.65e+08 2.54e+11 NA 1.22e+11 1.12e+10 ...
$ 1991 : num 8.72e+08 2.77e+11 NA 1.18e+11 1.27e+10 ...
$ 1992 : num 9.59e+08 2.46e+11 NA 1.19e+11 1.51e+10 ...
$ 1993 : num 1.08e+09 2.43e+11 NA 9.93e+10 1.11e+10 ...
$ 1994 : num 1.25e+09 2.40e+11 NA 8.66e+10 3.39e+09 ...
$ 1995 : num 1.32e+09 2.70e+11 NA 1.09e+11 5.56e+09 ...
$ 1996 : num 1.38e+09 2.69e+11 NA 1.26e+11 7.53e+09 ...
$ 1997 : num 1.53e+09 2.83e+11 NA 1.28e+11 7.65e+09 ...
$ 1998 : num 1.67e+09 2.67e+11 NA 1.31e+11 6.51e+09 ...
$ 1999 : num 1.72e+09 2.63e+11 NA 1.38e+11 6.15e+09 ...
$ 2000 : num 1.87e+09 2.85e+11 NA 1.41e+11 9.13e+09 ...
$ 2001 : num 1.90e+09 2.60e+11 NA 1.49e+11 8.94e+09 ...
$ 2002 : num 1.96e+09 2.67e+11 3.85e+09 1.77e+11 1.53e+10 ...
$ 2003 : num 2.04e+09 3.54e+11 4.54e+09 2.05e+11 1.78e+10 ...
$ 2004 : num 2.25e+09 4.40e+11 5.22e+09 2.54e+11 2.36e+10 ...
$ 2005 : num 2.36e+09 5.14e+11 6.23e+09 3.11e+11 3.70e+10 ...
$ 2006 : num 2.47e+09 5.78e+11 6.97e+09 3.97e+11 5.24e+10 ...
$ 2007 : num 2.68e+09 6.63e+11 9.72e+09 4.65e+11 6.53e+10 ...
$ 2008 : num 2.84e+09 7.11e+11 1.02e+10 5.68e+11 8.85e+10 ...
$ 2009 : num 2.55e+09 7.22e+11 1.22e+10 5.08e+11 7.03e+10 ...
$ 2010 : num 2.45e+09 8.64e+11 1.56e+10 5.99e+11 8.17e+10 ...
$ 2011 : num 2.64e+09 9.68e+11 1.82e+10 6.82e+11 1.09e+11 ...
$ 2012 : num 2.62e+09 9.75e+11 2.02e+10 7.38e+11 1.25e+11 ...
$ 2013 : num 2.73e+09 9.86e+11 2.06e+10 8.34e+11 1.33e+11 ...
$ 2014 : num 2.79e+09 1.01e+12 2.06e+10 8.94e+11 1.37e+11 ...
$ 2015 : num 2.96e+09 9.27e+11 2.00e+10 7.69e+11 8.72e+10 ...
$ 2016 : num 2.98e+09 8.85e+11 1.80e+10 6.91e+11 4.98e+10 ...
$ 2017 : num 3.09e+09 1.02e+12 1.89e+10 6.85e+11 6.90e+10 ...
$ 2018 : num 3.28e+09 1.01e+12 1.84e+10 7.67e+11 7.78e+10 ...
$ 2019 : num 3.40e+09 1.00e+12 1.89e+10 8.23e+11 6.93e+10 ...
$ 2020 : num 2.61e+09 9.28e+11 2.01e+10 7.86e+11 5.02e+10 ...
$ 2021 : num 3.13e+09 1.08e+12 1.46e+10 8.44e+11 6.57e+10 ...
$ 2022 : num NA 1.17e+12 NA 8.78e+11 1.07e+11 ...
# Show first few rowsprint("First few rows of original data:")
[1] "First few rows of original data:"
print(head(gdp_data))
Country Country Code 1960 1961 1962
1 Aruba ABW NA NA NA
2 Africa Eastern and Southern AFE 21125015452 21616228139 23506279900
3 Afghanistan AFG 537777811 548888896 546666678
4 Africa Western and Central AFW 10447637853 11173212080 11990534018
5 Angola AGO NA NA NA
6 Albania ALB NA NA NA
1963 1964 1965 1966 1967 1968
1 NA NA NA NA NA NA
2 28048360188 25920665260 29472103270 32014368121 33269509510 36327785495
3 751111191 800000044 1006666638 1399999967 1673333418 1373333367
4 12727688165 13898109284 14929792388 15910837742 14510579889 14968235782
5 NA NA NA NA NA NA
6 NA NA NA NA NA NA
1969 1970 1971 1972 1973 1974
1 NA NA NA NA NA NA
2 41638967621 44629891649 49173371529 53123459912 69482723444 85380645042
3 1408888922 1748886596 1831108971 1595555476 1733333264 2155555498
4 16979315745 23596163865 20936358634 25386169423 31975594565 44416677335
5 NA NA NA NA NA NA
6 NA NA NA NA NA NA
1975 1976 1977 1978 1979 1980
1 NA NA NA NA NA NA
2 90835426418 90212747243 102240575583 116084638702 134256827127 171217790781
3 2366666616 2555555567 2953333418 3300000109 3697940410 3641723322
4 51667190242 62351622300 65595122956 71496496574 88948338390 112439126385
5 NA NA NA NA NA 5930503401
6 NA NA NA NA NA NA
1981 1982 1983 1984 1985 1986
1 NA NA NA NA NA 405586592
2 175859256874 168095657215 175564912386 160646748724 136759437910 153050335916
3 3478787909 NA NA NA NA NA
4 211338060015 187448724920 138384182007 114516348921 116776995133 107886511309
5 5550483036 5550483036 5784341596 6131475065 7554065410 7072536109
6 NA NA NA 1857338012 1897050133 2097326250
1987 1988 1989 1990 1991 1992
1 487709497 596648045 695530726 764804469 872067039 958659218
2 186658478814 204765985926 218241607366 254062093242 276856728336 246088124936
3 NA NA NA NA NA NA
4 110728825942 109438851254 102254998563 122387353859 118039698016 118893094122
5 8084412414 8769836769 10201780977 11229515599 12704558517 15114352005
6 2080796250 2051236250 2253090000 2028553750 1099559028 652174991
1993 1994 1995 1996 1997 1998
1 1083240223 1245810056 1320670391 1379888268 1531843575 1665363128
2 242926405780 239610677917 270327154575 269490833465 283446224788 266652333831
3 NA NA NA NA NA NA
4 99272180411 86636400266 108690885030 126287285163 127602388366 130678128885
5 11051939102 3390500000 5561222222 7526963964 7648377413 6506229607
6 1185315468 1880950858 2392764853 3199640826 2258513974 2545964541
1999 2000 2001 2002 2003 2004
1 1722905028 1873184358 1896648045 1962011173 2044134078 2254748603
2 263024788890 284759318603 259643121973 266529432166 354176768091 440481795991
3 NA NA NA 3854235264 4539496563 5220825049
4 138085971820 140945759314 148529518712 177201164643 205214466071 254264799899
5 6152922943 9129594819 8936079253 15285592487 17812704626 23552057820
6 3212121651 3480355258 3922100794 4348068242 5611496257 7184685782
2005 2006 2007 2008 2009 2010
1 2359776536 2469832402 2677653631 2843016760 2553631285 2453631285
2 513941625354 577586931667 662868036074 710536227817 721901168421 863519541418
3 6226198935 6971383339 9715765105 10249770319 12154835708 15633843662
4 310889578636 396921040009 465485534969 567791156729 508362731367 598521555566
5 36970901025 52381025105 65266415707 88538664888 70307196477 81699526881
6 8052077209 8896074996 10677324873 12881352878 12044205572 11926926654
2011 2012 2013 2014 2015 2016
1 2637988827 2615083799 2727932961 2.791061e+09 2963128492 2983798883
2 967824566959 975354777362 985987130727 1.006526e+12 927348469903 885176429224
3 18190413832 20203572960 20564485419 2.055058e+10 19998143636 18019554403
4 682015858792 737589473577 833948054767 8.943225e+11 768644740597 691363412188
5 109436566876 124998210652 133401582620 1.372444e+11 87219300385 49840491178
6 12890760273 12319834233 12776224537 1.322815e+10 11386853143 11861199831
2017 2018 2019 2020 2021 2022
1 3.092179e+09 3.276188e+09 3.395794e+09 2610038939 3.126019e+09 NA
2 1.021043e+12 1.007196e+12 1.000834e+12 927593321648 1.081998e+12 1.169484e+12
3 1.889635e+10 1.841886e+10 1.890450e+10 20143451706 1.458314e+10 NA
4 6.848988e+11 7.670257e+11 8.225384e+11 786460035395 8.444597e+11 8.778633e+11
5 6.897277e+10 7.779294e+10 6.930911e+10 50241368244 6.568544e+10 1.067136e+11
6 1.301973e+10 1.515642e+10 1.540183e+10 15162734205 1.793057e+10 1.888210e+10
# Transform from wide to long formatgdp_long <- gdp_data |>pivot_longer(cols =-c(Country, `Country Code`), names_to ="Year",values_to ="GDP" ) |>mutate(Year =as.integer(Year), GDP =as.numeric(GDP) ) |>arrange(Country, Year)# Show new datasetprint("Transformed data column names:")
[1] "Transformed data column names:"
print(names(gdp_long))
[1] "Country" "Country Code" "Year" "GDP"
print("Structure of transformed data:")
[1] "Structure of transformed data:"
str(gdp_long)
tibble [16,758 × 4] (S3: tbl_df/tbl/data.frame)
$ Country : chr [1:16758] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Country Code: chr [1:16758] "AFG" "AFG" "AFG" "AFG" ...
$ Year : int [1:16758] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
$ GDP : num [1:16758] 5.38e+08 5.49e+08 5.47e+08 7.51e+08 8.00e+08 ...
Min. 1st Qu. Median Mean 3rd Qu. Max.
1960 1975 1991 1991 2007 2022
Analysis and visualization. From the results shown below, The GDP dataset contains observations from 1960 to 2022. Since we GDP data listed for each year, we can do analysis on the GDP trend, such as find out which region has the top 10 growing GDP. I will also plot contry GDP but group them in regions to find out a a macro trend globally. In the end show plots to conclude analysis for this first dataset.
# top 10 economies in the most recent year (2022)top_economies_2022 <- gdp_long |>filter(Year ==2022&!is.na(GDP)) |>arrange(desc(GDP)) |>head(10) |>pull(Country)print("Top 10 economies in 2022:")
# VISUALIZATIONS# Visualization 1: Top 10 Economies Growth Over Time (1960-2022)top_gdp_data <- gdp_long |>filter(Country %in% top_economies_2022 &!is.na(GDP))ggplot(top_gdp_data, aes(x = Year, y = GDP/1e9, color = Country)) +geom_line(size =1) +scale_y_log10(labels = scales::comma) +labs(title ="GDP Growth of Top 10 Economies (1960-2022)",subtitle ="Log scale - Values in Billions USD",x ="Year", y ="GDP (Billions USD, log scale)",color ="Country") +theme_minimal() +theme(legend.position ="bottom",plot.title =element_text(hjust =0.5),plot.subtitle =element_text(hjust =0.5)) +scale_x_continuous(breaks =seq(1960, 2020, by =10))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
# Global GDP Trendglobal_trend <- gdp_long |>group_by(Year)|>summarise(World_GDP =sum(GDP, na.rm =TRUE) /1e9) # in unit billionsggplot(global_trend, aes(x = Year, y = World_GDP)) +geom_line(color ="darkblue", size =1.2) +geom_point(color ="darkblue", size =0.5) +labs(title ="Global GDP Trend (1960-2022)",subtitle ="Total World GDP in Billions USD",x ="Year",y ="World GDP (Billions USD)") +theme_minimal() +scale_x_continuous(breaks =seq(1960, 2020, by =5)) +scale_y_continuous(labels = scales::comma)
# Create a simple region classification and compare trendmajor_economies_list <-c("United States", "China", "Japan", "Germany", "United Kingdom","France", "India", "Brazil", "Canada", "Italy", "Russia")# Filter the data for these major economiesmajor_economies <- gdp_long |>filter(Country %in% major_economies_list &!is.na(GDP))# Create region classificationmajor_economies <- major_economies |>mutate(Region =case_when( Country %in%c("United States", "Canada") ~"North America", Country %in%c("Germany", "United Kingdom", "France", "Italy") ~"Europe", Country =="China"~"China", Country =="India"~"India", Country =="Japan"~"Japan", Country =="Brazil"~"Brazil", Country =="Russia"~"Russia",TRUE~"Other" ))# Plot regional comparisonggplot(major_economies, aes(x = Year, y = GDP/1e9, color = Region, group = Country)) +geom_line(size =1) +facet_wrap(~Region, scales ="free_y") +labs(title ="GDP Trends by Major Economy",subtitle ="Selected major economies grouped by region",x ="Year",y ="GDP (Billions USD)") +theme_minimal() +theme(legend.position ="none") +scale_x_continuous(breaks =seq(1960, 2020, by =20))
Dataset 2: Japanese 2015 Population Density
This is a dataset about Japanese population in different cities. City names were spread out in multiple columns, total 47 in the original dataset. Goal is transform dataset to long format, new column city_name will be created to group those cities in one single column, population density will be grouped under column name population_density. We will check and compare columns between original dataset with the transformed dataset in the following code chunk:
library(tidyr)library(dplyr)# Read the population density CSV file from my github repositorypopulation_data <-read.csv("https://raw.githubusercontent.com/xiaofeimei1/sale-trend-analysis/refs/heads/main/prefecturepopdensity2015.csv", stringsAsFactors =FALSE,check.names =FALSE,header =TRUE) # original column namesprint("Original columns in population dataset:")
# structure print("Structure of original population data:")
[1] "Structure of original population data:"
str(population_data)
'data.frame': 1 obs. of 47 variables:
$ Aichi : num 1447
$ Akita : num 87.9
$ Aomori : num 136
$ Chiba : num 1207
$ Ehime : num 244
$ Fukui : num 188
$ Fukuoka : num 1023
$ Fukushima: num 139
$ Gifu : num 191
$ Gumma : num 310
$ Hiroshima: num 336
$ Hokkaido : num 68.6
$ Hyogo : num 659
$ Ibaraki : num 479
$ Ishikawa : num 276
$ Iwate : num 83.8
$ Kagawa : num 520
$ Kagoshima: num 179
$ Kanagawa : num 3778
$ Kochi : num 102
$ Kumamoto : num 241
$ Kyoto : num 566
$ Mie : num 314
$ Miyagi : num 320
$ Miyazaki : num 143
$ Nagano : num 155
$ Nagasaki : num 333
$ Nara : num 370
$ Niigata : num 183
$ Oita : int 184
$ Okayama : num 270
$ Okinawa : num 629
$ Osaka : num 4640
$ Saga : num 341
$ Saitama : int 1912
$ Shiga : num 352
$ Shimane : num 104
$ Shizuoka : num 476
$ Tochigi : num 308
$ Tokushima: num 182
$ Tokyo : num 6168
$ Tottori : num 164
$ Toyama : num 251
$ Wakayama : int 204
$ Yamagata : num 120
$ Yamaguchi: num 230
$ Yamanashi: int 187
# Show first few rows of original dataprint(head(population_data))
print(paste("Number of prefectures:", length(unique(population_long$city_name))))
[1] "Number of prefectures: 47"
After getting the long format, we are ready to do population analysis and visualization. As shown below, top cities that has most dense population are: Tokyo, Osaka, Kanagawa… On the other hand, lowest cities with population density are Hokkaido, Iwate, Akita… To show how population were largely concentrate on the top 3 cities, we can see the last visualization that those 3 cities are far more dense compare to other cities.
top_dense <- population_long |>arrange(desc(population_density)) |>head(10)bottom_dense <- population_long |>arrange(population_density) |>head(10)# Bar chart of top 10 most dense prefecturesggplot(top_dense, aes(x =reorder(city_name, population_density), y = population_density)) +geom_bar(stat ="identity", fill ="steelblue") +coord_flip() +labs(title ="Top 10 Most Densely Populated Japanese Prefectures",subtitle ="Population Density (persons/km²) - 2015",x ="Prefecture",y ="Population Density (persons/km²)") +theme_minimal() +geom_text(aes(label =round(population_density, 0)), hjust =-0.1, size =3)
# Bar chart of bottom 10 least dense prefecturesggplot(bottom_dense, aes(x =reorder(city_name, desc(population_density)), y = population_density)) +geom_bar(stat ="identity", fill ="coral") +coord_flip() +labs(title ="Top 10 Least Densely Populated Japanese Prefectures",subtitle ="Population Density (persons/km²) - 2015",x ="Prefecture",y ="Population Density (persons/km²)") +theme_minimal() +geom_text(aes(label =round(population_density, 0)), hjust =-0.1, size =3)
# Dot plot of all prefecturesggplot(population_long, aes(x =reorder(city_name, population_density), y = population_density)) +geom_point(size =3, color ="darkblue") +coord_flip() +labs(title ="Population Density of All Japanese Prefectures",subtitle ="Ordered from lowest to highest density",x ="Prefecture",y ="Population Density (persons/km²)") +theme_minimal() +geom_hline(yintercept =mean(population_long$population_density), linetype ="dashed", color ="red", size =0.5) +annotate("text", x =5, y =mean(population_long$population_density) +100, label ="National Average", color ="red", size =3)
Dataset 3: takeout spending habit by generation.
The original takeout spending dataset contained 12 rows and 7 columns, with 5 generations: Gen Z, Millennials, Gen X, Baby Boomers, Silent Gen. and their monthly spending columns from January to June. After transformation, the dataset was reshaped into long format with 30 rows and 3 columns: generation, month and spending. This transformed datasets is in tidy structure are are ready for data analysis and visualization.
# Read the takeout spending CSV filetakeout_data <-read.csv("https://raw.githubusercontent.com/xiaofeimei1/sale-trend-analysis/refs/heads/main/takeoutspending.csv", stringsAsFactors =FALSE,check.names =FALSE)# Print original columns print(names(takeout_data))
# Show structure of original datastr(takeout_data)
'data.frame': 12 obs. of 7 variables:
$ generation: chr "Gen Z" "Millenials" "Gen X" "Baby Boomers" ...
$ Jan : int 185 240 195 120 75 NA NA NA NA NA ...
$ Feb : int 172 228 188 115 70 NA NA NA NA NA ...
$ Mar : int 198 255 205 130 82 NA NA NA NA NA ...
$ Apr : int 210 270 215 140 88 NA NA NA NA NA ...
$ May : int 205 265 210 135 85 NA NA NA NA NA ...
$ Jun : int 225 285 220 145 90 NA NA NA NA NA ...
# Show first few rows of original dataprint("First few rows of original takeout data:")
[1] "First few rows of original takeout data:"
print(head(takeout_data))
generation Jan Feb Mar Apr May Jun
1 Gen Z 185 172 198 210 205 225
2 Millenials 240 228 255 270 265 285
3 Gen X 195 188 205 215 210 220
4 Baby Boomers 120 115 130 140 135 145
5 Silent Gen 75 70 82 88 85 90
6 NA NA NA NA NA NA
# Clean the data first - remove empty rowstakeout_data_clean <- takeout_data |>filter(!is.na(generation) & generation !="") |># Remove rows with empty generationselect(1:7) # Transform from wide to long takeout_long <- takeout_data_clean |>pivot_longer(cols =-generation, # All month columns except generationnames_to ="month",values_to ="spending" ) |>mutate(month =as.character(month), # month in characterspending =as.numeric(spending), # spending is numericrmonth =factor(month, levels =c("Jan", "Feb", "Mar", "Apr", "May", "Jun"), ordered =TRUE) ) |>arrange(generation, month) # Show transformed dataset resultsprint(names(takeout_long))
# A tibble: 30 × 3
generation month spending
<chr> <ord> <dbl>
1 Baby Boomers Jan 120
2 Baby Boomers Feb 115
3 Baby Boomers Mar 130
4 Baby Boomers Apr 140
5 Baby Boomers May 135
6 Baby Boomers Jun 145
7 Gen X Jan 195
8 Gen X Feb 188
9 Gen X Mar 205
10 Gen X Apr 215
# ℹ 20 more rows
print(paste("Total rows:", nrow(takeout_long)))
[1] "Total rows: 30"
print(paste("Number of generations:", length(unique(takeout_long$generation))))
# A tibble: 6 × 2
month avg_spending
<ord> <dbl>
1 Jan 163
2 Feb 155.
3 Mar 174
4 Apr 185.
5 May 180
6 Jun 193
Data analysis and visualization to identify which generation spends the most on takeout per month and to find out monthly spending trends across different generational groups.
The results shows that Millennials lead in takeout spending, with the highest average monthly expenditures. They are closely followed by Gen Z and Gen X, whose spending levels are very similar to each other.
Regarding monthly trends, the analysis shows a upward pattern from January to June across all generations, indicating that takeout spending generally increases as the year progresses from winter to summer months. This seasonal trend, combined with generational differences, could be valuable infomation to study consumer behavior and for company to potential targeting specific generation to promote their food products.
generation_summary <- takeout_long |>group_by(generation) |>summarise(Avg_Monthly =mean(spending, na.rm =TRUE),.groups ='drop' ) |>arrange(desc(Avg_Monthly))# Bar chart - Average monthly spending by generationggplot(generation_summary, aes(x =reorder(generation, Avg_Monthly), y = Avg_Monthly, fill = generation)) +geom_bar(stat ="identity", width =0.7) +coord_flip() +labs(title ="Average Monthly Takeout Spending by Generation",subtitle ="Based on Jan-Jun 2015 data",x ="Generation",y ="Average Monthly Spending ($)") +theme_minimal() +theme(legend.position ="none") +geom_text(aes(label =paste0("$", round(Avg_Monthly, 0))), hjust =-0.2, size =4) +scale_fill_brewer(palette ="Set2") +ylim(0, max(generation_summary$Avg_Monthly) *1.1)
# Line chart - Spending trends over months by generationggplot(takeout_long, aes(x = month, y = spending, group = generation, color = generation)) +geom_line(size =1.2) +geom_point(size =3) +labs(title ="Takeout Spending Trends by Generation (Jan-Jun 2015)",subtitle ="Monthly spending patterns across generations",x ="Month",y ="Spending ($)",color ="Generation") +theme_minimal() +theme(legend.position ="bottom") +scale_color_brewer(palette ="Set1")
#Comparison across monthsggplot(takeout_long, aes(x = month, y = spending, fill = generation)) +geom_bar(stat ="identity", position ="dodge", width =0.7) +labs(title ="Monthly Takeout Spending by Generation",subtitle ="Side-by-side comparison",x ="Month",y ="Spending ($)",fill ="Generation") +theme_minimal() +theme(legend.position ="bottom") +scale_fill_brewer(palette ="Set2")
Summary
This data analysis project focused on transforming three distinct datasets (global GDP, Japanese prefecture population density, and generational takeout spending) from wide format to long format to enable more effective analysis and visualization. Wide format has data of same category across multiple columns, making comparative analysis more difficult. The transformed data made filtering easier, enhanced grouping and aggregation when doing analysis. This data transformation to get tidy data is essential part for complex data analysis.