P2data607

Author

xiaofei

Introduction

The task is to choose three datasets from the week’s class discussion to practice transforming wide-format datasets into tidy formats suitable for downstream analysis.

The three datasets I choose are: 1)world GDP by country; 2)Japanese population density; 3)a made up datsets about takeout spending habit by generation. The following process steps will be performed on each of the datasets individually:

Create a .CSV file that includes all of the information included in the dataset.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform the analysis requested.
Code R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Dataset 1: World GDP by Country: 1960-2022

Import and load dataset. Original data has dimensions of 266 rows and 65 columns. After transformation to the long format, the new datasets has only 4 columns and 16758 rows. This long format is grouped in a way that’s easier for analysis, filtering and visualization. We are ready for the next step - analysis.

library(tidyr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(ggplot2)

# create df to read the CSV file from my github repository
gdp_data <- read.csv("https://raw.githubusercontent.com/xiaofeimei1/sale-trend-analysis/refs/heads/main/GDP.csv", 
                     stringsAsFactors = FALSE,
                     check.names = FALSE)   

print("origional columns:")

[1] "origional columns:"

print(names(gdp_data))

 [1] "Country"      "Country Code" "1960"         "1961"         "1962"        
 [6] "1963"         "1964"         "1965"         "1966"         "1967"        
[11] "1968"         "1969"         "1970"         "1971"         "1972"        
[16] "1973"         "1974"         "1975"         "1976"         "1977"        
[21] "1978"         "1979"         "1980"         "1981"         "1982"        
[26] "1983"         "1984"         "1985"         "1986"         "1987"        
[31] "1988"         "1989"         "1990"         "1991"         "1992"        
[36] "1993"         "1994"         "1995"         "1996"         "1997"        
[41] "1998"         "1999"         "2000"         "2001"         "2002"        
[46] "2003"         "2004"         "2005"         "2006"         "2007"        
[51] "2008"         "2009"         "2010"         "2011"         "2012"        
[56] "2013"         "2014"         "2015"         "2016"         "2017"        
[61] "2018"         "2019"         "2020"         "2021"         "2022"

print("Structure of original data:")

[1] "Structure of original data:"

str(gdp_data)

'data.frame':   266 obs. of  65 variables:
 $ Country     : chr  "Aruba" "Africa Eastern and Southern" "Afghanistan" "Africa Western and Central" ...
 $ Country Code: chr  "ABW" "AFE" "AFG" "AFW" ...
 $ 1960        : num  NA 2.11e+10 5.38e+08 1.04e+10 NA ...
 $ 1961        : num  NA 2.16e+10 5.49e+08 1.12e+10 NA ...
 $ 1962        : num  NA 2.35e+10 5.47e+08 1.20e+10 NA ...
 $ 1963        : num  NA 2.80e+10 7.51e+08 1.27e+10 NA ...
 $ 1964        : num  NA 2.59e+10 8.00e+08 1.39e+10 NA ...
 $ 1965        : num  NA 2.95e+10 1.01e+09 1.49e+10 NA ...
 $ 1966        : num  NA 3.20e+10 1.40e+09 1.59e+10 NA ...
 $ 1967        : num  NA 3.33e+10 1.67e+09 1.45e+10 NA ...
 $ 1968        : num  NA 3.63e+10 1.37e+09 1.50e+10 NA ...
 $ 1969        : num  NA 4.16e+10 1.41e+09 1.70e+10 NA ...
 $ 1970        : num  NA 4.46e+10 1.75e+09 2.36e+10 NA ...
 $ 1971        : num  NA 4.92e+10 1.83e+09 2.09e+10 NA ...
 $ 1972        : num  NA 5.31e+10 1.60e+09 2.54e+10 NA ...
 $ 1973        : num  NA 6.95e+10 1.73e+09 3.20e+10 NA ...
 $ 1974        : num  NA 8.54e+10 2.16e+09 4.44e+10 NA ...
 $ 1975        : num  NA 9.08e+10 2.37e+09 5.17e+10 NA ...
 $ 1976        : num  NA 9.02e+10 2.56e+09 6.24e+10 NA ...
 $ 1977        : num  NA 1.02e+11 2.95e+09 6.56e+10 NA ...
 $ 1978        : num  NA 1.16e+11 3.30e+09 7.15e+10 NA ...
 $ 1979        : num  NA 1.34e+11 3.70e+09 8.89e+10 NA ...
 $ 1980        : num  NA 1.71e+11 3.64e+09 1.12e+11 5.93e+09 ...
 $ 1981        : num  NA 1.76e+11 3.48e+09 2.11e+11 5.55e+09 ...
 $ 1982        : num  NA 1.68e+11 NA 1.87e+11 5.55e+09 ...
 $ 1983        : num  NA 1.76e+11 NA 1.38e+11 5.78e+09 ...
 $ 1984        : num  NA 1.61e+11 NA 1.15e+11 6.13e+09 ...
 $ 1985        : num  NA 1.37e+11 NA 1.17e+11 7.55e+09 ...
 $ 1986        : num  4.06e+08 1.53e+11 NA 1.08e+11 7.07e+09 ...
 $ 1987        : num  4.88e+08 1.87e+11 NA 1.11e+11 8.08e+09 ...
 $ 1988        : num  5.97e+08 2.05e+11 NA 1.09e+11 8.77e+09 ...
 $ 1989        : num  6.96e+08 2.18e+11 NA 1.02e+11 1.02e+10 ...
 $ 1990        : num  7.65e+08 2.54e+11 NA 1.22e+11 1.12e+10 ...
 $ 1991        : num  8.72e+08 2.77e+11 NA 1.18e+11 1.27e+10 ...
 $ 1992        : num  9.59e+08 2.46e+11 NA 1.19e+11 1.51e+10 ...
 $ 1993        : num  1.08e+09 2.43e+11 NA 9.93e+10 1.11e+10 ...
 $ 1994        : num  1.25e+09 2.40e+11 NA 8.66e+10 3.39e+09 ...
 $ 1995        : num  1.32e+09 2.70e+11 NA 1.09e+11 5.56e+09 ...
 $ 1996        : num  1.38e+09 2.69e+11 NA 1.26e+11 7.53e+09 ...
 $ 1997        : num  1.53e+09 2.83e+11 NA 1.28e+11 7.65e+09 ...
 $ 1998        : num  1.67e+09 2.67e+11 NA 1.31e+11 6.51e+09 ...
 $ 1999        : num  1.72e+09 2.63e+11 NA 1.38e+11 6.15e+09 ...
 $ 2000        : num  1.87e+09 2.85e+11 NA 1.41e+11 9.13e+09 ...
 $ 2001        : num  1.90e+09 2.60e+11 NA 1.49e+11 8.94e+09 ...
 $ 2002        : num  1.96e+09 2.67e+11 3.85e+09 1.77e+11 1.53e+10 ...
 $ 2003        : num  2.04e+09 3.54e+11 4.54e+09 2.05e+11 1.78e+10 ...
 $ 2004        : num  2.25e+09 4.40e+11 5.22e+09 2.54e+11 2.36e+10 ...
 $ 2005        : num  2.36e+09 5.14e+11 6.23e+09 3.11e+11 3.70e+10 ...
 $ 2006        : num  2.47e+09 5.78e+11 6.97e+09 3.97e+11 5.24e+10 ...
 $ 2007        : num  2.68e+09 6.63e+11 9.72e+09 4.65e+11 6.53e+10 ...
 $ 2008        : num  2.84e+09 7.11e+11 1.02e+10 5.68e+11 8.85e+10 ...
 $ 2009        : num  2.55e+09 7.22e+11 1.22e+10 5.08e+11 7.03e+10 ...
 $ 2010        : num  2.45e+09 8.64e+11 1.56e+10 5.99e+11 8.17e+10 ...
 $ 2011        : num  2.64e+09 9.68e+11 1.82e+10 6.82e+11 1.09e+11 ...
 $ 2012        : num  2.62e+09 9.75e+11 2.02e+10 7.38e+11 1.25e+11 ...
 $ 2013        : num  2.73e+09 9.86e+11 2.06e+10 8.34e+11 1.33e+11 ...
 $ 2014        : num  2.79e+09 1.01e+12 2.06e+10 8.94e+11 1.37e+11 ...
 $ 2015        : num  2.96e+09 9.27e+11 2.00e+10 7.69e+11 8.72e+10 ...
 $ 2016        : num  2.98e+09 8.85e+11 1.80e+10 6.91e+11 4.98e+10 ...
 $ 2017        : num  3.09e+09 1.02e+12 1.89e+10 6.85e+11 6.90e+10 ...
 $ 2018        : num  3.28e+09 1.01e+12 1.84e+10 7.67e+11 7.78e+10 ...
 $ 2019        : num  3.40e+09 1.00e+12 1.89e+10 8.23e+11 6.93e+10 ...
 $ 2020        : num  2.61e+09 9.28e+11 2.01e+10 7.86e+11 5.02e+10 ...
 $ 2021        : num  3.13e+09 1.08e+12 1.46e+10 8.44e+11 6.57e+10 ...
 $ 2022        : num  NA 1.17e+12 NA 8.78e+11 1.07e+11 ...

# Show first few rows
print("First few rows of original data:")

[1] "First few rows of original data:"

print(head(gdp_data))

                      Country Country Code        1960        1961        1962
1                       Aruba          ABW          NA          NA          NA
2 Africa Eastern and Southern          AFE 21125015452 21616228139 23506279900
3                 Afghanistan          AFG   537777811   548888896   546666678
4  Africa Western and Central          AFW 10447637853 11173212080 11990534018
5                      Angola          AGO          NA          NA          NA
6                     Albania          ALB          NA          NA          NA
         1963        1964        1965        1966        1967        1968
1          NA          NA          NA          NA          NA          NA
2 28048360188 25920665260 29472103270 32014368121 33269509510 36327785495
3   751111191   800000044  1006666638  1399999967  1673333418  1373333367
4 12727688165 13898109284 14929792388 15910837742 14510579889 14968235782
5          NA          NA          NA          NA          NA          NA
6          NA          NA          NA          NA          NA          NA
         1969        1970        1971        1972        1973        1974
1          NA          NA          NA          NA          NA          NA
2 41638967621 44629891649 49173371529 53123459912 69482723444 85380645042
3  1408888922  1748886596  1831108971  1595555476  1733333264  2155555498
4 16979315745 23596163865 20936358634 25386169423 31975594565 44416677335
5          NA          NA          NA          NA          NA          NA
6          NA          NA          NA          NA          NA          NA
         1975        1976         1977         1978         1979         1980
1          NA          NA           NA           NA           NA           NA
2 90835426418 90212747243 102240575583 116084638702 134256827127 171217790781
3  2366666616  2555555567   2953333418   3300000109   3697940410   3641723322
4 51667190242 62351622300  65595122956  71496496574  88948338390 112439126385
5          NA          NA           NA           NA           NA   5930503401
6          NA          NA           NA           NA           NA           NA
          1981         1982         1983         1984         1985         1986
1           NA           NA           NA           NA           NA    405586592
2 175859256874 168095657215 175564912386 160646748724 136759437910 153050335916
3   3478787909           NA           NA           NA           NA           NA
4 211338060015 187448724920 138384182007 114516348921 116776995133 107886511309
5   5550483036   5550483036   5784341596   6131475065   7554065410   7072536109
6           NA           NA           NA   1857338012   1897050133   2097326250
          1987         1988         1989         1990         1991         1992
1    487709497    596648045    695530726    764804469    872067039    958659218
2 186658478814 204765985926 218241607366 254062093242 276856728336 246088124936
3           NA           NA           NA           NA           NA           NA
4 110728825942 109438851254 102254998563 122387353859 118039698016 118893094122
5   8084412414   8769836769  10201780977  11229515599  12704558517  15114352005
6   2080796250   2051236250   2253090000   2028553750   1099559028    652174991
          1993         1994         1995         1996         1997         1998
1   1083240223   1245810056   1320670391   1379888268   1531843575   1665363128
2 242926405780 239610677917 270327154575 269490833465 283446224788 266652333831
3           NA           NA           NA           NA           NA           NA
4  99272180411  86636400266 108690885030 126287285163 127602388366 130678128885
5  11051939102   3390500000   5561222222   7526963964   7648377413   6506229607
6   1185315468   1880950858   2392764853   3199640826   2258513974   2545964541
          1999         2000         2001         2002         2003         2004
1   1722905028   1873184358   1896648045   1962011173   2044134078   2254748603
2 263024788890 284759318603 259643121973 266529432166 354176768091 440481795991
3           NA           NA           NA   3854235264   4539496563   5220825049
4 138085971820 140945759314 148529518712 177201164643 205214466071 254264799899
5   6152922943   9129594819   8936079253  15285592487  17812704626  23552057820
6   3212121651   3480355258   3922100794   4348068242   5611496257   7184685782
          2005         2006         2007         2008         2009         2010
1   2359776536   2469832402   2677653631   2843016760   2553631285   2453631285
2 513941625354 577586931667 662868036074 710536227817 721901168421 863519541418
3   6226198935   6971383339   9715765105  10249770319  12154835708  15633843662
4 310889578636 396921040009 465485534969 567791156729 508362731367 598521555566
5  36970901025  52381025105  65266415707  88538664888  70307196477  81699526881
6   8052077209   8896074996  10677324873  12881352878  12044205572  11926926654
          2011         2012         2013         2014         2015         2016
1   2637988827   2615083799   2727932961 2.791061e+09   2963128492   2983798883
2 967824566959 975354777362 985987130727 1.006526e+12 927348469903 885176429224
3  18190413832  20203572960  20564485419 2.055058e+10  19998143636  18019554403
4 682015858792 737589473577 833948054767 8.943225e+11 768644740597 691363412188
5 109436566876 124998210652 133401582620 1.372444e+11  87219300385  49840491178
6  12890760273  12319834233  12776224537 1.322815e+10  11386853143  11861199831
          2017         2018         2019         2020         2021         2022
1 3.092179e+09 3.276188e+09 3.395794e+09   2610038939 3.126019e+09           NA
2 1.021043e+12 1.007196e+12 1.000834e+12 927593321648 1.081998e+12 1.169484e+12
3 1.889635e+10 1.841886e+10 1.890450e+10  20143451706 1.458314e+10           NA
4 6.848988e+11 7.670257e+11 8.225384e+11 786460035395 8.444597e+11 8.778633e+11
5 6.897277e+10 7.779294e+10 6.930911e+10  50241368244 6.568544e+10 1.067136e+11
6 1.301973e+10 1.515642e+10 1.540183e+10  15162734205 1.793057e+10 1.888210e+10

# Transform from wide to long format
gdp_long <- gdp_data |>
  pivot_longer(
    cols = -c(Country, `Country Code`),  
    names_to = "Year",
    values_to = "GDP"
  ) |>
  mutate(
    Year = as.integer(Year),   
    GDP = as.numeric(GDP)      
  ) |>
  arrange(Country, Year)

# Show new dataset
print("Transformed data column names:")

[1] "Transformed data column names:"

print(names(gdp_long))

[1] "Country"      "Country Code" "Year"         "GDP"

print("Structure of transformed data:")

[1] "Structure of transformed data:"

str(gdp_long)

tibble [16,758 × 4] (S3: tbl_df/tbl/data.frame)
 $ Country     : chr [1:16758] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ Country Code: chr [1:16758] "AFG" "AFG" "AFG" "AFG" ...
 $ Year        : int [1:16758] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
 $ GDP         : num [1:16758] 5.38e+08 5.49e+08 5.47e+08 7.51e+08 8.00e+08 ...

print("First 20 rows of transformed dataset:")

[1] "First 20 rows of transformed dataset:"

print(head(gdp_long, 20))

# A tibble: 20 × 4
   Country     `Country Code`  Year         GDP
   <chr>       <chr>          <int>       <dbl>
 1 Afghanistan AFG             1960  537777811.
 2 Afghanistan AFG             1961  548888896.
 3 Afghanistan AFG             1962  546666678.
 4 Afghanistan AFG             1963  751111191.
 5 Afghanistan AFG             1964  800000044.
 6 Afghanistan AFG             1965 1006666638 
 7 Afghanistan AFG             1966 1399999967 
 8 Afghanistan AFG             1967 1673333418 
 9 Afghanistan AFG             1968 1373333367 
10 Afghanistan AFG             1969 1408888922 
11 Afghanistan AFG             1970 1748886596 
12 Afghanistan AFG             1971 1831108971 
13 Afghanistan AFG             1972 1595555476 
14 Afghanistan AFG             1973 1733333264 
15 Afghanistan AFG             1974 2155555498 
16 Afghanistan AFG             1975 2366666616 
17 Afghanistan AFG             1976 2555555567 
18 Afghanistan AFG             1977 2953333418 
19 Afghanistan AFG             1978 3300000109 
20 Afghanistan AFG             1979 3697940410

print("Summary of Year column:")

[1] "Summary of Year column:"

print(summary(gdp_long$Year))

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1960    1975    1991    1991    2007    2022

Analysis and visualization. From the results shown below, The GDP dataset contains observations from 1960 to 2022. Since we GDP data listed for each year, we can do analysis on the GDP trend, such as find out which region has the top 10 growing GDP. I will also plot contry GDP but group them in regions to find out a a macro trend globally. In the end show plots to conclude analysis for this first dataset.

# 1. BASIC SUMMARY STATISTICS
print(paste("Total countries:", length(unique(gdp_long$Country))))

[1] "Total countries: 266"

print(paste("Time period:", min(gdp_long$Year, na.rm = TRUE), "to", max(gdp_long$Year, na.rm = TRUE)))

[1] "Time period: 1960 to 2022"

print(paste("Total observations:", nrow(gdp_long)))

[1] "Total observations: 16758"

# top 10 economies in the most recent year (2022)
top_economies_2022 <- gdp_long |>
  filter(Year == 2022 & !is.na(GDP)) |>
  arrange(desc(GDP)) |>
  head(10) |>
  pull(Country)

print("Top 10 economies in 2022:")

[1] "Top 10 economies in 2022:"

print(top_economies_2022)

 [1] "World"                     "High income"              
 [3] "OECD members"              "Post-demographic dividend"
 [5] "IDA & IBRD total"          "Low & middle income"      
 [7] "Middle income"             "IBRD only"                
 [9] "East Asia & Pacific"       "Upper middle income"

# VISUALIZATIONS

# Visualization 1: Top 10 Economies Growth Over Time (1960-2022)
top_gdp_data <- gdp_long |>
  filter(Country %in% top_economies_2022 & !is.na(GDP))

ggplot(top_gdp_data, aes(x = Year, y = GDP/1e9, color = Country)) +
  geom_line(size = 1) +
  scale_y_log10(labels = scales::comma) +
  labs(title = "GDP Growth of Top 10 Economies (1960-2022)",
       subtitle = "Log scale - Values in Billions USD",
       x = "Year", 
       y = "GDP (Billions USD, log scale)",
       color = "Country") +
  theme_minimal() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5)) +
  scale_x_continuous(breaks = seq(1960, 2020, by = 10))

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

# Global GDP Trend
global_trend <- gdp_long |>
  group_by(Year)|>
  summarise(World_GDP = sum(GDP, na.rm = TRUE) / 1e9)  # in unit billions

ggplot(global_trend, aes(x = Year, y = World_GDP)) +
  geom_line(color = "darkblue", size = 1.2) +
  geom_point(color = "darkblue", size = 0.5) +
  labs(title = "Global GDP Trend (1960-2022)",
       subtitle = "Total World GDP in Billions USD",
       x = "Year",
       y = "World GDP (Billions USD)") +
  theme_minimal() +
  scale_x_continuous(breaks = seq(1960, 2020, by = 5)) +
  scale_y_continuous(labels = scales::comma)

# Create a simple region classification and compare trend
major_economies_list <- c("United States", "China", "Japan", "Germany", "United Kingdom","France", "India", "Brazil", "Canada", "Italy", "Russia")

# Filter the data for these major economies
major_economies <- gdp_long |>
  filter(Country %in% major_economies_list & !is.na(GDP))

# Create region classification
major_economies <- major_economies |>
  mutate(Region = case_when(
    Country %in% c("United States", "Canada") ~ "North America",
    Country %in% c("Germany", "United Kingdom", "France", "Italy") ~ "Europe",
    Country == "China" ~ "China",
    Country == "India" ~ "India",
    Country == "Japan" ~ "Japan",
    Country == "Brazil" ~ "Brazil",
    Country == "Russia" ~ "Russia",
    TRUE ~ "Other"
  ))

# Plot regional comparison
ggplot(major_economies, aes(x = Year, y = GDP/1e9, color = Region, group = Country)) +
  geom_line(size = 1) +
  facet_wrap(~Region, scales = "free_y") +
  labs(title = "GDP Trends by Major Economy",
       subtitle = "Selected major economies grouped by region",
       x = "Year",
       y = "GDP (Billions USD)") +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_x_continuous(breaks = seq(1960, 2020, by = 20))

Dataset 2: Japanese 2015 Population Density

This is a dataset about Japanese population in different cities. City names were spread out in multiple columns, total 47 in the original dataset. Goal is transform dataset to long format, new column city_name will be created to group those cities in one single column, population density will be grouped under column name population_density. We will check and compare columns between original dataset with the transformed dataset in the following code chunk:

library(tidyr)
library(dplyr)

# Read the  population density CSV file from my github repository
population_data <- read.csv("https://raw.githubusercontent.com/xiaofeimei1/sale-trend-analysis/refs/heads/main/prefecturepopdensity2015.csv", 
                            stringsAsFactors = FALSE,
                            check.names = FALSE,
                            header = TRUE)  

# original column names
print("Original columns in population dataset:")

[1] "Original columns in population dataset:"

print(names(population_data))

 [1] "Aichi"     "Akita"     "Aomori"    "Chiba"     "Ehime"     "Fukui"    
 [7] "Fukuoka"   "Fukushima" "Gifu"      "Gumma"     "Hiroshima" "Hokkaido" 
[13] "Hyogo"     "Ibaraki"   "Ishikawa"  "Iwate"     "Kagawa"    "Kagoshima"
[19] "Kanagawa"  "Kochi"     "Kumamoto"  "Kyoto"     "Mie"       "Miyagi"   
[25] "Miyazaki"  "Nagano"    "Nagasaki"  "Nara"      "Niigata"   "Oita"     
[31] "Okayama"   "Okinawa"   "Osaka"     "Saga"      "Saitama"   "Shiga"    
[37] "Shimane"   "Shizuoka"  "Tochigi"   "Tokushima" "Tokyo"     "Tottori"  
[43] "Toyama"    "Wakayama"  "Yamagata"  "Yamaguchi" "Yamanashi"

# structure 
print("Structure of original population data:")

[1] "Structure of original population data:"

str(population_data)

'data.frame':   1 obs. of  47 variables:
 $ Aichi    : num 1447
 $ Akita    : num 87.9
 $ Aomori   : num 136
 $ Chiba    : num 1207
 $ Ehime    : num 244
 $ Fukui    : num 188
 $ Fukuoka  : num 1023
 $ Fukushima: num 139
 $ Gifu     : num 191
 $ Gumma    : num 310
 $ Hiroshima: num 336
 $ Hokkaido : num 68.6
 $ Hyogo    : num 659
 $ Ibaraki  : num 479
 $ Ishikawa : num 276
 $ Iwate    : num 83.8
 $ Kagawa   : num 520
 $ Kagoshima: num 179
 $ Kanagawa : num 3778
 $ Kochi    : num 102
 $ Kumamoto : num 241
 $ Kyoto    : num 566
 $ Mie      : num 314
 $ Miyagi   : num 320
 $ Miyazaki : num 143
 $ Nagano   : num 155
 $ Nagasaki : num 333
 $ Nara     : num 370
 $ Niigata  : num 183
 $ Oita     : int 184
 $ Okayama  : num 270
 $ Okinawa  : num 629
 $ Osaka    : num 4640
 $ Saga     : num 341
 $ Saitama  : int 1912
 $ Shiga    : num 352
 $ Shimane  : num 104
 $ Shizuoka : num 476
 $ Tochigi  : num 308
 $ Tokushima: num 182
 $ Tokyo    : num 6168
 $ Tottori  : num 164
 $ Toyama   : num 251
 $ Wakayama : int 204
 $ Yamagata : num 120
 $ Yamaguchi: num 230
 $ Yamanashi: int 187

# Show first few rows of original data
print(head(population_data))

   Aichi Akita Aomori  Chiba Ehime Fukui Fukuoka Fukushima  Gifu Gumma
1 1446.9  87.9  135.7 1206.8 244.2 187.8  1023.4     138.8 191.4 310.2
  Hiroshima Hokkaido Hyogo Ibaraki Ishikawa Iwate Kagawa Kagoshima Kanagawa
1     335.5     68.6 659.1   478.6    275.8  83.8  520.5     179.4   3778.2
  Kochi Kumamoto Kyoto   Mie Miyagi Miyazaki Nagano Nagasaki  Nara Niigata Oita
1 102.5    241.2 565.9 314.5  320.5    142.8  154.8    333.4 369.8   183.2  184
  Okayama Okinawa  Osaka  Saga Saitama Shiga Shimane Shizuoka Tochigi Tokushima
1   270.2   628.7 4639.9 341.4    1912 351.8   103.5    475.8   308.2     182.3
   Tokyo Tottori Toyama Wakayama Yamagata Yamaguchi Yamanashi
1 6168.1   163.6  251.2      204    120.4     229.9       187

# Transform from wide to long format
population_long <- population_data |>
  pivot_longer(
    cols = everything(),   
    names_to = "city_name",
    values_to = "population_density"
  ) |>
  mutate(
    population_density = as.numeric(population_density),  
    city_name = as.character(city_name)   
  ) |>
  arrange(city_name)   

# Show transformed dataset results
print("Transformed data column names:")

[1] "Transformed data column names:"

print(names(population_long))

[1] "city_name"          "population_density"

print("Structure of transformed data:")

[1] "Structure of transformed data:"

str(population_long)

tibble [47 × 2] (S3: tbl_df/tbl/data.frame)
 $ city_name         : chr [1:47] "Aichi" "Akita" "Aomori" "Chiba" ...
 $ population_density: num [1:47] 1446.9 87.9 135.7 1206.8 244.2 ...

print("First 20 rows of transformed dataset:")

[1] "First 20 rows of transformed dataset:"

print(head(population_long, 20))

# A tibble: 20 × 2
   city_name population_density
   <chr>                  <dbl>
 1 Aichi                 1447. 
 2 Akita                   87.9
 3 Aomori                 136. 
 4 Chiba                 1207. 
 5 Ehime                  244. 
 6 Fukui                  188. 
 7 Fukuoka               1023. 
 8 Fukushima              139. 
 9 Gifu                   191. 
10 Gumma                  310. 
11 Hiroshima              336. 
12 Hokkaido                68.6
13 Hyogo                  659. 
14 Ibaraki                479. 
15 Ishikawa               276. 
16 Iwate                   83.8
17 Kagawa                 520. 
18 Kagoshima              179. 
19 Kanagawa              3778. 
20 Kochi                  102.

print(paste("Total rows:", nrow(population_long)))

[1] "Total rows: 47"

print(paste("Number of prefectures:", length(unique(population_long$city_name))))

[1] "Number of prefectures: 47"

After getting the long format, we are ready to do population analysis and visualization. As shown below, top cities that has most dense population are: Tokyo, Osaka, Kanagawa… On the other hand, lowest cities with population density are Hokkaido, Iwate, Akita… To show how population were largely concentrate on the top 3 cities, we can see the last visualization that those 3 cities are far more dense compare to other cities.

top_dense <- population_long |>
  arrange(desc(population_density)) |>
  head(10)

bottom_dense <- population_long |>
  arrange(population_density) |>
  head(10)

# Bar chart of top 10 most dense prefectures
ggplot(top_dense, aes(x = reorder(city_name, population_density), y = population_density)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 10 Most Densely Populated Japanese Prefectures",
       subtitle = "Population Density (persons/km²) - 2015",
       x = "Prefecture",
       y = "Population Density (persons/km²)") +
  theme_minimal() +
  geom_text(aes(label = round(population_density, 0)), hjust = -0.1, size = 3)

# Bar chart of bottom 10 least dense prefectures
ggplot(bottom_dense, aes(x = reorder(city_name, desc(population_density)), y = population_density)) +
  geom_bar(stat = "identity", fill = "coral") +
  coord_flip() +
  labs(title = "Top 10 Least Densely Populated Japanese Prefectures",
       subtitle = "Population Density (persons/km²) - 2015",
       x = "Prefecture",
       y = "Population Density (persons/km²)") +
  theme_minimal() +
  geom_text(aes(label = round(population_density, 0)), hjust = -0.1, size = 3)

# Dot plot of all prefectures
ggplot(population_long, aes(x = reorder(city_name, population_density), y = population_density)) +
  geom_point(size = 3, color = "darkblue") +
  coord_flip() +
  labs(title = "Population Density of All Japanese Prefectures",
       subtitle = "Ordered from lowest to highest density",
       x = "Prefecture",
       y = "Population Density (persons/km²)") +
  theme_minimal() +
  geom_hline(yintercept = mean(population_long$population_density), 
             linetype = "dashed", color = "red", size = 0.5) +
  annotate("text", x = 5, y = mean(population_long$population_density) + 100, 
           label = "National Average", color = "red", size = 3)

Dataset 3: takeout spending habit by generation.

The original takeout spending dataset contained 12 rows and 7 columns, with 5 generations: Gen Z, Millennials, Gen X, Baby Boomers, Silent Gen. and their monthly spending columns from January to June. After transformation, the dataset was reshaped into long format with 30 rows and 3 columns: generation, month and spending. This transformed datasets is in tidy structure are are ready for data analysis and visualization.

# Read the takeout spending CSV file
takeout_data <- read.csv("https://raw.githubusercontent.com/xiaofeimei1/sale-trend-analysis/refs/heads/main/takeoutspending.csv", 
                         stringsAsFactors = FALSE,
                         check.names = FALSE)

# Print original columns 
print(names(takeout_data))

[1] "generation" "Jan"        "Feb"        "Mar"        "Apr"       
[6] "May"        "Jun"

# Show structure of original data
str(takeout_data)

'data.frame':   12 obs. of  7 variables:
 $ generation: chr  "Gen Z" "Millenials" "Gen X" "Baby Boomers" ...
 $ Jan       : int  185 240 195 120 75 NA NA NA NA NA ...
 $ Feb       : int  172 228 188 115 70 NA NA NA NA NA ...
 $ Mar       : int  198 255 205 130 82 NA NA NA NA NA ...
 $ Apr       : int  210 270 215 140 88 NA NA NA NA NA ...
 $ May       : int  205 265 210 135 85 NA NA NA NA NA ...
 $ Jun       : int  225 285 220 145 90 NA NA NA NA NA ...

# Show first few rows of original data
print("First few rows of original takeout data:")

[1] "First few rows of original takeout data:"

print(head(takeout_data))

    generation Jan Feb Mar Apr May Jun
1        Gen Z 185 172 198 210 205 225
2   Millenials 240 228 255 270 265 285
3        Gen X 195 188 205 215 210 220
4 Baby Boomers 120 115 130 140 135 145
5   Silent Gen  75  70  82  88  85  90
6               NA  NA  NA  NA  NA  NA

# Clean the data first - remove empty rows
takeout_data_clean <- takeout_data |>
  filter(!is.na(generation) & generation != "") |>  # Remove rows with empty generation
  select(1:7)  

# Transform from wide to long 
takeout_long <- takeout_data_clean |>
  pivot_longer(
    cols = -generation,  # All month columns except generation
    names_to = "month",
    values_to = "spending"
  ) |>
  mutate(
    month = as.character(month),  #  month in character
    spending = as.numeric(spending),  #  spending is numericr
    month = factor(month, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun"), ordered = TRUE)
  ) |>
  arrange(generation, month)  
# Show transformed dataset results
print(names(takeout_long))

[1] "generation" "month"      "spending"

print("Structure of transformed data:")

[1] "Structure of transformed data:"

str(takeout_long)

tibble [30 × 3] (S3: tbl_df/tbl/data.frame)
 $ generation: chr [1:30] "Baby Boomers" "Baby Boomers" "Baby Boomers" "Baby Boomers" ...
 $ month     : Ord.factor w/ 6 levels "Jan"<"Feb"<"Mar"<..: 1 2 3 4 5 6 1 2 3 4 ...
 $ spending  : num [1:30] 120 115 130 140 135 145 195 188 205 215 ...

print("All rows of transformed dataset:")

[1] "All rows of transformed dataset:"

print(takeout_long)

# A tibble: 30 × 3
   generation   month spending
   <chr>        <ord>    <dbl>
 1 Baby Boomers Jan        120
 2 Baby Boomers Feb        115
 3 Baby Boomers Mar        130
 4 Baby Boomers Apr        140
 5 Baby Boomers May        135
 6 Baby Boomers Jun        145
 7 Gen X        Jan        195
 8 Gen X        Feb        188
 9 Gen X        Mar        205
10 Gen X        Apr        215
# ℹ 20 more rows

print(paste("Total rows:", nrow(takeout_long)))

[1] "Total rows: 30"

print(paste("Number of generations:", length(unique(takeout_long$generation))))

[1] "Number of generations: 5"

print(paste("Months covered:", paste(levels(takeout_long$month), collapse = ", ")))

[1] "Months covered: Jan, Feb, Mar, Apr, May, Jun"

# average spending by generation
print("Average spending by generation:")

[1] "Average spending by generation:"

takeout_long |>
  group_by(generation) |>
  summarise(avg_spending = mean(spending, na.rm = TRUE)) %>%
  arrange(desc(avg_spending))

# A tibble: 5 × 2
  generation   avg_spending
  <chr>               <dbl>
1 Millenials          257. 
2 Gen X               206. 
3 Gen Z               199. 
4 Baby Boomers        131. 
5 Silent Gen           81.7

# average spending by month
print("Average spending by month:")

[1] "Average spending by month:"

takeout_long |>
  group_by(month) |>
  summarise(avg_spending = mean(spending, na.rm = TRUE))

# A tibble: 6 × 2
  month avg_spending
  <ord>        <dbl>
1 Jan           163 
2 Feb           155.
3 Mar           174 
4 Apr           185.
5 May           180 
6 Jun           193

Data analysis and visualization to identify which generation spends the most on takeout per month and to find out monthly spending trends across different generational groups.

The results shows that Millennials lead in takeout spending, with the highest average monthly expenditures. They are closely followed by Gen Z and Gen X, whose spending levels are very similar to each other.

Regarding monthly trends, the analysis shows a upward pattern from January to June across all generations, indicating that takeout spending generally increases as the year progresses from winter to summer months. This seasonal trend, combined with generational differences, could be valuable infomation to study consumer behavior and for company to potential targeting specific generation to promote their food products.

generation_summary <- takeout_long |>
  group_by(generation) |>
  summarise(
    Avg_Monthly = mean(spending, na.rm = TRUE),
    .groups = 'drop'
  ) |>
  arrange(desc(Avg_Monthly))

# Bar chart - Average monthly spending by generation
ggplot(generation_summary, aes(x = reorder(generation, Avg_Monthly), y = Avg_Monthly, fill = generation)) +
  geom_bar(stat = "identity", width = 0.7) +
  coord_flip() +
  labs(title = "Average Monthly Takeout Spending by Generation",
       subtitle = "Based on Jan-Jun 2015 data",
       x = "Generation",
       y = "Average Monthly Spending ($)") +
  theme_minimal() +
  theme(legend.position = "none") +
  geom_text(aes(label = paste0("$", round(Avg_Monthly, 0))), hjust = -0.2, size = 4) +
  scale_fill_brewer(palette = "Set2") +
  ylim(0, max(generation_summary$Avg_Monthly) * 1.1)

# Line chart - Spending trends over months by generation
ggplot(takeout_long, aes(x = month, y = spending, group = generation, color = generation)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  labs(title = "Takeout Spending Trends by Generation (Jan-Jun 2015)",
       subtitle = "Monthly spending patterns across generations",
       x = "Month",
       y = "Spending ($)",
       color = "Generation") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_color_brewer(palette = "Set1")

#Comparison across months
ggplot(takeout_long, aes(x = month, y = spending, fill = generation)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.7) +
  labs(title = "Monthly Takeout Spending by Generation",
       subtitle = "Side-by-side comparison",
       x = "Month",
       y = "Spending ($)",
       fill = "Generation") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_fill_brewer(palette = "Set2")

Summary

This data analysis project focused on transforming three distinct datasets (global GDP, Japanese prefecture population density, and generational takeout spending) from wide format to long format to enable more effective analysis and visualization. Wide format has data of same category across multiple columns, making comparative analysis more difficult. The transformed data made filtering easier, enhanced grouping and aggregation when doing analysis. This data transformation to get tidy data is essential part for complex data analysis.