Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!] Perform the analysis requested in the discussion item.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
The URL to the .Rmd file in your GitHub repository, and The URL for your rpubs.com web page.
This dataset contains population data for 4 states in the span of 6 years (2015 - 2020) and is sourced from the US Census Bureau.
We will analyse this data set in order to look for a things related to yearly population growth:
#load the data
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(dplyr)
library(ggplot2)
population_data <- read.csv('https://raw.githubusercontent.com/nk014914/Data-607/main/Population_data.csv')
population_data
## Name FIPS Abbreviations X2015.Population X2016.Population
## 1 Georgia 14 GA 10067378 10189016
## 2 North Carolina 37 NC 9932862 10036881
## 3 NA NA NA
## 4 South Carolina 45 SC 4817440 4879157
## 5 Virginia 51 VA 8254218 8312400
## X2017.Population X2018.Population X2019.Population X2020.Population
## 1 10325943 10471428 10604413 10722092
## 2 10151700 10291929 10455811 10598314
## 3 NA NA NA NA
## 4 4954035 5036155 5118397 5196026
## 5 8370206 8420184 8483598 8539322
#tidying the data
#Removing blank rows and renaming the columns to shorten for just the year
population_data <- population_data %>%
drop_na() %>%
rename(state = Name, x2015 = X2015.Population, x2016 = X2016.Population, x2017 = X2017.Population, x2018 = X2018.Population, x2019 = X2019.Population, x2020 = X2020.Population)
#remove unused columns
population_data_subset <- subset(population_data, select= -c(FIPS,Abbreviations))
#transforming data - putting years into columns to make long data
population_data_subset <- population_data_subset %>%
gather(key = 'year', value = 'population', 2:7)
population_data_subset
## state year population
## 1 Georgia x2015 10067378
## 2 North Carolina x2015 9932862
## 3 South Carolina x2015 4817440
## 4 Virginia x2015 8254218
## 5 Georgia x2016 10189016
## 6 North Carolina x2016 10036881
## 7 South Carolina x2016 4879157
## 8 Virginia x2016 8312400
## 9 Georgia x2017 10325943
## 10 North Carolina x2017 10151700
## 11 South Carolina x2017 4954035
## 12 Virginia x2017 8370206
## 13 Georgia x2018 10471428
## 14 North Carolina x2018 10291929
## 15 South Carolina x2018 5036155
## 16 Virginia x2018 8420184
## 17 Georgia x2019 10604413
## 18 North Carolina x2019 10455811
## 19 South Carolina x2019 5118397
## 20 Virginia x2019 8483598
## 21 Georgia x2020 10722092
## 22 North Carolina x2020 10598314
## 23 South Carolina x2020 5196026
## 24 Virginia x2020 8539322
#remove x's in year column
population_data_final <- population_data_subset %>%
mutate(across(c('year'), substr, 2, nchar(year)))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(c("year"), substr, 2, nchar(year))`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
population_data_final
## state year population
## 1 Georgia 2015 10067378
## 2 North Carolina 2015 9932862
## 3 South Carolina 2015 4817440
## 4 Virginia 2015 8254218
## 5 Georgia 2016 10189016
## 6 North Carolina 2016 10036881
## 7 South Carolina 2016 4879157
## 8 Virginia 2016 8312400
## 9 Georgia 2017 10325943
## 10 North Carolina 2017 10151700
## 11 South Carolina 2017 4954035
## 12 Virginia 2017 8370206
## 13 Georgia 2018 10471428
## 14 North Carolina 2018 10291929
## 15 South Carolina 2018 5036155
## 16 Virginia 2018 8420184
## 17 Georgia 2019 10604413
## 18 North Carolina 2019 10455811
## 19 South Carolina 2019 5118397
## 20 Virginia 2019 8483598
## 21 Georgia 2020 10722092
## 22 North Carolina 2020 10598314
## 23 South Carolina 2020 5196026
## 24 Virginia 2020 8539322
#First I need to make data as a numeric vector because my y axis was not in increasing order
transform(population_data_final, population = as.numeric(population))
## state year population
## 1 Georgia 2015 10067378
## 2 North Carolina 2015 9932862
## 3 South Carolina 2015 4817440
## 4 Virginia 2015 8254218
## 5 Georgia 2016 10189016
## 6 North Carolina 2016 10036881
## 7 South Carolina 2016 4879157
## 8 Virginia 2016 8312400
## 9 Georgia 2017 10325943
## 10 North Carolina 2017 10151700
## 11 South Carolina 2017 4954035
## 12 Virginia 2017 8370206
## 13 Georgia 2018 10471428
## 14 North Carolina 2018 10291929
## 15 South Carolina 2018 5036155
## 16 Virginia 2018 8420184
## 17 Georgia 2019 10604413
## 18 North Carolina 2019 10455811
## 19 South Carolina 2019 5118397
## 20 Virginia 2019 8483598
## 21 Georgia 2020 10722092
## 22 North Carolina 2020 10598314
## 23 South Carolina 2020 5196026
## 24 Virginia 2020 8539322
#checking if column is now numeric
class(population_data_final$population)
## [1] "numeric"
#Determine the summary statistics on population per year of the four states
population_summary <- population_data_final %>%
group_by(year) %>%
summarize(mean = mean(population), median = median(population), min = min(population), max = max(population), na.rm = T) %>%
arrange(mean)
population_summary
## # A tibble: 6 × 6
## year mean median min max na.rm
## <chr> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 2015 8267974. 9093540 4817440 10067378 TRUE
## 2 2016 8354364. 9174640. 4879157 10189016 TRUE
## 3 2017 8450471 9260953 4954035 10325943 TRUE
## 4 2018 8554924 9356056. 5036155 10471428 TRUE
## 5 2019 8665555. 9469704. 5118397 10604413 TRUE
## 6 2020 8763938. 9568818 5196026 10722092 TRUE
#visualize growth of population per state
ggplot(data=population_data_final, aes(x=year, y= population, group = state)) +
geom_line(aes(color=state))+
geom_point(aes(color=state))
Viewing the yearly state populations plotted, we can see a few things: - Georgia has the largest overall population every year, while South Carolina has the smallest (by a large difference in comparison to the other states.)
Georgia and North Carolina are closer in yearly population than the other two states as their lines are very close to one another.
Virginia seems to have the smallest yearly population growth as their line has the least amount of steepness in comparison to Virginia, North Carolina and Georgia.
#Determining which year has the largest population growth
#sum of population per year
sum2015 = sum( population_data_final$population [ population_data_final$year==2015] )
sum2016 = sum( population_data_final$population [ population_data_final$year==2016] )
sum2017 = sum( population_data_final$population [ population_data_final$year==2017] )
sum2018 = sum( population_data_final$population [ population_data_final$year==2018] )
sum2019 = sum( population_data_final$population [ population_data_final$year==2019] )
sum2020 = sum( population_data_final$population [ population_data_final$year==2020] )
population_total = c(sum2015,sum2016,sum2017,sum2018,sum2019,sum2020)
#add column with years and create dataframe
year = c('2015','2016','2017','2018','2019','2020')
sum_population = as.data.frame(year)
sum_population$population_total <- population_total
sum_population
## year population_total
## 1 2015 33071898
## 2 2016 33417454
## 3 2017 33801884
## 4 2018 34219696
## 5 2019 34662219
## 6 2020 35055754
#calculate percent change per year over year, we will use the collapse package
library(collapse)
## Warning: package 'collapse' was built under R version 4.2.3
## collapse 2.0.10, see ?`collapse-package` or ?`collapse-documentation`
##
## Attaching package: 'collapse'
## The following object is masked from 'package:lubridate':
##
## is.Date
## The following object is masked from 'package:tidyr':
##
## replace_na
## The following object is masked from 'package:stats':
##
## D
sum_population |> fmutate(growth = fgrowth(population_total))
## year population_total growth
## 1 2015 33071898 NA
## 2 2016 33417454 1.044863
## 3 2017 33801884 1.150387
## 4 2018 34219696 1.236061
## 5 2019 34662219 1.293182
## 6 2020 35055754 1.135343
Based on the analysis, we can see the largest population growth of the four states combined is between 2018 and 2019 with a growth of around 1.29. The smallest growth was from 2015 to 2016 at 1.04. An interesting find I saw was that the growth percentage was steadily increasing each year until 2019-2020 where it dropped to 1.13 the second lowest population growth of all 6 years.The recent drop in population growth could be attributed to economical/financial issue pertaining to the year, or the impeding issues cause by the corona virus, however more research will be needed to confirm or deny these assumptions.
For this data set, we will look into the following:
As suggested by Anna Moy and Mohammed Rahman, I used a csv called region in order to classify each country to a region for simpler data categorization.
#Load the csv
global <- read.csv('https://raw.githubusercontent.com/nk014914/Data-607/main/global_inflation_data.csv')
region <- read.csv('https://raw.githubusercontent.com/nk014914/Data-607/main/region.csv')
colnames(region) <- c("country_name", "region")
global_inflation <- merge(global, region, by = "country_name")
head(global_inflation)
## country_name indicator_name X1980
## 1 Afghanistan Annual average inflation (consumer prices) rate 13.4
## 2 Albania Annual average inflation (consumer prices) rate NA
## 3 Algeria Annual average inflation (consumer prices) rate 9.7
## 4 Andorra Annual average inflation (consumer prices) rate NA
## 5 Angola Annual average inflation (consumer prices) rate 46.7
## 6 Antigua and Barbuda Annual average inflation (consumer prices) rate 19.0
## X1981 X1982 X1983 X1984 X1985 X1986 X1987 X1988 X1989 X1990 X1991 X1992
## 1 22.2 18.2 15.9 20.4 8.7 -2.1 18.4 27.5 71.5 47.4 43.8 58.19
## 2 NA NA NA NA NA NA NA NA NA -0.2 35.7 226.00
## 3 14.6 6.6 7.8 6.3 10.4 14.0 5.9 5.9 9.2 9.3 25.9 31.70
## 4 NA NA NA NA NA NA NA NA NA NA NA NA
## 5 1.4 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 85.3 299.10
## 6 11.5 4.2 2.3 3.8 1.0 0.5 3.6 6.8 4.4 6.6 4.5 3.00
## X1993 X1994 X1995 X1996 X1997 X1998 X1999 X2000 X2001 X2002 X2003
## 1 33.99 20.01 14.0 14.01 14.01 14.01 14.01 0.0 -43.4 51.93 35.66
## 2 85.00 22.60 7.8 12.70 33.20 20.60 0.40 0.0 3.1 5.20 2.40
## 3 20.50 29.00 29.8 18.70 5.70 5.00 2.60 0.3 4.2 1.40 4.30
## 4 NA NA NA NA NA NA NA NA NA 3.10 3.10
## 5 1379.50 949.80 2672.2 4146.00 221.50 107.40 248.20 325.0 152.6 108.90 98.20
## 6 3.10 6.50 2.7 3.00 0.40 3.30 1.10 -0.2 1.9 2.40 2.00
## X2004 X2005 X2006 X2007 X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015 X2016
## 1 16.36 10.57 6.78 8.68 26.42 -6.81 2.18 11.8 6.44 7.39 4.67 -0.66 4.38
## 2 2.90 2.40 2.40 3.00 3.30 2.20 3.60 3.4 2.00 1.90 1.60 1.90 1.30
## 3 4.00 1.40 2.30 3.70 4.90 5.70 3.90 4.5 8.90 3.30 2.90 4.80 6.40
## 4 2.90 3.50 3.70 2.70 4.30 -1.20 1.70 2.6 1.50 0.50 -0.10 -1.10 -0.40
## 5 43.50 23.00 13.30 12.20 12.50 13.70 14.50 13.5 10.30 8.80 7.30 9.20 30.70
## 6 2.00 2.10 1.80 1.40 5.30 -0.60 3.40 3.5 3.40 1.10 1.10 1.00 -0.50
## X2017 X2018 X2019 X2020 X2021 X2022 X2023 X2024 region
## 1 4.98 0.63 2.3 5.44 5.06 13.71 9.1 NA Asia & Pacific
## 2 2.00 2.00 1.4 1.60 2.00 6.70 4.8 4.0 Europe
## 3 5.60 4.30 2.0 2.40 7.20 9.30 9.0 6.8 Arab States
## 4 2.60 1.00 0.5 0.10 1.70 6.20 5.2 3.5 Europe
## 5 29.80 19.60 17.1 22.30 25.80 21.40 13.1 22.3 Africa
## 6 2.40 1.20 1.4 1.10 1.60 7.50 5.0 2.9 South/Latin America
#Tidying the data
#drop unnecessary column and NAs
global_inflation <- subset(global_inflation, select = -c(indicator_name))
global_inflation <- global_inflation %>%
select(country_name, region,X1980,X1980,X1981,X1982,X1983,X1984,X1985,X1986,X1987,X1988,X1989,X1990,X1991,X1992,X1993,X1994,X1995,X1996,X1997,X1998,X1999,X2000,X2001,X2002,X2003,X2004,X2005,X2006,X2007,X2008,X2009,X2010,X2011,X2012,X2013,X2014,X2015,X2016,X2017,X2018,X2019,X2020,X2021,X2022,X2023,X2024) %>%
pivot_longer(`X1980`:`X2024`,names_to = "year", values_to = "CPI_inflation", values_drop_na = TRUE)
#remove x's in year column
global_inflation_final <- global_inflation %>%
mutate(across(c('year'), substr, 2, nchar(year)))
global_inflation_final
## # A tibble: 7,952 × 4
## country_name region year CPI_inflation
## <chr> <chr> <chr> <dbl>
## 1 Afghanistan Asia & Pacific 1980 13.4
## 2 Afghanistan Asia & Pacific 1981 22.2
## 3 Afghanistan Asia & Pacific 1982 18.2
## 4 Afghanistan Asia & Pacific 1983 15.9
## 5 Afghanistan Asia & Pacific 1984 20.4
## 6 Afghanistan Asia & Pacific 1985 8.7
## 7 Afghanistan Asia & Pacific 1986 -2.1
## 8 Afghanistan Asia & Pacific 1987 18.4
## 9 Afghanistan Asia & Pacific 1988 27.5
## 10 Afghanistan Asia & Pacific 1989 71.5
## # ℹ 7,942 more rows
#Determine min and max of the regions, sorting by desc
global_max <- global_inflation_final %>%
group_by(region) %>%
summarize(max_cpi = max(CPI_inflation)) %>%
arrange(desc(max_cpi))
global_min <- global_inflation_final %>%
group_by(region) %>%
summarize(min_cpi = min(CPI_inflation)) %>%
arrange(min_cpi)
global_min
## # A tibble: 9 × 2
## region min_cpi
## <chr> <dbl>
## 1 "Africa" -72.7
## 2 "Asia & Pacific" -71.3
## 3 "South/Latin America" -44.4
## 4 "Europe" -11.7
## 5 "Arab States" -11.3
## 6 "Middle east" -9.9
## 7 "" -2.4
## 8 "North America" -0.3
## 9 "South/Central America" 3.5
global_max
## # A tibble: 9 × 2
## region max_cpi
## <chr> <dbl>
## 1 "South/Latin America" 65374.
## 2 "Africa" 23773.
## 3 "South/Central America" 13110.
## 4 "Europe" 5273.
## 5 "Asia & Pacific" 3102.
## 6 "Middle east" 487.
## 7 "Arab States" 359.
## 8 "" 111.
## 9 "North America" 13.5
#Filter out the US to look at its min/max and determine its lowes CPI avg by sorting
#Highest/lowest CPI year
us_CPI <- global_inflation_final %>%
filter(country_name == "United States") %>%
mutate(min = min(CPI_inflation), max = max(CPI_inflation)) %>%
arrange(desc(CPI_inflation))
us_CPI
## # A tibble: 45 × 6
## country_name region year CPI_inflation min max
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 United States North America 1980 13.5 -0.3 13.5
## 2 United States North America 1981 10.4 -0.3 13.5
## 3 United States North America 2022 8 -0.3 13.5
## 4 United States North America 1982 6.2 -0.3 13.5
## 5 United States North America 1990 5.4 -0.3 13.5
## 6 United States North America 1989 4.8 -0.3 13.5
## 7 United States North America 2021 4.7 -0.3 13.5
## 8 United States North America 1984 4.4 -0.3 13.5
## 9 United States North America 1991 4.2 -0.3 13.5
## 10 United States North America 1988 4.1 -0.3 13.5
## # ℹ 35 more rows
#CPI by year
us_CPI_year <- global_inflation_final %>%
filter(country_name == "United States") %>%
mutate(min = min(CPI_inflation), max = max(CPI_inflation)) %>%
arrange(desc(year))
us_CPI_year
## # A tibble: 45 × 6
## country_name region year CPI_inflation min max
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 United States North America 2024 2.8 -0.3 13.5
## 2 United States North America 2023 4.1 -0.3 13.5
## 3 United States North America 2022 8 -0.3 13.5
## 4 United States North America 2021 4.7 -0.3 13.5
## 5 United States North America 2020 1.3 -0.3 13.5
## 6 United States North America 2019 1.8 -0.3 13.5
## 7 United States North America 2018 2.4 -0.3 13.5
## 8 United States North America 2017 2.1 -0.3 13.5
## 9 United States North America 2016 1.3 -0.3 13.5
## 10 United States North America 2015 0.1 -0.3 13.5
## # ℹ 35 more rows
After tidying and analyzing the data, the highest CPI for all regions is 65374 in the South/Latin America region. The lowest CPI is seen in Africa at -72.7. The drastic difference between the CPIs can be attributed to outliers. When looking at just the US in CPI descending order, we see 2009 had the lowest CPI (min) at -0.3, while 1980 has the highest CPI (max) at 13.5. Looking at year over year and outside 1981 and 1980, the US CPI did not fluctuate much and remained relatively constant.
Dataset: “Bureau
of Labor Statistics - CPI Report for Jan2024”
# Load the data
library(tidyverse)
CPI_Jan <- read.csv("https://raw.githubusercontent.com/sokkarbishoy/DATA607/main/news-release-table1-202401.csv")
head(CPI_Jan)
## X
## 1
## 2
## 3 Indent Level
## 4
## 5
## 6 0
## Table.1..Consumer.Price.Index.for.All.Urban.Consumers..CPI.U...U.S..city.average..by.expenditure.category..January.2024
## 1 [1982-84=100, unless otherwise noted]
## 2
## 3 Expenditure category
## 4
## 5
## 6 All items
## X.1 X.2 X.3
## 1
## 2
## 3 Relative\nimportance\nDec.\n2023 Unadjusted indexes Unadjusted indexes
## 4 Jan.\n2023 Dec.\n2023
## 5
## 6 100.000 299.170 306.746
## X.4 X.5 X.6
## 1
## 2
## 3 Unadjusted indexes Unadjusted percent change Unadjusted percent change
## 4 Jan.\n2024 Jan.\n2023-\nJan.\n2024 Dec.\n2023-\nJan.\n2024
## 5
## 6 308.417 3.1 0.5
## X.7 X.8
## 1
## 2
## 3 Seasonally adjusted percent change Seasonally adjusted percent change
## 4 Oct.\n2023-\nNov.\n2023 Nov.\n2023-\nDec.\n2023
## 5
## 6 0.2 0.2
## X.9
## 1
## 2
## 3 Seasonally adjusted percent change
## 4 Dec.\n2023-\nJan.\n2024
## 5
## 6 0.3
#We start tidying the data removing the first two rows, last three rows and empty rows
CPI_Jan <- CPI_Jan[-c(1:5, 16, 25, 45:49), ]
tail(CPI_Jan)
## X
## 39 4
## 40 4
## 41 3
## 42 4
## 43 4
## 44 4
## Table.1..Consumer.Price.Index.for.All.Urban.Consumers..CPI.U...U.S..city.average..by.expenditure.category..January.2024
## 39 Physicians' services(1)
## 40 Hospital services(1)(3)
## 41 Transportation services
## 42 Motor vehicle maintenance and repair(1)
## 43 Motor vehicle insurance
## 44 Airline fares
## X.1 X.2 X.3 X.4 X.5 X.6 X.7 X.8 X.9
## 39 1.828 415.197 412.930 415.427 0.1 0.6 0.6 0.2 0.6
## 40 1.987 385.064 404.407 410.695 6.7 1.6 0.1 0.5 1.6
## 41 6.294 376.743 409.749 412.643 9.5 0.7 1.0 0.1 1.0
## 42 1.233 371.780 392.897 396.004 6.5 0.8 0.3 -0.3 0.8
## 43 2.794 658.513 780.284 794.142 20.6 1.8 1.2 1.7 1.4
## 44 0.751 264.629 243.348 247.606 -6.4 1.7 -0.2 0.9 1.4
#change column names
new_columns <- c("Indent_Level",
"Expenditure_category",
"Relative_importance_Dec_2023",
"Jan_2023_Unadjusted_indexes",
"Dec_2023_Unadjusted_indexes",
"Jan_2024_Unadjusted_indexes",
"Jan_2023_Jan_2024_Unadjusted_percent_change",
"Dec_2023_Jan_2024_Unadjusted_percent_change",
"Oct_2023_Nov_2023_Seasonally_adjusted_percent_change",
"Nov_2023_Dec_2023_Seasonally_adjusted_percent_change",
"Dec_2023_Jan_2024_Seasonally_adjusted_percent_change")
colnames(CPI_Jan) <- new_columns
#change the type into INT
CPI_Jan$Indent_Level <- as.integer(CPI_Jan$Indent_Level)
head(CPI_Jan)
## Indent_Level Expenditure_category Relative_importance_Dec_2023
## 6 0 All items 100.000
## 7 1 Food 13.555
## 8 2 Food at home 8.167
## 9 3 Cereals and bakery products 1.066
## 10 3 Meats, poultry, fish, and eggs 1.722
## 11 3 Dairy and related products 0.748
## Jan_2023_Unadjusted_indexes Dec_2023_Unadjusted_indexes
## 6 299.170 306.746
## 7 319.136 325.409
## 8 301.435 303.005
## 9 349.294 353.844
## 10 322.737 320.143
## 11 272.040 267.889
## Jan_2024_Unadjusted_indexes Jan_2023_Jan_2024_Unadjusted_percent_change
## 6 308.417 3.1
## 7 327.327 2.6
## 8 305.037 1.2
## 9 354.532 1.5
## 10 319.752 -0.9
## 11 268.941 -1.1
## Dec_2023_Jan_2024_Unadjusted_percent_change
## 6 0.5
## 7 0.6
## 8 0.7
## 9 0.2
## 10 -0.1
## 11 0.4
## Oct_2023_Nov_2023_Seasonally_adjusted_percent_change
## 6 0.2
## 7 0.2
## 8 0.0
## 9 0.3
## 10 -0.2
## 11 0.0
## Nov_2023_Dec_2023_Seasonally_adjusted_percent_change
## 6 0.2
## 7 0.2
## 8 0.1
## 9 -0.1
## 10 0.3
## 11 0.1
## Dec_2023_Jan_2024_Seasonally_adjusted_percent_change
## 6 0.3
## 7 0.4
## 8 0.4
## 9 -0.2
## 10 0.0
## 11 0.2
Finally we change the type of each columns to INT and Numeric
CPI_Jan <- transform(CPI_Jan,
Indent_Level = as.integer(Indent_Level),
Relative_importance_Dec_2023 = as.numeric(Relative_importance_Dec_2023),
Jan_2023_Unadjusted_indexes = as.numeric(Jan_2023_Unadjusted_indexes),
Dec_2023_Unadjusted_indexes = as.numeric(Dec_2023_Unadjusted_indexes),
Jan_2024_Unadjusted_indexes = as.numeric(Jan_2024_Unadjusted_indexes),
Jan_2023_Jan_2024_Unadjusted_percent_change = as.numeric(Jan_2023_Jan_2024_Unadjusted_percent_change),
Dec_2023_Jan_2024_Unadjusted_percent_change = as.numeric(Dec_2023_Jan_2024_Unadjusted_percent_change),
Oct_2023_Nov_2023_Seasonally_adjusted_percent_change = as.numeric(Oct_2023_Nov_2023_Seasonally_adjusted_percent_change),
Nov_2023_Dec_2023_Seasonally_adjusted_percent_change = as.numeric(Nov_2023_Dec_2023_Seasonally_adjusted_percent_change),
Dec_2023_Jan_2024_Seasonally_adjusted_percent_change = as.numeric(Dec_2023_Jan_2024_Seasonally_adjusted_percent_change))
Analyzing the data:
in the next lines, we will answer the question: which categories saw the biggest from December to Jan. Amother part of tidying the data will be using the filter function to analyse the subgroups only which will give us accurate data.
Data shows that commodities expenditures such as Cereals, bakery
products, as well as Energy and gasoline saw the least change.
CPI_Jan %>%
filter(Indent_Level == 3) %>%
arrange(Dec_2023_Jan_2024_Seasonally_adjusted_percent_change)
## Indent_Level Expenditure_category
## 1 3 Fuel oil
## 2 3 Used cars and trucks
## 3 3 Motor fuel
## 4 3 Apparel
## 5 3 Medical care commodities(1)
## 6 3 Cereals and bakery products
## 7 3 Meats, poultry, fish, and eggs
## 8 3 New vehicles
## 9 3 Dairy and related products
## 10 3 Alcoholic beverages
## 11 3 Tobacco and smoking products(1)
## 12 3 Fruits and vegetables
## 13 3 Other food at home
## 14 3 Shelter
## 15 3 Medical care services
## 16 3 Transportation services
## 17 3 Nonalcoholic beverages and beverage materials
## 18 3 Electricity
## 19 3 Utility (piped) gas service
## Relative_importance_Dec_2023 Jan_2023_Unadjusted_indexes
## 1 0.084 455.595
## 2 2.012 185.857
## 3 3.372 297.413
## 4 2.512 127.875
## 5 1.489 395.981
## 6 1.066 349.294
## 7 1.722 322.737
## 8 3.684 177.276
## 9 0.748 272.040
## 10 0.854 282.286
## 11 0.542 1388.790
## 12 1.410 351.029
## 13 2.193 264.746
## 14 36.191 369.585
## 15 6.515 601.551
## 16 6.294 376.743
## 17 1.027 213.359
## 18 2.428 266.528
## 19 0.688 285.407
## Dec_2023_Unadjusted_indexes Jan_2024_Unadjusted_indexes
## 1 393.189 390.877
## 2 186.383 179.410
## 3 282.246 277.709
## 4 125.794 127.946
## 5 410.365 407.879
## 6 353.844 354.532
## 7 320.143 319.752
## 8 178.269 178.595
## 9 267.889 268.941
## 10 287.873 288.758
## 11 1486.900 1491.538
## 12 350.250 354.798
## 13 270.223 271.600
## 14 389.433 391.896
## 15 599.464 605.257
## 16 409.749 412.643
## 17 215.872 220.573
## 18 269.170 276.698
## 19 230.862 234.515
## Jan_2023_Jan_2024_Unadjusted_percent_change
## 1 -14.2
## 2 -3.5
## 3 -6.6
## 4 0.1
## 5 3.0
## 6 1.5
## 7 -0.9
## 8 0.7
## 9 -1.1
## 10 2.3
## 11 7.4
## 12 1.1
## 13 2.6
## 14 6.0
## 15 0.6
## 16 9.5
## 17 3.4
## 18 3.8
## 19 -17.8
## Dec_2023_Jan_2024_Unadjusted_percent_change
## 1 -0.6
## 2 -3.7
## 3 -1.6
## 4 1.7
## 5 -0.6
## 6 0.2
## 7 -0.1
## 8 0.2
## 9 0.4
## 10 0.3
## 11 0.3
## 12 1.3
## 13 0.5
## 14 0.6
## 15 1.0
## 16 0.7
## 17 2.2
## 18 2.8
## 19 1.6
## Oct_2023_Nov_2023_Seasonally_adjusted_percent_change
## 1 -1.1
## 2 1.4
## 3 -4.0
## 4 -0.6
## 5 0.5
## 6 0.3
## 7 -0.2
## 8 0.0
## 9 0.0
## 10 -0.1
## 11 1.1
## 12 0.1
## 13 -0.2
## 14 0.4
## 15 0.5
## 16 1.0
## 17 0.4
## 18 1.0
## 19 1.2
## Nov_2023_Dec_2023_Seasonally_adjusted_percent_change
## 1 -3.3
## 2 0.6
## 3 -0.6
## 4 0.0
## 5 -0.1
## 6 -0.1
## 7 0.3
## 8 0.2
## 9 0.1
## 10 0.1
## 11 -0.1
## 12 0.0
## 13 0.2
## 14 0.4
## 15 0.5
## 16 0.1
## 17 0.2
## 18 0.6
## 19 -0.6
## Dec_2023_Jan_2024_Seasonally_adjusted_percent_change
## 1 -4.5
## 2 -3.4
## 3 -3.3
## 4 -0.7
## 5 -0.6
## 6 -0.2
## 7 0.0
## 8 0.0
## 9 0.2
## 10 0.3
## 11 0.3
## 12 0.4
## 13 0.6
## 14 0.6
## 15 0.7
## 16 1.0
## 17 1.2
## 18 1.2
## 19 2.0
To view expenditures that saw the most change we can use the following code.
CPI_Jan %>%
filter(Indent_Level == 3) %>%
arrange(desc(Dec_2023_Jan_2024_Seasonally_adjusted_percent_change))
## Indent_Level Expenditure_category
## 1 3 Utility (piped) gas service
## 2 3 Nonalcoholic beverages and beverage materials
## 3 3 Electricity
## 4 3 Transportation services
## 5 3 Medical care services
## 6 3 Other food at home
## 7 3 Shelter
## 8 3 Fruits and vegetables
## 9 3 Alcoholic beverages
## 10 3 Tobacco and smoking products(1)
## 11 3 Dairy and related products
## 12 3 Meats, poultry, fish, and eggs
## 13 3 New vehicles
## 14 3 Cereals and bakery products
## 15 3 Medical care commodities(1)
## 16 3 Apparel
## 17 3 Motor fuel
## 18 3 Used cars and trucks
## 19 3 Fuel oil
## Relative_importance_Dec_2023 Jan_2023_Unadjusted_indexes
## 1 0.688 285.407
## 2 1.027 213.359
## 3 2.428 266.528
## 4 6.294 376.743
## 5 6.515 601.551
## 6 2.193 264.746
## 7 36.191 369.585
## 8 1.410 351.029
## 9 0.854 282.286
## 10 0.542 1388.790
## 11 0.748 272.040
## 12 1.722 322.737
## 13 3.684 177.276
## 14 1.066 349.294
## 15 1.489 395.981
## 16 2.512 127.875
## 17 3.372 297.413
## 18 2.012 185.857
## 19 0.084 455.595
## Dec_2023_Unadjusted_indexes Jan_2024_Unadjusted_indexes
## 1 230.862 234.515
## 2 215.872 220.573
## 3 269.170 276.698
## 4 409.749 412.643
## 5 599.464 605.257
## 6 270.223 271.600
## 7 389.433 391.896
## 8 350.250 354.798
## 9 287.873 288.758
## 10 1486.900 1491.538
## 11 267.889 268.941
## 12 320.143 319.752
## 13 178.269 178.595
## 14 353.844 354.532
## 15 410.365 407.879
## 16 125.794 127.946
## 17 282.246 277.709
## 18 186.383 179.410
## 19 393.189 390.877
## Jan_2023_Jan_2024_Unadjusted_percent_change
## 1 -17.8
## 2 3.4
## 3 3.8
## 4 9.5
## 5 0.6
## 6 2.6
## 7 6.0
## 8 1.1
## 9 2.3
## 10 7.4
## 11 -1.1
## 12 -0.9
## 13 0.7
## 14 1.5
## 15 3.0
## 16 0.1
## 17 -6.6
## 18 -3.5
## 19 -14.2
## Dec_2023_Jan_2024_Unadjusted_percent_change
## 1 1.6
## 2 2.2
## 3 2.8
## 4 0.7
## 5 1.0
## 6 0.5
## 7 0.6
## 8 1.3
## 9 0.3
## 10 0.3
## 11 0.4
## 12 -0.1
## 13 0.2
## 14 0.2
## 15 -0.6
## 16 1.7
## 17 -1.6
## 18 -3.7
## 19 -0.6
## Oct_2023_Nov_2023_Seasonally_adjusted_percent_change
## 1 1.2
## 2 0.4
## 3 1.0
## 4 1.0
## 5 0.5
## 6 -0.2
## 7 0.4
## 8 0.1
## 9 -0.1
## 10 1.1
## 11 0.0
## 12 -0.2
## 13 0.0
## 14 0.3
## 15 0.5
## 16 -0.6
## 17 -4.0
## 18 1.4
## 19 -1.1
## Nov_2023_Dec_2023_Seasonally_adjusted_percent_change
## 1 -0.6
## 2 0.2
## 3 0.6
## 4 0.1
## 5 0.5
## 6 0.2
## 7 0.4
## 8 0.0
## 9 0.1
## 10 -0.1
## 11 0.1
## 12 0.3
## 13 0.2
## 14 -0.1
## 15 -0.1
## 16 0.0
## 17 -0.6
## 18 0.6
## 19 -3.3
## Dec_2023_Jan_2024_Seasonally_adjusted_percent_change
## 1 2.0
## 2 1.2
## 3 1.2
## 4 1.0
## 5 0.7
## 6 0.6
## 7 0.6
## 8 0.4
## 9 0.3
## 10 0.3
## 11 0.2
## 12 0.0
## 13 0.0
## 14 -0.2
## 15 -0.6
## 16 -0.7
## 17 -3.3
## 18 -3.4
## 19 -4.5
Below we analyze the major categories (indent level = 3);
CPI_Jan %>%
filter(Indent_Level == 3) %>%
arrange(desc(Relative_importance_Dec_2023))
## Indent_Level Expenditure_category
## 1 3 Shelter
## 2 3 Medical care services
## 3 3 Transportation services
## 4 3 New vehicles
## 5 3 Motor fuel
## 6 3 Apparel
## 7 3 Electricity
## 8 3 Other food at home
## 9 3 Used cars and trucks
## 10 3 Meats, poultry, fish, and eggs
## 11 3 Medical care commodities(1)
## 12 3 Fruits and vegetables
## 13 3 Cereals and bakery products
## 14 3 Nonalcoholic beverages and beverage materials
## 15 3 Alcoholic beverages
## 16 3 Dairy and related products
## 17 3 Utility (piped) gas service
## 18 3 Tobacco and smoking products(1)
## 19 3 Fuel oil
## Relative_importance_Dec_2023 Jan_2023_Unadjusted_indexes
## 1 36.191 369.585
## 2 6.515 601.551
## 3 6.294 376.743
## 4 3.684 177.276
## 5 3.372 297.413
## 6 2.512 127.875
## 7 2.428 266.528
## 8 2.193 264.746
## 9 2.012 185.857
## 10 1.722 322.737
## 11 1.489 395.981
## 12 1.410 351.029
## 13 1.066 349.294
## 14 1.027 213.359
## 15 0.854 282.286
## 16 0.748 272.040
## 17 0.688 285.407
## 18 0.542 1388.790
## 19 0.084 455.595
## Dec_2023_Unadjusted_indexes Jan_2024_Unadjusted_indexes
## 1 389.433 391.896
## 2 599.464 605.257
## 3 409.749 412.643
## 4 178.269 178.595
## 5 282.246 277.709
## 6 125.794 127.946
## 7 269.170 276.698
## 8 270.223 271.600
## 9 186.383 179.410
## 10 320.143 319.752
## 11 410.365 407.879
## 12 350.250 354.798
## 13 353.844 354.532
## 14 215.872 220.573
## 15 287.873 288.758
## 16 267.889 268.941
## 17 230.862 234.515
## 18 1486.900 1491.538
## 19 393.189 390.877
## Jan_2023_Jan_2024_Unadjusted_percent_change
## 1 6.0
## 2 0.6
## 3 9.5
## 4 0.7
## 5 -6.6
## 6 0.1
## 7 3.8
## 8 2.6
## 9 -3.5
## 10 -0.9
## 11 3.0
## 12 1.1
## 13 1.5
## 14 3.4
## 15 2.3
## 16 -1.1
## 17 -17.8
## 18 7.4
## 19 -14.2
## Dec_2023_Jan_2024_Unadjusted_percent_change
## 1 0.6
## 2 1.0
## 3 0.7
## 4 0.2
## 5 -1.6
## 6 1.7
## 7 2.8
## 8 0.5
## 9 -3.7
## 10 -0.1
## 11 -0.6
## 12 1.3
## 13 0.2
## 14 2.2
## 15 0.3
## 16 0.4
## 17 1.6
## 18 0.3
## 19 -0.6
## Oct_2023_Nov_2023_Seasonally_adjusted_percent_change
## 1 0.4
## 2 0.5
## 3 1.0
## 4 0.0
## 5 -4.0
## 6 -0.6
## 7 1.0
## 8 -0.2
## 9 1.4
## 10 -0.2
## 11 0.5
## 12 0.1
## 13 0.3
## 14 0.4
## 15 -0.1
## 16 0.0
## 17 1.2
## 18 1.1
## 19 -1.1
## Nov_2023_Dec_2023_Seasonally_adjusted_percent_change
## 1 0.4
## 2 0.5
## 3 0.1
## 4 0.2
## 5 -0.6
## 6 0.0
## 7 0.6
## 8 0.2
## 9 0.6
## 10 0.3
## 11 -0.1
## 12 0.0
## 13 -0.1
## 14 0.2
## 15 0.1
## 16 0.1
## 17 -0.6
## 18 -0.1
## 19 -3.3
## Dec_2023_Jan_2024_Seasonally_adjusted_percent_change
## 1 0.6
## 2 0.7
## 3 1.0
## 4 0.0
## 5 -3.3
## 6 -0.7
## 7 1.2
## 8 0.6
## 9 -3.4
## 10 0.0
## 11 -0.6
## 12 0.4
## 13 -0.2
## 14 1.2
## 15 0.3
## 16 0.2
## 17 2.0
## 18 0.3
## 19 -4.5
Conclusion: CPI data contains many more insights that I can be analyzed. Here are the questions we answered in the analysis above:
Which category saw the least change from Dec 2023 - Jan 2024? Ans: Fuel oil, Used cars and trucks, and Motor fuel. A: Fuel oil, Used cars and trucks, and Motor fuel.
Which category saw the biggest change from Dec 2023 - Jan 2024? Ans: Utility Gas services, electricity, and transportation services. A:Utility Gas services, electricity, and transportation services.
Which category saw the biggest change from Jan 2023 - Jan 2024? Ans: Transportation, Tobacco and smoking products, and Shelter. A: Transportation, Tobacco and smoking products, and Shelter.