First: Tidying the data.
to get started, we load the tidyverse package, which includes all the packages we need for the assignment and load the first dataset: CPI data for January 2024.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
CPI_Jan <- read.csv("https://raw.githubusercontent.com/sokkarbishoy/DATA607/main/news-release-table1-202401.csv")
head(CPI_Jan)
## X
## 1
## 2
## 3 Indent Level
## 4
## 5
## 6 0
## Table.1..Consumer.Price.Index.for.All.Urban.Consumers..CPI.U...U.S..city.average..by.expenditure.category..January.2024
## 1 [1982-84=100, unless otherwise noted]
## 2
## 3 Expenditure category
## 4
## 5
## 6 All items
## X.1 X.2 X.3
## 1
## 2
## 3 Relative\nimportance\nDec.\n2023 Unadjusted indexes Unadjusted indexes
## 4 Jan.\n2023 Dec.\n2023
## 5
## 6 100.000 299.170 306.746
## X.4 X.5 X.6
## 1
## 2
## 3 Unadjusted indexes Unadjusted percent change Unadjusted percent change
## 4 Jan.\n2024 Jan.\n2023-\nJan.\n2024 Dec.\n2023-\nJan.\n2024
## 5
## 6 308.417 3.1 0.5
## X.7 X.8
## 1
## 2
## 3 Seasonally adjusted percent change Seasonally adjusted percent change
## 4 Oct.\n2023-\nNov.\n2023 Nov.\n2023-\nDec.\n2023
## 5
## 6 0.2 0.2
## X.9
## 1
## 2
## 3 Seasonally adjusted percent change
## 4 Dec.\n2023-\nJan.\n2024
## 5
## 6 0.3
We start tidying the data removing the first two rows, last three rows and empty rows;
CPI_Jan <- CPI_Jan[-c(1:5, 16, 25, 45:49), ]
tail(CPI_Jan)
## X
## 39 4
## 40 4
## 41 3
## 42 4
## 43 4
## 44 4
## Table.1..Consumer.Price.Index.for.All.Urban.Consumers..CPI.U...U.S..city.average..by.expenditure.category..January.2024
## 39 Physicians' services(1)
## 40 Hospital services(1)(3)
## 41 Transportation services
## 42 Motor vehicle maintenance and repair(1)
## 43 Motor vehicle insurance
## 44 Airline fares
## X.1 X.2 X.3 X.4 X.5 X.6 X.7 X.8 X.9
## 39 1.828 415.197 412.930 415.427 0.1 0.6 0.6 0.2 0.6
## 40 1.987 385.064 404.407 410.695 6.7 1.6 0.1 0.5 1.6
## 41 6.294 376.743 409.749 412.643 9.5 0.7 1.0 0.1 1.0
## 42 1.233 371.780 392.897 396.004 6.5 0.8 0.3 -0.3 0.8
## 43 2.794 658.513 780.284 794.142 20.6 1.8 1.2 1.7 1.4
## 44 0.751 264.629 243.348 247.606 -6.4 1.7 -0.2 0.9 1.4
Next we change the columns names to be easier to identify
new_columns <- c("Indent_Level",
"Expenditure_category",
"Relative_importance_Dec_2023",
"Jan_2023_Unadjusted_indexes",
"Dec_2023_Unadjusted_indexes",
"Jan_2024_Unadjusted_indexes",
"Jan_2023_Jan_2024_Unadjusted_percent_change",
"Dec_2023_Jan_2024_Unadjusted_percent_change",
"Oct_2023_Nov_2023_Seasonally_adjusted_percent_change",
"Nov_2023_Dec_2023_Seasonally_adjusted_percent_change",
"Dec_2023_Jan_2024_Seasonally_adjusted_percent_change")
colnames(CPI_Jan) <- new_columns
#change the type into INT
CPI_Jan$Indent_Level <- as.integer(CPI_Jan$Indent_Level)
head(CPI_Jan)
## Indent_Level Expenditure_category Relative_importance_Dec_2023
## 6 0 All items 100.000
## 7 1 Food 13.555
## 8 2 Food at home 8.167
## 9 3 Cereals and bakery products 1.066
## 10 3 Meats, poultry, fish, and eggs 1.722
## 11 3 Dairy and related products 0.748
## Jan_2023_Unadjusted_indexes Dec_2023_Unadjusted_indexes
## 6 299.170 306.746
## 7 319.136 325.409
## 8 301.435 303.005
## 9 349.294 353.844
## 10 322.737 320.143
## 11 272.040 267.889
## Jan_2024_Unadjusted_indexes Jan_2023_Jan_2024_Unadjusted_percent_change
## 6 308.417 3.1
## 7 327.327 2.6
## 8 305.037 1.2
## 9 354.532 1.5
## 10 319.752 -0.9
## 11 268.941 -1.1
## Dec_2023_Jan_2024_Unadjusted_percent_change
## 6 0.5
## 7 0.6
## 8 0.7
## 9 0.2
## 10 -0.1
## 11 0.4
## Oct_2023_Nov_2023_Seasonally_adjusted_percent_change
## 6 0.2
## 7 0.2
## 8 0.0
## 9 0.3
## 10 -0.2
## 11 0.0
## Nov_2023_Dec_2023_Seasonally_adjusted_percent_change
## 6 0.2
## 7 0.2
## 8 0.1
## 9 -0.1
## 10 0.3
## 11 0.1
## Dec_2023_Jan_2024_Seasonally_adjusted_percent_change
## 6 0.3
## 7 0.4
## 8 0.4
## 9 -0.2
## 10 0.0
## 11 0.2
Finally we change the type of each columns to INT and Numeric
CPI_Jan <- transform(CPI_Jan,
Indent_Level = as.integer(Indent_Level),
Relative_importance_Dec_2023 = as.numeric(Relative_importance_Dec_2023),
Jan_2023_Unadjusted_indexes = as.numeric(Jan_2023_Unadjusted_indexes),
Dec_2023_Unadjusted_indexes = as.numeric(Dec_2023_Unadjusted_indexes),
Jan_2024_Unadjusted_indexes = as.numeric(Jan_2024_Unadjusted_indexes),
Jan_2023_Jan_2024_Unadjusted_percent_change = as.numeric(Jan_2023_Jan_2024_Unadjusted_percent_change),
Dec_2023_Jan_2024_Unadjusted_percent_change = as.numeric(Dec_2023_Jan_2024_Unadjusted_percent_change),
Oct_2023_Nov_2023_Seasonally_adjusted_percent_change = as.numeric(Oct_2023_Nov_2023_Seasonally_adjusted_percent_change),
Nov_2023_Dec_2023_Seasonally_adjusted_percent_change = as.numeric(Nov_2023_Dec_2023_Seasonally_adjusted_percent_change),
Dec_2023_Jan_2024_Seasonally_adjusted_percent_change = as.numeric(Dec_2023_Jan_2024_Seasonally_adjusted_percent_change))
in the next lines, we will answer the question: which categories saw the biggest from December to Jan. Amother part of tidying the data will be using the filter function to analyse the subgroups only which will give us accurate data.
Data shows that commodities expenditures such as Cereals, bakery
products, as well as Energy and gasoline saw the least change.
CPI_Jan %>%
filter(Indent_Level == 3) %>%
arrange(Dec_2023_Jan_2024_Seasonally_adjusted_percent_change)
## Indent_Level Expenditure_category
## 1 3 Fuel oil
## 2 3 Used cars and trucks
## 3 3 Motor fuel
## 4 3 Apparel
## 5 3 Medical care commodities(1)
## 6 3 Cereals and bakery products
## 7 3 Meats, poultry, fish, and eggs
## 8 3 New vehicles
## 9 3 Dairy and related products
## 10 3 Alcoholic beverages
## 11 3 Tobacco and smoking products(1)
## 12 3 Fruits and vegetables
## 13 3 Other food at home
## 14 3 Shelter
## 15 3 Medical care services
## 16 3 Transportation services
## 17 3 Nonalcoholic beverages and beverage materials
## 18 3 Electricity
## 19 3 Utility (piped) gas service
## Relative_importance_Dec_2023 Jan_2023_Unadjusted_indexes
## 1 0.084 455.595
## 2 2.012 185.857
## 3 3.372 297.413
## 4 2.512 127.875
## 5 1.489 395.981
## 6 1.066 349.294
## 7 1.722 322.737
## 8 3.684 177.276
## 9 0.748 272.040
## 10 0.854 282.286
## 11 0.542 1388.790
## 12 1.410 351.029
## 13 2.193 264.746
## 14 36.191 369.585
## 15 6.515 601.551
## 16 6.294 376.743
## 17 1.027 213.359
## 18 2.428 266.528
## 19 0.688 285.407
## Dec_2023_Unadjusted_indexes Jan_2024_Unadjusted_indexes
## 1 393.189 390.877
## 2 186.383 179.410
## 3 282.246 277.709
## 4 125.794 127.946
## 5 410.365 407.879
## 6 353.844 354.532
## 7 320.143 319.752
## 8 178.269 178.595
## 9 267.889 268.941
## 10 287.873 288.758
## 11 1486.900 1491.538
## 12 350.250 354.798
## 13 270.223 271.600
## 14 389.433 391.896
## 15 599.464 605.257
## 16 409.749 412.643
## 17 215.872 220.573
## 18 269.170 276.698
## 19 230.862 234.515
## Jan_2023_Jan_2024_Unadjusted_percent_change
## 1 -14.2
## 2 -3.5
## 3 -6.6
## 4 0.1
## 5 3.0
## 6 1.5
## 7 -0.9
## 8 0.7
## 9 -1.1
## 10 2.3
## 11 7.4
## 12 1.1
## 13 2.6
## 14 6.0
## 15 0.6
## 16 9.5
## 17 3.4
## 18 3.8
## 19 -17.8
## Dec_2023_Jan_2024_Unadjusted_percent_change
## 1 -0.6
## 2 -3.7
## 3 -1.6
## 4 1.7
## 5 -0.6
## 6 0.2
## 7 -0.1
## 8 0.2
## 9 0.4
## 10 0.3
## 11 0.3
## 12 1.3
## 13 0.5
## 14 0.6
## 15 1.0
## 16 0.7
## 17 2.2
## 18 2.8
## 19 1.6
## Oct_2023_Nov_2023_Seasonally_adjusted_percent_change
## 1 -1.1
## 2 1.4
## 3 -4.0
## 4 -0.6
## 5 0.5
## 6 0.3
## 7 -0.2
## 8 0.0
## 9 0.0
## 10 -0.1
## 11 1.1
## 12 0.1
## 13 -0.2
## 14 0.4
## 15 0.5
## 16 1.0
## 17 0.4
## 18 1.0
## 19 1.2
## Nov_2023_Dec_2023_Seasonally_adjusted_percent_change
## 1 -3.3
## 2 0.6
## 3 -0.6
## 4 0.0
## 5 -0.1
## 6 -0.1
## 7 0.3
## 8 0.2
## 9 0.1
## 10 0.1
## 11 -0.1
## 12 0.0
## 13 0.2
## 14 0.4
## 15 0.5
## 16 0.1
## 17 0.2
## 18 0.6
## 19 -0.6
## Dec_2023_Jan_2024_Seasonally_adjusted_percent_change
## 1 -4.5
## 2 -3.4
## 3 -3.3
## 4 -0.7
## 5 -0.6
## 6 -0.2
## 7 0.0
## 8 0.0
## 9 0.2
## 10 0.3
## 11 0.3
## 12 0.4
## 13 0.6
## 14 0.6
## 15 0.7
## 16 1.0
## 17 1.2
## 18 1.2
## 19 2.0
To view expenditures that saw the most change we can use the following code.
CPI_Jan %>%
filter(Indent_Level == 3) %>%
arrange(desc(Dec_2023_Jan_2024_Seasonally_adjusted_percent_change))
## Indent_Level Expenditure_category
## 1 3 Utility (piped) gas service
## 2 3 Nonalcoholic beverages and beverage materials
## 3 3 Electricity
## 4 3 Transportation services
## 5 3 Medical care services
## 6 3 Other food at home
## 7 3 Shelter
## 8 3 Fruits and vegetables
## 9 3 Alcoholic beverages
## 10 3 Tobacco and smoking products(1)
## 11 3 Dairy and related products
## 12 3 Meats, poultry, fish, and eggs
## 13 3 New vehicles
## 14 3 Cereals and bakery products
## 15 3 Medical care commodities(1)
## 16 3 Apparel
## 17 3 Motor fuel
## 18 3 Used cars and trucks
## 19 3 Fuel oil
## Relative_importance_Dec_2023 Jan_2023_Unadjusted_indexes
## 1 0.688 285.407
## 2 1.027 213.359
## 3 2.428 266.528
## 4 6.294 376.743
## 5 6.515 601.551
## 6 2.193 264.746
## 7 36.191 369.585
## 8 1.410 351.029
## 9 0.854 282.286
## 10 0.542 1388.790
## 11 0.748 272.040
## 12 1.722 322.737
## 13 3.684 177.276
## 14 1.066 349.294
## 15 1.489 395.981
## 16 2.512 127.875
## 17 3.372 297.413
## 18 2.012 185.857
## 19 0.084 455.595
## Dec_2023_Unadjusted_indexes Jan_2024_Unadjusted_indexes
## 1 230.862 234.515
## 2 215.872 220.573
## 3 269.170 276.698
## 4 409.749 412.643
## 5 599.464 605.257
## 6 270.223 271.600
## 7 389.433 391.896
## 8 350.250 354.798
## 9 287.873 288.758
## 10 1486.900 1491.538
## 11 267.889 268.941
## 12 320.143 319.752
## 13 178.269 178.595
## 14 353.844 354.532
## 15 410.365 407.879
## 16 125.794 127.946
## 17 282.246 277.709
## 18 186.383 179.410
## 19 393.189 390.877
## Jan_2023_Jan_2024_Unadjusted_percent_change
## 1 -17.8
## 2 3.4
## 3 3.8
## 4 9.5
## 5 0.6
## 6 2.6
## 7 6.0
## 8 1.1
## 9 2.3
## 10 7.4
## 11 -1.1
## 12 -0.9
## 13 0.7
## 14 1.5
## 15 3.0
## 16 0.1
## 17 -6.6
## 18 -3.5
## 19 -14.2
## Dec_2023_Jan_2024_Unadjusted_percent_change
## 1 1.6
## 2 2.2
## 3 2.8
## 4 0.7
## 5 1.0
## 6 0.5
## 7 0.6
## 8 1.3
## 9 0.3
## 10 0.3
## 11 0.4
## 12 -0.1
## 13 0.2
## 14 0.2
## 15 -0.6
## 16 1.7
## 17 -1.6
## 18 -3.7
## 19 -0.6
## Oct_2023_Nov_2023_Seasonally_adjusted_percent_change
## 1 1.2
## 2 0.4
## 3 1.0
## 4 1.0
## 5 0.5
## 6 -0.2
## 7 0.4
## 8 0.1
## 9 -0.1
## 10 1.1
## 11 0.0
## 12 -0.2
## 13 0.0
## 14 0.3
## 15 0.5
## 16 -0.6
## 17 -4.0
## 18 1.4
## 19 -1.1
## Nov_2023_Dec_2023_Seasonally_adjusted_percent_change
## 1 -0.6
## 2 0.2
## 3 0.6
## 4 0.1
## 5 0.5
## 6 0.2
## 7 0.4
## 8 0.0
## 9 0.1
## 10 -0.1
## 11 0.1
## 12 0.3
## 13 0.2
## 14 -0.1
## 15 -0.1
## 16 0.0
## 17 -0.6
## 18 0.6
## 19 -3.3
## Dec_2023_Jan_2024_Seasonally_adjusted_percent_change
## 1 2.0
## 2 1.2
## 3 1.2
## 4 1.0
## 5 0.7
## 6 0.6
## 7 0.6
## 8 0.4
## 9 0.3
## 10 0.3
## 11 0.2
## 12 0.0
## 13 0.0
## 14 -0.2
## 15 -0.6
## 16 -0.7
## 17 -3.3
## 18 -3.4
## 19 -4.5
Below we analyze the major categories (indent level = 3);
CPI_Jan %>%
filter(Indent_Level == 3) %>%
arrange(desc(Jan_2023_Jan_2024_Unadjusted_percent_change))
## Indent_Level Expenditure_category
## 1 3 Transportation services
## 2 3 Tobacco and smoking products(1)
## 3 3 Shelter
## 4 3 Electricity
## 5 3 Nonalcoholic beverages and beverage materials
## 6 3 Medical care commodities(1)
## 7 3 Other food at home
## 8 3 Alcoholic beverages
## 9 3 Cereals and bakery products
## 10 3 Fruits and vegetables
## 11 3 New vehicles
## 12 3 Medical care services
## 13 3 Apparel
## 14 3 Meats, poultry, fish, and eggs
## 15 3 Dairy and related products
## 16 3 Used cars and trucks
## 17 3 Motor fuel
## 18 3 Fuel oil
## 19 3 Utility (piped) gas service
## Relative_importance_Dec_2023 Jan_2023_Unadjusted_indexes
## 1 6.294 376.743
## 2 0.542 1388.790
## 3 36.191 369.585
## 4 2.428 266.528
## 5 1.027 213.359
## 6 1.489 395.981
## 7 2.193 264.746
## 8 0.854 282.286
## 9 1.066 349.294
## 10 1.410 351.029
## 11 3.684 177.276
## 12 6.515 601.551
## 13 2.512 127.875
## 14 1.722 322.737
## 15 0.748 272.040
## 16 2.012 185.857
## 17 3.372 297.413
## 18 0.084 455.595
## 19 0.688 285.407
## Dec_2023_Unadjusted_indexes Jan_2024_Unadjusted_indexes
## 1 409.749 412.643
## 2 1486.900 1491.538
## 3 389.433 391.896
## 4 269.170 276.698
## 5 215.872 220.573
## 6 410.365 407.879
## 7 270.223 271.600
## 8 287.873 288.758
## 9 353.844 354.532
## 10 350.250 354.798
## 11 178.269 178.595
## 12 599.464 605.257
## 13 125.794 127.946
## 14 320.143 319.752
## 15 267.889 268.941
## 16 186.383 179.410
## 17 282.246 277.709
## 18 393.189 390.877
## 19 230.862 234.515
## Jan_2023_Jan_2024_Unadjusted_percent_change
## 1 9.5
## 2 7.4
## 3 6.0
## 4 3.8
## 5 3.4
## 6 3.0
## 7 2.6
## 8 2.3
## 9 1.5
## 10 1.1
## 11 0.7
## 12 0.6
## 13 0.1
## 14 -0.9
## 15 -1.1
## 16 -3.5
## 17 -6.6
## 18 -14.2
## 19 -17.8
## Dec_2023_Jan_2024_Unadjusted_percent_change
## 1 0.7
## 2 0.3
## 3 0.6
## 4 2.8
## 5 2.2
## 6 -0.6
## 7 0.5
## 8 0.3
## 9 0.2
## 10 1.3
## 11 0.2
## 12 1.0
## 13 1.7
## 14 -0.1
## 15 0.4
## 16 -3.7
## 17 -1.6
## 18 -0.6
## 19 1.6
## Oct_2023_Nov_2023_Seasonally_adjusted_percent_change
## 1 1.0
## 2 1.1
## 3 0.4
## 4 1.0
## 5 0.4
## 6 0.5
## 7 -0.2
## 8 -0.1
## 9 0.3
## 10 0.1
## 11 0.0
## 12 0.5
## 13 -0.6
## 14 -0.2
## 15 0.0
## 16 1.4
## 17 -4.0
## 18 -1.1
## 19 1.2
## Nov_2023_Dec_2023_Seasonally_adjusted_percent_change
## 1 0.1
## 2 -0.1
## 3 0.4
## 4 0.6
## 5 0.2
## 6 -0.1
## 7 0.2
## 8 0.1
## 9 -0.1
## 10 0.0
## 11 0.2
## 12 0.5
## 13 0.0
## 14 0.3
## 15 0.1
## 16 0.6
## 17 -0.6
## 18 -3.3
## 19 -0.6
## Dec_2023_Jan_2024_Seasonally_adjusted_percent_change
## 1 1.0
## 2 0.3
## 3 0.6
## 4 1.2
## 5 1.2
## 6 -0.6
## 7 0.6
## 8 0.3
## 9 -0.2
## 10 0.4
## 11 0.0
## 12 0.7
## 13 -0.7
## 14 0.0
## 15 0.2
## 16 -3.4
## 17 -3.3
## 18 -4.5
## 19 2.0
###Conclusion: CPI data contains many more insights that I can be analyzed. here are the questions we answered in the analysis above:
Dataset: “Bureau
of Labor Statistics - CPI Report for Jan2024”
First: Tidying the data.
population_data <- read.csv('https://raw.githubusercontent.com/nk014914/Data-607/main/Population_data.csv')
population_data
## Name FIPS Abbreviations X2015.Population X2016.Population
## 1 Georgia 14 GA 10067378 10189016
## 2 North Carolina 37 NC 9932862 10036881
## 3 NA NA NA
## 4 South Carolina 45 SC 4817440 4879157
## 5 Virginia 51 VA 8254218 8312400
## X2017.Population X2018.Population X2019.Population X2020.Population
## 1 10325943 10471428 10604413 10722092
## 2 10151700 10291929 10455811 10598314
## 3 NA NA NA NA
## 4 4954035 5036155 5118397 5196026
## 5 8370206 8420184 8483598 8539322
#cleaning the data
#Removing blank rows and renaming the columns to shorten for just the year
population_data <- population_data %>%
drop_na() %>%
rename(state = Name, x2015 = X2015.Population, x2016 = X2016.Population, x2017 = X2017.Population, x2018 = X2018.Population, x2019 = X2019.Population, x2020 = X2020.Population)
#remove unused columns
population_data_subset <- subset(population_data, select= -c(FIPS,Abbreviations))
#transforming data - putting years into columns to make long data
population_data_subset <- population_data_subset %>%
gather(key = 'year', value = 'population', 2:7)
population_data_subset
## state year population
## 1 Georgia x2015 10067378
## 2 North Carolina x2015 9932862
## 3 South Carolina x2015 4817440
## 4 Virginia x2015 8254218
## 5 Georgia x2016 10189016
## 6 North Carolina x2016 10036881
## 7 South Carolina x2016 4879157
## 8 Virginia x2016 8312400
## 9 Georgia x2017 10325943
## 10 North Carolina x2017 10151700
## 11 South Carolina x2017 4954035
## 12 Virginia x2017 8370206
## 13 Georgia x2018 10471428
## 14 North Carolina x2018 10291929
## 15 South Carolina x2018 5036155
## 16 Virginia x2018 8420184
## 17 Georgia x2019 10604413
## 18 North Carolina x2019 10455811
## 19 South Carolina x2019 5118397
## 20 Virginia x2019 8483598
## 21 Georgia x2020 10722092
## 22 North Carolina x2020 10598314
## 23 South Carolina x2020 5196026
## 24 Virginia x2020 8539322
#remove x's in year column
population_data_final <- population_data_subset %>%
mutate(across(c('year'), substr, 2, nchar(year)))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(c("year"), substr, 2, nchar(year))`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
population_data_final
## state year population
## 1 Georgia 2015 10067378
## 2 North Carolina 2015 9932862
## 3 South Carolina 2015 4817440
## 4 Virginia 2015 8254218
## 5 Georgia 2016 10189016
## 6 North Carolina 2016 10036881
## 7 South Carolina 2016 4879157
## 8 Virginia 2016 8312400
## 9 Georgia 2017 10325943
## 10 North Carolina 2017 10151700
## 11 South Carolina 2017 4954035
## 12 Virginia 2017 8370206
## 13 Georgia 2018 10471428
## 14 North Carolina 2018 10291929
## 15 South Carolina 2018 5036155
## 16 Virginia 2018 8420184
## 17 Georgia 2019 10604413
## 18 North Carolina 2019 10455811
## 19 South Carolina 2019 5118397
## 20 Virginia 2019 8483598
## 21 Georgia 2020 10722092
## 22 North Carolina 2020 10598314
## 23 South Carolina 2020 5196026
## 24 Virginia 2020 8539322
We will create a few visuals in order to dig into the raw data a bit deeper, some things I will be looking for are:
#First I need to make data as a numeric vector because my y axis was not in increasing order
transform(population_data_final, population = as.numeric(population))
## state year population
## 1 Georgia 2015 10067378
## 2 North Carolina 2015 9932862
## 3 South Carolina 2015 4817440
## 4 Virginia 2015 8254218
## 5 Georgia 2016 10189016
## 6 North Carolina 2016 10036881
## 7 South Carolina 2016 4879157
## 8 Virginia 2016 8312400
## 9 Georgia 2017 10325943
## 10 North Carolina 2017 10151700
## 11 South Carolina 2017 4954035
## 12 Virginia 2017 8370206
## 13 Georgia 2018 10471428
## 14 North Carolina 2018 10291929
## 15 South Carolina 2018 5036155
## 16 Virginia 2018 8420184
## 17 Georgia 2019 10604413
## 18 North Carolina 2019 10455811
## 19 South Carolina 2019 5118397
## 20 Virginia 2019 8483598
## 21 Georgia 2020 10722092
## 22 North Carolina 2020 10598314
## 23 South Carolina 2020 5196026
## 24 Virginia 2020 8539322
#checking if column is now numeric
class(population_data_final$population)
## [1] "numeric"
#Determine the summary statistics on population per year of the four states
population_summary <- population_data_final %>%
group_by(year) %>%
summarize(mean = mean(population), median = median(population), min = min(population), max = max(population), na.rm = T) %>%
arrange(mean)
population_summary
## # A tibble: 6 × 6
## year mean median min max na.rm
## <chr> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 2015 8267974. 9093540 4817440 10067378 TRUE
## 2 2016 8354364. 9174640. 4879157 10189016 TRUE
## 3 2017 8450471 9260953 4954035 10325943 TRUE
## 4 2018 8554924 9356056. 5036155 10471428 TRUE
## 5 2019 8665555. 9469704. 5118397 10604413 TRUE
## 6 2020 8763938. 9568818 5196026 10722092 TRUE
#visualize growth of population per state
ggplot(data=population_data_final, aes(x=year, y= population, group = state)) +
geom_line(aes(color=state))+
geom_point(aes(color=state))
Viewing the yearly state populations plotted, we can see a few things: - Georgia has the largest overall population every year, while South Carolina has the smallest (by a large difference in comparison to the other states.)
Georgia and North Carolina are closer in yearly population than the other two states as their lines are very close to one another.
Virginia seems to have the smallest yearly population growth as their line has the least amount of steepness in comparison to Virginia, North Carolina and Georgia.
#Determining which year has the largest population growth
#sum of population per year
sum2015 = sum( population_data_final$population [ population_data_final$year==2015] )
sum2016 = sum( population_data_final$population [ population_data_final$year==2016] )
sum2017 = sum( population_data_final$population [ population_data_final$year==2017] )
sum2018 = sum( population_data_final$population [ population_data_final$year==2018] )
sum2019 = sum( population_data_final$population [ population_data_final$year==2019] )
sum2020 = sum( population_data_final$population [ population_data_final$year==2020] )
population_total = c(sum2015,sum2016,sum2017,sum2018,sum2019,sum2020)
#add column with years and create dataframe
year = c('2015','2016','2017','2018','2019','2020')
sum_population = as.data.frame(year)
sum_population$population_total <- population_total
sum_population
## year population_total
## 1 2015 33071898
## 2 2016 33417454
## 3 2017 33801884
## 4 2018 34219696
## 5 2019 34662219
## 6 2020 35055754
#calculate percent change per year over year, we will use th collapse package
library(collapse)
## collapse 2.0.10, see ?`collapse-package` or ?`collapse-documentation`
##
## Attaching package: 'collapse'
## The following object is masked from 'package:lubridate':
##
## is.Date
## The following object is masked from 'package:tidyr':
##
## replace_na
## The following object is masked from 'package:stats':
##
## D
sum_population |> fmutate(growth = fgrowth(population_total))
## year population_total growth
## 1 2015 33071898 NA
## 2 2016 33417454 1.044863
## 3 2017 33801884 1.150387
## 4 2018 34219696 1.236061
## 5 2019 34662219 1.293182
## 6 2020 35055754 1.135343
Based on the analysis, we can see the largest population growth of the four states combined is between 2018 o 2019 with a growth of around 1.29. The smallest growth was from 2015 to 2016 at 1.04. An interesting find I saw was that the growth percentage was steadily increasing each year until 2019-2020 we it dropped to 1.13 the second lowest population growth of all 6 years.
—— Please Help with below if possible, below is a draft and will continued and submitted later ———–——————————————————————————————————————————————
Dataset 3 - Nobel Prize Winners.
First we load the data from the stored location on my Github.
nobel <- read.csv("https://raw.githubusercontent.com/sokkarbishoy/DATA607/main/complete(1).csv")
Questions to answer with the analysis:
Which field has received the largest prize amount? Which field has
received the lowest?
What proportion of the awards are given to women vs men? How has this
changed over time?
Looking at the data, there were some empty values in the Gender category. Upon closer look it appears to be organizations. So in the code below we will replace empty values in the gender variable to Organization, And then compare the proportion of awards given to Males, Females, and organizations.
nobel$gender[is.na(nobel$gender)] <- "Organization"
###Analysis
gender_freq <- table(nobel$gender)
gender_data <- data.frame(gender = names(gender_freq), frequency = as.numeric(gender_freq))
ggplot(gender_data, aes(x = gender, y = frequency, fill = gender)) +
geom_bar(stat = "identity") +
labs(title = "Frequency of Gender/Organization in Nobel Dataset",
x = "Gender/Organization", y = "Frequency") +
scale_fill_manual(values = c("male" = "lightblue", "female" = "lightpink", "organization" = "lightgreen"))
As of now I am working on my last dataset and will be finished before
the lecture.
My work will be in tidying and removing the NA values with actual values
to answer the two questions above. Please advice of the best way if
possible.