The purpose of this assignment is to gain practice in preparing different datasets for downstream analysis. The project consists of three main steps, and this R Markdown document follows the steps as requested.
#1: Dataset Selection and Import:
I have chosen three datasets: cancer-poverty, consumer index, and air quality. These datasets initially appeared interesting and relevant to my topic of interest cancer and poverty. I prefer using the “wide” dataset format, and I’ve decided to work with a combination of wide and log transformations for this project.
#2: Tasks for Each Dataset:
Create a .CSV file (or optionally, a MySQL database) that includes all the information from the dataset.
The goal is to practice tidying and transformations, as described below.
#3: Data Tidying and Transformation:
Read the information from your .CSV file into R.
Utilize tidyr and dplyr as needed to tidy and transform the data.
Provide detailed narrative descriptions of the steps taken, explaining why and how each transformation was performed.
Include information on data cleanup, analysis, and conclusions.
The first step is to choose and bring in data. I chose my own database. I’m interested in cancer and want to see if it’s related to things like poverty, lifestyle, diet, or other factors in life.
My next pick is Guillermo Schneider’s data about the consumer price index. This also connects to my interest in poverty and lifestyle, and it’s the kind of info I’m really interested in. The data looks intriguing and I’m excited to spend time analyzing and understanding it.
My third choice is Anthony Conrardy’s data on NYC air quality. This data could also be linked to cancer, so I want to see if poor air quality might be causing cancer or other illnesses.
I hope my choices make sense. I plan to connect these datasets and work on them together in the future as a project of mine.
I am trying to load data from the local files that are stored in my Data folder.
Here is my code, it follows the same structure practiced in the previous assignments ans some new tricks used in between:
## [1] "All required packages are installed"
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Here is the head of the laoded data from: Data/Air_Quality_20240302.csv
## # A tibble: 6 × 12
## `Unique ID` `Indicator ID` Name Measure `Measure Info` `Geo Type Name`
## <int> <int> <chr> <chr> <chr> <chr>
## 1 172653 375 Nitrogen di… Mean ppb UHF34
## 2 172585 375 Nitrogen di… Mean ppb UHF34
## 3 336637 375 Nitrogen di… Mean ppb UHF34
## 4 336622 375 Nitrogen di… Mean ppb UHF34
## 5 172582 375 Nitrogen di… Mean ppb UHF34
## 6 667327 375 Nitrogen di… Mean ppb UHF34
## # ℹ 6 more variables: `Geo Join ID` <int>, `Geo Place Name` <chr>,
## # `Time Period` <chr>, Start_Date <chr>, `Data Value` <dbl>, Message <lgl>
##
## Here is the head of the laoded data from: Data/cpi-u-202401.csv
##
## 1 <NA>
## 2 <NA>
## 3 Indent Level
## 4 <NA>
## 5 <NA>
## 6 <NA>
## Consumer Price Index for All Urban Consumers (CPI-U): U.S. city average, by expenditure category, January 2024
## 1 [1982-84=100, unless otherwise noted]
## 2 <NA>
## 3 Expenditure category
## 4 <NA>
## 5 <NA>
## 6 <NA>
##
## 1 <NA> <NA> <NA>
## 2 <NA> <NA> <NA>
## 3 Relative\nimportance\nDec.\n2023 Unadjusted indexes Unadjusted indexes
## 4 <NA> Jan.\n2023 Feb.\n2023
## 5 <NA> <NA> <NA>
## 6 <NA> <NA> <NA>
##
## 1 <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA>
## 3 Unadjusted indexes Unadjusted indexes Unadjusted indexes Unadjusted indexes
## 4 Mar.\n2023 Apr.\n2023 May\n2023 Jun.\n2023
## 5 <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA>
##
## 1 <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA>
## 3 Unadjusted indexes Unadjusted indexes Unadjusted indexes Unadjusted indexes
## 4 Jul.\n2023 Aug.\n2023 Sep.\n2023 Oct.\n2023
## 5 <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA>
##
## 1 <NA> <NA> <NA>
## 2 <NA> <NA> <NA>
## 3 Unadjusted indexes Unadjusted indexes Unadjusted indexes
## 4 Nov.\n2023 Dec.\n2023 Jan.\n2024
## 5 <NA> <NA> <NA>
## 6 <NA> <NA> <NA>
##
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 Seasonally adjusted indexes Seasonally adjusted indexes
## 4 Jan.\n2023 Feb.\n2023
## 5 <NA> <NA>
## 6 <NA> <NA>
##
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 Seasonally adjusted indexes Seasonally adjusted indexes
## 4 Mar.\n2023 Apr.\n2023
## 5 <NA> <NA>
## 6 <NA> <NA>
##
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 Seasonally adjusted indexes Seasonally adjusted indexes
## 4 May\n2023 Jun.\n2023
## 5 <NA> <NA>
## 6 <NA> <NA>
##
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 Seasonally adjusted indexes Seasonally adjusted indexes
## 4 Jul.\n2023 Aug.\n2023
## 5 <NA> <NA>
## 6 <NA> <NA>
##
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 Seasonally adjusted indexes Seasonally adjusted indexes
## 4 Sep.\n2023 Oct.\n2023
## 5 <NA> <NA>
## 6 <NA> <NA>
##
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 Seasonally adjusted indexes Seasonally adjusted indexes
## 4 Nov.\n2023 Dec.\n2023
## 5 <NA> <NA>
## 6 <NA> <NA>
##
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 Seasonally adjusted indexes Unadjusted percent change
## 4 Jan.\n2024 Jan.\n2023-\nJan.\n2024
## 5 <NA> <NA>
## 6 <NA> <NA>
##
## 1 <NA> <NA> <NA>
## 2 <NA> <NA> <NA>
## 3 Unadjusted percent change Unadjusted percent change Unadjusted percent change
## 4 Oct.\n2023-\nNov.\n2023 Nov.\n2023-\nDec.\n2023 Dec.\n2023-\nJan.\n2024
## 5 <NA> <NA> <NA>
## 6 <NA> <NA> <NA>
##
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 Seasonally adjusted percent change Seasonally adjusted percent change
## 4 Oct.\n2023-\nNov.\n2023 Nov.\n2023-\nDec.\n2023
## 5 <NA> <NA>
## 6 <NA> <NA>
##
## 1 <NA>
## 2 <NA>
## 3 Seasonally adjusted percent change
## 4 Dec.\n2023-\nJan.\n2024
## 5 <NA>
## 6 <NA>
##
## 1 <NA>
## 2 <NA>
## 3 One Month
## 4 Seasonally adjusted effect on All Items\nDec. 2023-\nJan. 2024(1)
## 5 <NA>
## 6 <NA>
##
## 1 <NA>
## 2 <NA>
## 3 One Month
## 4 Standard error, median price change(2)
## 5 <NA>
## 6 <NA>
##
## 1 <NA>
## 2 <NA>
## 3 One Month
## 4 Largest (L) or Smallest (S) seasonally adjusted change since:(3)
## 5 Date
## 6 <NA>
##
## 1 <NA>
## 2 <NA>
## 3 One Month
## 4 Largest (L) or Smallest (S) seasonally adjusted change since:(3)
## 5 Percent change
## 6 <NA>
##
## 1 <NA>
## 2 <NA>
## 3 Twelve Month
## 4 Unadjusted effect on All Items\nJan. 2023-\nJan. 2024(1)
## 5 <NA>
## 6 <NA>
##
## 1 <NA>
## 2 <NA>
## 3 Twelve Month
## 4 Standard error, median price change(2)
## 5 <NA>
## 6 <NA>
##
## 1 <NA>
## 2 <NA>
## 3 Twelve Month
## 4 Largest (L) or Smallest (S) unadjusted change since:(3)
## 5 Date
## 6 <NA>
##
## 1 <NA>
## 2 <NA>
## 3 Twelve Month
## 4 Largest (L) or Smallest (S) unadjusted change since:(3)
## 5 Percent change
## 6 <NA>
##
## Here is the head of the laoded data from: Data/death.csv
## county fips met_objective_of_45_5_1
## 1 United States 0 No
## 2 Perry County, Kentucky 21193 No
## 3 Powell County, Kentucky 21197 No
## 4 North Slope Borough, Alaska 2185 No
## 5 Owsley County, Kentucky 21189 No
## 6 Union County, Florida 12125 No
## age_adjusted_death_rate lower_95_confidence_interval_for_death_rate
## 1 46 45.9
## 2 125.6 108.9
## 3 125.3 100.2
## 4 124.9 73
## 5 118.5 83.1
## 6 113.5 89.9
## upper_95_confidence_interval_for_death_rate average_deaths_per_year
## 1 46.1 157,376
## 2 144.2 43
## 3 155.1 18
## 4 194.7 5
## 5 165.5 8
## 6 141.4 19
## recent_trend_2 recent_5_year_trend_2_in_death_rates
## 1 falling -2.4
## 2 stable -0.6
## 3 stable 1.7
## 4 ** **
## 5 stable 2.2
## 6 falling -2.2
## lower_95_confidence_interval_for_trend upper_95_confidence_interval_for_trend
## 1 -2.6 -2.2
## 2 -2.7 1.6
## 3 0 3.4
## 4 ** **
## 5 -0.4 4.8
## 6 -4.3 0
##
## Here is the head of the laoded data from: Data/incd.csv
## county fips
## 1 US (SEER+NPCR)(1,10) 0
## 2 Autauga County, Alabama(6,10) 1001
## 3 Baldwin County, Alabama(6,10) 1003
## 4 Barbour County, Alabama(6,10) 1005
## 5 Bibb County, Alabama(6,10) 1007
## 6 Blount County, Alabama(6,10) 1009
## age_adjusted_incidence_rate_e_cases_per_100_000 lower_95_confidence_interval
## 1 62.4 62.3
## 2 74.9 65.1
## 3 66.9 62.4
## 4 74.6 61.8
## 5 86.4 71
## 6 69.7 61.2
## upper_95_confidence_interval average_annual_count recent_trend
## 1 62.6 214614 falling
## 2 85.7 43 stable
## 3 71.7 170 stable
## 4 89.4 25 stable
## 5 104.2 23 stable
## 6 79 51 stable
## recent_5_year_trend_in_incidence_rates lower_95_confidence_interval_2
## 1 -2.5 -3
## 2 0.5 -14.9
## 3 3 -10.2
## 4 -6.4 -18.3
## 5 -4.5 -31.4
## 6 -13.6 -27.8
## upper_95_confidence_interval_2
## 1 -2
## 2 18.6
## 3 18.3
## 4 7.3
## 5 32.9
## 6 3.4
##
## Here is the head of the laoded data from: Data/PovertyEstimates.csv
## FIPS_Code Stabr Area_name Attribute Value
## 1 0 US United States POVALL_2021 41393176.0
## 2 0 US United States CI90LBALL_2021 41149497.0
## 3 0 US United States CI90UBALL_2021 41636855.0
## 4 0 US United States PCTPOVALL_2021 12.8
## 5 0 US United States CI90LBALLP_2021 12.7
## 6 0 US United States CI90UBALLP_2021 12.9
At this point, all the 4 DFs are stored into R-Studio and names are assigned to them for further tidying and cleaning them up. In this step, we work with data to and I will be as descriptive as I can to explain the steps.
As mentioned earlier, one of our primary goals is to change the data format from wide to long and ensure its cleanliness.
Now that we’ve loaded the data, let’s leverage dplyr for the necessary transformations.
tidyr comes to our rescue with its versatile
functionalities, including reshaping data using
pivot_longer and
pivot_wider. It also provides tools for
handling missing values (NA), either by removal or replacement. If you
need a handy tidyr cheatsheet, just click tidyr
cheatsheet.
Our first step involves tidying the data. Upon close examination, I’ve identified several areas for improvement:
Removal of NA values
Elimination of empty columns
identifying the header in the middle of rows
Addressing column names that appear in unexpected places
Disregarding unimportant footnotes
Correcting any misread data
changing format of the data to long
separate cells data to observation values for season and year
Finally, we’ll pivot and selectively separate columns to convert our data from wide to long format.
Let’s focus our attention on the df_airquality
dataframe. Upon reviewing the data, we find that our observations
necessitate restructuring, tidying, and cleaning. While the data is
relatively organized and tidy, areas for improvement can still be
identified.
The proposed improvements are as follows:
There is a column called Time Period that needs to
be separated into seasons, annual averages, and years.
Conduct data statistical analyses to evaluate data quality, examine outliers, and assess data distribution.
Normalize the data to facilitate comparison and analysis, especially considering significant changes in values.
To this end, I’ve created a new dataframe named *_norm.
Analysis of the data statistics suggests that many columns exhibit
right-skewed distributions with outliers, particularly in the upper
range. Additionally, the Ozone relted values in
Data Valuecolumn contain outliers on both the upper and
lower ends. This can be seen in the box plots clearly.
I also intend to add a new column to separate the values in
Time Period into seasons and years. To achieve this, I
utilize the separate function from dplyr.
Firstly, I separate the Time Period based on the presence
of a character followed by a space and a number, using the regex
"(?=\\d|-)" to split the column into Season
and Year. Then, I divide the Year into two
additional columns, recognizing that the value could be either a
character or a number separated by “-”. This is accomplished with the
simple regex “-”, resulting in two new columns. Upon reviewing the data
and assessing the output, it becomes apparent that some rows contain
NA values, which need to be addressed.
I begin by addressing the NA values in the
End Year column. This occurs when the
Start Year is a single value without a “-”. In such cases,
I assign the End Year to be the same as the
Start Year. Subsequently, I replace the two-character
values in the End Year column by appending “20” to them,
thus converting them into full years.
Finally, I convert the data in the two newly created columns from characters to integers. With these adjustments, the data appears organized, well-structured, and in wide and tidy format.
# Change the type to a tibble
df_airquality <- dplyr::as_tibble(df_airquality)
head(df_airquality)
## # A tibble: 6 × 12
## `Unique ID` `Indicator ID` Name Measure `Measure Info` `Geo Type Name`
## <int> <int> <chr> <chr> <chr> <chr>
## 1 172653 375 Nitrogen di… Mean ppb UHF34
## 2 172585 375 Nitrogen di… Mean ppb UHF34
## 3 336637 375 Nitrogen di… Mean ppb UHF34
## 4 336622 375 Nitrogen di… Mean ppb UHF34
## 5 172582 375 Nitrogen di… Mean ppb UHF34
## 6 667327 375 Nitrogen di… Mean ppb UHF34
## # ℹ 6 more variables: `Geo Join ID` <int>, `Geo Place Name` <chr>,
## # `Time Period` <chr>, Start_Date <chr>, `Data Value` <dbl>, Message <lgl>
test_df <- df_airquality %>%
filter(Name == 'Annual vehicle miles traveled') %>%
select('Data Value')
#Data nromalization
distinc_test <- df_airquality |> distinct(Name)
#Fidn maximum of each set of data based grouped by name
df_airquality_max <- df_airquality %>%
group_by(Name) %>%
summarise(
max_value = max(`Data Value`)
)
# Normalize the data by dividing each `Data Value` by its corresponding maximum value within each Name group
df_airquality_norm <- df_airquality %>%
group_by(Name) %>%
mutate(maximum = max(`Data Value`)) %>% # Calculate the maximum value
mutate(Data_Value_Normalized = `Data Value` / maximum) %>% # Normalize them
ungroup() %>% # Ungroup the data
select(-maximum) # Remove the temporary column added
sum_stat_AQ <- df_airquality_norm %>%
group_by(Name) %>%
summarize(
mean = mean(Data_Value_Normalized),
sd = sd(Data_Value_Normalized),
median = median(Data_Value_Normalized),
skewness = moments::skewness(Data_Value_Normalized),
kurtosis = moments::kurtosis(Data_Value_Normalized)
) %>%
arrange(skewness, kurtosis)
(sum_stat_AQ)
## # A tibble: 18 × 6
## Name mean sd median skewness kurtosis
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Ozone (O3) 0.745 0.0803 0.751 -0.503 4.01
## 2 Nitrogen dioxide (NO2) 0.415 0.124 0.409 0.345 3.17
## 3 Fine particles (PM 2.5) 0.497 0.108 0.482 0.621 3.14
## 4 Annual vehicle miles traveled 0.352 0.170 0.314 0.810 3.70
## 5 Annual vehicle miles travelled (cars) 0.351 0.169 0.313 0.834 3.79
## 6 Annual vehicle miles travelled (trucks) 0.336 0.228 0.28 0.888 3.09
## 7 Cardiovascular hospitalizations due to… 0.476 0.165 0.457 0.946 3.77
## 8 Cardiac and respiratory deaths due to … 0.427 0.154 0.397 1.28 4.25
## 9 Deaths due to PM2.5 0.417 0.132 0.390 1.30 5.60
## 10 Asthma emergency departments visits du… 0.255 0.198 0.199 1.31 4.53
## 11 Outdoor Air Toxics - Formaldehyde 0.426 0.137 0.396 1.34 5.03
## 12 Asthma hospitalizations due to Ozone 0.227 0.201 0.157 1.40 4.26
## 13 Respiratory hospitalizations due to PM… 0.332 0.154 0.300 1.53 6.12
## 14 Outdoor Air Toxics - Benzene 0.281 0.154 0.233 1.78 6.72
## 15 Asthma emergency department visits due… 0.168 0.150 0.116 1.82 7.18
## 16 Boiler Emissions- Total NOx Emissions 0.189 0.232 0.0976 1.95 6.07
## 17 Boiler Emissions- Total PM2.5 Emissions 0.121 0.208 0.0263 2.43 8.76
## 18 Boiler Emissions- Total SO2 Emissions 0.110 0.200 0.0196 2.68 10.3
#Let's also plot QQ and density distribution of data
df_airquality_norm$Name <- str_wrap(df_airquality_norm$Name, width = 20) # Adjust width as needed
# let's plot Q-Q s
ggplot(df_airquality_norm, aes(sample = Data_Value_Normalized)) +
geom_qq() +
facet_wrap(~Name)+
theme_minimal()+ # Optional: Adjust plot theme
theme(strip.text = element_text(size = 8),
axis.text = element_text(size = 8)) # Adjust text size
# Let's plot density histograms
ggplot(df_airquality_norm, aes(x = Data_Value_Normalized, fill = Name)) +
geom_density(alpha = 0.5) +
facet_wrap(~ Name, scales = "free") +
theme_minimal() +
theme(strip.text = element_text(size = 5),
axis.text = element_text(size = 5),
legend.text = element_text(size = 5) ) # Adjust text size
#plot box plot
ggplot(df_airquality_norm, aes(x = Data_Value_Normalized, fill = Name)) +
geom_boxplot(alpha = 0.5) +
facet_wrap(~ Name, scales = "free") +
theme_minimal() +
theme(strip.text = element_text(size = 5),
axis.text = element_text(size = 5),
legend.text = element_text(size = 5) ) # Adjust text size
#summary(df_airquality_norm$`Time Period`)
df_airquality_norm %>% filter(is.na(`Time Period`)) %>%
summarize(
count = n()
)
## # A tibble: 1 × 1
## count
## <int>
## 1 0
#Separate Time Period to two columns based on the text and date.
df_airquality_norm <- df_airquality_norm %>%
separate(`Time Period`, into = c("Season", "Year"), sep = " (?=\\d|-)", remove = FALSE)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3246 rows [767, 768, 769,
## 770, 771, 772, 896, 897, 898, 899, 1012, 1013, 1014, 1015, 1043, 1044, 1045,
## 1046, 1047, 1048, ...].
# Separate Year column into Start Year and End Year based on the '-' character
df_airquality_norm <- df_airquality_norm %>%
separate(Year, into = c("Start Year", "End Year"), sep = "-", remove = FALSE)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 9165 rows [1, 2, 3, 4, 5,
## 6, 7, 8, 10, 11, 13, 15, 16, 18, 19, 20, 22, 24, 25, 27, ...].
# If End Year is NA, set it to be the same as Start Year
df_airquality_norm$`End Year`[is.na(df_airquality_norm$`End Year`)] <- df_airquality_norm$`Start Year`[is.na(df_airquality_norm$`End Year`)]
df_airquality_norm$`End Year` <- ifelse(nchar(df_airquality_norm$`End Year`) == 2, paste0("20", df_airquality_norm$`End Year`), df_airquality_norm$`End Year`)
# and finally we want to change the characters to number for columns Start Year and End Year.
df_airquality_norm$`End Year` <- as.integer(df_airquality_norm$`End Year`)
df_airquality_norm$`Start Year` <- as.integer(df_airquality_norm$`Start Year`)
This dataset appears messier than the previous one. Let’s break down the steps to tidy and organize it:
Empty Rows and Column Placement:
Initially, I noticed empty rows at the beginning of the imported data.
The columns are oddly placed in the middle, with data appearing on the second line.
Step 1: Removing Empty Rows
Step 2: Identifying and Removing All Empty Rows
Step 3: Transitioning from Wide to Long Format
Now, let’s focus on changing the wide structure to a long format.
Observing the data, we notice the following:
Two sets of data for each year, relevant to months.
Each set has columns repeated 13 times—one for each month—under the “Unadjusted indexes” and “Seasonally adjusted indexes.”
The data includes month associations, index report types, and additional information.
Actions for Long-Format Tidying:
Combine the first three rows and add a delimiter (e.g., ” | “)
for later use with pivot_longer.
Remove these three rows to simplify the data frame.
Eliminate all columns with NA values (empty columns).
Create a long data frame by splitting the combined column names into “Type,” “Month,” and “More.”
Remaining Challenges:
Although the new data frame is in a long format, it’s not fully organized.
We need to identify different types of observations and arrange them systematically.
Some type columns contain date data, and the “Month” column seems to indicate a duration (start and end).
Consider using numeric representations (e.g., 1 for Jan) for clarity.
Additionally, the “Values” column content isn’t numeric; we’ll need to convert it.
Address any weird characters in the “Values” cells by replacing them with NA or numeric values.
Additionally, it also may make sense to normalize the data. (note doneyet)
Finally, I created a new DF and plot some releavnt plots to show the trend of the indexes over 13 months. It is still required work, but I stop here to work on the third dataframe of mine.
# Let's start by removing the two empty rows.
#tibble initiation
df_CPI_new <- df_CPI[-c(1:2), ]
# I need to remove the column if those all are NA. To do so, I use rowwise, do the sum on NA accrsoss the colu,n and if all are NA put TRUE at the newly created column of all_na and then remove it from the DF
# all column must have name in the DF, some does nto and I need to assign them a name
# Rename columns if they are not named
#colnames(df_CPI_new) <- ifelse(colnames(df_CPI_new) == "", paste0("Col", 1:ncol(df_CPI_new)), colnames(df_CPI_new))
# Combine the content of the first three rows into column names, we use " | " as a seperator to use later for using it to transform the data
# Combine the content of the three first columns into combined column names
colnames(df_CPI_new) <- paste(df_CPI_new[1,], df_CPI_new[2,], df_CPI_new[3,], sep = " | ")
# Test to esnure all are represented as expected to be used later
# Define the separator
separator <- " \\| "
# Split column names into three sections using the separator
split_names <- strsplit(colnames(df_CPI_new), separator)
# Identify column names that don't have three sections
wrong_names <- which(sapply(split_names, length) != 3)
# Remove the first three rows from the dataframe
df_CPI_new <- df_CPI_new[-c(1:3), ]
#Test if it works
#df_CPI_na <- df_CPI_new %>%
# rowwise() %>%
# mutate(all_na = all(is.na(c_across())))
df_CPI_new <- df_CPI_new %>%
rowwise() %>%
mutate(all_na = all(is.na(c_across()))) %>%
ungroup() %>%
filter(!all_na)
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `all_na = all(is.na(c_across()))`.
## ℹ In row 1.
## Caused by warning:
## ! Using `c_across()` without supplying `cols` was deprecated in dplyr 1.1.0.
## ℹ Please supply `cols` instead.
#fidn the distinc Intent Level to use later for filtering and more
indent_levels <- df_CPI_new %>%
select(1) %>%
na.omit() %>%
distinct()
#Use pivot_longer to reform the data, we seperate them two three different columns based o nthe title we combined above
df_CPI_new_L <- df_CPI_new %>%
pivot_longer(cols = 4:ncol(df_CPI_new)-1,
names_to = c("Type", "Months", "More"),
names_sep = " \\| ",
values_to = "Values")
#correc the first column name
colnames(df_CPI_new_L)[1:2] <- c("Indent Level", "Expenditure category")
#Transform Values to numeric
# Convert the Values column to numeric, but first changng everythign to charaters
#df_CPI_new_L$Values <- as.numeric(as.character(df_CPI_new_L$Values))
# There are some charaters that are not numeric like, I need to first remove them all, then convert to number,but before I need to change them all to a encoding UTF-8.
# Remove non-numeric characters from the Values column
# Convert the Values column to UTF-8 encoding
df_CPI_new_L$Values <- iconv(df_CPI_new_L$Values, to = "UTF-8", sub = "byte")
#Keep all with numebr like format and replace them with empty values
df_CPI_new_L$Values <- as.character(df_CPI_new_L$Values)
# Clean up the Values column to remove non-numeric characters
df_CPI_new_L$Values <- str_replace_all(df_CPI_new_L$Values, "[^0-9.-]", "")
# Replace non-numeric values with NA
#df_CPI_new_L$Values <- ifelse(grepl("^-?[0-9.]+$", df_CPI_new_L$Values),
# as.numeric(df_CPI_new_L$Values),
# NA)
#df_CPI_new_L$Values <- gsub("[^0-9.-]", "", df_CPI_new_L$Values)
df_CPI_new_L$Values <- as.numeric(as.character(df_CPI_new_L$Values))
## Warning: NAs introduced by coercion
# find the distinc value in column types
DF_distincs <- df_CPI_new_L %>%
select(`Indent Level`, Type, Months, More) %>%
distinct() %>%
arrange()
#let's explore the data
#start with summary
sum_test <- df_CPI_new_L %>%
filter(Type == "Unadjusted indexes", `Indent Level` %in% c("0", "1")) %>%
group_by(`Expenditure category`) %>%
summarise(
n = n(),
mean = mean(Values, na.rm = T)
)
print(sum_test)
## # A tibble: 28 × 3
## `Expenditure category` n mean
## <chr> <int> <dbl>
## 1 All items 13 305.
## 2 All items less energy 13 310.
## 3 All items less food 13 302.
## 4 All items less food and energy 13 309.
## 5 All items less food and shelter 13 267.
## 6 All items less food, shelter, and energy 13 269.
## 7 All items less food, shelter, energy, and used cars and trucks 13 273.
## 8 All items less medical care 13 293.
## 9 All items less shelter 13 278.
## 10 Apparel less footwear 13 122.
## # ℹ 18 more rows
#Let's plot a simle line for the All items, Unadjsuted indexes
df_CPI_new_L %>%
filter(Type == "Unadjusted indexes", `Indent Level` == "0", `Expenditure category` == "All items") %>%
arrange(Months) %>%
ggplot(aes(x = Months, y = Values)) +
geom_point(alpha = 0.5, color = "blue") + # Specify a color for the line
theme_minimal() +
theme(strip.text = element_text(size = 5),
axis.text = element_text(size = 5),
legend.text = element_text(size = 5))
#Export all Unadjsuted indexes to a new DF and then start explorign them, in thsi process we also chaneg the column month to seperate the Month number and year to plot some comaprative data
CPI_Unadjsuted_Monthly <- df_CPI_new_L %>%
filter(Type == "Unadjusted indexes") %>%
group_by(`Expenditure category`)
# to plot we want to seperate mpnth from year and add factors for months
CPI_Unadjsuted_Monthly$Months <- as.character(CPI_Unadjsuted_Monthly$Months) # Convert to character (in case it's not already)
CPI_Unadjsuted_Monthly$Month_numeric <- NA # Create a new column for numeric month
CPI_Unadjsuted_Monthly$Year_numeric <- NA # Create a new column for numeric year
# Loop through each row and extract the month and year
for (i in 1:nrow(CPI_Unadjsuted_Monthly)) {
months <- unlist(strsplit(CPI_Unadjsuted_Monthly$Months[i], "[[:space:]]+")) # Split the string by spaces
month <- match(sub("\\.$", "", months[1]), month.abb) # Get the numeric representation of the month
year <- as.numeric(months[2]) # Extract the year
CPI_Unadjsuted_Monthly$Month_numeric[i] <- month
CPI_Unadjsuted_Monthly$Year_numeric[i] <- year
}
# Convert the month and year columns to factor for sake of plotting
CPI_Unadjsuted_Monthly$Month_numeric <- factor(CPI_Unadjsuted_Monthly$Month_numeric, levels = 1:12, labels = month.abb)
CPI_Unadjsuted_Monthly$Year_numeric <- factor(CPI_Unadjsuted_Monthly$Year_numeric)
# Reorder the levels of Month_numeric based on Year_numeric, since year alaos changed
CPI_Unadjsuted_Monthly$Month_numeric <- factor(CPI_Unadjsuted_Monthly$Month_numeric, levels = unique(CPI_Unadjsuted_Monthly$Month_numeric)[order(CPI_Unadjsuted_Monthly$Year_numeric)])
# Plot the data
CPI_Unadjsuted_Monthly %>%
filter(`Indent Level` == "0", `Expenditure category` == "All items") %>%
ggplot(aes(x = Month_numeric, y = Values)) +
geom_point(alpha = 0.5, color = "blue") + # Specify a color for the line
theme_minimal() +
theme(strip.text = element_text(size = 5),
axis.text = element_text(size = 5),
legend.text = element_text(size = 5))
#plot a graph to compare the "Unadjusted Indexes" of level 0 and 1 across 12 months and compare them agaist the level 0 (All Items)
CPI_Unadjsuted_Monthly %>%
filter(`Indent Level` %in% c("0", "1")) %>%
group_by(`Expenditure category`) %>%
ggplot(aes(x = Month_numeric, y = Values, color = `Expenditure category`)) + # Use color to identify them
geom_point(alpha = 0.5) +
theme_minimal() +
theme(strip.text = element_text(size = 5),
axis.text = element_text(size = 5),
legend.text = element_text(size = 5))
## Warning: Removed 13 rows containing missing values (`geom_point()`).
CPI_Unadjsuted_Monthly %>%
filter(`Indent Level` %in% c("0", "1")) %>%
filter(`Expenditure category` %in% c("Energy", "Food", "Housing", "All items", "Services", "Medical care", "Fuels and utilities")) %>%
ggplot(aes(x = Month_numeric, y = Values, color = `Expenditure category`)) + # Use color to identify them
geom_point(alpha = 0.5) +
theme_minimal() +
theme(strip.text = element_text(size = 8),
axis.text = element_text(size = 8),
legend.text = element_text(size = 8))
These two DFs are pretty clean and organised. One can tell it but looking into the data and run some summary statistics.
I start with some summary statistics to assess the quality of the data. Frist we remove the data in the DF that recent_trend_2 does not have any indication (rising, falling, or stable). and then figure out what is the trend in general is it positive or negative? The overall trend points out that the general trends among all the counties is falling.
# Let's first do some summary statistics and evalaute the data.
#Plot the recent trends and evalute if general trends if positive or negative
#remove the recent_trend_2 with no information
print("take a sneak pick of the DF")
## [1] "take a sneak pick of the DF"
head(df_death,10)
## county fips met_objective_of_45_5_1
## 1 United States 0 No
## 2 Perry County, Kentucky 21193 No
## 3 Powell County, Kentucky 21197 No
## 4 North Slope Borough, Alaska 2185 No
## 5 Owsley County, Kentucky 21189 No
## 6 Union County, Florida 12125 No
## 7 McCreary County, Kentucky 21147 No
## 8 Leslie County, Kentucky 21131 No
## 9 Martin County, Kentucky 21159 No
## 10 Menifee County, Kentucky 21165 No
## age_adjusted_death_rate lower_95_confidence_interval_for_death_rate
## 1 46 45.9
## 2 125.6 108.9
## 3 125.3 100.2
## 4 124.9 73
## 5 118.5 83.1
## 6 113.5 89.9
## 7 111.1 90.6
## 8 110.3 87
## 9 109.1 84.8
## 10 106 76.4
## upper_95_confidence_interval_for_death_rate average_deaths_per_year
## 1 46.1 157,376
## 2 144.2 43
## 3 155.1 18
## 4 194.7 5
## 5 165.5 8
## 6 141.4 19
## 7 134.9 22
## 8 138.5 16
## 9 138.3 14
## 10 144.4 9
## recent_trend_2 recent_5_year_trend_2_in_death_rates
## 1 falling -2.4
## 2 stable -0.6
## 3 stable 1.7
## 4 ** **
## 5 stable 2.2
## 6 falling -2.2
## 7 rising 22.9
## 8 stable 0.8
## 9 stable 1.3
## 10 stable 1.4
## lower_95_confidence_interval_for_trend
## 1 -2.6
## 2 -2.7
## 3 0
## 4 **
## 5 -0.4
## 6 -4.3
## 7 6.9
## 8 -0.7
## 9 -0.8
## 10 -1.4
## upper_95_confidence_interval_for_trend
## 1 -2.2
## 2 1.6
## 3 3.4
## 4 **
## 5 4.8
## 6 0
## 7 41.4
## 8 2.4
## 9 3.4
## 10 4.3
trend_summary <- df_death %>%
filter(recent_trend_2 %in% c("stable","falling","rising") ) %>%
select(recent_trend_2) %>%
summary(
n = n(),
rising = sum(recent_trend_2=="rising")
)
# Calculate summary statistics
trend_summary <- df_death %>%
filter(recent_trend_2 %in% c("stable","falling","rising") ) %>%
summarise(
total = n(),
rising = sum(recent_trend_2 == "rising"),
falling = sum(recent_trend_2 == "falling"),
stable = sum(recent_trend_2 == "stable"),
overall_trend = ifelse(rising > falling, "Rising Cancer Trend", ifelse(rising < falling, "Falling Cancer Trend", "Neutral Trend"))
)
print(trend_summary)
## total rising falling stable overall_trend
## 1 2694 31 867 1796 Falling Cancer Trend
# Reshape the data to long format
trend_summary_long <- tidyr::pivot_longer(trend_summary, cols = c(rising, falling, stable),names_to = "trend_type", values_to = "count")
# Plot a bar chart
ggplot(trend_summary_long, aes(x = trend_type, y = count, fill = trend_type)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "summary of the cancer cases in US", x = "General trends", y = "Counts" )
#remove recent_trend_2 without any trend indication
df_death <- df_death %>%
filter(recent_trend_2 %in% c("stable","falling","rising") ) %>%
select(everything())
# county has the state name at the end after a coma, i woudl like to seperate countu to countu and state
df_death <- df_death %>%
separate(county, sep = ",",
into = c("county", "State"))
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [1,
## 2238].
#convert the data to numeric
# Convert columns 5 to 8 from character to numeric
df_death <- df_death %>%
mutate_at(vars(5:8), as.numeric)
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `average_deaths_per_year =
## .Primitive("as.double")(average_deaths_per_year)`.
## Caused by warning:
## ! NAs introduced by coercion
# Convert columns 10 to 12 from character to numeric
df_death <- df_death %>%
mutate_at(vars(10:12), as.numeric)
state_sum <- df_death %>%
group_by(State) %>%
summarize(
death_sum = sum(average_deaths_per_year, na.rm=TRUE)
)
# Plot death rate by state
ggplot(state_sum, aes(x = State, y = death_sum)) +
geom_bar(stat = "identity", fill = "plum") +
labs(x = "State", y = "Total Death Rate", title = "Death Rate by State") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))