Data preparation

The purpose of this assignment is to gain practice in preparing different datasets for downstream analysis. The project consists of three main steps, and this R Markdown document follows the steps as requested.

#1: Dataset Selection and Import:

I have chosen three datasets: cancer-poverty, consumer index, and air quality. These datasets initially appeared interesting and relevant to my topic of interest cancer and poverty. I prefer using the “wide” dataset format, and I’ve decided to work with a combination of wide and log transformations for this project.

#2: Tasks for Each Dataset:

  1. Create a .CSV file (or optionally, a MySQL database) that includes all the information from the dataset.

  2. The goal is to practice tidying and transformations, as described below.

#3: Data Tidying and Transformation:

  1. Read the information from your .CSV file into R.

  2. Utilize tidyr and dplyr as needed to tidy and transform the data.

  3. Provide detailed narrative descriptions of the steps taken, explaining why and how each transformation was performed.

  4. Include information on data cleanup, analysis, and conclusions.

Step 1: Choosing and importing datasets

The first step is to choose and bring in data. I chose my own database. I’m interested in cancer and want to see if it’s related to things like poverty, lifestyle, diet, or other factors in life.

My next pick is Guillermo Schneider’s data about the consumer price index. This also connects to my interest in poverty and lifestyle, and it’s the kind of info I’m really interested in. The data looks intriguing and I’m excited to spend time analyzing and understanding it.

My third choice is Anthony Conrardy’s data on NYC air quality. This data could also be linked to cancer, so I want to see if poor air quality might be causing cancer or other illnesses.

I hope my choices make sense. I plan to connect these datasets and work on them together in the future as a project of mine.

#1: Importing datasets to RStudio

I am trying to load data from the local files that are stored in my Data folder.

Here is my code, it follows the same structure practiced in the previous assignments ans some new tricks used in between:

## [1] "All required packages are installed"
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
##  Here is the head of the laoded data from: Data/Air_Quality_20240302.csv
## # A tibble: 6 × 12
##   `Unique ID` `Indicator ID` Name         Measure `Measure Info` `Geo Type Name`
##         <int>          <int> <chr>        <chr>   <chr>          <chr>          
## 1      172653            375 Nitrogen di… Mean    ppb            UHF34          
## 2      172585            375 Nitrogen di… Mean    ppb            UHF34          
## 3      336637            375 Nitrogen di… Mean    ppb            UHF34          
## 4      336622            375 Nitrogen di… Mean    ppb            UHF34          
## 5      172582            375 Nitrogen di… Mean    ppb            UHF34          
## 6      667327            375 Nitrogen di… Mean    ppb            UHF34          
## # ℹ 6 more variables: `Geo Join ID` <int>, `Geo Place Name` <chr>,
## #   `Time Period` <chr>, Start_Date <chr>, `Data Value` <dbl>, Message <lgl>
## 
##  Here is the head of the laoded data from: Data/cpi-u-202401.csv
##               
## 1         <NA>
## 2         <NA>
## 3 Indent Level
## 4         <NA>
## 5         <NA>
## 6         <NA>
##   Consumer Price Index for All Urban Consumers (CPI-U): U.S. city average, by expenditure category, January 2024
## 1                                                                          [1982-84=100, unless otherwise noted]
## 2                                                                                                           <NA>
## 3                                                                                           Expenditure category
## 4                                                                                                           <NA>
## 5                                                                                                           <NA>
## 6                                                                                                           <NA>
##                                                                         
## 1                             <NA>               <NA>               <NA>
## 2                             <NA>               <NA>               <NA>
## 3 Relative\nimportance\nDec.\n2023 Unadjusted indexes Unadjusted indexes
## 4                             <NA>         Jan.\n2023         Feb.\n2023
## 5                             <NA>               <NA>               <NA>
## 6                             <NA>               <NA>               <NA>
##                                                                              
## 1               <NA>               <NA>               <NA>               <NA>
## 2               <NA>               <NA>               <NA>               <NA>
## 3 Unadjusted indexes Unadjusted indexes Unadjusted indexes Unadjusted indexes
## 4         Mar.\n2023         Apr.\n2023          May\n2023         Jun.\n2023
## 5               <NA>               <NA>               <NA>               <NA>
## 6               <NA>               <NA>               <NA>               <NA>
##                                                                              
## 1               <NA>               <NA>               <NA>               <NA>
## 2               <NA>               <NA>               <NA>               <NA>
## 3 Unadjusted indexes Unadjusted indexes Unadjusted indexes Unadjusted indexes
## 4         Jul.\n2023         Aug.\n2023         Sep.\n2023         Oct.\n2023
## 5               <NA>               <NA>               <NA>               <NA>
## 6               <NA>               <NA>               <NA>               <NA>
##                                                           
## 1               <NA>               <NA>               <NA>
## 2               <NA>               <NA>               <NA>
## 3 Unadjusted indexes Unadjusted indexes Unadjusted indexes
## 4         Nov.\n2023         Dec.\n2023         Jan.\n2024
## 5               <NA>               <NA>               <NA>
## 6               <NA>               <NA>               <NA>
##                                                          
## 1                        <NA>                        <NA>
## 2                        <NA>                        <NA>
## 3 Seasonally adjusted indexes Seasonally adjusted indexes
## 4                  Jan.\n2023                  Feb.\n2023
## 5                        <NA>                        <NA>
## 6                        <NA>                        <NA>
##                                                          
## 1                        <NA>                        <NA>
## 2                        <NA>                        <NA>
## 3 Seasonally adjusted indexes Seasonally adjusted indexes
## 4                  Mar.\n2023                  Apr.\n2023
## 5                        <NA>                        <NA>
## 6                        <NA>                        <NA>
##                                                          
## 1                        <NA>                        <NA>
## 2                        <NA>                        <NA>
## 3 Seasonally adjusted indexes Seasonally adjusted indexes
## 4                   May\n2023                  Jun.\n2023
## 5                        <NA>                        <NA>
## 6                        <NA>                        <NA>
##                                                          
## 1                        <NA>                        <NA>
## 2                        <NA>                        <NA>
## 3 Seasonally adjusted indexes Seasonally adjusted indexes
## 4                  Jul.\n2023                  Aug.\n2023
## 5                        <NA>                        <NA>
## 6                        <NA>                        <NA>
##                                                          
## 1                        <NA>                        <NA>
## 2                        <NA>                        <NA>
## 3 Seasonally adjusted indexes Seasonally adjusted indexes
## 4                  Sep.\n2023                  Oct.\n2023
## 5                        <NA>                        <NA>
## 6                        <NA>                        <NA>
##                                                          
## 1                        <NA>                        <NA>
## 2                        <NA>                        <NA>
## 3 Seasonally adjusted indexes Seasonally adjusted indexes
## 4                  Nov.\n2023                  Dec.\n2023
## 5                        <NA>                        <NA>
## 6                        <NA>                        <NA>
##                                                        
## 1                        <NA>                      <NA>
## 2                        <NA>                      <NA>
## 3 Seasonally adjusted indexes Unadjusted percent change
## 4                  Jan.\n2024   Jan.\n2023-\nJan.\n2024
## 5                        <NA>                      <NA>
## 6                        <NA>                      <NA>
##                                                                                
## 1                      <NA>                      <NA>                      <NA>
## 2                      <NA>                      <NA>                      <NA>
## 3 Unadjusted percent change Unadjusted percent change Unadjusted percent change
## 4   Oct.\n2023-\nNov.\n2023   Nov.\n2023-\nDec.\n2023   Dec.\n2023-\nJan.\n2024
## 5                      <NA>                      <NA>                      <NA>
## 6                      <NA>                      <NA>                      <NA>
##                                                                        
## 1                               <NA>                               <NA>
## 2                               <NA>                               <NA>
## 3 Seasonally adjusted percent change Seasonally adjusted percent change
## 4            Oct.\n2023-\nNov.\n2023            Nov.\n2023-\nDec.\n2023
## 5                               <NA>                               <NA>
## 6                               <NA>                               <NA>
##                                     
## 1                               <NA>
## 2                               <NA>
## 3 Seasonally adjusted percent change
## 4            Dec.\n2023-\nJan.\n2024
## 5                               <NA>
## 6                               <NA>
##                                                                    
## 1                                                              <NA>
## 2                                                              <NA>
## 3                                                         One Month
## 4 Seasonally adjusted effect on All Items\nDec. 2023-\nJan. 2024(1)
## 5                                                              <NA>
## 6                                                              <NA>
##                                         
## 1                                   <NA>
## 2                                   <NA>
## 3                              One Month
## 4 Standard error, median price change(2)
## 5                                   <NA>
## 6                                   <NA>
##                                                                   
## 1                                                             <NA>
## 2                                                             <NA>
## 3                                                        One Month
## 4 Largest (L) or Smallest (S) seasonally adjusted change since:(3)
## 5                                                             Date
## 6                                                             <NA>
##                                                                   
## 1                                                             <NA>
## 2                                                             <NA>
## 3                                                        One Month
## 4 Largest (L) or Smallest (S) seasonally adjusted change since:(3)
## 5                                                   Percent change
## 6                                                             <NA>
##                                                           
## 1                                                     <NA>
## 2                                                     <NA>
## 3                                             Twelve Month
## 4 Unadjusted effect on All Items\nJan. 2023-\nJan. 2024(1)
## 5                                                     <NA>
## 6                                                     <NA>
##                                         
## 1                                   <NA>
## 2                                   <NA>
## 3                           Twelve Month
## 4 Standard error, median price change(2)
## 5                                   <NA>
## 6                                   <NA>
##                                                          
## 1                                                    <NA>
## 2                                                    <NA>
## 3                                            Twelve Month
## 4 Largest (L) or Smallest (S) unadjusted change since:(3)
## 5                                                    Date
## 6                                                    <NA>
##                                                          
## 1                                                    <NA>
## 2                                                    <NA>
## 3                                            Twelve Month
## 4 Largest (L) or Smallest (S) unadjusted change since:(3)
## 5                                          Percent change
## 6                                                    <NA>
## 
##  Here is the head of the laoded data from: Data/death.csv
##                        county  fips met_objective_of_45_5_1
## 1               United States     0                      No
## 2      Perry County, Kentucky 21193                      No
## 3     Powell County, Kentucky 21197                      No
## 4 North Slope Borough, Alaska  2185                      No
## 5     Owsley County, Kentucky 21189                      No
## 6       Union County, Florida 12125                      No
##   age_adjusted_death_rate lower_95_confidence_interval_for_death_rate
## 1                      46                                        45.9
## 2                   125.6                                       108.9
## 3                   125.3                                       100.2
## 4                   124.9                                          73
## 5                   118.5                                        83.1
## 6                   113.5                                        89.9
##   upper_95_confidence_interval_for_death_rate average_deaths_per_year
## 1                                        46.1                 157,376
## 2                                       144.2                      43
## 3                                       155.1                      18
## 4                                       194.7                       5
## 5                                       165.5                       8
## 6                                       141.4                      19
##   recent_trend_2 recent_5_year_trend_2_in_death_rates
## 1        falling                                 -2.4
## 2         stable                                 -0.6
## 3         stable                                  1.7
## 4             **                                   **
## 5         stable                                  2.2
## 6        falling                                 -2.2
##   lower_95_confidence_interval_for_trend upper_95_confidence_interval_for_trend
## 1                                   -2.6                                   -2.2
## 2                                   -2.7                                    1.6
## 3                                      0                                    3.4
## 4                                     **                                     **
## 5                                   -0.4                                    4.8
## 6                                   -4.3                                      0
## 
##  Here is the head of the laoded data from: Data/incd.csv
##                          county fips
## 1          US (SEER+NPCR)(1,10)    0
## 2 Autauga County, Alabama(6,10) 1001
## 3 Baldwin County, Alabama(6,10) 1003
## 4 Barbour County, Alabama(6,10) 1005
## 5    Bibb County, Alabama(6,10) 1007
## 6  Blount County, Alabama(6,10) 1009
##   age_adjusted_incidence_rate_e_cases_per_100_000 lower_95_confidence_interval
## 1                                            62.4                         62.3
## 2                                            74.9                         65.1
## 3                                            66.9                         62.4
## 4                                            74.6                         61.8
## 5                                            86.4                           71
## 6                                            69.7                         61.2
##   upper_95_confidence_interval average_annual_count recent_trend
## 1                         62.6               214614      falling
## 2                         85.7                   43       stable
## 3                         71.7                  170       stable
## 4                         89.4                   25       stable
## 5                        104.2                   23       stable
## 6                           79                   51       stable
##   recent_5_year_trend_in_incidence_rates lower_95_confidence_interval_2
## 1                                   -2.5                             -3
## 2                                    0.5                          -14.9
## 3                                      3                          -10.2
## 4                                   -6.4                          -18.3
## 5                                   -4.5                          -31.4
## 6                                  -13.6                          -27.8
##   upper_95_confidence_interval_2
## 1                             -2
## 2                           18.6
## 3                           18.3
## 4                            7.3
## 5                           32.9
## 6                            3.4
## 
##  Here is the head of the laoded data from: Data/PovertyEstimates.csv
##   FIPS_Code Stabr     Area_name       Attribute      Value
## 1         0    US United States     POVALL_2021 41393176.0
## 2         0    US United States  CI90LBALL_2021 41149497.0
## 3         0    US United States  CI90UBALL_2021 41636855.0
## 4         0    US United States  PCTPOVALL_2021       12.8
## 5         0    US United States CI90LBALLP_2021       12.7
## 6         0    US United States CI90UBALLP_2021       12.9

At this point, all the 4 DFs are stored into R-Studio and names are assigned to them for further tidying and cleaning them up. In this step, we work with data to and I will be as descriptive as I can to explain the steps.

#2&3: Tidying and data clean up

As mentioned earlier, one of our primary goals is to change the data format from wide to long and ensure its cleanliness.

Now that we’ve loaded the data, let’s leverage dplyr for the necessary transformations.

tidyr comes to our rescue with its versatile functionalities, including reshaping data using pivot_longer and pivot_wider. It also provides tools for handling missing values (NA), either by removal or replacement. If you need a handy tidyr cheatsheet, just click tidyr cheatsheet.

Our first step involves tidying the data. Upon close examination, I’ve identified several areas for improvement:

Finally, we’ll pivot and selectively separate columns to convert our data from wide to long format.

2&3-1: Tidying the AirQuality DF

Let’s focus our attention on the df_airquality dataframe. Upon reviewing the data, we find that our observations necessitate restructuring, tidying, and cleaning. While the data is relatively organized and tidy, areas for improvement can still be identified.

The proposed improvements are as follows:

  • There is a column called Time Period that needs to be separated into seasons, annual averages, and years.

  • Conduct data statistical analyses to evaluate data quality, examine outliers, and assess data distribution.

  • Normalize the data to facilitate comparison and analysis, especially considering significant changes in values.

To this end, I’ve created a new dataframe named *_norm. Analysis of the data statistics suggests that many columns exhibit right-skewed distributions with outliers, particularly in the upper range. Additionally, the Ozone relted values in Data Valuecolumn contain outliers on both the upper and lower ends. This can be seen in the box plots clearly.

I also intend to add a new column to separate the values in Time Period into seasons and years. To achieve this, I utilize the separate function from dplyr. Firstly, I separate the Time Period based on the presence of a character followed by a space and a number, using the regex "(?=\\d|-)" to split the column into Season and Year. Then, I divide the Year into two additional columns, recognizing that the value could be either a character or a number separated by “-”. This is accomplished with the simple regex “-”, resulting in two new columns. Upon reviewing the data and assessing the output, it becomes apparent that some rows contain NA values, which need to be addressed.

I begin by addressing the NA values in the End Year column. This occurs when the Start Year is a single value without a “-”. In such cases, I assign the End Year to be the same as the Start Year. Subsequently, I replace the two-character values in the End Year column by appending “20” to them, thus converting them into full years.

Finally, I convert the data in the two newly created columns from characters to integers. With these adjustments, the data appears organized, well-structured, and in wide and tidy format.

# Change the type to a tibble 
df_airquality <- dplyr::as_tibble(df_airquality)
head(df_airquality)
## # A tibble: 6 × 12
##   `Unique ID` `Indicator ID` Name         Measure `Measure Info` `Geo Type Name`
##         <int>          <int> <chr>        <chr>   <chr>          <chr>          
## 1      172653            375 Nitrogen di… Mean    ppb            UHF34          
## 2      172585            375 Nitrogen di… Mean    ppb            UHF34          
## 3      336637            375 Nitrogen di… Mean    ppb            UHF34          
## 4      336622            375 Nitrogen di… Mean    ppb            UHF34          
## 5      172582            375 Nitrogen di… Mean    ppb            UHF34          
## 6      667327            375 Nitrogen di… Mean    ppb            UHF34          
## # ℹ 6 more variables: `Geo Join ID` <int>, `Geo Place Name` <chr>,
## #   `Time Period` <chr>, Start_Date <chr>, `Data Value` <dbl>, Message <lgl>
test_df <- df_airquality %>% 
  filter(Name == 'Annual vehicle miles traveled') %>% 
  select('Data Value')
  
#Data nromalization

distinc_test <-  df_airquality |> distinct(Name)

#Fidn maximum of each set of data based grouped by name
df_airquality_max <- df_airquality %>%
  group_by(Name) %>%
  summarise(
    max_value      = max(`Data Value`)
  )

# Normalize the data by dividing each `Data Value` by its corresponding maximum value within each Name group

df_airquality_norm <- df_airquality %>%
  group_by(Name) %>%
  mutate(maximum = max(`Data Value`)) %>%  # Calculate the maximum value
  mutate(Data_Value_Normalized = `Data Value` / maximum) %>%  # Normalize them
  ungroup() %>%  # Ungroup the data
  select(-maximum)  # Remove the temporary column added

  
sum_stat_AQ <- df_airquality_norm %>%
  group_by(Name) %>%
  summarize(
    mean     = mean(Data_Value_Normalized),
    sd       = sd(Data_Value_Normalized),
    median   = median(Data_Value_Normalized),
    skewness = moments::skewness(Data_Value_Normalized),
    kurtosis = moments::kurtosis(Data_Value_Normalized)
  ) %>%
  arrange(skewness, kurtosis)
  
(sum_stat_AQ)
## # A tibble: 18 × 6
##    Name                                     mean     sd median skewness kurtosis
##    <chr>                                   <dbl>  <dbl>  <dbl>    <dbl>    <dbl>
##  1 Ozone (O3)                              0.745 0.0803 0.751    -0.503     4.01
##  2 Nitrogen dioxide (NO2)                  0.415 0.124  0.409     0.345     3.17
##  3 Fine particles (PM 2.5)                 0.497 0.108  0.482     0.621     3.14
##  4 Annual vehicle miles traveled           0.352 0.170  0.314     0.810     3.70
##  5 Annual vehicle miles travelled (cars)   0.351 0.169  0.313     0.834     3.79
##  6 Annual vehicle miles travelled (trucks) 0.336 0.228  0.28      0.888     3.09
##  7 Cardiovascular hospitalizations due to… 0.476 0.165  0.457     0.946     3.77
##  8 Cardiac and respiratory deaths due to … 0.427 0.154  0.397     1.28      4.25
##  9 Deaths due to PM2.5                     0.417 0.132  0.390     1.30      5.60
## 10 Asthma emergency departments visits du… 0.255 0.198  0.199     1.31      4.53
## 11 Outdoor Air Toxics - Formaldehyde       0.426 0.137  0.396     1.34      5.03
## 12 Asthma hospitalizations due to Ozone    0.227 0.201  0.157     1.40      4.26
## 13 Respiratory hospitalizations due to PM… 0.332 0.154  0.300     1.53      6.12
## 14 Outdoor Air Toxics - Benzene            0.281 0.154  0.233     1.78      6.72
## 15 Asthma emergency department visits due… 0.168 0.150  0.116     1.82      7.18
## 16 Boiler Emissions- Total NOx Emissions   0.189 0.232  0.0976    1.95      6.07
## 17 Boiler Emissions- Total PM2.5 Emissions 0.121 0.208  0.0263    2.43      8.76
## 18 Boiler Emissions- Total SO2 Emissions   0.110 0.200  0.0196    2.68     10.3
#Let's also plot QQ and density distribution of data 

df_airquality_norm$Name <- str_wrap(df_airquality_norm$Name, width = 20)  # Adjust width as needed

# let's plot Q-Q s 
ggplot(df_airquality_norm, aes(sample = Data_Value_Normalized)) +
  geom_qq() +
  facet_wrap(~Name)+
  theme_minimal()+  # Optional: Adjust plot theme 
  theme(strip.text = element_text(size = 8),
        axis.text = element_text(size = 8))  # Adjust text size

# Let's plot density histograms
ggplot(df_airquality_norm, aes(x = Data_Value_Normalized, fill = Name)) +
  geom_density(alpha = 0.5) + 
  facet_wrap(~ Name, scales = "free") +  
  theme_minimal() +
  theme(strip.text = element_text(size = 5),
        axis.text = element_text(size = 5),
        legend.text = element_text(size = 5) )  # Adjust text size

#plot box plot 
ggplot(df_airquality_norm, aes(x = Data_Value_Normalized, fill = Name)) +
  geom_boxplot(alpha = 0.5) + 
  facet_wrap(~ Name, scales = "free") +  
  theme_minimal() +
  theme(strip.text = element_text(size = 5),
        axis.text = element_text(size = 5),
        legend.text = element_text(size = 5) )  # Adjust text size

#summary(df_airquality_norm$`Time Period`)

df_airquality_norm %>% filter(is.na(`Time Period`)) %>%
  summarize(
    count = n()
  )
## # A tibble: 1 × 1
##   count
##   <int>
## 1     0
#Separate Time Period to two columns based on the text and date.   
df_airquality_norm <- df_airquality_norm %>%
  separate(`Time Period`, into = c("Season", "Year"), sep = " (?=\\d|-)", remove = FALSE)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3246 rows [767, 768, 769,
## 770, 771, 772, 896, 897, 898, 899, 1012, 1013, 1014, 1015, 1043, 1044, 1045,
## 1046, 1047, 1048, ...].
# Separate Year column into Start Year and End Year based on the '-' character
df_airquality_norm <- df_airquality_norm %>%
  separate(Year, into = c("Start Year", "End Year"), sep = "-", remove = FALSE)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 9165 rows [1, 2, 3, 4, 5,
## 6, 7, 8, 10, 11, 13, 15, 16, 18, 19, 20, 22, 24, 25, 27, ...].
# If End Year is NA, set it to be the same as Start Year
df_airquality_norm$`End Year`[is.na(df_airquality_norm$`End Year`)] <- df_airquality_norm$`Start Year`[is.na(df_airquality_norm$`End Year`)]

df_airquality_norm$`End Year` <- ifelse(nchar(df_airquality_norm$`End Year`) == 2, paste0("20", df_airquality_norm$`End Year`), df_airquality_norm$`End Year`)

# and finally we want to change the characters to number for columns Start Year and End Year. 

df_airquality_norm$`End Year` <-  as.integer(df_airquality_norm$`End Year`)
df_airquality_norm$`Start Year` <-  as.integer(df_airquality_norm$`Start Year`)

#2&3-2: Tidting the df_CPI

This dataset appears messier than the previous one. Let’s break down the steps to tidy and organize it:

  1. Empty Rows and Column Placement:

    • Initially, I noticed empty rows at the beginning of the imported data.

    • The columns are oddly placed in the middle, with data appearing on the second line.

  2. Step 1: Removing Empty Rows

    • Our first action is to remove these two empty rows and create a new data frame.
  3. Step 2: Identifying and Removing All Empty Rows

    • Next, we’ll go rowwise to find and eliminate any remaining empty rows from the database.
  4. Step 3: Transitioning from Wide to Long Format

    • Now, let’s focus on changing the wide structure to a long format.

    • Observing the data, we notice the following:

      • Two sets of data for each year, relevant to months.

      • Each set has columns repeated 13 times—one for each month—under the “Unadjusted indexes” and “Seasonally adjusted indexes.”

      • The data includes month associations, index report types, and additional information.

  5. Actions for Long-Format Tidying:

    • Combine the first three rows and add a delimiter (e.g., ” | “) for later use with pivot_longer.

    • Remove these three rows to simplify the data frame.

    • Eliminate all columns with NA values (empty columns).

    • Create a long data frame by splitting the combined column names into “Type,” “Month,” and “More.”

  6. Remaining Challenges:

    • Although the new data frame is in a long format, it’s not fully organized.

    • We need to identify different types of observations and arrange them systematically.

    • Some type columns contain date data, and the “Month” column seems to indicate a duration (start and end).

    • Consider using numeric representations (e.g., 1 for Jan) for clarity.

    • Additionally, the “Values” column content isn’t numeric; we’ll need to convert it.

    • Address any weird characters in the “Values” cells by replacing them with NA or numeric values.

    • Additionally, it also may make sense to normalize the data. (note doneyet)

Finally, I created a new DF and plot some releavnt plots to show the trend of the indexes over 13 months. It is still required work, but I stop here to work on the third dataframe of mine.

# Let's start by removing the two empty rows. 
#tibble initiation 

df_CPI_new <- df_CPI[-c(1:2), ]

# I need to remove the column if those all are NA. To do so, I use rowwise, do the sum on NA accrsoss the colu,n and if all are NA put TRUE at the newly created column of all_na and then remove it from the DF 

# all column must have name in the DF, some does nto and I need to assign them a name 
# Rename columns if they are not named
#colnames(df_CPI_new) <- ifelse(colnames(df_CPI_new) == "", paste0("Col", 1:ncol(df_CPI_new)), colnames(df_CPI_new))

# Combine the content of the first three rows into column names, we use " | " as a seperator to use later for using it to transform the data
# Combine the content of the three first columns into combined column names

colnames(df_CPI_new) <- paste(df_CPI_new[1,], df_CPI_new[2,], df_CPI_new[3,], sep = " | ") 

# Test to esnure all are represented as expected to be used later 
# Define the separator
separator <- " \\| "

# Split column names into three sections using the separator
split_names <- strsplit(colnames(df_CPI_new), separator)

# Identify column names that don't have three sections
wrong_names <- which(sapply(split_names, length) != 3)


# Remove the first three rows from the dataframe
df_CPI_new <- df_CPI_new[-c(1:3), ]


#Test if it works
#df_CPI_na <- df_CPI_new %>%
#  rowwise() %>%
#  mutate(all_na = all(is.na(c_across())))


df_CPI_new <- df_CPI_new %>%
  rowwise() %>%
  mutate(all_na = all(is.na(c_across()))) %>%
  ungroup() %>%
  filter(!all_na)
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `all_na = all(is.na(c_across()))`.
## ℹ In row 1.
## Caused by warning:
## ! Using `c_across()` without supplying `cols` was deprecated in dplyr 1.1.0.
## ℹ Please supply `cols` instead.
#fidn the distinc Intent Level to use later for filtering and more 
indent_levels <- df_CPI_new %>%
  select(1) %>% 
  na.omit() %>%
  distinct()

#Use pivot_longer to reform the data, we seperate them two three different columns based o nthe title we combined above
df_CPI_new_L <- df_CPI_new %>%
  pivot_longer(cols = 4:ncol(df_CPI_new)-1, 
               names_to = c("Type", "Months", "More"), 
               names_sep = " \\| ", 
               values_to = "Values")

#correc the first column name 
colnames(df_CPI_new_L)[1:2] <-  c("Indent Level", "Expenditure category")

#Transform Values to numeric 
# Convert the Values column to numeric, but first changng everythign to charaters
#df_CPI_new_L$Values <- as.numeric(as.character(df_CPI_new_L$Values))

# There are some charaters that are not numeric like, I need to first remove them all, then convert to number,but before I need to change them all to a encoding UTF-8. 
# Remove non-numeric characters from the Values column 
# Convert the Values column to UTF-8 encoding
df_CPI_new_L$Values <- iconv(df_CPI_new_L$Values, to = "UTF-8", sub = "byte")
#Keep all with numebr like format and replace them with empty values 
df_CPI_new_L$Values <- as.character(df_CPI_new_L$Values)

# Clean up the Values column to remove non-numeric characters
df_CPI_new_L$Values <- str_replace_all(df_CPI_new_L$Values, "[^0-9.-]", "")

# Replace non-numeric values with NA
#df_CPI_new_L$Values <- ifelse(grepl("^-?[0-9.]+$", df_CPI_new_L$Values), 
#                              as.numeric(df_CPI_new_L$Values), 
#                              NA)

#df_CPI_new_L$Values <- gsub("[^0-9.-]", "", df_CPI_new_L$Values)
df_CPI_new_L$Values <- as.numeric(as.character(df_CPI_new_L$Values))
## Warning: NAs introduced by coercion
# find the distinc value in column types 
DF_distincs <- df_CPI_new_L %>%
  select(`Indent Level`, Type, Months, More) %>%
  distinct() %>%
  arrange()

#let's explore the data 

#start with summary 
sum_test <- df_CPI_new_L %>% 
  filter(Type == "Unadjusted indexes", `Indent Level` %in% c("0", "1")) %>%
  group_by(`Expenditure category`) %>%
  summarise(
    n       = n(),
    mean    = mean(Values, na.rm = T)
  )

print(sum_test)
## # A tibble: 28 × 3
##    `Expenditure category`                                             n  mean
##    <chr>                                                          <int> <dbl>
##  1 All items                                                         13  305.
##  2 All items less energy                                             13  310.
##  3 All items less food                                               13  302.
##  4 All items less food and energy                                    13  309.
##  5 All items less food and shelter                                   13  267.
##  6 All items less food, shelter, and energy                          13  269.
##  7 All items less food, shelter, energy, and used cars and trucks    13  273.
##  8 All items less medical care                                       13  293.
##  9 All items less shelter                                            13  278.
## 10 Apparel less footwear                                             13  122.
## # ℹ 18 more rows
#Let's plot a simle line for the All items, Unadjsuted indexes 
df_CPI_new_L %>% 
  filter(Type == "Unadjusted indexes", `Indent Level` == "0", `Expenditure category` == "All items") %>%
  arrange(Months) %>%
  ggplot(aes(x = Months, y = Values)) +  
  geom_point(alpha = 0.5, color = "blue") +  # Specify a color for the line
  theme_minimal() +
  theme(strip.text = element_text(size = 5),
        axis.text = element_text(size = 5),
        legend.text = element_text(size = 5))

#Export all Unadjsuted indexes to a new DF and then start explorign them, in thsi process we also chaneg the column month to seperate the Month number and year to plot some comaprative data 
CPI_Unadjsuted_Monthly <- df_CPI_new_L %>% 
  filter(Type == "Unadjusted indexes") %>%
  group_by(`Expenditure category`)

# to plot we want to seperate mpnth from year and add factors for months 
CPI_Unadjsuted_Monthly$Months <- as.character(CPI_Unadjsuted_Monthly$Months)  # Convert to character (in case it's not already)

CPI_Unadjsuted_Monthly$Month_numeric <- NA  # Create a new column for numeric month
CPI_Unadjsuted_Monthly$Year_numeric <- NA   # Create a new column for numeric year

# Loop through each row and extract the month and year
for (i in 1:nrow(CPI_Unadjsuted_Monthly)) {
  months <- unlist(strsplit(CPI_Unadjsuted_Monthly$Months[i], "[[:space:]]+"))  # Split the string by spaces
  month <- match(sub("\\.$", "", months[1]), month.abb) # Get the numeric representation of the month
  year <- as.numeric(months[2])         # Extract the year
  
  CPI_Unadjsuted_Monthly$Month_numeric[i] <- month
  CPI_Unadjsuted_Monthly$Year_numeric[i] <- year
}

# Convert the month and year columns to factor for sake of plotting 
CPI_Unadjsuted_Monthly$Month_numeric <- factor(CPI_Unadjsuted_Monthly$Month_numeric, levels = 1:12, labels = month.abb)
CPI_Unadjsuted_Monthly$Year_numeric <- factor(CPI_Unadjsuted_Monthly$Year_numeric)

# Reorder the levels of Month_numeric based on Year_numeric, since year alaos changed 
CPI_Unadjsuted_Monthly$Month_numeric <- factor(CPI_Unadjsuted_Monthly$Month_numeric, levels = unique(CPI_Unadjsuted_Monthly$Month_numeric)[order(CPI_Unadjsuted_Monthly$Year_numeric)])

# Plot the data
CPI_Unadjsuted_Monthly %>% 
  filter(`Indent Level` == "0", `Expenditure category` == "All items") %>%
  ggplot(aes(x = Month_numeric, y = Values)) +  
  geom_point(alpha = 0.5, color = "blue") +  # Specify a color for the line
  theme_minimal() +
  theme(strip.text = element_text(size = 5),
        axis.text = element_text(size = 5),
        legend.text = element_text(size = 5))

#plot a graph to compare the "Unadjusted Indexes" of level 0 and 1 across 12 months and compare them agaist the level 0 (All Items)

CPI_Unadjsuted_Monthly %>% 
  filter(`Indent Level` %in% c("0", "1")) %>%
  group_by(`Expenditure category`) %>%
  ggplot(aes(x = Month_numeric, y = Values, color = `Expenditure category`)) +  # Use color to identify them 
  geom_point(alpha = 0.5) + 
  theme_minimal() +
  theme(strip.text = element_text(size = 5),
        axis.text = element_text(size = 5),
        legend.text = element_text(size = 5))
## Warning: Removed 13 rows containing missing values (`geom_point()`).

CPI_Unadjsuted_Monthly %>% 
  filter(`Indent Level` %in% c("0", "1")) %>%
  filter(`Expenditure category` %in% c("Energy", "Food", "Housing", "All items", "Services", "Medical care", "Fuels and utilities")) %>%
  ggplot(aes(x = Month_numeric, y = Values, color = `Expenditure category`)) +  # Use color to identify them 
  geom_point(alpha = 0.5) + 
  theme_minimal() +
  theme(strip.text = element_text(size = 8),
        axis.text = element_text(size = 8),
        legend.text = element_text(size = 8))

#2&3-3: Tidying the df_death and df_incd

These two DFs are pretty clean and organised. One can tell it but looking into the data and run some summary statistics.

I start with some summary statistics to assess the quality of the data. Frist we remove the data in the DF that recent_trend_2 does not have any indication (rising, falling, or stable). and then figure out what is the trend in general is it positive or negative? The overall trend points out that the general trends among all the counties is falling.

# Let's first do some summary statistics and evalaute the data. 

#Plot the recent trends and evalute if general trends if positive or negative 
#remove the recent_trend_2 with no information 
print("take a sneak pick of the DF")
## [1] "take a sneak pick of the DF"
head(df_death,10)
##                         county  fips met_objective_of_45_5_1
## 1                United States     0                      No
## 2       Perry County, Kentucky 21193                      No
## 3      Powell County, Kentucky 21197                      No
## 4  North Slope Borough, Alaska  2185                      No
## 5      Owsley County, Kentucky 21189                      No
## 6        Union County, Florida 12125                      No
## 7    McCreary County, Kentucky 21147                      No
## 8      Leslie County, Kentucky 21131                      No
## 9      Martin County, Kentucky 21159                      No
## 10    Menifee County, Kentucky 21165                      No
##    age_adjusted_death_rate lower_95_confidence_interval_for_death_rate
## 1                       46                                        45.9
## 2                    125.6                                       108.9
## 3                    125.3                                       100.2
## 4                    124.9                                          73
## 5                    118.5                                        83.1
## 6                    113.5                                        89.9
## 7                    111.1                                        90.6
## 8                    110.3                                          87
## 9                    109.1                                        84.8
## 10                     106                                        76.4
##    upper_95_confidence_interval_for_death_rate average_deaths_per_year
## 1                                         46.1                 157,376
## 2                                        144.2                      43
## 3                                        155.1                      18
## 4                                        194.7                       5
## 5                                        165.5                       8
## 6                                        141.4                      19
## 7                                        134.9                      22
## 8                                        138.5                      16
## 9                                        138.3                      14
## 10                                       144.4                       9
##    recent_trend_2 recent_5_year_trend_2_in_death_rates
## 1         falling                                 -2.4
## 2          stable                                 -0.6
## 3          stable                                  1.7
## 4              **                                   **
## 5          stable                                  2.2
## 6         falling                                 -2.2
## 7          rising                                 22.9
## 8          stable                                  0.8
## 9          stable                                  1.3
## 10         stable                                  1.4
##    lower_95_confidence_interval_for_trend
## 1                                    -2.6
## 2                                    -2.7
## 3                                       0
## 4                                      **
## 5                                    -0.4
## 6                                    -4.3
## 7                                     6.9
## 8                                    -0.7
## 9                                    -0.8
## 10                                   -1.4
##    upper_95_confidence_interval_for_trend
## 1                                    -2.2
## 2                                     1.6
## 3                                     3.4
## 4                                      **
## 5                                     4.8
## 6                                       0
## 7                                    41.4
## 8                                     2.4
## 9                                     3.4
## 10                                    4.3
trend_summary <- df_death %>%
  filter(recent_trend_2 %in% c("stable","falling","rising") ) %>%
  select(recent_trend_2) %>%
  summary(
    n          = n(),
    rising     = sum(recent_trend_2=="rising")
    
  )

# Calculate summary statistics
trend_summary <- df_death %>%
  filter(recent_trend_2 %in% c("stable","falling","rising") ) %>%
  summarise(
    total   = n(),
    rising  = sum(recent_trend_2 == "rising"),
    falling = sum(recent_trend_2 == "falling"),
    stable  = sum(recent_trend_2 == "stable"),
    overall_trend = ifelse(rising > falling, "Rising Cancer Trend", ifelse(rising < falling, "Falling Cancer Trend", "Neutral Trend"))
  )

print(trend_summary)
##   total rising falling stable        overall_trend
## 1  2694     31     867   1796 Falling Cancer Trend
# Reshape the data to long format
trend_summary_long <- tidyr::pivot_longer(trend_summary, cols = c(rising, falling, stable),names_to = "trend_type", values_to = "count")

# Plot a bar chart
ggplot(trend_summary_long, aes(x = trend_type, y = count, fill = trend_type)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "summary of the cancer cases in US", x = "General trends", y = "Counts" )

#remove recent_trend_2 without any trend indication 
df_death <- df_death %>%
  filter(recent_trend_2 %in% c("stable","falling","rising") ) %>%
  select(everything())

# county has the state name at the end after a coma, i woudl like to seperate countu to countu and state 

df_death <- df_death %>%
  separate(county, sep = ",",
into = c("county", "State"))
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [1,
## 2238].
#convert the data to numeric 
# Convert columns 5 to 8 from character to numeric
df_death <- df_death %>%
  mutate_at(vars(5:8), as.numeric)
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `average_deaths_per_year =
##   .Primitive("as.double")(average_deaths_per_year)`.
## Caused by warning:
## ! NAs introduced by coercion
# Convert columns 10 to 12 from character to numeric
df_death <- df_death %>%
  mutate_at(vars(10:12), as.numeric)

state_sum <- df_death %>%
  group_by(State) %>%
  summarize(
    death_sum     = sum(average_deaths_per_year, na.rm=TRUE)
  )

# Plot death rate by state
ggplot(state_sum, aes(x = State, y = death_sum)) +
  geom_bar(stat = "identity", fill = "plum") +
  labs(x = "State", y = "Total Death Rate", title = "Death Rate by State") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))