Introduction

When working with data, you should expect to spend a good amount of time in the clean-up process, but it is not always ‘messy’ or unreadable. For example, data can still be organized in a data-frame in a way that is readable, but at the same time it may not be useful in such presented formats. In these cases, we may have to transpose the data-frame and re-organize to fit our needs.

The following data is obtained from the National Oceanic and Atmospheric Administration and tracks the starting and ending dates of the atmospheric data retrieval process in various locations on Earth. The data is retrieved using specialized sensors called radiosondes which are attached to weather balloons and track information such as wind speed, wind direction, temperature and relative humidity and different heights throughout its ascent.

1. Loading Data: Creating the Initial Data-Frame and Writing to CSV File

The data is loaded from a URL into a data-frame called ‘radiosonde_df’. I then write the original data to a csv file called “noaa_data.csv”.

rm(list = ls())

noaa_url <- url("https://www.ncei.noaa.gov/data/integrated-global-radiosonde-archive/doc/wmo-sonde-history.txt")

radiosonde_df <- read.table(noaa_url, header = FALSE) %>%
  as_tibble() %>%
  rename(Station_ID = V1, Beg_Year = V2, Beg_Month = V3, Beg_Day = V4, Beg_Hour = V5, 
         End_Year = V6, End_Month = V7, End_Day = V8, End_Hour = V9, Code = V10)

#view current state of data-frame
head(radiosonde_df)

## # A tibble: 6 × 10
##   Station_ID  Beg_Year Beg_Month Beg_Day Beg_Hour End_Year End_Month End_Day
##   <chr>          <int>     <int>   <int>    <int>    <int>     <int>   <int>
## 1 AEM00041217     2000         1       1        0     2000         1      22
## 2 AEM00041217     2000         1      23        0     2000         1      23
## 3 AEM00041217     2000         1      23       12     2000         2      17
## 4 AEM00041217     2000         2      18        0     2000         2      18
## 5 AEM00041217     2000         2      18       12     2001         4      21
## 6 AEM00041217     2001         4      23        0     2001         4      23
## # ℹ 2 more variables: End_Hour <int>, Code <int>

write_csv(radiosonde_df, 'noaa_data.csv')

2. Loading Data: Reading the Data-Frame from the Created CSV File

Now I can read my data to begin working on it. I read the csv file into R and assign it to the variable “noaa_data_df”. I then take a glimpse at the data frame using the head() function. However, since this is only a glimpse, I use the nrow() function to help me determine the number of observations/rows in the data frame. This will help me determine my course of action when transforming the data.

noaa_data_df <- read_csv('noaa_data.csv', show_col_types = FALSE)

head(noaa_data_df)

## # A tibble: 6 × 10
##   Station_ID  Beg_Year Beg_Month Beg_Day Beg_Hour End_Year End_Month End_Day
##   <chr>          <dbl>     <dbl>   <dbl>    <dbl>    <dbl>     <dbl>   <dbl>
## 1 AEM00041217     2000         1       1        0     2000         1      22
## 2 AEM00041217     2000         1      23        0     2000         1      23
## 3 AEM00041217     2000         1      23       12     2000         2      17
## 4 AEM00041217     2000         2      18        0     2000         2      18
## 5 AEM00041217     2000         2      18       12     2001         4      21
## 6 AEM00041217     2001         4      23        0     2001         4      23
## # ℹ 2 more variables: End_Hour <dbl>, Code <dbl>

nrow(noaa_data_df)

## [1] 141552

3. Filtering the Data

From using the nrow() function above, we see that there are 141,552 rows. Since my data is vast and spans a multitude of countries and specific locations within each country, I filtered my data to only include the top three countries with the most active hurricane seasons, namely China, Japan and the Philippines. I also grouped all of the different stations by country to consolidate my data. Using the nrow() function again on the filtered data, we see that this significantly reduced the data observations to 1731 which is much more reasonable to work with.

countries_filtered <- noaa_data_df %>%
  filter((Station_ID >= "CH" & Station_ID < "CI") | 
         (Station_ID >= "RP" & Station_ID < "RQ") | 
         (Station_ID >= "JA" & Station_ID < "JB"))

head(countries_filtered)

## # A tibble: 6 × 10
##   Station_ID  Beg_Year Beg_Month Beg_Day Beg_Hour End_Year End_Month End_Day
##   <chr>          <dbl>     <dbl>   <dbl>    <dbl>    <dbl>     <dbl>   <dbl>
## 1 CHM00050527     2012        10      18        0     2019        12      31
## 2 CHM00050527     2020         1       1        0     2020         1      20
## 3 CHM00050527     2020         9       9       12     2020         9       9
## 4 CHM00050527     2020        10       8        0     2020        10       8
## 5 CHM00050557     2012        10       1        0     2020         1      19
## 6 CHM00050557     2020        10       8        0     2020        10       8
## # ℹ 2 more variables: End_Hour <dbl>, Code <dbl>

nrow(countries_filtered)

## [1] 1731

4. Transforming the Data Frame

I created a new column called “Country” and assigned the appropriate country based on the station ID. Station IDs that start with ‘C’ in the filtered data frame are all located in China. Station IDs that start with ‘R’ are in the Philippines, and if it started with ‘J’, they are in Japan.

countries_filtered$Country <- NA

i <- 1

for (row in countries_filtered$Station_ID){
  #print(strsplit(row, NULL)[[1]][1])
  country <- ifelse(strsplit(row, NULL)[[1]][1] == 'C', 'China',
                ifelse(strsplit(row, NULL)[[1]][1] == 'R', 'Philippines',
                'Japan'
                )
             )
  countries_filtered$Country[i] <- country
  i <- i + 1
}

head(countries_filtered)

## # A tibble: 6 × 11
##   Station_ID  Beg_Year Beg_Month Beg_Day Beg_Hour End_Year End_Month End_Day
##   <chr>          <dbl>     <dbl>   <dbl>    <dbl>    <dbl>     <dbl>   <dbl>
## 1 CHM00050527     2012        10      18        0     2019        12      31
## 2 CHM00050527     2020         1       1        0     2020         1      20
## 3 CHM00050527     2020         9       9       12     2020         9       9
## 4 CHM00050527     2020        10       8        0     2020        10       8
## 5 CHM00050557     2012        10       1        0     2020         1      19
## 6 CHM00050557     2020        10       8        0     2020        10       8
## # ℹ 3 more variables: End_Hour <dbl>, Code <dbl>, Country <chr>

5. Removing Columns

For my purposes, I decided to only work with the data that specified the start and end date of a data retrieval process, but in years. I removed all unnecessary columns.

countries <- countries_filtered[, -c(1, 3:5, 7:10)]

head(countries)

## # A tibble: 6 × 3
##   Beg_Year End_Year Country
##      <dbl>    <dbl> <chr>  
## 1     2012     2019 China  
## 2     2020     2020 China  
## 3     2020     2020 China  
## 4     2020     2020 China  
## 5     2012     2020 China  
## 6     2020     2020 China

6. Organization

I decided to group my data by the start dates that indicate when the radiosondes were deployed and arranged it in increasing chronological order.

countries2 <- countries %>%
  group_by(Beg_Year) %>%
  arrange(Country, Beg_Year)

countries2

## # A tibble: 1,731 × 3
## # Groups:   Beg_Year [21]
##    Beg_Year End_Year Country
##       <dbl>    <dbl> <chr>  
##  1     2002     2002 China  
##  2     2011     2011 China  
##  3     2012     2019 China  
##  4     2012     2020 China  
##  5     2012     2020 China  
##  6     2012     2020 China  
##  7     2012     2020 China  
##  8     2012     2019 China  
##  9     2012     2019 China  
## 10     2012     2019 China  
## # ℹ 1,721 more rows

7. Start and End Date Counts

Next, I created columns that included the count of the number of deployments for a given year and location. I then check the number of rows again and see it narrowed down to 126.

countries_count1 <- countries2 %>%
  group_by(Country, Beg_Year, End_Year) %>%
  summarize(Beg_Count = n(), End_Count = n(), .groups = 'drop')

countries_count1 <- countries_count1[, c(1, 2, 4, 3, 5)]

head(countries_count1)

## # A tibble: 6 × 5
##   Country Beg_Year Beg_Count End_Year End_Count
##   <chr>      <dbl>     <int>    <dbl>     <int>
## 1 China       2002         1     2002         1
## 2 China       2011         1     2011         1
## 3 China       2012         2     2013         2
## 4 China       2012         1     2016         1
## 5 China       2012        14     2019        14
## 6 China       2012        69     2020        69

nrow(countries_count1)

## [1] 126

8. Visualization: Start Dates

Since the data frame has 126 lines of observations, I decided to make better sense of it using a visual. In this visual, I focused mainly on the start dates of radiosonde deployment for each country. We can see that most deployments started after 2010 or so, regardless of the location.

ggplot(countries_count1, aes(x = Beg_Year, y = Beg_Count, fill = Country)) +
  facet_wrap(~ Country) +
  geom_bar(stat = "identity")

9. Visualization: End Dates

I created a similar visual for the ending dates of deployment. However, I chose to work with the start dates.

ggplot(countries_count1, aes(x = End_Year, y = End_Count, fill = Country)) +
  facet_wrap(~ Country) +
  geom_bar(stat = "identity")

10. Narrowing Down

To narrow down my data, I chose to look at the deployments starting in 2011.

filtered_dates <- countries_count1 %>%
  filter(Beg_Year > 2011)

filtered_dates

## # A tibble: 58 × 5
##    Country Beg_Year Beg_Count End_Year End_Count
##    <chr>      <dbl>     <int>    <dbl>     <int>
##  1 China       2012         2     2013         2
##  2 China       2012         1     2016         1
##  3 China       2012        14     2019        14
##  4 China       2012        69     2020        69
##  5 China       2013         2     2013         2
##  6 China       2013         2     2019         2
##  7 China       2014         1     2015         1
##  8 China       2014         1     2019         1
##  9 China       2014         1     2020         1
## 10 China       2015         2     2019         2
## # ℹ 48 more rows

11. Pivoting

After initially narrowing down, I decide to pivot my data to a wider format, but to avoid too many columns I consolidated the start date data by ranges. The first range is from 2011-2015. The second range is from 2016-2020. To do this, I used the mutate() function to filter, the group_by() function to group accordingly, and the summarize() function to aggregate my data to find the count totals. Then I use the pivot_wider() function to transform the data frame by pivoting using the total count criteria.

filtered_data <- filtered_dates %>%
  mutate(Beg_Year_Range = case_when(
    Beg_Year >= 2011 & Beg_Year <= 2015 ~ "2011_2015",
    Beg_Year >= 2016 & Beg_Year <= 2020 ~ "2016_2020",
    TRUE ~ "Other"
  ),
  End_Year_Range = case_when(
    End_Year >= 2011 & End_Year <= 2015 ~ "2011_2015",
    End_Year >= 2016 & End_Year <= 2020 ~ "2016_2020",
    TRUE ~ "Other"
  ))

summarized_data <- filtered_data %>%
  group_by(Country, Beg_Year_Range, End_Year_Range) %>%
  summarize(
    Beg_Count_Sum = sum(Beg_Count, na.rm = TRUE),
    End_Count_Sum = sum(End_Count, na.rm = TRUE),
    .groups = 'drop'
  )

# Pivot the data using tidyr to get the year ranges in separate columns
pivoted_data <- summarized_data %>%
  pivot_wider(names_from = Beg_Year_Range, values_from = Beg_Count_Sum, names_prefix = "Beg_") %>%
  pivot_wider(names_from = End_Year_Range, values_from = End_Count_Sum, names_prefix = "End_")

pivoted_data

## # A tibble: 9 × 5
##   Country     Beg_2011_2015 Beg_2016_2020 End_2011_2015 End_2016_2020
##   <chr>               <int>         <int>         <int>         <int>
## 1 China                   5            NA             5            NA
## 2 China                  94            NA            NA            94
## 3 China                  NA           287            NA           287
## 4 Japan                 295            NA           295            NA
## 5 Japan                  10            NA            NA            10
## 6 Japan                  NA           196            NA           196
## 7 Philippines           137            NA           137            NA
## 8 Philippines             7            NA            NA             7
## 9 Philippines            NA           483            NA           483

12. Totaling the Columns

Each row in the table above shows the number of radiosonde deployments corresponding to each country. We see that were are three rows for each country. This is because each row shows the count for a completed deployment, so their beginning count and ending count will match up.

For example, let’s take a look at the first three rows of the data frame above that correspond to China.

-Row 1 shows that 5 radiosondes were deployed during 2011-2015 and they collected data until they were recalled in 2011-2015 as well. -Row 2 shows that 94 radiosondes were deployed during 2011-2015 and were not recalled until some time in 2016-2020. -Row 3 shows that 287 radiosondes were deployed during 2016-2020 and were all recalled in 2016-2020.

To make things a bit simpler, let’s just look at the total number of deployments (beginning) and the total number of recalls (ending) for each country and by date range. By totaling the numbers up, we lose track of the correspondence between start and end dates, but for my analysis, this is not needed. Below is the condensed data frame.

total_counts <- pivoted_data %>%
  group_by(Country) %>%
  summarize(
    Total_Beg_2011_2015 = sum(Beg_2011_2015, na.rm = TRUE),
    Total_Beg_2016_2020 = sum(Beg_2016_2020, na.rm = TRUE),
    Total_End_2011_2015 = sum(End_2011_2015, na.rm = TRUE),
    Total_End_2016_2020 = sum(End_2016_2020, na.rm = TRUE)
  )

# View the result
total_counts

## # A tibble: 3 × 5
##   Country     Total_Beg_2011_2015 Total_Beg_2016_2020 Total_End_2011_2015
##   <chr>                     <int>               <int>               <int>
## 1 China                        99                 287                   5
## 2 Japan                       305                 196                 295
## 3 Philippines                 144                 483                 137
## # ℹ 1 more variable: Total_End_2016_2020 <int>

13. Comparison

In the table above, we see the individual number of deployments and recalls for each country for the aforementioned year ranges. I then computed the mean for all deployments and for each date range. The mean for 2011-2015 is 182.6667 and the mean number of deployments for 2016-2020 is 322.

mean1 <- total_counts %>%
  summarize(Pre2016_Mean = mean(total_counts$Total_Beg_2011_2015))

mean1

## # A tibble: 1 × 1
##   Pre2016_Mean
##          <dbl>
## 1         183.

mean2 <- total_counts %>%
  summarize(Post2015_Mean = mean(total_counts$Total_Beg_2016_2020))

mean2

## # A tibble: 1 × 1
##   Post2015_Mean
##           <dbl>
## 1           322

14. Consolidating Data

I put my mean data in the same data frame as my individual counts.

total_counts2 <- total_counts[, -c(4,5)]

total_counts2$Mean_Pre2016 <- mean1$Pre2016_Mean[1]
total_counts2$Mean_Post2015 <- mean2$Post2015_Mean[1]

total_counts2

## # A tibble: 3 × 5
##   Country     Total_Beg_2011_2015 Total_Beg_2016_2020 Mean_Pre2016 Mean_Post2015
##   <chr>                     <int>               <int>        <dbl>         <dbl>
## 1 China                        99                 287         183.           322
## 2 Japan                       305                 196         183.           322
## 3 Philippines                 144                 483         183.           322

15. Visualizing by Time Frame

Below, I plotted the mean data with the deployment counts for each country. The first graph colored in blue represents the data from 2011-2015 and the second one colored in red represents the data from 2016-2020.

#blue plot for 2011-2015
ggplot(total_counts2, aes(x = Country)) +
  geom_bar(aes(y = Total_Beg_2011_2015, fill = "Total_Beg_2011_2015"), stat = "identity", position = "dodge", alpha = 0.5) +
  geom_point(aes(y = Mean_Pre2016, color = "Mean_Pre2016"), size = 3) +
  geom_line(aes(y = Mean_Pre2016, group = 1, color = "Mean_Pre2016"), linetype = "dashed") +
  labs(title = "Comparison of Counts and Means Across Periods",
       x = "Country", 
       y = "Value",
       fill = "Counts",
       color = "Means") +
  theme_minimal() +
  scale_fill_manual(values = c("Total_Beg_2011_2015" = "blue")) +
  scale_color_manual(values = c("Mean_Pre2016" = "blue"))

#red plot for 2016-2020
ggplot(total_counts2, aes(x = Country)) +
  geom_bar(aes(y = Total_Beg_2016_2020, fill = "Total_Beg_2016_2020"), 
           stat = "identity", position = "dodge", alpha = 0.5) +
  geom_point(aes(y = Mean_Post2015, color = "Mean_Post2015"), size = 3) +
  geom_line(aes(y = Mean_Post2015, group = 1, color = "Mean_Post2015"), linetype = "dashed") +
  labs(title = "Comparison of Counts and Means Across Periods",
       x = "Country", 
       y = "Value",
       fill = "Counts",
       color = "Means") +
  theme_minimal() + 
  scale_fill_manual(values = c("Total_Beg_2016_2020" = "red")) +
  scale_color_manual(values = c("Mean_Post2015" = "red"))

DATA 607 - Project 2: Data Set 2 (Resubmit)

Julian Adames-Ng

2024-10-28