NOAA Yearly Weather Data Processing

Processing 1900-2016 Temperature Data



Project Synopsis

According to a report published on January 18, 2017 by the National Aeronautics and Space Administration (NASA) and the National Oceanic and Atmospheric Administration (NOAA):

…(the) Earth’s 2016 surface temperatures were the warmest since modern record keeping began in 1880.

Globally-averaged temperatures in 2016 were 1.78 degrees Fahrenheit (0.99 degrees Celsius) warmer than the mid-20th century mean. This makes 2016 the third year in a row to set a new record for global average surface temperatures.

Source: https://www.nasa.gov/press-release/nasa-noaa-data-show-2016-warmest-year-on-record-globally


The 2016 Weather Data Exploratory Analysis project was started to review the raw data from NOAA and identify areas of uncertainty and their potential impact on reaching a greater than 95% scientific certainty.

This is Part 6 of the 2016 Weather Data Exploratory Analysis.


This project is designed to process the NOAA yearly weather data files to be used in the temperature analysis portion of the overall project. The years to be processed are 1900-2016.

The project will use the same criteria for acceptance, cleaning, and processing as Part 5 of the study. However, since the goal is to perform the operations on 117 files, the processing and cleaning steps will be contained within a single processing loop.


The yearly weather data is located at the following:

The files range from the year 1763 through 2017. These files are gzip compressed .csv files.

The specific file range we will be using are:

The description of the files can be found at:


Libraries Required

library(dplyr)        # Data manipulation
library(knitr)        # Dynamic report generation
library(lubridate)    # Date and time processing
library(reshape2)     # Data transposition

Preparing the Processing Loop

All of the yearly data files from 1900-2016 have been downloaded from the NOAA GHCN FTP site and are stored in a local data directory. To prepare for the processing loop, the directory path and file names are gathered and the temperature variables are defined to a vector. The file names can be looped through using the vector index.

Set Path

raw_path = "~/NOAA Data/RAW WD 1900-2016/"

Retrieve File Names

file.names <- dir(raw_path, pattern =".csv.gz")

Set Temperature Elements

temp_vals <- c("TMAX", "TMIN", "TAVG")

The processed files will be saved in .Rds format to a separate directory. WDT stands for “Weather Data Temperature”.

Set RDS path

rds_path = "~/NOAA Data/RDS WDT 1900-2016/"

Reporting statistics will be gathered for each year and a separate report file will be created and saved.

Initialize Reporting Variables

rpt_year         <- as.numeric() ## Year of the weather data observations
rpt_yday         <- as.numeric() ## Number of days in the year
rpt_tot_stations <- as.numeric() ## Total number of observation stations in the file
rpt_tmp_stations <- as.numeric() ## Total number of temperature observation stations in the file
rpt_tmp_sta_pct  <- as.numeric() ## Percentage of temperature stations in the file
rpt_ini_pobs     <- as.numeric() ## Initial potential observations based on temp stations and days in the year
rpt_ini_aobs     <- as.numeric() ## Initial actual observations prior to cleaning
rpt_ini_obs_pct  <- as.numeric() ## Percentage of actual observations to potential observations
rpt_ini_cobs     <- as.numeric() ## Initial number of complete observations (TAVG recorded)
rpt_ini_cobs_pct <- as.numeric() ## Percentage of initial complete observation against potential observations
rpt_fin_stations <- as.numeric() ## Final number of stations after cleaning
rpt_lst_stations <- as.numeric() ## Number of stations lost during cleaning
rpt_fin_pobs     <- as.numeric() ## Potential observations remaining after cleaning
rpt_fin_cobs     <- as.numeric() ## Complete observations after cleaning and deriving TAVG as needed
rpt_fin_cobs_pct <- as.numeric() ## Final complete observations percentage again potential observations

Processing Loop

Within the loop, there will be comments above or to the right of the executed command to explain its purpose. There is an additional check in the loop for the column TAVG after the dcast. The reason for this is that there may be years in which none of the observation stations have a TAVG Element recorded. For these years, the TAVG value will be NA and must be derived from TMAX and TMIN exclusively.

Process Yearly Weather Data

for(i in 1:length(file.names)){
        
     raw_file <- paste(raw_path, file.names[i], sep="")
     
     wd_data <- read.csv(raw_file, sep=",", header=FALSE, stringsAsFactors = FALSE)
     
     ## Record year for reporting and creating the .rds file name
     rpt_year[i] <- year(ymd(wd_data$V2[1]))  
     
     ## Record number of days in the year
     rpt_yday[i] <- if(leap_year(rpt_year[i])) {366} else {365}
     
     ## Record total number of stations
     rpt_tot_stations[i] <- length(unique(wd_data$V1))
     
     wd_data <- wd_data %>%
                select(V1, V2, V3, V4) %>%              ## Remove unneeded variables  
                filter(V3 %in% temp_vals) %>%           ## Filter for temperature observations   
                rename(ID   = V1,                       ## Rename the first two columns  
                       Date = V2) %>%
                dcast(ID + Date ~ V3, value.var = "V4") ## Create Tidy data by making Elements into variables
     
     ## Report number of temperature stations
     rpt_tmp_stations[i] <- length(unique(wd_data$ID))
     rpt_tmp_sta_pct[i]  <- round(rpt_tmp_stations[i] / rpt_tot_stations[i], 4)
     
     ## Report initial potential observations
     rpt_ini_pobs[i] <- rpt_tmp_stations[i] * rpt_yday[i]
     
     ## Report initial actual observations
     rpt_ini_aobs[i]    <- nrow(wd_data)
     rpt_ini_obs_pct[i] <- round(rpt_ini_aobs[i] / rpt_ini_pobs[i], 4)
     
     if(!"TAVG" %in% colnames(wd_data)) {               ## Check for presence of TAVG
             wd_data$TAVG <- NA                         ## If not present, create TAVG and set to NA
     }
     
     ## Replace missing observations with NA, convert temps to Farenheit
     wd_data <- wd_data %>%                               
                transform(TAVG = ifelse(TAVG == -999, NA, (TAVG / 10 * 1.8) + 32 ),    
                          TMAX = ifelse(TMAX == -999, NA, (TMAX / 10 * 1.8) + 32 ),    
                          TMIN = ifelse(TMIN == -999, NA, (TMIN / 10 * 1.8) + 32 ))
     
     ## Report initial complete observations
     rpt_ini_cobs[i]     <- length(wd_data$TAVG[!is.na(wd_data$TAVG)])
     rpt_ini_cobs_pct[i] <- round(rpt_ini_cobs[i] / rpt_ini_pobs[i], 4)
     
     wd_data <- wd_data %>%
                transform(TAVG = ifelse(TAVG  >  134.1, NA, TAVG ),  ## Replace temps greater than record high    
                          TMAX = ifelse(TMAX  >  134.1, NA, TMAX ),       
                          TMIN = ifelse(TMIN  >  134.1, NA, TMIN )) %>%
                transform(TAVG = ifelse(TAVG  < -128.6, NA, TAVG ),  ## Replace temps less than record low 
                          TMAX = ifelse(TMAX  < -128.6, NA, TMAX ),
                          TMIN = ifelse(TMIN  < -128.6, NA, TMIN )) %>%
                transform(TMIN = ifelse(TMIN  >  TMAX,  NA, TMIN))   ## Replace temps with TMIN > TMAX
     
     wd_data <- wd_data %>%
                mutate(DAVG = (TMAX + TMIN) / 2) %>%                    ## Create derived average temp 
                transform(TAVG = ifelse(is.na(TAVG), DAVG, TAVG)) %>%   ## Replace TAVG with DAVG if NA 
                select(ID, Date, TAVG) %>%                              ## Keep columns needed 
                filter(complete.cases(.))                               ## Keep only complete cases, no NAs
     
     
     ## Report final number of temperature stations
     rpt_fin_stations[i] <- length(unique(wd_data$ID))
     rpt_lst_stations[i] <- rpt_tmp_stations[i] - rpt_fin_stations[i]
     
     ## Report final complete observations
     rpt_fin_pobs[i]     <- rpt_fin_stations[i] * rpt_yday[i]
     rpt_fin_cobs[i]     <- nrow(wd_data)
     rpt_fin_cobs_pct[i] <- round(rpt_fin_cobs[i] / rpt_fin_pobs[i], 4)
     
     wd_data <- wd_data %>%
                transform(Date = ymd(Date)) %>%           ## Convert Date to date format        
                mutate(Yday = yday(Date),                 ## Create Yday
                       Week = week(Date),                 ## Create Week
                       Month = month(Date)) %>%           ## Create Month
                select(ID, Month, Week, Yday, TAVG)       ## Keep and reorder needed columns
     
     rds_name <- paste(rds_path, "ww_data_", rpt_year[i], ".rds", sep = "")
     
     saveRDS(wd_data, rds_name)
     
     rm(wd_data)
}

Create Report Table

ww_tdata_rpt <- data.frame(rpt_year, rpt_yday, rpt_tot_stations, rpt_tmp_stations, rpt_tmp_sta_pct,
                           rpt_ini_pobs, rpt_ini_aobs, rpt_ini_obs_pct, rpt_ini_cobs, rpt_ini_cobs_pct,
                           rpt_fin_stations, rpt_lst_stations, rpt_fin_pobs, rpt_fin_cobs, rpt_fin_cobs_pct)

Save Report Table

saveRDS(ww_tdata_rpt, "Reports/ww_tdata_rpt.rds")

Summary Report Review

A summary report has been created from the data gathered during this project. It can be found at the following link:

Summary Report: http://t2tec.com/rpts/wacl/ww_tdata_1900_2016.pdf

But we can look at some key statistics now.

Distribution of Completion Percentage

quantile(ww_tdata_rpt$rpt_fin_cobs_pct, prob = seq(0, 1, length = 11), type = 5)
##      0%     10%     20%     30%     40%     50%     60%     70%     80% 
## 0.85960 0.87442 0.88172 0.88710 0.89421 0.90030 0.90905 0.91368 0.91830 
##     90%    100% 
## 0.92486 0.93600

For the timeframe 1900-2016, 50% of the years have complete observation percentages less than 90%. The maximum completion percentage is 93.6%.

Maximum Completion Percentage - Top Ten

tdata_top_ten <- ww_tdata_rpt %>%
                 arrange(desc(rpt_fin_cobs_pct)) %>%
                 select(rpt_year, rpt_fin_cobs_pct)
rpt_year rpt_fin_cobs_pct
1932 0.9360
1933 0.9341
1931 0.9334
1934 0.9318
1935 0.9317
1940 0.9309
1939 0.9279
1937 0.9273
1938 0.9264
2014 0.9255

Nine of the top ten completion percentages for final valid observations are found from 1940 and earlier. Only one “current” year, 2014, is in the top ten.

Minimum Completion Percentage - Bottom Ten

tdata_bot_ten <- ww_tdata_rpt %>%
                 arrange(rpt_fin_cobs_pct) %>%
                 select(rpt_year, rpt_fin_cobs_pct)
rpt_year rpt_fin_cobs_pct
1901 0.8596
1948 0.8614
1907 0.8644
1900 0.8678
1906 0.8706
1973 0.8714
1904 0.8715
1974 0.8716
2016 0.8726
1903 0.8736

Perhaps the most concerning finding is that the comparison year, 2016, that NOAA and NASA use as the basis for their paper has the ninth lowest completion percentage, 87.26%, of the 117 years analyzed.

Baseline Years Completion Percentage

tdata_bas_yrs <- ww_tdata_rpt %>%
                 filter(rpt_year >= 1951,
                        rpt_year <= 1980) %>%
                 select(rpt_year, rpt_fin_cobs_pct)
rpt_year rpt_fin_cobs_pct
1951 0.9095
1952 0.9178
1953 0.9130
1954 0.9160
1955 0.9153
1956 0.9139
1957 0.9014
1958 0.9150
1959 0.8782
1960 0.8970
1961 0.8954
1962 0.8961
1963 0.8977
1964 0.9235
1965 0.9136
1966 0.9206
1967 0.9206
1968 0.9178
1969 0.9146
1970 0.9096
1971 0.8954
1972 0.9221
1973 0.8714
1974 0.8716
1975 0.8741
1976 0.8828
1977 0.8754
1978 0.8798
1979 0.8810
1980 0.8745
##       0%      25%      50%      75%     100% 
## 0.871400 0.881450 0.905450 0.915225 0.923500

Once again, half of the years that comprise the Mid_Century baseline used in the report indicate complete observation percentages under 90%.


Key Findings

The median completion percentage, after cleaning, is around 90% which translates into 1 out of 10 potential observations missing from the recorded temperature history between 1900 and 2016. As was stated in the previous project, depending on when those missing observations occurred and how the missing data is handled in the NOAA model, there may be an impact to actual global average temperatures.

We also find temperature stations dropping from the record due to containing zero valid observations.

There were zero years with a 95% completion percentage or better. A thorough review of the processed temperature data, including time of the year and geographic location, will be required to understand the potential impact.

During the baseline years, 50% of the years had complete observation percentages less than 90%. And the year 2016 had one of the lowest recorded observation percentages in the 117 year temperature history.


Next Steps

As mentioned earlier, a summary report with the statistics from this project is available at the link documented above.

There are two immediate next steps as a result of the findings from this project.

Since we have an incomplete historical record that was used in the NOAA and NASA analysis, it is important to understand the potential impacts to the conclusions that are drawn.




sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 15063)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] reshape2_1.4.2  lubridate_1.6.0 knitr_1.16      dplyr_0.5.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.11     digest_0.6.12    rprojroot_1.2    assertthat_0.2.0
##  [5] plyr_1.8.4       R6_2.2.1         DBI_0.6-1        backports_1.1.0 
##  [9] magrittr_1.5     evaluate_0.10    highr_0.6        rlang_0.1.1     
## [13] stringi_1.1.5    lazyeval_0.2.0   rmarkdown_1.5    tools_3.4.0     
## [17] stringr_1.2.0    yaml_2.1.14      compiler_3.4.0   htmltools_0.3.6 
## [21] tibble_1.3.3