According to a report published on January 18, 2017 by the National Aeronautics and Space Administration (NASA) and the National Oceanic and Atmospheric Administration (NOAA):
…(the) Earth’s 2016 surface temperatures were the warmest since modern record keeping began in 1880.
Globally-averaged temperatures in 2016 were 1.78 degrees Fahrenheit (0.99 degrees Celsius) warmer than the mid-20th century mean. This makes 2016 the third year in a row to set a new record for global average surface temperatures.
Source: https://www.nasa.gov/press-release/nasa-noaa-data-show-2016-warmest-year-on-record-globally
The 2016 Weather Data Exploratory Analysis project was started to review the raw data from NOAA and identify areas of uncertainty and their potential impact on reaching a greater than 95% scientific certainty.
This is Part 6 of the 2016 Weather Data Exploratory Analysis.
This project is designed to process the NOAA yearly weather data files to be used in the temperature analysis portion of the overall project. The years to be processed are 1900-2016.
The project will use the same criteria for acceptance, cleaning, and processing as Part 5 of the study. However, since the goal is to perform the operations on 117 files, the processing and cleaning steps will be contained within a single processing loop.
The yearly weather data is located at the following:
The files range from the year 1763 through 2017. These files are gzip compressed .csv files.
The specific file range we will be using are:
The description of the files can be found at:
Libraries Required
library(dplyr) # Data manipulation
library(knitr) # Dynamic report generation
library(lubridate) # Date and time processing
library(reshape2) # Data transposition
All of the yearly data files from 1900-2016 have been downloaded from the NOAA GHCN FTP site and are stored in a local data directory. To prepare for the processing loop, the directory path and file names are gathered and the temperature variables are defined to a vector. The file names can be looped through using the vector index.
Set Path
raw_path = "~/NOAA Data/RAW WD 1900-2016/"
Retrieve File Names
file.names <- dir(raw_path, pattern =".csv.gz")
Set Temperature Elements
temp_vals <- c("TMAX", "TMIN", "TAVG")
The processed files will be saved in .Rds format to a separate directory. WDT stands for “Weather Data Temperature”.
Set RDS path
rds_path = "~/NOAA Data/RDS WDT 1900-2016/"
Reporting statistics will be gathered for each year and a separate report file will be created and saved.
Initialize Reporting Variables
rpt_year <- as.numeric() ## Year of the weather data observations
rpt_yday <- as.numeric() ## Number of days in the year
rpt_tot_stations <- as.numeric() ## Total number of observation stations in the file
rpt_tmp_stations <- as.numeric() ## Total number of temperature observation stations in the file
rpt_tmp_sta_pct <- as.numeric() ## Percentage of temperature stations in the file
rpt_ini_pobs <- as.numeric() ## Initial potential observations based on temp stations and days in the year
rpt_ini_aobs <- as.numeric() ## Initial actual observations prior to cleaning
rpt_ini_obs_pct <- as.numeric() ## Percentage of actual observations to potential observations
rpt_ini_cobs <- as.numeric() ## Initial number of complete observations (TAVG recorded)
rpt_ini_cobs_pct <- as.numeric() ## Percentage of initial complete observation against potential observations
rpt_fin_stations <- as.numeric() ## Final number of stations after cleaning
rpt_lst_stations <- as.numeric() ## Number of stations lost during cleaning
rpt_fin_pobs <- as.numeric() ## Potential observations remaining after cleaning
rpt_fin_cobs <- as.numeric() ## Complete observations after cleaning and deriving TAVG as needed
rpt_fin_cobs_pct <- as.numeric() ## Final complete observations percentage again potential observations
Within the loop, there will be comments above or to the right of the executed command to explain its purpose. There is an additional check in the loop for the column TAVG after the dcast. The reason for this is that there may be years in which none of the observation stations have a TAVG Element recorded. For these years, the TAVG value will be NA and must be derived from TMAX and TMIN exclusively.
Process Yearly Weather Data
for(i in 1:length(file.names)){
raw_file <- paste(raw_path, file.names[i], sep="")
wd_data <- read.csv(raw_file, sep=",", header=FALSE, stringsAsFactors = FALSE)
## Record year for reporting and creating the .rds file name
rpt_year[i] <- year(ymd(wd_data$V2[1]))
## Record number of days in the year
rpt_yday[i] <- if(leap_year(rpt_year[i])) {366} else {365}
## Record total number of stations
rpt_tot_stations[i] <- length(unique(wd_data$V1))
wd_data <- wd_data %>%
select(V1, V2, V3, V4) %>% ## Remove unneeded variables
filter(V3 %in% temp_vals) %>% ## Filter for temperature observations
rename(ID = V1, ## Rename the first two columns
Date = V2) %>%
dcast(ID + Date ~ V3, value.var = "V4") ## Create Tidy data by making Elements into variables
## Report number of temperature stations
rpt_tmp_stations[i] <- length(unique(wd_data$ID))
rpt_tmp_sta_pct[i] <- round(rpt_tmp_stations[i] / rpt_tot_stations[i], 4)
## Report initial potential observations
rpt_ini_pobs[i] <- rpt_tmp_stations[i] * rpt_yday[i]
## Report initial actual observations
rpt_ini_aobs[i] <- nrow(wd_data)
rpt_ini_obs_pct[i] <- round(rpt_ini_aobs[i] / rpt_ini_pobs[i], 4)
if(!"TAVG" %in% colnames(wd_data)) { ## Check for presence of TAVG
wd_data$TAVG <- NA ## If not present, create TAVG and set to NA
}
## Replace missing observations with NA, convert temps to Farenheit
wd_data <- wd_data %>%
transform(TAVG = ifelse(TAVG == -999, NA, (TAVG / 10 * 1.8) + 32 ),
TMAX = ifelse(TMAX == -999, NA, (TMAX / 10 * 1.8) + 32 ),
TMIN = ifelse(TMIN == -999, NA, (TMIN / 10 * 1.8) + 32 ))
## Report initial complete observations
rpt_ini_cobs[i] <- length(wd_data$TAVG[!is.na(wd_data$TAVG)])
rpt_ini_cobs_pct[i] <- round(rpt_ini_cobs[i] / rpt_ini_pobs[i], 4)
wd_data <- wd_data %>%
transform(TAVG = ifelse(TAVG > 134.1, NA, TAVG ), ## Replace temps greater than record high
TMAX = ifelse(TMAX > 134.1, NA, TMAX ),
TMIN = ifelse(TMIN > 134.1, NA, TMIN )) %>%
transform(TAVG = ifelse(TAVG < -128.6, NA, TAVG ), ## Replace temps less than record low
TMAX = ifelse(TMAX < -128.6, NA, TMAX ),
TMIN = ifelse(TMIN < -128.6, NA, TMIN )) %>%
transform(TMIN = ifelse(TMIN > TMAX, NA, TMIN)) ## Replace temps with TMIN > TMAX
wd_data <- wd_data %>%
mutate(DAVG = (TMAX + TMIN) / 2) %>% ## Create derived average temp
transform(TAVG = ifelse(is.na(TAVG), DAVG, TAVG)) %>% ## Replace TAVG with DAVG if NA
select(ID, Date, TAVG) %>% ## Keep columns needed
filter(complete.cases(.)) ## Keep only complete cases, no NAs
## Report final number of temperature stations
rpt_fin_stations[i] <- length(unique(wd_data$ID))
rpt_lst_stations[i] <- rpt_tmp_stations[i] - rpt_fin_stations[i]
## Report final complete observations
rpt_fin_pobs[i] <- rpt_fin_stations[i] * rpt_yday[i]
rpt_fin_cobs[i] <- nrow(wd_data)
rpt_fin_cobs_pct[i] <- round(rpt_fin_cobs[i] / rpt_fin_pobs[i], 4)
wd_data <- wd_data %>%
transform(Date = ymd(Date)) %>% ## Convert Date to date format
mutate(Yday = yday(Date), ## Create Yday
Week = week(Date), ## Create Week
Month = month(Date)) %>% ## Create Month
select(ID, Month, Week, Yday, TAVG) ## Keep and reorder needed columns
rds_name <- paste(rds_path, "ww_data_", rpt_year[i], ".rds", sep = "")
saveRDS(wd_data, rds_name)
rm(wd_data)
}
Create Report Table
ww_tdata_rpt <- data.frame(rpt_year, rpt_yday, rpt_tot_stations, rpt_tmp_stations, rpt_tmp_sta_pct,
rpt_ini_pobs, rpt_ini_aobs, rpt_ini_obs_pct, rpt_ini_cobs, rpt_ini_cobs_pct,
rpt_fin_stations, rpt_lst_stations, rpt_fin_pobs, rpt_fin_cobs, rpt_fin_cobs_pct)
Save Report Table
saveRDS(ww_tdata_rpt, "Reports/ww_tdata_rpt.rds")
A summary report has been created from the data gathered during this project. It can be found at the following link:
Summary Report: http://t2tec.com/rpts/wacl/ww_tdata_1900_2016.pdf
But we can look at some key statistics now.
Distribution of Completion Percentage
quantile(ww_tdata_rpt$rpt_fin_cobs_pct, prob = seq(0, 1, length = 11), type = 5)
## 0% 10% 20% 30% 40% 50% 60% 70% 80%
## 0.85960 0.87442 0.88172 0.88710 0.89421 0.90030 0.90905 0.91368 0.91830
## 90% 100%
## 0.92486 0.93600
For the timeframe 1900-2016, 50% of the years have complete observation percentages less than 90%. The maximum completion percentage is 93.6%.
Maximum Completion Percentage - Top Ten
tdata_top_ten <- ww_tdata_rpt %>%
arrange(desc(rpt_fin_cobs_pct)) %>%
select(rpt_year, rpt_fin_cobs_pct)
| rpt_year | rpt_fin_cobs_pct |
|---|---|
| 1932 | 0.9360 |
| 1933 | 0.9341 |
| 1931 | 0.9334 |
| 1934 | 0.9318 |
| 1935 | 0.9317 |
| 1940 | 0.9309 |
| 1939 | 0.9279 |
| 1937 | 0.9273 |
| 1938 | 0.9264 |
| 2014 | 0.9255 |
Nine of the top ten completion percentages for final valid observations are found from 1940 and earlier. Only one “current” year, 2014, is in the top ten.
Minimum Completion Percentage - Bottom Ten
tdata_bot_ten <- ww_tdata_rpt %>%
arrange(rpt_fin_cobs_pct) %>%
select(rpt_year, rpt_fin_cobs_pct)
| rpt_year | rpt_fin_cobs_pct |
|---|---|
| 1901 | 0.8596 |
| 1948 | 0.8614 |
| 1907 | 0.8644 |
| 1900 | 0.8678 |
| 1906 | 0.8706 |
| 1973 | 0.8714 |
| 1904 | 0.8715 |
| 1974 | 0.8716 |
| 2016 | 0.8726 |
| 1903 | 0.8736 |
Perhaps the most concerning finding is that the comparison year, 2016, that NOAA and NASA use as the basis for their paper has the ninth lowest completion percentage, 87.26%, of the 117 years analyzed.
Baseline Years Completion Percentage
tdata_bas_yrs <- ww_tdata_rpt %>%
filter(rpt_year >= 1951,
rpt_year <= 1980) %>%
select(rpt_year, rpt_fin_cobs_pct)
| rpt_year | rpt_fin_cobs_pct |
|---|---|
| 1951 | 0.9095 |
| 1952 | 0.9178 |
| 1953 | 0.9130 |
| 1954 | 0.9160 |
| 1955 | 0.9153 |
| 1956 | 0.9139 |
| 1957 | 0.9014 |
| 1958 | 0.9150 |
| 1959 | 0.8782 |
| 1960 | 0.8970 |
| 1961 | 0.8954 |
| 1962 | 0.8961 |
| 1963 | 0.8977 |
| 1964 | 0.9235 |
| 1965 | 0.9136 |
| 1966 | 0.9206 |
| 1967 | 0.9206 |
| 1968 | 0.9178 |
| 1969 | 0.9146 |
| 1970 | 0.9096 |
| 1971 | 0.8954 |
| 1972 | 0.9221 |
| 1973 | 0.8714 |
| 1974 | 0.8716 |
| 1975 | 0.8741 |
| 1976 | 0.8828 |
| 1977 | 0.8754 |
| 1978 | 0.8798 |
| 1979 | 0.8810 |
| 1980 | 0.8745 |
## 0% 25% 50% 75% 100%
## 0.871400 0.881450 0.905450 0.915225 0.923500
Once again, half of the years that comprise the Mid_Century baseline used in the report indicate complete observation percentages under 90%.
The median completion percentage, after cleaning, is around 90% which translates into 1 out of 10 potential observations missing from the recorded temperature history between 1900 and 2016. As was stated in the previous project, depending on when those missing observations occurred and how the missing data is handled in the NOAA model, there may be an impact to actual global average temperatures.
We also find temperature stations dropping from the record due to containing zero valid observations.
There were zero years with a 95% completion percentage or better. A thorough review of the processed temperature data, including time of the year and geographic location, will be required to understand the potential impact.
During the baseline years, 50% of the years had complete observation percentages less than 90%. And the year 2016 had one of the lowest recorded observation percentages in the 117 year temperature history.
As mentioned earlier, a summary report with the statistics from this project is available at the link documented above.
There are two immediate next steps as a result of the findings from this project.
Since we have an incomplete historical record that was used in the NOAA and NASA analysis, it is important to understand the potential impacts to the conclusions that are drawn.
sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 15063)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] reshape2_1.4.2 lubridate_1.6.0 knitr_1.16 dplyr_0.5.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.11 digest_0.6.12 rprojroot_1.2 assertthat_0.2.0
## [5] plyr_1.8.4 R6_2.2.1 DBI_0.6-1 backports_1.1.0
## [9] magrittr_1.5 evaluate_0.10 highr_0.6 rlang_0.1.1
## [13] stringi_1.1.5 lazyeval_0.2.0 rmarkdown_1.5 tools_3.4.0
## [17] stringr_1.2.0 yaml_2.1.14 compiler_3.4.0 htmltools_0.3.6
## [21] tibble_1.3.3