NOAA Yearly Weather Data Processing

Analyzing 1900-2016 Temperature Data



Project Synopsis

According to a report published on January 18, 2017 by the National Aeronautics and Space Administration (NASA) and the National Oceanic and Atmospheric Administration (NOAA):

…(the) Earth’s 2016 surface temperatures were the warmest since modern record keeping began in 1880.

Globally-averaged temperatures in 2016 were 1.78 degrees Fahrenheit (0.99 degrees Celsius) warmer than the mid-20th century mean. This makes 2016 the third year in a row to set a new record for global average surface temperatures.

Source: https://www.nasa.gov/press-release/nasa-noaa-data-show-2016-warmest-year-on-record-globally


The 2016 Weather Data Exploratory Analysis project was started to review the raw data from NOAA and identify areas of uncertainty and their potential impact on reaching a greater than 95% scientific certainty.

This is Part 8 of the 2016 Weather Data Exploratory Analysis.


This project is designed to analyze the NOAA yearly weather data files processed in Part 6 of the project. It was found that the historical weather record is missing a significant number of observations, approximately 10%. In order to determine the potential impact on any scientific conclusions that may be rendered, this project will explore the processed yearly temperature data files and explore the degree of missing data by geographic indicators and the time of year. In Part 7 of the project, geographic indicators were assigned to the master station data to reflect hemisphere, quadrant, sector, and temperature zone. The indicators will be used to examine completeness of data geographically. The observation month will be used to examine the completeness of observations by time of year.



Libraries Required

library(dplyr)        # Data manipulation
library(ggplot2)      # Graphics Language for complex plots 
library(knitr)        # Dynamic report generation
library(lubridate)    # Date and time processing
library(reshape2)     # Data transposition
library(stringi)      # String processing

Stations Data

The Stations Data to be used is the modified worldwide master list created in Part 7 of the 2016 Weather Data Exploratory Analysis in which geographic indicators were assigned to the master station data from Part 1 of the project.


Read Stations Data

For this project, the only required data fields are the station ID, hemisphere, quadrant, and temperature zone.

Read Master Station Data

master_stations_in <- readRDS("~/Temp Stations/master_stations_in.rds")
ID Latitude Longitude Elevation Country State Location FirstYear LastYear Hemi Quad Zone Sect
ACW00011604 17.1167 -61.7833 33.14 ANTIGUA AND BARBUDA NA ST JOHNS COOLIDGE FLD 1949 1949 N NW TR S30
ACW00011647 17.1333 -61.7833 62.99 ANTIGUA AND BARBUDA NA ST JOHNS 1961 1961 N NW TR S30
AE000041196 25.3330 55.5170 111.55 UNITED ARAB EMIRATES NA SHARJAH INTER. AIRP 1944 2017 N NE TR S20
AEM00041194 25.2550 55.3640 34.12 UNITED ARAB EMIRATES NA DUBAI INTL 1983 2017 N NE TR S20
AEM00041217 24.4330 54.6510 87.93 UNITED ARAB EMIRATES NA ABU DHABI INTL 1983 2017 N NE TR S20

Create Master Station Indicator Table

master_indicators <- master_stations_in %>%
                     select(ID, Hemi, Quad, Zone)
ID Hemi Quad Zone
ACW00011604 N NW TR
ACW00011647 N NW TR
AE000041196 N NE TR
AEM00041194 N NE TR
AEM00041217 N NE TR
AEM00041218 N NE TR
AF000040930 N NE TE
AFM00040938 N NE TE
AFM00040948 N NE TE
AFM00040990 N NE TE

Data Processing

Preparing the Processing Loop

All of the processed yearly data files from 1900-2016 are stored in a local data directory. To prepare for the processing loop, the directory path and file names are gathered. The file names can be looped through using the vector index.

Set Path

rds_path = "~/NOAA Data/RDS WDT 1900-2016/"

Retrieve File Names

file.names <- dir(rds_path, pattern =".rds")

Summary Functions

There are four functions that will be called from within the yearly temperature data processing loop. The functions will allow each year of data to be processed within a single processing loop and create a set of complete summarized data tables for temperature stations and observations broken out by worldwide, hemisphere, quadrant, and temperature zone.

Worldwide Summary Function

WW_Summ_Function <- function(temp_frame, temp_year) {
        
        temp_frame <- temp_frame %>%
                      filter(complete.cases(.))
        
        temp_stat <- length(unique(temp_frame$ID))
        
        temp_summ <- temp_frame %>%
                     group_by(Month) %>%
                     summarize(ID_Ct = n(),
                               OB_Ct = sum(Count),
                               PO_Ct = sum(Days)) %>%
                     ungroup() %>%
                     mutate(Year         = temp_year,
                            Stations     = temp_stat,
                            Station_Pct  = round(ID_Ct / Stations, 3),
                            Mth_Pot_Obs  = Stations * (PO_Ct / ID_Ct),
                            Mth_Pot_Pct  = round(OB_Ct / Mth_Pot_Obs, 3),
                            Sta_Act_Pct  = round(OB_Ct / PO_Ct, 3)) %>%
                     select(Year, Stations, Month, ID_Ct, Station_Pct, Mth_Pot_Obs, 
                            OB_Ct, Mth_Pot_Pct, PO_Ct, Sta_Act_Pct)
        
        return(temp_summ)
}

Hemisphere Summary Function

HM_Summ_Function <- function(temp_frame, temp_year) {
        
        stat_cnts <- temp_frame %>%
                     filter(complete.cases(.)) %>%
                     group_by(Hemi) %>%
                     summarize(Count = length(unique(ID))) %>%
                     ungroup()
        
        temp_n_st <- stat_cnts$Count[stat_cnts$Hemi == "N"]
        temp_s_st <- stat_cnts$Count[stat_cnts$Hemi == "S"]
        
        temp_summ <- temp_frame %>%
                     filter(complete.cases(.)) %>%
                     group_by(Hemi, Month) %>%
                     summarize(ID_Ct = n(),
                               OB_Ct = sum(Count),
                               PO_Ct = sum(Days)) %>%
                     ungroup() %>%
                     mutate(Year         = temp_year,
                            Stations     = ifelse(Hemi == "N", temp_n_st, temp_s_st),
                            Station_Pct  = round(ID_Ct / Stations, 3),
                            Mth_Pot_Obs  = Stations * (PO_Ct / ID_Ct),
                            Mth_Pot_Pct  = round(OB_Ct / Mth_Pot_Obs, 3),
                            Sta_Act_Pct  = round(OB_Ct / PO_Ct, 3)) %>%
                     select(Hemi, Year, Stations, Month, ID_Ct, Station_Pct, Mth_Pot_Obs, 
                            OB_Ct, Mth_Pot_Pct, PO_Ct, Sta_Act_Pct)
        
        return(temp_summ)
}

Quadrant Summary Function

QU_Summ_Function <- function(temp_frame, temp_year) {
        
        stat_cnts <- temp_frame %>%
                     filter(complete.cases(.)) %>%
                     group_by(Quad) %>%
                     summarize(Count = length(unique(ID))) %>%
                     ungroup()
        
        quad_pres <- stat_cnts$Quad
        
        temp_ne_st <- ifelse("NE" %in% quad_pres, stat_cnts$Count[stat_cnts$Quad == "NE"], as.integer(0))
        temp_nw_st <- ifelse("NW" %in% quad_pres, stat_cnts$Count[stat_cnts$Quad == "NW"], as.integer(0))
        temp_se_st <- ifelse("SE" %in% quad_pres, stat_cnts$Count[stat_cnts$Quad == "SE"], as.integer(0))
        temp_sw_st <- ifelse("SW" %in% quad_pres, stat_cnts$Count[stat_cnts$Quad == "SW"], as.integer(0))
        
        temp_summ <- temp_frame %>%
                     filter(complete.cases(.)) %>%
                     group_by(Quad, Month) %>%
                     summarize(ID_Ct = n(),
                               OB_Ct = sum(Count),
                               PO_Ct = sum(Days)) %>%
                     ungroup() %>%
                     mutate(Year         = temp_year,
                            Stations     = case_when(.$Quad == "NE" ~ temp_ne_st,
                                                     .$Quad == "NW" ~ temp_nw_st,
                                                     .$Quad == "SE" ~ temp_se_st,
                                                     .$Quad == "SW" ~ temp_sw_st),
                            Station_Pct  = round(ID_Ct / Stations, 3),
                            Mth_Pot_Obs  = Stations * (PO_Ct / ID_Ct),
                            Mth_Pot_Pct  = round(OB_Ct / Mth_Pot_Obs, 3),
                            Sta_Act_Pct  = round(OB_Ct / PO_Ct, 3)) %>%
                     select(Quad, Year, Stations, Month, ID_Ct, Station_Pct, Mth_Pot_Obs, 
                            OB_Ct, Mth_Pot_Pct, PO_Ct, Sta_Act_Pct)
        
        return(temp_summ)
}

Zone Summary Function

TZ_Summ_Function <- function(temp_frame, temp_year) {
        
        stat_cnts <- temp_frame %>%
                     filter(complete.cases(.)) %>%
                     group_by(Zone) %>%
                     summarize(Count = length(unique(ID))) %>%
                     ungroup()
        
        temp_po_st <- stat_cnts$Count[stat_cnts$Zone == "PO"]
        temp_te_st <- stat_cnts$Count[stat_cnts$Zone == "TE"]
        temp_tr_st <- stat_cnts$Count[stat_cnts$Zone == "TR"]
        
        temp_summ <- temp_frame %>%
                     filter(complete.cases(.)) %>%
                     group_by(Zone, Month) %>%
                     summarize(ID_Ct = n(),
                               OB_Ct = sum(Count),
                               PO_Ct = sum(Days)) %>%
                     ungroup() %>%
                     mutate(Year         = temp_year,
                            Stations     = case_when(.$Zone == "PO" ~ temp_po_st,
                                                     .$Zone == "TE" ~ temp_te_st,
                                                     .$Zone == "TR" ~ temp_tr_st),
                            Station_Pct  = round(ID_Ct / Stations, 3),
                            Mth_Pot_Obs  = Stations * (PO_Ct / ID_Ct),
                            Mth_Pot_Pct  = round(OB_Ct / Mth_Pot_Obs, 3),
                            Sta_Act_Pct  = round(OB_Ct / PO_Ct, 3)) %>%
                     select(Zone, Year, Stations, Month, ID_Ct, Station_Pct, Mth_Pot_Obs, 
                            OB_Ct, Mth_Pot_Pct, PO_Ct, Sta_Act_Pct)
        
        return(temp_summ)
}

Processing Loop

The processing loop will perform summary statistical gathering for each of the years in our temperature history related to the number of stations reporting data and the actual number of observations recorded. The geographic indicators for the stations will be used to create summarizations by hemisphere, quadrant, and temperature zone. The resultant tables will then be used to create the station summary data and observation summary data for charting.

Process Yearly Weather Data

for(i in 1:length(file.names)){
        
     rds_file <- paste(rds_path, file.names[i], sep="")
     
     wdt_data <- readRDS(rds_file)
     
     ## Record year for reporting
     dat_year    <- as.numeric(substr(file.names[i], 9, 12))

     ## Record number of days in the year
     nbr_yday    <- if(leap_year(dat_year)) {366} else {365}
     
     
     ## Summarize yearly temperature data by station ID and Month and join to master indicators table

     monthly_summary <- wdt_data %>%
                        select(ID, Month) %>%
                        group_by(ID, Month) %>%
                        summarize(Count = n()) %>%
                        ungroup() %>%
                        mutate(Days = days_in_month(Month)) %>%
                        transform(Days = ifelse(nbr_yday == 366 & Month == 2, 29, Days))
     
     monthly_summary <- left_join(monthly_summary, master_indicators, by = "ID")
     
     
     ## Worldwide Monthly Summary
     
     ww_month_table  <- WW_Summ_Function(monthly_summary, dat_year)
     
     if(i == 1) { ww_month_final <- ww_month_table } else
                { ww_month_final <- bind_rows(ww_month_final, ww_month_table)}
     
     rm(ww_month_table)
     
     
     ## Hemisphere Monthly Summary
     
     hm_month_table  <- HM_Summ_Function(monthly_summary, dat_year)
     
     if(i == 1) { hm_month_final <- hm_month_table } else
                { hm_month_final <- bind_rows(hm_month_final, hm_month_table)}
     
     rm(hm_month_table)
     
     
     
     ## Quadrant Monthly Summary
     
     qu_month_table  <- QU_Summ_Function(monthly_summary, dat_year)
     
     if(i == 1) { qu_month_final <- qu_month_table } else
                { qu_month_final <- bind_rows(qu_month_final, qu_month_table)}
     
     rm(qu_month_table)
     
     
     ## Temp Zone Monthly Summary
     
     tz_month_table  <- TZ_Summ_Function(monthly_summary, dat_year)
     
     if(i == 1) { tz_month_final <- tz_month_table } else
                { tz_month_final <- bind_rows(tz_month_final, tz_month_table)}
     
     rm(tz_month_table)
}
ww_month_final
Year Stations Month ID_Ct Station_Pct Mth_Pot_Obs OB_Ct Mth_Pot_Pct PO_Ct Sta_Act_Pct
1900 2615 1 2303 0.881 81065 69765 0.861 71393 0.977
1900 2615 2 2339 0.894 73220 64168 0.876 65492 0.980
1900 2615 3 2352 0.899 81065 71241 0.879 72912 0.977
1900 2615 4 2366 0.905 78450 69474 0.886 70980 0.979
1900 2615 5 2334 0.893 81065 70848 0.874 72354 0.979
hm_month_final
Hemi Year Stations Month ID_Ct Station_Pct Mth_Pot_Obs OB_Ct Mth_Pot_Pct PO_Ct Sta_Act_Pct
N 1900 2574 1 2262 0.879 79794 68546 0.859 70122 0.978
N 1900 2574 2 2298 0.893 72072 63047 0.875 64344 0.980
N 1900 2574 3 2311 0.898 79794 70013 0.877 71641 0.977
N 1900 2574 4 2325 0.903 77220 68283 0.884 69750 0.979
N 1900 2574 5 2293 0.891 79794 69635 0.873 71083 0.980
qu_month_final
Quad Year Stations Month ID_Ct Station_Pct Mth_Pot_Obs OB_Ct Mth_Pot_Pct PO_Ct Sta_Act_Pct
NE 1900 160 1 151 0.944 4960 4618 0.931 4681 0.987
NE 1900 160 2 153 0.956 4480 4223 0.943 4284 0.986
NE 1900 160 3 151 0.944 4960 4560 0.919 4681 0.974
NE 1900 160 4 152 0.950 4800 4530 0.944 4560 0.993
NE 1900 160 5 154 0.962 4960 4701 0.948 4774 0.985
tz_month_final
Zone Year Stations Month ID_Ct Station_Pct Mth_Pot_Obs OB_Ct Mth_Pot_Pct PO_Ct Sta_Act_Pct
PO 1900 43 1 36 0.837 1333 1104 0.828 1116 0.989
PO 1900 43 2 36 0.837 1204 998 0.829 1008 0.990
PO 1900 43 3 35 0.814 1333 1068 0.801 1085 0.984
PO 1900 43 4 36 0.837 1290 1069 0.829 1080 0.990
PO 1900 43 5 36 0.837 1333 1103 0.827 1116 0.988

Data Prep for Charting

The next step will prepare the data gathered in the processing loop so it can be used in creating the comparison charts to follow. The monthly data will be summarized by year to gather the respective temperature station counts and percentages along with the potential and actual observation counts.

Yearly Data Processing

##Worldwide

ww_year_sum <- ww_month_final %>%
               group_by(Year) %>%
               summarize(Yr_Stations = max(Stations),
                         Av_Stations = round(mean(ID_Ct)),
                         AOBS = sum(OB_Ct),
                         POBS = sum(Mth_Pot_Obs)) %>%
               ungroup() %>%
               mutate(Pt_Stations = round(Av_Stations/Yr_Stations, 3),
                      OBS_Pt = round(AOBS/POBS, 3)) %>%
               select(Year, Yr_Stations, Av_Stations, Pt_Stations, POBS, AOBS, OBS_Pt)


##Hemisphere

hm_year_sum <- hm_month_final %>%
               group_by(Year, Hemi) %>%
               summarize(Yr_Stations = max(Stations),
                         Av_Stations = round(mean(ID_Ct)),
                         AOBS = sum(OB_Ct),
                         POBS = sum(Mth_Pot_Obs)) %>%
               ungroup() %>%
               mutate(Pt_Stations = round(Av_Stations/Yr_Stations, 3),
                      OBS_Pt = round(AOBS/POBS, 3)) %>%
               select(Year, Hemi, Yr_Stations, Av_Stations, Pt_Stations, POBS, AOBS, OBS_Pt)

hm_n_year_sum <- hm_year_sum %>% filter(Hemi == "N")
hm_s_year_sum <- hm_year_sum %>% filter(Hemi == "S")

rm(hm_year_sum)


##Quadrant

qu_year_sum <- qu_month_final %>%
               group_by(Year, Quad) %>%
               summarize(Yr_Stations = max(Stations),
                         Av_Stations = round(mean(ID_Ct)),
                         AOBS = sum(OB_Ct),
                         POBS = sum(Mth_Pot_Obs)) %>%
               ungroup() %>%
               mutate(Pt_Stations = round(Av_Stations/Yr_Stations, 3),
                      OBS_Pt = round(AOBS/POBS, 3)) %>%
               select(Year, Quad, Yr_Stations, Av_Stations, Pt_Stations, POBS, AOBS, OBS_Pt)

qu_ne_year_sum <- qu_year_sum %>% filter(Quad == "NE")
qu_nw_year_sum <- qu_year_sum %>% filter(Quad == "NW")
qu_se_year_sum <- qu_year_sum %>% filter(Quad == "SE")
qu_sw_year_sum <- qu_year_sum %>% filter(Quad == "SW")

rm(qu_year_sum)


##Temperature Zone

tz_year_sum <- tz_month_final %>%
               group_by(Year, Zone) %>%
               summarize(Yr_Stations = max(Stations),
                         Av_Stations = round(mean(ID_Ct)),
                         AOBS = sum(OB_Ct),
                         POBS = sum(Mth_Pot_Obs)) %>%
               ungroup() %>%
               mutate(Pt_Stations = round(Av_Stations/Yr_Stations, 3),
                      OBS_Pt = round(AOBS/POBS, 3)) %>%
               select(Year, Zone, Yr_Stations, Av_Stations, Pt_Stations, POBS, AOBS, OBS_Pt)

tz_po_year_sum <- tz_year_sum %>% filter(Zone == "PO")
tz_te_year_sum <- tz_year_sum %>% filter(Zone == "TE")
tz_tr_year_sum <- tz_year_sum %>% filter(Zone == "TR")

rm(tz_year_sum)

Data Visualization

Worldwide Station and Observation Charts

I have not included the code for the chart creations using the summarized data. Instead, just the final rendered charts will be displayed along with related commentary on the findings.


Worldwide Temperature Stations (1900-2016)

The number of temperature stations deployed worldwide has increased steadily since 1900. There was an increase in the rate of growth for stations beginning after WW II and continuing until 1970. However, there has been a significant decrease in stations since 2011.

The solid blue line reflects the average number of stations reporting observations per month during the observation year.

Beginning in 1959, we see an increasing gap between the total number of stations reporting for the year and the average monthly reporting. Using the year 2016 as an example, even though there are 14,284 stations reporting temperature observations for the year, during any given month only 13,048 have observations. So, reporting of observations is inconsistent and the gap indicates a growing trend of missing data since 1958.


Worldwide Temperature Observations (1900-2016)

Potential observations are simply the number of stations reporting observations in a given year multiplied by the number of days in the year, factoring in Leap Years. Using 2016 as an example, with 14,284 stations in the historical record as producing temperature observations, the total potential observations for the year is 5,227,944. The actual observation count for 2016 was 4,561,989. The difference is the gap reflecting missing observations. This gap has been increasing, or remaining at high levels, since 1973.

It is critical to note that the total number of actual observations in 2016 fell to levels close to those in 1980, despite the increase in stations over the years.


Worldwide Observation Completion (1900-2016)

The area in red illustrates the gap between potential observations and actual observations and highlights the growing number of missing observations from the worldwide historical temperature record.


Worldwide Observation Completion % (1900-2016)

Since 1900, there has not been an individual year that reached a 95% level of complete observations. That is, actual observations reaching 95% of the potential observations available for the year given the number of stations that are reporting temperature data. In fact, most of the years fail to reach 90%.

During the Mid-Century Baseline period (1951-1980), only 16 of the 30 years reached 90% completion.

And most concering is that 2016, the year pointed out in the NASA and NOAA study, reached only 87% completion.

Why is there so much missing data?

Are there problems with the physical temperature stations that they aren’t able to report, are providing inaccurate data, have quality issues, or has the data been scrubbed because the observations have been deemed incorrect?

Next, the worldwide data will be broken down by Hemispheres, Quadrants, and Temperature Zones to identify geographic trends in the quality of the data.



Hemisphere Station and Observation Charts


Worldwide Temperature Stations - By Hemisphere (1900-2016)

The Northern Hemisphere contains an overwhelming number of the temperature stations across the globe. The contribution to the worlwide historical temperature record by the Southern Hemisphere is practically non-existent, especially given the percentage of physical land mass contained in that Hemisphere.

Furthermore, the Southern Hemisphere has seen near-zero growth in observation stations since 1973.


Worldwide Temperature Observations - By Hemisphere (1900-2016)

Not surprisingly, the actual observations mirror the results of the station data. Clearly, the Southern Hemisphere has minimal impact on the worldwide observation trending as the observation count indicates near-zero growth and an insignifican observation count in comparison to the Northern Hemisphere.


Northern Hemisphere Observation Completion (1900-2016)

The Northern Hemisphere results trend in the same manner as the worldwide results due to the overwhelming percentage of observations that comprise the worldwide historical temperature record.


Southern Hemisphere Observation Completion (1900-2016)

By using the same scale as the Northern Hemisphere and worldwide charts, the degree of participation for the Souther Hemisphere is put into proper perspective. Although the amount of missing data in the Southern Hemisphere is significantly lower than the Northern Hemisphere, the complete under-reporting of half the world skews any global analysis.


Northern Hemisphere Observation Completion % (1900-2016)

When we reviewed the worldwide completion percentages, we saw poor completion statistics indicating large amounts of missing data. Clearly, the Northern Hemisphere is responsible for the majority of missing data and the poor completion rates.


Southern Hemisphere Observation Completion % (1900-2016)

However, the Southern Hemisphere has an even worse issue with regards to providing actual observations in line with potential observations. So the Southern Hemisphere is severely under-represented in the worldwide historical temperature record in two areas:

  • The small number of temperature stations.
  • The reliability of the temperature stations to provide observational data.

We can break the Hemispheres into smaller segments by examining the individual Earth quadrants.



Quadrant Station and Observation Charts


Worldwide Temperature Stations - By Quadrant (1900-2016)

The Mid-Century Baseline years (1951-1980) are dominated by the Northwest Quadrant (US and Canada primarily) by a factor of 4 over the Northeast Quadrant and a factor of 8 over the Southeast and Southwest Quadrants combined. In fact, the Southwest Quadrant has virtually no statistical impact on the worldwide historical temperature record.

Of particular concern is the fact that the 3 lowest reporting quadrants have had zero or declining growth since 1973. So their respective contributions to worlwide statistics, already at relatively insignificant levels, has actually become proportionally worse over the last 40 years.


Worldwide Temperature Observations - By Quadrant (1900-2016)

The actual observation statistics mirror what was seen in the stations chart. The impact of this is that 75% of the planet must rely on statistical models and inferential methods to attempt a reasonable assumption of what the temperature landscape is. The quality, reliability, and veracity of these models and methods becomes critical and worthy of examination.


Northeast Quadrant Observation Completion (1900-2016)

The Northeast Quadrant, despite improvements in reducing missing observations, has shown an actual decrease in observations since 1974.


Northwest Quadrant Observation Completion (1900-2016)

The Northwest Quadrant makes up the vast majority of worldwide observations. However, the amount of missing observations has consistently increased since the end of WW II. Furthermore, the quantity of observations has decreased significantly since 2011.

Instead of generating a more comprehensive view of worldwide temperatures and generating more reliable estimates, we have actually moved in the opposite direction by having less reliable data and a decrease in the observational data that are used in the climate models.


Southeast Quadrant Observation Completion (1900-2016)

The amount of missing data in the Southeast Quadrant has improved somewhat over the last decade. However, the Southeast Quadrant has a very small representation in the worldwide historical temperature record. Of particular note is the nearly absent representation during the first half of the Mid-Century Baseline.


Southwest Quadrant Observation Completion (1900-2016)

The Southwest Quadrant has virtually zero contribution to the worldwide historical temperature record and nearly zero representation during the Mid-Century Baseline.

Granted, the Southwest Quadrant primarily consistents of the Pacific and Atlantic oceans. However, roughly 80% of South America and 20% of Antarctica, two of the largest land masses on Earth, are contained in this quadrant.


Northeast Quadrant Observation Completion % (1900-2016)

The Northeast Quadrant has been sporadic with regards to actual versus potential observations. This includes the Mid-Century Baseline period as well as the last 35 years. For 2016, the completion percentage was roughly 86%, significantly below the 95% threshold.


Northwest Quadrant Observation Completion % (1900-2016)

The Northwest Quadrant has a fairly consistent completion percentage falling between 90-94% during the Mid-Century Baseline and the following years. However, 2016 saw a dramatic decrease to 87% completion.

Despite the consistent completion percentages, at no time since 1990 has the quadrant reached the 95% threshold.


Southeast Quadrant Observation Completion % (1900-2016)

The Southeast Quadrant has experienced unpredictable and unreliable completion percentages since 1965. The Mid-Century Baseline significantly drops from very acceptable levels through 1964 to completion percentages reaching nearly to 80%. That trend continued until 2007 where some years in the last decade have reached a 90% level.


Southwest Quadrant Observation Completion % (1900-2016)

More concerning than the under-representation of the Southwest Quadrant is the quality of data reported. The quadrant is consistently missing large amounts of observational data, over 40% at times. This means that of the small number of stations existing in the quadrant, the reliability and consistenty of the data is highly questionable. Attempting to determine temperature patterns with this degree of quality, and the sheer lack of observations, across two large land masses would be scientifically impossible. Any results would be based on overly generalized models and significant data imputation.



Temperature Zone Station and Observation Charts


Worldwide Temperature Stations - By Temperature Zone (1900-2016)

The Mid-Century Baseline year (1951-1980) is clearly dominated by the Temperate Zone. The Polar Zone, despite its incredible importance in studying global temperature data, has a minimal impact on the worldwide temperature record. In fact, both the Polar and Tropical Zones have had relatively zero growth in temperature stations since 1973.


Worldwide Temperature Observations - By Temperature Zone (1900-2016)

The actual observation chart reflects the same type of proportional data that the stations chart provided. One item to note is the significant decrease in observations in both the Temperate and Polar Zones since 2011. Once again, the trend indicates a reduction of data observations being used for scientific studies of temperature data.


Polar Zone Observation Completion (1900-2016)

The Polar Zone appears to have a minimal amount of missing observations. However, the actual number of observations is minimal with very little growth since 1975 and a decrease in observations since 2012.

Keep in mind that the Polar Zone includes Antarctica, Greenland, and large land areas belonging to Canada, the Commonwealth of Independent States, United States, and Scandinavia. These are critical areas in understanding global weather patterns and potential climate change.


Tropical Zone Observation Completion (1900-2016)

The Tropical Zone has improved somewhat in terms of missing observations over the last 10 years. However, minor increases in observation counts is the result of reducing the missing observations.

Given that the Tropical Zone has the highest temperatures on Earth, due to the overall proximity to the equator, expanding temperature observations would be expected. However, there has been near zero growth in this zone since 1973.


Temperate Zone Observation Completion (1900-2016)

The Temperate Zone, home to the majority of the Earth’s population, has the highest number of actual observations. It also has the largest number of missing observations.

It appears that the focus for placement of temperature stations and gathering observational data is around more heavily populated areas instead of gathering an equal set of observations from around the world.

There appears to be an abundance of data for one-third of the planet and a dearth of data from the remaining two-thirds.


Polar Zone Observation Completion % (1900-2016)

The Polar Zone has many years between 90-94% completion percentages. However, it also has many years that fall well below the 90% mark. Of particular note, 2016 registered only an 80% completion rate. So for the target year in the NASA and NOAA report, 20% of the observations from the Polar Zone are missing. The question then is how does the missing data impact impact the overall worldwide global temperature estimate?


Tropical Zone Observation Completion % (1900-2016)

In the same fashion that we asked the question regarding the impact of worldwide average temperatures as it relates to the missing data from the Polar Zone, what is the impact of the significant amount of missing data from the Tropical Zone? Are temperatures increasing in the warmest zone of the planet? Are the temperatures decreasing or staying the same? Given that up to 25% of the data since 1973 is missing, the impact to the Mid-Century Baseline and subsequent years of study has to be brought into question.


Temperate Zone Observation Completion % (1900-2016)

The Temperate Zone has been fairly consistent with an average completion rate near 90%. Once again, we see the drop in completion percentage for the year 2016, down to 87.5%.

The stated level of scientific accuracy in the published report was 95%. Given the significant amount of misssing data from existing stations one has to question whether that level of accuracy is valid.

So far we’ve analyzed missing data from a geographic perspective and the results are not encouraging. Now we can review the missing data in terms of the time of the year. This is important because each region of the earth has different temperature cycles. That is, the Northern Hemisphere experiences warmer temperatures during June, July, and August. In the Southern Hemisphere, the warmer temperature months are December, January, and Februray. Missing data, based on geography, for certain times of the year may impact the assessment of worldwide temperature averages.



Monthly Completion Charts

The best way to identify potential impacts to the average global temperature based on the degree of observation completion by month is to compare the Mid-Century Baseline to the year 2016 for each of the geographic areas we previously covered.


Worldwide Monthly Completion % (Baseline vs 2016)

Worldwide, we see significant amounts of missing data from June through December during the 2016 observation year. All of these months fell well below the Mid-Century Baseline statistics. The Baseline period indicates a wide range of completion percentages during the 30 year period as well. From a global perspective, one would think that the missing data in warmer months, along with missing data in cooler months, would even out the potential estimates for average worldwide temperatures. That isn’t necessarily the case. It would entirely depend on the whether the temperatures that were recorded reflected normal temperatures or whether the ones that were kept, or recorded, were greater than or less than normal.


Northern Hemisphere Monthly Completion % (Baseline vs 2016)

The Northern Hemisphere is the obvious culprit in the poor completion percentages we saw worldwide. The month of December, one of the colder months of the year, has only an 82.5% completion percentage. So withouth 17.5% of observational data available, the average temperature for the month will be heavily skewed by the observations present. With that much missing data, it’s hard to imagine reaching any reasonable scientific conclusion regarding the temperature trend compared to the Baseline.

We already know that temperatures have a great deal of variance associated with them. If the preponderence of temperatures recorded in December were during warmer, or cooler, periods within the expected variance, the average results would not be reflective of what the true average is.

It is important to note that for both the worldwide and Northern Hemisphere monthly results, the 95% threshold was never met in any month during the Baseline Period or in 2016.


Southern Hemisphere Monthly Completion % (Baseline vs 2016)

The Southern Hemisphere indicates significant missing data during the Baseline years while 2016 is fairly balanced across months. There are years where individual months meet or exceed the 95% threshold. However, during 2016, no individual month reached 95%.


Northeast Quadrant Monthly Completion % (Baseline vs 2016)

The Mid-Century Baseline for the Northeast Quadrant is even more disparate than what we saw in the Southern Hemisphere. We also see a dramatic drop off of completion percentages in 2016 beginning in April and continuing throughout the year. A meaningful comparison between 2016 and the Baseline will heavily rely on models and statistical imputing. This means manufactured data to complete the missing observations.


Northwest Quadrant Monthly Completion % (Baseline vs 2016)

The Northwest Quadrant, which contains the largest number of stations and observations, has a severe amount of missing data beginning in June. And December drops to 82%, well below the norms we see during the Baseline years. This means that the Northwest Quadrant requires a great deal of manufactured data as well in order to compare against the Baseline.


Southeast Quadrant Monthly Completion % (Baseline vs 2016)

The Southeast Quadrant was fairly consistent in monthly reporting during 2016. However, the Baseline period has a large amount of variance making it difficult to properly determine a true comparison point.


Southwest Quadrant Monthly Completion % (Baseline vs 2016)

To call the Southwest Quadrant a statistical disaster would be an understatement. Given the incredibly small amount of observations recorded, as seen earlier, and the resultant variance of completion percentages during the Baseline years, and the disappointing percentages in 2016, the entire quadrant, 25% of the planet, is virtually unknown statistically.


Key Findings

Given the substantial amount of missing data, the complete dependence on the Northern Hemisphere (the Northwest Quadrant in particular) for the worldwide historical temperature record, the complete lack of growth in stations and observations for 75% of the world, the decrease in stations and observation in the Polar Zone, and overall high levels of variance in the monthly completion percentages for the Baseline period - it is impossible to make a scientific statement, that claims within 95% accuracy, about any global temperature trend, be it warming or cooling.

Any assessment will need to use substantial aggregation and imputation of data in order to manufacture sufficient observations to perform even rudimentary statistical analysis. Even then, the complete lack of geodiversity in observation collections, which we have seen in previous segments to this project, requires large scale assumptions about the relationships of temperatures across wide distances.


Next Steps

James Hansen, in the 1970s, devised a method for aggregating temperature data in order to handle the problem of missing data due to spatial variability he encountered while performing temperature studies. His method is still used today by NASA and NOAA, with some adjustments made over time.

The key item in the model is the correlation of temperature change for stations seperated by up to 1,200 km.

The next step in the project will evaluate Hansen’s method and determine the correlations for stations during the Baseline years. A review of the analysis he did will be conducted including the limiting factors he faced and the geographic samplings he chose.




sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 15063)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] stringi_1.1.5   reshape2_1.4.2  lubridate_1.6.0 knitr_1.16     
## [5] ggplot2_2.2.1   dplyr_0.5.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.11     magrittr_1.5     munsell_0.4.3    colorspace_1.3-2
##  [5] R6_2.2.1         rlang_0.1.1      highr_0.6        stringr_1.2.0   
##  [9] plyr_1.8.4       tools_3.4.0      grid_3.4.0       gtable_0.2.0    
## [13] DBI_0.6-1        htmltools_0.3.6  yaml_2.1.14      lazyeval_0.2.0  
## [17] assertthat_0.2.0 rprojroot_1.2    digest_0.6.12    tibble_1.3.3    
## [21] evaluate_0.10    rmarkdown_1.5    compiler_3.4.0   scales_0.4.1    
## [25] backports_1.1.0