In this document, I am wondering if we can replicate the claim that “The Earth has not gotten warmer in the past ten years.” (I have seen this argument in some Twitter conversations.) Here I will discuss how to gather weather data and analyze some of it. Note that this is just one analysis for just one city, and furthermore note that “weather” is not the same as “climate”.

The Data

I started out by following the instructions at Climate.gov (https://www.climate.gov/maps-data/dataset/past-weather-zip-code-data-table). Next, from the NOAA (National Centers for Environmental Information), I was able to request free data from a particular weather station in Merced County (in the state of California in the United States). In this document we will explore 10-year and 100-year data (which the NOAA website sent in under 10 minutes).

library("readr")   #to load CSV files efficiently

merced10 <- read_csv("merced10.csv")
merced100 <- read_csv("merced100.csv")

I downloaded all of the information I could request. This included measurement flags and weather descriptions (e.g. “rain”). At this moment, the data looks like this.

## # A tibble: 2,969 × 66
##              STATION STATION_NAME ELEVATION LATITUDE LONGITUDE     DATE
##                <chr>        <chr>     <dbl>    <dbl>     <dbl>    <int>
## 1  GHCND:USC00045532 MERCED CA US      46.6 37.28583 -120.5117 20070601
## 2  GHCND:USC00045532 MERCED CA US      46.6 37.28583 -120.5117 20070602
## 3  GHCND:USC00045532 MERCED CA US      46.6 37.28583 -120.5117 20070603
## 4  GHCND:USC00045532 MERCED CA US      46.6 37.28583 -120.5117 20070604
## 5  GHCND:USC00045532 MERCED CA US      46.6 37.28583 -120.5117 20070605
## 6  GHCND:USC00045532 MERCED CA US      46.6 37.28583 -120.5117 20070606
## 7  GHCND:USC00045532 MERCED CA US      46.6 37.28583 -120.5117 20070607
## 8  GHCND:USC00045532 MERCED CA US      46.6 37.28583 -120.5117 20070608
## 9  GHCND:USC00045532 MERCED CA US      46.6 37.28583 -120.5117 20070609
## 10 GHCND:USC00045532 MERCED CA US      46.6 37.28583 -120.5117 20070610
## # ... with 2,959 more rows, and 60 more variables: MDPR <dbl>,
## #   `Measurement Flag` <chr>, `Quality Flag` <chr>, `Source Flag` <int>,
## #   `Time of Observation` <int>, DAPR <int>, `Measurement Flag_1` <chr>,
## #   `Quality Flag_1` <chr>, `Source Flag_1` <chr>, `Time of
## #   Observation_1` <int>, PRCP <dbl>, `Measurement Flag_2` <chr>, `Quality
## #   Flag_2` <chr>, `Source Flag_2` <int>, `Time of Observation_2` <int>,
## #   SNWD <dbl>, `Measurement Flag_3` <chr>, `Quality Flag_3` <chr>,
## #   `Source Flag_3` <int>, `Time of Observation_3` <int>, SNOW <dbl>,
## #   `Measurement Flag_4` <chr>, `Quality Flag_4` <chr>, `Source
## #   Flag_4` <int>, `Time of Observation_4` <int>, TMAX <dbl>, `Measurement
## #   Flag_5` <chr>, `Quality Flag_5` <chr>, `Source Flag_5` <int>, `Time of
## #   Observation_5` <int>, TMIN <dbl>, `Measurement Flag_6` <chr>, `Quality
## #   Flag_6` <chr>, `Source Flag_6` <int>, `Time of Observation_6` <int>,
## #   TOBS <dbl>, `Measurement Flag_7` <chr>, `Quality Flag_7` <chr>,
## #   `Source Flag_7` <int>, `Time of Observation_7` <int>, WT01 <int>,
## #   `Measurement Flag_8` <chr>, `Quality Flag_8` <chr>, `Source
## #   Flag_8` <int>, `Time of Observation_8` <int>, WT05 <int>, `Measurement
## #   Flag_9` <chr>, `Quality Flag_9` <chr>, `Source Flag_9` <int>, `Time of
## #   Observation_9` <int>, WT04 <int>, `Measurement Flag_10` <chr>,
## #   `Quality Flag_10` <chr>, `Source Flag_10` <int>, `Time of
## #   Observation_10` <int>, WT03 <int>, `Measurement Flag_11` <chr>,
## #   `Quality Flag_11` <chr>, `Source Flag_11` <chr>, `Time of
## #   Observation_11` <int>

We can see that for the 10-year data set, there are nearly 3000 observations (gathered daily) and 66 variables. For today’s analysis, we will focus on the TMAX column: the maximum temperature recorded each day. Note that at this point, we have not computed an “average” of any kind.

Upon discussing this analysis with some colleagues, they requested a tutorial so that they can replicate this analysis. At first, I will intentionally do simple analyses to act as an R tutorial to those new to the R computer programming language.

highTemps <- merced10$TMAX
summary(highTemps)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -9999.0    17.2    24.4  -110.1    32.8    42.8

As we can see, there is a strange matter of this “9999”. For some of the technical details, sometimes scientists are instructed to avoid leaving blanks in the data because that affects the ability to transmit ASCII data (especially in the cartography fields). Thus there is a habit to type in an obviously invalid number to indicate when a measurement could not be made. Here, the average temperature in my city is definitely not freezing!

highTemps[highTemps < -273] <- NA #renames the missing data
summary(highTemps)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    3.30   17.80   24.40   24.95   33.30   42.80      40

That is, I relegated any recording that claimed to be below absolute zero to be “NA” instead. Now there are only 40 missing values in 2969 observations. All temperatures are in Celsius.

Ten Year Regression

The data ranges from June 1, 2007 to May 31, 2017. I will define the x-axis simply as the number of days since June 1, 2007.

x <- seq(1:length(highTemps))
y <- highTemps
plot(x,y,
     col = "blue",
     main = "Merced Weather", 
     xlab = "days since 20070601", 
     ylab = "high temperature (in Celsius)")

With a few more lines of code, we can furthermore draw a linear regression line onto the graph.

model10 <- lm(y ~ x)
plot(x,y,
     col = "blue",
     main = "Merced Weather", 
     xlab = "days since 20070601", 
     ylab = "high temperature (in Celsius)")
abline(model10,
       col = "red",
       lwd = 3)

The y-intercept of this trend line is 25.3871369 degrees Celsius and the slope is -0.0002908 degrees Celsius per day. That is, our linear regression model indicates \[-0.0002908 * 10 * 365.25\] a temperature change of about -1.06 degrees Celsius over that 10 year time span.

For posterity and for the statistically inclined, here are the full results of the model:

summary(model10)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.6078  -7.3507  -0.2819   8.0725  17.6400 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 25.3871369  0.3256904  77.949   <2e-16 ***
## x           -0.0002908  0.0001898  -1.532    0.126    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.774 on 2927 degrees of freedom
##   (40 observations deleted due to missingness)
## Multiple R-squared:  0.0008016,  Adjusted R-squared:  0.0004602 
## F-statistic: 2.348 on 1 and 2927 DF,  p-value: 0.1255

Hundred Year Model

To continue our curiosity, we can repeat this analysis, but over 100 years of data (by simply changing “10” to “100” in the code).

highTemps <- merced100$TMAX
highTemps[highTemps < -273] <- NA #renames the missing data
summary(highTemps)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     1.1    17.2    24.4    24.7    32.8    45.6     436
x <- seq(1:length(highTemps))
y <- highTemps
model100 <- lm(y ~ x)
plot(x,y,
     col = "blue",
     main = "Merced Weather", 
     xlab = "days since 19170601", 
     ylab = "high temperature (in Celsius)")
abline(model100,
       col = "red",
       lwd = 3)

This time the y-intercept of this trend line is about 24.38 degrees Celsius and the slope is 0.00001796 degrees Celsius per day. That is, our linear regression model indicates \[0.00001796 * 100 * 365.25\] a temperature change of about 0.66 degrees Celsius over that 100 year time span.

For posterity and for the statistically inclined, here are the full results of the model:

summary(model100)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.5625  -7.7220  -0.0054   7.9337  21.1675 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.438e+01  9.646e-02 252.726  < 2e-16 ***
## x           1.796e-05  4.718e-06   3.806 0.000141 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.053 on 35013 degrees of freedom
##   (436 observations deleted due to missingness)
## Multiple R-squared:  0.0004136,  Adjusted R-squared:  0.0003851 
## F-statistic: 14.49 on 1 and 35013 DF,  p-value: 0.0001412

Summer Focus

Admiteddly, if I continue this with more sophisticated analyses, I would prefer to use more sophisticated tools (thus, here ends the basic tutorial). From there I want to practice using the dplyr package to wrangle the data and ggplot2 to make better graphs.

library("dplyr")   #to display and dissect data easily
library("ggplot2") #better graphs
library("tidyr")   #to wrangle data easily

merced10[merced10 < -273] <- NA #renames the missing data
m10 <- merced10 %>%
  select(STATION_NAME, DATE, TMAX)
m10
## # A tibble: 2,969 × 3
##    STATION_NAME     DATE  TMAX
##           <chr>    <int> <dbl>
## 1  MERCED CA US 20070601  30.6
## 2  MERCED CA US 20070602  31.7
## 3  MERCED CA US 20070603  32.8
## 4  MERCED CA US 20070604  32.8
## 5  MERCED CA US 20070605  32.2
## 6  MERCED CA US 20070606  34.4
## 7  MERCED CA US 20070607  23.3
## 8  MERCED CA US 20070608  27.2
## 9  MERCED CA US 20070609  30.6
## 10 MERCED CA US 20070610  32.2
## # ... with 2,959 more rows

I am wondering what would happen if we focus on the summer months. We need to separate the DATE column into separate columns for year, month, and day.

m10$DATE <- as.Date(as.character(m10$DATE), "%Y%m%d")
m10 <- m10 %>%
  separate(DATE, c("YEAR", "MONTH", "DAY"))
m10
## # A tibble: 2,969 × 5
##    STATION_NAME  YEAR MONTH   DAY  TMAX
## *         <chr> <chr> <chr> <chr> <dbl>
## 1  MERCED CA US  2007    06    01  30.6
## 2  MERCED CA US  2007    06    02  31.7
## 3  MERCED CA US  2007    06    03  32.8
## 4  MERCED CA US  2007    06    04  32.8
## 5  MERCED CA US  2007    06    05  32.2
## 6  MERCED CA US  2007    06    06  34.4
## 7  MERCED CA US  2007    06    07  23.3
## 8  MERCED CA US  2007    06    08  27.2
## 9  MERCED CA US  2007    06    09  30.6
## 10 MERCED CA US  2007    06    10  32.2
## # ... with 2,959 more rows

Disclaimer: I am going to make the coding easier by simply saying that “summer” is either July, August, or September (rather than saying that summer starts particularly on June 21). This discrepancy is off by about 9 days, but should not affect the spirit of the analysis.

m10 <- m10 %>%
  mutate(SUMMER = MONTH %in% c("07", "08", "09"))
m10
## # A tibble: 2,969 × 6
##    STATION_NAME  YEAR MONTH   DAY  TMAX SUMMER
##           <chr> <chr> <chr> <chr> <dbl>  <lgl>
## 1  MERCED CA US  2007    06    01  30.6  FALSE
## 2  MERCED CA US  2007    06    02  31.7  FALSE
## 3  MERCED CA US  2007    06    03  32.8  FALSE
## 4  MERCED CA US  2007    06    04  32.8  FALSE
## 5  MERCED CA US  2007    06    05  32.2  FALSE
## 6  MERCED CA US  2007    06    06  34.4  FALSE
## 7  MERCED CA US  2007    06    07  23.3  FALSE
## 8  MERCED CA US  2007    06    08  27.2  FALSE
## 9  MERCED CA US  2007    06    09  30.6  FALSE
## 10 MERCED CA US  2007    06    10  32.2  FALSE
## # ... with 2,959 more rows

Finally, I will unite the time columns for plotting purposes.

#m10 <- m10 %>%
#  unite(DATE, YEAR, MONTH, DAY)
m10$DAYS <- seq(1:nrow(m10))
ggplot(m10,
       aes(x = DAYS, y = TMAX, color = SUMMER)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm)

It appears that even when we distinguish the summer months from the other seasons, the trend lines appear to have virtually zero slope.

For posterity, we will look at the 100-year data one more time.

merced100[merced100 < -273] <- NA #renames the missing data
m100 <- merced100 %>%
  select(STATION_NAME, DATE, TMAX)
m100$DATE <- as.Date(as.character(m100$DATE), "%Y%m%d")
m100 <- m100 %>%
  separate(DATE, c("YEAR", "MONTH", "DAY")) %>%
  mutate(SUMMER = MONTH %in% c("07", "08", "09"))
m100$DAYS <- seq(1:nrow(m100))
ggplot(m100,
       aes(x = DAYS, y = TMAX, color = SUMMER)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = lm)

We can see that the trend lines over the 100-year time span have a slightly positive slope.