In this document, I am wondering if we can replicate the claim that “The Earth has not gotten warmer in the past ten years.” (I have seen this argument in some Twitter conversations.) Here I will discuss how to gather weather data and analyze some of it. Note that this is just one analysis for just one city, and furthermore note that “weather” is not the same as “climate”.
I started out by following the instructions at Climate.gov
(https://www.climate.gov/maps-data/dataset/past-weather-zip-code-data-table). Next, from the NOAA (National Centers for Environmental Information), I was able to request free data from a particular weather station in Merced County (in the state of California in the United States). In this document we will explore 10-year and 100-year data (which the NOAA website sent in under 10 minutes).
library("readr") #to load CSV files efficiently
merced10 <- read_csv("merced10.csv")
merced100 <- read_csv("merced100.csv")
I downloaded all of the information I could request. This included measurement flags and weather descriptions (e.g. “rain”). At this moment, the data looks like this.
## # A tibble: 2,969 × 66
## STATION STATION_NAME ELEVATION LATITUDE LONGITUDE DATE
## <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 GHCND:USC00045532 MERCED CA US 46.6 37.28583 -120.5117 20070601
## 2 GHCND:USC00045532 MERCED CA US 46.6 37.28583 -120.5117 20070602
## 3 GHCND:USC00045532 MERCED CA US 46.6 37.28583 -120.5117 20070603
## 4 GHCND:USC00045532 MERCED CA US 46.6 37.28583 -120.5117 20070604
## 5 GHCND:USC00045532 MERCED CA US 46.6 37.28583 -120.5117 20070605
## 6 GHCND:USC00045532 MERCED CA US 46.6 37.28583 -120.5117 20070606
## 7 GHCND:USC00045532 MERCED CA US 46.6 37.28583 -120.5117 20070607
## 8 GHCND:USC00045532 MERCED CA US 46.6 37.28583 -120.5117 20070608
## 9 GHCND:USC00045532 MERCED CA US 46.6 37.28583 -120.5117 20070609
## 10 GHCND:USC00045532 MERCED CA US 46.6 37.28583 -120.5117 20070610
## # ... with 2,959 more rows, and 60 more variables: MDPR <dbl>,
## # `Measurement Flag` <chr>, `Quality Flag` <chr>, `Source Flag` <int>,
## # `Time of Observation` <int>, DAPR <int>, `Measurement Flag_1` <chr>,
## # `Quality Flag_1` <chr>, `Source Flag_1` <chr>, `Time of
## # Observation_1` <int>, PRCP <dbl>, `Measurement Flag_2` <chr>, `Quality
## # Flag_2` <chr>, `Source Flag_2` <int>, `Time of Observation_2` <int>,
## # SNWD <dbl>, `Measurement Flag_3` <chr>, `Quality Flag_3` <chr>,
## # `Source Flag_3` <int>, `Time of Observation_3` <int>, SNOW <dbl>,
## # `Measurement Flag_4` <chr>, `Quality Flag_4` <chr>, `Source
## # Flag_4` <int>, `Time of Observation_4` <int>, TMAX <dbl>, `Measurement
## # Flag_5` <chr>, `Quality Flag_5` <chr>, `Source Flag_5` <int>, `Time of
## # Observation_5` <int>, TMIN <dbl>, `Measurement Flag_6` <chr>, `Quality
## # Flag_6` <chr>, `Source Flag_6` <int>, `Time of Observation_6` <int>,
## # TOBS <dbl>, `Measurement Flag_7` <chr>, `Quality Flag_7` <chr>,
## # `Source Flag_7` <int>, `Time of Observation_7` <int>, WT01 <int>,
## # `Measurement Flag_8` <chr>, `Quality Flag_8` <chr>, `Source
## # Flag_8` <int>, `Time of Observation_8` <int>, WT05 <int>, `Measurement
## # Flag_9` <chr>, `Quality Flag_9` <chr>, `Source Flag_9` <int>, `Time of
## # Observation_9` <int>, WT04 <int>, `Measurement Flag_10` <chr>,
## # `Quality Flag_10` <chr>, `Source Flag_10` <int>, `Time of
## # Observation_10` <int>, WT03 <int>, `Measurement Flag_11` <chr>,
## # `Quality Flag_11` <chr>, `Source Flag_11` <chr>, `Time of
## # Observation_11` <int>
We can see that for the 10-year data set, there are nearly 3000 observations (gathered daily) and 66 variables. For today’s analysis, we will focus on the TMAX
column: the maximum temperature recorded each day. Note that at this point, we have not computed an “average” of any kind.
Upon discussing this analysis with some colleagues, they requested a tutorial so that they can replicate this analysis. At first, I will intentionally do simple analyses to act as an R tutorial to those new to the R computer programming language.
highTemps <- merced10$TMAX
summary(highTemps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9999.0 17.2 24.4 -110.1 32.8 42.8
As we can see, there is a strange matter of this “9999”. For some of the technical details, sometimes scientists are instructed to avoid leaving blanks in the data because that affects the ability to transmit ASCII data (especially in the cartography fields). Thus there is a habit to type in an obviously invalid number to indicate when a measurement could not be made. Here, the average temperature in my city is definitely not freezing!
highTemps[highTemps < -273] <- NA #renames the missing data
summary(highTemps)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 3.30 17.80 24.40 24.95 33.30 42.80 40
That is, I relegated any recording that claimed to be below absolute zero to be “NA” instead. Now there are only 40 missing values in 2969 observations. All temperatures are in Celsius.
The data ranges from June 1, 2007 to May 31, 2017. I will define the x-axis simply as the number of days since June 1, 2007.
x <- seq(1:length(highTemps))
y <- highTemps
plot(x,y,
col = "blue",
main = "Merced Weather",
xlab = "days since 20070601",
ylab = "high temperature (in Celsius)")
With a few more lines of code, we can furthermore draw a linear regression line onto the graph.
model10 <- lm(y ~ x)
plot(x,y,
col = "blue",
main = "Merced Weather",
xlab = "days since 20070601",
ylab = "high temperature (in Celsius)")
abline(model10,
col = "red",
lwd = 3)
The y-intercept of this trend line is 25.3871369 degrees Celsius and the slope is -0.0002908 degrees Celsius per day. That is, our linear regression model indicates \[-0.0002908 * 10 * 365.25\] a temperature change of about -1.06 degrees Celsius over that 10 year time span.
For posterity and for the statistically inclined, here are the full results of the model:
summary(model10)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.6078 -7.3507 -0.2819 8.0725 17.6400
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.3871369 0.3256904 77.949 <2e-16 ***
## x -0.0002908 0.0001898 -1.532 0.126
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.774 on 2927 degrees of freedom
## (40 observations deleted due to missingness)
## Multiple R-squared: 0.0008016, Adjusted R-squared: 0.0004602
## F-statistic: 2.348 on 1 and 2927 DF, p-value: 0.1255
To continue our curiosity, we can repeat this analysis, but over 100 years of data (by simply changing “10” to “100” in the code).
highTemps <- merced100$TMAX
highTemps[highTemps < -273] <- NA #renames the missing data
summary(highTemps)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.1 17.2 24.4 24.7 32.8 45.6 436
x <- seq(1:length(highTemps))
y <- highTemps
model100 <- lm(y ~ x)
plot(x,y,
col = "blue",
main = "Merced Weather",
xlab = "days since 19170601",
ylab = "high temperature (in Celsius)")
abline(model100,
col = "red",
lwd = 3)
This time the y-intercept of this trend line is about 24.38 degrees Celsius and the slope is 0.00001796 degrees Celsius per day. That is, our linear regression model indicates \[0.00001796 * 100 * 365.25\] a temperature change of about 0.66 degrees Celsius over that 100 year time span.
For posterity and for the statistically inclined, here are the full results of the model:
summary(model100)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.5625 -7.7220 -0.0054 7.9337 21.1675
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.438e+01 9.646e-02 252.726 < 2e-16 ***
## x 1.796e-05 4.718e-06 3.806 0.000141 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.053 on 35013 degrees of freedom
## (436 observations deleted due to missingness)
## Multiple R-squared: 0.0004136, Adjusted R-squared: 0.0003851
## F-statistic: 14.49 on 1 and 35013 DF, p-value: 0.0001412
Admiteddly, if I continue this with more sophisticated analyses, I would prefer to use more sophisticated tools (thus, here ends the basic tutorial). From there I want to practice using the dplyr
package to wrangle the data and ggplot2
to make better graphs.
library("dplyr") #to display and dissect data easily
library("ggplot2") #better graphs
library("tidyr") #to wrangle data easily
merced10[merced10 < -273] <- NA #renames the missing data
m10 <- merced10 %>%
select(STATION_NAME, DATE, TMAX)
m10
## # A tibble: 2,969 × 3
## STATION_NAME DATE TMAX
## <chr> <int> <dbl>
## 1 MERCED CA US 20070601 30.6
## 2 MERCED CA US 20070602 31.7
## 3 MERCED CA US 20070603 32.8
## 4 MERCED CA US 20070604 32.8
## 5 MERCED CA US 20070605 32.2
## 6 MERCED CA US 20070606 34.4
## 7 MERCED CA US 20070607 23.3
## 8 MERCED CA US 20070608 27.2
## 9 MERCED CA US 20070609 30.6
## 10 MERCED CA US 20070610 32.2
## # ... with 2,959 more rows
I am wondering what would happen if we focus on the summer months. We need to separate
the DATE
column into separate columns for year, month, and day.
m10$DATE <- as.Date(as.character(m10$DATE), "%Y%m%d")
m10 <- m10 %>%
separate(DATE, c("YEAR", "MONTH", "DAY"))
m10
## # A tibble: 2,969 × 5
## STATION_NAME YEAR MONTH DAY TMAX
## * <chr> <chr> <chr> <chr> <dbl>
## 1 MERCED CA US 2007 06 01 30.6
## 2 MERCED CA US 2007 06 02 31.7
## 3 MERCED CA US 2007 06 03 32.8
## 4 MERCED CA US 2007 06 04 32.8
## 5 MERCED CA US 2007 06 05 32.2
## 6 MERCED CA US 2007 06 06 34.4
## 7 MERCED CA US 2007 06 07 23.3
## 8 MERCED CA US 2007 06 08 27.2
## 9 MERCED CA US 2007 06 09 30.6
## 10 MERCED CA US 2007 06 10 32.2
## # ... with 2,959 more rows
Disclaimer: I am going to make the coding easier by simply saying that “summer” is either July, August, or September (rather than saying that summer starts particularly on June 21). This discrepancy is off by about 9 days, but should not affect the spirit of the analysis.
m10 <- m10 %>%
mutate(SUMMER = MONTH %in% c("07", "08", "09"))
m10
## # A tibble: 2,969 × 6
## STATION_NAME YEAR MONTH DAY TMAX SUMMER
## <chr> <chr> <chr> <chr> <dbl> <lgl>
## 1 MERCED CA US 2007 06 01 30.6 FALSE
## 2 MERCED CA US 2007 06 02 31.7 FALSE
## 3 MERCED CA US 2007 06 03 32.8 FALSE
## 4 MERCED CA US 2007 06 04 32.8 FALSE
## 5 MERCED CA US 2007 06 05 32.2 FALSE
## 6 MERCED CA US 2007 06 06 34.4 FALSE
## 7 MERCED CA US 2007 06 07 23.3 FALSE
## 8 MERCED CA US 2007 06 08 27.2 FALSE
## 9 MERCED CA US 2007 06 09 30.6 FALSE
## 10 MERCED CA US 2007 06 10 32.2 FALSE
## # ... with 2,959 more rows
Finally, I will unite
the time columns for plotting purposes.
#m10 <- m10 %>%
# unite(DATE, YEAR, MONTH, DAY)
m10$DAYS <- seq(1:nrow(m10))
ggplot(m10,
aes(x = DAYS, y = TMAX, color = SUMMER)) +
geom_point(alpha = 0.5) +
geom_smooth(method = lm)
It appears that even when we distinguish the summer months from the other seasons, the trend lines appear to have virtually zero slope.
For posterity, we will look at the 100-year data one more time.
merced100[merced100 < -273] <- NA #renames the missing data
m100 <- merced100 %>%
select(STATION_NAME, DATE, TMAX)
m100$DATE <- as.Date(as.character(m100$DATE), "%Y%m%d")
m100 <- m100 %>%
separate(DATE, c("YEAR", "MONTH", "DAY")) %>%
mutate(SUMMER = MONTH %in% c("07", "08", "09"))
m100$DAYS <- seq(1:nrow(m100))
ggplot(m100,
aes(x = DAYS, y = TMAX, color = SUMMER)) +
geom_point(alpha = 0.1) +
geom_smooth(method = lm)
We can see that the trend lines over the 100-year time span have a slightly positive slope.