The weather in the Pacific Northwest is notoriusly rainy and gloomy, but the summers are gloriously full of perfect weather, or at least that’s what we believe. How many days are actually perfect?
To answer this question, we must first think about what a perfect day means. The temperature should be warm, but not hot and the sun should be shining most if not all of the day. So we can define a perfect day as being between a certain range of temperatures, say between 60 degrees F and 80 degrees F, and little to no precipitation. When defining a perfect day, the precipitation can’t be assumed to be zero because we will often have a small sprinkle over night, but then beautiful blue-skyed days.
We will start our data collection at NOAA and the EPA.
The NOAA website has available a weather dataset measured at Seattle-Tacoma Airport (SeaTac) spanning all the way back to 1948 containing 50 variables including both max and min temperature and precipitation data and is available for download here: Seattle Daily
The EPA has a pre-generated air quality dataset which contains the air quality index by day for every county in the US and spanning from 01.01.1980 to 11.27.2018: AQI
Let’s start by loading the datasets and taking a look at their structure.
Load the Seattle weather data:
#read in seattle weather
SeattleWeather <- readr::read_csv("SeattleWeather_1948_2017.csv")
#take a look at the data
kable(head(SeattleWeather))
| DATE | PRCP | TMAX | TMIN | RAIN |
|---|---|---|---|---|
| 1948-01-01 | 0.47 | 51 | 42 | TRUE |
| 1948-01-02 | 0.59 | 45 | 36 | TRUE |
| 1948-01-03 | 0.42 | 45 | 35 | TRUE |
| 1948-01-04 | 0.31 | 45 | 34 | TRUE |
| 1948-01-05 | 0.17 | 45 | 32 | TRUE |
| 1948-01-06 | 0.44 | 48 | 39 | TRUE |
kable(tail(SeattleWeather))
| DATE | PRCP | TMAX | TMIN | RAIN |
|---|---|---|---|---|
| 2017-12-09 | 0 | 44 | 29 | FALSE |
| 2017-12-10 | 0 | 49 | 34 | FALSE |
| 2017-12-11 | 0 | 49 | 29 | FALSE |
| 2017-12-12 | 0 | 46 | 32 | FALSE |
| 2017-12-13 | 0 | 48 | 34 | FALSE |
| 2017-12-14 | 0 | 50 | 36 | FALSE |
#look at the structure of the data
str(SeattleWeather)
## Classes 'tbl_df', 'tbl' and 'data.frame': 25551 obs. of 5 variables:
## $ DATE: Date, format: "1948-01-01" "1948-01-02" ...
## $ PRCP: num 0.47 0.59 0.42 0.31 0.17 0.44 0.41 0.04 0.12 0.74 ...
## $ TMAX: int 51 45 45 45 45 48 50 48 50 43 ...
## $ TMIN: int 42 36 35 34 32 39 40 35 31 34 ...
## $ RAIN: logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 5
## .. ..$ DATE:List of 1
## .. .. ..$ format: chr ""
## .. .. ..- attr(*, "class")= chr "collector_date" "collector"
## .. ..$ PRCP: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ TMAX: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ TMIN: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ RAIN: list()
## .. .. ..- attr(*, "class")= chr "collector_logical" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
Seattle weather is a dataframe with 25551 observations of 5 variables which are DATE, PRCP, TMAX, TMIN, RAIN. Each observation represents a single day’s meausurements.
TMAX and TMIN are integers representing the maximum and minimum temperatures recorded for that day. RAIN is a logical vector which is true if there is 0.01 inches of rain and false if there is 0.0 inches precipitation whereas PRCP is a numeric representing the amount of daily precipitation measured in inches. DATE is already a date factor so it won’t need to be transformed.
Next we must consider air quality as the recent wildfires on the west coast have greatly impacted everyone’s ability to safely go outside and a good air quality rating should be factored into our definition of a perfect day.
We load each year of AQI data from 1980 to present, filtering for only Seattle, King County, Washington, and then create a single data frame for all the years of data.
#load the code from the EPA and create a table from all the years
#after data was read in and stored in a .csv, code was removed to allow for faster knitting
#data_list <- list()
#for(year in 1980:2018) {
# temp <- tempfile()
# base_file_name <- paste0("daily_aqi_by_county_",year)
# url_name <- paste0("https://aqs.epa.gov/aqsweb/airdata/",base_file_name,".zip")
# csv_name <- paste0(base_file_name,".csv")
# download.file(url_name,temp)
# data <- read_csv(unz(temp, csv_name)) %>%
# filter(`State Name` == "Washington" & `county Name` == "King")
# data_list[[year]] <- data
# unlink(temp)
#}
#AQIdata <- do.call("rbind", data_list)
#load the .csv created from reading in the downloaded AQI year data
AQIdata <- readr::read_csv("AQI_KING.csv")
#take a look at the data
kable(head(AQIdata))
| State Name | county Name | State Code | County Code | Date | AQI | Category | Defining Parameter | Defining Site | Number of Sites Reporting |
|---|---|---|---|---|---|---|---|---|---|
| Washington | King | 53 | 033 | 1980-01-01 | 32 | Good | CO | 53-033-0076 | 11 |
| Washington | King | 53 | 033 | 1980-01-02 | 72 | Moderate | CO | 53-033-0077 | 14 |
| Washington | King | 53 | 033 | 1980-01-03 | 82 | Moderate | CO | 53-033-0062 | 14 |
| Washington | King | 53 | 033 | 1980-01-04 | 128 | Unhealthy for Sensitive Groups | CO | 53-033-0077 | 14 |
| Washington | King | 53 | 033 | 1980-01-05 | 89 | Moderate | CO | 53-033-0077 | 14 |
| Washington | King | 53 | 033 | 1980-01-06 | 61 | Moderate | CO | 53-033-1003 | 14 |
kable(tail(AQIdata))
| State Name | county Name | State Code | County Code | Date | AQI | Category | Defining Parameter | Defining Site | Number of Sites Reporting |
|---|---|---|---|---|---|---|---|---|---|
| Washington | King | 53 | 033 | 2018-07-27 | 64 | Moderate | Ozone | 53-033-0023 | 10 |
| Washington | King | 53 | 033 | 2018-07-28 | 67 | Moderate | Ozone | 53-033-0023 | 10 |
| Washington | King | 53 | 033 | 2018-07-29 | 147 | Unhealthy for Sensitive Groups | Ozone | 53-033-0023 | 10 |
| Washington | King | 53 | 033 | 2018-07-30 | 136 | Unhealthy for Sensitive Groups | Ozone | 53-033-0017 | 10 |
| Washington | King | 53 | 033 | 2018-07-31 | 44 | Good | Ozone | 53-033-0017 | 10 |
| Washington | King | 53 | 033 | 2018-08-01 | 5 | Good | CO | 53-033-0030 | 2 |
#look at the structure of the data
str(AQIdata)
## Classes 'tbl_df', 'tbl' and 'data.frame': 14090 obs. of 10 variables:
## $ State Name : chr "Washington" "Washington" "Washington" "Washington" ...
## $ county Name : chr "King" "King" "King" "King" ...
## $ State Code : int 53 53 53 53 53 53 53 53 53 53 ...
## $ County Code : chr "033" "033" "033" "033" ...
## $ Date : Date, format: "1980-01-01" "1980-01-02" ...
## $ AQI : int 32 72 82 128 89 61 103 51 94 63 ...
## $ Category : chr "Good" "Moderate" "Moderate" "Unhealthy for Sensitive Groups" ...
## $ Defining Parameter : chr "CO" "CO" "CO" "CO" ...
## $ Defining Site : chr "53-033-0076" "53-033-0077" "53-033-0062" "53-033-0077" ...
## $ Number of Sites Reporting: int 11 14 14 14 14 14 13 14 14 10 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 10
## .. ..$ State Name : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ county Name : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ State Code : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ County Code : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Date :List of 1
## .. .. ..$ format: chr ""
## .. .. ..- attr(*, "class")= chr "collector_date" "collector"
## .. ..$ AQI : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Category : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Defining Parameter : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Defining Site : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Number of Sites Reporting: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
The AQI or air quality index data set consists of 14090 observations of 10 variables including State Name, county Name, State Code, County Code, Date, AQI, Category, Defining Parameter, Defining Site, Number of Sites Reporting. Each observation represents a single day’s meausurements. The Date and Category variables are of interest in determining which days had acceptable air quality in Seattle. Once again the Date was read in as a date factor and category is a character.
So now we have 2 dataframes with temperature, precipitation and air quality included. The next step is to combine these dataframes into one table.
Since our AQI dataset only contains data from 1980, we will only have concurrent dates beginning from 01.01.1980 and going to 12.14.2017 which is when the Seattle Weather Data Ends.
### inner join on the dates available
AllWeather <- inner_join(SeattleWeather, AQIdata, by = c("DATE" = "Date"))
#look at how the new table is structured
str(AllWeather)
## Classes 'tbl_df', 'tbl' and 'data.frame': 13860 obs. of 14 variables:
## $ DATE : Date, format: "1980-01-01" "1980-01-02" ...
## $ PRCP : num 0.22 0.19 0 0.13 0.12 0 0.19 0.37 0.15 0 ...
## $ TMAX : int 52 46 46 40 39 35 30 29 34 38 ...
## $ TMIN : int 44 37 33 34 30 26 24 26 26 28 ...
## $ RAIN : logi TRUE TRUE FALSE TRUE TRUE FALSE ...
## $ State Name : chr "Washington" "Washington" "Washington" "Washington" ...
## $ county Name : chr "King" "King" "King" "King" ...
## $ State Code : int 53 53 53 53 53 53 53 53 53 53 ...
## $ County Code : chr "033" "033" "033" "033" ...
## $ AQI : int 32 72 82 128 89 61 103 51 94 63 ...
## $ Category : chr "Good" "Moderate" "Moderate" "Unhealthy for Sensitive Groups" ...
## $ Defining Parameter : chr "CO" "CO" "CO" "CO" ...
## $ Defining Site : chr "53-033-0076" "53-033-0077" "53-033-0062" "53-033-0077" ...
## $ Number of Sites Reporting: int 11 14 14 14 14 14 13 14 14 10 ...
#look at the sumamry statistics of air quality
kable(AllWeather %>%
select_if(is.numeric) %>%
skimr::skim(AQI))
| variable | type | stat | level | value | formatted |
|---|---|---|---|---|---|
| AQI | integer | missing | .all | 0.00000 | 0 |
| AQI | integer | complete | .all | 13860.00000 | 13860 |
| AQI | integer | n | .all | 13860.00000 | 13860 |
| AQI | integer | mean | .all | 56.53550 | 56.54 |
| AQI | integer | sd | .all | 26.53875 | 26.54 |
| AQI | integer | p0 | .all | 7.00000 | 7 |
| AQI | integer | p25 | .all | 39.00000 | 39 |
| AQI | integer | p50 | .all | 49.00000 | 49 |
| AQI | integer | p75 | .all | 66.00000 | 66 |
| AQI | integer | p100 | .all | 220.00000 | 220 |
| AQI | integer | hist | .all | NA | ▂▇▂▁▁▁▁▁ |
#look at the sumamry statistics of max temperature
kable(AllWeather %>%
select_if(is.numeric) %>%
skimr::skim(TMAX))
| variable | type | stat | level | value | formatted |
|---|---|---|---|---|---|
| TMAX | integer | missing | .all | 0.00000 | 0 |
| TMAX | integer | complete | .all | 13860.00000 | 13860 |
| TMAX | integer | n | .all | 13860.00000 | 13860 |
| TMAX | integer | mean | .all | 60.12345 | 60.12 |
| TMAX | integer | sd | .all | 12.67805 | 12.68 |
| TMAX | integer | p0 | .all | 18.00000 | 18 |
| TMAX | integer | p25 | .all | 50.00000 | 50 |
| TMAX | integer | p50 | .all | 59.00000 | 59 |
| TMAX | integer | p75 | .all | 69.00000 | 69 |
| TMAX | integer | p100 | .all | 103.00000 | 103 |
| TMAX | integer | hist | .all | NA | ▁▁▅▇▆▃▂▁ |
#look at the sumamry statistics of min temperature
kable(AllWeather %>%
select_if(is.numeric) %>%
skimr::skim(TMIN))
| variable | type | stat | level | value | formatted |
|---|---|---|---|---|---|
| TMIN | integer | missing | .all | 0.000000 | 0 |
| TMIN | integer | complete | .all | 13860.000000 | 13860 |
| TMIN | integer | n | .all | 13860.000000 | 13860 |
| TMIN | integer | mean | .all | 45.243362 | 45.24 |
| TMIN | integer | sd | .all | 8.771334 | 8.77 |
| TMIN | integer | p0 | .all | 7.000000 | 7 |
| TMIN | integer | p25 | .all | 39.000000 | 39 |
| TMIN | integer | p50 | .all | 45.000000 | 45 |
| TMIN | integer | p75 | .all | 52.000000 | 52 |
| TMIN | integer | p100 | .all | 71.000000 | 71 |
| TMIN | integer | hist | .all | NA | ▁▁▂▆▇▇▃▁ |
#look at the sumamry statistics of precipitation
kable(AllWeather %>%
select_if(is.numeric) %>%
skimr::skim(PRCP))
| variable | type | stat | level | value | formatted |
|---|---|---|---|---|---|
| PRCP | numeric | missing | .all | 3.000000e+00 | 3 |
| PRCP | numeric | complete | .all | 1.385700e+04 | 13857 |
| PRCP | numeric | n | .all | 1.386000e+04 | 13860 |
| PRCP | numeric | mean | .all | 1.051519e-01 | 0.11 |
| PRCP | numeric | sd | .all | 2.436655e-01 | 0.24 |
| PRCP | numeric | p0 | .all | 0.000000e+00 | 0 |
| PRCP | numeric | p25 | .all | 0.000000e+00 | 0 |
| PRCP | numeric | p50 | .all | 0.000000e+00 | 0 |
| PRCP | numeric | p75 | .all | 1.000000e-01 | 0.1 |
| PRCP | numeric | p100 | .all | 5.020000e+00 | 5.02 |
| PRCP | numeric | hist | .all | NA | ▇▁▁▁▁▁▁▁ |
#how many NA's are in the new table
sum(is.na(AllWeather))
## [1] 6
From the numberic summary statistics, we can see that AQI, TMAX, TMIN, and PCRP have little or no missing data, reasonable mins and maxs, and the histograms show that the data is normally distributed.
Now that we have our two datasets joined together, let’s take a look at the max and min temp ranges for each year with air quality.
#plot tmin and tmax against date - color green for good days and red for bad
AllWeather %>%
mutate(year = year(DATE), day_of_year = yday(DATE) ) %>%
ggplot(aes(x=day_of_year, ymin=TMIN, ymax=TMAX, colour=AQI)) +
geom_linerange(size=0.5) +
scale_colour_gradient2(low = "green", mid="yellow", high = "red", midpoint = 100) +
# 125 is the midpoint of the "Unhealthy for Sensitive Groups" level
# 100 is the transition point from "Moderate" to "Unhealthy for Sensitive Groups"
facet_grid( year ~ .) +
xlab("Day of Year") + ylab("Temp (F, min to max)") +
ggtitle("Maximum and Minimum Temperature with Air Quality from 1980 to 2017") +
theme_minimal()
From the plots we can see that air quality tends to be an issue in the summer time. The temperature range is pretty consistent throughout the year.
And now the precipitation by day for each year.
#plot precipation by year
AllWeather %>%
mutate(year = year(DATE), day_of_year = yday(DATE) ) %>%
ggplot(aes(x=day_of_year, y=PRCP)) +
geom_bar(stat="identity") +
facet_grid( year ~ .) +
xlab("Day of Year") + ylab("Precipitation (inches)") +
coord_cartesian(ylim = c(0, 1)) +
scale_y_continuous(breaks=c(0,0.5,1)) +
ggtitle("Precipitation from 1980 to 2017") +
theme_classic()
There are definite gaps in the precipitation each year which makes sense as the summers tend to be very dry. It is interesting to see how the dry weather gaps do not occur consistently at the same time of year.
So, now that we have visualized the weather and air quality in Seattle, we can begin to define what perfect weather is. We can select a “perfect” temperature range, let’s say only the days with TMIN and TMAX between 60 and 80 degrees F, and also days with no rain. We can assume that given this temperature range, the precipitation that falls will be rain rather than snow. One approach to selecting would be to filter the dataset for RAIN being FALSE, but this does not allow for any precipitation at all. Instead,let’s consider the PCRP variable which is inches of rain measured in a day. We can filter for precipitation less than a certain amount of rain in inches to account for the common occurence of rain over night and this approach will allow for sensitivity testing of edge perfect days. The data may be sensitive close to the ranges that we select, so we will want to examine the data near our range to see if there is a great difference in perfect days.
#let's first look at the number of days in the 60-80 degree range
SeattleWeather_TempRange6080 <- AllWeather %>%
filter(TMAX <= 80 & TMIN >= 60)
count(SeattleWeather_TempRange6080)
## # A tibble: 1 x 1
## n
## <int>
## 1 123
#now let's widen the range to 55-85 degree range
SeattleWeather_TempRange5585 <- AllWeather %>%
filter(TMAX <= 85 & TMIN >= 55)
count(SeattleWeather_TempRange5585)
## # A tibble: 1 x 1
## n
## <int>
## 1 1935
| Temp Range (F) | Days within Temp Range | Percentage |
|---|---|---|
| 60 - 80 | 123 | 0.89% |
| 55 - 85 | 1935 | 13.96% |
There’s a vast difference in the percentage of days between the temp range of 60 and 80 which is 0.89%, and between 55 and 85 which is 13.96%. We will go with the later because some people might be a little less sensitive to the temperature range and we can consider more possible perfect days.
Filtering for precipitation range:
#total number of days in the data set
nrow(AllWeather)
## [1] 13860
###how many days with rain less than 0.01
SeattleWeather_noRain0.01 <- AllWeather %>%
filter(PRCP <= 0.01)
count(SeattleWeather_noRain0.01)
## # A tibble: 1 x 1
## n
## <int>
## 1 8553
###how many days with rain less than 0.01
SeattleWeather_noRain0.03 <- AllWeather %>%
filter(PRCP <= 0.03)
count(SeattleWeather_noRain0.03)
## # A tibble: 1 x 1
## n
## <int>
## 1 9175
###how many days with rain less than 0.05
SeattleWeather_noRain0.05 <- AllWeather %>%
filter(PRCP <= 0.05)
count(SeattleWeather_noRain0.05)
## # A tibble: 1 x 1
## n
## <int>
## 1 9652
###or days with rain less than 0.1
SeattleWeather_noRain0.1 <- AllWeather %>%
filter(PRCP <= 0.1)
count(SeattleWeather_noRain0.1)
## # A tibble: 1 x 1
## n
## <int>
## 1 10495
| Rain Amount(In) | Days with Rain | Percentage |
|---|---|---|
| 0.01 | 8553 | 61.71% |
| 0.03 | 9175 | 66.2% |
| 0.05 | 9652 | 69.64% |
| 0.10 | 10495 | 75.72% |
The percentage of days with less than or equal to 0.01 inches of precipitation is 61.71%, 0.05 inches of precipitation is 69.64% while 0.1 is 75.72%. There is much less sensitivity to days within the precipitation range, so we will select 0.05 inches.
Next, we need to choose only the dates where the air quality is listed as Moderate or Good. We can use the Date and Category columns to filter for the moderate to good air quality days.
# Create a new feature on AllWeather if the day is "perfect" or not
AllWeather <- AllWeather %>%
mutate(PerfectDay = (TMAX <= 85 & TMIN >= 55 & PRCP <= 0.05 & AQI <= 100))
#filter the complete weather data set
PerfectWeather <- AllWeather %>% filter(PerfectDay)
str(PerfectWeather)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1534 obs. of 15 variables:
## $ DATE : Date, format: "1980-07-10" "1980-07-14" ...
## $ PRCP : num 0 0.03 0 0.01 0 0 0 0 0 0 ...
## $ TMAX : int 65 68 74 68 68 78 85 76 69 78 ...
## $ TMIN : int 55 55 55 56 57 60 60 59 57 56 ...
## $ RAIN : logi FALSE TRUE FALSE TRUE FALSE FALSE ...
## $ State Name : chr "Washington" "Washington" "Washington" "Washington" ...
## $ county Name : chr "King" "King" "King" "King" ...
## $ State Code : int 53 53 53 53 53 53 53 53 53 53 ...
## $ County Code : chr "033" "033" "033" "033" ...
## $ AQI : int 56 64 81 64 94 94 75 38 69 56 ...
## $ Category : chr "Moderate" "Moderate" "Moderate" "Moderate" ...
## $ Defining Parameter : chr "SO2" "CO" "SO2" "CO" ...
## $ Defining Site : chr "53-033-0057" "53-033-0062" "53-033-0057" "53-033-0051" ...
## $ Number of Sites Reporting: int 11 9 10 9 12 12 10 11 11 11 ...
## $ PerfectDay : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
Now we have one table containing 1534 days where the weather falls in our defined temperature range with little to no rain and moderate or good air quality.
Let’s take a look at the distribution of the perfect weather days.
#plot tmin and tmax against date - color green for good days and red for bad
PerfectWeather %>%
mutate(year = year(DATE), day_of_year = yday(DATE) ) %>%
ggplot(aes(x=day_of_year, ymin=TMIN, ymax=TMAX)) +
geom_linerange(size=0.5) +
facet_grid( year ~ .) +
xlab("Day of Year") + ylab("Temp (F, min to max)") +
ggtitle("Distribution of Perfect Weather Days from 1980 to 2017") +
theme_minimal()
We see right away that 1980 and 1981 have fewer perfect days. Everyone in the Pacific Northwest recalls the events of May 18, 1980 when Mt. St. Helens erupted causing extremely poor air quality and possibly affecting the weather of those summers.
Our null hypothesis is that the number of perfect days per year is not changing over time. We can also write this out in an equation form:
H0 = there is no difference in the number of perfect days over time. H1 = there is a difference in the number of perfect days over time.
We can use a linear model to evaluate whether there is a linear relationship between the number of perfect days per as a function of time. The equation for a linear model is \(y = {\beta}_0 + {\beta}_1*x\).
#summarise the number of perfect days by year
PerfectWeatherYears <- PerfectWeather %>%
group_by( year=year(DATE) ) %>%
count()
#linear regression model
weather_model <- lm(n ~ year, PerfectWeatherYears)
summary(weather_model)
##
## Call:
## lm(formula = n ~ year, data = PerfectWeatherYears)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.5040 -7.0110 0.5756 8.6573 20.2064
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1547.2353 314.8649 -4.914 1.95e-05 ***
## year 0.7944 0.1575 5.042 1.32e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.65 on 36 degrees of freedom
## Multiple R-squared: 0.4139, Adjusted R-squared: 0.3976
## F-statistic: 25.42 on 1 and 36 DF, p-value: 1.32e-05
#plot the number of perfect days as a function of time
plot(PerfectWeatherYears$n ~ PerfectWeatherYears$year, xlab = "Year", ylab = "Perfect Days / Year")
abline(weather_model)
#summary plot the model
plot(weather_model)
#histogram of the residuals
hist(weather_model$residuals)
Conditions for Regression:
So, we can see from the model coefficients that the number of perfect days is increasing by 0.79 days per year. The p-value for the model 1.310^{-5} or very close to zero and we can therefore reject our null hypothesis that the number of perfect weather days is not changing over time. The weather in the Pacific Northwest does appear to be changing over time with the number of days which fall into our definition of perfect weather increasing at 0.79 days per year. Essentially, the PNW’s reputation as a gloomy and dreary place to live is false as the weather is trending better every year.
Finally, let’s compare the data over the range of years to find how many of these perfect days coincide with weekends or when festivals and events are happening in the area?
How many perfect days fall on weekends? It’s a common myth here in the Seattle area that you know it’s the weekend because it starts raining. Is it true that it rains more on the weekends?
#filter perfect weather dates for weekend dates
temp_filterweekend <- PerfectWeather %>% filter( is.weekend(DATE) )
#number of weekend days that are pefect and the with proportion for perfect weather dates
nrow(temp_filterweekend)
## [1] 428
nrow(temp_filterweekend)/nrow(PerfectWeather)
## [1] 0.2790091
The number of perfect weather days which fall on weekends is 428 which is 27.9% of the perfect weather days. Obviously, the old saying that you know it’s the weekend because it starts raining is false as well.
We need data on events in the area which we can obtain from a local events website:
#read in the url and get data for each month
baseurl <- 'https://www.events12.com/seattle/'
months <- c('january','february','march','april','may','june',
'july','august','september','october','november','december')
dates <- c()
#get the dates for events from each month
for( month in months ) {
url <- paste0(baseurl,month,'/')
events_data <- read_html(url)
datestext <- events_data %>% html_nodes(".date") %>% html_text()
dates <- c(dates, datestext)
}
#create a dataframe of all the dates
ds <- c()
for( datetxt in dates ) {
date <- AsDate(datetxt, on.parse.failure="warn")
ds <- c(ds,as_date(date))
}
#create a dataframe and transform the dates to the same format as PerfectWeather
dates_df = as.data.frame(ds)
dates_df %<>% transmute(date = as_date(ds)) %>% distinct() %>%
filter(!is.na(date)) %>%
filter(year(date) > 2017 & year(date) < 2020) %>%
mutate(month = month(date), day = day(date), year = 2017 ) %>%
mutate(date = mdy( paste0(month,"-",day,"-",year) ) )
#how many event dates are there
nrow(dates_df)
## [1] 127
#create an events table of events dates that fall on perfect days
Seattle_events <- inner_join(PerfectWeather %>% filter(DATE >= "2017-01-01")
, dates_df, by = c("DATE" = "date"))
#number of events on perfect days
nrow(Seattle_events)
## [1] 22
kable(head(Seattle_events$DATE))
| x |
|---|
| 2017-06-01 |
| 2017-07-13 |
| 2017-07-14 |
| 2017-07-18 |
| 2017-07-20 |
| 2017-07-21 |
kable(tail(Seattle_events$DATE))
| x |
|---|
| 2017-08-18 |
| 2017-08-19 |
| 2017-08-30 |
| 2017-08-31 |
| 2017-09-08 |
| 2017-09-09 |
This data set consists of 127 observations of 4 different variables including date, month, day, year. The dates were reformatted in order to join the events dates with our weather data. Obtaining data for prior events proved to be difficult and so we are using the 2017 perfect weather dates to comare to the 2018 event dates. We find the 22 events dates coincide with perfect weather dates and that these dates occur between June 1 and September 30. This is what we would expect as most perfect days occur at or near summer.
Though there a many days in Seattle which are gloomy and dreary, the number of perfect weather days is increasing every year. These perfect weather days tend to occur more in the summer time when the rain dries up; also, thus far the air quality issues of summer wildfires have not greatly impacted the number of perfect summer days. The weekends also do not tend to be more rainy than weekdays in the summer time. So everyone who doubts that Seattle is a beautiful place to live because of the rainy weather should know that with every coming year there are more perfect days on average. This analysis may partially explain, along with economic factors not analyzed here, why the Seattle-metro area is a relocation destination as well as a summer time outdoor activity mecca.