Turn in an HTML file and this R Markdown file after you have edited it.
The assignment uses both the original Madison weather data file, madison-weather.csv
, and the official data madison-weather-official.csv
. You may certainly copy and reuse code from lecture as needed.
mw = mw %>%
mutate(NAME = case_when(
NAME == "ARBORETUM UNIVERSITY WIS, WI US" ~ "Arboretum",
NAME == "CHARMANY FARM, WI US" ~ "Charmany",
NAME == "MADISON DANE CO REGIONAL AIRPORT, WI US" ~ "Airport",
NAME == "MADISON WEATHER BUREAU CITY, WI US" ~ "Bureau",
TRUE ~ NAME)) %>%
select(-STATION,-ELEVATION,-starts_with("L")) %>%
rename(name = NAME,
date = DATE,
prcp = PRCP,
snow = SNOW,
tmax = TMAX,
tmin = TMIN)
Use the full Madison weather data set. Create a smaller data set that has each day where both the Madison Weather Bureau and the Airport have both tmax
and tmin
measured without missing data. Keep the date, name of the weather station, and these two tempertures, but eliminate the other variables and observations from the other two weather stations. What are the earliest and latest dates, how many observations are there, and how many dates are missing?
Answer: The earliest date of the data set is the first of January, 1869, and the latest date is December 31st, 2019. There are over fifty-five thousand observations, 55144 to be exact, with 7 days of missing data.
#P1
problem_1 = mw %>%
select(name,date,tmax,tmin) %>%
filter(name == "Airport" | name == "Bureau") %>%
pivot_wider(names_from = name, values_from = c("tmax","tmin"))
p1_mw_sum = problem_1 %>%
summarize(first = min(date),
last = max(date),
n = n(),
missing_dates = (last - first + 1) - n,
na_tmax_a = sum(is.na(tmax_Airport)),
na_tmin_a = sum(is.na(tmin_Airport)),
na_tmax_b = sum(is.na(tmax_Bureau)),
na_tmin_b = sum(is.na(tmin_Bureau))
) %>%
arrange(first)
p1_mw_sum %>% as.data.frame()
## first last n missing_dates na_tmax_a na_tmin_a na_tmax_b
## 1 1869-01-01 2019-12-31 55144 7 days 25832 25832 20785
## na_tmin_b
## 1 20788
Using the data set from the previous problem, calculate the difference (Airport minus Weather Bureau) of the maximum temperature recorded on each day and plot these differences with a histogram. Describe the notable features of this distribution. What are the mean and standard deviation? What are the median and lower and upper quartiles? What are the most extreme differences in each direction?
Answer: The distribution looks very Normal and symmetric. There does not seem to be an apparent skew in the distribution. The mean of the data is 1.262134, the standard deviation is 2.954255, the median is 1 and the IQR is 3. There is an extreme value at -20, which means the weather was extremely different in the two locations on that day.
#P2
problem_2 = problem_1%>%
select(date, tmax_Airport, tmax_Bureau)%>%
drop_na(tmax_Airport, tmax_Bureau)%>%
mutate(dif_max =
tmax_Airport-tmax_Bureau,)
p2_mw_sum = problem_2%>%
summarise(avg = mean(dif_max),
dev = sd(dif_max),
med = median(dif_max),
iqr = IQR(dif_max))
drop_na(problem_2, dif_max)
## # A tibble: 8,530 x 4
## date tmax_Airport tmax_Bureau dif_max
## <date> <dbl> <dbl> <dbl>
## 1 1939-10-01 67 67 0
## 2 1939-10-02 70 69 1
## 3 1939-10-03 74 74 0
## 4 1939-10-04 81 80 1
## 5 1939-10-05 70 75 -5
## 6 1939-10-06 77 73 4
## 7 1939-10-07 85 84 1
## 8 1939-10-08 79 82 -3
## 9 1939-10-09 75 75 0
## 10 1939-10-10 67 70 -3
## # ... with 8,520 more rows
ggplot(problem_2, aes(dif_max))+
geom_histogram(binwidth=1)
### 3
Repeat the previous problem with the daily minimum temperature.
Answer: The distribution looks to be skew left, with a few straggling values on the right.
The mean of the data is -2.632739, the standard deviation is 3.801263, the median is -2 and the IQR is 5. This means that this data is distributed with more variation than the last problem. There are a few extreme values to the right of the general distribution.
#P3
problem_3 = problem_1%>%
select(date, tmin_Airport, tmin_Bureau)%>%
drop_na(tmin_Airport, tmin_Bureau)%>%
mutate(dif_min =
tmin_Airport-tmin_Bureau,)
p3_mw_sum = problem_3%>%
summarise(avg = mean(dif_min),
dev = sd(dif_min),
med = median(dif_min),
iqr = IQR(dif_min))
drop_na(problem_3, dif_min)
## # A tibble: 8,528 x 4
## date tmin_Airport tmin_Bureau dif_min
## <date> <dbl> <dbl> <dbl>
## 1 1939-10-01 33 35 -2
## 2 1939-10-02 38 42 -4
## 3 1939-10-03 48 47 1
## 4 1939-10-04 51 55 -4
## 5 1939-10-05 56 58 -2
## 6 1939-10-06 45 52 -7
## 7 1939-10-07 47 55 -8
## 8 1939-10-08 59 59 0
## 9 1939-10-09 58 59 -1
## 10 1939-10-10 44 44 0
## # ... with 8,518 more rows
ggplot(problem_3, aes(dif_min))+
geom_histogram(binwidth=1)
### 4
Using the official Madison weather data, use code from class to create a data set with the dates and snowfall amounts from January 1, 1884 to the present where you replace missing snowfall totals with zero if the precipitation is zero and drop dates where the snowfall measurements are missing. Find the total annual snowfall by year and plot this. Add a trend line. Describe any patterns. How has total snowfall been changing in Madison? About how much does the actual snowfall in a given year tend to vary from the trend?
Answer: The total annual snowfall by year seems to be decently static from the start of the data set to the end. Total snowfall seems to be getting more erratic from year to year, with the actual data growing further from the trend line as time goes on. Variation has seemed to get much more frequent the closer you get to the current day.
problem_4 = official%>%
group_by(date)%>%
select(date, snow, prcp)%>%
filter(date>"1884-01-01")
problem_4_temp = problem_4%>%
select(date, snow, prcp)%>%
filter(is.na(snow), .preserve=TRUE)%>%
drop_na(prcp)%>%
mutate(snow=0.00)
problem_4 = problem_4%>%
full_join(problem_4, problem_4_temp, by=NULL)%>%
drop_na(snow)%>%
group_by(year(date))%>%
mutate(sum_snow = max(cumsum(snow)))
ggplot(problem_4, aes(x=year(date), y=sum_snow))+
geom_point()+ geom_smooth()
### 5
For each winter season from the 1885-86 winter through the 2018-2019 winter, find the earliest and latest snowfall occurences. Which winter has the record earliest snowfall and which has the record latest? The latest snowfall recorded in the data occured on April 29, and the earliest snowfall occured on November 02, both in the year 1970.
problem_5=official%>%
mutate(year=year(date),
year_2=year+1)%>%
group_by(year, year_2)%>%
filter(year>1884)%>%
mutate(winter_start=make_datetime(year=year,month=11,day = 01),
winter_end=make_datetime(year=year_2,month=04, day=30))%>%
filter(date>winter_start, snow!=0)%>%
mutate(first_snow=first(date))%>%
summarise(earliest_by_year=min(first_snow))
problem_5_temp=official%>%
mutate(year=year(date),
year_2=year+1)%>%
group_by(year, year_2)%>%
filter(year>1885)%>%
mutate(winter_start=make_datetime(year=year,month=11,day = 01),
winter_end=make_datetime(year=year_2,month=04, day=30))%>%
filter(date+years(1)<winter_end, snow!=0)%>%
mutate(last_snow=last(date))%>%
summarise(latest_by_year=max(last_snow))
p5_max_sum = problem_5_temp%>%
ungroup()%>%
mutate(latest_by_year=make_datetime(month=month(latest_by_year),day=day(latest_by_year)))%>%
summarise(record_latest=max(latest_by_year))
p5_min_sum = problem_5%>%
ungroup()%>%
mutate(earliest_by_year=make_datetime(month=month(earliest_by_year),day=day(earliest_by_year)))%>%
summarise(record_earliest=min(earliest_by_year))
Find the record snowfall amount for each day of the year and plot this data. Which day in January or February has the lowest record snowfall?
Answer: The lowest days in January in terms of record snowfall are the 10th and the 22nd, and the lowest day in February is the 29th; this is likely due to the infrequency of leap years.
problem_6=official%>%
drop_na(snow)%>%
group_by(month(date),day(date))%>%
mutate(record=max(snow),
record_date=make_datetime(year=0000, month= month(date),day=day(date)))
ggplot(problem_6, aes(record_date, record))+
geom_point()
p6_sum = problem_6%>%
ungroup()%>%
filter(month(date)==01)%>%
filter(record==min(record))%>%
summarize(min_1=first(day(date)),
min_2=last(day(date)))
p6_sum_2 = problem_6%>%
ungroup()%>%
filter(month(date)==02)%>%
filter(record==min(record))%>%
summarize(min_1=first(day(date)),
min_2=last(day(date)))