If we wanted to study climate change, we can find data on the Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies in the Northern Hemisphere at NASA’s Goddard Institute for Space Studies. The tabular data of temperature anomalies can be found here
To define temperature anomalies you need to have a reference, or base, period which NASA clearly states that it is the period between 1951-1980.
Run the code below to load the file:
<-
weather read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v4/NH.Ts+dSST.csv",
skip = 1,
na = "***")
Notice that, when using this function, we added two options: skip
and na
.
skip=1
option is there as the real data table only starts in Row 2, so we need to skip one row.na = "***"
option informs R how missing observations in the spreadsheet are coded. When looking at the spreadsheet, you can see that missing data is coded as "***". It is best to specify this here, as otherwise some of the data is not recognized as numeric data.Once the data is loaded, notice that there is a object titled weather
in the Environment
panel. If you cannot see the panel (usually on the top-right), go to Tools
> Global Options
> Pane Layout
and tick the checkbox next to Environment
. Click on the weather
object, and the dataframe will pop up on a seperate tab. Inspect the dataframe.
For each month and year, the dataframe shows the deviation of temperature from the normal (expected). Further the dataframe is in wide format.
You have two objectives in this section:
Select the year and the twelve month variables from the weather
dataset. We do not need the others (J-D, D-N, DJF, etc.) for this assignment. Hint: use select()
function.
Convert the dataframe from wide to ‘long’ format. Hint: use gather()
or pivot_longer()
function. Name the new dataframe as tidyweather
, name the variable containing the name of the month as month
, and the temperature deviation values as delta
.
<- weather %>%
tidyweather select(1:13) %>%
pivot_longer(cols =2:13,
names_to = "Month",
values_to = "delta")
Inspect your dataframe. It should have three variables now, one each for
Let us plot the data using a time-series scatter plot, and add a trendline. To do that, we first need to create a new variable called date
in order to ensure that the delta
values are plot chronologically.
<- tidyweather %>%
tidyweather mutate(date = ymd(paste(as.character(Year), Month, "1")),
month = month(date, label=TRUE),
year = year(date))
ggplot(tidyweather, aes(x=date, y = delta))+
geom_point()+
geom_smooth(color="red") +
theme_bw() +
labs (
title = "Weather Anomalies"
)
Is the effect of increasing temperature more pronounced in some months? Use facet_wrap()
to produce a seperate scatter plot for each month, again with a smoothing line. Your chart should human-readable labels; that is, each month should be labeled “Jan”, “Feb”, “Mar” (full or abbreviated month names are fine), not 1
, 2
, 3
.
It is sometimes useful to group data into different time periods to study historical data. For example, we often refer to decades such as 1970s, 1980s, 1990s etc. to refer to a period of time. NASA calcuialtes a temperature anomaly, as difference form the base periof of 1951-1980. The code below creates a new data frame called comparison
that groups data in five time periods: 1881-1920, 1921-1950, 1951-1980, 1981-2010 and 2011-present.
We remove data before 1800 and before using filter
. Then, we use the mutate
function to create a new variable interval
which contains information on which period each observation belongs to. We can assign the different periods using case_when()
.
<- tidyweather %>%
comparison filter(Year>= 1881) %>% #remove years prior to 1881
#create new variable 'interval', and assign values based on criteria below:
mutate(interval = case_when(
%in% c(1881:1920) ~ "1881-1920",
Year %in% c(1921:1950) ~ "1921-1950",
Year %in% c(1951:1980) ~ "1951-1980",
Year %in% c(1981:2010) ~ "1981-2010",
Year TRUE ~ "2011-present"
))
Inspect the comparison
dataframe by clicking on it in the Environment
pane.
Now that we have the interval
variable, we can create a density plot to study the distribution of monthly deviations (delta
), grouped by the different time periods we are interested in. Set fill
to interval
to group and colour the data by different time periods.
ggplot(comparison, aes(x=delta, fill=interval))+
geom_density(alpha=0.2) + #density plot with tranparency set to 20%
theme_bw() + #theme
labs (
title = "Density Plot for Monthly Temperature Anomalies",
y = "Density" #changing y-axis label to sentence case
)
So far, we have been working with monthly anomalies. However, we might be interested in average annual anomalies. We can do this by using group_by()
and summarise()
, followed by a scatter plot to display the result.
#creating yearly averages
<- tidyweather %>%
average_annual_anomaly group_by(Year) %>% #grouping data by Year
# creating summaries for mean delta
# use `na.rm=TRUE` to eliminate NA (not available) values
summarise(annual_average_delta = mean(delta, na.rm=TRUE))
#plotting the data:
ggplot(average_annual_anomaly, aes(x=Year, y= annual_average_delta))+
geom_point()+
#Fit the best fit line, using LOESS method
geom_smooth() +
#change to theme_bw() to have white background + black frame around plot
theme_bw() +
labs (
title = "Average Yearly Anomaly",
y = "Average Annual Delta"
)
delta
NASA points out on their website that
A one-degree global change is significant because it takes a vast amount of heat to warm all the oceans, atmosphere, and land by that much. In the past, a one- to two-degree drop was all it took to plunge the Earth into the Little Ice Age.
Your task is to construct a confidence interval for the average annual delta since 2011, both using a formula and using a bootstrap simulation with the infer
package. Recall that the dataframe comparison
has already grouped temperature anomalies according to time intervals; we are only interested in what is happening between 2011-present.
<- comparison %>%
formula_ci filter(Year>=2011) %>%
summarise( mean_averag_ann = mean(delta, na.rm = TRUE),
sd_averag_ann = sd(delta, na.rm = TRUE),
count = n(),
# get t-critical value with (n-1) degrees of freedom
t_critical = qt(0.975, count-1),
se = sd_averag_ann/sqrt(count),
margin_of_error = t_critical * se,
ci_low = mean_averag_ann - margin_of_error,
ci_high = mean_averag_ann + margin_of_error
)
formula_ci
## # A tibble: 1 x 8
## mean_averag_ann sd_averag_ann count t_critical se margin_of_error ci_low
## <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1.06 0.280 132 1.98 0.0244 0.0483 1.02
## # ... with 1 more variable: ci_high <dbl>
#kable()
# use the infer package to construct a 95% CI for delta
What is the data showing us? Please type your answer after (and outside!) this blockquote. You have to explain what you have done, and the interpretation of the result. One paragraph max, please!
Recall the gapminder
data frame from the gapminder package. That data frame contains just six columns from the larger data in Gapminder World. In this part, you will join a few dataframes with more data than the ‘gapminder’ package. Specifically, you will look at data on
You must use the wbstats
package to download data from the World Bank. The relevant World Bank indicators are SP.DYN.TFRT.IN
, SE.PRM.NENR
, NY.GDP.PCAP.KD
, and SH.DYN.MORT
You have to join the 3 dataframes (life_expectancy, worldbank_data, and HIV) into one. You may need to tidy your data first and then perform [join operations](http://r4ds.had.co.nz/relational-data.html). Think about what type makes the most sense **and explain why you chose it**.
1. What is the relationship between HIV prevalence and life expectancy? Generate a scatterplot with a smoothing line to report your results. You may find faceting useful
1. What is the relationship between fertility rate and GDP per capita? Generate a scatterplot with a smoothing line to report your results. You may find facetting by region useful
1. Which regions have the most observations with missing HIV data? Generate a bar chart (`geom_col()`), in descending order.
1. How has mortality rate for under 5 changed by region? In each region, find the top 5 countries that have seen the greatest improvement, as well as those 5 countries where mortality rates have had the least improvement or even deterioration.
1. Is there a relationship between primary school enrollment and fertility rate?
# Challenge 1: Excess rentals in TfL bike sharing
Recall the TfL data on how many bikes were hired every single day. We can get the latest data by running the following
```r
url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"
# Download TFL data to temporary file
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))
## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2021-03-16T17%3A22%3A42/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20210329%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20210329T233055Z&X-Amz-Expires=300&X-Amz-Signature=acf4b177ea63fc09690e3f7bf59876a585dbc01fbfe9c66932191c489fa99864&X-Amz-SignedHeaders=host]
## Date: 2021-03-29 23:31
## Status: 200
## Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## Size: 169 kB
## <ON DISK> C:\Users\USER\AppData\Local\Temp\Rtmp4skoIl\file418c35441004.xlsx
# Use read_excel to read it as dataframe
<- read_excel(bike.temp,
bike0 sheet = "Data",
range = cell_cols("A:B"))
# change dates to get year, month, and week
<- bike0 %>%
bike clean_names() %>%
rename (bikes_hired = number_of_bicycle_hires) %>%
mutate (year = year(day),
month = lubridate::month(day, label = TRUE),
week = isoweek(day))
We can easily create a facet grid that plots bikes hired by month and year.
Look at May and Jun and compare 2020 with the previous years. What’s happening?
However, the challenge I want you to work on is to reproduce the following two graphs.
The second one looks at percentage changes from the expected level of weekly rentals. The two grey shaded rectangles correspond to the second (weeks 14-26) and fourth (weeks 40-52) quarters.
For both of these graphs, you have to calculate the expected number of rentals per week or month between 2015-2019 and then, see how each week/month of 2020 compares to the expected rentals. Think of the calculation excess_rentals = actual_rentals - expected_rentals
.
Should you use the mean or the median to calculate your expected rentals? Why?
In creating your plots, you may find these links useful:
As usual, there is a lot of explanatory text, comments, etc. You do not need these, so delete them and produce a stand-alone document that you could share with someone. Knit the edited and completed R Markdown file as an HTML document (use the “Knit” button at the top of the script editor window) and upload it to Canvas.
Please seek out help when you need it, and remember the 15-minute rule. You know enough R (and have enough examples of code from class and your readings) to be able to do this. If you get stuck, ask for help from others, post a question on Slack– and remember that I am here to help too!
As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?
Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.
Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).
Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.