The goal is to test your software installation, to demonstrate
competency in Markdown, and in the basics of ggplot.
Please delete all the intro text I wrote from line 22 to line 69 and start writing your short biography after this blockquote.
Linkedin | Instagram | Twitter
gapminder country comparisonYou have seen the gapminder dataset that has data on
life expectancy, population, and GDP per capita for 142 countries from
1952 to 2007. To get a glimpse of the dataframe, namely to see the
variable names, variable types, etc., we use the glimpse
function. We also want to have a look at the first 20 rows of data.
## Rows: 1,704
## Columns: 6
## $ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
## # A tibble: 20 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## 11 Afghanistan Asia 2002 42.1 25268405 727.
## 12 Afghanistan Asia 2007 43.8 31889923 975.
## 13 Albania Europe 1952 55.2 1282697 1601.
## 14 Albania Europe 1957 59.3 1476505 1942.
## 15 Albania Europe 1962 64.8 1728137 2313.
## 16 Albania Europe 1967 66.2 1984060 2760.
## 17 Albania Europe 1972 67.7 2263554 3313.
## 18 Albania Europe 1977 68.9 2509048 3533.
## 19 Albania Europe 1982 70.4 2780097 3631.
## 20 Albania Europe 1987 72 3075321 3739.
Your task is to produce two graphs of how life expectancy has changed
over the years for the country and the
continent you come from.
I have created the country_data and
continent_data with the code below.
country_data <- gapminder %>%
filter(country == "Greece") # just choosing Greece, as this is where I come from
continent_data <- gapminder %>%
filter(continent == "Europe")First, create a plot of life expectancy over time for the single
country you chose. Map year on the x-axis, and
lifeExp on the y-axis. You should also use
geom_point() to see the actual data points and
geom_smooth(se = FALSE) to plot the underlying trendlines.
You need to remove the comments # from the lines below
for your code to run.
# plot1 <- ggplot(data = ??, mapping = aes(x = ??, y = ??))+
# geom_??() +
# geom_smooth(se = FALSE)+
# NULL
# plot1Next we need to add a title. Create a new plot, or extend plot1,
using the labs() function to add an informative title to
the plot.
Secondly, produce a plot for all countries in the continent
you come from. (Hint: map the country variable to the
colour aesthetic. You also want to map country to the
group aesthetic, so all points for each country are grouped
together).
# ggplot(gapminder, mapping = aes(x = , y = , colour= , group =))+
# geom_?? +
# geom_smooth(se = FALSE) +
# NULLFinally, using the original gapminder data, produce a
life expectancy over time graph, grouped (or faceted) by continent. We
will remove all legends, adding the
theme(legend.position="none") in the end of our ggplot.
# ggplot(data = gapminder , mapping = aes(x = , y = , colour= ))+
# geom_??? +
# geom_smooth(se = FALSE) +
# facet_wrap(~continent) +
# theme(legend.position="none") + #remove all legends
# NULLGiven these trends, what can you say about life expectancy since 1952? Again, don’t just say what’s happening in the graph. Tell some sort of story and speculate about the differences in the patterns.
Type your answer after this blockquote.
country_data <- gapminder %>%
filter(country == "Spain")
ggplot(country_data, aes(x= year, y= lifeExp)) +
geom_point() + geom_smooth(se = FALSE) +
ggtitle ("Life expectancy in Spain over time")## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
continent_data <- gapminder %>%
filter(continent == "Europe")
ggplot(continent_data, aes(x= year, y= lifeExp, colour= country, group = country)) + geom_point() + geom_smooth(se = FALSE) + ggtitle ("Life expectancy in Europe over time")## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(gapminder, aes(x = year, y = lifeExp, colour = continent)) + geom_smooth(se = FALSE) + geom_point() + facet_wrap(~continent) + theme(legend.position="none") + ggtitle ("Development of life expectancy over the world, by continent")## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
In my opinion, the primary factor accounting for variations in the growth rate of life expectancy is the availability and integration of technology. Advanced economies like those in Europe, which have consistently embraced and incorporated technology, have witnessed a steady rise in life expectancy. Conversely, in economies where technology’s influence has been more superficial, such as in parts of Africa, life expectancy has remained largely unchanged. Lastly, in economies where technology implementation has been less methodical, like in certain regions of Asia, there has been a rapid short-term growth in life expectancy.
The London Fire Brigade attends a range of non-fire incidents (which we call ‘special services’). These ‘special services’ include assistance to animals that may be trapped or in distress. The data is provided from January 2009 and is updated monthly. A range of information is supplied for each incident including some location information (postcode, borough, ward), as well as the data/time of the incidents. We do not routinely record data about animal deaths or injuries.
Please note that any cost included is a notional cost calculated based on the length of time rounded up to the nearest hour spent by Pump, Aerial and FRU appliances at the incident and charged at the current Brigade hourly rate.
url <- "https://data.london.gov.uk/download/animal-rescue-incidents-attended-by-lfb/01007433-55c2-4b8a-b799-626d9e3bc284/Animal%20Rescue%20incidents%20attended%20by%20LFB%20from%20Jan%202009.csv"
animal_rescue <- read_csv(url,
locale = locale(encoding = "CP1252")) %>%
#use janitor::clean_names() to clean names
janitor::clean_names()
# quick look at the dataframe- how many rows- columns, type of variables (characters, numbers, etc )
glimpse(animal_rescue)## Rows: 9,853
## Columns: 31
## $ incident_number <chr> "139091", "275091", "2075091", "2872091"…
## $ date_time_of_call <chr> "01/01/2009 03:01", "01/01/2009 08:51", …
## $ cal_year <dbl> 2009, 2009, 2009, 2009, 2009, 2009, 2009…
## $ fin_year <chr> "2008/09", "2008/09", "2008/09", "2008/0…
## $ type_of_incident <chr> "Special Service", "Special Service", "S…
## $ pump_count <chr> "1", "1", "1", "1", "1", "1", "1", "1", …
## $ pump_hours_total <chr> "2", "1", "1", "1", "1", "1", "1", "1", …
## $ hourly_notional_cost_a <dbl> 255, 255, 255, 255, 255, 255, 255, 255, …
## $ incident_notional_cost_a <chr> "510", "255", "255", "255", "255", "255"…
## $ final_description <chr> "Redacted", "Redacted", "Redacted", "Red…
## $ animal_group_parent <chr> "Dog", "Fox", "Dog", "Horse", "Rabbit", …
## $ originof_call <chr> "Person (land line)", "Person (land line…
## $ property_type <chr> "House - single occupancy", "Railings", …
## $ property_category <chr> "Dwelling", "Outdoor Structure", "Outdoo…
## $ special_service_type_category <chr> "Other animal assistance", "Other animal…
## $ special_service_type <chr> "Animal assistance involving livestock -…
## $ ward_code <chr> "E05011467", "E05000169", "E05013756", "…
## $ ward <chr> "Crystal Palace & Upper Norwood", "Woods…
## $ borough_code <chr> "E09000008", "E09000008", "E09000029", "…
## $ borough <chr> "Croydon", "Croydon", "Sutton", "Hilling…
## $ stn_ground_name <chr> "Norbury", "Woodside", "Wallington", "Ru…
## $ uprn <chr> "NULL", "NULL", "NULL", "1.00021E+11", "…
## $ street <chr> "Waddington Way", "Grasmere Road", "Mill…
## $ usrn <chr> "20500146", "NULL", "NULL", "21401484", …
## $ postcode_district <chr> "SE19", "SE25", "SM5", "UB9", "RM3", "RM…
## $ easting_m <chr> "NULL", "534785", "528041", "504689", "N…
## $ northing_m <chr> "NULL", "167546", "164923", "190685", "N…
## $ easting_rounded <dbl> 532350, 534750, 528050, 504650, 554650, …
## $ northing_rounded <dbl> 170050, 167550, 164950, 190650, 192350, …
## $ latitude <chr> "NULL", "51.39095371", "51.36894086", "5…
## $ longitude <chr> "NULL", "-0.064166887", "-0.161985191", …
One of the more useful things one can do with any data set is quick
counts, namely to see how many observations fall within one category.
For instance, if we wanted to count the number of incidents by year, we
would either use group_by()... summarise() or, simply count()
## # A tibble: 15 × 2
## cal_year count
## <dbl> <int>
## 1 2009 568
## 2 2010 611
## 3 2011 620
## 4 2012 603
## 5 2013 585
## 6 2014 583
## 7 2015 540
## 8 2016 604
## 9 2017 539
## 10 2018 610
## 11 2019 604
## 12 2020 758
## 13 2021 885
## 14 2022 1029
## 15 2023 714
## # A tibble: 15 × 2
## cal_year count
## <dbl> <int>
## 1 2009 568
## 2 2010 611
## 3 2011 620
## 4 2012 603
## 5 2013 585
## 6 2014 583
## 7 2015 540
## 8 2016 604
## 9 2017 539
## 10 2018 610
## 11 2019 604
## 12 2020 758
## 13 2021 885
## 14 2022 1029
## 15 2023 714
Once we count() how many incidents we have per year, we
can pipe %>% the table to a ggplot and draw a simple
time series chart.
animal_rescue %>%
count(cal_year, name="count") %>%
# we dont have all the data for 2023, so let us filter it out
filter(cal_year < 2023) %>%
# the result of count() is a dataframe, so we pass it to
ggplot() +
# map year (cal_year) on the x-axis, count on the y-axis
aes( x = cal_year,
y = count)+
# we just want a time-series, line graph
geom_line()+
# also add the points to make graph easier to read
geom_point()+
# make sure y-axis starts at zero
expand_limits(y = 0)+
# add labels
labs(
title = "Animal rescue incidents have almost doubled post Covid-19",
subtitle = "Animal rescue incidents attended by the LBF",
x = NULL,
y = NULL,
caption = "Source: https://data.london.gov.uk/dataset/animal-rescue-incidents-attended-by-lfb") +
theme_minimal() +
# change the theme, so title is left-aligned
theme(plot.title.position = "plot") +
# add one final layer of NULL, so if you comment out any lines
# you never end up with a hanging `+` that awaits another ggplot layer
NULLIncidents over time.
Let us try to see how many incidents we have by animal group. Again, we can do this either using group_by() and summarise(), or by using count()
animal_rescue %>%
group_by(animal_group_parent) %>%
#group_by and summarise will produce a new column with the count in each animal group
summarise(count = n()) %>%
# mutate adds a new column; here we calculate the percentage
mutate(percent = round(100*count/sum(count),2)) %>%
# arrange() sorts the data by percent. Since the default sorting is min to max and we would like to see it sorted
# in descending order (max to min), we use arrange(desc())
arrange(desc(percent))## # A tibble: 28 × 3
## animal_group_parent count percent
## <chr> <int> <dbl>
## 1 Cat 4866 49.4
## 2 Bird 2001 20.3
## 3 Dog 1455 14.8
## 4 Fox 534 5.42
## 5 Unknown - Domestic Animal Or Pet 227 2.3
## 6 Horse 213 2.16
## 7 Deer 167 1.69
## 8 Unknown - Wild Animal 113 1.15
## 9 Squirrel 85 0.86
## 10 Unknown - Heavy Livestock Animal 51 0.52
## # ℹ 18 more rows
animal_rescue %>%
#count does the same thing as group_by and summarise
# name = "count" will call the column with the counts "count" ( exciting, I know)
# and 'sort=TRUE' will sort them from max to min
count(animal_group_parent, name="count", sort=TRUE) %>%
mutate(percent = round(100*count/sum(count),2))## # A tibble: 28 × 3
## animal_group_parent count percent
## <chr> <int> <dbl>
## 1 Cat 4866 49.4
## 2 Bird 2001 20.3
## 3 Dog 1455 14.8
## 4 Fox 534 5.42
## 5 Unknown - Domestic Animal Or Pet 227 2.3
## 6 Horse 213 2.16
## 7 Deer 167 1.69
## 8 Unknown - Wild Animal 113 1.15
## 9 Squirrel 85 0.86
## 10 Unknown - Heavy Livestock Animal 51 0.52
## # ℹ 18 more rows
Do you see anything strange in these tables?
Finally, let us have a loot at the notional cost for rescuing each of these animals. As the LFB says,
Please note that any cost included is a notional cost calculated based on the length of time rounded up to the nearest hour spent by Pump, Aerial and FRU appliances at the incident and charged at the current Brigade hourly rate.
There is two things we will do:
incident_notional_cost_a
for each animal_group_parentincident_notional_cost_a by
animal_group_parent.Before we go on, however, we need to fix
incident_notional_cost_a as it is stored as a
chr, or character, rather than a number.
# what type is variable incident_notional_cost from dataframe `animal_rescue`
typeof(animal_rescue$incident_notional_cost_a)## [1] "character"
# readr::parse_number() will convert any numerical values stored as characters into numbers
animal_rescue <- animal_rescue %>%
# we use mutate() to use the parse_number() function and overwrite the same variable
mutate(incident_notional_cost_a = parse_number(incident_notional_cost_a))
# incident_notional_cost from dataframe `animal_rescue` is now 'double' or numeric
typeof(animal_rescue$incident_notional_cost_a)## [1] "double"
Now that incident_notional_cost_a is numeric, let us
quickly calculate summary statistics for each animal group.
animal_rescue %>%
# group by animal_group_parent
group_by(animal_group_parent) %>%
# filter resulting data, so each group has at least 6 observations
filter(n()>6) %>%
# summarise() will collapse all values into 3 values: the mean, median, and count
# we use na.rm=TRUE to make sure we remove any NAs, or cases where we do not have the incident cos
summarise(mean_incident_cost = mean (incident_notional_cost_a, na.rm=TRUE),
median_incident_cost = median (incident_notional_cost_a, na.rm=TRUE),
sd_incident_cost = sd (incident_notional_cost_a, na.rm=TRUE),
min_incident_cost = min (incident_notional_cost_a, na.rm=TRUE),
max_incident_cost = max (incident_notional_cost_a, na.rm=TRUE),
count = n()) %>%
# sort the resulting data in descending order. You choose whether to sort by count or mean cost.
arrange(desc(count))## # A tibble: 17 × 7
## animal_group_parent mean_incident_cost median_incident_cost sd_incident_cost
## <chr> <dbl> <dbl> <dbl>
## 1 Cat 360. 333 164.
## 2 Bird 356. 333 138.
## 3 Dog 362. 326 202.
## 4 Fox 386. 346 191.
## 5 Unknown - Domestic … 334. 298 116.
## 6 Horse 734. 596 527.
## 7 Deer 419. 346 265.
## 8 Unknown - Wild Anim… 418. 339 302.
## 9 Squirrel 324. 333 55.1
## 10 Unknown - Heavy Liv… 374. 260 260.
## 11 cat 353. 336 110.
## 12 Hamster 326. 310. 88.1
## 13 Snake 375. 352 122.
## 14 Rabbit 325. 333 40.0
## 15 Ferret 314. 336 41.1
## 16 Cow 599. 436 451.
## 17 Sheep 355. 339 114.
## # ℹ 3 more variables: min_incident_cost <dbl>, max_incident_cost <dbl>,
## # count <int>
Compare the mean and the median for each animal group. What do you think this is telling us? Anything else that stands out? Any outliers?
Finally, let us plot a few plots that show the distribution of incident_cost for each animal group.
# base_plot
base_plot <- animal_rescue %>%
group_by(animal_group_parent) %>%
filter(n()>6) %>%
ggplot(aes(x=incident_notional_cost_a))+
facet_wrap(~animal_group_parent, scales = "free")+
theme_bw()
base_plot + geom_histogram()Which of these four graphs do you think best communicates the
variability of the incident_notional_cost_a values? Also,
can you please tell some sort of story (which animals are more expensive
to rescue than others, the spread of values) and speculate about the
differences in the patterns.
Using LFB’s incident_notional_cost_a, plot a line graph
showing the total incident notional cost between 2009 - 2022.
Knit the completed R Markdown file as an HTML document (use the “Knit” button at the top of the script editor window) and upload it to Canvas.
If you want to, please answer the following