The goal is to test your software installation, to demonstrate competency in Markdown, and in the basics of ggplot.

Please delete all the intro text I wrote from line 22 to line 69 and start writing your short biography after this blockquote.

Sebastian Oliver’s biography

Linkedin | Instagram | Twitter

Education

The University of Warwick London Business School

Employers

“Oxbow Parners” “Razor Group” “Stealth”

Hobbies

  • Football
    • RCD Mallorca
    • Real Madrid
  • Sailing
    • Laser pico
    • Minicat
  • Travelling
    • Southeast Asia
    • South America
  • Poker

Task 2: gapminder country comparison

You have seen the gapminder dataset that has data on life expectancy, population, and GDP per capita for 142 countries from 1952 to 2007. To get a glimpse of the dataframe, namely to see the variable names, variable types, etc., we use the glimpse function. We also want to have a look at the first 20 rows of data.

glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
head(gapminder, 20) # look at the first 20 rows of the dataframe
## # A tibble: 20 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## 11 Afghanistan Asia       2002    42.1 25268405      727.
## 12 Afghanistan Asia       2007    43.8 31889923      975.
## 13 Albania     Europe     1952    55.2  1282697     1601.
## 14 Albania     Europe     1957    59.3  1476505     1942.
## 15 Albania     Europe     1962    64.8  1728137     2313.
## 16 Albania     Europe     1967    66.2  1984060     2760.
## 17 Albania     Europe     1972    67.7  2263554     3313.
## 18 Albania     Europe     1977    68.9  2509048     3533.
## 19 Albania     Europe     1982    70.4  2780097     3631.
## 20 Albania     Europe     1987    72    3075321     3739.

Your task is to produce two graphs of how life expectancy has changed over the years for the country and the continent you come from.

I have created the country_data and continent_data with the code below.

country_data <- gapminder %>% 
            filter(country == "Greece") # just choosing Greece, as this is where I come from

continent_data <- gapminder %>% 
            filter(continent == "Europe")

First, create a plot of life expectancy over time for the single country you chose. Map year on the x-axis, and lifeExp on the y-axis. You should also use geom_point() to see the actual data points and geom_smooth(se = FALSE) to plot the underlying trendlines. You need to remove the comments # from the lines below for your code to run.

# plot1 <- ggplot(data = ??, mapping = aes(x = ??, y = ??))+
#   geom_??() +
#   geom_smooth(se = FALSE)+
#   NULL 

# plot1

Next we need to add a title. Create a new plot, or extend plot1, using the labs() function to add an informative title to the plot.

# plot1<- plot1 +
#   labs(title = " ",
#       x = " ",
#       y = " ") +
#   NULL


# plot1

Secondly, produce a plot for all countries in the continent you come from. (Hint: map the country variable to the colour aesthetic. You also want to map country to the group aesthetic, so all points for each country are grouped together).

# ggplot(gapminder, mapping = aes(x =  , y =  , colour= , group =))+
#   geom_?? + 
#   geom_smooth(se = FALSE) +
#   NULL

Finally, using the original gapminder data, produce a life expectancy over time graph, grouped (or faceted) by continent. We will remove all legends, adding the theme(legend.position="none") in the end of our ggplot.

# ggplot(data = gapminder , mapping = aes(x =  , y =  , colour= ))+
#   geom_??? + 
#   geom_smooth(se = FALSE) +
#   facet_wrap(~continent) +
#   theme(legend.position="none") + #remove all legends
#   NULL

Given these trends, what can you say about life expectancy since 1952? Again, don’t just say what’s happening in the graph. Tell some sort of story and speculate about the differences in the patterns.

Type your answer after this blockquote.

Task 2

Part 1

country_data <- gapminder %>%
    filter(country == "Spain")

ggplot(country_data, aes(x= year, y= lifeExp)) + 
  geom_point() + geom_smooth(se = FALSE) +
  ggtitle ("Life expectancy in Spain over time")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Part 2

  continent_data <- gapminder %>%
    filter(continent == "Europe")

ggplot(continent_data, aes(x= year, y= lifeExp, colour= country, group = country)) + geom_point() + geom_smooth(se = FALSE) + ggtitle ("Life expectancy in Europe over time")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Part 3

ggplot(gapminder, aes(x = year, y = lifeExp, colour = continent)) + geom_smooth(se = FALSE) + geom_point() + facet_wrap(~continent) + theme(legend.position="none") +  ggtitle ("Development of life expectancy over the world, by continent")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Graph intepretation

In my opinion, the primary factor accounting for variations in the growth rate of life expectancy is the availability and integration of technology. Advanced economies like those in Europe, which have consistently embraced and incorporated technology, have witnessed a steady rise in life expectancy. Conversely, in economies where technology’s influence has been more superficial, such as in parts of Africa, life expectancy has remained largely unchanged. Lastly, in economies where technology implementation has been less methodical, like in certain regions of Asia, there has been a rapid short-term growth in life expectancy.

Task 3: Animal rescue incidents attended by the London Fire Brigade

The London Fire Brigade attends a range of non-fire incidents (which we call ‘special services’). These ‘special services’ include assistance to animals that may be trapped or in distress. The data is provided from January 2009 and is updated monthly. A range of information is supplied for each incident including some location information (postcode, borough, ward), as well as the data/time of the incidents. We do not routinely record data about animal deaths or injuries.

Please note that any cost included is a notional cost calculated based on the length of time rounded up to the nearest hour spent by Pump, Aerial and FRU appliances at the incident and charged at the current Brigade hourly rate.

url <- "https://data.london.gov.uk/download/animal-rescue-incidents-attended-by-lfb/01007433-55c2-4b8a-b799-626d9e3bc284/Animal%20Rescue%20incidents%20attended%20by%20LFB%20from%20Jan%202009.csv"

animal_rescue <- read_csv(url,
                          locale = locale(encoding = "CP1252")) %>% 
  
  #use janitor::clean_names() to clean names
  janitor::clean_names()

# quick look at the dataframe- how many rows- columns, type of variables (characters, numbers, etc )
glimpse(animal_rescue)
## Rows: 9,853
## Columns: 31
## $ incident_number               <chr> "139091", "275091", "2075091", "2872091"…
## $ date_time_of_call             <chr> "01/01/2009 03:01", "01/01/2009 08:51", …
## $ cal_year                      <dbl> 2009, 2009, 2009, 2009, 2009, 2009, 2009…
## $ fin_year                      <chr> "2008/09", "2008/09", "2008/09", "2008/0…
## $ type_of_incident              <chr> "Special Service", "Special Service", "S…
## $ pump_count                    <chr> "1", "1", "1", "1", "1", "1", "1", "1", …
## $ pump_hours_total              <chr> "2", "1", "1", "1", "1", "1", "1", "1", …
## $ hourly_notional_cost_a        <dbl> 255, 255, 255, 255, 255, 255, 255, 255, …
## $ incident_notional_cost_a      <chr> "510", "255", "255", "255", "255", "255"…
## $ final_description             <chr> "Redacted", "Redacted", "Redacted", "Red…
## $ animal_group_parent           <chr> "Dog", "Fox", "Dog", "Horse", "Rabbit", …
## $ originof_call                 <chr> "Person (land line)", "Person (land line…
## $ property_type                 <chr> "House - single occupancy", "Railings", …
## $ property_category             <chr> "Dwelling", "Outdoor Structure", "Outdoo…
## $ special_service_type_category <chr> "Other animal assistance", "Other animal…
## $ special_service_type          <chr> "Animal assistance involving livestock -…
## $ ward_code                     <chr> "E05011467", "E05000169", "E05013756", "…
## $ ward                          <chr> "Crystal Palace & Upper Norwood", "Woods…
## $ borough_code                  <chr> "E09000008", "E09000008", "E09000029", "…
## $ borough                       <chr> "Croydon", "Croydon", "Sutton", "Hilling…
## $ stn_ground_name               <chr> "Norbury", "Woodside", "Wallington", "Ru…
## $ uprn                          <chr> "NULL", "NULL", "NULL", "1.00021E+11", "…
## $ street                        <chr> "Waddington Way", "Grasmere Road", "Mill…
## $ usrn                          <chr> "20500146", "NULL", "NULL", "21401484", …
## $ postcode_district             <chr> "SE19", "SE25", "SM5", "UB9", "RM3", "RM…
## $ easting_m                     <chr> "NULL", "534785", "528041", "504689", "N…
## $ northing_m                    <chr> "NULL", "167546", "164923", "190685", "N…
## $ easting_rounded               <dbl> 532350, 534750, 528050, 504650, 554650, …
## $ northing_rounded              <dbl> 170050, 167550, 164950, 190650, 192350, …
## $ latitude                      <chr> "NULL", "51.39095371", "51.36894086", "5…
## $ longitude                     <chr> "NULL", "-0.064166887", "-0.161985191", …

One of the more useful things one can do with any data set is quick counts, namely to see how many observations fall within one category. For instance, if we wanted to count the number of incidents by year, we would either use group_by()... summarise() or, simply count()

animal_rescue %>% 
  dplyr::group_by(cal_year) %>% 
  summarise(count=n())
## # A tibble: 15 × 2
##    cal_year count
##       <dbl> <int>
##  1     2009   568
##  2     2010   611
##  3     2011   620
##  4     2012   603
##  5     2013   585
##  6     2014   583
##  7     2015   540
##  8     2016   604
##  9     2017   539
## 10     2018   610
## 11     2019   604
## 12     2020   758
## 13     2021   885
## 14     2022  1029
## 15     2023   714
animal_rescue %>% 
  count(cal_year, name="count")
## # A tibble: 15 × 2
##    cal_year count
##       <dbl> <int>
##  1     2009   568
##  2     2010   611
##  3     2011   620
##  4     2012   603
##  5     2013   585
##  6     2014   583
##  7     2015   540
##  8     2016   604
##  9     2017   539
## 10     2018   610
## 11     2019   604
## 12     2020   758
## 13     2021   885
## 14     2022  1029
## 15     2023   714

Once we count() how many incidents we have per year, we can pipe %>% the table to a ggplot and draw a simple time series chart.

animal_rescue %>% 
  count(cal_year, name="count") %>% 
  
  # we dont have all the data for 2023, so let us filter it out
  filter(cal_year < 2023) %>% 
  
  # the result of count() is a dataframe, so we pass it to 
  ggplot() + 
  
  # map year (cal_year) on the x-axis, count on the y-axis
  aes( x = cal_year,
       y = count)+
  
  # we just want a time-series, line graph
  geom_line()+
  
  # also add the points to make graph easier to read
  geom_point()+
  
  # make sure y-axis starts at zero
  expand_limits(y = 0)+
  
  # add labels
  labs(
    title = "Animal rescue incidents have almost doubled post Covid-19",
    subtitle = "Animal rescue incidents attended by the LBF",
    x = NULL,
    y = NULL,
    caption = "Source: https://data.london.gov.uk/dataset/animal-rescue-incidents-attended-by-lfb") +
  
  theme_minimal() + 
  
  # change the theme, so title is left-aligned
  theme(plot.title.position = "plot") +
  
  # add one final layer of NULL, so if you comment out any lines
  # you never end up with a hanging `+` that awaits another ggplot layer
  NULL
Incidents over time.

Incidents over time.

Let us try to see how many incidents we have by animal group. Again, we can do this either using group_by() and summarise(), or by using count()

animal_rescue %>% 
  group_by(animal_group_parent) %>% 
  
  #group_by and summarise will produce a new column with the count in each animal group
  summarise(count = n()) %>% 
  
  # mutate adds a new column; here we calculate the percentage
  mutate(percent = round(100*count/sum(count),2)) %>% 
  
  # arrange() sorts the data by percent. Since the default sorting is min to max and we would like to see it sorted
  # in descending order (max to min), we use arrange(desc()) 
  arrange(desc(percent))
## # A tibble: 28 × 3
##    animal_group_parent              count percent
##    <chr>                            <int>   <dbl>
##  1 Cat                               4866   49.4 
##  2 Bird                              2001   20.3 
##  3 Dog                               1455   14.8 
##  4 Fox                                534    5.42
##  5 Unknown - Domestic Animal Or Pet   227    2.3 
##  6 Horse                              213    2.16
##  7 Deer                               167    1.69
##  8 Unknown - Wild Animal              113    1.15
##  9 Squirrel                            85    0.86
## 10 Unknown - Heavy Livestock Animal    51    0.52
## # ℹ 18 more rows
animal_rescue %>% 
  
  #count does the same thing as group_by and summarise
  # name = "count" will call the column with the counts "count" ( exciting, I know)
  # and 'sort=TRUE' will sort them from max to min
  count(animal_group_parent, name="count", sort=TRUE) %>% 
  mutate(percent = round(100*count/sum(count),2))
## # A tibble: 28 × 3
##    animal_group_parent              count percent
##    <chr>                            <int>   <dbl>
##  1 Cat                               4866   49.4 
##  2 Bird                              2001   20.3 
##  3 Dog                               1455   14.8 
##  4 Fox                                534    5.42
##  5 Unknown - Domestic Animal Or Pet   227    2.3 
##  6 Horse                              213    2.16
##  7 Deer                               167    1.69
##  8 Unknown - Wild Animal              113    1.15
##  9 Squirrel                            85    0.86
## 10 Unknown - Heavy Livestock Animal    51    0.52
## # ℹ 18 more rows

Do you see anything strange in these tables?

Finally, let us have a loot at the notional cost for rescuing each of these animals. As the LFB says,

Please note that any cost included is a notional cost calculated based on the length of time rounded up to the nearest hour spent by Pump, Aerial and FRU appliances at the incident and charged at the current Brigade hourly rate.

There is two things we will do:

  1. Calculate the mean and median incident_notional_cost_a for each animal_group_parent
  2. Plot a boxplot to get a feel for the distribution of incident_notional_cost_a by animal_group_parent.

Before we go on, however, we need to fix incident_notional_cost_a as it is stored as a chr, or character, rather than a number.

# what type is variable incident_notional_cost from dataframe `animal_rescue`
typeof(animal_rescue$incident_notional_cost_a)
## [1] "character"
# readr::parse_number() will convert any numerical values stored as characters into numbers
animal_rescue <- animal_rescue %>% 

  # we use mutate() to use the parse_number() function and overwrite the same variable
  mutate(incident_notional_cost_a = parse_number(incident_notional_cost_a))

# incident_notional_cost from dataframe `animal_rescue` is now 'double' or numeric
typeof(animal_rescue$incident_notional_cost_a)
## [1] "double"

Now that incident_notional_cost_a is numeric, let us quickly calculate summary statistics for each animal group.

animal_rescue %>% 
  
  # group by animal_group_parent
  group_by(animal_group_parent) %>% 
  
  # filter resulting data, so each group has at least 6 observations
  filter(n()>6) %>% 
  
  # summarise() will collapse all values into 3 values: the mean, median, and count  
  # we use na.rm=TRUE to make sure we remove any NAs, or cases where we do not have the incident cos
  summarise(mean_incident_cost = mean (incident_notional_cost_a, na.rm=TRUE),
            median_incident_cost = median (incident_notional_cost_a, na.rm=TRUE),
            sd_incident_cost = sd (incident_notional_cost_a, na.rm=TRUE),
            min_incident_cost = min (incident_notional_cost_a, na.rm=TRUE),
            max_incident_cost = max (incident_notional_cost_a, na.rm=TRUE),
            count = n()) %>% 
  
  # sort the resulting data in descending order. You choose whether to sort by count or mean cost.
  arrange(desc(count))
## # A tibble: 17 × 7
##    animal_group_parent  mean_incident_cost median_incident_cost sd_incident_cost
##    <chr>                             <dbl>                <dbl>            <dbl>
##  1 Cat                                360.                 333             164. 
##  2 Bird                               356.                 333             138. 
##  3 Dog                                362.                 326             202. 
##  4 Fox                                386.                 346             191. 
##  5 Unknown - Domestic …               334.                 298             116. 
##  6 Horse                              734.                 596             527. 
##  7 Deer                               419.                 346             265. 
##  8 Unknown - Wild Anim…               418.                 339             302. 
##  9 Squirrel                           324.                 333              55.1
## 10 Unknown - Heavy Liv…               374.                 260             260. 
## 11 cat                                353.                 336             110. 
## 12 Hamster                            326.                 310.             88.1
## 13 Snake                              375.                 352             122. 
## 14 Rabbit                             325.                 333              40.0
## 15 Ferret                             314.                 336              41.1
## 16 Cow                                599.                 436             451. 
## 17 Sheep                              355.                 339             114. 
## # ℹ 3 more variables: min_incident_cost <dbl>, max_incident_cost <dbl>,
## #   count <int>

Compare the mean and the median for each animal group. What do you think this is telling us? Anything else that stands out? Any outliers?

Finally, let us plot a few plots that show the distribution of incident_cost for each animal group.

# base_plot
base_plot <- animal_rescue %>% 
  group_by(animal_group_parent) %>% 
  filter(n()>6) %>% 
  ggplot(aes(x=incident_notional_cost_a))+
  facet_wrap(~animal_group_parent, scales = "free")+
  theme_bw()

base_plot + geom_histogram()

base_plot + geom_density()

base_plot + geom_boxplot()

base_plot + stat_ecdf(geom = "step", pad = FALSE) +
  scale_y_continuous(labels = scales::percent)

Which of these four graphs do you think best communicates the variability of the incident_notional_cost_a values? Also, can you please tell some sort of story (which animals are more expensive to rescue than others, the spread of values) and speculate about the differences in the patterns.

Bonus Question: Total LFB Animal Rescue Cost over time

Using LFB’s incident_notional_cost_a, plot a line graph showing the total incident notional cost between 2009 - 2022.

Submit the assignment

Knit the completed R Markdown file as an HTML document (use the “Knit” button at the top of the script editor window) and upload it to Canvas.

Details

If you want to, please answer the following

  • Who did you collaborate with: TYPE NAMES HERE
  • Approximately how much time did you spend on this problem set: ANSWER HERE
  • What, if anything, gave you the most trouble: ANSWER HERE