Project 1 - Midterm

Author

Steve Donfack

The data set we are working on today is the accident record data by country from 1990-2019 published by the Global burden of Disease Collaborative Network in 2019. They collected the total number of deaths from road traffic incidents, including vehicle drivers or passengers, motorcyclists, cyclists and pedestrians. This data set have 6 variables : * Entity- the name of the countries * Code- the code for each country * Year- * Death-the number of death * Sidednes- * Historical-Population: The total population.

To analyse this data set, we are going to compare evolution over the years of the deaths for five selected African country Cameroon , Nigeria, Kenya, cote Ivoire and Senegal

Load the data set and the database

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("C:/Users/steve/OneDrive/Desktop/DATA 110/WEEK 3")
Road_traffic_death <- read_csv("Road traffic death.csv")
Rows: 8010 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Entity, Code
dbl (4): Year, Deaths, Sidedness, Historical_Population

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

lets clean our database

let’s remove all the NA from the variable “death” and “Historical_population”

Road_traffic_death_na <- Road_traffic_death |>
  filter(!is.na(Deaths) & !is.na(Historical_Population) & !is.na(Code)) 
head(Road_traffic_death_na)
# A tibble: 6 × 6
  Entity      Code   Year Deaths Sidedness Historical_Population
  <chr>       <chr> <dbl>  <dbl>     <dbl>                 <dbl>
1 Afghanistan AFG    1990   4154         0              12412311
2 Afghanistan AFG    1991   4472         0              13299016
3 Afghanistan AFG    1992   5106         0              14485543
4 Afghanistan AFG    1993   5681         0              15816601
5 Afghanistan AFG    1994   6001         0              17075728
6 Afghanistan AFG    1995   6211         0              18110662

Use group_by and summarise to create a new table

lets create a new table to include the for each of those 5 countries (Cameroon, Nigeria, Kenya, cote Ivoire, Senegal) and group by the average deaths and the average historical population

by_death <- Road_traffic_death_na |>
   filter( Entity %in% c("Cameroon", "Cote d'Ivoire", "Nigeria", "Kenya", "Senegal")) |>
  group_by(Entity)|>
  summarise(Deaths = Deaths,
        population = Historical_Population ,
        year = Year)
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.
`summarise()` has grouped output by 'Entity'. You can override using the
`.groups` argument.
head(by_death)
# A tibble: 6 × 4
# Groups:   Entity [1]
  Entity   Deaths population  year
  <chr>     <dbl>      <dbl> <dbl>
1 Cameroon   2434   11780086  1990
2 Cameroon   2569   12137912  1991
3 Cameroon   2721   12499499  1992
4 Cameroon   2862   12864091  1993
5 Cameroon   2992   13230978  1994
6 Cameroon   3133   13599984  1995

lets make our scatterplot

ggplot(by_death, aes(year, Entity, colour = Entity))+
  geom_point(aes(size = Deaths), alpha = 1) +
  scale_color_manual(values = c("Senegal"="#8a0707", "Nigeria"="#7d1fcf","Kenya"="#d1135c","Cote d'Ivoire"="#ff5e00", "Cameroon"="#327829"))+
  scale_size_area() +
  theme_minimal() +
  labs(x = "Year",
       y = "Countries",
       size = "Number of death",
       caption = "Global Burden of Disease Study 2019",
       title = "Road traffic Death evolution per country from 1990 to 2020")

To clean this data set, I used the is.NA function to remove all the NA present in my Data set. I also filtered the data set to only keep the variable I needed. This visualization represent the evolution of the road traffic death person for five African countries we selected, Kenya, Cameroon, cote Ivoire, Senegal and Nigeria from 1990 to 2020. As we can see on the visualization, for most country, we can observe an evolution of death related to road traffic except from one Senegal where the average death is quite stable over the years. For this data set, I wish I could have showed at the same time the average death and the average historical population , to compare the evolution of both variable over the years, using a kind of graph with curves.