Project One

This is the answers to the project questions.

Question 1

Covert the Country column into factor Data

Answer to Q1

library(readr)
airpopn <- read_csv("air_pollution.csv")
## Rows: 60 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (3): Year, PM2.5_air_pollution, Deaths
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The above chunk read the excel file into the R environment. The below is to convert the Country column to factor data, the below chunk is used.

as.factor(airpopn$Country)
##  [1] Benin    Benin    Benin    Benin    Benin    Benin    Benin    Benin   
##  [9] Benin    Benin    Benin    Benin    Cameroon Cameroon Cameroon Cameroon
## [17] Cameroon Cameroon Cameroon Cameroon Cameroon Cameroon Cameroon Cameroon
## [25] Chad     Chad     Chad     Chad     Chad     Chad     Chad     Chad    
## [33] Chad     Chad     Chad     Chad     Niger    Niger    Niger    Niger   
## [41] Niger    Niger    Niger    Niger    Niger    Niger    Niger    Niger   
## [49] Nigeria  Nigeria  Nigeria  Nigeria  Nigeria  Nigeria  Nigeria  Nigeria 
## [57] Nigeria  Nigeria  Nigeria  Nigeria 
## Levels: Benin Cameroon Chad Niger Nigeria

In other to replace the Country in the airpopn with the factor data, the below chunk is used

airpopn$Country <- as.factor(airpopn$Country)
airpopn
## # A tibble: 60 × 4
##    Country  Year PM2.5_air_pollution Deaths
##    <fct>   <dbl>               <dbl>  <dbl>
##  1 Benin    1990                40.2   269.
##  2 Benin    1995                38.1   253.
##  3 Benin    2000                39.8   240.
##  4 Benin    2005                34.3   220.
##  5 Benin    2010                32.4   209.
##  6 Benin    2011                32.6   207.
##  7 Benin    2012                32.0   206.
##  8 Benin    2013                31.4   203.
##  9 Benin    2014                29.7   199.
## 10 Benin    2015                40.5   198.
## # … with 50 more rows

Question 1b

How many countries are in the dataset

Answer to Q1b

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ dplyr   1.0.9
## ✔ tibble  3.1.8     ✔ stringr 1.4.1
## ✔ tidyr   1.2.0     ✔ forcats 0.5.2
## ✔ purrr   0.3.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
airpopn %>% distinct(Country)
## # A tibble: 5 × 1
##   Country 
##   <fct>   
## 1 Benin   
## 2 Cameroon
## 3 Chad    
## 4 Niger   
## 5 Nigeria

There are five (5) countries in the dataset namely Benin, Cameroon, Chad, Niger and Nigeria.

Question 2

using a suitable vizaulization What is the relationship between PM2.5_air_pollution and Deaths, colour by Country?

Answer to Q2

To visualized the above, scattered plot or line plot can be used

library(ggplot2)

ggplot(airpopn, aes(PM2.5_air_pollution, Deaths)) +
  geom_point() + ## scattered plot
  ggtitle("Chat: One")

ggplot(airpopn, aes(PM2.5_air_pollution, Deaths, fill = Country) ) +
  geom_point() + ## scattered plot using country as separator
  ggtitle("Chat: Two")

From the above, it is established that Benin has lowest PM2.5 air pollution while Niger recorded the highest PM2.5 air pollution.

Question 2a

Which country has the highest PM2.5_air_pollution and Deaths in the vizualization?

Answer to Q2a

Niger recorded the highest PM2.5_air_pollution and deaths.

Question 3

Using a a suitable vizaulization compare the distribution PM2.5_air_pollution by country

Answer to Q3

To achieved our aim, I used either histogram or boxplot can be used.

ggplot(airpopn, aes(PM2.5_air_pollution, fill= Country)) +
  geom_histogram() +
  ggtitle("Chart: Three") +
  facet_wrap(~Country)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(airpopn, aes(PM2.5_air_pollution, fill= Country)) +
  geom_boxplot() +
  ggtitle("Chart: Four") 

The chart above show that Benin has the lowest air pollution while Niger has the highest air pollution

Question 4

Using a suitable visualization compare the distribution Deaths by country

Answer to Q4 Here, histogram or boxplot is used to compare the distribution.

ggplot(airpopn, aes(Deaths, fill= Country)) +
  geom_histogram() +
  ggtitle("Chart: Five") +
  facet_wrap(~Country)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(airpopn, aes(Deaths, fill= Country)) +
  geom_boxplot() +
  ggtitle("Chart: Six")

From chart 5 abd 6, Nigeria has the recorded lowest deaths while it seems Niger recorded highest deaths.

Question 5

Which year had the least air pollution in all countries

Answer to Q5

In other to achieve the above, line plot is used. Using years on x axis, y as air pollution and color it with Country

ggplot(airpopn, aes(Year, PM2.5_air_pollution, color = Country)) +
  geom_line() +
  ggtitle("Chart: Seven")

From chart 7, Benin had the least air pollution among the countries.

Question 5b

Which Country has the highest air pollution overtime?

Answer to Q5b

Here line plot using years on the x axis, air pollution on y axis and color by countries,

ggplot(airpopn, aes(Year, PM2.5_air_pollution, color = Country)) +
  geom_line() +
  ggtitle("Chart: Eight")

From the table, its obvious Niger has the highest air pollution

Question 6

Create a function to Calculate the cumulative Deaths of a country from 2010 to 2017

Answer to Q6

To get the cumulative deaths of a country

library(dplyr)
gts <- airpopn %>% 
  group_by(Country) %>%
  mutate(cumDeaths=cumsum(Deaths))

gts
## # A tibble: 60 × 5
## # Groups:   Country [5]
##    Country  Year PM2.5_air_pollution Deaths cumDeaths
##    <fct>   <dbl>               <dbl>  <dbl>     <dbl>
##  1 Benin    1990                40.2   269.      269.
##  2 Benin    1995                38.1   253.      522.
##  3 Benin    2000                39.8   240.      762.
##  4 Benin    2005                34.3   220.      983.
##  5 Benin    2010                32.4   209.     1192.
##  6 Benin    2011                32.6   207.     1399.
##  7 Benin    2012                32.0   206.     1605.
##  8 Benin    2013                31.4   203.     1808.
##  9 Benin    2014                29.7   199.     2007.
## 10 Benin    2015                40.5   198.     2204.
## # … with 50 more rows

The above chunck create a dataframe of the whole dataset with cumulative of deaths for all countries.

airpopn_2010_17 <- airpopn %>% 
  filter(Year %in% c(2010:2017)) %>%
  group_by(Country) %>%
  mutate(cumDeaths=cumsum(Deaths))

airpopn_2010_17
## # A tibble: 40 × 5
## # Groups:   Country [5]
##    Country   Year PM2.5_air_pollution Deaths cumDeaths
##    <fct>    <dbl>               <dbl>  <dbl>     <dbl>
##  1 Benin     2010                32.4   209.      209.
##  2 Benin     2011                32.6   207.      417.
##  3 Benin     2012                32.0   206.      622.
##  4 Benin     2013                31.4   203.      826.
##  5 Benin     2014                29.7   199.     1024.
##  6 Benin     2015                40.5   198.     1222.
##  7 Benin     2016                38.4   193.     1414.
##  8 Benin     2017                39.0   189.     1604.
##  9 Cameroon  2010                59.3   196.      196.
## 10 Cameroon  2011                58.1   194.      389.
## # … with 30 more rows

In other to get the required years, I have to use filter to bring out the dataset from 2010 to 2017. This I have done in the above chunk.The new dataset of countries from 2010 to 2017 is now airpopn_2010_17

To create a function that will do the cumulative sum of a country, the below chunck is created.

cumDeaths <- function(df, ctry){
  #df = the dataframe
  #ctry = The country name
  new_df <- df[df$Country==ctry,] #Country name variable is - Country
  cumsum(new_df$Deaths)
}

In the function I created, the df specified the dataset, ctry specified the country

The above can be tested by supplying the dataset name and the country.

for example

cumDeaths(df=airpopn_2010_17, ctry = 'Nigeria')
## [1]  171.6087  340.5777  508.2280  673.9928  836.2246  996.9611 1153.3075
## [8] 1304.6900

This can be used for other dataset like the dataset before I filtered the year.

Question 7

Using rvest package and your web scraping skills get the definition of Air pollution from Wikipedia on this link https://en.wikipedia.org/wiki/Air_pollution

Answer to Q7

In other to scrap from the web, rvest needed to be load by

library(rvest)
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
## 
##     guess_encoding
library(dplyr)

airpolution <- read_html("https://en.wikipedia.org/wiki/Air_pollution")

airpolution %>% 
  html_elements(css = "#mw-content-text > div.mw-parser-output > p:nth-child(12)") %>% 
  html_text()
## [1] "Air pollution is the contamination of air due to the presence of substances in the atmosphere that are harmful to the health of humans and other living beings, or cause damage to the climate or to materials.[1] There are many different types of air pollutants, such as gases (including ammonia, carbon monoxide, sulfur dioxide, nitrous oxides, methane, carbon dioxide and chlorofluorocarbons), particulates (both organic and inorganic), and biological molecules. Air pollution can cause diseases, allergies, and even death to humans; it can also cause harm to other living organisms such as animals and food crops, and may damage the natural environment (for example, climate change, ozone depletion or habitat degradation) or built environment (for example, acid rain).[2] Air pollution can be caused by both human activities and natural phenomena.[3]"
This RMarkdown is produced and submitted by

Yusuf Akintunde Azeez