This is the answers to the project questions.
Covert the Country column into factor Data
Answer to Q1
library(readr)
airpopn <- read_csv("air_pollution.csv")
## Rows: 60 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (3): Year, PM2.5_air_pollution, Deaths
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The above chunk read the excel file into the R environment. The below
is to convert the Country column to factor data, the below
chunk is used.
as.factor(airpopn$Country)
## [1] Benin Benin Benin Benin Benin Benin Benin Benin
## [9] Benin Benin Benin Benin Cameroon Cameroon Cameroon Cameroon
## [17] Cameroon Cameroon Cameroon Cameroon Cameroon Cameroon Cameroon Cameroon
## [25] Chad Chad Chad Chad Chad Chad Chad Chad
## [33] Chad Chad Chad Chad Niger Niger Niger Niger
## [41] Niger Niger Niger Niger Niger Niger Niger Niger
## [49] Nigeria Nigeria Nigeria Nigeria Nigeria Nigeria Nigeria Nigeria
## [57] Nigeria Nigeria Nigeria Nigeria
## Levels: Benin Cameroon Chad Niger Nigeria
In other to replace the Country in the airpopn with the factor data, the below chunk is used
airpopn$Country <- as.factor(airpopn$Country)
airpopn
## # A tibble: 60 × 4
## Country Year PM2.5_air_pollution Deaths
## <fct> <dbl> <dbl> <dbl>
## 1 Benin 1990 40.2 269.
## 2 Benin 1995 38.1 253.
## 3 Benin 2000 39.8 240.
## 4 Benin 2005 34.3 220.
## 5 Benin 2010 32.4 209.
## 6 Benin 2011 32.6 207.
## 7 Benin 2012 32.0 206.
## 8 Benin 2013 31.4 203.
## 9 Benin 2014 29.7 199.
## 10 Benin 2015 40.5 198.
## # … with 50 more rows
How many countries are in the dataset
Answer to Q1b
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ dplyr 1.0.9
## ✔ tibble 3.1.8 ✔ stringr 1.4.1
## ✔ tidyr 1.2.0 ✔ forcats 0.5.2
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
airpopn %>% distinct(Country)
## # A tibble: 5 × 1
## Country
## <fct>
## 1 Benin
## 2 Cameroon
## 3 Chad
## 4 Niger
## 5 Nigeria
There are five (5) countries in the dataset namely
Benin, Cameroon, Chad, Niger and Nigeria.
using a suitable vizaulization What is the relationship between PM2.5_air_pollution and Deaths, colour by Country?
Answer to Q2
To visualized the above, scattered plot or line plot can be used
library(ggplot2)
ggplot(airpopn, aes(PM2.5_air_pollution, Deaths)) +
geom_point() + ## scattered plot
ggtitle("Chat: One")
ggplot(airpopn, aes(PM2.5_air_pollution, Deaths, fill = Country) ) +
geom_point() + ## scattered plot using country as separator
ggtitle("Chat: Two")
From the above, it is established that Benin has lowest PM2.5 air pollution while Niger recorded the highest PM2.5 air pollution.
Which country has the highest PM2.5_air_pollution and Deaths in the vizualization?
Answer to Q2a
Niger recorded the highest PM2.5_air_pollution and deaths.
Using a a suitable vizaulization compare the distribution PM2.5_air_pollution by country
Answer to Q3
To achieved our aim, I used either histogram or boxplot can be used.
ggplot(airpopn, aes(PM2.5_air_pollution, fill= Country)) +
geom_histogram() +
ggtitle("Chart: Three") +
facet_wrap(~Country)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(airpopn, aes(PM2.5_air_pollution, fill= Country)) +
geom_boxplot() +
ggtitle("Chart: Four")
The chart above show that Benin has the lowest air pollution while Niger has the highest air pollution
Using a suitable visualization compare the distribution Deaths by country
Answer to Q4 Here, histogram or boxplot is used to compare the distribution.
ggplot(airpopn, aes(Deaths, fill= Country)) +
geom_histogram() +
ggtitle("Chart: Five") +
facet_wrap(~Country)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(airpopn, aes(Deaths, fill= Country)) +
geom_boxplot() +
ggtitle("Chart: Six")
From chart 5 abd 6, Nigeria has the recorded lowest deaths while it seems Niger recorded highest deaths.
Which year had the least air pollution in all countries
Answer to Q5
In other to achieve the above, line plot is used. Using years on x axis, y as air pollution and color it with Country
ggplot(airpopn, aes(Year, PM2.5_air_pollution, color = Country)) +
geom_line() +
ggtitle("Chart: Seven")
From chart 7, Benin had the least air pollution among the countries.
Which Country has the highest air pollution overtime?
Answer to Q5b
Here line plot using years on the x axis, air pollution on y axis and color by countries,
ggplot(airpopn, aes(Year, PM2.5_air_pollution, color = Country)) +
geom_line() +
ggtitle("Chart: Eight")
From the table, its obvious Niger has the highest air pollution
Create a function to Calculate the cumulative Deaths of a country from 2010 to 2017
Answer to Q6
To get the cumulative deaths of a country
library(dplyr)
gts <- airpopn %>%
group_by(Country) %>%
mutate(cumDeaths=cumsum(Deaths))
gts
## # A tibble: 60 × 5
## # Groups: Country [5]
## Country Year PM2.5_air_pollution Deaths cumDeaths
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Benin 1990 40.2 269. 269.
## 2 Benin 1995 38.1 253. 522.
## 3 Benin 2000 39.8 240. 762.
## 4 Benin 2005 34.3 220. 983.
## 5 Benin 2010 32.4 209. 1192.
## 6 Benin 2011 32.6 207. 1399.
## 7 Benin 2012 32.0 206. 1605.
## 8 Benin 2013 31.4 203. 1808.
## 9 Benin 2014 29.7 199. 2007.
## 10 Benin 2015 40.5 198. 2204.
## # … with 50 more rows
The above chunck create a dataframe of the whole dataset with cumulative of deaths for all countries.
airpopn_2010_17 <- airpopn %>%
filter(Year %in% c(2010:2017)) %>%
group_by(Country) %>%
mutate(cumDeaths=cumsum(Deaths))
airpopn_2010_17
## # A tibble: 40 × 5
## # Groups: Country [5]
## Country Year PM2.5_air_pollution Deaths cumDeaths
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Benin 2010 32.4 209. 209.
## 2 Benin 2011 32.6 207. 417.
## 3 Benin 2012 32.0 206. 622.
## 4 Benin 2013 31.4 203. 826.
## 5 Benin 2014 29.7 199. 1024.
## 6 Benin 2015 40.5 198. 1222.
## 7 Benin 2016 38.4 193. 1414.
## 8 Benin 2017 39.0 189. 1604.
## 9 Cameroon 2010 59.3 196. 196.
## 10 Cameroon 2011 58.1 194. 389.
## # … with 30 more rows
In other to get the required years, I have to use filter to bring out the dataset from 2010 to 2017. This I have done in the above chunk.The new dataset of countries from 2010 to 2017 is now airpopn_2010_17
To create a function that will do the cumulative sum of a country, the below chunck is created.
cumDeaths <- function(df, ctry){
#df = the dataframe
#ctry = The country name
new_df <- df[df$Country==ctry,] #Country name variable is - Country
cumsum(new_df$Deaths)
}
In the function I created, the df specified the dataset, ctry specified the country
The above can be tested by supplying the dataset name and the country.
for example
cumDeaths(df=airpopn_2010_17, ctry = 'Nigeria')
## [1] 171.6087 340.5777 508.2280 673.9928 836.2246 996.9611 1153.3075
## [8] 1304.6900
This can be used for other dataset like the dataset before I filtered the year.
Using rvest package and your web scraping skills get the definition of Air pollution from Wikipedia on this link https://en.wikipedia.org/wiki/Air_pollution
Answer to Q7
In other to scrap from the web, rvest needed to be load by
library(rvest)
##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
## guess_encoding
library(dplyr)
airpolution <- read_html("https://en.wikipedia.org/wiki/Air_pollution")
airpolution %>%
html_elements(css = "#mw-content-text > div.mw-parser-output > p:nth-child(12)") %>%
html_text()
## [1] "Air pollution is the contamination of air due to the presence of substances in the atmosphere that are harmful to the health of humans and other living beings, or cause damage to the climate or to materials.[1] There are many different types of air pollutants, such as gases (including ammonia, carbon monoxide, sulfur dioxide, nitrous oxides, methane, carbon dioxide and chlorofluorocarbons), particulates (both organic and inorganic), and biological molecules. Air pollution can cause diseases, allergies, and even death to humans; it can also cause harm to other living organisms such as animals and food crops, and may damage the natural environment (for example, climate change, ozone depletion or habitat degradation) or built environment (for example, acid rain).[2] Air pollution can be caused by both human activities and natural phenomena.[3]"
Yusuf Akintunde Azeez