Report template – HTML Quarto output

Unit of Study | Project X

Published

March 25, 2025

1. Introduction and description of data

After considering the 6 data sets covering very important topics, i chose to perform my EDA on data set 5: Survey on COVID-19 era perceived air pollution in multiple countries including Australia. It is my understanding that during this decade, covid-19 and air pollution have emerged as two critical global challenges, which has the potential to shape and influence our future as we are exposed to new viruses, air pollution and performing certain tasks efficiently indoors. The main objective of my analysis is to investigate the air pollution before and during Covid-19 to to conclude how restricted human activity (industrial, transportation) affected perceived air pollution.

For my analysis, i chose the countries,

  1. Australia - our home country
  2. India/china/USA - high air pollution and strict lock downs

Variables,

  1. Pollution level of certain country before epidemic - categorical, ordinal data
  2. Pollution level of certain country during epidemic - categorical, ordinal data
  3. Country - categorical, nominal data (only considered as a variable when plotting relationship between variables)

The following variables were chosen to perform statistical measurements on, as it highly supports the objective and proved itself to be the most advantageous, if patterns were identified and further researched.

2. Exploratory Analysis of Single Variables

Air pollution of Australia before covid-19

Code
library(readxl)
Warning: package 'readxl' was built under R version 4.4.3
Code
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.4.3
Code
library(modeest)
Warning: package 'modeest' was built under R version 4.4.3
Code
air_pollution <- read_excel("data/airpollution.xlsx")
New names:
• `` -> `...4`
• `` -> `...5`
• `` -> `...6`

Typical values

Code
library(modeest)
air_pollution <- read_excel("data/airpollution.xlsx", sheet="AU")
New names:
• `` -> `...4`
• `` -> `...5`
• `` -> `...6`
Code
mfv(air_pollution$`air pollution during pandemic aus`)
[1] "very low"

Mode: “very low”

Visual representation

Code
ggplot(data = air_pollution, mapping = aes(x = `air pollution before pandemic aus`)) +
  geom_bar(fill = 'brown', color = 'black') +
  labs(
    title = "Air pollution level before covid-19 in Australia",
    x = "Pollution level",
    y = "Frequency"
  ) +
  theme_minimal()

Air pollution of Australia during covid-19

Typical values

Code
library(modeest)
air_pollution <- read_excel("data/airpollution.xlsx", sheet="AU")
New names:
• `` -> `...4`
• `` -> `...5`
• `` -> `...6`
Code
mfv(air_pollution$`air pollution during pandemic aus`)
[1] "very low"

Mode: “very low”

Visual representation

Code
ggplot(data = air_pollution, mapping = aes(x = `air pollution during pandemic aus` )) +
  geom_bar(fill = 'darkgreen', color = 'black') +
  labs(
    title = "Air pollution level during covid-19 in Australia",
    x = "Pollution level",
    y = "Frequency"
  ) +
  theme_minimal()

Air pollution in China before covid-19

Typical values

Code
library(modeest)
air_pollution <- read_excel("data/airpollution.xlsx", sheet="CH")
mfv(air_pollution$`air pollution before pandemic china`)
[1] "average"

Mode: “average”

Visual representation

Code
ggplot(data = air_pollution, mapping = aes(x = `air pollution before pandemic china` )) +
  geom_bar(fill = 'brown', color = 'black') +
  labs(
    title = "Air pollution level before covid-19 in china",
    x = "Pollution level",
    y = "Frequency"
  ) +
  theme_minimal()

Air pollution in China during covid-19

Typical values

Code
library(modeest)
air_pollution <- read_excel("data/airpollution.xlsx", sheet="CH")
mfv(air_pollution$`air pollution during pandemic china`)
[1] "average"

Mode: average

Visual representation

Code
ggplot(data = air_pollution, mapping = aes(x = `air pollution during pandemic china` )) +
  geom_bar(fill = 'darkgreen', color = 'black') +
  labs(
    title = "Air pollution level during covid-19 in Australia",
    x = "Pollution level",
    y = "Frequency"
  ) +
  theme_minimal()

Air pollution in India before covid-19

Typical values

Code
library(modeest)
air_pollution <- read_excel("data/airpollution.xlsx", sheet="IN")
mfv(air_pollution$`air pollution before pandemic indi`)
[1] "high"

Mode: “high”

Visual represenatation

Code
ggplot(data = air_pollution, mapping = aes(x = `air pollution before pandemic indi` )) +
  geom_bar(fill = 'brown', color = 'black') +
  labs(
    title = "Air pollution level before covid-19 in india",
    x = "Pollution level",
    y = "Frequency"
  ) +
  theme_minimal()

Air pollution in India during covid-19

Typical values

Code
library(modeest)
air_pollution <- read_excel("data/airpollution.xlsx", sheet="IN")
mfv(air_pollution$`air pollution during pandemic indi`)
[1] "very low"

Mode: “very low”

Visual representation

Code
ggplot(data = air_pollution, mapping = aes(x = `air pollution during pandemic indi` )) +
  geom_bar(fill = 'darkgreen', color = 'black') +
  labs(
    title = "Air pollution level during covid-19 in india",
    x = "Pollution level",
    y = "Frequency"
  ) +
  theme_minimal()

Air pollution in USA before covid-19

Typical values

Code
library(modeest)
air_pollution <- read_excel("data/airpollution.xlsx", sheet="USA")
mfv(air_pollution$`air pollution before pandemic usa`)
[1] "average"

Mode: “average”

Visual representation

Code
ggplot(data = air_pollution, mapping = aes(x = `air pollution before pandemic usa` )) +
  geom_bar(fill = 'brown', color = 'black') +
  labs(
    title = "Air pollution level before covid-19 in USA",
    x = "Pollution level",
    y = "Frequency"
  ) +
  theme_minimal()

Air pollution in USA during covid-19

Typical values

Code
library(modeest)
air_pollution <- read_excel("data/airpollution.xlsx", sheet="USA")
mfv(air_pollution$`air pollution during pandemic usa`)
[1] "low"

Mode: “low”

Visual representation

Code
ggplot(data = air_pollution, mapping = aes(x = `air pollution during pandemic usa` )) +
  geom_bar(fill = 'darkgreen', color = 'black') +
  labs(
    title = "Air pollution level during covid-19 in USA",
    x = "Pollution level",
    y = "Frequency"
  ) +
  theme_minimal()

3. Relationships between variables

Below is the frequency distribution for the pollution levels before and during the Covid-19 virus for each corresponding country. The datasets were brought together in a data frame to produce an accurate stacked bar chart for further exploration regarding the relationship between variables.

Code
install.packages("readxl")
Warning: package 'readxl' is in use and will not be installed
Code
install.packages("ggplot2")
Warning: package 'ggplot2' is in use and will not be installed
Code
library(readxl)
library(ggplot2)


au_data <- read_excel("data/airpollution.xlsx", sheet = "AU")
New names:
• `` -> `...4`
• `` -> `...5`
• `` -> `...6`
Code
ch_data <- read_excel("data/airpollution.xlsx", sheet = "CH")
usa_data <- read_excel("data/airpollution.xlsx", sheet = "USA")
in_data <- read_excel("data/airpollution.xlsx", sheet = "IN")


au_data$country <- "Australia"
ch_data$country <- "China"
usa_data$country <- "USA"
in_data$country <- "India"


combined_data <- rbind(  
  data.frame(country = "Australia", time = "Before", pollution = au_data$`air pollution before pandemic aus`),
  data.frame(country = "Australia", time = "During", pollution = au_data$`air pollution during pandemic aus`),
  data.frame(country = "China", time = "Before", pollution = ch_data$`air pollution before pandemic china`),
  data.frame(country = "China", time = "During", pollution = ch_data$`air pollution during pandemic china`),
  data.frame(country = "India", time = "Before", pollution = in_data$`air pollution before pandemic indi`),
  data.frame(country = "India", time = "During", pollution = in_data$`air pollution during pandemic indi`),
  data.frame(country = "USA", time = "Before", pollution = usa_data$`air pollution before pandemic usa`),
  data.frame(country = "USA", time = "During", pollution = usa_data$`air pollution during pandemic usa`)
)


ggplot(combined_data, aes(x = time, fill = pollution)) +
  geom_bar(position = "stack") +
  facet_wrap(~ country, scales = "free") +
  labs(
    title = "Air Pollution Levels Before and During COVID-19",
    x = "Time Period",
    y = "Frequency",
    fill = "Pollution Level"
  ) +
  theme_minimal()

4. Discussions and findings

Key insights

An investigation was carried out by researchers in order to determine the relationship between increased human activity and its contribution to air pollution in several countries. The key part of data analysis after creating a hypothesis is data collection from reliable sources and for the following investigation the public was handed surveys and requested to rate the air pollution level they experienced before the spread of Covid-19 virus and during the spread (in 2020) when there was restricted human activity.

From on the data collected, the two variables across four countries, Australia, China, India and USA were selected for my analysis. While observing the typical values for individual analysis of the two variables, the measurement mode (most frequent value/data recorded) was accumulated. it was an observable pattern as to the pollution levels decreasing or staying constant as we entered the timeline of the Covid-19 virus spread. The individual graphs showed the distribution of the ratings given by public which varied from each variable and the country selected. It was observed as to bars corresponding to ‘average’, ‘very high’ and ‘high’ being taller in ‘brown’ filled bar plots and bars corresponding to ‘very low’, ‘low’ and ‘average’ being taller in ‘green’ filled bar plots. Regardless of individual analysis,

Observing the stacked bar chart for the comparison among countries and the two variables, Australia independently depicts that while the highest frequency recording for the data set under ‘pollution before Covid-19’ was low, the corresponding value for the dataset, ‘pollution during Covid-19’ is equally distributed among low and very low. China recorded a high frequency of data collected being ‘average’ before the pandemic and an equally distributed frequency for ‘average’ and ‘low’. This analysis is useful as measuring the mode resulted in similar ratings and bar chart enhanced our understanding of the data. Visual representation of India depicted as to the highest frequency recording of each variable, pollution before the pandemic and pollution during the pandemic to be, ‘high’ and ‘very low’. Furthermore, Usa depicted as to the highest recorded value for the two variables as per order being ‘average’ and ‘low’. The results obtained from the bar chart may differ to the mode measured in individual data analysis due to data being read from naked eye and rough calculations. Regardless it is useful for further analysis to record data with similar values as it may enhance our understanding of the results. Even though not our objective, we can compare the air pollution levels among countries before and during the Covid-19 as it would useful for exploring and implementing strategies observed from neighbouring countries. The lowest pollution level before Covid-19 was recorded from Australia while the lowest pollution level during Covid-19 was recorded from Australia and India respectively. It is also an observable pattern as to the frequency representing the rating, “extremely high” (yellow) and “very high” (purple) is absent from the bars corresponding to the variable “pollution during the epidemic”.

In conclusion, the data supported as to the air pollution levels reducing as we entered the era of Covid-19 and showed through statistical analysis, visual diagrams and measurements as to regulations and restrictions carried out during the epidemic contributed to the betterment of the environment health overall.

Limitations

The following research is exposed to its limitations affecting its accuracy and results. For some countries, it was observed, even though a high population, only a small sample contributed in providing the data for the investigation (sample size). There is also a limitation as to whether the public can accurately rate the air pollution before the epidemic which could result in over or under estimations. Furthermore, the analysis was done on data collected based on a rating system provided to the public. This could affect our results due to bias (geographical, personal experience) lack of standardized measurments, absence of data collected from experts of the field, absence of air pollution quality measurement equipment, different education levels, other external factors contributing to pollution being excluded,etc.

Potential Future Research

  • Conducting research based on objective air quality data with research experts and quality measuring equipment.

  • Conducting research by measuring factors that change with different air pollution levels such as water quality, biodiversity, contents and molecules present in the air (eg: CO2 levels)

  • Data could be collected from the same area/state which potentially had higher energy consumption, increased human activity and high industrial activity.

5.References and Source Summary

  1. OpenAI. (2025). Github Copilot (Mar 16 version) [Large language model]. https://github.com/copilot

Github copilot was included in the report to create a code for the stacked bar plot which includes the code chunk for combining of the data with function rbind and the ggplot code for the stacked bar plot. This function was highly useful as it created a high quality visual representation of all the data making it easier for analysis and interpretation.