In this assignment, several Exploratory Data Analysis (EDA) methods will be employed to uncover patterns in multivariate data, specifically air pollution data for US cities.
EDA methods such as:
For investigating independence will be utilized to explore the correlation between variables such as Rainfall, NOX, SO2, Mortality, Education, and Population density across US cities.
The Chi plot is a graphical tool used for investigating the independence between two variables. It helps in visualizing the correlation and can be particularly useful in detecting deviations from independence.
Use Cases
The Chi plot is constructed as follows:
Will also be employed to investigate the distribution, scale, and location of multivariate data. Bivariate boxplots are a useful extension of the traditional univariate boxplot to two dimensions. They provide a graphical representation of the relationship between two variables, allowing for the detection of bivariate outliers and the understanding of the joint distribution of the data. Here are some specific uses of bivariate boxplots:
Use cases
Detection of Bivariate Outliers,Understanding Correlation,Assessing Distribution Shape(shape, spread, and center of the bivariate distribution, highlighting any asymmetry or skewness in the data)
Comparison Across Groups: When used with different groups (e.g., different categories of a third variable), bivariate boxplots can be used to compare the bivariate distributions across these groups.
Visualizing Joint Variation: They help in visualizing how the variability in one variable is associated with the variability in another, thus giving a clearer picture of the joint variation.
R software will be used for the analysis, where built-in functions
such as vbox(), chiplot(),
faces() will be sourced from various R-libraries such as
MVA and aplpack.
airpol.full<-read.table('airpoll.txt',header=TRUE)
city.names <- as.character(airpol.full[,1])
airpol.data <- airpol.full[,1:8]
Starplot
# Define a function to create a star plot for a given variable
create_star_plot <- function(variable) {
options(repr.plot.width = 6, repr.plot.height = 6) # Adjust size as needed
city_colors <- rainbow(length(unique(airpol.data$City))) # Generate unique colors for each city
city_fill_colors <- city_colors[match(airpol.data$City, unique(airpol.data$City))] # Match colors to cities
stars(as.matrix(airpol.data[variable]), draw.segments = TRUE,
labels = airpol.data[, 1], # Use city names as labels
main = variable, # Main title is variable name
scale = TRUE, # Scale plots to the same maximum radius
flip.labels = TRUE, # Labels on the same side
col.segments = city_fill_colors, # Different colors for each city
len = 0.6) # Control the length of each star plot
legend("topright", legend = variable, fill = rainbow(length(airpol.data[, 1])),
title = "Measurement of Interest", cex = 0.5)
}
# Create star plots for each variable
for (variable in names(airpol.data)[-1]) { # Exclude the first column (city names)
create_star_plot(variable)
#plot.new()
}
Based on the star plots, here is a summary of which city is performing in each of the variable:
Discussion
The diverse environmental and demographic factors observed in the air pollution data for the USA illustrate the complexity of how different variables interact and impact air quality and public health. Here’s is a bried account of the observed patterns above.
Rainfall: Cities like Miami and New Orleans with high rainfall might experience less air pollution due to the washing effect of rain, which helps to clear particulate matter from the air.
Education: Cities with higher education levels, such as Boston and Los Angeles, often have better awareness and implementation of pollution control measures, leading to more proactive environmental policies.
Population Density (Popden): High population density in cities like New York and Los Angeles typically results in increased vehicle emissions and industrial activities, contributing to higher pollution levels.
Nonwhite Population Percentage: Cities with higher percentages of nonwhite populations, such as Birmingham and Memphis, often face environmental justice issues, where marginalized communities are disproportionately affected by pollution.
NOx Levels: High NOx levels in cities like Los Angeles and San Francisco are primarily due to heavy traffic and industrial activities, leading to more significant air pollution problems.
SO2 Levels: Elevated SO2 levels in Baltimore, Chicago, and Pittsburgh are typically associated with industrial emissions, particularly from coal-burning power plants and manufacturing industries.
Mortality Rate: The high mortality rates in cities like Baltimore and New Orleans may be linked to poor air quality, as long-term exposure to pollutants like SO2 and NOx can lead to severe health problems, including respiratory and cardiovascular diseases.
# Install and load aplpack package
if (!requireNamespace("aplpack", quietly = TRUE)) {
install.packages("aplpack")
}
## Warning in fun(libname, pkgname): couldn't connect to display ":0"
library(aplpack)
# Create Chernoff faces plot with city labels
faces(airpol.data[, c("Rainfall", "Education", "Popden", "Nonwhite", "NOX", "SO2", "Mortality")],
face.type = 1,
scale = TRUE,
labels = airpol.data$City,
plot.faces = TRUE,
nrow.plot = 4,
ncol.plot = 2.5)
## effect of variables:
## modified item Var
## "height of face " "Rainfall"
## "width of face " "Education"
## "structure of face" "Popden"
## "height of mouth " "Nonwhite"
## "width of mouth " "NOX"
## "smiling " "SO2"
## "height of eyes " "Mortality"
## "width of eyes " "Rainfall"
## "height of hair " "Education"
## "width of hair " "Popden"
## "style of hair " "Nonwhite"
## "height of nose " "NOX"
## "width of nose " "SO2"
## "width of ear " "Mortality"
## "height of ear " "Rainfall"
Write a short paragraph explaining the main conclusions from the scatterplot matrix.
# Define colors for the plots
plot_colors <- c("red", "blue", "green", "grey", "purple", "black")
# Assign the colors to the pairs function
pairs(
~ Rainfall + Education + Popden + Nonwhite + NOX + SO2 + Mortality,
data = airpol.data,
pch = 8, # Use star shape
col = plot_colors, # Apply colors
cex = 1.5
)
Discussion
Creating a scatterplot matrix for the air pollution dataset allows us to visualize the relationships between different variables. From the scatterplot matrix above, the following main conclusions can be drawn:
Correlation Patterns:
We can observe patterns of correlation between pairs of variables as below.
Positive correlation:
Negative correlation:
library(MVA)
## Loading required package: HSAUR2
## Loading required package: tools
# Inspecting bivariate independence between Rainfall and Mortality
chiplot(x = airpol.data$Rainfall, y = airpol.data$Mortality,main='Chi plot of Rainfall and Mortality',col='green')
# Inspecting bivariate independence between Education and Mortality
chiplot(x = airpol.data$Education, y = airpol.data$Mortality,main='Chi plot of Education and Mortality',col='red')
# Inspecting bivariate independence between NOX and Nonwhite
chiplot(x = airpol.data$NOX, y = airpol.data$Nonwhite,main='Chi plot of Nox and Nonwhite',col='grey')
# Inspecting bivariate independence between NOX and Education
chiplot(x = airpol.data$NOX, y = airpol.data$Education,main='Chi plot of Nox and Education',col='grey')
Discussion
From the above chi plots the following observations can be made:
Inspecting bivariate independence between Rainfall and Mortality
There seem to be a slight deviation from independence as the number of points within the horizontal band is almost equivalent to that above the band. This might suggest a moderate or weak positive bivariate correlation between Rainfall and Mortality
Inspecting bivariate independence between Education and Mortality
There exists a clear deviation from independence as the number of points within the horizontal band is less than that below the band. This suggests a strong negative bivariate correlation between Education and Mortality.
Inspecting bivariate independence between NOX and Nonwhite
There seem to be no deviation from independence as majority of points are captured within the horizontal band.
Explain what the plot tells you about the relationship between the two variables. Do you see any outliers? If so, which cities are they?
# Load the MVA library
library(MVA)
# Specify the cities
cities <- airpol.data$City
# Identify the indices of the specified cities
city_indices <- match(cities, airpol.data$City)
# Select the variables "Education" and "Mortality"
x <- airpol.data[, c("Education", "Mortality")]
# Create the bivariate boxplot
bvbox(x, mtitle = "Bivariate Box plot", xlab = "Education", ylab = "NOX",col='red')
# Add text labels for the specified cities
text(x$Education[city_indices], x$Mortality[city_indices], labels = cities, pos = 1)
# Select the variables "Mortality" and "Rainfall"
y <- airpol.data[, c("Rainfall", "Mortality")]
# Create the bivariate boxplot
bvbox(y, mtitle = "Bivariate Box plot", xlab = "Rainfall", ylab = "Mortality",col='green')
# Add text labels for the specified cities
text(y$Rainfall[city_indices], y$Mortality[city_indices], labels = cities, pos = 1)
Discussion
From the analysis we can observe there exists a negative relationship between the two variables namely Education and Nox as the elipsoid tilts downwards from left to right.
Also importantly we can observe that LancasPA has outliers as some of its values sit outside the fence.
Explain what the plot tells you about the relationships among the three variables and Comment on any notable cities.
# Load the necessary library
library(MVA)
# Define the y-axis limits
ylim <- with(airpol.data, range(Mortality))
# Create the base plot
plot(Mortality ~ Education, data = airpol.data,
xlab = "Education",
ylab = "Mortality",
pch = 10,
ylim = ylim,
main = "Bubble Plot of Education vs. Mortality")
# Add bubbles representing Population Density with city labels
with(airpol.data, {
symbols(Education, Mortality, circles = Popden / 400,
inches = 0.5, add = TRUE, bg = rgb(0.1, 0.2, 0.5, 0.5))
text(Education, Mortality, labels = City, pos = 3, cex = 0.7, col = "black")
})
Discussion
The bubble plot demonstrates a negative relationship between education and mortality.
From literatue an increase in education is more likely to translate to reduced mortality this could be as a result of socio-economic priviledges that are associated with high education.
On the other hand we can also see that there are notable cities that are densely populated with residents scoring very low on education and experiencing high mortality such as BaltimMD.