Introduction

In this assignment, several Exploratory Data Analysis (EDA) methods will be employed to uncover patterns in multivariate data, specifically air pollution data for US cities.

Methods

EDA methods such as:

  • Chi plot

For investigating independence will be utilized to explore the correlation between variables such as Rainfall, NOX, SO2, Mortality, Education, and Population density across US cities.

The Chi plot is a graphical tool used for investigating the independence between two variables. It helps in visualizing the correlation and can be particularly useful in detecting deviations from independence.

Use Cases

  1. Correlation Analysis: Chi plots can be used to explore the relationship between pairs of variables, such as environmental factors (Rainfall, NOX, SO2) and health outcomes (Mortality) or socio-economic variables (Education, Population density).
  2. Independence Testing: By plotting Chi plots for different pairs of variables, one can visually assess if the variables are independent or if there is some form of association.
  3. Exploratory Data Analysis: Chi plots are useful in the initial stages of data analysis to identify potential relationships or dependencies that warrant further statistical testing or modeling.
  4. Outlier Detection: They can also help in identifying outliers in the data, which may indicate special cases or errors in data collection.

How It Is Constructed

The Chi plot is constructed as follows:

  1. Rank the Data: For two variables \(X\) and \(Y\), compute their ranks \(R(X)\) and \(R(Y)\).
  2. Compute Pairwise Distances: Calculate the pairwise distances between the ranks of the data points.
  3. Standardize Distances: Standardize these distances by dividing by the maximum possible rank distance.
  4. Plot Chi Values: The Chi plot then plots the standardized distances (Chi values) against each other. Typically, this involves plotting \((i/n, D_i)\) where \(D_i\) is the Chi value for the \(i\)-th pair and \(n\) is the total number of observations.
  • Bivariate Boxplots

Will also be employed to investigate the distribution, scale, and location of multivariate data. Bivariate boxplots are a useful extension of the traditional univariate boxplot to two dimensions. They provide a graphical representation of the relationship between two variables, allowing for the detection of bivariate outliers and the understanding of the joint distribution of the data. Here are some specific uses of bivariate boxplots:

Use cases

  • Detection of Bivariate Outliers,Understanding Correlation,Assessing Distribution Shape(shape, spread, and center of the bivariate distribution, highlighting any asymmetry or skewness in the data)

  • Comparison Across Groups: When used with different groups (e.g., different categories of a third variable), bivariate boxplots can be used to compare the bivariate distributions across these groups.

  • Visualizing Joint Variation: They help in visualizing how the variability in one variable is associated with the variability in another, thus giving a clearer picture of the joint variation.

Analysis

R software will be used for the analysis, where built-in functions such as vbox(), chiplot(), faces() will be sourced from various R-libraries such as MVA and aplpack.

airpol.full<-read.table('airpoll.txt',header=TRUE)
city.names <- as.character(airpol.full[,1])
airpol.data <- airpol.full[,1:8]

Results and Discussions

1a. Do a star plot to display all 7 variables.

  • Write a short paragraph explaining what the plots tell you about the cities. You can include the “labels” argument to label the drawings for both the stars function and the faces function, e.g.:

Starplot

# Define a function to create a star plot for a given variable
create_star_plot <- function(variable) {
  options(repr.plot.width = 6, repr.plot.height = 6)  # Adjust size as needed
  city_colors <- rainbow(length(unique(airpol.data$City)))  # Generate unique colors for each city
  city_fill_colors <- city_colors[match(airpol.data$City, unique(airpol.data$City))]  # Match colors to cities
  
  stars(as.matrix(airpol.data[variable]), draw.segments = TRUE,
        labels = airpol.data[, 1],  # Use city names as labels
        main = variable,  # Main title is variable name
        scale = TRUE,  # Scale plots to the same maximum radius
        flip.labels = TRUE,  # Labels on the same side
        col.segments = city_fill_colors,  # Different colors for each city
        len = 0.6)  # Control the length of each star plot
  
  legend("topright", legend = variable, fill = rainbow(length(airpol.data[, 1])),
         title = "Measurement of Interest", cex = 0.5)
}

# Create star plots for each variable
for (variable in names(airpol.data)[-1]) {  # Exclude the first column (city names)
  create_star_plot(variable)
  #plot.new()
}

Based on the star plots, here is a summary of which city is performing in each of the variable:

  1. Rainfall: MiamiLA,NeworILA,Birmhmal had the highest rainfall amounts.
  2. Education: bostonMA,losangCA,washDC have residents with relatively high levels of education .
  3. Population Density (Popden): YorkPA,LAnewyrkNY had the highest population density.
  4. Nonwhite Population Percentage: birmhmAL,memphsTN,neworlLA leads in nonwhite population percentage.
  5. NOx Levels: losangCA,sanfrnCA has the highest NOX levels.
  6. SO2 Levels: baltimMD,chicagIL,pittsbPA leads in SO2 levels.
  7. Mortality Rate: baltimMD,neworlLA has the highest mortality rate.

Discussion

The diverse environmental and demographic factors observed in the air pollution data for the USA illustrate the complexity of how different variables interact and impact air quality and public health. Here’s is a bried account of the observed patterns above.

  • Rainfall: Cities like Miami and New Orleans with high rainfall might experience less air pollution due to the washing effect of rain, which helps to clear particulate matter from the air.

  • Education: Cities with higher education levels, such as Boston and Los Angeles, often have better awareness and implementation of pollution control measures, leading to more proactive environmental policies.

  • Population Density (Popden): High population density in cities like New York and Los Angeles typically results in increased vehicle emissions and industrial activities, contributing to higher pollution levels.

  • Nonwhite Population Percentage: Cities with higher percentages of nonwhite populations, such as Birmingham and Memphis, often face environmental justice issues, where marginalized communities are disproportionately affected by pollution.

  • NOx Levels: High NOx levels in cities like Los Angeles and San Francisco are primarily due to heavy traffic and industrial activities, leading to more significant air pollution problems.

  • SO2 Levels: Elevated SO2 levels in Baltimore, Chicago, and Pittsburgh are typically associated with industrial emissions, particularly from coal-burning power plants and manufacturing industries.

  • Mortality Rate: The high mortality rates in cities like Baltimore and New Orleans may be linked to poor air quality, as long-term exposure to pollutants like SO2 and NOx can lead to severe health problems, including respiratory and cardiovascular diseases.

1b. Do a star plot to display all 7 variables using Chernoff Faces.

# Install and load aplpack package
if (!requireNamespace("aplpack", quietly = TRUE)) {
  install.packages("aplpack")
}
## Warning in fun(libname, pkgname): couldn't connect to display ":0"
library(aplpack)
# Create Chernoff faces plot with city labels
faces(airpol.data[, c("Rainfall", "Education", "Popden", "Nonwhite", "NOX", "SO2", "Mortality")], 
      face.type = 1, 
      scale = TRUE, 
      labels = airpol.data$City, 
      plot.faces = TRUE, 
      nrow.plot = 4, 
      ncol.plot = 2.5)

## effect of variables:
##  modified item       Var        
##  "height of face   " "Rainfall" 
##  "width of face    " "Education"
##  "structure of face" "Popden"   
##  "height of mouth  " "Nonwhite" 
##  "width of mouth   " "NOX"      
##  "smiling          " "SO2"      
##  "height of eyes   " "Mortality"
##  "width of eyes    " "Rainfall" 
##  "height of hair   " "Education"
##  "width of hair   "  "Popden"   
##  "style of hair   "  "Nonwhite" 
##  "height of nose  "  "NOX"      
##  "width of nose   "  "SO2"      
##  "width of ear    "  "Mortality"
##  "height of ear   "  "Rainfall"

2. Produce a scatterplot matrix for this air pollution data set.

Write a short paragraph explaining the main conclusions from the scatterplot matrix.

# Define colors for the plots
plot_colors <- c("red", "blue", "green", "grey", "purple", "black")

# Assign the colors to the pairs function
pairs(
  ~ Rainfall + Education + Popden + Nonwhite + NOX + SO2 + Mortality,
  data = airpol.data,
  pch = 8,  # Use star shape
  col = plot_colors,  # Apply colors
  cex = 1.5
)

Discussion

Creating a scatterplot matrix for the air pollution dataset allows us to visualize the relationships between different variables. From the scatterplot matrix above, the following main conclusions can be drawn:

Correlation Patterns:

We can observe patterns of correlation between pairs of variables as below.

Positive correlation:

  • The variable Mortality exhibits a positive correlation with Nonwhite. This indicates that as the percentage of non-white population increases in a city, the mortality rate also tends to increase significantly.
  • The variable Rainfall has a positive correlation with Mortality . This implies that cities with higher rainfall levels tend to have slightly higher mortality rates.
  • The variable SO2 demonstrates a positive correlation with Mortality. This indicates that cities with higher levels of sulfur dioxide tend to have moderately higher mortality rates.

Negative correlation:

  • The variable Education shows a negative correlation with Mortality . This suggests that cities with higher levels of education tend to have lower mortality rates.

3. Produce chiplots for A FEW pairs of variables in the air pollution data set, and write comments about those.

library(MVA)
## Loading required package: HSAUR2
## Loading required package: tools
# Inspecting bivariate independence between Rainfall and Mortality
chiplot(x = airpol.data$Rainfall, y = airpol.data$Mortality,main='Chi plot of Rainfall and Mortality',col='green')

# Inspecting bivariate independence between Education and Mortality
chiplot(x = airpol.data$Education, y = airpol.data$Mortality,main='Chi plot of Education and Mortality',col='red')

# Inspecting bivariate independence between NOX and Nonwhite
chiplot(x = airpol.data$NOX, y = airpol.data$Nonwhite,main='Chi plot of Nox and Nonwhite',col='grey')

# Inspecting bivariate independence between NOX and Education
chiplot(x = airpol.data$NOX, y = airpol.data$Education,main='Chi plot of Nox and Education',col='grey')

Discussion

From the above chi plots the following observations can be made:

Inspecting bivariate independence between Rainfall and Mortality

There seem to be a slight deviation from independence as the number of points within the horizontal band is almost equivalent to that above the band. This might suggest a moderate or weak positive bivariate correlation between Rainfall and Mortality

Inspecting bivariate independence between Education and Mortality

There exists a clear deviation from independence as the number of points within the horizontal band is less than that below the band. This suggests a strong negative bivariate correlation between Education and Mortality.

Inspecting bivariate independence between NOX and Nonwhite

There seem to be no deviation from independence as majority of points are captured within the horizontal band.

4. Do a bivariate boxplot of the pair of variables “Education” and “Mortality” from the air pollution data set.

Explain what the plot tells you about the relationship between the two variables. Do you see any outliers? If so, which cities are they?

# Load the MVA library
library(MVA)

# Specify the cities
cities <- airpol.data$City

# Identify the indices of the specified cities
city_indices <- match(cities, airpol.data$City)

# Select the variables "Education" and "Mortality"
x <- airpol.data[, c("Education", "Mortality")]

# Create the bivariate boxplot
bvbox(x, mtitle = "Bivariate Box plot", xlab = "Education", ylab = "NOX",col='red')

# Add text labels for the specified cities
text(x$Education[city_indices], x$Mortality[city_indices], labels = cities, pos = 1)

# Select the variables "Mortality" and "Rainfall"
y <- airpol.data[, c("Rainfall", "Mortality")]

# Create the bivariate boxplot
bvbox(y, mtitle = "Bivariate Box plot", xlab = "Rainfall", ylab = "Mortality",col='green')

# Add text labels for the specified cities
text(y$Rainfall[city_indices], y$Mortality[city_indices], labels = cities, pos = 1)

Discussion

From the analysis we can observe there exists a negative relationship between the two variables namely Education and Nox as the elipsoid tilts downwards from left to right.

Also importantly we can observe that LancasPA has outliers as some of its values sit outside the fence.

5. Do a bubble plot with “Education” and “Mortality” on the axes and “Population Density represented by the bubbles.”

Explain what the plot tells you about the relationships among the three variables and Comment on any notable cities.

# Load the necessary library
library(MVA)

# Define the y-axis limits
ylim <- with(airpol.data, range(Mortality))

# Create the base plot
plot(Mortality ~ Education, data = airpol.data,
     xlab = "Education",
     ylab = "Mortality",
     pch = 10,
     ylim = ylim,
     main = "Bubble Plot of Education vs. Mortality")

# Add bubbles representing Population Density with city labels
with(airpol.data, {
  symbols(Education, Mortality, circles = Popden / 400,
          inches = 0.5, add = TRUE, bg = rgb(0.1, 0.2, 0.5, 0.5))
  text(Education, Mortality, labels = City, pos = 3, cex = 0.7, col = "black")
})

Discussion

The bubble plot demonstrates a negative relationship between education and mortality.

From literatue an increase in education is more likely to translate to reduced mortality this could be as a result of socio-economic priviledges that are associated with high education.

On the other hand we can also see that there are notable cities that are densely populated with residents scoring very low on education and experiencing high mortality such as BaltimMD.

References:

  • For the Chi plot method and its application in investigating independence, the works of Chernoff and Lehmann (1954) provide foundational insights.
  • Bivariate boxplots as a method for exploring multivariate data have been extensively discussed in Tukey’s work on exploratory data analysis (Tukey, 1977).
  • R software and its libraries for statistical analysis are extensively documented in the literature, with R Core Team (2021) providing comprehensive guidance.