Objective:

Perform a principal component analysis on the Air Pollution data set and make an inference.

Data:

Air Pollution Data Set:

This date set can be downloaded through R HSAUR2 package and it has 41 observation of 7 variables.

These variables are: SO2: SO2 content of air in micrograms per cubic metre
Temp: average annual temperature in F
Manuf: number of manufacturing enterprises employing 20 or more workers.
Pop: population size (1970) in thousands.
Wind: Average annual wind speed (mph).
Precip: Average annual precipitation (in).
Days: Average number of days with precipitation per year.

Initially, the data were collected to assess the impact of various factors on SO2(Sulfur Dioxide (SO2) Pollution) level in the air.

Method:

  1. Explore the Dataset

  2. Find Covariance and Correlation Matrix and make decision which one to use.

  3. Describe eigen values and fractional contributions to the variance.

  4. Give the eigen vectors for the principal components to retain.

  5. Make appropriate plots of pairs of principal components.

  6. Interpret the the result and PC plots.

Goal:

We will be using Principal Component analysis to explore which principal components are the best predictor of air pollution.


DATA ANALYSIS

Air Pollution Data Set
library(corrplot)
## corrplot 0.90 loaded
library(ggbiplot)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.5
## Loading required package: plyr
## Warning: package 'plyr' was built under R version 4.0.5
## Loading required package: scales
## Loading required package: grid
library(ggplot2)
library(MVA)
## Warning: package 'MVA' was built under R version 4.0.5
## Loading required package: HSAUR2
## Warning: package 'HSAUR2' was built under R version 4.0.5
## Loading required package: tools
library(HSAUR2,quietly = TRUE)

df = USairpollution

Exploring Data set

head(df)
##             SO2 temp manu popul wind precip predays
## Albany       46 47.6   44   116  8.8  33.36     135
## Albuquerque  11 56.8   46   244  8.9   7.77      58
## Atlanta      24 61.5  368   497  9.1  48.34     115
## Baltimore    47 55.0  625   905  9.6  41.31     111
## Buffalo      11 47.1  391   463 12.4  36.11     166
## Charleston   31 55.2   35    71  6.5  40.75     148
nrow(df)
## [1] 41
print(row.names(df))
##  [1] "Albany"         "Albuquerque"    "Atlanta"        "Baltimore"     
##  [5] "Buffalo"        "Charleston"     "Chicago"        "Cincinnati"    
##  [9] "Cleveland"      "Columbus"       "Dallas"         "Denver"        
## [13] "Des Moines"     "Detroit"        "Hartford"       "Houston"       
## [17] "Indianapolis"   "Jacksonville"   "Kansas City"    "Little Rock"   
## [21] "Louisville"     "Memphis"        "Miami"          "Milwaukee"     
## [25] "Minneapolis"    "Nashville"      "New Orleans"    "Norfolk"       
## [29] "Omaha"          "Philadelphia"   "Phoenix"        "Pittsburgh"    
## [33] "Providence"     "Richmond"       "Salt Lake City" "San Francisco" 
## [37] "Seattle"        "St. Louis"      "Washington"     "Wichita"       
## [41] "Wilmington"

There are total of 41 observation of 7 variables.

First, Let’s check observations of some of the cities.

df[c(12,7,15),]
##          SO2 temp manu popul wind precip predays
## Denver    17 51.9  454   515  9.0  12.95      86
## Chicago  110 50.6 3344  3369 10.4  34.44     122
## Hartford  56 49.1  412   158  9.0  43.37     127

In the first impression, it looks like population and manufacturing may cause the SO2 pollution in Chicago.

In order to have general idea about he data, we can compare the result of the top and bottom 5 cities which have maximum and minimum SO2 values.

#sorting data
newdata <- df[order(df$SO2),]
# Creating new data set for bottom 5 cities by SO2 values
first5<-head(newdata,5)
colMeans(first5)
##     SO2    temp    manu   popul    wind  precip predays 
##   9.200  64.320 405.600 667.800  10.400  44.116  96.200
# Creating new data set for top 5 cities by SO2 values
last5<-tail(newdata,5)
colMeans(last5)
##      SO2     temp     manu    popul     wind   precip  predays 
##   79.800   51.060 1346.600 1353.800   10.180   37.666  132.800

This data confirms that Manufacturing and Population are correlated with SO2 level.

library(GGally)
## Warning: package 'GGally' was built under R version 4.0.5
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(df[,-1])

GGPairs plot provide us a scatter plot between two variables on the left side, and their correlation coefficient on the right side, it also provides a denstiy graph on the diagonal.
So, density plots for Manufacturing and Population are stongly right skewed, it is an evidence that there are some outliers.

Box Plot Analysis

boxplot(df)

We can see that Manufacturing and Population have some notable outliers, let’s find those outliers and remove them.

# Outliers for Manufacturing
boxplot.stats(df$manu)$out
## [1] 3344 1007 1064 1692
# Outliers for Population
boxplot.stats(df$popul)$out
## [1] 3369 1513 1950

Note that we will only remove the maximum ones from the data.

out_Manu<-boxplot.stats(df$manu)$out
out_Popul<-boxplot.stats(df$popul)$out

out_ind1 <- which(df$manu %in% max(c(out_Manu)))
out_ind2 <- which(df$popul %in% max(c(out_Popul)))

# Let;s find out which is cities are outliers
df[out_ind1,]
##         SO2 temp manu popul wind precip predays
## Chicago 110 50.6 3344  3369 10.4  34.44     122
df[out_ind2,]
##         SO2 temp manu popul wind precip predays
## Chicago 110 50.6 3344  3369 10.4  34.44     122
outs <- c(out_ind1,out_ind2)

We see that Chicago is the outlier in both variable, so I will remove it from the data.

df<- df[-outs,]

boxplot(df)

We can see that Manufacturing and Population size have a bigger variance than the other variables. Therefore, it is better to normalize the data. We will also check the covariance matrix to see the variance of each variable.

Pair Plot Analysis

library(GGally)
ggpairs(df[,-1])

Density plots look better after removing outliers.

Notice that temp is negatively correlated with most of the variables. It may make sense to use “negative temperature” for interpretation reasons.

Covariance Matrix Analysis

S <- var(df[,-1])
round(S,2)
##            temp      manu     popul   wind precip predays
## temp      52.88   -402.86    105.52  -3.57  33.39  -83.44
## manu    -402.86 107513.51 110705.83 124.07 -44.19 1406.33
## popul    105.52 110705.83 143708.40 111.08 -13.65   75.08
## wind      -3.57    124.07    111.08   2.07  -0.17    6.17
## precip    33.39    -44.19    -13.65  -0.17 141.98  159.26
## predays  -83.44   1406.33     75.08   6.17 159.26  718.88

Covariance matrix confirm the data we obtain through boxplot. Population and Manufacturing have higher variance compared to other variables, so the data needs to be scaled.Otherwise, the component of the covariance matrix are completely dominated by Poulation and Manufacturing variables.

SO, if variables are on very different scale or have very different variances, a Principal Component analysis of the data should be performed on the Correlation Matrix.

Correlation Matrix Analysis

C <- cor(df[,-1])

corrplot((C) , method = "number")

PC Analysis

Eigen Values

eig<- eigen(C)
round(eig$values,2)
## [1] 2.11 1.50 1.44 0.75 0.12 0.08
round(eig$values/sum(eig$values),3)
## [1] 0.352 0.250 0.240 0.125 0.020 0.013

First three eigen values are greater than one and together account for almost %85 of the variance of the original variables.

Scree Plot

#perform PCA
results <- prcomp(df, scale = TRUE)
#calculate total variance explained by each principal component
var_explained = results$sdev^2 / sum(results$sdev^2)

#create scree plot
library(ggplot2)

qplot(c(1:7), var_explained) + 
  geom_line() + 
  xlab("Principal Component") + 
  ylab("Variance Explained") +
  ggtitle("Scree Plot") +
  ylim(0, 1)

Scree plot do not give me an elbow(breaking) point, so, I will be using the first 3 components.

PCA with standardizing:

We will be using Correlation Matrix

# PCA with standardizing
df.pc = princomp(df[,-1],cor=TRUE)
summary(df.pc,loadings=T)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
## Standard deviation     1.4523018 1.2245852 1.2012367 0.8669370 0.34219183
## Proportion of Variance 0.3515301 0.2499348 0.2404950 0.1252633 0.01951588
## Cumulative Proportion  0.3515301 0.6014649 0.8419599 0.9672232 0.98673904
##                            Comp.6
## Standard deviation     0.28207400
## Proportion of Variance 0.01326096
## Cumulative Proportion  1.00000000
## 
## Loadings:
##         Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
## temp     0.292  0.287  0.644  0.289  0.493  0.302
## manu    -0.623  0.220  0.171 -0.181 -0.276  0.652
## popul   -0.561  0.344  0.278         0.225 -0.656
## wind    -0.378 -0.155 -0.256  0.867  0.112       
## precip         -0.484  0.638  0.179 -0.539 -0.188
## predays -0.263 -0.702        -0.300  0.571  0.126

As you can see, we have 6 Principal components, it is because we have 6 predictor variables. Fist principal component

Interpretation: We need to examine the coefficients of each component.

The first component might be regarded about the life condition, because manufacturing and population has higher coefficient indicating relatively poor environment.

The second component is discloses information about rainfall status in cities.

The third component has high values in Temperature and Precipitation days with opposite directions. So, this component gives an idea about the weather condition.

ggbiplot(df.pc)

We can notice that population & manufacturing and precipitation days& precipitation have same direction, it is a sign that thay both are correlated to each other.

Another point we can draw here that, tamp has opposite direction with temp and precipitation days, so is is a sign thta they are negatively correlated each other.

ggbiplot(df.pc,labels = rownames(df.pc$scores))

This plot shows that Philadelphia and Detroit may have the most SO2 level due to effect of manufacturing and population.

It looks Buffalo and Seatle are the cities who have the most precipitation days. Cleveland, Milwaukee and Minnesota are the windy cities and Salt Lake City is the city with the lowest humidity.

par(mfrow=c(2,2))
plot(df$SO2~df.pc$scores[,1],xlab="PC1",ylab="SO2", pch=19)
abline(lm(df$SO2~ df.pc$scores[,1]), col="red")
plot(df$SO2~df.pc$scores[,2],xlab="PC2",ylab="SO2",pch=19)
abline(lm(df$SO2~ df.pc$scores[,2]), col="red")
plot(df$SO2~df.pc$scores[,3],xlab="PC3",ylab="SO2",pch=19)
abline(lm(df$SO2~ df.pc$scores[,3]), col="red")
plot(df$SO2~df.pc$scores[,4],xlab="PC4",ylab="SO2",pch=19)
abline(lm(df$SO2~ df.pc$scores[,4]), col="red")

We can use PC scores to determine which Principal components are best predictors of the SO2 levels in the air.

In this score plot, we can notice that PC3 has no correlation with SO2 level. We called PC3 as a weather component,this component is not a good predictor.


summary(lm(df$SO2~ df.pc$scores))
## 
## Call:
## lm(formula = df$SO2 ~ df.pc$scores)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.093  -9.182  -1.065   5.937  48.738 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          28.050      2.349  11.942 1.59e-13 ***
## df.pc$scoresComp.1   -5.521      1.617  -3.414  0.00171 ** 
## df.pc$scoresComp.2   -4.242      1.918  -2.212  0.03402 *  
## df.pc$scoresComp.3   -1.430      1.955  -0.731  0.46989    
## df.pc$scoresComp.4   -7.516      2.709  -2.774  0.00903 ** 
## df.pc$scoresComp.5  -18.246      6.864  -2.658  0.01202 *  
## df.pc$scoresComp.6   19.226      8.327   2.309  0.02735 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.86 on 33 degrees of freedom
## Multiple R-squared:  0.5297, Adjusted R-squared:  0.4442 
## F-statistic: 6.196 on 6 and 33 DF,  p-value: 0.0001971

In this example, we will apply regression analysis by using all the principal components because the component with a small variance may be a significant predictor of SO2 variable.

Except the third component, the other components are good predictor of response variable (SO2).Specifically, the first principal component is the most predictive of SO2.But it is clear that components with small variance such as the fourth component also have correlation with the response.


Note that, we could apply multiple linear regression model to determine which variables are the best predictor. However, there is possible collinearity between manufacturing and population, we can address this issue by sipmly dropping one of thoses variables.