Topics for today!

  1. Processing the data
  2. PCA idea
  3. PCA example

Data for todays lecture

setwd("~/Desktop/R Materials/mih140/Lecture 18 - Data Processing II")
air <- read.table("AirPollution.txt", sep = "\t", header = T, 
quote = "", allowEscapes = T) 

Topic 1: Processing the data: Pruning, Scaling, Searching for correlation

boxplot(air) # R will make a boxplot for each numeric behavior

# Seems like the second feature has some outliers

# Prune with the 1.5 IQR
solar_quantiles = quantile(air$SolarRadiation)
U = solar_quantiles[3]+1.5*(solar_quantiles[3] - solar_quantiles[1])
L = solar_quantiles[1]-1.5*(solar_quantiles[3] - solar_quantiles[1])
air_clean = air[air$SolarRadiation >= L & air$SolarRadiation <= U,]

# Normalize the data
air_clean = scale(air_clean)
boxplot(air_clean)

# Look for correlation in the data 

# install.packages("corrplot") <- install if you haven't already
library(corrplot)
## corrplot 0.84 loaded
corrplot.mixed(cor(air_clean), lower = "number", upper = "ellipse")

So now our data is cleaned, scaled, and we can see that many of the features are correlated. Now we would like to reduce the dimension of our data, essentially filtering out this correlation, using PCA.

Topic 2: PCA Idea

Qu: What is PCA?

A way to reduce the dimension of the data i.e. the number of features

Qu 2: Why would you want to do that?

Well it makes it harder to do analysis when there are many features (e.g. on midterm), curse of dimensionality, multicolinearity etc.

Qu 3: Okay, so how does PCA reduce the dimension?

Consider this motivating example: Suppose I want to understand the stock market, what information do I use? I can track the performance of all fortune 500 companies in the US OR I can use indicies like the SP 500 or the DOW Jones which are single numbers that aggregate the informaiton about these 500 features. PCA does this aggregating automatically.

Inputs to PCA: A dataset with k scaled numeric features. Outputs of PCA: A new dataset with k “indices”, in ranked order of how well they explain the variance.

Terminology: We call these indicies components, and they are ordered as follows: The first component explains as much of the variance in the data as possible. The second component explains as much of the remaining variance as possible etc. all the way down untill all variance explained.

Upside: We can use the first couple components to “explain” most of the variance in the data, reducing our number of features.

Topic 3: PCA Example

# There are two methods to do PCA in R, prcomp() and princomp(). We will use the second one.
pca_air = princomp(air_clean) # This does PCA
pca_air$loadings # These are the linear combinations of the base features that make the indicies
## 
## Loadings:
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## Wind            0.237  0.278  0.643  0.173  0.561  0.224  0.241
## SolarRadiation -0.206 -0.527  0.224  0.778 -0.156              
## CO             -0.551        -0.114         0.573  0.110 -0.585
## NO             -0.378  0.435 -0.407  0.291         0.450  0.461
## NO2            -0.498  0.200  0.197               -0.745  0.338
## O3             -0.325 -0.567  0.160 -0.508         0.331  0.417
## HC             -0.319  0.308  0.541 -0.143 -0.566  0.266 -0.314
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.143  0.143  0.143  0.143  0.143  0.143  0.143
## Cumulative Var  0.143  0.286  0.429  0.571  0.714  0.857  1.000
plot(pca_air)

summary(pca_air) # need 6 components to get over 95% of variance
## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
## Standard deviation     1.5103460 1.1631856 1.0841576 0.8424814 0.79869743
## Proportion of Variance 0.3338261 0.1980001 0.1720094 0.1038695 0.09335379
## Cumulative Proportion  0.3338261 0.5318262 0.7038356 0.8077051 0.90105889
##                            Comp.6     Comp.7
## Standard deviation     0.72381661 0.39011161
## Proportion of Variance 0.07666983 0.02227128
## Cumulative Proportion  0.97772872 1.00000000
# To access the transformated data use scores
comp_1 = pca_air$scores[,1] # Isolates first component
comp_2 = pca_air$scores[,2] # Isolates second component
plot(comp_1, comp_2) # Looks like a cloud, no colinearity

PCA Example Continued

# Example: Suppose we want to predict the CO in the air from the concentration of NO, NO2, and O3

air_particles = scale(air[,c("CO", "NO", "NO2", "O3")])
pca_air_particles = princomp(air_particles[,c("NO", "NO2", "O3")])
comp_1 = pca_air_particles$scores[,1] # Isolates first component
comp_2 = pca_air_particles$scores[,2] # Isolates second component
air_particles = data.frame(air_particles, comp_1, comp_2)

model_1 = lm(data = air_particles, CO ~ NO + NO2 + O3)
model_2 = lm(data = air_particles, CO ~ comp_1 + comp_2)
model_3 = lm(data = air_particles, CO ~ comp_1)
# Note the very similar performance!

model_a = lm(data = air_particles, CO ~ NO)
model_b = lm(data = air_particles, CO ~ NO2)
model_c = lm(data = air_particles, CO ~ O3)
# All much worse than the one component model!

# To see how that one component is constructed try
pca_air_particles$loadings[,1]
##        NO       NO2        O3 
## 0.6769666 0.7287989 0.1028021