Objective:
Perform a principal component analysis on the Mali Farm data set and make an inference.
Data:
Mali Farm Data Set:
Data from survey of 76 farmers in the Sikasso region of Mali (West Africa) on (from columns 1 through9) on family size (Family), distance in km to nearest passable road (DistRd), hectares of cotton, maize, sorghum, millet planted in the year 2000 (Cotton, Maize, Sorg,
Method:
Explore the Dataset
Find Covariance and Correlation Matrix and make decision which one to use.
Describe eigen values and fractional contributions to the variance.
Give the eigen vectors for the principal components to retain.
Make appropriate plots of pairs of principal components.
Interpret the the result and PC plots.
library(corrplot)
## corrplot 0.90 loaded
library(AMR)
## Warning: package 'AMR' was built under R version 4.0.5
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df <- read.csv("malifarmdata.csv")
getwd()
## [1] "C:/Users/myavu/OneDrive/Desktop/CSU/Old/STAA 574_2 Multivariate Analysis/HW/HW4"
Exploring Data set
head(df)
## Family DistRD Cotton Maize Sorg Millet Bull Cattle Goats
## 1 12 80 1.5 1.0 3 0.25 2 0 1
## 2 54 8 6.0 4.0 0 1.00 6 32 5
## 3 11 13 0.5 1.0 0 0.00 0 0 0
## 4 21 13 2.0 2.5 1 0.00 1 0 5
## 5 61 30 3.0 5.0 0 0.00 4 21 0
## 6 20 70 0.0 2.0 3 0.00 2 0 3
nrow(df)
## [1] 76
There are total of 76 data that has 9 variables.
Family column represent the family size,
DisctRd column is the distance in km from the farm to the nearest road,
Cotton, Maize, Sorghum and Millet’s columns represent the planted area in hectares,
Bull, Cattle and Goats’ columns represent the number of livestock.
Box Plot Analysis
boxplot(df)
We can see that DistRd, Family and Cattle have some notable outliers, let’s find those outliers and remove them.
# Outliers for DistRd
boxplot.stats(df$DistRD)$out
## [1] 80 500 500 100 100 90 90
# Outliers for Family
boxplot.stats(df$Family)$out
## [1] 145
# Outliers for Cattle
boxplot.stats(df$Cattle)$out
## [1] 32 21 13 104 17 14 17 23 14
I will only remove the outliers which are comparable bigger than others. So,I will remove 500 from DistRd, 145 from Family, and 104 from Cattle.
First, I will find their index number;
out_DistRd<-boxplot.stats(df$DistRD)$out
out_Family<-boxplot.stats(df$Family)$out
out_Cattle<-boxplot.stats(df$Cattle)$out
out_ind1 <- which(df$DistRD %in% max(c(out_DistRd)))
out_ind2 <- which(df$Family %in% max(c(out_Family)))
out_ind3 <- which(df$Cattle %in% max(c(out_Cattle)))
outs <- c(out_ind1,out_ind2,out_ind3)
Then, remove them from the data.
df<- df[-outs,]
boxplot(df)
This plot is more interpretable. We can see that Family, and DistRd have bigger variance than the other inputs.Therefore, it is better to normalize the data. We will also check the covariance matrix to see the variance of each variable.
Pairs Plot Analysis
library(GGally)
## Warning: package 'GGally' was built under R version 4.0.5
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.5
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(df)
GGPairs plot provide us a scatter plot between two variables along with their correlation coefficient.
We can notice that DistRD is not correlated with other variables,and there is a correlation between Family-size & Cotton, Family-size & Bull and Bull & Cotton.
Covariance Matrix Analysis
S <- var(df)
round(S,2)
## Family DistRD Cotton Maize Sorg Millet Bull Cattle Goats
## Family 390.25 -15.02 41.40 23.29 14.23 21.73 37.19 70.17 27.67
## DistRD -15.02 582.76 -2.08 4.77 -9.79 -4.45 1.76 9.31 15.98
## Cotton 41.40 -2.08 7.67 3.57 2.09 2.16 5.84 10.64 3.99
## Maize 23.29 4.77 3.57 3.24 -0.10 0.70 2.91 6.12 0.35
## Sorg 14.23 -9.79 2.09 -0.10 3.45 1.50 1.52 0.72 1.71
## Millet 21.73 -4.45 2.16 0.70 1.50 4.91 1.91 1.78 2.14
## Bull 37.19 1.76 5.84 2.91 1.52 1.91 6.59 11.04 5.01
## Cattle 70.17 9.31 10.64 6.12 0.72 1.78 11.04 41.22 9.49
## Goats 27.67 15.98 3.99 0.35 1.71 2.14 5.01 9.49 14.99
Because the variables have different units, the data needs to be scaled. Otherwise, the column of DistRd and Family will be dominating other variables and will have the highest weight on PCs. So, we may not get the accurate interpretation.
As you can notice, the variance in DistRd is 582.76 and Family is 390.25 are too high compared to the variance in other variables.
Correlation Matrix Analysis
C <- cor(df)
corrplot((C) , method = "number")
After scaling the data, DistRd has no effect on other variables. Correlation matix satisfy the information we obtain through pairs plot.
There is a high correlation between Family-size & Cotton, Family-size & Bull and Cotton & Bull. Nevertheless, there might be a Collinearity among Family, Cotton and Bull, since they all have a high correlation.
Eigen Values
eig<- eigen(C)
round(eig$values,2)
## [1] 4.19 1.44 1.08 0.79 0.60 0.37 0.24 0.17 0.12
round(eig$values/sum(eig$values),2)
## [1] 0.47 0.16 0.12 0.09 0.07 0.04 0.03 0.02 0.01
round(cumsum(eig$values)/sum(eig$values),2)
## [1] 0.47 0.62 0.75 0.83 0.90 0.94 0.97 0.99 1.00
Eigenvalues explain the variance of the data.
The proportion of the variance explained by each principal component are: 1st PC explains about 43% 2nd PC explains about 16% 3rd PC explains about 13% 4th PC explains about 10% 5th PC explains about 7% The rest of the PCs explain 11% of the variability.
Criteria for PCs Selection
We can use scree plot and cumulative total to determine how many PC’s to keep.
Cumulative Sum
round(cumsum(eig$values)/sum(eig$values),2)
## [1] 0.47 0.62 0.75 0.83 0.90 0.94 0.97 0.99 1.00
Scree Plot
#perform PCA
results <- prcomp(df, scale = TRUE)
#calculate total variance explained by each principal component
var_explained = results$sdev^2 / sum(results$sdev^2)
#create scree plot
library(ggplot2)
qplot(c(1:9), var_explained) +
geom_line() +
xlab("Principal Component") +
ylab("Variance Explained") +
ggtitle("Scree Plot") +
ylim(0, 1)
A scree plot displays how much variation each principal components captures from the data.
We can also plot using eigen-values.
plot(eig$values, type = "b")
Based on the scree plot, and cumulative total, I would like to choose first five principal component analysis.
Eigen Vector Analysis
Let’s see the first five eigen vectors.
round(eig$vectors,2)[,1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.43 -0.07 0.10 0.17 0.01
## [2,] 0.01 0.50 -0.57 0.50 -0.38
## [3,] 0.45 0.01 0.13 -0.03 -0.22
## [4,] 0.35 0.35 0.39 0.24 -0.08
## [5,] 0.20 -0.60 -0.11 -0.06 -0.64
## [6,] 0.24 -0.42 -0.12 0.62 0.53
## [7,] 0.45 0.07 -0.03 -0.15 -0.03
## [8,] 0.36 0.28 0.01 -0.37 0.22
## [9,] 0.25 -0.05 -0.69 -0.35 0.25
And now, let’s see the first five PCs.
pca<-princomp(df,cor=T)
loadings<-unclass(pca$loadings)[,1:5]
round(loadings,2)
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Family 0.43 0.07 0.10 0.17 0.01
## DistRD 0.01 -0.50 -0.57 0.50 -0.38
## Cotton 0.45 -0.01 0.13 -0.03 -0.22
## Maize 0.35 -0.35 0.39 0.24 -0.08
## Sorg 0.20 0.60 -0.11 -0.06 -0.64
## Millet 0.24 0.42 -0.12 0.62 0.53
## Bull 0.45 -0.07 -0.03 -0.15 -0.03
## Cattle 0.36 -0.28 0.01 -0.37 0.22
## Goats 0.25 0.05 -0.69 -0.35 0.25
As we can see, loadings and eigen vectors have the same value. It is ok to have them different sign because they just show the directions.
For example, if we look at PC2 and Eigen vector 2, Family an DistRD have the same values but opposite directions: PC2: (-0.07 , 0.50) and (0.07 , -0.50).
All the variables except Distance to Road load about equally on the first component. This component might be called a Farm Size component.
So we can say that eigenvectors represents the directions and, eigen values determine their magnitudes.
Plot of pairs of PCs
biplot(pca,choices = c(1,2))
biplot(pca,choices = c(1,3))
biplot(pca,choices = c(1,4))
biplot(pca,choices = c(1,5))