##1 Introduction

The Idea of Dimension Reduction The dimensions of a data set is a collection of different numbers required to describe the observations in it. Dimension reduction is the process of finding low-dimensional data sets similar to high-dimensional data sets.

PCA

Principal component analysis(PCA) reduces the dimensions of a large dataset, the aim of which is revealing internal structure of the data in a way that explains its variance. In statistic, it uses in exploratory data analysis and for making predictive models which could use an orthogonal transformation to convert a set of instances of possibly correlated variables into a set of values of linearly uncorrelated variables. This transformation comes at the expense of accuracy. The idea behind this popular dimensions reduction is to trade a little accuracy for simplicity by preserving as much information as possible.

In some cases, they might be linearly correlated. The function of PCA in this case would be to reduce that correlation between GDP and Population to just one feature, which would be the functional relation between the two.

MCA

Multiple Correspondence Analysis (MCA) takes multiple categorical variables reduces the categorical features by creating a matrix with values of zero or one. If a categorical variable has more than two different classes, this method binarized it. For instance, a diet feature including carnivore, vegetarian and vegan. Each binary answer is represented by 2 binary columns (e.g., carnivore is represented by the pattern [1,0,0]; vegetarian is represented by the pattern [0,1,0]; vegan is represented by the pattern [0,0,1] ) By using this method, MCA creates a matrix that consists of individual x variables where the rows represent individuals and the columns are dummy variables. Applying a standard correspondence analysis on this matrix is the next step. The result is a linear combination of rows that carries the most possible information of all categorical features. But the data only have two categorical variable, we won’t use this method for this data just for note.

##2 Data set analysis

Data set contains information of GPS tracking of 3 birds which are Eric, Nico, Sanne. The data is taken from kaggle platform. Website link: https://www.kaggle.com/turhancankargin/bird-migration

Name the data bmd(bird migration data).

bmd <- read.csv("bird_migration.csv")
dim(bmd)
## [1] 61920     9
# use colnames() to view all the name of column
colnames(bmd)
## [1] "X"                  "altitude"           "date_time"         
## [4] "device_info_serial" "direction"          "latitude"          
## [7] "longitude"          "speed_2d"           "bird_name"
# use head() to view the first few rows of data
head(bmd)
##   X altitude              date_time device_info_serial  direction latitude
## 1 0       71 2013-08-15 00:18:08+00                851 -150.46975 49.41986
## 2 1       68 2013-08-15 00:48:07+00                851 -136.15114 49.41988
## 3 2       68 2013-08-15 01:17:58+00                851  160.79748 49.42031
## 4 3       73 2013-08-15 01:47:51+00                851   32.76936 49.42036
## 5 4       69 2013-08-15 02:17:42+00                851   45.19123 49.42033
## 6 5       54 2013-08-15 02:47:38+00                851  -46.34448 49.42037
##   longitude  speed_2d bird_name
## 1  2.120733 0.1500000      Eric
## 2  2.120746 2.4383601      Eric
## 3  2.120885 0.5966574      Eric
## 4  2.120859 0.3101612      Eric
## 5  2.120887 0.1931321      Eric
## 6  2.120840 2.9047719      Eric
# use str() to view types of data
str(bmd)
## 'data.frame':    61920 obs. of  9 variables:
##  $ X                 : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ altitude          : int  71 68 68 73 69 54 57 65 59 107 ...
##  $ date_time         : chr  "2013-08-15 00:18:08+00" "2013-08-15 00:48:07+00" "2013-08-15 01:17:58+00" "2013-08-15 01:47:51+00" ...
##  $ device_info_serial: int  851 851 851 851 851 851 851 851 851 851 ...
##  $ direction         : num  -150.5 -136.2 160.8 32.8 45.2 ...
##  $ latitude          : num  49.4 49.4 49.4 49.4 49.4 ...
##  $ longitude         : num  2.12 2.12 2.12 2.12 2.12 ...
##  $ speed_2d          : num  0.15 2.438 0.597 0.31 0.193 ...
##  $ bird_name         : chr  "Eric" "Eric" "Eric" "Eric" ...
# use summary() to view more details
summary(bmd)
##        X            altitude         date_time         device_info_serial
##  Min.   :    0   Min.   :-1010.00   Length:61920       Min.   :833.0     
##  1st Qu.:15480   1st Qu.:    2.00   Class :character   1st Qu.:833.0     
##  Median :30960   Median :   14.00   Mode  :character   Median :851.0     
##  Mean   :30960   Mean   :   52.31                      Mean   :849.3     
##  3rd Qu.:46439   3rd Qu.:   84.00                      3rd Qu.:864.0     
##  Max.   :61919   Max.   : 6965.00                      Max.   :864.0     
##                                                                          
##    direction           latitude       longitude          speed_2d     
##  Min.   :-179.998   Min.   :12.35   Min.   :-17.626   Min.   : 0.000  
##  1st Qu.: -89.680   1st Qu.:15.39   1st Qu.:-16.761   1st Qu.: 0.410  
##  Median : -10.983   Median :30.42   Median : -9.662   Median : 1.209  
##  Mean   :  -4.611   Mean   :30.23   Mean   : -8.953   Mean   : 2.559  
##  3rd Qu.:  81.965   3rd Qu.:50.00   3rd Qu.:  2.604   3rd Qu.: 3.059  
##  Max.   : 180.000   Max.   :51.52   Max.   :  4.858   Max.   :63.488  
##  NA's   :443                                          NA's   :443     
##   bird_name        
##  Length:61920      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Names of columns correspond to:

X - id of the event altitude - distance measurement, the vertical direction, between a reference datum and a point or object. date_time - year-month-day (24 hour clock) hh:mm:ss device_info_serial - device information direction - Generalized trigonometry [-180,180] latitude - Latitude in decimal degrees where the sample or observation was collected. longitude - Longitude in decimal degrees where the sample or observation was collected. speed_2d - speed of flying bird_name - name of bird

Dataset contains numerical as well as categorical variables. To make analysis more accurate, separate them and conduct individual analysis for numerical variables (using PCA) and categorical variables (using MCA). Omit X(id) variable as it does not contain any information.

num_bmd <- bmd[, c(2,5:8)]
cat_bmd <- bmd[, c(4,9)]
# Numerical variables
summary(num_bmd)
##     altitude          direction           latitude       longitude      
##  Min.   :-1010.00   Min.   :-179.998   Min.   :12.35   Min.   :-17.626  
##  1st Qu.:    2.00   1st Qu.: -89.680   1st Qu.:15.39   1st Qu.:-16.761  
##  Median :   14.00   Median : -10.983   Median :30.42   Median : -9.662  
##  Mean   :   52.31   Mean   :  -4.611   Mean   :30.23   Mean   : -8.953  
##  3rd Qu.:   84.00   3rd Qu.:  81.965   3rd Qu.:50.00   3rd Qu.:  2.604  
##  Max.   : 6965.00   Max.   : 180.000   Max.   :51.52   Max.   :  4.858  
##                     NA's   :443                                         
##     speed_2d     
##  Min.   : 0.000  
##  1st Qu.: 0.410  
##  Median : 1.209  
##  Mean   : 2.559  
##  3rd Qu.: 3.059  
##  Max.   :63.488  
##  NA's   :443

The average of bird migration is at 52.31 The average of bird direction is at -4.611 The average of bird latitude is at 30.23 The average of bird longitude is at -8.953 The average of bird speed_2d is at 2.559

# Categorical variables
library(ggplot2)
library(reshape2)

plot.bmd <- melt(cat_bmd, id.vars = NULL)
ggplot(data = plot.bmd) +
  geom_bar(aes(x = value)) + 
  theme(plot.title = element_text(hjust = 0.5, size = 14)) +
  facet_wrap(~ variable, scales = "free", ncol = 4)

The data of three birds (Eric, Nico and Sanne) is almost same number. BirdName DevName Total Eric 851 19795
Nico 864 21121
Sanne 833 21004

# set a data  table to see the total of birds.
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:reshape2':
## 
##     dcast, melt
X <- data.table(bmd)
X[,.(Total = length(bird_name)),.(BirdName = bird_name)]
##    BirdName Total
## 1:     Eric 19795
## 2:     Nico 21121
## 3:    Sanne 21004
X[,.(Total = length(device_info_serial)),.(DevName = device_info_serial)]
##    DevName Total
## 1:     851 19795
## 2:     864 21121
## 3:     833 21004

##3 PCA for numerical variables

Correlations and variances between variables Correlation is a bivariate analysis that measures the strength of association between two variables and the direction of the relationship. There are three types of correlations: Pearson correlation, Kendall rank correlation and Spearman correlation. Pearson: Product difference correlation calculation of continuous variables or correlation analysis between variables measured at equal intervals Kendall: Rank correlation Calculate the rank correlation between categorical variables, suitable for merging grade data Spearman: rank correlation calculation Spearman correlation, suitable for continuous rank data

corp <- cor(num_bmd, method="pearson", use="complete.obs") 
# cork <- cor(num_bmd, method="kendall", use="complete.obs")
# cors <- cor(num_bmd, method="spearman", use="complete.obs")
print(corp, digits= 1)
##           altitude direction latitude longitude speed_2d
## altitude     1.000    -0.009     0.33     0.332    0.207
## direction   -0.009     1.000    -0.02    -0.022   -0.005
## latitude     0.328    -0.021     1.00     0.983    0.024
## longitude    0.332    -0.022     0.98     1.000    0.001
## speed_2d     0.207    -0.005     0.02     0.001    1.000
# print(cork, digits= 1)
# print(cors, digits= 1)

We choose pearson correlation

library(corrplot) 
## corrplot 0.84 loaded
corrplot.mixed(corp, lower = "square", upper = "number", tl.col = "black")

The results show there is a positive correlation between latitude and longitude variables. 0.98 The results show there is a positive correlation between altitude and latitude variables. 0.33 The results show there is a positive correlation between altitude and longitude variables. 0.33 The results show there is a positive correlation between altitude and speed_2d variables. 0.21

corrplot::corrplot(corp, method= "color", order = "hclust", tl.pos = 'n')

More plots

source("http://www.sthda.com/upload/rquery_cormat.r")
rquery.cormat(num_bmd)

## $r
##           direction speed_2d altitude latitude longitude
## direction         1                                     
## speed_2d    -0.0045        1                            
## altitude    -0.0088     0.21        1                   
## latitude     -0.021    0.024     0.33        1          
## longitude    -0.022    0.001     0.33     0.98         1
## 
## $p
##           direction speed_2d altitude latitude longitude
## direction         0                                     
## speed_2d       0.26        0                            
## altitude      0.029        0        0                   
## latitude    1.1e-07  2.2e-09        0        0          
## longitude   3.3e-08      0.8        0        0         0
## 
## $sym
##           direction speed_2d altitude latitude longitude
## direction 1                                             
## speed_2d            1                                   
## altitude                     1                          
## latitude                     .        1                 
## longitude                    .        B        1        
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

In order to run PCA, use factoextra library

# omit missing value
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
pca <- prcomp(na.omit(num_bmd), center=TRUE, scale=TRUE, )
summary(pca)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5
## Standard deviation     1.4757 1.0530 0.9996 0.8355 0.12806
## Proportion of Variance 0.4355 0.2218 0.1998 0.1396 0.00328
## Cumulative Proportion  0.4355 0.6573 0.8571 0.9967 1.00000
#plot 
fviz_eig(pca)

Scree plot represents graphically the percentage of variance explained by every component. View summary(pca): Proportion of Variance 0.4355 0.2218 0.1998 0.1396 0.00328 The results show that PC1 explains about 43% of variation. To explain 85% of variance, there has to be 3 components.

library(gridExtra)
p_fv <- fviz_pca_var(pca, col.var = "dodgerblue1", repel = TRUE)
p_fi <- fviz_pca_ind(pca, col.ind = "cos2", geom = "point", gradient.cols = c("#FFBD2E", "#79EBD6", "#4BBAE9"))
grid.arrange(p_fv, p_fi, nrow=1)

On the left side we can see the graphical representation of relations between variables, where positively correlated variables are grouped together. The plot presents that variables converge into three groups. First, there is latitude and longitude, which confirm the correlation results. The second group is speed_2d and the third one is altitude. And there is specical one which is very close center direction.

Let’s check variable contribution in every first 3 dimensions, which explain 85% of variance.

library(pdp)
PC1 <- fviz_contrib(pca, choice = "var", axes = 1)
PC2 <- fviz_contrib(pca, choice = "var", axes = 2)
PC3 <- fviz_contrib(pca, choice = "var", axes = 3)
PC4 <- fviz_contrib(pca, choice = "var", axes = 4)
grid.arrange(PC1, PC2, PC3, PC4)

The most important component consist of latitude and longtitude. The second one consist only of speed_2d. The third one only left direction. The last one consist of altitude and speed_2d.

library(ClusterR)
## Loading required package: gtools
num_bmd_cs <- center_scale(na.omit(num_bmd), mean_center = TRUE, sd_scale = TRUE)
num_bmd_pca <- princomp(num_bmd_cs)$scores[,1:2] 
num_bmd_km <- KMeans_rcpp(num_bmd_pca, clusters=6, num_init=5, max_iters=10000) 

cluster_color <- num_bmd_km$clusters

picture <- ggplot(as.data.frame(num_bmd_pca)) +
      geom_point(aes(x = Comp.1, y = Comp.2, color = cluster_color)) +
      theme(legend.position = "right",) +
      ggtitle("Numerical variables clusters")

picture

As we can see from the plot, the data are gather at the left bottom.

Conclusions

Although the percentage of explained variance in 2- dimensional space is not high enough to assume that all points have been assigned correctly, the results seems to be reasonable in analysis. Latitude is related to longtitude in most cases, as well as altitude. So when the birds migration flew north and east is more high latitude.