LopezJD_T2_Assignment

Discussion Questions

1. Use the R command X <- iris to assign Fishers’ iris dataset to the data matrix X. Using the head(X) command summarize what each column of the dataset is measuring and represents. Assign Y as a new matrix of dimension 150 by 4 which has the values of X without the species label.

X<- iris
#summary
head(X)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

#Y matrix
Y <- iris[,1:4]

2. Compute and interpret (in summary English) each of the summary statistics \(\bar{X}\),\(S\),\(R\) using R.

Calculating \(\bar{X}\)

#mean 
apply(Y,2,mean)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333

Calculating \(S\)

#covariance
var(Y)

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    0.6856935  -0.0424340    1.2743154   0.5162707
## Sepal.Width    -0.0424340   0.1899794   -0.3296564  -0.1216394
## Petal.Length    1.2743154  -0.3296564    3.1162779   1.2956094
## Petal.Width     0.5162707  -0.1216394    1.2956094   0.5810063

Calculating \(R\)

#correlation
cor(Y)

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

3. Visualize the dataset by making a scatterplot of Sepal Length vs. Sepal Width, a scatterplot of Petal Length vs. Petal Width. The pairs function and page 422 is useful here. Use your plots and stats from #2 to comment on any evident correlations.

pairs(Y, main = "Plotting Pairs of Scatterplot",
      pch = 21, col = c("red", "yellow", "blue")[unclass(X$Species)],oma=c(3,3,3,15))
par(xpd=T)
legend("bottomright",fill=c("red", "yellow", "blue"),legend = levels(X$Species))

As we had noticed in the correlation matrix before, there is a great degree of correlation among Sepal.Length and Petal.Length, as well as Sepal.Length and Petal Width. We can also notice a strong correlation among Petal.Length and Petal.Width. Being a positive strong correlation, it tells us that as Sepal Length grows larger, so does Petal Length and Width.

However, we can also notice that Sepal.Length has a high correlation with Length and Width of the petal because of the versicolor and virginica species. We can notice that setosa species does not help to sustain a high correlation. In the following graph, we will observe the correlation graph and matrix without setosa.

df2<-iris[iris$Species!="setosa",]
df2$Species<-droplevels(df2$Species)



pairs(df2[,1:4], main = "Plotting Pairs of Scatterplot",
      pch = 21, col = c("red", "blue")[df2$Species], oma=c(3,3,3,15))

cor(df2[1:4])

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000   0.5538548    0.8284787   0.5937094
## Sepal.Width     0.5538548   1.0000000    0.5198023   0.5662025
## Petal.Length    0.8284787   0.5198023    1.0000000   0.8233476
## Petal.Width     0.5937094   0.5662025    0.8233476   1.0000000

As we can notice, the correlation of Sepal Length and Sepal.Width actually increased from -0.11 to 0.55 by taking out Setosa out of the equation.

Numerical Questions

Instructions: Answer the following using the R statistical computing platform. Your answer should include the code you wrote plus the output of such code and English rhetoric / coding comments where necessary.

N1. The iris dataset is a native dataset to R. Obtain a matrix of scatter plots for the overall dataset (without species), and the three subsets according to species. Obtain an average of the four characteristics by species and, using the faces function from the aplpack package, plot the Chernoff faces. Do the Chernoff faces offer enough insight to identify the group? [20 pts]

Overall DataSet

library(ggplot2)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs(iris, columns=1:4) + 
  ggtitle("Corrleation Iris Data + Density Plots by Species")

According to Species

ggpairs(iris, columns=1:4, aes(color=Species)) + 
  ggtitle("Corrleation Iris Data + Density Plots by Species")

Average of Characteristics by Species

iris%>%group_by(Species)%>%summarise(avg.Sepal.Length=mean(Sepal.Length),avg.Sepal.Width = mean(Sepal.Width),avg.Petal.Length = mean(Petal.Length), avg.Petal.Width = mean(Petal.Width))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 5
##   Species    avg.Sepal.Length avg.Sepal.Width avg.Petal.Length avg.Petal.Width
##   <fct>                 <dbl>           <dbl>            <dbl>           <dbl>
## 1 setosa                 5.01            3.43             1.46           0.246
## 2 versicolor             5.94            2.77             4.26           1.33 
## 3 virginica              6.59            2.97             5.55           2.03

Using faces from aplpack. Note: We will be plotting 100 faces as a guideline of the usefulness of the graph.

library(aplpack)

aplpack::faces(iris[1:100,1:4])

## effect of variables:
##  modified item       Var           
##  "height of face   " "Sepal.Length"
##  "width of face    " "Sepal.Width" 
##  "structure of face" "Petal.Length"
##  "height of mouth  " "Petal.Width" 
##  "width of mouth   " "Sepal.Length"
##  "smiling          " "Sepal.Width" 
##  "height of eyes   " "Petal.Length"
##  "width of eyes    " "Petal.Width" 
##  "height of hair   " "Sepal.Length"
##  "width of hair   "  "Sepal.Width" 
##  "style of hair   "  "Petal.Length"
##  "height of nose  "  "Petal.Width" 
##  "width of nose   "  "Sepal.Length"
##  "width of ear    "  "Sepal.Width" 
##  "height of ear   "  "Petal.Length"

To understand how nicely the program performs, let’s compare face #16 with face #2

df3<-rbind(apply(iris[1:100,1:4],2,mean),
apply(iris[1:100,1:4],2,sd),
iris[c(2,16),1:4])

rownames(df3)<-c("mean","Sd","#2","#16")

df3

##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## mean    5.4710000   3.0990000     2.861000   0.7860000
## Sd      0.6416983   0.4787389     1.449549   0.5651531
## #2      4.9000000   3.0000000     1.400000   0.2000000
## #16     5.7000000   4.4000000     1.500000   0.4000000

We can clearly notice that Sepal Length for observation #2 is well below 1 standard deviation from the average, whereas #16 is above the average. The height of face 16 is noticeably larger than #2. Similarly, #16 has an incredible large ratio of Sepal.Width to average, whereas #2 is just about the average. This can also be appreciated in the width of the face. When it comes to petal length, both observations are about 1 st. dev. below average. This can be noticed in the small eyes they have. Finally, petal width is significantly smaller than average for #2 than it is for #16, and this can be seen in the height of their mouth.

N2. Equation 14.7 gives the Mahalanbonis distance. Use this to obtain the “distance” of the observations from the entire dataset for the board stiffness dataset and investigate for outliers. Repeat this for the iris dataset as well. [20 pts]

from SMLOutliers Library, using Dataset called: stiff

library(SMLoutliers)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

#loading data
data(stiff)

#calculating Maha distance for Data 
mahaDistance<-mahalanobis(stiff,colMeans(stiff),var(stiff))
stiff$MahaDistance<- mahaDistance

stiff1<-arrange(stiff,desc(MahaDistance))

#identifying outliers (first 5 outliers)
stiff1$outliers<- c(rep("outlier",5),rep("not-outlier",nrow(stiff1)-5))

#plotting outliers using PCA
pca_estudio<-prcomp(stiff1[,1:4], center = TRUE,scale. = TRUE)


ggplotly(plotPCA(pca_estudio,vectorVariables = c("x1","x2"),groups =stiff1$outliers))

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:plotly':
## 
##     arrange, mutate, rename, summarise

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

#viewing outliers
stiff1[1:5,]

##     x1   x2   x3   x4 MahaDistance outliers
## 1 1954 2149 1180 1281    16.847407  outlier
## 2 2983 2794 2412 2581    12.264755  outlier
## 3 2276 2189 1547 2111     9.898038  outlier
## 4 2119 1700 1815 2222     7.616644  outlier
## 5 2326 2301 2065 2234     6.283763  outlier

Outliers in IRIS Dataset

iris_forMaha<-iris
#calculating Maha distance for Data 
mahaDistance1<-mahalanobis(iris_forMaha[,1:4],colMeans(iris_forMaha[,1:4]),var(iris_forMaha[,1:4]))
iris_forMaha$MahaDistance<- mahaDistance1

iris_forMaha1<-arrange(iris_forMaha,desc(MahaDistance))


#identifying outliers (first 5 outliers)
iris_forMaha1$outliers<- c(rep("outlier",5),rep("not-outlier",nrow(iris_forMaha1)-5))

#plotting outliers using PCA
pca_estudio1<-prcomp(iris_forMaha1[,1:4], center = TRUE,scale. = TRUE)


#viewing ggplotly
ggplotly(plotPCA(pca_estudio1,vectorVariables = c("Petal.Length","Sepal.Width"),groups =iris_forMaha1$outliers)+
  xlim(-4,4),
  tooltip="all")

## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

#viewing outliers
iris_forMaha1[1:5,]

##   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species MahaDistance
## 1          7.9         3.8          6.4         2.0 virginica     13.10109
## 2          6.1         2.6          5.6         1.4 virginica     12.88033
## 3          7.7         3.8          6.7         2.2 virginica     12.81307
## 4          6.9         3.1          5.1         2.3 virginica     12.44138
## 5          4.5         2.3          1.3         0.3    setosa     11.42403
##   outliers
## 1  outlier
## 2  outlier
## 3  outlier
## 4  outlier
## 5  outlier

Yet another way of viewing outliers using Maha Distance:

#viewing maha Distance with Chi Square Distribution
iris_forMaha2<-arrange(iris_forMaha1,MahaDistance)
### por aqui inicia
df_chiSquare<-data.frame(x=qchisq((1:nrow(iris_forMaha1) - 1/2) / nrow(iris_forMaha1), df = 6),
           y=sort(mahaDistance1),
           z=iris_forMaha2$outliers)

df_chiSquare$fitted.values<- lm(y~x,df_chiSquare)$fitted.values

ggplot(df_chiSquare)+geom_point(aes(x=x,y=y,col=z),size=2)+ 
  scale_color_manual(values = c("white", "red"))+
  geom_line(aes(x=x,y=fitted.values),col="blue")+labs(y="Ordered Distances",x="Quantile")+theme_dark()

Discussion Questions

Instructions: Answer the following in a mathematically sound argument (proof) using combinations of English / symbols as needed.

T1. Let Y be a random vector with the following mean vector and variance/covariance matrix:

\[\mu=\begin{equation*} \mathbf{}\left[\begin{matrix} -1\\ 1\\ 4 \end{matrix}\right] \end{equation*} \]

\[\sum=\begin{equation*} \mathbf{}\left[\begin{matrix} 7 & 4 & 0\\ 4 & 8 & 2\\ 0 & 2 & 9 \end{matrix}\right] \end{equation*} \]

Define the univariate transformations \(x=y_1-3y_2+y_3\), \(z_1=y_1+y_2+y_3\) ,\(z_2=4y_1+y_2-y_3\) . Form the random vector \(z=(z_1,z_2)'\) .

\[\mu=\begin{equation*} \mathbf{}\left[\begin{matrix} E(y_1)\\ E(y_2)\\ E(y_3) \end{matrix}\right] \end{equation*} \]

Defining Transformations
\(x=y_1-3y_2+y_3\)
\(x=-1-3(1)+4\)
\(x=0\)

\(z_1=y_1+y_2+y_3\)
\(z_1=-1+1+4\)
\(z_1=4\)

\(z_2=4y_1+y_2-y_3\)
\(z_2=4(-1)+1-4\)
\(z_2=-7\)

Creating z vector \[z=\begin{equation*} \mathbf{}\left[\begin{matrix} 4\\ -7 \end{matrix}\right] \end{equation*} \]

Compute each of \(E(x)\),\(var(x)\),\(E(z)\),\(cov(z)\).

\(E(x)=0\)

\[var(x) = A*CovMatrix*A^T\]

A<-c(1,-3,1) CovMatrix<- matrix(c(7,4,0,4,8,2,0,2,9), byrow = T,ncol=3) t(as.matrix(A))%*%CovMatrix%*%as.matrix(A)

## [,1] ## [1,] 52

\(var(x) = 52\)

\[E(z)=\begin{equation*} \mathbf{}\left[\begin{matrix} 4\\ -7 \end{matrix}\right] \end{equation*} \]

\[var(z)=\begin{equation*} \mathbf{}\left[\begin{matrix} var(z_1) & cov(z1,z2) \\ cov(z1,z2) & var(z_2) \end{matrix}\right] \end{equation*} \]

#variance of z1 A<-c(1,1,1) CovMatrix<- matrix(c(7,4,0,4,8,2,0,2,9), byrow = T,ncol=3) t(as.matrix(A))%*%CovMatrix%*%as.matrix(A)

## [,1] ## [1,] 36

\(var(z_1) = 36\)

#variance of z2 A<-c(4,1,-1) CovMatrix<- matrix(c(7,4,0,4,8,2,0,2,9), byrow = T,ncol=3) t(as.matrix(A))%*%CovMatrix%*%as.matrix(A)

## [,1] ## [1,] 157

\(var(z_2) = 157\)

Total Covariance Matrix

\(A^T\sum A = var(z)\)

#total Covariance matrix m<-matrix(c(1,4,1,1,1,-1),nrow=2) t(t(m))%*%matrix(c(7,4,0,4,8,2,0,2,9),nrow = 3)%*%t(m)

## [,1] [,2] ## [1,] 36 47 ## [2,] 47 157

\[var(z)=\begin{equation*} \mathbf{}\left[\begin{matrix} 36 & 47 \\ 47 & 157 \end{matrix}\right] \end{equation*} \]

T2. Let X be drawn from a 3-dimensional normal distribution with mean vector \(\mu'=[-1,0,4]\) and variance/covariance matrix:
\[\sum=\begin{equation*} \mathbf{}\left[\begin{matrix} 1 & -7 & 0\\ -7 & 9 & 0\\ 0 & 0 & 2 \end{matrix}\right] \end{equation*} \] We want to determine what combinations or pairs of the random variables are independent. Determine if each pair of random vectors are independent:
a) \(X_1\) and \(X_2\)
b) \(X_2\) and \(X_3\)
c) (\(X_1\),\(X_3\) ) and \(X_2\)

\(x_1\)and\(x_2\) have a covariance of \(Cov_{12}=-7\), which is different from \(var(x_1)+var(x_2)=10\). This means that they are not independent. They are correlated.

\(x_2\)and\(x_3\) have a covariance of \(Cov_{23}=0\), which is different from \(var(x_2)+var(x_3)=11\). This means that they are not independent. They are correlated.

\(x_1\)and\(x_3\) have a covariance of \(Cov_{13}=0\), which is different from \(var(x_1)+var(x_3)=3\). This means that they are not independent. They are correlated. And we had also seen that \(x_1\) and \(x_2\) where correlated, as \(x_2\) and \(x_3\), therefore, (\(X_1\),\(X_3\) ) and \(X_2\) must also be correlated.

Is it true that each of \(X_1\),\(X_2\),\(X_3\) are individually univariate normal given that \(X\) is multivariate normal? Give reasoning on why or why not. [40 pts]

Proof taken from: https://www.youtube.com/watch?v=YgExEVji7xs&t=1057s

Let us recall the formula for PDF of Univariate Gaussian is

\[\frac{1}{(2\pi\sigma^2)^{1/2}}e^{-0.5*(\frac{x-\mu}{\sigma})^2}\]

We will see from the multiplication of univariate normal distirbution pdf’s how we can approximate the multivariate normal distribution for 3 variables, whose formula is :

\[\frac{1}{(2\pi)^{3/2}(|\sum|)^{1/2}}e^{-0.5*({x-\mu})^{T}\sum^{-1}(x-\mu)}\]

And this will prove that a mvn pdf is composed of individual normal distribution pdf’s.

Let’s start out by joining 2 independent univariate normald distribution pdf’s:

\(f(x_1,x_2) = f(x1)*f(x2)\)

\[f(x_1,x_2) = \frac{1}{(2\pi\sigma^2)^{1/2}}e^{-0.5*(\frac{x-\mu}{\sigma})^2} * \frac{1}{(2\pi\sigma^2)^{1/2}}e^{-0.5*(\frac{x-\mu}{\sigma})^2} \]

We can notice that anything to the left of the exponent is static, let’s explore the result of that multiplication first:

\[\frac{1}{(2\pi)^{2/2}(\sigma_1^2\sigma_2^2)^{1/2}}\]

then the exponent part:

\[e^{-0.5[(\frac{x_1-\mu_1}{\sigma_1})^2+(\frac{x_2-\mu_2}{\sigma_2})^2]}\]

now, we know that the determinant of the covariance matrix of \(f(x_1,x_2)\) is:

\[|\sum|=\begin{equation*} \mathbf{}|\left[\begin{matrix} \sigma_{11} & \sigma_{12} \\ \sigma_{21} & \sigma_{22} \end{matrix}\right] \end{equation*}| \]

Now, we know they are independent, therefor \(\sigma_{12}=0\) So, the determinant is :

\(|\sum|= \sigma_1^2\sigma_2^2 - 0\) \(|\sum|= \sigma_1^2\sigma_2^2\)

Then, the square root of the determinant, is:

\(\sqrt{|\sum|}= \sqrt{\sigma_1^2\sigma_2^2}\)

Hence, we can subsitute in our static component of our equation to be:

\[\frac{1}{(2\pi)^{2/2}\sqrt{|\sum|}}\]

from here we can also generalize that \(2\pi\) is raised to \(d/2\) where d is the different number of univariate pdf’s together, in the case of 3 univariate distributions, it would be raised to \(3/2\)

Then, we have just seen how the constant term of the mvn for 3 univariate normal pdf’s is actually derived from such distributions, leaving us with:

\[\frac{1}{(2\pi)^{3/2}(|\sum|)^{1/2}}\]

now, let’s take a look into the exponent part:

We know from the univariate pdf, the formula could as well be considered:

\[(x-\mu)(\sigma^2)^{-1}(x-\mu)\] so, that formula multivariate case and in matrix form will become:

\[(x-\mu)^T\sum^{-1}(x-\mu)\]

therefore, joining both static and exponent forms formulas, we can conclude the formula becomes:

\[\frac{1}{(2\pi)^{3/2}(|\sum|)^{1/2}}e^{-0.5*({x-\mu})^{T}\sum^{-1}(x-\mu)}\]

End of proof.

LopezJD_T2_Assignment

Jorge D Lopez-Saavedra

9/20/2020