Lecture 1

We load up a bunch of packages to start.

library(HSAUR2)

## Warning: package 'HSAUR2' was built under R version 3.1.3

## Loading required package: tools

library(MVA)

## Warning: package 'MVA' was built under R version 3.1.3

library(car)

## Warning: package 'car' was built under R version 3.1.3

library(MASS)
library(rgl)

## Warning: package 'rgl' was built under R version 3.1.3

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.1.3

library(plyr)
library(psych)

## Warning: package 'psych' was built under R version 3.1.3

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

## The following object is masked from 'package:car':
## 
##     logit

library(lattice)
library(fpp)

## Warning: package 'fpp' was built under R version 3.1.3

## Loading required package: forecast

## Warning: package 'forecast' was built under R version 3.1.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.1.3

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: timeDate

## Warning: package 'timeDate' was built under R version 3.1.3

## This is forecast 7.1

## Loading required package: fma

## Warning: package 'fma' was built under R version 3.1.3

## Loading required package: tseries

## Warning: package 'tseries' was built under R version 3.1.3

## 
## Attaching package: 'fma'

## The following object is masked from 'package:plyr':
## 
##     ozone

## The following objects are masked from 'package:MASS':
## 
##     cement, housing, petrol

## Loading required package: expsmooth

## Warning: package 'expsmooth' was built under R version 3.1.3

## Loading required package: lmtest

## Warning: package 'lmtest' was built under R version 3.1.3

library(forecast)
#library(rpart)
#library(mice)
#library(h2o)
#library(party)
#library(randomForest)
#library(neuralnet)
#library(fpp)
#library(nnet)
#library(nnet)
#libary(forecast)
options(warn=-1)

Remember, in R, matrices can be specified directly.

x=matrix(c(65,72,84,61,100,110,140,110,0,1,1,0,30,50,40,30),nrow=4)
x

##      [,1] [,2] [,3] [,4]
## [1,]   65  100    0   30
## [2,]   72  110    1   50
## [3,]   84  140    1   40
## [4,]   61  110    0   30

#convert to data frame from matrix
newx=as.data.frame(x)
colnames(newx)=c("Ht","Wt","Gender","Age")
newx

##   Ht  Wt Gender Age
## 1 65 100      0  30
## 2 72 110      1  50
## 3 84 140      1  40
## 4 61 110      0  30

We can read.csv or read.table or…to pull data into R. Note: to read from the clipboard, try… mydata=read.table(file=“clipboard”, header=TRUE).

mydata=read.csv("c:/Users/Lawrence/Google Drive/Courses & Syllabi/Texas Tech/ISQS 6348/mydata.csv")

Here, we have measurements of chest, weight, hips, and gender. Could body size and body shape be summarized in some way by combining the three measurements into a single number? Principal Components Analysis (PCA) is a method that might be used. (In fact, Exploratory Factor Analysis with orthogonal rotations is the same thing..but Confirmatory Factor Analysis is not…We will discuss these laters)

mypca=princomp(x=mydata[1:20,1:3])
summary(mypca,loadings=TRUE)

## Importance of components:
##                           Comp.1    Comp.2     Comp.3
## Standard deviation     4.1730777 2.0412227 1.50543412
## Proportion of Variance 0.7302475 0.1747181 0.09503436
## Cumulative Proportion  0.7302475 0.9049656 1.00000000
## 
## Loadings:
##       Comp.1 Comp.2 Comp.3
## Chest -0.516         0.854
## Waist -0.782 -0.444 -0.438
## Hips  -0.351  0.894 -0.280

plot(mypca)

We could also see if observations cluster together. Perhaps, there are two clusters that we might find associated with chest.

mykmeans=kmeans(mydata$Chest, 2, iter.max=100, nstart=25)
plot(mydata$Chest, col=mykmeans$cluster)

Are there subtypes of body shapes amongst the men and amongst the women within which individuals are of similar shapes and between which body shapes differ? Cluster Analysis

dm=dist(mydata[1:3])
plot(cs <- hclust(dm, method = "single"))

plot(cs <- hclust(dm, method = "complete"))

plot(cs <- hclust(dm, method = "average"))

Sometimes, we may want to use factor analysis to reduce data into one or more predetermined factors. The difference between CFA and EFA using principal components is that CFA assumes knowledge of the number of factors (constructs) and then tests that knowledge.

So let’s say we are interested in compressing down 32 student score observations on 6 tests (English, Math, etc.) into 2 factors measuring what we believe is STEM vs. Non-STEM skills. We will force the 32 x 6 matrix into a 32 x 2 matrix. Literally, we multiply a 32 x 6 matrix by a 6 x 4 matrix to make a 32 x 4 linear combination of the original Now, how that 6 x 4 is formed depends on what we want to do (later). Of course, we typically want to build the transformation matrix (6 x 4 ) on standardized variables, as the data are not scale invariant.

mydata2=read.csv("c:/Users/Lawrence/Google Drive/Courses & Syllabi/Texas Tech/ISQS 6348/mydata2.csv")
mydata2

##    math english history geography chemistry physics
## 1    60      70      75        58        53      42
## 2    80      65      66        75        70      76
## 3    53      60      50        48        45      43
## 4    85      79      71        77        68      79
## 5    45      80      80        84        44      46
## 6    46      68      69        78        52      70
## 7    75      60      53        52        62      56
## 8    51      71      80        69        47      69
## 9    73      66      50        55        49      52
## 10   80      76      59        51        46      57
## 11   74      77      51        75        49      53
## 12   84      76      73        62        69      75
## 13   80      78      68        69        48      47
## 14   54      67      55        54        49      69
## 15   46      75      52        60        65      58
## 16   61      61      61        83        45      55
## 17   72      79      53        77        67      44
## 18   46      69      71        52        49      53
## 19   85      80      65        71        45      58
## 20   48      74      73        66        60      61
## 21   67      70      67        59        64      69
## 22   77      68      60        72        57      45
## 23   59      67      58        56        69      68
## 24   57      75      67        59        51      42
## 25   52      69      65        57        63      73
## 26   62      76      69        78        49      62
## 27   80      69      52        57        62      42
## 28   67      64      54        63        65      78
## 29   45      68      76        67        69      48
## 30   85      72      73        67        60      62
## 31   69      74      75        79        58      43
## 32   49      80      80        64        53      59

describe(mydata2)

##           vars  n  mean    sd median trimmed   mad min max range  skew
## math         1 32 64.59 14.06   64.5   64.46 19.27  45  85    40  0.02
## english      2 32 71.34  5.97   70.5   71.62  6.67  60  80    20 -0.20
## history      3 32 64.72  9.72   66.5   64.62 11.86  50  80    30 -0.08
## geography    4 32 65.44 10.22   65.0   65.27 12.60  48  84    36  0.14
## chemistry    5 32 56.31  8.81   55.0   56.15 11.12  44  70    26  0.14
## physics      6 32 57.94 11.88   57.5   57.50 17.05  42  79    37  0.20
##           kurtosis   se
## math         -1.54 2.48
## english      -1.03 1.05
## history      -1.37 1.72
## geography    -1.26 1.81
## chemistry    -1.57 1.56
## physics      -1.30 2.10

myfact=factanal(mydata2,factors=2,method="mle",rotation="varimax", scores="regression")
myfact

## 
## Call:
## factanal(x = mydata2, factors = 2, scores = "regression", rotation = "varimax",     method = "mle")
## 
## Uniquenesses:
##      math   english   history geography chemistry   physics 
##     0.957     0.605     0.573     0.661     0.166     0.816 
## 
## Loadings:
##           Factor1 Factor2
## math               0.205 
## english    0.625         
## history    0.653         
## geography  0.583         
## chemistry          0.911 
## physics            0.426 
## 
##                Factor1 Factor2
## SS loadings      1.164   1.058
## Proportion Var   0.194   0.176
## Cumulative Var   0.194   0.370
## 
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 5.63 on 4 degrees of freedom.
## The p-value is 0.228

myfact$PVAL

## objective 
## 0.2282669

myfact$scores

##           Factor1      Factor2
##  [1,]  0.07656331 -0.445861139
##  [2,] -0.01046649  1.538989748
##  [3,] -1.80807988 -1.334293605
##  [4,]  1.08804978  1.403372862
##  [5,]  1.66651650 -1.278137642
##  [6,]  0.38965995 -0.357348420
##  [7,] -1.55735606  0.513339178
##  [8,]  0.74422596 -0.836924160
##  [9,] -1.23131651 -0.805582309
## [10,] -0.38097172 -1.027095087
## [11,]  0.05667861 -0.746421417
## [12,]  0.53442126  1.445873912
## [13,]  0.60634477 -0.845611734
## [14,] -0.94894694 -0.718857249
## [15,] -0.45983745  0.782814299
## [16,] -0.23700422 -1.139754374
## [17,]  0.27466822  0.980563260
## [18,] -0.28454750 -0.818519826
## [19,]  0.69365003 -1.053352257
## [20,]  0.51789896  0.370288290
## [21,] -0.15757100  0.840370418
## [22,] -0.23247326  0.003015474
## [23,] -0.79208678  1.281216150
## [24,]  0.08275151 -0.662768574
## [25,] -0.34040041  0.724457001
## [26,]  0.84002647 -0.665683004
## [27,] -0.95485684  0.440801288
## [28,] -0.89549468  0.975946014
## [29,]  0.27100905  1.165249333
## [30,]  0.41978345  0.466672382
## [31,]  0.93189145  0.121220035
## [32,]  1.09727042 -0.317978846

We could also take

splom(mydata[,1:3],groups=mydata[,4],panel=panel.superpose, col=c("red","blue"), key=list(title="Measurements by Gender", columns=2, col=c("red","blue"),text=list(c("female","male"))))

Now, let’s talk some basic statistics, shall we? Using library(pscyh), we can take a look at some of the statistics associated with bio measurements.

describe(mydata)

##        vars  n  mean   sd median trimmed  mad min max range  skew kurtosis
## Chest     1 20 37.00 2.58   36.5   36.81 2.22  33  43    10  0.53    -0.46
## Waist     2 20 28.00 3.54   28.5   28.06 5.19  22  33    11 -0.08    -1.49
## Hips      3 20 37.05 2.44   37.0   37.12 2.22  32  42    10 -0.18    -0.42
## Gender    4 20  0.50 0.51    0.5    0.50 0.74   0   1     1  0.00    -2.10
##          se
## Chest  0.58
## Waist  0.79
## Hips   0.55
## Gender 0.11

mybox=boxplot(mydata[,1:3], horizontal = TRUE, col = "Red")

par(mfrow=c(2,2))
for (i in 1:3) myhist=hist(mydata[,i])

We also need to look at relationships among the data. A scatterplot matrix helps.

library(lattice) 
splom(mydata[,1:3],groups=mydata[,4],panel=panel.superpose, col=c("red","blue"), key=list(title="Measurements by Gender", columns=2, col=c("red","blue"),text=list(c("female","male"))))

Sometimes, we might want to look at the data in 3d.

library(rgl)
plot3d(mydata[,1:3],col="red",type="s")
decorate3d(xlab="Chest",ylab="Waist",zlab="Hips")

Also, we might want to look at distances among points to see if there are natural clusters.

mydist=dist(scale(mydata[,1:3],center=FALSE))
plot(mydist)

plot(clust <- hclust(mydist, method = "complete"))

And we can take a look at the covariance matrix and correlation matrix.

 cov(mydata[,1:3])

##          Chest     Waist     Hips
## Chest 6.631579  6.368421 3.000000
## Waist 6.368421 12.526316 3.578947
## Hips  3.000000  3.578947 5.944737

 cor(mydata[,1:3])

##           Chest     Waist      Hips
## Chest 1.0000000 0.6987336 0.4778004
## Waist 0.6987336 1.0000000 0.4147413
## Hips  0.4778004 0.4147413 1.0000000

One of the most distributions associated with multivariate statistics is the multivariate normal. What is that? It is a distribution similar to the univariate Gaussian but with multiple variables. It takes on a nice “hill” contour shape. Let’s take a look at one.

#simulate a bivariate normal sample
library(MASS)
bivn <- mvrnorm(10000, mu = c(0, 0), Sigma = matrix(c(1, .5, .5, 1), 2))

# now we do a kernel density estimate
bivn.kde <- kde2d(bivn[,1], bivn[,2], n = 50)

par(mfrow=c(2,3)) #plot space config

# plot
contour(bivn.kde)
image(bivn.kde)
persp(bivn.kde, phi = 45, theta = 30)

image(bivn.kde); contour(bivn.kde, add = T)

# fancy perspective
persp(bivn.kde, phi = 45, theta = 30, shade = .1, border = NA)

Now, we may want to check to see if our data are normal. We could test the marginal distributions one at a time (e.g., qqplot)

par(mfrow=c(1,1))
qqplot(bivn[,1],rnorm(50,mean(bivn[,1],sd(bivn[,1]))))

Alternatively, we could reduce a dataset into distances and evaluate a qqnorm plot.

x=mydata[1:3]
Xbar=colMeans(x)
S=cov(x)
d=apply(x,MARGIN=1, function(x) t(x-Xbar)%*%solve(S)%*%(x-Xbar))
par(mfrow=c(1,1))
qqnorm(d,main="QQPlot of d vs. Chi Square(2)");qqline(d)

If we have issues with multivariate normality, we can always try to transform the data. Here is an example using Box Cox.

par(mfrow=c(2,2))
myt2=powerTransform(cbind(fuel$Carbon,fuel$City)~1)
myt2$lambda

##        Y1        Y2 
##  1.268839 -1.726699

Carbon2=fuel$Carbon^myt2$lambda[1]
City2=fuel$City^myt2$lambda[2]
B4trans=kde2d(fuel$Carbon,fuel$City, n=100)
image(B4trans,main="Before Xform");contour(B4trans,add=T)
persp(B4trans,phi=45,theta=30,shade=.1,border=NA,main="Before Xform")
threed=kde2d(Carbon2,City2,n=100) 
image(threed,main="Post Xform");contour(threed,add=T)
persp(threed,phi=45,theta=30,shade=.1,border=NA,main="Post Xform")

Lecture 1

Doc

October 18, 2016