Assignment 1

Question 1 (30 marks)

The data file ‘chocolates.dat’ contains data on 6 price, weight and nutritional variables for eleven chocolate bar varieties. The data was gathered in April 2008 in Brisbane. The variables are:

Provide R code, output and written interpretation for parts a) to d) of this question. Test for multivariate normality (MVN):

When testing for multivariate normality, we consider both the univariate distributions of each variable and then the multivariate distribution. There are serval tests that provide a good indication of distribution and skewedness, and in the result that the distributions deviate from normality, the fix will depend on which factor is at cause. If it is skew, than a transformation can be used to lessen the deviation, or if outliers are may be removed to make the distribution more normal-ish. Below we look at these methods to determine if our chocolate dataset is MVN. First, we read in our chocolate data set and review to ensure the data is clean and that we have an understanding of the variables and structure.

#read and preview data
choc <- read.table("chocolates.dat", header = TRUE)
attach(choc)
str(choc)

## 'data.frame':    11 obs. of  6 variables:
##  $ Size   : num  50 40 78 55 60 60 50 40 55 60 ...
##  $ Price  : num  1.76 2.88 1.79 2.33 1.62 1.62 2.56 2.75 2.33 2.58 ...
##  $ Energy : int  2003 2057 2186 1930 1980 1890 2030 2180 1623 1980 ...
##  $ Protein: num  4.6 9.9 7 3.5 10.2 4.7 5.6 5.5 2.2 8.5 ...
##  $ Fat    : num  26.5 23 28.4 24.5 22.9 19.5 20.4 26.8 8.1 20.6 ...
##  $ Carbo  : num  59 60.9 59.7 56.4 59.9 67.9 67.4 67.3 73.3 63.3 ...

summary(choc)

##       Size           Price           Energy        Protein      
##  Min.   :40.00   Min.   :1.620   Min.   :1623   Min.   : 2.200  
##  1st Qu.:46.25   1st Qu.:1.775   1st Qu.:1950   1st Qu.: 4.650  
##  Median :55.00   Median :2.330   Median :1980   Median : 5.500  
##  Mean   :53.68   Mean   :2.273   Mean   :1984   Mean   : 6.064  
##  3rd Qu.:60.00   3rd Qu.:2.665   3rd Qu.:2044   3rd Qu.: 7.750  
##  Max.   :78.00   Max.   :2.880   Max.   :2186   Max.   :10.200  
##       Fat            Carbo      
##  Min.   : 8.10   Min.   :56.40  
##  1st Qu.:20.20   1st Qu.:59.80  
##  Median :22.90   Median :63.30  
##  Mean   :21.88   Mean   :64.01  
##  3rd Qu.:25.50   3rd Qu.:67.65  
##  Max.   :28.40   Max.   :73.30

Consider univariate QQ plots and histograms and univariate Shapiro-Wilks tests of normality for each of the six variables. (6 marks)

###MVN package using the choc dataframe
library(MVN)

## Warning: package 'MVN' was built under R version 3.3.3

## sROC 0.1-2 loaded

##A.univariate tests of normality plots for all variables in the dataset
uniPlot(choc, type = "qqplot")

uniPlot(choc, type = "histogram")

The univariate QQ plots and histograms give a good indication of skewedness and distribution for a single variable. The QQ plot compares the actual distribution to the theoretical spread ifthe data were normal, the ideal result is a 45 degree line, of close to it. If the data does not follow this line, then it will show if there is a strong skew or outliers at one end. The histograms provide a quick indication of the skew and kurtosis of the data, with the ideal shope taking a normal distribution, it is easy to spot variables with wide or narrow kurtosis.

The results show Size, Protein and Carbo do not deviate much from 45 in the qq plot, while price appears steep with a curve at either end, and Energy and Fat seem flat with a few outliers heavily affecting the line. A review of the histogram output shows similar results with Price having a wide spread,and fat and energy being right skewed, suporting the outliers to the left of the distribution seen in the qq plots.

###MVN package using the choc dataframe
library(MVN)

##A.univariate tests of normality plots for all variables in the dataset
uniPlot(choc, type = "qqplot")

uniPlot(choc, type = "histogram")

#univariate normality tests: SW-Shapiro Wilk's; desc=descriptive stats= size,mean,sd,q5,skew,kurtosis
uniNorm(choc, type = "SW", desc = TRUE)

## $`Descriptive Statistics`
##          n     Mean Std.Dev  Median     Min     Max     25th     75th
## Size    11   53.682  11.141   55.00   40.00   78.00   46.250   60.000
## Price   11    2.273   0.488    2.33    1.62    2.88    1.775    2.665
## Energy  11 1984.455 151.435 1980.00 1623.00 2186.00 1950.000 2043.500
## Protein 11    6.064   2.565    5.50    2.20   10.20    4.650    7.750
## Fat     11   21.882   5.478   22.90    8.10   28.40   20.200   25.500
## Carbo   11   64.009   5.266   63.30   56.40   73.30   59.800   67.650
##           Skew Kurtosis
## Size     0.560   -0.400
## Price   -0.224   -1.792
## Energy  -0.840    0.509
## Protein  0.328   -1.277
## Fat     -1.177    0.938
## Carbo    0.195   -1.433
## 
## $`Shapiro-Wilk's Normality Test`
##    Variable Statistic   p-value Normality
## 1   Size       0.9122    0.2590    YES   
## 2   Price      0.8731    0.0850    YES   
## 3  Energy      0.8843    0.1181    YES   
## 4  Protein     0.9375    0.4921    YES   
## 5    Fat       0.8543    0.0485    NO    
## 6   Carbo      0.9410    0.5323    YES

The Shapiro-Wilk’s Normality test offers an evaluation of each variable with a p-value to determine if each is of normal distribution. For the test of normality, we want to accept H0 which means a p-value of p>0.05 is desired. A p-value of p<0.05 implies we reject H0 for the alternate hypothesis, the data is not normally distributed. In this case, only Fat has been shown to significantly deviate from normality with a p-value of 0.0485. Referring back to our histograms, fat had the most right skew and may benefit from a transormation or elimination of outliers. We will accept these results as having shown our data to be univariate normalish for the purposes of our analysis, and will proceed to consider the multivariate case.

Consider perspective and contour plots for Energy and Carbo. What is an inherent problem with using these plots to assess MVN? (4 marks)

###check multivariate vaiables
##B. perspective and contour plots for Energy and Carbo
(chocEC <- c("Energy","Carbo"))

## [1] "Energy" "Carbo"

(chocSS <- choc[chocEC])

##                 Energy Carbo
## Bounty            2003  59.0
## Milo.Bar          2057  60.9
## KitKat            2186  59.7
## Cherry.Ripe       1930  56.4
## Snickers          1980  59.9
## Mars              1890  67.9
## Crunchie          2030  67.4
## Tim.Tam           2180  67.3
## Turkish.Delight   1623  73.3
## Maltesers         1980  63.3
## MandMs            1970  69.0

resultEC <- hzTest(chocSS, qqplot=FALSE)
par(mfrow = c(1,2))
mvnPlot(resultEC, type = "persp", default = TRUE) # perspective plot
mvnPlot(resultEC, type = "contour", default = TRUE) # contour plot

par(mfrow = c(1,1)) #reset plot area

We now focus in on Energy and Carbo, remembering from the historgrams analysis that energy was quite right skewed and carbo had a slighty narrow kurtosis. We will now look at the contour and perspective plots for more depth. The perspective plot shows the main peak is slightly right skewed in the energy direction and a mini peak in the right of carbo. Results also presented in the contour plot, seeing the peak more clearly to be slightly right skewed also for carbo and again supporting the small peak in the right of carbo and left of energy. THis distribution may present some issues with MVN due to the appearance of the secondary peak, but will need further analysis to determine the effects on the normality.

Consider the Mardia, Henze-Zirkler and Royston tests of MVN based on all six chocolate bar variables. Include in your interpretation: (10 marks)

##C. MVN tests Mardia'S, HZ, Roy
(resultM <- mardiaTest(choc, qqplot = T))

##    Mardia's Multivariate Normality Test 
## --------------------------------------- 
##    data : choc 
## 
##    g1p            : 18.41365 
##    chi.skew       : 33.75836 
##    p.value.skew   : 0.9919058 
## 
##    g2p            : 37.92329 
##    z.kurtosis     : -1.70549 
##    p.value.kurt   : 0.088103 
## 
##    chi.small.skew : 46.2702 
##    p.value.small  : 0.8198427 
## 
##    Result         : Data are multivariate normal. 
## ---------------------------------------

(resultHZ <- hzTest(choc, qqplot = F))

##   Henze-Zirkler's Multivariate Normality Test 
## --------------------------------------------- 
##   data : choc 
## 
##   HZ      : 0.8484293 
##   p-value : 0.2058381 
## 
##   Result  : Data are multivariate normal. 
## ---------------------------------------------

(resultR <- roystonTest(choc, qqplot = F))

##   Royston's Multivariate Normality Test 
## --------------------------------------------- 
##   data : choc 
## 
##   H       : 8.774458 
##   p-value : 0.06982361 
## 
##   Result  : Data are multivariate normal. 
## ---------------------------------------------

. The Chi-Square QQ plot and describe how it is constructed and its relationship to the univariate normal QQ plots as part of your interpretation.

Having considered the univariate distribution, there is need for a multivariate interpetation to determine if the dataset is MVN. Mardia, Henze-Zirkler and Royston tests are three fairly robust multivariate tests which indicate multivariate normal distributions of the data. Looking first at the results of Mardia’s Multivariate Normality test and the Chi-Square QQ plot produced, the data is determined to be MVN. The effects of the skew (p-value 0.99) and kurtosis (p-value 0.09) are determined to be fairly undamaging on the normality of the data as both p-values are greater than 0.05. Dispite the S-shape of the data seen in the Chi-Squared QQ plot, the data is found to not deviate from normal.

Specifically, the CHi-Squared QQ Plot offers a visualisation of how the data should be spread were it to be a normal distribution along the x-axis and then along the y-axis we can display the actual distribution of the data. If the two have a 1:1 relationship or form a 45 degree line (were the scales on each axis identical) then the data is normally distributed. THe Chi-squared QQ plot is based on Mahalanobis distances, calculating a distance for each point from the origin and using the chi distribution to calculate the theoretical spread of the data, this is similar to the univariat eqq plot discussed earlier which calulates the z scores that should occur equally to the number of data points we have based on a normal distribution to be compared with the actual z-score of the data.

Henze-Zirkler’s test results aslo indicate MVN distribution with a p-value of 0.2 sufficiently larger than 0.05 which supports accepting the null hypothysis. Finally, the Royston test results again are found to show no significant deviation from multivariate normality with an H value of 8.77 calculated to have a probability of 0.0698, greater than 0.05 and supporting Ho.

With all three tests providing insignificant deviation from normality, the data is found to be multivariate normal. This does not imply that each variable was individually normally distributed, as we saw with the results of the Shapiro Wilk’s test, but once all the variables are considered together the effects of Carbo did not result in a large overall deviation from normality for the multivariate analysis.

. What is a key limitation of these MVN statistical tests?

The key limitations to MVN statistical tests are the need for knowledge of the area to correctly interpret the results. The results of the tests are not neccessarily black and white, there are several tests to be considered because there are several ways in which to view the data to determine if the data can be considered to not deviate far from normality. As there is some deviation allowed before it may be considered statistically significant that the data is not normal, multiple test view the data from different measures in order to best determine an overall finding of the data. It may be that 1 or 2 of the tests find the data to be MVN and if the research knows his data and area thouroughly can better make the dcision on if this is ok to proceed or if the data needs to be adjusted before proceeding. Often a cause of data deviating from normality can be due to outliers and it will be only with knowledge of the data that one can determine if these outliers can be eliminated or if they are an important result of a particular effect within the data and need be keep in the data to correcly representthe population.

. Suggest two ways that you might improve univariate and multivariate normality for data sets in general.

In certain instances it may be neccessary to adjust the data in order to fit to MVN distribution and proceed with analysis. In certain cases where a variable skew is effecting the overall distribtution, the data may need to be transformed to eliminate this effect. Consider a log transfromation or a root transfrom to reduce right skew and a power transform to reduce left skew. Another option may be to examine the actual data points and any outliers that may be eliminated to create a better subset or more clean subset of data to work with. This cleaning step should always be done to ensure the best results and most complete set of data is used in the analysis. There may be issues of sample size and at times mroe data may be needed to improve normality.

Produce a draftsman display and a star plot (by bar type) for the 6 variables. Interpret these plots. What are the y and x axes on plot [3,2] of the draftsman plot? (5 marks)

#plot draftsman
plot(choc)

#[3,2] is (Price,Energy)

The draftsman plot provides a whole picture analysis of each variable and its correlation with the other variables in the data set. It is a great starting point for familiarizing yourself with the data and quickly determining if there are real issues with individual variables deviating from normality. It also provides a quick synopsis of the correlations that might be seen in the correlation matrix for the data. For the data set shown in the results above, the axis for the plot in location [3,2] is Price on the X-axis and Energy on the Y-axis.

Looking at the draftsman plot for the chocolate dataset, we can see very few strong correlations between the variables. This may be do to some extreme values distorting the data, or a consiquence of the small dataset being studied. Probably the strongest correlation visible would be that of energy and fat, which makes intuitive sense. Also, protein and carbs appear to have a negative correlation while protein and fat appear to have a weak correlation. The draftsmans plot for this data highlights the need to test for univariate and multivariate normality and consider the size of the dataset at hand.

#plot starplot
stars(choc,draw.segments=T, key.loc=c(7,2),main="Nutrition of Chocolate Bars")

The stars plot is a plot essential for analysing the data relative to the other data points in the data. It provides a visualisation for those data points in the data set that may be largest of smallest in a particular variable. It is essential for this plot that not too many variables are considered at once or it may make the graph hard to interpret. For the chocolate data, we have constructed a startplot of the 6 nutrition variables for the 11 chocolate bars in the data. The graph easily shows differences in the chocolate bars relative to each other.

Take for instance Turkish Delight, which clearly sits at the extremely low range for energy, protein and fat, but it is important to reinforce that this is only relative to the other chocolate bars in the data not all chocolate bars. We can also see that KitKat is cnversely low on price and carbs but high in energy, protein and fat. Bounty is similar to KitKat with even smaller size and protein. Cheery ripe is lowest in carbs and also low in protein. For these examples, one can see how easy it is to pull out facts about the data relative to teh dataset in a quick and easy manor.

Produce the correlation and covariance matrices in R. Explain the difference between these matrices in detail (i.e. explain clearly how the values are adjusted mathematically and the effect of these changes). Would using the covariance matrix in PCA on the Chocolate data be appropriate? (5 marks)

cor(choc)

##                Size       Price      Energy     Protein        Fat
## Size     1.00000000 -0.69057719 -0.00251376  0.07496077  0.0884535
## Price   -0.69057719  1.00000000  0.10235868  0.02051982 -0.1499203
## Energy  -0.00251376  0.10235868  1.00000000  0.50443709  0.9000381
## Protein  0.07496077  0.02051982  0.50443709  1.00000000  0.3872488
## Fat      0.08845350 -0.14992025  0.90003805  0.38724877  1.0000000
## Carbo   -0.27403643  0.30551507 -0.47037565 -0.42777201 -0.7398280
##              Carbo
## Size    -0.2740364
## Price    0.3055151
## Energy  -0.4703756
## Protein -0.4277720
## Fat     -0.7398280
## Carbo    1.0000000

cov(choc)

##               Size       Price       Energy      Protein         Fat
## Size    124.113636 -3.75754545    -4.240909   2.14227273   5.3986364
## Price    -3.757545  0.23854182     7.570636   0.02570909  -0.4011455
## Energy   -4.240909  7.57063636 22932.472727 195.95818182 746.6990909
## Protein   2.142273  0.02570909   195.958182   6.58054545   5.4422727
## Fat       5.398636 -0.40114545   746.699091   5.44227273  30.0136364
## Carbo   -16.076818  0.78577273  -375.104545  -5.77863636 -21.3438182
##                Carbo
## Size     -16.0768182
## Price      0.7857727
## Energy  -375.1045455
## Protein   -5.7786364
## Fat      -21.3438182
## Carbo     27.7309091

The covariance matrix procides a measure of inter-variable variance, when an element on the diagonal is considered it refers to the variance of a single variable. The off diagonals provide a calculation for the association of two variables in the dataset. The matrix will be a symmetric matrix as the association between the variables i and j will be the same as the association between the variables j and i. The covariance matrix can not be interpreted without understandign the data, specifically the means and standard deviations of each individual variable. Becausethe data is not centered or standardised, the variance or covariance for each set of variable may differ quite drastically from another in the data.

The correlation matrix solves this issues, by determining the covariance for the data after being centered and standardised, the diagonal values, or variance of a single variable will be 1 while the off diagonals will range from -1 to 1. The negative value signifies a negative correlation, while a positive a positive correlation. A value close to zero can be interpreted as the variables having little correlation while a value near one implies the variables have a strong correlation. Specifically the calulation for centering and standardising the data means for each variable the mean of the variable is subtracted and then divided by the standard deviation to insure a mean of zero and a standard deviation of 1 for that variable.

For multivariate analysis, the dataset is required to be correlated, if the correlation matrix reveals very weak correlations between variables than the dataset is best analyised using univariate methods. If the correlation matrix reveals some highly strong coreelations near one, then it should be considered that the variables with this correlation represent duplicate information and one should be removed.

Question 2 (30 marks)

The data file ‘countries.dat’ contains data compiled in 2005 on 6 variables related to ‘population health’ indicators of 30 countries from 4 Regions.

Region | The region of the country: Africa, Asia, Europe or South.America.
Irrigated | The area of irrigated land (in square kilometres) Population | The population (in millions)
Under.14 | The percentage of the population under 14
Life.expectancy | The life expectancy at birth (in years)
Literacy.Rate | The reported literacy rate (in percent)
Unemployment | The unemployment rate (in percent)

Provide R code, output and written interpretation for parts a) to c) of this question.

In the context of MANOVA, list the potential dependent and independent variables. (2 marks)

The dataset ‘countries.dat’ provides a snapshot of the health of 30 countries from 4 regions in 2005. THis data can be used to group a given country in onne of the four regions by evaluating its population health metrics. In this method, Region would be the independent variable, and the chosen population health variables used would make up the dependent variables.

This sort of comparison, consisting of 4 groups or categories rather than 2 would require the use of ANAOVA if there was only 1 dependent variable, but as in this case we wish to consider 5 dependent variables (Population, Under.14, Life.expectancy, Literacy.Rate and Unemployment) the appropriate test will be MANOVA.

#read and preview data
ctry <- read.table("countries.dat", header = TRUE)
attach(ctry)
str(ctry)

## 'data.frame':    30 obs. of  7 variables:
##  $ Region         : Factor w/ 4 levels "Africa","Asia",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ Irrigated      : int  210 32460 60 660 60 12700 460 1930 31000 498720 ...
##  $ Population     : num  15.8 69.5 19.9 30.8 1.8 ...
##  $ Under.14       : num  42.4 34.6 41.2 42 42.7 32 47.4 38.7 35 25 ...
##  $ Life.expectancy: num  54.6 63.7 57.2 47.5 40.6 ...
##  $ Literacy.Rate  : num  63.4 51.4 64.5 78.1 38 81.8 78.2 85 56 81.5 ...
##  $ Unemployment   : num  30 11.5 20 50 35 30 50 50 35.2 10 ...

summary(ctry)

##            Region    Irrigated        Population          Under.14    
##  Africa       :8   Min.   :    10   Min.   :   0.440   Min.   :14.17  
##  Asia         :9   1st Qu.:   645   1st Qu.:   9.875   1st Qu.:18.62  
##  Europe       :9   Median :  4550   Median :  26.500   Median :29.35  
##  South.America:4   Mean   : 32204   Mean   :  91.615   Mean   :29.01  
##                    3rd Qu.: 24975   3rd Qu.:  77.300   3rd Qu.:38.65  
##                    Max.   :498720   Max.   :1273.100   Max.   :47.40  
##  Life.expectancy Literacy.Rate     Unemployment   
##  Min.   :37.10   Min.   : 38.00   Min.   : 2.700  
##  1st Qu.:58.03   1st Qu.: 78.12   1st Qu.: 5.775  
##  Median :70.35   Median : 84.40   Median : 9.950  
##  Mean   :65.89   Mean   : 82.33   Mean   :16.557  
##  3rd Qu.:77.15   3rd Qu.: 98.75   3rd Qu.:23.750  
##  Max.   :80.80   Max.   :100.00   Max.   :50.000

Using MANOVA in R, test for differences in ‘population health’ among the four regional groups using only Population, Under.14, Life.expectancy, Literacy.Rate and Unemployment to represent ‘population health’. Include tests using all four test statistics covered in this course and interpret output. First check the correlation matrix. What does it suggest about the potential for the variables to be MVN distributed? (Do not test for MVN) (13 marks)

The correlation matrix shows a relatively strong correlation for some while a few are quite low, but this should be a sufficient correlation between the variables to continue with multivariate tests under the assumption that the data is multivariate normal. There could be reason from this matrix to do further investigation and possibly correct the distribution to be more normalish, but for the purposes of these tests which are quite robust we will not look further into it and assume the data is MVN.

#subset variables to look at population health
(ctry1 <- cbind(Population, Under.14, Life.expectancy, Literacy.Rate, Unemployment))

##       Population Under.14 Life.expectancy Literacy.Rate Unemployment
##  [1,]      15.80    42.40           54.60          63.4         30.0
##  [2,]      69.50    34.60           63.70          51.4         11.5
##  [3,]      19.90    41.20           57.20          64.5         20.0
##  [4,]      30.80    42.00           47.50          78.1         50.0
##  [5,]       1.80    42.70           40.60          38.0         35.0
##  [6,]      43.60    32.00           48.10          81.8         30.0
##  [7,]       9.80    47.40           37.30          78.2         50.0
##  [8,]      11.40    38.70           37.10          85.0         50.0
##  [9,]     131.30    35.00           60.54          56.0         35.2
## [10,]    1273.10    25.00           71.60          81.5         10.0
## [11,]       7.20    17.70           79.70          92.2          4.5
## [12,]     228.40    30.10           68.30          83.8         17.5
## [13,]     126.80    14.60           80.80          99.0          4.7
## [14,]       5.60    42.80           53.50          57.0          5.7
## [15,]      22.20    34.50           71.10          83.5          2.8
## [16,]     144.60    40.50           61.50          42.7          6.0
## [17,]      79.90    32.10           69.60          93.7         25.0
## [18,]      10.30    16.10           74.60          99.9          8.7
## [19,]       5.30    18.60           76.70         100.0          5.3
## [20,]       5.20    18.00           77.60         100.0          9.8
## [21,]      59.60    18.70           78.90          99.0          9.7
## [22,]      83.00    15.60           77.60          99.0          9.9
## [23,]      57.70    14.17           79.10          98.0         10.4
## [24,]       0.44    18.90           77.30         100.0          2.7
## [25,]      10.10    17.00           75.90          87.4          4.3
## [26,]      59.60    18.90           77.80          99.0          5.5
## [27,]      37.40    26.50           75.20          96.2         15.0
## [28,]     174.50    28.60           63.24          83.3          7.1
## [29,]       8.30    38.50           64.10          83.1         11.4
## [30,]      15.30    27.30           75.90          95.2          9.0

#review correlation matrix
(corC <- round(cor(ctry1),digits=3))

##                 Population Under.14 Life.expectancy Literacy.Rate
## Population           1.000   -0.082           0.110        -0.036
## Under.14            -0.082    1.000          -0.866        -0.765
## Life.expectancy      0.110   -0.866           1.000         0.647
## Literacy.Rate       -0.036   -0.765           0.647         1.000
## Unemployment        -0.105    0.642          -0.828        -0.333
##                 Unemployment
## Population            -0.105
## Under.14               0.642
## Life.expectancy       -0.828
## Literacy.Rate         -0.333
## Unemployment           1.000

The four test results to be considered will offer support to determine if there is a sinficant difference in the means of the regions, enough to consider the regions to be from different populations. The four tests, Pillai’s, Wilk’s, Roy’s, and Hotellings-Lawley all have slightly different methodology and provide a slightly different view for interpreting the results but all give a similar result. The tests all assume that the dependent variables are MVN, and that the variables have equal covariance matrices, and are robust to unequal sample sizes given they are >=5.

The data we wish to study, separates the 30 countries into 4 different region, Asia, Africa, Europe and South America with sample sizes of 9,8,9,4 respectively. South America has a small samle size of only 4, and this will need to be considered in the analysis. This low sample size may cuase the mean and variance to be a poor estimate of the population.

We run the four tests on a subset of the variables available in the dataset, looking only at Population, Under.14, Life.expectancy, Literacy.Rate, Unemployment, which is considered to represent a view of population health. The results of the four tests are shown below.

The first test calculates Wilk’s lambda as 0.1157, a comparison of the within sample variation to the total between and within variation. This value is low and suggests that the total variation is much larger than the within sample variation. The large total variation inplies large between group variation signifying the sample are not from the same population and are therefore significantly different. This can also be seen in the test result p value at a significance of p<0.001.

The next test calculates Roy’s largest root as a linear combination of the variables choosen as to maximise the between sample sum of square to the within sample sum of squares. Again, a similar concept of interpretation through comparing the variation within and between samples. As with the Wilk’s test, if the between sample variation is fond to be large, than the samples can be considered to be of different populations.The results of this test show the maximum root is 3.1001 which is largre with a significance of p<0.001.

The third test, Pillai’s trace is also based on the concept of eigan values, with a large trace value signaling a large between sample variation and reason to conclude the sample are not from the same population. The results of this test back those of Wilk’s and Roy, with again a significance level of p<0.001 and Pillai’s trace of 1.3546 indicating a large between sample variation for significantly diffeferent populations, or in this analysis regions.

The last test results to be considered are the Hotelling-Lawley’s. The Lawes-Hotelling trace is a sum of trace of the eiganvalues, with a large value desired to determine differing populations for the samples tested. The results indicate a significantly large trace of 4.0544 with significance of p<0.001 and therefore a large between sample variation is detected implying the samples so not come from the same population.

The results of all four tests support each other, and determine the four regions to be from different populations and the means of each sample population to be significantly different. This result is a useful result which would allow the distinction of each of the regions as distinct populations with unique classification. The benefit to such an outcome is the ability to later predict through the availability of population health record whic region the country may be classified as.

##check the 4 tests using manova....
ctry.manova <- manova(ctry1~as.factor(Region), data=ctry)

#Wilks test: compares within variation to total variation, and low lambda implies different populations
summary(ctry.manova, test = 'Wilks')

##                   Df  Wilks approx F num Df den Df    Pr(>F)    
## as.factor(Region)  3 0.1157   4.8266     15 61.134 4.839e-06 ***
## Residuals         26                                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Roy test
summary(ctry.manova, test = 'Roy')

##                   Df    Roy approx F num Df den Df    Pr(>F)    
## as.factor(Region)  3 3.1001    14.88      5     24 1.096e-06 ***
## Residuals         26                                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Pillai test:  #most robust to deviations from MVN or unequal covariances
summary(ctry.manova, test = 'Pillai')

##                   Df Pillai approx F num Df den Df   Pr(>F)    
## as.factor(Region)  3 1.3546   3.9519     15     72 3.84e-05 ***
## Residuals         26                                           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Hotelling-Lawley
summary(ctry.manova, test = 'Hotelling-Lawley')

##                   Df Hotelling-Lawley approx F num Df den Df    Pr(>F)    
## as.factor(Region)  3           4.0544   5.5861     15     62 5.414e-07 ***
## Residuals         26                                                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Produce output that specifically compares each of the Regions with each other (you should have 6 comparisons) using Hotelling’s T2 t-test equivalent and a significance level of 0.05. Determine the multiple test corrected significance level. Do not provide R output; instead reproduce and complete the following table for all comparisons and interpret. How may sample sizes have affected these results and those in part b)? (15 marks)

Hotelling’s T2 t-test is a generalization of the t-est which allows for comparison of means for more than 1 dependant variable, by testing whether a vecor of means differs rather than the means of one dependent variable for two populations. This test is considered robust to the assumption of multivariate normality and to equal covariance matrices if the two sample sizes are roughly equal. For our comparisons below, combinations of all four regions, the sample sizes range from 4 to 9, with specifically South America(4), Africa(8), Asia(9), Europe(9) and so can be assumed roughly euivalient. Although the small sample size (as noted above in the MVN Manova test analysis) may require some caution when interpreting.

Hotelling’s T2 t-test takes into account the two covariance matrices for the sample populations being tested. The use of the combined covariance matrix in the calculation allows consideration for the correlation between variables and tests for differences between groups. As a Multivariate test, it accounts for multiple dependent variable in the analysis without the need for a correction parameter within a single test of two regions. Yet, in our case, we are using all combinations of regions on the Hotelling’s T2 test to test for multiple pairwise relationships, and so we will need to consider Bonferroni’s correction.

The results of the 6 pair wise comparisons are provided in the table below. In these results, it is easy to see the correction parameter highly reduces the level by which the result must reach to be significant, from p<0.05 to p<0.0083, the effect of which is to reverse the significnce of 3 pair wise comparisons, leaving only 2 significat results. Remebering the samll sample size of South America, we need to be carful when interpreting Europe-South America significant result as the limited data may have not provided a meaningful representation of the population. Using the corrected p value significance esults, we can interpret the results are a view on which populations aresignificantly different, after determining in part b that there was a significant difference in the population means. The results here imply this difference has arisen from only 2 pairs (or 3 countries) Africa-Europe and Europe-South America. And all other pairs may be insignificant and unable to determine as separate populations.

Comparison Results

Regions	Hotelling’s p-value	Significant (0.05) (Y/N)	Significant correction (0.008) (Y/N)
Africa-Asia	0.043	Y	N
Africa-Europe	0.000002	Y	Y
Africa-S.America	0.048	Y	N
Asia-Europe	0.019	Y	N
Asia-S.America	0.674	N	N
Europe-S.America	0.006	Y	Y

Question 3 (30 marks)

The data file ‘body.dat’ contains data for 6 body dimension variables measured on 44 men and women in Australia. All variables are measured in centimetres. Provide R code, output and written interpretation for all analyses.

The body dataset is read and inital review of the structure and variables is preformed. The data looks to be clean and so we have proceeded to check for normality, both univariate and multivariate.

body <- read.table("body.dat", header = TRUE)
attach(body)
str(body)

## 'data.frame':    44 obs. of  6 variables:
##  $ shoulder.girth: num  106 110 115 104 108 ...
##  $ chest.girth   : num  89.5 97 97.5 97 97.5 ...
##  $ waist.girth   : num  71.5 79 83.2 77.8 80 82.5 82 76.8 68.5 77.5 ...
##  $ hip.girth     : num  93.5 94.8 95 94 98.5 95.3 101 98 89.5 99.8 ...
##  $ thigh.girth   : num  51.5 51.5 57.3 53 55.4 57.5 60.9 56 50 59.8 ...
##  $ bicep.girth   : num  32.5 34.4 33.4 31 32 33 42.4 34.1 33 36.5 ...

summary(body)

##  shoulder.girth    chest.girth      waist.girth      hip.girth     
##  Min.   : 91.90   Min.   : 78.10   Min.   :57.90   Min.   : 84.60  
##  1st Qu.: 98.97   1st Qu.: 87.10   1st Qu.:69.03   1st Qu.: 93.50  
##  Median :106.65   Median : 93.65   Median :74.15   Median : 95.90  
##  Mean   :106.24   Mean   : 92.94   Mean   :74.60   Mean   : 96.24  
##  3rd Qu.:112.40   3rd Qu.: 98.28   3rd Qu.:80.78   3rd Qu.: 98.53  
##  Max.   :123.50   Max.   :106.90   Max.   :88.30   Max.   :112.10  
##   thigh.girth     bicep.girth   
##  Min.   :50.00   Min.   :24.40  
##  1st Qu.:53.92   1st Qu.:28.10  
##  Median :57.00   Median :31.60  
##  Mean   :56.93   Mean   :31.53  
##  3rd Qu.:59.15   3rd Qu.:34.42  
##  Max.   :66.50   Max.   :42.40

plot(body)

##check univariate normality
library(MVN)
uniNorm(body, type = "SW", desc = T) #all univariate normal

## $`Descriptive Statistics`
##                 n    Mean Std.Dev Median  Min   Max   25th    75th   Skew
## shoulder.girth 44 106.239   9.258 106.65 91.9 123.5 98.975 112.400  0.203
## chest.girth    44  92.941   7.284  93.65 78.1 106.9 87.100  98.275 -0.116
## waist.girth    44  74.600   7.672  74.15 57.9  88.3 69.025  80.775 -0.094
## hip.girth      44  96.236   5.258  95.90 84.6 112.1 93.500  98.525  0.338
## thigh.girth    44  56.932   3.911  57.00 50.0  66.5 53.925  59.150  0.370
## bicep.girth    44  31.532   4.081  31.60 24.4  42.4 28.100  34.425  0.239
##                Kurtosis
## shoulder.girth   -1.112
## chest.girth      -1.044
## waist.girth      -1.000
## hip.girth         0.554
## thigh.girth      -0.265
## bicep.girth      -0.462
## 
## $`Shapiro-Wilk's Normality Test`
##         Variable Statistic   p-value Normality
## 1 shoulder.girth    0.9545    0.0807    YES   
## 2  chest.girth      0.9725    0.3694    YES   
## 3  waist.girth      0.9725    0.3685    YES   
## 4   hip.girth       0.9809    0.6687    YES   
## 5  thigh.girth      0.9681    0.2599    YES   
## 6  bicep.girth      0.9793    0.6060    YES

##check multivariate normality
mardiaTest(body, qqplot = T) #chi plot - very little deviation from 1:1 fit, good result

##    Mardia's Multivariate Normality Test 
## --------------------------------------- 
##    data : body 
## 
##    g1p            : 7.227748 
##    chi.skew       : 53.00349 
##    p.value.skew   : 0.5890206 
## 
##    g2p            : 45.58609 
##    z.kurtosis     : -0.817112 
##    p.value.kurt   : 0.4138644 
## 
##    chi.small.skew : 57.71673 
##    p.value.small  : 0.4115828 
## 
##    Result         : Data are multivariate normal. 
## ---------------------------------------

#mardia's result mvn

hzTest(body, qqplot = F)

##   Henze-Zirkler's Multivariate Normality Test 
## --------------------------------------------- 
##   data : body 
## 
##   HZ      : 1.085819 
##   p-value : 0.0002851522 
## 
##   Result  : Data are not multivariate normal. 
## ---------------------------------------------

#hz not mvn

roystonTest(body, qqplot = F)

##   Royston's Multivariate Normality Test 
## --------------------------------------------- 
##   data : body 
## 
##   H       : 5.349142 
##   p-value : 0.3377938 
## 
##   Result  : Data are multivariate normal. 
## ---------------------------------------------

#roy's mvn

The draftsman plot is reviewed for all variables, and in general has produced a good result. The data looks mainly correlated, some higher than others and some a little more weak, but in general a positive result to continue. The univariate Shapiro-Wilks normality test has been proformed and findings, shown above, show all varaibles to not significantly diveate from normality. Next, we check multivariate normaliy. The Chi-Squared QQ-Plot provides another good visual that the data is somewhat normal, as the data points lie around the 1:1 line. The three tests, Mardia’s, Henze-Zirkler’s, and Royston’s are run and the results of Maria’s and Royston’s show MVN, while Henze-Zirkler’sshows a significant deviation from normality with a p-value of 0.0003. The data appears to be normalish from all but one test run, and so with some caution we can proceed with our multivariate analysis.

Perform PCA analysis on all six body measurement variables. Include in your answer (15 marks):

cov(body)

##                shoulder.girth chest.girth waist.girth hip.girth
## shoulder.girth      85.708007    55.08489    46.96488  14.75368
## chest.girth         55.084894    53.05410    47.52558  21.09569
## waist.girth         46.964884    47.52558    58.85302  26.96814
## hip.girth           14.753679    21.09569    26.96814  27.64888
## thigh.girth          9.321998    11.92658    15.79791  17.44323
## bicep.girth         29.763858    25.22123    24.02093  10.89486
##                thigh.girth bicep.girth
## shoulder.girth    9.321998   29.763858
## chest.girth      11.926575   25.221226
## waist.girth      15.797907   24.020930
## hip.girth        17.443235   10.894863
## thigh.girth      15.293848    7.631057
## bicep.girth       7.631057   16.652452

cor(body)  #as differing scales between shoulder and bicep use this one

##                shoulder.girth chest.girth waist.girth hip.girth
## shoulder.girth      1.0000000   0.8168877   0.6612690 0.3030755
## chest.girth         0.8168877   1.0000000   0.8505181 0.5508019
## waist.girth         0.6612690   0.8505181   1.0000000 0.6685407
## hip.girth           0.3030755   0.5508019   0.6685407 1.0000000
## thigh.girth         0.2574779   0.4186951   0.5265708 0.8482617
## bicep.girth         0.7878425   0.8485308   0.7673020 0.5077429
##                thigh.girth bicep.girth
## shoulder.girth   0.2574779   0.7878425
## chest.girth      0.4186951   0.8485308
## waist.girth      0.5265708   0.7673020
## hip.girth        0.8482617   0.5077429
## thigh.girth      1.0000000   0.4781754
## bicep.girth      0.4781754   1.0000000

#calculate PCs on correlation matrix
(body.prcomp <- prcomp(body, center=TRUE, scale=TRUE ))

## Standard deviations:
## [1] 2.0338621 1.0869961 0.5438815 0.4231910 0.3446413 0.2969330
## 
## Rotation:
##                      PC1         PC2        PC3         PC4        PC5
## shoulder.girth 0.3862980 -0.46474668  0.4264986 -0.60267416  0.1560384
## chest.girth    0.4516065 -0.25175685 -0.2228306 -0.08260076 -0.4401488
## waist.girth    0.4476787 -0.03815777 -0.6495013  0.06525361  0.5964472
## hip.girth      0.3740616  0.54535179 -0.1790435 -0.24637024 -0.5114866
## thigh.girth    0.3354521  0.60989241  0.4664227  0.00605201  0.3785681
## bicep.girth    0.4404078 -0.22314942  0.3114271  0.75164354 -0.1457377
##                       PC6
## shoulder.girth  0.2555763
## chest.girth    -0.6945954
## waist.girth     0.1275481
## hip.girth       0.4563945
## thigh.girth    -0.3931953
## bicep.girth     0.2702794

names(body.prcomp)

## [1] "sdev"     "rotation" "center"   "scale"    "x"

body.prcomp$rotation #print only rotations/loadings/eigen vectors

##                      PC1         PC2        PC3         PC4        PC5
## shoulder.girth 0.3862980 -0.46474668  0.4264986 -0.60267416  0.1560384
## chest.girth    0.4516065 -0.25175685 -0.2228306 -0.08260076 -0.4401488
## waist.girth    0.4476787 -0.03815777 -0.6495013  0.06525361  0.5964472
## hip.girth      0.3740616  0.54535179 -0.1790435 -0.24637024 -0.5114866
## thigh.girth    0.3354521  0.60989241  0.4664227  0.00605201  0.3785681
## bicep.girth    0.4404078 -0.22314942  0.3114271  0.75164354 -0.1457377
##                       PC6
## shoulder.girth  0.2555763
## chest.girth    -0.6945954
## waist.girth     0.1275481
## hip.girth       0.4563945
## thigh.girth    -0.3931953
## bicep.girth     0.2702794

EV = body.prcomp$sdev^2 #eigen values are variance

#Scree plots
screeplot(body.prcomp)

screeplot(body.prcomp, type="lines")

#calc %weighting of variance in prominant PCs
v = sum(EV) #verify = 6, sum of correlation or # of variables
(pcweight <- round(EV / v * 100, digits=0))

## [1] 69 20  5  3  2  1

#69% 20% 5% 3% 2% 1%
(pcweight12 <- sum(pcweight[1:2])) #1&2 only pcs over 1 and acct together for almost 90% & scree shows elbow at 3

## [1] 89

#Visualise PC1 & PC2
load = body.prcomp$rotation

# DotPlot PC1
sorted.loadings = load[order(load[,1]),1]
Main="Loadings Plot for PC1" 
xlabs="Variable Loadings"
dotchart(sorted.loadings,main=Main,xlab=xlabs,cex=1.5,col="red")

#represents overall average structure as all variables similar weighting

# DotPlot PC2
sorted.loadings = load[order(load[,2]),2]
Main="Loadings Plot for PC2"
xlabs="Variable Loadings"
dotchart(sorted.loadings,main=Main,xlab=xlabs,cex=1.5,col="red")

#represents relationship between upper and lower body

# Now draw the BiPlot
biplot(body.prcomp,cex=c(1,0.7),choices=c(1,2)) #default PC1 and PC2

#can verify waist contributes minimal and shows contrast between upper and lower body of pc2
#limited by the % of var acct for b 2 pcs

#test if PC's are correlated : shd be orthogonal = 0
round(cor(body.prcomp$x[,1],body.prcomp$x[,2]),digits=3) #orthogonal

## [1] 0

1.A brief explanation of your choice between the correlation and covariance matrix as the basis of the PCA analysis.
For the analysis, the scale of the variables were considered. The body data set compares measurements of various part of the body, with the largest of the measurements being shoulder girth with a mean of 107cm and the smallest measurement the bicep girth with a mean of 32cm. The determination was made on this data set to scale and standardize the data, as such to use the correlation matrix. THe variables should not hold different weighing of importance in the analysis and with the range spreading 3 fold as described above, the decision to se the correlation matrix will ensure variables are of equal weight i the analysis.

Reviewing the results for the crrelation matrix, the association of variables range from 0.3 - 0.9, and so the data is considered suitable for Principle COmponent Analysis. In is ideal in PCA to havecorrelated variables, not uncorrelated, which would prove the analysis useless, and not too correlated whcih would find the variables representing duplicate results. SO with this data set having a nice range of semi correlated variables, we will apply PCA methodoglogies and see what reduction can be acheived.

2.Justification for your choice of the number of PCs to interpret.

The goal of PCA is too produce a smaller or reduced set of components that will simlify the analysis, although the goal is to preform a reduction technique the ideal results will contain the loss of minimal information. There are several acceptable methods of deciding how many components to keep in the analysis but all must consider the loss of information and the gain of simplicity through variable reduction as a balance. Method include keeping principle components with eiganvalues greater than one, but this is often considered too stringent and accounts for too little of the original variance; percent wise PC deetermination would keep the minimum amount of PC’s to account for 90% of the variance; while another method the Scree plot is a visual technique which displays the drop in importance of the components in discribing the data meaningfully. n this method, there can be multiple elbows in the graph and the right amount should be determined in consideration with the other techniques.

This data has been analysed using prcomp in R, and results have provided the standard deviations (which are squared to calculte the variance or eiganvalues) and the rotations (or the variable loadings or eigan vectors). The eigan values of the 6 components represent the proportion of variance accounted for by each component. Only the first two PC’s eiganvalues are greater than one. Now considering the percent of variance each eiganvalue carries, the first two otal to 89%, although slightly under the ideal 90% target still an accepably large proportion of the variance for only 2 components. The third PC is now considered to see if the additional information is worth while, the eigenvalue for the third PC is substantially lower than the second, and by including PC3, the total variance is only improved minamilly to 94%. Now to review the Scree plot to help with the final decision, it is seen that there are elbows at 2 and 3, implying the decision is to keep only 1 or 2 components. As PC1 would account for only 69% of the variation, the final decision is made confidently to keep the first two components only to be used for the remaining of the analysis.

3.Interpretation of a biplot of the first two PCs in conjunction with the PC loadings and state the correlation between the first two PCs.

Next, the loadings for each of the two principle components are graphed for visual analysis. THis process orders the weighting of each variable in the PC and can be useful in understanding any implied interpretations of the components, or latent variables. The loading plot for PC1 (displayed below) shows miimal range for all the original variables, approximately 0.32 to 0.46. THis similar spread of weightings accross all variables implies PC1 can be interpreted as an average size discription of the data. Next, the loadings plot for PC2 is depicted and reviewed. This plot has a much larger range of variables, from -0.5 to 0.6, there is a visible dirrefentiation in the weightings of thigh and hip variables (both positive values) to bicep, chest, and shoulder (all negative values), with waist contributing minimally. The biplot is diplayed as an ordination method to review to supplement these results. The biplot provdies a visual joint view of the contributions of PC1 and PC2 and supports the determination of the loading plots. In PC1 the average weightings of each variable shown can be interpreted as an average size component an in PC2 the contrast between the upper and lower body variables show an interpretation of body shape. Together it is clear how size and shape can provide a good interpretation of body part measurements.

4.Interpretation of the relationships among people numbered 42, 39 and 20.

Using the above descriptions to understand the information accounted for by P1 and PC2, we can isolate a few of the population to interpret in more depth. Person 42 sits in the centre of the PC2 range, representing a proportional upper to lower body relationship, and is found at the far left of PC1, at the bottom range of average size. Person 39 is at the top extreme of PC2, with a larger than the lower end of the lower to upper body ratio and near the right of PC1 representing a greater than average size. Person 20 is similar above average size (PC1) as Person 39 but is found to have a normalish lower to upper body

5.Was this analysis appropriate and has it been useful?

Principle component analysis was found to be a reasonable application for this dataset, he correlation matrix howed adequate variable assocciation for the method, and results found the original six variables could be redued to two principle components representing average siza and body proportion. These results both provide successful reduction and intuitive results when applied to the data.

Perform a Factor Analysis on all six variables (apply no rotation) using the PCA above to determine the number of factors. Interpret the output including the (10 marks):

For the same data set, body, that we just preformed Principle Component Analysis on and found the best number of components to include in the analysis was 2. We can now use our knowledge of this data, the variables, the correlation between them and the number of components in the PCA reduction technique, to further our understanding of the data through exploratory Factor Analysis (EFA). We have choosen EFA to explore this data set, as we are not experts in the field of body girth and do not have many preconceived ideas of theories about the relationships that will exist between the variables in the dataset. We will use our gained knowledge through PCA to help us with the initial step of our factor analysis.

##                shoulder.girth chest.girth waist.girth hip.girth
## shoulder.girth      1.0000000   0.8168877   0.6612690 0.3030755
## chest.girth         0.8168877   1.0000000   0.8505181 0.5508019
## waist.girth         0.6612690   0.8505181   1.0000000 0.6685407
## hip.girth           0.3030755   0.5508019   0.6685407 1.0000000
## thigh.girth         0.2574779   0.4186951   0.5265708 0.8482617
## bicep.girth         0.7878425   0.8485308   0.7673020 0.5077429
##                thigh.girth bicep.girth
## shoulder.girth   0.2574779   0.7878425
## chest.girth      0.4186951   0.8485308
## waist.girth      0.5265708   0.7673020
## hip.girth        0.8482617   0.5077429
## thigh.girth      1.0000000   0.4781754
## bicep.girth      0.4781754   1.0000000

Recalling the the correlation matrix for the given data set showed a range of variable relationships with correlations ranging from 0.3-0.9. The spread of association strength accross variables will be good for proceeding with our analysis as all low correlations or all high correlations would limit the usefulness of FA, but we must also consider that this range of correlations will imply multiple factors may be needed to best explain the variance found in the dataset. Now, as our Principle Component Analysis found the best number of components to be used in th eanalysis was 2, we will use this for the beginnings of our factor analysis.

##to find eigen values and vectors from PCA for step 1 of FA no rotation
body.prcomp <- prcomp(body, center=TRUE, scale=TRUE )

#transpose the data so the original variables are composed of the factors
Fvec<-round(t(body.prcomp$rotation),digits=3)
#eigen values are variance so square std dev
FEV<-round(t(body.prcomp$sdev^2), digits=3)
#join the column of EValues to the Evectors  
results <- cbind(t(FEV),Fvec)

The results shown above contain both the eigen values and eigen vecots, to be used to create our variable equations with factor loadings. The eigen values (in column 2) are needed to convert the component loadings into factor loadings, serving to standardize the components and transform into factors. This will be done for each original variable which can now be written in terms of the 6 corresponfding factors. Now, we will only consider the first 2 factors for the analysis as determined by the PCA and discussed above.

For example of how to create the original variable equations with factor loadings, we can look at the first variable shoulder girth, now taking the square root of each eigen value and multiplying by the vector component associated with each PC, we get a transformed vector with the loadings for the first two factors factor.

(shoulder <- round(t(results[1:2,1]^(1/2)*results[1:2,2]),digit=2))

##       PC1   PC2
## [1,] 0.79 -0.51

(communality <- sum(shoulder^2))

## [1] 0.8842

The calculation is given above, and this would result in the equation: shoulder.girth = 0.79F1 - 0.51F2 + e1
Now, this could easily be continued to produce the 6 variable equations, all containing only the choosen 2 factors and an associated error.From each resulting equation we could determine the communalities as the sum of the squares of each factor loading. But we need to Again for our example, the shoulder the communality would be 0.88. The communality for each original variable should be high as the two factors should account for a large portion of the variance in the variable. The fact that we began with PCA, or using eigenvectors and eigen values, we know the factors are orthogonal basis, it is only the error or variance terms associated with each original variable that will not be orthogonal to each other. This breaks the assumptions of FA technique but assuming we achieve high communalities than the method is considered robust to this violation.

#results to display are the fa anlysis and loadings w adj cutoff
library(psych)

## Warning: package 'psych' was built under R version 3.3.3

## 
## Attaching package: 'psych'

## The following object is masked from 'package:MVN':
## 
##     mardia

(body.fa <- fa(r = Fcorr, nfactors = 2, n.obs = 44, rotate = "none", fm = "ml"))

## Factor Analysis using method =  ml
## Call: fa(r = Fcorr, nfactors = 2, n.obs = 44, rotate = "none", fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                 ML1   ML2   h2    u2 com
## shoulder.girth 0.35  0.80 0.77 0.230 1.4
## chest.girth    0.60  0.76 0.93 0.069 1.9
## waist.girth    0.70  0.55 0.80 0.199 1.9
## hip.girth      1.00 -0.06 1.00 0.005 1.0
## thigh.girth    0.85 -0.08 0.72 0.276 1.0
## bicep.girth    0.55  0.70 0.79 0.211 1.9
## 
##                        ML1  ML2
## SS loadings           2.98 2.03
## Proportion Var        0.50 0.34
## Cumulative Var        0.50 0.83
## Proportion Explained  0.60 0.40
## Cumulative Proportion 0.60 1.00
## 
## Mean item complexity =  1.5
## Test of the hypothesis that 2 factors are sufficient.
## 
## The degrees of freedom for the null model are  15  and the objective function was  5.91 with Chi Square of  237.39
## The degrees of freedom for the model are 4  and the objective function was  0.18 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.05 
## 
## The harmonic number of observations is  44 with the empirical chi square  0.79  with prob <  0.94 
## The total number of observations was  44  with Likelihood Chi Square =  7.04  with prob <  0.13 
## 
## Tucker Lewis Index of factoring reliability =  0.947
## RMSEA index =  0.131  and the 90 % confidence intervals are  0 0.215
## BIC =  -8.1
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                 ML1  ML2
## Correlation of scores with factors             1.00 0.97
## Multiple R square of scores with factors       1.00 0.94
## Minimum correlation of possible factor scores  0.99 0.88

print(body.fa$loadings, cutoff = 0.5)

## 
## Loadings:
##                ML1    ML2   
## shoulder.girth         0.804
## chest.girth     0.596  0.759
## waist.girth     0.703  0.554
## hip.girth       0.996       
## thigh.girth     0.847       
## bicep.girth     0.550  0.697
## 
##                  ML1   ML2
## SS loadings    2.983 2.026
## Proportion Var 0.497 0.338
## Cumulative Var 0.497 0.835

#want only some variables on f1 and rest on f2 with only some variables on both factors
(body.fa$communalities)

## shoulder.girth    chest.girth    waist.girth      hip.girth    thigh.girth 
##      0.7695398      0.9307001      0.8010906      0.9950000      0.7238053 
##    bicep.girth 
##      0.7890376

(body.fa$uniquenesses)

## shoulder.girth    chest.girth    waist.girth      hip.girth    thigh.girth 
##    0.230460112    0.069300017    0.198908797    0.004992736    0.276194117 
##    bicep.girth 
##    0.210962857

We now will do the full analysis using the fa function found in the Psych package in R on our 44 observations, limiting the output to two factors, having no rotation and method set to ‘ml’ or maximum liklihood to see the results for all 6 original variables in terms of two factors. Note that the results for shoulder will differ slightly to those shown above as we ar enow using a slightly different means of calulation the factors through the FA function.

1.Variance explained

The output of the fa model displays both the variance accounted for by each factor and the cumulative variance of the factors, for our two factors is 83%, which is an acceptable amount of the variance and so we will continue to analyse the results.

2.Chi-square test

The results shown above include a chi-squared test with the hypothesis the the model is a perfect fit, the resulting p-value greater than 0.05 indicates we keep the null hypothesis and therefore assume the model with 2 factors sufficiently represents the data.

3.Variable loadings

We now need to consider how each variable is loaded on the two factors. The ideal result will have as many variables as possible largely dependent on only one factors and minimal variables dependent on both. A result of this sort should simplify our interpretation of the factors and the underlying representation of the interrelationships between the original variables.

Looking at the loadings output where we have increased the cutoff to 0.5 to isolate those variables with larger weightings on each factor, we see that shoulder, hip and thigh are highly loaded on a single factor. While chest, waist and bicep are loading on both factors. It is this result that indicates that a rotation may be in order to optimize the transformation so that each factor represents a minimal number of variables.

4.Difference in uniqueness values for the variables hip.girth and thigh.girth

Finally, of importance in the fa results output yet to be considered is the communalities of each variable. The communalities for the 6 body variables have been dislayed above and range from 0.724 - 0.995. As the goal of maximum liklihood method, as with other methods, is to optimise communalities and minimize uniqueness, the communalities for chest and hip are very high while thigh is quite a lower but acceptable, with shoulder and waist and bicep sitting in between and again at a semi high result.

Repeat the FA with a varimax rotation and calculate the communalities. Interpret (5 marks):

Using the same Psych package and fa function, we repeat our analysis this time using a verimax rotation. The concept of a verimax rotation is to create factors with a small number of large loadings and a large number of minimal loadings, which will isolate the factor to few original variables and simplify the interpretation of an underlying meaning for the factor to be describing in the data. Verimax specifically applies a rotation that will maximise variance of the loadings or produce the best opposition of the variables. The resulting factor matrix will be a combination ofthe loadings and the pattern matrix which describes the variable - factor correlations.

The verimax rotation will maintain orthogonality. The rotation does not alter the pattern matrix and so the variance accounted for by the two factors remains at 83% and the chi squared test still maintains the 2 factors are suffiecent for modelling the data. What we hope to achieve from the rotation is a simplified relationship between variables displayed through the factors.

####Psych package FA method w Rotation
body.far <- fa(r = Fcorr, nfactors = 2, n.obs = 44, rotate = "varimax", fm = "ml")

print(body.far$loadings, cutoff = 0.5)

## 
## Loadings:
##                ML2   ML1  
## shoulder.girth 0.874      
## chest.girth    0.910      
## waist.girth    0.751      
## hip.girth            0.961
## thigh.girth          0.828
## bicep.girth    0.837      
## 
##                  ML2   ML1
## SS loadings    2.965 2.044
## Proportion Var 0.494 0.341
## Cumulative Var 0.494 0.835

(body.far$communalities)

## shoulder.girth    chest.girth    waist.girth      hip.girth    thigh.girth 
##      0.7695398      0.9307001      0.8010906      0.9950000      0.7238053 
##    bicep.girth 
##      0.7890376

(body.far$uniquenesses)

## shoulder.girth    chest.girth    waist.girth      hip.girth    thigh.girth 
##    0.230460112    0.069300017    0.198908797    0.004992736    0.276194117 
##    bicep.girth 
##    0.210962857

1.Changes in the variable loadings

The results of the fa function using a verimax rotation successfully produce large loadings of each variable on only one factor. The results are an optimal outcome for isolating the factors to minimal variables and will provide a means of simpiler interpretation. We can now look at the results in more detail to see that factor 1 is heavily loaded for hip and thigh while factor 2 is heavily loaded for the shoulder, waist, chest, and bicep variables. Factor 1 may therefore be interpreted as lower body size or perhaps lower body fitness, while factor 2 can be interpreted as a measure of upper body size or prehaps again upper body fitnes.

2.The communalities

The communalities can not be changed by applying a verimax rotation, as it also does not change the chi-squared statistic, or cumulative % variance. Thus, the communality results shown after the verimax rotation match the earlier results shown with no rotation and therefore have been unchanged.

Assignment 1

[Laura Sturmer]

2017-04-18

Question 1 (30 marks)

Question 2 (30 marks)

Question 3 (30 marks)