Homework 4 - STAT 601

##Kevin Kuipers (Completed by myself) ##September 18, 2018

Problem 1.

The data from contains the velocities of 82 galaxies from six well-separated conic sections of space (Postman et al., 1986, Roeder, 1990). The data are intended to shed light on whether or not the observable universe contains superclusters of galaxies surrounded by large voids. The evidence for the existence of superclusters would be the multimodality of the distribution of velocities.(8.1 Handbook)

##Problem 1 Part a)

a.) Construct histograms using the following functions: 

    -hist() and ggplot()+geom_histogram()
  
    -truehist() and ggplot+geom_histogram() (pay attention to the y-axis!)
  
    -qplot()
    
Comment on the shape and distribution of the variable based on the three plots. 
(Hint: Also play around with binning)

First I will load the libraries needed for Part a. Then I will plot the histograms using hist() and ggoplot() + geom_histogram()

library(MASS)
library(tidyverse)
library(dplyr)
data(galaxies)

#creating data frame from the galaxies data set and naming the variable velocity 

velocity <- galaxies[1:length(galaxies)]

gal_dat <- data.frame(
  velocity
)

#ploting histogram of the galaxies data set with velocity for variable 
hist(gal_dat$velocity, xlab="Velocity",ylab="Frequency", main="Histogram of Galaxies Dataset (Velocity)")

#using ggplot for creating a histogram of the galaxies data set with velocity for variable 
ggplot(gal_dat, aes(velocity)) +
  geom_histogram() + 
  labs(title="Histogram of Galaxies Dataset (Velocity)",xlab="Velocity",ylab="Frequency")

##My Repsonse: It appears there could be 3-4 data clusters. 1 cluster around 10,000, the main cluster of the data around 20,000, and possibly another two clusters around 32,000-35,000

Now I will use truehist and ggplot()+geom_histogram

truehist(gal_dat$velocity, main="Histogram of Galaxies Dataset (Velocity)", xlab="Velocity",ylab="Frequency")

ggplot(gal_dat, aes(velocity)) +
  geom_histogram()+
  labs(title = "Histogram of Galaxies Dataset (Velocity)", xlab="Velocity",ylab="Frequency")

##My Response It appears that truehist converted the frequency count to scientific notation. It also colored the histogram blue.

#Problem 1 Part b)

b.) Create a new variable \textit{loggalaxies} = $\log$(galaxies). 
Construct histograms using the functions in part a.) 
and comment on the shape and differences.

I will create a data frame containing both velocity and the log of velocity. Then I will display the same histograms above but looking and the log of velocity

#taking the log of velocity 
log_velocity <- log(velocity)

#Creating data frame with both velocity and log of velocity 
gal_dat2 <- data.frame(
  velocity,
  log_velocity
)

#ploting histogram of the galaxies data set with the log of velocity for variable 
hist(log_velocity, main = "Histogram of Galaxies Dataset (Log of Velocity)", xlab="Log of Velocity",ylab="Frequency")

#using ggplot for creating a histogram of the galaxies data set with the log of velocity for variable 
ggplot(gal_dat2, aes(log_velocity)) +
  geom_histogram() +
  labs(title = "Histogram of Galaxies Dataset (Log of Velocity)", xlab="Log of Velocity", ylab="Frequency")

#looking at the truehist of the log of velocity 
truehist(log_velocity,main="Histogram of Galaxies Dataset (Velocity)", xlab="Velocity",ylab="Frequency")

##My Response:

The data still appears to contain several clusters. However, the main cluster seems to have shifted to the right a little more. Just looking at the main cluster of data it seems to be more of a normal curve than before.

#Problem 1 Part c)

c.) Construct kernel density estimates using two different 
choices of kernel functions and three choices of bandwidth 
(one that is too large and oversmooths, one that is too small 
and undersmooths, one that appears appropriate.) 
Therefore you should have six different kernel density estimates plots. 
Discuss your results. 
You can use the log scale or original scale for the variable.

From the lecture it appears to me that Gaussian kernel and triangular kernel looked the neatest for density. So I will use both Gaussain kernel and triangular kernel on the galaxies data set. While peforming these results I will be using the log of velocity. The plots will be histograms with density curves overlain on them.

#loading KernSmooth library
library(KernSmooth)

#ploting the histograms with density curves

#ploting histogram with Gaussian kernal density bandwith=2
hist(log_velocity, xlab="Log of Velocity", ylab="Frequency",main="Gaussian Kernal Density of Velocity: BandWidth=2", probability = TRUE, boarder="grey")
lines(density(log_velocity, width=2),lwd=1)

#ploting histogram with Gaussian kernal density bandwith=0.04707
hist(log_velocity, xlab="Log of Velocity", ylab="Frequency",main="Gaussian Kernal Density of Velocity: BandWidth=0.04707", probability = TRUE, boarder="grey")
lines(density(log_velocity, width=0.04707),lwd=1)

#ploting histogram with Gaussian kernal density bandwith=0.2707
hist(log_velocity, xlab="Log of Velocity", ylab="Frequency",main="Gaussian Kernal Density of Velocity: BandWidth=0.2707", probability = TRUE, boarder="grey")
lines(density(log_velocity, width=0.2707),lwd=1)

#ploting histogram with triangular kernal density bandwith=2
hist(log_velocity, xlab="Log of Velocity", ylab="Frequency",main="Triangular Kernal Density of Velocity: BandWidth=2", probability = TRUE, boarder="grey")
lines(density(log_velocity, width=2, window="triangular"),lwd=1)

#ploting histogram with triangular kernal density bandwith=0.04707
hist(log_velocity, xlab="Log of Velocity", ylab="Frequency",main="Triangular Kernal Density of Velocity: BandWidth=0.04707", probability = TRUE, boarder="grey")
lines(density(log_velocity, width=0.04707, window="triangular"),lwd=1)

#ploting histogram with triangular kernal density bandwith=0.2707
hist(log_velocity, xlab="Log of Velocity", ylab="Frequency",main="Triangular Kernal Density of Velocity: BandWidth=0.2707", probability = TRUE, boarder="grey")
lines(density(log_velocity, width=0.2707, window="triangular"),lwd=1)

##My Response:

When looking at the base-R histograms and density plots with the binwidth = 0.2707 I would only suggest 3-4 super clusters. However, ggplot has little more features and can reveal more instight. Just judging on the histograms from ggplots I would hyphothesize that there are 4 super clusters due to two left tails, one main central source of data, and then one right tail. Lets look at the density plots in ggplot.

##ggplot of histogram with a gaussian kernal density curve at adjust=10
ggplot(gal_dat2, aes(x=log_velocity, y=..density..)) +
  geom_histogram(fill="cornsilk", color="grey60", size=0.2) +
  geom_density(kernal='gaussian',adjust=10) + 
  xlim(min(log_velocity), max(log_velocity)) + 
  labs(title="Gaussian Kernal Density of Velocity: adjustment=10", xlab="Log of Velocity", ylab="Frequency")

##ggplot of histogram with a gaussian kernal density curve at adjust=0.1
ggplot(gal_dat2, aes(x=log_velocity, y=..density..)) +
  geom_histogram(fill="cornsilk", color="grey60", size=0.2) +
  geom_density(kernal='gaussian',adjust=0.1) + 
  xlim(min(log_velocity), max(log_velocity)) + 
  labs(title="Gaussian Kernal Density of Velocity: adjustment=0.1", xlab="Log of Velocity", ylab="Frequency")

##ggplot of histogram with a gaussian kernal density curve at adjust=1
ggplot(gal_dat2, aes(x=log_velocity, y=..density..)) +
  geom_histogram(fill="cornsilk", color="grey60", size=0.2) +
  geom_density(kernal='gaussian',adjust=1.0) + 
  xlim(min(log_velocity), max(log_velocity)) + 
  labs(title="Gaussian Kernal Density of Velocity: adjustment=1.0", xlab="Log of Velocity", ylab="Frequency")

##ggplot of histogram with a triangular kernal density curve at adjust=10
ggplot(gal_dat2, aes(x=log_velocity, y=..density..)) +
  geom_histogram(fill="cornsilk", color="grey60", size=0.2) +
  geom_density(kernal='triangular',adjust=10) + 
  xlim(min(log_velocity), max(log_velocity)) + 
  labs(title="Triangular Kernal Density of Velocity: adjustment=10", xlab="Log of Velocity", ylab="Frequency")

##ggplot of histogram with a triangular kernal density curve at adjust=0.1
ggplot(gal_dat2, aes(x=log_velocity, y=..density..)) +
  geom_histogram(fill="cornsilk", color="grey60", size=0.2) +
  geom_density(kernal='triangular',adjust=0.1) + 
  xlim(min(log_velocity), max(log_velocity)) + 
  labs(title="Triangular Kernal Density of Velocity: adjustment=0.1", xlab="Log of Velocity", ylab="Frequency")

##ggplot of histogram with a triangular kernal density curve at adjust=1
ggplot(gal_dat2, aes(x=log_velocity, y=..density..)) +
  geom_histogram(fill="cornsilk", color="grey60", size=0.2) +
  geom_density(kernal='triangular',adjust=1.0) + 
  xlim(min(log_velocity), max(log_velocity)) + 
  labs(title="Triangular Kernal Density of Velocity: adjustment=1.0", xlab="Log of Velocity", ylab="Frequency")

##Problem 1 part d)

d.) What is your conclusion about the 
possible existence of superclusterd of galaxies? 
How many superclusters (1,2, 3, ... )?

##My Response: The density plots with to high a bandwidth are way too smooth and lump the entire data set into one smooth curve. When bandwidth is too small the density curves are too many curves.

When looking at just the right bandwidth for the histograms with the density plots over layed I feel like there coule be a case 3 and a case 4 superclustered galaxies. In my case, I will go with the posibility of 4 superclusters.

##Problem 1 part e)

e.) How many clusters did it find? 
Did it match with your answer from (d) above? 
Report parameter estimates and BIC of the best model.

library(mclust)

#computing a model 
gal_model <- Mclust(gal_dat2$log_velocity)
summary(gal_model, parameters=TRUE)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust E (univariate, equal variance) model with 6 components: 
## 
##  log-likelihood  n df    BIC     ICL
##        52.67732 82 12 52.474 45.3563
## 
## Clustering table:
##  1  2  3  4  5  6 
##  7  2 36 31  3  3 
## 
## Mixing probabilities:
##          1          2          3          4          5          6 
## 0.08536585 0.02441079 0.44458577 0.35938860 0.04966360 0.03658538 
## 
## Means:
##         1         2         3         4         5         6 
##  9.179991  9.688363  9.897559 10.042100 10.162541 10.405219 
## 
## Variances:
##           1           2           3           4           5           6 
## 0.001482092 0.001482092 0.001482092 0.001482092 0.001482092 0.001482092

#looking at how many data sets mCluster creates for clustering the data
mclustBIC(gal_dat2$log_velocity)

## Bayesian Information Criterion (BIC): 
##           E          V
## 1 -19.41401 -19.414006
## 2  39.77275  34.678455
## 3  50.52974  35.857394
## 4  41.71664  46.716877
## 5  52.23989  33.543710
## 6  52.47400  22.615285
## 7  43.66060  15.036094
## 8  34.84621   3.507817
## 9  27.39640  -6.073024
## 
## Top 3 models based on the BIC criterion: 
##      E,6      E,5      E,3 
## 52.47400 52.23989 50.52974

##My Response:

This is very interesting to me. mClust function determines that there are 6 tables and thus 6 superclusters when using the log of velocity. My initial hypthosize of suggesting 4 supercluster appears to be incorrect due the mClust model summary. Looking over, the histograms again and density plots I can see how it determined 6 super clusters due to the distribution and tails. I wonder if this is the case for looking at the model with out taking the log of velocity.

##mClust without Log of Velocity

library(mclust)
gal_model2 <- Mclust(gal_dat2$velocity)
summary(gal_model2, parameters=TRUE)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust V (univariate, unequal variance) model with 4 components: 
## 
##  log-likelihood  n df       BIC       ICL
##        -765.694 82 11 -1579.862 -1598.907
## 
## Clustering table:
##  1  2  3  4 
##  7 35 32  8 
## 
## Mixing probabilities:
##          1          2          3          4 
## 0.08440635 0.38660329 0.37116156 0.15782880 
## 
## Means:
##         1         2         3         4 
##  9707.492 19804.259 22879.486 24459.536 
## 
## Variances:
##          1          2          3          4 
##   177296.7   436160.9  1261611.3 34437115.3

mclustBIC(gal_dat2$velocity)

## Bayesian Information Criterion (BIC): 
##           E         V
## 1 -1622.361 -1622.361
## 2 -1631.243 -1595.403
## 3 -1584.016 -1592.299
## 4 -1592.828 -1579.862
## 5 -1592.299 -1593.277
## 6 -1601.228 -1604.069
## 7 -1588.610 -1611.538
## 8 -1597.427 -1625.804
## 9 -1600.709 -1633.494
## 
## Top 3 models based on the BIC criterion: 
##       V,4       E,3       E,7 
## -1579.862 -1584.016 -1588.610

##My Response:

This is interesting. mclust suggests 4 different superclusters on the data set without taking the log of velocity. And this supports my original hypothesis of 4 superclusters. And even looking at BIC criterion it stuggests 4.

##Problem 2.

The data from gives the birth and death rates for 69 countries (from Hartigan, 1975). (8.2 Handbook)

a.) Produce a scatterplot of the data and 
overlay a contour plot of the estimated bivariate density.

First I will use contour function to plot the data and then ggplot to create a contour plot with a scatter plot overlayed

#Loading library and data set
library(HSAUR3)
library("KernSmooth")
data(birthdeathrates)

#Creating contour plot 
birthdeathrates1 <- bkde2D(birthdeathrates, bandwidth=sapply(birthdeathrates, dpik))

#plotting the contour plot
contour(x = birthdeathrates1$x1, y=birthdeathrates1$x2, z= birthdeathrates1$fhat,
      xlab="Birth Rate",
      ylab='Death Rate',
      main='contour Scatterplot: Birth Rate Vs. Death Rate')
points(birthdeathrates, pch=1, col='red')

#using ggplot for contour plot
ggplot(data=birthdeathrates, aes(x=birth, y=death)) + 
  geom_density_2d(aes(color=..level..)) +
  scale_color_gradient(low='blue',high='orange') +
  theme_bw() + 
  geom_point() +
  labs(title='contour Scatterplot: Birth Rate Vs. Death Rate',
       xlab='Birth Rate',
       ylab='Death Rate')

##Problem 2 part b)

b.) Does the plot give you any interesting insights 
into the possible structure of the data?

##My Response

Yes, it appears the birth is higher than the death rate. In fact, where most of the clustering occurs around 20 for the birth rate the death rate is around 10. Therefore, it seems like the birth rate is twice the amount of the death rate. It also appears that as the birth rate increases the death rate very slowly increases. Then the death rate starts to increases once the birth rate reaches 40. But this is not the case for all data points. It likes that data Kind of resembles a parabola.

##Problem 2 part c)

c.) Construct the perspective plot (persp() in R, 
GGplot is not required for this question).

#Since a contour plot has already been done just a few modifications and additions to the code will produce a perspective plot using persp()

#3d perspective plot 
persp(x = birthdeathrates1$x1, y=birthdeathrates1$x2, z=birthdeathrates1$fhat,
      xlab = 'Birth Rates',
      ylab = 'Death Rate',
      zlab = 'Estimated Density',
      main = 'Perspective Plot of Birth Rate vs. Death Rate',
      theta = -150, 
      axes=TRUE, 
      box=TRUE,
      phi=25,
      col='orange')

##Problem 2 part d)

d.) Model-based clustering (Mclust). 
Provide plot of the summary of your fit 
(BIC, classification, uncertainty, and density).

First I will fit a model using the mclust library and print a summary of the model with its parameters.

library(mclust)
dbmod1 <- Mclust(birthdeathrates)
summary(dbmod1, parameters=TRUE)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EII (spherical, equal volume) model with 4 components: 
## 
##  log-likelihood  n df       BIC       ICL
##       -424.4194 69 12 -899.6481 -906.4841
## 
## Clustering table:
##  1  2  3  4 
##  2 17 38 12 
## 
## Mixing probabilities:
##          1          2          3          4 
## 0.02898652 0.24555002 0.55023375 0.17522972 
## 
## Means:
##           [,1]     [,2]      [,3]      [,4]
## birth 55.94967 43.80396 19.922913 33.730672
## death 29.34960 12.09411  9.081348  8.535812
## 
## Variances:
## [,,1]
##         birth   death
## birth 10.2108  0.0000
## death  0.0000 10.2108
## [,,2]
##         birth   death
## birth 10.2108  0.0000
## death  0.0000 10.2108
## [,,3]
##         birth   death
## birth 10.2108  0.0000
## death  0.0000 10.2108
## [,,4]
##         birth   death
## birth 10.2108  0.0000
## death  0.0000 10.2108

##Problem 2 part d - BIC plot

Next I will plot the BIC of the model using the plot function. The plot function will be easy. The ggplot function will require creating a data frame in order plot the BIC models from the mclust model.

#ploting BIC using base r-function
plot(dbmod1, what='BIC')

#Creating Data Frame from the BIC section of the mclust model. This is done for plotting in ggplot2 library
EII <- dbmod1$BIC[,1]
VII <- dbmod1$BIC[,2]
EEI <- dbmod1$BIC[,3]
VEI <- dbmod1$BIC[,4]
EVI <- dbmod1$BIC[,5]
VVI <- dbmod1$BIC[,6]
EEE <- dbmod1$BIC[,7]
EVE <- dbmod1$BIC[,8]
VEE <- dbmod1$BIC[,9]
VVE <- dbmod1$BIC[,10]
EEV <- dbmod1$BIC[,11]
VEV <- dbmod1$BIC[,12]
EVV <- dbmod1$BIC[,13]
VVV <- dbmod1$BIC[,14]
bicseq <- seq(from=1, to=9)


df <- data.frame (
  EII,
  VII,
  EEI,
  VEI,
  EVI,
  VVI,
  EEE,
  EVE,
  VEE,
  VVE,
  EEV,
  VEV,
  EVV,
  VVV,
  bicseq
)

#plotting BIC using ggplot 
ggplot(df, aes(x=bicseq)) + 
  geom_line(aes(y=EII, color='EII')) + 
  geom_line(aes(y=VII, color='VII')) + 
   geom_line(aes(y=EEI, color='EEI')) +
    geom_line(aes(y=VEI, color='VEI')) +
  geom_line(aes(y=EVI, color='EVI')) +
    geom_line(aes(y=VVI, color='VVI')) +
    geom_line(aes(y=EEE, color='EEE')) +
    geom_line(aes(y=EVE, color='EVE')) +
    geom_line(aes(y=VEE, color='VEE')) +
    geom_line(aes(y=VVE, color='VVE')) +
    geom_line(aes(y=EEV, color='EEV')) +
  geom_line(aes(y=VEV, color='VEV')) +
  geom_line(aes(y=EVV, color='EVV')) +
    geom_line(aes(y=VVV, color='VVV')) +
  scale_x_discrete(limits=c(1,2,3,4,5,6,7,8,9)) + labs(x="Number of Components",y="BIC", title="BIC Plot of Mclust Model for Birth vs Death Rates")

Looking at the summary and the BIC plot it appears the EII model with 4 compoents is the best option since it contains the lowest BIC

##Problem 2 part d - classification plot

Next lets look at classification plot of the model using the plot function. A dataframe will also be created using the classification vector from the mClust model and inserted into the birthdeathrates data set in order to look at ggplot classification

#using the plot function to plot the classification of the model 
plot(dbmod1, what='classification')

#Creating a datafame and inserting the classification vector from the mclust model into the birthdeathrates data set
class.dbrates <- data.frame(birthdeathrates, 'classification'=as.character(dbmod1$classification))

library(factoextra)
#Using ggplot but fviz_mclust function from the library factoextra to plot the classification of the model
fviz_mclust(dbmod1, what='classification', geom='point')

#ggplot of the model to plot the classification of the model
ggplot(class.dbrates, aes(x=birth, y=death, col=classification)) + 
  geom_point() +
  labs(x='Birth Rates',y='Death Rates', title='GGPlot of Classification') +
  stat_ellipse(aes(x=birth, y=death, group=classification))

From the classification plot it displays the clusters of the model of death rates vs birth rates. It appears that as birth rates increase by 10 a new cluster begins.

##Problem 2 part d - Uncertainty plot

Now lets look at look at uncertainty using the plot() function. And a ggplot of the uncertainty model will be plotted as well.

#Using the plot function to plot the uncetainty of the model
plot(dbmod1, what='uncertainty')

#using fvis_mclust for ggplot for plotting the uncertainty of the model
fviz_mclust(dbmod1, what='uncertainty')

The classification plot displays the model’s uncertainty by the points that will fail to be predicted.

##Problem 2 part d - density plot

Now lets look at the density of the model using the plot(). ggplot will also be used to produce a density plot.

#plot function for plotting density of the model
plot(dbmod1, what='density')

#using ggplot for plotting density of the model 
ggplot(data=birthdeathrates, aes(x=birth, y=death)) + 
  geom_density2d(aes(color=..level..)) +
  scale_color_gradient(low='blue',high='orange') +
  theme_bw() + 
  labs(title='contour plot: Birth Rate Vs. Death Rate',
       xlab='Birth Rate',
       ylab='Death Rate')

The log density contour plot continues to display the birth as twice the amount of the death rate. It also displays the clusters containing the frequent observations.

##Problem Part e)

e.) Discuss the results (structure of data, outliers, etc.). 
Write a discussion in the context of the problem.

To me the scatterplot/contour plots of the birth rates vs death rates look like a parabola resembles the data set. In fact, I going to use ggplot and a quadratic term to the scatterplot with the line cutting through the data points.

ggplot(data=birthdeathrates, aes(x=birth, y=death)) + geom_point() + geom_smooth(method='lm', formula='y~poly(x,2)') + labs(title='Scatterplot of birth rates vs death rates', xlab='birth rates', ylab='death rates')

Next lets look at the mclustBIC() to see which models are the best

mclustBIC(birthdeathrates)

## Bayesian Information Criterion (BIC): 
##          EII        VII       EEI       VEI       EVI       VVI       EEE
## 1 -1006.5723 -1006.5723 -961.7502 -961.7502 -961.7502 -961.7502 -950.3669
## 2  -936.3442  -914.8037 -938.6127 -911.9710 -924.7310 -915.6217 -933.9448
## 3  -906.7729  -907.3547 -905.7403 -908.3174 -911.0701 -916.6248 -909.8428
## 4  -899.6481  -921.0631 -903.7704 -920.1226 -906.1018 -922.7386 -906.5496
## 5  -907.1378  -924.0068 -915.6050 -921.8611 -912.6162 -928.8162 -914.7571
## 6  -914.1679  -925.3259 -911.3484 -929.7137 -926.3244 -946.8290 -914.6918
## 7  -912.5610  -940.0067 -916.3920 -941.5804 -933.6770 -961.1733 -925.9343
## 8  -924.2724  -953.6153 -928.2698 -955.4928 -947.8093 -979.3765 -932.5095
## 9  -934.9379         NA -940.9908        NA        NA        NA -945.1889
##         EVE       VEE       VVE       EEV       VEV       EVV        VVV
## 1 -950.3669 -950.3669 -950.3669 -950.3669 -950.3669 -950.3669  -950.3669
## 2 -923.7050 -915.4055 -909.3891 -916.4290 -911.3583 -920.4713  -915.5710
## 3 -916.3323 -912.5420 -918.3377 -913.3972 -914.0597 -916.1073  -920.1468
## 4        NA -921.7029 -930.5803 -920.0012 -932.9836 -929.8081  -935.1407
## 5        NA -931.6311 -946.4479 -936.9447 -938.7558        NA  -952.6602
## 6        NA -943.2135 -951.2986 -930.4589 -952.1768        NA  -973.6995
## 7        NA -958.8094 -966.6536 -978.8477 -970.2239        NA  -994.2301
## 8        NA -967.8431 -981.9471 -992.3116 -987.2295        NA -1006.1989
## 9        NA        NA        NA -972.9489        NA        NA         NA
## 
## Top 3 models based on the BIC criterion: 
##     EII,4     EEI,4     EEI,3 
## -899.6481 -903.7704 -905.7403

##Discussion:

It seems like the data best resembles a quadratic function. It still seems like the death rates are half the amount of the birth rates. But as birth rate increases from 5-25 to death rate very slightly declines. Then as the birth rates are between 25-30ish the death rate seems to stay within the same range. But then as birth rates increase beyond 30 the death rates seem to increase again. Overall, the death rate is not keeping up with birthrate. In fact, this data shows that the birthrates are higher than the death rates. More people are being born than people dying. It also appears that the top 3 models based on BIC criterion are:

EEI - 4 clusters
EEI - 4 clusters
EEI - 3 clusters

##Problem 3. A sex difference in the age of onset of schizophrenia was noted by Kraepelin (1919). Subsequent epidemiological studies of the disorder have consistently shown an earlier onset in men than in women. One model that has been suggested to explain this observed difference is known as the subtype model which postulates two types of schizophrenia, one characterized by early onset, typical symptoms and poor premorbid competence; and the other by late onset, atypical symptoms and good premorbid competence. The early onset type is assumed to be largely a disorder of men and the late onset largely a disorder of women. Fit finite mixutres of normal densities separately to the onset data for men and women given in the data from . See if you can produce some evidence for or against the subtype model. (8.3 Handbook)

##Histograms and Density Plots

I will first subset the data by male and female to display histograms with density curves. I will use the density function to produce the lines on the histograms. I will also use the adjust command and adjust it by 1.

data(schizophrenia)

#Creating data frame
age <- schizophrenia$age
gender <- schizophrenia$gender

schizophrenia <- data.frame(
  age,
  gender
)

#Subsetting the data: one male subset and one female subset
male <- subset(schizophrenia, schizophrenia$gender=='male')
female <- subset(schizophrenia, schizophrenia$gender=='female')

#Creating male histogram
hist(male$age, xlab="Age", ylab="Frequency",main="Gaussian Kernal of Males with schizophrenia: adjust=1", probability = TRUE, boarder="red")
#density line for males with adjust=1
lines(density(male$age, adjust=1))

#Creating female histogram
hist(female$age, xlab="Age", ylab="Frequency",main="Gaussian Kernal of Females with schizophrenia: adjust=1", probability = TRUE, boarder="red")
#Density line for females with adjust=1
lines(density(female$age, adjust=1))

#Creating ggplot histogram with both female and male for comparison with density curves
ggplot(schizophrenia, aes(x=age, y=..density..)) +
  geom_histogram(fill='cornsilk', color='grey60', size=0.2) + 
  geom_density(kernal='gaussian', adjust=1.0) +
  xlim(min(age),max(age)) +
  labs(title = 'Gaussian Kernal Density of schizophrenia: adjustment=1.0',x='Age',y='Frequency') +
  facet_grid(gender~.)

From the histograms with gaussian kernal density overlapping it, it appears for males the centering of the date is in their 20s. Females, on the other hand, seem to have a wider spread, thus it seems schizophrenia for women is through various ages in life. The ggplot reveals a little more information pertaining to the males. For males it tapers off after the 20s and then reappears in the 40s and then dies down and comes back again somewhat in the 60s.

##Fitting to a model

I will use mclust to fit each subsetted data to a model. It will have a male model and a female model based on age. First I will compute the male model and then the female model.

male_model <- Mclust(male$age)
female_model <- Mclust(female$age)

##Male Model summary:

summary(male_model, parameter=TRUE)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust V (univariate, unequal variance) model with 2 components: 
## 
##  log-likelihood   n df       BIC       ICL
##       -520.9747 152  5 -1067.069 -1134.392
## 
## Clustering table:
##  1  2 
## 99 53 
## 
## Mixing probabilities:
##         1         2 
## 0.5104189 0.4895811 
## 
## Means:
##        1        2 
## 20.23922 27.74615 
## 
## Variances:
##          1          2 
##   9.395305 111.997525

It appears that males are diagnosed with two different clusters. The mean age for the first cluster is around 20 and the second cluster around 28.

##Female model summary

summary(female_model, parameter=TRUE)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust E (univariate, equal variance) model with 2 components: 
## 
##  log-likelihood  n df       BIC       ICL
##       -373.6992 99  4 -765.7788 -774.8935
## 
## Clustering table:
##  1  2 
## 74 25 
## 
## Mixing probabilities:
##         1         2 
## 0.7472883 0.2527117 
## 
## Means:
##        1        2 
## 24.93517 46.85570 
## 
## Variances:
##        1        2 
## 44.55641 44.55641

It appears that females are diagnosed with two different clusters. The mean age for the first cluster is around 25 and the second cluster is around 47.

##BIC plot of both female and male models

Next I will plot the BIC of the model using the plot function. The first plot BIC will be the male model and the second plot will be the female model

##Male Model BIC

plot(male_model$BIC)

##Female Model BIC

plot(female_model$BIC)

It appears from the BIC models that both males and females do better with 2 clusters. When there two clusters the BIC tends to be at its lowest.

Density plots of both female and male models

Next I will plot the density of each model using the plot function. The first plot will be a density plot from the male model and the second plot will be from the female model

plot(male_model, what='density', xlab="Age",ylab="Density", main="Schizophrenia Male Model Density")
title(main="Schizophrenia Male Model Density")

plot(female_model, what='density', xlab="Age",ylab="Density", main="Schizophrenia Female Model Density")
title(main="Schizophrenia Female Model Density")

From the density plots of each model it appears men cluster around their 20s for schizophrenia and then starts to taper off. Women with schizophrenia tend to cluster in their 20s but then seems to cluster again around 40-50s.

##Uncertainty plots - Male and Female models

Next I will plot the uncertainy of each model using the plot function. The first plot will be a density plot from the male model and the second plot will be from the female model

plot(male_model, what='uncertainty', xlab='Age',ylab='Uncertainty', main='Schizophrenia: Uncertainty for Male Model')
title(main='Schizophrenia: Uncertainty for Male Model')

plot(female_model, what='uncertainty', xlab='Age',ylab='Uncertainty', main='Schizophrenia: Uncertainty for Female Model')
title(main='Schizophrenia: Uncertainty for Female Model')

From the uncertainty plots it appears that of uncertainty for males is higher in the younger ages, whereas the women tend to higher in the older ages of life.

##Discussion

Putting this all together and looking at the summaries and plots it appears that men tend to suffer from schizophrenia at younger age–in their early 20s. After that point, males with schizophrenia tends to steadily decline. In constrast, women tend to suffer from it at two main clusters. The biggest cluster in their mid 20s and second cluster later in life around the mid 40s. Therefore, it men seem to suffer from it at early stages in life whereas women tend deal with schizophrenia throughout a wider span of life.