The problems on this assignment require you to assess structure of a variable/data using kernel estimation methods. Answer all questions specfied on the problem but if you see something interesting and want to do more analysis please report it as well. The output you generate along with your comments and answers to questions should be on a separate file from the R-code which generated the output.
The galaxies data from MASS contains the velocities of 82 galaxies from six well-separated conic sections of space (Postman et al., 1986 , Roeder, 1990). The data are intended to shed light on whether or not the observable universe contains superclusters of galaxies surrounded by large voids. The evidence for the existence of superclusters would be the multimodality of the distribution of velocities. Construct a histogram of the data and add a variety of kernel estimates of the density function. What do you conclude about the possible existence of superclusters of galaxies? (8.1 Handbook)
Construct three types of histograms using the following functions: (hist(), truehist(), and qplot() ). Comment on the shape and distribution of the variable based on the three plots.
#load all required libraries
library(dplyr)
library(tidyr)
library(MASS)
library(ggplot2)
library(HSAUR3)
library(KernSmooth)
library(flexmix)
library(boot)
library(mclust)
library(gridExtra)
Set the galaxies data frame and add the variable Log_Velocity to create the natural logrithm of the galaxy velocity. Per R documentation there is a typo on row 78 of the data. The value should be 26960 rather than the imputed 26690. We will correct this typo in our initial data step.
#call the galaxies dataset
data(galaxies)
#set the vector of numeric velocities to dataframe
galaxies <- as.data.frame(galaxies)
#rename the single column in the galaxies dataframe to Velocity
names(galaxies) <- 'Velocity'
#replace the typo with the correct numeric value
galaxies[78,1] <- 26960
#add the log velocity column
galaxies <- galaxies %>%
mutate(Log_Velocity = log(Velocity))
The output of this plot shows a somewhat normal distribution of the galaxies data set.
#set the figure position
par(fig=c(0,1,0,1),new=T)
#draw the histogram
hist(galaxies$Velocity,
xlab = 'Velocity',
main = 'Histogram of Galaxy Velocity',
ylab = 'Frequency')
dev.off()
The truehist function outputs the same histogram as hist, but automatically fills the bars with color.
par(fig=c(0,1,0,1),new=T)
#draw the histogram
truehist(galaxies$Velocity,
xlab = 'velocity',
main = 'Histogram of Galaxy Velocity',
ylab = 'Frequency')
dev.off()
Qplot automatically increases the bin size of the histogram, which shows a bimodal distribution with tails that increase on both sides of the histogram.
par(fig=c(0,1,0,1),new=T)
#draw the histogram with ggplot2's qplot function
qplot(galaxies$Velocity) +
labs(title='Histogram of Galaxy Velocity',
x='Velocity of Galaxy',
y='Frequency')
dev.off()
Create a new variable loggalaxies = log(galaxies). Construct three histograms using the above functions and comment on the shape.
This histogram, with the natural log of velocity, shows a tail on the left side of the histogram begining to form. You cannot determine if there is any bimodality from the lack of bins.
par(fig=c(0,1,0,1),new=T)
#draw the histogram of log velocity
hist(galaxies$Log_Velocity,
xlab = 'Natural Log of Galaxy Velocity',
main = 'Histogram of Galaxy Velocity',
ylab = 'Frequency')
dev.off()
This function simply adds color to the hist function.
par(fig=c(0,1,0,1),new=T)
#draw the histogram of log velocity
truehist(galaxies$Log_Velocity,
xlab = 'Natural Log of Galaxy Velocity',
main = 'Histogram of Galaxy Velocity',
ylab = 'Frequency')
dev.off()
Again, there is definitely an increase in the left tail and right tail of this distribution. Bimodality is also present in the histogram.
par(fig=c(0,1,0,1),new=T)
#draw the histogram with ggplot2 of log velocity
qplot(galaxies$Log_Velocity) +
labs(title='Histogram of Galaxy Velocity',
x='Natural Log of Galaxy Velocity',
y='Frequency')
dev.off()
Construct kernel density estimates using 2 different choices of kernel functions and 3 choices of bandwidth (one that is too large and oversmooths, one that is too small and undersmooths, and one that appears appropriate.) Therefore you should have 6 different kernel density estimates plots.Discuss your results. You can use the log scale or original scale for the variable.
#construct a stat density plot with ggplot2, adjust = 1 for less smoothing
p1_g <- ggplot() +
stat_density(kernel='gaussian',adjust=1,aes(x=galaxies$Velocity)) +
labs(title = 'Gaussian Kernal Density of Galaxy Velocity',
x = 'Galaxy Velocity',
y='Density Estimate')
#construct a stat density plot with ggplot2, adjust = 2 for moderate smoothing
p2_g <- ggplot() +
stat_density(kernel='gaussian',adjust=2,aes(x=galaxies$Velocity)) +
labs(title = 'Gaussian Kernal Density of Galaxy Velocity',
x = 'Galaxy Velocity',
y='Density Estimate')
#construct a stat density plot with ggplot2, adjust = 3 for over-smoothing
p3_g <- ggplot() +
stat_density(kernel='gaussian',adjust=3,aes(x=galaxies$Velocity)) +
labs(title = 'Gaussian Kernal Density of Galaxy Velocity',
x = 'Galaxy Velocity',
y='Density Estimate')
#construct a triangular stat density plot, adjust for less smoothing
p1_t <- ggplot() +
stat_density(kernel='triangular',adjust=1,aes(x=galaxies$Velocity)) +
labs(title = 'Triangular Kernal Density of Galaxy Velocity',
x = 'Galaxy Velocity',
y='Density Estimate')
#construct a triangular stat density plot, adjust for moderate smoothing
p2_t <- ggplot() +
stat_density(kernel='triangular',adjust=2,aes(x=galaxies$Velocity)) +
labs(title = 'Triangular Kernal Density of Galaxy Velocity',
x = 'Galaxy Velocity',
y='Density Estimate')
#construct a triangular stat density plot, adjust for over-smoothing
p3_t <- ggplot() +
stat_density(kernel='triangular',adjust=3,aes(x=galaxies$Velocity)) +
labs(title = 'Triangular Kernal Density of Galaxy Velocity',
x = 'Galaxy Velocity',
y='Density Estimate')
The under-smoothed estimate shows approximately 4 clusters in the density plot. The left and right tails and the bimodality of the density upon it’s center.
#use the grid arrange function to construct a single plot of the two under-smoothed plots
par(fig=c(0,1,0,1),new=T)
grid.arrange(p1_g,p1_t)
dev.off()
The appropriately smoothed estimate shows no bimodality and an increase in the left tail of the density plot.
#use the grid arrange function to construct a single plot of the two moderately smoothed plots
par(fig=c(0,1,0,1),new=T)
grid.arrange(p2_g,p2_t)
dev.off()
The over-smoothed estimate shows no bimodality and no increase in the tails.
#use the grid arrange function to construct a single plot of the two over-smoothed plots
par(fig=c(0,1,0,1),new=T)
grid.arrange(p3_g,p3_t)
dev.off()
What is your conclusion about the possible existence of superclusters of galaxies? How many superclusters (1,2, 3, … )?
Judging by the undersmoothed density estimation and the ggplot2 histogram of the data, I would estimate that there are four superclusters. I deduce this from the bimodality of the histogram, the undersmoothed densit plot, and the two tails that are evident in the density plots and histogram.
Using Mclust function in R fit a finite mixture model. How many clusters did it find? Did it match with your answer from (d) above? Report parameter estimates and BIC of the best model. The qplot function (ggplot2) shows a bimodal distribution with two tails. The tail closest to a velocity of 0 has a much higher frequency than the tail with the highest velocity. The density estimation with the undersmoothed bandwith supports this theory.
The summary shows a total of four superclusters of galaxies with the largest frequencies having an average velocity between 19,800 and 23,000 on the velocity scale.
#create a mclust model and print the summary setting parameters = T
mod <- mod <- Mclust(galaxies$Velocity)
summary(mod,parameters=T)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust V (univariate, unequal variance) model with 4 components:
##
## log.likelihood n df BIC ICL
## -765.7316 82 11 -1579.937 -1598.809
##
## Clustering table:
## 1 2 3 4
## 7 35 32 8
##
## Mixing probabilities:
## 1 2 3 4
## 0.08441927 0.38768587 0.36896338 0.15893147
##
## Means:
## 1 2 3 4
## 9707.522 19806.592 22880.348 24483.603
##
## Variances:
## 1 2 3 4
## 177311.8 437746.2 1231115.8 34305975.7
The density plot of the model shows three distinct superclusters with the far right tail not being as distinct.
#pot the density plot of the model
par(fig=c(0,1,0,1),new=T)
plot(mod,what="density")
dev.off()
After examining the best BIC of the models it is evident that the best possible clustering is at four superclusters, with a three cluster estimate being very close in BIC; most likely due to the low number of observations at the far right tail (high velocity).
#use the mclustBIC function to print the BIC summary of the model
mclustBIC(galaxies$Velocity)
## Bayesian Information Criterion (BIC):
## E V
## 1 -1622.518 -1622.518
## 2 -1631.401 -1595.633
## 3 -1584.673 -1592.408
## 4 -1593.485 -1579.937
## 5 -1593.361 -1593.345
## 6 -1602.266 -1604.112
## 7 -1589.153 -1611.579
## 8 -1597.984 -1625.847
## 9 -1601.089 -1633.533
##
## Top 3 models based on the BIC criterion:
## V,4 E,3 E,7
## -1579.937 -1584.673 -1589.153
The birthdeathrates data from HSAUR3 gives the birth and death rates for 69 countries (from Hartigan, 1975). (8.2 Handbook)
Produce a scatterplot of the data that shows a contour plot of the estimated bivariate density. Put the original points on the contour plot.
Does the plot give you any interesting insights into the possible structure of the data?
Yes, the countour plot shows that most birth rates have a 2:1 ratio to death rates in most countries. There are countries, and clusters of countries, whose birth rate far exceeds that ratio, but only one country has a higher death rate than it does birth rate. Note-Text annotations were added to the plot using the row names (country abbreviations) of countries that had birth/death rates beyond the central cluster. This helped to identify outliers in the data.
#plot the birth rate data into a 2d density countourplot setting. add text annotations of country beyond the central cluster
data(birthdeathrates)
par(fig=c(0,1,0,1),new=T)
ggplot(data=birthdeathrates,aes(birth,death)) +
geom_density2d(aes(colour=..level..)) +
scale_colour_gradient(low="green",high="red") +
theme_bw() +
geom_point() +
geom_text(aes(label=ifelse(death >= 15 | birth >= 35,row.names(birthdeathrates),''),hjust=0, vjust=0)) +
labs(title='Countour Scatterplot of Birth and Death Rates for 69 Countries',
x='Birth Rate',
y='Death Rate')
dev.off()
Construct the perspective plot (persp).
###note - some of this code was used directly from the R documentation of the persp function
#knitr::opts_current$set(fig.width=15,fig.height=15)
x <- seq(from=1,to=69,by=1)
y<- seq(0,2,by=1)
z <- outer(birthdeathrates$birth,birthdeathrates$death)
nrz <- nrow(z)
ncz <- ncol(z)
# Create a function interpolating colors in the range of specified colors
jet.colors <- colorRampPalette( c("blue", "green") )
# Generate the desired number of colors from this palette
nbcol <- 100
color <- jet.colors(nbcol)
# Compute the z-value at the facet centres
zfacet <- z[-1, -1] + z[-1, -ncz] + z[-nrz, -1] + z[-nrz, -ncz]
# Recode facet z-values into color indices
facetcol <- cut(zfacet, nbcol)
#leg <- cbind(1:69,row.names(birthdeathrates))
#leg <- paste(leg[,1],'=',leg[,2])
#leg1 <- leg[1:10]
#leg2 <- leg
#use the persp function to generate a perspective plot of the data
par(fig=c(0,1,0,1),new=T)
persp(x = x, y = x, z = z,
main = 'Perspective Plot of Birth and Death Rates for 69 Countries',
xlab = "x",
ylab = "y",
zlab = expression(K(Birth,Death)),
col= color[facetcol],
theta = 130,
phi = 35,
ticktype='detailed',
axes = TRUE,
box = TRUE)
dev.off()
Use/perform Model-based clustering (Mclust, library mclust). Provide plot of the summary of your fit (BIC, classification, uncertainty, and density).
From the summary of the model it appears there are four clusters with most points being centered in the fourth cluster. The 2:1 birth to death ratio wasevident from the scatter plot.
#create an mclust model and print the summary
mod2 <- Mclust(birthdeathrates)
summary(mod2,parameters=T)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EII (spherical, equal volume) model with 4 components:
##
## log.likelihood n df BIC ICL
## -424.4183 69 12 -899.6459 -906.4827
##
## Clustering table:
## 1 2 3 4
## 12 17 2 38
##
## Mixing probabilities:
## 1 2 3 4
## 0.1755768 0.2450169 0.0289865 0.5504198
##
## Means:
## [,1] [,2] [,3] [,4]
## birth 33.752373 43.81548 55.94967 19.925287
## death 8.542732 12.09771 29.34961 9.080797
##
## Variances:
## [,,1]
## birth death
## birth 10.21004 0.00000
## death 0.00000 10.21004
## [,,2]
## birth death
## birth 10.21004 0.00000
## death 0.00000 10.21004
## [,,3]
## birth death
## birth 10.21004 0.00000
## death 0.00000 10.21004
## [,,4]
## birth death
## birth 10.21004 0.00000
## death 0.00000 10.21004
This plot shows that the number of clusters with the lowest BIC is 4.
#plot the BIC of the model
plot(mod2,what="BIC")
The classification plot shows the center of each cluster within the scatterplot of birth to death rates.
#plot the classification of the model
plot(mod2,what="classification")
The uncertainty plot ensenuates the points that fall outside of the clusters determined by the model.
#plot the uncertainty of the model
plot(mod2,what="uncertainty")
The density countour plot shows the cluster with the most observations, the cluster with the 2:1 birth to death ratio.
#plot the density of the model
plot(mod2,what="density")
The top three models contain, four, four, and three clusters from the BIC model.
#use the mclustBIC function to print the BIC summary
mclustBIC(birthdeathrates)
## Bayesian Information Criterion (BIC):
## EII VII EEI VEI EVI VVI EEE
## 1 -1006.5723 -1006.5723 -961.7502 -961.7502 -961.7502 -961.7502 -950.3669
## 2 -936.3451 -914.8026 -938.6114 -911.9710 -924.7309 -915.6217 -933.9448
## 3 -906.7730 -916.7347 -905.7403 -908.3213 -911.0692 -917.4011 -909.8437
## 4 -899.6459 -908.7024 -903.7688 -912.3517 -906.1031 -907.6750 -906.5390
## 5 -907.1295 -914.5474 -908.8953 -917.3932 -912.1108 -911.3001 -912.9725
## 6 -909.5614 -929.2061 -913.5805 -933.2733 -921.1098 -928.4247 -917.8177
## 7 -912.5646 NA -924.9581 -931.4482 -931.7108 -943.1337 -930.6014
## 8 -924.0520 NA -928.1996 NA -946.0579 -953.0436 -931.8076
## 9 -934.9837 NA -937.1261 NA -960.9737 -969.6890 -941.9337
## EVE VEE VVE EEV VEV EVV VVV
## 1 -950.3669 -950.3669 -950.3669 -950.3669 -950.3669 -950.3669 -950.3669
## 2 -923.7050 -915.4067 -909.4032 -916.4306 -908.1553 -920.4682 -910.1725
## 3 -916.3679 -924.3171 -920.9178 -913.3934 -916.2099 -916.1145 -917.0500
## 4 NA NA NA -910.5605 -922.0331 NA NA
## 5 NA NA NA -917.9875 -930.4636 NA NA
## 6 NA NA NA -927.4957 -943.8289 NA NA
## 7 NA NA NA -943.3551 -952.2035 NA NA
## 8 NA NA NA -955.0234 NA NA NA
## 9 NA NA NA -955.0388 -965.3098 NA NA
##
## Top 3 models based on the BIC criterion:
## EII,4 EEI,4 EEI,3
## -899.6459 -903.7688 -905.7403
Discuss the results (structure of data, outliers, …). Write a discussion in the context of the problem.
The results indicate four unique clusters with the predominate cluster having a 2:1 birth to death ratio. Death rates generally are held constant, but some countries with extremely high birth rates also have very high death rates. This most likely indicates extreme poverty in those countries, but that cannot be determined with this data alone. There is one country, abbreviated as dem, that has an unusually high death rate and very low birth rate. More citizens are dying as opposed to being born.
A sex difference in the age of onset of schizophrenia was noted by Kraepelin (1919). Subsequent epidemiological studies of the disorder have consistently shown an earlier onset in men than in women. One model that has been suggested to explain this observed difference is known as the subtype model which postulates two types of schizophrenia, one characterized by early onset, typical symptoms and poor premorbid competence, and the other by late onset, atypical symptoms and good premorbid competence. The early onset type is assumed to be largely a disorder of men and the late onset largely a disorder of women. Fit finite mixtures of normal densities separately to the onset data for men and women given in schizophrenia data from HSAUR3. See if you can produce some evidence for or against the subtype model. (8.3 Handbook)
This density plot shows that men are centered around 20 years of age while women can obtain schizophrenia throughout life.
data(schizophrenia)
#plot the schizophrenia data using stat_density within ggplot2, facet by gender
par(fig=c(0,1,0,1),new=T)
ggplot(data=schizophrenia)+
stat_density(kernel='gaussian',adjust=1,aes(age,fill=gender)) +
facet_grid(gender~.) +
labs(title = 'Gaussian Kernal Density of Schizophrenia Diagnosis by Gender',
x = 'Age at the Time of Diagnosis',
y='Density Estimate')
dev.off()
This histogram shows a distribution in men centered in the 20s and distribution for women that is spread evenly with peaks in the 20s and 40s.
#plot a histogram of the data faceting by gender
par(fig=c(0,1,0,1),new=T)
ggplot(data=schizophrenia)+
geom_histogram(aes(age,fill=gender)) +
facet_grid(gender~.) +
labs(title = 'Histogram of Schizophrenia Diagnosis by Gender',
x = 'Age at the Time of Diagnosis',
y='Frequency')
dev.off()
The summary of the male clustering model supports the hypothesis that men show only signs on early onset schizophrenia. They are clustered at 20 and 27 years of age.
#create a male and female data set
male <- schizophrenia %>%
filter(gender=='male')
female <- schizophrenia %>%
filter(gender=='female')
#create a male and female Mclust model
male_mod <- Mclust(male$age)
female_mod <- Mclust(female$age)
#print the male model
summary(male_mod,parameters=T)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust V (univariate, unequal variance) model with 2 components:
##
## log.likelihood n df BIC ICL
## -520.9747 152 5 -1067.069 -1134.392
##
## Clustering table:
## 1 2
## 99 53
##
## Mixing probabilities:
## 1 2
## 0.5104189 0.4895811
##
## Means:
## 1 2
## 20.23922 27.74615
##
## Variances:
## 1 2
## 9.395305 111.997525
The summary of the female model indicates the two clusters occur at age 25 and age 46 which supports earlier studies that women are subject to early and late onset schizophrenia.
#print the female model
summary(female_mod,parameters=T)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust E (univariate, equal variance) model with 2 components:
##
## log.likelihood n df BIC ICL
## -373.6992 99 4 -765.7788 -774.8935
##
## Clustering table:
## 1 2
## 74 25
##
## Mixing probabilities:
## 1 2
## 0.7472883 0.2527117
##
## Means:
## 1 2
## 24.93517 46.85570
##
## Variances:
## 1 2
## 44.55641 44.55641
The density plots clearly show late onset schizophrenia in women and that men are clustered around early onset schizophrenia.
#create a under/over density plot of the male vs female model
par(mfrow=c(2,1))
plot(male_mod,what='density',
main='',
xlab='Diagnosis Age of Males',ylab='Density Estimate')
title(main='Modeled Density Estimate of Schizophrenia Diagnois by Gender')
plot(female_mod,what='density',
main='',
xlab='Diagnosis Age of Females',
ylab='Density Estimate')
dev.off()
The uncertainty plots show a higher uncertainty in men early and a higher uncertainty for women later in life.
#create a under/over uncertainty plot of the male vs female model
par(mfrow=c(2,1))
plot(male_mod,what='uncertainty',
main='',
xlab='Diagnosis Age of Males')
title(main='Uncertanty of Schizophrenia Diagnosis Model by Gender')
plot(female_mod,what='uncertainty',
main='',
xlab='Diagnosis Age of Females')
dev.off()
This plots shows that both the model for men and women have an optimum number of clusters at 2.
#create a under/over BIC plot of the male vs female model
par(mfrow=c(2,1))
plot(male_mod,what='BIC')
title(main='BIC Of Schizophrenia Diagnosis Model by Gender')
plot(female_mod,what='BIC')
dev.off()
From the above summaries and plots it is evident that males generally see a much higher rate of schizophrenia in the early to late 20s than females. Males have two clusters at 20 years of age and 27 years of age whereas females are clustered at 25 years of age and 46 years of age for the diagnosis of schizophrenia. This clustering and density estimation clearly supports the theory for early onset schizophrenia in men and late onset schizophrenia in women.