Exploring Potential Relationships between Globular Cluster Number and Specific Galaxy Properties


Note: It is recommended to knit this .Rmd file to html for best visual results.


BACKGROUND:

Globular Clusters

A globular cluster is loosely defined as a collection of (potentially millions of) stars orbiting a galactic core [1]; these entities are of particular interest to astronomers, because the analysis of their physical properties helps inform models of galaxy evolution [2].


METHODS:

(1) Input data converted to individual vectors representing each variable of interest.

library(data.table)

#create vector with column names

columnVector <- c("Galaxy", "ID", "RightAscension", "Declination", "MorphType", "Distance", "Uncert+/-", "DistEstMeth", "FGExtinct", "AbsVisMag", "Uncert+/-", "KBandMAg", "Uncert+/-", "NGlobClust", "Uncert+/-", "LitSource", "VelocDisp", "Uncert+/-", "EffecRad", "Uncert+/-", "DynamicMass", "Uncert+/-", "TotMass", "Uncert+/-", "MassBH", "UpperUncert", "LowerUncert")

#read in .txt file, obtained from: http://www.physics.mcmaster.ca/~harris/GCS_table.txt


file1 <- "GCS_table.txt"

rawlines <- readLines(file(file1))

#remove unnecessary information

mydat <-rawlines[-c(seq(1:39))]

#remove newline characters

mydat <- gsub("[\n]", " ", mydat)

#create a string from the vector of column names

columns <- paste(columnVector, collapse=" ")

#read data into dataframe

data <- read.table(text = mydat[seq(1:424)],col.names=columnVector,na.strings = c("nd", "nid"))

#attach data for ease of reference

attach(data)

#create vector for each variable:

#Number of Globular Clusters

NGC <- as.vector(as.numeric(NGlobClust))

#Central Black Hole Mass

CBHM <- as.vector(as.numeric(MassBH))

#Dynamical Bulge Mass

DBM <- as.vector(as.numeric(DynamicMass))

#Bulge velocity Dispersion

BVD <- as.vector(as.numeric(VelocDisp))

#Absolute Visual MAgnitude

AVM <- as.vector(as.numeric(AbsVisMag))

#Galaxy type

GT <- as.vector(MorphType)

#detach data

detach(data)

#have a look at first 6 entries of each vector to see if it is in the format we expected

head(NGC)
## [1]  58   4  52 172   6 273
head(CBHM)
## [1] 6.61   NA   NA   NA   NA   NA
head(DBM)
## [1]  9.856     NA 10.879     NA  8.408  8.105
head(BVD)
## [1] 105.0    NA 169.2    NA  22.0  19.9
head(AVM)
## [1] -21.30 -14.84 -20.18 -18.77 -15.46 -15.40
head(GT)
## [1] "Sbc" "Irr" "Sab" "SBm" "E5"  "E3"

(2) Exploratory Data Analysis:

Quick look at what kind of data we have, reveals the following…

#we begin by getting a quick exploratory summary for each vector

#Number of globular clusters:

summary(NGC)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   67.25  143.00  144.70  222.00  295.00
hist(NGC, main= "Histogam of Globular Cluster Number",col=terrain.colors(10), xlab = "Number of Globular Clusters", xlim=c(1,295), las=1, breaks= 10)

#Central black hole mass:

summary(CBHM)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   7.950   5.944   8.720  10.320     357
hist(CBHM, main= "Histogam of Central Black Hole Mass",col=terrain.colors(6), xlab = "Central Black Hole Mass 1.98892×1030 kg")

#Dynamical Bulge Mass:

summary(DBM)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   7.693   9.988  10.870  10.720  11.410  12.730     166
hist(DBM, main= "Histogam of Dynamical Bulge Mass",col=terrain.colors(11), xlab = "Dynamical Bulge Mass 1.98892×1030 kg")

#Bulge velocity dispersion:

summary(BVD)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   10.40   85.15  168.10  167.90  239.20  414.00     148
hist(BVD, main= "Histogam of Bulge velocity Dispersion",col=terrain.colors(9), xlab = "Bulge velocity Dispersion km/s")

#Absolute visual magnitude:

summary(AVM)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -24.19  -21.31  -19.59  -19.13  -16.91  -11.17       1
hist(AVM, main= "Histogam of Absolute Visual Magnitude",col=terrain.colors(14), xlab = "Absolute Magnitude (M)")

#Galaxy types:
summary(GT)
##    Length     Class      Mode 
##       422 character character
#if necessary: install.packages('plyr'), size 1.1MB

library(plyr)

names <- count(GT)

plot(names, main="Galaxy Types", xlab="Galaxy Names")


(3) Having a look at general correlations between galaxy properties:

#if necessary: install.packages('PerformanceAnalytics'), size 2MB

AllData <- cbind.data.frame(NGC, CBHM, DBM, BVD, AVM)

library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:data.table':
## 
##     first, last
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
chart.Correlation(AllData)

We see that there appears to be some promising correlations we might be able to explore further:

  1. In our data there was quite a strong positive correlation between:
  • DBM and BVD (0.92)
  1. The strongest negative correlations were found between:

DBM and AVM (-0.94) BVD and AVM (-0.85)

  1. No immediately obvious correlation was found between NGC and any other galaxy property.

(4) Explore simple correlations between NGC and each other galaxy property:

#NGC vs (CBHM, DBM, BVD, AVM)

#install.packages('pander')

library(pander)
z <- lm(NGC ~ CBHM)
y <- lm(NGC ~ DBM)
x <- lm(NGC ~ BVD)
w <- lm(NGC ~ AVM)


panderOptions("digits", 2)
pander(z)
Fitting linear model: NGC ~ CBHM
  Estimate Std. Error t value Pr(>|t|)
CBHM 1.7 2.8 0.61 0.54
(Intercept) 123 20 6.3 3.5e-08
pander(y)
Fitting linear model: NGC ~ DBM
  Estimate Std. Error t value Pr(>|t|)
DBM -1.4 5.6 -0.25 0.8
(Intercept) 164 61 2.7 0.0074
pander(x)
Fitting linear model: NGC ~ BVD
  Estimate Std. Error t value Pr(>|t|)
BVD -0.036 0.057 -0.62 0.53
(Intercept) 157 11 14 4.1e-35
pander(w)
Fitting linear model: NGC ~ AVM
  Estimate Std. Error t value Pr(>|t|)
AVM -2.9 1.5 -1.9 0.064
(Intercept) 90 30 3 0.0027
###

v <- lm(NGC ~ CBHM + DBM + BVD + AVM)
#summary(v)

u <- lm(NGC ~ CBHM * DBM * BVD * AVM)
#summary(u)

pander(v)
Fitting linear model: NGC ~ CBHM + DBM + BVD + AVM
  Estimate Std. Error t value Pr(>|t|)
CBHM 1.2 2.9 0.43 0.67
DBM 96 54 1.8 0.079
BVD -0.4 0.28 -1.4 0.16
AVM 22 19 1.1 0.26
(Intercept) -392 289 -1.4 0.18
pander(u)
Fitting linear model: NGC ~ CBHM * DBM * BVD * AVM ——————————————————————————————————-
  Estimate Std. Error t value Pr(>|t|)
CBHM 2152 1096 2 0.055
DBM 2159 899 2.4 0.02
BVD 28 41 0.69 0.5
AVM -785 303 -2.6 0.012
CBHM:DBM -268 140 -1.9 0.061
CBHM:BVD -2.1 8.2 -0.25 0.8
DBM:BVD -4.7 3.1 -1.5 0.14
CBHM:AVM 97 48 2 0.048
DBM:AVM 95 38 2.5 0.014
BVD:AVM 1.1 2.1 0.5 0.62
CBHM:DBM:BVD 0.51 0.61 0.83 0.41
CBHM:DBM:AVM -12 5.8 -2.1 0.039
CBHM:BVD:AVM -0.067 0.41 -0.16 0.87
DBM:BVD:AVM -0.2 0.16 -1.3 0.21
CBHM:DBM:BVD:AVM 0.021 0.03 0.72 0.48
(Intercept) -17985 6939 -2.6 0.013

Looking at Estimates for Coefficients, We choose to use α <= 0.05 as a cut-off point for models we want to explore further.

The models that survived the α level cut-off are:

NGC ~ DBM : Est: 2159 StdError: 1096

NGC ~ AVM: Est: -785 Stderror: 303

NGC ~ (CBHM:AVM): Est: 97 StdError: 48

NGC ~ (DBM:AVM): Est: 95 StdError: 38

NGC ~ (CBHM:DBM:AVM): Est:-12 StdError: 5.8


5. Model building: Random Forests:

Using random decision forests to find out which galaxy properties influence Number of Globular Clusters the most:

#install.packages('randomForest')

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
fit1 <- randomForest(NGC ~ CBHM + DBM + BVD + AVM, data=AllData, importance=TRUE, ntree = 2000, na.action=na.omit)

fit1
## 
## Call:
##  randomForest(formula = NGC ~ CBHM + DBM + BVD + AVM, data = AllData,      importance = TRUE, ntree = 2000, na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 2000
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 8344.032
##                     % Var explained: -13.63
fit2 <- randomForest(NGC ~ CBHM + DBM + BVD + AVM, data=AllData, importance=TRUE, ntree = 2000, na.action=na.omit)

fit2
## 
## Call:
##  randomForest(formula = NGC ~ CBHM + DBM + BVD + AVM, data = AllData,      importance = TRUE, ntree = 2000, na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 2000
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 8378.991
##                     % Var explained: -14.11
fit3 <- randomForest(NGC ~ DBM, importance=TRUE, ntree = 2000, na.action=na.omit)

fit3
## 
## Call:
##  randomForest(formula = NGC ~ DBM, importance = TRUE, ntree = 2000,      na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 2000
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 10479.99
##                     % Var explained: -37.8
fit4 <- randomForest(NGC ~ AVM, importance=TRUE, ntree = 2000, na.action=na.omit)

fit4
## 
## Call:
##  randomForest(formula = NGC ~ AVM, importance = TRUE, ntree = 2000,      na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 2000
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 9902.272
##                     % Var explained: -26.9
fit5 <- randomForest(NGC ~ (CBHM:AVM), importance=TRUE, ntree = 2000, na.action=na.omit)

fit5
## 
## Call:
##  randomForest(formula = NGC ~ (CBHM:AVM), importance = TRUE, ntree = 2000,      na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 2000
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 8113.289
##                     % Var explained: -11.44
fit6 <- randomForest(NGC ~ (DBM:AVM), importance=TRUE, ntree = 2000, na.action=na.omit)

fit6
## 
## Call:
##  randomForest(formula = NGC ~ (DBM:AVM), importance = TRUE, ntree = 2000,      na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 2000
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 9028.338
##                     % Var explained: -18.71
fit7 <- randomForest(NGC ~ (CBHM:DBM:AVM), importance=TRUE, ntree = 2000, na.action=na.omit)

fit7
## 
## Call:
##  randomForest(formula = NGC ~ (CBHM:DBM:AVM), importance = TRUE,      ntree = 2000, na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 2000
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 8252.12
##                     % Var explained: -12.38

From the above analysis, we see that the greatest influencers of NGC are:

NGC ~ DBM (% Var explained: -36.39) NGC ~ AVM (% Var explained: -27.39)


Predictive Model:

model_1 <- lm(NGC ~ DBM + AVM)

model_1
## 
## Call:
## lm(formula = NGC ~ DBM + AVM)
## 
## Coefficients:
## (Intercept)          DBM          AVM  
##    163.7001      -0.8574       0.2954
model_2 <- lm(NGC ~ DBM * AVM)

summary(model_1)
## 
## Call:
## lm(formula = NGC ~ DBM + AVM)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -149.233  -83.871    0.444   78.761  146.589 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 163.7001    60.7823   2.693  0.00755 **
## DBM          -0.8574    17.0497  -0.050  0.95993   
## AVM           0.2954     8.3272   0.035  0.97173   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 87.71 on 253 degrees of freedom
##   (166 observations deleted due to missingness)
## Multiple R-squared:  0.0002577,  Adjusted R-squared:  -0.007645 
## F-statistic: 0.03261 on 2 and 253 DF,  p-value: 0.9679
summary(model_2)
## 
## Call:
## lm(formula = NGC ~ DBM * AVM)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -142.781  -84.086    3.866   77.213  153.489 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -297.136    493.994  -0.601    0.548
## DBM           46.159     52.843   0.874    0.383
## AVM          -21.522     24.659  -0.873    0.384
## DBM:AVM        2.214      2.355   0.940    0.348
## 
## Residual standard error: 87.73 on 252 degrees of freedom
##   (166 observations deleted due to missingness)
## Multiple R-squared:  0.003751,   Adjusted R-squared:  -0.008109 
## F-statistic: 0.3163 on 3 and 252 DF,  p-value: 0.8136

DISCUSSION & CONCLUSIONS:

The Galaxy Property which influenced number of galaxy clusters the most, according to this data set and the limited amount of models we investigated is:

Dynamical Bulge Mass:

finalModel <- lm(NGC ~ DBM)

summary(finalModel)
## 
## Call:
## lm(formula = NGC ~ DBM)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -149.142  -83.853    0.517   78.506  146.691 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  163.770     60.630   2.701  0.00738 **
## DBM           -1.428      5.635  -0.253  0.80016   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 87.54 on 254 degrees of freedom
##   (166 observations deleted due to missingness)
## Multiple R-squared:  0.0002528,  Adjusted R-squared:  -0.003683 
## F-statistic: 0.06422 on 1 and 254 DF,  p-value: 0.8002

Limitations of methodology of this analysis are:

  • We focussed on NGC as the sole response variable in our analysis

  • Future work should investigate how NGC influences other galaxy properties, particularly Galaxy Type

  • This was a purely exploratory investigation (literature was not consulted) and therefore a lot of time was spent just getting to know the data, instead of having a Null Hypothesis before building and testing models.


REFERENCES:

Computational Platform:

R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Data source:

“Catalog of Globular Cluster Systems in Galaxies”, Harris, W.E., Harris, G.L.H, and Alessi, M. 2013, ApJ 772, 82.

In-text References:

  1. http://www.space.com/29717-globular-clusters.html
  2. https://ned.ipac.caltech.edu/level5/Ashman/Ashman1.html Accessed on 04/03/2017.
  3. Antonucci, R. (1993). “Unified Models for Active Galactic Nuclei and Quasars”. Annual Review of Astronomy and Astrophysics. 31 (1): 473–521. As referenced in Wikipedia Article: https://en.wikipedia.org/wiki/Supermassive_black_hole. Accessed on 05/03/2017.
  4. http://www.cosmotography.com/images/supermassive_blackholes_drive_galaxy_evolution_2.html. Accessed on 05/03/2017. 5.http://adsabs.harvard.edu/abs/2004ApJ…604L..89H. Accessed on 05/03/2017.