Note: It is recommended to knit this .Rmd file to html for best visual results.
A globular cluster is loosely defined as a collection of (potentially millions of) stars orbiting a galactic core [1]; these entities are of particular interest to astronomers, because the analysis of their physical properties helps inform models of galaxy evolution [2].
library(data.table)
#create vector with column names
columnVector <- c("Galaxy", "ID", "RightAscension", "Declination", "MorphType", "Distance", "Uncert+/-", "DistEstMeth", "FGExtinct", "AbsVisMag", "Uncert+/-", "KBandMAg", "Uncert+/-", "NGlobClust", "Uncert+/-", "LitSource", "VelocDisp", "Uncert+/-", "EffecRad", "Uncert+/-", "DynamicMass", "Uncert+/-", "TotMass", "Uncert+/-", "MassBH", "UpperUncert", "LowerUncert")
#read in .txt file, obtained from: http://www.physics.mcmaster.ca/~harris/GCS_table.txt
file1 <- "GCS_table.txt"
rawlines <- readLines(file(file1))
#remove unnecessary information
mydat <-rawlines[-c(seq(1:39))]
#remove newline characters
mydat <- gsub("[\n]", " ", mydat)
#create a string from the vector of column names
columns <- paste(columnVector, collapse=" ")
#read data into dataframe
data <- read.table(text = mydat[seq(1:424)],col.names=columnVector,na.strings = c("nd", "nid"))
#attach data for ease of reference
attach(data)
#create vector for each variable:
#Number of Globular Clusters
NGC <- as.vector(as.numeric(NGlobClust))
#Central Black Hole Mass
CBHM <- as.vector(as.numeric(MassBH))
#Dynamical Bulge Mass
DBM <- as.vector(as.numeric(DynamicMass))
#Bulge velocity Dispersion
BVD <- as.vector(as.numeric(VelocDisp))
#Absolute Visual MAgnitude
AVM <- as.vector(as.numeric(AbsVisMag))
#Galaxy type
GT <- as.vector(MorphType)
#detach data
detach(data)
#have a look at first 6 entries of each vector to see if it is in the format we expected
head(NGC)## [1] 58 4 52 172 6 273
head(CBHM)## [1] 6.61 NA NA NA NA NA
head(DBM)## [1] 9.856 NA 10.879 NA 8.408 8.105
head(BVD)## [1] 105.0 NA 169.2 NA 22.0 19.9
head(AVM)## [1] -21.30 -14.84 -20.18 -18.77 -15.46 -15.40
head(GT)## [1] "Sbc" "Irr" "Sab" "SBm" "E5" "E3"
Quick look at what kind of data we have, reveals the following…
#we begin by getting a quick exploratory summary for each vector
#Number of globular clusters:
summary(NGC)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 67.25 143.00 144.70 222.00 295.00
hist(NGC, main= "Histogam of Globular Cluster Number",col=terrain.colors(10), xlab = "Number of Globular Clusters", xlim=c(1,295), las=1, breaks= 10)#Central black hole mass:
summary(CBHM)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 7.950 5.944 8.720 10.320 357
hist(CBHM, main= "Histogam of Central Black Hole Mass",col=terrain.colors(6), xlab = "Central Black Hole Mass 1.98892×1030 kg")#Dynamical Bulge Mass:
summary(DBM)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 7.693 9.988 10.870 10.720 11.410 12.730 166
hist(DBM, main= "Histogam of Dynamical Bulge Mass",col=terrain.colors(11), xlab = "Dynamical Bulge Mass 1.98892×1030 kg")#Bulge velocity dispersion:
summary(BVD)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 10.40 85.15 168.10 167.90 239.20 414.00 148
hist(BVD, main= "Histogam of Bulge velocity Dispersion",col=terrain.colors(9), xlab = "Bulge velocity Dispersion km/s")#Absolute visual magnitude:
summary(AVM)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -24.19 -21.31 -19.59 -19.13 -16.91 -11.17 1
hist(AVM, main= "Histogam of Absolute Visual Magnitude",col=terrain.colors(14), xlab = "Absolute Magnitude (M)")#Galaxy types:
summary(GT)## Length Class Mode
## 422 character character
#if necessary: install.packages('plyr'), size 1.1MB
library(plyr)
names <- count(GT)
plot(names, main="Galaxy Types", xlab="Galaxy Names")#if necessary: install.packages('PerformanceAnalytics'), size 2MB
AllData <- cbind.data.frame(NGC, CBHM, DBM, BVD, AVM)
library(PerformanceAnalytics)## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:data.table':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
chart.Correlation(AllData)We see that there appears to be some promising correlations we might be able to explore further:
DBM and AVM (-0.94) BVD and AVM (-0.85)
#NGC vs (CBHM, DBM, BVD, AVM)
#install.packages('pander')
library(pander)
z <- lm(NGC ~ CBHM)
y <- lm(NGC ~ DBM)
x <- lm(NGC ~ BVD)
w <- lm(NGC ~ AVM)
panderOptions("digits", 2)
pander(z)| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| CBHM | 1.7 | 2.8 | 0.61 | 0.54 |
| (Intercept) | 123 | 20 | 6.3 | 3.5e-08 |
pander(y)| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| DBM | -1.4 | 5.6 | -0.25 | 0.8 |
| (Intercept) | 164 | 61 | 2.7 | 0.0074 |
pander(x)| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| BVD | -0.036 | 0.057 | -0.62 | 0.53 |
| (Intercept) | 157 | 11 | 14 | 4.1e-35 |
pander(w)| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| AVM | -2.9 | 1.5 | -1.9 | 0.064 |
| (Intercept) | 90 | 30 | 3 | 0.0027 |
###
v <- lm(NGC ~ CBHM + DBM + BVD + AVM)
#summary(v)
u <- lm(NGC ~ CBHM * DBM * BVD * AVM)
#summary(u)
pander(v)| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| CBHM | 1.2 | 2.9 | 0.43 | 0.67 |
| DBM | 96 | 54 | 1.8 | 0.079 |
| BVD | -0.4 | 0.28 | -1.4 | 0.16 |
| AVM | 22 | 19 | 1.1 | 0.26 |
| (Intercept) | -392 | 289 | -1.4 | 0.18 |
pander(u)| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| CBHM | 2152 | 1096 | 2 | 0.055 |
| DBM | 2159 | 899 | 2.4 | 0.02 |
| BVD | 28 | 41 | 0.69 | 0.5 |
| AVM | -785 | 303 | -2.6 | 0.012 |
| CBHM:DBM | -268 | 140 | -1.9 | 0.061 |
| CBHM:BVD | -2.1 | 8.2 | -0.25 | 0.8 |
| DBM:BVD | -4.7 | 3.1 | -1.5 | 0.14 |
| CBHM:AVM | 97 | 48 | 2 | 0.048 |
| DBM:AVM | 95 | 38 | 2.5 | 0.014 |
| BVD:AVM | 1.1 | 2.1 | 0.5 | 0.62 |
| CBHM:DBM:BVD | 0.51 | 0.61 | 0.83 | 0.41 |
| CBHM:DBM:AVM | -12 | 5.8 | -2.1 | 0.039 |
| CBHM:BVD:AVM | -0.067 | 0.41 | -0.16 | 0.87 |
| DBM:BVD:AVM | -0.2 | 0.16 | -1.3 | 0.21 |
| CBHM:DBM:BVD:AVM | 0.021 | 0.03 | 0.72 | 0.48 |
| (Intercept) | -17985 | 6939 | -2.6 | 0.013 |
Looking at Estimates for Coefficients, We choose to use α <= 0.05 as a cut-off point for models we want to explore further.
The models that survived the α level cut-off are:
NGC ~ DBM : Est: 2159 StdError: 1096
NGC ~ AVM: Est: -785 Stderror: 303
NGC ~ (CBHM:AVM): Est: 97 StdError: 48
NGC ~ (DBM:AVM): Est: 95 StdError: 38
NGC ~ (CBHM:DBM:AVM): Est:-12 StdError: 5.8
Using random decision forests to find out which galaxy properties influence Number of Globular Clusters the most:
#install.packages('randomForest')
library(randomForest)## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
fit1 <- randomForest(NGC ~ CBHM + DBM + BVD + AVM, data=AllData, importance=TRUE, ntree = 2000, na.action=na.omit)
fit1##
## Call:
## randomForest(formula = NGC ~ CBHM + DBM + BVD + AVM, data = AllData, importance = TRUE, ntree = 2000, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 2000
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 8344.032
## % Var explained: -13.63
fit2 <- randomForest(NGC ~ CBHM + DBM + BVD + AVM, data=AllData, importance=TRUE, ntree = 2000, na.action=na.omit)
fit2##
## Call:
## randomForest(formula = NGC ~ CBHM + DBM + BVD + AVM, data = AllData, importance = TRUE, ntree = 2000, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 2000
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 8378.991
## % Var explained: -14.11
fit3 <- randomForest(NGC ~ DBM, importance=TRUE, ntree = 2000, na.action=na.omit)
fit3##
## Call:
## randomForest(formula = NGC ~ DBM, importance = TRUE, ntree = 2000, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 2000
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 10479.99
## % Var explained: -37.8
fit4 <- randomForest(NGC ~ AVM, importance=TRUE, ntree = 2000, na.action=na.omit)
fit4##
## Call:
## randomForest(formula = NGC ~ AVM, importance = TRUE, ntree = 2000, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 2000
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 9902.272
## % Var explained: -26.9
fit5 <- randomForest(NGC ~ (CBHM:AVM), importance=TRUE, ntree = 2000, na.action=na.omit)
fit5##
## Call:
## randomForest(formula = NGC ~ (CBHM:AVM), importance = TRUE, ntree = 2000, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 2000
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 8113.289
## % Var explained: -11.44
fit6 <- randomForest(NGC ~ (DBM:AVM), importance=TRUE, ntree = 2000, na.action=na.omit)
fit6##
## Call:
## randomForest(formula = NGC ~ (DBM:AVM), importance = TRUE, ntree = 2000, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 2000
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 9028.338
## % Var explained: -18.71
fit7 <- randomForest(NGC ~ (CBHM:DBM:AVM), importance=TRUE, ntree = 2000, na.action=na.omit)
fit7##
## Call:
## randomForest(formula = NGC ~ (CBHM:DBM:AVM), importance = TRUE, ntree = 2000, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 2000
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 8252.12
## % Var explained: -12.38
From the above analysis, we see that the greatest influencers of NGC are:
NGC ~ DBM (% Var explained: -36.39) NGC ~ AVM (% Var explained: -27.39)
model_1 <- lm(NGC ~ DBM + AVM)
model_1##
## Call:
## lm(formula = NGC ~ DBM + AVM)
##
## Coefficients:
## (Intercept) DBM AVM
## 163.7001 -0.8574 0.2954
model_2 <- lm(NGC ~ DBM * AVM)
summary(model_1)##
## Call:
## lm(formula = NGC ~ DBM + AVM)
##
## Residuals:
## Min 1Q Median 3Q Max
## -149.233 -83.871 0.444 78.761 146.589
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 163.7001 60.7823 2.693 0.00755 **
## DBM -0.8574 17.0497 -0.050 0.95993
## AVM 0.2954 8.3272 0.035 0.97173
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 87.71 on 253 degrees of freedom
## (166 observations deleted due to missingness)
## Multiple R-squared: 0.0002577, Adjusted R-squared: -0.007645
## F-statistic: 0.03261 on 2 and 253 DF, p-value: 0.9679
summary(model_2)##
## Call:
## lm(formula = NGC ~ DBM * AVM)
##
## Residuals:
## Min 1Q Median 3Q Max
## -142.781 -84.086 3.866 77.213 153.489
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -297.136 493.994 -0.601 0.548
## DBM 46.159 52.843 0.874 0.383
## AVM -21.522 24.659 -0.873 0.384
## DBM:AVM 2.214 2.355 0.940 0.348
##
## Residual standard error: 87.73 on 252 degrees of freedom
## (166 observations deleted due to missingness)
## Multiple R-squared: 0.003751, Adjusted R-squared: -0.008109
## F-statistic: 0.3163 on 3 and 252 DF, p-value: 0.8136
The Galaxy Property which influenced number of galaxy clusters the most, according to this data set and the limited amount of models we investigated is:
Dynamical Bulge Mass:
finalModel <- lm(NGC ~ DBM)
summary(finalModel)##
## Call:
## lm(formula = NGC ~ DBM)
##
## Residuals:
## Min 1Q Median 3Q Max
## -149.142 -83.853 0.517 78.506 146.691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 163.770 60.630 2.701 0.00738 **
## DBM -1.428 5.635 -0.253 0.80016
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 87.54 on 254 degrees of freedom
## (166 observations deleted due to missingness)
## Multiple R-squared: 0.0002528, Adjusted R-squared: -0.003683
## F-statistic: 0.06422 on 1 and 254 DF, p-value: 0.8002
Limitations of methodology of this analysis are:
We focussed on NGC as the sole response variable in our analysis
Future work should investigate how NGC influences other galaxy properties, particularly Galaxy Type
This was a purely exploratory investigation (literature was not consulted) and therefore a lot of time was spent just getting to know the data, instead of having a Null Hypothesis before building and testing models.
R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
“Catalog of Globular Cluster Systems in Galaxies”, Harris, W.E., Harris, G.L.H, and Alessi, M. 2013, ApJ 772, 82.