Exploring Potential Relationships between Globular Cluster Number and Specific Galaxy Properties

Note: It is recommended to knit this .Rmd file to html for best visual results.

BACKGROUND:

Globular Clusters

A globular cluster is loosely defined as a collection of (potentially millions of) stars orbiting a galactic core [1]; these entities are of particular interest to astronomers, because the analysis of their physical properties helps inform models of galaxy evolution [2].

METHODS:

(1) Input data converted to individual vectors representing each variable of interest.

library(data.table)

#create vector with column names

columnVector <- c("Galaxy", "ID", "RightAscension", "Declination", "MorphType", "Distance", "Uncert+/-", "DistEstMeth", "FGExtinct", "AbsVisMag", "Uncert+/-", "KBandMAg", "Uncert+/-", "NGlobClust", "Uncert+/-", "LitSource", "VelocDisp", "Uncert+/-", "EffecRad", "Uncert+/-", "DynamicMass", "Uncert+/-", "TotMass", "Uncert+/-", "MassBH", "UpperUncert", "LowerUncert")

#read in .txt file, obtained from: http://www.physics.mcmaster.ca/~harris/GCS_table.txt


file1 <- "GCS_table.txt"

rawlines <- readLines(file(file1))

#remove unnecessary information

mydat <-rawlines[-c(seq(1:39))]

#remove newline characters

mydat <- gsub("[\n]", " ", mydat)

#create a string from the vector of column names

columns <- paste(columnVector, collapse=" ")

#read data into dataframe

data <- read.table(text = mydat[seq(1:424)],col.names=columnVector,na.strings = c("nd", "nid"))

#attach data for ease of reference

attach(data)

#create vector for each variable:

#Number of Globular Clusters

NGC <- as.vector(as.numeric(NGlobClust))

#Central Black Hole Mass

CBHM <- as.vector(as.numeric(MassBH))

#Dynamical Bulge Mass

DBM <- as.vector(as.numeric(DynamicMass))

#Bulge velocity Dispersion

BVD <- as.vector(as.numeric(VelocDisp))

#Absolute Visual MAgnitude

AVM <- as.vector(as.numeric(AbsVisMag))

#Galaxy type

GT <- as.vector(MorphType)

#detach data

detach(data)

#have a look at first 6 entries of each vector to see if it is in the format we expected

head(NGC)

## [1]  58   4  52 172   6 273

head(CBHM)

## [1] 6.61   NA   NA   NA   NA   NA

head(DBM)

## [1]  9.856     NA 10.879     NA  8.408  8.105

head(BVD)

## [1] 105.0    NA 169.2    NA  22.0  19.9

head(AVM)

## [1] -21.30 -14.84 -20.18 -18.77 -15.46 -15.40

head(GT)

## [1] "Sbc" "Irr" "Sab" "SBm" "E5"  "E3"

(2) Exploratory Data Analysis:

Quick look at what kind of data we have, reveals the following…

#we begin by getting a quick exploratory summary for each vector

#Number of globular clusters:

summary(NGC)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   67.25  143.00  144.70  222.00  295.00

hist(NGC, main= "Histogam of Globular Cluster Number",col=terrain.colors(10), xlab = "Number of Globular Clusters", xlim=c(1,295), las=1, breaks= 10)

#Central black hole mass:

summary(CBHM)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   7.950   5.944   8.720  10.320     357

hist(CBHM, main= "Histogam of Central Black Hole Mass",col=terrain.colors(6), xlab = "Central Black Hole Mass 1.98892×1030 kg")

#Dynamical Bulge Mass:

summary(DBM)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   7.693   9.988  10.870  10.720  11.410  12.730     166

hist(DBM, main= "Histogam of Dynamical Bulge Mass",col=terrain.colors(11), xlab = "Dynamical Bulge Mass 1.98892×1030 kg")

#Bulge velocity dispersion:

summary(BVD)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   10.40   85.15  168.10  167.90  239.20  414.00     148

hist(BVD, main= "Histogam of Bulge velocity Dispersion",col=terrain.colors(9), xlab = "Bulge velocity Dispersion km/s")

#Absolute visual magnitude:

summary(AVM)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -24.19  -21.31  -19.59  -19.13  -16.91  -11.17       1

hist(AVM, main= "Histogam of Absolute Visual Magnitude",col=terrain.colors(14), xlab = "Absolute Magnitude (M)")

#Galaxy types:
summary(GT)

##    Length     Class      Mode 
##       422 character character

#if necessary: install.packages('plyr'), size 1.1MB

library(plyr)

names <- count(GT)

plot(names, main="Galaxy Types", xlab="Galaxy Names")

(3) Having a look at general correlations between galaxy properties:

#if necessary: install.packages('PerformanceAnalytics'), size 2MB

AllData <- cbind.data.frame(NGC, CBHM, DBM, BVD, AVM)

library(PerformanceAnalytics)

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:data.table':
## 
##     first, last

## 
## Attaching package: 'PerformanceAnalytics'

## The following object is masked from 'package:graphics':
## 
##     legend

chart.Correlation(AllData)

We see that there appears to be some promising correlations we might be able to explore further:

In our data there was quite a strong positive correlation between:

DBM and BVD (0.92)

The strongest negative correlations were found between:

DBM and AVM (-0.94) BVD and AVM (-0.85)

No immediately obvious correlation was found between NGC and any other galaxy property.

(4) Explore simple correlations between NGC and each other galaxy property:

#NGC vs (CBHM, DBM, BVD, AVM)

#install.packages('pander')

library(pander)
z <- lm(NGC ~ CBHM)
y <- lm(NGC ~ DBM)
x <- lm(NGC ~ BVD)
w <- lm(NGC ~ AVM)


panderOptions("digits", 2)
pander(z)

Fitting linear model: NGC ~ CBHM
	Estimate	Std. Error	t value	Pr(>\|t\|)
CBHM	1.7	2.8	0.61	0.54
(Intercept)	123	20	6.3	3.5e-08

pander(y)

Fitting linear model: NGC ~ DBM
	Estimate	Std. Error	t value	Pr(>\|t\|)
DBM	-1.4	5.6	-0.25	0.8
(Intercept)	164	61	2.7	0.0074

pander(x)

Fitting linear model: NGC ~ BVD
	Estimate	Std. Error	t value	Pr(>\|t\|)
BVD	-0.036	0.057	-0.62	0.53
(Intercept)	157	11	14	4.1e-35

pander(w)

Fitting linear model: NGC ~ AVM
	Estimate	Std. Error	t value	Pr(>\|t\|)
AVM	-2.9	1.5	-1.9	0.064
(Intercept)	90	30	3	0.0027

###

v <- lm(NGC ~ CBHM + DBM + BVD + AVM)
#summary(v)

u <- lm(NGC ~ CBHM * DBM * BVD * AVM)
#summary(u)

pander(v)

Fitting linear model: NGC ~ CBHM + DBM + BVD + AVM
	Estimate	Std. Error	t value	Pr(>\|t\|)
CBHM	1.2	2.9	0.43	0.67
DBM	96	54	1.8	0.079
BVD	-0.4	0.28	-1.4	0.16
AVM	22	19	1.1	0.26
(Intercept)	-392	289	-1.4	0.18

pander(u)

Fitting linear model: NGC ~ CBHM * DBM * BVD * AVM ——————————————————————————————————-
	Estimate	Std. Error	t value	Pr(>\|t\|)
CBHM	2152	1096	2	0.055
DBM	2159	899	2.4	0.02
BVD	28	41	0.69	0.5
AVM	-785	303	-2.6	0.012
CBHM:DBM	-268	140	-1.9	0.061
CBHM:BVD	-2.1	8.2	-0.25	0.8
DBM:BVD	-4.7	3.1	-1.5	0.14
CBHM:AVM	97	48	2	0.048
DBM:AVM	95	38	2.5	0.014
BVD:AVM	1.1	2.1	0.5	0.62
CBHM:DBM:BVD	0.51	0.61	0.83	0.41
CBHM:DBM:AVM	-12	5.8	-2.1	0.039
CBHM:BVD:AVM	-0.067	0.41	-0.16	0.87
DBM:BVD:AVM	-0.2	0.16	-1.3	0.21
CBHM:DBM:BVD:AVM	0.021	0.03	0.72	0.48
(Intercept)	-17985	6939	-2.6	0.013

Looking at Estimates for Coefficients, We choose to use α <= 0.05 as a cut-off point for models we want to explore further.

The models that survived the α level cut-off are:

NGC ~ DBM : Est: 2159 StdError: 1096

NGC ~ AVM: Est: -785 Stderror: 303

NGC ~ (CBHM:AVM): Est: 97 StdError: 48

NGC ~ (DBM:AVM): Est: 95 StdError: 38

NGC ~ (CBHM:DBM:AVM): Est:-12 StdError: 5.8

5. Model building: Random Forests:

Using random decision forests to find out which galaxy properties influence Number of Globular Clusters the most:

#install.packages('randomForest')

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

fit1 <- randomForest(NGC ~ CBHM + DBM + BVD + AVM, data=AllData, importance=TRUE, ntree = 2000, na.action=na.omit)

fit1

## 
## Call:
##  randomForest(formula = NGC ~ CBHM + DBM + BVD + AVM, data = AllData,      importance = TRUE, ntree = 2000, na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 2000
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 8344.032
##                     % Var explained: -13.63

fit2 <- randomForest(NGC ~ CBHM + DBM + BVD + AVM, data=AllData, importance=TRUE, ntree = 2000, na.action=na.omit)

fit2

## 
## Call:
##  randomForest(formula = NGC ~ CBHM + DBM + BVD + AVM, data = AllData,      importance = TRUE, ntree = 2000, na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 2000
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 8378.991
##                     % Var explained: -14.11

fit3 <- randomForest(NGC ~ DBM, importance=TRUE, ntree = 2000, na.action=na.omit)

fit3

## 
## Call:
##  randomForest(formula = NGC ~ DBM, importance = TRUE, ntree = 2000,      na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 2000
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 10479.99
##                     % Var explained: -37.8

fit4 <- randomForest(NGC ~ AVM, importance=TRUE, ntree = 2000, na.action=na.omit)

fit4

## 
## Call:
##  randomForest(formula = NGC ~ AVM, importance = TRUE, ntree = 2000,      na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 2000
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 9902.272
##                     % Var explained: -26.9

fit5 <- randomForest(NGC ~ (CBHM:AVM), importance=TRUE, ntree = 2000, na.action=na.omit)

fit5

## 
## Call:
##  randomForest(formula = NGC ~ (CBHM:AVM), importance = TRUE, ntree = 2000,      na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 2000
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 8113.289
##                     % Var explained: -11.44

fit6 <- randomForest(NGC ~ (DBM:AVM), importance=TRUE, ntree = 2000, na.action=na.omit)

fit6

## 
## Call:
##  randomForest(formula = NGC ~ (DBM:AVM), importance = TRUE, ntree = 2000,      na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 2000
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 9028.338
##                     % Var explained: -18.71

fit7 <- randomForest(NGC ~ (CBHM:DBM:AVM), importance=TRUE, ntree = 2000, na.action=na.omit)

fit7

## 
## Call:
##  randomForest(formula = NGC ~ (CBHM:DBM:AVM), importance = TRUE,      ntree = 2000, na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 2000
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 8252.12
##                     % Var explained: -12.38

From the above analysis, we see that the greatest influencers of NGC are:

NGC ~ DBM (% Var explained: -36.39) NGC ~ AVM (% Var explained: -27.39)

Predictive Model:

model_1 <- lm(NGC ~ DBM + AVM)

model_1

## 
## Call:
## lm(formula = NGC ~ DBM + AVM)
## 
## Coefficients:
## (Intercept)          DBM          AVM  
##    163.7001      -0.8574       0.2954

model_2 <- lm(NGC ~ DBM * AVM)

summary(model_1)

## 
## Call:
## lm(formula = NGC ~ DBM + AVM)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -149.233  -83.871    0.444   78.761  146.589 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 163.7001    60.7823   2.693  0.00755 **
## DBM          -0.8574    17.0497  -0.050  0.95993   
## AVM           0.2954     8.3272   0.035  0.97173   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 87.71 on 253 degrees of freedom
##   (166 observations deleted due to missingness)
## Multiple R-squared:  0.0002577,  Adjusted R-squared:  -0.007645 
## F-statistic: 0.03261 on 2 and 253 DF,  p-value: 0.9679

summary(model_2)

## 
## Call:
## lm(formula = NGC ~ DBM * AVM)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -142.781  -84.086    3.866   77.213  153.489 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -297.136    493.994  -0.601    0.548
## DBM           46.159     52.843   0.874    0.383
## AVM          -21.522     24.659  -0.873    0.384
## DBM:AVM        2.214      2.355   0.940    0.348
## 
## Residual standard error: 87.73 on 252 degrees of freedom
##   (166 observations deleted due to missingness)
## Multiple R-squared:  0.003751,   Adjusted R-squared:  -0.008109 
## F-statistic: 0.3163 on 3 and 252 DF,  p-value: 0.8136

DISCUSSION & CONCLUSIONS:

The Galaxy Property which influenced number of galaxy clusters the most, according to this data set and the limited amount of models we investigated is:

Dynamical Bulge Mass:

finalModel <- lm(NGC ~ DBM)

summary(finalModel)

## 
## Call:
## lm(formula = NGC ~ DBM)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -149.142  -83.853    0.517   78.506  146.691 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  163.770     60.630   2.701  0.00738 **
## DBM           -1.428      5.635  -0.253  0.80016   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 87.54 on 254 degrees of freedom
##   (166 observations deleted due to missingness)
## Multiple R-squared:  0.0002528,  Adjusted R-squared:  -0.003683 
## F-statistic: 0.06422 on 1 and 254 DF,  p-value: 0.8002

Limitations of methodology of this analysis are:

We focussed on NGC as the sole response variable in our analysis
Future work should investigate how NGC influences other galaxy properties, particularly Galaxy Type
This was a purely exploratory investigation (literature was not consulted) and therefore a lot of time was spent just getting to know the data, instead of having a Null Hypothesis before building and testing models.

REFERENCES:

Computational Platform:

R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Data source:

“Catalog of Globular Cluster Systems in Galaxies”, Harris, W.E., Harris, G.L.H, and Alessi, M. 2013, ApJ 772, 82.

In-text References:

http://www.space.com/29717-globular-clusters.html
https://ned.ipac.caltech.edu/level5/Ashman/Ashman1.html Accessed on 04/03/2017.
Antonucci, R. (1993). “Unified Models for Active Galactic Nuclei and Quasars”. Annual Review of Astronomy and Astrophysics. 31 (1): 473–521. As referenced in Wikipedia Article: https://en.wikipedia.org/wiki/Supermassive_black_hole. Accessed on 05/03/2017.
http://www.cosmotography.com/images/supermassive_blackholes_drive_galaxy_evolution_2.html. Accessed on 05/03/2017. 5.http://adsabs.harvard.edu/abs/2004ApJ…604L..89H. Accessed on 05/03/2017.

VLJCHR004

STA 5075Z: Assignment 1

CG Viljoen

5 March 2017