Final Project

From the text files given to us, I have deleted the part containing the instructions. I have used these modified text files containing only the data sets for the project.

Binary Data Set 1

Military pilots sometimes black out when their brains are deprived of oxygen due to G-forces during violent manoeuvres. A study produced similar symptoms by exposing volunteers’ lower bodies to negative air pressure, likewise decreasing oxygen to the brain. The data record the ages of eight volunteers and whether they showed syncopal blackout-related signs (pallor, sweating, slow heartbeat, unconsciousness) during an 18 min period. Do a proper exploratory data analysis. Fit a model with justification. Does resistance to blackout decrease with age?

# loading the data set
gforces_data <- read.table("binary_data1.txt", header=TRUE, check.names=FALSE)

# performing exploratory data analysis
str(gforces_data)

## 'data.frame':    8 obs. of  3 variables:
##  $ Subject: chr  "JW" "JM" "DT" "LK" ...
##  $ Age    : int  39 42 20 37 20 21 41 52
##  $ Signs  : int  0 1 0 1 1 0 1 1

summary(gforces_data)

##    Subject               Age            Signs      
##  Length:8           Min.   :20.00   Min.   :0.000  
##  Class :character   1st Qu.:20.75   1st Qu.:0.000  
##  Mode  :character   Median :38.00   Median :1.000  
##                     Mean   :34.00   Mean   :0.625  
##                     3rd Qu.:41.25   3rd Qu.:1.000  
##                     Max.   :52.00   Max.   :1.000

cor(gforces_data$Age, gforces_data$Signs)

## [1] 0.5001292

boxplot(Age ~ Signs, data=gforces_data, xlab="Signs", ylab="Age (in years)", col=c("skyblue","darkgreen"))

We observe that the data set contains 8 observations of 3 variables - Subject, Age (integer valued) and Signs (binary valued). The correlation between the age of a volunteer and whether they showed syncopal blackout-related signs is 0.50, indicating neither a very strong correlation nor an absence. We also take a look at the boxplot which shows how much whether a volunteer showed syncopal blackout-related signs is spread out with respect to the age of the volunteer. From the boxplot, we observe that the boxplot for military pilot showing no syncopal blackout-related signs is comparitively taller (with a high IQR) and lower (small maximum value), suggesting a significant difference between the two groups of volunteers. This suggests that the probability of showing syncopal blackout-related signs is higher for older volunteers.

# fitting a logit model
binomial_glm <- glm(Signs ~ Age, data=gforces_data, family=binomial(link="logit"))

Since the response variable is binary, it takes only two distinct values, and no transformation can change that, we fit a binomial model along with a logit link function to the data set with the explanatory variable as Age (\(x\)) and the response variable as Signs (\(y\)). The random component of the model is \(y ∼ Bin(1, p)\) and whereas the systematic component of the model is \(logit(p)=\beta_{0} + \beta_{1} x\).

# summarizing the model
summary(binomial_glm)

## 
## Call:
## glm(formula = Signs ~ Age, family = binomial(link = "logit"), 
##     data = gforces_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7112  -0.8686   0.5063   0.6939   1.5336  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.92086    2.64655  -1.104    0.270
## Age          0.10569    0.08037   1.315    0.188
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10.5850  on 7  degrees of freedom
## Residual deviance:  8.4345  on 6  degrees of freedom
## AIC: 12.435
## 
## Number of Fisher Scoring iterations: 4

# finding the odds ratio
exp(coefficients(binomial_glm)[2])

##      Age 
## 1.111479

# plotting the results of the logistic regression model
x <- seq(18, 54, 0.1)
y <- predict(binomial_glm, list(Age=x), type="response")
plot(jitter(Signs, 0.15) ~ Age, data=gforces_data, pch=20, col="red", xlab="Age (in years)", ylab="Probability of a blackout response")
lines(x, y, lwd=2)

Substituting the values of the estimated coefficients obtained from the summary, we get that the fitted model has the systematic component of \(logit(\hat{p})=0.04813 + 0.1464 x\).

We observe from the summary of the model that the AIC, the null deviance and the residual deviance values are low. Indicating that our model choice is decent. We also observe that only 4 Fisher scoring iterations were used, implying that the iterates converge quickly. Note that the slope (which is the coefficient of Age) represents the change in the log of the odds ratio per unit change in Age. The positive slope implies that when A>B, the odds ratio is greater than 1, That is, the odds of a volunteer of age A showing syncopal blackout-related signs is greater than the odds of a volunteer of age B. We have added a small amount of randomness to avoid overplotting in the model plot. From the model plot as well, we observe that the probability of showing syncopal blackout-related signs increases with age. Based on the above observations, we conclude that the resistance to blackout decreases with age.

Binary Data Set 2

The total July rainfall (in millimetres) at Quilpie, Australia, has been recorded, together with the value of the monthly mean southern oscillation index (SOI). The SOI is the standardized diﬀerence between the air pressures at Darwin and Tahiti, and is known to have relationships with rainfall in parts of Australia. Some Australian farmers may delay planting crops until 10mm amount of rain has fallen (a ‘rain threshold’) within a given time frame (a ‘rain window’). Determine how the proportion of exceed of rain threshold is related to SOI. Also determine if there is any effect of phase on the proportion of rain threshold. Does proportion vary with time? All these answer require proper justification.

# loading the data set
quilpie_data <- read.table("binary_data2.txt", header=TRUE, check.names=FALSE)
head(quilpie_data, 4)

##   Year Rain   SOI Phase Exceed y
## 1 1921 38.4   2.7     2    Yes 1
## 2 1922  0.0   2.0     5     No 0
## 3 1923  0.0 -10.7     3     No 0
## 4 1924 24.4   6.9     2    Yes 1

# making a boxplot
boxplot(SOI ~ y, horizontal=TRUE, data=quilpie_data, xlab="July average SOI", ylab="Rainfall exceeds threshold", col=c("skyblue","darkgreen"))

The boxplot shows the distribution of the SOI (in years) when the rainfall exceeded and did not exceed the threshold. We observe that the boxplot for when the rainfall did not exceed the threshold is comparitively taller and lower, suggesting a significant difference between the two groups of rainfalls. That is, suggesting a higher probability of the rainfall exceeding the threshold for higher SOI values.

soi_binomial_glm <- glm(y ~ SOI, data=quilpie_data, family=binomial(link="logit"))
summary(soi_binomial_glm)

## 
## Call:
## glm(formula = y ~ SOI, family = binomial(link = "logit"), data = quilpie_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0793  -0.9853   0.3245   0.9616   1.9303  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.04813    0.27856   0.173 0.862828    
## SOI          0.14642    0.04288   3.415 0.000638 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 94.209  on 67  degrees of freedom
## Residual deviance: 75.901  on 66  degrees of freedom
## AIC: 79.901
## 
## Number of Fisher Scoring iterations: 4

linear_model <- lm(y ~ SOI, data=quilpie_data)
x <- seq(-25, 25, 0.1)
y_model <- predict(soi_binomial_glm, list(SOI=x), type="response")
y_linear <- predict(linear_model, list(SOI=x), type="response")

plot(jitter(y, 0.15) ~ SOI, data=quilpie_data, pch=20, col="red", xlab="July average SOI", ylab="Probability of rainfall exceeding threshold")
lines(x, y_linear, lty="dashed", lwd=1)
lines(x, y_model, lwd=2)

\(\mu=P(y=1)\) is the probability that the rainfall exceeds the 10 mm threshold. A direct linear model, with a systematic component of \(\mu=\beta_{0}+\beta_{1} x\) would not be good for this data set as \(\mu\) is a probability and it cannot be smaller than 0 nor can it exceed 1. Instead, we model the systematic component of the model to be \(logit(\mu)=\beta_{0} + \beta_{1} x\), ensuring that \(\mu\) remains between 0 and 1. The random component of the model is \(ym ∼ Bin(\mu, m)\).

From the summary of the adopted model, we observe that although only 4 Fisher scoring iterations were used, implying that the iterates converge quickly, the intercept term is not significant. Nonetheless, the coefficient term is significant. Substituting the values of the estimated coefficients obtained from the summary, we get that the fitted model has the systematic component of \(logit(\hat{\mu})=0.04813 + 0.1464 x\), where \(x\) is the July average SOI.

To determine how the proportion of exceed of rain threshold is related to SOI, we observe that as \(x\) decreases, \(\hat{\mu}\) decreases to 0 and as \(x\) increases, \(\hat{\mu}\) increases to 1. This is clearly visible in the plot as well. Note that the linear model has also been depicted in the plot for ease of comparison. We also note that when the SOI is 0, the probability of rainfall exceeding 10 mm is approximately 0.50. Based on the above observations, we conclude that the probability of exceeding \(10\) mm increases with increasing values of the SOI, being less than 0.5 for negative SOI values and more than 0.5 for positive SOI values.

cor(quilpie_data$y, quilpie_data$Phase)

## [1] -0.01951958

# creating dummy variables for phase
quilpie_data$Phase <- factor(quilpie_data$Phase)
dummy_variables <- with(quilpie_data, model.matrix( ~ Phase))
 
# creating the phase model
phase_binomial_model <- glm(y ~ Phase, data=quilpie_data, family=binomial)
summary(phase_binomial_model)

## 
## Call:
## glm(formula = y ~ Phase, family = binomial, data = quilpie_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3548  -1.1372   0.3593   1.1278   1.9728  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -17.57    1495.30  -0.012    0.991
## Phase2         20.27    1495.30   0.014    0.989
## Phase3         15.77    1495.30   0.011    0.992
## Phase4         17.68    1495.30   0.012    0.991
## Phase5         17.47    1495.30   0.012    0.991
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 94.209  on 67  degrees of freedom
## Residual deviance: 65.796  on 63  degrees of freedom
## AIC: 75.796
## 
## Number of Fisher Scoring iterations: 16

To determine if there is any effect of phase on the proportion of rain exceeding the threshold, we use SOI phase as an explanatory variable instead of SOI. We observe that the correlation between the two variables is negative, indicating that the higher phases have lower probabilities associated with them.

Since the model with SOI as the explanatory variable and the model with Phase as the explanatory variable are both not nested, comparing them using the likelihood ratio test would be inappropriate. Instead, we compare them by their AIC values. Note that for creating the phase model, we have used the fact that phase is a 5 distinct integer valued variable and created dummy variables to be as the explanatory variables for the phase model. From the summaries of the models, we observe that the AIC value for the SOI model is 79.90 and for the phase model is 75.80. This suggests that using phase indeed has an effect on the proportion of rain threshold and infact, makes better predictions than using SOI, as the AIC value for phase is comparitively lower.

decade <- cut(quilpie_data$Year, breaks=seq(1920, 1990, by=10))
mean_y <- tapply(quilpie_data$y, decade, "mean")
mean_year <- tapply(quilpie_data$Year, decade, "mean")
plot(mean_y ~ mean_year, type="b", data=quilpie_data, pch=20, xlab="Year", ylab="Probability of rainfall exceeding threshold")

To determine whether proportion varies with time, we make a simple plot. Since there are many years of data, we make break the years data into decades and consider one decade as one unit of time and see how the probabilities vary with it. We observe that the mean of \(y\) oscillates with time but has an overall increasing trend, with the exception of the decade 1930-1940.

Count Data Set

A study of the habitats of the noisy miner (a small but aggressive native Australian bird) counted the number of noisy miners \(y\) and the number of eucalypt trees \(x\) in two-hectare buloke woodland transects. By looking at the plot of \(x\) vs \(y\), can you make any conclusion? Please provide the basis of the conclusion. Fit a model with proper justification, i.e., provide the reasons behind the proposed model to the data set. Discuss how the fit is.

# loading the data set
nminer_data <- read.table("count_data1.txt", header=TRUE, check.names=FALSE)
head(nminer_data)

##   Euc noisy_miners
## 1   2            0
## 2  10            0
## 3  16            3
## 4  20            2
## 5  19            8
## 6  18            1

summary(nminer_data)

##       Euc      noisy_miners  
##  Min.   : 0   Min.   : 0.00  
##  1st Qu.: 4   1st Qu.: 0.00  
##  Median :12   Median : 1.00  
##  Mean   :12   Mean   : 2.71  
##  3rd Qu.:17   3rd Qu.: 4.00  
##  Max.   :32   Max.   :19.00

# plotting x vs y
plot(jitter(noisy_miners) ~ Euc, data=nminer_data, pch=20, col="red", xlab="Number of eucalypt trees", ylab="Number of noisy miners")

# fitting the Poisson GLM
poisson_glm <- glm(noisy_miners ~ Euc, data=nminer_data, family=poisson(link="log"))

We take a look at the summary of the data, to take note of the range of the variables in particular. While making the plot, we add a small amount of randomness in the vertical direction to avoid overplotting. From the plot, we make the following observations -

One location (with 19 noisy miners) has more than twice the number of noisy miners observed than the location with the next largest number of noisy miners (with eight noisy miners).
Between 0 and 10 eucalypt trees, the number of noisy miners varies little. Between 10 and 20 eucalypt trees, a larger amount of variation exists in the number of noisy miners. This shows that the randomness does not have constant variance.
Between 0 and 10 eucalypt trees, the number of noisy miners is almost always zero and between 10 and 20 eucalypt trees, the number of noisy miners increases. This shows that Buloke woodland patches with more eucalypt trees tend to have more noisy miners and that the systematic relationship between the number of eucalypt trees and the number of noisy miners is not linear.

Because the response variable is a count and the randomness does not have a constant variance, the variation in data may be modeled using a Poisson distribution with \(y ∼ Pois(\mu)\), where \(y\) takes non negative integer values and \(\mu\) is positive. A possible model for the systematic component is \(log(\mu)=exp(\beta_{0} + \beta_{1} x)\), where \(\mu\) is the expected number of noisy miners and \(x\) is the number of eucalypt trees. Note that the link function used here is the logarithm. Using the logarithm ensures that \(\mu\) is positive even when \(\beta_{0}\) and \(\beta_{1}\) range between \(-\infty\) to \(\infty\), and also models the non-linear form of the relationship between \(\mu\) and \(x\). Combining the two components, we model the data with \(y ∼ Pois(\mu)\) (random component) and \(log(\mu)=exp(\beta_{0} + \beta_{1} x)\) (systematic component).

summary(poisson_glm)

## 
## Call:
## glm(formula = noisy_miners ~ Euc, family = poisson(link = "log"), 
##     data = nminer_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1454  -1.2530  -0.9673   0.5634   3.5603  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.87621    0.28279  -3.098  0.00195 ** 
## Euc          0.11398    0.01243   9.169  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 150.545  on 30  degrees of freedom
## Residual deviance:  63.318  on 29  degrees of freedom
## AIC: 121.47
## 
## Number of Fisher Scoring iterations: 5

# plotting the confidence intervals for the mean
zstar <- qnorm(p=0.975) 
new_Euc <- seq(0, 35, length=100)
new_noisy_miner <- predict(poisson_glm, se.fit=TRUE, newdata=data.frame(Euc=new_Euc))
lower_ci <- exp(new_noisy_miner$fit-zstar* new_noisy_miner$se.fit)
upper_ci <- exp(new_noisy_miner$fit+zstar* new_noisy_miner$se.fit)
plot(noisy_miners ~ Euc, data=nminer_data, pch=20, col="red", xlab="Number of eucalypt trees", ylab="Number of noisy miners")
lines(exp(new_noisy_miner$fit) ~ new_Euc, lwd=2)
lines(lower_ci ~ new_Euc, lty=2)
lines(upper_ci ~ new_Euc, lty=2)

To get an idea of how the fit of the model is, we take a look at the summary of the model. We observe that the coefficient of number of eucalypt trees is 0.11, that is, the expected log count for a one-unit increase in number of eucalypt trees is 0.11. We observe that both the coefficients are significant and the standard errors are low, indicating a good fit. Also, only 5 Fisher scoring iterations were needed, implying that the iterates converge quickly. Substituting the estimated coefficient values, we conclude that the fitted model has the systematic component \(log(\hat{\mu})=exp(-0.8762 + 0.1140 x)\). We further analyse the fit by looking at some diagnostic plots.

From the plot of the confidence intervals, we observe that the interval is narrow in the beginning but becomes wider as the number of eucalypt trees increases. This suggests that for lower values of number of eucalypt trees, we can predict the number of noisy miners more accurately. This may be accounted for by the fact that there are not enough data points with high number of eucalypt trees.

library(statmod)
# Cook's distance plot
plot(cooks.distance(poisson_glm), type="h")

# Leverages plot
plot(hatvalues(poisson_glm), type="h")

which.max(hatvalues(poisson_glm))

## 11 
## 11

quantile_residuals <- qresid(poisson_glm)
which.max(abs(quantile_residuals))

## [1] 7

# locating the most influential observation
which.max(cooks.distance(poisson_glm))

## 17 
## 17

From the plots and the extracted values, we observe that only observation 17 is influential. Although observation 11 has a high leverage but the residual is small and so it is not influential. Similarly, although observation 7 has a large residual but the leverage is small and so it is not influential. Only observation 17 has a reasonably large leverage and residual, making it an influential point. Thus, we expect that excluding observation 17 should give us a better fit.

Survival Data Set

The times to death (in weeks) of two groups of leukaemia patients (grouped according to a morphological variable called the AG factor) were recorded and their white blood cell counts were measured. Exponential distribution is commonly used for fitting survival data. Do an exploratory data analysis to see if the time varies with WBC (or a suitable transformation to it) and AG. The exponential distribution is commonly used to explain the survival times. Fit an exponential distribution with the link function to be the log function. Evaluate how the fit is. Can you suggest some other rational model for the given survival time data? Justification is needed.

# loading the data set
leukwbc_data <- read.table("survival_data.txt", header=TRUE, check.names=FALSE)

# exploratory data analysis
str(leukwbc_data)

## 'data.frame':    33 obs. of  3 variables:
##  $ WBC : int  2300 750 4300 2600 6000 10500 10000 17000 5400 7000 ...
##  $ Time: int  65 156 100 134 16 108 121 4 39 143 ...
##  $ AG  : int  1 1 1 1 1 1 1 1 1 1 ...

cor(leukwbc_data$WBC, leukwbc_data$Time)

## [1] -0.3294525

cor(log10(leukwbc_data$WBC), leukwbc_data$Time)

## [1] -0.5087889

summary(leukwbc_data)

##       WBC              Time              AG       
##  Min.   :   750   Min.   :  1.00   Min.   :1.000  
##  1st Qu.:  5300   1st Qu.:  4.00   1st Qu.:1.000  
##  Median : 10500   Median : 22.00   Median :1.000  
##  Mean   : 29165   Mean   : 40.88   Mean   :1.485  
##  3rd Qu.: 32000   3rd Qu.: 65.00   3rd Qu.:2.000  
##  Max.   :100000   Max.   :156.00   Max.   :2.000

leukwbc_data$WBC_1000 <- (leukwbc_data$WBC/ 1000)
boxplot(Time ~ AG, data=leukwbc_data, xlab="AG factor", ylab="Time to death (in weeks)", col=c("skyblue","darkgreen"))

We observe that the data set contains 33 observations of 3 variables - WBC (integer valued), Time and AG (binary valued). We observe that the correlation between WBC and survival time is -0.33 but the correlation between \(log_{10}(WBC)\) and survival time is -0.51, suggesting that a log-transformed explanatory variable would yield better results than a not transformed one. From the summary of the data set, we observe that the range of WBC is very high. To account for this, we scale down the variable by 1000. We will be using the scaled down version from now on for our analysis. To see how survival time varies is spread out with respect to the AG factor, we take a look at the boxplot. From the boxplot, we observe that the boxplot for AG-positive is comparitively taller (with a high IQR) and significantly higher (higher maximum value) than the one for AG-negative, suggesting a significant difference between the two groups of AG-factors. This suggests that the survival time for patients with AG-positive factors are significantly higher.

A crucial observation that we make is that the response variable is a positive continuous variable. We know that continuous data over the positive real line may be modeled by the gamma distributions. Thus, moving ahead we will model the random component of the model to follow a gamma distribution. Also, we will consider two models with systematic components of the form \(\mu=\eta\) and \(log_{10}\mu=\eta\).

library(e1071)
par(mfrow=c(1, 2))
plot(Time ~ WBC_1000, data=leukwbc_data, pch=20, col=ifelse(leukwbc_data$AG==1, "skyblue", "darkgreen"), xlab="WBC (in thousands)", ylab="Time to death (in weeks)")
legend("topright", pch=20, c("AG-positive","AG-negative"), col=c("skyblue", "darkgreen"))
skewness(leukwbc_data$WBC_1000)

## [1] 1.237991

plot(Time ~ log10(WBC_1000), data=leukwbc_data, pch=20, col=ifelse(leukwbc_data$AG==1, "skyblue", "darkgreen"), xlab="Logarithm of WBC (in thousands)", ylab="Time to death (in weeks)")
legend("topright", pch=20, c("AG-positive","AG-negative"), col=c("skyblue", "darkgreen"))

dev.off()

## null device 
##           1

skewness(log10(leukwbc_data$WBC_1000))

## [1] -0.08023783

We plot the time to death (in weeks) against WBC, distinguishing between AG-positive and AG-negative patients. To improvise on the previous plot we plot time to death (in weeks) against \(log_{10}(WBC)\), distinguishing between AG-positive and AG-negative patients. The logarithm accounts for the fact that time to death should be positive. Note that from our exploratory data analysis also, we expect the model with \(log_{10}(WBC)\) to perform better. Comparing the two plots, we observe that the plot on the left is highly skewed whereas the plot on the right is moderately skewed, which is an improvement. This can also be seen from the skewness values. Thus, we conclude that \(log_{10}(WBC)\) is a better choice and proceed with it.

interaction_glm <- glm(Time ~ AG* log10(WBC_1000), family=Gamma(link="log"), data=leukwbc_data)
summary(interaction_glm)

## 
## Call:
## glm(formula = Time ~ AG * log10(WBC_1000), family = Gamma(link = "log"), 
##     data = leukwbc_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9921  -1.2116  -0.3269   0.2159   1.5647  
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.0203     1.2359   5.680 3.84e-06 ***
## AG                  -1.8704     0.8292  -2.256   0.0318 *  
## log10(WBC_1000)     -1.8642     0.9781  -1.906   0.0666 .  
## AG:log10(WBC_1000)   0.7548     0.6492   1.162   0.2545    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 1.116173)
## 
##     Null deviance: 58.138  on 32  degrees of freedom
## Residual deviance: 38.555  on 29  degrees of freedom
## AIC: 301.74
## 
## Number of Fisher Scoring iterations: 11

gamma_glm <- glm(Time ~ AG + log10(WBC_1000), family=Gamma(link="log"), data=leukwbc_data)
summary(gamma_glm)

## 
## Call:
## glm(formula = Time ~ AG + log10(WBC_1000), family = Gamma(link = "log"), 
##     data = leukwbc_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1746  -1.2663  -0.4251   0.4962   1.9048  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.7480     0.6533   8.798 8.26e-10 ***
## AG               -1.0176     0.3642  -2.794  0.00898 ** 
## log10(WBC_1000)  -0.7009     0.3167  -2.213  0.03462 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 1.087715)
## 
##     Null deviance: 58.138  on 32  degrees of freedom
## Residual deviance: 40.319  on 30  degrees of freedom
## AIC: 301.49
## 
## Number of Fisher Scoring iterations: 8

We fit a GLM(gamma, log) model, including the interaction term between AG and \(log_{10}(WBC)\). For comparison sake, we also fit a GLM(gamma, log) model, excluding the interaction term between AG and \(log_{10}(WBC)\) and observe that the model without the interaction term is better. In the model with the interaction term, the coefficients are not all significant but in the other model, all the coefficients are significant, have lower standard errors and have lower p-values. In particular, the coefficient of the interaction term is not significant, suggesting that the interaction term is not necessary. Also, the number of fisher scoring iterations in the model without the interaction term is lower, indicating that the iterates converge quicker in the model without the interaction term. Thus, we proceed with the model without the intercept term.

# Cook's distance plot
plot(cooks.distance(gamma_glm), type="h")

We also take a quick look at Cook’s distance plot which indicate that the 33rd observation is an influential point, suggesting that its removal should improve the model.

new_WBC <- seq(min(leukwbc_data$WBC_1000), max(leukwbc_data$WBC_1000), length=1000)
new_AG_positive <- predict(gamma_glm, newdata=data.frame(WBC_1000=new_WBC, AG=1),
type="response")
new_AG_negative <- predict(gamma_glm, newdata=data.frame(WBC_1000=new_WBC, AG=2),
type="response")

plot(Time ~ log10(WBC_1000), data=leukwbc_data, pch=20, col=ifelse(leukwbc_data$AG==1, "skyblue", "darkgreen"), xlab="Logarithm of WBC (in thousands)", ylab="Time to death (in weeks)")
lines(new_AG_positive ~ log10(new_WBC), lty=1)
lines(new_AG_negative ~ log10(new_WBC), lty=2)
legend("topright", c("AG-positive","AG-negative"), col=c("skyblue", "darkgreen"), lty=c(1, 2))

Lastly, we plot the fitted lines for each AG factor on a plot of the observations. We note that the log transformation makes the fitted lines very close to linear. We observe that the survival time is higher for lower WBC values. Furthermore, for the same WBC, the survival time for an AG-positive patient is approximately double or more than double that for an AG-negative patient.

Another model that can be used for this survival data set is the inverse Gaussian model as that is also a commonly used model for positive continuous data.

Final Project

Kalpana Das

12/05/2022

Binary Data Set 1

Binary Data Set 2

Count Data Set

Survival Data Set