1) Use the petrol consumption data set (http://myweb.fsu.edu/jelsner/temp/data/PetrolConsumption.txt) and build a regression tree to predict petrol consumption based on petrol tax, average income, amount of pavement and the proportion of the population with drivers licences. Plot the tree. Prune the tree leaving only three terminal nodes. Plot the final tree. (20)

setwd("C:/Spatial Statistics")
PetCon = read.table("http://myweb.fsu.edu/jelsner/temp/data/PetrolConsumption.txt", header = TRUE)
head(PetCon)
##   Petrol.Tax Avg.Inc Pavement Prop.DL Petrol.Consumption
## 1        9.0    3571     1976   0.525                541
## 2        9.0    4092     1250   0.572                524
## 3        9.0    3865     1586   0.580                561
## 4        7.5    4870     2351   0.529                414
## 5        8.0    4399      431   0.544                410
## 6       10.0    5342     1333   0.571                457
suppressPackageStartupMessages(library(tree))
tr = tree(Petrol.Consumption ~ . , data = PetCon)
plot(tr)
text(tr)

tr1 = prune.tree(tr, best = 3)
plot(tr1)
text(tr1)

2) Use the data below to model the probability of O-ring damage as a logistic regression using launch temperature as the explanatory variable. Is the temperature a significant predictor of damage? Is it adequate? What are the odds of damage when launch temperature is 60F relative to the odds of damage when the temperature is 75F? Use the model to predict the probability of damage given a launch temperature of 55F. (30)

Temp = c(66, 70, 69, 68, 67, 72, 73, 70, 57, 63, 70, 78, 67, 53, 67, 75, 70, 81, 76, 79, 75, 76, 58)
Damage = c(0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1)
TempD = data.frame(Temp , Damage)
logrm = glm(Damage ~ Temp, data = TempD, family = binomial)
summary(logrm)
## 
## Call:
## glm(formula = Damage ~ Temp, family = binomial, data = TempD)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0611  -0.7613  -0.3783   0.4524   2.2175  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  15.0429     7.3786   2.039   0.0415 *
## Temp         -0.2322     0.1082  -2.145   0.0320 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28.267  on 22  degrees of freedom
## Residual deviance: 20.315  on 21  degrees of freedom
## AIC: 24.315
## 
## Number of Fisher Scoring iterations: 5

Given the p value of 0.03 there is moderate evidence that temperature is a major predictor of damage.

The change in deviance from the null to the full model is:

28.267 - 20.315 = 7.952 on 22 - 21 = 1 degree of freedom. Comparing this drop in deviance to a chi-squared value.

pchisq(7.952, 1, lower.tail = FALSE)
## [1] 0.004803426

The p-value for the model is 0.0048. Since it is less than 0.01, we can conclude that there is strong evidence that the model is significant and, hence, adequate in explaining damage.

You write the logistic regression model, where pi represents damage probability: \[ \hbox{logit}(\pi) = 15.043 - .0232 \times \hbox{Temp} \] In words: The logarithm of the odds of Damage equals 15.043 minus 0.232 times Temp.

So, the odds of damage when launch temperature is 60F relative to the odds of damage when the temperature is 75F is :

\[ \exp(-.0232 \times (60 - 75)) \] Or, exp(-0.232*(60 - 75)) = 32.460 ~ 32.5

As a result, the probabilities of damage at a launch temperature of 60F are approximately 32.5 times more than those at a launch temperature of 75F.

predict(logrm, data.frame(Temp = 55), type = "response")
##         1 
## 0.9066965

Given a launch temperature of 55°F, we can forecast a damage probability of 90.7%based on the data and the model.

##3) Consider a set of medical records for 81 children undergoing a spinal operation. The data are in data frame called kyphosis (rpart package). The variables are: Kyphosis: A binary variable indicating the presence/absence of a post-operative spinal deformity called Kyphosis. Age: The age of the child in months. Number: The number of vertebrae involved in the spinal operation. Start: The beginning of the range of the vertebrae involved in the operation. (50)

suppressPackageStartupMessages(library(rpart))
head(kyphosis)
##   Kyphosis Age Number Start
## 1   absent  71      3     5
## 2   absent 158      3    14
## 3  present 128      4     5
## 4   absent   2      5     1
## 5   absent   1      4    15
## 6   absent   1      2    16
names(kyphosis)
## [1] "Kyphosis" "Age"      "Number"   "Start"
  1. What is the average age of the children in the sample?
mean(kyphosis$Age)/12
## [1] 6.971193
  1. What proportion of children in the sample indicated spinal deformity?
table(kyphosis$Kyphosis)
## 
##  absent present 
##      64      17
sum(kyphosis$Kyphosis == "present")/length(kyphosis$Kyphosis)
## [1] 0.2098765
  1. Create a side-by-side boxplot showing the bivariate relationships between presence/absence of kyphosis and the age of the child.
suppressPackageStartupMessages(library(ggplot2))
ggplot(kyphosis, aes(x = Kyphosis, y = Age)) + geom_boxplot() + ylab("Age (months")

  1. Regress the presence or absence of kyphosis onto age, number of vertebrae and starting range of vertebrae using a logistic regression.
logrmK = glm(Kyphosis ~ ., data =kyphosis, family = binomial)
summary(logrmK)
## 
## Call:
## glm(formula = Kyphosis ~ ., family = binomial, data = kyphosis)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3124  -0.5484  -0.3632  -0.1659   2.1613  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -2.036934   1.449575  -1.405  0.15996   
## Age          0.010930   0.006446   1.696  0.08996 . 
## Number       0.410601   0.224861   1.826  0.06785 . 
## Start       -0.206510   0.067699  -3.050  0.00229 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 83.234  on 80  degrees of freedom
## Residual deviance: 61.380  on 77  degrees of freedom
## AIC: 69.38
## 
## Number of Fisher Scoring iterations: 5
  1. Which variable appears to be the most important in explaining the presence of Kyphosis?