Course code

1. Linear Regression

#Load data
mos.df <- read.table("D:/Work/Final/HD/MATH 3821/MATH3821/data/Data Sets - unzipped-20230607/mos.df.dat",header = TRUE)

#pairwise scatter plots
pairs(mos.df)

#fitting a linear model

mos.lm<-lm(Maxtemp~., data=mos.df)

#summary a linear model
summary(mos.lm)

## 
## Call:
## lm(formula = Maxtemp ~ ., data = mos.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7473 -1.7809 -0.1622  1.8646  7.9175 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -168.64102   28.94328  -5.827 1.24e-08 ***
## Modst          0.08488    0.07805   1.087    0.278    
## Modsp         -0.03270    0.01812  -1.804    0.072 .  
## Modthik        0.03594    0.00245  14.665  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.009 on 365 degrees of freedom
## Multiple R-squared:  0.4704, Adjusted R-squared:  0.466 
## F-statistic: 108.1 on 3 and 365 DF,  p-value: < 2.2e-16

Linear model diagnostics

Residuals in R

influence中一共三个元素，分别是coefficient、sigma、hat:

coefficient表示the estimate of β obtained when the ith observation is deleted;
sigma表示the vector with ith element σˆ−i，表示排除ith observation的standard error;
hat表示leverage：残差和杠杆点等指标来识别和评估模型中的离群点或异常值

The component df.residual of mos.lm contains n − p,the residual degrees of freedom.

# influence
influence <- lm.influence(mos.lm)

β <- influence$coefficients
sigma <- influence$sigma
hii <- influence$hat

#raw residuals

raw_residuals <- mos.lm$residuals

##Standardized deviation.
stddev<-sqrt(sum(mos.lm$residuals^2)/ mos.lm$df.residual)

mos.inf<-lm.influence(mos.lm)

## standardized residuals
stdres<-mos.lm$residuals/(stddev*sqrt(1-mos.inf$hat))


## studentized residuals
studres<-mos.lm$residuals/ (mos.inf$sigma*sqrt(1-mos.inf$hat))

##To plot standardized residuals against fitted values. 

par(mfrow=c(1,1))
plot(mos.lm$fit,stdres, 
     xlab="Fitted values",
     ylab="Standardized residuals")

Plot model

plot of raw residuals against fitted values;
Normal qq plot of standardized residuals;
fitted values against square root of standardized residuals;
Cook’s distance against observation indices.

par(mfrow=c(2,2))
plot(mos.lm)

Model Selection

Drop1&add1

R commands add1 and drop1 that show effect of one variable additions and deletions for a given model.

add1 函数可以用来检验向某一线性回归模型加入一个新的变量所带来的影响。

drop1 函数则是相反的操作，它可以用来检验将某一变量从线性回归模型中删除后，整个模型的拟合好坏是否改善。AIC越小越好，表示去掉某一个predictor之后，模型各项指标的变化。

drop1(mos.lm)

Prediction

The vector fit shows predictions of the mean for the two sets of predictor values.

lwr and upr specify lower and upper confidence limits for 95% intervals for the mean.

mosnew.df<-data.frame(Modst=c(285.1,288.0),
                      Modsp=c(1021.2,1026.8),
                      Modthik=c(5380.4,5388.4))

mos.pred<-predict.lm(mos.lm, newdata=mosnew.df,
                     se.fit=T, interval="confidence", level=0.95)

mos.pred

## $fit
##        fit      lwr      upr
## 1 15.50811 14.81804 16.19818
## 2 15.85861 15.09306 16.62416
## 
## $se.fit
##         1         2 
## 0.3509169 0.3892985 
## 
## $df
## [1] 365
## 
## $residual.scale
## [1] 3.009404

2. Factors

Factor analysis

# Factor
x4 <- factor(c("apple", "orange", "pear"))

# blonds data

blonds <- read.table("D:/Work/Final/HD/MATH 3821/MATH3821/data/Data Sets - unzipped-20230607/blonds.txt",header = TRUE)

# boxplot
par(mfrow=c(1,1))
boxplot(Pain~HairColour,data=blonds)

The following commands produce plots showing the overall mean and means for each level of HairColour (left) and the overall median and medians for each hair colour (right).

# Analyse the data as a one factor experiment
par(mfrow=c(1,2))

blonds$HairColour <- as.factor(blonds$HairColour)

plot.design(blonds)

plot.design(blonds,fun=median)

From the box plots, it seems that there is a difference between groups:

the lighter the hair color, the higher the pain threshold.

##Side by side boxplots showing the effect of each hair colour
par(mfrow=c(1,1))
plot(blonds)

Fitting the model with single factor

aov() 函数可以用于执行方差分析(ANOVA，Analysis of Variance)。方差分析是一种统计方法，用于比较两个或多个样本的均值是否存在显著差异。它通常用于分析因素（factor）对响应变量（response variable）的影响，其中因素可以是一个或多个分类变量。

aov.blond<-aov(Pain ~ HairColour, blonds)
summary(aov.blond)

##             Df Sum Sq Mean Sq F value  Pr(>F)   
## HairColour   3   1361   453.6   6.791 0.00411 **
## Residuals   15   1002    66.8                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model：线性模型的公式和所使用的数据集名称。

Residuals：残差相关的统计量，例如残差的个数、残差的平均值、残差平方和等。

Df：自由度（degrees of freedom），例如总体自由度、组间自由度、组内自由度等。

Sum Sq：各种和平方值（sums of squares），例如总体平方和、组间平方和、组内平方和等。

Mean Sq：各种均方值（means of squares），例如组间均方和、组内均方和等。

F value：显示 F 统计量的值。

Pr(>F)：显示 P 值，即显著性水平，表示在当前假设下观察到该 F 统计量或更极端结果的概率。

Residual Analysis

##Function resid extracts the residuals

## hist 直方图
hist(resid(aov.blond))

##qqnorm() 是一个用于绘制正态概率图（normal probability plot）的函数。
qqnorm(resid(aov.blond))

##Extract estimates of coefficients in overparametrized model subject to α1 = 0.
##In R, the output of dummy.coef will give coefficient estimates setting α1 = 0 by default.

dummy.coef(aov.blond)

## Full coefficients are 
##                                                                
## (Intercept):         51.2                                      
## HairColour:     DarkBlond DarkBrunette LightBlond LightBrunette
##                       0.0        -13.8        8.0          -8.7

Experiments involving two factors

Data frame planes contains the following variables:

Distance - distance travelled, the response

Paper - the weight of the sheet of paper used, a factor with

levels 1=80 grams and 2=50 grams.

Plane - design of the plane, 1=sophisticated design,2=simple design.

planes <- read.table("D:/Work/Final/HD/MATH 3821/MATH3821/data/Data Sets - unzipped-20230607/planes.txt",header = TRUE)

##Plot of means for different levels for the two factors and plot of global mean
planes$Paper <- as.factor(planes$Paper)
planes$Plane <- as.factor(planes$Plane)
plot.design(planes)

par(mfrow=c(1,2))

##Box plots showing variation among responses for different levels of each factor
boxplot(Distance~Paper,data = planes)

boxplot(Distance~Plane,data = planes)

Fitting the model-No interaction

aov.planes<-aov(Distance ~ Paper+Plane,
                data=planes)
summary(aov.planes)

##             Df   Sum Sq Mean Sq F value Pr(>F)
## Paper        1  1718721 1718721   0.678  0.425
## Plane        1   385641  385641   0.152  0.703
## Residuals   13 32947925 2534456

F-tests for significance of factors indicate that (at least in the model with no interaction) neither factor is helpful for explaining variation in paper plane distance.

Fitting model with interactions

The attach command just allows us to refer to the columns of the data frame as if they were vectors (we can now write Paper instead of planes$Paper for instance).

In the interaction.plot command, the first argument is the factor that appears on the x-axis, there is one line for each level of the factor supplied as the second argument, and the third argument is the response (means of the response are plotted).

attach(planes)

par(mfrow=c(1,1))
interaction.plot(Paper,Plane,Distance)

##fitting model

planes.aov<-aov(Distance ~ Plane+Paper+
                Plane:Paper,data=planes)
summary(planes.aov)

##             Df   Sum Sq  Mean Sq F value   Pr(>F)    
## Plane        1   385641   385641   0.484 0.499861    
## Paper        1  1718721  1718721   2.157 0.167628    
## Plane:Paper  1 23386896 23386896  29.353 0.000156 ***
## Residuals   12  9561029   796752                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model Formulas in R

In fitting linear models in R, we have supplied model formulas to the commands lm and aov.

amodel.lm<-lm(response ~ p1+p2)

#In specifying model formulas, we can give transformations of the response and/or predictors:
amodel.lm<-lm(log(response)~ sin(p1)+p2)

#For a two factor experiment with replication, to fit a model with interactions, type
amodel.lm<-lm(response ~ p1+p2+p1:p2)

#To specify that both p1, p2 and their interaction are included we can also use the ‘*’ symbol:
amodel.lm<-lm(response ~ p1*p2)

#One usage of the ‘.’ symbol: will fit a linear model with all the columns of df as predictors (except the response)
amodel.lm<-lm(response ~ ., data=df)

#amodel.lm<-lm(response ˜ ., data=df)
amodel.lm<-lm(response ~ . - p1, data=df)

#An intercept is always fitted unless you remove with a ‘-1’ as in
amodel.lm<-lm(response ~ . -1, data=df)

3. Binomial Regression

titanic <- read.table("D:/Work/Final/HD/MATH 3821/MATH3821/data/Data Sets - unzipped-20230607/titanic.txt",header = TRUE)
donner <- read.table("D:/Work/Final/HD/MATH 3821/MATH3821/data/Data Sets - unzipped-20230607/donner.txt",header = TRUE,sep = ',')

Logistic Regression

Fit a model with Survived as the response and Age as a predictor.

The option na.exclude will remove cases where one or more variables are missing

titanic.glm<-glm(Survived ~ Age, data=titanic,
                 family=binomial, na.action=na.exclude)
summary(titanic.glm)

## 
## Call:
## glm(formula = Survived ~ Age, family = binomial, data = titanic, 
##     na.action = na.exclude)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.081428   0.173862  -0.468   0.6395  
## Age         -0.008795   0.005232  -1.681   0.0928 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1025.6  on 755  degrees of freedom
## Residual deviance: 1022.7  on 754  degrees of freedom
##   (因为不存在，557个观察量被删除了)
## AIC: 1026.7
## 
## Number of Fisher Scoring iterations: 4

### plot model 
par(mfrow=c(2,2))
plot(titanic.glm)

par(mfrow=c(1,1))
plot(x = na.omit(titanic$Age),y = titanic.glm$fitted.values,ylim = c(0,1))

Example tobacco budworm

在Treatment对比方式下，假设有n个水平，那么该方法将按照字母顺序给出n-1个确定性对比。对比组是第一个水平，其他组分别与它进行比较。这通常用于将多个类别的变量编码为虚拟变量以进行回归分析。

在Polynomial对比方式下，该方法使用正交多项式（orthogonal polynomial）进行对比。这种方法可以用来检验某因子的线性效应是否显著。

counts<-c(1,4,9,13,18,20,0,2,6,10,12,16)
n<-rep(20,times=12)
rfreq<-counts/n
sex<-factor(rep(c("M","F"),times=c(6,6)))
# Use log of dose to base 2 as predictor
dose <- rep(c(1,2,4,8,16,32),2)
ldose_1 <- log2(dose)

options(contrasts= c("contr.treatment","contr.poly"))

toxic.glm<-glm(rfreq~sex*ldose_1,family=binomial, weights=n)

summary(toxic.glm)

## 
## Call:
## glm(formula = rfreq ~ sex * ldose_1, family = binomial, weights = n)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.9935     0.5527  -5.416 6.09e-08 ***
## sexM           0.1750     0.7783   0.225    0.822    
## ldose_1        0.9060     0.1671   5.422 5.89e-08 ***
## sexM:ldose_1   0.3529     0.2700   1.307    0.191    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 124.8756  on 11  degrees of freedom
## Residual deviance:   4.9937  on  8  degrees of freedom
## AIC: 43.104
## 
## Number of Fisher Scoring iterations: 4

toxic.glm<-glm(rfreq~I(ldose_1-3)*sex,
               family=binomial, weights=n)
summary(toxic.glm)

## 
## Call:
## glm(formula = rfreq ~ I(ldose_1 - 3) * sex, family = binomial, 
##     weights = n)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -0.2754     0.2305  -1.195  0.23215    
## I(ldose_1 - 3)        0.9060     0.1671   5.422 5.89e-08 ***
## sexM                  1.2337     0.3770   3.273  0.00107 ** 
## I(ldose_1 - 3):sexM   0.3529     0.2700   1.307  0.19117    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 124.8756  on 11  degrees of freedom
## Residual deviance:   4.9937  on  8  degrees of freedom
## AIC: 43.104
## 
## Number of Fisher Scoring iterations: 4

Sequential test

Deviance table: anova()

Null deviance is the deviance of a model including just an intercept term.

The degrees of freedom for the null deviance is the difference between the number of parameters in the saturated model (the number of observations n, which is 12 here) and the number of parameters in the model including just an intercept (which is 1).

Residual deviance is just the deviance for the fitted model: degrees of freedom is the number of parameters in the saturated model minus number of parameters for the fitted model (12 − 4 = 8 here).

donner.glm<-glm(as.factor(donner$Survived) ~ Age, 
                data=donner,
                na.action=na.exclude, 
                family=binomial)

anova(donner.glm)

titanic.glm<-glm(Survived ~ PClass*Sex,family=binomial,data = titanic)
summary(titanic.glm)

## 
## Call:
## glm(formula = Survived ~ PClass * Sex, family = binomial, data = titanic)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.7006     0.3443   7.843 4.41e-15 ***
## PClass2nd          -0.7223     0.4540  -1.591    0.112    
## PClass3rd          -3.2014     0.3724  -8.598  < 2e-16 ***
## Sexmale            -3.4106     0.3793  -8.992  < 2e-16 ***
## PClass2nd:Sexmale  -0.3461     0.5274  -0.656    0.512    
## PClass3rd:Sexmale   1.8827     0.4283   4.396 1.10e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1688.1  on 1312  degrees of freedom
## Residual deviance: 1155.9  on 1307  degrees of freedom
## AIC: 1167.9
## 
## Number of Fisher Scoring iterations: 5

anova(titanic.glm)

Predictors are listed in the order they are added in the left column. Predictors: 按照变量输入顺序展示；

Number of parameters added for each predictor in Df column. Df: 表示增加变量的数量；

Deviance column is actually reduction in deviance as each term is added. Deviance：增加该变量，偏差deviance下降多少；

R. Df (residual degrees of freedom) is n − p where p is the number of parameters in the current model. 表示残差自由度

Residual deviance is another name for the deviance. Residual deviance：表示加入该变量后，模型的偏差。

Residuals in binomial regression

pr<-residuals(titanic.glm,type="pearson")
sum(pr^2)

## [1] 1313

dr<-residuals(titanic.glm,type="deviance")
sum(dr)

## [1] -120.3285

4. Choosing models and transformation

Density plot

For unknown reasons, many pregnant dairy cows become recumbent – they lay down – either shortly before or shortly after calving. This condition can be serious, and may lead to death of the cow. Data from a study of blood samples of over 500 cows studied at the Ruakura Animal Health Laboratory in New Zealand. Variable Outcome is the response (survived or died) and predictors are various blood measurements (here we consider only the predictor AST).

We estimate the density of AST within the two groups defined by the response values.

The R function density can be used to estimate a general smooth density function. To plot the density functions for the y = 1 and y = 0 groups on the ame graph with different line types for the two densities:

cows <- read.table("D:/Work/Final/HD/MATH 3821/MATH3821/data/Data Sets - unzipped-20230607/downer.txt",header = TRUE)

library(data.table)
setDT(cows)

plot(density(cows[Outcome==1]$AST,na.rm=T), type="l",xlab="AST",ylab="Density",main = 'Density')

lines(density(cows[Outcome==0]$AST,na.rm=T),lty=2)

#Density estimates for AST for cows which survived (solid line) and cows which died (dotted line)

We know that a logistic regression model with AST and possible AST^2 as predictors will be appropriate if the densities are roughly normal. This does not appear to be the case for the figure. However, using a log transformation on AST does result in densities which are roughly normal, and this suggests that a model involving log(AST) and possibly log(AST)2 would be a good one. Density estimates for log(AST) in the two groups:

plot(density(log(cows[Outcome==1]$AST),na.rm=T),type="l", xlab="log(AST)",ylab="Density",main = 'Log_density')

lines(density(log(cows[Outcome==0]$AST),na.rm=T),lty=2)

# Density estimates for log(AST) for cows which survived (solid line) and cows which died (dotted line)

The log transformation has made the two densities roughly symmetric and unimodal: the scale of the two densities also seems to be similar. The above informal diagnostics suggest that a model involving the single predictor log(AST) might be appropriate. First fit a model involving just AST:

cows.glm<-glm(Outcome ~ log(cows$AST),data=cows, na.action=na.exclude, family=binomial)
summary(cows.glm)

## 
## Call:
## glm(formula = Outcome ~ log(cows$AST), family = binomial, data = cows, 
##     na.action = na.exclude)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     5.4066     0.7380   7.326 2.38e-13 ***
## log(cows$AST)  -1.0889     0.1374  -7.926 2.27e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 569.75  on 428  degrees of freedom
## Residual deviance: 489.08  on 427  degrees of freedom
##   (因为不存在，6个观察量被删除了)
## AIC: 493.08
## 
## Number of Fisher Scoring iterations: 4

Alternatives to logistic regression

Fit of a probit regression model:

cowsprobit.glm<-glm(Outcome~log(AST),
                    data=cows,
                    na.action=na.exclude,
                    family=binomial(link=probit))
summary(cowsprobit.glm)

## 
## Call:
## glm(formula = Outcome ~ log(AST), family = binomial(link = probit), 
##     data = cows, na.action = na.exclude)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.2357     0.4243   7.627  2.4e-14 ***
## log(AST)     -0.6518     0.0780  -8.356  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 569.75  on 428  degrees of freedom
## Residual deviance: 489.05  on 427  degrees of freedom
##   (因为不存在，6个观察量被删除了)
## AIC: 493.05
## 
## Number of Fisher Scoring iterations: 4

Fit of a log-log regression model:

cowslog.glm<-glm(Outcome~log(AST),
                    data=cows,
                    na.action=na.exclude,
                    family=poisson(link=log))
summary(cowslog.glm)

## 
## Call:
## glm(formula = Outcome ~ log(AST), family = poisson(link = log), 
##     data = cows, na.action = na.exclude)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.30687    0.47859   4.820 1.43e-06 ***
## log(AST)    -0.62117    0.09382  -6.621 3.57e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 315.47  on 428  degrees of freedom
## Residual deviance: 267.04  on 427  degrees of freedom
##   (因为不存在，6个观察量被删除了)
## AIC: 597.04
## 
## Number of Fisher Scoring iterations: 5

Logistic discrimination

Logistic discrimination with more than two categories can be carried out in R with the command multinom, which is part of the neural networks library. Type library(nnet) in order to be able to use the command.

amodel.mn<-multinom(formula, data)

A common use of classification and pattern recognition algorithms is in medical diagnosis. Data frame Cushings contains data from patients with Cushing’s syndrome, a hypertensive disorder. Patients with Cushing’s sydnrome can have the disease in one of three forms (a, b or c say). Interested in classifying patients based on predictors Tetrahydrocortisone and Preganetriol (urinary excretion rates for two steroid metabolites). This data set is contained in the R library. Type library(mass) so that you can use the data.

In addition to the predictors Tetrahydrocortisone and Preganetriol, Cushings contains a variable Type indicating which form of Cushing’s syndrome each patient has. Type can have possible values a, b, c and u (a, b and c are indicators for the three different forms of Cushing’s syndrome and u indicates that the type is unknown). The first 21 observations are the ones for patients where the type of Cushing’s syndrome is known.

library(nnet)
library(MASS)

## Warning: 程辑包'MASS'是用R版本4.3.1 来建造的

names(Cushings)

## [1] "Tetrahydrocortisone" "Pregnanetriol"       "Type"

Cf<-data.frame(Type=factor(Cushings[1:21,3]),
               Tetrahydrocortisone=log(Cushings[1:21,2]),
               Pregnanetriol=log(Cushings[1:21,1]))
Cush.multinom<-multinom(Type ~ .,Cf,maxit=250)

## # weights:  12 (6 variable)
## initial  value 23.070858 
## iter  10 value 6.623970
## iter  20 value 6.214841
## iter  30 value 6.182968
## iter  40 value 6.172650
## iter  50 value 6.167699
## iter  60 value 6.162723
## iter  70 value 6.156685
## iter  80 value 6.155298
## iter  90 value 6.153807
## iter 100 value 6.152597
## iter 110 value 6.152041
## iter 120 value 6.151229
## final  value 6.151167 
## converged

summary(Cush.multinom)

## Call:
## multinom(formula = Type ~ ., data = Cf, maxit = 250)
## 
## Coefficients:
##   (Intercept) Tetrahydrocortisone Pregnanetriol
## b   -19.99566          -0.2450327      14.37357
## c   -28.83773           3.3561273      16.23923
## 
## Std. Errors:
##   (Intercept) Tetrahydrocortisone Pregnanetriol
## b    18.43773           0.6691037      13.70268
## c    18.71154           2.0981000      13.35303
## 
## Residual Deviance: 12.30233 
## AIC: 24.30233

First row is set of parameters for log odds of b versus a, second row is for c versus a.

以上代码可能会出现以下报错：

Error in multinom(Type ~ ., Cf, maxit = 250) : 
  unused arguments (Cf, maxit = 250)

由于在一个R package形式的R project中，已定义了一系列函数。当你不论怎么修改一个函数，它的结果总是不变，甚至输入参数都是原来的怎么改都变不了，那可能是你在其他文件中也定义了同名函数！当你调用函数的时候，执行的其实是之前定义的那个版本。可采用指定函数的形式调整：

library(nnet)
library(MASS)
names(Cushings)

## [1] "Tetrahydrocortisone" "Pregnanetriol"       "Type"

Cf<-data.frame(Type=factor(Cushings[1:21,3]),
               Tetrahydrocortisone=log(Cushings[1:21,2]),
               Pregnanetriol=log(Cushings[1:21,1]))

Cush.multinom<-nnet::multinom(Type ~ ., data=Cf,maxit=250)

## # weights:  12 (6 variable)
## initial  value 23.070858 
## iter  10 value 6.623970
## iter  20 value 6.214841
## iter  30 value 6.182968
## iter  40 value 6.172650
## iter  50 value 6.167699
## iter  60 value 6.162723
## iter  70 value 6.156685
## iter  80 value 6.155298
## iter  90 value 6.153807
## iter 100 value 6.152597
## iter 110 value 6.152041
## iter 120 value 6.151229
## final  value 6.151167 
## converged

summary(Cush.multinom)

## Call:
## nnet::multinom(formula = Type ~ ., data = Cf, maxit = 250)
## 
## Coefficients:
##   (Intercept) Tetrahydrocortisone Pregnanetriol
## b   -19.99566          -0.2450327      14.37357
## c   -28.83773           3.3561273      16.23923
## 
## Std. Errors:
##   (Intercept) Tetrahydrocortisone Pregnanetriol
## b    18.43773           0.6691037      13.70268
## c    18.71154           2.0981000      13.35303
## 
## Residual Deviance: 12.30233 
## AIC: 24.30233

args(nnet::multinom)

## function (formula, data, weights, subset, na.action, contrasts = NULL, 
##     Hess = FALSE, summ = 0, censored = FALSE, model = FALSE, 
##     ...) 
## NULL

Poisson regression

Example: multiple killing

A number of homicide incidents in Australia have involved multiple killings (a multiple killing is defined as an incident in which two or more people are murdered). According to available literature, there were 24 multiple killings by firearm between 1987 and 1996 (see table below). The 1996 figure only includes incidents before April 28. Is there any evidence of a trend in the rate of multiple killings?

multkill <- read.table('D:/Work/Final/HD/MATH 3821/MATH3821/data/Data Sets - unzipped-20230607/multkill.txt',header = TRUE)
t <- c(rep(1,9),1/3)
multkill.glm<-glm(Incidents ~ Year+offset(log(t)),
                  family=poisson,data = multkill)
summary(multkill.glm)

## 
## Call:
## glm(formula = Incidents ~ Year + offset(log(t)), family = poisson, 
##     data = multkill)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) 286.92697  158.00113   1.816   0.0694 .
## Year         -0.14366    0.07939  -1.810   0.0704 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 19.883  on 9  degrees of freedom
## Residual deviance: 16.451  on 8  degrees of freedom
## AIC: 43.074
## 
## Number of Fisher Scoring iterations: 5

anova(multkill.glm)

Poisson approximation to the binomial

Result from elementary statistics: if Y is a binomial random variable with parameters n and p, and if n → ∞ and p → 0 so that np → λ, then the distribution of Y converges to Poisson with mean λ.

Data collected on the number of nonmelanoma skin cancers among women in Minnesota-St Paul, Minnesota and Dallas-Fort Worth, Texas. Data frame skin contains variables Age (a factor, with levels representing eight different age groups, 15-24, 25-34, 35-44, 45-54, 55-64, 65-74, 75-84 and 85+), Town (a binary variable, 0 for Minnesota-St Paul and 1 for Dallas-Fort Worth), Cases (the number of cases) and Population (the population in the given town and age group).

In this case, Binomial regression and Poisson regression provide similar results.

skin <- read.table('D:/Work/Final/HD/MATH 3821/MATH3821/data/Data Sets - unzipped-20230607/skin.txt',header = TRUE)
attach(skin)
options(contrasts=c("contr.treatment","contr.poly"))

# manipulate response 
response<-Cases/Population

# binomial regression 
skin.glm<-glm(response~Age+Town,family=binomial, weights=Population, maxit=20)
summary(skin.glm)

## 
## Call:
## glm(formula = response ~ Age + Town, family = binomial, weights = Population, 
##     maxit = 20)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -11.69364    0.44923 -26.030  < 2e-16 ***
## Age25-34      2.62915    0.46747   5.624 1.86e-08 ***
## Age35-44      3.84627    0.45467   8.459  < 2e-16 ***
## Age45-54      4.59538    0.45104  10.188  < 2e-16 ***
## Age55-64      5.08901    0.45031  11.301  < 2e-16 ***
## Age65-74      5.65031    0.44976  12.563  < 2e-16 ***
## Age75-84      6.20887    0.45756  13.570  < 2e-16 ***
## Age85+        6.18346    0.45783  13.506  < 2e-16 ***
## Town          0.85492    0.05969  14.322  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2330.4637  on 14  degrees of freedom
## Residual deviance:    5.1509  on  6  degrees of freedom
## AIC: 110.1
## 
## Number of Fisher Scoring iterations: 4

skin2.glm<-glm(Cases~Age+Town +offset(log(Population)), family=poisson)
summary(skin2.glm)

## 
## Call:
## glm(formula = Cases ~ Age + Town + offset(log(Population)), family = poisson)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -11.69207    0.44922 -26.028  < 2e-16 ***
## Age25-34      2.62899    0.46746   5.624 1.87e-08 ***
## Age35-44      3.84558    0.45466   8.458  < 2e-16 ***
## Age45-54      4.59381    0.45103  10.185  < 2e-16 ***
## Age55-64      5.08638    0.45030  11.296  < 2e-16 ***
## Age65-74      5.64569    0.44975  12.553  < 2e-16 ***
## Age75-84      6.20317    0.45751  13.558  < 2e-16 ***
## Age85+        6.17568    0.45774  13.492  < 2e-16 ***
## Town          0.85269    0.05962  14.302  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2327.2912  on 14  degrees of freedom
## Residual deviance:    5.2089  on  6  degrees of freedom
## AIC: 110.19
## 
## Number of Fisher Scoring iterations: 4

在R语言中，options(contrasts=c("contr.treatment","contr.poly"))是一条设置默认对比方式（contrast）的命令。

对比方式是在统计分析中用来比较不同水平（level）或组别之间差异的方法。R语言中提供了一些内置的对比方式，如”contr.treatment”和”contr.poly”。

“contr.treatment”（也称为Treatment contrasts）是默认的对比方式。它将第一个水平/组别作为基准（reference），将其他水平/组别与基准进行对比。例如，对于一个名为”category”的因子变量，如果有三个水平”A”、“B”和”C”，则”contr.treatment”对比方式将创建两个对比：A vs B 和 A vs C。
“contr.poly”（也称为Polynomial contrasts）是另一种常见的对比方式。它将水平/组别之间的差异建模为多项式函数。这种对比方式适用于按照某种顺序排列的水平/组别，可以用于测试随机效应或线性趋势等情况。

Change Poisson Regression Link Function

By default, in R the link function of Poisson Regression in log function, but we can change the link function. Fit a Poisson regression model. Link function here is the identity, no intercept term in the linear predictor.

Equipment dataset gives the numbers of failures of certain electronic devices which can be operated in each of two modes. Here Mode1 and Mode2 are the times spent operating in each mode for each piece of equipment.

equipment <- read.table('D:/Work/Final/HD/MATH 3821/MATH3821/data/Data Sets - unzipped-20230607/equipment.txt',sep=',',header = TRUE)
attach(equipment)
options(contrasts=c("contr.treatment","contr.poly"))
failures.glm<-glm(failures~mode1+mode2-1,family=poisson(link=identity))
summary(failures.glm)

## 
## Call:
## glm(formula = failures ~ mode1 + mode2 - 1, family = poisson(link = identity))
## 
## Coefficients:
##       Estimate Std. Error z value Pr(>|z|)    
## mode1  0.16661    0.03486   4.779 1.76e-06 ***
## mode2  0.09040    0.06301   1.435    0.151    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance:    Inf  on 9  degrees of freedom
## Residual deviance: 7.5716  on 7  degrees of freedom
## AIC: 54.628
## 
## Number of Fisher Scoring iterations: 5

5. Smoothing

The loess algorithm

Study of the factors affecting patterns of insulin dependent diabetes mellitus in children. Response variable of interest (y) – log(C.peptide) (a measure of how much insulin is being produced in the body). Predictors: age and base.deficit (the latter a measure of acidity). Model mean response as a function of age (x). To draw a scatterplot of log(C.peptide) against age and superimpose the loess smooth:

diabetes <- read.table('D:/Work/Final/HD/MATH 3821/MATH3821/data/Data Sets - unzipped-20230607/diabetes.txt',header = TRUE)

plot(as.numeric(diabetes$Age),log(as.numeric(diabetes$C.Peptide)))
lines(loess.smooth(as.numeric(diabetes$Age),log(as.numeric(diabetes$C.Peptide)),span=2/3))
title("Loess Smooth")

loess.smooth returns a list with two components, x and y.

x contains a set of predictor values (by default 50 equally spaced values between the minimum and maximum of observed predictors).

y is the vector of smoothed values given by loess.smooth.

If a list with components x and y (numeric vectors) is passed to an R plotting function like lines, then the x component will be plotted on the x-axis and the y component on the y-axis.

Default specification of k in loess.smooth: a span of 2/3 (by which we mean k = 2/3n) which is often too large. To set a smaller k, say a span of 0.2, we pass span=0.2 as an argument.

Kernel smoothing

attach(diabetes)

## The following object is masked from skin:
## 
##     Age

As for loess.smooth, output of the function ksmooth is a list with components x and y (numeric vectors, by default x is the set of observed predictors and y gives corresponding values of the kernel smooth).

The argument bandwidth controls the degree of smoothing (default is usually not sensible) and kernel allows specification of the kernel (options box, triangle, parzen and normal, see help for details).

plot(Age,log(C.Peptide))
lines(ksmooth(Age,log(C.Peptide),bandwidth=5.0,kernel="normal"))
title("Kernel smooth")

6. Splines

Cubic smoothing splines

R function smooth.spline for fitting cubic smoothing splines: allows specification of the smoothing parameter through the degrees of freedom of the smoother. Intuitive way of specifying the degree of smoothing (if this were a linear model like a polynomial regression or a regression spline how many parameters would I need?)

plot(Age,log(C.Peptide))
lines(smooth.spline(Age,log(C.Peptide),df=5))
lines(smooth.spline(Age,log(C.Peptide),df=10),lty=2)
lines(smooth.spline(Age,log(C.Peptide),df=20),lty=3)
title("Cubic smoothing spline fits\n df=5 (solid), df=10 (dots), df=20 (dashes)")

Confidence bands for smoothers

Result returned by R command smooth.spline is a list. Component lev gives the diagonal elements of S. Component yin is the vector of the original responses (if the predictor values are all distinct). Component y is the vector of smoothed responses. R commands to compute and plot confidence bands for the diabetes data:

fit<-smooth.spline(Age,log(C.Peptide))
res<-(fit$yin-fit$y)/(1-fit$lev)
sigma<-sqrt(var(res))
upper<-fit$y+1.96*sigma*sqrt(fit$lev)
lower<-fit$y-1.96*sigma*sqrt(fit$lev)
plot(Age,log(C.Peptide))
lines(fit$x,upper,lty=2)
lines(fit$x,lower,lty=2)
lines(fit)
title("Spline smooth with 95 percent pointwise confidence bands")

7. Additive models

Fit additive model

Additive models are fitted in R with the gam function. (type library(mgcv) before use).

Simple usage of the command (there are many additional optional arguments):

gam(formula, data)

where the data argument is a data frame and the formula argument specifies the response and predictors.

Example of a formula for gam:

Specifies log(C.peptide) as the response, and tells gam to fit a linear term for base.deficit and to use a cubic smoothing spline for age. The value returned by the gam function is a list which can be passed to plotting and summary functions (similar to the output of lm).

library(mgcv)

## 载入需要的程辑包：nlme

## This is mgcv 1.8-42. For overview type 'help("mgcv-package")'.

## 
## 载入程辑包：'mgcv'

## The following object is masked from 'package:nnet':
## 
##     multinom

diabetes.gam <- gam(log(C.Peptide) ~s(Age)+Base.Deficit)
summary(diabetes.gam)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(C.Peptide) ~ s(Age) + Base.Deficit
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.612408   0.027826  57.946  < 2e-16 ***
## Base.Deficit 0.008218   0.002630   3.125  0.00337 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##          edf Ref.df     F p-value   
## s(Age) 2.469   3.08 6.278  0.0013 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.467   Deviance explained = 51.1%
## GCV = 0.015119  Scale est. = 0.013548  n = 43

To fit an additive model in R with log(C.peptide) as the response and age and base.deficit as predictors (and general smooth terms for both predictors): The graph over the page shows the smooth terms for age and base.deficit, the fitted additive surface, and the fit from a two-dimensional smoother.

diabetes.gam<-gam(log(C.Peptide) ~ s(Age) +s(Base.Deficit), data=diabetes)
plot(diabetes.gam)

Plotting a two-dimensional surface

The list grid contains values for age and base.deficit defining a two-dimensional grid.

expand.grid(grid) creates a data frame with columns age and base.deficit where every combination of the predictor values supplied in grid are given. persp is the three dimensional plotting function: supply values which define the grid and the predictions at those points.

grid<-list(Age=seq(from=0.9,to=15.0,length=50),Base.Deficit=seq(from=-29.0,to=-1.0,length=50))

diabetes.pr<-predict.gam(diabetes.gam,newdata=expand.grid(grid))

diabetes.pr<-matrix(diabetes.pr,nrow=50,ncol=50)

persp(grid$Age,grid$Base.Deficit, diabetes.pr, xlab="age",ylab="base.deficit",zlab="log(C.peptide)", theta=-45,phi=15,d=2.0,tick="detailed")

Example: rock data set

Rock data set from Venables and Ripley (1999). Twelve core samples from petroleum reservoirs are sampled by four cross-sections. For each cross-section, permeability (perm), total area of pores (area), total perimeter of pores (peri) and shape (shape) are measured.

rock <- read.table('D:/Work/Final/HD/MATH 3821/MATH3821/data/Data Sets - unzipped-20230607/rock.dat',header = TRUE)

rock.gam<-gam(log(perm) ~ s(peri)+s(shape),data=rock)
par(mfrow=c(1,1))

plot(rock.gam)

grid<-list(peri=seq(from=309,to=4864,length=50),shape=seq(from=0.1,to=0.46,length=50))
rock.pr<-predict(rock.gam,newdata=expand.grid(grid))
rock.pr<-matrix(rock.pr,nrow=50,ncol=50)
persp(grid$peri,grid$shape,rock.pr,xlab="peri",ylab="shape",zlab="log(perm)",theta=-45)

Fitting an additive model with three predictors: for the rock data, model log(perm) as a function of area, peri and shape.

With three predictors, we cannot plot the full fitted surface, but with an additive model we can still understand what the mean response looks like from the univariate smooth terms. Fitting the model and plotting smooth terms:

rock.gam2<-gam(log(perm) ~ s(area)+s(peri)+s(shape), data=rock, bf.maxit=20)
plot(rock.gam2,se=T)

The argument bf.maxit controls the number of iterations used in the iterative process used to fit the algorithm. The argument se=T when supplied to plot results in confidence bands being plotted for the smooth univariate functions.

Interactions in additive models

Model for the rock data with log(perm) as the response, and peri, shape, and peri*shape as predictors.

rock.gam3<-gam(log(perm) ~ s(peri)+s(shape)+s(I(peri*shape)),data=rock,bf.maxit=50)
grid<-list(peri=seq(from=309,to=4865, length=50),shape=seq(from=0.1,to=0.46, length=50))
rock.pr3<-predict(rock.gam3,newdata=expand.grid(grid))
rock.pr3<-matrix(rock.pr3, ncol=50, nrow=50)
persp(grid$peri,grid$shape,rock.pr3,xlab="peri",ylab="shape",zlab="log(perm)")
title("Additive model with product interaction term")

Important code in Assignment1

Question2 Parameter Estimation

Use your answer to (b) to compute b for the data in Question 1.

As we have already known the formula of paramater of linear model, we can compute them by calculating the formula.

data<-read.csv("D:/Work/Final/HD/MATH 3821/5.同步拓展课/data/sample.csv")

I <- diag(5)
y <- data[,1]
x <- data[,2:5]

x_mat <- cbind(1, as.matrix(x))
y_mat <- matrix(y, ncol = 1)

parameter_q2 <- solve(t(x_mat)%*%x_mat)%*%t(x_mat)%*%y_mat
parameter_q2

##           [,1]
##    -10.9632658
## x1   0.5723703
## x2   2.7589682
## x3  -0.2141964
## x4   3.5506248

代码延申：假设需要simulate linear model，除了计算出parameter之外，还需要simulate ε。How to simulate ε？

y <- t(parameter_q2)%*%t(x_mat)+rnorm(nrow(x_mat))
y

##         [,1]     [,2]     [,3]    [,4]     [,5]     [,6]     [,7]     [,8]
## [1,] 14.7113 8.167012 7.382597 14.9929 18.72117 17.33471 11.67026 13.14129
##          [,9]    [,10]    [,11]    [,12]    [,13]     [,14]    [,15]    [,16]
## [1,] 5.381216 17.53651 9.718616 14.12494 12.20943 -1.659194 10.13878 7.249925
##         [,17]    [,18]    [,19]   [,20]    [,21]    [,22]    [,23]    [,24]
## [1,] 18.73219 17.75996 14.49076 15.0524 13.02746 9.809845 8.729658 6.401057
##         [,25]    [,26]    [,27]   [,28]    [,29]   [,30]
## [1,] 1.964569 9.515478 12.85683 18.3126 6.770664 6.74114

Question5 PRESS and RSS

Let’s get all the parameters from Question3 and Question 4

# Parameter in Question3
model <- lm(y~x1+x2+x3+x4,data=data)
summary(model)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8225 -1.4870 -0.3770  0.9843  4.5725 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -10.9633     2.3906  -4.586 0.000109 ***
## x1            0.5724     0.5139   1.114 0.275971    
## x2            2.7590     1.3362   2.065 0.049456 *  
## x3           -0.2142     1.3218  -0.162 0.872573    
## x4            3.5506     0.3758   9.448 9.95e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.117 on 25 degrees of freedom
## Multiple R-squared:  0.8669, Adjusted R-squared:  0.8456 
## F-statistic:  40.7 on 4 and 25 DF,  p-value: 1.339e-10

drop1(model)

##遍历直到筛选最优模型
best_model <- step(model)

## Start:  AIC=49.53
## y ~ x1 + x2 + x3 + x4
## 
##        Df Sum of Sq    RSS    AIC
## - x3    1      0.12 112.14 47.557
## - x1    1      5.56 117.58 48.979
## <none>              112.02 49.526
## - x2    1     19.11 131.13 52.250
## - x4    1    400.02 512.04 93.116
## 
## Step:  AIC=47.56
## y ~ x1 + x2 + x4
## 
##        Df Sum of Sq    RSS    AIC
## - x1    1      5.63 117.77 47.026
## <none>              112.14 47.557
## - x2    1    225.28 337.43 78.605
## - x4    1    403.36 515.51 91.318
## 
## Step:  AIC=47.03
## y ~ x2 + x4
## 
##        Df Sum of Sq    RSS    AIC
## <none>              117.77 47.026
## - x2    1    219.66 337.43 76.605
## - x4    1    412.76 530.53 90.180

model_without_x3_x1 <- lm(y~x2+x4,data=data)
summary(model_without_x3_x1)

## 
## Call:
## lm(formula = y ~ x2 + x4, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1072 -1.3374 -0.4368  1.1231  5.0015 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -9.4712     1.9254  -4.919 3.79e-05 ***
## x2            2.4858     0.3503   7.096 1.25e-07 ***
## x4            3.4284     0.3524   9.728 2.56e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.088 on 27 degrees of freedom
## Multiple R-squared:   0.86,  Adjusted R-squared:  0.8497 
## F-statistic: 82.96 on 2 and 27 DF,  p-value: 2.958e-12

result_Q3 <- summary(model_without_x3_x1)$coefficients[,"Estimate"]

result_Q3 <- append(result_Q3,0,after = 1)
result_Q3 <- append(result_Q3,0,after = 3)
parameter_q3 <- matrix(result_Q3,ncol = 1)
parameter_q3

##           [,1]
## [1,] -9.471201
## [2,]  0.000000
## [3,]  2.485756
## [4,]  0.000000
## [5,]  3.428390

lambda_q4 <- 1/5
I <- diag(5)
y <- data[,1]
x <- data[,2:5]

x_mat <- cbind(1, as.matrix(x))
y_mat <- matrix(y, ncol = 1)

beta_result <- solve(t(x_mat)%*%x_mat+lambda_q4*I)%*%t(x_mat)%*%y_mat

parameter_q4 <- beta_result
parameter_q4

##          [,1]
##    -8.6490823
## x1  0.2878198
## x2  2.4064777
## x3 -0.1385936
## x4  3.3884281

Calculate PRESS (Predicted Residual Sum of Squares) of every model

PRESS_1<-0
Y <- as.matrix(data[,1])
X1 <- as.matrix(data[,2:5])
X2 <- matrix(rep(1,length(Y)),ncol = 1)
X <- cbind(X2,X1)
for(i in 1:length(Y)){
  Y.i<-Y[-i]
  X.i<-X[-i,]
  beta.i<-solve(t(X.i)%*%X.i)%*%t(X.i)%*%Y.i
  PRESS_1<-PRESS_1+(Y[i]-X[i,]%*%beta.i)^2
}
PRESS_1

##          [,1]
## [1,] 163.3144

PRESS_2<-0
Y <- as.matrix(data[,1])
X1 <- as.matrix(data[,c(3,5)])
X2 <- matrix(rep(1,length(Y)),ncol = 1)
X <- cbind(X2,X1)
for(i in 1:length(Y)){
  Y.i<-Y[-i]
  X.i<-X[-i,]
  beta.i<-solve(t(X.i)%*%X.i)%*%t(X.i)%*%Y.i
  PRESS_2<-PRESS_2+(Y[i]-X[i,]%*%beta.i)^2
}
PRESS_2

##          [,1]
## [1,] 150.1624

PRESS_3<-0
Y <- as.matrix(data[,1])
X1 <- as.matrix(data[,2:5])
X2 <- matrix(rep(1,length(Y)),ncol = 1)
X <- cbind(X2,X1)
for(i in 1:length(Y)){
  Y.i<-Y[-i]
  X.i<-X[-i,]
  I <- diag(5)
  lambda <- 1/5
  beta.i<-solve(t(X.i)%*%X.i+lambda*I)%*%t(X.i)%*%Y.i
  PRESS_3<-PRESS_3+(Y[i]-X[i,]%*%beta.i)^2
}
PRESS_3

##          [,1]
## [1,] 162.8785

Calculate RSS (Residual Sum of Squares) of every model

Y <- as.matrix(data[,1])
X1 <- as.matrix(data[,2:5])
X2 <- matrix(rep(1,length(Y)),ncol = 1)
X <- cbind(X2,X1)

residual_q2 <- sum((Y-X%*%parameter_q2)^2)
residual_q3 <- sum((Y-X%*%parameter_q3)^2)
residual_q4 <- sum((Y-X%*%parameter_q4)^2)
residual_q2

## [1] 112.0248

residual_q3

## [1] 117.7686

residual_q4

## [1] 116.326

Practices in tut

1. Practice1-tute-2-Question2

Fit a logistic regression model with Signs as the response and Age as the predictor. Does the Wald test indicate a significant effect for Age?

z value for age is 1.315159, compare this to the upper 2.5 percentage point of a standard normal distribution (1.96) to determine conclusion of the Wald test at the 5% level. Accept H0 that there is no effect for age here.

age<-c(39,42,20,37,20,21,41,52)
signs<-c(0,1,0,1,1,0,1,1)
blackout.glm<-glm(signs ~ age,family=binomial)
summary(blackout.glm)

## 
## Call:
## glm(formula = signs ~ age, family = binomial)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.92086    2.64655  -1.104    0.270
## age          0.10569    0.08037   1.315    0.188
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10.5850  on 7  degrees of freedom
## Residual deviance:  8.4345  on 6  degrees of freedom
## AIC: 12.435
## 
## Number of Fisher Scoring iterations: 4

Determine if there is a significant effect for Age by considering the reduction in the deviance when the covariate Age is added to a model including just an intercept (to determine the upper 5 percentage point of a chi-squared distribution with n degrees of freedom in R, just type qchisq(0.95,n)). Does the conclusion of the test agree with the conclusion you obtained in the first part?

Again age effect does not seem significant (conclusion agrees with that of the Wald test).

qchisq(0.95,1)

## [1] 3.841459

10.58501-8.434522

## [1] 2.150488

2. Practice2-tute-3-Question2

Consider the titanic data set. We considered modelling the variable Survived as a function of Age. After fitting a logistic regression model with Survived as the response and Age as predictor (you will need to remove missing values by including the na.action=na.exclude argument) extract the Pearson and deviance residuals.

(Note: after extracting the Pearson residuals using the residuals function, you will need to remove missing values in the vector of Pearson residuals. If pr is the vector of Pearson residuals, pr[!is.na(pr)] is the vector of non-missing values in pr).

Plot the Pearson residuals against the fitted values and superimpose a loess smooth on this plot. To do this, type lines(scatter.smooth(titanic.glm$fit,pr)). Repeat this procedure for the deviance residuals. Do you think a logistic regression model is appropriate?

titanic=read.table("D:/Work/Final/HD/MATH 3821/MATH3821/data/Data Sets - unzipped-20230607/titanic.txt", header=T)
attach(titanic)

## The following object is masked from diabetes:
## 
##     Age

## The following object is masked from skin:
## 
##     Age

titanic.glm<-glm(Survived ~ Age, family=binomial, data=titanic, na.action=na.exclude)
pr<-residuals(titanic.glm,type="pearson")
pr<-pr[!is.na(pr)]
plot(titanic.glm$fit,pr)
lines(loess.smooth(titanic.glm$fit,pr))

dr<-residuals(titanic.glm,type="deviance")
dr<-dr[!is.na(dr)]
plot(titanic.glm$fit,dr)
lines(loess.smooth(titanic.glm$fit,dr))

3. Practice3-tute-4-Question3

The recumbant cows data set discussed in lectures is contained in the file downer.txt. In lectures we discussed modelling the binary variable Outcome (1 if a cow survives, 0 if it dies) as a function of one of the predictors AST (a blood measurement). Following the discussion in your lecture notes, determine an appropriate model for outcome in terms of the predictor variable CK (another blood measurement) and fit the model.

downer=read.table("D:/Work/Final/HD/MATH 3821/MATH3821/data/Data Sets - unzipped-20230607/downer.txt", header=T)
attach(downer)
plot(density(log(CK[Outcome==1]),na.rm=T),type="l",xlab="CK",ylab="Density",xlim=c(1,13))
lines(density(log(CK[Outcome==0]),na.rm=T),lty=2)

cows.glm<-glm(Outcome ~ log(CK)+I(log(CK)^2), data=downer, family=binomial,na.action=na.exclude)
anova(cows.glm)

4. Practice4-tute-5-Question3

Download and attach the air data frame, typing attach(air) will allow you to refer to the column vectors in the data frame directly (there are four columns in this data frame, ozone, radiation, temperature and wind). Do a scatter plot of ozone against radiation and superimpose loess, kernel and cubic smoothing spline smooths on the scatter plot (you can get different lines for the different smooths by using the lty argument of lines: passing lty=0 gives a solid line, lty=1 gives dots, lty=2 gives dashes, etc.). You may need to change the default level of smoothing for the kernel and loess smooths, although for the cubic smoothing spline the default choice of smoothing parameter is usually sensible.

air <- read.table("D:/Work/Final/HD/MATH 3821/MATH3821/data/Data Sets - unzipped-20230607/air.txt", header=TRUE)
attach(air)

plot(radiation, ozone)

lines(loess.smooth(radiation, ozone,span = 0.2),lty = 1) # loess smoother

lines(ksmooth(radiation, ozone,bandwidth=50.0,kernel="normal"),lty=2) # Kenerl smoother

lines(smooth.spline(radiation, ozone,df=10),lty=3) # cubic smoothing spline

Probability Distribution Functions in R

[dpqr]distribution_abbreviation 其中第一个字母表示其所指分布的某一方面： d = 密度函数（density） p = 分布函数（distribution function） q = 分位数函数（quantile function） r = 生成随机数（随机偏差）

plot(pretty(c(-3, 3), 100), pnorm(pretty(c(-3, 3), 100), 0, 1))

# 绘制标准正态分布下的概率分布曲线。

plot(pretty(c(-5, 5), 100), dnorm(pretty(c(-5, 5), 100), 0, 1)) # 绘制标准正态分布下的概率密度曲线。

dnorm(0, mean = 0 , sd = 1) # 求算x=0时概率。

## [1] 0.3989423

plot(pretty(c(-3, 3), 100), pnorm(pretty(c(-3, 3), 100), 0, 1))

# 绘制标准正态分布下的概率分布曲线。

pnorm(0.5) # 表示x等于0.5，小于等于0.5的对于的概率。

## [1] 0.6914625

plot(pretty(c(0, 1), 90), qnorm(pretty(c(0, 1), 90), 0, 1))

## Warning in qnorm(pretty(c(0, 1), 90), 0, 1): 产生了NaNs

# 绘制标准正态分布下的概率分布曲线。

qnorm(0.6, mean = 0, sd = 1) # 均值为50，标准差为10的正态分布的0.6分位点值。

## [1] 0.2533471

## [1] 0.2533471

R script summary

Letizia

2023-07-01