Lab 1: Introduction to R and Multiple linear regression analysis

Objectives: Students are able to use R language to analyse data using multiple linear regression:
1. perform descriptive statistsics
2. transform qualitative independent variable into dummy variables
3. select independent variables
4. perform linear regression analysis and inference on regression parameters
5. interpret the results

1. Introduction to R

1. Installing the R and RStudio

R is a programming language for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. R is used among data miners, bioinformaticians and statisticians for data analysis and developing statistical software. Users have created packages to augment the functions of the R language.

To work with R, you need to download the program and install it. I recommend that you also install R-Studio. R-Studio is a separate program that makes R easier to use .

Windows

Install R

Download the latest R installer (.exe) for Windows. Install the downloaded file as any other windows app.

Install RStudio

Now that R is installed, you need to download and install RStudio. First download the installer for Windows. Run the installer (.exe file) and follow the instructions.

Mac

Install R

First download the latest release (“R-version.pkg”) of R Save the .pkg file, double-click it to open, and follow the installation instructions. Now that R is installed, you need to download and install RStudio.

Install RStudio

First download the the version for Mac. After downloading, double-click the file to open it, and then drag and drop it to your applications folder.

2. Basic data types

R works with numerous data types. Some of the most basic types to get started are:

  • Decimal values like 4.5 are called numerics.
  • Whole numbers like 4 are called integers. Integers are also numerics.
  • Boolean values (TRUE or FALSE) are called logical.
  • Text (or string) (e.g. “world”) values are called characters.
Numeric and integer values
# create create variable "a" that is a vector of one number.
a <- 7
a = 7
a
## [1] 7
# create vectors that consists of multiple numbers.
b <- c(1.25, 2.9, 3.0)
b
## [1] 1.25 2.90 3.00
# create a regular sequence of whole numbers
c <- 1:10
c
##  [1]  1  2  3  4  5  6  7  8  9 10
# seq(from, to, by) function ro creat sequence of numbers
d <- seq(1,10,0.5)
d
##  [1]  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0
## [16]  8.5  9.0  9.5 10.0
# rep() function to repeat a single number, or a sequence of numbers.

rep(99, times=5)
## [1] 99 99 99 99 99
rep(c(1,2,3), times=5)
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
rep(c(1,2,3),each = 5)
##  [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Character values
x <- "Monday"
class(x)
## [1] "character"
y <- c("Mon","Tue","Wed","Thu","Fri","Sat","Sun")
class(y)
## [1] "character"
Factors

A factor is a nominal (categorical) variable with a set of known possible values called levels. They can be created using the as.factor function.

yy <- as.factor(y)
yy
## [1] Mon Tue Wed Thu Fri Sat Sun
## Levels: Fri Mon Sat Sun Thu Tue Wed

3. Data structure

Matrix
mat0 <- matrix(NA, ncol=3, nrow=2)
mat0
##      [,1] [,2] [,3]
## [1,]   NA   NA   NA
## [2,]   NA   NA   NA
mat1 <- matrix(1:6, ncol=3, nrow=2)
mat1
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
mat2 <- matrix(1:6, ncol=3, nrow=2,byrow = TRUE)
mat2
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
List

A list is a very flexible container to store data. Each element of a list can contain any type of R object, e.g. a vector, matrix, data.frame, another list, or more complex data types.

lst <- list(d,yy,mat2)
lst
## [[1]]
##  [1]  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0
## [16]  8.5  9.0  9.5 10.0
## 
## [[2]]
## [1] Mon Tue Wed Thu Fri Sat Sun
## Levels: Fri Mon Sat Sun Thu Tue Wed
## 
## [[3]]
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
Data frame

The data.frame is commonly used for statistical data analysis in R. It is a special type of list that requires that all elements (variables) have the same length.

# Loading
data(ToothGrowth)
# Print the first 6 rows
head(ToothGrowth,row=6)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5
# Examine data structure of the dataframe
str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

4. Read files

To read a file into R, you need to know the full path (directory) name and the name of the file itself.

  1. Save .xlsx data as .csv file
  2. Check the directory of data file (right click–> Properties–>Location)
  3. Use read.csv from base R (Slowest method, but works fine for smaller datasets)
#Your directory will be different from mine! 
setwd("G:\\My Drive\\208429\\LabMaterials\\DataFiles_208429")
# Read data file and keep it in dt
dt <- read.csv("bank_data.csv", header = TRUE)
head(dt) # show the first six rows of the data
##   age        job marital education balance housing loan duration campaign pdays
## 1  59     admin. married secondary    2343     yes   no     1042        1    -1
## 2  56     admin. married secondary      45      no   no     1467        1    -1
## 3  41 technician married secondary    1270     yes   no     1389        1    -1
## 4  55   services married secondary    2476     yes   no      579        1    -1
## 5  54     admin. married  tertiary     184      no   no      673        2    -1
## 6  42 management  single  tertiary       0     yes  yes      562        2    -1
str(dt)  # show data structure 
## 'data.frame':    1000 obs. of  10 variables:
##  $ age      : int  59 56 41 55 54 42 56 60 37 28 ...
##  $ job      : chr  "admin." "admin." "technician" "services" ...
##  $ marital  : chr  "married" "married" "married" "married" ...
##  $ education: chr  "secondary" "secondary" "secondary" "secondary" ...
##  $ balance  : int  2343 45 1270 2476 184 0 830 545 1 5090 ...
##  $ housing  : chr  "yes" "no" "yes" "yes" ...
##  $ loan     : chr  "no" "no" "no" "no" ...
##  $ duration : int  1042 1467 1389 579 673 562 1201 1030 608 1297 ...
##  $ campaign : int  1 1 1 1 2 2 1 1 1 3 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...

Set “chr” variables as “factor”

dt$job <- as.factor(dt$job)
dt$marital <- as.factor(dt$marital)
dt$education <- as.factor(dt$education)
dt$housing <- as.factor(dt$housing)
dt$loan <- as.factor(dt$loan)
#check data structure again
str(dt)

5. Numerical presentation for qualitative data or categorical data

  • summary() function
  • frequency table with table() function
  • contingency table (Two-Way Frequency Tables)
  • percentage or proportion with prop.table() function
# job, marital, education, housing, loan
summary(dt)
#frequency table for one categorical variable
table(dt$job)
table(dt$housing)
prop.table(table(dt$job))
prop.table(table(dt$job))*100
#Two-Way Frequency Tables
table(dt$job,dt$education)
tb <- table(dt$loan,dt$marital)
# Add margin totals to your table. 
addmargins(tb)
#Calculate percentage
prop.table(tb)*100

6. Numerical presentation for quantitative data

  • summary() function
  • max() and min function
  • mean() function
  • median() function
  • There is no function to compute mode in R. To compute mode, use the getmode() function defined below.
#define function to calculate mode
getmode <- function(x) {
  u <- unique(x)
  tab <- tabulate(match(x, u))
  u[tab == max(tab)]
}
  • range() function
  • quantile(, probs = ) function where probs can be 0.25, 0.5, 0.75 for first, second, and third quantile
  • IQR() function
  • sd() and var() function
#age, balance, duration, campaign, pdays
summary(dt$age)
min(dt$age)
range(dt$age)
mean(dt$age)
getmode(dt$age)
IQR(dt$age)
sd(dt$age)
quantile(dt$age,probs = 0.25)
quantile(dt$age,probs = 0.50)
quantile(dt$age,probs = 0.75)
quantile(dt$age,probs = c(0.25,0.50,0.75))
cv_age <- sd(dt$age)/mean(dt$age)
cv_age
#cv of age by loan (yes,no)
cv1 <- sd(dt$age[dt$loan=="yes"])/mean(dt$age[dt$loan=="yes"])
cv1
cv2 <- sd(dt$age[dt$loan=="no"])/mean(dt$age[dt$loan=="no"])
cv2
  • aggregate() function is used to get the summary statistics of the data by group.
# aggregate: compute summary statistics of data subsets
aggregate(balance~loan, data = dt, FUN=summary)
aggregate(balance~loan+marital, data = dt, FUN=summary)
aggregate(balance~loan+marital, data = dt, FUN=mean)

7. Graphical presentation for qualitative data or categorical data

In the case that you have a factor variable, you can display a frequency table or a bar graph.

You can use the good ol’ summary() on a factor variable to display a frequency table.

Bar graph

freq <- table(dt$job)
freq
## 
##        admin.   blue-collar  entrepreneur     housemaid    management 
##            96           226            45            25           221 
##       retired self-employed      services       student    technician 
##            44            36            82             8           183 
##    unemployed       unknown 
##            31             3
barplot(freq)

To obtain a list of a factor variable’s levels, use levels().

maritallevel <- levels(dt$marital)
maritallevel
## [1] "divorced" "married"  "single"

Stacked Bar Plot with Colors and Legend

counts <- table(dt$marital, dt$loan)
barplot(counts,
        legend = rownames(counts),xlab="loan")

Grouped Bar Plot

counts <- table(dt$marital, dt$loan)
barplot(counts,col=c("blue","red","green"),
        legend = rownames(counts), beside=TRUE,xlab="loan")

Pie chart

counts <- table(dt$marital)
pie(counts, main = "Marital pie chart")

8. Graphical presentation for quantitative data

Histogram

  • To display a histogram of a vector, use hist()
hist(dt$age)

Sometimes, you want to change the title of the plot, and also the label of the horizontal axis and vertical axis. To do that, set the parameters main, xlab and ylab

hist(dt$age, main = "Histogram of age", xlab = "age", ylab = "counts")

To change the number of bins, set the breaks parameter. However, R will decide what is the best number of bins close to the number that your specify. For example, the following code displays a histogram with the number of bins close to 9:

hist(dt$age, breaks = 20, 
     main = "Histogram of age", 
     xlab = "age", 
     ylab = "counts")

Stem and leaf plot

  • To display a stem and leaf plot of a vector, use stem().
stem(dt$duration)
## 
##   The decimal point is 2 digit(s) to the right of the |
## 
##    0 | 18890113333334455566666667788899
##    2 | 1333333344555577888901111244456667777788888889999
##    4 | 00000011112222222333333334444445555566666666777777777888888888888899+122
##    6 | 00000000001111111122222222222223333333333333334444444444444444445555+160
##    8 | 00000000001111111111222223333333444444444455555555555556666666666677+103
##   10 | 00000000011111112222222233333333333344444444455555566666666667788888+54
##   12 | 00000111111112233344444444555667778899900011122333344444445556666677
##   14 | 0001122344455556777899900133445556667788
##   16 | 012245667899022234899
##   18 | 1346778885788
##   20 | 2239
##   22 | 7
##   24 | 62
##   26 | 257
##   28 | 
##   30 | 98
##   32 | 5
##   34 | 
##   36 | 
##   38 | 8
#compare with its histogram 

Boxplot

  • To display a boxplot of a vector, use boxplot().
boxplot(dt$age,main="Boxplot of age")

  • To display the boxplot horizontally, set the parameter horizontal = TRUE
boxplot(dt$age, horizontal = TRUE)

You can display boxplots of by group variable.

boxplot(dt$age~dt$loan,xlab="loan",ylab="age")
boxplot(dt$age~dt$job,xlab="job",ylab="age")

Scatterplot

Note that you need two vectors to plot a scatter plot: one for the x-axis, and one for the y-axis.

Suppose that you have two variables: var1 and var2. To display a scatter plot between var1 and var2, use plot(x = var1, y = var2).

plot(x = dt$age, y = dt$balance,xlab="age",ylab="balance")

9. Perform the following numerical data presentation for the bank_data.csv dataset:

    1. Create frequency table of education, with proportion
    1. Create two-ways frequency table of job by loan category, with percentage
    1. Compute summary statistics of balance
    1. Compute Q1,Q2, and Q3 of balance, also the IQR
    1. Compute CV of balance by type of housing
    1. Compute summary statistics of balance by housing
    1. Compute sd of balance by housing and loan

Perform the following graphical data presentation for the bank_data.csv dataset:

    1. How would you present the education data?
    1. If you would like to examine the number of people with different education and martial status, which graph shall you use to present such information?
    1. How would you present the balance data?
    1. If you would like to compare the balance between people with different education, which graph shall you use to present such information?
    1. If you would like to examine the relationship between age and duration, which graph shall you use to present such information?

## 2. Multiple linear regression analysis

  1. Import data into R. ( must be in .csv form).
setwd("G:\\My Drive\\208429\\LabMaterials\\DataFiles_208429")
data = read.csv("SalesPrice.csv",header=TRUE) 
data
# name variables for convenience
y = data$SalesPrice
x1 = data$NumberOfHousehold #quantitative independent variable
x2 = data$Location   #qualitative independent variable with k=3 groups 
  1. Explore data using descriptive statistic
summary(y)
hist(y,col    = "darkorange",
     border = "dodgerblue")
shapiro.test(y)
summary(x1)
plot(x1,y,xlab="Number Of Household",ylab="Sale price")
cor(x1,y) #Compute the r between two quantitative variables
boxplot(y~x2,xlab="Location",ylab="Sale price")

NOTE: It is easier to work with graphs when using ggplot

data_df = data.frame(data) # set data as dataframe type
library(ggplot2) # you might need to install packgage "ggplot2" first!
ggplot(data_df,aes(x=NumberOfHousehold,y=SalesPrice))+
  geom_point(aes(color=Location))

ggplot(data_df,aes(y=SalesPrice,x=Location))+
  geom_boxplot(aes(color=Location))

  1. Since Location is qualitative independent variable with k=3 groups.
    We need to transform Location into dummy variables
    We create k-1=3-1=2 dummy variables, set as. x2.dummy, x3.dummy.
#create x2.dummy
x2.dummy= c()   #create null vector
for(i in 1:length(data$Location)){
  if(data$Location[i]=="Street") x2.dummy[i] = 1
  else x2.dummy[i] = 0
}
x2.dummy
##  [1] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
#create x3.dummy
x3.dummy= c()
for(i in 1:length(data$Location)){
  if(data$Location[i]=="Mall") x3.dummy[i] = 1
  else x3.dummy[i] = 0
}
x3.dummy
##  [1] 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
  1. Select independent variables using variable selection methods
null=lm(y~1)
full=lm(y~x1+x2.dummy+x3.dummy)

#The R function vif() [car package] can be used to detect multicollinearity in a regression model:
car::vif(full)

# Forward selection
fw.fit = step(null, scope=list(lower=null,upper=full),direction="forward")
fw.fit

# Backward elimination
be.fit = step(full,direction="backward")
be.fit

# Stepwise regression
sw.fit = step(full, direction="both")
sw.fit
  1. Obtain fitted equation and CI of beta
summary(sw.fit)
anova(sw.fit)
confint.lm(sw.fit)
  1. Plot the fitted:
new = data.frame(x1=x1,x2.dummy=x2.dummy,x3.dummy=x3.dummy)
yhat = predict(sw.fit,newdata=new) #compute fitted y
yhat 

plot(x1,y,pch=1,xlab="Number Of Household",ylab="Sale price")
points(x1,yhat,type="p",pch=20)
legend("topleft",c("observed y","predicted y"),pch=c(1,20))

  1. Analyse the residuals:
par(mfrow=c(2,2))  #set plot layout as 2 row 2 column
plot(sw.fit)

#get the residuals
res = resid(sw.fit)
par(mfrow=c(1,2))  #set plot layout as 1 row 2 column
plot(res)
hist(res)

shapiro.test(res)
# Performs the Durbin-Watson test for autocorrelation of disturbances.
lmtest::dwtest(sw.fit)
# Test the constant variance assumption: Breusch-Pagan Test
lmtest::bptest(sw.fit)

Lab 2: Binary logistic regression

Task 1: Example I: Study hours and Exam

# Load the packages
library(aod)  #for 'Wald test'
library(MASS)
library(rms)  #for 'lrm'
## Loading required package: Hmisc
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "ndiMatrix" of class "replValueSp"; definition not updated
library(pscl) #to get pseudo R2
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
#setwd("C:\\Users\\ASUS\\Google Drive\\208429\\LabMaterials\\DataFiles_208429")
setwd("G:\\My Drive\\208429\\LabMaterials\\DataFiles_208429")
data = read.csv("StudyHours.csv",header=TRUE) 
head(data)
##   Obs Hours Pass
## 1   1  0.50    0
## 2   2  0.75    0
## 3   3  1.00    0
## 4   4  1.25    0
## 5   5  1.50    0
## 6   6  1.75    0
y = data$Pass
x = data$Hours

# Generating the frequency table 
table(y)
## y
##  0  1 
## 10 10
# 1.Building the Model
fit.logist = glm(y ~x,family=binomial(link='logit'))
fit.logist
## 
## Call:  glm(formula = y ~ x, family = binomial(link = "logit"))
## 
## Coefficients:
## (Intercept)            x  
##      -4.078        1.505  
## 
## Degrees of Freedom: 19 Total (i.e. Null);  18 Residual
## Null Deviance:       27.73 
## Residual Deviance: 16.06     AIC: 20.06
anova(fit.logist)
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: y
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev
## NULL                    19     27.726
## x     1   11.666        18     16.060
summary(fit.logist)
## 
## Call:
## glm(formula = y ~ x, family = binomial(link = "logit"))
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -4.0777     1.7610  -2.316   0.0206 *
## x             1.5046     0.6287   2.393   0.0167 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 27.726  on 19  degrees of freedom
## Residual deviance: 16.060  on 18  degrees of freedom
## AIC: 20.06
## 
## Number of Fisher Scoring iterations: 5
# 2. Perform overall fit test
mod <- lrm(y ~x, data = data)
mod
## Logistic Regression Model
## 
## lrm(formula = y ~ x, data = data)
## 
##                       Model Likelihood    Discrimination    Rank Discrim.    
##                             Ratio Test           Indexes          Indexes    
## Obs            20    LR chi2     11.67    R2       0.589    C       0.895    
##  0             10    d.f.            1    R2(1,20) 0.413    Dxy     0.790    
##  1             10    Pr(> chi2) 0.0006    R2(1,15) 0.509    gamma   0.798    
## max |deriv| 1e-07                         Brier    0.137    tau-a   0.416    
## 
##           Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept -4.0777 1.7610 -2.32  0.0206  
## x          1.5046 0.6287  2.39  0.0167
# 3. Perform Wald test:
#wald.test(b = coef(fit.logist), Sigma = vcov(fit.logist), Terms = 1)
wald.test(b = coef(fit.logist), Sigma = vcov(fit.logist), Terms = 2)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 5.7, df = 1, P(> X2) = 0.017
# 4. Perform lr test
null <- glm(y ~ 1, data = data,family="binomial")
summary(null)
## 
## Call:
## glm(formula = y ~ 1, family = "binomial", data = data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.930e-17  4.472e-01       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 27.726  on 19  degrees of freedom
## Residual deviance: 27.726  on 19  degrees of freedom
## AIC: 29.726
## 
## Number of Fisher Scoring iterations: 2
full <- glm(y ~ x, data = data,family="binomial")
summary(full)
## 
## Call:
## glm(formula = y ~ x, family = "binomial", data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -4.0777     1.7610  -2.316   0.0206 *
## x             1.5046     0.6287   2.393   0.0167 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 27.726  on 19  degrees of freedom
## Residual deviance: 16.060  on 18  degrees of freedom
## AIC: 20.06
## 
## Number of Fisher Scoring iterations: 5
lrtest(null, full)
## 
## Model 1: y ~ 1
## Model 2: y ~ x
## 
##   L.R. Chisq         d.f.            P 
## 1.166613e+01 1.000000e+00 6.364826e-04
# 5. get LL and pseudo R2
logLik(full)
## 'log Lik.' -8.029878 (df=2)
#To easily get a McFadden's pseudo R2 for a fitted model in R, use the "pscl" package by Simon Jackman and use the pR2 command.
pR2(full)
## fitting null model for pseudo-r2
##         llh     llhNull          G2    McFadden        r2ML        r2CU 
##  -8.0298785 -13.8629436  11.6661303   0.4207667   0.4419499   0.5892665
mod1b <- lrm(y ~ x, data = data)
mod1b
## Logistic Regression Model
## 
## lrm(formula = y ~ x, data = data)
## 
##                       Model Likelihood    Discrimination    Rank Discrim.    
##                             Ratio Test           Indexes          Indexes    
## Obs            20    LR chi2     11.67    R2       0.589    C       0.895    
##  0             10    d.f.            1    R2(1,20) 0.413    Dxy     0.790    
##  1             10    Pr(> chi2) 0.0006    R2(1,15) 0.509    gamma   0.798    
## max |deriv| 1e-07                         Brier    0.137    tau-a   0.416    
## 
##           Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept -4.0777 1.7610 -2.32  0.0206  
## x          1.5046 0.6287  2.39  0.0167
## CIs using profiled log-likelihood
confint(fit.logist)
## Waiting for profiling to be done...
##                  2.5 %    97.5 %
## (Intercept) -8.5853315 -1.268378
## x            0.5293557  3.145138
## CIs using standard errors
confint.default(fit.logist)
##                  2.5 %     97.5 %
## (Intercept) -7.5291793 -0.6262476
## x            0.2723838  2.7369071
logLik(fit.logist)
## 'log Lik.' -8.029878 (df=2)
## odds ratios only
exp(coef(fit.logist))
## (Intercept)           x 
##  0.01694617  4.50255687
## odds ratios and 95% CI
exp(cbind(OR = coef(fit.logist), confint(fit.logist)))
## Waiting for profiling to be done...
##                     OR        2.5 %     97.5 %
## (Intercept) 0.01694617 0.0001868263  0.2812875
## x           4.50255687 1.6978380343 23.2228735
## 6.Prediction with se
pred_logit = predict(fit.logist, newdata = list(x=x), type = "link",se = TRUE)
pred_logit
## $fit
##          1          2          3          4          5          6          7 
## -3.3253907 -2.9492294 -2.5730680 -2.1969066 -1.8207453 -1.4445839 -1.4445839 
##          8          9         10         11         12         13         14 
## -1.0684226 -0.6922612 -0.3160999  0.0600615  0.4362229  0.8123842  1.1885456 
##         15         16         17         18         19         20 
##  1.9408683  2.3170296  2.6931910  3.0693524  3.4455137  4.1978364 
## 
## $se.fit
##         1         2         3         4         5         6         7         8 
## 1.4715288 1.3310441 1.1947260 1.0641770 0.9417992 0.8312096 0.8312096 0.7377287 
##         9        10        11        12        13        14        15        16 
## 0.6685718 0.6317781 0.6330161 0.6720757 0.7430159 0.8377769 1.0722391 1.2032148 
##        17        18        19        20 
## 1.3398379 1.4805458 1.6242772 1.9180797 
## 
## $residual.scale
## [1] 1
pred = predict(fit.logist, newdata = list(x=x), type = "response",se =  TRUE)
pred
## $fit
##          1          2          3          4          5          6          7 
## 0.03471034 0.04977295 0.07089196 0.10002862 0.13934447 0.19083650 0.19083650 
##          8          9         10         11         12         13         14 
## 0.25570318 0.33353024 0.42162653 0.51501086 0.60735865 0.69261733 0.76648084 
##         15         16         17         18         19         20 
## 0.87444750 0.91027764 0.93662366 0.95561071 0.96909707 0.98519444 
## 
## $se.fit
##          1          2          3          4          5          6          7 
## 0.04930435 0.06295253 0.07869217 0.09580030 0.11294771 0.12835367 0.12835367 
##          8          9         10         11         12         13         14 
## 0.14040383 0.14861538 0.15406389 0.15811139 0.16027266 0.15818702 0.14995198 
##         15         16         17         18         19         20 
## 0.11772013 0.09826927 0.07953248 0.06280310 0.04864376 0.02797779 
## 
## $residual.scale
## [1] 1
pred_prob = 1/(1+exp(-pred_logit$fit))
pred_prob
##          1          2          3          4          5          6          7 
## 0.03471034 0.04977295 0.07089196 0.10002862 0.13934447 0.19083650 0.19083650 
##          8          9         10         11         12         13         14 
## 0.25570318 0.33353024 0.42162653 0.51501086 0.60735865 0.69261733 0.76648084 
##         15         16         17         18         19         20 
## 0.87444750 0.91027764 0.93662366 0.95561071 0.96909707 0.98519444
LCI_pred = 1/(1+exp(-(pred_logit$fit-1.96*pred_logit$se.fit)))
UCI_pred = 1/(1+exp(-(pred_logit$fit+1.96*pred_logit$se.fit)))

plot(x,pred_prob,type="o",ylim=c(-0.1,1.1),xlab="Number of Study Hours",ylab="Prob of passing exam",col="blue")
points(x,y)
lines(x,LCI_pred,col="red",lty=2)
lines(x,UCI_pred,col="red",lty=2)
abline(h=0.5,col="gray")

#create data.frame
dtf = data.frame( y =y ,
                  x =x ,
                  pred_logit = pred_logit,
                  pred_prob  = pred_prob,
                  LCI_pred   = LCI_pred,
                  UCI_pred   = UCI_pred
)

dtf
##    y    x pred_logit.fit pred_logit.se.fit pred_logit.residual.scale  pred_prob
## 1  0 0.50     -3.3253907         1.4715288                         1 0.03471034
## 2  0 0.75     -2.9492294         1.3310441                         1 0.04977295
## 3  0 1.00     -2.5730680         1.1947260                         1 0.07089196
## 4  0 1.25     -2.1969066         1.0641770                         1 0.10002862
## 5  0 1.50     -1.8207453         0.9417992                         1 0.13934447
## 6  0 1.75     -1.4445839         0.8312096                         1 0.19083650
## 7  1 1.75     -1.4445839         0.8312096                         1 0.19083650
## 8  0 2.00     -1.0684226         0.7377287                         1 0.25570318
## 9  1 2.25     -0.6922612         0.6685718                         1 0.33353024
## 10 0 2.50     -0.3160999         0.6317781                         1 0.42162653
## 11 1 2.75      0.0600615         0.6330161                         1 0.51501086
## 12 0 3.00      0.4362229         0.6720757                         1 0.60735865
## 13 1 3.25      0.8123842         0.7430159                         1 0.69261733
## 14 0 3.50      1.1885456         0.8377769                         1 0.76648084
## 15 1 4.00      1.9408683         1.0722391                         1 0.87444750
## 16 1 4.25      2.3170296         1.2032148                         1 0.91027764
## 17 1 4.50      2.6931910         1.3398379                         1 0.93662366
## 18 1 4.75      3.0693524         1.4805458                         1 0.95561071
## 19 1 5.00      3.4455137         1.6242772                         1 0.96909707
## 20 1 5.50      4.1978364         1.9180797                         1 0.98519444
##       LCI_pred  UCI_pred
## 1  0.002006035 0.3914564
## 2  0.003841377 0.4157164
## 3  0.007284242 0.4424055
## 4  0.013617431 0.4722486
## 5  0.024924302 0.5062950
## 6  0.044202920 0.5460161
## 7  0.044202920 0.5460161
## 8  0.074856857 0.5932762
## 9  0.118923711 0.6497953
## 10 0.174453851 0.7154871
## 11 0.234935337 0.7859651
## 12 0.292961171 0.8523926
## 13 0.344359378 0.9062510
## 14 0.388529610 0.9443065
## 15 0.459906172 0.9827485
## 16 0.489683624 0.9907637
## 17 0.516770898 0.9951275
## 18 0.541773065 0.9974554
## 19 0.565110751 0.9986804
## 20 0.607877767 0.9996500
#create another variable called "pred_pass"
# if pred_prob <= 0.5  , pred_pass = 0
# if pred_prob > 0.5   , pred_pass = 1
#create x2.dummy
pred_pass= c()   #create null vector
for(i in 1:length(y)){
  if(pred_prob[i] <= 0.5) pred_pass[i] = 0
  else pred_pass[i] = 1
}
pred_pass
##  [1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
#add pred_pass in dtf 
dtf$pred_pass = pred_pass
dtf
##    y    x pred_logit.fit pred_logit.se.fit pred_logit.residual.scale  pred_prob
## 1  0 0.50     -3.3253907         1.4715288                         1 0.03471034
## 2  0 0.75     -2.9492294         1.3310441                         1 0.04977295
## 3  0 1.00     -2.5730680         1.1947260                         1 0.07089196
## 4  0 1.25     -2.1969066         1.0641770                         1 0.10002862
## 5  0 1.50     -1.8207453         0.9417992                         1 0.13934447
## 6  0 1.75     -1.4445839         0.8312096                         1 0.19083650
## 7  1 1.75     -1.4445839         0.8312096                         1 0.19083650
## 8  0 2.00     -1.0684226         0.7377287                         1 0.25570318
## 9  1 2.25     -0.6922612         0.6685718                         1 0.33353024
## 10 0 2.50     -0.3160999         0.6317781                         1 0.42162653
## 11 1 2.75      0.0600615         0.6330161                         1 0.51501086
## 12 0 3.00      0.4362229         0.6720757                         1 0.60735865
## 13 1 3.25      0.8123842         0.7430159                         1 0.69261733
## 14 0 3.50      1.1885456         0.8377769                         1 0.76648084
## 15 1 4.00      1.9408683         1.0722391                         1 0.87444750
## 16 1 4.25      2.3170296         1.2032148                         1 0.91027764
## 17 1 4.50      2.6931910         1.3398379                         1 0.93662366
## 18 1 4.75      3.0693524         1.4805458                         1 0.95561071
## 19 1 5.00      3.4455137         1.6242772                         1 0.96909707
## 20 1 5.50      4.1978364         1.9180797                         1 0.98519444
##       LCI_pred  UCI_pred pred_pass
## 1  0.002006035 0.3914564         0
## 2  0.003841377 0.4157164         0
## 3  0.007284242 0.4424055         0
## 4  0.013617431 0.4722486         0
## 5  0.024924302 0.5062950         0
## 6  0.044202920 0.5460161         0
## 7  0.044202920 0.5460161         0
## 8  0.074856857 0.5932762         0
## 9  0.118923711 0.6497953         0
## 10 0.174453851 0.7154871         0
## 11 0.234935337 0.7859651         1
## 12 0.292961171 0.8523926         1
## 13 0.344359378 0.9062510         1
## 14 0.388529610 0.9443065         1
## 15 0.459906172 0.9827485         1
## 16 0.489683624 0.9907637         1
## 17 0.516770898 0.9951275         1
## 18 0.541773065 0.9974554         1
## 19 0.565110751 0.9986804         1
## 20 0.607877767 0.9996500         1
#Classification table
xtabs(~ y + pred_pass, data = dtf)
##    pred_pass
## y   0 1
##   0 8 2
##   1 2 8

Task 2: Admission data

researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don’t admit, is a binary variable.

# Load the packages
library(aod)
library(MASS)
library(rms)
#setwd("C:\\Users\\ASUS\\Google Drive\\208429\\LabMaterials\\DataFiles_208429")
setwd("G:\\My Drive\\208429\\LabMaterials\\DataFiles_208429")
data2 = read.csv("Admission.csv",header=TRUE) 
head(data2)
##   admit gre  gpa rank
## 1     0 380 3.61    3
## 2     1 660 3.67    3
## 3     1 800 4.00    1
## 4     1 640 3.19    4
## 5     0 520 2.93    4
## 6     1 760 3.00    2
summary(data2)
##      admit             gre             gpa             rank      
##  Min.   :0.0000   Min.   :220.0   Min.   :2.260   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:520.0   1st Qu.:3.130   1st Qu.:2.000  
##  Median :0.0000   Median :580.0   Median :3.395   Median :2.000  
##  Mean   :0.3175   Mean   :587.7   Mean   :3.390   Mean   :2.485  
##  3rd Qu.:1.0000   3rd Qu.:660.0   3rd Qu.:3.670   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :800.0   Max.   :4.000   Max.   :4.000
sapply(data2, sd)
##       admit         gre         gpa        rank 
##   0.4660867 115.5165364   0.3805668   0.9444602
xtabs(~admit + rank, data = data2)
##      rank
## admit  1  2  3  4
##     0 28 97 93 55
##     1 33 54 28 12
#Using the logit model
data2$rank <- factor(data2$rank)



#Perform variable selection

null <- glm(admit ~ 1, data = data2,family="binomial")
full <- glm(admit ~ gre + gpa + rank, data = data2,family="binomial")

forwards = step(null,scope=list(lower=null,upper=full), direction="forward")
## Start:  AIC=501.98
## admit ~ 1
## 
##        Df Deviance    AIC
## + rank  3   474.97 482.97
## + gre   1   486.06 490.06
## + gpa   1   486.97 490.97
## <none>      499.98 501.98
## 
## Step:  AIC=482.97
## admit ~ rank
## 
##        Df Deviance    AIC
## + gpa   1   462.88 472.88
## + gre   1   464.53 474.53
## <none>      474.97 482.97
## 
## Step:  AIC=472.88
## admit ~ rank + gpa
## 
##        Df Deviance    AIC
## + gre   1   458.52 470.52
## <none>      462.88 472.88
## 
## Step:  AIC=470.52
## admit ~ rank + gpa + gre
backwards = step(full) 
## Start:  AIC=470.52
## admit ~ gre + gpa + rank
## 
##        Df Deviance    AIC
## <none>      458.52 470.52
## - gre   1   462.88 472.88
## - gpa   1   464.53 474.53
## - rank  3   480.34 486.34
bothways = step(null, list(lower=null,upper=full),direction="both")
## Start:  AIC=501.98
## admit ~ 1
## 
##        Df Deviance    AIC
## + rank  3   474.97 482.97
## + gre   1   486.06 490.06
## + gpa   1   486.97 490.97
## <none>      499.98 501.98
## 
## Step:  AIC=482.97
## admit ~ rank
## 
##        Df Deviance    AIC
## + gpa   1   462.88 472.88
## + gre   1   464.53 474.53
## <none>      474.97 482.97
## - rank  3   499.98 501.98
## 
## Step:  AIC=472.88
## admit ~ rank + gpa
## 
##        Df Deviance    AIC
## + gre   1   458.52 470.52
## <none>      462.88 472.88
## - gpa   1   474.97 482.97
## - rank  3   486.97 490.97
## 
## Step:  AIC=470.52
## admit ~ rank + gpa + gre
## 
##        Df Deviance    AIC
## <none>      458.52 470.52
## - gre   1   462.88 472.88
## - gpa   1   464.53 474.53
## - rank  3   480.34 486.34
mod1b <- lrm(admit ~ gre + gpa + rank, data = data2)
mod1b
## Logistic Regression Model
## 
## lrm(formula = admit ~ gre + gpa + rank, data = data2)
## 
##                        Model Likelihood    Discrimination    Rank Discrim.    
##                              Ratio Test           Indexes          Indexes    
## Obs           400    LR chi2      41.46    R2       0.138    C       0.693    
##  0            273    d.f.             5    R2(5,400)0.087    Dxy     0.386    
##  1            127    Pr(> chi2) <0.0001    R2(5,260)0.131    gamma   0.386    
## max |deriv| 2e-06                          Brier    0.195    tau-a   0.168    
## 
##           Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept -3.9900 1.1400 -3.50  0.0005  
## gre        0.0023 0.0011  2.07  0.0385  
## gpa        0.8040 0.3318  2.42  0.0154  
## rank=2    -0.6754 0.3165 -2.13  0.0328  
## rank=3    -1.3402 0.3453 -3.88  0.0001  
## rank=4    -1.5515 0.4178 -3.71  0.0002
#lrtest
fit <- glm(admit ~ gre + gpa + rank, data = data2, family = "binomial")
summary(fit)
## 
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
##     data = data2)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.989979   1.139951  -3.500 0.000465 ***
## gre          0.002264   0.001094   2.070 0.038465 *  
## gpa          0.804038   0.331819   2.423 0.015388 *  
## rank2       -0.675443   0.316490  -2.134 0.032829 *  
## rank3       -1.340204   0.345306  -3.881 0.000104 ***
## rank4       -1.551464   0.417832  -3.713 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.52
## 
## Number of Fisher Scoring iterations: 4
confint(fit)
## Waiting for profiling to be done...
##                     2.5 %       97.5 %
## (Intercept) -6.2716202334 -1.792547080
## gre          0.0001375921  0.004435874
## gpa          0.1602959439  1.464142727
## rank2       -1.3008888002 -0.056745722
## rank3       -2.0276713127 -0.670372346
## rank4       -2.4000265384 -0.753542605
confint.default(fit)
##                     2.5 %       97.5 %
## (Intercept) -6.2242418514 -1.755716295
## gre          0.0001202298  0.004408622
## gpa          0.1536836760  1.454391423
## rank2       -1.2957512650 -0.055134591
## rank3       -2.0169920597 -0.663415773
## rank4       -2.3703986294 -0.732528724
wald.test(b = coef(fit), Sigma = vcov(fit), Terms = 4:6)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 20.9, df = 3, P(> X2) = 0.00011
pR2(fit)
## fitting null model for pseudo-r2
##           llh       llhNull            G2      McFadden          r2ML 
## -229.25874624 -249.98825878   41.45902508    0.08292194    0.09845702 
##          r2CU 
##    0.13799580
lrm(admit ~ gre + gpa + rank, data = data2)
## Logistic Regression Model
## 
## lrm(formula = admit ~ gre + gpa + rank, data = data2)
## 
##                        Model Likelihood    Discrimination    Rank Discrim.    
##                              Ratio Test           Indexes          Indexes    
## Obs           400    LR chi2      41.46    R2       0.138    C       0.693    
##  0            273    d.f.             5    R2(5,400)0.087    Dxy     0.386    
##  1            127    Pr(> chi2) <0.0001    R2(5,260)0.131    gamma   0.386    
## max |deriv| 2e-06                          Brier    0.195    tau-a   0.168    
## 
##           Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept -3.9900 1.1400 -3.50  0.0005  
## gre        0.0023 0.0011  2.07  0.0385  
## gpa        0.8040 0.3318  2.42  0.0154  
## rank=2    -0.6754 0.3165 -2.13  0.0328  
## rank=3    -1.3402 0.3453 -3.88  0.0001  
## rank=4    -1.5515 0.4178 -3.71  0.0002
newdata1 <- with(data2, data.frame(gre = gre, gpa = gpa, rank = factor(rank)))
pred.response <- predict(fit, newdata = newdata1, type = "response")
head(pred.response)
##         1         2         3         4         5         6 
## 0.1726265 0.2921750 0.7384082 0.1783846 0.1183539 0.3699699
newdata1$rankP <- predict(fit, newdata = newdata1, type = "response")
head(newdata1)
##   gre  gpa rank     rankP
## 1 380 3.61    3 0.1726265
## 2 660 3.67    3 0.2921750
## 3 800 4.00    1 0.7384082
## 4 640 3.19    4 0.1783846
## 5 520 2.93    4 0.1183539
## 6 760 3.00    2 0.3699699

Task 3: Buying Cosmetics

# Load the packages
library(aod)  #for 'Wald test'
library(MASS)
library(rms)  #for 'lrm'
library(pscl) #to get pseudo R2

#setwd("C:\\Users\\ASUS\\Google Drive\\208429\\LabMaterials\\DataFiles_208429")
setwd("G:\\My Drive\\208429\\LabMaterials\\DataFiles_208429")
data3 = read.csv("BuyingCosmetics.csv",header=TRUE) 
head(data3)
##   Obs Buying Age Income
## 1   1      1  40     15
## 2   2      1  45     27
## 3   3      0  20      8
## 4   4      1  30     17
## 5   5      1  55     45
## 6   6      0  15      3
summary(data3)
##       Obs            Buying            Age            Income     
##  Min.   : 1.00   Min.   :0.0000   Min.   :15.00   Min.   : 3.00  
##  1st Qu.:15.75   1st Qu.:0.0000   1st Qu.:22.00   1st Qu.:12.00  
##  Median :30.50   Median :1.0000   Median :30.00   Median :20.00  
##  Mean   :30.50   Mean   :0.5333   Mean   :32.35   Mean   :21.63  
##  3rd Qu.:45.25   3rd Qu.:1.0000   3rd Qu.:40.00   3rd Qu.:30.00  
##  Max.   :60.00   Max.   :1.0000   Max.   :60.00   Max.   :51.00
#Using the logit model
#Perform variable selection

null <- glm(Buying ~ 1, data = data3,family="binomial")
full <- glm(Buying ~ Age + Income, data = data3,family="binomial")
summary(full)
## 
## Call:
## glm(formula = Buying ~ Age + Income, family = "binomial", data = data3)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.56798    1.22985  -3.714 0.000204 ***
## Age          0.16265    0.05200   3.128 0.001761 ** 
## Income      -0.01801    0.04232  -0.426 0.670439    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 82.911  on 59  degrees of freedom
## Residual deviance: 59.067  on 57  degrees of freedom
## AIC: 65.067
## 
## Number of Fisher Scoring iterations: 5
lrtest(null, full)
## 
## Model 1: Buying ~ 1
## Model 2: Buying ~ Age + Income
## 
##   L.R. Chisq         d.f.            P 
## 2.384415e+01 2.000000e+00 6.642161e-06
wald.test(b = coef(full), Sigma = vcov(full), Terms = 1)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 13.8, df = 1, P(> X2) = 2e-04
wald.test(b = coef(full), Sigma = vcov(full), Terms = 2)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 9.8, df = 1, P(> X2) = 0.0018
wald.test(b = coef(full), Sigma = vcov(full), Terms = 3)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 0.18, df = 1, P(> X2) = 0.67
mod = lrm(Buying ~ Age + Income, data = data3)
mod
## Logistic Regression Model
## 
## lrm(formula = Buying ~ Age + Income, data = data3)
## 
##                        Model Likelihood     Discrimination    Rank Discrim.    
##                              Ratio Test            Indexes          Indexes    
## Obs            60    LR chi2      23.84     R2       0.438    C       0.836    
##  0             28    d.f.             2     R2(2,60) 0.305    Dxy     0.672    
##  1             32    Pr(> chi2) <0.0001    R2(2,44.8)0.386    gamma   0.673    
## max |deriv| 1e-08                           Brier    0.167    tau-a   0.340    
## 
##           Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept -4.5680 1.2299 -3.71  0.0002  
## Age        0.1626 0.0520  3.13  0.0018  
## Income    -0.0180 0.0423 -0.43  0.6704
-2*logLik(full)
## 'log Lik.' 59.06665 (df=3)
#To easily get a McFadden's pseudo R2 for a fitted model in R, use the "pscl" package by Simon Jackman and use the pR2 command.
pR2(full)
## fitting null model for pseudo-r2
##         llh     llhNull          G2    McFadden        r2ML        r2CU 
## -29.5333253 -41.4553986  23.8441464   0.2875880   0.3279365   0.4378993
#R2 McFadden
1-as.numeric(logLik(full))/as.numeric(logLik(null))
## [1] 0.287588
#r2 Cox and snell
n=dim(data3)[1]
r2_cs = 1-(as.numeric(exp(logLik(null))/exp(logLik(full))))^(2/n)
r2_cs
## [1] 0.3279365
#r2 Nagelkerki
r2_N = r2_cs/(1-as.numeric(exp(logLik(null)))^(2/n))
r2_N
## [1] 0.4378993
#forwards = step(null,scope=list(lower=null,upper=full), direction="forward")
#backwards = step(full) 
bothways = step(null, list(lower=null,upper=full),direction="both")
## Start:  AIC=84.91
## Buying ~ 1
## 
##          Df Deviance    AIC
## + Age     1   59.250 63.250
## + Income  1   73.593 77.593
## <none>        82.911 84.911
## 
## Step:  AIC=63.25
## Buying ~ Age
## 
##          Df Deviance    AIC
## <none>        59.250 63.250
## + Income  1   59.067 65.067
## - Age     1   82.911 84.911
summary(bothways)
## 
## Call:
## glm(formula = Buying ~ Age, family = "binomial", data = data3)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.51528    1.22025  -3.700 0.000215 ***
## Age          0.14847    0.03878   3.828 0.000129 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 82.911  on 59  degrees of freedom
## Residual deviance: 59.250  on 58  degrees of freedom
## AIC: 63.25
## 
## Number of Fisher Scoring iterations: 5
confint(bothways)
## Waiting for profiling to be done...
##                  2.5 %     97.5 %
## (Intercept) -7.2076248 -2.3490179
## Age          0.0805232  0.2349225
confint.default(bothways)
##                  2.5 %    97.5 %
## (Intercept) -6.9069200 -2.123635
## Age          0.0724545  0.224479
## odds ratios and 95% CI
exp(cbind(OR = coef(bothways), confint(bothways)))
## Waiting for profiling to be done...
##                     OR        2.5 %     97.5 %
## (Intercept) 0.01094057 0.0007409149 0.09546287
## Age         1.16005425 1.0838539957 1.26481079
newdata1 <- with(data3, data.frame(Age=Age, Income=Income))
pred_response <- predict(bothways, newdata = newdata1, type = "response")
head(pred_response)
##          1          2          3          4          5          6 
## 0.80586986 0.89712932 0.17567316 0.48468624 0.97467705 0.09209862
mod1b <- lrm(Buying ~ Age, data = data3)
mod1b
## Logistic Regression Model
## 
## lrm(formula = Buying ~ Age, data = data3)
## 
##                        Model Likelihood     Discrimination    Rank Discrim.    
##                              Ratio Test            Indexes          Indexes    
## Obs            60    LR chi2      23.66     R2       0.435    C       0.827    
##  0             28    d.f.             1     R2(1,60) 0.315    Dxy     0.654    
##  1             32    Pr(> chi2) <0.0001    R2(1,44.8)0.397    gamma   0.677    
## max |deriv| 3e-09                           Brier    0.167    tau-a   0.331    
## 
##           Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept -4.5153 1.2202 -3.70  0.0002  
## Age        0.1485 0.0388  3.83  0.0001
## odds ratios and 95% CI
#exp(cbind(OR = coef(fit_logit_full), confint(fit_logit_full)))

#newdata1 <- with(data2, data.frame(gre = gre, gpa = gpa, rank = factor(rank)))
#pred_response <- predict(fit_logit_full, newdata = newdata1, type = "response")
#head(pred_response)

Lab 3: Multinomial logistic regression

Example: Product Buying

1.Import data

library(readr)

#setwd("C:\\Users\\ASUS\\Google Drive\\208429\\LabMaterials\\DataFiles_208429")
setwd("G:\\My Drive\\208429\\LabMaterials\\DataFiles_208429")
dt <- read_csv("product.csv")
## Rows: 400 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): product, gender
## dbl (5): obs, age, household, position_level, absent
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(dt)
## spc_tbl_ [400 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ obs           : num [1:400] 1 2 3 4 5 6 7 8 9 10 ...
##  $ product       : chr [1:400] "C" "A" "C" "A" ...
##  $ age           : num [1:400] 57 21 66 36 23 31 37 37 55 66 ...
##  $ household     : num [1:400] 2 7 7 4 0 5 3 0 3 2 ...
##  $ position_level: num [1:400] 2 2 2 2 2 1 3 3 3 4 ...
##  $ gender        : chr [1:400] "Male" "Male" "Male" "Female" ...
##  $ absent        : num [1:400] 10 7 1 6 11 14 12 25 3 18 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   obs = col_double(),
##   ..   product = col_character(),
##   ..   age = col_double(),
##   ..   household = col_double(),
##   ..   position_level = col_double(),
##   ..   gender = col_character(),
##   ..   absent = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
  1. set data structure
dt$product <- as.factor(dt$product)
dt$position_level <- as.factor(dt$position_level)
dt$gender <- as.factor(dt$gender)
str(dt)
## spc_tbl_ [400 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ obs           : num [1:400] 1 2 3 4 5 6 7 8 9 10 ...
##  $ product       : Factor w/ 3 levels "A","B","C": 3 1 3 1 1 1 1 2 3 2 ...
##  $ age           : num [1:400] 57 21 66 36 23 31 37 37 55 66 ...
##  $ household     : num [1:400] 2 7 7 4 0 5 3 0 3 2 ...
##  $ position_level: Factor w/ 5 levels "1","2","3","4",..: 2 2 2 2 2 1 3 3 3 4 ...
##  $ gender        : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 2 2 1 1 1 ...
##  $ absent        : num [1:400] 10 7 1 6 11 14 12 25 3 18 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   obs = col_double(),
##   ..   product = col_character(),
##   ..   age = col_double(),
##   ..   household = col_double(),
##   ..   position_level = col_double(),
##   ..   gender = col_character(),
##   ..   absent = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
  1. descriptive statistics
summary(dt)
##       obs        product      age          household     position_level
##  Min.   :  1.0   A:142   Min.   :21.00   Min.   :0.000   1: 47         
##  1st Qu.:100.8   B:117   1st Qu.:30.00   1st Qu.:2.000   2: 99         
##  Median :200.5   C:141   Median :37.00   Median :3.000   3:135         
##  Mean   :200.5           Mean   :41.05   Mean   :3.212   4: 64         
##  3rd Qu.:300.2           3rd Qu.:53.00   3rd Qu.:5.000   5: 55         
##  Max.   :400.0           Max.   :67.00   Max.   :7.000                 
##     gender        absent     
##  Female:232   Min.   : 0.00  
##  Male  :168   1st Qu.: 7.00  
##               Median :14.00  
##               Mean   :14.17  
##               3rd Qu.:21.00  
##               Max.   :31.00
boxplot(age~product,data=dt)

table(dt$product,dt$position_level)
##    
##      1  2  3  4  5
##   A 14 36 44 28 20
##   B 20 29 39 13 16
##   C 13 34 52 23 19
plot(dt$age,dt$absent)

4.set base group for y variable

dt$product <- relevel(dt$product, ref = "A")

#Important packages library(MASS)# for multino, library(rms) # for lrt library(pscl) #to get pseudo R2

  1. fit multinomial logistic regression
#Important packages
library(MASS)# for multino,
library(rms) # for lrt
library(pscl) #to get pseudo R2
#package "nnet"
library(nnet)
null <- multinom(product~1,data=dt)
## # weights:  6 (2 variable)
## initial  value 439.444915 
## final  value 437.908882 
## converged
full <- multinom(product~age+household+position_level+gender+absent,data=dt)
## # weights:  30 (18 variable)
## initial  value 439.444915 
## iter  10 value 268.455865
## iter  20 value 208.024202
## iter  30 value 207.509074
## iter  30 value 207.509072
## iter  30 value 207.509072
## final  value 207.509072 
## converged
summary(full)
## Call:
## multinom(formula = product ~ age + household + position_level + 
##     gender + absent, data = dt)
## 
## Coefficients:
##   (Intercept)       age  household position_level2 position_level3
## B   -4.727535 0.2368700 -0.9628612     -0.34360355      -1.3305304
## C  -10.114127 0.2647383  0.1552594      0.01216742      -0.1824063
##   position_level4 position_level5  genderMale      absent
## B      -1.9177254      -1.9734897 -2.39588762  0.01384059
## C      -0.2653796      -0.7251654  0.09274777 -0.01208771
## 
## Std. Errors:
##   (Intercept)        age  household position_level2 position_level3
## B    1.027988 0.02864214 0.13731517       0.5999365       0.5954285
## C    1.213453 0.02960232 0.09266699       0.6027884       0.5777555
##   position_level4 position_level5 genderMale     absent
## B       0.7839738       0.7471631  0.4633714 0.02489143
## C       0.6440092       0.7116233  0.3694506 0.02293563
## 
## Residual Deviance: 415.0181 
## AIC: 451.0181
  1. Calculate z value=b/se(b) , then p-value
b = summary(full)$coefficients
se.b = summary(full)$standard.errors
z <- b/se.b
t(z)#transpose of Z
##                          B           C
## (Intercept)     -4.5988214 -8.33500061
## age              8.2699833  8.94315942
## household       -7.0120526  1.67545552
## position_level2 -0.5727332  0.02018522
## position_level3 -2.2345763 -0.31571534
## position_level4 -2.4461600 -0.41207432
## position_level5 -2.6413106 -1.01902993
## genderMale      -5.1705554  0.25104244
## absent           0.5560382 -0.52702743
# p-value for 2-tailed z test
p <- (1 - pnorm(abs(z),mean=0,sd=1))*2
t(p)#transpose of p
##                            B          C
## (Intercept)     4.248879e-06 0.00000000
## age             2.220446e-16 0.00000000
## household       2.348566e-12 0.09384489
## position_level2 5.668254e-01 0.98389562
## position_level3 2.544518e-02 0.75221859
## position_level4 1.443869e-02 0.68028495
## position_level5 8.258598e-03 0.30818876
## genderMale      2.333993e-07 0.80178130
## absent          5.781847e-01 0.59817454
  1. test if position_level should be included when the rest are in!
#likelihood ratio test
full <- multinom(product~age+household+position_level+gender+absent,data=dt)
## # weights:  30 (18 variable)
## initial  value 439.444915 
## iter  10 value 268.455865
## iter  20 value 208.024202
## iter  30 value 207.509074
## iter  30 value 207.509072
## iter  30 value 207.509072
## final  value 207.509072 
## converged
fit1 <- multinom(product~age+household+gender+absent,data=dt)
## # weights:  18 (10 variable)
## initial  value 439.444915 
## iter  10 value 256.830887
## iter  20 value 214.842594
## iter  30 value 214.837090
## final  value 214.837084 
## converged
anova(fit1,full)
## Likelihood ratio tests of Multinomial Models
## 
## Response: product
##                                                Model Resid. df Resid. Dev
## 1                  age + household + gender + absent       790   429.6742
## 2 age + household + position_level + gender + absent       782   415.0181
##     Test    Df LR stat.    Pr(Chi)
## 1                                 
## 2 1 vs 2     8 14.65602 0.06618923
  1. from fit1, test if “absent” should be included when the rest are in
fit2 <- multinom(product~age+household+gender,data=dt)
## # weights:  15 (8 variable)
## initial  value 439.444915 
## iter  10 value 220.300742
## iter  20 value 215.255121
## final  value 215.248449 
## converged
anova(fit2,fit1)
## Likelihood ratio tests of Multinomial Models
## 
## Response: product
##                               Model Resid. df Resid. Dev   Test    Df  LR stat.
## 1          age + household + gender       792   430.4969                       
## 2 age + household + gender + absent       790   429.6742 1 vs 2     2 0.8227305
##     Pr(Chi)
## 1          
## 2 0.6627448
  1. stepAIC for variable selection in glm
stepAIC(full,direction="both")
## Start:  AIC=451.02
## product ~ age + household + position_level + gender + absent
## 
## # weights:  27 (16 variable)
## initial  value 439.444915 
## iter  10 value 342.197670
## iter  20 value 334.868396
## final  value 334.851268 
## converged
## # weights:  27 (16 variable)
## initial  value 439.444915 
## iter  10 value 301.317069
## iter  20 value 276.620201
## final  value 276.458307 
## converged
## # weights:  18 (10 variable)
## initial  value 439.444915 
## iter  10 value 256.830887
## iter  20 value 214.842594
## iter  30 value 214.837090
## final  value 214.837084 
## converged
## # weights:  27 (16 variable)
## initial  value 439.444915 
## iter  10 value 260.954489
## iter  20 value 230.815041
## final  value 230.581071 
## converged
## # weights:  27 (16 variable)
## initial  value 439.444915 
## iter  10 value 229.652679
## iter  20 value 208.237597
## final  value 208.149341 
## converged
##                  Df    AIC
## - absent          2 448.30
## - position_level  8 449.67
## <none>              451.02
## - gender          2 493.16
## - household       2 584.92
## - age             2 701.70
## # weights:  27 (16 variable)
## initial  value 439.444915 
## iter  10 value 229.652679
## iter  20 value 208.237597
## final  value 208.149341 
## converged
## 
## Step:  AIC=448.3
## product ~ age + household + position_level + gender
## 
## # weights:  24 (14 variable)
## initial  value 439.444915 
## iter  10 value 345.161008
## iter  20 value 335.869815
## final  value 335.869805 
## converged
## # weights:  24 (14 variable)
## initial  value 439.444915 
## iter  10 value 286.695124
## iter  20 value 277.917014
## final  value 277.916928 
## converged
## # weights:  15 (8 variable)
## initial  value 439.444915 
## iter  10 value 220.300742
## iter  20 value 215.255121
## final  value 215.248449 
## converged
## # weights:  24 (14 variable)
## initial  value 439.444915 
## iter  10 value 252.565848
## iter  20 value 231.441304
## final  value 231.409442 
## converged
## # weights:  30 (18 variable)
## initial  value 439.444915 
## iter  10 value 268.455865
## iter  20 value 208.024202
## iter  30 value 207.509074
## iter  30 value 207.509072
## iter  30 value 207.509072
## final  value 207.509072 
## converged
##                  Df    AIC
## - position_level  8 446.50
## <none>              448.30
## + absent          2 451.02
## - gender          2 490.82
## - household       2 583.83
## - age             2 699.74
## # weights:  15 (8 variable)
## initial  value 439.444915 
## iter  10 value 220.300742
## iter  20 value 215.255121
## final  value 215.248449 
## converged
## 
## Step:  AIC=446.5
## product ~ age + household + gender
## 
## # weights:  12 (6 variable)
## initial  value 439.444915 
## iter  10 value 341.114844
## final  value 340.866118 
## converged
## # weights:  12 (6 variable)
## initial  value 439.444915 
## iter  10 value 284.455247
## iter  20 value 283.954925
## final  value 283.954920 
## converged
## # weights:  12 (6 variable)
## initial  value 439.444915 
## iter  10 value 239.284726
## iter  20 value 238.014588
## final  value 238.014521 
## converged
## # weights:  27 (16 variable)
## initial  value 439.444915 
## iter  10 value 229.652679
## iter  20 value 208.237597
## final  value 208.149341 
## converged
## # weights:  18 (10 variable)
## initial  value 439.444915 
## iter  10 value 256.830887
## iter  20 value 214.842594
## iter  30 value 214.837090
## final  value 214.837084 
## converged
##                  Df    AIC
## <none>              446.50
## + position_level  8 448.30
## + absent          2 449.67
## - gender          2 488.03
## - household       2 579.91
## - age             2 693.73
## Call:
## multinom(formula = product ~ age + household + gender, data = dt)
## 
## Coefficients:
##   (Intercept)       age  household genderMale
## B   -5.229876 0.2247682 -0.9143186 -2.2522987
## C  -10.339791 0.2608011  0.1518043  0.1087794
## 
## Residual Deviance: 430.4969 
## AIC: 446.4969
  1. overall fit test
anova(null,fit2)
## Likelihood ratio tests of Multinomial Models
## 
## Response: product
##                      Model Resid. df Resid. Dev   Test    Df LR stat. Pr(Chi)
## 1                        1       798   875.8178                              
## 2 age + household + gender       792   430.4969 1 vs 2     6 445.3209       0
  1. obtain the fitted equations
summary(fit2)
## Call:
## multinom(formula = product ~ age + household + gender, data = dt)
## 
## Coefficients:
##   (Intercept)       age  household genderMale
## B   -5.229876 0.2247682 -0.9143186 -2.2522987
## C  -10.339791 0.2608011  0.1518043  0.1087794
## 
## Std. Errors:
##   (Intercept)        age  household genderMale
## B   0.8699672 0.02746372 0.12695214  0.4347157
## C   1.0956022 0.02854696 0.09074394  0.3624084
## 
## Residual Deviance: 430.4969 
## AIC: 446.4969
  1. get odd ratios
exp(summary(fit2)$coefficients)
##    (Intercept)      age household genderMale
## B 5.354189e-03 1.252032 0.4007896  0.1051572
## C 3.232109e-05 1.297969 1.1639324  1.1149163
  1. Get the pseudo-R2
library(pscl) 
pR2(fit2)
## fitting null model for pseudo-r2
## # weights:  6 (2 variable)
## initial  value 439.444915 
## final  value 437.908882 
## converged
##          llh      llhNull           G2     McFadden         r2ML         r2CU 
## -215.2484490 -437.9088817  445.3208655    0.5084629    0.6715275    0.7561972
  1. Get \(p_{hat}\) and \(y_{hat}\)
phat <- predict(fit2,newdata=dt,type="probs")
head(phat)#show the first 6 rows
##              A            B          C
## 1 0.0057578835 0.1908607399 0.80338138
## 2 0.9755735475 0.0001023612 0.02432409
## 3 0.0003204954 0.0008306117 0.99884889
## 4 0.4628612448 0.2089157798 0.32822298
## 5 0.8980469293 0.0889180955 0.01303498
## 6 0.7961988323 0.0049228842 0.19887828
yhat <- predict(fit2,newdata=dt)
head(yhat)
## [1] C A C A A A
## Levels: A B C

15.Classification table and %accuracy

table(dt$product,yhat)
##    yhat
##       A   B   C
##   A 118  15   9
##   B  28  79  10
##   C  16  20 105

Lab 4: Ordinal logistic regression

import data

library(readr)
#setwd("C:\\Users\\ASUS\\Google Drive\\208429\\LabMaterials\\DataFiles_208429")
setwd("G:\\My Drive\\208429\\LabMaterials\\DataFiles_208429")
dp <- read_csv("poverty.csv")
## New names:
## Rows: 5381 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (5): poverty, religion, degree, country, gender dbl (2): ...1, age
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
  1. set data structure
str(dp)
## spc_tbl_ [5,381 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ...1    : num [1:5381] 1 2 3 4 5 6 7 8 9 10 ...
##  $ poverty : chr [1:5381] "Too Little" "About Right" "Too Little" "Too Much" ...
##  $ religion: chr [1:5381] "yes" "yes" "yes" "yes" ...
##  $ degree  : chr [1:5381] "no" "no" "no" "yes" ...
##  $ country : chr [1:5381] "USA" "USA" "USA" "USA" ...
##  $ age     : num [1:5381] 44 40 36 25 39 80 48 32 74 30 ...
##  $ gender  : chr [1:5381] "male" "female" "female" "female" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ...1 = col_double(),
##   ..   poverty = col_character(),
##   ..   religion = col_character(),
##   ..   degree = col_character(),
##   ..   country = col_character(),
##   ..   age = col_double(),
##   ..   gender = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
dp$poverty <- as.factor(dp$poverty)
dp$religion <- as.factor(dp$religion)
dp$degree  <- as.factor(dp$degree)
dp$country <- as.factor(dp$country)
dp$gender  <- as.factor(dp$gender)
  1. descriptive statistics
dp$poverty <- ordered(dp$poverty,
                      levels =c("Too Little","About Right","Too Much"))

levels(dp$poverty)
## [1] "Too Little"  "About Right" "Too Much"
#3.m0: null model
library(MASS)
m0 <- polr(poverty~1,data=dp)

#4. mf: full model
mf <- polr(poverty~religion+degree+country+age+gender,data=dp)
summary(mf)
## 
## Re-fitting to get Hessian
## Call:
## polr(formula = poverty ~ religion + degree + country + age + 
##     gender, data = dp)
## 
## Coefficients:
##                  Value Std. Error t value
## religionyes    0.17973   0.077346   2.324
## degreeyes      0.14092   0.066193   2.129
## countryNorway -0.32235   0.073766  -4.370
## countrySweden -0.60330   0.079494  -7.589
## countryUSA     0.61777   0.070665   8.742
## age            0.01114   0.001561   7.139
## gendermale     0.17637   0.052972   3.329
## 
## Intercepts:
##                        Value   Std. Error t value
## Too Little|About Right  0.7298  0.1041     7.0128
## About Right|Too Much    2.5325  0.1103    22.9496
## 
## Residual Deviance: 10402.59 
## AIC: 10420.59
# 5.calculate p-value
## store table
ctable <- coef(summary(mf))
## 
## Re-fitting to get Hessian
## calculate and store p values
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE)*2

## combined table
ctable <- cbind(ctable, "p value" = p)
ctable
##                              Value  Std. Error   t value       p value
## religionyes             0.17973194 0.077346025  2.323738  2.013951e-02
## degreeyes               0.14091745 0.066193093  2.128885  3.326381e-02
## countryNorway          -0.32235359 0.073766012 -4.369947  1.242765e-05
## countrySweden          -0.60329785 0.079493879 -7.589237  3.217957e-14
## countryUSA              0.61777260 0.070664754  8.742302  2.284084e-18
## age                     0.01114091 0.001560585  7.138935  9.405696e-13
## gendermale              0.17636863 0.052972239  3.329454  8.701647e-04
## Too Little|About Right  0.72976353 0.104061619  7.012802  2.335919e-12
## About Right|Too Much    2.53247870 0.110349763 22.949562 1.488394e-116
# 6. answer is age
# 7.Test if "age" should be included in the model when others are in
m1 <- polr(poverty~religion+degree+country+gender,data=dp)
#likelihood ratio test, use anova function
anova(mf,m1)
## Likelihood ratio tests of ordinal regression models
## 
## Response: poverty
##                                        Model Resid. df Resid. Dev   Test    Df
## 1       religion + degree + country + gender      5373   10453.71             
## 2 religion + degree + country + age + gender      5372   10402.59 1 vs 2     1
##   LR stat.      Pr(Chi)
## 1                      
## 2 51.11977 8.689716e-13
## store table
ctable <- coef(summary(m1))
## 
## Re-fitting to get Hessian
## calculate and store p values
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE)*2
## combined table
ctable <- cbind(ctable, "p value" = p)
ctable
##                              Value Std. Error   t value       p value
## religionyes             0.23479820 0.07679364  3.057521  2.231758e-03
## degreeyes               0.09304962 0.06569864  1.416310  1.566848e-01
## countryNorway          -0.32957463 0.07354477 -4.481279  7.419704e-06
## countrySweden          -0.60346856 0.07923965 -7.615740  2.621848e-14
## countryUSA              0.66319466 0.07033120  9.429593  4.116813e-21
## gendermale              0.18646926 0.05284173  3.528826  4.174078e-04
## Too Little|About Right  0.28016816 0.08232821  3.403064  6.663467e-04
## About Right|Too Much    2.06918584 0.08824151 23.449121 1.349496e-121
#8. From m1, test the significance of "degree"
m2 <- polr(poverty~religion+country+gender,data=dp)
anova(m1,m2)
## Likelihood ratio tests of ordinal regression models
## 
## Response: poverty
##                                  Model Resid. df Resid. Dev   Test    Df
## 1          religion + country + gender      5374   10455.71             
## 2 religion + degree + country + gender      5373   10453.71 1 vs 2     1
##   LR stat.   Pr(Chi)
## 1                   
## 2 2.001472 0.1571465
#Variable selection for logistic
library(MASS)
stepAIC(mf,direction="both")#use stepwise
## Start:  AIC=10420.59
## poverty ~ religion + degree + country + age + gender
## 
##            Df   AIC
## <none>        10421
## - degree    1 10423
## - religion  1 10424
## - gender    1 10430
## - age       1 10470
## - country   3 10666
## Call:
## polr(formula = poverty ~ religion + degree + country + age + 
##     gender, data = dp)
## 
## Coefficients:
##   religionyes     degreeyes countryNorway countrySweden    countryUSA 
##    0.17973194    0.14091745   -0.32235359   -0.60329785    0.61777260 
##           age    gendermale 
##    0.01114091    0.17636863 
## 
## Intercepts:
## Too Little|About Right   About Right|Too Much 
##              0.7297635              2.5324787 
## 
## Residual Deviance: 10402.59 
## AIC: 10420.59
#9. 
anova(m0,mf)
## Likelihood ratio tests of ordinal regression models
## 
## Response: poverty
##                                        Model Resid. df Resid. Dev   Test    Df
## 1                                          1      5379   10740.38             
## 2 religion + degree + country + age + gender      5372   10402.59 1 vs 2     7
##   LR stat. Pr(Chi)
## 1                 
## 2 337.7841       0
#11. OR and its confidence interval (CI)
ci <- confint(mf) # 95% CI of log odd 
## Waiting for profiling to be done...
## 
## Re-fitting to get Hessian
#odd
exp(coef(mf))#odds
##   religionyes     degreeyes countryNorway countrySweden    countryUSA 
##     1.1968965     1.1513296     0.7244420     0.5470047     1.8547921 
##           age    gendermale 
##     1.0112032     1.1928777
exp(cbind(OR = coef(mf), ci))#95% CI of odds
##                      OR     2.5 %    97.5 %
## religionyes   1.1968965 1.0289511 1.3934593
## degreeyes     1.1513296 1.0110549 1.3106194
## countryNorway 0.7244420 0.6267512 0.8369294
## countrySweden 0.5470047 0.4678450 0.6389299
## countryUSA    1.8547921 1.6151018 2.1306390
## age           1.0112032 1.0081175 1.0143029
## gendermale    1.1928777 1.0752794 1.3234479
#12.pseudo R2
library(pscl) #to get pseudo R2
pR2(mf)
## fitting null model for pseudo-r2
##           llh       llhNull            G2      McFadden          r2ML 
## -5.201296e+03 -5.370188e+03  3.377841e+02  3.144993e-02  6.084382e-02 
##          r2CU 
##  7.041132e-02
#13. getting phat
phat <- predict(mf,newdata=dp,type="probs")
head(phat)
##   Too Little About Right  Too Much
## 1  0.3242497   0.4200439 0.2557065
## 2  0.3744021   0.4096330 0.2159649
## 3  0.3848970   0.4065882 0.2085148
## 4  0.3805578   0.4078799 0.2115623
## 5  0.3058650   0.4218762 0.2722588
## 6  0.2770756   0.4221685 0.3007559
#getting yhat
yhat <- predict(mf,newdata=dp)
head(yhat)
## [1] About Right About Right About Right About Right About Right About Right
## Levels: Too Little About Right Too Much
#14. classification table
table(dp$poverty,yhat)
##              yhat
##               Too Little About Right Too Much
##   Too Little        2190         518        0
##   About Right       1483         379        0
##   Too Much           377         434        0