Objectives: Students are able to use R language to
analyse data using multiple linear regression:
1. perform descriptive statistsics
2. transform qualitative independent variable into dummy variables
3. select independent variables
4. perform linear regression analysis and inference on regression
parameters
5. interpret the results
R is a programming language for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. R is used among data miners, bioinformaticians and statisticians for data analysis and developing statistical software. Users have created packages to augment the functions of the R language.
To work with R, you need to download the program and install it. I recommend that you also install R-Studio. R-Studio is a separate program that makes R easier to use .
Download the latest R installer (.exe) for Windows. Install the downloaded file as any other windows app.
Now that R is installed, you need to download and install RStudio. First download the installer for Windows. Run the installer (.exe file) and follow the instructions.
First download the latest release (“R-version.pkg”) of R Save the .pkg file, double-click it to open, and follow the installation instructions. Now that R is installed, you need to download and install RStudio.
First download the the version for Mac. After downloading, double-click the file to open it, and then drag and drop it to your applications folder.
R works with numerous data types. Some of the most basic types to get started are:
4.5 are called
numerics.4 are called
integers. Integers are also numerics.TRUE or FALSE) are called
logical.# create create variable "a" that is a vector of one number.
a <- 7
a = 7
a
## [1] 7
# create vectors that consists of multiple numbers.
b <- c(1.25, 2.9, 3.0)
b
## [1] 1.25 2.90 3.00
# create a regular sequence of whole numbers
c <- 1:10
c
## [1] 1 2 3 4 5 6 7 8 9 10
# seq(from, to, by) function ro creat sequence of numbers
d <- seq(1,10,0.5)
d
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
## [16] 8.5 9.0 9.5 10.0
# rep() function to repeat a single number, or a sequence of numbers.
rep(99, times=5)
## [1] 99 99 99 99 99
rep(c(1,2,3), times=5)
## [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
rep(c(1,2,3),each = 5)
## [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
x <- "Monday"
class(x)
## [1] "character"
y <- c("Mon","Tue","Wed","Thu","Fri","Sat","Sun")
class(y)
## [1] "character"
A factor is a nominal (categorical) variable with a set of known
possible values called levels. They can be created using the
as.factor function.
yy <- as.factor(y)
yy
## [1] Mon Tue Wed Thu Fri Sat Sun
## Levels: Fri Mon Sat Sun Thu Tue Wed
mat0 <- matrix(NA, ncol=3, nrow=2)
mat0
## [,1] [,2] [,3]
## [1,] NA NA NA
## [2,] NA NA NA
mat1 <- matrix(1:6, ncol=3, nrow=2)
mat1
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
mat2 <- matrix(1:6, ncol=3, nrow=2,byrow = TRUE)
mat2
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
A list is a very flexible container to store data. Each element of a list can contain any type of R object, e.g. a vector, matrix, data.frame, another list, or more complex data types.
lst <- list(d,yy,mat2)
lst
## [[1]]
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
## [16] 8.5 9.0 9.5 10.0
##
## [[2]]
## [1] Mon Tue Wed Thu Fri Sat Sun
## Levels: Fri Mon Sat Sun Thu Tue Wed
##
## [[3]]
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
The data.frame is commonly used for statistical data
analysis in R. It is a special type of list that requires that all
elements (variables) have the same length.
# Loading
data(ToothGrowth)
# Print the first 6 rows
head(ToothGrowth,row=6)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
# Examine data structure of the dataframe
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
To read a file into R, you need to know the full path (directory) name and the name of the file itself.
#Your directory will be different from mine!
setwd("G:\\My Drive\\208429\\LabMaterials\\DataFiles_208429")
# Read data file and keep it in dt
dt <- read.csv("bank_data.csv", header = TRUE)
head(dt) # show the first six rows of the data
## age job marital education balance housing loan duration campaign pdays
## 1 59 admin. married secondary 2343 yes no 1042 1 -1
## 2 56 admin. married secondary 45 no no 1467 1 -1
## 3 41 technician married secondary 1270 yes no 1389 1 -1
## 4 55 services married secondary 2476 yes no 579 1 -1
## 5 54 admin. married tertiary 184 no no 673 2 -1
## 6 42 management single tertiary 0 yes yes 562 2 -1
str(dt) # show data structure
## 'data.frame': 1000 obs. of 10 variables:
## $ age : int 59 56 41 55 54 42 56 60 37 28 ...
## $ job : chr "admin." "admin." "technician" "services" ...
## $ marital : chr "married" "married" "married" "married" ...
## $ education: chr "secondary" "secondary" "secondary" "secondary" ...
## $ balance : int 2343 45 1270 2476 184 0 830 545 1 5090 ...
## $ housing : chr "yes" "no" "yes" "yes" ...
## $ loan : chr "no" "no" "no" "no" ...
## $ duration : int 1042 1467 1389 579 673 562 1201 1030 608 1297 ...
## $ campaign : int 1 1 1 1 2 2 1 1 1 3 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
Set “chr” variables as “factor”
dt$job <- as.factor(dt$job)
dt$marital <- as.factor(dt$marital)
dt$education <- as.factor(dt$education)
dt$housing <- as.factor(dt$housing)
dt$loan <- as.factor(dt$loan)
#check data structure again
str(dt)
summary() functiontable() functionprop.table()
function# job, marital, education, housing, loan
summary(dt)
#frequency table for one categorical variable
table(dt$job)
table(dt$housing)
prop.table(table(dt$job))
prop.table(table(dt$job))*100
#Two-Way Frequency Tables
table(dt$job,dt$education)
tb <- table(dt$loan,dt$marital)
# Add margin totals to your table.
addmargins(tb)
#Calculate percentage
prop.table(tb)*100
summary() functionmax() and min functionmean() functionmedian() functiongetmode() function defined below.#define function to calculate mode
getmode <- function(x) {
u <- unique(x)
tab <- tabulate(match(x, u))
u[tab == max(tab)]
}
range() functionquantile(, probs = ) function where probs
can be 0.25, 0.5, 0.75 for first, second, and third quantileIQR() functionsd() and var() function#age, balance, duration, campaign, pdays
summary(dt$age)
min(dt$age)
range(dt$age)
mean(dt$age)
getmode(dt$age)
IQR(dt$age)
sd(dt$age)
quantile(dt$age,probs = 0.25)
quantile(dt$age,probs = 0.50)
quantile(dt$age,probs = 0.75)
quantile(dt$age,probs = c(0.25,0.50,0.75))
cv_age <- sd(dt$age)/mean(dt$age)
cv_age
#cv of age by loan (yes,no)
cv1 <- sd(dt$age[dt$loan=="yes"])/mean(dt$age[dt$loan=="yes"])
cv1
cv2 <- sd(dt$age[dt$loan=="no"])/mean(dt$age[dt$loan=="no"])
cv2
aggregate() function is used to get the summary
statistics of the data by group.# aggregate: compute summary statistics of data subsets
aggregate(balance~loan, data = dt, FUN=summary)
aggregate(balance~loan+marital, data = dt, FUN=summary)
aggregate(balance~loan+marital, data = dt, FUN=mean)
In the case that you have a factor variable, you can display a frequency table or a bar graph.
You can use the good ol’ summary() on a factor variable
to display a frequency table.
freq <- table(dt$job)
freq
##
## admin. blue-collar entrepreneur housemaid management
## 96 226 45 25 221
## retired self-employed services student technician
## 44 36 82 8 183
## unemployed unknown
## 31 3
barplot(freq)
To obtain a list of a factor variable’s levels, use
levels().
maritallevel <- levels(dt$marital)
maritallevel
## [1] "divorced" "married" "single"
Stacked Bar Plot with Colors and Legend
counts <- table(dt$marital, dt$loan)
barplot(counts,
legend = rownames(counts),xlab="loan")
Grouped Bar Plot
counts <- table(dt$marital, dt$loan)
barplot(counts,col=c("blue","red","green"),
legend = rownames(counts), beside=TRUE,xlab="loan")
counts <- table(dt$marital)
pie(counts, main = "Marital pie chart")
hist()hist(dt$age)
Sometimes, you want to change the title of the plot, and also the
label of the horizontal axis and vertical axis. To do that, set the
parameters main, xlab and
ylab
hist(dt$age, main = "Histogram of age", xlab = "age", ylab = "counts")
To change the number of bins, set the breaks parameter.
However, R will decide what is the best number of bins close to the
number that your specify. For example, the following code displays a
histogram with the number of bins close to 9:
hist(dt$age, breaks = 20,
main = "Histogram of age",
xlab = "age",
ylab = "counts")
stem().stem(dt$duration)
##
## The decimal point is 2 digit(s) to the right of the |
##
## 0 | 18890113333334455566666667788899
## 2 | 1333333344555577888901111244456667777788888889999
## 4 | 00000011112222222333333334444445555566666666777777777888888888888899+122
## 6 | 00000000001111111122222222222223333333333333334444444444444444445555+160
## 8 | 00000000001111111111222223333333444444444455555555555556666666666677+103
## 10 | 00000000011111112222222233333333333344444444455555566666666667788888+54
## 12 | 00000111111112233344444444555667778899900011122333344444445556666677
## 14 | 0001122344455556777899900133445556667788
## 16 | 012245667899022234899
## 18 | 1346778885788
## 20 | 2239
## 22 | 7
## 24 | 62
## 26 | 257
## 28 |
## 30 | 98
## 32 | 5
## 34 |
## 36 |
## 38 | 8
#compare with its histogram
boxplot().boxplot(dt$age,main="Boxplot of age")
horizontal = TRUEboxplot(dt$age, horizontal = TRUE)
You can display boxplots of by group variable.
boxplot(dt$age~dt$loan,xlab="loan",ylab="age")
boxplot(dt$age~dt$job,xlab="job",ylab="age")
Note that you need two vectors to plot a scatter plot: one for the x-axis, and one for the y-axis.
Suppose that you have two variables: var1 and
var2. To display a scatter plot between var1
and var2, use plot(x = var1, y = var2).
plot(x = dt$age, y = dt$balance,xlab="age",ylab="balance")
bank_data.csv dataset:Perform the following graphical data presentation for the
bank_data.csv dataset:
## 2. Multiple linear regression analysis
setwd("G:\\My Drive\\208429\\LabMaterials\\DataFiles_208429")
data = read.csv("SalesPrice.csv",header=TRUE)
data
# name variables for convenience
y = data$SalesPrice
x1 = data$NumberOfHousehold #quantitative independent variable
x2 = data$Location #qualitative independent variable with k=3 groups
summary(y)
hist(y,col = "darkorange",
border = "dodgerblue")
shapiro.test(y)
summary(x1)
plot(x1,y,xlab="Number Of Household",ylab="Sale price")
cor(x1,y) #Compute the r between two quantitative variables
boxplot(y~x2,xlab="Location",ylab="Sale price")
NOTE: It is easier to work with graphs when using ggplot
data_df = data.frame(data) # set data as dataframe type
library(ggplot2) # you might need to install packgage "ggplot2" first!
ggplot(data_df,aes(x=NumberOfHousehold,y=SalesPrice))+
geom_point(aes(color=Location))
ggplot(data_df,aes(y=SalesPrice,x=Location))+
geom_boxplot(aes(color=Location))
#create x2.dummy
x2.dummy= c() #create null vector
for(i in 1:length(data$Location)){
if(data$Location[i]=="Street") x2.dummy[i] = 1
else x2.dummy[i] = 0
}
x2.dummy
## [1] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
#create x3.dummy
x3.dummy= c()
for(i in 1:length(data$Location)){
if(data$Location[i]=="Mall") x3.dummy[i] = 1
else x3.dummy[i] = 0
}
x3.dummy
## [1] 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
null=lm(y~1)
full=lm(y~x1+x2.dummy+x3.dummy)
#The R function vif() [car package] can be used to detect multicollinearity in a regression model:
car::vif(full)
# Forward selection
fw.fit = step(null, scope=list(lower=null,upper=full),direction="forward")
fw.fit
# Backward elimination
be.fit = step(full,direction="backward")
be.fit
# Stepwise regression
sw.fit = step(full, direction="both")
sw.fit
summary(sw.fit)
anova(sw.fit)
confint.lm(sw.fit)
new = data.frame(x1=x1,x2.dummy=x2.dummy,x3.dummy=x3.dummy)
yhat = predict(sw.fit,newdata=new) #compute fitted y
yhat
plot(x1,y,pch=1,xlab="Number Of Household",ylab="Sale price")
points(x1,yhat,type="p",pch=20)
legend("topleft",c("observed y","predicted y"),pch=c(1,20))
par(mfrow=c(2,2)) #set plot layout as 2 row 2 column
plot(sw.fit)
#get the residuals
res = resid(sw.fit)
par(mfrow=c(1,2)) #set plot layout as 1 row 2 column
plot(res)
hist(res)
shapiro.test(res)
# Performs the Durbin-Watson test for autocorrelation of disturbances.
lmtest::dwtest(sw.fit)
# Test the constant variance assumption: Breusch-Pagan Test
lmtest::bptest(sw.fit)
# Load the packages
library(aod) #for 'Wald test'
library(MASS)
library(rms) #for 'lrm'
## Loading required package: Hmisc
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "ndiMatrix" of class "replValueSp"; definition not updated
library(pscl) #to get pseudo R2
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
#setwd("C:\\Users\\ASUS\\Google Drive\\208429\\LabMaterials\\DataFiles_208429")
setwd("G:\\My Drive\\208429\\LabMaterials\\DataFiles_208429")
data = read.csv("StudyHours.csv",header=TRUE)
head(data)
## Obs Hours Pass
## 1 1 0.50 0
## 2 2 0.75 0
## 3 3 1.00 0
## 4 4 1.25 0
## 5 5 1.50 0
## 6 6 1.75 0
y = data$Pass
x = data$Hours
# Generating the frequency table
table(y)
## y
## 0 1
## 10 10
# 1.Building the Model
fit.logist = glm(y ~x,family=binomial(link='logit'))
fit.logist
##
## Call: glm(formula = y ~ x, family = binomial(link = "logit"))
##
## Coefficients:
## (Intercept) x
## -4.078 1.505
##
## Degrees of Freedom: 19 Total (i.e. Null); 18 Residual
## Null Deviance: 27.73
## Residual Deviance: 16.06 AIC: 20.06
anova(fit.logist)
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: y
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 19 27.726
## x 1 11.666 18 16.060
summary(fit.logist)
##
## Call:
## glm(formula = y ~ x, family = binomial(link = "logit"))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.0777 1.7610 -2.316 0.0206 *
## x 1.5046 0.6287 2.393 0.0167 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 27.726 on 19 degrees of freedom
## Residual deviance: 16.060 on 18 degrees of freedom
## AIC: 20.06
##
## Number of Fisher Scoring iterations: 5
# 2. Perform overall fit test
mod <- lrm(y ~x, data = data)
mod
## Logistic Regression Model
##
## lrm(formula = y ~ x, data = data)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 20 LR chi2 11.67 R2 0.589 C 0.895
## 0 10 d.f. 1 R2(1,20) 0.413 Dxy 0.790
## 1 10 Pr(> chi2) 0.0006 R2(1,15) 0.509 gamma 0.798
## max |deriv| 1e-07 Brier 0.137 tau-a 0.416
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -4.0777 1.7610 -2.32 0.0206
## x 1.5046 0.6287 2.39 0.0167
# 3. Perform Wald test:
#wald.test(b = coef(fit.logist), Sigma = vcov(fit.logist), Terms = 1)
wald.test(b = coef(fit.logist), Sigma = vcov(fit.logist), Terms = 2)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 5.7, df = 1, P(> X2) = 0.017
# 4. Perform lr test
null <- glm(y ~ 1, data = data,family="binomial")
summary(null)
##
## Call:
## glm(formula = y ~ 1, family = "binomial", data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.930e-17 4.472e-01 0 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 27.726 on 19 degrees of freedom
## Residual deviance: 27.726 on 19 degrees of freedom
## AIC: 29.726
##
## Number of Fisher Scoring iterations: 2
full <- glm(y ~ x, data = data,family="binomial")
summary(full)
##
## Call:
## glm(formula = y ~ x, family = "binomial", data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.0777 1.7610 -2.316 0.0206 *
## x 1.5046 0.6287 2.393 0.0167 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 27.726 on 19 degrees of freedom
## Residual deviance: 16.060 on 18 degrees of freedom
## AIC: 20.06
##
## Number of Fisher Scoring iterations: 5
lrtest(null, full)
##
## Model 1: y ~ 1
## Model 2: y ~ x
##
## L.R. Chisq d.f. P
## 1.166613e+01 1.000000e+00 6.364826e-04
# 5. get LL and pseudo R2
logLik(full)
## 'log Lik.' -8.029878 (df=2)
#To easily get a McFadden's pseudo R2 for a fitted model in R, use the "pscl" package by Simon Jackman and use the pR2 command.
pR2(full)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML r2CU
## -8.0298785 -13.8629436 11.6661303 0.4207667 0.4419499 0.5892665
mod1b <- lrm(y ~ x, data = data)
mod1b
## Logistic Regression Model
##
## lrm(formula = y ~ x, data = data)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 20 LR chi2 11.67 R2 0.589 C 0.895
## 0 10 d.f. 1 R2(1,20) 0.413 Dxy 0.790
## 1 10 Pr(> chi2) 0.0006 R2(1,15) 0.509 gamma 0.798
## max |deriv| 1e-07 Brier 0.137 tau-a 0.416
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -4.0777 1.7610 -2.32 0.0206
## x 1.5046 0.6287 2.39 0.0167
## CIs using profiled log-likelihood
confint(fit.logist)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -8.5853315 -1.268378
## x 0.5293557 3.145138
## CIs using standard errors
confint.default(fit.logist)
## 2.5 % 97.5 %
## (Intercept) -7.5291793 -0.6262476
## x 0.2723838 2.7369071
logLik(fit.logist)
## 'log Lik.' -8.029878 (df=2)
## odds ratios only
exp(coef(fit.logist))
## (Intercept) x
## 0.01694617 4.50255687
## odds ratios and 95% CI
exp(cbind(OR = coef(fit.logist), confint(fit.logist)))
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.01694617 0.0001868263 0.2812875
## x 4.50255687 1.6978380343 23.2228735
## 6.Prediction with se
pred_logit = predict(fit.logist, newdata = list(x=x), type = "link",se = TRUE)
pred_logit
## $fit
## 1 2 3 4 5 6 7
## -3.3253907 -2.9492294 -2.5730680 -2.1969066 -1.8207453 -1.4445839 -1.4445839
## 8 9 10 11 12 13 14
## -1.0684226 -0.6922612 -0.3160999 0.0600615 0.4362229 0.8123842 1.1885456
## 15 16 17 18 19 20
## 1.9408683 2.3170296 2.6931910 3.0693524 3.4455137 4.1978364
##
## $se.fit
## 1 2 3 4 5 6 7 8
## 1.4715288 1.3310441 1.1947260 1.0641770 0.9417992 0.8312096 0.8312096 0.7377287
## 9 10 11 12 13 14 15 16
## 0.6685718 0.6317781 0.6330161 0.6720757 0.7430159 0.8377769 1.0722391 1.2032148
## 17 18 19 20
## 1.3398379 1.4805458 1.6242772 1.9180797
##
## $residual.scale
## [1] 1
pred = predict(fit.logist, newdata = list(x=x), type = "response",se = TRUE)
pred
## $fit
## 1 2 3 4 5 6 7
## 0.03471034 0.04977295 0.07089196 0.10002862 0.13934447 0.19083650 0.19083650
## 8 9 10 11 12 13 14
## 0.25570318 0.33353024 0.42162653 0.51501086 0.60735865 0.69261733 0.76648084
## 15 16 17 18 19 20
## 0.87444750 0.91027764 0.93662366 0.95561071 0.96909707 0.98519444
##
## $se.fit
## 1 2 3 4 5 6 7
## 0.04930435 0.06295253 0.07869217 0.09580030 0.11294771 0.12835367 0.12835367
## 8 9 10 11 12 13 14
## 0.14040383 0.14861538 0.15406389 0.15811139 0.16027266 0.15818702 0.14995198
## 15 16 17 18 19 20
## 0.11772013 0.09826927 0.07953248 0.06280310 0.04864376 0.02797779
##
## $residual.scale
## [1] 1
pred_prob = 1/(1+exp(-pred_logit$fit))
pred_prob
## 1 2 3 4 5 6 7
## 0.03471034 0.04977295 0.07089196 0.10002862 0.13934447 0.19083650 0.19083650
## 8 9 10 11 12 13 14
## 0.25570318 0.33353024 0.42162653 0.51501086 0.60735865 0.69261733 0.76648084
## 15 16 17 18 19 20
## 0.87444750 0.91027764 0.93662366 0.95561071 0.96909707 0.98519444
LCI_pred = 1/(1+exp(-(pred_logit$fit-1.96*pred_logit$se.fit)))
UCI_pred = 1/(1+exp(-(pred_logit$fit+1.96*pred_logit$se.fit)))
plot(x,pred_prob,type="o",ylim=c(-0.1,1.1),xlab="Number of Study Hours",ylab="Prob of passing exam",col="blue")
points(x,y)
lines(x,LCI_pred,col="red",lty=2)
lines(x,UCI_pred,col="red",lty=2)
abline(h=0.5,col="gray")
#create data.frame
dtf = data.frame( y =y ,
x =x ,
pred_logit = pred_logit,
pred_prob = pred_prob,
LCI_pred = LCI_pred,
UCI_pred = UCI_pred
)
dtf
## y x pred_logit.fit pred_logit.se.fit pred_logit.residual.scale pred_prob
## 1 0 0.50 -3.3253907 1.4715288 1 0.03471034
## 2 0 0.75 -2.9492294 1.3310441 1 0.04977295
## 3 0 1.00 -2.5730680 1.1947260 1 0.07089196
## 4 0 1.25 -2.1969066 1.0641770 1 0.10002862
## 5 0 1.50 -1.8207453 0.9417992 1 0.13934447
## 6 0 1.75 -1.4445839 0.8312096 1 0.19083650
## 7 1 1.75 -1.4445839 0.8312096 1 0.19083650
## 8 0 2.00 -1.0684226 0.7377287 1 0.25570318
## 9 1 2.25 -0.6922612 0.6685718 1 0.33353024
## 10 0 2.50 -0.3160999 0.6317781 1 0.42162653
## 11 1 2.75 0.0600615 0.6330161 1 0.51501086
## 12 0 3.00 0.4362229 0.6720757 1 0.60735865
## 13 1 3.25 0.8123842 0.7430159 1 0.69261733
## 14 0 3.50 1.1885456 0.8377769 1 0.76648084
## 15 1 4.00 1.9408683 1.0722391 1 0.87444750
## 16 1 4.25 2.3170296 1.2032148 1 0.91027764
## 17 1 4.50 2.6931910 1.3398379 1 0.93662366
## 18 1 4.75 3.0693524 1.4805458 1 0.95561071
## 19 1 5.00 3.4455137 1.6242772 1 0.96909707
## 20 1 5.50 4.1978364 1.9180797 1 0.98519444
## LCI_pred UCI_pred
## 1 0.002006035 0.3914564
## 2 0.003841377 0.4157164
## 3 0.007284242 0.4424055
## 4 0.013617431 0.4722486
## 5 0.024924302 0.5062950
## 6 0.044202920 0.5460161
## 7 0.044202920 0.5460161
## 8 0.074856857 0.5932762
## 9 0.118923711 0.6497953
## 10 0.174453851 0.7154871
## 11 0.234935337 0.7859651
## 12 0.292961171 0.8523926
## 13 0.344359378 0.9062510
## 14 0.388529610 0.9443065
## 15 0.459906172 0.9827485
## 16 0.489683624 0.9907637
## 17 0.516770898 0.9951275
## 18 0.541773065 0.9974554
## 19 0.565110751 0.9986804
## 20 0.607877767 0.9996500
#create another variable called "pred_pass"
# if pred_prob <= 0.5 , pred_pass = 0
# if pred_prob > 0.5 , pred_pass = 1
#create x2.dummy
pred_pass= c() #create null vector
for(i in 1:length(y)){
if(pred_prob[i] <= 0.5) pred_pass[i] = 0
else pred_pass[i] = 1
}
pred_pass
## [1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
#add pred_pass in dtf
dtf$pred_pass = pred_pass
dtf
## y x pred_logit.fit pred_logit.se.fit pred_logit.residual.scale pred_prob
## 1 0 0.50 -3.3253907 1.4715288 1 0.03471034
## 2 0 0.75 -2.9492294 1.3310441 1 0.04977295
## 3 0 1.00 -2.5730680 1.1947260 1 0.07089196
## 4 0 1.25 -2.1969066 1.0641770 1 0.10002862
## 5 0 1.50 -1.8207453 0.9417992 1 0.13934447
## 6 0 1.75 -1.4445839 0.8312096 1 0.19083650
## 7 1 1.75 -1.4445839 0.8312096 1 0.19083650
## 8 0 2.00 -1.0684226 0.7377287 1 0.25570318
## 9 1 2.25 -0.6922612 0.6685718 1 0.33353024
## 10 0 2.50 -0.3160999 0.6317781 1 0.42162653
## 11 1 2.75 0.0600615 0.6330161 1 0.51501086
## 12 0 3.00 0.4362229 0.6720757 1 0.60735865
## 13 1 3.25 0.8123842 0.7430159 1 0.69261733
## 14 0 3.50 1.1885456 0.8377769 1 0.76648084
## 15 1 4.00 1.9408683 1.0722391 1 0.87444750
## 16 1 4.25 2.3170296 1.2032148 1 0.91027764
## 17 1 4.50 2.6931910 1.3398379 1 0.93662366
## 18 1 4.75 3.0693524 1.4805458 1 0.95561071
## 19 1 5.00 3.4455137 1.6242772 1 0.96909707
## 20 1 5.50 4.1978364 1.9180797 1 0.98519444
## LCI_pred UCI_pred pred_pass
## 1 0.002006035 0.3914564 0
## 2 0.003841377 0.4157164 0
## 3 0.007284242 0.4424055 0
## 4 0.013617431 0.4722486 0
## 5 0.024924302 0.5062950 0
## 6 0.044202920 0.5460161 0
## 7 0.044202920 0.5460161 0
## 8 0.074856857 0.5932762 0
## 9 0.118923711 0.6497953 0
## 10 0.174453851 0.7154871 0
## 11 0.234935337 0.7859651 1
## 12 0.292961171 0.8523926 1
## 13 0.344359378 0.9062510 1
## 14 0.388529610 0.9443065 1
## 15 0.459906172 0.9827485 1
## 16 0.489683624 0.9907637 1
## 17 0.516770898 0.9951275 1
## 18 0.541773065 0.9974554 1
## 19 0.565110751 0.9986804 1
## 20 0.607877767 0.9996500 1
#Classification table
xtabs(~ y + pred_pass, data = dtf)
## pred_pass
## y 0 1
## 0 8 2
## 1 2 8
researcher is interested in how variables, such as GRE (Graduate
Record Exam scores), GPA (grade point average) and prestige of the
undergraduate institution, effect admission into graduate school. The
response variable, admit/don’t admit, is a binary variable.
# Load the packages
library(aod)
library(MASS)
library(rms)
#setwd("C:\\Users\\ASUS\\Google Drive\\208429\\LabMaterials\\DataFiles_208429")
setwd("G:\\My Drive\\208429\\LabMaterials\\DataFiles_208429")
data2 = read.csv("Admission.csv",header=TRUE)
head(data2)
## admit gre gpa rank
## 1 0 380 3.61 3
## 2 1 660 3.67 3
## 3 1 800 4.00 1
## 4 1 640 3.19 4
## 5 0 520 2.93 4
## 6 1 760 3.00 2
summary(data2)
## admit gre gpa rank
## Min. :0.0000 Min. :220.0 Min. :2.260 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000
## Median :0.0000 Median :580.0 Median :3.395 Median :2.000
## Mean :0.3175 Mean :587.7 Mean :3.390 Mean :2.485
## 3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000
## Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000
sapply(data2, sd)
## admit gre gpa rank
## 0.4660867 115.5165364 0.3805668 0.9444602
xtabs(~admit + rank, data = data2)
## rank
## admit 1 2 3 4
## 0 28 97 93 55
## 1 33 54 28 12
#Using the logit model
data2$rank <- factor(data2$rank)
#Perform variable selection
null <- glm(admit ~ 1, data = data2,family="binomial")
full <- glm(admit ~ gre + gpa + rank, data = data2,family="binomial")
forwards = step(null,scope=list(lower=null,upper=full), direction="forward")
## Start: AIC=501.98
## admit ~ 1
##
## Df Deviance AIC
## + rank 3 474.97 482.97
## + gre 1 486.06 490.06
## + gpa 1 486.97 490.97
## <none> 499.98 501.98
##
## Step: AIC=482.97
## admit ~ rank
##
## Df Deviance AIC
## + gpa 1 462.88 472.88
## + gre 1 464.53 474.53
## <none> 474.97 482.97
##
## Step: AIC=472.88
## admit ~ rank + gpa
##
## Df Deviance AIC
## + gre 1 458.52 470.52
## <none> 462.88 472.88
##
## Step: AIC=470.52
## admit ~ rank + gpa + gre
backwards = step(full)
## Start: AIC=470.52
## admit ~ gre + gpa + rank
##
## Df Deviance AIC
## <none> 458.52 470.52
## - gre 1 462.88 472.88
## - gpa 1 464.53 474.53
## - rank 3 480.34 486.34
bothways = step(null, list(lower=null,upper=full),direction="both")
## Start: AIC=501.98
## admit ~ 1
##
## Df Deviance AIC
## + rank 3 474.97 482.97
## + gre 1 486.06 490.06
## + gpa 1 486.97 490.97
## <none> 499.98 501.98
##
## Step: AIC=482.97
## admit ~ rank
##
## Df Deviance AIC
## + gpa 1 462.88 472.88
## + gre 1 464.53 474.53
## <none> 474.97 482.97
## - rank 3 499.98 501.98
##
## Step: AIC=472.88
## admit ~ rank + gpa
##
## Df Deviance AIC
## + gre 1 458.52 470.52
## <none> 462.88 472.88
## - gpa 1 474.97 482.97
## - rank 3 486.97 490.97
##
## Step: AIC=470.52
## admit ~ rank + gpa + gre
##
## Df Deviance AIC
## <none> 458.52 470.52
## - gre 1 462.88 472.88
## - gpa 1 464.53 474.53
## - rank 3 480.34 486.34
mod1b <- lrm(admit ~ gre + gpa + rank, data = data2)
mod1b
## Logistic Regression Model
##
## lrm(formula = admit ~ gre + gpa + rank, data = data2)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 400 LR chi2 41.46 R2 0.138 C 0.693
## 0 273 d.f. 5 R2(5,400)0.087 Dxy 0.386
## 1 127 Pr(> chi2) <0.0001 R2(5,260)0.131 gamma 0.386
## max |deriv| 2e-06 Brier 0.195 tau-a 0.168
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -3.9900 1.1400 -3.50 0.0005
## gre 0.0023 0.0011 2.07 0.0385
## gpa 0.8040 0.3318 2.42 0.0154
## rank=2 -0.6754 0.3165 -2.13 0.0328
## rank=3 -1.3402 0.3453 -3.88 0.0001
## rank=4 -1.5515 0.4178 -3.71 0.0002
#lrtest
fit <- glm(admit ~ gre + gpa + rank, data = data2, family = "binomial")
summary(fit)
##
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial",
## data = data2)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.989979 1.139951 -3.500 0.000465 ***
## gre 0.002264 0.001094 2.070 0.038465 *
## gpa 0.804038 0.331819 2.423 0.015388 *
## rank2 -0.675443 0.316490 -2.134 0.032829 *
## rank3 -1.340204 0.345306 -3.881 0.000104 ***
## rank4 -1.551464 0.417832 -3.713 0.000205 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 458.52 on 394 degrees of freedom
## AIC: 470.52
##
## Number of Fisher Scoring iterations: 4
confint(fit)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -6.2716202334 -1.792547080
## gre 0.0001375921 0.004435874
## gpa 0.1602959439 1.464142727
## rank2 -1.3008888002 -0.056745722
## rank3 -2.0276713127 -0.670372346
## rank4 -2.4000265384 -0.753542605
confint.default(fit)
## 2.5 % 97.5 %
## (Intercept) -6.2242418514 -1.755716295
## gre 0.0001202298 0.004408622
## gpa 0.1536836760 1.454391423
## rank2 -1.2957512650 -0.055134591
## rank3 -2.0169920597 -0.663415773
## rank4 -2.3703986294 -0.732528724
wald.test(b = coef(fit), Sigma = vcov(fit), Terms = 4:6)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 20.9, df = 3, P(> X2) = 0.00011
pR2(fit)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML
## -229.25874624 -249.98825878 41.45902508 0.08292194 0.09845702
## r2CU
## 0.13799580
lrm(admit ~ gre + gpa + rank, data = data2)
## Logistic Regression Model
##
## lrm(formula = admit ~ gre + gpa + rank, data = data2)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 400 LR chi2 41.46 R2 0.138 C 0.693
## 0 273 d.f. 5 R2(5,400)0.087 Dxy 0.386
## 1 127 Pr(> chi2) <0.0001 R2(5,260)0.131 gamma 0.386
## max |deriv| 2e-06 Brier 0.195 tau-a 0.168
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -3.9900 1.1400 -3.50 0.0005
## gre 0.0023 0.0011 2.07 0.0385
## gpa 0.8040 0.3318 2.42 0.0154
## rank=2 -0.6754 0.3165 -2.13 0.0328
## rank=3 -1.3402 0.3453 -3.88 0.0001
## rank=4 -1.5515 0.4178 -3.71 0.0002
newdata1 <- with(data2, data.frame(gre = gre, gpa = gpa, rank = factor(rank)))
pred.response <- predict(fit, newdata = newdata1, type = "response")
head(pred.response)
## 1 2 3 4 5 6
## 0.1726265 0.2921750 0.7384082 0.1783846 0.1183539 0.3699699
newdata1$rankP <- predict(fit, newdata = newdata1, type = "response")
head(newdata1)
## gre gpa rank rankP
## 1 380 3.61 3 0.1726265
## 2 660 3.67 3 0.2921750
## 3 800 4.00 1 0.7384082
## 4 640 3.19 4 0.1783846
## 5 520 2.93 4 0.1183539
## 6 760 3.00 2 0.3699699
# Load the packages
library(aod) #for 'Wald test'
library(MASS)
library(rms) #for 'lrm'
library(pscl) #to get pseudo R2
#setwd("C:\\Users\\ASUS\\Google Drive\\208429\\LabMaterials\\DataFiles_208429")
setwd("G:\\My Drive\\208429\\LabMaterials\\DataFiles_208429")
data3 = read.csv("BuyingCosmetics.csv",header=TRUE)
head(data3)
## Obs Buying Age Income
## 1 1 1 40 15
## 2 2 1 45 27
## 3 3 0 20 8
## 4 4 1 30 17
## 5 5 1 55 45
## 6 6 0 15 3
summary(data3)
## Obs Buying Age Income
## Min. : 1.00 Min. :0.0000 Min. :15.00 Min. : 3.00
## 1st Qu.:15.75 1st Qu.:0.0000 1st Qu.:22.00 1st Qu.:12.00
## Median :30.50 Median :1.0000 Median :30.00 Median :20.00
## Mean :30.50 Mean :0.5333 Mean :32.35 Mean :21.63
## 3rd Qu.:45.25 3rd Qu.:1.0000 3rd Qu.:40.00 3rd Qu.:30.00
## Max. :60.00 Max. :1.0000 Max. :60.00 Max. :51.00
#Using the logit model
#Perform variable selection
null <- glm(Buying ~ 1, data = data3,family="binomial")
full <- glm(Buying ~ Age + Income, data = data3,family="binomial")
summary(full)
##
## Call:
## glm(formula = Buying ~ Age + Income, family = "binomial", data = data3)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.56798 1.22985 -3.714 0.000204 ***
## Age 0.16265 0.05200 3.128 0.001761 **
## Income -0.01801 0.04232 -0.426 0.670439
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 82.911 on 59 degrees of freedom
## Residual deviance: 59.067 on 57 degrees of freedom
## AIC: 65.067
##
## Number of Fisher Scoring iterations: 5
lrtest(null, full)
##
## Model 1: Buying ~ 1
## Model 2: Buying ~ Age + Income
##
## L.R. Chisq d.f. P
## 2.384415e+01 2.000000e+00 6.642161e-06
wald.test(b = coef(full), Sigma = vcov(full), Terms = 1)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 13.8, df = 1, P(> X2) = 2e-04
wald.test(b = coef(full), Sigma = vcov(full), Terms = 2)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 9.8, df = 1, P(> X2) = 0.0018
wald.test(b = coef(full), Sigma = vcov(full), Terms = 3)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 0.18, df = 1, P(> X2) = 0.67
mod = lrm(Buying ~ Age + Income, data = data3)
mod
## Logistic Regression Model
##
## lrm(formula = Buying ~ Age + Income, data = data3)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 60 LR chi2 23.84 R2 0.438 C 0.836
## 0 28 d.f. 2 R2(2,60) 0.305 Dxy 0.672
## 1 32 Pr(> chi2) <0.0001 R2(2,44.8)0.386 gamma 0.673
## max |deriv| 1e-08 Brier 0.167 tau-a 0.340
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -4.5680 1.2299 -3.71 0.0002
## Age 0.1626 0.0520 3.13 0.0018
## Income -0.0180 0.0423 -0.43 0.6704
-2*logLik(full)
## 'log Lik.' 59.06665 (df=3)
#To easily get a McFadden's pseudo R2 for a fitted model in R, use the "pscl" package by Simon Jackman and use the pR2 command.
pR2(full)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML r2CU
## -29.5333253 -41.4553986 23.8441464 0.2875880 0.3279365 0.4378993
#R2 McFadden
1-as.numeric(logLik(full))/as.numeric(logLik(null))
## [1] 0.287588
#r2 Cox and snell
n=dim(data3)[1]
r2_cs = 1-(as.numeric(exp(logLik(null))/exp(logLik(full))))^(2/n)
r2_cs
## [1] 0.3279365
#r2 Nagelkerki
r2_N = r2_cs/(1-as.numeric(exp(logLik(null)))^(2/n))
r2_N
## [1] 0.4378993
#forwards = step(null,scope=list(lower=null,upper=full), direction="forward")
#backwards = step(full)
bothways = step(null, list(lower=null,upper=full),direction="both")
## Start: AIC=84.91
## Buying ~ 1
##
## Df Deviance AIC
## + Age 1 59.250 63.250
## + Income 1 73.593 77.593
## <none> 82.911 84.911
##
## Step: AIC=63.25
## Buying ~ Age
##
## Df Deviance AIC
## <none> 59.250 63.250
## + Income 1 59.067 65.067
## - Age 1 82.911 84.911
summary(bothways)
##
## Call:
## glm(formula = Buying ~ Age, family = "binomial", data = data3)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.51528 1.22025 -3.700 0.000215 ***
## Age 0.14847 0.03878 3.828 0.000129 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 82.911 on 59 degrees of freedom
## Residual deviance: 59.250 on 58 degrees of freedom
## AIC: 63.25
##
## Number of Fisher Scoring iterations: 5
confint(bothways)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -7.2076248 -2.3490179
## Age 0.0805232 0.2349225
confint.default(bothways)
## 2.5 % 97.5 %
## (Intercept) -6.9069200 -2.123635
## Age 0.0724545 0.224479
## odds ratios and 95% CI
exp(cbind(OR = coef(bothways), confint(bothways)))
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.01094057 0.0007409149 0.09546287
## Age 1.16005425 1.0838539957 1.26481079
newdata1 <- with(data3, data.frame(Age=Age, Income=Income))
pred_response <- predict(bothways, newdata = newdata1, type = "response")
head(pred_response)
## 1 2 3 4 5 6
## 0.80586986 0.89712932 0.17567316 0.48468624 0.97467705 0.09209862
mod1b <- lrm(Buying ~ Age, data = data3)
mod1b
## Logistic Regression Model
##
## lrm(formula = Buying ~ Age, data = data3)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 60 LR chi2 23.66 R2 0.435 C 0.827
## 0 28 d.f. 1 R2(1,60) 0.315 Dxy 0.654
## 1 32 Pr(> chi2) <0.0001 R2(1,44.8)0.397 gamma 0.677
## max |deriv| 3e-09 Brier 0.167 tau-a 0.331
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -4.5153 1.2202 -3.70 0.0002
## Age 0.1485 0.0388 3.83 0.0001
## odds ratios and 95% CI
#exp(cbind(OR = coef(fit_logit_full), confint(fit_logit_full)))
#newdata1 <- with(data2, data.frame(gre = gre, gpa = gpa, rank = factor(rank)))
#pred_response <- predict(fit_logit_full, newdata = newdata1, type = "response")
#head(pred_response)
1.Import data
library(readr)
#setwd("C:\\Users\\ASUS\\Google Drive\\208429\\LabMaterials\\DataFiles_208429")
setwd("G:\\My Drive\\208429\\LabMaterials\\DataFiles_208429")
dt <- read_csv("product.csv")
## Rows: 400 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): product, gender
## dbl (5): obs, age, household, position_level, absent
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(dt)
## spc_tbl_ [400 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ obs : num [1:400] 1 2 3 4 5 6 7 8 9 10 ...
## $ product : chr [1:400] "C" "A" "C" "A" ...
## $ age : num [1:400] 57 21 66 36 23 31 37 37 55 66 ...
## $ household : num [1:400] 2 7 7 4 0 5 3 0 3 2 ...
## $ position_level: num [1:400] 2 2 2 2 2 1 3 3 3 4 ...
## $ gender : chr [1:400] "Male" "Male" "Male" "Female" ...
## $ absent : num [1:400] 10 7 1 6 11 14 12 25 3 18 ...
## - attr(*, "spec")=
## .. cols(
## .. obs = col_double(),
## .. product = col_character(),
## .. age = col_double(),
## .. household = col_double(),
## .. position_level = col_double(),
## .. gender = col_character(),
## .. absent = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
dt$product <- as.factor(dt$product)
dt$position_level <- as.factor(dt$position_level)
dt$gender <- as.factor(dt$gender)
str(dt)
## spc_tbl_ [400 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ obs : num [1:400] 1 2 3 4 5 6 7 8 9 10 ...
## $ product : Factor w/ 3 levels "A","B","C": 3 1 3 1 1 1 1 2 3 2 ...
## $ age : num [1:400] 57 21 66 36 23 31 37 37 55 66 ...
## $ household : num [1:400] 2 7 7 4 0 5 3 0 3 2 ...
## $ position_level: Factor w/ 5 levels "1","2","3","4",..: 2 2 2 2 2 1 3 3 3 4 ...
## $ gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 2 2 1 1 1 ...
## $ absent : num [1:400] 10 7 1 6 11 14 12 25 3 18 ...
## - attr(*, "spec")=
## .. cols(
## .. obs = col_double(),
## .. product = col_character(),
## .. age = col_double(),
## .. household = col_double(),
## .. position_level = col_double(),
## .. gender = col_character(),
## .. absent = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
summary(dt)
## obs product age household position_level
## Min. : 1.0 A:142 Min. :21.00 Min. :0.000 1: 47
## 1st Qu.:100.8 B:117 1st Qu.:30.00 1st Qu.:2.000 2: 99
## Median :200.5 C:141 Median :37.00 Median :3.000 3:135
## Mean :200.5 Mean :41.05 Mean :3.212 4: 64
## 3rd Qu.:300.2 3rd Qu.:53.00 3rd Qu.:5.000 5: 55
## Max. :400.0 Max. :67.00 Max. :7.000
## gender absent
## Female:232 Min. : 0.00
## Male :168 1st Qu.: 7.00
## Median :14.00
## Mean :14.17
## 3rd Qu.:21.00
## Max. :31.00
boxplot(age~product,data=dt)
table(dt$product,dt$position_level)
##
## 1 2 3 4 5
## A 14 36 44 28 20
## B 20 29 39 13 16
## C 13 34 52 23 19
plot(dt$age,dt$absent)
4.set base group for y variable
dt$product <- relevel(dt$product, ref = "A")
#Important packages library(MASS)# for multino, library(rms) # for lrt library(pscl) #to get pseudo R2
#Important packages
library(MASS)# for multino,
library(rms) # for lrt
library(pscl) #to get pseudo R2
#package "nnet"
library(nnet)
null <- multinom(product~1,data=dt)
## # weights: 6 (2 variable)
## initial value 439.444915
## final value 437.908882
## converged
full <- multinom(product~age+household+position_level+gender+absent,data=dt)
## # weights: 30 (18 variable)
## initial value 439.444915
## iter 10 value 268.455865
## iter 20 value 208.024202
## iter 30 value 207.509074
## iter 30 value 207.509072
## iter 30 value 207.509072
## final value 207.509072
## converged
summary(full)
## Call:
## multinom(formula = product ~ age + household + position_level +
## gender + absent, data = dt)
##
## Coefficients:
## (Intercept) age household position_level2 position_level3
## B -4.727535 0.2368700 -0.9628612 -0.34360355 -1.3305304
## C -10.114127 0.2647383 0.1552594 0.01216742 -0.1824063
## position_level4 position_level5 genderMale absent
## B -1.9177254 -1.9734897 -2.39588762 0.01384059
## C -0.2653796 -0.7251654 0.09274777 -0.01208771
##
## Std. Errors:
## (Intercept) age household position_level2 position_level3
## B 1.027988 0.02864214 0.13731517 0.5999365 0.5954285
## C 1.213453 0.02960232 0.09266699 0.6027884 0.5777555
## position_level4 position_level5 genderMale absent
## B 0.7839738 0.7471631 0.4633714 0.02489143
## C 0.6440092 0.7116233 0.3694506 0.02293563
##
## Residual Deviance: 415.0181
## AIC: 451.0181
b = summary(full)$coefficients
se.b = summary(full)$standard.errors
z <- b/se.b
t(z)#transpose of Z
## B C
## (Intercept) -4.5988214 -8.33500061
## age 8.2699833 8.94315942
## household -7.0120526 1.67545552
## position_level2 -0.5727332 0.02018522
## position_level3 -2.2345763 -0.31571534
## position_level4 -2.4461600 -0.41207432
## position_level5 -2.6413106 -1.01902993
## genderMale -5.1705554 0.25104244
## absent 0.5560382 -0.52702743
# p-value for 2-tailed z test
p <- (1 - pnorm(abs(z),mean=0,sd=1))*2
t(p)#transpose of p
## B C
## (Intercept) 4.248879e-06 0.00000000
## age 2.220446e-16 0.00000000
## household 2.348566e-12 0.09384489
## position_level2 5.668254e-01 0.98389562
## position_level3 2.544518e-02 0.75221859
## position_level4 1.443869e-02 0.68028495
## position_level5 8.258598e-03 0.30818876
## genderMale 2.333993e-07 0.80178130
## absent 5.781847e-01 0.59817454
#likelihood ratio test
full <- multinom(product~age+household+position_level+gender+absent,data=dt)
## # weights: 30 (18 variable)
## initial value 439.444915
## iter 10 value 268.455865
## iter 20 value 208.024202
## iter 30 value 207.509074
## iter 30 value 207.509072
## iter 30 value 207.509072
## final value 207.509072
## converged
fit1 <- multinom(product~age+household+gender+absent,data=dt)
## # weights: 18 (10 variable)
## initial value 439.444915
## iter 10 value 256.830887
## iter 20 value 214.842594
## iter 30 value 214.837090
## final value 214.837084
## converged
anova(fit1,full)
## Likelihood ratio tests of Multinomial Models
##
## Response: product
## Model Resid. df Resid. Dev
## 1 age + household + gender + absent 790 429.6742
## 2 age + household + position_level + gender + absent 782 415.0181
## Test Df LR stat. Pr(Chi)
## 1
## 2 1 vs 2 8 14.65602 0.06618923
fit1, test if “absent” should be included when the
rest are infit2 <- multinom(product~age+household+gender,data=dt)
## # weights: 15 (8 variable)
## initial value 439.444915
## iter 10 value 220.300742
## iter 20 value 215.255121
## final value 215.248449
## converged
anova(fit2,fit1)
## Likelihood ratio tests of Multinomial Models
##
## Response: product
## Model Resid. df Resid. Dev Test Df LR stat.
## 1 age + household + gender 792 430.4969
## 2 age + household + gender + absent 790 429.6742 1 vs 2 2 0.8227305
## Pr(Chi)
## 1
## 2 0.6627448
glmstepAIC(full,direction="both")
## Start: AIC=451.02
## product ~ age + household + position_level + gender + absent
##
## # weights: 27 (16 variable)
## initial value 439.444915
## iter 10 value 342.197670
## iter 20 value 334.868396
## final value 334.851268
## converged
## # weights: 27 (16 variable)
## initial value 439.444915
## iter 10 value 301.317069
## iter 20 value 276.620201
## final value 276.458307
## converged
## # weights: 18 (10 variable)
## initial value 439.444915
## iter 10 value 256.830887
## iter 20 value 214.842594
## iter 30 value 214.837090
## final value 214.837084
## converged
## # weights: 27 (16 variable)
## initial value 439.444915
## iter 10 value 260.954489
## iter 20 value 230.815041
## final value 230.581071
## converged
## # weights: 27 (16 variable)
## initial value 439.444915
## iter 10 value 229.652679
## iter 20 value 208.237597
## final value 208.149341
## converged
## Df AIC
## - absent 2 448.30
## - position_level 8 449.67
## <none> 451.02
## - gender 2 493.16
## - household 2 584.92
## - age 2 701.70
## # weights: 27 (16 variable)
## initial value 439.444915
## iter 10 value 229.652679
## iter 20 value 208.237597
## final value 208.149341
## converged
##
## Step: AIC=448.3
## product ~ age + household + position_level + gender
##
## # weights: 24 (14 variable)
## initial value 439.444915
## iter 10 value 345.161008
## iter 20 value 335.869815
## final value 335.869805
## converged
## # weights: 24 (14 variable)
## initial value 439.444915
## iter 10 value 286.695124
## iter 20 value 277.917014
## final value 277.916928
## converged
## # weights: 15 (8 variable)
## initial value 439.444915
## iter 10 value 220.300742
## iter 20 value 215.255121
## final value 215.248449
## converged
## # weights: 24 (14 variable)
## initial value 439.444915
## iter 10 value 252.565848
## iter 20 value 231.441304
## final value 231.409442
## converged
## # weights: 30 (18 variable)
## initial value 439.444915
## iter 10 value 268.455865
## iter 20 value 208.024202
## iter 30 value 207.509074
## iter 30 value 207.509072
## iter 30 value 207.509072
## final value 207.509072
## converged
## Df AIC
## - position_level 8 446.50
## <none> 448.30
## + absent 2 451.02
## - gender 2 490.82
## - household 2 583.83
## - age 2 699.74
## # weights: 15 (8 variable)
## initial value 439.444915
## iter 10 value 220.300742
## iter 20 value 215.255121
## final value 215.248449
## converged
##
## Step: AIC=446.5
## product ~ age + household + gender
##
## # weights: 12 (6 variable)
## initial value 439.444915
## iter 10 value 341.114844
## final value 340.866118
## converged
## # weights: 12 (6 variable)
## initial value 439.444915
## iter 10 value 284.455247
## iter 20 value 283.954925
## final value 283.954920
## converged
## # weights: 12 (6 variable)
## initial value 439.444915
## iter 10 value 239.284726
## iter 20 value 238.014588
## final value 238.014521
## converged
## # weights: 27 (16 variable)
## initial value 439.444915
## iter 10 value 229.652679
## iter 20 value 208.237597
## final value 208.149341
## converged
## # weights: 18 (10 variable)
## initial value 439.444915
## iter 10 value 256.830887
## iter 20 value 214.842594
## iter 30 value 214.837090
## final value 214.837084
## converged
## Df AIC
## <none> 446.50
## + position_level 8 448.30
## + absent 2 449.67
## - gender 2 488.03
## - household 2 579.91
## - age 2 693.73
## Call:
## multinom(formula = product ~ age + household + gender, data = dt)
##
## Coefficients:
## (Intercept) age household genderMale
## B -5.229876 0.2247682 -0.9143186 -2.2522987
## C -10.339791 0.2608011 0.1518043 0.1087794
##
## Residual Deviance: 430.4969
## AIC: 446.4969
anova(null,fit2)
## Likelihood ratio tests of Multinomial Models
##
## Response: product
## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi)
## 1 1 798 875.8178
## 2 age + household + gender 792 430.4969 1 vs 2 6 445.3209 0
summary(fit2)
## Call:
## multinom(formula = product ~ age + household + gender, data = dt)
##
## Coefficients:
## (Intercept) age household genderMale
## B -5.229876 0.2247682 -0.9143186 -2.2522987
## C -10.339791 0.2608011 0.1518043 0.1087794
##
## Std. Errors:
## (Intercept) age household genderMale
## B 0.8699672 0.02746372 0.12695214 0.4347157
## C 1.0956022 0.02854696 0.09074394 0.3624084
##
## Residual Deviance: 430.4969
## AIC: 446.4969
exp(summary(fit2)$coefficients)
## (Intercept) age household genderMale
## B 5.354189e-03 1.252032 0.4007896 0.1051572
## C 3.232109e-05 1.297969 1.1639324 1.1149163
library(pscl)
pR2(fit2)
## fitting null model for pseudo-r2
## # weights: 6 (2 variable)
## initial value 439.444915
## final value 437.908882
## converged
## llh llhNull G2 McFadden r2ML r2CU
## -215.2484490 -437.9088817 445.3208655 0.5084629 0.6715275 0.7561972
phat <- predict(fit2,newdata=dt,type="probs")
head(phat)#show the first 6 rows
## A B C
## 1 0.0057578835 0.1908607399 0.80338138
## 2 0.9755735475 0.0001023612 0.02432409
## 3 0.0003204954 0.0008306117 0.99884889
## 4 0.4628612448 0.2089157798 0.32822298
## 5 0.8980469293 0.0889180955 0.01303498
## 6 0.7961988323 0.0049228842 0.19887828
yhat <- predict(fit2,newdata=dt)
head(yhat)
## [1] C A C A A A
## Levels: A B C
15.Classification table and %accuracy
table(dt$product,yhat)
## yhat
## A B C
## A 118 15 9
## B 28 79 10
## C 16 20 105
import data
library(readr)
#setwd("C:\\Users\\ASUS\\Google Drive\\208429\\LabMaterials\\DataFiles_208429")
setwd("G:\\My Drive\\208429\\LabMaterials\\DataFiles_208429")
dp <- read_csv("poverty.csv")
## New names:
## Rows: 5381 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (5): poverty, religion, degree, country, gender dbl (2): ...1, age
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
str(dp)
## spc_tbl_ [5,381 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ...1 : num [1:5381] 1 2 3 4 5 6 7 8 9 10 ...
## $ poverty : chr [1:5381] "Too Little" "About Right" "Too Little" "Too Much" ...
## $ religion: chr [1:5381] "yes" "yes" "yes" "yes" ...
## $ degree : chr [1:5381] "no" "no" "no" "yes" ...
## $ country : chr [1:5381] "USA" "USA" "USA" "USA" ...
## $ age : num [1:5381] 44 40 36 25 39 80 48 32 74 30 ...
## $ gender : chr [1:5381] "male" "female" "female" "female" ...
## - attr(*, "spec")=
## .. cols(
## .. ...1 = col_double(),
## .. poverty = col_character(),
## .. religion = col_character(),
## .. degree = col_character(),
## .. country = col_character(),
## .. age = col_double(),
## .. gender = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
dp$poverty <- as.factor(dp$poverty)
dp$religion <- as.factor(dp$religion)
dp$degree <- as.factor(dp$degree)
dp$country <- as.factor(dp$country)
dp$gender <- as.factor(dp$gender)
dp$poverty <- ordered(dp$poverty,
levels =c("Too Little","About Right","Too Much"))
levels(dp$poverty)
## [1] "Too Little" "About Right" "Too Much"
#3.m0: null model
library(MASS)
m0 <- polr(poverty~1,data=dp)
#4. mf: full model
mf <- polr(poverty~religion+degree+country+age+gender,data=dp)
summary(mf)
##
## Re-fitting to get Hessian
## Call:
## polr(formula = poverty ~ religion + degree + country + age +
## gender, data = dp)
##
## Coefficients:
## Value Std. Error t value
## religionyes 0.17973 0.077346 2.324
## degreeyes 0.14092 0.066193 2.129
## countryNorway -0.32235 0.073766 -4.370
## countrySweden -0.60330 0.079494 -7.589
## countryUSA 0.61777 0.070665 8.742
## age 0.01114 0.001561 7.139
## gendermale 0.17637 0.052972 3.329
##
## Intercepts:
## Value Std. Error t value
## Too Little|About Right 0.7298 0.1041 7.0128
## About Right|Too Much 2.5325 0.1103 22.9496
##
## Residual Deviance: 10402.59
## AIC: 10420.59
# 5.calculate p-value
## store table
ctable <- coef(summary(mf))
##
## Re-fitting to get Hessian
## calculate and store p values
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE)*2
## combined table
ctable <- cbind(ctable, "p value" = p)
ctable
## Value Std. Error t value p value
## religionyes 0.17973194 0.077346025 2.323738 2.013951e-02
## degreeyes 0.14091745 0.066193093 2.128885 3.326381e-02
## countryNorway -0.32235359 0.073766012 -4.369947 1.242765e-05
## countrySweden -0.60329785 0.079493879 -7.589237 3.217957e-14
## countryUSA 0.61777260 0.070664754 8.742302 2.284084e-18
## age 0.01114091 0.001560585 7.138935 9.405696e-13
## gendermale 0.17636863 0.052972239 3.329454 8.701647e-04
## Too Little|About Right 0.72976353 0.104061619 7.012802 2.335919e-12
## About Right|Too Much 2.53247870 0.110349763 22.949562 1.488394e-116
# 6. answer is age
# 7.Test if "age" should be included in the model when others are in
m1 <- polr(poverty~religion+degree+country+gender,data=dp)
#likelihood ratio test, use anova function
anova(mf,m1)
## Likelihood ratio tests of ordinal regression models
##
## Response: poverty
## Model Resid. df Resid. Dev Test Df
## 1 religion + degree + country + gender 5373 10453.71
## 2 religion + degree + country + age + gender 5372 10402.59 1 vs 2 1
## LR stat. Pr(Chi)
## 1
## 2 51.11977 8.689716e-13
## store table
ctable <- coef(summary(m1))
##
## Re-fitting to get Hessian
## calculate and store p values
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE)*2
## combined table
ctable <- cbind(ctable, "p value" = p)
ctable
## Value Std. Error t value p value
## religionyes 0.23479820 0.07679364 3.057521 2.231758e-03
## degreeyes 0.09304962 0.06569864 1.416310 1.566848e-01
## countryNorway -0.32957463 0.07354477 -4.481279 7.419704e-06
## countrySweden -0.60346856 0.07923965 -7.615740 2.621848e-14
## countryUSA 0.66319466 0.07033120 9.429593 4.116813e-21
## gendermale 0.18646926 0.05284173 3.528826 4.174078e-04
## Too Little|About Right 0.28016816 0.08232821 3.403064 6.663467e-04
## About Right|Too Much 2.06918584 0.08824151 23.449121 1.349496e-121
#8. From m1, test the significance of "degree"
m2 <- polr(poverty~religion+country+gender,data=dp)
anova(m1,m2)
## Likelihood ratio tests of ordinal regression models
##
## Response: poverty
## Model Resid. df Resid. Dev Test Df
## 1 religion + country + gender 5374 10455.71
## 2 religion + degree + country + gender 5373 10453.71 1 vs 2 1
## LR stat. Pr(Chi)
## 1
## 2 2.001472 0.1571465
#Variable selection for logistic
library(MASS)
stepAIC(mf,direction="both")#use stepwise
## Start: AIC=10420.59
## poverty ~ religion + degree + country + age + gender
##
## Df AIC
## <none> 10421
## - degree 1 10423
## - religion 1 10424
## - gender 1 10430
## - age 1 10470
## - country 3 10666
## Call:
## polr(formula = poverty ~ religion + degree + country + age +
## gender, data = dp)
##
## Coefficients:
## religionyes degreeyes countryNorway countrySweden countryUSA
## 0.17973194 0.14091745 -0.32235359 -0.60329785 0.61777260
## age gendermale
## 0.01114091 0.17636863
##
## Intercepts:
## Too Little|About Right About Right|Too Much
## 0.7297635 2.5324787
##
## Residual Deviance: 10402.59
## AIC: 10420.59
#9.
anova(m0,mf)
## Likelihood ratio tests of ordinal regression models
##
## Response: poverty
## Model Resid. df Resid. Dev Test Df
## 1 1 5379 10740.38
## 2 religion + degree + country + age + gender 5372 10402.59 1 vs 2 7
## LR stat. Pr(Chi)
## 1
## 2 337.7841 0
#11. OR and its confidence interval (CI)
ci <- confint(mf) # 95% CI of log odd
## Waiting for profiling to be done...
##
## Re-fitting to get Hessian
#odd
exp(coef(mf))#odds
## religionyes degreeyes countryNorway countrySweden countryUSA
## 1.1968965 1.1513296 0.7244420 0.5470047 1.8547921
## age gendermale
## 1.0112032 1.1928777
exp(cbind(OR = coef(mf), ci))#95% CI of odds
## OR 2.5 % 97.5 %
## religionyes 1.1968965 1.0289511 1.3934593
## degreeyes 1.1513296 1.0110549 1.3106194
## countryNorway 0.7244420 0.6267512 0.8369294
## countrySweden 0.5470047 0.4678450 0.6389299
## countryUSA 1.8547921 1.6151018 2.1306390
## age 1.0112032 1.0081175 1.0143029
## gendermale 1.1928777 1.0752794 1.3234479
#12.pseudo R2
library(pscl) #to get pseudo R2
pR2(mf)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML
## -5.201296e+03 -5.370188e+03 3.377841e+02 3.144993e-02 6.084382e-02
## r2CU
## 7.041132e-02
#13. getting phat
phat <- predict(mf,newdata=dp,type="probs")
head(phat)
## Too Little About Right Too Much
## 1 0.3242497 0.4200439 0.2557065
## 2 0.3744021 0.4096330 0.2159649
## 3 0.3848970 0.4065882 0.2085148
## 4 0.3805578 0.4078799 0.2115623
## 5 0.3058650 0.4218762 0.2722588
## 6 0.2770756 0.4221685 0.3007559
#getting yhat
yhat <- predict(mf,newdata=dp)
head(yhat)
## [1] About Right About Right About Right About Right About Right About Right
## Levels: Too Little About Right Too Much
#14. classification table
table(dp$poverty,yhat)
## yhat
## Too Little About Right Too Much
## Too Little 2190 518 0
## About Right 1483 379 0
## Too Much 377 434 0