Data files

Lab 1: Introduction to R and Multiple linear regression analysis

Objectives: Students are able to use R language to analyse data using multiple linear regression:
1. perform descriptive statistsics
2. transform qualitative independent variable into dummy variables
3. select independent variables
4. perform linear regression analysis and inference on regression parameters
5. interpret the results

1. Introduction to R

1. Installing the R and RStudio

R is a programming language for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. R is used among data miners, bioinformaticians and statisticians for data analysis and developing statistical software. Users have created packages to augment the functions of the R language.

To work with R, you need to download the program and install it. I recommend that you also install R-Studio. R-Studio is a separate program that makes R easier to use .

Windows

Install R

Download the latest R installer (.exe) for Windows. Install the downloaded file as any other windows app.

Install RStudio

Now that R is installed, you need to download and install RStudio. First download the installer for Windows. Run the installer (.exe file) and follow the instructions.

Mac

Install R

First download the latest release (“R-version.pkg”) of R Save the .pkg file, double-click it to open, and follow the installation instructions. Now that R is installed, you need to download and install RStudio.

Install RStudio

First download the the version for Mac. After downloading, double-click the file to open it, and then drag and drop it to your applications folder.

2. Basic data types

R works with numerous data types. Some of the most basic types to get started are:

  • Decimal values like 4.5 are called numerics.
  • Whole numbers like 4 are called integers. Integers are also numerics.
  • Boolean values (TRUE or FALSE) are called logical.
  • Text (or string) (e.g. “world”) values are called characters.
Numeric and integer values
# create create variable "a" that is a vector of one number.
a <- 7
a = 7
a
## [1] 7
# create vectors that consists of multiple numbers.
b <- c(1.25, 2.9, 3.0)
b
## [1] 1.25 2.90 3.00
# create a regular sequence of whole numbers
c <- 1:10
c
##  [1]  1  2  3  4  5  6  7  8  9 10
# seq(from, to, by) function ro creat sequence of numbers
d <- seq(1,10,0.5)
d
##  [1]  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0
## [16]  8.5  9.0  9.5 10.0
# rep() function to repeat a single number, or a sequence of numbers.

rep(99, times=5)
## [1] 99 99 99 99 99
rep(c(1,2,3), times=5)
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
rep(c(1,2,3),each = 5)
##  [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Character values
x <- "Monday"
class(x)
## [1] "character"
y <- c("Mon","Tue","Wed","Thu","Fri","Sat","Sun")
class(y)
## [1] "character"
Factors

A factor is a nominal (categorical) variable with a set of known possible values called levels. They can be created using the as.factor function.

yy <- as.factor(y)
yy
## [1] Mon Tue Wed Thu Fri Sat Sun
## Levels: Fri Mon Sat Sun Thu Tue Wed

3. Data structure

Matrix
mat0 <- matrix(NA, ncol=3, nrow=2)
mat0
##      [,1] [,2] [,3]
## [1,]   NA   NA   NA
## [2,]   NA   NA   NA
mat1 <- matrix(1:6, ncol=3, nrow=2)
mat1
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
mat2 <- matrix(1:6, ncol=3, nrow=2,byrow = TRUE)
mat2
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
List

A list is a very flexible container to store data. Each element of a list can contain any type of R object, e.g. a vector, matrix, data.frame, another list, or more complex data types.

lst <- list(d,yy,mat2)
lst
## [[1]]
##  [1]  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0
## [16]  8.5  9.0  9.5 10.0
## 
## [[2]]
## [1] Mon Tue Wed Thu Fri Sat Sun
## Levels: Fri Mon Sat Sun Thu Tue Wed
## 
## [[3]]
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
Data frame

The data.frame is commonly used for statistical data analysis in R. It is a special type of list that requires that all elements (variables) have the same length.

# Loading
data(ToothGrowth)
# Print the first 6 rows
head(ToothGrowth,row=6)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5
# Examine data structure of the dataframe
str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

4. Read files

To read a file into R, you need to know the full path (directory) name and the name of the file itself.

  1. Save .xlsx data as .csv file
  2. Check the directory of data file (right click–> Properties–>Location)
  3. Use read.csv from base R (Slowest method, but works fine for smaller datasets)
#Your directory will be different from mine! 
setwd("C:\\Users\\ASUS\\Google Drive\\208429\\LabMaterials\\DataFiles_208429")
# Read data file and keep it in dt
dt <- read.csv("bank_data.csv", header = TRUE)
head(dt) # show the first six rows of the data
##   age        job marital education balance housing loan duration campaign pdays
## 1  59     admin. married secondary    2343     yes   no     1042        1    -1
## 2  56     admin. married secondary      45      no   no     1467        1    -1
## 3  41 technician married secondary    1270     yes   no     1389        1    -1
## 4  55   services married secondary    2476     yes   no      579        1    -1
## 5  54     admin. married  tertiary     184      no   no      673        2    -1
## 6  42 management  single  tertiary       0     yes  yes      562        2    -1
str(dt)  # show data structure 
## 'data.frame':    1000 obs. of  10 variables:
##  $ age      : int  59 56 41 55 54 42 56 60 37 28 ...
##  $ job      : chr  "admin." "admin." "technician" "services" ...
##  $ marital  : chr  "married" "married" "married" "married" ...
##  $ education: chr  "secondary" "secondary" "secondary" "secondary" ...
##  $ balance  : int  2343 45 1270 2476 184 0 830 545 1 5090 ...
##  $ housing  : chr  "yes" "no" "yes" "yes" ...
##  $ loan     : chr  "no" "no" "no" "no" ...
##  $ duration : int  1042 1467 1389 579 673 562 1201 1030 608 1297 ...
##  $ campaign : int  1 1 1 1 2 2 1 1 1 3 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...

Set “chr” variables as “factor”

dt$job <- as.factor(dt$job)
dt$marital <- as.factor(dt$marital)
dt$education <- as.factor(dt$education)
dt$housing <- as.factor(dt$housing)
dt$loan <- as.factor(dt$loan)
#check data structure again
str(dt)

5. Numerical presentation for qualitative data or categorical data

  • summary() function
  • frequency table with table() function
  • contingency table (Two-Way Frequency Tables)
  • percentage or proportion with prop.table() function
# job, marital, education, housing, loan
summary(dt)
#frequency table for one categorical variable
table(dt$job)
table(dt$housing)
prop.table(table(dt$job))
prop.table(table(dt$job))*100
#Two-Way Frequency Tables
table(dt$job,dt$education)
tb <- table(dt$loan,dt$marital)
# Add margin totals to your table. 
addmargins(tb)
#Calculate percentage
prop.table(tb)*100

6. Numerical presentation for quantitative data

  • summary() function
  • max() and min function
  • mean() function
  • median() function
  • There is no function to compute mode in R. To compute mode, use the getmode() function defined below.
#define function to calculate mode
getmode <- function(x) {
  u <- unique(x)
  tab <- tabulate(match(x, u))
  u[tab == max(tab)]
}
  • range() function
  • quantile(, probs = ) function where probs can be 0.25, 0.5, 0.75 for first, second, and third quantile
  • IQR() function
  • sd() and var() function
#age, balance, duration, campaign, pdays
summary(dt$age)
min(dt$age)
range(dt$age)
mean(dt$age)
getmode(dt$age)
IQR(dt$age)
sd(dt$age)
quantile(dt$age,probs = 0.25)
quantile(dt$age,probs = 0.50)
quantile(dt$age,probs = 0.75)
quantile(dt$age,probs = c(0.25,0.50,0.75))
cv_age <- sd(dt$age)/mean(dt$age)
cv_age
#cv of age by loan (yes,no)
cv1 <- sd(dt$age[dt$loan=="yes"])/mean(dt$age[dt$loan=="yes"])
cv1
cv2 <- sd(dt$age[dt$loan=="no"])/mean(dt$age[dt$loan=="no"])
cv2
  • aggregate() function is used to get the summary statistics of the data by group.
# aggregate: compute summary statistics of data subsets
aggregate(balance~loan, data = dt, FUN=summary)
aggregate(balance~loan+marital, data = dt, FUN=summary)
aggregate(balance~loan+marital, data = dt, FUN=mean)

7. Graphical presentation for qualitative data or categorical data

In the case that you have a factor variable, you can display a frequency table or a bar graph.

You can use the good ol’ summary() on a factor variable to display a frequency table.

Bar graph

freq <- table(dt$job)
freq
## 
##        admin.   blue-collar  entrepreneur     housemaid    management 
##            96           226            45            25           221 
##       retired self-employed      services       student    technician 
##            44            36            82             8           183 
##    unemployed       unknown 
##            31             3
barplot(freq)

To obtain a list of a factor variable’s levels, use levels().

maritallevel <- levels(dt$marital)
maritallevel
## [1] "divorced" "married"  "single"

Stacked Bar Plot with Colors and Legend

counts <- table(dt$marital, dt$loan)
barplot(counts,
        legend = rownames(counts),xlab="loan")

Grouped Bar Plot

counts <- table(dt$marital, dt$loan)
barplot(counts,col=c("blue","red","green"),
        legend = rownames(counts), beside=TRUE,xlab="loan")

Pie chart

counts <- table(dt$marital)
pie(counts, main = "Marital pie chart")

8. Graphical presentation for quantitative data

Histogram

  • To display a histogram of a vector, use hist()
hist(dt$age)

Sometimes, you want to change the title of the plot, and also the label of the horizontal axis and vertical axis. To do that, set the parameters main, xlab and ylab

hist(dt$age, main = "Histogram of age", xlab = "age", ylab = "counts")

To change the number of bins, set the breaks parameter. However, R will decide what is the best number of bins close to the number that your specify. For example, the following code displays a histogram with the number of bins close to 9:

hist(dt$age, breaks = 20, 
     main = "Histogram of age", 
     xlab = "age", 
     ylab = "counts")

Stem and leaf plot

  • To display a stem and leaf plot of a vector, use stem().
stem(dt$duration)
## 
##   The decimal point is 2 digit(s) to the right of the |
## 
##    0 | 18890113333334455566666667788899
##    2 | 1333333344555577888901111244456667777788888889999
##    4 | 00000011112222222333333334444445555566666666777777777888888888888899+122
##    6 | 00000000001111111122222222222223333333333333334444444444444444445555+160
##    8 | 00000000001111111111222223333333444444444455555555555556666666666677+103
##   10 | 00000000011111112222222233333333333344444444455555566666666667788888+54
##   12 | 00000111111112233344444444555667778899900011122333344444445556666677
##   14 | 0001122344455556777899900133445556667788
##   16 | 012245667899022234899
##   18 | 1346778885788
##   20 | 2239
##   22 | 7
##   24 | 62
##   26 | 257
##   28 | 
##   30 | 98
##   32 | 5
##   34 | 
##   36 | 
##   38 | 8
#compare with its histogram 

Boxplot

  • To display a boxplot of a vector, use boxplot().
boxplot(dt$age,main="Boxplot of age")

  • To display the boxplot horizontally, set the parameter horizontal = TRUE
boxplot(dt$age, horizontal = TRUE)

You can display boxplots of by group variable.

boxplot(dt$age~dt$loan,xlab="loan",ylab="age")
boxplot(dt$age~dt$job,xlab="job",ylab="age")

Scatterplot

Note that you need two vectors to plot a scatter plot: one for the x-axis, and one for the y-axis.

Suppose that you have two variables: var1 and var2. To display a scatter plot between var1 and var2, use plot(x = var1, y = var2).

plot(x = dt$age, y = dt$balance,xlab="age",ylab="balance")

9. Exercise

Perform the following numerical data presentation for the bank_data.csv dataset:

    1. Create frequency table of education, with proportion
    1. Create two-ways frequency table of job by loan category, with percentage
    1. Compute summary statistics of balance
    1. Compute Q1,Q2, and Q3 of balance, also the IQR
    1. Compute CV of balance by type of housing
    1. Compute summary statistics of balance by housing
    1. Compute sd of balance by housing and loan

Perform the following graphical data presentation for the bank_data.csv dataset:

    1. How would you present the education data?
    1. If you would like to examine the number of people with different education and martial status, which graph shall you use to present such information?
    1. How would you present the balance data?
    1. If you would like to compare the balance between people with different education, which graph shall you use to present such information?
    1. If you would like to examine the relationship between age and duration, which graph shall you use to present such information?

2. Multiple linear regression analysis

  1. Import data into R. ( must be in .csv form).
setwd("C:\\Users\\ASUS\\Google Drive\\208429\\LabMaterials\\DataFiles_208429")
data = read.csv("SalesPrice.csv",header=TRUE) 
data
# name variables for convenience
y = data$SalesPrice
x1 = data$NumberOfHousehold #quantitative independent variable
x2 = data$Location   #qualitative independent variable with k=3 groups 
  1. Explore data using descriptive statistic
summary(y)
hist(y,col    = "darkorange",
     border = "dodgerblue")
shapiro.test(y)
summary(x1)
plot(x1,y,xlab="Number Of Household",ylab="Sale price")
cor(x1,y) #Compute the r between two quantitative variables
boxplot(y~x2,xlab="Location",ylab="Sale price")

NOTE: It is easier to work with graphs when using ggplot

data_df = data.frame(data) # set data as dataframe type
library(ggplot2) # you might need to install packgage "ggplot2" first!
## Warning: package 'ggplot2' was built under R version 4.3.1
ggplot(data_df,aes(x=NumberOfHousehold,y=SalesPrice))+
  geom_point(aes(color=Location))

ggplot(data_df,aes(y=SalesPrice,x=Location))+
  geom_boxplot(aes(color=Location))

  1. Since Location is qualitative independent variable with k=3 groups.
    We need to transform Location into dummy variables
    We create k-1=3-1=2 dummy variables, set as. x2.dummy, x3.dummy.
#create x2.dummy
x2.dummy= c()   #create null vector
for(i in 1:length(data$Location)){
  if(data$Location[i]=="Street") x2.dummy[i] = 1
  else x2.dummy[i] = 0
}
x2.dummy
##  [1] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
#create x3.dummy
x3.dummy= c()
for(i in 1:length(data$Location)){
  if(data$Location[i]=="Mall") x3.dummy[i] = 1
  else x3.dummy[i] = 0
}
x3.dummy
##  [1] 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
  1. Select independent variables using variable selection methods
null=lm(y~1)
full=lm(y~x1+x2.dummy+x3.dummy)

#The R function vif() [car package] can be used to detect multicollinearity in a regression model:
car::vif(full)

# Forward selection
fw.fit = step(null, scope=list(lower=null,upper=full),direction="forward")
fw.fit

# Backward elimination
be.fit = step(full,direction="backward")
be.fit

# Stepwise regression
sw.fit = step(full, direction="both")
sw.fit
  1. Obtain fitted equation and CI of beta
summary(sw.fit)
anova(sw.fit)
confint.lm(sw.fit)
  1. Plot the fitted:
new = data.frame(x1=x1,x2.dummy=x2.dummy,x3.dummy=x3.dummy)
yhat = predict(sw.fit,newdata=new) #compute fitted y
yhat 

plot(x1,y,pch=1,xlab="Number Of Household",ylab="Sale price")
points(x1,yhat,type="p",pch=20)
legend("topleft",c("observed y","predicted y"),pch=c(1,20))

  1. Analyse the residuals:
par(mfrow=c(2,2))  #set plot layout as 2 row 2 column
plot(sw.fit)

#get the residuals
res = resid(sw.fit)
par(mfrow=c(1,2))  #set plot layout as 1 row 2 column
plot(res)
hist(res)

shapiro.test(res)
# Performs the Durbin-Watson test for autocorrelation of disturbances.
lmtest::dwtest(sw.fit)
# Test the constant variance assumption: Breusch-Pagan Test
lmtest::bptest(sw.fit)

Lab 2: Binary logistic regression

Task 1: Study hours and Exam

# Load the packages
library(aod)  #for 'Wald test'
## Warning: package 'aod' was built under R version 4.3.2
library(MASS)
library(rms)  #for 'lrm'
## Warning: package 'rms' was built under R version 4.3.2
## Loading required package: Hmisc
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(pscl) #to get pseudo R2
## Warning: package 'pscl' was built under R version 4.3.2
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
setwd("C:\\Users\\ASUS\\Google Drive\\208429\\LabMaterials\\DataFiles_208429")
data = read.csv("StudyHours.csv",header=TRUE) 
head(data)
y = data$Pass
x = data$Hours
# Generating the frequency table 
table(y)

Perform descriptive statistics on the data set

# TO DO CODE

1.fit binary logistic regression

fit.logist = glm(y ~x,family=binomial(link='logit'))
fit.logist
anova(fit.logist)
summary(fit.logist)

2. Perform likelihood ratio test (LRT) for overall fit test

mod <- lrm(y ~x, data = data)
mod

3. Perform Wald test:

wald.test(b = coef(fit.logist), Sigma = vcov(fit.logist), Terms = 1)
wald.test(b = coef(fit.logist), Sigma = vcov(fit.logist), Terms = 2)

4. Perform LRT for:

#create null model
null <- glm(y ~ 1, data = data,family="binomial")
summary(null)
#create full model
full <- glm(y ~ x, data = data,family="binomial")
summary(full)
lrtest(null, full)

5. get LL and pseudo R2

#To easily get a McFadden's pseudo R2 for a fitted model in R, use the "pscl" package by Simon Jackman and use the pR2 command.
logLik(fit.logist)
pR2(fit.logist)
mod1b <- lrm(y ~ x, data = data)
mod1b

Find Confidence Intervals

confint(fit.logist)
## Waiting for profiling to be done...
#Confidence Intervals using standard errors
confint.default(fit.logist)
fit.logist

Obtain odds ratios

exp(coef(fit.logist))
## odds ratios and 95% CI
exp(cbind(OR = coef(fit.logist), confint(fit.logist)))
## Waiting for profiling to be done...

6.Prediction with standard deviation

pred_logit = predict(fit.logist, newdata = list(x=x), type = "link",se = TRUE)
pred_logit

pred = predict(fit.logist, newdata = list(x=x), type = "response",se =  TRUE)
pred
pred_prob = 1/(1+exp(-pred_logit$fit))
pred_prob

LCI_pred = 1/(1+exp(-(pred_logit$fit-1.96*pred_logit$se.fit)))
UCI_pred = 1/(1+exp(-(pred_logit$fit+1.96*pred_logit$se.fit)))

#create plot
plot(x,pred_prob,type="o",ylim=c(-0.1,1.1),xlab="Number of Study Hours",ylab="Prob of passing exam",col="blue")
points(x,y)
lines(x,LCI_pred,col="red",lty=2)
lines(x,UCI_pred,col="red",lty=2)
abline(h=0.5,col="gray")

#create data.frame
dtf = data.frame( y =y ,
                  x =x ,
                  pred_logit = pred_logit,
                  pred_prob  = pred_prob,
                  LCI_pred   = LCI_pred,
                  UCI_pred   = UCI_pred
)

dtf

create another variable called “pred_pass”

# if pred_prob <= 0.5  , pred_pass = 0
# if pred_prob > 0.5   , pred_pass = 1
#create x2.dummy
pred_pass= c()   #create null vector
for(i in 1:length(y)){
  if(pred_prob[i] <= 0.5) pred_pass[i] = 0
  else pred_pass[i] = 1
}
pred_pass

#add pred_pass in dtf 
dtf$pred_pass = pred_pass
dtf

Get Classification table

xtabs(~ y + pred_pass, data = dtf)

Task 2: Admission data

researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don’t admit, is a binary variable.

# Load the packages
library(aod)
library(MASS)
library(rms)
setwd("C:\\Users\\ASUS\\Google Drive\\208429\\LabMaterials\\DataFiles_208429")
data2 = read.csv("Admission.csv",header=TRUE) 
head(data2)
summary(data2)
sapply(data2, sd)
xtabs(~admit + rank, data = data2)

#Using the logit model
data2$rank <- factor(data2$rank)
#Perform variable selection
null <- glm(admit ~ 1, data = data2,family="binomial")
full <- glm(admit ~ gre + gpa + rank, data = data2,family="binomial")

forwards = step(null,scope=list(lower=null,upper=full), direction="forward")
backwards = step(full) 
bothways = step(null, list(lower=null,upper=full),direction="both")

mod1b <- lrm(admit ~ gre + gpa + rank, data = data2)
mod1b

#lrtest
fit <- glm(admit ~ gre + gpa + rank, data = data2, family = "binomial")
summary(fit)
confint(fit)
## Waiting for profiling to be done...
confint.default(fit)
wald.test(b = coef(fit), Sigma = vcov(fit), Terms = 4:6)

pR2(fit)
lrm(admit ~ gre + gpa + rank, data = data2)

newdata1 <- with(data2, data.frame(gre = gre, gpa = gpa, rank = factor(rank)))
pred.response <- predict(fit, newdata = newdata1, type = "response")
head(pred.response)

newdata1$rankP <- predict(fit, newdata = newdata1, type = "response")
head(newdata1)

Lab 3: Multinomial logistic regression

Task 1: Product Buying

1.Import data

library(readr)

setwd("C:\\Users\\ASUS\\Google Drive\\208429\\LabMaterials\\DataFiles_208429")
dt <- read_csv("product.csv")
## Rows: 400 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): product, gender
## dbl (5): obs, age, household, position_level, absent
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(dt)

2. set data structure

dt$product <- as.factor(dt$product)
dt$position_level <- as.factor(dt$position_level)
dt$gender <- as.factor(dt$gender)
str(dt)

3. Perform descriptive statistics

summary(dt)
boxplot(age~product,data=dt)

table(dt$product,dt$position_level)
plot(dt$age,dt$absent)

4. Set base group for y variable

dt$product <- relevel(dt$product, ref = "A")

5. fit multinomial logistic regression

#Important packages
library(MASS)# for multino,
library(rms) # for lrt
library(pscl) #to get pseudo R2
#package "nnet"
library(nnet)
null <- multinom(product~1,data=dt)
full <- multinom(product~age+household+position_level+gender+absent,data=dt)
summary(full)

6. Calculate z value=b/se(b) , then p-value

b = summary(full)$coefficients
se.b = summary(full)$standard.errors
z <- b/se.b
t(z)#transpose of Z
# p-value for 2-tailed z test
p <- (1 - pnorm(abs(z),mean=0,sd=1))*2
t(p)#transpose of p

7. test if position_level should be included when the rest are in!

#likelihood ratio test
full <- multinom(product~age+household+position_level+gender+absent,data=dt)
fit1 <- multinom(product~age+household+gender+absent,data=dt)
anova(fit1,full)

8. from fit1, test if “absent” should be included when the rest are in

fit2 <- multinom(product~age+household+gender,data=dt)
anova(fit2,fit1)

9. stepAIC for variable selection in glm

stepAIC(full,direction="both")

10. overall fit test

anova(null,fit2)

11. obtain the fitted equations

summary(fit2)

12. get odd ratios

exp(summary(fit2)$coefficients)

13. Get the pseudo-R2

pR2(fit2)

14. Get \(p_{hat}\) and \(y_{hat}\)

phat <- predict(fit2,newdata=dt,type="probs")
head(phat)#show the first 6 rows

yhat <- predict(fit2,newdata=dt)
head(yhat)

15.Classification table and %accuracy

table(dt$product,yhat)

Lab 4: Ordinal logistic regression

import data

library(readr)
setwd("C:\\Users\\ASUS\\Google Drive\\208429\\LabMaterials\\DataFiles_208429")
dp <- read_csv("poverty.csv")
## New names:
## Rows: 5381 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (5): poverty, religion, degree, country, gender dbl (2): ...1, age
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
  1. set data structure
str(dp)
dp$poverty <- as.factor(dp$poverty)
dp$religion <- as.factor(dp$religion)
dp$degree  <- as.factor(dp$degree)
dp$country <- as.factor(dp$country)
dp$gender  <- as.factor(dp$gender)
  1. descriptive statistics
dp$poverty <- ordered(dp$poverty,
                      levels =c("Too Little","About Right","Too Much"))

levels(dp$poverty)
#3.m0: null model
library(MASS)
m0 <- polr(poverty~1,data=dp)

#4. mf: full model
mf <- polr(poverty~religion+degree+country+age+gender,data=dp)
summary(mf)
## 
## Re-fitting to get Hessian
# 5.calculate p-value
## store table
ctable <- coef(summary(mf))
## 
## Re-fitting to get Hessian
## calculate and store p values
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE)*2

## combined table
ctable <- cbind(ctable, "p value" = p)
ctable

#
# 7.Test if "age" should be included in the model when others are in
m1 <- polr(poverty~religion+degree+country+gender,data=dp)
#likelihood ratio test, use anova function
anova(mf,m1)

## store table
ctable <- coef(summary(m1))
## 
## Re-fitting to get Hessian
## calculate and store p values
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE)*2
## combined table
ctable <- cbind(ctable, "p value" = p)
ctable

#8. From m1, test the significance of "degree"
m2 <- polr(poverty~religion+country+gender,data=dp)
anova(m1,m2)



#Variable selection for logistic
library(MASS)
stepAIC(mf,direction="both")#use stepwise

#9. 
anova(m0,mf)


#11. OR and its confidence interval (CI)
ci <- confint(mf) # 95% CI of log odd 
## Waiting for profiling to be done...
## 
## Re-fitting to get Hessian
#odd
exp(coef(mf))#odds
exp(cbind(OR = coef(mf), ci))#95% CI of odds

#12.pseudo R2
library(pscl) #to get pseudo R2
pR2(mf)


#13. getting phat
phat <- predict(mf,newdata=dp,type="probs")
head(phat)
#getting yhat
yhat <- predict(mf,newdata=dp)
head(yhat)

#14. classification table
table(dp$poverty,yhat)