Objectives: Students are able to use R language to
analyse data using multiple linear regression:
1. perform descriptive statistsics
2. transform qualitative independent variable into dummy variables
3. select independent variables
4. perform linear regression analysis and inference on regression
parameters
5. interpret the results
R is a programming language for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. R is used among data miners, bioinformaticians and statisticians for data analysis and developing statistical software. Users have created packages to augment the functions of the R language.
To work with R, you need to download the program and install it. I recommend that you also install R-Studio. R-Studio is a separate program that makes R easier to use .
Download the latest R installer (.exe) for Windows. Install the downloaded file as any other windows app.
Now that R is installed, you need to download and install RStudio. First download the installer for Windows. Run the installer (.exe file) and follow the instructions.
First download the latest release (“R-version.pkg”) of R Save the .pkg file, double-click it to open, and follow the installation instructions. Now that R is installed, you need to download and install RStudio.
First download the the version for Mac. After downloading, double-click the file to open it, and then drag and drop it to your applications folder.
R works with numerous data types. Some of the most basic types to get started are:
4.5 are called
numerics.4 are called
integers. Integers are also numerics.TRUE or FALSE) are called
logical.# create create variable "a" that is a vector of one number.
a <- 7
a = 7
a
## [1] 7
# create vectors that consists of multiple numbers.
b <- c(1.25, 2.9, 3.0)
b
## [1] 1.25 2.90 3.00
# create a regular sequence of whole numbers
c <- 1:10
c
## [1] 1 2 3 4 5 6 7 8 9 10
# seq(from, to, by) function ro creat sequence of numbers
d <- seq(1,10,0.5)
d
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
## [16] 8.5 9.0 9.5 10.0
# rep() function to repeat a single number, or a sequence of numbers.
rep(99, times=5)
## [1] 99 99 99 99 99
rep(c(1,2,3), times=5)
## [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
rep(c(1,2,3),each = 5)
## [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
x <- "Monday"
class(x)
## [1] "character"
y <- c("Mon","Tue","Wed","Thu","Fri","Sat","Sun")
class(y)
## [1] "character"
A factor is a nominal (categorical) variable with a set of known
possible values called levels. They can be created using the
as.factor function.
yy <- as.factor(y)
yy
## [1] Mon Tue Wed Thu Fri Sat Sun
## Levels: Fri Mon Sat Sun Thu Tue Wed
mat0 <- matrix(NA, ncol=3, nrow=2)
mat0
## [,1] [,2] [,3]
## [1,] NA NA NA
## [2,] NA NA NA
mat1 <- matrix(1:6, ncol=3, nrow=2)
mat1
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
mat2 <- matrix(1:6, ncol=3, nrow=2,byrow = TRUE)
mat2
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
A list is a very flexible container to store data. Each element of a list can contain any type of R object, e.g. a vector, matrix, data.frame, another list, or more complex data types.
lst <- list(d,yy,mat2)
lst
## [[1]]
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
## [16] 8.5 9.0 9.5 10.0
##
## [[2]]
## [1] Mon Tue Wed Thu Fri Sat Sun
## Levels: Fri Mon Sat Sun Thu Tue Wed
##
## [[3]]
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
The data.frame is commonly used for statistical data
analysis in R. It is a special type of list that requires that all
elements (variables) have the same length.
# Loading
data(ToothGrowth)
# Print the first 6 rows
head(ToothGrowth,row=6)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
# Examine data structure of the dataframe
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
To read a file into R, you need to know the full path (directory) name and the name of the file itself.
#Your directory will be different from mine!
setwd("C:\\Users\\ASUS\\Google Drive\\208429\\LabMaterials\\DataFiles_208429")
# Read data file and keep it in dt
dt <- read.csv("bank_data.csv", header = TRUE)
head(dt) # show the first six rows of the data
## age job marital education balance housing loan duration campaign pdays
## 1 59 admin. married secondary 2343 yes no 1042 1 -1
## 2 56 admin. married secondary 45 no no 1467 1 -1
## 3 41 technician married secondary 1270 yes no 1389 1 -1
## 4 55 services married secondary 2476 yes no 579 1 -1
## 5 54 admin. married tertiary 184 no no 673 2 -1
## 6 42 management single tertiary 0 yes yes 562 2 -1
str(dt) # show data structure
## 'data.frame': 1000 obs. of 10 variables:
## $ age : int 59 56 41 55 54 42 56 60 37 28 ...
## $ job : chr "admin." "admin." "technician" "services" ...
## $ marital : chr "married" "married" "married" "married" ...
## $ education: chr "secondary" "secondary" "secondary" "secondary" ...
## $ balance : int 2343 45 1270 2476 184 0 830 545 1 5090 ...
## $ housing : chr "yes" "no" "yes" "yes" ...
## $ loan : chr "no" "no" "no" "no" ...
## $ duration : int 1042 1467 1389 579 673 562 1201 1030 608 1297 ...
## $ campaign : int 1 1 1 1 2 2 1 1 1 3 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
Set “chr” variables as “factor”
dt$job <- as.factor(dt$job)
dt$marital <- as.factor(dt$marital)
dt$education <- as.factor(dt$education)
dt$housing <- as.factor(dt$housing)
dt$loan <- as.factor(dt$loan)
#check data structure again
str(dt)
summary() functiontable() functionprop.table()
function# job, marital, education, housing, loan
summary(dt)
#frequency table for one categorical variable
table(dt$job)
table(dt$housing)
prop.table(table(dt$job))
prop.table(table(dt$job))*100
#Two-Way Frequency Tables
table(dt$job,dt$education)
tb <- table(dt$loan,dt$marital)
# Add margin totals to your table.
addmargins(tb)
#Calculate percentage
prop.table(tb)*100
summary() functionmax() and min functionmean() functionmedian() functiongetmode() function defined below.#define function to calculate mode
getmode <- function(x) {
u <- unique(x)
tab <- tabulate(match(x, u))
u[tab == max(tab)]
}
range() functionquantile(, probs = ) function where probs
can be 0.25, 0.5, 0.75 for first, second, and third quantileIQR() functionsd() and var() function#age, balance, duration, campaign, pdays
summary(dt$age)
min(dt$age)
range(dt$age)
mean(dt$age)
getmode(dt$age)
IQR(dt$age)
sd(dt$age)
quantile(dt$age,probs = 0.25)
quantile(dt$age,probs = 0.50)
quantile(dt$age,probs = 0.75)
quantile(dt$age,probs = c(0.25,0.50,0.75))
cv_age <- sd(dt$age)/mean(dt$age)
cv_age
#cv of age by loan (yes,no)
cv1 <- sd(dt$age[dt$loan=="yes"])/mean(dt$age[dt$loan=="yes"])
cv1
cv2 <- sd(dt$age[dt$loan=="no"])/mean(dt$age[dt$loan=="no"])
cv2
aggregate() function is used to get the summary
statistics of the data by group.# aggregate: compute summary statistics of data subsets
aggregate(balance~loan, data = dt, FUN=summary)
aggregate(balance~loan+marital, data = dt, FUN=summary)
aggregate(balance~loan+marital, data = dt, FUN=mean)
In the case that you have a factor variable, you can display a frequency table or a bar graph.
You can use the good ol’ summary() on a factor variable
to display a frequency table.
freq <- table(dt$job)
freq
##
## admin. blue-collar entrepreneur housemaid management
## 96 226 45 25 221
## retired self-employed services student technician
## 44 36 82 8 183
## unemployed unknown
## 31 3
barplot(freq)
To obtain a list of a factor variable’s levels, use
levels().
maritallevel <- levels(dt$marital)
maritallevel
## [1] "divorced" "married" "single"
Stacked Bar Plot with Colors and Legend
counts <- table(dt$marital, dt$loan)
barplot(counts,
legend = rownames(counts),xlab="loan")
Grouped Bar Plot
counts <- table(dt$marital, dt$loan)
barplot(counts,col=c("blue","red","green"),
legend = rownames(counts), beside=TRUE,xlab="loan")
counts <- table(dt$marital)
pie(counts, main = "Marital pie chart")
hist()hist(dt$age)
Sometimes, you want to change the title of the plot, and also the
label of the horizontal axis and vertical axis. To do that, set the
parameters main, xlab and
ylab
hist(dt$age, main = "Histogram of age", xlab = "age", ylab = "counts")
To change the number of bins, set the breaks parameter.
However, R will decide what is the best number of bins close to the
number that your specify. For example, the following code displays a
histogram with the number of bins close to 9:
hist(dt$age, breaks = 20,
main = "Histogram of age",
xlab = "age",
ylab = "counts")
stem().stem(dt$duration)
##
## The decimal point is 2 digit(s) to the right of the |
##
## 0 | 18890113333334455566666667788899
## 2 | 1333333344555577888901111244456667777788888889999
## 4 | 00000011112222222333333334444445555566666666777777777888888888888899+122
## 6 | 00000000001111111122222222222223333333333333334444444444444444445555+160
## 8 | 00000000001111111111222223333333444444444455555555555556666666666677+103
## 10 | 00000000011111112222222233333333333344444444455555566666666667788888+54
## 12 | 00000111111112233344444444555667778899900011122333344444445556666677
## 14 | 0001122344455556777899900133445556667788
## 16 | 012245667899022234899
## 18 | 1346778885788
## 20 | 2239
## 22 | 7
## 24 | 62
## 26 | 257
## 28 |
## 30 | 98
## 32 | 5
## 34 |
## 36 |
## 38 | 8
#compare with its histogram
boxplot().boxplot(dt$age,main="Boxplot of age")
horizontal = TRUEboxplot(dt$age, horizontal = TRUE)
You can display boxplots of by group variable.
boxplot(dt$age~dt$loan,xlab="loan",ylab="age")
boxplot(dt$age~dt$job,xlab="job",ylab="age")
Note that you need two vectors to plot a scatter plot: one for the x-axis, and one for the y-axis.
Suppose that you have two variables: var1 and
var2. To display a scatter plot between var1
and var2, use plot(x = var1, y = var2).
plot(x = dt$age, y = dt$balance,xlab="age",ylab="balance")
Perform the following numerical data presentation for the
bank_data.csv dataset:
Perform the following graphical data presentation for the
bank_data.csv dataset:
setwd("C:\\Users\\ASUS\\Google Drive\\208429\\LabMaterials\\DataFiles_208429")
data = read.csv("SalesPrice.csv",header=TRUE)
data
# name variables for convenience
y = data$SalesPrice
x1 = data$NumberOfHousehold #quantitative independent variable
x2 = data$Location #qualitative independent variable with k=3 groups
summary(y)
hist(y,col = "darkorange",
border = "dodgerblue")
shapiro.test(y)
summary(x1)
plot(x1,y,xlab="Number Of Household",ylab="Sale price")
cor(x1,y) #Compute the r between two quantitative variables
boxplot(y~x2,xlab="Location",ylab="Sale price")
NOTE: It is easier to work with graphs when using ggplot
data_df = data.frame(data) # set data as dataframe type
library(ggplot2) # you might need to install packgage "ggplot2" first!
## Warning: package 'ggplot2' was built under R version 4.3.1
ggplot(data_df,aes(x=NumberOfHousehold,y=SalesPrice))+
geom_point(aes(color=Location))
ggplot(data_df,aes(y=SalesPrice,x=Location))+
geom_boxplot(aes(color=Location))
#create x2.dummy
x2.dummy= c() #create null vector
for(i in 1:length(data$Location)){
if(data$Location[i]=="Street") x2.dummy[i] = 1
else x2.dummy[i] = 0
}
x2.dummy
## [1] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
#create x3.dummy
x3.dummy= c()
for(i in 1:length(data$Location)){
if(data$Location[i]=="Mall") x3.dummy[i] = 1
else x3.dummy[i] = 0
}
x3.dummy
## [1] 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
null=lm(y~1)
full=lm(y~x1+x2.dummy+x3.dummy)
#The R function vif() [car package] can be used to detect multicollinearity in a regression model:
car::vif(full)
# Forward selection
fw.fit = step(null, scope=list(lower=null,upper=full),direction="forward")
fw.fit
# Backward elimination
be.fit = step(full,direction="backward")
be.fit
# Stepwise regression
sw.fit = step(full, direction="both")
sw.fit
summary(sw.fit)
anova(sw.fit)
confint.lm(sw.fit)
new = data.frame(x1=x1,x2.dummy=x2.dummy,x3.dummy=x3.dummy)
yhat = predict(sw.fit,newdata=new) #compute fitted y
yhat
plot(x1,y,pch=1,xlab="Number Of Household",ylab="Sale price")
points(x1,yhat,type="p",pch=20)
legend("topleft",c("observed y","predicted y"),pch=c(1,20))
par(mfrow=c(2,2)) #set plot layout as 2 row 2 column
plot(sw.fit)
#get the residuals
res = resid(sw.fit)
par(mfrow=c(1,2)) #set plot layout as 1 row 2 column
plot(res)
hist(res)
shapiro.test(res)
# Performs the Durbin-Watson test for autocorrelation of disturbances.
lmtest::dwtest(sw.fit)
# Test the constant variance assumption: Breusch-Pagan Test
lmtest::bptest(sw.fit)
# Load the packages
library(aod) #for 'Wald test'
## Warning: package 'aod' was built under R version 4.3.2
library(MASS)
library(rms) #for 'lrm'
## Warning: package 'rms' was built under R version 4.3.2
## Loading required package: Hmisc
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(pscl) #to get pseudo R2
## Warning: package 'pscl' was built under R version 4.3.2
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
setwd("C:\\Users\\ASUS\\Google Drive\\208429\\LabMaterials\\DataFiles_208429")
data = read.csv("StudyHours.csv",header=TRUE)
head(data)
y = data$Pass
x = data$Hours
# Generating the frequency table
table(y)
# TO DO CODE
fit.logist = glm(y ~x,family=binomial(link='logit'))
fit.logist
anova(fit.logist)
summary(fit.logist)
mod <- lrm(y ~x, data = data)
mod
wald.test(b = coef(fit.logist), Sigma = vcov(fit.logist), Terms = 1)
wald.test(b = coef(fit.logist), Sigma = vcov(fit.logist), Terms = 2)
#create null model
null <- glm(y ~ 1, data = data,family="binomial")
summary(null)
#create full model
full <- glm(y ~ x, data = data,family="binomial")
summary(full)
lrtest(null, full)
#To easily get a McFadden's pseudo R2 for a fitted model in R, use the "pscl" package by Simon Jackman and use the pR2 command.
logLik(fit.logist)
pR2(fit.logist)
mod1b <- lrm(y ~ x, data = data)
mod1b
confint(fit.logist)
## Waiting for profiling to be done...
#Confidence Intervals using standard errors
confint.default(fit.logist)
fit.logist
exp(coef(fit.logist))
## odds ratios and 95% CI
exp(cbind(OR = coef(fit.logist), confint(fit.logist)))
## Waiting for profiling to be done...
pred_logit = predict(fit.logist, newdata = list(x=x), type = "link",se = TRUE)
pred_logit
pred = predict(fit.logist, newdata = list(x=x), type = "response",se = TRUE)
pred
pred_prob = 1/(1+exp(-pred_logit$fit))
pred_prob
LCI_pred = 1/(1+exp(-(pred_logit$fit-1.96*pred_logit$se.fit)))
UCI_pred = 1/(1+exp(-(pred_logit$fit+1.96*pred_logit$se.fit)))
#create plot
plot(x,pred_prob,type="o",ylim=c(-0.1,1.1),xlab="Number of Study Hours",ylab="Prob of passing exam",col="blue")
points(x,y)
lines(x,LCI_pred,col="red",lty=2)
lines(x,UCI_pred,col="red",lty=2)
abline(h=0.5,col="gray")
#create data.frame
dtf = data.frame( y =y ,
x =x ,
pred_logit = pred_logit,
pred_prob = pred_prob,
LCI_pred = LCI_pred,
UCI_pred = UCI_pred
)
dtf
# if pred_prob <= 0.5 , pred_pass = 0
# if pred_prob > 0.5 , pred_pass = 1
#create x2.dummy
pred_pass= c() #create null vector
for(i in 1:length(y)){
if(pred_prob[i] <= 0.5) pred_pass[i] = 0
else pred_pass[i] = 1
}
pred_pass
#add pred_pass in dtf
dtf$pred_pass = pred_pass
dtf
xtabs(~ y + pred_pass, data = dtf)
researcher is interested in how variables, such as GRE (Graduate
Record Exam scores), GPA (grade point average) and prestige of the
undergraduate institution, effect admission into graduate school. The
response variable, admit/don’t admit, is a binary variable.
# Load the packages
library(aod)
library(MASS)
library(rms)
setwd("C:\\Users\\ASUS\\Google Drive\\208429\\LabMaterials\\DataFiles_208429")
data2 = read.csv("Admission.csv",header=TRUE)
head(data2)
summary(data2)
sapply(data2, sd)
xtabs(~admit + rank, data = data2)
#Using the logit model
data2$rank <- factor(data2$rank)
#Perform variable selection
null <- glm(admit ~ 1, data = data2,family="binomial")
full <- glm(admit ~ gre + gpa + rank, data = data2,family="binomial")
forwards = step(null,scope=list(lower=null,upper=full), direction="forward")
backwards = step(full)
bothways = step(null, list(lower=null,upper=full),direction="both")
mod1b <- lrm(admit ~ gre + gpa + rank, data = data2)
mod1b
#lrtest
fit <- glm(admit ~ gre + gpa + rank, data = data2, family = "binomial")
summary(fit)
confint(fit)
## Waiting for profiling to be done...
confint.default(fit)
wald.test(b = coef(fit), Sigma = vcov(fit), Terms = 4:6)
pR2(fit)
lrm(admit ~ gre + gpa + rank, data = data2)
newdata1 <- with(data2, data.frame(gre = gre, gpa = gpa, rank = factor(rank)))
pred.response <- predict(fit, newdata = newdata1, type = "response")
head(pred.response)
newdata1$rankP <- predict(fit, newdata = newdata1, type = "response")
head(newdata1)
library(readr)
setwd("C:\\Users\\ASUS\\Google Drive\\208429\\LabMaterials\\DataFiles_208429")
dt <- read_csv("product.csv")
## Rows: 400 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): product, gender
## dbl (5): obs, age, household, position_level, absent
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(dt)
dt$product <- as.factor(dt$product)
dt$position_level <- as.factor(dt$position_level)
dt$gender <- as.factor(dt$gender)
str(dt)
summary(dt)
boxplot(age~product,data=dt)
table(dt$product,dt$position_level)
plot(dt$age,dt$absent)
dt$product <- relevel(dt$product, ref = "A")
#Important packages
library(MASS)# for multino,
library(rms) # for lrt
library(pscl) #to get pseudo R2
#package "nnet"
library(nnet)
null <- multinom(product~1,data=dt)
full <- multinom(product~age+household+position_level+gender+absent,data=dt)
summary(full)
b = summary(full)$coefficients
se.b = summary(full)$standard.errors
z <- b/se.b
t(z)#transpose of Z
# p-value for 2-tailed z test
p <- (1 - pnorm(abs(z),mean=0,sd=1))*2
t(p)#transpose of p
#likelihood ratio test
full <- multinom(product~age+household+position_level+gender+absent,data=dt)
fit1 <- multinom(product~age+household+gender+absent,data=dt)
anova(fit1,full)
fit1, test if “absent” should be included when
the rest are infit2 <- multinom(product~age+household+gender,data=dt)
anova(fit2,fit1)
glmstepAIC(full,direction="both")
anova(null,fit2)
summary(fit2)
exp(summary(fit2)$coefficients)
pR2(fit2)
phat <- predict(fit2,newdata=dt,type="probs")
head(phat)#show the first 6 rows
yhat <- predict(fit2,newdata=dt)
head(yhat)
table(dt$product,yhat)
import data
library(readr)
setwd("C:\\Users\\ASUS\\Google Drive\\208429\\LabMaterials\\DataFiles_208429")
dp <- read_csv("poverty.csv")
## New names:
## Rows: 5381 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (5): poverty, religion, degree, country, gender dbl (2): ...1, age
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
str(dp)
dp$poverty <- as.factor(dp$poverty)
dp$religion <- as.factor(dp$religion)
dp$degree <- as.factor(dp$degree)
dp$country <- as.factor(dp$country)
dp$gender <- as.factor(dp$gender)
dp$poverty <- ordered(dp$poverty,
levels =c("Too Little","About Right","Too Much"))
levels(dp$poverty)
#3.m0: null model
library(MASS)
m0 <- polr(poverty~1,data=dp)
#4. mf: full model
mf <- polr(poverty~religion+degree+country+age+gender,data=dp)
summary(mf)
##
## Re-fitting to get Hessian
# 5.calculate p-value
## store table
ctable <- coef(summary(mf))
##
## Re-fitting to get Hessian
## calculate and store p values
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE)*2
## combined table
ctable <- cbind(ctable, "p value" = p)
ctable
#
# 7.Test if "age" should be included in the model when others are in
m1 <- polr(poverty~religion+degree+country+gender,data=dp)
#likelihood ratio test, use anova function
anova(mf,m1)
## store table
ctable <- coef(summary(m1))
##
## Re-fitting to get Hessian
## calculate and store p values
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE)*2
## combined table
ctable <- cbind(ctable, "p value" = p)
ctable
#8. From m1, test the significance of "degree"
m2 <- polr(poverty~religion+country+gender,data=dp)
anova(m1,m2)
#Variable selection for logistic
library(MASS)
stepAIC(mf,direction="both")#use stepwise
#9.
anova(m0,mf)
#11. OR and its confidence interval (CI)
ci <- confint(mf) # 95% CI of log odd
## Waiting for profiling to be done...
##
## Re-fitting to get Hessian
#odd
exp(coef(mf))#odds
exp(cbind(OR = coef(mf), ci))#95% CI of odds
#12.pseudo R2
library(pscl) #to get pseudo R2
pR2(mf)
#13. getting phat
phat <- predict(mf,newdata=dp,type="probs")
head(phat)
#getting yhat
yhat <- predict(mf,newdata=dp)
head(yhat)
#14. classification table
table(dp$poverty,yhat)