March 18, 2021
Author:
Reza Hosseini
MD, MPH Student (Class of 2022)
School of Population and Public Health, Faculty of Medicine,
University of British Columbia
This report is an example illustration of different topics in R that I am familiar with and have worked on over the past months. In my previous R markdown file published on RPubs titled “Public health harm and economic damage caused by different storm events based on NOAA data”, I presented examples of me working with these packages and functions:
dplyr package to tidy and manipulate data frames (in the form of tibbles) using its four main functions (select(), filter(), mutate(), and summarise() ). I also used the pipe operator (%>%) to simplify the flow of my code,ggplot2 package to produce graphs based on the grammar of graphics concepts. The dplyr and ggplot2 packages are a part of the tidyverse, a collection of open source R packages introduced by Hadley Wickham, the chief data scientist at RStudio.lubridate package to work with date and time easily in R,kableExtra package to easily present a table in an R markdown file.sapply() function one of the functions from the apply family, which are basically shorthands for loops in R.In this report, I am presenting some of my other R knowledge. I have divided this file into several sections, so please feel free to use the navigation menu on the left to quickly look at the part you are most interested in.
There are three main graphical systems in R, and each one has its own pros and cons:
GGally, etc.
Below, I present examples of each plotting system:
I first use an if function to download and unzip the data file [2 Kb], only if the file does not exist before:
if(!file.exists("DataFiles/SalaryData.txt")){
fileURL <- "https://raw.githubusercontent.com/rezahosseinimed/portfolio/main/SalaryData.txt"
download.file(fileURL, destfile = "DataFiles/SalaryData.txt", method = "curl")
}
salary <- read.table(file="DataFiles/SalaryData.txt", header=T, sep="\t")
I will use this data set in the GGplot section below, as well.
In this part, I examine salary discrimination amongst tenure-track professors in a small Mid-western college in the United States. The data consists of information on 52 faculty members, and were initially collected for presentation in legal proceedings for which discrimination against women in salary was an issue (recorded some time prior to 1980). The data were collected from personnel files, and consist of the following variables. The data is saved in the file SalaryData.txt.
sex = recorded as “male” and “female”degree = the highest degree obtained, recorded as “doctorate” and “masters”yearsdeg = the number of years since the degree was earnedsalary = academic salary in U.S. dollarsrank = academic rank, recorded as “full”, “associate” and “assistant”The structure of the data set:
str(salary)
## 'data.frame': 52 obs. of 5 variables:
## $ sex : chr "male" "male" "male" "female" ...
## $ degree : chr "doctorate" "doctorate" "doctorate" "doctorate" ...
## $ yearsdeg: int 35 22 23 27 30 21 32 18 30 31 ...
## $ salary : int 36350 35350 28200 26775 33696 28516 24900 31909 31850 32850 ...
## $ rank : chr "full" "full" "full" "full" ...
After checking confounding, effect modification, etc., my fitted linear model is:
model1 <- lm(salary ~ sex + degree + yearsdeg, data = salary)
cbind(Coefficients = coef(model1), confint(model1))
## Coefficients 2.5 % 97.5 %
## (Intercept) 15608.2036 12861.6028 18354.8045
## sexmale 2696.2635 203.1353 5189.3917
## degreemasters -4030.2259 -6816.1165 -1244.3353
## yearsdeg 467.7237 338.5600 596.8875
Now, I plot this model’s diagnostic plots using base R plotting system:
First, I save the default plotting parameters so that I can reset them later, if needed.
defaultPar <- par()
Then, the graph:
par(mfrow=c(2, 2), oma=c(3, 0, 0, 0), mar=c(5, 4.5, 4, 2))
plot(model1, which = 1, caption = "", main="Residuals vs Fitted",
cex.main=1.1, cex.sub=1.1, cex.axis=1, cex.lab=1.1, pch=19, )
plot(model1, which = 2, caption = "", main="Normal Q-Q",
cex.main=1.1, cex.sub=1.1, cex.axis=1, cex.lab=1.1, pch=19, )
plot(model1, which = 3, caption = "", main="Scale-Location",
cex.main=1.1, cex.sub=1.1, cex.axis=1, cex.lab=1.1, pch=19, )
plot(model1, which = 5, caption = "", main="Residuals vs Leverage",
cex.main=1.1, cex.sub=1.1, cex.axis=1, cex.lab=1.1, pch=19, )
mtext("lm(salary ~ sex + yearsdeg + degree)", outer=TRUE, cex=1.1,
line=1, side=1)
Resetting the plot prameters back to default:
par(defaultPar)
For creating a lattice plot, I use the National Emissions Inventory (NEI) database, which is released by the United States’ Environmental Protection Agency (EPA) every three years and reports on emissions of fine particulate matter (PM2.5).
For each year and for each type of PM source, the NEI records how many tons of PM2.5 were emitted from that source over the course of the entire year. The data that I use here are for 1999, 2002, 2005, and 2008.
Disclaimer: the description of the data set is taken from the Coursera website.
I want to compare emissions from motor vehicle sources in Baltimore City, Maryland (fips == “24510”) with emissions from motor vehicle sources in Los Angeles County, California (fips == “06037”), and see which city has seen greater changes over time in motor vehicle emissions.
Downloading and unzipping the data file [29 Mb]:
if(!file.exists("DataFiles/DataNEI.zip")){
fileURL <- "https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2FNEI_data.zip"
download.file(fileURL, destfile = "DataFiles/DataNEI.zip", method = "curl")
unzip("DataFiles/DataNEI.zip", exdir = "DataFiles/")
}
The zip file contains two files:
PM2.5 Emissions Data (summarySCC_PM25.rds): This file contains a data frame with all of the PM2.5 emissions data for 1999, 2002, 2005, and 2008. For each year, the table contains number of tons of PM2.5 emitted from a specific type of source for the entire year. The file contains these variables:
Source Classification Code Table (Source_Classification_Code.rds): This table provides a mapping from the SCC digit strings in the Emissions table to the actual name of the PM2.5 source.
Reading in the data:
NEI <- readRDS("DataFiles/summarySCC_PM25.rds")
SCC <- readRDS("DataFiles/Source_Classification_Code.rds")
As we only want “motor vehicle sources”, I tried different keywords in the SCC data set to find the most relevant matches. I think the words “highway veh” and “motorcycle” in the “Short.Name” column, and the word “vehicle” in the “EI.Sector” column are the best matches. It turned out that both columns have matches that are not present in the other one. As a result, I use the union() function to take unique matches from both columns:
This code chunk automatically detects if the dplyr package is not installed. Then it will install and load it:
if (!require(dplyr)) {
install.packages("dplyr")
library(dplyr)
}
First, searching the “Short.Name” column for matches using grep() function and regular expressions:
motor_indx1 <- grep("(highway veh)|(motorcycle)", unique(SCC$Short.Name),
value = TRUE, ignore.case = TRUE)
motor1 <- filter(SCC, Short.Name %in% motor_indx1)
Then, searching the “EI.Sector” column for matches:
motor_indx2 <- grep("vehicle", unique(SCC$EI.Sector), value = TRUE,
ignore.case = TRUE)
motor2 <- filter(SCC, EI.Sector %in% motor_indx2)
Finally, I take the union() of unique values from both columns:
SCC_indx <- union(unique(motor1$SCC), unique(motor2$SCC))
Calculating emissions from motor vehicles in Baltimore and LA:
total_motor <- filter(NEI, fips %in% c("24510", "06037") & SCC %in% SCC_indx) %>%
group_by(fips, year) %>%
summarise(totalPM = sum(Emissions))
total_motor$fips <- sub("24510", "Baltimore", total_motor$fips)
total_motor$fips <- sub("06037", "Los Angeles", total_motor$fips)
Loading the lattice package:
if (!require(lattice)) {
install.packages("lattice")
library(lattice)
}
Now, the lattice panel plot:
xyplot(totalPM ~ factor(year) | fips,
data = total_motor,
main = list(expression("Changes of"~PM[2.5]~"emissions from motor vehicle"
~"sources in Baltimore compared to Los Angeles"),
cex = 1),
xlab = list("Year", cex = 1),
ylab = list(expression(PM[2.5]~emitted~(underline("in tons"))),
cex = 1),
scales = list(cex = 1), # Increasing axes sizes
par.strip.text = list(cex = 1), # Increasing strip text size
par.settings = list(strip.background=list(col="thistle1"), # Strip background color
layout.widths = list(ylab.axis.padding = 2)), # ylab distance
panel = function(x, y, ...){
panel.xyplot(x, y,
type = "l",
lwd = 4,
col = "purple")
panel.points(x, y, # Adding points to the line graph
pch = 19,
col = "purple4",
cex = 1)
}
)
First, I load the ggplot2 package:
library(tidyverse)
I use the same salary data set introduced in the Base R section above.
2.3.1 A boxplot:
ggplot(data = salary, aes(x = sex, y = salary/1000)) +
stat_boxplot(geom = "errorbar", width = 0.5) +
geom_boxplot(fill = c("orchid1", "deepskyblue"),
outlier.shape = 21,
outlier.size = 2.5,
width = 0.5) +
theme_classic() +
theme(axis.text = element_text(size=14),
axis.title = element_text(size=14,face="bold")) +
theme(axis.title.y = element_text(
margin = margin(t = 0, r = 20, b = 0, l = 0))) +
theme(axis.title.x = element_text(
margin = margin(t = 10, r = 0, b = 0, l = 0))) +
labs(title = "Salary vs. Sex",
x = "Sex",
y = "Salary (thousand USD)") +
theme(plot.title = element_text(hjust = 0.5, vjust=2, face = "bold"))
2.3.2 Another, more complicated boxplot:
I use the reshape2 package to transform the data from wide to long (tidy) format.
library(reshape2)
factorCols <- c("sex", "degree", "rank")
salary[factorCols] <- lapply(salary[factorCols], factor)
salaryFactor <- salary %>% select(salary, all_of(factorCols))
salaryFactor <- melt(salaryFactor, id.var = "salary")
And the plot itself:
ggplot(salaryFactor, aes(y = salary/1000, x = value, group = value)) +
stat_boxplot(geom = "errorbar", width = 0.5) +
geom_boxplot(fill = c("deepskyblue"), outlier.shape = 21,
outlier.size = 2.5, width = 0.5) +
facet_wrap(. ~ variable, scales = "free_x") +
theme_bw() +
theme(axis.title = element_text(size=14,face="bold")) +
theme(axis.title.y = element_text(
margin = margin(t = 0, r = 20, b = 0, l = 0))) +
theme(axis.title.x = element_text(
margin = margin(t = 15, r = 0, b = 0, l = 0))) +
labs(title = "Comparing salary within groups of sex, degree, and rank",
x = "",
y = "Salary (in thousand US$)") +
theme(plot.title = element_text(hjust = 0.5, vjust=2.5, face = "bold")) +
theme(axis.text.y = element_text(size = 11)) +
theme(axis.text.x = element_text(size = 11, angle = 20, vjust = 0.6))
2.3.3 A histogram:
Preparing the data in the correct format:
histSalaryDat1 <- salary %>%
select(sex, salary) %>%
melt(id.var = "salary") %>%
unite(Groups, c("variable", "value"),
sep = ": ",
remove = TRUE)
histSalaryDat2 <- salary %>%
select(salary) %>%
mutate(Groups = "all")
histSalaryDatMerge <- rbind(histSalaryDat1, histSalaryDat2)
Then the plot itself:
ggplot(data=histSalaryDatMerge, aes(x=salary/1000, col = Groups, fill = Groups)) +
geom_histogram(bins = 10, alpha=0.5) +
scale_color_brewer(palette="Dark2", direction = -1) +
scale_fill_brewer(palette="Dark2", direction = -1) +
labs(title="(A) Salary distribution",
x="Salary (in thousand USD)",
y = "Count")+
theme_classic() +
theme(legend.position = c(0.89, 1))
2.3.4 A scatterplot illustrating my fitted linear regression model:
In the Base R section above, I fitted a linear model (model1). Let’s look at its coefficients again:
cbind(Coefficients = coef(model1), confint(model1))
## Coefficients 2.5 % 97.5 %
## (Intercept) 15608.2036 12861.6028 18354.8045
## sexmale 2696.2635 203.1353 5189.3917
## degreemasters -4030.2259 -6816.1165 -1244.3353
## yearsdeg 467.7237 338.5600 596.8875
Preparing the data to be plotted:
scatterData <- salary %>% select(-rank)
scatterData <- scatterData %>%
unite(Group, c("sex", "degree"), sep = " ", remove = FALSE)
And then the plot itself. The slopes and intercepts are obtained and caculated from the coefficients table above:
ggplot(data = scatterData, aes(y = salary/1000, x = yearsdeg, col = Group)) +
geom_point() +
geom_abline(slope=467.72/1000, intercept=15608.20/1000,
col="#F8766D", lwd=1.2) +
geom_abline(slope=467.72/1000, intercept=(15608.2036-4030.2259)/1000,
col="#7CAE00", lwd=1.2) +
geom_abline(slope=467.72/1000, intercept=(15608.2036+2696.2635)/1000,
col="#00BFC4", lwd=1.2) +
geom_abline(slope=467.72/1000, intercept=(15608.2036+2696.2635-4030.2259)/1000,
col="#C77CFF", lwd=1.2) +
theme_bw() +
theme(axis.title.y = element_text(
margin = margin(t = 0, r = 10, b = 0, l = 0))) +
theme(axis.title.x = element_text(
margin = margin(t = 10, r = 0, b = 0, l = 0))) +
labs(title = "Salary ~ Sex + Degree + YearsDegree",
x = "Years since highest degree was earned",
y = "Salary (in thousand US$)") +
theme(plot.title = element_text(hjust = 0.5, face = "bold", size=11))
2.3.5 A scatterplot illustrating different x-transformations:
I use another data set here only for presenting the following graph. The dataset GPAdata.csv contains information collected for a simple random sample of recent university graduates. For each individual in the study, we have the initial starting salary (Salary), and the cumulative grade point average (GPA). I would like to examine the relationship between starting salary and grade point average.
Downloading and unzipping the data file [1 Kb]:
if(!file.exists("DataFiles/GPAdata.csv")){
fileURL <- "https://raw.githubusercontent.com/rezahosseinimed/portfolio/main/GPAdata.csv"
download.file(fileURL, destfile = "DataFiles/GPAdata.csv", method = "curl")
unzip("DataFiles/GPAdata.csv", exdir = "DataFiles/")
}
Reading the data, plus looking at its structure:
gpa <- read_csv("DataFiles/GPAdata.csv")
str(gpa)
## tibble [38 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Salary: num [1:38] 56001 47542 38798 60069 63848 ...
## $ GPA : num [1:38] 3.53 2.63 2.25 3.03 3.46 3.2 3.81 2.97 3.14 3.41 ...
## $ GPAcat: chr [1:38] "D" "B" "A" "C" ...
## - attr(*, "spec")=
## .. cols(
## .. Salary = col_double(),
## .. GPA = col_double(),
## .. GPAcat = col_character()
## .. )
Here, I add the polynomial and log transformations of the x variable to the graph:
ggplot(gpa, aes(x = GPA, y = Salary/1000)) +
geom_point() +
geom_smooth(method = "lm", fill = "blue") +
geom_smooth(method="lm", formula= y ~ x + I(x^2), col = "red", fill = "red") +
geom_smooth(method = "lm", formula= y ~ log(x), col = "orange", fill = "orange") +
labs(title = "GPA vs Salary", y = "Salary (in thousand $)") +
theme_classic() +
annotate(geom = "text", x = 3.23, y = 38, label = "Blue line: Salary ~ GPA") +
annotate(geom = "text", x = 3.32, y = 35, label = "Red line: Salary ~ GPA + GPA^2") +
annotate(geom = "text", x = 3.30, y = 32, label = "Orange line: Salary ~ log(GPA)")
In this section, I write two functions. The overall purpose of these two functions is to make the inverse of a matrix and store it in the cache for further use. The reason for doing this is that some calculations, including finding the inverse of a matrix, can be very resource-consuming in R.
The concept of R’s lexical scoping is necessary here to fully understand how things work. The makeCacheMatrix function gets a matrix, x, as its input, and returns a list of four functions. These four functions are used to set and get the values of the other two objects created within the environment of makeCacheMatrix(), namely x (as a formal argument, representing the original matrix) and i (as a local variable, representing the inverse matrix). The makeCacheMatrix() does not calculate the inverse matrix itself, but allocates memory for caching the result of the cacheSolve function (description below).
We can use the double arrow assignment operator (<<-) to assign value to an object that is in an environment higher in the hierarchy of R environments. Every function creates its own environment. In the example below, we use the double arrow assignment operator to assign value to the object i which is in another environment.
makeCacheMatrix <- function(x = matrix()) {
i <- NULL
set <- function(y){
x <<- y
i <<- NULL
}
get <- function() x
set_inverse <- function(inverse) i <<- inverse
get_inverse <- function() i
list(set = set,
get = get,
set_inverse = set_inverse,
get_inverse = get_inverse)
}
The cacheSolve function gets the returned value of the makeCacheMatrix function as its first formal argument. Then uses the get_inverse function created within makeCacheMatrix() to access the value of i (representing the inverse matrix). If a previously calculated inverse matrix has been assigned to i, then the cacheSolve() gives us a message indicating that it gets the cached data and returns the value of i. However, if i is NULL, the cacheSolve() will get the original matrix from makeCacheMatrix(), find the inverse matrix, put the result in the set_inverse function within makeCacheMatrix(), and return the inverse matrix.
Every time we set another matrix using the set() function within makeCacheMatrix, we need to run the cacheSolve again to re-calculate and save the inverse matrix in cache.
cacheSolve <- function(x, ...) {
i <- x$get_inverse()
if (!is.null(i)) {
message("getting cached data")
return(i)
}
data <- x$get()
i <- solve(data, ...)
x$set_inverse(i)
i
}
I make a simple matrix:
myMatrix <- matrix(1:4, ncol = 2)
I give this matrix to the makeCacheMatrix and assign it to the object test:
test <- makeCacheMatrix(myMatrix)
then, the first time that I run the cacheSolve function, it only gives me the inverse matrix:
cacheSolve(test)
## [,1] [,2]
## [1,] -2 1.5
## [2,] 1 -0.5
The second time that I run the cacheSolve function, it gets the answer from the cache and does not recalculate it (and prints the message “getting cached data”), hence saving resources.
cacheSolve(test)
## getting cached data
## [,1] [,2]
## [1,] -2 1.5
## [2,] 1 -0.5
For loops are most commonly used for iterating over the elements of an object:
A simple for loop:
a <- c("a", "b", "c", "d")
for (i in seq_along(a)) {
print(a[i])
}
## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"
Another:
b <- matrix(1:6, 2, 3)
for (i in seq_len(nrow(b))) {
for (j in seq_len(ncol(b))) {
print(b[i, j])
}
}
## [1] 1
## [1] 3
## [1] 5
## [1] 2
## [1] 4
## [1] 6
In while loops, we define a condition. As long as the condition is correct, the loop continues. They are prone to becoming infinite and causing the software to crash, if not written cautiously.
A simple while loop:
count <- 0
while(count < 10){
print(count)
count <- count + 1
}
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
Another:
z <- 5
while(z >= 3 && z <= 10){
print(z)
coin <- rbinom(1, 1, 0.5)
if (coin == 1) {
z <- z + 1
} else {
z <- z - 1
}
}
## [1] 5
## [1] 6
## [1] 5
## [1] 4
## [1] 5
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 6
## [1] 7
## [1] 6
## [1] 5
## [1] 4
## [1] 3
## [1] 4
## [1] 3
## [1] 4
## [1] 3
The apply family is a group of loop functions that does not require multi-line coding or lots of curly braces. They are easier to use when exploring the data. This family includes these functions: lapply(), sapply(), apply(), tapply(), mapply(). They are very similar to each other.
First, I load the InsectSprays data set, which is built into R:
data("InsectSprays")
head(InsectSprays)
## count spray
## 1 10 A
## 2 7 A
## 3 20 A
## 4 14 A
## 5 14 A
## 6 12 A
Then, I calculate the sum of insect counts for each spray type only using one function call with tapply:
tapply(InsectSprays$count, INDEX = InsectSprays$spray, FUN = sum)
## A B C D E F
## 174 184 25 59 42 200
Regular expressions or regex are sophisticated commands that can be used to find or replace character strings in a large text vector or data file.
To show examples of regular expressions, I will work with the following data set [554 Kb]:
if(!file.exists("DataFiles/homicides.txt")){
fileURL <- "https://raw.githubusercontent.com/ahawker/data-analysis-coursera/master/HW4/homicides.txt"
download.file(fileURL, destfile = "DataFiles/homicides.txt", method = "curl")
}
homicides <- readLines("DataFiles/homicides.txt", warn = FALSE)
Exploring the data:
length(homicides)
## [1] 1250
homicides[1]
## [1] "39.311024, -76.674227, iconHomicideShooting, 'p2', '<dl><dt>Leon Nelson</dt><dd class=\"address\">3400 Clifton Ave.<br />Baltimore, MD 21216</dd><dd>black male, 17 years old</dd><dd>Found on January 1, 2007</dd><dd>Victim died at Shock Trauma</dd><dd>Cause: shooting</dd></dl>'"
homicides[1000]
## [1] "39.33626300000, -76.55553990000, icon_homicide_shooting, 'p1200', '<dl><dt><a href=\"http://essentials.baltimoresun.com/micro_sun/homicides/victim/1200/davon-diggs\">Davon Diggs</a></dt><dd class=\"address\">4100 Parkwood Ave<br />Baltimore, MD 21206</dd><dd>Race: Black<br />Gender: male<br />Age: 21 years old</dd><dd>Found on November 5, 2011</dd><dd>Victim died at Johns Hopkins Bayview Medical Center </dd><dd>Cause: Shooting</dd><dd class=\"popup-note\"><p>Originally reported in 5000 Belair Road; later determined to be rear alley of 4100 block Parkwood</p></dd></dl>'"
Using grep function, we can get the index number of those values inside homicide data set that contain either the words baltimore or November.
g <- grep("baltimore|November", homicides)
length(g)
## [1] 1094
We can also ask for the letter b in baltimore to be either lower of uppercase:
i <- grep("[Bb]altimore", homicides)
length(i)
## [1] 1249
There is another data set in R called state.name, which contains the name of the 50 US states. Using grep, we can search for states’ names that begin with New:
grep("^New", state.name)
## [1] 29 30 31 32
grep("^New", state.name, value = TRUE)
## [1] "New Hampshire" "New Jersey" "New Mexico" "New York"
There are other functions, such as grepl(), regexpr(), sub(), and gsub() that do similar things with slight differences and customization.
For producing random numbers, it is better to specify a seed. This way, we can get the exact same random number next time we run the code:
set.seed(3435)
trainIndicatot <- rbinom(4601, size = 1, prob = 0.5)
table(trainIndicatot)
## trainIndicatot
## 0 1
## 2314 2287
Again, I work with the salary data set from Base R section above, to develop a linear regression model investigating the relationship between sex and salary.
Fisrt, doing some exploratory data analysis:
dim(salary)
## [1] 52 5
names(salary)
## [1] "sex" "degree" "yearsdeg" "salary" "rank"
head(salary)
## sex degree yearsdeg salary rank
## 1 male doctorate 35 36350 full
## 2 male doctorate 22 35350 full
## 3 male doctorate 23 28200 full
## 4 female doctorate 27 26775 full
## 5 male masters 30 33696 full
## 6 male doctorate 21 28516 full
tail(salary)
## sex degree yearsdeg salary rank
## 47 female doctorate 1 16686 assistant
## 48 female doctorate 1 15000 assistant
## 49 female doctorate 2 20300 assistant
## 50 male doctorate 11 22050 associate
## 51 male doctorate 14 23070 associate
## 52 female doctorate 15 19140 assistant
summary(salary)
## sex degree yearsdeg salary rank
## female:14 doctorate:36 Min. : 1.00 Min. :15000 assistant:17
## male :38 masters :16 1st Qu.: 7.00 1st Qu.:19169 associate:15
## Median :15.00 Median :23712 full :20
## Mean :16.29 Mean :23957
## 3rd Qu.:23.25 3rd Qu.:27258
## Max. :35.00 Max. :38045
table(salary$sex)
##
## female male
## 14 38
table(salary$degree)
##
## doctorate masters
## 36 16
table(salary$rank)
##
## assistant associate full
## 17 15 20
Some simple exploratory plots:
par(mfrow = c(2, 2))
boxplot(salary$yearsdeg, main = "Years since last degree")
boxplot(salary$salary, main = "Salary")
hist(salary$yearsdeg)
hist(salary$salary)
Plotting a scatterplot and calculating the Pearson correlation:
plot(salary$yearsdeg, salary$salary)
cor(salary$yearsdeg, salary$salary, method = "pearson")
## [1] 0.659165
Calculating mean salary for males and females separately:
mean(salary$salary[salary$sex=="male"])
## [1] 24786.13
mean(salary$salary[salary$sex=="female"])
## [1] 21706.43
Calculating standard deviation (SD) of salary for males and females separately:
sd(salary$salary[salary$sex=="male"])
## [1] 5558.507
sd(salary$salary[salary$sex=="female"])
## [1] 5923.151
Doing an unadjusted bivariate test (t-test) (as one SD is not more than double the other, I assume equal variances):
t.test(salary ~ sex, data = salary, var.equal = TRUE)
##
## Two Sample t-test
##
## data: salary by sex
## t = -1.7418, df = 50, p-value = 0.0877
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6631.1663 471.7603
## sample estimates:
## mean in group female mean in group male
## 21706.43 24786.13
Fitting another linear model (I previously fitted a model called model1 under 2.1 Base R section):
model2 <- lm(salary ~ factor(sex) + yearsdeg, data = salary)
summary(model2)
##
## Call:
## lm(formula = salary ~ factor(sex) + yearsdeg, data = salary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9628.4 -2564.3 -167.6 2237.4 13175.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15937.54 1461.30 10.906 1.04e-14 ***
## factor(sex)male 2678.13 1331.01 2.012 0.0497 *
## yearsdeg 372.19 59.26 6.280 8.64e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4252 on 49 degrees of freedom
## Multiple R-squared: 0.4777, Adjusted R-squared: 0.4563
## F-statistic: 22.4 on 2 and 49 DF, p-value: 1.23e-07
Using partial F-test, AIC, and BIC, we can see that model 1 is a better fit:
anova(model1, model2)
## Analysis of Variance Table
##
## Model 1: salary ~ sex + degree + yearsdeg
## Model 2: salary ~ factor(sex) + yearsdeg
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 48 753282637
## 2 49 886056860 -1 -132774223 8.4605 0.005483 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(model1, model2)
## df AIC
## model1 5 1014.982
## model2 4 1021.424
BIC(model1, model2)
## df BIC
## model1 5 1024.739
## model2 4 1029.229
I also previously showed the diagnostic plots of model1 to check linear regression assumptions under 2.1 Base R section.
For this section, I study the relationship between education and the use of contraceptives in Fiji, in the mid 1970s. The dataset that I use is ContraceptiveUseIndicators.txt. The data consists of observations on 1607 married and fecund women interviewed in the Fiji Fertility Survey of 1975.
The main research question of interest is to examine the relationship between education and contraceptive use in Fiji in the 1970s, while controlling for other confounding variables. Below is a description of the variables contained in the data set.
First, downloading the data:
if(!file.exists("DataFiles/ContraceptiveUseIndicators.txt")){
fileURL <- "https://raw.githubusercontent.com/rezahosseinimed/portfolio/main/ContraceptiveUseIndicators.txt"
download.file(fileURL, destfile = "DataFiles/ContraceptiveUseIndicators.txt",
method = "curl")
}
Reading the data:
OCP <- read.table("DataFiles/ContraceptiveUseIndicators.txt",
header = TRUE)
dim(OCP)
## [1] 1607 5
head(OCP)
## Age AgeCat Education WantsMore UseContraceptive
## 1 21 A 0 1 0
## 2 18 A 0 1 0
## 3 19 A 0 1 0
## 4 22 A 0 1 0
## 5 17 A 0 1 0
## 6 20 A 0 1 0
str(OCP)
## 'data.frame': 1607 obs. of 5 variables:
## $ Age : int 21 18 19 22 17 20 20 21 24 23 ...
## $ AgeCat : chr "A" "A" "A" "A" ...
## $ Education : int 0 0 0 0 0 0 0 0 0 0 ...
## $ WantsMore : int 1 1 1 1 1 1 1 1 1 1 ...
## $ UseContraceptive: int 0 0 0 0 0 0 0 0 0 0 ...
Renaming the values of several variable:
#### Renaming the values for the AgeCat variable:
OCP$AgeCat[which(OCP$AgeCat=="A")] <- "<25"
OCP$AgeCat[which(OCP$AgeCat=="B")] <- "25-29"
OCP$AgeCat[which(OCP$AgeCat=="C")] <- "30-39"
OCP$AgeCat[which(OCP$AgeCat=="D")] <- "40-49"
#### Renaming the values for the Education variable:
OCP$Education[which(OCP$Education=="0")] <- "Low"
OCP$Education[which(OCP$Education=="1")] <- "High"
#### Renaming the values for the WantsMore variable:
OCP$WantsMore[which(OCP$WantsMore=="0")] <- "No"
OCP$WantsMore[which(OCP$WantsMore=="1")] <- "Yes"
#### Renaming the values for the UseContraceptive variable:
OCP$UseContraceptive[which(OCP$UseContraceptive=="0")] <- "No"
OCP$UseContraceptive[which(OCP$UseContraceptive=="1")] <- "Yes"
Making these four columns as factor using lapply function:
factorCols <- c("Education", "WantsMore", "AgeCat", "UseContraceptive")
OCP[factorCols] <- lapply(OCP[factorCols], factor)
Changing the reference level for Education variable:
OCP$Education <- relevel(OCP$Education, "Low")
Comparing UseContraceptive against AgeCat (I do not show these exploratory analyses for other variables to save space):
table(OCP$UseContraceptive, OCP$AgeCat)
##
## <25 25-29 30-39 40-49
## No 325 299 375 101
## Yes 72 105 237 93
mosaicplot(table(OCP$UseContraceptive, OCP$AgeCat),
main = "Mosaicplot", xlab = "Contraceptive use", ylab = "Age category")
chisq.test(OCP$UseContraceptive, OCP$AgeCat)
##
## Pearson's Chi-squared test
##
## data: OCP$UseContraceptive and OCP$AgeCat
## X-squared = 77.578, df = 3, p-value < 2.2e-16
Preparing the data into a table:
OCPtbl <- table(OCP$Education, OCP$UseContraceptive)
OCPtbl <- OCPtbl[, c(2, 1)]
OCPtbl <- OCPtbl[c(2, 1), ]
OCPtbl
##
## Yes No
## High 306 688
## Low 201 412
Calculating unadjusted odds ratio (OR) using the epiR package:
library(epiR)
epi.2by2(OCPtbl, method = "cross.sectional", conf.level = 0.95)
## Outcome + Outcome - Total Prevalence * Odds
## Exposed + 306 688 994 30.8 0.445
## Exposed - 201 412 613 32.8 0.488
## Total 507 1100 1607 31.5 0.461
##
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Prevalence ratio 0.94 (0.81, 1.09)
## Odds ratio 0.91 (0.73, 1.13)
## Attrib prevalence * -2.00 (-6.70, 2.69)
## Attrib prevalence in population * -1.24 (-5.60, 3.12)
## Attrib fraction in exposed (%) -6.51 (-23.35, 8.02)
## Attrib fraction in population (%) -3.93 (-13.56, 4.88)
## -------------------------------------------------------------------
## Test that OR = 1: chi2(1) = 0.706 Pr>chi2 = 0.40
## Wald confidence limits
## CI: confidence interval
## * Outcomes per 100 population units
chisq.test(OCPtbl)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: OCPtbl
## X-squared = 0.61593, df = 1, p-value = 0.4326
Calculating adjusted ORs, including interaction terms:
model3 <- glm(UseContraceptive ~ Education + AgeCat + WantsMore +
Education:AgeCat + Education:WantsMore,
data = OCP, family = "binomial")
round(exp(cbind(ORs = coef(model3), confint(model3))), 2)
## Waiting for profiling to be done...
## ORs 2.5 % 97.5 %
## (Intercept) 0.41 0.19 0.81
## EducationHigh 0.89 0.41 2.09
## AgeCat25-29 1.73 0.77 4.12
## AgeCat30-39 2.58 1.29 5.65
## AgeCat40-49 2.31 1.08 5.31
## WantsMoreYes 0.28 0.19 0.40
## EducationHigh:AgeCat25-29 0.84 0.33 2.06
## EducationHigh:AgeCat30-39 0.94 0.40 2.08
## EducationHigh:AgeCat40-49 2.72 0.97 7.33
## EducationHigh:WantsMoreYes 2.11 1.31 3.44
Calculating odds ratios for each stratum of variables WantsMore and AgeCat, based on the coefficients table below:
summary(model3)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.89968338 0.3707028 -2.4269666 1.522566e-02
## EducationHigh -0.11542242 0.4135261 -0.2791176 7.801546e-01
## AgeCat25-29 0.54772364 0.4244194 1.2905246 1.968686e-01
## AgeCat30-39 0.94870626 0.3735493 2.5397086 1.109449e-02
## AgeCat40-49 0.83814518 0.4019206 2.0853502 3.703752e-02
## WantsMoreYes -1.28771510 0.1939207 -6.6404211 3.127883e-11
## EducationHigh:AgeCat25-29 -0.17092490 0.4665755 -0.3663392 7.141120e-01
## EducationHigh:AgeCat30-39 -0.06069185 0.4182164 -0.1451207 8.846156e-01
## EducationHigh:AgeCat40-49 0.99955089 0.5131261 1.9479634 5.141934e-02
## EducationHigh:WantsMoreYes 0.74804861 0.2462960 3.0371937 2.387919e-03
#### OR for WantsMore:No - AgeCat:A
round(exp(-0.11542242), 2)
## [1] 0.89
#### OR for WantsMore:No - AgeCat:B
round(exp(-0.11542242 - 0.17092490), 2)
## [1] 0.75
#### OR for WantsMore:No - AgeCat:C
round(exp(-0.11542242 - 0.06069185), 2)
## [1] 0.84
#### OR for WantsMore:No - AgeCat:D
round(exp(-0.11542242 + 0.99955089), 2)
## [1] 2.42
#### OR for WantsMore:Yes - AgeCat:A
round(exp(-0.11542242 + 0.74804861), 2)
## [1] 1.88
#### OR for WantsMore:Yes - AgeCat:B
round(exp(-0.11542242 + 0.74804861 - 0.17092490), 2)
## [1] 1.59
#### OR for WantsMore:Yes - AgeCat:C
round(exp(-0.11542242 + 0.74804861 - 0.06069185), 2)
## [1] 1.77
#### OR for WantsMore:Yes - AgeCat:D
round(exp(-0.11542242 + 0.74804861 + 0.99955089), 2)
## [1] 5.11