March 18, 2021

Author:

Reza Hosseini
MD, MPH Student (Class of 2022)
School of Population and Public Health, Faculty of Medicine,
University of British Columbia


1. Introduction

This report is an example illustration of different topics in R that I am familiar with and have worked on over the past months. In my previous R markdown file published on RPubs titled “Public health harm and economic damage caused by different storm events based on NOAA data”, I presented examples of me working with these packages and functions:

  • dplyr package to tidy and manipulate data frames (in the form of tibbles) using its four main functions (select(), filter(), mutate(), and summarise() ). I also used the pipe operator (%>%) to simplify the flow of my code,
  • ggplot2 package to produce graphs based on the grammar of graphics concepts. The dplyr and ggplot2 packages are a part of the tidyverse, a collection of open source R packages introduced by Hadley Wickham, the chief data scientist at RStudio.
  • lubridate package to work with date and time easily in R,
  • kableExtra package to easily present a table in an R markdown file.
  • sapply() function one of the functions from the apply family, which are basically shorthands for loops in R.

In this report, I am presenting some of my other R knowledge. I have divided this file into several sections, so please feel free to use the navigation menu on the left to quickly look at the part you are most interested in.


2. Graphical systems

There are three main graphical systems in R, and each one has its own pros and cons:

  • Base R Plotting System
    • starts with a plot function,
    • then annotations (texts, lines, etc.) can be added separately
    • Its problem is it does not automatically manage plot margins, spacing, etc.
  • Lattice Plotting System:
    • very powerful in producing panel plots,
    • all plotting and annotation is done with a single function call
    • automatically manages margins, spacing, etc.
    • Its problem is it is hard to annotate, awkward to specify entire plot in one function call
  • GGplot2 Plotting System
    • implements the Grammar of Graphics by Leland Wilkinson. “In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (color, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system”.
    • is mostly a combination of advantages of the other two systems
    • step-wise annotation using separate function calls is possible
    • automatically manages margins, spacing, etc.
    • relatively easy to make panel plots
    • very customizable
    • there are many complementary packages for it, such as GGally, etc.


Below, I present examples of each plotting system:

2.1 Base R

I first use an if function to download and unzip the data file [2 Kb], only if the file does not exist before:

if(!file.exists("DataFiles/SalaryData.txt")){
    fileURL <- "https://raw.githubusercontent.com/rezahosseinimed/portfolio/main/SalaryData.txt"
    download.file(fileURL, destfile = "DataFiles/SalaryData.txt", method = "curl")
}

salary <- read.table(file="DataFiles/SalaryData.txt", header=T, sep="\t")

I will use this data set in the GGplot section below, as well.

In this part, I examine salary discrimination amongst tenure-track professors in a small Mid-western college in the United States. The data consists of information on 52 faculty members, and were initially collected for presentation in legal proceedings for which discrimination against women in salary was an issue (recorded some time prior to 1980). The data were collected from personnel files, and consist of the following variables. The data is saved in the file SalaryData.txt.

  • sex = recorded as “male” and “female”
  • degree = the highest degree obtained, recorded as “doctorate” and “masters”
  • yearsdeg = the number of years since the degree was earned
  • salary = academic salary in U.S. dollars
  • rank = academic rank, recorded as “full”, “associate” and “assistant”

The structure of the data set:

str(salary)
## 'data.frame':    52 obs. of  5 variables:
##  $ sex     : chr  "male" "male" "male" "female" ...
##  $ degree  : chr  "doctorate" "doctorate" "doctorate" "doctorate" ...
##  $ yearsdeg: int  35 22 23 27 30 21 32 18 30 31 ...
##  $ salary  : int  36350 35350 28200 26775 33696 28516 24900 31909 31850 32850 ...
##  $ rank    : chr  "full" "full" "full" "full" ...

After checking confounding, effect modification, etc., my fitted linear model is:

model1 <- lm(salary ~ sex + degree + yearsdeg, data = salary)

cbind(Coefficients = coef(model1), confint(model1))
##               Coefficients      2.5 %     97.5 %
## (Intercept)     15608.2036 12861.6028 18354.8045
## sexmale          2696.2635   203.1353  5189.3917
## degreemasters   -4030.2259 -6816.1165 -1244.3353
## yearsdeg          467.7237   338.5600   596.8875

Now, I plot this model’s diagnostic plots using base R plotting system:

First, I save the default plotting parameters so that I can reset them later, if needed.

defaultPar <- par()

Then, the graph:

par(mfrow=c(2, 2), oma=c(3, 0, 0, 0), mar=c(5, 4.5, 4, 2))
plot(model1, which = 1, caption = "", main="Residuals vs Fitted",
     cex.main=1.1, cex.sub=1.1, cex.axis=1, cex.lab=1.1, pch=19, )
plot(model1, which = 2, caption = "", main="Normal Q-Q",
     cex.main=1.1, cex.sub=1.1, cex.axis=1, cex.lab=1.1, pch=19, )
plot(model1, which = 3, caption = "", main="Scale-Location",
     cex.main=1.1, cex.sub=1.1, cex.axis=1, cex.lab=1.1, pch=19, )
plot(model1, which = 5, caption = "", main="Residuals vs Leverage",
     cex.main=1.1, cex.sub=1.1, cex.axis=1, cex.lab=1.1, pch=19, )
mtext("lm(salary ~ sex + yearsdeg + degree)", outer=TRUE,  cex=1.1,
      line=1, side=1)

Resetting the plot prameters back to default:

par(defaultPar)


2.2 Lattice

For creating a lattice plot, I use the National Emissions Inventory (NEI) database, which is released by the United States’ Environmental Protection Agency (EPA) every three years and reports on emissions of fine particulate matter (PM2.5).
For each year and for each type of PM source, the NEI records how many tons of PM2.5 were emitted from that source over the course of the entire year. The data that I use here are for 1999, 2002, 2005, and 2008.

Disclaimer: the description of the data set is taken from the Coursera website.

I want to compare emissions from motor vehicle sources in Baltimore City, Maryland (fips == “24510”) with emissions from motor vehicle sources in Los Angeles County, California (fips == “06037”), and see which city has seen greater changes over time in motor vehicle emissions.

Downloading and unzipping the data file [29 Mb]:

if(!file.exists("DataFiles/DataNEI.zip")){
    fileURL <- "https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2FNEI_data.zip"
    download.file(fileURL, destfile = "DataFiles/DataNEI.zip", method = "curl")
    unzip("DataFiles/DataNEI.zip", exdir = "DataFiles/")
}

The zip file contains two files:

  • PM2.5 Emissions Data (summarySCC_PM25.rds): This file contains a data frame with all of the PM2.5 emissions data for 1999, 2002, 2005, and 2008. For each year, the table contains number of tons of PM2.5 emitted from a specific type of source for the entire year. The file contains these variables:

    • fips: A five-digit number (represented as a string) indicating the U.S. county
    • SCC: The name of the source as indicated by a digit string (see source code classification table)
    • Pollutant: A string indicating the pollutant
    • Emissions: Amount of PM2.5 emitted, in tons
    • type: The type of source (point, non-point, on-road, or non-road)
    • year: The year of emissions recorded
  • Source Classification Code Table (Source_Classification_Code.rds): This table provides a mapping from the SCC digit strings in the Emissions table to the actual name of the PM2.5 source.

Reading in the data:

NEI <- readRDS("DataFiles/summarySCC_PM25.rds")
SCC <- readRDS("DataFiles/Source_Classification_Code.rds")

As we only want “motor vehicle sources”, I tried different keywords in the SCC data set to find the most relevant matches. I think the words “highway veh” and “motorcycle” in the “Short.Name” column, and the word “vehicle” in the “EI.Sector” column are the best matches. It turned out that both columns have matches that are not present in the other one. As a result, I use the union() function to take unique matches from both columns:

This code chunk automatically detects if the dplyr package is not installed. Then it will install and load it:

if (!require(dplyr)) {
  install.packages("dplyr")
  library(dplyr)
}

First, searching the “Short.Name” column for matches using grep() function and regular expressions:

motor_indx1 <- grep("(highway veh)|(motorcycle)", unique(SCC$Short.Name),
                    value = TRUE, ignore.case = TRUE)
motor1 <- filter(SCC, Short.Name %in% motor_indx1)

Then, searching the “EI.Sector” column for matches:

motor_indx2 <- grep("vehicle", unique(SCC$EI.Sector), value = TRUE,
                    ignore.case = TRUE)
motor2 <- filter(SCC, EI.Sector %in% motor_indx2)

Finally, I take the union() of unique values from both columns:

SCC_indx <- union(unique(motor1$SCC), unique(motor2$SCC))

Calculating emissions from motor vehicles in Baltimore and LA:

total_motor <- filter(NEI, fips %in% c("24510", "06037") & SCC %in% SCC_indx) %>%
                group_by(fips, year) %>%
                  summarise(totalPM = sum(Emissions))

total_motor$fips <- sub("24510", "Baltimore", total_motor$fips)
total_motor$fips <- sub("06037", "Los Angeles", total_motor$fips)

Loading the lattice package:

if (!require(lattice)) {
  install.packages("lattice")
  library(lattice)
}

Now, the lattice panel plot:

xyplot(totalPM ~ factor(year) | fips,
       data = total_motor,
       main = list(expression("Changes of"~PM[2.5]~"emissions from motor vehicle"
                         ~"sources in Baltimore compared to Los Angeles"),
                   cex = 1),
       xlab = list("Year", cex = 1),
       ylab = list(expression(PM[2.5]~emitted~(underline("in tons"))),
                   cex = 1),
       scales = list(cex = 1),  # Increasing axes sizes
       par.strip.text = list(cex = 1),  # Increasing strip text size
       par.settings = list(strip.background=list(col="thistle1"), # Strip background color
                           layout.widths = list(ylab.axis.padding = 2)), # ylab distance
       panel = function(x, y, ...){
               panel.xyplot(x, y,
                            type = "l",
                            lwd = 4,
                            col = "purple")
               panel.points(x, y,              # Adding points to the line graph
                            pch = 19,
                            col = "purple4",
                            cex = 1)
               }
       )


2.3 GGplot

First, I load the ggplot2 package:

library(tidyverse)

I use the same salary data set introduced in the Base R section above.

2.3.1 A boxplot:

ggplot(data = salary, aes(x = sex, y = salary/1000)) +
    stat_boxplot(geom = "errorbar", width = 0.5) +
    geom_boxplot(fill = c("orchid1", "deepskyblue"),
                 outlier.shape = 21,
                 outlier.size = 2.5,
                 width = 0.5) +
    theme_classic() +
    theme(axis.text = element_text(size=14),
          axis.title = element_text(size=14,face="bold")) +
    theme(axis.title.y = element_text(
          margin = margin(t = 0, r = 20, b = 0, l = 0))) +
    theme(axis.title.x = element_text(
          margin = margin(t = 10, r = 0, b = 0, l = 0))) +
    labs(title = "Salary vs. Sex",
         x = "Sex",
         y = "Salary (thousand USD)") +
    theme(plot.title = element_text(hjust = 0.5, vjust=2, face = "bold"))

2.3.2 Another, more complicated boxplot:
I use the reshape2 package to transform the data from wide to long (tidy) format.

library(reshape2)
factorCols <- c("sex", "degree", "rank")
salary[factorCols] <- lapply(salary[factorCols], factor)

salaryFactor <- salary %>% select(salary, all_of(factorCols))
salaryFactor <- melt(salaryFactor, id.var = "salary")

And the plot itself:

ggplot(salaryFactor, aes(y = salary/1000, x = value, group = value)) +
    stat_boxplot(geom = "errorbar", width = 0.5) +
    geom_boxplot(fill = c("deepskyblue"), outlier.shape = 21,
                 outlier.size = 2.5, width = 0.5) +
    facet_wrap(. ~ variable, scales = "free_x") +
    theme_bw() +
    theme(axis.title = element_text(size=14,face="bold")) +
    theme(axis.title.y = element_text(
      margin = margin(t = 0, r = 20, b = 0, l = 0))) +
    theme(axis.title.x = element_text(
      margin = margin(t = 15, r = 0, b = 0, l = 0))) +
    labs(title = "Comparing salary within groups of sex, degree, and rank",
         x = "",
         y = "Salary (in thousand US$)") +
    theme(plot.title = element_text(hjust = 0.5, vjust=2.5, face = "bold")) +
    theme(axis.text.y = element_text(size = 11)) +
    theme(axis.text.x = element_text(size = 11, angle = 20, vjust = 0.6))

2.3.3 A histogram:

Preparing the data in the correct format:

histSalaryDat1 <- salary %>% 
                    select(sex, salary) %>% 
                      melt(id.var = "salary") %>% 
                        unite(Groups, c("variable", "value"),
                              sep = ": ",
                              remove = TRUE)

histSalaryDat2 <- salary %>% 
                    select(salary) %>% 
                      mutate(Groups = "all")

histSalaryDatMerge <- rbind(histSalaryDat1, histSalaryDat2)

Then the plot itself:

ggplot(data=histSalaryDatMerge, aes(x=salary/1000, col = Groups, fill = Groups)) +
          geom_histogram(bins = 10, alpha=0.5) +
          scale_color_brewer(palette="Dark2", direction = -1) +
          scale_fill_brewer(palette="Dark2", direction = -1) +
          labs(title="(A) Salary distribution",
               x="Salary (in thousand USD)",
               y = "Count")+
          theme_classic() +
          theme(legend.position = c(0.89, 1))

2.3.4 A scatterplot illustrating my fitted linear regression model:

In the Base R section above, I fitted a linear model (model1). Let’s look at its coefficients again:

cbind(Coefficients = coef(model1), confint(model1))
##               Coefficients      2.5 %     97.5 %
## (Intercept)     15608.2036 12861.6028 18354.8045
## sexmale          2696.2635   203.1353  5189.3917
## degreemasters   -4030.2259 -6816.1165 -1244.3353
## yearsdeg          467.7237   338.5600   596.8875

Preparing the data to be plotted:

scatterData <- salary %>% select(-rank)
scatterData <- scatterData %>%
                unite(Group, c("sex", "degree"), sep = " ", remove = FALSE)

And then the plot itself. The slopes and intercepts are obtained and caculated from the coefficients table above:

ggplot(data = scatterData, aes(y = salary/1000, x = yearsdeg, col = Group)) +
          geom_point() +
          geom_abline(slope=467.72/1000, intercept=15608.20/1000,
                      col="#F8766D", lwd=1.2) +
          geom_abline(slope=467.72/1000, intercept=(15608.2036-4030.2259)/1000,
                      col="#7CAE00", lwd=1.2) +
          geom_abline(slope=467.72/1000, intercept=(15608.2036+2696.2635)/1000,
                      col="#00BFC4", lwd=1.2) +
          geom_abline(slope=467.72/1000, intercept=(15608.2036+2696.2635-4030.2259)/1000,
                      col="#C77CFF", lwd=1.2) +
          theme_bw() +
          theme(axis.title.y = element_text(
            margin = margin(t = 0, r = 10, b = 0, l = 0))) +
          theme(axis.title.x = element_text(
            margin = margin(t = 10, r = 0, b = 0, l = 0))) +
          labs(title = "Salary ~ Sex + Degree + YearsDegree",
               x = "Years since highest degree was earned",
               y = "Salary (in thousand US$)") +
          theme(plot.title = element_text(hjust = 0.5, face = "bold", size=11))

2.3.5 A scatterplot illustrating different x-transformations:

I use another data set here only for presenting the following graph. The dataset GPAdata.csv contains information collected for a simple random sample of recent university graduates. For each individual in the study, we have the initial starting salary (Salary), and the cumulative grade point average (GPA). I would like to examine the relationship between starting salary and grade point average.

Downloading and unzipping the data file [1 Kb]:

if(!file.exists("DataFiles/GPAdata.csv")){
    fileURL <- "https://raw.githubusercontent.com/rezahosseinimed/portfolio/main/GPAdata.csv"
    download.file(fileURL, destfile = "DataFiles/GPAdata.csv", method = "curl")
    unzip("DataFiles/GPAdata.csv", exdir = "DataFiles/")
}

Reading the data, plus looking at its structure:

gpa <- read_csv("DataFiles/GPAdata.csv")

str(gpa)
## tibble [38 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Salary: num [1:38] 56001 47542 38798 60069 63848 ...
##  $ GPA   : num [1:38] 3.53 2.63 2.25 3.03 3.46 3.2 3.81 2.97 3.14 3.41 ...
##  $ GPAcat: chr [1:38] "D" "B" "A" "C" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Salary = col_double(),
##   ..   GPA = col_double(),
##   ..   GPAcat = col_character()
##   .. )


Here, I add the polynomial and log transformations of the x variable to the graph:

ggplot(gpa, aes(x = GPA, y = Salary/1000)) +
    geom_point() +
    geom_smooth(method = "lm", fill = "blue") +
    geom_smooth(method="lm", formula= y ~ x + I(x^2), col = "red", fill = "red") +
    geom_smooth(method = "lm", formula= y ~ log(x), col = "orange", fill = "orange") +
    labs(title = "GPA vs Salary", y = "Salary (in thousand $)") +
    theme_classic() +
    annotate(geom = "text", x = 3.23, y = 38, label = "Blue line: Salary ~ GPA") +
    annotate(geom = "text", x = 3.32, y = 35, label = "Red line: Salary ~ GPA + GPA^2") +
    annotate(geom = "text", x = 3.30, y = 32, label = "Orange line: Salary ~ log(GPA)")



3. Functions - lexical scoping

In this section, I write two functions. The overall purpose of these two functions is to make the inverse of a matrix and store it in the cache for further use. The reason for doing this is that some calculations, including finding the inverse of a matrix, can be very resource-consuming in R.

The concept of R’s lexical scoping is necessary here to fully understand how things work. The makeCacheMatrix function gets a matrix, x, as its input, and returns a list of four functions. These four functions are used to set and get the values of the other two objects created within the environment of makeCacheMatrix(), namely x (as a formal argument, representing the original matrix) and i (as a local variable, representing the inverse matrix). The makeCacheMatrix() does not calculate the inverse matrix itself, but allocates memory for caching the result of the cacheSolve function (description below).

We can use the double arrow assignment operator (<<-) to assign value to an object that is in an environment higher in the hierarchy of R environments. Every function creates its own environment. In the example below, we use the double arrow assignment operator to assign value to the object i which is in another environment.

makeCacheMatrix <- function(x = matrix()) {
  i <- NULL
  set <- function(y){
    x <<- y
    i <<- NULL
  }
  get <- function() x
  set_inverse <- function(inverse) i <<- inverse
  get_inverse <- function() i
  list(set = set,
       get = get,
       set_inverse = set_inverse,
       get_inverse = get_inverse)
}


The cacheSolve function gets the returned value of the makeCacheMatrix function as its first formal argument. Then uses the get_inverse function created within makeCacheMatrix() to access the value of i (representing the inverse matrix). If a previously calculated inverse matrix has been assigned to i, then the cacheSolve() gives us a message indicating that it gets the cached data and returns the value of i. However, if i is NULL, the cacheSolve() will get the original matrix from makeCacheMatrix(), find the inverse matrix, put the result in the set_inverse function within makeCacheMatrix(), and return the inverse matrix.

Every time we set another matrix using the set() function within makeCacheMatrix, we need to run the cacheSolve again to re-calculate and save the inverse matrix in cache.

cacheSolve <- function(x, ...) {
  i <- x$get_inverse()
  if (!is.null(i)) {
    message("getting cached data")
    return(i)
  }
  data <- x$get()
  i <- solve(data, ...)
  x$set_inverse(i)
  i
}
3.1 Example

I make a simple matrix:

myMatrix <- matrix(1:4, ncol = 2)

I give this matrix to the makeCacheMatrix and assign it to the object test:

test <- makeCacheMatrix(myMatrix)

then, the first time that I run the cacheSolve function, it only gives me the inverse matrix:

cacheSolve(test)
##      [,1] [,2]
## [1,]   -2  1.5
## [2,]    1 -0.5

The second time that I run the cacheSolve function, it gets the answer from the cache and does not recalculate it (and prints the message “getting cached data”), hence saving resources.

cacheSolve(test)
## getting cached data
##      [,1] [,2]
## [1,]   -2  1.5
## [2,]    1 -0.5



4. Loop functions

4.1 For loop

For loops are most commonly used for iterating over the elements of an object:

A simple for loop:

a <- c("a", "b", "c", "d")

for (i in seq_along(a)) {
  print(a[i])
}
## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"

Another:

b <- matrix(1:6, 2, 3)

for (i in seq_len(nrow(b))) {
  for (j in seq_len(ncol(b))) {
    print(b[i, j])
  }
}
## [1] 1
## [1] 3
## [1] 5
## [1] 2
## [1] 4
## [1] 6
4.2 while loop

In while loops, we define a condition. As long as the condition is correct, the loop continues. They are prone to becoming infinite and causing the software to crash, if not written cautiously.

A simple while loop:

count <- 0

while(count < 10){
  print(count)
  count <- count + 1
}
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9

Another:

z <- 5
while(z >= 3 && z <= 10){
  print(z)
  coin <- rbinom(1, 1, 0.5)
  if (coin == 1) {
    z <- z + 1
  } else {
    z <- z - 1
  }
}
## [1] 5
## [1] 6
## [1] 5
## [1] 4
## [1] 5
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 6
## [1] 7
## [1] 6
## [1] 5
## [1] 4
## [1] 3
## [1] 4
## [1] 3
## [1] 4
## [1] 3
4.3 apply family

The apply family is a group of loop functions that does not require multi-line coding or lots of curly braces. They are easier to use when exploring the data. This family includes these functions: lapply(), sapply(), apply(), tapply(), mapply(). They are very similar to each other.

First, I load the InsectSprays data set, which is built into R:

data("InsectSprays")
head(InsectSprays)
##   count spray
## 1    10     A
## 2     7     A
## 3    20     A
## 4    14     A
## 5    14     A
## 6    12     A

Then, I calculate the sum of insect counts for each spray type only using one function call with tapply:

tapply(InsectSprays$count, INDEX = InsectSprays$spray, FUN = sum)
##   A   B   C   D   E   F 
## 174 184  25  59  42 200



5. Regular expressions

Regular expressions or regex are sophisticated commands that can be used to find or replace character strings in a large text vector or data file.

To show examples of regular expressions, I will work with the following data set [554 Kb]:

if(!file.exists("DataFiles/homicides.txt")){
  fileURL <- "https://raw.githubusercontent.com/ahawker/data-analysis-coursera/master/HW4/homicides.txt"
  download.file(fileURL, destfile = "DataFiles/homicides.txt", method = "curl")
}

homicides <- readLines("DataFiles/homicides.txt", warn = FALSE)

Exploring the data:

length(homicides)
## [1] 1250
homicides[1]
## [1] "39.311024, -76.674227, iconHomicideShooting, 'p2', '<dl><dt>Leon Nelson</dt><dd class=\"address\">3400 Clifton Ave.<br />Baltimore, MD 21216</dd><dd>black male, 17 years old</dd><dd>Found on January 1, 2007</dd><dd>Victim died at Shock Trauma</dd><dd>Cause: shooting</dd></dl>'"
homicides[1000]
## [1] "39.33626300000, -76.55553990000, icon_homicide_shooting, 'p1200', '<dl><dt><a href=\"http://essentials.baltimoresun.com/micro_sun/homicides/victim/1200/davon-diggs\">Davon Diggs</a></dt><dd class=\"address\">4100 Parkwood Ave<br />Baltimore, MD 21206</dd><dd>Race: Black<br />Gender: male<br />Age: 21 years old</dd><dd>Found on November  5, 2011</dd><dd>Victim died at Johns Hopkins Bayview Medical Center </dd><dd>Cause: Shooting</dd><dd class=\"popup-note\"><p>Originally reported in 5000 Belair Road; later determined to be rear alley of 4100 block Parkwood</p></dd></dl>'"

Using grep function, we can get the index number of those values inside homicide data set that contain either the words baltimore or November.

g <- grep("baltimore|November", homicides)
length(g)
## [1] 1094

We can also ask for the letter b in baltimore to be either lower of uppercase:

i <- grep("[Bb]altimore", homicides)
length(i)
## [1] 1249

There is another data set in R called state.name, which contains the name of the 50 US states. Using grep, we can search for states’ names that begin with New:

grep("^New", state.name)
## [1] 29 30 31 32
grep("^New", state.name, value = TRUE)
## [1] "New Hampshire" "New Jersey"    "New Mexico"    "New York"

There are other functions, such as grepl(), regexpr(), sub(), and gsub() that do similar things with slight differences and customization.



6. Simulations

For producing random numbers, it is better to specify a seed. This way, we can get the exact same random number next time we run the code:

set.seed(3435)
trainIndicatot <- rbinom(4601, size = 1, prob = 0.5)
table(trainIndicatot)
## trainIndicatot
##    0    1 
## 2314 2287



7. Linear regression

Again, I work with the salary data set from Base R section above, to develop a linear regression model investigating the relationship between sex and salary.

7.1 Exploratory analysis

Fisrt, doing some exploratory data analysis:

dim(salary)
## [1] 52  5
names(salary)
## [1] "sex"      "degree"   "yearsdeg" "salary"   "rank"
head(salary)
##      sex    degree yearsdeg salary rank
## 1   male doctorate       35  36350 full
## 2   male doctorate       22  35350 full
## 3   male doctorate       23  28200 full
## 4 female doctorate       27  26775 full
## 5   male   masters       30  33696 full
## 6   male doctorate       21  28516 full
tail(salary)
##       sex    degree yearsdeg salary      rank
## 47 female doctorate        1  16686 assistant
## 48 female doctorate        1  15000 assistant
## 49 female doctorate        2  20300 assistant
## 50   male doctorate       11  22050 associate
## 51   male doctorate       14  23070 associate
## 52 female doctorate       15  19140 assistant


summary(salary)
##      sex           degree      yearsdeg         salary             rank   
##  female:14   doctorate:36   Min.   : 1.00   Min.   :15000   assistant:17  
##  male  :38   masters  :16   1st Qu.: 7.00   1st Qu.:19169   associate:15  
##                             Median :15.00   Median :23712   full     :20  
##                             Mean   :16.29   Mean   :23957                 
##                             3rd Qu.:23.25   3rd Qu.:27258                 
##                             Max.   :35.00   Max.   :38045
table(salary$sex)
## 
## female   male 
##     14     38
table(salary$degree)
## 
## doctorate   masters 
##        36        16
table(salary$rank)
## 
## assistant associate      full 
##        17        15        20


Some simple exploratory plots:

par(mfrow = c(2, 2))
boxplot(salary$yearsdeg, main = "Years since last degree")
boxplot(salary$salary, main = "Salary")
hist(salary$yearsdeg)
hist(salary$salary)


Plotting a scatterplot and calculating the Pearson correlation:

plot(salary$yearsdeg, salary$salary)

cor(salary$yearsdeg, salary$salary, method = "pearson")
## [1] 0.659165


Calculating mean salary for males and females separately:

mean(salary$salary[salary$sex=="male"])
## [1] 24786.13
mean(salary$salary[salary$sex=="female"])
## [1] 21706.43


Calculating standard deviation (SD) of salary for males and females separately:

sd(salary$salary[salary$sex=="male"])
## [1] 5558.507
sd(salary$salary[salary$sex=="female"])
## [1] 5923.151


7.2 t-test

Doing an unadjusted bivariate test (t-test) (as one SD is not more than double the other, I assume equal variances):

t.test(salary ~ sex, data = salary, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  salary by sex
## t = -1.7418, df = 50, p-value = 0.0877
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6631.1663   471.7603
## sample estimates:
## mean in group female   mean in group male 
##             21706.43             24786.13


7.3 Model fitting

Fitting another linear model (I previously fitted a model called model1 under 2.1 Base R section):

model2 <- lm(salary ~ factor(sex) + yearsdeg, data = salary)
summary(model2)
## 
## Call:
## lm(formula = salary ~ factor(sex) + yearsdeg, data = salary)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9628.4 -2564.3  -167.6  2237.4 13175.0 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     15937.54    1461.30  10.906 1.04e-14 ***
## factor(sex)male  2678.13    1331.01   2.012   0.0497 *  
## yearsdeg          372.19      59.26   6.280 8.64e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4252 on 49 degrees of freedom
## Multiple R-squared:  0.4777, Adjusted R-squared:  0.4563 
## F-statistic:  22.4 on 2 and 49 DF,  p-value: 1.23e-07


Using partial F-test, AIC, and BIC, we can see that model 1 is a better fit:

anova(model1, model2)
## Analysis of Variance Table
## 
## Model 1: salary ~ sex + degree + yearsdeg
## Model 2: salary ~ factor(sex) + yearsdeg
##   Res.Df       RSS Df  Sum of Sq      F   Pr(>F)   
## 1     48 753282637                                 
## 2     49 886056860 -1 -132774223 8.4605 0.005483 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(model1, model2)
##        df      AIC
## model1  5 1014.982
## model2  4 1021.424
BIC(model1, model2)
##        df      BIC
## model1  5 1024.739
## model2  4 1029.229


I also previously showed the diagnostic plots of model1 to check linear regression assumptions under 2.1 Base R section.



8. Logistic regression


8.1 Introducing data

For this section, I study the relationship between education and the use of contraceptives in Fiji, in the mid 1970s. The dataset that I use is ContraceptiveUseIndicators.txt. The data consists of observations on 1607 married and fecund women interviewed in the Fiji Fertility Survey of 1975.

The main research question of interest is to examine the relationship between education and contraceptive use in Fiji in the 1970s, while controlling for other confounding variables. Below is a description of the variables contained in the data set.

  • Age = The numeric age
  • AgeCat = Categorized ages: A = “<25”, B=”25-29”, C=”30-39”, D=”40-49”
  • Education = Education categorized as: 0 = “low”, 1 = “high”
  • WantsMore = An indicator if the women wants to have more children: 0 = “no”, 1 = “yes”
  • UseContraceptive = An indicator of contraceptive use: 0 = “no”, 1 = “yes”

8.2 Exploratory analysis

First, downloading the data:

if(!file.exists("DataFiles/ContraceptiveUseIndicators.txt")){
    fileURL <- "https://raw.githubusercontent.com/rezahosseinimed/portfolio/main/ContraceptiveUseIndicators.txt"
    download.file(fileURL, destfile = "DataFiles/ContraceptiveUseIndicators.txt",
                  method = "curl")
}

Reading the data:

OCP <- read.table("DataFiles/ContraceptiveUseIndicators.txt",
                  header = TRUE)
dim(OCP)
## [1] 1607    5
head(OCP)
##   Age AgeCat Education WantsMore UseContraceptive
## 1  21      A         0         1                0
## 2  18      A         0         1                0
## 3  19      A         0         1                0
## 4  22      A         0         1                0
## 5  17      A         0         1                0
## 6  20      A         0         1                0
str(OCP)
## 'data.frame':    1607 obs. of  5 variables:
##  $ Age             : int  21 18 19 22 17 20 20 21 24 23 ...
##  $ AgeCat          : chr  "A" "A" "A" "A" ...
##  $ Education       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ WantsMore       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ UseContraceptive: int  0 0 0 0 0 0 0 0 0 0 ...

Renaming the values of several variable:

#### Renaming the values for the AgeCat variable:
OCP$AgeCat[which(OCP$AgeCat=="A")] <- "<25"
OCP$AgeCat[which(OCP$AgeCat=="B")] <- "25-29"
OCP$AgeCat[which(OCP$AgeCat=="C")] <- "30-39"
OCP$AgeCat[which(OCP$AgeCat=="D")] <- "40-49"

#### Renaming the values for the Education variable:
OCP$Education[which(OCP$Education=="0")] <- "Low"
OCP$Education[which(OCP$Education=="1")] <- "High"

#### Renaming the values for the WantsMore variable:
OCP$WantsMore[which(OCP$WantsMore=="0")] <- "No"
OCP$WantsMore[which(OCP$WantsMore=="1")] <- "Yes"

#### Renaming the values for the UseContraceptive variable:
OCP$UseContraceptive[which(OCP$UseContraceptive=="0")] <- "No"
OCP$UseContraceptive[which(OCP$UseContraceptive=="1")] <- "Yes"

Making these four columns as factor using lapply function:

factorCols <- c("Education", "WantsMore", "AgeCat", "UseContraceptive")
OCP[factorCols] <- lapply(OCP[factorCols], factor)

Changing the reference level for Education variable:

OCP$Education <- relevel(OCP$Education, "Low")

Comparing UseContraceptive against AgeCat (I do not show these exploratory analyses for other variables to save space):

table(OCP$UseContraceptive, OCP$AgeCat)
##      
##       <25 25-29 30-39 40-49
##   No  325   299   375   101
##   Yes  72   105   237    93
mosaicplot(table(OCP$UseContraceptive, OCP$AgeCat),
           main = "Mosaicplot", xlab = "Contraceptive use", ylab = "Age category")

chisq.test(OCP$UseContraceptive, OCP$AgeCat)
## 
##  Pearson's Chi-squared test
## 
## data:  OCP$UseContraceptive and OCP$AgeCat
## X-squared = 77.578, df = 3, p-value < 2.2e-16


8.3 Unadjusted OR & χ2 test

Preparing the data into a table:

OCPtbl <- table(OCP$Education, OCP$UseContraceptive)
OCPtbl <- OCPtbl[, c(2, 1)]
OCPtbl <- OCPtbl[c(2, 1), ]
OCPtbl
##       
##        Yes  No
##   High 306 688
##   Low  201 412

Calculating unadjusted odds ratio (OR) using the epiR package:

library(epiR)
epi.2by2(OCPtbl, method = "cross.sectional", conf.level = 0.95)
##              Outcome +    Outcome -      Total        Prevalence *        Odds
## Exposed +          306          688        994                30.8       0.445
## Exposed -          201          412        613                32.8       0.488
## Total              507         1100       1607                31.5       0.461
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Prevalence ratio                             0.94 (0.81, 1.09)
## Odds ratio                                   0.91 (0.73, 1.13)
## Attrib prevalence *                          -2.00 (-6.70, 2.69)
## Attrib prevalence in population *            -1.24 (-5.60, 3.12)
## Attrib fraction in exposed (%)              -6.51 (-23.35, 8.02)
## Attrib fraction in population (%)           -3.93 (-13.56, 4.88)
## -------------------------------------------------------------------
##  Test that OR = 1: chi2(1) = 0.706 Pr>chi2 = 0.40
##  Wald confidence limits
##  CI: confidence interval
##  * Outcomes per 100 population units
chisq.test(OCPtbl)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  OCPtbl
## X-squared = 0.61593, df = 1, p-value = 0.4326


8.4 Adjusted OR

Calculating adjusted ORs, including interaction terms:

model3 <- glm(UseContraceptive ~ Education + AgeCat + WantsMore +
                                 Education:AgeCat + Education:WantsMore,
              data = OCP, family = "binomial")
round(exp(cbind(ORs = coef(model3), confint(model3))), 2)
## Waiting for profiling to be done...
##                             ORs 2.5 % 97.5 %
## (Intercept)                0.41  0.19   0.81
## EducationHigh              0.89  0.41   2.09
## AgeCat25-29                1.73  0.77   4.12
## AgeCat30-39                2.58  1.29   5.65
## AgeCat40-49                2.31  1.08   5.31
## WantsMoreYes               0.28  0.19   0.40
## EducationHigh:AgeCat25-29  0.84  0.33   2.06
## EducationHigh:AgeCat30-39  0.94  0.40   2.08
## EducationHigh:AgeCat40-49  2.72  0.97   7.33
## EducationHigh:WantsMoreYes 2.11  1.31   3.44


Calculating odds ratios for each stratum of variables WantsMore and AgeCat, based on the coefficients table below:

summary(model3)$coefficients
##                               Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)                -0.89968338  0.3707028 -2.4269666 1.522566e-02
## EducationHigh              -0.11542242  0.4135261 -0.2791176 7.801546e-01
## AgeCat25-29                 0.54772364  0.4244194  1.2905246 1.968686e-01
## AgeCat30-39                 0.94870626  0.3735493  2.5397086 1.109449e-02
## AgeCat40-49                 0.83814518  0.4019206  2.0853502 3.703752e-02
## WantsMoreYes               -1.28771510  0.1939207 -6.6404211 3.127883e-11
## EducationHigh:AgeCat25-29  -0.17092490  0.4665755 -0.3663392 7.141120e-01
## EducationHigh:AgeCat30-39  -0.06069185  0.4182164 -0.1451207 8.846156e-01
## EducationHigh:AgeCat40-49   0.99955089  0.5131261  1.9479634 5.141934e-02
## EducationHigh:WantsMoreYes  0.74804861  0.2462960  3.0371937 2.387919e-03
#### OR for WantsMore:No - AgeCat:A
round(exp(-0.11542242), 2)
## [1] 0.89
#### OR for WantsMore:No - AgeCat:B
round(exp(-0.11542242 - 0.17092490), 2)
## [1] 0.75
#### OR for WantsMore:No - AgeCat:C
round(exp(-0.11542242 - 0.06069185), 2)
## [1] 0.84
#### OR for WantsMore:No - AgeCat:D
round(exp(-0.11542242 + 0.99955089), 2)
## [1] 2.42
#### OR for WantsMore:Yes - AgeCat:A
round(exp(-0.11542242 + 0.74804861), 2)
## [1] 1.88
#### OR for WantsMore:Yes - AgeCat:B
round(exp(-0.11542242 + 0.74804861 - 0.17092490), 2)
## [1] 1.59
#### OR for WantsMore:Yes - AgeCat:C
round(exp(-0.11542242 + 0.74804861 - 0.06069185), 2)
## [1] 1.77
#### OR for WantsMore:Yes - AgeCat:D
round(exp(-0.11542242 + 0.74804861 + 0.99955089), 2)
## [1] 5.11






Thank you for reading my report.