SESSION 2: R as a Calculator & Object Assignment

6 * 4

## [1] 24

x <- 5
y <- 10
x + y

## [1] 15

z<- 20 / 5
k<-2^3
z+k

## [1] 12

Exercise 1

Write the equation in R:

\[ ax^2 + bx + 2 \]

# Students should practice writing the formula in R

Exercise 2 : Rewrite the equation given the values

\[ ax^2 + bx + 2 \] Given the values of the variables: a = 2; b = 5; c = 3, x=5; Rewrite and solve the equation

# Assign values
a <- 2
b <- -5
c <- 3
y=a * x^2 + b * x + 2
y

## [1] 27

Compute the value of D

# Compute discriminant value of D
D <- b^2 - 4*a*c
D

## [1] 1

Observe the formular below and write an equation in R to solve this:

\[ x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} \]

Exercise 4

# Student should write and solve the equation in R

Solution

# break the equation in two
x1 <- (-b + sqrt(D)) / (2*a) # equation 1
x2 <- (-b - sqrt(D)) / (2*a) # equation 2

x1

## [1] 1.5

x2

## [1] 1

Solving with functions for values x1 and x2

\[ f(x) = 2x^2 - 5x + 3 \]

# Write the formula in R and solve with functions
f <- function(x) 2*x^2 - 5*x + 3

Solving with functions given x = 4

x<- 4
f(x)

## [1] 15

Solving with function given values of x as 10, 11, 12, 13, 14, 15

x<-10:15
f(x)

## [1] 153 190 231 276 325 378

Exercise

Create objects age, height_cm, and compute BMI for a 70kg individual Assume age = 4 years and height = 170cm

# Students should try

Summary of syntax

Selected logical operators in R
Operator1	Meaning1	Operator2	Meaning2
<	lesser than	!	NOT
>	greater than	&	AND
<=	lesser than or equal to	\|	OR
>=	greater than or equal to	==	equal
!=	not equal to	xor	exclusive OR

SESSION 3: Data Types in R

Data types in R are mainly Integer, Double, Complex, Logical, Character, Factor and Date and Time. To determine the data type functions such as is.double(), is.integer(), is.logical(), is.character() and is.factor() can be used. Also class(…) can be used.

Data types

Integer data types are made up of numeric variables that can be counted. By default R does not store numbers as integers but there may arise situations where numbers will have to be converted to integers to facilitate manipulations.

A Double is a number that can take any value including decimals. This is the default type of numeric variable used by R. Numeric data type stored as double are never stored as exact but rather as approximations to real numbers.

A Logical is an object stored as TRUE or FALSE. The example below show the creation of “Z” from a statement asking if 5 is less than 8. Z therefore is a logical (TRUE). Logical objects have innate values in R such that FALSE is always considered to have a value of 0 while TRUE has a value of 1.

A Character is an object enclosed in double quotes. These often are names and cannot be be used in mathematical calculations. Examples include “red”, “Male” and “1”. As seen “1” is a character and so cannot be used for calculations unless converted to another object form

A Factor is a categorical variable such as sex (male & female). Factor variables in R have levels representing the different categories. Sex for instance naturally will have two levels, male and female. Factors can be created from numeric and character objects using as.factor()

Examples of data types

sex <- "Female"
age <- 28
febrile <- TRUE
class(sex); class(age); class(febrile)# use class function to determine the nature of a data type

## [1] "character"

## [1] "numeric"

## [1] "logical"

blood.grp<-c("O", "AB", "B", "A")
class(blood.grp) # Character object

## [1] "character"

blood.grp2<-as.factor(blood.grp) # Converts to factor variable
class(blood.grp2) # Now factor variable

## [1] "factor"

levels(blood.grp2) # Shows level of the object

## [1] "A"  "AB" "B"  "O"

 unclass(blood.grp2) # shows numerical values allocated

## [1] 4 2 3 1
## attr(,"levels")
## [1] "A"  "AB" "B"  "O"

SESSION 4: Data Structures (Vectors, Matrices, Data Frames)

Different types of data structures in R include Vector, Matrix, Array, List, Data frame and Time-series. Data needs to be in a specific in structure to perform appropriate analysis. These data structures are often built from the types.

Vectors

A vector is the simplest data structure in R. It is made up of a collection of like data types as above. Many functions in R create vectors in specific order. Some examples of the generation of vectors are as shown below:

A<-c(3,2,4,5,6,2,3,1,7,9) # Combination of numbers

B<-1:10 # sequence of numbers from 1:10

C<-seq(from=0, to=20, by=2) # Sequence of numbers from 1:20 at 2 intervals

D<-rep("B", times=10) # Replicate "B" 10 times

# Numeric object (patient age)
patient_age <- 45

# Numeric object (patient age)
patient_id <- "PT-1023"

Logical object (HIV test result: TRUE = Positive, FALSE = Negative)

#Logistical variable
hiv_positive <- TRUE


A<-7.1
is.double(A) # Determine if A is stored as a double

## [1] TRUE

is.integer(A) # Determine if A is stored as an integer

## [1] FALSE

21.3==A+A+A # Adding A three times is not equal to 21.3!

## [1] FALSE

Matrices

Matrices are vectors arranged in two dimensions (r,c). This dimension arranges the elements as rows and columns. There are a number of ways for creating a matrix, a few are as shown below. First a vector can be assigned a dimension of r x c. Names of the columns and rows can then be created by the dim(), rownames() and colnames() functions.

Example Matrix 1

X<-1:12 # create a vector
dim(X)<-c(3,4) # dim of 3 x 4 converts it into a matrix
colnames(X)<-c("Col1","Col2","Col3","Col4")
rownames(X)<-c("Row1","Row2","Row3")

Example Matrix 2

X<-1:12
 matrix(X, nrow=3, ncol=4, byrow=F,
dimnames=list(c("Row1","Row2","Row3"), c("Col1","Col2","Col3","Col4")))

##      Col1 Col2 Col3 Col4
## Row1    1    4    7   10
## Row2    2    5    8   11
## Row3    3    6    9   12

Example Matrix 3

Row1<-c(1, 4, 7, 10)
Row2<-c(2, 5, 8, 11)
Row3<-c(3, 6, 9, 12)
X<-rbind(Row1, Row2, Row3)

#Example Matrix 4

Col1<-1:3
Col2<-4:6
Col3<-7:9
Col4<-10:12
X<-cbind(Col1, Col2, Col3, Col4)

SESSION 5: Dataframe structures

A data frame is used for routine statistical manipulations. It is essentially a matrix in which various columns are grouped as classes.Most standard statistical datasets are manipulated in R as data frames. An example data frames can be created in R as below:

age <- c(22, 25, 30, 28)
sex <- factor(c("M","F","F","M"))
df <- data.frame(sex, age)
str(df) # a function that displays data structures

## 'data.frame':    4 obs. of  2 variables:
##  $ sex: Factor w/ 2 levels "F","M": 2 1 1 2
##  $ age: num  22 25 30 28

sex<-gl(n=2, k=5, label=c("Male","Female")) # Create factor vector
age<-c(5,2,5,6,5,6,7,8,7,7) # Create numeric vector
color<-rep(c("Red","Blue"), times=5) # Create character vector
old<-age>6 # Create logical vector
df1<-data.frame(sex, age, color, old) # Create data frame
df1

##       sex age color   old
## 1    Male   5   Red FALSE
## 2    Male   2  Blue FALSE
## 3    Male   5   Red FALSE
## 4    Male   6  Blue FALSE
## 5    Male   5   Red FALSE
## 6  Female   6  Blue FALSE
## 7  Female   7   Red  TRUE
## 8  Female   8  Blue  TRUE
## 9  Female   7   Red  TRUE
## 10 Female   7  Blue  TRUE

List

A list is similar to a data frame but with a some of differences. The components of a list can be made of objects other than a vector. These include data frames. Also the length of the components of a list need not be the same as for data frame. A list can be created with the list() function as below. The list below is made up of three elements. The list is a data frame called DF. The next is a numeric vector called Vec and then the last a character called Color.

list1<-list(DF=df1, Vec=1:5, Color="Red")

#Indexing data types in R All data types can be indexed. The simplest is indexing vectors. Numbers are placed in square brackets representing the elements or subset of the vector to be selected. Some examples are as shown below:

Indexing examples

A<-c(1,3,2,4,3,2,6,5,4,8) # Vector of length 10
A[3] # Extracts the 3rd element (2)

## [1] 2

A[4:7] # Extracts 4th to 7th elements

## [1] 4 3 2 6

A[-5] # Leaves out 5th element (3)

## [1] 1 3 2 4 2 6 5 4 8

A[A<5] # Returns elements less than 5

## [1] 1 3 2 4 3 2 4

Indexing Matrices

mat1<-matrix(1:9, nrow=3, ncol=3, byrow=T)
mat1

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

mat1[2:3, 1:2] # Rows 2,3 and column 1,2 selected

##      [,1] [,2]
## [1,]    4    5
## [2,]    7    8

Exercise

Identify variables with missing values.

SESSION 6: Data Cleaning & Transformation

##Importing and exporting data in R

R comes with a number of datasets. To view all datasets that come with the R use data(). To load a specific dataset from a specific package use data(x, package=“y”) where x is dataset’s name and y is the package’s name.

Text files come in many forms and the first thing is to open and view it in any text editor before attempting to import the data into R. Common file types are coma delimited; import this into R using read.csv(), read.csv2() tab delimited; read.delim() and read.delim2(). There are many others. Students should explore this

The first step in reading foreign data is to tell R where to look for the data on your computer. This is done by using the function setwd(). A list of the files in the directory can the be obtained by the function dir().

##Data management in R head(X): Shows first six rows of a dataset X head(X): Shows last six rows of a dataset X str(X): Shows the structure of the dataset X summary(X): Gives a summary of all variables in the dataset X transform(X): Make new variable in the dataset X cut(Y): Convert a continuous vector/variable into a categorical one. set.seed(): Set a number to make the random number generation reproducible subset(): Select a part of a data frame

Selected Statistical Operations on Vectors
Function	Explanation
length(x)	length of the vector x
sum(x)	Add up all elements of vector x
mean(x)	Mean of all elements of x
sd(x)	Standard deviation of x
var(x)	Variance of x
median(x)	median of the elements of x
mad(x)	Median absolute deviation
quantile(x, probs=)	sample quantiles corresponding to the given probabilities
weighted.mean(x, w)	mean of x with weights w
rank(x)	ranks of the elements of x
density(x)	kernel density estimates of x

Data manipulation

data(airquality)
head(airquality)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

str(airquality)

## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

summary(airquality)

##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
##

aq <- na.omit(airquality)
aq$Temp_C <- (aq$Temp - 32) * 5/9
aq$HotDay <- ifelse(aq$Temp >= 85, "Hot", "Not Hot")
table(aq$HotDay)

## 
##     Hot Not Hot 
##      28      83

SESSION 7: Manipulation of Maternal & Child Health Dataset

library(MASS)
data(birthwt)
bw <- birthwt
bw$low <- factor(bw$low, labels=c("Normal BW","Low BW"))
summary(bw)

##         low           age             lwt             race      
##  Normal BW:130   Min.   :14.00   Min.   : 80.0   Min.   :1.000  
##  Low BW   : 59   1st Qu.:19.00   1st Qu.:110.0   1st Qu.:1.000  
##                  Median :23.00   Median :121.0   Median :1.000  
##                  Mean   :23.24   Mean   :129.8   Mean   :1.847  
##                  3rd Qu.:26.00   3rd Qu.:140.0   3rd Qu.:3.000  
##                  Max.   :45.00   Max.   :250.0   Max.   :3.000  
##      smoke             ptl               ht                ui        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.3915   Mean   :0.1958   Mean   :0.06349   Mean   :0.1481  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :3.0000   Max.   :1.00000   Max.   :1.0000  
##       ftv              bwt      
##  Min.   :0.0000   Min.   : 709  
##  1st Qu.:0.0000   1st Qu.:2414  
##  Median :0.0000   Median :2977  
##  Mean   :0.7937   Mean   :2945  
##  3rd Qu.:1.0000   3rd Qu.:3487  
##  Max.   :6.0000   Max.   :4990

Exercise

Calculate proportion of low birth weight babies.

SESSION 8: Data Visualization for Public Health

Visualization various plots

# Load dataset
data(mtcars)

# View first few rows
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

# Relationship between speed (hp) and fuel efficiency (mpg)
plot(mtcars$hp, mtcars$mpg,
     main = "Scatter Plot: Horsepower vs Fuel Efficiency",
     xlab = "Horsepower",
     ylab = "Miles per Gallon",
     pch = 19,
     col = "blue")

# Fit linear model
model <- lm(mtcars$mpg ~ mtcars$hp)

# Add regression line
abline(model, lwd = 2)

#Histogram

# -----------------------------
# 2. Histogram
# Distribution of miles per gallon
# -----------------------------

hist(mtcars$mpg,
     main = "Histogram of Miles per Gallon",
     xlab = "Miles per Gallon",
     col = "lightgreen",
     border = "black")

Compare mpg across number of cylinders

boxplot(mpg ~ cyl,
        data = mtcars,
        main = "Fuel Efficiency by Cylinder Number",
        xlab = "Number of Cylinders",
        ylab = "Miles per Gallon",
        col = "orange")

# Count of cars by cylinder type
# -----------------------------

cyl_counts <- table(mtcars$cyl)

barplot(cyl_counts,
        main = "Number of Cars by Cylinder Type",
        xlab = "Cylinders",
        ylab = "Frequency",
        col = "purple")

Plotting Probabilities

# Set plotting layout (2 rows, 3 columns)
par(mfrow = c(2,3))
# -----------------------------
# 1. Normal Distribution
# -----------------------------
x <- seq(-4,4,length=100)
y <- dnorm(x, mean=0, sd=1)

plot(x,y,type="l",
     main="Normal Distribution",
     xlab="x",
     ylab="Density",
     col="blue",
     lwd=2)

# -----------------------------
# 2. Binomial Distribution
# -----------------------------
x <- 0:10
y <- dbinom(x, size=10, prob=0.5)

barplot(y,
        names.arg=x,
        main="Binomial Distribution",
        xlab="Number of Successes",
        ylab="Probability",
        col="orange")

# -----------------------------
# 3. Poisson Distribution
# -----------------------------
x <- 0:15
y <- dpois(x, lambda=4)

barplot(y,
        names.arg=x,
        main="Poisson Distribution",
        xlab="Number of Events",
        ylab="Probability",
        col="green")

# -----------------------------
# 4. Exponential Distribution
# -----------------------------
x <- seq(0,10,length=100)
y <- dexp(x, rate=0.5)

plot(x,y,type="l",
     main="Exponential Distribution",
     xlab="x",
     ylab="Density",
     col="red",
     lwd=2)

# -----------------------------
# 5. Uniform Distribution
# -----------------------------
x <- seq(0,10,length=100)
y <- dunif(x, min=0, max=10)

plot(x,y,type="l",
     main="Uniform Distribution",
     xlab="x",
     ylab="Density",
     col="purple",
     lwd=2)

# -----------------------------
# 6. Chi-Square Distribution
# -----------------------------
x <- seq(0,10,length=100)
y <- dchisq(x, df=4)

plot(x,y,type="l",
     main="Chi-Square Distribution",
     xlab="x",
     ylab="Density",
     col="darkblue",
     lwd=2)

SESSION 9: Descriptive Statistics & Reporting

Common Statistical Analyses in R and Their Interpretation
Category	Statistical_Test	Variable_Type	Typical_Use	R_Function	Interpretation
Descriptive	Mean and standard deviation	Continuous	Summarize average and spread of normally distributed data	mean(), sd()	Mean gives average; SD shows spread around the mean
Descriptive	Median and interquartile range	Continuous	Summarize central tendency of skewed data	median(), IQR(), quantile()	Median gives middle value; IQR shows spread of middle 50%
Descriptive	Frequency and percentage	Categorical	Summarize counts in categories	table(), prop.table()	Shows number and proportion in each category
Descriptive	Minimum and maximum	Continuous	Show data limits	min(), max(), range()	Shows smallest and largest values
One-sample test	One-sample t-test	Continuous	Compare a sample mean with a known value	t.test(x, mu = value)	Small p-value suggests sample mean differs from the hypothesized value
One-sample test	One-sample proportion test	Categorical (binary)	Compare a sample proportion with a known value	prop.test(x, n, p = value)	Small p-value suggests sample proportion differs from the hypothesized value
Two-sample test	Independent samples t-test	Continuous	Compare means of two independent groups	t.test(y ~ group, var.equal = TRUE)	Small p-value suggests the two group means are significantly different
Two-sample test	Welch two-sample t-test	Continuous	Compare means of two independent groups when variances are unequal	t.test(y ~ group)	Small p-value suggests the two group means are significantly different under unequal variances
Paired test	Paired t-test	Continuous	Compare two related measurements from the same subjects	t.test(before, after, paired = TRUE)	Small p-value suggests significant difference between paired measurements
ANOVA	One-way ANOVA	Continuous	Compare means across three or more independent groups	aov(y ~ group)	Small p-value suggests at least one group mean differs
ANOVA	Repeated measures ANOVA	Continuous	Compare repeated measurements on the same subjects	aov(y ~ subject + time) or ezANOVA()	Small p-value suggests at least one time point differs
Categorical test	Chi-square test of independence	Categorical	Assess association between two categorical variables	chisq.test(table(x, y))	Small p-value suggests association between categorical variables
Categorical test	Fisher’s exact test	Categorical	Assess association between two categorical variables when expected counts are small	fisher.test(table(x, y))	Small p-value suggests association between categorical variables with small samples
Categorical test	McNemar test	Categorical	Compare paired categorical responses	mcnemar.test(table(before, after))	Small p-value suggests paired categorical responses differ
Association	Pearson correlation	Continuous	Measure linear association between two continuous variables	cor.test(x, y, method = ‘pearson’)	Correlation coefficient shows strength and direction of linear relationship
Association	Spearman correlation	Ordinal / Continuous	Measure monotonic association between ranked or non-normal variables	cor.test(x, y, method = ‘spearman’)	Correlation coefficient shows strength and direction of monotonic relationship
Regression	Simple linear regression	Continuous outcome	Model continuous outcome using predictor(s)	lm(y ~ x)	Regression coefficient shows expected change in outcome per unit increase in predictor
Regression	Logistic regression	Binary outcome	Model binary outcome using predictor(s)	glm(y ~ x, family = binomial)	Odds ratio or coefficients show effect of predictors on odds of outcome
Non-parametric	Mann-Whitney U test	Continuous / Ordinal	Compare two independent groups when normality assumption is not met	wilcox.test(y ~ group)	Small p-value suggests the two groups differ in distribution
Non-parametric	Wilcoxon signed-rank test	Continuous / Ordinal	Compare paired observations when normality assumption is not met	wilcox.test(before, after, paired = TRUE)	Small p-value suggests paired observations differ in distribution
Non-parametric	Kruskal-Wallis test	Continuous / Ordinal	Compare three or more groups when ANOVA assumptions are not met	kruskal.test(y ~ group)	Small p-value suggests at least one group differs in distribution
Non-parametric	Friedman test	Continuous / Ordinal	Compare repeated measurements when repeated ANOVA assumptions are not met	friedman.test(y ~ group \| subject)	Small p-value suggests repeated measurements differ

mean(aq$Ozone)

## [1] 42.0991

sd(aq$Ozone)

## [1] 33.27597

median(bw$bwt)

## [1] 2977

Exercise

Report median and IQR of birth weight in one sentence.

Biostatistics_Lecture

SESSION 1: Course Overview & Reproducible Research

Learning Objectives

Exercise 1

SESSION 2: R as a Calculator & Object Assignment

Exercise 1

Exercise 2 : Rewrite the equation given the values

Compute the value of D

Observe the formular below and write an equation in R to solve this:

Exercise 4

Solution

Solving with functions for values x1 and x2

Solving with functions given x = 4

Solving with function given values of x as 10, 11, 12, 13, 14, 15

Exercise

Summary of syntax

SESSION 3: Data Types in R

Data types

Examples of data types

SESSION 4: Data Structures (Vectors, Matrices, Data Frames)

Vectors

Logical object (HIV test result: TRUE = Positive, FALSE = Negative)

Matrices

Example Matrix 1

Example Matrix 2

Example Matrix 3

SESSION 5: Dataframe structures

List

Indexing examples

Indexing Matrices

Exercise

SESSION 6: Data Cleaning & Transformation

Data manipulation

SESSION 7: Manipulation of Maternal & Child Health Dataset

Exercise

SESSION 8: Data Visualization for Public Health

Visualization various plots

Compare mpg across number of cylinders

Plotting Probabilities

SESSION 9: Descriptive Statistics & Reporting

Exercise

SESSION 10: Inferential Statistics

Exercise