Introduction

Files

Download the zip-file Introduction-to-Biostatistics. There, you will find:

A data dictionary which is important to understanding what variables mean, how they are coded, etc. Birthweight_data_kg_description.docx
A data file with all your data in csv format. Note, csv is a “comma-separated values” file which, all intents and purposes, is an excel-type file.
An R Script, this is the document which holds your script (code).
A Project File, this is basically a designated project space in RStudio. It remembers your file locations, any data you imported into RStudio, etc. This is not necessary, but it does keep things organized when you are dealing with many projects with many Rscripts, etc. For example, some people like to have multiple Rscripts for different elements of their project so they avoid super long single script documents.

Note: When you have a designated RProj Project, open that first before opening and running your RScript. This will avoid having to set a working directory (see below).

Reading in your data

CSV

There are so many ways to read in (i.e. import) your data; everyone has a different preference.

Now try reading in the csv file (cmd/ctl+enter):

Note: I named this file in RStudio as “df” but you can name it anything you want (no spaces). Maybe one time you notice a formatting issue with the excel file; you can always go back to your excel, fix the issue, and rerun this bit of code. R will automatically replace the old “data” file with this new one. If you want to reimport but KEEP the old copy, just rename the data in RStudio (i.e. read in “data2”)

Working Directory

ERROR If you get an error saying something along the lines of “no such file or directory”, it may be that R does not know where to look or is looking in the wrong place.

A working directory is a folder path (map), so that R knows what folder to look into. First, try checking to see what working directory it is currently on; run this line.

Tip: RStudio is always trying to anticipate what you are trying to type, if as you are typing a function you see it pop up, hit TAB to autofill! Saves time and spelling headaches.

getwd()

Is this the right path to the folder where your data lives? If not (or if it says you have no working directory), set your working directory. First, go to your library/finder and find the folder you downloaded. Right click on any of the files and press Get Info (Mac) or Properties (Windows). You should see a location path; copy it.

It should look something like this:

setwd("/Users/yourname/Library/Block 3/Introduction-to-Biostatistics")

Try reading in the file again now!

Analyzing your data

Always look at your data first!

Glance over

If you want to view your data in spread-sheet-like form within RStudio use View(). Annoyingly, this must be capitalized while other functions are normally lowercase. Alternatively, you can go to your environment (upper right hand corner section of RStudio) and click on data file.

View(data) #opens up your datafile (data) within RStudio

head(data) #gives you a peak of the first 10 rows of your dataset

##     ID Length Birthweight Headcirc Gestation smoker mage mnocig mheight mppwt
## 1 1360     56        4.55       34        44      0   20      0     162    57
## 2 1016     53        4.32       36        40      0   19      0     171    62
## 3  462     58        4.10       39        41      0   35      0     172    58
## 4 1187     53        4.07       38        44      0   20      0     174    68
## 5  553     54        3.94       37        42      0   24      0     175    66
## 6 1636     51        3.93       38        38      0   29      0     165    61
##   fage fedyrs fnocig fheight lowbwt mage35
## 1   23     10     35     179      0      0
## 2   19     12      0     183      0      0
## 3   31     16     25     185      0      1
## 4   26     14     25     189      0      0
## 5   30     12      0     184      0      0
## 6   31     16      0     180      0      0

head() (Does not apply to the CSV file we read in)

When working with excel, R reads the cell type in (number vs text vs etc) and this sometimes gives you errors when running analyses. The function you are using is expecting two numbers for example, yet it is getting 1 number and 1 text. First thing you could do is run the head() function and see if the variable in question matches the appropriate format (i.e. if you see <chr> when you want to see the simple number <dbl>).

Alternatively and more commonly, you use str() in order to find out what “string type” (i.e. numeric, factor, etc.) your variable is:

str(data$Length)

##  int [1:42] 56 53 58 53 54 51 52 53 54 50 ...

If it doesn’t match what you want, look up a variation of this code, depending on your situation:

df$var <- as.numeric(df$var)

Notice that, when specifying a variable, you need to use $ after your dataframe.

Identifying variables

Sometimes you need a quick list of all the variables in your dataset for easy copy paste. Use colnames()

colnames(data)

##  [1] "ID"          "Length"      "Birthweight" "Headcirc"    "Gestation"  
##  [6] "smoker"      "mage"        "mnocig"      "mheight"     "mppwt"      
## [11] "fage"        "fedyrs"      "fnocig"      "fheight"     "lowbwt"     
## [16] "mage35"

Summarize variables

summary(data)

##        ID             Length       Birthweight       Headcirc   
##  Min.   :  27.0   Min.   :43.00   Min.   :1.920   Min.   :30.0  
##  1st Qu.: 537.2   1st Qu.:50.00   1st Qu.:2.940   1st Qu.:33.0  
##  Median : 821.0   Median :52.00   Median :3.295   Median :34.0  
##  Mean   : 894.1   Mean   :51.33   Mean   :3.313   Mean   :34.6  
##  3rd Qu.:1269.5   3rd Qu.:53.00   3rd Qu.:3.647   3rd Qu.:36.0  
##  Max.   :1764.0   Max.   :58.00   Max.   :4.570   Max.   :39.0  
##    Gestation         smoker            mage           mnocig      
##  Min.   :33.00   Min.   :0.0000   Min.   :18.00   Min.   : 0.000  
##  1st Qu.:38.00   1st Qu.:0.0000   1st Qu.:20.25   1st Qu.: 0.000  
##  Median :39.50   Median :1.0000   Median :24.00   Median : 4.500  
##  Mean   :39.19   Mean   :0.5238   Mean   :25.55   Mean   : 9.429  
##  3rd Qu.:41.00   3rd Qu.:1.0000   3rd Qu.:29.00   3rd Qu.:15.750  
##  Max.   :45.00   Max.   :1.0000   Max.   :41.00   Max.   :50.000  
##     mheight          mppwt            fage          fedyrs          fnocig     
##  Min.   :149.0   Min.   :45.00   Min.   :19.0   Min.   :10.00   Min.   : 0.00  
##  1st Qu.:161.0   1st Qu.:52.25   1st Qu.:23.0   1st Qu.:12.00   1st Qu.: 0.00  
##  Median :164.5   Median :57.00   Median :29.5   Median :14.00   Median :18.50  
##  Mean   :164.5   Mean   :57.50   Mean   :28.9   Mean   :13.67   Mean   :17.19  
##  3rd Qu.:169.5   3rd Qu.:62.00   3rd Qu.:32.0   3rd Qu.:16.00   3rd Qu.:25.00  
##  Max.   :181.0   Max.   :78.00   Max.   :46.0   Max.   :16.00   Max.   :50.00  
##     fheight          lowbwt           mage35       
##  Min.   :169.0   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:175.2   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :180.5   Median :0.0000   Median :0.00000  
##  Mean   :180.5   Mean   :0.1429   Mean   :0.09524  
##  3rd Qu.:184.8   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :200.0   Max.   :1.0000   Max.   :1.00000

This previous output is not super informative so let’s explore our data more.

Histograms

# distribution of length of baby (cm)
summary(data$Length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   43.00   50.00   52.00   51.33   53.00   58.00

hist(data$Length, nclass = 40)

By default if you leave nclass argument out, it will automatically calculate a good binning pattern. Use nclass when you want to manipulate bins from default. Here’s what it looks like without the nclass argument.

hist(data$Length)

Distribution of Birthweight (kg)

summary(data$Birthweight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.920   2.940   3.295   3.313   3.647   4.570

hist(data$Birthweight, nclass = 20)

Finding the distribution of maternal number of cigarettes smoked per day:

summary(data$mnocig)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   4.500   9.429  15.750  50.000

hist(data$mnocig,nclass = 10)

Distribution of gestation (in weeks)

hist(data$Gestation)

Hypothesis Testing

Check out this helpful guide for Choosing Statistical Tests

Confidence Intervals

When you are given only the summary stats (i.e. mean, sample size, proportion, etc.) for a variable and are asked to calculate a CI, there are two methods in R to give you the same results.

First, you must determine if this is a continuous distribution or a binomial distribution. Then see if you are dealing with one point or comparing two samples.

Binomial

From Activity Worksheet 1, Question 4 was:

Side effects were prevalent in both the placebo and mRNA-1273 vaccine group. Of those in the treatment group, 84.2% of the patients reported a local adverse event. Compute the 95% confidence interval on this estimate.

You are given one estimate, and the estimate (i.e. 84.2%) is a proportion (people either had a reaction or did not). This is binary so we must use the binomial CI calculation:

$p\pm Z_{1-\alpha /2} \sqrt{\frac{p(1-p)}{n}}$

You can use two methods with R:

Method 1

Assign number of total observations to n and proportion of local adverse events in the mRNA group to p. For a 95% CI, you use $1-0.05/2 = 0.975$ which is what you use in qnorm() argument.

n <- 15762 #number of observations
p <- 0.842 #proportion (%) given

margin <- qnorm(0.975)*sqrt(p*(1-p)/n)
margin

## [1] 0.005694125

lowerinterval <- p - margin
lowerinterval

## [1] 0.8363059

upperinterval <- p + margin
upperinterval

## [1] 0.8476941

Method 2

Here is a different way, but should give you the same result

upper <- 0.842 + 1.96 * (sqrt(0.842 * (1-0.842)/15762))
upper

## [1] 0.8476942

lower <- 0.842 - 1.96 * (sqrt(0.842 * (1-0.842)/15762))
lower

## [1] 0.8363058

Normal Distribution

Let’s say you were given this set of data: > 11, 10, 6, 3, 11, 10, 9, 11

And were asked to calculate a confidence interval. First, you can make a dataset and call it df1. In this case, we call this a vector.

df1 <- c(11, 10, 6, 3, 11, 10, 9, 11)

Now you can summarize the data as you normally would!

summary(df1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   8.250  10.000   8.875  11.000  11.000

sd(df1)

## [1] 2.900123

Since this is a continuous distribution and you are attempting to calculate the 95% CI for this distribution, you would use this equation:

$\bar{x}\pm z \frac{\sigma}{\sqrt{n}}$

Where:

$\bar{x}$ is the mean
$z$ value is determined by your CI level (in our case, 95%)
$\sigma$ is the SD
$n$ is the sample size

R can do all these sub calculations for you with its native functions

CI_pos <- (mean(df1)) + (1.96 * ((sd(df1) / (sqrt(length(df1))))))
CI_pos #Upper level of CI

## [1] 10.88468

CI_neg <- (mean(df1))-(1.96*((sd(df1)/(sqrt(length(df1))))))
CI_neg #Lower level of CI

## [1] 6.865317

One Sample T Test

We hypothesize that the average gestation is less than 40 weeks

H0: the average gestatopm age is $\ge$ 40
HA: the average gestation age is < 40

We need to use a t-test here bc we are looking at one continuous variable for which we have the entire data set (so not only a mean and SD like in a z-test)

Here is how to mathematically come up with an answer (do a t-test by hand):

# by hand
m.gest = mean(data$Gestation)
sd.gest = sd(data$Gestation)
n.gest = length(data$Gestation)

t.obs = (m.gest - 40)/(sd.gest/sqrt(n.gest))
pt(t.obs, df=n.gest-1, lower.tail = T)

## [1] 0.02694659

Base R already has a T-test function called t.test()

# using the function
t.test(data$Gestation, mu = 40, alternative = "less")

## 
##  One Sample t-test
## 
## data:  data$Gestation
## t = -1.9847, df = 41, p-value = 0.02695
## alternative hypothesis: true mean is less than 40
## 95 percent confidence interval:
##      -Inf 39.87688
## sample estimates:
## mean of x 
##  39.19048

data$Gestation

You are telling R to look at the variable Gestation in your dataframe data
mu is the “40 weeks”
alternative = "less" You are telling R your alternative hypothesis is less than mu

TIP remember that capitalization of vars matters!!!

Two Sample T Test

Example 1

We hypothesize that average gestation will be different between babies of mothers that smoked cigarettes and babies of mothers that did not smoke cigarettes.

H0: difference in average gestation of babies of mothers that did and did not smoke cigarettes = 0
HA: difference in average gestation of babies of mothers that did and did not smoke cigarettes != 0

Note: From here on out, in R, when using ! in front of an argument, this means NOT. So if x = 0 then x != 0 must mean x DOES NOT equal 0.

# distribution of mothers that smoked
table(data$smoker)

## 
##  0  1 
## 20 22

# before we run the analysis, let's look at our data
boxplot(data$Gestation~data$smoker)

We are using a two sample t-test because we are now working with two variables; one variable is numerical/continuous (Gestation) and the other is binary (smoker).

# to test our hypothesis, we will perform a two-sample t-test
t.test(data$Gestation~data$smoker)

## 
##  Welch Two Sample t-test
## 
## data:  data$Gestation by data$smoker
## t = 0.59903, df = 38.517, p-value = 0.5527
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -1.17817  2.16908
## sample estimates:
## mean in group 0 mean in group 1 
##        39.45000        38.95455

Example 2

We hypothesize that average birthweight will be different between babies of mothers that smoked cigarettes and babies of mothers that did not smoke cigarettes.

H0: difference in average birthweight of babies of mothers that did and did not smoke cigarettes = 0
HA: difference in average birthweight of babies of mothers that did and did not smoke cigarettes != 0

# before we run the analysis, let's look at our data
boxplot(data$Birthweight~data$smoker)

# to test our hypothesis, we will perform a two-sample t-test
t.test(data$Birthweight~data$smoker) # for reference: .37kg = .83lbs

## 
##  Welch Two Sample t-test
## 
## data:  data$Birthweight by data$smoker
## t = 2.1134, df = 39.618, p-value = 0.04092
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  0.01628738 0.73453080
## sample estimates:
## mean in group 0 mean in group 1 
##        3.509500        3.134091

Correlation

We hypothesize that average birthweight will be associated with birth length

H0: correlation between birthweight and length = 0
HA: correlation between birthweight and length != 0

# look at the association: is it linear?
plot(data$Length, data$Birthweight, 
     xlab="length",ylab="weight")

Note: by default, xlab and ylab (if you don’t include them) will just use the actual variable names. Since msot variable names are ugly, you can specify what you actually want them to be called (in this case length for x-axis and weight for y-axis). Try ?plot to check out how else you can customize your Base R plots! And if you want them prettier, check out (but don’t be too intimidated by) ggplot2 package.

We are using correlation because we have two continuous variables

# test the correlation
cor.test(data$Length,data$Birthweight)

## 
##  Pearson's product-moment correlation
## 
## data:  data$Length and data$Birthweight
## t = 6.6931, df = 40, p-value = 5.029e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5428137 0.8442612
## sample estimates:
##       cor 
## 0.7268335

Linear Regression

Univariate

We hypothesize that average birth length will predict birthweight

H0: beta_length = 0
HA: beta_length != 0

Univariate means you have one predictor to one outcome

Where:

Length is the predictor
Birthweight is the continuous outcome variable
data = data is you telling R what data it should be looking at
lm() is the linear regression function
fit.length.weight is what you decided to name this model (after running this code, do you see it in your environment view in the upper right hand corner?)

# perform simple (univariate) linear regression
fit.length.weight = lm(Birthweight~Length, data=data)
summary(fit.length.weight)

## 
## Call:
## lm(formula = Birthweight ~ Length, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.89446 -0.35492  0.01746  0.28674  0.75794 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.36244    1.14858  -3.798 0.000486 ***
## Length       0.14952    0.02234   6.693 5.03e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4199 on 40 degrees of freedom
## Multiple R-squared:  0.5283, Adjusted R-squared:  0.5165 
## F-statistic:  44.8 on 1 and 40 DF,  p-value: 5.029e-08

Multivariable

We hypothesize that average birth length and maternal smoking will predict birthweight

H0: beta_length = 0
HA: beta_length != 0
H0: beta_smoking = 0
HA: beta_smoking != 0

Multivariable means you have >1 predictor (or predictor + covariates) and 1 outcome variable. Note, this is how you “adjust for covariates” usually age, sex, bmi, etc.

Where:

Length and smoker are the predictors
Birthweight is the continuous outcome variable
data = data is you telling R what data it should be looking at
lm() is the linear regression function
fit.length.smoker.weight is what you decided to name this model

# perform multiple linear regression
fit.length.smoker.weight = lm(Birthweight~Length + smoker, data=data)
summary(fit.length.smoker.weight)

## 
## Call:
## lm(formula = Birthweight ~ Length + smoker, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79819 -0.29443  0.07428  0.23794  0.86588 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.89619    1.13544  -3.431  0.00143 ** 
## Length       0.14297    0.02185   6.543 9.13e-08 ***
## smoker      -0.24804    0.12689  -1.955  0.05781 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4058 on 39 degrees of freedom
## Multiple R-squared:  0.5704, Adjusted R-squared:  0.5483 
## F-statistic: 25.89 on 2 and 39 DF,  p-value: 7.001e-08

Note: Multivariable is not the same as Multivariate. Multivariable is very common where you have multiple predictors to one outcome. You might (but rarely) run into Multivariate models, which have multiple outcomes. Common mistake, but use these terms correctly!!

Biostatistics R Code Summaries

Marina

1/17/2023

Introduction

Files

Reading in your data

CSV

Working Directory

Other methods of reading in files (Excel)

Packages

Using readxl to read in your excel file

Analyzing your data

Always look at your data first!

Glance over

Identifying variables

Summarize variables

Histograms

Hypothesis Testing

Confidence Intervals

Binomial

Method 1

Method 2

Normal Distribution

One Sample T Test

Two Sample T Test

Example 1

Example 2

Correlation

Linear Regression

Univariate

Multivariable