Introduction

Files

Download the zip-file Introduction-to-Biostatistics. There, you will find:

  1. A data dictionary which is important to understanding what variables mean, how they are coded, etc. Birthweight_data_kg_description.docx
  2. A data file with all your data in csv format. Note, csv is a “comma-separated values” file which, all intents and purposes, is an excel-type file.
  3. An R Script, this is the document which holds your script (code).
  4. A Project File, this is basically a designated project space in RStudio. It remembers your file locations, any data you imported into RStudio, etc. This is not necessary, but it does keep things organized when you are dealing with many projects with many Rscripts, etc. For example, some people like to have multiple Rscripts for different elements of their project so they avoid super long single script documents.

Note: When you have a designated RProj Project, open that first before opening and running your RScript. This will avoid having to set a working directory (see below).

Reading in your data

CSV

There are so many ways to read in (i.e. import) your data; everyone has a different preference.

Now try reading in the csv file (cmd/ctl+enter):

Note: I named this file in RStudio as “df” but you can name it anything you want (no spaces). Maybe one time you notice a formatting issue with the excel file; you can always go back to your excel, fix the issue, and rerun this bit of code. R will automatically replace the old “data” file with this new one. If you want to reimport but KEEP the old copy, just rename the data in RStudio (i.e. read in “data2”)

Working Directory

ERROR If you get an error saying something along the lines of “no such file or directory”, it may be that R does not know where to look or is looking in the wrong place.

A working directory is a folder path (map), so that R knows what folder to look into. First, try checking to see what working directory it is currently on; run this line.

Tip: RStudio is always trying to anticipate what you are trying to type, if as you are typing a function you see it pop up, hit TAB to autofill! Saves time and spelling headaches.

getwd()

Is this the right path to the folder where your data lives? If not (or if it says you have no working directory), set your working directory. First, go to your library/finder and find the folder you downloaded. Right click on any of the files and press Get Info (Mac) or Properties (Windows). You should see a location path; copy it.

It should look something like this:

setwd("/Users/yourname/Library/Block 3/Introduction-to-Biostatistics")

Try reading in the file again now!

Other methods of reading in files (Excel)

Most of the time you will have your data stored in an excel file. There are probably other ways to do this but I prefer using the readxl package.

Packages

What is a package??

You will hear the term “Base R” and “Packages” a lot. Base R has built in functions (like what we have already used); you did not have to load anything in order to run the code. Most simple statistics like t tests are natively (endogenously lol) in “Base R”.

Some people have found that Base R functions sometimes aren’t the prettiest or aren’t suited for higher level analysis and so people create Packages. There are packages that help you spit out Table Ones, packages that help you summarize your data, packages that do generalized mixed models, and packages that help you make really customizable pretty graphs (ggplot2).

To use packages you must:

  1. Install package (you need to do this only once)

  2. Load packages (you need to do this every time you reopen a project)

#install package
install.packages("readxl") #you must put the package name in quotes

#load package
library(readxl) #don't put quotes here

Using readxl to read in your excel file

df <- read_excel("File_name.xlsx")

Note: by default, R will only read in the first sheet in the excel. If you have multiple sheets and you want to read in sheet 2 or 3 or etc, you need to specify this.

df <- read_excel("File_name.xlsx", sheet="sheet_name")

Note: best practice is to make sure your sheet names, your file names, and variable names don’t have spaces. Use underscores instead.

TIP: If you ever don’t know what the different components are of a function try running ?YOURFUNCTION. This will display the R Documentation for the function in your Help section of RStudio. It will give you the defaults, definitions to different arguments, and even examples at the bottom. Sometimes these are challenging to read, so always google; chances are, someone wrote a medium post about this lol

?read_excel

Analyzing your data

Always look at your data first!

Glance over

If you want to view your data in spread-sheet-like form within RStudio use View(). Annoyingly, this must be capitalized while other functions are normally lowercase. Alternatively, you can go to your environment (upper right hand corner section of RStudio) and click on data file.

View(data) #opens up your datafile (data) within RStudio
head(data) #gives you a peak of the first 10 rows of your dataset
##     ID Length Birthweight Headcirc Gestation smoker mage mnocig mheight mppwt
## 1 1360     56        4.55       34        44      0   20      0     162    57
## 2 1016     53        4.32       36        40      0   19      0     171    62
## 3  462     58        4.10       39        41      0   35      0     172    58
## 4 1187     53        4.07       38        44      0   20      0     174    68
## 5  553     54        3.94       37        42      0   24      0     175    66
## 6 1636     51        3.93       38        38      0   29      0     165    61
##   fage fedyrs fnocig fheight lowbwt mage35
## 1   23     10     35     179      0      0
## 2   19     12      0     183      0      0
## 3   31     16     25     185      0      1
## 4   26     14     25     189      0      0
## 5   30     12      0     184      0      0
## 6   31     16      0     180      0      0

head() (Does not apply to the CSV file we read in)

When working with excel, R reads the cell type in (number vs text vs etc) and this sometimes gives you errors when running analyses. The function you are using is expecting two numbers for example, yet it is getting 1 number and 1 text. First thing you could do is run the head() function and see if the variable in question matches the appropriate format (i.e. if you see <chr> when you want to see the simple number <dbl>).

Alternatively and more commonly, you use str() in order to find out what “string type” (i.e. numeric, factor, etc.) your variable is:

str(data$Length)
##  int [1:42] 56 53 58 53 54 51 52 53 54 50 ...

If it doesn’t match what you want, look up a variation of this code, depending on your situation:

df$var <- as.numeric(df$var)

Notice that, when specifying a variable, you need to use $ after your dataframe.

Identifying variables

Sometimes you need a quick list of all the variables in your dataset for easy copy paste. Use colnames()

colnames(data)
##  [1] "ID"          "Length"      "Birthweight" "Headcirc"    "Gestation"  
##  [6] "smoker"      "mage"        "mnocig"      "mheight"     "mppwt"      
## [11] "fage"        "fedyrs"      "fnocig"      "fheight"     "lowbwt"     
## [16] "mage35"

Summarize variables

summary(data)
##        ID             Length       Birthweight       Headcirc   
##  Min.   :  27.0   Min.   :43.00   Min.   :1.920   Min.   :30.0  
##  1st Qu.: 537.2   1st Qu.:50.00   1st Qu.:2.940   1st Qu.:33.0  
##  Median : 821.0   Median :52.00   Median :3.295   Median :34.0  
##  Mean   : 894.1   Mean   :51.33   Mean   :3.313   Mean   :34.6  
##  3rd Qu.:1269.5   3rd Qu.:53.00   3rd Qu.:3.647   3rd Qu.:36.0  
##  Max.   :1764.0   Max.   :58.00   Max.   :4.570   Max.   :39.0  
##    Gestation         smoker            mage           mnocig      
##  Min.   :33.00   Min.   :0.0000   Min.   :18.00   Min.   : 0.000  
##  1st Qu.:38.00   1st Qu.:0.0000   1st Qu.:20.25   1st Qu.: 0.000  
##  Median :39.50   Median :1.0000   Median :24.00   Median : 4.500  
##  Mean   :39.19   Mean   :0.5238   Mean   :25.55   Mean   : 9.429  
##  3rd Qu.:41.00   3rd Qu.:1.0000   3rd Qu.:29.00   3rd Qu.:15.750  
##  Max.   :45.00   Max.   :1.0000   Max.   :41.00   Max.   :50.000  
##     mheight          mppwt            fage          fedyrs          fnocig     
##  Min.   :149.0   Min.   :45.00   Min.   :19.0   Min.   :10.00   Min.   : 0.00  
##  1st Qu.:161.0   1st Qu.:52.25   1st Qu.:23.0   1st Qu.:12.00   1st Qu.: 0.00  
##  Median :164.5   Median :57.00   Median :29.5   Median :14.00   Median :18.50  
##  Mean   :164.5   Mean   :57.50   Mean   :28.9   Mean   :13.67   Mean   :17.19  
##  3rd Qu.:169.5   3rd Qu.:62.00   3rd Qu.:32.0   3rd Qu.:16.00   3rd Qu.:25.00  
##  Max.   :181.0   Max.   :78.00   Max.   :46.0   Max.   :16.00   Max.   :50.00  
##     fheight          lowbwt           mage35       
##  Min.   :169.0   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:175.2   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :180.5   Median :0.0000   Median :0.00000  
##  Mean   :180.5   Mean   :0.1429   Mean   :0.09524  
##  3rd Qu.:184.8   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :200.0   Max.   :1.0000   Max.   :1.00000

This previous output is not super informative so let’s explore our data more.

Histograms

# distribution of length of baby (cm)
summary(data$Length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   43.00   50.00   52.00   51.33   53.00   58.00
hist(data$Length, nclass = 40)

By default if you leave nclass argument out, it will automatically calculate a good binning pattern. Use nclass when you want to manipulate bins from default. Here’s what it looks like without the nclass argument.

hist(data$Length)

Distribution of Birthweight (kg)

summary(data$Birthweight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.920   2.940   3.295   3.313   3.647   4.570
hist(data$Birthweight, nclass = 20)

Finding the distribution of maternal number of cigarettes smoked per day:

summary(data$mnocig)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   4.500   9.429  15.750  50.000
hist(data$mnocig,nclass = 10)

Distribution of gestation (in weeks)

hist(data$Gestation)

Hypothesis Testing

Check out this helpful guide for Choosing Statistical Tests

Confidence Intervals

When you are given only the summary stats (i.e. mean, sample size, proportion, etc.) for a variable and are asked to calculate a CI, there are two methods in R to give you the same results.

First, you must determine if this is a continuous distribution or a binomial distribution. Then see if you are dealing with one point or comparing two samples.

Binomial

From Activity Worksheet 1, Question 4 was:

Side effects were prevalent in both the placebo and mRNA-1273 vaccine group. Of those in the treatment group, 84.2% of the patients reported a local adverse event. Compute the 95% confidence interval on this estimate.

You are given one estimate, and the estimate (i.e. 84.2%) is a proportion (people either had a reaction or did not). This is binary so we must use the binomial CI calculation:

\(p\pm Z_{1-\alpha /2} \sqrt{\frac{p(1-p)}{n}}\)

You can use two methods with R:

Method 1

Assign number of total observations to n and proportion of local adverse events in the mRNA group to p. For a 95% CI, you use \(1-0.05/2 = 0.975\) which is what you use in qnorm() argument.

n <- 15762 #number of observations
p <- 0.842 #proportion (%) given

margin <- qnorm(0.975)*sqrt(p*(1-p)/n)
margin
## [1] 0.005694125
lowerinterval <- p - margin
lowerinterval
## [1] 0.8363059
upperinterval <- p + margin
upperinterval
## [1] 0.8476941

Method 2

Here is a different way, but should give you the same result

upper <- 0.842 + 1.96 * (sqrt(0.842 * (1-0.842)/15762))
upper
## [1] 0.8476942
lower <- 0.842 - 1.96 * (sqrt(0.842 * (1-0.842)/15762))
lower
## [1] 0.8363058

Normal Distribution

Let’s say you were given this set of data: > 11, 10, 6, 3, 11, 10, 9, 11

And were asked to calculate a confidence interval. First, you can make a dataset and call it df1. In this case, we call this a vector.

df1 <- c(11, 10, 6, 3, 11, 10, 9, 11)

Now you can summarize the data as you normally would!

summary(df1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   8.250  10.000   8.875  11.000  11.000
sd(df1)
## [1] 2.900123

Since this is a continuous distribution and you are attempting to calculate the 95% CI for this distribution, you would use this equation:

\(\bar{x}\pm z \frac{\sigma}{\sqrt{n}}\)

Where:

  • \(\bar{x}\) is the mean

  • \(z\) value is determined by your CI level (in our case, 95%)

  • \(\sigma\) is the SD

  • \(n\) is the sample size

R can do all these sub calculations for you with its native functions

CI_pos <- (mean(df1)) + (1.96 * ((sd(df1) / (sqrt(length(df1))))))
CI_pos #Upper level of CI
## [1] 10.88468
CI_neg <- (mean(df1))-(1.96*((sd(df1)/(sqrt(length(df1))))))
CI_neg #Lower level of CI
## [1] 6.865317

One Sample T Test

We hypothesize that the average gestation is less than 40 weeks

  • H0: the average gestatopm age is \(\ge\) 40

  • HA: the average gestation age is < 40

We need to use a t-test here bc we are looking at one continuous variable for which we have the entire data set (so not only a mean and SD like in a z-test)

Here is how to mathematically come up with an answer (do a t-test by hand):

# by hand
m.gest = mean(data$Gestation)
sd.gest = sd(data$Gestation)
n.gest = length(data$Gestation)

t.obs = (m.gest - 40)/(sd.gest/sqrt(n.gest))
pt(t.obs, df=n.gest-1, lower.tail = T)
## [1] 0.02694659

Base R already has a T-test function called t.test()

# using the function
t.test(data$Gestation, mu = 40, alternative = "less")
## 
##  One Sample t-test
## 
## data:  data$Gestation
## t = -1.9847, df = 41, p-value = 0.02695
## alternative hypothesis: true mean is less than 40
## 95 percent confidence interval:
##      -Inf 39.87688
## sample estimates:
## mean of x 
##  39.19048

data$Gestation

  • You are telling R to look at the variable Gestation in your dataframe data

  • mu is the “40 weeks”

  • alternative = "less" You are telling R your alternative hypothesis is less than mu

TIP remember that capitalization of vars matters!!!

Two Sample T Test

Example 1

We hypothesize that average gestation will be different between babies of mothers that smoked cigarettes and babies of mothers that did not smoke cigarettes.

  • H0: difference in average gestation of babies of mothers that did and did not smoke cigarettes = 0

  • HA: difference in average gestation of babies of mothers that did and did not smoke cigarettes != 0

Note: From here on out, in R, when using ! in front of an argument, this means NOT. So if x = 0 then x != 0 must mean x DOES NOT equal 0.

# distribution of mothers that smoked
table(data$smoker)
## 
##  0  1 
## 20 22
# before we run the analysis, let's look at our data
boxplot(data$Gestation~data$smoker)

We are using a two sample t-test because we are now working with two variables; one variable is numerical/continuous (Gestation) and the other is binary (smoker).

# to test our hypothesis, we will perform a two-sample t-test
t.test(data$Gestation~data$smoker)
## 
##  Welch Two Sample t-test
## 
## data:  data$Gestation by data$smoker
## t = 0.59903, df = 38.517, p-value = 0.5527
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -1.17817  2.16908
## sample estimates:
## mean in group 0 mean in group 1 
##        39.45000        38.95455

Example 2

We hypothesize that average birthweight will be different between babies of mothers that smoked cigarettes and babies of mothers that did not smoke cigarettes.

  • H0: difference in average birthweight of babies of mothers that did and did not smoke cigarettes = 0

  • HA: difference in average birthweight of babies of mothers that did and did not smoke cigarettes != 0

# before we run the analysis, let's look at our data
boxplot(data$Birthweight~data$smoker)

# to test our hypothesis, we will perform a two-sample t-test
t.test(data$Birthweight~data$smoker) # for reference: .37kg = .83lbs
## 
##  Welch Two Sample t-test
## 
## data:  data$Birthweight by data$smoker
## t = 2.1134, df = 39.618, p-value = 0.04092
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  0.01628738 0.73453080
## sample estimates:
## mean in group 0 mean in group 1 
##        3.509500        3.134091

Correlation

We hypothesize that average birthweight will be associated with birth length

  • H0: correlation between birthweight and length = 0

  • HA: correlation between birthweight and length != 0

# look at the association: is it linear?
plot(data$Length, data$Birthweight, 
     xlab="length",ylab="weight")

Note: by default, xlab and ylab (if you don’t include them) will just use the actual variable names. Since msot variable names are ugly, you can specify what you actually want them to be called (in this case length for x-axis and weight for y-axis). Try ?plot to check out how else you can customize your Base R plots! And if you want them prettier, check out (but don’t be too intimidated by) ggplot2 package.

We are using correlation because we have two continuous variables

# test the correlation
cor.test(data$Length,data$Birthweight)
## 
##  Pearson's product-moment correlation
## 
## data:  data$Length and data$Birthweight
## t = 6.6931, df = 40, p-value = 5.029e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5428137 0.8442612
## sample estimates:
##       cor 
## 0.7268335

Linear Regression

Univariate

We hypothesize that average birth length will predict birthweight

  • H0: beta_length = 0

  • HA: beta_length != 0

Univariate means you have one predictor to one outcome

Where:

  • Length is the predictor

  • Birthweight is the continuous outcome variable

  • data = data is you telling R what data it should be looking at

  • lm() is the linear regression function

  • fit.length.weight is what you decided to name this model (after running this code, do you see it in your environment view in the upper right hand corner?)

# perform simple (univariate) linear regression
fit.length.weight = lm(Birthweight~Length, data=data)
summary(fit.length.weight)
## 
## Call:
## lm(formula = Birthweight ~ Length, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.89446 -0.35492  0.01746  0.28674  0.75794 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.36244    1.14858  -3.798 0.000486 ***
## Length       0.14952    0.02234   6.693 5.03e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4199 on 40 degrees of freedom
## Multiple R-squared:  0.5283, Adjusted R-squared:  0.5165 
## F-statistic:  44.8 on 1 and 40 DF,  p-value: 5.029e-08

Multivariable

We hypothesize that average birth length and maternal smoking will predict birthweight

  • H0: beta_length = 0

  • HA: beta_length != 0

  • H0: beta_smoking = 0

  • HA: beta_smoking != 0

Multivariable means you have >1 predictor (or predictor + covariates) and 1 outcome variable. Note, this is how you “adjust for covariates” usually age, sex, bmi, etc.

Where:

  • Length and smoker are the predictors

  • Birthweight is the continuous outcome variable

  • data = data is you telling R what data it should be looking at

  • lm() is the linear regression function

  • fit.length.smoker.weight is what you decided to name this model

# perform multiple linear regression
fit.length.smoker.weight = lm(Birthweight~Length + smoker, data=data)
summary(fit.length.smoker.weight)
## 
## Call:
## lm(formula = Birthweight ~ Length + smoker, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79819 -0.29443  0.07428  0.23794  0.86588 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.89619    1.13544  -3.431  0.00143 ** 
## Length       0.14297    0.02185   6.543 9.13e-08 ***
## smoker      -0.24804    0.12689  -1.955  0.05781 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4058 on 39 degrees of freedom
## Multiple R-squared:  0.5704, Adjusted R-squared:  0.5483 
## F-statistic: 25.89 on 2 and 39 DF,  p-value: 7.001e-08

Note: Multivariable is not the same as Multivariate. Multivariable is very common where you have multiple predictors to one outcome. You might (but rarely) run into Multivariate models, which have multiple outcomes. Common mistake, but use these terms correctly!!