Download the zip-file Introduction-to-Biostatistics. There, you will find:
Note: When you have a designated RProj Project, open that first before opening and running your RScript. This will avoid having to set a working directory (see below).
There are so many ways to read in (i.e. import) your data; everyone has a different preference.
Now try reading in the csv file (cmd/ctl+enter):
Note: I named this file in RStudio as “df” but you can name it anything you want (no spaces). Maybe one time you notice a formatting issue with the excel file; you can always go back to your excel, fix the issue, and rerun this bit of code. R will automatically replace the old “data” file with this new one. If you want to reimport but KEEP the old copy, just rename the data in RStudio (i.e. read in “data2”)
ERROR If you get an error saying something along the lines of “no such file or directory”, it may be that R does not know where to look or is looking in the wrong place.
A working directory is a folder path (map), so that R knows what folder to look into. First, try checking to see what working directory it is currently on; run this line.
Tip: RStudio is always trying to anticipate what you are trying to type, if as you are typing a function you see it pop up, hit TAB to autofill! Saves time and spelling headaches.
getwd()
Is this the right path to the folder where your data lives? If not (or if it says you have no working directory), set your working directory. First, go to your library/finder and find the folder you downloaded. Right click on any of the files and press Get Info (Mac) or Properties (Windows). You should see a location path; copy it.
It should look something like this:
setwd("/Users/yourname/Library/Block 3/Introduction-to-Biostatistics")
Try reading in the file again now!
Most of the time you will have your data stored in an excel file. There are probably other ways to do this but I prefer using the readxl package.
What is a package??
You will hear the term “Base R” and “Packages” a lot. Base R has built in functions (like what we have already used); you did not have to load anything in order to run the code. Most simple statistics like t tests are natively (endogenously lol) in “Base R”.
Some people have found that Base R functions sometimes aren’t the prettiest or aren’t suited for higher level analysis and so people create Packages. There are packages that help you spit out Table Ones, packages that help you summarize your data, packages that do generalized mixed models, and packages that help you make really customizable pretty graphs (ggplot2).
To use packages you must:
Install package (you need to do this only once)
Load packages (you need to do this every time you reopen a project)
#install package
install.packages("readxl") #you must put the package name in quotes
#load package
library(readxl) #don't put quotes here
df <- read_excel("File_name.xlsx")
Note: by default, R will only read in the first sheet in the excel. If you have multiple sheets and you want to read in sheet 2 or 3 or etc, you need to specify this.
df <- read_excel("File_name.xlsx", sheet="sheet_name")
Note: best practice is to make sure your sheet names, your file names, and variable names don’t have spaces. Use underscores instead.
TIP: If you ever don’t know what the different
components are of a function try running ?YOURFUNCTION.
This will display the R Documentation for the function in your Help
section of RStudio. It will give you the defaults, definitions to
different arguments, and even examples at the bottom. Sometimes these
are challenging to read, so always google; chances are, someone wrote a
medium post about this lol
?read_excel
If you want to view your data in spread-sheet-like form within
RStudio use View(). Annoyingly, this must be capitalized
while other functions are normally lowercase. Alternatively, you can go
to your environment (upper right hand corner section of RStudio) and
click on data file.
View(data) #opens up your datafile (data) within RStudio
head(data) #gives you a peak of the first 10 rows of your dataset
## ID Length Birthweight Headcirc Gestation smoker mage mnocig mheight mppwt
## 1 1360 56 4.55 34 44 0 20 0 162 57
## 2 1016 53 4.32 36 40 0 19 0 171 62
## 3 462 58 4.10 39 41 0 35 0 172 58
## 4 1187 53 4.07 38 44 0 20 0 174 68
## 5 553 54 3.94 37 42 0 24 0 175 66
## 6 1636 51 3.93 38 38 0 29 0 165 61
## fage fedyrs fnocig fheight lowbwt mage35
## 1 23 10 35 179 0 0
## 2 19 12 0 183 0 0
## 3 31 16 25 185 0 1
## 4 26 14 25 189 0 0
## 5 30 12 0 184 0 0
## 6 31 16 0 180 0 0
head() (Does not apply to the CSV file we read in)
When working with excel, R reads the cell type in (number vs text vs
etc) and this sometimes gives you errors when running analyses. The
function you are using is expecting two numbers for
example, yet it is getting 1 number and 1 text. First thing you
could do is run the head() function and see if the variable
in question matches the appropriate format (i.e. if you see
<chr> when you want to see the simple number
<dbl>).
Alternatively and more commonly, you use str()
in order to find out what “string type” (i.e. numeric, factor, etc.)
your variable is:
str(data$Length)
## int [1:42] 56 53 58 53 54 51 52 53 54 50 ...
If it doesn’t match what you want, look up a variation of this code, depending on your situation:
df$var <- as.numeric(df$var)
Notice that, when specifying a variable, you need to use
$ after your dataframe.
Sometimes you need a quick list of all the variables in your dataset
for easy copy paste. Use colnames()
colnames(data)
## [1] "ID" "Length" "Birthweight" "Headcirc" "Gestation"
## [6] "smoker" "mage" "mnocig" "mheight" "mppwt"
## [11] "fage" "fedyrs" "fnocig" "fheight" "lowbwt"
## [16] "mage35"
summary(data)
## ID Length Birthweight Headcirc
## Min. : 27.0 Min. :43.00 Min. :1.920 Min. :30.0
## 1st Qu.: 537.2 1st Qu.:50.00 1st Qu.:2.940 1st Qu.:33.0
## Median : 821.0 Median :52.00 Median :3.295 Median :34.0
## Mean : 894.1 Mean :51.33 Mean :3.313 Mean :34.6
## 3rd Qu.:1269.5 3rd Qu.:53.00 3rd Qu.:3.647 3rd Qu.:36.0
## Max. :1764.0 Max. :58.00 Max. :4.570 Max. :39.0
## Gestation smoker mage mnocig
## Min. :33.00 Min. :0.0000 Min. :18.00 Min. : 0.000
## 1st Qu.:38.00 1st Qu.:0.0000 1st Qu.:20.25 1st Qu.: 0.000
## Median :39.50 Median :1.0000 Median :24.00 Median : 4.500
## Mean :39.19 Mean :0.5238 Mean :25.55 Mean : 9.429
## 3rd Qu.:41.00 3rd Qu.:1.0000 3rd Qu.:29.00 3rd Qu.:15.750
## Max. :45.00 Max. :1.0000 Max. :41.00 Max. :50.000
## mheight mppwt fage fedyrs fnocig
## Min. :149.0 Min. :45.00 Min. :19.0 Min. :10.00 Min. : 0.00
## 1st Qu.:161.0 1st Qu.:52.25 1st Qu.:23.0 1st Qu.:12.00 1st Qu.: 0.00
## Median :164.5 Median :57.00 Median :29.5 Median :14.00 Median :18.50
## Mean :164.5 Mean :57.50 Mean :28.9 Mean :13.67 Mean :17.19
## 3rd Qu.:169.5 3rd Qu.:62.00 3rd Qu.:32.0 3rd Qu.:16.00 3rd Qu.:25.00
## Max. :181.0 Max. :78.00 Max. :46.0 Max. :16.00 Max. :50.00
## fheight lowbwt mage35
## Min. :169.0 Min. :0.0000 Min. :0.00000
## 1st Qu.:175.2 1st Qu.:0.0000 1st Qu.:0.00000
## Median :180.5 Median :0.0000 Median :0.00000
## Mean :180.5 Mean :0.1429 Mean :0.09524
## 3rd Qu.:184.8 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :200.0 Max. :1.0000 Max. :1.00000
This previous output is not super informative so let’s explore our data more.
# distribution of length of baby (cm)
summary(data$Length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 43.00 50.00 52.00 51.33 53.00 58.00
hist(data$Length, nclass = 40)
By default if you leave nclass argument out, it will
automatically calculate a good binning pattern. Use nclass
when you want to manipulate bins from default. Here’s what it looks like
without the nclass argument.
hist(data$Length)
Distribution of Birthweight (kg)
summary(data$Birthweight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.920 2.940 3.295 3.313 3.647 4.570
hist(data$Birthweight, nclass = 20)
Finding the distribution of maternal number of cigarettes smoked per day:
summary(data$mnocig)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 4.500 9.429 15.750 50.000
hist(data$mnocig,nclass = 10)
Distribution of gestation (in weeks)
hist(data$Gestation)
Check out this helpful guide for Choosing Statistical Tests
When you are given only the summary stats (i.e. mean, sample size, proportion, etc.) for a variable and are asked to calculate a CI, there are two methods in R to give you the same results.
First, you must determine if this is a continuous distribution or a binomial distribution. Then see if you are dealing with one point or comparing two samples.
From Activity Worksheet 1, Question 4 was:
Side effects were prevalent in both the placebo and mRNA-1273 vaccine group. Of those in the treatment group, 84.2% of the patients reported a local adverse event. Compute the 95% confidence interval on this estimate.
You are given one estimate, and the estimate (i.e. 84.2%) is a proportion (people either had a reaction or did not). This is binary so we must use the binomial CI calculation:
\(p\pm Z_{1-\alpha /2} \sqrt{\frac{p(1-p)}{n}}\)
You can use two methods with R:
Assign number of total observations to n and proportion
of local adverse events in the mRNA group to p. For a 95%
CI, you use \(1-0.05/2 = 0.975\) which
is what you use in qnorm() argument.
n <- 15762 #number of observations
p <- 0.842 #proportion (%) given
margin <- qnorm(0.975)*sqrt(p*(1-p)/n)
margin
## [1] 0.005694125
lowerinterval <- p - margin
lowerinterval
## [1] 0.8363059
upperinterval <- p + margin
upperinterval
## [1] 0.8476941
Here is a different way, but should give you the same result
upper <- 0.842 + 1.96 * (sqrt(0.842 * (1-0.842)/15762))
upper
## [1] 0.8476942
lower <- 0.842 - 1.96 * (sqrt(0.842 * (1-0.842)/15762))
lower
## [1] 0.8363058
Let’s say you were given this set of data: > 11, 10, 6, 3, 11, 10, 9, 11
And were asked to calculate a confidence interval. First, you can
make a dataset and call it df1. In this case, we
call this a vector.
df1 <- c(11, 10, 6, 3, 11, 10, 9, 11)
Now you can summarize the data as you normally would!
summary(df1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 8.250 10.000 8.875 11.000 11.000
sd(df1)
## [1] 2.900123
Since this is a continuous distribution and you are attempting to calculate the 95% CI for this distribution, you would use this equation:
\(\bar{x}\pm z \frac{\sigma}{\sqrt{n}}\)
Where:
\(\bar{x}\) is the mean
\(z\) value is determined by your CI level (in our case, 95%)
\(\sigma\) is the SD
\(n\) is the sample size
R can do all these sub calculations for you with its native functions
CI_pos <- (mean(df1)) + (1.96 * ((sd(df1) / (sqrt(length(df1))))))
CI_pos #Upper level of CI
## [1] 10.88468
CI_neg <- (mean(df1))-(1.96*((sd(df1)/(sqrt(length(df1))))))
CI_neg #Lower level of CI
## [1] 6.865317
We hypothesize that the average gestation is less than 40 weeks
H0: the average gestatopm age is \(\ge\) 40
HA: the average gestation age is < 40
We need to use a t-test here bc we are looking at one continuous variable for which we have the entire data set (so not only a mean and SD like in a z-test)
Here is how to mathematically come up with an answer (do a t-test by hand):
# by hand
m.gest = mean(data$Gestation)
sd.gest = sd(data$Gestation)
n.gest = length(data$Gestation)
t.obs = (m.gest - 40)/(sd.gest/sqrt(n.gest))
pt(t.obs, df=n.gest-1, lower.tail = T)
## [1] 0.02694659
Base R already has a T-test function called t.test()
# using the function
t.test(data$Gestation, mu = 40, alternative = "less")
##
## One Sample t-test
##
## data: data$Gestation
## t = -1.9847, df = 41, p-value = 0.02695
## alternative hypothesis: true mean is less than 40
## 95 percent confidence interval:
## -Inf 39.87688
## sample estimates:
## mean of x
## 39.19048
data$Gestation
You are telling R to look at the variable Gestation
in your dataframe data
mu is the “40 weeks”
alternative = "less" You are telling R your
alternative hypothesis is less than mu
TIP remember that capitalization of vars matters!!!
We hypothesize that average gestation will be different between babies of mothers that smoked cigarettes and babies of mothers that did not smoke cigarettes.
H0: difference in average gestation of babies of mothers that did and did not smoke cigarettes = 0
HA: difference in average gestation of babies of mothers that did and did not smoke cigarettes != 0
Note: From here on out, in R, when using
! in front of an argument, this means NOT. So if
x = 0 then x != 0 must mean
x DOES NOT equal 0.
# distribution of mothers that smoked
table(data$smoker)
##
## 0 1
## 20 22
# before we run the analysis, let's look at our data
boxplot(data$Gestation~data$smoker)
We are using a two sample t-test because we are now
working with two variables; one variable is
numerical/continuous (Gestation) and the other is binary
(smoker).
# to test our hypothesis, we will perform a two-sample t-test
t.test(data$Gestation~data$smoker)
##
## Welch Two Sample t-test
##
## data: data$Gestation by data$smoker
## t = 0.59903, df = 38.517, p-value = 0.5527
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -1.17817 2.16908
## sample estimates:
## mean in group 0 mean in group 1
## 39.45000 38.95455
We hypothesize that average birthweight will be different between babies of mothers that smoked cigarettes and babies of mothers that did not smoke cigarettes.
H0: difference in average birthweight of babies of mothers that did and did not smoke cigarettes = 0
HA: difference in average birthweight of babies of mothers that did and did not smoke cigarettes != 0
# before we run the analysis, let's look at our data
boxplot(data$Birthweight~data$smoker)
# to test our hypothesis, we will perform a two-sample t-test
t.test(data$Birthweight~data$smoker) # for reference: .37kg = .83lbs
##
## Welch Two Sample t-test
##
## data: data$Birthweight by data$smoker
## t = 2.1134, df = 39.618, p-value = 0.04092
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 0.01628738 0.73453080
## sample estimates:
## mean in group 0 mean in group 1
## 3.509500 3.134091
We hypothesize that average birthweight will be associated with birth length
H0: correlation between birthweight and length = 0
HA: correlation between birthweight and length != 0
# look at the association: is it linear?
plot(data$Length, data$Birthweight,
xlab="length",ylab="weight")
Note: by default, xlab and ylab (if you don’t
include them) will just use the actual variable names. Since msot
variable names are ugly, you can specify what you actually want them to
be called (in this case length for x-axis and
weight for y-axis). Try ?plot to check out how
else you can customize your Base R plots! And if you want them prettier,
check out (but don’t be too intimidated by) ggplot2 package.
We are using correlation because we have two continuous variables
# test the correlation
cor.test(data$Length,data$Birthweight)
##
## Pearson's product-moment correlation
##
## data: data$Length and data$Birthweight
## t = 6.6931, df = 40, p-value = 5.029e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5428137 0.8442612
## sample estimates:
## cor
## 0.7268335
We hypothesize that average birth length will predict birthweight
H0: beta_length = 0
HA: beta_length != 0
Univariate means you have one predictor to one outcome
Where:
Length is the predictor
Birthweight is the continuous outcome
variable
data = data is you telling R what data it should be
looking at
lm() is the linear regression function
fit.length.weight is what you decided to name this
model (after running this code, do you see it in your environment view
in the upper right hand corner?)
# perform simple (univariate) linear regression
fit.length.weight = lm(Birthweight~Length, data=data)
summary(fit.length.weight)
##
## Call:
## lm(formula = Birthweight ~ Length, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.89446 -0.35492 0.01746 0.28674 0.75794
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.36244 1.14858 -3.798 0.000486 ***
## Length 0.14952 0.02234 6.693 5.03e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4199 on 40 degrees of freedom
## Multiple R-squared: 0.5283, Adjusted R-squared: 0.5165
## F-statistic: 44.8 on 1 and 40 DF, p-value: 5.029e-08
We hypothesize that average birth length and maternal smoking will predict birthweight
H0: beta_length = 0
HA: beta_length != 0
H0: beta_smoking = 0
HA: beta_smoking != 0
Multivariable means you have >1 predictor (or predictor + covariates) and 1 outcome variable. Note, this is how you “adjust for covariates” usually age, sex, bmi, etc.
Where:
Length and smoker are the
predictors
Birthweight is the continuous outcome
variable
data = data is you telling R what data it should be
looking at
lm() is the linear regression function
fit.length.smoker.weight is what you decided to name
this model
# perform multiple linear regression
fit.length.smoker.weight = lm(Birthweight~Length + smoker, data=data)
summary(fit.length.smoker.weight)
##
## Call:
## lm(formula = Birthweight ~ Length + smoker, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.79819 -0.29443 0.07428 0.23794 0.86588
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.89619 1.13544 -3.431 0.00143 **
## Length 0.14297 0.02185 6.543 9.13e-08 ***
## smoker -0.24804 0.12689 -1.955 0.05781 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4058 on 39 degrees of freedom
## Multiple R-squared: 0.5704, Adjusted R-squared: 0.5483
## F-statistic: 25.89 on 2 and 39 DF, p-value: 7.001e-08
Note: Multivariable is not the same as Multivariate. Multivariable is very common where you have multiple predictors to one outcome. You might (but rarely) run into Multivariate models, which have multiple outcomes. Common mistake, but use these terms correctly!!