Brandon Gavett
21/3/2022
(Open the RBasics project we created last time, if not open already)
Restoring our previously loaded packages…
library(pacman) # Makes the p_load function available
p_load(tidyverse, psych) # Loads the specified packages (and installs them if not installed)
Using the rescale
function.
myScores <- c(101, 130, 65, 87, 119, 102, 74, 96, 101, 88) # Put 10 made up scores into an object called myScores
rescale(x = myScores) # runs the rescale command on the data stored in the myScores object
This is equivalent to:
rescale(c(101, 130, 65, 87, 119, 102, 74, 96, 101, 88))
Note, we did not need to say x =
in our function call.
As a reminder, we can find the help file for the rescale
function by running
?rescale
or
help(rescale)
From the help file:
rescale(x, mean = 100, sd = 15,df=TRUE)
x
A matrix or data frame
mean
Desired mean of the rescaled scores- may be a vector
sd
Desired standard deviation of the rescaled scores
df
if TRUE, returns a data frame, otherwise a matrix
rescale(c(101, 130, 65, 87, 119, 102, 74, 96, 101, 88))
rescale(101, 130, 65, 87, 119, 102, 74, 96, 101, 88)
In R, the c()
function is an important tool for combining multiple data points into a single object.
For example, suppose I wanted to calculate the mean of these three numbers:
12, 58, 481
To figure out how to do this, I can look at the help file for the mean
function.
?mean
This tells me that I need to enter a value for x, which represents the numbers that I want to average.
If I didn't use the c()
function, my command would look like this:
mean(12, 58, 481)
[1] 12
🧐
Let's open our help file for mean
?mean
When we run this command…
mean(12, 58, 481)
R interprets it as…
mean(x = 12, trim = 58, na.rm = 481)
By default, R will separate your input based on commas (as shown in its help file).
Unless you specifically label the argument (for example, by typing
trim =
in front of your to-be-averaged values), R will assume that your input matches what is shown in the help file.
Let's try calculating the mean with c()
instead.
mean(c(12, 58, 481))
[1] 183.6667
That looks better. For many R functions, all of your data points have to be passed as single objects. c()
is the simplest way of concatenating multiple data points into a single vector or list.
Most of the time, our data are more complex than 3 data points. And those data sets are usually stored in separate files in various formats (e.g., .csv, .sav). Let's learn to read in some data.
If your data set is in SPSS format, you can use the haven
package to read it in to R. First, we need to load the haven
package. Then, let's read in our SPSS file (.sav extension).
p_load(haven)
mySPSSdata <- read_sav("fakeData.sav")
Note: if your SPSS file has missing data codes that you want to import into R, you should add the argument user_na = TRUE
, like this:
mySPSSdata <- read_sav("fakeData.sav", user_na = TRUE)
If we want to see what our data look like, we have several options. Try each of these and annotate your code to remind yourself of how they differ.
Dump as much as possible to screen
mySPSSdata # or print(mySPSSdata)
View()
View(mySPSSdata)
head()
head(mySPSSdata)
glimpse()
glimpse(mySPSSdata)
When reading in most data files (e.g., SAV, CSV), the default is to make the objects a data.frame
, which is the most common way of storing rectangular data where different columns are allowed to store different types of data (e.g., some numeric, some character, some factors).
Once you have your data frame loaded into R, it is important to learn how to view, manipulate, and summarize its contents.
Every data.frame
has its own internal structure that is essential for you to be aware of. To view the structure of the data, execute the str()
function.
str(mySPSSdata)
Annotate your syntax with a description of what you just observed after running this command. Make sure you take some time to study the output.
One of the most important pieces of information you can get from learning your data's structure is the name of the variables (columns) in your data object.
Note how each column is prefixed with a dollar sign. If we wanted to examine one specific variable (column) of the data set, we could ask for only that variable to be printed to our screen:
mySPSSdata$AGE
This shows the contents of the AGE
variable in my data set. If I only wanted to see the beginning of that column (first 6 rows by default), I could ask R to show me the header with the head()
function. Similarly, the tail()
function shows me the last 6 rows in that column.
head(mySPSSdata$AGE)
[1] 69.36 66.86 87.28 74.38 79.82 69.07
tail(mySPSSdata$AGE)
[1] 75.63 73.77 73.67 84.30 68.04 73.62
The head()
and tail()
functions can be applied to entire data frames, too. Try calling head()
and tail()
on the mySPSSdata
object.
Skills practice: Try asking R
to show you the last 10 rows of the mySPSSdata
object instead of the default of 6 rows.
The easiest way to see the name of every column in your data set is to ask for its names()
.
names(mySPSSdata)
The same thing could be achieved with the colnames
function.
colnames(mySPSSdata)
As a general rule, there are usually multiple ways to achieve the same goal in R.
The psych
package does a great job providing descriptive statistics. Let's look at descriptives for our entire data set using the describe()
function.
Technical detail: sometimes different packages have functions with the same name. For example, the Hmisc
and psych
packages both have a function called describe()
. To explicitly tell R to use the describe()
function from the psych
package, we prepend our function call with psych::
.
psych::describe(mySPSSdata)
Note that we haven't actually loaded the psych
package yet. Prepending our function call with psych::
allowed us to run that function even though the package isn't loaded. Let's load that package now.
p_load(psych)
The psych
package also allows us to view descriptives by group. Let's look at the same descriptives, but broken down by diagnosis (DX
) using the describeBy
function.
describeBy(mySPSSdata, mySPSSdata$DX)
In addition to descriptives, we may also want to visualize our data. One of R's greatest features is its plotting capabilities, which is an entire workshop in and of itself. For now, we'll stick with some simple plots.
hist(mySPSSdata$AGE)
Scatterplots are as simple as specifying what you want on the x and y axes.
plot(x = mySPSSdata$AGE, y = mySPSSdata$MMSE)
plot(x = mySPSSdata$AGE, y = mySPSSdata$MMSE, xlab = "Age (years)", ylab = "MMSE Score",
main = "MMSE Scores by Age")
You'll notice that the describe()
and describeBy()
functions treated categorical variables as numeric, leading to some nonsensical values, like the mean of racial category (PTRACCAT
). Instead, we can summarize categorical data as frequencies using the table()
command.
table(mySPSSdata$PTRACCAT)
1 2 4 5 6 7
1 16 54 912 11 6
If you want to see proportions instead of or in addition to frequencies, you can use the prop.table()
function.
prop.table(mySPSSdata$PTRACCAT)
😕
However, as you can see here, we run into some problems. This weird output gives us a good excuse to delve a little deeper into the nuances of coding in R.
Whereas table()
simply required us to specify the column of data we wanted to use to calculate frequencies, prop.table
doesn't expect the raw data as its input, it expects the table we just created as the input.
So, how do we go about telling prop.table
to work its magic on the results that are generated by table
?
Piping provides a way to daisy-chain commands together; this is especially useful when you only care about the input and the output, not the intermediate steps.
Piping is done using a special symbol (a “pipe”) from the magrittr
package. magrittr
is automatically loaded when you load the tidyverse
package, but we can also load it directly as follows:
p_load(magrittr)
The magrittr
pipe looks like this:
%>%
It can be typed manually, or by pressing control
+ shift
+ M
.
Here's how it could be used to solve our problem:
mySPSSdata$PTRACCAT %>% # Start with the raw data of interest and pipe it forward
table() %>% # Take what was piped, calculate frequencies, and then pipe THOSE forward
prop.table() # Take what was piped, use it to calculate proportions, and then end
The above code would print your data to the screen. If you wanted to save it as a new object, you could write:
prop_race <- mySPSSdata$PTRACCAT %>% # Start with the raw data of interest and pipe it forward
table() %>% # Take what was piped, calculate frequencies, and then pipe THOSE forward
prop.table() # Take what was piped, use it to calculate proportions, and then save as prop_race
Or, if you want to get silly…
mySPSSdata$PTRACCAT %>% # Start with the raw data of interest and pipe it forward
table() %>% # Take what was piped, calculate frequencies, and then pipe THOSE forward
prop.table() -> prop_race # Take what was piped, use it to calculate proportions, and then save as prop_race
wrangled_data <- mySPSSdata %>%
select(ID, DX, AGE, PTGENDER, PTEDUCAT, TAU, ABETA, MOCA, Hippocampus, WholeBrain, TRABSCOR) %>%
filter(between(TRABSCOR, 0, 300))
wrangled_data
# A tibble: 933 × 11
ID DX AGE PTGENDER PTEDUCAT TAU ABETA MOCA Hippocampus
<dbl> <dbl+lbl> <dbl> <dbl+lbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1002 1 [CN] 66.9 2 [Male] 16.2 322. 426. 20.8 7730.
2 1003 2 [MCI] 87.3 2 [Male] 14.8 275. 514. 22.0 6575.
3 1004 3 [Dementia] 74.4 2 [Male] 11.2 368. 773. 20.0 5192.
4 1005 3 [Dementia] 79.8 2 [Male] 17.3 217. 441. 21.4 6198.
5 1006 2 [MCI] 69.1 2 [Male] 19.6 70.2 597. 24.4 7594.
6 1007 2 [MCI] 65.2 1 [Female] 14 428. 741. 26.7 6589.
7 1008 3 [Dementia] 71.1 1 [Female] 19.0 182. 443. 28.4 5489.
8 1009 3 [Dementia] 79.1 1 [Female] 15.4 103. 1031. 26.3 5757.
9 1010 2 [MCI] 75.6 2 [Male] 15.5 481. 825. 26.6 6066.
10 1011 1 [CN] 71.8 1 [Female] 15.4 405. -46.7 19.2 5823.
# … with 923 more rows, and 2 more variables: WholeBrain <dbl>, TRABSCOR <dbl>
mutated_data <- wrangled_data %>%
mutate(tau_to_abeta = TAU/ABETA,
cog_normal = case_when(DX == 1 ~ 1,
TRUE ~ 0))
mutated_data
# A tibble: 933 × 13
ID DX AGE PTGENDER PTEDUCAT TAU ABETA MOCA Hippocampus
<dbl> <dbl+lbl> <dbl> <dbl+lbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1002 1 [CN] 66.9 2 [Male] 16.2 322. 426. 20.8 7730.
2 1003 2 [MCI] 87.3 2 [Male] 14.8 275. 514. 22.0 6575.
3 1004 3 [Dementia] 74.4 2 [Male] 11.2 368. 773. 20.0 5192.
4 1005 3 [Dementia] 79.8 2 [Male] 17.3 217. 441. 21.4 6198.
5 1006 2 [MCI] 69.1 2 [Male] 19.6 70.2 597. 24.4 7594.
6 1007 2 [MCI] 65.2 1 [Female] 14 428. 741. 26.7 6589.
7 1008 3 [Dementia] 71.1 1 [Female] 19.0 182. 443. 28.4 5489.
8 1009 3 [Dementia] 79.1 1 [Female] 15.4 103. 1031. 26.3 5757.
9 1010 2 [MCI] 75.6 2 [Male] 15.5 481. 825. 26.6 6066.
10 1011 1 [CN] 71.8 1 [Female] 15.4 405. -46.7 19.2 5823.
# … with 923 more rows, and 4 more variables: WholeBrain <dbl>, TRABSCOR <dbl>,
# tau_to_abeta <dbl>, cog_normal <dbl>
mutated_data %>%
summarise(mean_age = mean(AGE),
sd_age = sd(AGE),
mean_edu = mean(PTEDUCAT),
sd_edu = sd(PTEDUCAT),
mean_tmtb = mean(TRABSCOR),
sd_tmtb = sd(TRABSCOR),
n = n())
# A tibble: 1 × 7
mean_age sd_age mean_edu sd_edu mean_tmtb sd_tmtb n
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 73.3 7.17 15.9 2.69 127. 65.9 933
mutated_data %>%
group_by(cog_normal) %>%
summarise(mean_age = mean(AGE),
sd_age = sd(AGE),
mean_edu = mean(PTEDUCAT),
sd_edu = sd(PTEDUCAT),
mean_tmb = mean(TRABSCOR),
sd_tmtb = sd(TRABSCOR),
n = n()) %>%
ungroup()
# A tibble: 2 × 8
cog_normal mean_age sd_age mean_edu sd_edu mean_tmb sd_tmtb n
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 0 73.5 7.04 15.9 2.73 128. 65.8 591
2 1 72.9 7.39 15.9 2.62 124. 66.1 342
Create a new object called wf_dat
, derived from mySPSSdata
, that meets the following criteria:
TRABSCOR
) values greater than 0 and less than 300DX
) of 1 (“CN”) or 2 (“MCI”), but not 3 (“Dementia”)AGE
) greater than 60wf_dat <- mySPSSdata %>%
select(ID, DX, AGE, PTEDUCAT, WholeBrain, LDELTOTAL, TRABSCOR) %>%
filter(between(TRABSCOR, 0, 300),
DX %in% c(1, 2),
AGE > 60)
Let's say you want to perform a linear regression using your new data (wf_dat
)
DV:
IVs:
linear_m1 <- lm(LDELTOTAL ~ AGE + PTEDUCAT*WholeBrain, data = wf_dat)
summary(linear_m1)
Call:
lm(formula = LDELTOTAL ~ AGE + PTEDUCAT * WholeBrain, data = wf_dat)
Residuals:
Min 1Q Median 3Q Max
-15.3071 -3.6292 0.1869 3.5434 13.1540
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.884e+00 1.037e+01 0.471 0.638
AGE -4.170e-02 3.014e-02 -1.384 0.167
PTEDUCAT -1.117e-01 6.310e-01 -0.177 0.860
WholeBrain -1.401e-06 9.817e-06 -0.143 0.887
PTEDUCAT:WholeBrain 5.154e-07 6.147e-07 0.838 0.402
Residual standard error: 5.201 on 724 degrees of freedom
Multiple R-squared: 0.07893, Adjusted R-squared: 0.07384
F-statistic: 15.51 on 4 and 724 DF, p-value: 3.503e-12
summary(linear_m1)
Call:
lm(formula = LDELTOTAL ~ AGE + PTEDUCAT * WholeBrain, data = wf_dat)
Residuals:
Min 1Q Median 3Q Max
-15.3071 -3.6292 0.1869 3.5434 13.1540
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.884e+00 1.037e+01 0.471 0.638
AGE -4.170e-02 3.014e-02 -1.384 0.167
PTEDUCAT -1.117e-01 6.310e-01 -0.177 0.860
WholeBrain -1.401e-06 9.817e-06 -0.143 0.887
PTEDUCAT:WholeBrain 5.154e-07 6.147e-07 0.838 0.402
Residual standard error: 5.201 on 724 degrees of freedom
Multiple R-squared: 0.07893, Adjusted R-squared: 0.07384
F-statistic: 15.51 on 4 and 724 DF, p-value: 3.503e-12
wf_dat <- wf_dat %>%
mutate(pred_lmd = predict(linear_m1),
Diagnosis = factor(DX, levels = 1:2, labels = c("CN", "MCI")))
p_load(ggplot2)
wf_dat %>%
ggplot(aes(x = pred_lmd, y = LDELTOTAL, colour = Diagnosis)) + # Specifies main elements of plot
geom_point() + # Adds data points
geom_smooth(method = "lm") + # Adds a linear (method = "lm") trend line for each group
xlab("Predicted LM-D Score") + # Gives the x-axis a custom label
ylab("LM-D score") + # Gives the y-axis a custom label
ggtitle("No Interaction Effect of Whole Brain Volume x Education on LM-D scores") # Gives the plot a title
Be sure to save your R Script file!
Although there are numerous books available (many through the library) for learning R, the best resources are the R help files and good Google searches.
Stack Exchange and Stack Overflow are probably the best resources on the web for getting R help.
https://www.rstudio.com/resources/cheatsheets/
I have written a package called cheatR
that makes it easy to view these cheat sheets in your RStudio Viewer.
if (!requireNamespace("remotes")) {
install.packages("remotes")
}
remotes::install_github("begavett/cheatR")
https://github.com/begavett/cheatR
Any questions? Shoot me an email.
Files from this workshop will be posted to GitHub: https://github.com/begavett/Rbasics
The video recording of this workshop will be available on Teams.