R Basics 2

Brandon Gavett
21/3/2022

Where we left off

(Open the RBasics project we created last time, if not open already)

Restoring our previously loaded packages…

library(pacman) # Makes the p_load function available
p_load(tidyverse, psych) # Loads the specified packages (and installs them if not installed)

Using the rescale function.

myScores <- c(101, 130, 65, 87, 119, 102, 74, 96, 101, 88) # Put 10 made up scores into an object called myScores
rescale(x = myScores) # runs the rescale command on the data stored in the myScores object

This is equivalent to:

rescale(c(101, 130, 65, 87, 119, 102, 74, 96, 101, 88))

Note, we did not need to say x = in our function call.

Need Help?

As a reminder, we can find the help file for the rescale function by running

?rescale

or

help(rescale)

From the help file:

Usage

rescale(x, mean = 100, sd = 15,df=TRUE)

Arguments

x                 A matrix or data frame

mean     Desired mean of the rescaled scores- may be a vector

sd            Desired standard deviation of the rescaled scores

df             if TRUE, returns a data frame, otherwise a matrix

Spot the Difference

rescale(c(101, 130, 65, 87, 119, 102, 74, 96, 101, 88))
rescale(101, 130, 65, 87, 119, 102, 74, 96, 101, 88)

The c function

In R, the c() function is an important tool for combining multiple data points into a single object.

For example, suppose I wanted to calculate the mean of these three numbers:

12, 58, 481

To figure out how to do this, I can look at the help file for the mean function.

?mean

This tells me that I need to enter a value for x, which represents the numbers that I want to average.

If I didn't use the c() function, my command would look like this:

mean(12, 58, 481)
[1] 12

🧐

The c function

Let's open our help file for mean

?mean

When we run this command…

mean(12, 58, 481)

R interprets it as…

mean(x = 12, trim = 58, na.rm = 481)

By default, R will separate your input based on commas (as shown in its help file).

Unless you specifically label the argument (for example, by typing

trim =

in front of your to-be-averaged values), R will assume that your input matches what is shown in the help file.

The c function

Let's try calculating the mean with c() instead.

mean(c(12, 58, 481))
[1] 183.6667

That looks better. For many R functions, all of your data points have to be passed as single objects. c() is the simplest way of concatenating multiple data points into a single vector or list.

More complex data structures

Most of the time, our data are more complex than 3 data points. And those data sets are usually stored in separate files in various formats (e.g., .csv, .sav). Let's learn to read in some data.

SPSS Files

If your data set is in SPSS format, you can use the haven package to read it in to R. First, we need to load the haven package. Then, let's read in our SPSS file (.sav extension).

p_load(haven)
mySPSSdata <- read_sav("fakeData.sav")

Note: if your SPSS file has missing data codes that you want to import into R, you should add the argument user_na = TRUE, like this:

mySPSSdata <- read_sav("fakeData.sav", user_na = TRUE)

Viewing Your Data

If we want to see what our data look like, we have several options. Try each of these and annotate your code to remind yourself of how they differ.

Dump as much as possible to screen

mySPSSdata # or print(mySPSSdata)

View()

View(mySPSSdata)

head()

head(mySPSSdata)

glimpse()

glimpse(mySPSSdata)

Getting to Know Your Data

When reading in most data files (e.g., SAV, CSV), the default is to make the objects a data.frame, which is the most common way of storing rectangular data where different columns are allowed to store different types of data (e.g., some numeric, some character, some factors).

Once you have your data frame loaded into R, it is important to learn how to view, manipulate, and summarize its contents.

Data structure

Every data.frame has its own internal structure that is essential for you to be aware of. To view the structure of the data, execute the str() function.

str(mySPSSdata)

Annotate your syntax with a description of what you just observed after running this command. Make sure you take some time to study the output.

Data structure

One of the most important pieces of information you can get from learning your data's structure is the name of the variables (columns) in your data object.

Note how each column is prefixed with a dollar sign. If we wanted to examine one specific variable (column) of the data set, we could ask for only that variable to be printed to our screen:

mySPSSdata$AGE

This shows the contents of the AGE variable in my data set. If I only wanted to see the beginning of that column (first 6 rows by default), I could ask R to show me the header with the head() function. Similarly, the tail() function shows me the last 6 rows in that column.

head(mySPSSdata$AGE)
[1] 69.36 66.86 87.28 74.38 79.82 69.07
tail(mySPSSdata$AGE)
[1] 75.63 73.77 73.67 84.30 68.04 73.62

The head() and tail() functions can be applied to entire data frames, too. Try calling head() and tail() on the mySPSSdata object.

Skills practice: Try asking R to show you the last 10 rows of the mySPSSdata object instead of the default of 6 rows.

Column names

The easiest way to see the name of every column in your data set is to ask for its names().

names(mySPSSdata)

The same thing could be achieved with the colnames function.

colnames(mySPSSdata)

As a general rule, there are usually multiple ways to achieve the same goal in R.

Descriptive Statistics

The psych package does a great job providing descriptive statistics. Let's look at descriptives for our entire data set using the describe() function.

Technical detail: sometimes different packages have functions with the same name. For example, the Hmisc and psych packages both have a function called describe(). To explicitly tell R to use the describe() function from the psych package, we prepend our function call with psych::.

psych::describe(mySPSSdata)

Note that we haven't actually loaded the psych package yet. Prepending our function call with psych:: allowed us to run that function even though the package isn't loaded. Let's load that package now.

p_load(psych)

Descriptive Statistics

The psych package also allows us to view descriptives by group. Let's look at the same descriptives, but broken down by diagnosis (DX) using the describeBy function.

describeBy(mySPSSdata, mySPSSdata$DX)

Histograms

In addition to descriptives, we may also want to visualize our data. One of R's greatest features is its plotting capabilities, which is an entire workshop in and of itself. For now, we'll stick with some simple plots.

hist(mySPSSdata$AGE)

plot of chunk hist

Scatterplots

Scatterplots are as simple as specifying what you want on the x and y axes.

plot(x = mySPSSdata$AGE, y = mySPSSdata$MMSE)

plot of chunk scatterplots

Plot customization

plot(x = mySPSSdata$AGE, y = mySPSSdata$MMSE, xlab = "Age (years)", ylab = "MMSE Score", 
     main = "MMSE Scores by Age")

plot of chunk prettyScatterplots

Frequencies

You'll notice that the describe() and describeBy() functions treated categorical variables as numeric, leading to some nonsensical values, like the mean of racial category (PTRACCAT). Instead, we can summarize categorical data as frequencies using the table() command.

table(mySPSSdata$PTRACCAT)

  1   2   4   5   6   7 
  1  16  54 912  11   6 

Proportions

If you want to see proportions instead of or in addition to frequencies, you can use the prop.table() function.

prop.table(mySPSSdata$PTRACCAT)

😕

Problems

However, as you can see here, we run into some problems. This weird output gives us a good excuse to delve a little deeper into the nuances of coding in R.

Whereas table() simply required us to specify the column of data we wanted to use to calculate frequencies, prop.table doesn't expect the raw data as its input, it expects the table we just created as the input.

So, how do we go about telling prop.table to work its magic on the results that are generated by table?

Piping

Piping provides a way to daisy-chain commands together; this is especially useful when you only care about the input and the output, not the intermediate steps.

magrittr

Piping is done using a special symbol (a “pipe”) from the magrittr package. magrittr is automatically loaded when you load the tidyverse package, but we can also load it directly as follows:

p_load(magrittr)

The magrittr pipe looks like this:

%>%

It can be typed manually, or by pressing control + shift + M.

Piping

Here's how it could be used to solve our problem:

mySPSSdata$PTRACCAT %>%  # Start with the raw data of interest and pipe it forward
  table() %>%  # Take what was piped, calculate frequencies, and then pipe THOSE forward
  prop.table() # Take what was piped, use it to calculate proportions, and then end

The above code would print your data to the screen. If you wanted to save it as a new object, you could write:

prop_race <- mySPSSdata$PTRACCAT %>%  # Start with the raw data of interest and pipe it forward
  table() %>%  # Take what was piped, calculate frequencies, and then pipe THOSE forward
  prop.table() # Take what was piped, use it to calculate proportions, and then save as prop_race

Or, if you want to get silly…

mySPSSdata$PTRACCAT %>%  # Start with the raw data of interest and pipe it forward
  table() %>%  # Take what was piped, calculate frequencies, and then pipe THOSE forward
  prop.table() -> prop_race # Take what was piped, use it to calculate proportions, and then save as prop_race

Piped Workflows

wrangled_data <- mySPSSdata %>%
  select(ID, DX, AGE, PTGENDER, PTEDUCAT, TAU, ABETA, MOCA, Hippocampus, WholeBrain, TRABSCOR) %>%
  filter(between(TRABSCOR, 0, 300))
wrangled_data
# A tibble: 933 × 11
      ID           DX   AGE   PTGENDER PTEDUCAT   TAU  ABETA  MOCA Hippocampus
   <dbl>    <dbl+lbl> <dbl>  <dbl+lbl>    <dbl> <dbl>  <dbl> <dbl>       <dbl>
 1  1002 1 [CN]        66.9 2 [Male]       16.2 322.   426.   20.8       7730.
 2  1003 2 [MCI]       87.3 2 [Male]       14.8 275.   514.   22.0       6575.
 3  1004 3 [Dementia]  74.4 2 [Male]       11.2 368.   773.   20.0       5192.
 4  1005 3 [Dementia]  79.8 2 [Male]       17.3 217.   441.   21.4       6198.
 5  1006 2 [MCI]       69.1 2 [Male]       19.6  70.2  597.   24.4       7594.
 6  1007 2 [MCI]       65.2 1 [Female]     14   428.   741.   26.7       6589.
 7  1008 3 [Dementia]  71.1 1 [Female]     19.0 182.   443.   28.4       5489.
 8  1009 3 [Dementia]  79.1 1 [Female]     15.4 103.  1031.   26.3       5757.
 9  1010 2 [MCI]       75.6 2 [Male]       15.5 481.   825.   26.6       6066.
10  1011 1 [CN]        71.8 1 [Female]     15.4 405.   -46.7  19.2       5823.
# … with 923 more rows, and 2 more variables: WholeBrain <dbl>, TRABSCOR <dbl>

Data Mutations

mutated_data <- wrangled_data %>%
  mutate(tau_to_abeta = TAU/ABETA,
         cog_normal = case_when(DX == 1 ~ 1,
                                        TRUE ~ 0))
mutated_data
# A tibble: 933 × 13
      ID           DX   AGE   PTGENDER PTEDUCAT   TAU  ABETA  MOCA Hippocampus
   <dbl>    <dbl+lbl> <dbl>  <dbl+lbl>    <dbl> <dbl>  <dbl> <dbl>       <dbl>
 1  1002 1 [CN]        66.9 2 [Male]       16.2 322.   426.   20.8       7730.
 2  1003 2 [MCI]       87.3 2 [Male]       14.8 275.   514.   22.0       6575.
 3  1004 3 [Dementia]  74.4 2 [Male]       11.2 368.   773.   20.0       5192.
 4  1005 3 [Dementia]  79.8 2 [Male]       17.3 217.   441.   21.4       6198.
 5  1006 2 [MCI]       69.1 2 [Male]       19.6  70.2  597.   24.4       7594.
 6  1007 2 [MCI]       65.2 1 [Female]     14   428.   741.   26.7       6589.
 7  1008 3 [Dementia]  71.1 1 [Female]     19.0 182.   443.   28.4       5489.
 8  1009 3 [Dementia]  79.1 1 [Female]     15.4 103.  1031.   26.3       5757.
 9  1010 2 [MCI]       75.6 2 [Male]       15.5 481.   825.   26.6       6066.
10  1011 1 [CN]        71.8 1 [Female]     15.4 405.   -46.7  19.2       5823.
# … with 923 more rows, and 4 more variables: WholeBrain <dbl>, TRABSCOR <dbl>,
#   tau_to_abeta <dbl>, cog_normal <dbl>

Data Summaries

Ungrouped

mutated_data %>%
  summarise(mean_age = mean(AGE),
            sd_age = sd(AGE),
            mean_edu = mean(PTEDUCAT),
            sd_edu = sd(PTEDUCAT),
            mean_tmtb = mean(TRABSCOR),
            sd_tmtb = sd(TRABSCOR),
            n = n())
# A tibble: 1 × 7
  mean_age sd_age mean_edu sd_edu mean_tmtb sd_tmtb     n
     <dbl>  <dbl>    <dbl>  <dbl>     <dbl>   <dbl> <int>
1     73.3   7.17     15.9   2.69      127.    65.9   933

Grouped

mutated_data %>%
  group_by(cog_normal) %>%
  summarise(mean_age = mean(AGE),
            sd_age = sd(AGE),
            mean_edu = mean(PTEDUCAT),
            sd_edu = sd(PTEDUCAT),
            mean_tmb = mean(TRABSCOR),
            sd_tmtb = sd(TRABSCOR),
            n = n()) %>%
  ungroup()
# A tibble: 2 × 8
  cog_normal mean_age sd_age mean_edu sd_edu mean_tmb sd_tmtb     n
       <dbl>    <dbl>  <dbl>    <dbl>  <dbl>    <dbl>   <dbl> <int>
1          0     73.5   7.04     15.9   2.73     128.    65.8   591
2          1     72.9   7.39     15.9   2.62     124.    66.1   342

Your Task

Create a new object called wf_dat, derived from mySPSSdata, that meets the following criteria:

  1. Contains the following columns:
    • ID
    • DX
    • AGE
    • PTEDUCAT
    • WholeBrain
    • LDELTOTAL
    • TRABSCOR
  2. Contains the following cases (rows):
    • Trails B (TRABSCOR) values greater than 0 and less than 300
    • Diagnosis (DX) of 1 (“CN”) or 2 (“MCI”), but not 3 (“Dementia”)
    • Age (AGE) greater than 60

Analyzing Your Data

wf_dat <- mySPSSdata %>%
  select(ID, DX, AGE, PTEDUCAT, WholeBrain, LDELTOTAL, TRABSCOR) %>%
  filter(between(TRABSCOR, 0, 300),
         DX %in% c(1, 2),
         AGE > 60)

Let's say you want to perform a linear regression using your new data (wf_dat)

  • DV:

    • LDELTOTAL
  • IVs:

    • AGE
    • PTEDUCAT
    • WholeBrain
    • PTEDUCAT x WholeBrain
linear_m1 <- lm(LDELTOTAL ~ AGE + PTEDUCAT*WholeBrain, data = wf_dat)
summary(linear_m1)

Call:
lm(formula = LDELTOTAL ~ AGE + PTEDUCAT * WholeBrain, data = wf_dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.3071  -3.6292   0.1869   3.5434  13.1540 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)
(Intercept)          4.884e+00  1.037e+01   0.471    0.638
AGE                 -4.170e-02  3.014e-02  -1.384    0.167
PTEDUCAT            -1.117e-01  6.310e-01  -0.177    0.860
WholeBrain          -1.401e-06  9.817e-06  -0.143    0.887
PTEDUCAT:WholeBrain  5.154e-07  6.147e-07   0.838    0.402

Residual standard error: 5.201 on 724 degrees of freedom
Multiple R-squared:  0.07893,   Adjusted R-squared:  0.07384 
F-statistic: 15.51 on 4 and 724 DF,  p-value: 3.503e-12

Summarizing Results

summary(linear_m1)

Call:
lm(formula = LDELTOTAL ~ AGE + PTEDUCAT * WholeBrain, data = wf_dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.3071  -3.6292   0.1869   3.5434  13.1540 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)
(Intercept)          4.884e+00  1.037e+01   0.471    0.638
AGE                 -4.170e-02  3.014e-02  -1.384    0.167
PTEDUCAT            -1.117e-01  6.310e-01  -0.177    0.860
WholeBrain          -1.401e-06  9.817e-06  -0.143    0.887
PTEDUCAT:WholeBrain  5.154e-07  6.147e-07   0.838    0.402

Residual standard error: 5.201 on 724 degrees of freedom
Multiple R-squared:  0.07893,   Adjusted R-squared:  0.07384 
F-statistic: 15.51 on 4 and 724 DF,  p-value: 3.503e-12

Handling Results

Making Predictions

wf_dat <- wf_dat %>%
  mutate(pred_lmd = predict(linear_m1),
         Diagnosis = factor(DX, levels = 1:2, labels = c("CN", "MCI")))

Plotting

p_load(ggplot2)
wf_dat %>%
  ggplot(aes(x = pred_lmd, y = LDELTOTAL, colour = Diagnosis)) + # Specifies main elements of plot
  geom_point() +  # Adds data points
  geom_smooth(method = "lm") + # Adds a linear (method = "lm") trend line for each group
  xlab("Predicted LM-D Score") + # Gives the x-axis a custom label
  ylab("LM-D score") + # Gives the y-axis a custom label
  ggtitle("No Interaction Effect of Whole Brain Volume x Education on LM-D scores") # Gives the plot a title

plot of chunk plot2

Conclusion

Be sure to save your R Script file!

Continued learning and problem-solving

Although there are numerous books available (many through the library) for learning R, the best resources are the R help files and good Google searches.

Stack Exchange and Stack Overflow are probably the best resources on the web for getting R help.

R Cheat Sheets

https://www.rstudio.com/resources/cheatsheets/

I have written a package called cheatR that makes it easy to view these cheat sheets in your RStudio Viewer.

if (!requireNamespace("remotes")) {
  install.packages("remotes")
}
remotes::install_github("begavett/cheatR")

https://github.com/begavett/cheatR

Wrapping Up

Any questions? Shoot me an email.

brandon.gavett@uwa.edu.au

Files from this workshop will be posted to GitHub: https://github.com/begavett/Rbasics

The video recording of this workshop will be available on Teams.