Data Cleaning is a topic that is often requested at the SCC and has begged the question, “Why is it not covered in our intro research courses?” Though it would be intuitive to integrate into the stats sequence, the actual application is often difficult to implement (In other words, easier said than done).

The reason for this is because data cleaning isn’t simply an operation, but a process; a process that varies widely based on your specific data structure and what you’re trying to accomplish. It’s further complicated that data cleaning often takes multiple steps to get your data ready to implement analysis.

Given all of these challenges, the goal of this page is to provide a sample of common data cleaning steps using quasi-real-world data (i.e. I made it up, but it looks pretty legit) that you may use to help guide your data cleaning adventures. As always, if you have any questions or would like to see more similar content, feel free to reach out to the SCC via email or Request an appointment. So without further ado, let’s start coding.

SPSS Integration

For the majority of people reading this, your data will be in SPSS format. SPSS is a great package for non-data enthusiasts, however it’s data cleaning abilities are very sparse. Additionally, the data cleaning you can do in SPSS, takes much more time and energy than through RStudio.

Fortunately, reading SPSS data in R is very easy! With the few lines below, you should be able to read your data directly into R, clean it, and export it into SPSS format, ready for you to work your statistical magic in the system most of you have come to know and love!

Before we get started, you’ll want to set your working directory. This will tell R where to import files from and export files to. Check this image out on how to set your working direcory:

Once you select the folder you want to use, you should be all set!

Next you’re gonna want to install and load the haven package.

install.packages("haven")
library(haven)

Now that we have that ready to go, we’re gonna want to read in our data using the read_sav() function. But remember, we have to create an object that we can call back while cleaning. To do this is simple, simply give your data a name, like data and use the <- to denote an object.

data <- read_sav(file = "data.sav")

Congratulations!! Your data is now in R, and we can start cleaning! Before we get started, let’s take a look at the first 6 lines to make sure we have the right dataset:

head(data)

Recoding Variables

One of the most used, and probably the easiest, cleaning operation is recoding variables. Whether we want to dummy code or group values together, R can get you where you need to go with a few lines of code.

Dummy Code

Let’s take a look at our Gender variable in column 3. When running any kind of regression, we are gonna want to recode this variable where we have 1 variable as a reference group and another as our “group of interest”. First things first, let’s take a closer look at what categories we have an how many participants in each category

table(data$Gender)

   F    M   NB 
1430  738  240 

Our data shows we have 3 categories: Female = 1430 participants, Male = 738 participants, and Non-binary = 240 participants. In order to effectively dummy code, we want to select a reference group that we will compare the other groups to during our regression analysis. I will arbitrarily select the Female group as our reference group. This means we need to create dummy-coded variables for the Male group and the Non-binary group.

data$male_d <- ifelse(data$Gender=="M",1,0)
data$NB_d <- ifelse(data$Gender=="NB",1,0)

There’s a lot happening in this small chunk of code, so I’ll try to break it down as best as I can. The first thing you’ll notice is the $ in the chunk reading data$male_d. For our purposes, this little symbol specifies a variable within a dataframe. When it’s on the left side of a <-, it is creating a new, or updating an existing, variable in the dataset. If it’s on the right side of the <- it is typically referencing a variable in an equation or function.

The next thing you’ll notice is we used the ifelse function. This is a very common and effective function that reads ifelse(test,if TRUE, if FALSE). Broken down a little further, the test is a logical operation where we say if the statement “data$Gender equals (==) M” is TRUE. The next vales if TRUE and if FALSE simply say what we want those respective values to be after this is run.

To put it all together we’re saying, “create a new variable, male_d in dataframe data, where if data$Gender says”M", recode that into 1, for anything else, recode those into 0.

Recoding variable into distinct groups

Let’s take a look at a situation where you have a continuous value that you want to group into a few distinct categories. We’ll shift our attention to a variable at the very end OQ.

First things first, lets get an idea of what we’re working with:

summary(data$OQ)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   29.00   34.00   39.39   47.00  105.00 

We’re gonna group this variable into 4 categories using the case_when function from the dplyr package. This package is part of the tidyverse family (that’s a whole other topic). For now lets get into some code.

library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
-- Attaching packages ---------------------------------- tidyverse 1.3.1 --
v ggplot2 3.3.5     v purrr   0.3.4
v tibble  3.1.3     v dplyr   1.0.7
v tidyr   1.1.3     v stringr 1.4.0
v readr   2.0.1     v forcats 0.5.1
-- Conflicts ------------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

We’re gonna divide the 4 categories using the 4 quartiles as listed above.

data$OQ_cat<- case_when(
  data$OQ<=29 ~"1",
  data$OQ>=30&data$OQ<=34 ~"2",
  data$OQ>=35&data$OQ<=47 ~"3",
  data$OQ>=48 ~"4"
)

data$OQ_cat<-as.numeric(data$OQ_cat)

Now is probably a good time to talk about logical operators. In the above code you see a lot of random symbols that are probably scaring a lot of people off. But if you just bear with me for a few more sentences, I’ll explain everything. We should all be familiar with some of these, but they could still be a little cryptic:

> greater than
>= greater than or equal to
< less than
<= less than or equal to
== is equal to
!= is not equal to

& and
| or

With our newfound knowledge, let’s take a closer look at how we created category 2. data$OQ>=30&data$OQ<=34 ~ "2"
If we take is slow, we can see we specified “if OQ score is greater than or equal to 30(1 above 1st quartile) AND OQ score is less than or equal to 34 (median), then recode into 2”

Then the last line I’m just changing the variable type from character to numeric. If a variable is stored as numeric and we wanted to convert it to character, we could call as.character()

Subsetting Datasets

I’m not actually sure if you could do this in SPSS or not without jumping through multiple hoops. There’s many ways you might want to subset a dataset, so I’ll show the most common.

Keeping specific variables

A lot of times you’ll acquire a gargantuan of a dataset and you only need a few variables. I personally hate having a bunch of vars in my dataset that I’m not even gonna use. We’ll use the subset() function to trim our original dataset data into something more manageable. As always, let’s first get a good look at our current dataset:

pillar::glimpse(data)
Rows: 2,408
Columns: 37
$ Client_ID <chr> "5054", "5055", "2101", "2102", "2103", "2104", "2105"~
$ DOB       <chr> "5/19/1987", "5/21/1970", "3/18/1974", "8/30/1983", "7~
$ Gender    <chr> "F", "F", "M", "F", "M", "NB", "NB", "F", "F", "F", "F~
$ Ethnicity <chr> "African American", "Asian", "Hispanic", "Asian", "Whi~
$ tst1      <dbl> 2, 2, 2, 4, 2, 2, 2, 3, 2, 2, 3, 3, 2, 3, 1, 4, 3, 2, ~
$ tst2      <dbl> 3, 3, 3, 1, 4, 2, 2, 3, 2, 3, 3, 4, 3, 3, 2, 3, 3, 2, ~
$ tst3      <dbl> 3, 4, 2, 4, 2, 1, 2, 3, 1, 2, 1, 3, 1, 1, 3, 2, 2, 3, ~
$ tst4      <dbl> 2, 3, 3, 3, 2, 1, 3, 1, 1, 2, 2, 1, 3, 4, 3, 1, 3, 3, ~
$ tst5      <dbl> 2, 3, 3, 2, 4, 2, 3, 1, 4, 3, 2, 3, 3, 1, 1, 3, 1, 2, ~
$ tst6      <dbl> 3, 3, 4, 3, 2, 3, 3, 2, 3, 2, 2, 3, 3, 3, 2, 3, 3, 3, ~
$ tst7      <dbl> 1, 3, 3, 3, 4, 3, 2, 2, 3, 1, 3, 1, 2, 3, 3, 2, 1, 3, ~
$ tst8      <dbl> 1, 3, 3, 2, 2, 4, 3, 3, 1, 2, 3, 2, 4, 1, 3, 3, 2, 2, ~
$ tst9      <dbl> 3, 4, 4, 3, 3, 3, 3, 1, 3, 4, 3, 2, 3, 4, 3, 2, 4, 3, ~
$ tst10     <dbl> 2, 3, 4, 3, 2, 2, 3, 3, 1, 3, 2, 3, 2, 3, 2, 3, 2, 2, ~
$ AB1       <dbl> 31.34906, 31.06828, 27.65837, 27.18045, 43.28128, 27.6~
$ AB2       <dbl> 39.74240, 29.36426, 38.12954, 31.67078, 31.11556, 20.2~
$ AB3       <dbl> 30.14401, 34.13057, 29.19953, 33.83375, 25.05884, 37.7~
$ AB4       <dbl> 26.72113, 28.84908, 28.33457, 30.57840, 23.98519, 45.6~
$ AB5       <dbl> 27.08153, 30.94320, 28.46713, 27.94193, 28.94680, 36.5~
$ AB6       <dbl> 37.77197, 40.14996, 35.14671, 35.61309, 26.40682, 26.6~
$ AB7       <dbl> 30.65428, 32.48099, 36.51662, 26.46737, 39.82626, 22.4~
$ AB8       <dbl> 34.18883, 31.73529, 31.76150, 20.06505, 31.64857, 36.9~
$ AB9       <dbl> 16.19063, 27.65630, 30.66528, 40.31240, 35.77346, 37.9~
$ AB10      <dbl> 26.41643, 35.01845, 20.57879, 29.70228, 25.66419, 21.0~
$ AB11      <dbl> 34.15993, 28.87851, 17.84969, 31.60336, 20.46157, 27.0~
$ AB12      <dbl> 26.84247, 34.85961, 26.03678, 35.39373, 25.32589, 35.2~
$ AB13      <dbl> 19.04151, 28.30442, 33.32491, 29.72498, 36.34334, 35.5~
$ AB14      <dbl> 31.62268, 25.56272, 34.20726, 25.37035, 23.07195, 30.8~
$ AB15      <dbl> 23.38902, 31.24178, 37.98436, 28.39059, 22.85218, 26.7~
$ AB16      <dbl> 27.53183, 31.25811, 42.46906, 36.16505, 28.68694, 29.7~
$ AB17      <dbl> 28.75115, 33.00335, 26.43749, 27.02463, 30.38442, 26.4~
$ AB18      <dbl> 29.82952, 28.52266, 36.79014, 29.77278, 29.85245, 32.4~
$ AB19      <dbl> 19.53882, 30.67495, 35.59165, 29.36187, 29.76933, 35.7~
$ AB20      <dbl> 35.81505, 25.74468, 24.07690, 22.37588, 27.02414, 33.8~
$ OQ        <dbl> 32, 32, 22, 47, 23, 33, 31, 22, 22, 28, 22, 22, 22, 31~
$ SI        <chr> "N", "N", "N", "N", "Y", "N", "N", "Y", "N", "N", "N",~
$ OQ_cat    <dbl> 2, 2, 1, 3, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 3, 1, 1, 1, ~

Now that we know what we’re working with, let’s use the subset() function to keep only relevant variables:

newdata<- subset(data, select = c(Client_ID,DOB,Gender,Ethnicity,OQ,tst1:tst10))

pillar::glimpse(newdata)
Rows: 2,408
Columns: 15
$ Client_ID <chr> "5054", "5055", "2101", "2102", "2103", "2104", "2105"~
$ DOB       <chr> "5/19/1987", "5/21/1970", "3/18/1974", "8/30/1983", "7~
$ Gender    <chr> "F", "F", "M", "F", "M", "NB", "NB", "F", "F", "F", "F~
$ Ethnicity <chr> "African American", "Asian", "Hispanic", "Asian", "Whi~
$ OQ        <dbl> 32, 32, 22, 47, 23, 33, 31, 22, 22, 28, 22, 22, 22, 31~
$ tst1      <dbl> 2, 2, 2, 4, 2, 2, 2, 3, 2, 2, 3, 3, 2, 3, 1, 4, 3, 2, ~
$ tst2      <dbl> 3, 3, 3, 1, 4, 2, 2, 3, 2, 3, 3, 4, 3, 3, 2, 3, 3, 2, ~
$ tst3      <dbl> 3, 4, 2, 4, 2, 1, 2, 3, 1, 2, 1, 3, 1, 1, 3, 2, 2, 3, ~
$ tst4      <dbl> 2, 3, 3, 3, 2, 1, 3, 1, 1, 2, 2, 1, 3, 4, 3, 1, 3, 3, ~
$ tst5      <dbl> 2, 3, 3, 2, 4, 2, 3, 1, 4, 3, 2, 3, 3, 1, 1, 3, 1, 2, ~
$ tst6      <dbl> 3, 3, 4, 3, 2, 3, 3, 2, 3, 2, 2, 3, 3, 3, 2, 3, 3, 3, ~
$ tst7      <dbl> 1, 3, 3, 3, 4, 3, 2, 2, 3, 1, 3, 1, 2, 3, 3, 2, 1, 3, ~
$ tst8      <dbl> 1, 3, 3, 2, 2, 4, 3, 3, 1, 2, 3, 2, 4, 1, 3, 3, 2, 2, ~
$ tst9      <dbl> 3, 4, 4, 3, 3, 3, 3, 1, 3, 4, 3, 2, 3, 4, 3, 2, 4, 3, ~
$ tst10     <dbl> 2, 3, 4, 3, 2, 2, 3, 3, 1, 3, 2, 3, 2, 3, 2, 3, 2, 2, ~

The subset() function uses 2 primary arguments, the first is the original dataset; in our case, it was data. The next argument is the criteria for subsetting. In this first case, it was selecting specific variables.

Let’s say we only want to get rid of a couple variables. It would be really time consuming (and mistake prone) to write out 20 different variables that we wanted to keep. We would instead add a - in front of c() and enter the variables we would like to exclude. It would look something like this:
newdata<-subset(data, select = -c(SI, OQ_cat))

Keeping/discarding rows

If we want to keep participants based on their level on a given variable, we can also turn to the subset() function:

### Keeping only female participants
female_ds <- subset(data, data$Gender=="F")

#Notice the difference in the Gender counts between data & female_ds
table(data$Gender)

   F    M   NB 
1430  738  240 
table(female_ds$Gender)

   F 
1430 
## Keeping participants with scores between 20-30 on OQ
OQ_dat <- subset(data, data$OQ>=20 & data$OQ<=30)

# Notice the min. & max. for both
summary(data$OQ)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   29.00   34.00   39.39   47.00  105.00 
summary(OQ_dat$OQ)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  20.00   23.00   27.00   25.94   28.00   30.00 

Keeping specific rows & randomly sampling participants

This one’s pretty common when you want to keep a holdout validation set (if anyone’s into machine learning).

Let’s say we want to use just the first 500 rows of data. First let’s look at how many rows data has:

nrow(data)
[1] 2408

Next, lets subset using these handy-dandy []:

first500<- data[1:500,]
nrow(first500)
[1] 500

The brackets are predoninantly used in R (specifically when working with dataframes) to reference rows and columns in the dataset. The argument in front of the comma (1:500) is in reference to the rows. Here we wanted rows 1 through 500. We also wanted to keep all the columns, so we didn’t specify anything after the comma. say we wanted to keep the first 500 rows and the first 10 columns. Our code would look something like this:

first500<-data[1:500,1:10]

Split-half Random Sample

Instead of just taking the first 500 or so rows, we can randomly sample our dataset like so:

set.seed(1234)
splt<-sample(nrow(data),nrow(data)*.5)

half_dat<-data[splt,]

nrow(half_dat)

I know, I know, I have some explaining to do…

First off, set.seed(1234) is used when you’re doing something random in R (i.e. random sampling, running a machine learning algo, even using maximum likelihood or Bayesian) and you want to replicate your results.

The next line is simply randomly generating a sample of numbers between 1 and the number of rows in data (2408). This is what the first argument nrow(data) means. The second argument nrow(data)*.5, means make the sample half the number of rows of data, data rows * .5

Now that we have a random sample of digits that is equal to half of data, we are gonna make those digits represent the exact rows we want to keep from data. Remember the brackets []? We used the same principle from above where we kept rows 1:500, except we used our random sample we created, and specified we wanted to keep all the columns.

Next thing we would want to do is recover the other half of the dataset that isn’t represented in half_dat. To do this, you simply:

other_half<- data[-splt,]

Now you have a completely complimentary dataset.

Random Tidbits

As I wind down this article I’ll give a few random tidbits to help you clean your data.

Working with dates

We have a variable DOB that represents Date of Birth. Unfortunately, we don’t have age, but we can calculate it with relative ease.

First, let’s take a look at what kind of variable R thinks it is

class(data$DOB)
[1] "character"

As I suspected R doesn’t think this is a date, which makes it tough to calculate age. We’re gonna have to change the type before we do anything.

R’s date/time repritoire is not quite as sophistocated as SAS, but it can do the job. R’s base package has the function as.Date() that helps us out. Let’s take a look at how we can convert the date in our dataset to a format that R can recognize:

data$DOB<- as.Date(data$DOB, format = "%m/%d/%Y")

class(data$DOB)
[1] "Date"

The format = argument allows you to specify how your date is currently formatted. Since ours was mm/dd/yyyy, we used “%m/%d/%Y”. If our format was “mm-dd-yy” for example, it would read “%m-%d-%y”.

Cool! Now we can calculate age. We’ll use the lubridate package to help us out

data$age<- lubridate::interval(data$DOB,Sys.Date())%/%months(12)

summary(data$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   8.00   34.00   41.00   40.77   48.00   81.00 

The package lubridate has many useful functions that I won’t get into here, but just know they’re there. If you ever want to explore a package or function any further, just run the command help(lubridate).
The interval function allows you to calculate an interval between 2 timepoints. The %/% is what we call a “pipe operator” and is a staple in the tidyverse. It provides a simple way to perform a mathematical operation on the results of a function with just 1 line of code. Here what we’re telling it to do is calculate the interval between each participant’s DOB and divide that interval by 12 months, which will give us the age of each pariticpant. From here we can enter it into our equation like we would any normal variable.

Substring

Substring may come in handy if you want to take a snippet of a variable. To be honest, I primarily use this when I get a date/time variable and I just want to keep the date. To illustrate this point, I’m going to make a variable in data that has the current date/time and subset from there.

data$current <- rep(Sys.time(), nrow(data))

data$current[1]
[1] "2021-12-07 09:39:04 PST"

Now that we’re here, I might as well talk about the rep command. This stands for “repeat” and is a good way to create repetitions in R. Right now, I’m saying make a new variable called current and fill it with a series of the current date and time, for the number of rows in data.

Back to substring. We’re gonna use the command substr() to keep only the date (2021-12-07). The arguments read: 1) what do we want to subset 2) at which character do we start 3) at which character do we stop. Since the date starts at 1 and ends at 10, we’ll go ahead and put those arguments in.

data$current_date<- substr(data$current,1,10)

data$current_date[1]
[1] "2021-12-07"

SPSS Reintegration

Last thing we need to do is rewrite your data back into SPSS format. This is a 1 line command that’ll allow you to take your cleaned data from R and read it back in to SPSS.

haven::write_sav(newdata, "newdata.sav")

You should see a file titled “newdata” in the folder you’ve been working out of. If not, make sure to go back and set your working directory back to the folder of your choice.

Conclusion

That’s gonna do it for this page. Remember, data cleaning is a process so you’ll most likely have to do a few of these steps or slight variations on my provided code. Last bit of advice, Google has been my best friend since I’ve started learning R. You can find a solution to almost any error with a quick search. And as always, if you are still having trouble or want me to cover anything esle, feel free to shoot me an email at scc@paloaltou.edu

By: Joe Razo SCC Student Director December, 2021

---
title: "Data Cleaning"
Author: "Joe Razo"
output: html_notebook
---

Data Cleaning is a topic that is often requested at the SCC and has begged the question, "Why is it not covered in our intro research courses?" Though it would be intuitive to integrate into the stats sequence, the actual application is often difficult to implement (In other words, easier said than done). 

The reason for this is because data cleaning isn't simply an operation, but a process; a process that varies widely based on your specific data structure and what you're trying to accomplish. It's further complicated that data cleaning often takes multiple steps to get your data ready to implement analysis. 

Given all of these challenges, the goal of this page is to provide a sample of common data cleaning steps using quasi-real-world data (i.e. I made it up, but it looks pretty legit) that you may use to help guide your data cleaning adventures. As always, if you have any questions or would like to see more similar content, feel free to reach out to the SCC via [email](mailto:scc@paloaltou.edu) or [Request an appointment](http://scc.paloaltou.edu/). So without further ado, let's start coding. 

## SPSS Integration
For the majority of people reading this, your data will be in SPSS format. SPSS is a great package for non-data enthusiasts, however it's data cleaning abilities are very sparse. Additionally, the data cleaning you can do in SPSS, takes much more time and energy than through RStudio. 

Fortunately, reading SPSS data in R is very easy! With the few lines below, you should be able to read your data directly into R, clean it, and export it into SPSS format, ready for you to work your statistical magic in the system most of you have come to know and love! 

Before we get started, you'll want to set your working directory. This will tell R where to import files from and export files to. Check this image out on how to set your working direcory:

![](C:\Users\josep\Desktop\Grad School\SCC\SCC Stuff\setwd.png) 

Once you select the folder you want to use, you should be all set!  

Next you're gonna want to install and load the ```haven``` package.

```{r}
install.packages("haven")
library(haven)
```

Now that we have that ready to go, we're gonna want to read in our data using the ```read_sav()``` function. But remember, we have to create an object that we can call back while cleaning. To do this is simple, simply give your data a name, like ```data``` and use the ```<-``` to denote an object.

```{r}
data <- read_sav(file = "data.sav")
```

Congratulations!! Your data is now in R, and we can start cleaning! Before we get started, let's take a look at the first 6 lines to make sure we have the right dataset:

```{r}
head(data)
```

## Recoding Variables

One of the most used, and probably the easiest, cleaning operation is recoding variables. Whether we want to dummy code or group values together, R can get you where you need to go with a few lines of code. 

### Dummy Code

Let's take a look at our ```Gender``` variable in column 3. When running any kind of regression, we are gonna want to recode this variable where we have 1 variable as a reference group and another as our "group of interest". First things first, let's take a closer look at what categories we have an how many participants in each category

```{r}
table(data$Gender)
```

Our data shows we have 3 categories: ```Female``` = 1430 participants, ```Male``` = 738 participants, and ```Non-binary``` = 240 participants. 
In order to effectively dummy code, we want to select a reference group that we will compare the other groups to during our regression analysis. I will arbitrarily select the ```Female``` group as our reference group. This means we need to create dummy-coded variables for the ```Male``` group and the ```Non-binary``` group. 
```{r}
data$male_d <- ifelse(data$Gender=="M",1,0)
data$NB_d <- ifelse(data$Gender=="NB",1,0)
```

There's a lot happening in this small chunk of code, so I'll try to break it down as best as I can. The first thing you'll notice is the ```$``` in the chunk reading ```data$male_d```. For our purposes, this little symbol specifies a variable within a dataframe. When it's on the left side of a ```<-```, it is creating a new, or updating an existing, variable in the dataset. If it's on the right side of the ```<-``` it is typically referencing a variable in an equation or function.


The next thing you'll notice is we used the ```ifelse``` function. This is a very common and effective function that reads ```ifelse(test,if TRUE, if FALSE)```. Broken down a little further, the ```test``` is a logical operation where we say if the statement "```data$Gender``` equals (```==```) M" is TRUE. The next vales ```if TRUE``` and ```if FALSE``` simply say what we want those respective values to be after this is run. 

To put it all together we're saying, "create a new variable, ```male_d``` in dataframe ```data```, where if ```data$Gender``` says "M", recode that into ```1```, for anything else, recode those into ```0```.

### Recoding variable into distinct groups

Let's take a look at a situation where you have a continuous value that you want to group into a few distinct categories. We'll shift our attention to a variable at the very end ```OQ```. 

First things first, lets get an idea of what we're working with: 
```{r}
summary(data$OQ)
```
We're gonna group this variable into 4 categories using the ```case_when``` function from the ```dplyr``` package. This package is part of the ```tidyverse``` family (that's a whole other topic). For now lets get into some code. 

```{r}
library(tidyverse)
```

 We're gonna divide the 4 categories using the 4 quartiles as listed above. 

```{r}
data$OQ_cat<- case_when(
  data$OQ<=29 ~"1",
  data$OQ>=30&data$OQ<=34 ~"2",
  data$OQ>=35&data$OQ<=47 ~"3",
  data$OQ>=48 ~"4"
)

data$OQ_cat<-as.numeric(data$OQ_cat)
```

Now is probably a good time to talk about logical operators. In the above code you see a lot of random symbols that are probably scaring a lot of people off. But if you just bear with me for a few more sentences, I'll explain everything. 
We should all be familiar with some of these, but they could still be a little cryptic:

```>``` greater than  
```>=``` greater than or equal to  
```<``` less than  
```<=``` less than or equal to  
```==``` is equal to  
```!=``` is not equal to  

```&``` and  
```|``` or  

With our newfound knowledge, let's take a closer look at how we created category 2. ```data$OQ>=30&data$OQ<=34 ~ "2"```  
If we take is slow, we can see we specified "if OQ score is greater than or equal to 30(1 above 1st quartile) AND OQ score is less than or equal to 34 (median), then recode into 2"

Then the last line I'm just changing the variable type from character to numeric. If a variable is stored as numeric and we wanted to convert it to character, we could call ```as.character()```

## Subsetting Datasets

I'm not actually sure if you could do this in SPSS or not without jumping through multiple hoops. There's many ways you might want to subset a dataset, so I'll show the most common. 

### Keeping specific variables

A lot of times you'll acquire a gargantuan of a dataset and you only need a few variables. I personally hate having a bunch of vars in my dataset that I'm not even gonna use. We'll use the ```subset()``` function to trim our original dataset ```data``` into something more manageable. As always, let's first get a good look at our current dataset:

```{r}
pillar::glimpse(data)
```
Now that we know what we're working with, let's use the ```subset()``` function to keep only relevant variables:

```{r}
newdata<- subset(data, select = c(Client_ID,DOB,Gender,Ethnicity,OQ,tst1:tst10))

pillar::glimpse(newdata)
```
The ```subset()``` function uses 2 primary arguments, the first is the original dataset; in our case, it was ```data```. The next argument is the criteria for subsetting. In this first case, it was selecting specific variables.  

Let's say we only want to get rid of a couple variables. It would be really time consuming (and mistake prone) to write out 20 different variables that we wanted to keep. We would instead add a ```-``` in front of ```c()``` and enter the variables we would like to exclude. It would look something like this:  
``` newdata<-subset(data, select = -c(SI, OQ_cat))```

### Keeping/discarding rows 

If we want to keep participants based on their level on a given variable, we can also turn to the ```subset()``` function:

```{r}
### Keeping only female participants
female_ds <- subset(data, data$Gender=="F")

#Notice the difference in the Gender counts between data & female_ds
table(data$Gender)
table(female_ds$Gender)

## Keeping participants with scores between 20-30 on OQ
OQ_dat <- subset(data, data$OQ>=20 & data$OQ<=30)

# Notice the min. & max. for both
summary(data$OQ)
summary(OQ_dat$OQ)
```
 
#### Keeping specific rows & randomly sampling participants

This one's pretty common when you want to keep a holdout validation set (if anyone's into machine learning).  
  
Let's say we want to use just the first 500 rows of ```data```. First let's look at how many rows data has: 
```{r}
nrow(data)
```
Next, lets subset using these handy-dandy ```[]```:

```{r}
first500<- data[1:500,]
nrow(first500)
```
The brackets are predoninantly used in R (specifically when working with dataframes) to reference rows and columns in the dataset. The argument in front of the comma (1:500) is in reference to the rows. Here we wanted rows 1 through 500. We also wanted to keep all the columns, so we didn't specify anything after the comma. say we wanted to keep the first 500 rows and the first 10 columns. Our code would look something like this:
```{r}
first500<-data[1:500,1:10]
```

#### Split-half Random Sample
Instead of just taking the first 500 or so rows, we can randomly sample our dataset like so:
```{r}
set.seed(1234)
splt<-sample(nrow(data),nrow(data)*.5)

half_dat<-data[splt,]

nrow(half_dat)
```

I know, I know, I have some explaining to do...  

First off, ```set.seed(1234)``` is used when you're doing something random in R (i.e. random sampling, running a machine learning algo, even using maximum likelihood or Bayesian) and you want to replicate your results.  

The next line is simply randomly generating a sample of numbers between 1 and the number of rows in ```data``` (2408). This is what the first argument ```nrow(data)``` means. The second argument ```nrow(data)*.5```, means make the sample half the number of rows of data, data rows * .5  

Now that we have a random sample of digits that is equal to half of ```data```, we are gonna make those digits represent the exact rows we want to keep from ```data```. Remember the brackets ```[]```? We used the same principle from above where we kept rows 1:500, except we used our random sample we created, and specified we wanted to keep all the columns.  

Next thing we would want to do is recover the other half of the dataset that isn't represented in ```half_dat```. To do this, you simply:

```{r}
other_half<- data[-splt,]
```

Now you have a completely complimentary dataset. 

## Random Tidbits

As I wind down this article I'll give a few random tidbits to help you clean your data. 

### Working with dates

We have a variable ```DOB``` that represents Date of Birth. Unfortunately, we don't have age, but we can calculate it with relative ease. 

First, let's take a look at what kind of variable R thinks it is
```{r}
class(data$DOB)
```
As I suspected R doesn't think this is a date, which makes it tough to calculate age. We're gonna have to change the type before we do anything. 

R's date/time repritoire is not quite as sophistocated as SAS, but it can do the job. R's base package has the function ```as.Date()``` that helps us out. Let's take a look at how we can convert the date in our dataset to a format that R can recognize:

```{r}
data$DOB<- as.Date(data$DOB, format = "%m/%d/%Y")

class(data$DOB)
```
The ```format =``` argument allows you to specify how your date is currently formatted. Since ours was mm/dd/yyyy, we used "%m/%d/%Y". If our format was "mm-dd-yy" for example, it would read "%m-%d-%y".  

Cool! Now we can calculate age. We'll use the ```lubridate``` package to help us out

```{r}
data$age<- lubridate::interval(data$DOB,Sys.Date())%/%months(12)

summary(data$age)
```
The package ```lubridate``` has many useful functions that I won't get into here, but just know they're there. If you ever want to explore a package or function any further, just run the command ```help(lubridate)```.  
The interval function allows you to calculate an interval between 2 timepoints. The ```%/%``` is what we call a "pipe operator" and is a staple in the ```tidyverse```. It provides a simple way to perform a mathematical operation on the results of a function with just 1 line of code. Here what we're telling it to do is calculate the interval between each participant's DOB and divide that interval by 12 months, which will give us the age of each pariticpant. From here we can enter it into our equation like we would any normal variable. 

### Substring

Substring may come in handy if you want to take a snippet of a variable. To be honest, I primarily use this when I get a date/time variable and I just want to keep the date. To illustrate this point, I'm going to make a variable in ```data``` that has the current date/time and subset from there.

```{r}
data$current <- rep(Sys.time(), nrow(data))

data$current[1]
```
Now that we're here, I might as well talk about the ```rep``` command. This stands for "repeat" and is a good way to create repetitions in R. Right now, I'm saying make a new variable called ```current``` and fill it with a series of the current date and time, for the number of rows in ```data```.  

Back to substring. We're gonna use the command ```substr()``` to keep only the date (2021-12-07). The arguments read: 1) what do we want to subset 2) at which character do we start 3) at which character do we stop. Since the date starts at 1 and ends at 10, we'll go ahead and put those arguments in. 
```{r}
data$current_date<- substr(data$current,1,10)

data$current_date[1]
```
 
## SPSS Reintegration

Last thing we need to do is rewrite your data back into SPSS format. This is a 1 line command that'll allow you to take your cleaned data from R and read it back in to SPSS.
```{r}
haven::write_sav(newdata, "newdata.sav")
```
You should see a file titled "newdata" in the folder you've been working out of. If not, make sure to go back and set your working directory back to the folder of your choice. 

## Conclusion

That's gonna do it for this page. Remember, data cleaning is a process so you'll most likely have to do a few of these steps or slight variations on my provided code. Last bit of advice, Google has been my best friend since I've started learning R. You can find a solution to almost any error with a quick search. And as always, if you are still having trouble or want me to cover anything esle, feel free to shoot me an [email at scc@paloaltou.edu](mailto: scc@paloaltou.edu) 

By: Joe Razo
    SCC Student Director
    December, 2021
