Importing Data

Setting WD

Before we import, it may be worth it to set our “Working Directory.” This basically tells R where to look for things and where to save things.

We can be specific when we import or save, but this will make it faster and help code run if you are using code to import.

Sometimes, Rmd files give you a hard time with setting working directory, but here is the basic code for it.

setwd(“~/Documents/UCR Grad/2022 - Fall/PSYC211 (TA)/Week 1”) #setwd way to set working directory

That’s for macs, you can always look at the code R gives you in the console and save it for later

You WILL need to use different code based on your computer’s file directory and that will probably vary based on which computer you use (lab computer vs. personal laptop, e.g.) You can also point and click to set working directory. We’ve talked about this before: Session –> Set Working Directory –> Choose Directory… Tip: Make sure that both your Rmd file and your dataset(s) are saved in the same folder. It shouldn’t matter, but sometimes when I don’t do that, I have trouble knitting. ’’’ install.packages(“psych”) library (psych)

Actually Importing Data

There’s two main ways in which people import data: through code or point and click. I personally don’t like point and click because it gives me knitting problems at times. But know that is an option.

My preferred file format is .csv, but you can also import other types of files. For example, today we’ll be working with both a a .sav file. If you want to use .sav files, you will need the “haven” package. BEFORE you run the following code, be sure to install the haven package As a reminder, inside the chunck of code use the function install.packages(“haven”)

#, then you can load its library

For today’s problem set we’re going to use publicly available data from the Public Religion Research Institute (PRRI) on politics

#library(haven)
#PRRI_2019_American_Values_Survey_1 <- read_sav("PRRI-2019-American-Values-Survey-1.sav")
#$View(PRRI_2019_American_Values_Survey_1)

You can also change the name when you import: I recommend doing this and using a name that is easy to remember and not too long to type

#prri2019 <- read_sav("PRRI-2019-American-Values-Survey-1.sav")

Or change the name later: #```{r} #prri<- PRRI_2019_American_Values_Survey_1 ## Some other ways to import data Below are some codes for importing data from different file types.

#read_excel() <-For Excel (xl/xlsx) files, requires the “readxl” package. install.packages(“readxl) #read_sav() <-For SPSS (sav) files, requires”haven” package. #read.csv() <-For csv files, part of “utils” package (should be already included in R).

Before moving on, let’s load the other libraries we’ll need. If you don’t have

those packages installed, make sure you install them first

#install.packages("dplyr")
#library(dplyr) 


# Used for clearning data

Cleaning Data

Most data is not “clean” when you get it. Some things, I clean in Excel when I get it from Qualtrics (e.g.) but other times it makes sense to do it in R.

Trimming Down Datasets

Selecting FOR Columns

You may want to select only certain columns in your data, which can be done in a few ways.

This is one way, requires the dplyr package

#df1 <- dplyr::select(prri2019, Q1, Q2A_A, Q2A_F, Q3, Q5, PARTY)

#<- equals a = sign - perfer not to use because it has different functions #dplyr - state which package you would like to use #dataset vector name, columns of the items you would like to access - data frames are useful when you are interested in a specific set of items

Let’s talk about the code above. the df1 is the name I have assigned to the data I’m choosing to isolate. The <- symbol is the symbol we use to assign a value to an object. The dplyr:: portion specifies that we want to use the dplyr package. If you loaded your library, you don’t actually need that part. But there are times in which you have multiple libraries loaded, and sometimes a function in a library ‘masks’ or overrides a function from another library. When that happens, you can use that code to tell R which package you’re trying to use. the select() part is the function we’re actually using. Everything inside the parenthesis is the name of the dataset, in this case prri2019 (or whatever you named your dataset in the steps above), followed by the name of the columns (variables) you want isolated.

#THESE CODES ALLOW YOU SET A DATA FRAME - BRACKETS ARE A WAY TO GRAB ROWS AND/OR COLUMS #c on the right side of the comma is for columns and on the left side it is for rows #vector is a list of things - a column or row of data - a list !!!! # c funcion creates a list - #df1 uses a dyplr package and the df2 you do not need a specific package it is just apart of the core R package #do not forget to add quotation marks ?

This code also works

#df2 <- prri2019[,c("Q1","Q2A_A","Q2A_F","Q3","Q5","PARTY")]

The code above does not require any particular package. We’re essentially just creating a vector. df2 is the name for the object I’m creating. The <- is the symbol we use when we create objects and are assigning something to them. the prri2019 is the name of our dataset. The [] are brackets to indicate that we’re going to pull either specific rows or columns or both from our dataset. The , separates the rows from the columns. Anything to the left of the comma indicates that we want rows. Anything to the right of the column indicates we want columns. In this particular case, we want all the rows associated with our variables of interest so we leave the rows side blank. If we only wanted one variable, we could just type the name of that variable to the left side of the comma. However, we want more than one variable, therefore we have to create a list. The way we create a list is using the function c(); notice that’s the same function we use to create a vector. Inside the parentheses, we include the names of the variables we want. Notice that in this case, we have to use quotations around each variable and separate them with commas.

If you don’t know thte names of your variables by heart, or theyr’e too long to type, you can also select columns by number, for example: #leave left side of the comma empty

#df3 <- prri2019[,3:10]

This is good if you want a bunch of columns that are next to each other, but this can be risky as columns can be different on different datasets and you are selecting the location not the column itself.

Or you can use the subset function (which has other uses we’ll talk about when the time comes) #no need for brackets are not apart of the function - there should be a function associated with it {r} #codes to add a chunk #

#df4 <- subset(prri2019, select=c("Q1","Q2A_A","Q2A_F","Q3","Q5","PARTY"))

Dropping Certain Columns

You can also use the reverse of some of the functions to drop columns #``` r lets R know what will be used in the object idenitfier - the below code chooses what colums we want…

#df5 <- prri2019[ ,-c(1:3,7,20,100:199) ]

But again, be careful as column numbers aren’t always consistent!

So instead, you can do this kind of lengthy thing to drop columns

#drop <- c("WEIGHT","WEIGHT2","PARTYOE") #Name the columns you want to drop
#df6 <-  prri2019[,!(names(prri2019) %in% drop)] #And then drop them

#name of the object, bracklets to isolate only columns so leave left side empty, ! not to include something - include everything but the stuff in the object #%in% - things in the percentage signs are typically used for other things

drop is the name of the object we’re creating. Remember that you are the one that decides what to name your objects. That is not a function. Then we use the c() function to create a list of the variables we want to drop. The next line of code should look familiar, since that’s what we’ve been doing. However, there are some new elements. # The ! tells R not to include something, it literally means “not”. For example != means ‘not equal to’; !x means ‘not x’. In this case, we’re saying we want our new dataframe, df6 to include all the columns in our dataset, prri2019, except for whatever we put in the parentheses. So we’re saying include all the columns (we know we want columns because this code is located to the right of the comma) but not the ones that have the names in our dataset that are included in the drop object we just created. The reason the word in is surrounded by %% is because the word in is itself a function that’s only used for loops. I’m not entirely sure what it’s called, but the %% marks it as an ‘infix’ operator (but I could be getting the name wrong).

Or, we can again use the subset function

#subset functions you would have to review the

#df7 <- subset(prri2019, select=-c(WEIGHT,WEIGHT2,PARTYOE))

Another Subsetting Technique

But what if you only want certain rows?

The most common example for this is wanting to get rid of participants who responded in some way to a prompt. For example, I may only want those who pass an attention check in my study, or here, I may want to see only those who state are Democrats. This is the return of the subset function.

#install.packages(“tidyverse”) $library(tidyverse) #!!!!!!WAS MISSING THE WHOLE TIME

#{r} df1dems <- subset(df1, df1$PARTY== 2) Here, we’re saying we want R to take the dataframe we called df1, but we only want the rows containing data from participants who self-identified as Democrats (coded as the number 2). The double equal sign, ==, means ‘this and only this’, or ‘equal to this’. A single equal sign is an assignment operator, similar to <-.

I may also want to drop those who have above a score, for example outliers or, in this case, those who refused to answer certain questions and were given a code of 77, 98, or 99 to mean: “didn’t know”, “skipped”, or “refused”, respectively.

#{r} df1Answered <-subset(df1,df1$PARTY<76)

You can also do fancy things with this, so that you are removing people who do or don’t fit multiple criteria, using two important keys: & and | & means “and,” or that multiple criteria must be met. ##| means “or,” meaning that one or the other must be met. Note: The | is not the letter I or the number 1. It’s that separator key usually above the ‘enter’ key on your keyboard. I don’t know what that’s called. #{r} df1Party <- subset(df1, df1$PARTY==1|df1$PARTY==2|df1$PARTY==3|df1$PARTY==4) This also removed the NAs, which we’ll talk more about now

NAs in Data

Before we jump into NAs, let’s take a look at our data in two ways. First, what class is each variable? We’ll be using df1 for the rest of this assignment, or any df that selected those columns (df2 and df4 are good too!) #{r} lapply(df1,class) In previous code sheets (from stats bootcamp and week 0), I’ve shown you two other different ways to look at variable classes. I do this so that you know you have options. Feel free to refer back to those code sheets and try multiple approaches.

Those are…weird classes. Let’s make the opinion ones numeric and the party/how is America (Q1 and PARTY) ones factors.

#Factors (with as.factor) are variables that have discrete values, which may or may not be ordered. In other areas of science outside R they’re often called categorical values. For example North South East and West could be factors.

#Numerics (with as.numeric) are numbers, with infinite other numbers between them. So for example 5 is a number, as is 6, but so are 5.01, 5.001, 5.0001 etc. #{r} df1$Q1 <- as.factor(df1$Q1) df1$Q2A_A <- as.numeric(df1$Q2A_A) df1$Q2A_F <- as.numeric(df1$Q2A_F) df1$Q3 <- as.numeric(df1$Q3) df1$Q5 <- as.numeric(df1$Q5) df1$PARTY <- as.factor(df1$PARTY) Note: ‘double’ stands for double precision floating point numbers. R will automatically change between double and numeric depending on what functions you’re trying to run. You don’t actually have to convert to numeric, but it is good practice to avoid potential problems.

We can also take a quick detour and rename our columns to something meaningful #{r} names(df1)[names(df1) == "Q1"] <- "usaDir" #Is this country headed in the right # direction? 1=Right 2=Wrong names(df1)[names(df1) == "Q2A_A"] <- "trumpUnfav" #How favorably view Trump? # 1-4, V Favorable to V Unfavorable names(df1)[names(df1) == "Q2A_F"] <- "obamaUnfav" #How favorably view Obama? # 1-4, V Favorable to V Unfavorable names(df1)[names(df1) == "Q3"] <- "trumpDisapp" #Approve of Trump presidency? # 1-4, Strongly Approve to Strongly Disapprove names(df1)[names(df1) == "Q5"] <- "obamaDisapp" #Approve of Obama presidency? # 1-4, Strongly Approve to Strongly Disapprove names(df1)[names(df1) == "PARTY"] <- "poliParty" # 1 = R; 2 = D; 3 = Independent I also like to leave myself notes in these places, as your most common collaborator is you 6 months ago. Don’t rely on your memory. Not only will you most likely not remember, you shouldn’t have to. You’re a grad student; you have a million things to think about.

Most likely, you will find NAs in your data from participants skipping or not answering questions. You may want to impute responses to deal with missingness (you’ll learn about that later), but other times (like now), you may just want to exclude those people. But first, let’s visualize. We’re going to use the package ‘naniar’. Install it if you don’t have it, and load the library. install.packages(naniar) #```{r}

library(naniar)

vis_miss(df1)

What are your thoughts on this? What can you infer from this visualization?
This is a cool function to see where the missingness occurs, with %NA broken 
down by total and by column.

#all are present but not what we want too seee 

So, we noticed that there is no NAs, which seems odd. Most real datasets contain
a certain percentage of NAs. When in doubt, go back to your dataset! Looking on 
PRRI, we can see that they used numbers to indicate a lack of answering (such 
as 98 or 77 for the political party question). We can solve this by recoding 
those as NAs. We're going to need the package 'dplyr'. Install it if you don't
have it.
#```{r}
library(dplyr)
df1$usaDir <- na_if(df1$usaDir,9)
df1$trumpUnfav <- na_if(df1$trumpUnfav,9)
df1$obamaUnfav <- na_if(df1$obamaUnfav,9)
df1$trumpDisapp <- na_if(df1$trumpDisapp,9)
df1$obamaDisapp <- na_if(df1$obamaDisapp,9)
df1$poliParty <- na_if(df1$poliParty, "98")
df1$poliParty <- na_if(df1$poliParty, "99")
df1$poliParty <- droplevels(df1$poliParty) #This removes "levels" of factors 
# that aren't in use.

Let’s look at our NAs again #{r} vis_miss(df1)

That’s more like it! So let’s get rid of the participants who had NAs. #{r} df1 <- na.omit(df1) Notice that we’re not creating a new data frame. We’re just editing the one we created earlier. The above function got rid of all the NAs in that entire data frame.

We could also drop only NAs in one column. The code for this would look something like this: #df1 <- df1[!is.na(df1$poliParty),] In that code, we’re saying to update our data frame, df1, with every row from the poliParty column except (that’s why we use the !) the NAs.

This is also not the best practice for real life NAs (unless certain conditions are met) where you may want to impute data, or do other types of deletion. But for now, it is all good.

Now we have a complete dataset. This is the part where I like to save it. #```{r} save(df1,file=“NoNADataSetPRRI2019.Rda”) #This is an Rda file, which R can read # but most other programs can’t.

write.csv(df1,“NoNADataSetPRRI2019.csv”, row.names=FALSE) #This is a csv file, # which R can read and so can Excel! Usually your best bet for sharing data, #but a bit bigger than an Rda file (8 KB vs. 41 KB for these particular data)


# Analyzing Data
We cleaned our dataset! We've made it this far! Yay! The boring stuff is behind 
us (arguably). Some people might like cleaning data more than I do. It's 
possible; weirder things have happened, lol. 

## Measures of Central Tendency
Let's look at measures of central tendency, aka mean, median, and mode. As we 
saw last week, R has code for mean and median, but not for mode. Let's start 
with the easy ones, mean first.
#```{r}
mean(df1$trumpUnfav)

#{r} mean(df1$obamaUnfav)

Then the median #{r} median(df1$trumpUnfav)

#```{r} median(df1$obamaUnfav)

median(df1$obamaUnfav)


Remember that these variables go from more to less favorable as they go 1-4, 
which is a little counterintuitive, at least for me! We can change that.
It's not the fanciest function but simply subtracting works, where you take the 
1+ the largest variable and subtract the value from it. In this particular case,
the highest value in our scale is 4, so 4+1=5. We can code it this way:
#```{r}
df1$trumpFav <- 5-df1$trumpUnfav
df1$obamaFav <- 5-df1$obamaUnfav #Fav variables go 1-4, VUnfavorable to VFavorable

We can do mean again, then: #{r} mean(df1$trumpFav)

#{r} mean(df1$obamaFav) So now it might be more intuitive to read the means, and we can see that (at least in terms or raw numbers, NHST comes later) Obama is rated as more favorable than Trump

Then the median #{r} median(df1$trumpFav)

#{r} median(df1$obamaFav)

But what about mode? Some people online do this in different ways, with a lot of code. But one way is to just look at frequency counts, listed below the number. #{r} table(df1$trumpFav) the mode - the first # is the mode

#{r} table(df1$obamaFav) You can then look at the numbers to find the most common, aka the mode.

Below is the more sophisticated mode script, but this is more work when you could just use your eyes to pick out the mode from the frequency table, imo. #{r} print(names(table(df1$trumpFav))[which(table(df1$trumpFav) == max(table( df1$trumpFav)))])

#{r} print(names(table(df1$obamaFav))[which(table(df1$obamaFav) == max(table( df1$obamaFav)))]) Remember that last week, you each found code to get the mean. The code I used last week looked slightly different from this. Feel free to take a look back at that and compare. Choose whichever option you prefer.

Measures of Variability

The next part goes over a few of the main measures of variability: variance, standard deviation, and range. All of these have an R function.

First, the formula for variance. We’re going to use the package ‘stats’. Install it first if you don’t already have it, and then load the library. install.packages(“stats”) library(stats) #```{r}

var(df1$trumpFav)


#```{r}
var(df1$obamaFav)

And then for standard deviation: #{r} sd(df1$trumpFav)

#{r} sd(df1$obamaFav)

Range also has a formula which shows the min and max of the data, though this is predictable for our data because we know what the min and max points of our scale are: #{r} range(df1$trumpFav)

#{r} range(df1$obamaFav)

Interquartile range (IQR) also has a formula (you’ll learn more about IQR in lecture later on): #{r} IQR(df1$trumpFav)

#{r} IQR(df1$obamaFav)

Standardizing and Centering Data

Sometimes, you’ll want to transform your data. The most common way is to standardize (z-score) it. This makes the data have a mean of 0 and a standard deviation of 1. We’ll learn in lecture why this is important.

We could absolutely change the values in the original columns, but it’s not a good idea. It’s best to not write over the columns and to instead make a new, standardized (z-scored) column. #{r} df1$zTrumpFav <- scale(df1$trumpFav, center=TRUE, scale=TRUE) Here we’re saying that in the data frame, df1, we want a new column that we’re going to call zTrumpFav (name it something meaningful so that 6 months from now you’ll know this column includes the z-scores). The new column we’re creating will use the function scale(). Then we need to specify that the values we want to scale will come from our original df1 data frame, specifically the column trumpFav. We then specify that we want to center our data and scale it.

But just so you know, “center” and “scale” default to TRUE, so you also don’t have to include them in the code. I just did it for the sake of thoroughness. If you’re not sure what the default settings are for a certain function, you can always type a question mark, ?, followed by the function you want help with (do this in the console portion of R, not your actual file) and read the documentation #{r} df1$zObamaFav <- scale(df1$obamaFav) # Centering our data means we ‘shift’ our data so that the mean = 0. Again, the arguments in the scale() function will center your data and scale it (make the sd =1) at the same time by default.

However, you may want to center your data, but not adjust the standard deviation. Setting center=TRUE and scale=FALSE in your code will center your data.

You can also adjust the standard deviation, with (center=TRUE) or without (center=FALSE) centering your data. You can do this by changing the value for the argument scale= (for example, if scale=2 the standard deviation will be divided in half). ((I don’t know why this is so confusing)). I also don’t know in which cases you may want to adjust the sd to something specific. That will depend on your own research and your own data.

Graphing, reviewing the Basics

Graphing in R can be pretty extensive and I could spend four hours on ggplot alone, and I wouldn’t even begin to scratch the surface (I’m actually taking a course on that right now). So for purposes of this course, I will stick to some base graphing first and just show a little ggplot after. The hope is that as you get more comfortable, you will begin exploring packages and experimenting on your own.

We’ll start with a histogram, which is a solid way to see a distribution of data and bonus, we’ve done one of these before. #{r} hist.default(df1$trumpFav) This is good, but you couldn’t put this into a paper looking like that.

In R, we have a lot of options for making changes to our graphs to make them more appealing for papers and presentation. We can change a lot of things about this, including axes titles, graph title, and even colors. #{r} hist(df1$trumpFav, #Data graphed main="Histogram for Trump Favorability Ratings", #Title xlab="Increasing Favorability", #X-axis name border="black", #Bar border color col="red", #Bar color xlim=c(1,4), #X-axis limits ylim=c(0,1400)) #Y-axis limits

Bar charts may be another way to represent data: Recall from lecture: when do we use bar graphs? We tend to use graphs when we have discreet categories. So, arguably, we technically could use a bar chart to visualize how favorably people view Trump or Obama. But…for better or for worse, we tend to treat Likert scales as continuous (even though they’re not really continuous), but that’s a conversation for MAMA (aka R-squared). For now, we’ll just say we can’t represent Likert scales on bar charts, so we’ll pick a different variable. Let’s visualize the frequency of participants that identify with each political party. #{r} counts2 <- table(df1$poliParty) #Creating object which will contain a table # of frequencies. Remember we did this earlier. barplot(counts2, #we're using the function barplot(), then we tell R what # information needs to be included in that barplot, in this case, it is # the object we just created main="Political Party of Participants", #Title xlab="Political Party", #label for the x-axis ylab="Frequency", #label for y-axis beside=T) #Telling R to portray the columns next to each other Is this now presentable enough for a talk or a paper? Sure, but a problem here is that we don’t know what the numebrs on the x-axis represent. Luckily, we can edit the labels If we had a code book, we’d know that 1 = D; 2 = R; 3 = Ind; and 4 = Other #{r} levels(df1$poliParty) <- list(Republican="1",Democrat="2",Independent="3", Other="4") We’re going to internally edit the names so that R knows what the values mean. We’ll use the function levels(), then we tell R which variable we want to create levels for, in this case the poliParty variable from our data frame, df1. Then we use the assign symbol, <-, and we use the list() function.

And then we can graph again! #{r} counts2 <- table(df1$poliParty) barplot(counts2, main="Political Party of Participants", xlab="Political Party", ylab="Frequency", beside=T)

We’ve talked in lecture about measures of variability. Box plots to show IQR and variability Let’s say we want to visualize how favorably participants from different political parties view a particular person, in this case, Trump. #{r} boxplot(trumpFav~poliParty,data=df1, main="Trump Favorability by Political Party", xlab="Political Party", ylab="Favorability Scores") We’re using the function boxplot() and we’re telling R to pull Trump favorability scores separated by political party. We specify which data frame we’re using (df1), and then we make it pretty by specifying the title and the axes labels.

Both of these just show frequency on the Y-axis, but sometimes you want to show averages and confidence intervals and so on. You may also be interested in favorability based on political party, or some other novel question. While this is less important for this week, let’s talk about it a little anyway.

GGPlot has so many options, and this is just the start! We’re going to use the packages ggplot2 and scales. Install them if you haven’t before, and load the libraries. #```{r} library(ggplot2) library(scales)

ggplot(df1,aes(poliParty,obamaFav))+geom_bar(position=“dodge”,stat=“summary”, fun=“mean”,fill=c(“red”,“blue”,“green”, “purple”))+ xlab(“Political Party”)+ylab(“Mean Favorability”)+ scale_y_continuous(limits=c(1,4),oob = rescale_none) ``` Here we used the ggplot() function. Then we specify the data frame we’re using (df1), and we further specify the aesthetic (aes), #which simply means the variable we want and based on what. So, the aes says that we want the favorability scores for Obama separated by political party. #Then we need to specify what the bars need to show, that’s what the geom_bar() function does. We specify the position, “dodge” indicates that we want the bars next to each other as opposed to stacked on top of one another. The ‘stat’ argument specifies that we want to look at summary statistics, the ‘fun’ argument further specifies that we want to visualize the mean. Then we just add colors using the ‘fill’ argument. Note that we have four columns, so we could do ‘fill=red’, and that would make all bars red. But if we want a different color for each column, we need to create a list using the c() function. The colors need to be inside quotations and separated by commas. Then we add labels for the x- and y-axes. We also need to specify how our y-axis needs to be represented, in this case we’re saying it’s continuous, so we use the scale_y_continuous() function, and we specify the limits. We know that our scale goes from 1-4. Finally, the ‘oob’ argument tells R what to do with values that are out of bound. In this case, we tell it not to rescale any out of bounds values. We also know we don’t have any, so…

Time to PRACTICE!

Above, we did a bunch of interesting manipulations to the Trump and Obama favorability rankings. However, there are tons of other data in that dataset, and you may have your own data too!

Practice using these data, or your own data, or an R dataset (the function data() pulls up a bunch of datasets, such as mtcars, diamonds, ChickWeight, etc.).

And then practice the following things: - Importing datasets - Cleaning data (visualizing and removing NAs, e.g.) - Finding measures of central tendency (mean, median, mode) - Finding measures of variability (standard deviation, variance, range, IQR) - Transforming data (standardizing) - Plotting the data appropriately

W1_Cleaning_Organizing_Describing Data

Janine_Medina

9/27/2021