Recitation 5: Data Sets & Recoding



Loading RData


*Make sure all data files are saved in the folder you selected as your working directory before you try to bring them into R.

.R files are native to R, so they do not need to be assigned to an object. You can simply wrap your file pathway in the load() command and the .R file will be loaded into your environment and should appear in your “Environment” tab.

When first dealing with a new data file, you want to get an idea of its basic structure. You should look at its dimensions, column names, and take a look at the overall appearance of the data to make sure it was imported properly and give you context for any later processing.

setwd("/Users/kiraflemke/Desktop/TA 107/Recitations")

load("Data/Pew_March_19.Rdata")

dim(pewdta19)
## [1] 1503    9
colnames(pewdta19)
## [1] "respid"  "state"   "age"     "educ"    "hisp"    "racecmb" "party"  
## [8] "partyln" "q25"
summary(pewdta19)
##      respid           state            age             educ      
##  Min.   :     2   Min.   : 1.00   Min.   :18.00   Min.   :1.000  
##  1st Qu.:100142   1st Qu.:13.00   1st Qu.:36.00   1st Qu.:3.000  
##  Median :100783   Median :29.00   Median :53.00   Median :5.000  
##  Mean   : 91915   Mean   :28.56   Mean   :52.09   Mean   :5.012  
##  3rd Qu.:101475   3rd Qu.:42.00   3rd Qu.:67.00   3rd Qu.:6.000  
##  Max.   :300009   Max.   :56.00   Max.   :99.00   Max.   :9.000  
##                                                                  
##       hisp          racecmb          party          partyln     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :2.000   Median :1.000   Median :2.000   Median :2.000  
##  Mean   :1.967   Mean   :1.778   Mean   :2.336   Mean   :3.121  
##  3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:3.000   3rd Qu.:2.000  
##  Max.   :9.000   Max.   :9.000   Max.   :9.000   Max.   :9.000  
##                                                  NA's   :849    
##                        q25      
##  Just about always       :  43  
##  Most of the time        : 226  
##  Only some of the time   :1046  
##  (VOL) Never             : 165  
##  (VOL) Don't know/Refused:  23  
##                                 
## 
head(pewdta19)
##   respid state age educ hisp racecmb party partyln                   q25
## 1      2    36  69    3    2       1     3       9 Only some of the time
## 2      3    42  75    3    2       2     2      NA           (VOL) Never
## 3      4    12  55    3    2       1     1      NA Only some of the time
## 4      5    37  67    5    2       1     4       1           (VOL) Never
## 5      6    26  53    4    2       1     4       2 Only some of the time
## 6      7    42  46    6    2       1     1      NA Only some of the time
#View(pewdta19)

pewdta19.orig <- pewdta19

*Always save an original copy of your dataset that you can refer back to throughout the recoding process


Importing Data


All files that are not native to R (not .R files) need to be imported and assigned to an R object. These include excel files (.xlsx), csv files (.csv), stata files (.dta), and others.

There are invidual importing commands for each file type. However, the Rio package massively simplifies the process of importing data by offering a command that can easily import most file types.

setwd("/Users/kiraflemke/Desktop/TA 107/Recitations")

library(rio)
pewdta19.2 <- import("Data/Pew.March.19.csv")

dim(pewdta19.2)
## [1] 1503    9
colnames(pewdta19.2)
## [1] "respid"  "state"   "age"     "educ"    "hisp"    "racecmb" "party"  
## [8] "partyln" "q25"
summary(pewdta19.2)
##      respid           state            age             educ      
##  Min.   :     2   Min.   : 1.00   Min.   :18.00   Min.   :1.000  
##  1st Qu.:100142   1st Qu.:13.00   1st Qu.:36.00   1st Qu.:3.000  
##  Median :100783   Median :29.00   Median :53.00   Median :5.000  
##  Mean   : 91915   Mean   :28.56   Mean   :52.09   Mean   :5.012  
##  3rd Qu.:101475   3rd Qu.:42.00   3rd Qu.:67.00   3rd Qu.:6.000  
##  Max.   :300009   Max.   :56.00   Max.   :99.00   Max.   :9.000  
##                                                                  
##       hisp          racecmb          party          partyln     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :2.000   Median :1.000   Median :2.000   Median :2.000  
##  Mean   :1.967   Mean   :1.778   Mean   :2.336   Mean   :3.121  
##  3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:3.000   3rd Qu.:2.000  
##  Max.   :9.000   Max.   :9.000   Max.   :9.000   Max.   :9.000  
##                                                  NA's   :849    
##      q25           
##  Length:1503       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
#View(pewdta19.2)

pewdta19.2.orig <- pewdta19.2



Recoding


For the following examples we will use our data frame pewdta19 and the variable party which includes responses to the question: “In politics TODAY,do you consider yourself a Republican, Democrat, or Independent?”

1. Identify the current values of a variable of interest:

class(pewdta19$party) # class() returns the object type of the variable (numeric, character, or factor)
## [1] "numeric"
table(pewdta19$party) #table() returns all existing values of a variable and their frequency
## 
##   1   2   3   4   5   9 
## 424 425 566  41  12  35
summary(pewdta19$party) # returns descriptive statistics of a numeric variable
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.336   3.000   9.000


2. Locate rows based on a variable (column) value:

which(pewdta19$party == 1)  #returns the row number of each observation that contains a '1' for the "party" variable
##   [1]    3    6    7   11   12   18   20   23   24   26   31   32   33   38   39
##  [16]   40   46   47   48   49   50   53   54   58   60   63   64   68   72   74
##  [31]   77   81   87   88   96   97   98  108  110  113  114  120  121  124  130
##  [46]  132  134  136  138  144  147  150  154  155  157  158  159  162  163  164
##  [61]  170  171  175  176  177  178  182  186  191  192  195  199  203  207  211
##  [76]  213  215  220  222  231  233  235  241  243  251  257  260  261  263  264
##  [91]  266  274  281  288  294  300  307  308  320  324  325  329  330  332  334
## [106]  342  343  345  347  349  353  361  364  368  375  377  380  384  389  391
## [121]  393  394  399  401  405  408  409  411  416  422  426  434  435  441  449
## [136]  452  455  458  459  460  462  464  467  474  476  495  496  503  506  507
## [151]  509  510  514  523  525  532  538  548  556  557  558  560  561  563  572
## [166]  573  577  578  583  585  590  593  599  604  606  614  622  624  625  627
## [181]  636  638  642  644  646  647  650  653  662  667  669  670  672  673  675
## [196]  676  677  679  683  686  687  688  692  697  708  712  714  716  717  718
## [211]  719  726  727  729  732  734  735  741  742  745  746  750  751  755  757
## [226]  758  761  764  767  768  771  772  775  776  780  782  783  786  788  791
## [241]  792  793  794  795  796  798  801  803  804  807  810  815  816  821  822
## [256]  823  827  829  835  836  839  841  847  851  855  859  860  863  867  873
## [271]  879  886  893  896  904  907  925  926  927  932  936  938  943  945  948
## [286]  950  951  952  962  964  968  969  971  973  975  980  981  982  983  984
## [301]  986  989  990  994  998 1001 1002 1006 1007 1011 1014 1016 1019 1030 1031
## [316] 1033 1038 1040 1046 1047 1050 1059 1062 1064 1065 1069 1083 1086 1089 1095
## [331] 1101 1112 1129 1140 1141 1154 1155 1157 1162 1169 1170 1171 1173 1174 1186
## [346] 1187 1195 1196 1201 1207 1210 1211 1215 1221 1233 1238 1239 1240 1244 1245
## [361] 1246 1254 1255 1260 1265 1273 1276 1280 1286 1297 1301 1302 1307 1309 1311
## [376] 1312 1314 1316 1319 1320 1321 1326 1329 1330 1331 1333 1334 1335 1341 1344
## [391] 1346 1348 1351 1358 1367 1373 1376 1385 1387 1388 1389 1395 1399 1400 1405
## [406] 1407 1410 1413 1414 1420 1422 1439 1442 1446 1453 1455 1458 1478 1486 1487
## [421] 1488 1490 1500 1502
length(which(pewdta19$party == 1))  #returns the number of rows for which the value of "party" is '1'
## [1] 424
which(is.na(pewdta19$party)) #returns the row number of each observation that contains a NA for the "party" variable
## integer(0)
length(which(is.na(pewdta19$party))) #returns the number of rows for which the value of "party" is NA
## [1] 0


3a. Recode to a meaningful factor variable:

Use the dataset codebook to identify the meaning of each variable value. Below are the value labels of party as they might appear in a codebook:

## $label
## [1] "PARTY. In politics TODAY, do you consider yourself a Republican, Democrat, or independent?"
## 
## $labels
##               Republican                 Democrat              Independent 
##                        1                        2                        3 
##      (VOL) No preference        (VOL) Other party (VOL) Don't know/Refused 
##                        4                        5                        9


Create a new empty variable column within the dataset. Try to give your variables meaningful names. Our new variable is called party.new

In general, do not replace the original variable in a dataset with your recoded version. It is good practice to never eliminate data in case you need it for later reference.

#create the new variable column in the dataset by setting it to NA
pewdta19$party.new <- NA

#specify the new variable name (pewdta19$party.new) and the value of another existing variable (pewdta19$party) for which you want to assign a new label

# The first line below specifies that all observations (rows) that have the value of '1' for "party" will have the label 'Republican' for the new variable "party.new"
pewdta19$party.new[pewdta19$party == 1] <- "Republican"
pewdta19$party.new[pewdta19$party == 2] <- "Democrat"
pewdta19$party.new[pewdta19$party == 3] <- "Independent"
pewdta19$party.new[pewdta19$party == 4] <- "No Preference"
pewdta19$party.new[pewdta19$party == 5] <- "Other Party"
pewdta19$party.new[pewdta19$party == 9] <- "Refused"

table(pewdta19$party.new) # check that all value labels exist in the new variable and that the numbers appear correct for each label 
## 
##      Democrat   Independent No Preference   Other Party       Refused 
##           425           566            41            12            35 
##    Republican 
##           424
# and for NA
length(which(is.na(pewdta19$party.new)))
## [1] 0


Another way to recode a variable is by using the recode() function in the dplyr package
#install.packages("dplyr")
library("dplyr")

pewdta19$party.new1 <- NA
pewdta19$party.new1 <- recode(pewdta19$party, 
                                '1' = "Republican", 
                                '2' = "Democrat",
                                '3' = "Independent",
                                '4' = "No Preference",
                                '5' = "Other Party",
                                '9' = "Refused" )
#Note: for this command, numeric values must be in single quotes

table(pewdta19$party.new1)
## 
##      Democrat   Independent No Preference   Other Party       Refused 
##           425           566            41            12            35 
##    Republican 
##           424
length(which(is.na(pewdta19$party.new1)))
## [1] 0


3b. Recode to a different factor variable to collapse or add levels:

pewdta19$party.new2 <- NA
pewdta19$party.new2[pewdta19$party == 1] <- "Republican"
pewdta19$party.new2[pewdta19$party == 2] <- "Democrat"
pewdta19$party.new2[pewdta19$party == 3] <- "Independent"
pewdta19$party.new2[pewdta19$party == 4] <- "Independent"
pewdta19$party.new2[pewdta19$party == 5] <- NA
pewdta19$party.new2[pewdta19$party == 9] <- NA

table(pewdta19$party.new2)
## 
##    Democrat Independent  Republican 
##         425         607         424
length(which(is.na(pewdta19$party.new2)))
## [1] 47
pewdta19$party.new2a <- NA
pewdta19$party.new2a[pewdta19$party == 1] <- "Republican"
pewdta19$party.new2a[pewdta19$party == 2] <- "Democrat"
pewdta19$party.new2a[pewdta19$party == 3 | pewdta19$party == 4] <- "Independent"
pewdta19$party.new2a[pewdta19$party == 5 | pewdta19$party == 9] <- NA

table(pewdta19$party.new2a, useNA = "ifany")
## 
##    Democrat Independent  Republican        <NA> 
##         425         607         424          47


Recode different variables together to create a new factor variable:

The variable partyln includes responses to the question: “As of today do you lean more to the Republican Party or more to the Democratic Party?”
* This question was asked to respondents that did select Democrat or Republican as their answer for the variable party

Below are the value labels of partyln as they might appear in a codebook:

## $label
## [1] "PARTYLN. As of today do you lean more to the Republican Party or more to the Democratic Party?"
## 
## $labels
##                     Republican                       Democrat 
##                              1                              2 
## (VOL) Other/Don't know/Refused 
##                              9

When combining variables, it is a good idea to first check their values and the frequency of their value combinations.

table(pewdta19$partyln)
## 
##   1   2   9 
## 233 283 138
table(pewdta19$party,pewdta19$partyln) # this is a two-way table that displays the frequency of each combination of values between the variables 
##    
##       1   2   9
##   1   0   0   0
##   2   0   0   0
##   3 199 267 100
##   4  19   9  13
##   5   8   3   1
##   9   7   4  24


We are going to recode partyln and party together to create a new variable party.new.L. In this new variable, anyone that responded to partyln with “Democrat” or “Republican” will be coded into those categories and only those with no party lean will remain in the “Independent” group.

pewdta19$party.new.L <- NA
pewdta19$party.new.L[pewdta19$party == 1 | pewdta19$partyln == 1] <- "Republican"
pewdta19$party.new.L[pewdta19$party == 2 | pewdta19$partyln == 2] <- "Democrat"
pewdta19$party.new.L[(pewdta19$party == 3 | pewdta19$party == 4) & pewdta19$partyln == 9] <- "Independent"

table(pewdta19$party.new.L, useNA = "ifany")
## 
##    Democrat Independent  Republican        <NA> 
##         708         113         657          25


3c. Recode to a numeric variable in order to perform calculations:

For the folowing examples we will be using q25 :

“Question 25: How much of the time do you think you can trust the government in Washington to do what is right? Just about always, most of the time, or only some of the time?”

table(pewdta19$q25)
## 
##        Just about always         Most of the time    Only some of the time 
##                       43                      226                     1046 
##              (VOL) Never (VOL) Don't know/Refused 
##                      165                       23
pewdta19$trust <- NA
pewdta19$trust[pewdta19$q25 == "(VOL) Never"] <- 0
pewdta19$trust[pewdta19$q25 == "Only some of the time"] <- 1
pewdta19$trust[pewdta19$q25 == "Most of the time"] <- 2
pewdta19$trust[pewdta19$q25 == "Just about always"] <- 3

summary(pewdta19$trust)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   1.000   1.000   1.099   1.000   3.000      23
mean(pewdta19$trust, na.rm = T)
## [1] 1.099324
# On average, respondents trusts the government only slightly more than some of the time. Half of the respondents trust the government only some of the time (fall between Q1 and Q3). Two thirds trust the government only some of the time or never.


Now that we have created these new variables in our dataframe pewdta19, we can see that the dimensions and names have changed (compared to our original dataset pewdta19.orig) to include the new columns.

dim(pewdta19.orig)
## [1] 1503    9
dim(pewdta19)
## [1] 1503   15
colnames(pewdta19)
##  [1] "respid"      "state"       "age"         "educ"        "hisp"       
##  [6] "racecmb"     "party"       "partyln"     "q25"         "party.new"  
## [11] "party.new1"  "party.new2"  "party.new2a" "party.new.L" "trust"