*Make sure all data files are saved in the folder you selected as your working directory before you try to bring them into R.
.R files are native to R, so they do not need to be assigned to an object. You can simply wrap your file pathway in the load() command and the .R file will be loaded into your environment and should appear in your “Environment” tab.
When first dealing with a new data file, you want to get an idea of its basic structure. You should look at its dimensions, column names, and take a look at the overall appearance of the data to make sure it was imported properly and give you context for any later processing.
setwd("/Users/kiraflemke/Desktop/TA 107/Recitations")
load("Data/Pew_March_19.Rdata")
dim(pewdta19)
## [1] 1503 9
colnames(pewdta19)
## [1] "respid" "state" "age" "educ" "hisp" "racecmb" "party"
## [8] "partyln" "q25"
summary(pewdta19)
## respid state age educ
## Min. : 2 Min. : 1.00 Min. :18.00 Min. :1.000
## 1st Qu.:100142 1st Qu.:13.00 1st Qu.:36.00 1st Qu.:3.000
## Median :100783 Median :29.00 Median :53.00 Median :5.000
## Mean : 91915 Mean :28.56 Mean :52.09 Mean :5.012
## 3rd Qu.:101475 3rd Qu.:42.00 3rd Qu.:67.00 3rd Qu.:6.000
## Max. :300009 Max. :56.00 Max. :99.00 Max. :9.000
##
## hisp racecmb party partyln
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :2.000 Median :1.000 Median :2.000 Median :2.000
## Mean :1.967 Mean :1.778 Mean :2.336 Mean :3.121
## 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:3.000 3rd Qu.:2.000
## Max. :9.000 Max. :9.000 Max. :9.000 Max. :9.000
## NA's :849
## q25
## Just about always : 43
## Most of the time : 226
## Only some of the time :1046
## (VOL) Never : 165
## (VOL) Don't know/Refused: 23
##
##
head(pewdta19)
## respid state age educ hisp racecmb party partyln q25
## 1 2 36 69 3 2 1 3 9 Only some of the time
## 2 3 42 75 3 2 2 2 NA (VOL) Never
## 3 4 12 55 3 2 1 1 NA Only some of the time
## 4 5 37 67 5 2 1 4 1 (VOL) Never
## 5 6 26 53 4 2 1 4 2 Only some of the time
## 6 7 42 46 6 2 1 1 NA Only some of the time
#View(pewdta19)
pewdta19.orig <- pewdta19
*Always save an original copy of your dataset that you can refer back to throughout the recoding process
All files that are not native to R (not .R files) need to be imported and assigned to an R object. These include excel files (.xlsx), csv files (.csv), stata files (.dta), and others.
There are invidual importing commands for each file type. However, the Rio package massively simplifies the process of importing data by offering a command that can easily import most file types.
setwd("/Users/kiraflemke/Desktop/TA 107/Recitations")
library(rio)
pewdta19.2 <- import("Data/Pew.March.19.csv")
dim(pewdta19.2)
## [1] 1503 9
colnames(pewdta19.2)
## [1] "respid" "state" "age" "educ" "hisp" "racecmb" "party"
## [8] "partyln" "q25"
summary(pewdta19.2)
## respid state age educ
## Min. : 2 Min. : 1.00 Min. :18.00 Min. :1.000
## 1st Qu.:100142 1st Qu.:13.00 1st Qu.:36.00 1st Qu.:3.000
## Median :100783 Median :29.00 Median :53.00 Median :5.000
## Mean : 91915 Mean :28.56 Mean :52.09 Mean :5.012
## 3rd Qu.:101475 3rd Qu.:42.00 3rd Qu.:67.00 3rd Qu.:6.000
## Max. :300009 Max. :56.00 Max. :99.00 Max. :9.000
##
## hisp racecmb party partyln
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :2.000 Median :1.000 Median :2.000 Median :2.000
## Mean :1.967 Mean :1.778 Mean :2.336 Mean :3.121
## 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:3.000 3rd Qu.:2.000
## Max. :9.000 Max. :9.000 Max. :9.000 Max. :9.000
## NA's :849
## q25
## Length:1503
## Class :character
## Mode :character
##
##
##
##
#View(pewdta19.2)
pewdta19.2.orig <- pewdta19.2
For the following examples we will use our data frame pewdta19 and the variable party which includes responses to the question: “In politics TODAY,do you consider yourself a Republican, Democrat, or Independent?”
class(pewdta19$party) # class() returns the object type of the variable (numeric, character, or factor)
## [1] "numeric"
table(pewdta19$party) #table() returns all existing values of a variable and their frequency
##
## 1 2 3 4 5 9
## 424 425 566 41 12 35
summary(pewdta19$party) # returns descriptive statistics of a numeric variable
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.336 3.000 9.000
which(pewdta19$party == 1) #returns the row number of each observation that contains a '1' for the "party" variable
## [1] 3 6 7 11 12 18 20 23 24 26 31 32 33 38 39
## [16] 40 46 47 48 49 50 53 54 58 60 63 64 68 72 74
## [31] 77 81 87 88 96 97 98 108 110 113 114 120 121 124 130
## [46] 132 134 136 138 144 147 150 154 155 157 158 159 162 163 164
## [61] 170 171 175 176 177 178 182 186 191 192 195 199 203 207 211
## [76] 213 215 220 222 231 233 235 241 243 251 257 260 261 263 264
## [91] 266 274 281 288 294 300 307 308 320 324 325 329 330 332 334
## [106] 342 343 345 347 349 353 361 364 368 375 377 380 384 389 391
## [121] 393 394 399 401 405 408 409 411 416 422 426 434 435 441 449
## [136] 452 455 458 459 460 462 464 467 474 476 495 496 503 506 507
## [151] 509 510 514 523 525 532 538 548 556 557 558 560 561 563 572
## [166] 573 577 578 583 585 590 593 599 604 606 614 622 624 625 627
## [181] 636 638 642 644 646 647 650 653 662 667 669 670 672 673 675
## [196] 676 677 679 683 686 687 688 692 697 708 712 714 716 717 718
## [211] 719 726 727 729 732 734 735 741 742 745 746 750 751 755 757
## [226] 758 761 764 767 768 771 772 775 776 780 782 783 786 788 791
## [241] 792 793 794 795 796 798 801 803 804 807 810 815 816 821 822
## [256] 823 827 829 835 836 839 841 847 851 855 859 860 863 867 873
## [271] 879 886 893 896 904 907 925 926 927 932 936 938 943 945 948
## [286] 950 951 952 962 964 968 969 971 973 975 980 981 982 983 984
## [301] 986 989 990 994 998 1001 1002 1006 1007 1011 1014 1016 1019 1030 1031
## [316] 1033 1038 1040 1046 1047 1050 1059 1062 1064 1065 1069 1083 1086 1089 1095
## [331] 1101 1112 1129 1140 1141 1154 1155 1157 1162 1169 1170 1171 1173 1174 1186
## [346] 1187 1195 1196 1201 1207 1210 1211 1215 1221 1233 1238 1239 1240 1244 1245
## [361] 1246 1254 1255 1260 1265 1273 1276 1280 1286 1297 1301 1302 1307 1309 1311
## [376] 1312 1314 1316 1319 1320 1321 1326 1329 1330 1331 1333 1334 1335 1341 1344
## [391] 1346 1348 1351 1358 1367 1373 1376 1385 1387 1388 1389 1395 1399 1400 1405
## [406] 1407 1410 1413 1414 1420 1422 1439 1442 1446 1453 1455 1458 1478 1486 1487
## [421] 1488 1490 1500 1502
length(which(pewdta19$party == 1)) #returns the number of rows for which the value of "party" is '1'
## [1] 424
which(is.na(pewdta19$party)) #returns the row number of each observation that contains a NA for the "party" variable
## integer(0)
length(which(is.na(pewdta19$party))) #returns the number of rows for which the value of "party" is NA
## [1] 0
Use the dataset codebook to identify the meaning of each variable value. Below are the value labels of party as they might appear in a codebook:
## $label
## [1] "PARTY. In politics TODAY, do you consider yourself a Republican, Democrat, or independent?"
##
## $labels
## Republican Democrat Independent
## 1 2 3
## (VOL) No preference (VOL) Other party (VOL) Don't know/Refused
## 4 5 9
Create a new empty variable column within the dataset. Try to give your variables meaningful names. Our new variable is called party.new
In general, do not replace the original variable in a dataset with your recoded version. It is good practice to never eliminate data in case you need it for later reference.
#create the new variable column in the dataset by setting it to NA
pewdta19$party.new <- NA
#specify the new variable name (pewdta19$party.new) and the value of another existing variable (pewdta19$party) for which you want to assign a new label
# The first line below specifies that all observations (rows) that have the value of '1' for "party" will have the label 'Republican' for the new variable "party.new"
pewdta19$party.new[pewdta19$party == 1] <- "Republican"
pewdta19$party.new[pewdta19$party == 2] <- "Democrat"
pewdta19$party.new[pewdta19$party == 3] <- "Independent"
pewdta19$party.new[pewdta19$party == 4] <- "No Preference"
pewdta19$party.new[pewdta19$party == 5] <- "Other Party"
pewdta19$party.new[pewdta19$party == 9] <- "Refused"
table(pewdta19$party.new) # check that all value labels exist in the new variable and that the numbers appear correct for each label
##
## Democrat Independent No Preference Other Party Refused
## 425 566 41 12 35
## Republican
## 424
# and for NA
length(which(is.na(pewdta19$party.new)))
## [1] 0
#install.packages("dplyr")
library("dplyr")
pewdta19$party.new1 <- NA
pewdta19$party.new1 <- recode(pewdta19$party,
'1' = "Republican",
'2' = "Democrat",
'3' = "Independent",
'4' = "No Preference",
'5' = "Other Party",
'9' = "Refused" )
#Note: for this command, numeric values must be in single quotes
table(pewdta19$party.new1)
##
## Democrat Independent No Preference Other Party Refused
## 425 566 41 12 35
## Republican
## 424
length(which(is.na(pewdta19$party.new1)))
## [1] 0
pewdta19$party.new2 <- NA
pewdta19$party.new2[pewdta19$party == 1] <- "Republican"
pewdta19$party.new2[pewdta19$party == 2] <- "Democrat"
pewdta19$party.new2[pewdta19$party == 3] <- "Independent"
pewdta19$party.new2[pewdta19$party == 4] <- "Independent"
pewdta19$party.new2[pewdta19$party == 5] <- NA
pewdta19$party.new2[pewdta19$party == 9] <- NA
table(pewdta19$party.new2)
##
## Democrat Independent Republican
## 425 607 424
length(which(is.na(pewdta19$party.new2)))
## [1] 47
pewdta19$party.new2a <- NA
pewdta19$party.new2a[pewdta19$party == 1] <- "Republican"
pewdta19$party.new2a[pewdta19$party == 2] <- "Democrat"
pewdta19$party.new2a[pewdta19$party == 3 | pewdta19$party == 4] <- "Independent"
pewdta19$party.new2a[pewdta19$party == 5 | pewdta19$party == 9] <- NA
table(pewdta19$party.new2a, useNA = "ifany")
##
## Democrat Independent Republican <NA>
## 425 607 424 47
The variable partyln includes responses to the question: “As of today do you lean more to the Republican Party or more to the Democratic Party?”
* This question was asked to respondents that did select Democrat or Republican as their answer for the variable party
Below are the value labels of partyln as they might appear in a codebook:
## $label
## [1] "PARTYLN. As of today do you lean more to the Republican Party or more to the Democratic Party?"
##
## $labels
## Republican Democrat
## 1 2
## (VOL) Other/Don't know/Refused
## 9
When combining variables, it is a good idea to first check their values and the frequency of their value combinations.
table(pewdta19$partyln)
##
## 1 2 9
## 233 283 138
table(pewdta19$party,pewdta19$partyln) # this is a two-way table that displays the frequency of each combination of values between the variables
##
## 1 2 9
## 1 0 0 0
## 2 0 0 0
## 3 199 267 100
## 4 19 9 13
## 5 8 3 1
## 9 7 4 24
We are going to recode partyln and party together to create a new variable party.new.L. In this new variable, anyone that responded to partyln with “Democrat” or “Republican” will be coded into those categories and only those with no party lean will remain in the “Independent” group.
pewdta19$party.new.L <- NA
pewdta19$party.new.L[pewdta19$party == 1 | pewdta19$partyln == 1] <- "Republican"
pewdta19$party.new.L[pewdta19$party == 2 | pewdta19$partyln == 2] <- "Democrat"
pewdta19$party.new.L[(pewdta19$party == 3 | pewdta19$party == 4) & pewdta19$partyln == 9] <- "Independent"
table(pewdta19$party.new.L, useNA = "ifany")
##
## Democrat Independent Republican <NA>
## 708 113 657 25
For the folowing examples we will be using q25 :
“Question 25: How much of the time do you think you can trust the government in Washington to do what is right? Just about always, most of the time, or only some of the time?”
table(pewdta19$q25)
##
## Just about always Most of the time Only some of the time
## 43 226 1046
## (VOL) Never (VOL) Don't know/Refused
## 165 23
pewdta19$trust <- NA
pewdta19$trust[pewdta19$q25 == "(VOL) Never"] <- 0
pewdta19$trust[pewdta19$q25 == "Only some of the time"] <- 1
pewdta19$trust[pewdta19$q25 == "Most of the time"] <- 2
pewdta19$trust[pewdta19$q25 == "Just about always"] <- 3
summary(pewdta19$trust)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 1.000 1.000 1.099 1.000 3.000 23
mean(pewdta19$trust, na.rm = T)
## [1] 1.099324
# On average, respondents trusts the government only slightly more than some of the time. Half of the respondents trust the government only some of the time (fall between Q1 and Q3). Two thirds trust the government only some of the time or never.
Now that we have created these new variables in our dataframe pewdta19, we can see that the dimensions and names have changed (compared to our original dataset pewdta19.orig) to include the new columns.
dim(pewdta19.orig)
## [1] 1503 9
dim(pewdta19)
## [1] 1503 15
colnames(pewdta19)
## [1] "respid" "state" "age" "educ" "hisp"
## [6] "racecmb" "party" "partyln" "q25" "party.new"
## [11] "party.new1" "party.new2" "party.new2a" "party.new.L" "trust"