We will be using the dataset state.pop which includes information about US state populations and the dataset museums.dta which contains information about the number of museums in each state. The unit of observation in each data frame is state.
# check the original dimensions
dim(state.pop) # 49 observations of 2 variables
## [1] 49 2
dim(museums.dta) # 51 observations of 2 variables
## [1] 51 2
# Make sure that both unique identifier columns are the same class
class(state.pop$state.abb) == class(museums.dta$State)
## [1] TRUE
# Verify that each unique ID appears only once in each dataset
length(unique(state.pop$state.abb)) == nrow(state.pop)
## [1] TRUE
length(unique(museums.dta$State)) == nrow(museums.dta)
## [1] TRUE
# Identify how many unique IDs appear in one dataset and not the other
state.pop$state.abb[!state.pop$state.abb %in% museums.dta$State] # all states in state.pop appear in museums.dta
## character(0)
museums.dta$State[!museums.dta$State %in% state.pop$state.abb] # AK and HI appear in the museus dataset but not in the state.pop dataset
## [1] "AK" "HI"
We want to preserve as much information as possible for every state. Since all states in state.pop appear in museums.dta and AK and HI appear only in the museums dataset, we will use a right merge to preserve all 52 state from museums.dta even though AK and HI will not have values for population.
state.dta <- merge(state.pop, museums.dta, by.x = "state.abb", by.y = "State", all.y = T)
dim(state.dta)
## [1] 51 3
colnames(state.dta) # 3 columns
## [1] "state.abb" "population" "museum.count"
nrow(state.dta) # 51 observations
## [1] 51
unique(state.dta$state.abb) # 51 unique state labels
## [1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "GA" "HI" "IA" "ID" "IL"
## [16] "IN" "KS" "KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC" "ND" "NE"
## [31] "NH" "NJ" "NM" "NV" "NY" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT"
## [46] "VA" "VT" "WA" "WI" "WV" "WY"
# number of rows in our new data = number of rows in X
nrow(state.dta) == nrow(museums.dta)
## [1] TRUE
# number of columns in our new data = sum of columns - ID column
ncol(state.dta) == ncol(state.pop) + ncol(museums.dta) - 1
## [1] TRUE
datatable(state.dta)
barplot((state.dta$museum.count/state.dta$population)*1000,
names.arg = state.dta$state.abb,
cex.names = .4,
las = 2,
main = "Ratio of Museums Residents",
ylab = "population in thousands",
xlab = "")
We will be adding in the dataset ufo.dta which includes information about the number of UFO sightings within various states. The unit of observation in this data frame is state.
dim(state.dta) # 51 observations of 3 variables
## [1] 51 3
dim(ufo.dta) # 79 observations of 3 variables
## [1] 79 3
colnames(state.dta)
## [1] "state.abb" "population" "museum.count"
colnames(ufo.dta)
## [1] "State" "Country" "UFO.sightings"
class(state.dta$state.abb) == class(ufo.dta$State)
## [1] TRUE
state.dta$state.abb[!state.dta$state.abb %in% ufo.dta$State]# all states in state.dta appear in ufo.dta
## character(0)
ufo.dta$State[!ufo.dta$State %in% state.dta$state.abb]# several states in ufo.dta do not appear in state.dta
## [1] "PR" "" "NT" "SA" "YT" "" "AB" "BC" "MB" "NB" "NF" "NS" "NT" "ON" "PE"
## [16] "PQ" "QC" "SA" "SK" "YK" "YT" "" "" "BC" "NS" "NT" "SK" "YT"
unique(ufo.dta$Country) # we can see that this is because ufo.dta includes states outside the US
## [1] US AU CA DE GB
## Levels: AU CA DE GB US
Since our analysis only encompasses US states, we only want to preserve information about the 51 states we included in our analysis previously. We will use a left join to preserve only observations included in state.dta and we will exclude any others.
state.dta2 <- merge(state.dta, ufo.dta, by.x = "state.abb", by.y = "State", all.x = T)
dim(state.dta2) # 51 observations of 5 variables
## [1] 51 5
colnames(state.dta2)
## [1] "state.abb" "population" "museum.count" "Country"
## [5] "UFO.sightings"
# number of rows in our new data = number of rows in Y
nrow(state.dta2) == nrow(state.dta)
## [1] TRUE
# number of columns in our new data = sum of columns - ID column
ncol(state.dta2) == ncol(state.dta) + ncol(ufo.dta) - 1
## [1] TRUE
datatable(state.dta2)
barplot((state.dta2$UFO.sightings/state.dta2$population)*1000,
names.arg = state.dta2$state.abb,
cex.names = .4,
las = 2,
main = "UFO Sightings per 1000 Residents",
ylab = "",
xlab = "")
The ANES 2008-2009 Panel Study tracked and repeatedly interviewed the same group of respondents starting in January 2008 and in September 2009. For this example we will use data on the final wave of respondents called panel.08.dta.
The ANES 2010 Panel Recontact Study reinterviewed a portion of the ANES 2008-2009 Study panelists. For this example we will use data on the reinterviewed respondents called recontact.10.dta.
# check the original dimensions
dim(panel.08.dta) # 2313 observations of 765 variables
## [1] 2313 765
dim(recontact.10.dta) # 2606 observations of 874 variables
## [1] 2606 874
# Identify which variables appear in both datasets
tolower(colnames(panel.08.dta))[tolower(colnames(panel.08.dta)) %in% tolower(colnames(recontact.10.dta))]# only caseid appears in both
## [1] "caseid"
# Make sure that both unique identifier columns are the same class
panel.08.dta <- rename(panel.08.dta,caseid = CASEID) # give ID var same name
class(panel.08.dta$caseid) == class(recontact.10.dta$caseid)
## [1] TRUE
# Verify that each unique ID appears only once in each dataset
length(unique(panel.08.dta$caseid)) == nrow(panel.08.dta)
## [1] TRUE
length(unique(recontact.10.dta$caseid)) == nrow(recontact.10.dta)
## [1] TRUE
# Identify how many observations appear in one dataset and not the other
in.10.not.08 <- length(panel.08.dta$caseid[!panel.08.dta$caseid %in% recontact.10.dta$caseid])# 154 unique IDs appear in the 2008 survey data and not in 2010
in.08.not.10 <- length(recontact.10.dta$caseid[!recontact.10.dta$caseid %in% panel.08.dta$caseid])# 447 unique IDs appear in the 2010 survey data and not in this wave of the 2008 survey
# Identify which observations the datasets have in common - these will be merged together to form one row each
both.08.10 <- length(recontact.10.dta$caseid[recontact.10.dta$caseid %in% panel.08.dta$caseid])
both.08.10 # 2159 unique IDs appear in both the 2010 survey data and the 2008 survey wave
## [1] 2159
If we wanted to create one data frame that included all respondents surveyed - in either the 2008 panel wave or the 2010 reinterview survey - we would use a full merge. This would preserve ALL unique rows of data from either datasource.
all.respondents <- merge(panel.08.dta, recontact.10.dta, by = "caseid", all = T)
dim(all.respondents) # 2760 observations of 1638 variables
## [1] 2760 1638
# check columns
ncol(panel.08.dta) + ncol(recontact.10.dta) - 1 # 1638 total unique columns
## [1] 1638
ncol(panel.08.dta) + ncol(recontact.10.dta) - 1 == ncol(all.respondents)
## [1] TRUE
# check rows
in.10.not.08 + in.08.not.10 + both.08.10
## [1] 2760
nrow(panel.08.dta) + nrow(recontact.10.dta) - both.08.10 == nrow(all.respondents)
## [1] TRUE
If we wanted to compare responses from the same people across surveys, we would create one data frame that included only respondents surveyed in both the 2008 panel wave and the 2010 reinterview survey. To do this, we would use an inner merge. This would preserve ONLY unique rows that exist in both data frames.
both.respondents <- merge(panel.08.dta, recontact.10.dta, by = "caseid", all = F)
dim(both.respondents) # 2159 observations of 1638 variables
## [1] 2159 1638
both.08.10 # 2159 IDs appear in both datasets
## [1] 2159
nrow(both.respondents) == both.08.10 #TRUE
## [1] TRUE