Merging

View Data

We will be using the dataset state.pop which includes information about US state populations and the dataset museums.dta which contains information about the number of museums in each state. The unit of observation in each data frame is state.

Compare Datasets

# check the original dimensions
dim(state.pop) # 49 observations of 2 variables

## [1] 49  2

dim(museums.dta) # 51 observations of 2 variables

## [1] 51  2

# Make sure that both unique identifier columns are the same class  
class(state.pop$state.abb) == class(museums.dta$State)

## [1] TRUE

# Verify that each unique ID appears only once in each dataset 
length(unique(state.pop$state.abb)) == nrow(state.pop)

## [1] TRUE

length(unique(museums.dta$State)) == nrow(museums.dta)

## [1] TRUE

# Identify how many unique IDs appear in one dataset and not the other 
state.pop$state.abb[!state.pop$state.abb %in% museums.dta$State] # all states in state.pop appear in museums.dta

## character(0)

museums.dta$State[!museums.dta$State %in% state.pop$state.abb] # AK and HI appear in the museus dataset but not in the state.pop dataset

## [1] "AK" "HI"

Right Merge

We want to preserve as much information as possible for every state. Since all states in state.pop appear in museums.dta and AK and HI appear only in the museums dataset, we will use a right merge to preserve all 52 state from museums.dta even though AK and HI will not have values for population.

state.dta <- merge(state.pop, museums.dta, by.x = "state.abb", by.y = "State", all.y = T)

dim(state.dta)

## [1] 51  3

colnames(state.dta) # 3 columns

## [1] "state.abb"    "population"   "museum.count"

nrow(state.dta) # 51 observations

## [1] 51

unique(state.dta$state.abb) # 51 unique state labels

##  [1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "GA" "HI" "IA" "ID" "IL"
## [16] "IN" "KS" "KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC" "ND" "NE"
## [31] "NH" "NJ" "NM" "NV" "NY" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT"
## [46] "VA" "VT" "WA" "WI" "WV" "WY"

# number of rows in our new data = number of rows in X 
nrow(state.dta) == nrow(museums.dta)

## [1] TRUE

# number of columns in our new data = sum of columns - ID column
ncol(state.dta) == ncol(state.pop) + ncol(museums.dta) - 1

## [1] TRUE

datatable(state.dta)

barplot((state.dta$museum.count/state.dta$population)*1000,
        names.arg = state.dta$state.abb,
        cex.names = .4,
        las = 2,
        main = "Ratio of Museums Residents", 
        ylab = "population in thousands", 
        xlab = "")

Adding New Data

We will be adding in the dataset ufo.dta which includes information about the number of UFO sightings within various states. The unit of observation in this data frame is state.

Compare New Datasets

dim(state.dta) # 51 observations of 3 variables

## [1] 51  3

dim(ufo.dta)  # 79 observations of 3 variables

## [1] 79  3

colnames(state.dta)

## [1] "state.abb"    "population"   "museum.count"

colnames(ufo.dta)

## [1] "State"         "Country"       "UFO.sightings"

class(state.dta$state.abb) == class(ufo.dta$State)

## [1] TRUE

state.dta$state.abb[!state.dta$state.abb %in% ufo.dta$State]# all states in state.dta appear in ufo.dta

## character(0)

ufo.dta$State[!ufo.dta$State %in% state.dta$state.abb]# several states in ufo.dta do not appear in state.dta

##  [1] "PR" ""   "NT" "SA" "YT" ""   "AB" "BC" "MB" "NB" "NF" "NS" "NT" "ON" "PE"
## [16] "PQ" "QC" "SA" "SK" "YK" "YT" ""   ""   "BC" "NS" "NT" "SK" "YT"

unique(ufo.dta$Country) # we can see that this is because ufo.dta includes states outside the US

## [1] US AU CA DE GB
## Levels:  AU CA DE GB US

Left Merge

Since our analysis only encompasses US states, we only want to preserve information about the 51 states we included in our analysis previously. We will use a left join to preserve only observations included in state.dta and we will exclude any others.

state.dta2 <- merge(state.dta, ufo.dta, by.x = "state.abb", by.y = "State", all.x = T)

dim(state.dta2) # 51 observations of 5 variables

## [1] 51  5

colnames(state.dta2)

## [1] "state.abb"     "population"    "museum.count"  "Country"      
## [5] "UFO.sightings"

# number of rows in our new data = number of rows in Y 
nrow(state.dta2) == nrow(state.dta)

## [1] TRUE

# number of columns in our new data = sum of columns - ID column
ncol(state.dta2) == ncol(state.dta) + ncol(ufo.dta) - 1

## [1] TRUE

datatable(state.dta2)

barplot((state.dta2$UFO.sightings/state.dta2$population)*1000,
        names.arg = state.dta2$state.abb,
        cex.names = .4,
        las = 2,
        main = "UFO Sightings per 1000 Residents", 
        ylab = "", 
        xlab = "")

ANES 2008 Panel Study

The ANES 2008-2009 Panel Study tracked and repeatedly interviewed the same group of respondents starting in January 2008 and in September 2009. For this example we will use data on the final wave of respondents called panel.08.dta.

The ANES 2010 Panel Recontact Study reinterviewed a portion of the ANES 2008-2009 Study panelists. For this example we will use data on the reinterviewed respondents called recontact.10.dta.

Compare Data

# check the original dimensions
dim(panel.08.dta) # 2313  observations of 765 variables

## [1] 2313  765

dim(recontact.10.dta) # 2606 observations of 874 variables

## [1] 2606  874

# Identify which variables appear in both datasets
tolower(colnames(panel.08.dta))[tolower(colnames(panel.08.dta)) %in% tolower(colnames(recontact.10.dta))]# only caseid appears in both

## [1] "caseid"

# Make sure that both unique identifier columns are the same class  
panel.08.dta <- rename(panel.08.dta,caseid = CASEID) # give ID var same name

class(panel.08.dta$caseid) == class(recontact.10.dta$caseid)

## [1] TRUE

# Verify that each unique ID appears only once in each dataset 
length(unique(panel.08.dta$caseid)) == nrow(panel.08.dta)

## [1] TRUE

length(unique(recontact.10.dta$caseid)) == nrow(recontact.10.dta)

## [1] TRUE

# Identify how many observations appear in one dataset and not the other 
in.10.not.08 <- length(panel.08.dta$caseid[!panel.08.dta$caseid %in% recontact.10.dta$caseid])# 154 unique IDs appear in the 2008 survey data and not in 2010

in.08.not.10 <- length(recontact.10.dta$caseid[!recontact.10.dta$caseid %in% panel.08.dta$caseid])# 447 unique IDs appear in the 2010 survey data and not in this wave of the 2008 survey

# Identify which observations the datasets have in common - these will be merged together to form one row each 
both.08.10 <- length(recontact.10.dta$caseid[recontact.10.dta$caseid %in% panel.08.dta$caseid])
both.08.10 # 2159 unique IDs appear in both the 2010 survey data and the 2008 survey wave

## [1] 2159

Full Merge

If we wanted to create one data frame that included all respondents surveyed - in either the 2008 panel wave or the 2010 reinterview survey - we would use a full merge. This would preserve ALL unique rows of data from either datasource.

all.respondents <- merge(panel.08.dta, recontact.10.dta, by = "caseid", all = T)
dim(all.respondents) # 2760  observations of 1638 variables

## [1] 2760 1638

# check columns
ncol(panel.08.dta) + ncol(recontact.10.dta) - 1 # 1638 total unique columns

## [1] 1638

ncol(panel.08.dta) + ncol(recontact.10.dta) - 1 == ncol(all.respondents)

## [1] TRUE

# check rows
in.10.not.08 + in.08.not.10 + both.08.10

## [1] 2760

nrow(panel.08.dta) + nrow(recontact.10.dta) - both.08.10 == nrow(all.respondents)

## [1] TRUE

Inner Merge

If we wanted to compare responses from the same people across surveys, we would create one data frame that included only respondents surveyed in both the 2008 panel wave and the 2010 reinterview survey. To do this, we would use an inner merge. This would preserve ONLY unique rows that exist in both data frames.

both.respondents <- merge(panel.08.dta, recontact.10.dta, by = "caseid", all = F)
dim(both.respondents) # 2159  observations of 1638 variables

## [1] 2159 1638

both.08.10 # 2159 IDs appear in both datasets

## [1] 2159

nrow(both.respondents) == both.08.10 #TRUE

## [1] TRUE