Hi everyone,
During our last class on March 15, Sydney asked if there’s a way to
find common columns in two data frames and check if they have both the
same name and the same data structure. I didn’t think there was an easy
way, but after looking into it, I’ve found a solution using the
intersect() (new!) function along with my custom function
that I just wrote below that uses intersect(). I’ll also
introduce you to the sapply() function (since my custom
function uses it), which we haven’t discussed in class.
intersect()intersect() is a base R function that returns the
elements common to two or more vectors. Here’s a quick example:
# Define two vectors
vector1 <- c(1, 2, 3, 4, 5)
vector2 <- c(4, 5, 6, 7, 8)
# Find the intersection of the two vectors
common_elements <- intersect(vector1, vector2)
# Print the result
print(common_elements)
## [1] 4 5
I use intersect() below in a custom function I made to
check basically on column names of data frames.
Now, let’s create a custom function called find_common_columns() that finds common columns and checks if they have the same data structure:
find_common_columns <- function(df1, df2)
{
# Find common column names
common_columns <- intersect(names(df1), names(df2))
# Check if the common columns have the same data structure
same_data_structure <- sapply(common_columns,
function(column_name) {
identical(class(df1[[column_name]]), class(df2[[column_name]]))
})
# Print columns with different data structures
for (column_name in names(same_data_structure)) {
if (!same_data_structure[column_name]) {
paste0("Different data structures are in: ", column_name, "\n", sep = "")
}
}
# Return the result
return(same_data_structure)
}
You can use this function to compare common columns in two data frames. Just load the data frames you want to compare and call the function like this:
output <- find_common_columns(df1, df2)
print(output)
Let’s load four of the dataframes we spoke about last night and store
them in two different data frames, so that we can then call my function
above on police_2018 with police_2019 and on
weather_1000 with next_weather_1000:
library(tidyverse)
url_2018 <- "https://raw.githubusercontent.com/amanda-nathan/bpd/main/report_2018.csv"
url_2019 <- "https://raw.githubusercontent.com/amanda-nathan/bpd/main/report_2019.csv"
police_2018 <- read_csv(url_2018)
police_2019 <- read_csv(url_2019)
weather_1000<-
read_csv('https://raw.githubusercontent.com/amanda-nathan/bpd/main/weather_2016-01-01_2018-09-26.csv')
next_weather_1000<-
read_csv('https://raw.githubusercontent.com/amanda-nathan/bpd/main/weather_1000_addition_days.csv')
The code below also shows there is not differences.
output<-find_common_columns(weather_1000,next_weather_1000)
print('Below we see all columns. Where it is TRUE means that they are the same:')
## [1] "Below we see all columns. Where it is TRUE means that they are the same:"
print(output)
## name datetime tempmax tempmin
## TRUE TRUE TRUE TRUE
## temp feelslikemax feelslikemin feelslike
## TRUE TRUE TRUE TRUE
## dew humidity precip precipprob
## TRUE TRUE TRUE TRUE
## precipcover preciptype snow snowdepth
## TRUE TRUE TRUE TRUE
## windgust windspeed winddir sealevelpressure
## TRUE TRUE TRUE TRUE
## cloudcover visibility solarradiation solarenergy
## TRUE TRUE TRUE TRUE
## uvindex severerisk sunrise sunset
## TRUE TRUE TRUE TRUE
## moonphase conditions description icon
## TRUE TRUE TRUE TRUE
## stations
## TRUE
Above we see that all columns are the same for weather. We saw that in class too.
Below let’s try it for the two police report files below:
output<-find_common_columns(police_2018,police_2019)
print('Below we see all columns. Where it is TRUE means that they are the same:')
## [1] "Below we see all columns. Where it is TRUE means that they are the same:"
print(output)
## INCIDENT_NUMBER OFFENSE_CODE OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION
## FALSE TRUE FALSE TRUE
## DISTRICT REPORTING_AREA SHOOTING OCCURRED_ON_DATE
## TRUE TRUE FALSE TRUE
## YEAR MONTH DAY_OF_WEEK HOUR
## TRUE TRUE TRUE TRUE
## UCR_PART STREET Lat Long
## FALSE TRUE TRUE TRUE
## Location
## TRUE
Now above you see that my function does pick up on two columns
that we obtained last night: INCIDENT_NUMBER and
SHOOTING but it also gives, UCR_PART.
But to get the data frames to bind on rows we only had to mutate
just theINCIDENT_NUMBER and SHOOTING columns.
(And the fact that UCR_PART is different did not ruin our
bind at all. Hmmm..that’s great but why is that?)
So recall, that we didn’t have to change UCR_PART
though. That column was not even on our radar when we did
bind_rows() of these two data frames since it didn’t throw
an error. So why is that- why no error last night? Here’s why:
bind_rows() from the dplyr package is designed
to be flexible When you use bind_rows() to combine two data
frames with a common column that has different data types, the function
will attempt to automatically coerce the column data types to the most
flexible type that can accommodate both original data types.
In our case, UCR_PART is of type “character” in
police_2018: class(police_2018$UCR_PART), and
in the case of police_2019, UCR_PART is of
type “logical”: class(police_2018$UCR_PART). When using
bind_rows(), it will coerce the “logical” data type to
“character” so that both parts of the combined column can be represented
as character data. This conversion happens automatically, but it’s
important to be aware that the data type of the column will be changed
in the resulting data frame.
But R was not able to coerce the other two columns
INCIDENT_NUMBER and SHOOTING to be the same so
we did the data column transformations for those two before we could
apply bind_rows()
Keep in mind that this automatic coercion might not always lead to the desired result or may cause loss of information. It’s always a good idea to inspect the resulting data frame and ensure the combined data is as expected.
The sapply() function is a member of the “apply”
family of functions in R. It allows you to apply a function to each
element of a list or vector and return a simplified output. In our
custom function, sapply() is used to apply a function to each element of
the common_columns vector to check if the columns in both data frames
with the same name also have the same data structure using the
identical() function. The result is a named logical vector,
where the names are the common column names, and the values are TRUE if
the columns have the same data structure, and FALSE otherwise.