Answering Sydney’s question from March 15 in class

Hi everyone,

During our last class on March 15, Sydney asked if there’s a way to find common columns in two data frames and check if they have both the same name and the same data structure. I didn’t think there was an easy way, but after looking into it, I’ve found a solution using the intersect() (new!) function along with my custom function that I just wrote below that uses intersect(). I’ll also introduce you to the sapply() function (since my custom function uses it), which we haven’t discussed in class.

About `intersect()`

intersect() is a base R function that returns the elements common to two or more vectors. Here’s a quick example:

# Define two vectors
vector1 <- c(1, 2, 3, 4, 5)
vector2 <- c(4, 5, 6, 7, 8)

# Find the intersection of the two vectors
common_elements <- intersect(vector1, vector2)

# Print the result
print(common_elements)

## [1] 4 5

I use intersect() below in a custom function I made to check basically on column names of data frames.

find_common_columns() - a custom function.

Now, let’s create a custom function called find_common_columns() that finds common columns and checks if they have the same data structure:

find_common_columns <- function(df1, df2) 
{
    # Find common column names
    common_columns <- intersect(names(df1), names(df2))
  
    # Check if the common columns have the same data structure
    same_data_structure <- sapply(common_columns, 
                                  function(column_name) {
                                      identical(class(df1[[column_name]]), class(df2[[column_name]]))
                                  })

    # Print columns with different data structures
    for (column_name in names(same_data_structure)) {
        if (!same_data_structure[column_name]) {
            paste0("Different data structures are in: ", column_name, "\n", sep = "")
        }
    }

    # Return the result
    return(same_data_structure)
}

Usage:

You can use this function to compare common columns in two data frames. Just load the data frames you want to compare and call the function like this:

output <- find_common_columns(df1, df2)

print(output)

So let’s do an example and see it in action.

Let’s load four of the dataframes we spoke about last night and store them in two different data frames, so that we can then call my function above on police_2018 with police_2019 and on weather_1000 with next_weather_1000:

library(tidyverse)
url_2018 <- "https://raw.githubusercontent.com/amanda-nathan/bpd/main/report_2018.csv" 
url_2019 <- "https://raw.githubusercontent.com/amanda-nathan/bpd/main/report_2019.csv" 

police_2018 <- read_csv(url_2018) 
police_2019 <- read_csv(url_2019)

weather_1000<-
  read_csv('https://raw.githubusercontent.com/amanda-nathan/bpd/main/weather_2016-01-01_2018-09-26.csv')
next_weather_1000<-
  read_csv('https://raw.githubusercontent.com/amanda-nathan/bpd/main/weather_1000_addition_days.csv')

Trying it on the two weather dataframes first. We know from class last night there were no differences…

The code below also shows there is not differences.

output<-find_common_columns(weather_1000,next_weather_1000) 

print('Below we see all columns. Where it is TRUE means that they are the same:')

## [1] "Below we see all columns. Where it is TRUE means that they are the same:"

print(output)

##             name         datetime          tempmax          tempmin 
##             TRUE             TRUE             TRUE             TRUE 
##             temp     feelslikemax     feelslikemin        feelslike 
##             TRUE             TRUE             TRUE             TRUE 
##              dew         humidity           precip       precipprob 
##             TRUE             TRUE             TRUE             TRUE 
##      precipcover       preciptype             snow        snowdepth 
##             TRUE             TRUE             TRUE             TRUE 
##         windgust        windspeed          winddir sealevelpressure 
##             TRUE             TRUE             TRUE             TRUE 
##       cloudcover       visibility   solarradiation      solarenergy 
##             TRUE             TRUE             TRUE             TRUE 
##          uvindex       severerisk          sunrise           sunset 
##             TRUE             TRUE             TRUE             TRUE 
##        moonphase       conditions      description             icon 
##             TRUE             TRUE             TRUE             TRUE 
##         stations 
##             TRUE

Above we see that all columns are the same for weather. We saw that in class too.

Try on two years of police data. We already know from last class that there were differences…

Below let’s try it for the two police report files below:

output<-find_common_columns(police_2018,police_2019) 

print('Below we see all columns. Where it is TRUE means that they are the same:')

## [1] "Below we see all columns. Where it is TRUE means that they are the same:"

print(output)

##     INCIDENT_NUMBER        OFFENSE_CODE  OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION 
##               FALSE                TRUE               FALSE                TRUE 
##            DISTRICT      REPORTING_AREA            SHOOTING    OCCURRED_ON_DATE 
##                TRUE                TRUE               FALSE                TRUE 
##                YEAR               MONTH         DAY_OF_WEEK                HOUR 
##                TRUE                TRUE                TRUE                TRUE 
##            UCR_PART              STREET                 Lat                Long 
##               FALSE                TRUE                TRUE                TRUE 
##            Location 
##                TRUE

Now above you see that my function does pick up on two columns that we obtained last night: INCIDENT_NUMBER and SHOOTING but it also gives, UCR_PART.
But to get the data frames to bind on rows we only had to mutate just theINCIDENT_NUMBER and SHOOTING columns. (And the fact that UCR_PART is different did not ruin our bind at all. Hmmm..that’s great but why is that?)

More Details

So recall, that we didn’t have to change UCR_PART though. That column was not even on our radar when we did bind_rows() of these two data frames since it didn’t throw an error. So why is that- why no error last night? Here’s why: bind_rows() from the dplyr package is designed to be flexible When you use bind_rows() to combine two data frames with a common column that has different data types, the function will attempt to automatically coerce the column data types to the most flexible type that can accommodate both original data types.
In our case, UCR_PART is of type “character” in police_2018: class(police_2018$UCR_PART), and in the case of police_2019, UCR_PART is of type “logical”: class(police_2018$UCR_PART). When using bind_rows(), it will coerce the “logical” data type to “character” so that both parts of the combined column can be represented as character data. This conversion happens automatically, but it’s important to be aware that the data type of the column will be changed in the resulting data frame.
But R was not able to coerce the other two columns INCIDENT_NUMBER and SHOOTING to be the same so we did the data column transformations for those two before we could apply bind_rows()
Keep in mind that this automatic coercion might not always lead to the desired result or may cause loss of information. It’s always a good idea to inspect the resulting data frame and ensure the combined data is as expected.
The sapply() function is a member of the “apply” family of functions in R. It allows you to apply a function to each element of a list or vector and return a simplified output. In our custom function, sapply() is used to apply a function to each element of the common_columns vector to check if the columns in both data frames with the same name also have the same data structure using the identical() function. The result is a named logical vector, where the names are the common column names, and the values are TRUE if the columns have the same data structure, and FALSE otherwise.