Hi everyone,

During our last class on March 15, Sydney asked if there’s a way to find common columns in two data frames and check if they have both the same name and the same data structure. I didn’t think there was an easy way, but after looking into it, I’ve found a solution using the intersect() (new!) function along with my custom function that I just wrote below that uses intersect(). I’ll also introduce you to the sapply() function (since my custom function uses it), which we haven’t discussed in class.

About intersect()

intersect() is a base R function that returns the elements common to two or more vectors. Here’s a quick example:

# Define two vectors
vector1 <- c(1, 2, 3, 4, 5)
vector2 <- c(4, 5, 6, 7, 8)

# Find the intersection of the two vectors
common_elements <- intersect(vector1, vector2)

# Print the result
print(common_elements)
## [1] 4 5

I use intersect() below in a custom function I made to check basically on column names of data frames.

find_common_columns() - a custom function.

Now, let’s create a custom function called find_common_columns() that finds common columns and checks if they have the same data structure:

find_common_columns <- function(df1, df2) 
{
    # Find common column names
    common_columns <- intersect(names(df1), names(df2))
  
    # Check if the common columns have the same data structure
    same_data_structure <- sapply(common_columns, 
                                  function(column_name) {
                                      identical(class(df1[[column_name]]), class(df2[[column_name]]))
                                  })

    # Print columns with different data structures
    for (column_name in names(same_data_structure)) {
        if (!same_data_structure[column_name]) {
            paste0("Different data structures are in: ", column_name, "\n", sep = "")
        }
    }

    # Return the result
    return(same_data_structure)
}

Usage:

You can use this function to compare common columns in two data frames. Just load the data frames you want to compare and call the function like this:

output <- find_common_columns(df1, df2)

print(output)

So let’s do an example and see it in action.

Let’s load four of the dataframes we spoke about last night and store them in two different data frames, so that we can then call my function above on police_2018 with police_2019 and on weather_1000 with next_weather_1000:

library(tidyverse)
url_2018 <- "https://raw.githubusercontent.com/amanda-nathan/bpd/main/report_2018.csv" 
url_2019 <- "https://raw.githubusercontent.com/amanda-nathan/bpd/main/report_2019.csv" 

police_2018 <- read_csv(url_2018) 
police_2019 <- read_csv(url_2019)

weather_1000<-
  read_csv('https://raw.githubusercontent.com/amanda-nathan/bpd/main/weather_2016-01-01_2018-09-26.csv')
next_weather_1000<-
  read_csv('https://raw.githubusercontent.com/amanda-nathan/bpd/main/weather_1000_addition_days.csv')

Trying it on the two weather dataframes first. We know from class last night there were no differences…

The code below also shows there is not differences.

output<-find_common_columns(weather_1000,next_weather_1000) 

print('Below we see all columns. Where it is TRUE means that they are the same:')
## [1] "Below we see all columns. Where it is TRUE means that they are the same:"
print(output)
##             name         datetime          tempmax          tempmin 
##             TRUE             TRUE             TRUE             TRUE 
##             temp     feelslikemax     feelslikemin        feelslike 
##             TRUE             TRUE             TRUE             TRUE 
##              dew         humidity           precip       precipprob 
##             TRUE             TRUE             TRUE             TRUE 
##      precipcover       preciptype             snow        snowdepth 
##             TRUE             TRUE             TRUE             TRUE 
##         windgust        windspeed          winddir sealevelpressure 
##             TRUE             TRUE             TRUE             TRUE 
##       cloudcover       visibility   solarradiation      solarenergy 
##             TRUE             TRUE             TRUE             TRUE 
##          uvindex       severerisk          sunrise           sunset 
##             TRUE             TRUE             TRUE             TRUE 
##        moonphase       conditions      description             icon 
##             TRUE             TRUE             TRUE             TRUE 
##         stations 
##             TRUE

Above we see that all columns are the same for weather. We saw that in class too.

Try on two years of police data. We already know from last class that there were differences…

Below let’s try it for the two police report files below:

output<-find_common_columns(police_2018,police_2019) 

print('Below we see all columns. Where it is TRUE means that they are the same:')
## [1] "Below we see all columns. Where it is TRUE means that they are the same:"
print(output)
##     INCIDENT_NUMBER        OFFENSE_CODE  OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION 
##               FALSE                TRUE               FALSE                TRUE 
##            DISTRICT      REPORTING_AREA            SHOOTING    OCCURRED_ON_DATE 
##                TRUE                TRUE               FALSE                TRUE 
##                YEAR               MONTH         DAY_OF_WEEK                HOUR 
##                TRUE                TRUE                TRUE                TRUE 
##            UCR_PART              STREET                 Lat                Long 
##               FALSE                TRUE                TRUE                TRUE 
##            Location 
##                TRUE

More Details