Understanding the Value of Art

As a former fine art student, the fine art market has always fascinated me as an enigmatic industry. For this project, I’ve sought out to propose a concept for collecting, and analyzing fine art data in order to better understand what drives investment in art, what makes it so valuable, and, if anything, what motivates people to invest in it in the first place. Is it actually a ‘smart’ place to park your money? My aim is to understand the driving factors behind art valuations and to discern the motivations for investing in art. And hopefully unearth some interesting insights along the way.

Data Collection

The project consists of two main components:

  1. Web Scraping: Implemented through RSelenium, the script automates the collection of art auction data, focusing on a broad range of artworks to maximize the diversity and range of the data collected.

  2. Data Analysis: Utilizing a smaller, hand-collected sample from artprice, this phase aims at exploring trends and correlations within the art market.

library(stringr)
library(tidyverse)
## Warning: package 'lubridate' was built under R version 4.4.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(wdman)
library(purrr)
library(ggplot2)
library(lubridate)
library(rvest)
## 
## Attaching package: 'rvest'
## 
## The following object is masked from 'package:readr':
## 
##     guess_encoding
library(xml2)
library(httr)
library(RSelenium)
library(netstat)
library(xaringan)

Web Scraping with RSelenium

In order to collect a robust dataset on the art market, and with little time to do so manually, I created a script to crawl and scrape data from the artprice website.

For the sake of this demonstration, and code, the script goes through the flow of gathering the first 5 and last 5 artworks auctioned by the artist (the intention of this was to gather a big a range as possible of auction data). However, if time is not a constraint, I would recommend adjusting the code to move through individual pages and scrape all artworks, or at least until a certain threshold is met, in terms of artwork numbers, and auction dates represented.

#example, though this can be changed to include any list of artist names
artists <- c("Ronald Brooks Kitaj")

#web scraping using rselenium

selenium_object <- selenium(retcommand = T, check = F)
binman::list_versions("chromedriver")

rD <- rsDriver(browser = "firefox", port = free_port(), verbose = FALSE, geckover = NULL)
remDr <- rD[["client"]]

login_url <- "https://www.artprice.com/identity"
remDr$navigate(login_url)

# input field selectors
username_field <- remDr$findElement(using = 'css selector', "input[class='e2e-login-input']")
password_field <- remDr$findElement(using = 'css selector', "input[class='e2e-pwd-input']")
login_button <- remDr$findElement(using = 'css selector', "button[class='e2e-login-submit-btn']")

#login credentials
username_field$sendKeysToElement(list("abwrubel@gmail.com"))
password_field$sendKeysToElement(list("Tigger123$%"))

#login to artprice
login_button$clickElement()

Sys.sleep(5) 

safe_extract <- function(artwork, css_selector) {
  tryCatch({
    element <- artwork$findElement(using = 'css selector', css_selector)
    text <- element$getElementText()
    return(text[[1]])
  }, error = function(e) {
    return(NA)
  })
}

waitForElement <- function(remDr, css_selector, timeout = 10) {
  for (i in seq_len(timeout)) {
    if (length(remDr$findElements(using = 'css selector', css_selector)) > 0) {
      return(TRUE)
    } else {
      Sys.sleep(1)
    }
  }
  stop("Timeout waiting for element")
}


artist_data_text <- purrr::imap(all_artists_data, function(artist_data, artist_name) {
  paste("Artist: ", artist_name, "\n", 
        sapply(artist_data, function(texts) {
          paste("Artwork Details: ", paste(texts, collapse = " | "), "\n\n")
        }),
        collapse = "\n")
})

# The folder path - all artist artwork metadata will be saved to individual txt files to this folder on your local machine
folder_path <- file.path(Sys.getenv("HOME"), "Desktop", "ArtistData")

# Create the folder if it doesn't exist
if (!dir.exists(folder_path)) {
  dir.create(folder_path, recursive = TRUE)
}


save_artist_data <- function(artist_data, artist_name, folder_path) {
  # Create a file name
  file_name <- file.path(folder_path, paste0(gsub(" ", "_", artist_name), "_data.txt"))
  
  # Convert artist's data to a readable string format
  artist_data_text <- sapply(artist_data, function(texts) {
    paste("Artwork Details: ", paste(texts, collapse = " | "), "\n\n")
  }, simplify = FALSE)
  
  # Combine all artwork details into one large string
  final_text <- paste(artist_data_text, collapse = "\n")
  
  # Write to a text file named after the artist
  writeLines(final_text, file_name)
}

max_artworks <- 5 
all_artists_data <- list()
for (artist_name in artists) {

  remDr$navigate("https://www.artprice.com/search")
  Sys.sleep(5)
  
  #enter name of artist
  remDr$findElement(using = 'css selector', "input[id='universal-search-input']")
  search_input <- remDr$findElement(using = 'css selector', "input[id='universal-search-input']")
  search_input$clearElement()
  Sys.sleep(5)  # wait for any JS reactions to the clear action
  search_input$sendKeysToElement(list(artist_name, "\uE007")) 
  Sys.sleep(5)
  
  artist_links <- remDr$findElements(using = 'css selector', '.items .item .r1 a')
  print(length(artist_links)) 
  if (length(artist_links) > 0) {
    for (link in artist_links) {
      link_text <- link$getElementText()[[1]]
      cat("Found Link Text:", link_text, "\n")  # Output found text for debugging
      
      # Convert both texts to lower case to ensure case insensitive comparison
      if (tolower(link_text) == tolower(artist_name)) {
        link$clickElement()  # Click on the correct link
        Sys.sleep(10) # Allow artist page to load
        break
      }
    }
  } else {
    cat("No links found. Check the selector or the page for changes.")
    next
  }
  
  print(artist_name)
  
  #look at past auctions
  if (waitForElement(remDr, "a[id='sln_ps']")) {
    past_auctions <- remDr$findElement(using = 'css selector', "a[id='sln_ps']")
    past_auctions$clickElement()
  } else {
    message("Element not found")
  }
  
  open_filter <- remDr$findElement(using ='css selector', '.common-dropdown.right.common-drop-down-to-validate')
  open_filter$clickElement()
  Sys.sleep(3)
  auction_asc <- remDr$findElement(using ='css selector', "label[for='sort-radio-datesale_asc']")
  auction_asc$clickElement()
  Sys.sleep(3)
  apply_button<- remDr$findElement(using = 'css selector', ".btn.btn-primary.pull-right")
  apply_button$clickElement()
  
  artwork_divs <- remDr$findElements(using = 'css selector', '.lots-list.lots-square .lot-container .lot.lot-square')
  print(length(artwork_divs))
  
  all_texts <- list()
  
  for (i in seq_len(min(length(artwork_divs), max_artworks))) {
    artwork <- artwork_divs[[i]]
    # Fetch all lot data blocks
    lot_data_elements <- artwork$findElements(using = 'css selector', '.lot-datas-block')
    texts <- sapply(lot_data_elements, function(element) element$getElementText()[[1]], simplify = TRUE, USE.NAMES = FALSE)
    
    # Remove duplicates
    unique_texts <- unique(texts)
    
    # Debugging output
    print(paste("Iteration:", i))
    print("Texts before removing duplicates:")
    print(texts)
    print("Texts after removing duplicates:")
    print(unique_texts)
    
    # Store texts
    all_texts <- c(all_texts, paste(unique_texts, collapse = " | "))  # Using unique_texts instead of texts
    Sys.sleep(5)
    
  }
  
  Sys.sleep(5)
  
  open_filter <- remDr$findElement(using ='css selector', '.common-dropdown.right.common-drop-down-to-validate')
  open_filter$clickElement()
  Sys.sleep(5)
  auction_desc <- remDr$findElement(using ='css selector', "label[for='sort-radio-datesale_desc']")
  auction_desc$clickElement()
  Sys.sleep(5)
  apply_button<- remDr$findElement(using = 'css selector', ".btn.btn-primary.pull-right")
  apply_button$clickElement()
  
  Sys.sleep(3)
  
  artwork_divs <- remDr$findElements(using = 'css selector', '.lots-list.lots-square .lot-container .lot.lot-square')
  print(length(artwork_divs))
  
  for (i in seq_len(min(length(artwork_divs), max_artworks))) {
    artwork <- artwork_divs[[i]]
    # Fetch all lot data blocks
    lot_data_elements <- artwork$findElements(using = 'css selector', '.lot-datas-block')
    texts <- sapply(lot_data_elements, function(element) element$getElementText()[[1]], simplify = TRUE, USE.NAMES = FALSE)
    
    # Remove duplicates
    unique_texts <- unique(texts)
    
    # Debugging output
    print(paste("Iteration:", i))
    print("Texts before removing duplicates:")
    print(texts)
    print("Texts after removing duplicates:")
    print(unique_texts)
    
    # Store texts
    all_texts <- c(all_texts, paste(unique_texts, collapse = " | "))   # Using unique_texts instead of texts
    
  }
  all_artists_data[[artist_name]] <- all_texts
  save_artist_data(all_texts, artist_name, folder_path)
  Sys.sleep(10)
}

#stop the session
remDr$close()
rD[["server"]]$stop()


#once you have the text files in the folder, you will need to iterate through text files and use regex to process data appropriately and add to the dataframe
#adding to dataframe

Data Methodology

Gathering a robust data for any kind of meaningful analysis is of critical important. With more time, it would be essential to gather larger and more various data on a diverse set of artists. One of the more difficult parts of working with art, and trying to understand the commercial value of art, is to overcome a certain level of bias. Collecting and analyzing data on only well known famous artists will provide only so much insight.

This is why, for future evaluation, it will be essential to combine a curated and more random approach to collecting data. Using some sort of random selection to prevent bias might be helpful in preventing a overly biased data collection process, which would in turn skew the auction data one way or the other.

With all of this said, given the tight time constraints, I opted for a hand selected group of auction results from around 10 artists, spanning contemporary, modern, and 19th century, as well as representing a variety of mediums (painting, sculpture, and photography).

How to Measure Value

In the complex landscape of art valuation, the sale/estimate ratio serves as a pivotal metric. By focusing on this ratio, we can gain insights into how artworks are valued against pre-sale expectations set by experts.

Why Focus on Sale/Estimate Ratio?

  • Demand vs. Expectation: The sale/estimate ratio effectively captures the discrepancy between market expectations and actual outcomes. A high ratio indicates that the artwork sold for much more than its estimated value, suggesting a high demand or a possibly undervalued estimate.

  • Economic Indicators: This ratio can also reflect broader economic conditions (or at least this is my assumption). For instance, high ratios across numerous auctions might suggest a flourishing economic environment where collectors are willing to spend more, reflecting greater disposable income or confidence in art.

  • Artist Popularity and Market Trends: Changes in the sale/estimate ratio over time can indicate shifts in an artist’s market popularity or general trends within different art genres or periods. Analyzing these fluctuations helps identify what types of art or which artists are gaining or losing market traction.

Analytical Value

By analyzing the sale/estimate ratio, we can:

  • Assess the accuracy of market valuations by auction houses.

  • Understand the economic underpinnings that influence art buying behaviors.

  • Highlight periods of significant market activity that could correlate with economic cycles.

This metric thus provides a straightforward yet profound way to analyze the art market, offering insights not just into art valuation but also into economic conditions.

Analyzing What We Have

Now, we’ll move forward with cleaning and analyzing a more curated selection of artist data, along with some economic indicators, which could provide some useful insight into our analysis.

artist_data_raw <- read_csv("https://raw.githubusercontent.com/awrubes/Data607_FinalProject/main/ArtistData.csv")
## Rows: 99 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): artist_name, artwork_title, medium, dimensions, price_estimate, sal...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(artist_data_raw)
## # A tibble: 6 × 8
##   artist_name   artwork_title        medium dimensions price_estimate sale_price
##   <chr>         <chr>                <chr>  <chr>      <chr>          <chr>     
## 1 albert oehlen loves body (1985)    Oil    78 1/2 x … $8,000 - $12,… $14,000   
## 2 albert oehlen Acker ohne Wiederke… Mixed… 55 1/8 x … $11,972        $8,680    
## 3 albert oehlen Loves body (1985)    Oil, … 78 1/2 x … $ 10,000 - $ … $22,000   
## 4 albert oehlen Blick Durch den Sti… Oil/c… 74 3/4 x … $ 7,000 - $ 1… $7,000    
## 5 albert oehlen Untitled (1984)      Mixed… 74 7/8 x … $ 12,000 - $ … $20,000   
## 6 albert oehlen Untitled (1985)      Mixed… 47 1/4 x … $ 190,201 - $… $226,339  
## # ℹ 2 more variables: sale_date <chr>, auction_house <chr>
global_economic_raw <- read_csv("https://raw.githubusercontent.com/awrubes/Data607_FinalProject/main/global_economic.csv")
## Rows: 85 Columns: 68
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (68): Country Name, Country Code, Series Name, Series Code, 1960 [YR1960...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(global_economic_raw)
## # A tibble: 6 × 68
##   `Country Name` `Country Code` `Series Name`      `Series Code` `1960 [YR1960]`
##   <chr>          <chr>          <chr>              <chr>         <chr>          
## 1 Argentina      ARG            Consumer price in… FP.CPI.TOTL   ..             
## 2 Argentina      ARG            Inflation, consum… FP.CPI.TOTL.… ..             
## 3 Argentina      ARG            Real interest rat… FR.INR.RINR   ..             
## 4 Argentina      ARG            GDP per capita (c… NY.GDP.PCAP.… ..             
## 5 Australia      AUS            Consumer price in… FP.CPI.TOTL   7.960457856399…
## 6 Australia      AUS            Inflation, consum… FP.CPI.TOTL.… 3.728813559322…
## # ℹ 63 more variables: `1961 [YR1961]` <chr>, `1962 [YR1962]` <chr>,
## #   `1963 [YR1963]` <chr>, `1964 [YR1964]` <chr>, `1965 [YR1965]` <chr>,
## #   `1966 [YR1966]` <chr>, `1967 [YR1967]` <chr>, `1968 [YR1968]` <chr>,
## #   `1969 [YR1969]` <chr>, `1970 [YR1970]` <chr>, `1971 [YR1971]` <chr>,
## #   `1972 [YR1972]` <chr>, `1973 [YR1973]` <chr>, `1974 [YR1974]` <chr>,
## #   `1975 [YR1975]` <chr>, `1976 [YR1976]` <chr>, `1977 [YR1977]` <chr>,
## #   `1978 [YR1978]` <chr>, `1979 [YR1979]` <chr>, `1980 [YR1980]` <chr>, …

Data Manipulation and Cleaning

The data cleaning process for the art auction data involves several steps to prepare the data for analysis. These steps include extracting relevant information, normalizing data formats, and creating new variables for deeper insights.

Extracting Creation Dates

We start by extracting the year of creation from the artwork_title column. This is presumed to be enclosed in parentheses and represented by four digits.

Cleaning Price Estimates

The price_estimate column contains estimates often given as a range (e.g., “$8,000 - $12,000”). The cleaning involves several steps:

  • Splitting the range to calculate the average of the low and high estimates if a range is detected; otherwise, the single value is converted directly to numeric.

Normalizing Sale Price

The sale_price is also formatted as a string with a dollar sign and commas. These are removed, and the result is converted to a numeric format.

Calculating Sale/Estimate Ratio

A new column sale_estimate_ratio is created to measure how the actual sale price compares to the estimate. This ratio is informative of the market’s valuation of the artwork relative to its estimated value.

#normalize data and values across columns

#extract creation dates, first four digits
artist_data_cleaned <- artist_data_raw %>%
  mutate(
    creation_date = str_extract(artwork_title, "\\d{4}") %>%
      str_replace_all("[()]", "")
  )

#create col for average estimate (middle of low and high) 
artist_data_cleaned <- artist_data_cleaned %>%
  mutate(
    cleaned_estimate = str_replace_all(price_estimate, "\\$", ""), 
    cleaned_estimate = str_trim(cleaned_estimate),
    avg_estimate = if_else(
      str_detect(cleaned_estimate, "-"), # Detect if there's a range
      {
        num1 <- str_extract(cleaned_estimate, "^[\\d,]+") %>%
          str_replace_all(",", "") %>% 
          as.numeric()
        num2 <- str_extract(cleaned_estimate, "[\\d,]+$") %>%
          str_replace_all(",", "") %>% 
          as.numeric()
        # Output for debugging:
        (num1 + num2) / 2
      },
      str_replace_all(cleaned_estimate, ",", "") %>% 
        as.numeric()
    )
  )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `avg_estimate = if_else(...)`.
## Caused by warning in `str_replace_all(cleaned_estimate, ",", "") %>% as.numeric()`:
## ! NAs introduced by coercion
#normalize sale_price column
artist_data_cleaned <- artist_data_cleaned %>%
  mutate(
    sale_price = str_replace_all(sale_price, "\\$", "") %>%
      str_replace_all(",", "") %>%
      as.numeric()
  )

#create estimate sale price ratio
artist_data_cleaned <- artist_data_cleaned %>%
  mutate(
    sale_estimate_ratio = if_else(avg_estimate == 0 | is.na(avg_estimate), NA_real_, sale_price / avg_estimate)
  )

#convert sale_date column to date format for ease of use
artist_data_new <- artist_data_cleaned %>%
  mutate(
    sale_date = as.Date(sale_date, format = "%d %b %Y") 
  )

#normalize remaining columns
artist_data_new <- artist_data_new %>%
  mutate(
    artist_name = tolower(artist_name), 
    artwork_title = tolower(artwork_title),
    auction_house = tolower(auction_house),
  )

head(artist_data_new)
## # A tibble: 6 × 12
##   artist_name   artwork_title        medium dimensions price_estimate sale_price
##   <chr>         <chr>                <chr>  <chr>      <chr>               <dbl>
## 1 albert oehlen loves body (1985)    Oil    78 1/2 x … $8,000 - $12,…      14000
## 2 albert oehlen acker ohne wiederke… Mixed… 55 1/8 x … $11,972              8680
## 3 albert oehlen loves body (1985)    Oil, … 78 1/2 x … $ 10,000 - $ …      22000
## 4 albert oehlen blick durch den sti… Oil/c… 74 3/4 x … $ 7,000 - $ 1…       7000
## 5 albert oehlen untitled (1984)      Mixed… 74 7/8 x … $ 12,000 - $ …      20000
## 6 albert oehlen untitled (1985)      Mixed… 47 1/4 x … $ 190,201 - $…     226339
## # ℹ 6 more variables: sale_date <date>, auction_house <chr>,
## #   creation_date <chr>, cleaned_estimate <chr>, avg_estimate <dbl>,
## #   sale_estimate_ratio <dbl>

Next, we’ll move on to clean and reformat the global economic indicators dataframe to prepare it for analysis. As you’ll see, the df is in a wide format and will need to be converted to long, removing unneeded columns.

Reshaping Global Economic Data

The global economic data is transformed from a wide format, where years are spread across columns, to a long format where each row represents a single year’s data for a specific economic indicator.

Cleaning Up Global Economic Data

Unnecessary columns like Country Code and Series Code are removed to streamline the dataset.

#convert global economic data to long format from wide, using regex to match year ignoring formatting and making integer
global_economic_clean <- global_economic_raw %>%
  pivot_longer(
    cols = matches("YR\\d{4}"), 
    names_to = "Year",
    names_prefix = ".*YR", 
    values_to = "Value"
  ) %>%
  mutate(
    Year = str_extract(Year, "\\d+") %>%  
      as.integer() 
  )

#get rid of country code and series code columns
global_economic_clean <- global_economic_clean %>%
  select(-all_of(c("Country Code", "Series Code")))

head(global_economic_clean)
## # A tibble: 6 × 4
##   `Country Name` `Series Name`                      Year Value
##   <chr>          <chr>                             <int> <chr>
## 1 Argentina      Consumer price index (2010 = 100)  1960 ..   
## 2 Argentina      Consumer price index (2010 = 100)  1961 ..   
## 3 Argentina      Consumer price index (2010 = 100)  1962 ..   
## 4 Argentina      Consumer price index (2010 = 100)  1963 ..   
## 5 Argentina      Consumer price index (2010 = 100)  1964 ..   
## 6 Argentina      Consumer price index (2010 = 100)  1965 ..
#look at the economic variables here
unique_series_names <- unique(global_economic_clean$`Series Name`)

Exploratory Data Analysis

#auction house sale/estimate mean
# Calculate mean sale/estimate ratio for each auction house
mean_ratios <- artist_data_new %>%
  group_by(auction_house) %>%
  filter(n() >= 5) %>%
  summarize(mean_ratio = base::mean(sale_estimate_ratio, na.rm = TRUE), .groups = 'drop')

# Create the bar chart
ggplot(mean_ratios, aes(x = auction_house, y = mean_ratio, fill = auction_house)) +
  geom_bar(stat = "identity", width = 0.7) +
  labs(title = "Mean Sale/Estimate Ratio by Auction House",
       x = "Auction House",
       y = "Mean Sale/Estimate Ratio") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#visualize avg sales/estimate ratio per artist
artist_avg_ratio <- artist_data_new %>%
  group_by(artist_name) %>%
  filter(n() >= 5) %>%  # Ensure enough data points per artist
  summarize(avg_ratio = base::mean(sale_estimate_ratio, na.rm = TRUE))

# bar chart visualization
ggplot(artist_avg_ratio, aes(x = reorder(artist_name, avg_ratio), y = avg_ratio)) +
  geom_col() +
  coord_flip() +  # flip coordinates for legibility
  labs(title = "Average Sale/Estimate Ratio by Artist",
       x = "Artist",
       y = "Average Ratio") +
  theme_minimal()

artist_data_cleaned <- artist_data_cleaned %>%
  mutate(
    sale_date = dmy(sale_date),  # converts '19 Nov 1992' to Date
    year = year(sale_date)  # extract year from Date object
  )

#scatterplot with legend for artists looking at the SE ratio
artist_yearly_avg_ratio <- artist_data_cleaned %>%
  mutate(year = year(sale_date)) %>%
  group_by(artist_name, year) %>%
  summarize(avg_ratio = base::mean(sale_estimate_ratio, na.rm = TRUE), .groups = 'drop') %>%
  filter(n() >= 5)  # Optionally, filter artists with at least 5 artworks sold per year for stability in averages
#look at SE ratio based on medium
artist_data_cleaned_med <- artist_data_cleaned %>%
  mutate(
    medium_clean = tolower(medium),  # Convert to lower case for uniformity
    medium_clean = str_replace_all(medium_clean, " on ", "/"),  # Standardize 'on' to '/'
    medium_clean = str_replace_all(medium_clean, ", ", "/"),  # Replace commas with slashes for consistency
    medium_clean = str_replace_all(medium_clean, " and ", "/"),  # Replace 'and' with '/'
    medium_clean = str_replace_all(medium_clean, " with ", "/")  # Replace 'with' with '/'
  )

artist_data_cleaned_med <- artist_data_cleaned_med %>%
  mutate(
    medium_clean = str_extract(medium_clean, "^(oil|acrylic|watercolor|gouache|ink|pastel|charcoal|sculpture|bronze|mixed media|photograph|print|drawing|construction|installation|c print|gelatin silver print|chromogenic print)"),
    medium_clean = if_else(str_detect(medium_clean, "canvas|panel|paper|wood|metal|plastic"), "mixed media", medium_clean)
  )

artist_data_cleaned_med <- artist_data_cleaned_med %>%
  mutate(
    medium_clean = if_else(is.na(medium_clean), "other", medium_clean)
  )

# Create the box plot
ggplot(artist_data_cleaned_med, aes(x = medium_clean, y = sale_estimate_ratio)) +
  geom_boxplot() +
  labs(title = "Box Plot of Sale/Estimate Ratio by Medium",
       x = "Medium",
       y = "Sale/Estimate Ratio") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

Statistical Analysis and Regressions

The analysis portion digs into the curated dataset to uncover potential correlations between art market behaviors and economic indicators. By cleaning, transforming, and visualizing this data, we aim to understand how external economic factors may influence art valuations. This portion of the project doesn’t yield much information or useful insight. However, I wanted to include it in order to show how, given a larger dataset, one might try to look for interesting trends and correlations.

#creating regression to look at correlation between sales/estimate ratio and economic factors

#calculate yearly mean or median sale/estimate ratios
yearly_ratios <- artist_data_cleaned %>%
  group_by(year) %>%
  summarize(
    Avg_Sale_Estimate_Ratio = base::mean(sale_estimate_ratio, na.rm = TRUE),
    .groups = 'drop'
  )

#North America

# Filter for Real Interest Rate for United States
global_economic_interest <- global_economic_clean %>%
  filter(`Series Name` == "Real interest rate (%)", `Country Name` == "United States") %>%
  select(Year, Value) %>%
  rename(Real_Interest_Rate = Value)

artist_data_cleaned$year <- as.integer(artist_data_cleaned$year)
global_economic_interest$Year <- as.integer(global_economic_interest$Year)

merged_data_interest <- merge(artist_data_cleaned, global_economic_interest, by.x = "year", by.y = "Year", all.x = TRUE)

#filter out the NA values
filtered_data_interest <- merged_data_interest %>%
  filter(!is.na(Real_Interest_Rate))

# linear regression for interest rates and SE ratio
model <- lm(sale_estimate_ratio ~ Real_Interest_Rate, data = filtered_data_interest)
summary(model)
## 
## Call:
## lm(formula = sale_estimate_ratio ~ Real_Interest_Rate, data = filtered_data_interest)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1073 -0.4752  0.0000  0.1610  7.2954 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                         1.54308    0.74041   2.084    0.042 *
## Real_Interest_Rate..                1.24154    0.88987   1.395    0.169  
## Real_Interest_Rate1.16289865228358  0.08649    1.28243   0.067    0.946  
## Real_Interest_Rate1.36246374976571  0.18991    1.65561   0.115    0.909  
## Real_Interest_Rate1.48256139059228  1.45692    1.65561   0.880    0.383  
## Real_Interest_Rate1.52221338861923  0.45692    1.65561   0.276    0.784  
## Real_Interest_Rate1.60768046080895 -0.75795    1.28243  -0.591    0.557  
## Real_Interest_Rate2.01028262116756 -0.10652    1.04710  -0.102    0.919  
## Real_Interest_Rate2.10659449044537 -0.44758    1.65561  -0.270    0.788  
## Real_Interest_Rate2.19456642346184 -0.27559    1.65561  -0.166    0.868  
## Real_Interest_Rate2.26529638013973 -1.04308    1.65561  -0.630    0.531  
## Real_Interest_Rate2.31051464059837  0.65741    0.92816   0.708    0.482  
## Real_Interest_Rate2.53723229857104 -0.54308    1.65561  -0.328    0.744  
## Real_Interest_Rate2.55474987510739 -0.13308    1.04710  -0.127    0.899  
## Real_Interest_Rate2.61707765460079 -0.47880    1.65561  -0.289    0.774  
## Real_Interest_Rate2.96087521313044 -0.53828    1.28243  -0.420    0.676  
## Real_Interest_Rate3.0732472423635   0.47464    1.65561   0.287    0.775  
## Real_Interest_Rate3.10104221978606 -0.25678    0.99337  -0.258    0.797  
## Real_Interest_Rate3.54449745673519  0.61191    1.65561   0.370    0.713  
## Real_Interest_Rate3.54563216381481 -0.25132    1.28243  -0.196    0.845  
## Real_Interest_Rate3.88429489766143 -0.33590    1.04710  -0.321    0.750  
## Real_Interest_Rate4.56713894364996  0.06253    1.13100   0.055    0.956  
## Real_Interest_Rate4.72786366990758 -0.70975    1.65561  -0.429    0.670  
## Real_Interest_Rate4.91529596120366 -0.84308    1.65561  -0.509    0.613  
## Real_Interest_Rate5.19986376311978  0.28798    0.99337   0.290    0.773  
## Real_Interest_Rate5.58752028345538  0.65692    1.65561   0.397    0.693  
## Real_Interest_Rate6.03974985349163  1.05099    1.04710   1.004    0.320  
## Real_Interest_Rate6.19386958708484 -0.78133    1.28243  -0.609    0.545  
## Real_Interest_Rate6.32428598750851  0.05692    1.28243   0.044    0.965  
## Real_Interest_Rate6.48726735143027  0.06549    1.13100   0.058    0.954  
## Real_Interest_Rate6.59421797111052 -0.46808    1.28243  -0.365    0.717  
## Real_Interest_Rate6.60334742359249 -0.48508    1.65561  -0.293    0.771  
## Real_Interest_Rate6.8137924585142   0.05332    1.04710   0.051    0.960  
## Real_Interest_Rate8.14090660192676 -0.49679    1.13100  -0.439    0.662  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.481 on 53 degrees of freedom
## Multiple R-squared:  0.2179, Adjusted R-squared:  -0.2691 
## F-statistic: 0.4474 on 33 and 53 DF,  p-value: 0.9923
ggplot() +
  geom_line(data = filtered_data_interest, aes(x = year, y = Real_Interest_Rate, group = 1), color = "blue") +
  geom_line(data = filtered_data_interest, aes(x = year, y = sale_estimate_ratio, group = 1), color = "red", sec.axis = sec_axis(~./max(merged_data$sale_estimate_ratio) * max(merged_data$GDP_per_capita), name = "Sale/Estimate Ratio")) +
  labs(title = "Economic Indicators vs. Art Market Behavior Over Time",
       x = "Year",
       y = "US Interest Rate (Real)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning in geom_line(data = filtered_data_interest, aes(x = year, y =
## sale_estimate_ratio, : Ignoring unknown parameters: `sec.axis`

Conclusions

From the analyzed data, it appears that while certain patterns can be discerned, specifically in the exploratory analysis, the art market remains complex and influenced by numerous factors beyond just economics.

Challenges and Solutions

Selecting Relevant Data Metrics for Art Investment Analysis

One of the central challenges of this project was identifying relevant data metrics that could effectively track the value of art as an investment vehicle over time, especially in the context of fluctuating economic conditions like inflation.

Solution:

The breakthrough came with the realization that the sale/estimate ratio could serve as a robust indicator of market demand versus expert valuation expectations. This ratio effectively highlights whether artworks are selling above, below, or at their anticipated market value. High ratios indicate a strong market demand or underestimations by experts, providing a clear, quantifiable measure of investment performance that also accounts for market sentiment.

Technical Challenges with RSelenium

Another significant challenge involved setting up RSelenium to automate data collection from Artprice. The complexity of web scraping stems from the dynamic nature of web pages, coupled with the need to navigate through login screens, search interfaces, and artwork listings.

Solution:

To overcome these challenges, the project involved:

  • Environment Setup: Configuring RSelenium required the correct setup of drivers and ensuring compatibility with the browser version used for scraping.

  • Script Robustness: The RSelenium script was refined to handle various web elements dynamically. Functions like waitForElement and safe_extract were implemented to manage timeouts and element availability.

  • Error Handling: Robust error handling and debugging was incorporated to manage failed login attempts, missing data fields, and navigation errors.

Future Directions

I would like to expand the project by collecting a much more substantial dataset, with a more focused approach in the kind of art collected, to see if any trends do in fact arise. This would require automating the data collection across more artists and a longer timeline. Further research could also integrate more granular economic data to refine the analysis of market influences on art pricing.