Final Project: Famous Paintings and Art institute of chicago API

Bais 462 Final Project: Commonalities Between Famous Artists

Original Data Link: https://myxavier-my.sharepoint.com/:x:/g/personal/mccarthyc9_xavier_edu/IQBcwAZ48ugPQrY5i3viNV8wAQ82-adBZ2jl7kNTGbtt1p0?download=1

Interest and Introduction:

Outside of my BAIS and Marketing studies, I am an avid oil painter and pencil artist. Since I was young, I have always appreciated visiting museums and seeing both local and famous works during time abroad or in another city. For my final BAIS project, I decided to merge my more artistic endeavors with my statistical ones. I was able to find an initial data set on Kaggle titled “Greatest Paintings”. The original data was actually broken up into multiple csv’s that I could join on like ID’s. (SQL Throwback). To make my data frame, I combined a painting file, museum file, and artist description file so that each row of my final data frame represented a painting by a specific artist along with information regarding the museum the piece is held at.

Data Dictionary and Summary Statistics

Variables

Quantitative (num)

  • artist_id - unique artist number

  • birth - The birth year of the artist

  • death - The death year of the artist

  • work_id - The unique painting ID

  • museum_id - The unique museum ID

Qualitative (chr)

  • full_name - full name of the artist

  • first_name - first name of the artist

  • middle_names - middle name

  • nationality - Where the artist is from

  • style.x - The style and technique of the painting

  • name.x - The title of the painting

  • name.y - The name of the painting’s museum

  • address - The address of the museum

  • city - The city where the museum is located

  • state - The state of the museum

  • country - The country of the museum

  • phone - Phone number of museum

  • url - url

Inquiry: From my created data frame titled “artist_museum_work”, I first wanted to discover what the most popular “style” was among the list and also see where the artists were most commonly originating from. I also wanted to get an idea of how many museums were listed in the data set and which ones have the highest concentration of “Greatest” paintings. After gathering a basic understanding of the data, I came up with two more questions regarding the artists

1: Based on the artists nationality, what was the average birth year/death year?

2: Is there a preference of art styles between nationalities?

DATA LOAD:

#| echo: false
#| message: false
#| warning: false


library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(chromote)
library(tidyverse)  
library(httr)       
library(rvest)      

Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding
library(polite)     
library(lubridate)  
library(magrittr)

Attaching package: 'magrittr'

The following object is masked from 'package:purrr':

    set_names

The following object is masked from 'package:tidyr':

    extract
library(stringr)
library(dplyr)
library(knitr)
library(ggplot2)


#### Best paintings/museums/works (Kaggle)

artist_museum_work_df <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/mccarthyc9_xavier_edu/IQBtTTJJK_xaTJUbcEmE4p2RAWfX0kRZHPClBkvQmdHOAfg?download=1")
Rows: 4553 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): full_name, first_name, middle_names, last_name, nationality, style...
dbl  (5): artist_id, birth, death, work_id, museum_id

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Top 5 displays data frame

top5_artists <- read_csv ("https://myxavier-my.sharepoint.com/:x:/g/personal/mccarthyc9_xavier_edu/IQC_bRiJ31W7SKwt-Smw2IKUAYwt4pvUwv_ZVNSo9qEhBvA?download=1")
Rows: 5 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): artist_display
dbl (1): n

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
artist_monet <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/mccarthyc9_xavier_edu/IQA61ItllO5gTLEpaQk3TG5MAdj_w3T-3q7hcam8lSSKZ_4?download=1")
Rows: 192 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (15): full_name, first_name, last_name, nationality, style.x, name.x, st...
dbl  (5): artist_id, birth, death, work_id, museum_id
lgl  (1): middle_names

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#### Art institute of chicago (Supplemenary API)

inst_api_df <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/mccarthyc9_xavier_edu/IQAmNOeXiT0tTqIRmUlrjfBcAQ0SGx69Y_ycih4QkD26W0o?download=1")
Rows: 480 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): title, artist_display
dbl (4): id, fiscal_year, exhibition_start, year_of_display
lgl (2): has_not_been_viewed_much, is_on_view

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Monet paintings from API

art_inst_monet <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/mccarthyc9_xavier_edu/IQDzySj5i8k2Qop3kxQ91LOuAaiPRwAFoXY3oX_ZqubrBBM?download=1")
Rows: 6 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): title, artist_display
dbl (4): id, fiscal_year, exhibition_start, year_of_display
lgl (2): has_not_been_viewed_much, is_on_view

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Visualizations: (Kaggle DF)

#| echo: true

### Using skimr to summarise the df

library(skimr)

skim(artist_museum_work_df)
Data summary
Name artist_museum_work_df
Number of rows 4553
Number of columns 21
_______________________
Column type frequency:
character 16
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
full_name 0 1.00 6 29 0 343 0
first_name 0 1.00 3 12 0 183 0
middle_names 3049 0.33 1 16 0 106 0
last_name 0 1.00 3 16 0 338 0
nationality 0 1.00 5 9 0 17 0
style.x 0 1.00 4 20 0 34 0
name.x 0 1.00 4 115 0 4296 0
style.y 516 0.89 4 18 0 23 0
name.y 0 1.00 11 50 0 57 0
address 0 1.00 8 31 0 57 0
city 0 1.00 1 15 0 45 0
state 985 0.78 2 15 0 24 0
postal 261 0.94 4 9 0 47 0
country 0 1.00 2 14 0 17 0
phone 0 1.00 12 20 0 57 0
url 0 1.00 18 61 0 57 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
artist_id 0 1 694.99 131.95 500 573 689 812 920 ▇▆▆▆▆
birth 0 1 1768.54 107.67 1395 1712 1824 1841 1901 ▁▁▂▃▇
death 0 1 1835.88 109.66 1441 1779 1890 1919 1989 ▁▁▂▃▇
work_id 0 1 59758.37 75643.22 178 6819 23690 122431 208815 ▇▁▁▁▂
museum_id 0 1 46.83 10.06 30 35 47 51 86 ▆▇▃▂▁
library(ggplot2)

artist_museum_work_df %>%
  ggplot(aes(x = style.x)) +
  geom_bar() +
  labs(title = "Counts of Each Style",
       x = "style",
       y = "Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#| echo: true

artist_museum_work_df %>%
  ggplot(aes(x = nationality)) +
  geom_bar() +
  labs(title = "histogram of artist nationalities",
       x = "Nationality",
       y = "Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#| echo: true

artist_museum_work_df %>%
  ggplot(aes(x = name.y)) +
  geom_bar() +
  labs(title = "Histogram of Museum Uses",
       x = "name",
       y = "Frequency") +
  theme(axis.text.x = element_text(angle = 80, hjust = 1))

#| echo: true

hist(artist_museum_work_df$birth,   
     col = "green",
     main = "Histogram for artist birth year",
     xlab = "year",
     ylab = "Frequency") 

#| echo: true

hist(artist_museum_work_df$death,   
     col = "red",
     main = "Histogram for death years",
     xlab = "year",
     ylab = "Frequency") 

FINAL INQUIRIES

  1. Based on the artists nationality, what was the average birth year/death year?
#| echo: true

artist_museum_work_df %>%
  group_by(nationality) %>%
  summarise(avg_deathyr = mean(death, na.rm = TRUE)) %>%
  ggplot(aes(x = nationality, y = avg_deathyr)) +
  geom_point(color = "blue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Nationality vs Average Death Year",
       x = "Nationality",
       y = "Average Death Year")

#| echo: true

artist_museum_work_df %>%
  group_by(nationality) %>%
  summarise(avg_birthyr = mean(birth, na.rm = TRUE)) %>%
  ggplot(aes(x = nationality, y = avg_birthyr)) +
  geom_point(color = "blue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Nationality vs Average Birth Year",
       x = "Nationality",
       y = "Average Birth Year")

  1. Is there a preference of art styles between nationalities?
#| echo: true

artist_museum_work_df %>%
  ggplot(aes(x = style.x, fill = nationality)) +
  geom_bar(position = "dodge") +
  labs(x = "style", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Initial Findings:

Starting with the skim table, there was not much to see as many numeric variables were just ID’s, however the average birth year for artists was 1768 and the average death year was1835. Within the data frame, the three most common styles included Impressionist paintings (over 1000), Baroque paintings (about 600), and Post Impressionist paintings (Around 360). The most common nationality of artists was French, with over 1,700 artists from the data frame originating from the country.  The vast majority of paintings from this data frame are currently held at The Metropolitan Museum of Art in New York City.

For French Artists, the average death year was 1895, their average birth year was around 1850. Typically, the oldest artists were Dutch, Flemish, and Italian. Their average birth and death years predated other nationalities to the 1600’s and 1700’s.

With my last visualization, I wanted to discover more about the style preferences of different nationalities. Using a color-coded bar chart, I found that most French artists studied Impressionism. That was also the most highly concentrated nationality for the post impressionism style. Baroque paintings were most likely painted by Dutch artists and Italian painters had the highest concentration of Early Renaissance and High Renaissance paintings.

Supplementary API Call: Chicago Art institute Pieces

Description: This is the same API that I used for our class API assignment, however I wanted a larger sample of art from the institute so I looped through 20 pages instead of 7. The data frame I created in my R-script from the api call includes 8 variables including an id, painting title, a binary (T/F) for if the painting has been viewed a lot, an artist display description, the fiscal year, a binary (T/F) for if the painting is on display currently, and the exhibition start date.

Descriptive Visualizations:

#| echo: true

skim(inst_api_df)
Data summary
Name inst_api_df
Number of rows 480
Number of columns 8
_______________________
Column type frequency:
character 2
logical 2
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
title 0 1 3 211 0 224 0
artist_display 0 1 4 134 0 125 0

Variable type: logical

skim_variable n_missing complete_rate mean count
has_not_been_viewed_much 0 1 0.45 FAL: 266, TRU: 214
is_on_view 0 1 0.35 FAL: 312, TRU: 168

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 95718.75 68783.56 22 44875.75 88686.5 126155.5 279957 ▇▇▆▁▂
fiscal_year 198 0.59 1978.18 31.35 1921 1951.00 1983.0 1998.0 2026 ▆▃▆▇▆
exhibition_start 267 0.44 1851.69 113.94 1299 1836.00 1887.0 1907.0 2018 ▁▁▁▂▇
year_of_display 267 0.44 1851.69 113.94 1299 1836.00 1887.0 1907.0 2018 ▁▁▁▂▇
#| echo: true

# What portion of art from these pages is on display?

inst_api_df %>%
  ggplot(aes(x = is_on_view)) +
  geom_bar(fill = "red") +
  labs(title = "How many paintings are on view?",
       x = "false/true",
       y = "Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#| echo: true

# Most common years for exhibition start

ggplot(inst_api_df, aes(x = exhibition_start)) +
  geom_histogram() +
  labs(title = "Histogram of exhibition starts",
       x = "Year",
       y = "Count")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 267 rows containing non-finite outside the scale range
(`stat_bin()`).

#| echo: true

# from these pages of the api, what were the 5 most common displays?

top5_artists <- inst_api_df %>%
  count(artist_display, sort = TRUE) %>%
  slice_max(n, n = 5)

ggplot(top5_artists, aes(x = reorder(artist_display, n), y = n)) +
  geom_col(fill = "purple") +
  coord_flip() +  
  labs(
    title = "Top 5 Most Common Artist Displays",
    x = "Artist Display",
    y = "Count"
  )

Supplementary Comparison:

In my R script, and loaded into the quarto at the begining, I created two tables from both the API and original painting data set, I filtered both data frames so that only artist name or display would include the string “Monet”. The API’s Monet data frame is titled “art_inst_monet” and the Kaggle Monet data frame is titled “artist_monet”. Claude Monet is one of my favorite impressionist painters and I wanted to see if the the “Greatest Paintings” data frame had any Monet pieces and if any of them were located in the Art Institute of Chicago

matching_names <- intersect(art_inst_monet$title, artist_monet$name.x)
matching_names
[1] "Venice, Palazzo Dario" "Water Lily Pond"      
#| echo: true



#### Some paintings are shared between these two tables, from the greatest artist df, which
####monet paintings are at the art institte of chicago?

artist_monet %>%
  group_by(name.y) %>%
  ggplot(aes(x = name.y)) +
  geom_bar(fill = "lightblue") +
  labs(title = "How many monet paintings are on view at the art institute of chicago?",
       x = "museum",
       y = "Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

API Inquiry Findings:

By filtering down each data frame to just works by Monet, I was able to find 2 paintings in the “Best Paintings/Artists” data frame that are also held at The Art Institute of Chicago. These include “Venice,”Palazzo Dario”, and “Water lily Pond” all by Claude Monet.

“Venice, Palazzo Dario” - Claude Monet

“Water Lily Pond” - Claude Monet