Final Project: Famous Paintings and Art institute of chicago API

Bais 462 Final Project: Commonalities Between Famous Artists

Original Data Link: https://myxavier-my.sharepoint.com/:x:/g/personal/mccarthyc9_xavier_edu/IQBcwAZ48ugPQrY5i3viNV8wAQ82-adBZ2jl7kNTGbtt1p0?download=1

Interest and Introduction:

Outside of my BAIS and Marketing studies, I am an avid oil painter and pencil artist. Since I was young, I have always appreciated visiting museums and seeing both local and famous works during time abroad or in another city. For my final BAIS project, I decided to merge my more artistic endeavors with my statistical ones. I was able to find an initial data set on Kaggle titled “Greatest Paintings”. The original data was actually broken up into multiple csv’s that I could join on like ID’s. (SQL Throwback). To make my data frame, I combined a painting file, museum file, and artist description file so that each row of my final data frame represented a painting by a specific artist along with information regarding the museum the piece is held at.

Data Dictionary and Summary Statistics

Variables

Quantitative (num)

artist_id - unique artist number
birth - The birth year of the artist
death - The death year of the artist
work_id - The unique painting ID
museum_id - The unique museum ID

Qualitative (chr)

full_name - full name of the artist
first_name - first name of the artist
middle_names - middle name
nationality - Where the artist is from
style.x - The style and technique of the painting
name.x - The title of the painting
name.y - The name of the painting’s museum
address - The address of the museum
city - The city where the museum is located
state - The state of the museum
country - The country of the museum
phone - Phone number of museum
url - url

Inquiry: From my created data frame titled “artist_museum_work”, I first wanted to discover what the most popular “style” was among the list and also see where the artists were most commonly originating from. I also wanted to get an idea of how many museums were listed in the data set and which ones have the highest concentration of “Greatest” paintings. After gathering a basic understanding of the data, I came up with two more questions regarding the artists

1: Based on the artists nationality, what was the average birth year/death year?

2: Is there a preference of art styles between nationalities?

DATA LOAD:

#| echo: false
#| message: false
#| warning: false


library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(chromote)
library(tidyverse)  
library(httr)       
library(rvest)


Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding

library(polite)     
library(lubridate)  
library(magrittr)


Attaching package: 'magrittr'

The following object is masked from 'package:purrr':

    set_names

The following object is masked from 'package:tidyr':

    extract

library(stringr)
library(dplyr)
library(knitr)
library(ggplot2)


#### Best paintings/museums/works (Kaggle)

artist_museum_work_df <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/mccarthyc9_xavier_edu/IQBtTTJJK_xaTJUbcEmE4p2RAWfX0kRZHPClBkvQmdHOAfg?download=1")

Rows: 4553 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): full_name, first_name, middle_names, last_name, nationality, style...
dbl  (5): artist_id, birth, death, work_id, museum_id

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Top 5 displays data frame

top5_artists <- read_csv ("https://myxavier-my.sharepoint.com/:x:/g/personal/mccarthyc9_xavier_edu/IQC_bRiJ31W7SKwt-Smw2IKUAYwt4pvUwv_ZVNSo9qEhBvA?download=1")

Rows: 5 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): artist_display
dbl (1): n

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

artist_monet <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/mccarthyc9_xavier_edu/IQA61ItllO5gTLEpaQk3TG5MAdj_w3T-3q7hcam8lSSKZ_4?download=1")

Rows: 192 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (15): full_name, first_name, last_name, nationality, style.x, name.x, st...
dbl  (5): artist_id, birth, death, work_id, museum_id
lgl  (1): middle_names

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#### Art institute of chicago (Supplemenary API)

inst_api_df <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/mccarthyc9_xavier_edu/IQAmNOeXiT0tTqIRmUlrjfBcAQ0SGx69Y_ycih4QkD26W0o?download=1")

Rows: 480 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): title, artist_display
dbl (4): id, fiscal_year, exhibition_start, year_of_display
lgl (2): has_not_been_viewed_much, is_on_view

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Monet paintings from API

art_inst_monet <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/mccarthyc9_xavier_edu/IQDzySj5i8k2Qop3kxQ91LOuAaiPRwAFoXY3oX_ZqubrBBM?download=1")

Rows: 6 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): title, artist_display
dbl (4): id, fiscal_year, exhibition_start, year_of_display
lgl (2): has_not_been_viewed_much, is_on_view

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Visualizations: (Kaggle DF)

#| echo: true

### Using skimr to summarise the df

library(skimr)

skim(artist_museum_work_df)

Data summary
Name	artist_museum_work_df
Number of rows	4553
Number of columns	21
_______________________
Column type frequency:
character	16
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
full_name	0	1.00	6	29	343
first_name	0	1.00	3	12	183
middle_names	3049	0.33	1	16	106
last_name	0	1.00	3	16	338
nationality	0	1.00	5	9	17
style.x	0	1.00	4	20	34
name.x	0	1.00	4	115	4296
style.y	516	0.89	4	18	23
name.y	0	1.00	11	50	57
address	0	1.00	8	31	57
city	0	1.00	1	15	45
state	985	0.78	2	15	24
postal	261	0.94	4	9	47
country	0	1.00	2	14	17
phone	0	1.00	12	20	57
url	0	1.00	18	61	57

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
artist_id	1	694.99	131.95	500	573	689	812	920	▇▆▆▆▆
birth	1	1768.54	107.67	1395	1712	1824	1841	1901	▁▁▂▃▇
death	1	1835.88	109.66	1441	1779	1890	1919	1989	▁▁▂▃▇
work_id	1	59758.37	75643.22	178	6819	23690	122431	208815	▇▁▁▁▂
museum_id	1	46.83	10.06	30	35	47	51	86	▆▇▃▂▁

library(ggplot2)

artist_museum_work_df %>%
  ggplot(aes(x = style.x)) +
  geom_bar() +
  labs(title = "Counts of Each Style",
       x = "style",
       y = "Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#| echo: true

artist_museum_work_df %>%
  ggplot(aes(x = nationality)) +
  geom_bar() +
  labs(title = "histogram of artist nationalities",
       x = "Nationality",
       y = "Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#| echo: true

artist_museum_work_df %>%
  ggplot(aes(x = name.y)) +
  geom_bar() +
  labs(title = "Histogram of Museum Uses",
       x = "name",
       y = "Frequency") +
  theme(axis.text.x = element_text(angle = 80, hjust = 1))

#| echo: true

hist(artist_museum_work_df$birth,   
     col = "green",
     main = "Histogram for artist birth year",
     xlab = "year",
     ylab = "Frequency")

#| echo: true

hist(artist_museum_work_df$death,   
     col = "red",
     main = "Histogram for death years",
     xlab = "year",
     ylab = "Frequency")

FINAL INQUIRIES

Based on the artists nationality, what was the average birth year/death year?

#| echo: true

artist_museum_work_df %>%
  group_by(nationality) %>%
  summarise(avg_deathyr = mean(death, na.rm = TRUE)) %>%
  ggplot(aes(x = nationality, y = avg_deathyr)) +
  geom_point(color = "blue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Nationality vs Average Death Year",
       x = "Nationality",
       y = "Average Death Year")

#| echo: true

artist_museum_work_df %>%
  group_by(nationality) %>%
  summarise(avg_birthyr = mean(birth, na.rm = TRUE)) %>%
  ggplot(aes(x = nationality, y = avg_birthyr)) +
  geom_point(color = "blue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Nationality vs Average Birth Year",
       x = "Nationality",
       y = "Average Birth Year")

Is there a preference of art styles between nationalities?

#| echo: true

artist_museum_work_df %>%
  ggplot(aes(x = style.x, fill = nationality)) +
  geom_bar(position = "dodge") +
  labs(x = "style", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Initial Findings:

Starting with the skim table, there was not much to see as many numeric variables were just ID’s, however the average birth year for artists was 1768 and the average death year was1835. Within the data frame, the three most common styles included Impressionist paintings (over 1000), Baroque paintings (about 600), and Post Impressionist paintings (Around 360). The most common nationality of artists was French, with over 1,700 artists from the data frame originating from the country. The vast majority of paintings from this data frame are currently held at The Metropolitan Museum of Art in New York City.

For French Artists, the average death year was 1895, their average birth year was around 1850. Typically, the oldest artists were Dutch, Flemish, and Italian. Their average birth and death years predated other nationalities to the 1600’s and 1700’s.

With my last visualization, I wanted to discover more about the style preferences of different nationalities. Using a color-coded bar chart, I found that most French artists studied Impressionism. That was also the most highly concentrated nationality for the post impressionism style. Baroque paintings were most likely painted by Dutch artists and Italian painters had the highest concentration of Early Renaissance and High Renaissance paintings.

Supplementary API Call: Chicago Art institute Pieces

Description: This is the same API that I used for our class API assignment, however I wanted a larger sample of art from the institute so I looped through 20 pages instead of 7. The data frame I created in my R-script from the api call includes 8 variables including an id, painting title, a binary (T/F) for if the painting has been viewed a lot, an artist display description, the fiscal year, a binary (T/F) for if the painting is on display currently, and the exhibition start date.

Descriptive Visualizations:

#| echo: true

skim(inst_api_df)

Data summary
Name	inst_api_df
Number of rows	480
Number of columns	8
_______________________
Column type frequency:
character	2
logical	2
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
title	0	1	3	211	0	224	0
artist_display	0	1	4	134	0	125	0

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
has_not_been_viewed_much	0	1	0.45	FAL: 266, TRU: 214
is_on_view	0	1	0.35	FAL: 312, TRU: 168

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	0	1.00	95718.75	68783.56	22	44875.75	88686.5	126155.5	279957	▇▇▆▁▂
fiscal_year	198	0.59	1978.18	31.35	1921	1951.00	1983.0	1998.0	2026	▆▃▆▇▆
exhibition_start	267	0.44	1851.69	113.94	1299	1836.00	1887.0	1907.0	2018	▁▁▁▂▇
year_of_display	267	0.44	1851.69	113.94	1299	1836.00	1887.0	1907.0	2018	▁▁▁▂▇

#| echo: true

# What portion of art from these pages is on display?

inst_api_df %>%
  ggplot(aes(x = is_on_view)) +
  geom_bar(fill = "red") +
  labs(title = "How many paintings are on view?",
       x = "false/true",
       y = "Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#| echo: true

# Most common years for exhibition start

ggplot(inst_api_df, aes(x = exhibition_start)) +
  geom_histogram() +
  labs(title = "Histogram of exhibition starts",
       x = "Year",
       y = "Count")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 267 rows containing non-finite outside the scale range
(`stat_bin()`).

#| echo: true

# from these pages of the api, what were the 5 most common displays?

top5_artists <- inst_api_df %>%
  count(artist_display, sort = TRUE) %>%
  slice_max(n, n = 5)

ggplot(top5_artists, aes(x = reorder(artist_display, n), y = n)) +
  geom_col(fill = "purple") +
  coord_flip() +  
  labs(
    title = "Top 5 Most Common Artist Displays",
    x = "Artist Display",
    y = "Count"
  )

Supplementary Comparison:

In my R script, and loaded into the quarto at the begining, I created two tables from both the API and original painting data set, I filtered both data frames so that only artist name or display would include the string “Monet”. The API’s Monet data frame is titled “art_inst_monet” and the Kaggle Monet data frame is titled “artist_monet”. Claude Monet is one of my favorite impressionist painters and I wanted to see if the the “Greatest Paintings” data frame had any Monet pieces and if any of them were located in the Art Institute of Chicago

matching_names <- intersect(art_inst_monet$title, artist_monet$name.x)
matching_names

[1] "Venice, Palazzo Dario" "Water Lily Pond"

#| echo: true



#### Some paintings are shared between these two tables, from the greatest artist df, which
####monet paintings are at the art institte of chicago?

artist_monet %>%
  group_by(name.y) %>%
  ggplot(aes(x = name.y)) +
  geom_bar(fill = "lightblue") +
  labs(title = "How many monet paintings are on view at the art institute of chicago?",
       x = "museum",
       y = "Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

API Inquiry Findings:

By filtering down each data frame to just works by Monet, I was able to find 2 paintings in the “Best Paintings/Artists” data frame that are also held at The Art Institute of Chicago. These include “Venice,”Palazzo Dario”, and “Water lily Pond” all by Claude Monet.

“Venice, Palazzo Dario” - Claude Monet

“Water Lily Pond” - Claude Monet