Outside of my BAIS and Marketing studies, I am an avid oil painter and pencil artist. Since I was young, I have always appreciated visiting museums and seeing both local and famous works during time abroad or in another city. For my final BAIS project, I decided to merge my more artistic endeavors with my statistical ones. I was able to find an initial data set on Kaggle titled “Greatest Paintings”. The original data was actually broken up into multiple csv’s that I could join on like ID’s. (SQL Throwback). To make my data frame, I combined a painting file, museum file, and artist description file so that each row of my final data frame represented a painting by a specific artist along with information regarding the museum the piece is held at.
Data Dictionary and Summary Statistics
Variables
Quantitative (num)
artist_id - unique artist number
birth - The birth year of the artist
death - The death year of the artist
work_id - The unique painting ID
museum_id - The unique museum ID
Qualitative (chr)
full_name - full name of the artist
first_name - first name of the artist
middle_names - middle name
nationality - Where the artist is from
style.x - The style and technique of the painting
name.x - The title of the painting
name.y - The name of the painting’s museum
address - The address of the museum
city - The city where the museum is located
state - The state of the museum
country - The country of the museum
phone - Phone number of museum
url - url
Inquiry: From my created data frame titled “artist_museum_work”, I first wanted to discover what the most popular “style” was among the list and also see where the artists were most commonly originating from. I also wanted to get an idea of how many museums were listed in the data set and which ones have the highest concentration of “Greatest” paintings. After gathering a basic understanding of the data, I came up with two more questions regarding the artists
1: Based on the artists nationality, what was the average birth year/death year?
2: Is there a preference of art styles between nationalities?
Attaching package: 'magrittr'
The following object is masked from 'package:purrr':
set_names
The following object is masked from 'package:tidyr':
extract
library(stringr)library(dplyr)library(knitr)library(ggplot2)#### Best paintings/museums/works (Kaggle)artist_museum_work_df <-read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/mccarthyc9_xavier_edu/IQBtTTJJK_xaTJUbcEmE4p2RAWfX0kRZHPClBkvQmdHOAfg?download=1")
Rows: 4553 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): full_name, first_name, middle_names, last_name, nationality, style...
dbl (5): artist_id, birth, death, work_id, museum_id
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Top 5 displays data frametop5_artists <-read_csv ("https://myxavier-my.sharepoint.com/:x:/g/personal/mccarthyc9_xavier_edu/IQC_bRiJ31W7SKwt-Smw2IKUAYwt4pvUwv_ZVNSo9qEhBvA?download=1")
Rows: 5 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): artist_display
dbl (1): n
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 192 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (15): full_name, first_name, last_name, nationality, style.x, name.x, st...
dbl (5): artist_id, birth, death, work_id, museum_id
lgl (1): middle_names
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#### Art institute of chicago (Supplemenary API)inst_api_df <-read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/mccarthyc9_xavier_edu/IQAmNOeXiT0tTqIRmUlrjfBcAQ0SGx69Y_ycih4QkD26W0o?download=1")
Rows: 480 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): title, artist_display
dbl (4): id, fiscal_year, exhibition_start, year_of_display
lgl (2): has_not_been_viewed_much, is_on_view
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Monet paintings from APIart_inst_monet <-read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/mccarthyc9_xavier_edu/IQDzySj5i8k2Qop3kxQ91LOuAaiPRwAFoXY3oX_ZqubrBBM?download=1")
Rows: 6 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): title, artist_display
dbl (4): id, fiscal_year, exhibition_start, year_of_display
lgl (2): has_not_been_viewed_much, is_on_view
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Visualizations: (Kaggle DF)
#| echo: true### Using skimr to summarise the dflibrary(skimr)skim(artist_museum_work_df)
Data summary
Name
artist_museum_work_df
Number of rows
4553
Number of columns
21
_______________________
Column type frequency:
character
16
numeric
5
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
full_name
0
1.00
6
29
0
343
0
first_name
0
1.00
3
12
0
183
0
middle_names
3049
0.33
1
16
0
106
0
last_name
0
1.00
3
16
0
338
0
nationality
0
1.00
5
9
0
17
0
style.x
0
1.00
4
20
0
34
0
name.x
0
1.00
4
115
0
4296
0
style.y
516
0.89
4
18
0
23
0
name.y
0
1.00
11
50
0
57
0
address
0
1.00
8
31
0
57
0
city
0
1.00
1
15
0
45
0
state
985
0.78
2
15
0
24
0
postal
261
0.94
4
9
0
47
0
country
0
1.00
2
14
0
17
0
phone
0
1.00
12
20
0
57
0
url
0
1.00
18
61
0
57
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
artist_id
0
1
694.99
131.95
500
573
689
812
920
▇▆▆▆▆
birth
0
1
1768.54
107.67
1395
1712
1824
1841
1901
▁▁▂▃▇
death
0
1
1835.88
109.66
1441
1779
1890
1919
1989
▁▁▂▃▇
work_id
0
1
59758.37
75643.22
178
6819
23690
122431
208815
▇▁▁▁▂
museum_id
0
1
46.83
10.06
30
35
47
51
86
▆▇▃▂▁
library(ggplot2)artist_museum_work_df %>%ggplot(aes(x = style.x)) +geom_bar() +labs(title ="Counts of Each Style",x ="style",y ="Frequency") +theme(axis.text.x =element_text(angle =45, hjust =1))
#| echo: trueartist_museum_work_df %>%ggplot(aes(x = nationality)) +geom_bar() +labs(title ="histogram of artist nationalities",x ="Nationality",y ="Frequency") +theme(axis.text.x =element_text(angle =45, hjust =1))
#| echo: trueartist_museum_work_df %>%ggplot(aes(x = name.y)) +geom_bar() +labs(title ="Histogram of Museum Uses",x ="name",y ="Frequency") +theme(axis.text.x =element_text(angle =80, hjust =1))
#| echo: truehist(artist_museum_work_df$birth, col ="green",main ="Histogram for artist birth year",xlab ="year",ylab ="Frequency")
#| echo: truehist(artist_museum_work_df$death, col ="red",main ="Histogram for death years",xlab ="year",ylab ="Frequency")
FINAL INQUIRIES
Based on the artists nationality, what was the average birth year/death year?
#| echo: trueartist_museum_work_df %>%group_by(nationality) %>%summarise(avg_deathyr =mean(death, na.rm =TRUE)) %>%ggplot(aes(x = nationality, y = avg_deathyr)) +geom_point(color ="blue") +theme(axis.text.x =element_text(angle =45, hjust =1)) +labs(title ="Nationality vs Average Death Year",x ="Nationality",y ="Average Death Year")
#| echo: trueartist_museum_work_df %>%group_by(nationality) %>%summarise(avg_birthyr =mean(birth, na.rm =TRUE)) %>%ggplot(aes(x = nationality, y = avg_birthyr)) +geom_point(color ="blue") +theme(axis.text.x =element_text(angle =45, hjust =1)) +labs(title ="Nationality vs Average Birth Year",x ="Nationality",y ="Average Birth Year")
Is there a preference of art styles between nationalities?
#| echo: trueartist_museum_work_df %>%ggplot(aes(x = style.x, fill = nationality)) +geom_bar(position ="dodge") +labs(x ="style", y ="Count") +theme(axis.text.x =element_text(angle =45, hjust =1))
Initial Findings:
Starting with the skim table, there was not much to see as many numeric variables were just ID’s, however the average birth year for artists was 1768 and the average death year was1835. Within the data frame, the three most common styles included Impressionist paintings (over 1000), Baroque paintings (about 600), and Post Impressionist paintings (Around 360). The most common nationality of artists was French, with over 1,700 artists from the data frame originating from the country. The vast majority of paintings from this data frame are currently held at The Metropolitan Museum of Art in New York City.
For French Artists, the average death year was 1895, their average birth year was around 1850. Typically, the oldest artists were Dutch, Flemish, and Italian. Their average birth and death years predated other nationalities to the 1600’s and 1700’s.
With my last visualization, I wanted to discover more about the style preferences of different nationalities. Using a color-coded bar chart, I found that most French artists studied Impressionism. That was also the most highly concentrated nationality for the post impressionism style. Baroque paintings were most likely painted by Dutch artists and Italian painters had the highest concentration of Early Renaissance and High Renaissance paintings.
Supplementary API Call: Chicago Art institute Pieces
Description: This is the same API that I used for our class API assignment, however I wanted a larger sample of art from the institute so I looped through 20 pages instead of 7. The data frame I created in my R-script from the api call includes 8 variables including an id, painting title, a binary (T/F) for if the painting has been viewed a lot, an artist display description, the fiscal year, a binary (T/F) for if the painting is on display currently, and the exhibition start date.
Descriptive Visualizations:
#| echo: trueskim(inst_api_df)
Data summary
Name
inst_api_df
Number of rows
480
Number of columns
8
_______________________
Column type frequency:
character
2
logical
2
numeric
4
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
title
0
1
3
211
0
224
0
artist_display
0
1
4
134
0
125
0
Variable type: logical
skim_variable
n_missing
complete_rate
mean
count
has_not_been_viewed_much
0
1
0.45
FAL: 266, TRU: 214
is_on_view
0
1
0.35
FAL: 312, TRU: 168
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
id
0
1.00
95718.75
68783.56
22
44875.75
88686.5
126155.5
279957
▇▇▆▁▂
fiscal_year
198
0.59
1978.18
31.35
1921
1951.00
1983.0
1998.0
2026
▆▃▆▇▆
exhibition_start
267
0.44
1851.69
113.94
1299
1836.00
1887.0
1907.0
2018
▁▁▁▂▇
year_of_display
267
0.44
1851.69
113.94
1299
1836.00
1887.0
1907.0
2018
▁▁▁▂▇
#| echo: true# What portion of art from these pages is on display?inst_api_df %>%ggplot(aes(x = is_on_view)) +geom_bar(fill ="red") +labs(title ="How many paintings are on view?",x ="false/true",y ="Frequency") +theme(axis.text.x =element_text(angle =45, hjust =1))
#| echo: true# Most common years for exhibition startggplot(inst_api_df, aes(x = exhibition_start)) +geom_histogram() +labs(title ="Histogram of exhibition starts",x ="Year",y ="Count")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 267 rows containing non-finite outside the scale range
(`stat_bin()`).
#| echo: true# from these pages of the api, what were the 5 most common displays?top5_artists <- inst_api_df %>%count(artist_display, sort =TRUE) %>%slice_max(n, n =5)ggplot(top5_artists, aes(x =reorder(artist_display, n), y = n)) +geom_col(fill ="purple") +coord_flip() +labs(title ="Top 5 Most Common Artist Displays",x ="Artist Display",y ="Count" )
Supplementary Comparison:
In my R script, and loaded into the quarto at the begining, I created two tables from both the API and original painting data set, I filtered both data frames so that only artist name or display would include the string “Monet”. The API’s Monet data frame is titled “art_inst_monet” and the Kaggle Monet data frame is titled “artist_monet”. Claude Monet is one of my favorite impressionist painters and I wanted to see if the the “Greatest Paintings” data frame had any Monet pieces and if any of them were located in the Art Institute of Chicago
#| echo: true#### Some paintings are shared between these two tables, from the greatest artist df, which####monet paintings are at the art institte of chicago?artist_monet %>%group_by(name.y) %>%ggplot(aes(x = name.y)) +geom_bar(fill ="lightblue") +labs(title ="How many monet paintings are on view at the art institute of chicago?",x ="museum",y ="Frequency") +theme(axis.text.x =element_text(angle =45, hjust =1))
API Inquiry Findings:
By filtering down each data frame to just works by Monet, I was able to find 2 paintings in the “Best Paintings/Artists” data frame that are also held at The Art Institute of Chicago. These include “Venice,”Palazzo Dario”, and “Water lily Pond” all by Claude Monet.