Final Assignement

Impressionism is a 19th-century art movement characterized by relatively small, thin, yet visible brush strokes, open composition, emphasis on accurate depiction of light in its changing qualities (often accentuating the effects of the passage of time), ordinary subject matter, inclusion of movement as a crucial element of human perception and experience, and unusual visual angles. Impressionism originated with a group of Paris-based artists whose independent exhibitions brought them to prominence during the 1870s and 1880s (Wikipedia).

Impressionist artists were often painting landscapes, and used light in order to express themselves, and transfer their feelings to the viewer. In Eugène Boudin, a 19th century impressionists, vivid description,

Nature is richer than I represent it.. .Nature is so beautiful that when I am not tortured by poverty I am tortured by her splendor. How fortunate we are to be able to see and admire the glories of the sky and earth; if only I could be content just to admire them. But there is always the torment of struggling to reproduce them, the impossibility of creating anything within the narrow limits of painting.

—Eugène Boudin, Brave New World

In the final assignment you will be analyzing the Impressionist_Classifier_Data dataset. This dataset is consisted of paintings by 10 impressionist painters, Camille Pisarro, Childe Hassam, Claude Monet, Edgar Degas, Henri Matisse, John Singer-Sargent, Paul Cezanne, Paul Gauguin, Pierre-Auguste Renoir and Vincent van Gogh.

## [1] "./Cezanne"

## [1] "./Degas"

## [1] "./Gauguin"

## [1] "./Hassam"

## [1] "./Matisse"

## [1] "./Monet"

## [1] "./Pissarro"

## [1] "./Renoir"

## [1] "./Sargent"

## [1] "./VanGogh"

General Information

Please make sure to submit your final assignment in an HTML format. Please make sure to describe what you are doing. You are expected to elaborate on your assumptions and decisions. All the tools and methodologies discussed in class are at your disposal to achieve the best classifier you can get. PCA, LDA, SVD, normalization, all are valid, as well as they improve our classifier performance.
Every graph have to serve a purpose be describes as well.
In your analysis, use set.seed(42).
Good luck!

Construct the dataset

The data at hand come as a large set of images. The color of any pixel in a colored image you see can be represented as a combination of three colors, red, blue, and green. This decomposition is called RGB.
In order to be able to analyze our paintings, we first need to convert them into objects R can handle. The imager package allows us to load images and turn them into RGB matrices. In this part, we load and encode paintings to our environment, so we can later analyze them.

Load an Image

Load the imager package and use the following chunk in order to load an image.

library(imager)

# Set your working directory
setwd('C:/Users/David/Desktop/David/work/HUJI/Machine Learning for Economists BA/Final Assignment/')

im <- load.image('./Cezanne/215711.jpg')
# A painting!
plot(im)

And now let us look at the object we got. Its shape is

print(dim(im))

## [1] 3176 2606    1    3

The first two dimensions represent the image size; the third represents the “depth” of the image. If we loaded a video instead, the number of frames would have been represented by higher depth (it can be treated as the time coordinate of the image). Last is the spectrum of the picture. It has a value of 3, representing RGB colors.

print('head(5):')

## [1] "head(5):"

im %>% as.data.frame() %>% head(5)

##   x y cc     value
## 1 1 1  1 0.6274510
## 2 2 1  1 0.6196078
## 3 3 1  1 0.6235294
## 4 4 1  1 0.6470588
## 5 5 1  1 0.6784314

print('tail(5):')

## [1] "tail(5):"

im %>% as.data.frame() %>% tail(5)

##             x    y cc     value
## 24829964 3172 2606  3 0.6352941
## 24829965 3173 2606  3 0.6156863
## 24829966 3174 2606  3 0.6000000
## 24829967 3175 2606  3 0.5960784
## 24829968 3176 2606  3 0.6000000

Our data consist of different paintings of various sizes. Before we can proceed, we first need to standardize our paintings into a fixed size. The following chunk shows how to do that.

im2 <- resize(im, 500 ,500    ,1    ,3)
plot(im2)

Even though we have many images (about 400 for each painter), they are huge and come in different shapes. In order to reduce the effects of dimensionality problems, as well as to standardize our images, we will reshape all of our images into 100*100*3.

Image to Vector

The next step is to turn our images into vectors. After reshaping our images, we could turn the three matrices into a vector, as the following image suggests.

Source: Cloistered Monkey, here

We will do the following

im3 <- as.data.frame(im2)
image_Xs = im3$value
print(image_Xs[1:15])

##  [1] 0.6274510 0.6784314 0.6039216 0.6431373 0.6196078 0.5686275 0.6117647
##  [8] 0.5568627 0.5921569 0.5411765 0.5607843 0.5450980 0.7411765 0.7333333
## [15] 0.5647059

Now we have turned our painting into a vector of numbers. Pay attention that after resizing our images, we have that for each image (row), the first 100*100 values represent the red densities in our new 100*100 pixels image, the next 100*100 elements represent the green values, and the last 100*100 represent the blue values.

I used the following code in order to transform our images into an R dataset. This process took several hours, so don’t run it. You can go ahead and download X_train and y_train and use them in the following two sections.

library(imager)

setwd('C:/Users/David/Desktop/David/work/HUJI/Machine Learning for Economists BA/Final Assignment/')

image_preprocessing <- function(img){
  img <- resize(img, 100 ,100    ,1    ,3)
  img <- as.data.frame(img)
  x <- t(as.data.frame(img$value))
  return(x)
}

X = data.frame(matrix(nrow = 0, ncol = 100*100))
y <- c()

for (painter in list.dirs()){
  if (nchar(painter)<=1){
    next
  }
  X = data.frame(matrix(nrow = 0, ncol = 100*100))
  
  print(paste('starting ', painter, '..', sep = ""))
  print('---------------------------')
  
  files = list.files(paste('./', painter, '/', sep = ""))
  j = 1
  for (file in files){
    im <- load.image(paste(painter, '/', file, sep = ""))
    x <- image_preprocessing(im)
    X <- rbind(X,x)
    y <- c(y, gsub("[./]", "", painter))
    print(paste('Finished file ', j, ' out of ', length(files),'..', sep = ''))
    j <- j +1
  }
  write.csv(X, file = paste('./', painter, '/',gsub("[./]", "", painter), '.csv', sep = ""))
  write.csv(y, file = './y_train.csv')
}

# Create Combined File
X = data.frame(matrix(nrow = 0, ncol = 100*100*3))
for (painter in list.dirs()){
  if (nchar(painter)<=1){
    next
  }
  print(paste('Loaded ',  gsub("[./]", "", painter), ' dataset..', sep = ""))
  x <- read.csv(paste(painter, '/', gsub("[./]", "", painter), '.csv', sep = ""))
  X <- rbind(X,x)
  print('Finished rbinding..')
  }
write.csv(X, file = './X_train.csv')

Unsupervised Learning for Artworks

Some of the artists in our data are closer than others. Claude Monet, Pierre-Auguste Renoir, Alfred Sisley, and Frédéric Bazille (our data do not include the last two) met while studying under the academic artist Charles Gleyre. They discovered that they shared an interest in painting landscape and contemporary life rather than historical or mythological scenes. Following a practice that had become increasingly popular by mid-century, they often ventured into the countryside together to paint in the open air, but not for the purpose of making sketches to be developed into carefully finished works in the studio, as was the usual custom. By painting in sunlight directly from nature, and making bold use of the vivid synthetic pigments that had become available since the beginning of the century, they began to develop a lighter and brighter manner of painting that extended further the Realism of Gustave Courbet and the Barbizon school. A favourite meeting place for the artists was the Café Guerbois on Avenue de Clichy in Paris, where the discussions were often led by Édouard Manet, whom the younger artists greatly admired. They were soon joined by Camille Pissarro, Paul Cézanne, and Armand Guillaumin (Wikipedia).

In other words, most of the painters in our sample are french, Van Goch was dutch, John Singer Sargent operated from the US, and some of them were painting in slightly different periods.

Use one or more of the clustering algorithms we discussed in class to cluster together artists based on similarities. Usually, we use unsupervised learning in the earlier stages of the project. Discuss the results and support your claim in at least one plot (in addition to the clustering plot). This graph may relate to your predictions or incorporate any information from an outside source (please mention explicitly any source you used as your help).
NOTE: The dataset is very large, so it is natural that some operations on it will take time.

The following table summarizes the nationalities of the painters in our sample:

painters <- c("Cezanne", "Degas", "Gauguin", "Hassam", "Matisse", "Monet", "Pissarro", "Renoir", "Sargent", "VanGogh")
nationalities <- c("French", "French", "French", "US", "French", "French", "French", "French", "US", "Dutch")

df <- as.data.frame(cbind(painters, nationalities))
df

##    painters nationalities
## 1   Cezanne        French
## 2     Degas        French
## 3   Gauguin        French
## 4    Hassam            US
## 5   Matisse        French
## 6     Monet        French
## 7  Pissarro        French
## 8    Renoir        French
## 9   Sargent            US
## 10  VanGogh         Dutch

Artwork Classification

Build two classification models in order to predict the painter from the paint. The goal here is to make a good prediction. Please include explanations on the process of developing your models. Be as clear and descriptive as you can be.
NOTE: This is a challenging problem, so don’t expect to get the accuracy values you are familiar with from the homework (remember that we have balanced data with 300 examples per class, and that random guessing will only yield 10% correct answers. Using the models we discussed in class you should expect to achieve at least 4 times this random guess). Nevertheless, in this assignment, you are expected to train your model as best as you can. You should explore the model’s RDocumentation (or articles such as this), and tune the hyper-parameters in a loop, and each time you reach a new peak in the accuracy on the CV set, you should update your best model.

Preprocessing

Discuss the differences between your models, in their assumptions, and explain why did you choose them.
Consider manipulating your data to see if it helps you achieve better results.

Post-Estimation

Evaluate your performance using the tools from the class.
Present the tuning process. Alongside your description, add a table with hyper-parameters and their corresponding accuracies on the training and CV datasets, ordered by the CV accuracy in a decreasing order. Show only the best 15 combination; that is, the table should consist 15 rows max.
Explore your predictions. Which paintings were misclassified? Why?
Load X_test and y_test and test your model.
How does model performance? Discuss.

Machine Learning for Economists (Undergraduate)

David Harar

7/8/2021