Required packages
Below is the list of packages used in the project.
#Packages required for my code
library(tidyverse)
library(lubridate)
library(outliers)
library(forecast)
library(knitr)
library(Hmisc)
Executive Summary
Data pre-processing is an important stage of the data analysis lifecycle which is done primarily along with data import and before modeling/machine learning/visual reports. It is not nessesarily carried out in the same order but follows a high level order of data preperation stages.
Below are some of the important stages:
1. Problem definition by data :
- The problem area is defined at this point which will focus on the data required to solve the problem/problem-space. It will also help in understanding the void of certain critical data which can support creation of the same from existing data. It is also important to iron out details about storage and analytical platforms that are necessary to carry out the analysis
- I have carried out this steps first to understand the datasets needed for the analysis. 2 datasets are imported into R and merged for holistic analysis of all the museum’s artworks and artists
2. Data import:
- This is the stage of bringing all the necessary data onto 1 platform to be processed further into a final analytical dataset or multiple datasets
- Data is usually stored into databases, warehouses or local flat files
- I have imported files directly from the file URLs onto R where further processing will be carried out
3. Understand data:
- This stage involves exploring the structure and the current stage of the data to identify possible next steps needed. The dimensions, data types, summary statistics, invalid values, data dictionary, merge keys are some of the things to look at.
- It is recommended that the analyst understand the data completely before proceeding
- I have done this steps next, but have not moved towards manipulating them just yet. Some pre-processing is needed on the whole dataset before moving towards individual variables/observations
4. Tidy data:
- This is an important stage to clean the data and to bring it to the format that is appealing to look at, and to streamline further analysis.
- Involves converting to tidy format which involves converting attributes into columns, rows into observations and values into a unit of observation
- I have carried this steps right after since there was a lot of tidying that was needed on the dataset before moving towards further processing or beautification/formatting
5. Manipulate data:
- This step involves further changing the structure of the data by merging multiple datasets together and creating additional, meaningful variables and remove non impactful rows and variables and structuring data in an appealing format for viewing as well
- I have carried out a lot of operations at this steps which include individual variable manipulations, formatting and then merging clean datasets together (Initial step of understanding data helps out with this step)
6. Scan and treat missing/outlier values:
- This is an important step to either get rid of missing/special/outlier observations or proxy them with likely/most common values from existing data. Statistical tests and summaries are used to carry out this steps.
- Imputing with mean/median/mode is a popular option. I have split this section into Missing values and Outlier values sections and releavant treatment is done in detail for all relevant variables
- The most important step at this point is to go back to the business and request additional/improved data which is the best approach to deal with the data, if available
7. Transforming data:
- This steps involves transforming data into formats favorable to modeling, inter-variables relationships, to reduce skewness/outliers or to better understand the data)
- I have carried out this step last to get the data ready for visualization or further modeling activities. All numeric columns are viewed under the lens of multiple transformations and appropriate transformations are done, wherever applicable
Data Summary
The MoMA collection was established in 1929 and is a digital database of artworks by various artists across the world, and acquired by the museum across the last 150 years. The 2 datasets feature artwork details and artist details respectively and contain a variety of attributes outlined and summarized below. Both are in public domain under a CC0 License.
work_arts (Columns: 29, Rows: 138,185, URL: artworks.csv)
Columns:
* Title - Title of the artwork (Level of data)
* Artist - Artist name (multiple artists of an artwork are concatenated as untidy data)
* ConstituentID - Artist ID (Also the primary join key for Artists dataset)
* ArtistBio - Artist nationality and active years
* Gender - Gender of the artist
* BeginDate, EndDate - active years of artist
* Medium - medium used to create the art (paint, print, photograph etc)
* Dimensions - physical dimensions of the artwork (contains Height, Width and Depth)
* CreditLine - Purchase line by the architect (purchase, gift, partial gift by a person/organization etc)
* Classification - Type of artwork (Architecture, design, illustrated book etc)
* Department - Higher level classification (architecture, drawing, film, painting etc)
* DateAcquired - Date of acquiring by the Museum
* URL, ThumbnailURL - link to artwork and thumbnail
* Physical dimensions (cm) - Circumference, Depth, Diameter, Height, Length, Weight, Width etc)
Artists (Columns: 9, Rows: 15,333, URL: Artists.csv)
Columns:
* ConstituentID - Primary key and ID of the artist
* DisplayName - Artist name
* ArtistBio, Nationality, Gender, BeginDate, EndDate - same as above
* Wiki QID - Unique identifier in Wikidata for the artist (Also known as Wikidata item identifier)
* ULAN - This is the Getty ULAN ID for the artist (can be used to search in Getty databases
Brief Summary of steps taken and a few early findings:
1. I have looked at the head() and str() (str shown in next section) of each dataset (and explored the entire dataset) to understand the variables, data types and identify immediate next steps
2. I have looked at the structure of the data to identify join keys, level of data and the pre-processing steps needed for each of the columns for further analysis (There are a lot of columns that need manipulations to bring to the right format)
3. The work_arts data is untidy with multiple artists (of an artwork) concatenated in a singe cell and the same is applicable for each of the artist’s attribute in work_arts
4. artworks.csv is imported as work_arts for quick access using R shortcut functions (varying from artists.csv). artists.csv is imported as-is
5. imported csv are converted to dataframe using as.data.frame() while importing using read_csv()
6. I have identified a few redundant columns with ‘ArtistBio’ & ‘Dimensions’ but they will come handy to replace missing values wherever applicable
7. There are also few rows/columns with a majority of missing values. They will be manipulated appropriately in subsequent sections
8. work_arts is huge in size and will further increase after tidying, so an unbiased subset will be also created in subsequent sections (because my RStudio couldn’t process operations without this)
9. Merge key is ‘ConstituentID’ which is also the primary key for artists dataframe
- ‘artists’ has unique details about the artist which need to be brought into the work_arts dataset using the merge operation
#Dataset 1:
work_arts <- read_csv("https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv") %>% as.data.frame(row.names = NULL)
print(head(work_arts, n = 5))
#Dataset 2:
artists <- read_csv("https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artists.csv") %>% as.data.frame(row.names = NULL)
print(head(artists, n = 5))
Understand 1 (Explore only)
Summary of steps and required data type conversions:
Steps:
1. I have checked the structure of datasets using str() and type()
2. I have explored individual columns and identified necessary data type conversions required (outlined below). These will be done post tidying data in the next step)
3. The colSums() of NAs show columns with missing values. Columns with a majority of missing values can be removed after judging its significance for further analysis
4. count of unique values is used to identify possible factor columns mentioned below
Conversions needed (to be executed after tidying and unnesting columns):
1. ConstituentID, BeginDate, EndDate - convert from chr to int
2. Gender - convert from chr to factor (apply lower case conversion), (add factor labels)
3. Nationality, Classification, Department, Catalogued - convert from chr to factor
4. Columns with majority NAs (“Circumference (cm)”, “Diameter (cm)”, “Length (cm)”, “Weight (kg)”, “Seat Height (cm)”,“Duration (sec.)”)
#Understand work_arts
str(work_arts)
'data.frame': 138185 obs. of 29 variables:
$ Title : chr "Ferdinandsbrücke Project, Vienna, Austria (Elevation, preliminary version)" "City of Music, National Superior Conservatory of Music and Dance, Paris, France, View from interior courtyard" "Villa near Vienna Project, Outside Vienna, Austria, Elevation" "The Manhattan Transcripts Project, New York, New York, Introductory panel to Episode 1: The Park" ...
$ Artist : chr "Otto Wagner" "Christian de Portzamparc" "Emil Hoppe" "Bernard Tschumi" ...
$ ConstituentID : chr "6210" "7470" "7605" "7056" ...
$ ArtistBio : chr "(Austrian, 1841–1918)" "(French, born 1944)" "(Austrian, 1876–1957)" "(French and Swiss, born Switzerland 1944)" ...
$ Nationality : chr "(Austrian)" "(French)" "(Austrian)" "()" ...
$ BeginDate : chr "(1841)" "(1944)" "(1876)" "(1944)" ...
$ EndDate : chr "(1918)" "(0)" "(1957)" "(0)" ...
$ Gender : chr "(Male)" "(Male)" "(Male)" "(Male)" ...
$ Date : chr "1896" "1987" "1903" "1980" ...
$ Medium : chr "Ink and cut-and-pasted painted pages on paper" "Paint and colored pencil on print" "Graphite, pen, color pencil, ink, and gouache on tracing paper" "Photographic reproduction with colored synthetic laminate" ...
$ Dimensions : chr "19 1/8 x 66 1/2\" (48.6 x 168.9 cm)" "16 x 11 3/4\" (40.6 x 29.8 cm)" "13 1/2 x 12 1/2\" (34.3 x 31.8 cm)" "20 x 20\" (50.8 x 50.8 cm)" ...
$ CreditLine : chr "Fractional and promised gift of Jo Carole and Ronald S. Lauder" "Gift of the architect in honor of Lily Auchincloss" "Gift of Jo Carole and Ronald S. Lauder" "Purchase and partial gift of the architect in honor of Lily Auchincloss" ...
$ AccessionNumber : chr "885.1996" "1.1995" "1.1997" "2.1995" ...
$ Classification : chr "Architecture" "Architecture" "Architecture" "Architecture" ...
$ Department : chr "Architecture & Design" "Architecture & Design" "Architecture & Design" "Architecture & Design" ...
$ DateAcquired : Date, format: "1996-04-09" ...
$ Cataloged : chr "Y" "Y" "Y" "Y" ...
$ ObjectID : num 2 3 4 5 6 7 8 9 10 11 ...
$ URL : chr "http://www.moma.org/collection/works/2" "http://www.moma.org/collection/works/3" "http://www.moma.org/collection/works/4" "http://www.moma.org/collection/works/5" ...
$ ThumbnailURL : chr "http://www.moma.org/media/W1siZiIsIjU5NDA1Il0sWyJwIiwiY29udmVydCIsIi1yZXNpemUgMzAweDMwMFx1MDAzZSJdXQ.jpg?sha=137b8455b1ec6167" "http://www.moma.org/media/W1siZiIsIjk3Il0sWyJwIiwiY29udmVydCIsIi1yZXNpemUgMzAweDMwMFx1MDAzZSJdXQ.jpg?sha=55b65fa4368fe00a" "http://www.moma.org/media/W1siZiIsIjk4Il0sWyJwIiwiY29udmVydCIsIi1yZXNpemUgMzAweDMwMFx1MDAzZSJdXQ.jpg?sha=fdcfca4db3acac1f" "http://www.moma.org/media/W1siZiIsIjEyNCJdLFsicCIsImNvbnZlcnQiLCItcmVzaXplIDMwMHgzMDBcdTAwM2UiXV0.jpg?sha=c89b9071486760a5" ...
$ Circumference (cm): logi NA NA NA NA NA NA ...
$ Depth (cm) : num NA NA NA NA NA NA NA NA NA NA ...
$ Diameter (cm) : num NA NA NA NA NA NA NA NA NA NA ...
$ Height (cm) : num 48.6 40.6 34.3 50.8 38.4 ...
$ Length (cm) : logi NA NA NA NA NA NA ...
$ Weight (kg) : num NA NA NA NA NA NA NA NA NA NA ...
$ Width (cm) : num 168.9 29.8 31.8 50.8 19.1 ...
$ Seat Height (cm) : logi NA NA NA NA NA NA ...
$ Duration (sec.) : logi NA NA NA NA NA NA ...
- attr(*, "problems")= tibble [2,927 x 5] (S3: tbl_df/tbl/data.frame)
..$ row : int [1:2927] 1002 1003 1004 1005 1007 1008 1011 1041 1042 1045 ...
..$ col : chr [1:2927] "Length (cm)" "Length (cm)" "Length (cm)" "Length (cm)" ...
..$ expected: chr [1:2927] "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" ...
..$ actual : chr [1:2927] "17.8" "19.0" "18.1" "17.78" ...
..$ file : chr [1:2927] "'https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv'" "'https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv'" "'https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv'" "'https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv'" ...
- attr(*, "spec")=
.. cols(
.. Title = col_character(),
.. Artist = col_character(),
.. ConstituentID = col_character(),
.. ArtistBio = col_character(),
.. Nationality = col_character(),
.. BeginDate = col_character(),
.. EndDate = col_character(),
.. Gender = col_character(),
.. Date = col_character(),
.. Medium = col_character(),
.. Dimensions = col_character(),
.. CreditLine = col_character(),
.. AccessionNumber = col_character(),
.. Classification = col_character(),
.. Department = col_character(),
.. DateAcquired = col_date(format = ""),
.. Cataloged = col_character(),
.. ObjectID = col_double(),
.. URL = col_character(),
.. ThumbnailURL = col_character(),
.. `Circumference (cm)` = col_logical(),
.. `Depth (cm)` = col_double(),
.. `Diameter (cm)` = col_double(),
.. `Height (cm)` = col_double(),
.. `Length (cm)` = col_logical(),
.. `Weight (kg)` = col_double(),
.. `Width (cm)` = col_double(),
.. `Seat Height (cm)` = col_logical(),
.. `Duration (sec.)` = col_logical()
.. )
#Summary of data types. Need data type conversions (see summary after code chuck)
sapply(work_arts, class) %>% as.data.frame() %>% set_names("colType") %>% count(colType)
work_arts %>% {colSums(is.na(.))} #check count NAs per column
Title Artist ConstituentID
39 1309 1309
ArtistBio Nationality BeginDate
6029 1309 1309
EndDate Gender Date
1309 1309 2214
Medium Dimensions CreditLine
9735 9894 2431
AccessionNumber Classification Department
0 0 0
DateAcquired Cataloged ObjectID
7134 0 0
URL ThumbnailURL Circumference (cm)
51265 62089 138185
Depth (cm) Diameter (cm) Height (cm)
124392 136717 17934
Length (cm) Weight (kg) Width (cm)
138185 137896 18856
Seat Height (cm) Duration (sec.)
138185 138185
work_arts %>% sapply(function(x) length(unique(x))) #check unique per each columns
Title Artist ConstituentID
98573 13706 13746
ArtistBio Nationality BeginDate
8168 1054 2216
EndDate Gender Date
1285 410 9374
Medium Dimensions CreditLine
21251 84352 7224
AccessionNumber Classification Department
138185 32 8
DateAcquired Cataloged ObjectID
1756 2 138185
URL ThumbnailURL Circumference (cm)
86921 75496 1
Depth (cm) Diameter (cm) Height (cm)
1369 511 3816
Length (cm) Weight (kg) Width (cm)
1 201 3848
Seat Height (cm) Duration (sec.)
1 1
#Understand artists
str(artists)
'data.frame': 15233 obs. of 9 variables:
$ ConstituentID: num 1 2 3 4 5 6 7 9 10 11 ...
$ DisplayName : chr "Robert Arneson" "Doroteo Arnaiz" "Bill Arnold" "Charles Arnoldi" ...
$ ArtistBio : chr "American, 1930–1992" "Spanish, born 1936" "American, born 1941" "American, born 1946" ...
$ Nationality : chr "American" "Spanish" "American" "American" ...
$ Gender : chr "Male" "Male" "Male" "Male" ...
$ BeginDate : num 1930 1936 1941 1946 1941 ...
$ EndDate : num 1992 0 0 0 0 ...
$ Wiki QID : chr NA NA NA "Q1063584" ...
$ ULAN : num NA NA NA 5e+08 NA ...
- attr(*, "spec")=
.. cols(
.. ConstituentID = col_double(),
.. DisplayName = col_character(),
.. ArtistBio = col_character(),
.. Nationality = col_character(),
.. Gender = col_character(),
.. BeginDate = col_double(),
.. EndDate = col_double(),
.. `Wiki QID` = col_character(),
.. ULAN = col_double()
.. )
#Summary ff data types
sapply(artists, class) %>% as.data.frame() %>% set_names("colType") %>% count(colType)
artists %>% {colSums(is.na(.))} #check count NAs per column
ConstituentID DisplayName ArtistBio Nationality
0 0 2203 2448
Gender BeginDate EndDate Wiki QID
3131 0 0 11975
ULAN
12297
artists %>% sapply(function(x) length(unique(x))) #check unique per each column
ConstituentID DisplayName ArtistBio Nationality
15233 15178 7084 119
Gender BeginDate EndDate Wiki QID
7 236 169 3254
ULAN
2934
Tidy & Manipulate Data I (and II)
The work_arts dataset is untidy because:
1. It has multiple values concatenated into a single cell, so eaech row doesn’t correspond to a single observation
2. Multiple columns are present in a single column, ex: artistBio and Dimensions (while individual columns are present, they will be used for data treatment before being tidied)
3. Few columns like Dimensions colum is difficult to understand due to a lot of garbage values with important information
The below steps are taken (most correspond to tidying data):
1. strFunc is a user defined function used to identify number of concat values in artist attributes like Artist, ConstituentID, Nationality, BeginDate, EndDate and Gender. Nationality is used as an anchor and concat values of 3 or more are excluded (Data is already huge and R was already at the breaking point for executing in my machine)
- The requirement of creating a new column in the assignment is done here 2. A filter is used next to filter out artworks with title longer than 50 characters (using nchar())
3. Text-heavy, unimportant/redundant columns like ArtistBio, CreditLine, URL and ThumbnailURL are removed at this point to make the data lesser in size
4. Concat columns are unnested into individual rows using unnest() function which converts individual item into its own row (This essentially did what separate+gather operations would do)
5. Distinct of the dataset is used to get rid of duplicate observations, if any
6. Columns with >90% NAs were removed to reduce data size (assuming no critical analysis on these dimensions at the moment). COlumns removed are Circumference, Diameter, Length, Wiegth, Seat Height, Duration
7. A large sample of 20000 observations is extracted for further analysis using a random sample (This is done to avoid heavy lag for further executions. Assumption here is that the sample derived exhibits the properties of the original dataset)
8. set.seed(5) is used below for reproducible results and dataset is saved as wa_filt
9. Now the data is in a tidy format, so further operations will focus on the data type, consistent levels and string cleaning activities
10. Additional comments in code chunks
#Tidy up work_arts:
#Filter-in observations with concat values less than 3 by using custom function and creating
#column 'flag' to filter. This is done to reduce the data size
strFunc <- function(x) {str_count(pattern = "[)] [(]", string = x)}
work_arts<-work_arts %>% mutate(flag = strFunc(Nationality)) %>% filter(flag<3) %>% select(-flag)
#Remove columns with unimportant and text-heavy string data like ArtistBio, CreditLine, URL, ThumbnailURL
#Create col 'nchar' to filter out 'Title' with characters <= 50. This is done to reduce the data size (Recommended by TA)
work_arts <- work_arts %>% select(-c(ArtistBio, CreditLine, URL, ThumbnailURL)) %>%
mutate(nchar = nchar(work_arts$Title)) %>% filter(nchar<=50) %>% select(-nchar)
#Unnest and convert multiple columns in the data to long format (Do before changing data types as clean up
#is needed before type change). 'regEx' expressions are used to split and unnest based on earlier analysis
work_arts <- work_arts %>% mutate(Artist = str_split(pattern = ", ", string = Artist)) %>% unnest(Artist) %>%
mutate(ConstituentID = str_split(pattern = ", ", string = ConstituentID)) %>% unnest(ConstituentID) %>%
mutate(Nationality = str_split(pattern = "[)] [(]", string = Nationality)) %>% unnest(Nationality) %>%
mutate(BeginDate = str_split(pattern = "[)] [(]", string = BeginDate)) %>% unnest(BeginDate) %>%
mutate(EndDate = str_split(pattern = "[)] [(]", string = EndDate)) %>% unnest(EndDate) %>%
mutate(Gender = str_split(pattern = "[)] [(]", string = Gender)) %>% unnest(Gender) %>% distinct()
#drop columns which contain mostly NA values. 6 columns have at least 90% NAs.
#Remove these using a select function and -c()
work_arts <- work_arts %>%
select(-c("Circumference (cm)", "Diameter (cm)", "Length (cm)", "Weight (kg)",
"Seat Height (cm)","Duration (sec.)"))
#As the data size is high, a sample data is created from the dataset using a sample_n() function from dplyr.
#Set.seed() is used for reproducible results, and a sample of 20,000 rows is random chosen from the data.
#Assign the sample dataset to a new name wa_filt'
set.seed(5)
wa_filt <- work_arts %>% sample_n(size = 20000, replace = F)
Understand 2 (Execution after tidying data)
This section focuses on data formatting and accuracy of column representations
1. A replacement operation is carried out to get rid of leading, trailing ‘(’ or ‘)’ characters which are replaced with "". This is done for Nationality, BeginDate, EndDate and Gender
2. Upper case in Gender is converted to lower for consistent representation and to aid creation of factor levels
3. Gender is converted to a Factor variable and relevant labels are provided. Since it is an unordered column, ordered is set to FALSE
4. ‘Unknown’ and ‘Nationality unknown’ are converted to ‘unknown’ for consistent representation (assuming a biz requirement for consistency)
5. ‘(not assigned)’ in Classification is converted to ‘Not Assigned’ in line with the other levels in the column
6. ConstituentID, BeginDate, EndDate are converted to Integer data type
7. Nationality, Classification, Department and Cataloged are converted to factor variables (levels are not changed at the moment, but can be converted as per biz requirement. Gender is already converted with labels above)
8. BeginDate, EndDate and Date are converted to date type by converting the existing ‘year’ format to ‘ymd’ by adding the beginning of year attributes (day, month is added as per necessary formatting). This will aid any further analysis based on date operations like difference in duration, comparison to a custom date or help in updating more accurate dates when the museum is able to procure more information later)
9. ‘Date’ is renamed to ‘DateMade’ for readability with other data columns like ‘BeginDate’, ‘EndDate’ and ‘DateAcquired’
10. NAs in merge column are checked before the subsequent merge (The analysis is run already before, and no NAs are found. Code is not run again to avoid maxing page limit)
11. Additional comments in code chunks
#Clean up columns with '(' ')' in column names created by unnest() function in 4 columns
wa_filt <- wa_filt %>% mutate(Nationality = str_replace_all(wa_filt$Nationality, pattern = "[()]", replacement = ""),
BeginDate = str_replace_all(wa_filt$BeginDate, pattern = "[()]", replacement = ""),
EndDate = str_replace_all(wa_filt$EndDate, pattern = "[()]", replacement = ""),
Gender = str_replace_all(wa_filt$Gender, pattern = "[()]", replacement = ""))
#Convert Gender to lower case and convert to factor with relevant labels
wa_filt$Gender <- tolower(wa_filt$Gender)
wa_filt$Gender <- factor(wa_filt$Gender, levels = c("male", "female", ""),
labels = c("Male", "Female", "Unknown"), ordered = FALSE)
#Convert Nationality to valid values using case when statement. Convert to factor next
wa_filt <- wa_filt %>% mutate(Nationality = case_when(as.character(Nationality)==""~"Unknown",
as.character(Nationality)=="Nationality unknown"~"Unknown",
TRUE~as.character(Nationality)))
#Convert Classification values to consistent format
wa_filt$Classification[wa_filt$Classification=="(not assigned)"] <- "Not Assigned"
#Convert relevant columns to Integer and factor type (Gender already converted to Factor above)
wa_filt<-wa_filt %>% mutate_at(c("ConstituentID", "BeginDate", "EndDate"), .funs = as.integer)
wa_filt<-wa_filt %>% mutate_at(c("Nationality","Classification", "Department", "Cataloged"), .funs = as.factor)
#convert years to date
wa_filt$BeginDate <- as.Date(paste(wa_filt$BeginDate, "1", "1", sep = "-"), format = "%Y-%m-%d")
wa_filt$EndDate <- as.Date(paste(wa_filt$EndDate, "1", "1", sep = "-"), format = "%Y-%m-%d")
wa_filt$Date <- as.Date(paste(wa_filt$Date, "1", "1", sep = "-"), format = "%Y-%m-%d")
wa_filt <- plyr::rename(x = wa_filt, replace = c("Date" = "DateMade"))
#check for missing values in merge keys i.e. ConstituentID. Result from before: No missing values. Ready for merge
Merge Data to create Master dataset
- wa_filt is merged with artists based on ‘ConstituentID’ column which is the primary key for artists and foreign key for wa_filt.
- As other columns in artists already present in wa_filt, only ‘wiki QID’ and ‘ULAN’ columns are brought into the ‘mergedArt’ dataset
- Sum of NAs in mergedArt are explored for the subsequent section on missing value identification and treatment. This information guides the next section
#Merge 'wa_filt' and 'artists' on key 'ConstituentID'
mergedArt <- wa_filt %>% left_join(artists[,c("ConstituentID", "Wiki QID", "ULAN")], by = "ConstituentID")
colSums(is.na(mergedArt)) #Identify columns with missing values for treatment in next section
Title Artist ConstituentID Nationality
0 0 0 0
BeginDate EndDate Gender DateMade
0 0 0 4132
Medium Dimensions AccessionNumber Classification
636 657 0 0
Department DateAcquired Cataloged ObjectID
0 580 0 0
Depth (cm) Height (cm) Width (cm) Wiki QID
11395 1646 2167 13422
ULAN
13517
Scan I (Missing values)
- Missing values will be identified and relevant treatment technique is used
- A user fucntion to select the mode of a Categorical column is created as the impute function from Hmisc didn’t provide the mode as verified by the count(, sort = TRUE) %>% head(1) method
- DateMade has missing values which are imputed using the Mode of the column (i.e. ‘1971/01/01’). It has ~4k missing values so removing missing values is not recommended. Also, the mode is highly occurring value for the column
- DateAcquired is also imputed with the mode of the column (i.e. ‘2013-10-24’). Around ~6.5k values correspond to the mode. Since the missing values are just 580 in count, excluding selectively during analysis is also an option, but i decided to use the mode as it is highly common as well.
- Medium has ~636 missing values and is imputed with the mode (i.e. ‘Gelatin silver print’) which occurs ~6.7k times
- Dimensions has around 657 missing values but is also considered a redundant column, and for the time being is imputed with the mode (i.e. ‘6 11/16 × 9 11/16" (17 × 24.6 cm)’) since its required for the next step of extracting individual dimensions to treat dimensions column (where they are missing), and is preferred for the subsequent text-heavy operations. It is likely that this column will be dropped after the extraction is executed. Other alternative is to leave the column as-is due to the low count of missing values, but i went ahead with imputing to avoid issues with subsequent string operations. (Refer 9. below)
- Depth has ~50% missing values and is kept as-is for any critical analysis later on. Also the extraction of dimensions will aid in reducing missing values in the column. (Refer 9. below)
- Wiki QID and ULAN have ~60% missing values, but since they are important attributes of the artists, they are not excluded from the dataset and kept as-is.
- Numeric value treatment ((Depth (cm), Height (cm), Width (cm)). Since ‘dimensions’ has the same attributes in it, extract the data from the string column to replace wherever NA. And after that, imputing with either mean or median where NAs are still present (median opted since there are highly skewed outliers. shown below in last sub-point)
- Dimensions is a very dynamic variable with the required info at different positions of the column. So extracting the dimensions is done from the end of string based on locations of the brackets of the last dimension-set. Ex: extract (1cm x 2cm x 5cm) from the garbage values ‘..2cm..1cm x 2cm…. (1cm x 2cm x 5cm)’.
- I have used a series of steps with multiple user defined functions to avoid using stringi package (This was very difficult to achieve!!!). Steps involved are extracting the start and end locations of the braces (and extract into a column by extracting values from str_locate_all() into a list and extracting only the relevant values which are present towards the end of each nested list and a converting integer(0) to NAs using applyNAToStr() function in the code), then use str_sub() to extract the dimensions, replace garbage values, split based on ‘x’, and trim whitespaces, which provides a list of lists
- The requirement of creating a new column in the assignment is also done here
- Extract the Height, Width, Depth respectively using sapply(“[”) and relevant position in the list (e.g. 1,2,3) and convert to double
- Wherever original columns have missing values, relevant non-missing values are replaced across each of the 3 columns (Height, Width, Depth)
- As seen by hist(), there are large outliers in both Height and Width, hence median is used to impute the missing values instead of mean
- All numeric columns are checked for special values like Inf, -Inf and NAN. There are no special values in the dataset
- Additional comments in code chunks
#Missing value identification and treatment
getMode <- function(v) #function to get mode as impute from Hmisc didn't give the right mode as confirmed by count()
{ uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]}
#Check missing value for DateMade and DateAcquired
mergedArt %>% count(DateMade, sort = TRUE) %>% head(1)#Mode is '1971/01/01' which can be used for missing values in DateMade
mergedArt$DateMade[is.na(mergedArt$DateMade)] <- getMode(mergedArt$DateMade) #mode is 1971-01-01
mergedArt %>% count(DateAcquired, sort = TRUE) %>% head(1) #mode is 2013-10-24. Many items were acquired that day
mergedArt$DateAcquired[is.na(mergedArt$DateAcquired)] <- getMode(mergedArt$DateAcquired)
#Check missing values for categorical columns
mergedArt %>% count(Medium, sort = TRUE) %>% head(1) #mode
mergedArt[is.na(mergedArt$Medium),]$Medium <- getMode(mergedArt$Medium) #'Gelatin silver print' is the mode with ~6.7k obs
mergedArt %>% count(Dimensions, sort= TRUE) %>% head(1) #mode
mergedArt[is.na(mergedArt$Dimensions),]$Dimensions <- getMode(mergedArt$Dimensions)
#As 'Depth (cm)' has more than 50% missing values, the column can actually be excluded,
#but this has to be done before ruling out statistical significance and business rules.
#Since its a numeric dimension column, i am assuming its of importance and retain it. WIll be used with na.rm=TRUE
#Wiki QID & ULAN on the other hand has >60% missing and doesn't have any pattern as its an ID column,
#But since its an important attribute of the artist, the columns are not excluded from the dataset
#Numeric missing value treatment (Depth (cm), Height (cm), Width (cm)). Since 'dimensions' has the same attributes in it, extract the data from the string column to replace wherever NA. And after that impute with either mean or median
#Dimensions is a very dynamic variable with the required info at different positions of the column. So extract the dimensions from the end of string based on locations of the brackets of the last dimension-set. Ex: extract (1cm x 2cm x 5cm) from the garbage values '..2cm..1cm x 2cm.... (1cm x 2cm x 5cm)'
#I have used a series of steps with multiple user defined functions to avoid using stringi package
applyNAToStr<- function(x) #function to apply NA wherever the penultimate function below creates integer(0) values, i.e. blank values, thus allowing conversion to column (DimensionsStart, DimensionsEnd) to add back to the dataset
{if(length(x) == 0)
return (NA)
return (x)}
mergedArt$DimensionsStart <- mergedArt$Dimensions %>%
str_locate_all(., pattern = "[(]") %>% lapply(function(x) x[,1]) %>%
lapply(function(x) x[length(x)]) %>% sapply(applyNAToStr) #Apply a series of user defined functions to extract the start position of the required dimensions
mergedArt$DimensionsEnd <- mergedArt$Dimensions %>%
str_locate_all(., pattern = "[)]") %>% lapply(function(x) x[,1]) %>%
lapply(function(x) x[length(x)]) %>% sapply(applyNAToStr) #End position of the required dimensions
mergedArt$newDim <- mergedArt$Dimensions %>%
str_sub(start = mergedArt$DimensionsStart, end = mergedArt$DimensionsEnd) %>%
str_replace("[(]", "") %>% str_replace("[)]", "") %>% str_replace("cm", "") %>%
str_split("[x×]") %>% sapply(str_trim, side = c("both")) #Extract the dimensions dynamically from the column
mergedArt$newHeight<-sapply(mergedArt$newDim, "[", 1) %>% as.double() #Extract Height into a new temporary Dimension column
NAs introduced by coercion
mergedArt$newWidth<-sapply(mergedArt$newDim, "[", 2) %>% as.double() #Extract Width into a new temporary Dimension column
NAs introduced by coercion
mergedArt$newDepth<-sapply(mergedArt$newDim, "[", 3) %>% as.double() #Extract Depth into a new temporary Dimension column
NAs introduced by coercion
mergedArt <- mergedArt %>% mutate(`Height (cm)` = if_else(is.na(`Height (cm)`), newHeight, `Height (cm)`),
`Width (cm)` = if_else(is.na(`Width (cm)`), newWidth, `Width (cm)`),
`Depth (cm)` = if_else(is.na(`Depth (cm)`), newDepth, `Depth (cm)`)) %>% select(-newHeight, -newWidth, -newDepth, -DimensionsStart, -DimensionsEnd) #replace with values in new temp columns wherever dimensions are NAs and remove the temporary Dimension columns
par(mfrow=c(1, 2))
mergedArt$`Height (cm)` %>% hist() #Has a few outliers with very high values. Use median for imputing
mergedArt$`Width (cm)` %>% hist() #Has a few outliers with very high values. Use median for imputing

mergedArt[is.na(mergedArt$`Height (cm)`),]$`Height (cm)` <- median(mergedArt$`Height (cm)`, na.rm = TRUE)
mergedArt[is.na(mergedArt$`Width (cm)`),]$`Width (cm)` <- median(mergedArt$`Width (cm)`, na.rm = TRUE)
#Check for Special (NAN, Inf, -Inf) values
checkSpecialValues <- function(x)
{
is.infinite(x) | is.nan(x)
}
mergedArt %>% select_if(.predicate = is.numeric) %>% sapply(function(x) sum(checkSpecialValues(x))) #No special values
ConstituentID ObjectID Depth (cm) Height (cm)
0 0 0 0
Width (cm) ULAN
0 0
Scan II (Outliers)
- Instead of starting with the hist() or boxplot() to visualize the outliers, i went ahead with the z-score approach to identify the outliers on all numeric columns using select_if(is.numeric). The head() of the zScores dataset is quickly visualized. Values have been converted to zscores as identified from the abs(zscores)>3
- Histogram is then checked for each column to ensure normal distribution for using z-scores. Height and Width slightly resemble normal distribution while Depth doesn’t. Further analysis follows
- Height seems to have 223 outliers as identified by zScores from 1574.8cm to 215.9cm. However most of them are specific artworks which have larger height than majority of them. SO removing them would create a biased dataset, so capped can be used on Height column.
- Before that, counts of top 10 outliers is also viewed. The boxplot of the column is also viewed. Top outliers from largest to smallest is viewed (only top 10 shown using head() because of the limit)
- Capping function is created to cap to 5% and 95% quantiles. Capping is done for Height column
- Width has 253 outliers from 1705cm to 164.9cm. These all correspond to specific artworks and hence can be safely assumed to be actual data and not a documenting error. Capping is carried out on Width column to avoid creating a biased dataset by removing. Top 10 outliers are also viewed in the process (same as height)
- Depth has ~50% missing values, so it would not be highly recommended to treat the data. Depth column in the data is provided where painting with 3 dimensions need to be explicitly analyzed if needed, and hence are not deliberately captured for many artworks while creating data. Also, z-scores are not applicable since it doesn’t even remotely resemble normal distribution. My assumption is that the column would not feature in any modeling activity and hence is not capped or treated at the moment
- Boxplot is also plotted for each of the columns for visual representation of the distribution and help in treating the outliers
- ConstituentID, ULAN and ObjectID are unique identifiers and do not have a quantitative aspect, hence outlier treatment is not carried out for them.
- Additional comments in code chunks
#outlier detection and treatment:
par(mfrow = (c(1,2)))
mergedArt$`Height (cm)` %>% hist(xlab = "Height", breaks = 25) #check if Height is remotely similar to normal distribution, to use z-scores. It slightly resembles normal distribution, so we will go ahead with z-scores
zScores<-mergedArt %>% select_if(.predicate = is.numeric) %>% scores(type = "z") #Create z scores summary for all numeric values. Individual columns focused on next
zScores %>% head(5)
#'Height' has 223 outliers as shown using z scores from 1574.8cm to 215.9cm.
#'However most are specific artworks so excluding them would create a biased data. so use capping to quartile
mergedArt$`Height (cm)` %>% boxplot(las=2, xlab = "Height") #Explore the outliers using boxplot

mergedArt %>% count(`Height (cm)`, sort = TRUE) %>% arrange(-`Height (cm)`) %>% head(10) #Identify the value count of top 10 outlier values
zHeight <- scores(mergedArt$`Height (cm)`, type="z")
mergedArt$`Height (cm)`[which(abs(zHeight)>3)] %>% as.data.frame() %>% set_names(c("outl_Height")) %>%
arrange(-outl_Height) %>% head(10) #top 10 largest outlier values
cap <- function(x){ #function to cap to nearest non-outlier value
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ), na.rm = TRUE )
x[ x < quantiles[2] - 1.5*IQR(x, na.rm = TRUE) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x, na.rm = TRUE) ] <- quantiles[4]
x}
mergedArt$`Height (cm)` <- mergedArt$`Height (cm)` %>% cap()
#outliers in 'Width'. There are 253 outliers from 1705cm to 164.9cm. All specific artworks except for 3.
#Cap the outliers to avoid biased data
mergedArt$`Width (cm)` %>% hist(xlab = "Width") #Check distribution to ensure that it remotely resembles normal distribution to apply z-scores
mergedArt$`Width (cm)` %>% boxplot(las=2, xlab = "Width") #Explore the outliers using boxplot

zWidth <- scores(mergedArt$`Width (cm)`, type="z")
mergedArt$`Width (cm)`[which(abs(zWidth)>3)] %>% as.data.frame() %>% set_names(c("outl_Width")) %>%
arrange(-outl_Width) %>% head(10) #top 10 largest outlier values
mergedArt$`Width (cm)` <- mergedArt$`Width (cm)` %>% cap()
#outliers in 'Depth'. This column has more than 50% NAs and most of the remaining values are 0 (as
#most of the painting have a depth ~0). There is no benefit of treating outliers for this column
#This column is available in case artworks with 3 dimensions need to be analyzed explicitly
mergedArt$`Depth (cm)` %>% hist(xlab = "Depth")#No outliers using z-scores since the data is not normally distributed.
mergedArt$`Depth (cm)` %>% boxplot(xlab = "Depth") #Using Tukey's emthod there are many outliers that are detected. But, since most of the data is either NA or 0, outlier detection wont make sense for this column

---
title: "MATH2349 Data Wrangling"
author: "RAVI MELWYN ARANHA : S3852052"
subtitle: Assignment 2
output:
  html_notebook: default
  pdf_document: default
---
## Required packages 

Below is the list of packages used in the project.

```{r Packages, message=FALSE, warning=FALSE}
#Packages required for my code
library(tidyverse)
library(lubridate)
library(outliers)
library(forecast)
library(knitr)
library(Hmisc)
```


## Executive Summary 

Data pre-processing is an important stage of the data analysis lifecycle which is done primarily along with data import and before modeling/machine learning/visual reports. It is not nessesarily carried out in the same order but follows a high level order of data preperation stages.  

**Below are some of the important stages:**  
**1. Problem definition by data :**   
\   \   \   \   \  - The problem area is defined at this point which will focus on the data required to solve the problem/problem-space. It will also help in understanding the void of certain critical data which can support creation of the same from existing data. It is also important to iron out details about storage and analytical platforms that are necessary to carry out the analysis  
\   \   \   \   \  - *I have carried out this steps first to understand the datasets needed for the analysis. 2 datasets are imported into R and merged for holistic analysis of all the museum's artworks and artists*  
**2. Data import:**  
\   \   \   \   \  - This is the stage of bringing all the necessary data onto 1 platform to be processed further into a final analytical dataset or multiple datasets  
\   \   \   \   \  - Data is usually stored into databases, warehouses or local flat files   
    \   \   \   \   \  - *I have imported files directly from the file URLs onto R where further processing will be carried out*  
**3. Understand data:**  
\   \   \   \   \  - This stage involves exploring the structure and the current stage of the data to identify possible next steps needed. The dimensions, data types, summary statistics, invalid values, data dictionary, merge keys are some of the things to look at.  
\   \   \   \   \  - It is recommended that the analyst understand the data completely before proceeding  
\   \   \   \   \  - *I have done this steps next, but have not moved towards manipulating them just yet. Some pre-processing is needed on the whole dataset before moving towards individual variables/observations*  
**4. Tidy data:**  
\   \   \   \   \  - This is an important stage to clean the data and to bring it to the format that is appealing to look at, and to streamline further analysis.   
\   \   \   \   \  - Involves converting to tidy format which involves converting attributes into columns, rows into observations and values into a unit of observation   
\   \   \   \   \  - *I have carried this steps right after since there was a lot of tidying that was needed on the dataset before moving towards further processing or beautification/formatting*   
**5. Manipulate data:**  
\   \   \   \   \  - This step involves further changing the structure of the data by merging multiple datasets together and creating additional, meaningful variables and remove non impactful rows and variables and structuring data in an appealing format for viewing as well  
\   \   \   \   \  - *I have carried out a lot of operations at this steps which include individual variable manipulations, formatting and then merging clean datasets together (Initial step of understanding data helps out with this step)*  
**6. Scan and treat missing/outlier values:**  
\   \   \   \   \  - This is an important step to either get rid of missing/special/outlier observations or proxy them with likely/most common values from existing data. Statistical tests and summaries are used to carry out this steps.   
\   \   \   \   \  - *Imputing with mean/median/mode is a popular option. I have split this section into Missing values and Outlier values sections and releavant treatment is done in detail for all relevant variables*  
\   \   \   \   \  - *The most important step at this point is to go back to the business and request additional/improved data which is the best approach to deal with the data, if available*  
**7. Transforming data:**  
\   \   \   \   \  - This steps involves transforming data into formats favorable to modeling, inter-variables relationships, to reduce skewness/outliers or to better understand the data)   
\   \   \   \   \  - *I have carried out this step last to get the data ready for visualization or further modeling activities. All numeric columns are viewed under the lens of multiple transformations and appropriate transformations are done, wherever applicable*  


## Data Summary
The [MoMA collection](https://www.moma.org/collection/) was established in 1929 and is a digital database of artworks by various artists across the world, and acquired by the museum across the last 150 years. The 2 datasets feature artwork details and artist details respectively and contain a variety of attributes outlined and summarized below. Both are in public domain under a CC0 License.

**work_arts** (Columns: 29, Rows: 138,185, URL: [artworks.csv](https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv))  
Columns:   
* Title - *Title of the artwork (Level of data)*    
* Artist - *Artist name (multiple artists of an artwork are concatenated as untidy data)*   
* ConstituentID - *Artist ID (Also the primary join key for Artists dataset)*   
* ArtistBio - *Artist nationality and active years*   
* Gender - *Gender of the artist*   
* BeginDate, *EndDate - active years of artist*   
* Medium - *medium used to create the art (paint, print, photograph etc)*   
* Dimensions - *physical dimensions of the artwork (contains Height, Width and Depth)*   
* CreditLine - *Purchase line by the architect (purchase, gift, partial gift by a person/organization etc)*   
* Classification - *Type of artwork (Architecture, design, illustrated book etc)*   
* Department - *Higher level classification (architecture, drawing, film, painting etc)*   
* DateAcquired - *Date of acquiring by the Museum*   
* URL, ThumbnailURL - *link to artwork and thumbnail*   
* Physical dimensions (cm) - *Circumference, Depth, Diameter, Height, Length, Weight, Width etc)*   

**Artists** (Columns: 9, Rows: 15,333, URL: [Artists.csv](https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artists.csv))  
Columns:   
* ConstituentID - *Primary key and ID of the artist*   
* DisplayName - *Artist name*   
* ArtistBio, *Nationality, Gender, BeginDate, EndDate - same as above*   
* Wiki QID -  *Unique identifier in Wikidata for the artist (Also known as Wikidata item identifier)*   
* ULAN - *This is the Getty ULAN ID for the artist (can be used to search in Getty databases*   

**Brief Summary of steps taken and a few early findings:**  
1. I have looked at the *head()* and *str()* *(str shown in next section)* of each dataset *(and explored the entire dataset)* to understand the variables, data types and identify immediate next steps   
2. I have looked at the structure of the data to identify join keys, level of data and the pre-processing steps needed for each of the columns for further analysis *(There are a lot of columns that need manipulations to bring to the right format)*  
3. The work_arts data is untidy with multiple artists *(of an artwork)* concatenated in a singe cell and the same is applicable for each of the artist's attribute in work_arts  
4. artworks.csv is imported as **work_arts** for quick access using R shortcut functions *(varying from artists.csv)*. artists.csv is imported as-is  
5. imported csv are converted to dataframe using as.data.frame() while importing using read_csv()  
6. I have identified a few redundant columns with '*ArtistBio*' & '*Dimensions*' but they will come handy to replace missing values wherever applicable  
7. There are also few rows/columns with a majority of missing values. They will be manipulated appropriately in subsequent sections  
8. **work_arts** is huge in size and will further increase after tidying, so an unbiased subset will be also created in subsequent sections *(because my RStudio couldn't process operations without this)*  
9. Merge key is '*ConstituentID*' which is also the primary key for **artists** dataframe  
\   \   \   \   \  - *'**artists**' has unique details about the artist which need to be brought into the **work_arts** dataset using the merge operation*  

```{r Data Import, message=FALSE, warning=FALSE}
#Dataset 1:
work_arts <- read_csv("https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv") %>% as.data.frame(row.names = NULL)
print(head(work_arts, n = 5))
#Dataset 2:
artists <- read_csv("https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artists.csv") %>% as.data.frame(row.names = NULL)
print(head(artists, n = 5))
```

## Understand 1 (Explore only) 

**Summary of steps and required data type conversions:**  
**Steps:**  
1. I have checked the structure of datasets using str() and type()   
2. I have explored individual columns and identified necessary data type conversions required (outlined below). These will be done post tidying data in the next step)   
3. The colSums() of NAs show columns with missing values. Columns with a majority of missing values can be removed after judging its significance for further analysis  
4. count of unique values is used to identify possible factor columns mentioned below  

**Conversions needed (to be executed after tidying and unnesting columns):**   
1. ConstituentID, BeginDate, EndDate - convert from chr to int  
2. Gender - convert from chr to factor (apply lower case conversion), (add factor labels)    
3. Nationality, Classification, Department, Catalogued - convert from chr to factor  
4. Columns with majority NAs ("Circumference (cm)", "Diameter (cm)", "Length (cm)", "Weight (kg)", "Seat Height (cm)","Duration (sec.)")  

```{r Explore Data}
#Understand work_arts
str(work_arts)
#Summary of data types. Need data type conversions (see summary after code chuck)
sapply(work_arts, class) %>% as.data.frame() %>% set_names("colType") %>% count(colType)
work_arts %>% {colSums(is.na(.))} #check count NAs per column
work_arts %>% sapply(function(x) length(unique(x))) #check unique per each columns

#Understand artists
str(artists)
#Summary ff data types
sapply(artists, class) %>% as.data.frame() %>% set_names("colType") %>% count(colType)
artists %>% {colSums(is.na(.))} #check count NAs per column
artists %>% sapply(function(x) length(unique(x))) #check unique per each column

```

##	Tidy & Manipulate Data I (and II)

**The work_arts dataset is untidy because:**  
1. It has multiple values concatenated into a single cell, so eaech row doesn't correspond to a single observation  
2. Multiple columns are present in a single column, ex: artistBio and Dimensions (while individual columns are present, they will be used for data treatment before being tidied)  
3. Few columns like Dimensions colum is difficult to understand due to a lot of garbage values with important information  

**The below steps are taken (most correspond to tidying data):**  
1. strFunc is a user defined function used to identify number of concat values in artist attributes like Artist, ConstituentID, Nationality, BeginDate, EndDate and Gender. Nationality is used as an anchor and concat values of 3 or more are excluded (Data is already huge and R was already at the breaking point for executing in my machine)   
\   \   \   \   \  *- The requirement of creating a new column in the assignment is done here*
2. A filter is used next to filter out artworks with title longer than 50 characters (using nchar())  
3. Text-heavy, unimportant/redundant columns like ArtistBio, CreditLine, URL and ThumbnailURL are removed at this point to make the data lesser in size  
4. Concat columns are unnested into individual rows using unnest() function which converts individual item into its own row (This essentially did what separate+gather operations would do)  
5. Distinct of the dataset is used to get rid of duplicate observations, if any  
6. Columns with >90% NAs were removed to reduce data size (assuming no critical analysis on these dimensions at the moment). COlumns removed are Circumference, Diameter, Length, Wiegth, Seat Height, Duration  
7. A large sample of 20000 observations is extracted for further analysis using a random sample (This is done to avoid heavy lag for further executions. Assumption here is that the sample derived exhibits the properties of the original dataset)  
8. set.seed(5) is used below for reproducible results and dataset is saved as wa_filt  
9. Now the data is in a tidy format, so further operations will focus on the data type, consistent levels and string cleaning activities  
10. Additional comments in code chunks  


```{r Tidy and Manipulate}
#Tidy up work_arts:
#Filter-in observations with concat values less than 3 by using custom function and creating 
#column 'flag' to filter. This is done to reduce the data size
strFunc <- function(x) {str_count(pattern = "[)] [(]", string = x)}
work_arts<-work_arts %>% mutate(flag = strFunc(Nationality)) %>% filter(flag<3) %>% select(-flag)

#Remove columns with unimportant and text-heavy string data like ArtistBio, CreditLine, URL, ThumbnailURL
#Create col 'nchar' to filter out 'Title' with characters <= 50. This is done to reduce the data size (Recommended by TA)
work_arts <- work_arts %>% select(-c(ArtistBio, CreditLine, URL, ThumbnailURL))  %>%  
  mutate(nchar = nchar(work_arts$Title))  %>% filter(nchar<=50) %>% select(-nchar)

#Unnest and convert multiple columns in the data to long format (Do before changing data types as clean up 
#is needed before type change). 'regEx' expressions are used to split and unnest based on earlier analysis
work_arts <- work_arts %>% mutate(Artist = str_split(pattern = ", ", string = Artist)) %>% unnest(Artist) %>%
  mutate(ConstituentID = str_split(pattern = ", ", string = ConstituentID)) %>% unnest(ConstituentID) %>%
  mutate(Nationality = str_split(pattern = "[)] [(]", string = Nationality)) %>% unnest(Nationality) %>%
  mutate(BeginDate = str_split(pattern = "[)] [(]", string = BeginDate)) %>% unnest(BeginDate) %>%
  mutate(EndDate = str_split(pattern = "[)] [(]", string = EndDate)) %>% unnest(EndDate) %>% 
  mutate(Gender = str_split(pattern = "[)] [(]", string = Gender)) %>% unnest(Gender) %>% distinct()

#drop columns which contain mostly NA values. 6 columns have at least 90% NAs. 
#Remove these using a select function and -c()
work_arts <- work_arts %>%
  select(-c("Circumference (cm)", "Diameter (cm)", "Length (cm)", "Weight (kg)", 
            "Seat Height (cm)","Duration (sec.)"))

#As the data size is high, a sample data is created from the dataset using a sample_n() function from dplyr.
#Set.seed() is used for reproducible results, and a sample of 20,000 rows is random chosen from the data.
#Assign the sample dataset to a new name wa_filt'
set.seed(5)
wa_filt <- work_arts %>% sample_n(size = 20000, replace = F)

```

##	Understand 2 (Execution after tidying data)

**This section focuses on data formatting and accuracy of column representations**  
1. A replacement operation is carried out to get rid of leading, trailing '(' or ')' characters which are replaced with "". This is done for Nationality, BeginDate, EndDate and Gender  
2. Upper case in Gender is converted to lower for consistent representation and to aid creation of factor levels  
3. Gender is converted to a Factor variable and relevant labels are provided. Since it is an unordered column, ordered is set to FALSE  
4. 'Unknown' and 'Nationality unknown' are converted to 'unknown' for consistent representation (assuming a biz requirement for consistency)  
5. '(not assigned)' in Classification is converted to 'Not Assigned' in line with the other levels in the column  
6. ConstituentID, BeginDate, EndDate are converted to Integer data type  
7. Nationality, Classification, Department and Cataloged are converted to factor variables (levels are not changed at the moment, but can be converted as per biz requirement. Gender is already converted with labels above)  
8. BeginDate, EndDate and Date are converted to date type by converting the existing 'year' format to 'ymd' by adding the beginning of year attributes (day, month is added as per necessary formatting). This will aid any further analysis based on date operations like difference in duration, comparison to a custom date or help in updating more accurate dates when the museum is able to procure more information later)  
9. 'Date' is renamed to 'DateMade' for readability with other data columns like 'BeginDate', 'EndDate' and 'DateAcquired'  
10. NAs in merge column are checked before the subsequent merge (The analysis is run already before, and no NAs are found. Code is not run again to avoid maxing page limit)  
11. Additional comments in code chunks  


```{r Clean dataset}

#Clean up columns with '(' ')' in column names created by unnest() function in 4 columns 
wa_filt <- wa_filt %>% mutate(Nationality = str_replace_all(wa_filt$Nationality, pattern = "[()]", replacement = ""),
                              BeginDate = str_replace_all(wa_filt$BeginDate, pattern = "[()]", replacement = ""),
                              EndDate = str_replace_all(wa_filt$EndDate, pattern = "[()]", replacement = ""),
                              Gender = str_replace_all(wa_filt$Gender, pattern = "[()]", replacement = ""))

#Convert Gender to lower case and convert to factor with relevant labels
wa_filt$Gender <- tolower(wa_filt$Gender)
wa_filt$Gender <- factor(wa_filt$Gender, levels = c("male", "female", ""), 
                         labels = c("Male", "Female", "Unknown"), ordered = FALSE)

#Convert Nationality to valid values using case when statement. Convert to factor next
wa_filt <- wa_filt %>% mutate(Nationality = case_when(as.character(Nationality)==""~"Unknown",
                                                      as.character(Nationality)=="Nationality unknown"~"Unknown", 
                                                      TRUE~as.character(Nationality)))

#Convert Classification values to consistent format
wa_filt$Classification[wa_filt$Classification=="(not assigned)"] <- "Not Assigned"

#Convert relevant columns to Integer and factor type (Gender already converted to Factor above)
wa_filt<-wa_filt %>% mutate_at(c("ConstituentID", "BeginDate", "EndDate"), .funs = as.integer)
wa_filt<-wa_filt %>% mutate_at(c("Nationality","Classification", "Department", "Cataloged"), .funs = as.factor)

#convert years to date
wa_filt$BeginDate <- as.Date(paste(wa_filt$BeginDate, "1", "1", sep = "-"), format = "%Y-%m-%d")
wa_filt$EndDate <- as.Date(paste(wa_filt$EndDate, "1", "1", sep = "-"), format = "%Y-%m-%d")
wa_filt$Date <- as.Date(paste(wa_filt$Date, "1", "1", sep = "-"), format = "%Y-%m-%d")
wa_filt <- plyr::rename(x = wa_filt, replace = c("Date" = "DateMade"))

#check for missing values in merge keys i.e. ConstituentID. Result from before: No missing values. Ready for merge
```
## Merge Data to create Master dataset 

1. wa_filt is merged with artists based on 'ConstituentID' column which is the primary key for artists and foreign key for wa_filt.   
2. As other columns in artists already present in wa_filt, only 'wiki QID' and 'ULAN' columns are brought into the **'mergedArt'** dataset  
3. Sum of NAs in mergedArt are explored for the subsequent section on missing value identification and treatment. This information guides the next section  

```{r Merge Data}
#Merge 'wa_filt' and 'artists' on key 'ConstituentID'
mergedArt <- wa_filt %>% left_join(artists[,c("ConstituentID", "Wiki QID", "ULAN")], by = "ConstituentID")
colSums(is.na(mergedArt)) #Identify columns with missing values for treatment in next section
```


##	Scan I (Missing values)

1. Missing values will be identified and relevant treatment technique is used  
2. A user fucntion to select the mode of a Categorical column is created as the impute function from Hmisc didn't provide the mode as verified by the count(, sort = TRUE) %>% head(1) method  
3. DateMade has missing values which are imputed using the Mode of the column  (i.e. '1971/01/01'). It has ~4k missing values so removing missing values is not recommended. Also, the mode is highly occurring value for the column  
4. DateAcquired is also imputed with the mode of the column (i.e. '2013-10-24'). Around ~6.5k values correspond to the mode. Since the missing values are just 580 in count, excluding selectively during analysis is also an option, but i decided to use the mode as it is highly common as well.  
5. Medium has ~636 missing values and is imputed with the mode (i.e. 'Gelatin silver print') which occurs ~6.7k times  
6. Dimensions has around 657 missing values but is also considered a redundant column, and for the time being is imputed with the mode (i.e. '6 11/16 × 9 11/16" (17 × 24.6 cm)') since its required for the next step of extracting individual dimensions to treat dimensions column (where they are missing), and is preferred for the subsequent text-heavy operations. It is likely that this column will be dropped after the extraction is executed. Other alternative is to leave the column as-is due to the low count of missing values, but i went ahead with imputing to avoid issues with subsequent string operations. (Refer 9. below)  
7. Depth has ~50% missing values and is kept as-is for any critical analysis later on. Also the extraction of dimensions will aid in reducing missing values in the column. (Refer 9. below)  
8. Wiki QID and ULAN have ~60% missing values, but since they are important attributes of the artists, they are not excluded from the dataset and kept as-is.  
9. Numeric value treatment ((Depth (cm), Height (cm), Width (cm)). Since 'dimensions' has the same attributes in it, extract the data from the string column to replace wherever NA. And after that, imputing with either mean or median where NAs are still present (median opted since there are highly skewed outliers. shown below in last sub-point)    
\   \   \   \   \  *- Dimensions is a very dynamic variable with the required info at different positions of the column. So extracting the dimensions is done from the end of string based on locations of the brackets of the last dimension-set. Ex: extract (1cm x 2cm x 5cm) from the garbage values '..2cm..1cm x 2cm.... (1cm x 2cm x 5cm)'.*    
\   \   \   \   \  *- I have used a series of steps with multiple user defined functions to avoid using stringi package (This was very difficult to achieve!!!). Steps involved are extracting the start and end locations of the braces (and extract into a column by extracting values from str_locate_all() into a list and extracting only the relevant values which are present towards the end of each nested list and a converting integer(0) to NAs using applyNAToStr() function in the code), then use str_sub() to extract the dimensions, replace garbage values, split based on 'x', and trim whitespaces, which provides a list of lists*    
\   \   \   \   \  *- The requirement of creating a new column in the assignment is also done here*  
\   \   \   \   \  *- Extract the Height, Width, Depth respectively using sapply("[") and relevant position in the list (e.g. 1,2,3) and convert to double*    
\   \   \   \   \  *- Wherever original columns have missing values, relevant non-missing values are replaced across each of the 3 columns (Height, Width, Depth)*    
\   \   \   \   \  *- As seen by hist(), there are large outliers in both Height and Width, hence median is used to impute the missing values instead of mean*  
11. All numeric columns are checked for special values like Inf, -Inf and NAN. There are no special values in the dataset
10. Additional comments in code chunks  

```{r Missing Values, fig.height=2, fig.width=8}
#Missing value identification and treatment
getMode <- function(v) #function to get mode as impute from Hmisc didn't give the right mode as confirmed by count()
{ uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]}

#Check missing value for DateMade and DateAcquired
mergedArt %>% count(DateMade, sort = TRUE) %>% head(1)#Mode is '1971/01/01' which can be used for missing values in DateMade
mergedArt$DateMade[is.na(mergedArt$DateMade)] <- getMode(mergedArt$DateMade) #mode is 1971-01-01

mergedArt %>% count(DateAcquired, sort = TRUE) %>% head(1) #mode is 2013-10-24. Many items were acquired that day
mergedArt$DateAcquired[is.na(mergedArt$DateAcquired)] <- getMode(mergedArt$DateAcquired) 

#Check missing values for categorical columns
mergedArt %>% count(Medium, sort = TRUE) %>% head(1) #mode
mergedArt[is.na(mergedArt$Medium),]$Medium  <- getMode(mergedArt$Medium) #'Gelatin silver print' is the mode with ~6.7k obs

mergedArt %>% count(Dimensions, sort= TRUE) %>% head(1) #mode
mergedArt[is.na(mergedArt$Dimensions),]$Dimensions  <- getMode(mergedArt$Dimensions)

#As 'Depth (cm)' has more than 50% missing values, the column can actually be excluded, 
#but this has to be done before ruling out statistical significance and business rules. 
#Since its a numeric dimension column, i am assuming its of importance and retain it. WIll be used with na.rm=TRUE

#Wiki QID & ULAN on the other hand has >60% missing and doesn't have any pattern as its an ID column, 
#But since its an important attribute of the artist, the columns are not excluded from the dataset

#Numeric missing value treatment (Depth (cm), Height (cm), Width (cm)). Since 'dimensions' has the same attributes in it, extract the data from the string column to replace wherever NA. And after that impute with either mean or median
#Dimensions is a very dynamic variable with the required info at different positions of the column. So extract the dimensions from the end of string based on locations of the brackets of the last dimension-set. Ex: extract (1cm x 2cm x 5cm) from the garbage values '..2cm..1cm x 2cm.... (1cm x 2cm x 5cm)'
#I have used a series of steps with multiple user defined functions to avoid using stringi package

applyNAToStr<- function(x) #function to apply NA wherever the penultimate function below creates integer(0) values, i.e. blank values,  thus allowing conversion to column (DimensionsStart, DimensionsEnd) to add back to the dataset

{if(length(x) == 0)
    return (NA)
  return (x)}

mergedArt$DimensionsStart <- mergedArt$Dimensions %>% 
  str_locate_all(., pattern = "[(]") %>% lapply(function(x) x[,1]) %>%
  lapply(function(x) x[length(x)]) %>% sapply(applyNAToStr) #Apply a series of user defined functions to extract the start position of the required dimensions

mergedArt$DimensionsEnd <- mergedArt$Dimensions %>% 
  str_locate_all(., pattern = "[)]") %>% lapply(function(x) x[,1]) %>%
  lapply(function(x) x[length(x)]) %>% sapply(applyNAToStr) #End position of the required dimensions

mergedArt$newDim <- mergedArt$Dimensions %>% 
  str_sub(start = mergedArt$DimensionsStart, end = mergedArt$DimensionsEnd) %>% 
  str_replace("[(]", "") %>% str_replace("[)]", "") %>% str_replace("cm", "") %>% 
  str_split("[x×]") %>% sapply(str_trim, side = c("both")) #Extract the dimensions dynamically from the column


mergedArt$newHeight<-sapply(mergedArt$newDim, "[", 1) %>% as.double() #Extract Height into a new temporary Dimension column
mergedArt$newWidth<-sapply(mergedArt$newDim, "[", 2) %>% as.double() #Extract Width into a new temporary Dimension column
mergedArt$newDepth<-sapply(mergedArt$newDim, "[", 3) %>% as.double() #Extract Depth into a new temporary Dimension column

mergedArt <- mergedArt %>% mutate(`Height (cm)` = if_else(is.na(`Height (cm)`), newHeight, `Height (cm)`),
                      `Width (cm)` = if_else(is.na(`Width (cm)`), newWidth, `Width (cm)`),
                      `Depth (cm)` = if_else(is.na(`Depth (cm)`), newDepth, `Depth (cm)`)) %>% select(-newHeight, -newWidth, -newDepth, -DimensionsStart, -DimensionsEnd) #replace with values in new temp columns wherever dimensions are NAs and remove the temporary Dimension columns

par(mfrow=c(1, 2))
mergedArt$`Height (cm)` %>% hist() #Has a few outliers with very high values. Use median for imputing
mergedArt$`Width (cm)` %>% hist() #Has a few outliers with very high values. Use median for imputing

mergedArt[is.na(mergedArt$`Height (cm)`),]$`Height (cm)` <- median(mergedArt$`Height (cm)`, na.rm = TRUE)
mergedArt[is.na(mergedArt$`Width (cm)`),]$`Width (cm)` <- median(mergedArt$`Width (cm)`, na.rm = TRUE)

#Check for Special (NAN, Inf, -Inf) values
checkSpecialValues <- function(x)
{
is.infinite(x) | is.nan(x)
}

mergedArt %>% select_if(.predicate = is.numeric) %>% sapply(function(x) sum(checkSpecialValues(x))) #No special values

```


##	Scan II (Outliers)

1. Instead of starting with the hist() or boxplot() to visualize the outliers, i went ahead with the z-score approach to identify the outliers on all numeric columns using select_if(is.numeric). The head() of the zScores dataset is quickly visualized. Values have been converted to zscores as identified from the abs(zscores)>3  
\  \   \   \   \  *- Histogram is then checked for each column to ensure normal distribution for using z-scores. Height and Width slightly resemble normal distribution while Depth doesn't. Further analysis follows*  
2. Height seems to have 223 outliers as identified by zScores from 1574.8cm to 215.9cm. However most of them are specific artworks which have larger height than majority of them. SO removing them would create a biased dataset, so capped can be used on Height column.  
3. Before that, counts of top 10 outliers is also viewed. The boxplot of the column is also viewed. Top outliers from largest to smallest is viewed (only top 10 shown using head() because of the limit)  
4. Capping function is created to cap to 5% and 95% quantiles. Capping is done for Height column  
5. Width has 253 outliers from 1705cm to 164.9cm. These all correspond to specific artworks and hence can be safely assumed to be actual data and not a documenting error. Capping is carried out on Width column to avoid creating a biased dataset by removing. Top 10 outliers are also viewed in the process (same as height)  
6. Depth has ~50% missing values, so it would not be highly recommended to treat the data. Depth column in the data is provided where painting with 3 dimensions need to be explicitly analyzed if needed, and hence are not deliberately captured for many artworks while creating data. Also, z-scores are not applicable since it doesn't even remotely resemble normal distribution. My assumption is that the column would not feature in any modeling activity and hence is not capped or treated at the moment  
7. Boxplot is also plotted for each of the columns for visual representation of the distribution and help in treating the outliers  
8. ConstituentID, ULAN and ObjectID are unique identifiers and do not have a quantitative aspect, hence outlier treatment is not carried out for them.  
9. Additional comments in code chunks  

```{r Outliers, fig.height=2.5}
#outlier detection and treatment:
par(mfrow = (c(1,2)))
mergedArt$`Height (cm)` %>% hist(xlab = "Height", breaks = 25) #check if Height is remotely similar to normal distribution, to use z-scores. It slightly resembles normal distribution, so we will go ahead with z-scores
zScores<-mergedArt %>% select_if(.predicate = is.numeric) %>% scores(type = "z") #Create z scores summary for all numeric values. Individual columns focused on next
zScores %>% head(5)

#'Height' has 223 outliers as shown using z scores from 1574.8cm to 215.9cm.
#'However most are specific artworks so excluding them would create a biased data. so use capping to quartile
mergedArt$`Height (cm)` %>% boxplot(las=2, xlab = "Height") #Explore the outliers using boxplot
mergedArt %>% count(`Height (cm)`, sort = TRUE) %>% arrange(-`Height (cm)`) %>% head(10) #Identify the value count of top 10 outlier values
zHeight <- scores(mergedArt$`Height (cm)`, type="z")

mergedArt$`Height (cm)`[which(abs(zHeight)>3)] %>% as.data.frame() %>% set_names(c("outl_Height")) %>%
  arrange(-outl_Height) %>% head(10) #top 10 largest outlier values

cap <- function(x){ #function to cap to nearest non-outlier value
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ), na.rm = TRUE )
  x[ x < quantiles[2] - 1.5*IQR(x, na.rm = TRUE) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x, na.rm = TRUE) ] <- quantiles[4]
  x}

mergedArt$`Height (cm)` <- mergedArt$`Height (cm)` %>% cap()

#outliers in 'Width'. There are 253 outliers from 1705cm to 164.9cm. All specific artworks except for 3.
#Cap the outliers to avoid biased data
mergedArt$`Width (cm)` %>% hist(xlab = "Width") #Check distribution to ensure that it remotely resembles normal distribution to apply z-scores
mergedArt$`Width (cm)` %>% boxplot(las=2, xlab = "Width") #Explore the outliers using boxplot

zWidth <- scores(mergedArt$`Width (cm)`, type="z")
mergedArt$`Width (cm)`[which(abs(zWidth)>3)] %>% as.data.frame() %>% set_names(c("outl_Width")) %>%
  arrange(-outl_Width) %>% head(10) #top 10 largest outlier values

mergedArt$`Width (cm)` <- mergedArt$`Width (cm)` %>% cap()

#outliers in 'Depth'. This column has more than 50% NAs and most of the remaining values are 0 (as 
#most of the painting have a depth ~0). There is no benefit of treating outliers for this column
#This column is available in case artworks with 3 dimensions need to be analyzed explicitly

mergedArt$`Depth (cm)` %>% hist(xlab = "Depth")#No outliers using z-scores since the data is not normally distributed. 
mergedArt$`Depth (cm)` %>% boxplot(xlab = "Depth") #Using Tukey's emthod there are many outliers that are detected. But, since most of the data is either NA or 0, outlier detection wont make sense for this column
```


##	Transform 

1. Transformation is done by checking multiple transform operations of all numeric columns which are subset into a separate dataset for checking transformations and explore visuals  
2. numData is mutated to apply multiple transformations on all columns, namely log, log10, sqrt, square, reciprocal, BoxCox. Relevant columns are created by appropriate naming.   
\   \   \   \   \  *The requirement of creating a new column in the assignment is also done here*
3. All variables are viewed under the histogram lens. Due to the page limit, only 3 primary columns are displayed in the report (Height, Width and Depth)  
4. Only log(Depth) seem to have transformed the variable towards a normal distribution and is viewed with a histogram.  
5. Depth is mutated into log(Depth)  
6. Other variables can be transformed based on their involvement in modeling activities  
7. Additional comments in code chunks  


```{r Transformation, fig.keep='last', fig.height=5}

#Transformations: There are 6 numeric columns and are subset to 'numData' dataframe
#for transformation and comparing to normal distribution
numData <- mergedArt %>% select_if(.predicate = is.numeric) #create numData temporarily for checking transformations

#create incremental dataframe attempting 6 important transformation functions on all variables
numData <-  
  numData %>% mutate(across(.cols = c(1:6), .fns = log, .names = "log_{col}")) %>%
  mutate(across(.cols = c(1:6), .fns = log10, .names = "log10_{col}")) %>%
  mutate(across(.cols = c(1:6), .fns = sqrt, .names = "sqrt_{col}")) %>%
  mutate(across(.cols = c(1:6), .fns = function(x) x^2, .names = "sq_{col}")) %>%
  mutate(across(.cols = c(1:6), .fns = function(x) 1/x, .names = "recip_{col}")) %>%
  mutate(across(.cols = c(1:6), .fns = BoxCox, .names = "BoxCox_{col}", lambda="auto"))

#All transformed variables are viewed under a histogram to identify closeness to a normal distribution. Not displayed individually due to page limit on the assignment (few important ones displayed below)
#variables that slightly resemble normal distribution are log_Depth. SO transform Depth to log_Depth and then drop Depth
#Other variables didn't seem to resemble normal distributions after transformations and are not changed (This might be due to the 'not curator approved' status of many artworks since the museum is in the process of filling in missing details currently)
par(mfrow = c(3,1))
numData$`Height (cm)` %>% hist(xlab = "Height") #Check original Height distribution in boxplot
numData$`Width (cm)` %>% hist(xlab = "Width") #Check original Width distribution in boxplot
numData$`Depth (cm)` %>% hist(xlab = "Depth") #Check original Depth distribution in boxplot
par(mfrow = c(1,1))
numData$`log10_Depth (cm)` %>% hist(xlab = "Log Depth", breaks = 25) #Check log(Depth) distribution in boxplot

mergedArt<- mergedArt %>% mutate(`Depth (cm)` = log(`Depth (cm)`)) #Change Depth to log(Depth) as it is much closer to normal distribution (although not perfect). The others can also be transformed based on required of any further modeling activities. ALternatively scaling or centering can be implemented for modeling.
```

References for the project:  
https://github.com/MuseumofModernArt/collection  
http://rare-phoenix-161610.appspot.com/secured/index.html  

<br>
<br>
