Data Project Presentation

Jean-Luc BELLIER
February 25th, 2017

The subject

The European Heritage Days is a cultural annual event, that takes place since 1991. It gives the opportunity for everyone, during one week-end, to discover and visit momuments or part of them that are usually closed to the public.

This event takes place in around 50 countries in Europe and also in extra european countries. It was initiated by French Ministry of Culture in 1984.

During the week-end of September 17th-18th 2016, more than 12.5 millions of visitors on about :

17 000 sites
26 000 sociocultural activities

The elements of the project

The project is made of two elements :

an interactive application, made with shiny
the current presentation, built with R Presenter

The interactive application is made of an input panel with two dropdown lists. The output contains one table with the main information on the events, and a map enabling to localize them.

The application can be found at the following address : European Heritage Days 2016.

The code of the project can be found on github : Project Source

Data Load and Initial Quality

library(tidyr)
library(dplyr)
Journees_Patr_2016 <- read.csv("Journees_europeennes_du_patrimoine_20160914.csv",encoding="UTF-8", sep=";",header=TRUE,stringsAsFactors = FALSE)
dim(Journees_Patr_2016)

[1] 23624   122

nb_obs <- dim(Journees_Patr_2016)[1]
nb_col <- dim(Journees_Patr_2016)[2]

# remove empty columns
check_NA <- apply(Journees_Patr_2016,2,function(x) sum(is.na(x)))
cols_to_remove <- which(check_NA == nb_obs)
clean_Jour_Pat_2016 <- select(Journees_Patr_2016,-cols_to_remove)
dim(clean_Jour_Pat_2016)

[1] 23624    60

clean_Jour_Pat_2016_2 <- clean_Jour_Pat_2016 %>% subset(!is.na(Latitude) & !is.na(Longitude))
dim(clean_Jour_Pat_2016_2)

[1] 10591    60

A few words on the data

The data is coming from the French open data. It lists all the events and nomuments that occured in september 2016.

We can notice that half of the columns (62 columns among 122) are completely empty. In addition, the number of rows can be significantly reduced by filtering the rows having no latitude or no longitude. We need this to ensure that the events can be plotted on the map.This removes about half of the rows.

This gives an idea about the quality that can be found on a true dataset. As shown above, the data volume can be divided by 4. We did not go into further details on the data. There may be mistakes (the fields are mostly free texts). It is not the objective of the project to have 100% clean data.