Introduction
My data set has information about the world heritage sites all around the globe. I got this data set from kaggle (https://www.kaggle.com/ujwalkandi/unesco-world-heritage-sites?select=whc-sites-2019+-+Copy.xls).It was created at 2019 by Ujwal Kandi.
As a quantitative variable I use latitude,longitude,date_inscribed, and as categorical variable i use category,region_en variables. Organizing the data
I had some issue with quantitative variables. I have labeled the graphs using main, xlab and ylab. I used # to comment. I have brief detail about the headers through out the project.
knitr::opts_chunk$set(echo = TRUE)
df<- read.csv("whc-sites-2019.csv")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
#The mean value of Latitude's values.
mean(df$latitude)
## [1] 28.94847
#The standard deviation.
sd(df$latitude)
## [1] 23.69288
#Five number summary
summary(df$latitude)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -54.59 17.48 36.10 28.95 45.77 71.19
Below I have different graphical displays. Histogram, box plot and qq plot is presented for latitude variable which is one of the quantitative variable.The main purpose of a qq plot is to assess normality. Histograms might be second-best option (to normal probability plots) for assessing normality. Boxplots main purpose is to show quartiles and outliers, if there are any present.
hist(df$latitude, main = "Histogram of Latitude values",xlab = "Latitude (sec)",col = "red",)
boxplot(df$latitude, main= "Box plot showing Latitude", xlab= "Quartile", ylab = "latitude")
qqnorm(df$latitude)
qqline(df$latitude, col = "red")
There are some outliers below , which is far from the rest of the values. The distribution is negative skewed.
plot(df$latitude, df$longitude)
cor(df$latitude, df$longitude)
## [1] -0.01570684
The correlation coefficient is a measurement of the closeness of association of the points in a scatter plot to a linear regression line based on those points.
Below I have frequency table and relative frequency table. It contains different region and the number of world heritages sites in that region.
table(df$region_en)
##
## Africa
## 96
## Arab States
## 86
## Asia and the Pacific
## 266
## Europe and North America
## 528
## Europe and North America,Asia and the Pacific
## 2
## Europe and North America,Asia and the Pacific,Latin America and the Caribbean
## 1
## Latin America and the Caribbean
## 142
table(df$region_en)/length(df$region_en)
##
## Africa
## 0.0856378234
## Arab States
## 0.0767172168
## Asia and the Pacific
## 0.2372881356
## Europe and North America
## 0.4710080285
## Europe and North America,Asia and the Pacific
## 0.0017841213
## Europe and North America,Asia and the Pacific,Latin America and the Caribbean
## 0.0008920607
## Latin America and the Caribbean
## 0.1266726137
Two way table is made below for two categorical variables. Category and region. The table represents how many Cultural, Mixed and Natural heritage sites are in 7 different region.
two_way_table <- table(df$category,df$region_en)
two_way_table
##
## Africa Arab States Asia and the Pacific Europe and North America
## Cultural 53 78 189 452
## Mixed 5 3 12 11
## Natural 38 5 65 65
##
## Europe and North America,Asia and the Pacific
## Cultural 0
## Mixed 0
## Natural 2
##
## Europe and North America,Asia and the Pacific,Latin America and the Caribbean
## Cultural 1
## Mixed 0
## Natural 0
##
## Latin America and the Caribbean
## Cultural 96
## Mixed 8
## Natural 38
Here I use a boxplot with one quantitative and other categorical variable to present side-by-side plot. As X-axis represent different category of heritage sites and Y-axis represent establisment date of the heritage sites.
boxplot(df$date_inscribed ~ df$category, col="orange", main="Date Inscribed of distributed Category", ylab="Inscribed Date", xlab="Category")
One quantative variable is used for making barplot below. Latitude of heritage sites.
barplot(df$latitude, main = "Latitude Chart", xlab = "", ylab = "latitude")
scatter.smooth(df$longitude, main = "Scatter Plot")
data <- read.csv("whc-sites-2019.csv", header = TRUE)
data <- data.matrix(data[,-1])
library(RColorBrewer)
heatmap(t(data),
main = "Heat Map",
Rowv = NA,
Colv = NA,
col = colorRampPalette(brewer.pal(8, "PiYG"))(25),
scale = "column")
Conclusion
The most interesting feature of my data, graphical analysis of the data. We can predict how different variables are depending on each other through plotting graphs. As this is my first time exploring data and analyzing in RStudio, I got alot of ideas about how we can manipulate data according to our need.
Decision tree uses a tree-like model of decisions and their possible outcomes. Here I made category variable a factor. Since my data has alot of written description, i had to make my data small, so R won’t crash. I have imported two libraries ‘rpart’&‘rpart.plot’ to make decision tree.
library(rpart)
library(rpart.plot)
smalldf <- sample_n(df,35)
tree <- rpart(category ~ region_en + states_name_en + danger + date_inscribed + category_short , data = smalldf)
tree
## n= 35
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 35 4 Cultural (0.8857143 0.1142857)
## 2) states_name_en=Andorra,Austria,Austria,Hungary,Belarus,Estonia,Finland,Latvia,Lithuania,Norway,Republic of Moldova,Russian Federation,Sweden,Ukraine,Brazil,Burkina Faso,Cuba,Ethiopia,France,India,Indonesia,Italy,Japan,Kenya,Libya,Mexico,Myanmar,Poland,Portugal,Russian Federation,Senegal,South Africa,Spain,Sweden,Syrian Arab Republic,Turkey 28 0 Cultural (1.0000000 0.0000000) *
## 3) states_name_en=Algeria,Belize,Democratic Republic of the Congo,Iran (Islamic Republic of),Madagascar 7 3 Natural (0.4285714 0.5714286) *
rpart.plot(tree, extra = 2)
To make a prediction I am using the tree, I predict the tree that have created.
pred <- predict(tree, smalldf, type = "class")
head(pred)
## 1 2 3 4 5 6
## Cultural Cultural Natural Cultural Cultural Cultural
## Levels: Cultural Natural
Each has been classified into its own category.
predict(tree, smalldf) %>%
head()
## Cultural Natural
## 1 1.0000000 0.0000000
## 2 1.0000000 0.0000000
## 3 0.4285714 0.5714286
## 4 1.0000000 0.0000000
## 5 1.0000000 0.0000000
## 6 1.0000000 0.0000000
Confusion Table is presented below:
confusion_table <- with(smalldf, table(category, pred))
confusion_table
## pred
## category Cultural Natural
## Cultural 28 3
## Natural 0 4
The process of traning and testing data by seperating data into a set to train or to test is called cross validation.
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
inTrain <- createDataPartition(y = smalldf$category, p = .66, list = FALSE)
smalldf_train <- smalldf %>% slice(inTrain)
smalldf_test <- smalldf %>% slice(-inTrain)
dim(smalldf_train)
## [1] 24 22
dim(smalldf_test)
## [1] 11 22
I used the training set to build my model and then test it. I removed states_name_en from my tree.
tree_from_train <- rpart(category ~.,data = subset(smalldf_train, select=c( -states_name_en)))
pred_test <- predict(tree_from_train, subset(smalldf_train, select=c( -states_name_en)), type = "class")
with(smalldf_train, table(category, pred_test))
## pred_test
## category Cultural Natural
## Cultural 21 0
## Natural 3 0
I have made a full tree below. I only have ~25 data because i had a lot of data, I had chop my data.
smalldf_no_States <- subset(smalldf, select=c( -states_name_en))
tree_full <- sample_n(smalldf_no_States,25) %>%
rpart(category ~., data = ., control = rpart.control(minsplit = 2, cp = 0))
rpart.plot(tree_full, extra = 2, roundint=FALSE,
box.palette = list( "Gn", "Bu"))
I couldn’t make prediction on my data. I have error in model. That’s the reason i have kept the it in comment below:
#pred_full <- predict(tree_full, smalldf_no_States, type = "class")
#with(smalldf, table(region_en, pred_full))
imp <- varImp(tree)
head(imp)
## Overall
## date_inscribed 0.3164835
## region_en 0.8634921
## states_name_en 3.6571429
## danger 0.0000000
## category_short 0.0000000
imp %>% ggplot(aes(x = row.names(imp), weight = Overall)) +
geom_bar()
barplot(imp$Overall)
library(FSelector)
weights <- smalldf %>% chi.squared(category ~ ., data = .) %>%
as_tibble(rownames = "feature") %>%
arrange(desc(attr_importance))
weights
## # A tibble: 21 × 2
## feature attr_importance
## <chr> <dbl>
## 1 name_en 1
## 2 short_description_en 1
## 3 date_end 1
## 4 criteria_txt 1
## 5 category_short 1
## 6 states_name_en 0.901
## 7 iso_code 0.901
## 8 udnp_code 0.901
## 9 danger_list 0.853
## 10 area_hectares 0.717
## # … with 11 more rows
ggplot(weights,
aes(x = attr_importance, y = reorder(feature, attr_importance))) +
geom_bar(stat = "identity") +
xlab("Importance score") + ylab("Feature")
Extra tree
tree1 <- rpart(date_inscribed ~ category + danger + region_en,data = smalldf, method = 'class')
tree1
## n= 35
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 35 32 2001 (0.057 0.029 0.029 0.029 0.057 0.029 0.029 0.057 0.057 0.057 0.086 0.029 0.057 0.057 0.086 0.029 0.029 0.029 0.057 0.029 0.029 0.029 0.029)
## 2) region_en=Africa,Arab States,Europe and North America 24 21 2001 (0.083 0.042 0.042 0.042 0 0.042 0.042 0.042 0 0.083 0.12 0.042 0.083 0.042 0.083 0.042 0.042 0 0.083 0 0 0 0.042)
## 4) region_en=Africa,Arab States 10 8 1980 (0.2 0.1 0 0.1 0 0 0 0.1 0 0 0.1 0 0.1 0 0 0.1 0.1 0 0.1 0 0 0 0) *
## 5) region_en=Europe and North America 14 12 2000 (0 0 0.071 0 0 0.071 0.071 0 0 0.14 0.14 0.071 0.071 0.071 0.14 0 0 0 0.071 0 0 0 0.071) *
## 3) region_en=Asia and the Pacific,Latin America and the Caribbean 11 9 1991 (0 0 0 0 0.18 0 0 0.091 0.18 0 0 0 0 0.091 0.091 0 0 0.091 0 0.091 0.091 0.091 0) *
rpart.plot(tree1, extra = 2)
tree
## n= 35
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 35 4 Cultural (0.8857143 0.1142857)
## 2) states_name_en=Andorra,Austria,Austria,Hungary,Belarus,Estonia,Finland,Latvia,Lithuania,Norway,Republic of Moldova,Russian Federation,Sweden,Ukraine,Brazil,Burkina Faso,Cuba,Ethiopia,France,India,Indonesia,Italy,Japan,Kenya,Libya,Mexico,Myanmar,Poland,Portugal,Russian Federation,Senegal,South Africa,Spain,Sweden,Syrian Arab Republic,Turkey 28 0 Cultural (1.0000000 0.0000000) *
## 3) states_name_en=Algeria,Belize,Democratic Republic of the Congo,Iran (Islamic Republic of),Madagascar 7 3 Natural (0.4285714 0.5714286) *
tail(smalldf)
## category states_name_en region_en
## 30 Cultural Turkey Europe and North America
## 31 Cultural Austria Europe and North America
## 32 Cultural India Asia and the Pacific
## 33 Natural Democratic Republic of the Congo Africa
## 34 Cultural Italy Europe and North America
## 35 Cultural Algeria Arab States
## unique_number id_no rev_bis
## 30 729 614
## 31 1206 1033
## 32 1947 247 Rev
## 33 849 718
## 34 1196 1024 Rev
## 35 212 191
## name_en
## 30 City of Safranbolu
## 31 Historic Centre of Vienna
## 32 Hill Forts of Rajasthan
## 33 Okapi Wildlife Reserve
## 34 Late Baroque Towns of the Val di Noto (South-Eastern Sicily)
## 35 Djémila
## short_description_en
## 30 From the 13th century to the advent of the railway in the early 20th century, Safranbolu was an important caravan station on the main East–West trade route. The Old Mosque, Old Bath and Süleyman Pasha Medrese were built in 1322. During its apogee in the 17th century, Safranbolu's architecture influenced urban development throughout much of the Ottoman Empire.
## 31 Vienna developed from early Celtic and Roman settlements into a Medieval and Baroque city, the capital of the Austro-Hungarian Empire. It played an essential role as a leading European music centre, from the great age of Viennese Classicism through the early part of the 20th century. The historic centre of Vienna is rich in architectural ensembles, including Baroque castles and gardens, as well as the late-19th-century Ringstrasse lined with grand buildings, monuments and parks.
## 32 The serial site, situated in the state of Rajastahan, includes six majestic forts in Chittorgarh; Kumbhalgarh; Sawai Madhopur; Jhalawar; Jaipur, and Jaisalmer. The ecclectic architecture of the forts, some up to 20 kilometres in circumference, bears testimony to the power of the Rajput princely states that flourished in the region from the 8th to the 18th centuries. Enclosed within defensive walls are major urban centres, palaces, trading centres and other buildings including temples that often predate the fortifications within which developed an elaborate courtly culture that supported learning, music and the arts. Some of the urban centres enclosed in the fortifications have survived, as have many of the site's temples and other sacred buildings. The forts use the natural defenses offered by the landscape: hills, deserts, rivers, and dense forests. They also feature extensive water harvesting structures, largely still in use today.
## 33 The Okapi Wildlife Reserve occupies about one-fifth of the Ituri forest in the north-east of the Democratic Republic of the Congo. The Congo river basin, of which the reserve and forest are a part, is one of the largest drainage systems in Africa. The reserve contains threatened species of primates and birds and about 5,000 of the estimated 30,000 okapi surviving in the wild. It also has some dramatic scenery, including waterfalls on the Ituri and Epulu rivers. The reserve is inhabited by traditional nomadic pygmy Mbuti and Efe hunters.
## 34 The eight towns in south-eastern Sicily: Caltagirone, Militello Val di Catania, Catania, Modica, Noto, Palazzolo, Ragusa and Scicli, were all rebuilt after 1693 on or beside towns existing at the time of the earthquake which took place in that year. They represent a considerable collective undertaking, successfully carried out at a high level of architectural and artistic achievement. Keeping within the late Baroque style of the day, they also depict distinctive innovations in town planning and urban building.
## 35 Situated 900 m above sea-level, Djémila, or Cuicul, with its forum, temples, basilicas, triumphal arches and houses, is an interesting example of Roman town planning adapted to a mountain location.
## justification_en
## 30
## 31 <em>Criterion (ii):</em> The urban and architectural qualities of the Historic Centre of Vienna bear outstanding witness to a continuing interchange of values throughout the second millennium. \n <em>Criterion (iv):</em> Three key periods of European cultural and political development – the Middle Ages, the Baroque period, and the Gründerzeit – are exceptionally well illustrated by the urban and architectural heritage of the Historic Centre of Vienna. \n <em>Criterion (vi):</em> Since the 16th century Vienna has been universally acknowledged to be the musical capital of Europe.
## 32
## 33 The Committee inscribed the property as one of the most important sites for conservation, including the rare Okapi and rich floral diversity, under natural <em>criterion (x)</em>. The Committee expressed its hope that the activities outlined in the new management plan would ensure the integrity of the site. Considering the civil unrest in the country, the question of the long-term security of the site was raised.
## 34 <em>Criterion (i):</em> This group of towns in south-eastern Sicily provides outstanding testimony to the exuberant genius of late Baroque art and architecture. \n <em>Criterion (ii):</em> The towns of the Val di Noto represent the culmination and final flowering of Baroque art in Europe. \n <em>Criterion (iv):</em> The exceptional quality of the late Baroque art and architecture in the Val di Noto lies in its geographical and chronological homogeneity, as well as its quantity, the result of the 1693 earthquake in this region. \n <em>Criterion (v):</em> The eight towns of south-eastern Sicily that make up this nomination, which are characteristic of the settlement pattern and urban form of this region, are permanently at risk from earthquakes and eruptions of Mount Etna.
## 35
## date_inscribed secondary_dates danger date_end danger_list longitude
## 30 1994 0 NA 32.68972
## 31 2001 1 NA Y 2017 16.38333
## 32 2013 0 NA 74.64611
## 33 1996 1 NA Y 1997 28.50000
## 34 2002 0 NA 15.06892
## 35 1982 0 NA 5.73667
## latitude area_hectares criteria_txt category_short iso_code udnp_code
## 30 41.26000 193.00 (ii)(iv)(v) C tr tur
## 31 48.21667 371.00 (ii)(iv)(vi) C at aut
## 32 24.88333 NA (ii)(iii) C in ind
## 33 2.00000 1372625.00 (x) N cd cod
## 34 36.89319 112.79 (i)(ii)(iv)(v) C it ita
## 35 36.32056 30.60 (iii)(iv) C dz dza
## transboundary
## 30 0
## 31 0
## 32 0
## 33 0
## 34 0
## 35 0
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
transactions(smalldf)
## Warning: Column(s) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
## 18, 19, 20, 21, 22 not logical or factor. Applying default discretization (see
## '? discretizeDF').
## Warning in discretize(x = c(0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, : The calculated breaks are: 0, 0, 0, 1
## Only unique breaks are used reducing the number of intervals. Look at ? discretize for details.
## Warning in discretize(x = c(0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, : The calculated breaks are: 0, 0, 0, 1
## Only unique breaks are used reducing the number of intervals. Look at ? discretize for details.
## transactions in sparse format with
## 35 transactions (rows) and
## 248 items (columns)
colnames(smalldf)[c(1,2,3,4,10,12)]
## [1] "category" "states_name_en" "region_en" "unique_number"
## [5] "date_inscribed" "danger"
smalldf <- smalldf %>% mutate(
danger = (danger > 0),
date_inscribed = (date_inscribed >0)
)
trans <- transactions(smalldf)
## Warning: Column(s) 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 13, 14, 15, 16, 17, 18,
## 19, 20, 21, 22 not logical or factor. Applying default discretization (see '?
## discretizeDF').
## Warning in discretize(x = c(0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, : The calculated breaks are: 0, 0, 0, 1
## Only unique breaks are used reducing the number of intervals. Look at ? discretize for details.