Introduction
My data set has information about the world heritage sites all around the globe. I got this data set from kaggle (https://www.kaggle.com/ujwalkandi/unesco-world-heritage-sites?select=whc-sites-2019+-+Copy.xls).It was created at 2019 by Ujwal Kandi.
As a quantitative variable I use latitude,longitude,date_inscribed, and as categorical variable i use category,region_en variables. Organizing the data
I had some issue with quantitative variables. I have labeled the graphs using main, xlab and ylab. I used # to comment. I have brief detail about the headers through out the project.
knitr::opts_chunk$set(echo = TRUE)
df<- read.csv("whc-sites-2019.csv")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
#The mean value of Latitude's values.
mean(df$latitude)
## [1] 28.94847
#The standard deviation.
sd(df$latitude)
## [1] 23.69288
#Five number summary
summary(df$latitude)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -54.59 17.48 36.10 28.95 45.77 71.19
Below I have different graphical displays. Histogram, box plot and qq plot is presented for latitude variable which is one of the quantitative variable.The main purpose of a qq plot is to assess normality. Histograms might be second-best option (to normal probability plots) for assessing normality. Boxplots main purpose is to show quartiles and outliers, if there are any present.
hist(df$latitude, main = "Histogram of Latitude values",xlab = "Latitude (sec)",col = "red",)
boxplot(df$latitude, main= "Box plot showing Latitude", xlab= "Quartile", ylab = "latitude")
qqnorm(df$latitude)
qqline(df$latitude, col = "red")
There are some outliers below , which is far from the rest of the values. The distribution is negative skewed.
plot(df$latitude, df$longitude)
cor(df$latitude, df$longitude)
## [1] -0.01570684
The correlation coefficient is a measurement of the closeness of association of the points in a scatter plot to a linear regression line based on those points.
Below I have frequency table and relative frequency table. It contains different region and the number of world heritages sites in that region.
table(df$region_en)
##
## Africa
## 96
## Arab States
## 86
## Asia and the Pacific
## 266
## Europe and North America
## 528
## Europe and North America,Asia and the Pacific
## 2
## Europe and North America,Asia and the Pacific,Latin America and the Caribbean
## 1
## Latin America and the Caribbean
## 142
table(df$region_en)/length(df$region_en)
##
## Africa
## 0.0856378234
## Arab States
## 0.0767172168
## Asia and the Pacific
## 0.2372881356
## Europe and North America
## 0.4710080285
## Europe and North America,Asia and the Pacific
## 0.0017841213
## Europe and North America,Asia and the Pacific,Latin America and the Caribbean
## 0.0008920607
## Latin America and the Caribbean
## 0.1266726137
Two way table is made below for two categorical variables. Category and region. The table represents how many Cultural, Mixed and Natural heritage sites are in 7 different region.
two_way_table <- table(df$category,df$region_en)
two_way_table
##
## Africa Arab States Asia and the Pacific Europe and North America
## Cultural 53 78 189 452
## Mixed 5 3 12 11
## Natural 38 5 65 65
##
## Europe and North America,Asia and the Pacific
## Cultural 0
## Mixed 0
## Natural 2
##
## Europe and North America,Asia and the Pacific,Latin America and the Caribbean
## Cultural 1
## Mixed 0
## Natural 0
##
## Latin America and the Caribbean
## Cultural 96
## Mixed 8
## Natural 38
Here I use a boxplot with one quantitative and other categorical variable to present side-by-side plot. As X-axis represent different category of heritage sites and Y-axis represent establisment date of the heritage sites.
boxplot(df$date_inscribed ~ df$category, col="orange", main="Date Inscribed of distributed Category", ylab="Inscribed Date", xlab="Category")
One quantative variable is used for making barplot below. Latitude of heritage sites.
barplot(df$latitude, main = "Latitude Chart", xlab = "", ylab = "latitude")
scatter.smooth(df$longitude, main = "Scatter Plot")
data <- read.csv("whc-sites-2019.csv", header = TRUE)
data <- data.matrix(data[,-1])
library(RColorBrewer)
heatmap(t(data),
main = "Heat Map",
Rowv = NA,
Colv = NA,
col = colorRampPalette(brewer.pal(8, "PiYG"))(25),
scale = "column")
Conclusion
The most interesting feature of my data, graphical analysis of the data. We can predict how different variables are depending on each other through plotting graphs. As this is my first time exploring data and analyzing in RStudio, I got alot of ideas about how we can manipulate data according to our need.
Decision tree uses a tree-like model of decisions and their possible outcomes. Here I made category variable a factor. Since my data has alot of written description, i had to make my data small, so R won’t crash. I have imported two libraries ‘rpart’&‘rpart.plot’ to make decision tree.
library(rpart)
library(rpart.plot)
smalldf <- sample_n(df,35)
tree <- rpart(category ~ region_en + states_name_en + danger + date_inscribed + category_short , data = smalldf)
tree
## n= 35
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 35 6 Cultural (0.82857143 0.02857143 0.14285714)
## 2) states_name_en=Austria,Belgium,Belgium,France,Germany,Switzerland,India,Japan,Argentina,Bulgaria,Cambodia,Canada,Denmark,Germany,Germany,Poland,India,Iran (Islamic Republic of),Italy,Mali,Mauritania,Morocco,Norway,Panama,Portugal,Russian Federation,Sri Lanka,Tunisia,United Kingdom of Great Britain and Northern Ireland 27 0 Cultural (1.00000000 0.00000000 0.00000000) *
## 3) states_name_en=China,Costa Rica,France,Malaysia,Mexico,Spain 8 3 Natural (0.25000000 0.12500000 0.62500000) *
rpart.plot(tree, extra = 2)
To make a prediction I am using the tree, I predict the tree that have created.
pred <- predict(tree, smalldf, type = "class")
head(pred)
## 1 2 3 4 5 6
## Cultural Cultural Cultural Cultural Cultural Natural
## Levels: Cultural Mixed Natural
Each has been classified into its own category.
predict(tree, smalldf) %>%
head()
## Cultural Mixed Natural
## 1 1.00 0.000 0.000
## 2 1.00 0.000 0.000
## 3 1.00 0.000 0.000
## 4 1.00 0.000 0.000
## 5 1.00 0.000 0.000
## 6 0.25 0.125 0.625
Confusion Table is presented below:
confusion_table <- with(smalldf, table(category, pred))
confusion_table
## pred
## category Cultural Mixed Natural
## Cultural 27 0 2
## Mixed 0 0 1
## Natural 0 0 5
The process of traning and testing data by seperating data into a set to train or to test is called cross validation.
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
inTrain <- createDataPartition(y = smalldf$category, p = .66, list = FALSE)
## Warning in createDataPartition(y = smalldf$category, p = 0.66, list = FALSE):
## Some classes have a single record ( Mixed ) and these will be selected for the
## sample
smalldf_train <- smalldf %>% slice(inTrain)
smalldf_test <- smalldf %>% slice(-inTrain)
dim(smalldf_train)
## [1] 25 22
dim(smalldf_test)
## [1] 10 22
I used the training set to build my model and then test it. I removed states_name_en from my tree.
tree_from_train <- rpart(category ~.,data = subset(smalldf_train, select=c( -states_name_en)))
pred_test <- predict(tree_from_train, subset(smalldf_train, select=c( -states_name_en)), type = "class")
with(smalldf_train, table(category, pred_test))
## pred_test
## category Cultural Mixed Natural
## Cultural 18 0 2
## Mixed 0 0 1
## Natural 0 0 4
I have made a full tree below. I only have ~25 data because i had a lot of data, I had chop my data.
smalldf_no_States <- subset(smalldf, select=c( -states_name_en))
tree_full <- sample_n(smalldf_no_States,25) %>%
rpart(category ~., data = ., control = rpart.control(minsplit = 2, cp = 0))
rpart.plot(tree_full, extra = 2, roundint=FALSE,
box.palette = list( "Gn", "Bu"))
## Warning: All boxes will be white (the box.palette argument will be ignored) because
## the number of classes in the response 3 is greater than length(box.palette) 2.
## To silence this warning use box.palette=0 or trace=-1.
I couldn’t make prediction on my data. I have error in model. That’s the reason i have kept the it in comment below:
#pred_full <- predict(tree_full, smalldf_no_States, type = "class")
#with(smalldf, table(region_en, pred_full))
imp <- varImp(tree)
head(imp)
## Overall
## date_inscribed 0.6452381
## region_en 1.1485714
## states_name_en 5.9785714
## danger 0.0000000
## category_short 0.0000000
imp %>% ggplot(aes(x = row.names(imp), weight = Overall)) +
geom_bar()
barplot(imp$Overall)
library(FSelector)
weights <- smalldf %>% chi.squared(category ~ ., data = .) %>%
as_tibble(rownames = "feature") %>%
arrange(desc(attr_importance))
weights
## # A tibble: 21 × 2
## feature attr_importance
## <chr> <dbl>
## 1 name_en 1
## 2 short_description_en 1
## 3 criteria_txt 1
## 4 category_short 1
## 5 area_hectares 0.835
## 6 states_name_en 0.826
## 7 iso_code 0.826
## 8 udnp_code 0.826
## 9 secondary_dates 0.426
## 10 region_en 0.407
## # … with 11 more rows
ggplot(weights,
aes(x = attr_importance, y = reorder(feature, attr_importance))) +
geom_bar(stat = "identity") +
xlab("Importance score") + ylab("Feature")
Extra tree
tree1 <- rpart(date_inscribed ~ category + danger + region_en,data = smalldf, method = 'class')
tree1
## n= 35
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 35 31 2000 (0.029 0.029 0.029 0.029 0.057 0.029 0.029 0.029 0.029 0.029 0.029 0.029 0.11 0.086 0.029 0.057 0.11 0.057 0.029 0.029 0.029 0.086)
## 2) region_en=Asia and the Pacific,Europe and North America 26 22 2000 (0 0.038 0.038 0.038 0.077 0.038 0 0.038 0.038 0 0.038 0 0.15 0.077 0.038 0.077 0.077 0.077 0.038 0 0.038 0.077) *
## 3) region_en=Africa,Arab States,Europe and North America,Asia and the Pacific,Latin America and the Caribbean,Latin America and the Caribbean 9 7 2004 (0.11 0 0 0 0 0 0.11 0 0 0.11 0 0.11 0 0.11 0 0 0.22 0 0 0.11 0 0.11) *
rpart.plot(tree1, extra = 2)
tree
## n= 35
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 35 6 Cultural (0.82857143 0.02857143 0.14285714)
## 2) states_name_en=Austria,Belgium,Belgium,France,Germany,Switzerland,India,Japan,Argentina,Bulgaria,Cambodia,Canada,Denmark,Germany,Germany,Poland,India,Iran (Islamic Republic of),Italy,Mali,Mauritania,Morocco,Norway,Panama,Portugal,Russian Federation,Sri Lanka,Tunisia,United Kingdom of Great Britain and Northern Ireland 27 0 Cultural (1.00000000 0.00000000 0.00000000) *
## 3) states_name_en=China,Costa Rica,France,Malaysia,Mexico,Spain 8 3 Natural (0.25000000 0.12500000 0.62500000) *