Introduction
My data set has information about the world heritage sites all around the globe. I got this data set from kaggle (https://www.kaggle.com/ujwalkandi/unesco-world-heritage-sites?select=whc-sites-2019+-+Copy.xls).It was created at 2019 by Ujwal Kandi.
As a quantitative variable I use latitude,longitude,date_inscribed, and as categorical variable i use category,region_en variables. Organizing the data
I had some issue with quantitative variables. I have labeled the graphs using main, xlab and ylab. I used # to comment. I have brief detail about the headers through out the project.

knitr::opts_chunk$set(echo = TRUE)
df<- read.csv("whc-sites-2019.csv")
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

1

#The mean value of Latitude's values.
mean(df$latitude)

## [1] 28.94847

#The standard deviation.
sd(df$latitude)

## [1] 23.69288

#Five number summary
summary(df$latitude)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -54.59   17.48   36.10   28.95   45.77   71.19

Graphical Display

Below I have different graphical displays. Histogram, box plot and qq plot is presented for latitude variable which is one of the quantitative variable.The main purpose of a qq plot is to assess normality. Histograms might be second-best option (to normal probability plots) for assessing normality. Boxplots main purpose is to show quartiles and outliers, if there are any present.

hist(df$latitude, main = "Histogram of Latitude values",xlab = "Latitude (sec)",col = "red",)

boxplot(df$latitude, main= "Box plot showing Latitude", xlab= "Quartile", ylab = "latitude")

qqnorm(df$latitude)
qqline(df$latitude, col = "red")

There are some outliers below , which is far from the rest of the values. The distribution is negative skewed.

2 Graphical display looking at longitude and latitude and their correlation

plot(df$latitude, df$longitude)

cor(df$latitude, df$longitude)

## [1] -0.01570684

The correlation coefficient is a measurement of the closeness of association of the points in a scatter plot to a linear regression line based on those points.

3 Table

Below I have frequency table and relative frequency table. It contains different region and the number of world heritages sites in that region.

Frequency Table

table(df$region_en)

## 
##                                                                        Africa 
##                                                                            96 
##                                                                   Arab States 
##                                                                            86 
##                                                          Asia and the Pacific 
##                                                                           266 
##                                                      Europe and North America 
##                                                                           528 
##                                 Europe and North America,Asia and the Pacific 
##                                                                             2 
## Europe and North America,Asia and the Pacific,Latin America and the Caribbean 
##                                                                             1 
##                                               Latin America and the Caribbean 
##                                                                           142

Relative Frequency Table

table(df$region_en)/length(df$region_en)

## 
##                                                                        Africa 
##                                                                  0.0856378234 
##                                                                   Arab States 
##                                                                  0.0767172168 
##                                                          Asia and the Pacific 
##                                                                  0.2372881356 
##                                                      Europe and North America 
##                                                                  0.4710080285 
##                                 Europe and North America,Asia and the Pacific 
##                                                                  0.0017841213 
## Europe and North America,Asia and the Pacific,Latin America and the Caribbean 
##                                                                  0.0008920607 
##                                               Latin America and the Caribbean 
##                                                                  0.1266726137

4 Two-way table

Two way table is made below for two categorical variables. Category and region. The table represents how many Cultural, Mixed and Natural heritage sites are in 7 different region.

two_way_table <- table(df$category,df$region_en)
two_way_table

##           
##            Africa Arab States Asia and the Pacific Europe and North America
##   Cultural     53          78                  189                      452
##   Mixed         5           3                   12                       11
##   Natural      38           5                   65                       65
##           
##            Europe and North America,Asia and the Pacific
##   Cultural                                             0
##   Mixed                                                0
##   Natural                                              2
##           
##            Europe and North America,Asia and the Pacific,Latin America and the Caribbean
##   Cultural                                                                             1
##   Mixed                                                                                0
##   Natural                                                                              0
##           
##            Latin America and the Caribbean
##   Cultural                              96
##   Mixed                                  8
##   Natural                               38

5 Side-By-Side Plot

Here I use a boxplot with one quantitative and other categorical variable to present side-by-side plot. As X-axis represent different category of heritage sites and Y-axis represent establisment date of the heritage sites.

boxplot(df$date_inscribed ~ df$category, col="orange", main="Date Inscribed of distributed Category", ylab="Inscribed Date", xlab="Category")

6

BarPlot

One quantative variable is used for making barplot below. Latitude of heritage sites.

barplot(df$latitude,  main = "Latitude Chart", xlab = "", ylab = "latitude")

Scatter Plot of Longitude Value

scatter.smooth(df$longitude, main = "Scatter Plot")

HEAT MAP

data <- read.csv("whc-sites-2019.csv", header = TRUE)
data <- data.matrix(data[,-1])
library(RColorBrewer)
heatmap(t(data),
        main = "Heat Map",
        Rowv = NA,
        Colv = NA,
        col = colorRampPalette(brewer.pal(8, "PiYG"))(25),
        scale = "column")

Conclusion
The most interesting feature of my data, graphical analysis of the data. We can predict how different variables are depending on each other through plotting graphs. As this is my first time exploring data and analyzing in RStudio, I got alot of ideas about how we can manipulate data according to our need.

DECISION TREE

Decision tree uses a tree-like model of decisions and their possible outcomes. Here I made category variable a factor. Since my data has alot of written description, i had to make my data small, so R won’t crash. I have imported two libraries ‘rpart’&‘rpart.plot’ to make decision tree.

library(rpart)
library(rpart.plot)
smalldf <- sample_n(df,35)
tree <- rpart(category ~ region_en + states_name_en + danger + date_inscribed + category_short , data = smalldf)
tree

## n= 35 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 35 6 Cultural (0.82857143 0.02857143 0.14285714)  
##   2) states_name_en=Austria,Belgium,Belgium,France,Germany,Switzerland,India,Japan,Argentina,Bulgaria,Cambodia,Canada,Denmark,Germany,Germany,Poland,India,Iran (Islamic Republic of),Italy,Mali,Mauritania,Morocco,Norway,Panama,Portugal,Russian Federation,Sri Lanka,Tunisia,United Kingdom of Great Britain and Northern Ireland 27 0 Cultural (1.00000000 0.00000000 0.00000000) *
##   3) states_name_en=China,Costa Rica,France,Malaysia,Mexico,Spain 8 3 Natural (0.25000000 0.12500000 0.62500000) *

rpart.plot(tree, extra = 2)

To make a prediction I am using the tree, I predict the tree that have created.

pred <- predict(tree, smalldf, type = "class")
head(pred)

##        1        2        3        4        5        6 
## Cultural Cultural Cultural Cultural Cultural  Natural 
## Levels: Cultural Mixed Natural

Each has been classified into its own category.

predict(tree, smalldf) %>%
  head()

##   Cultural Mixed Natural
## 1     1.00 0.000   0.000
## 2     1.00 0.000   0.000
## 3     1.00 0.000   0.000
## 4     1.00 0.000   0.000
## 5     1.00 0.000   0.000
## 6     0.25 0.125   0.625

Confusion Table

Confusion Table is presented below:

confusion_table <- with(smalldf, table(category, pred))
confusion_table

##           pred
## category   Cultural Mixed Natural
##   Cultural       27     0       2
##   Mixed           0     0       1
##   Natural         0     0       5

Cross Validation

The process of traning and testing data by seperating data into a set to train or to test is called cross validation.

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

inTrain <- createDataPartition(y = smalldf$category, p = .66, list = FALSE)

## Warning in createDataPartition(y = smalldf$category, p = 0.66, list = FALSE):
## Some classes have a single record ( Mixed ) and these will be selected for the
## sample

smalldf_train <- smalldf %>% slice(inTrain)
smalldf_test <- smalldf %>% slice(-inTrain)

dim(smalldf_train)

## [1] 25 22

dim(smalldf_test)

## [1] 10 22

I used the training set to build my model and then test it. I removed states_name_en from my tree.

tree_from_train <- rpart(category ~.,data = subset(smalldf_train, select=c( -states_name_en)))
pred_test <- predict(tree_from_train, subset(smalldf_train, select=c( -states_name_en)), type = "class")
with(smalldf_train, table(category, pred_test))

##           pred_test
## category   Cultural Mixed Natural
##   Cultural       18     0       2
##   Mixed           0     0       1
##   Natural         0     0       4

I have made a full tree below. I only have ~25 data because i had a lot of data, I had chop my data.

smalldf_no_States <- subset(smalldf, select=c( -states_name_en))
tree_full <- sample_n(smalldf_no_States,25) %>% 
  rpart(category ~., data = ., control = rpart.control(minsplit = 2, cp = 0))
rpart.plot(tree_full, extra = 2, roundint=FALSE,
  box.palette = list( "Gn", "Bu"))

## Warning: All boxes will be white (the box.palette argument will be ignored) because
## the number of classes in the response 3 is greater than length(box.palette) 2.
## To silence this warning use box.palette=0 or trace=-1.

I couldn’t make prediction on my data. I have error in model. That’s the reason i have kept the it in comment below:

#pred_full <- predict(tree_full, smalldf_no_States, type = "class")
#with(smalldf, table(region_en, pred_full))

imp <- varImp(tree)
head(imp)

##                  Overall
## date_inscribed 0.6452381
## region_en      1.1485714
## states_name_en 5.9785714
## danger         0.0000000
## category_short 0.0000000

imp %>% ggplot(aes(x = row.names(imp), weight = Overall)) +
  geom_bar()

barplot(imp$Overall)

Chi-squared statistic

library(FSelector)
weights <- smalldf %>% chi.squared(category ~ ., data = .) %>%
  as_tibble(rownames = "feature") %>%
  arrange(desc(attr_importance))
weights

## # A tibble: 21 × 2
##    feature              attr_importance
##    <chr>                          <dbl>
##  1 name_en                        1    
##  2 short_description_en           1    
##  3 criteria_txt                   1    
##  4 category_short                 1    
##  5 area_hectares                  0.835
##  6 states_name_en                 0.826
##  7 iso_code                       0.826
##  8 udnp_code                      0.826
##  9 secondary_dates                0.426
## 10 region_en                      0.407
## # … with 11 more rows

ggplot(weights,
  aes(x = attr_importance, y = reorder(feature, attr_importance))) +
  geom_bar(stat = "identity") +
  xlab("Importance score") + ylab("Feature")

Extra tree

tree1 <- rpart(date_inscribed ~ category + danger + region_en,data = smalldf, method = 'class')
tree1

## n= 35 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 35 31 2000 (0.029 0.029 0.029 0.029 0.057 0.029 0.029 0.029 0.029 0.029 0.029 0.029 0.11 0.086 0.029 0.057 0.11 0.057 0.029 0.029 0.029 0.086)  
##   2) region_en=Asia and the Pacific,Europe and North America 26 22 2000 (0 0.038 0.038 0.038 0.077 0.038 0 0.038 0.038 0 0.038 0 0.15 0.077 0.038 0.077 0.077 0.077 0.038 0 0.038 0.077) *
##   3) region_en=Africa,Arab States,Europe and North America,Asia and the Pacific,Latin America and the Caribbean,Latin America and the Caribbean 9  7 2004 (0.11 0 0 0 0 0 0.11 0 0 0.11 0 0.11 0 0.11 0 0 0.22 0 0 0.11 0 0.11) *

rpart.plot(tree1, extra = 2)

tree

## n= 35 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 35 6 Cultural (0.82857143 0.02857143 0.14285714)  
##   2) states_name_en=Austria,Belgium,Belgium,France,Germany,Switzerland,India,Japan,Argentina,Bulgaria,Cambodia,Canada,Denmark,Germany,Germany,Poland,India,Iran (Islamic Republic of),Italy,Mali,Mauritania,Morocco,Norway,Panama,Portugal,Russian Federation,Sri Lanka,Tunisia,United Kingdom of Great Britain and Northern Ireland 27 0 Cultural (1.00000000 0.00000000 0.00000000) *
##   3) states_name_en=China,Costa Rica,France,Malaysia,Mexico,Spain 8 3 Natural (0.25000000 0.12500000 0.62500000) *

Project 1-DATA MINING

Bidhan Subedi

9/3/2021

1