Human-Orangutan Interaction (HOI) Survey: Initial Data Cleaning and Exploratory Analysis

Background

This project consists of an initial review, cleaning, and analysis of data an interview-based social survey given by Gunung Palung Orangutan Conservation Project (GPOCP) and its community conservation branch, Yayasan Palung (YP) located in West Kalimantan, Indonesia. These organizations have been involved in orangutan research and conservation work with local communities for several decades, including conservation education, sustainable livelihood programs, and legal assistance for ancestral land ownership (customary forests). This specific project consisted of carrying out interview-based surveys with both closed- and open-ended questions to gain insight into local knowledge, attitudes, and behaviors regarding orangutans, land use, conservation activities, and possibly harmful and/or illegal activities (e.g. illegal hunting or orangutans or other protected animals).

The goal of the survey was to assess overall ecological and conservation knowledge, attitudes, and behaviors of local people who live near the borders of Gunung Palung National Park (GPNP), where GPOCP conducts its long-term wild orangutan research, based at Cabang Panti Research Station. Many of these communities are ones in which GPOCP and YP have worked for years, including conservation education presentations and sustainable livelihoods programs.

General Survey Information

Surveys were conducted during October through December 2019. Interviewers were hired and trained to conduct the surveys, in order to not have people from YP conducting the interviews to prevent biases that may arise from past relationships and interactions. Interviewers worked in teams of two (2). A total of 230 individuals were interviewed, with 210 of these being individual interviews and 20 individuals interviewed together in a focus group-style interview. Before being interviewed, individuals were informed of the goals of the study, the anonymity and confidentiality of their answers, that there would be no compensation for participating nor negative consequences for not participating, as well as the fact that they could choose not to answer any questions for any reason, and that they could withdraw their participation at any time. All interviews were conducted in Indonesian, with both interviewers and participants fluent in this language.

Surveys consisted of close to 100 questions, though many of these were follow-up or sub-sections of main questions. Questions included basic demographic information, checked for respondent reliability (based on ability to recognize orangutans and other animals in photos) and recall ability and consistency (by asking questions on a range of related issues, at a range of locations, over a range of time frames),examined perceptions towards orangutans and the forest,and asked about specific behaviors that may support or harm conservation efforts. Questions included a mix of closed and open-ended questions. Questions were based on past studies, discussed in Meijaard et al. (2011), and modified for this specific study in a series of workshops with YP staff run by members of an Indonesian organization with experience in creating and conducting social surveys. Workshops were conducted in Indonesian, and questions were translated into both English and Indonesian.

Village Overview

GPNP is located within two regencies in West Kalimantan, with approximately 80% located in North Kayong Regency and the remaining approximately 20% located within Ketapang Regency. The villages chosen for this survey therefore are located in these two regencies, and are all at least within 20 meters of the official borders of GPNP, with several being directly adjacent or bordering the park.

In total, eight (8) villages were surveyed:

North Kayong Regency:

Penjalaan

Directly adjacent to GPNP
30 participants surveyed
Regular customary forest projects since 2015

Rantau Panjang

Directly adjacent to GPNP
30 participants surveyed
Regular customary forest projects since 2018

Sedahan Jaya

Directly adjacent to GPNP
30 participants surveyed
Regular customary forest projects since 2018

Matan Jaya

Directly borders GPNP
20 participants surveyed, using a focus group approach instead of individual surveys
No regular conservation activities; sporadic education activities

Ketapang Regency:

Desa Sempurna

Directly adjacent to GPNP
30 participants surveyed
No regular conservation activities; sporadic education activities

Teluk Bayur

Adjacent to GPNP
30 participants surveyed
No regular conservation activities; sporadic education activities
Past investigations of possible illegal wildlife activity conducted

Pengkalan Teluk

Directly borders GPNP
30 participants surveyed
No regular conservation activities; sporadic education activities
Past investigations of possible illegal wildlife activity conducted

Laman Satong

Directly adjacent to GPNP
30 participants surveyed
Customary forest activities from 2007 - 2010. No current regular conservation activities;sporadic education activities
Past investigations of possible illegal wildlife activity conducted

Data Analysis Considerations

Dealing with Biases

Data taken from interviews, while invaluable for conservation science and policy since they can access people’s beliefs, perceptions, and behaviors, are also challenging to analyze and interpret based on a wide array of factors. There are a number of biases unique to answers being self-reported: social desirability bias, non-response bias, interviewer bias, and differential recall ability, among others.

There are several ways to minimize these biases during all phases of survey-based studies, including survey design, data acquisition, and data analysis.WHile I will discuss a few of the specific decisions I made regarding this specific dataset for this initial exploratory analysis, there are further steps that could be taken for analysis of this interview data, and for follow-up surveys based on this study.

Sample and Target Populations

Part of a successful interview-based survey study should include the clear statement and justification of target populations and basing sample populations, i.e., the individuals who will actually be interviewed on this target group. Specifically, it is crucial for the target population to be both large enough and closely representative of the target population to result in data that is robust and in turn, allow for meaningful inferences from analyses.

To minimize the biases that could arise from individuals being interviewed in a group as opposed to individually, I decided to remove the participants that were part of the large focus group interview conducted in the village of Matan Jaya. This changed the total number of interviews analyzed from 230 to 210.

For this study, the target population should not be considered the same as the census population, that is, including everyone from the specific villages we visited. Instead, our target population are those who are most likely to encounter orangutans in their daily lives, as these are the individuals who are able to give reports on orangutan presence in the village and in the nearby forest, as well as being the same ones who would be responsible for hunting, killing, or otherwise interacting with orangutans. It is this human-orangutan interaction that is the focus of our study, and therefore a focus on this population allows us to more effectively answer these specific questions. One of the characteristics of this limited target population is a clear gender divide, with a disproportionate number of men being the ones to fall into the target population. Due to cultural practices and gender norms in these villages, men tend to be the ones visiting the forest, working outside the village, and being responsible for hunting when it occurs. In our survey, of the 210 interviews I will be analyzing, the gender makeup is 90.5% (190) individuals identifying as men, and 9.5% (20) identifying as women. Furthermore, the average age of men surveyed was 44.5 years, with the majority of these between the ages of 30 - 50 years old, (111 or 58.4%). These men are at ages most likely to have experience in the areas of traveling in the forest and hunting, activities in which they would be likely to encounter orangutans. For these reasons, the sample population that was interviewed seems to at least superficially accurately represent the target population. Further comparisons with data from published censuses and past (and future) interviews could allow for a more systematic approach in confirming this assertion.

Data Cleaning and Exploratory Analysis in R

Opening and Cleaning Data for Analysis

First, I will open the master spreadsheet with the collated raw data into R. It is in an Excel file on my computer, so I am using the {readxl} package. I decided to use the file.choose function, which means that I must choose the file each time I have saved it on the computer and need those changes reflected in R. But, I feel that this will make it easy for someone else to use this code if I provided them with the Excel spreadsheet document and they could have it in anywhere on their computer, instead of designating a specific path on my computer. I am still able to knit the R Markdown (.Rmd file) into an html or other format, as R will ask me for the file I want to use while knitting.

require(readxl) #Program for opening Excel files easily in R. This is part of the tidyverse so it will return a tibble instead of a data frame.

## Loading required package: readxl

f <- read_excel(file.choose(),col_names=TRUE) #The file.choose command means a screen will pop up that will allow you to choose the file from its location on your computer, even if this location changes.I have called the tibble "f" for now. col_names=TRUE confirms that the first row of the spreadsheet should be considered a column name row, and not additional data.
head (f) # Shows the first several rows and columns of the tibble, just to make sure everything looks good before continuing.

## # A tibble: 6 x 213
##   Interview.ID Date  Start.Time          End.Time            Interviewer Q1.1 
##   <chr>        <chr> <dttm>              <dttm>              <chr>       <chr>
## 1 LS-01        18/1~ 1899-12-31 12:45:00 1899-12-31 13:15:00 Susanto     Lama~
## 2 LS-02        20/1~ 1899-12-31 12:54:00 NA                  Chairil     Lama~
## 3 LS-03        18/1~ 1899-12-31 13:25:00 1899-12-31 14:00:00 Susanto     Lama~
## 4 LS-04        20/1~ 1899-12-31 14:12:00 NA                  Chairil     Lama~
## 5 LS-29        18/1~ 1899-12-31 09:30:00 1899-12-31 10:15:00 Susanto     Lama~
## 6 LS-05        20/1~ 1899-12-31 15:38:00 NA                  Chairil     Lama~
## # ... with 207 more variables: Q1.1a <chr>, Q1.2 <chr>, Q1.3 <dbl>, Q2.1 <chr>,
## #   Q2.2 <chr>, Q2.3 <chr>, Q2.4 <chr>, Q2.5 <chr>, Q2.6 <chr>, Q2.7 <chr>,
## #   Q2.8 <chr>, Q2.9 <chr>, Q2.9.1 <chr>, Q2.10 <chr>, Q2.10.1 <chr>,
## #   Q2.11 <chr>, Q2.12 <chr>, Q2.13 <chr>, Q2.14 <chr>, Q2.15 <chr>,
## #   Q2.16 <chr>, Q3.1 <lgl>, Q3.1a <chr>, Q3.1b <chr>, Q3.1c <chr>,
## #   Q3.1d <chr>, Q3.1e <chr>, Q3.1f <chr>, Q3.1g <chr>, Q3.1h <chr>,
## #   Q3.1i <chr>, Q3.1j <chr>, Q3.1k <chr>, Q3.2 <lgl>, Q.3.2a <chr>,
## #   Q.3.2b <chr>, Q.3.2c <chr>, Q.3.2d <chr>, Q.3.2e <chr>, Q.3.2f <chr>,
## #   Q.3.2g <chr>, Q.3.2h <chr>, Q.3.2i <chr>, Q.3.2j <chr>, Q.3.2k <chr>,
## #   Q3.3 <lgl>, Q3.3a <chr>, Q3.3b <chr>, Q3.3c <chr>, Q3.3d <chr>,
## #   Q3.3e <chr>, Q3.3f <chr>, Q3.3g <chr>, Q3.3h <chr>, Q3.3i <chr>,
## #   Q3.3j <chr>, Q3.3k <chr>, Q3.4 <lgl>, Q3.4a <chr>, Q3.4b <chr>,
## #   Q3.4c <chr>, Q3.4d <chr>, Q3.4e <chr>, Q3.4f <chr>, Q3.4g <chr>,
## #   Q3.4h <chr>, Q3.4i <chr>, Q3.4j <chr>, Q3.4k <chr>, Q3.5 <chr>,
## #   Q3.5a <chr>, Q3.5b <chr>, Q3.5c <chr>, Q3.6 <chr>, Q3.7 <chr>, Q3.8 <lgl>,
## #   Q3.8a <dbl>, Q3.8b <dbl>, Q3.8c <dbl>, Q3.8d <dbl>, Q3.8e <dbl>,
## #   Q3.8f <dbl>, Q3.8g <dbl>, Q3.8h <dbl>, Q3.9 <chr>, Q3.9English <chr>,
## #   Q3.10 <chr>, Q3.10English <chr>, Q3.11 <chr>, Q3.12 <chr>, Q3.13 <chr>,
## #   Q3.14 <chr>, Q3.14English <chr>, Q4.1 <chr>, Q4.2 <chr>, Q4.3 <chr>,
## #   Q4.4 <chr>, Q4.5 <chr>, Q4.5.1 <chr>, Q4.5.1Englsih <chr>, ...

To assure the anonymity of respondents, in the Excel spreadsheet, I replaced all the names of respondents in the column titled, “Q2.1” with the letter X, as a placeholder. I did not erase this column so that the column numbering would stay consistent with the master database column numbering, which I use as a reference document to be able to see the full questions linked with the columns in the spreadsheet.

Next, I removed the rows that contained data from the focus groups, since I have decided not to use those to minimize biases in the data. To do so, I extracted all the rows except 31 - 50, as these are the “focus group interviews.” This then puts the remaining rows into one data frame (or tibble in this case), leaving me with all the individual interviews. I then checked to make sure that this had worked by calling the entire tibble, and looking at it in R Studio.

Note: “g” shows up under “Data” in the Environment tab in the upper right-hand window; clicking on it then brings the entire tibble up in its own tab on the main screen in the left-hand upper window.

g <- f[c(0, 1:30,51:230),] #f is the name of the current tibble we are working
g

## # A tibble: 210 x 213
##    Interview.ID Date  Start.Time          End.Time            Interviewer Q1.1 
##    <chr>        <chr> <dttm>              <dttm>              <chr>       <chr>
##  1 LS-01        18/1~ 1899-12-31 12:45:00 1899-12-31 13:15:00 Susanto     Lama~
##  2 LS-02        20/1~ 1899-12-31 12:54:00 NA                  Chairil     Lama~
##  3 LS-03        18/1~ 1899-12-31 13:25:00 1899-12-31 14:00:00 Susanto     Lama~
##  4 LS-04        20/1~ 1899-12-31 14:12:00 NA                  Chairil     Lama~
##  5 LS-29        18/1~ 1899-12-31 09:30:00 1899-12-31 10:15:00 Susanto     Lama~
##  6 LS-05        20/1~ 1899-12-31 15:38:00 NA                  Chairil     Lama~
##  7 LS-06        18/1~ 1899-12-31 15:45:00 1899-12-31 16:30:00 Susanto     Lama~
##  8 LS-07        18/1~ 1899-12-31 16:45:00 NA                  Susanto     Lama~
##  9 LS-08        20/1~ 1899-12-31 16:33:00 NA                  Chairil     Lama~
## 10 LS-09        20/1~ 1899-12-31 19:10:00 NA                  Chairil     Lama~
## # ... with 200 more rows, and 207 more variables: Q1.1a <chr>, Q1.2 <chr>,
## #   Q1.3 <dbl>, Q2.1 <chr>, Q2.2 <chr>, Q2.3 <chr>, Q2.4 <chr>, Q2.5 <chr>,
## #   Q2.6 <chr>, Q2.7 <chr>, Q2.8 <chr>, Q2.9 <chr>, Q2.9.1 <chr>, Q2.10 <chr>,
## #   Q2.10.1 <chr>, Q2.11 <chr>, Q2.12 <chr>, Q2.13 <chr>, Q2.14 <chr>,
## #   Q2.15 <chr>, Q2.16 <chr>, Q3.1 <lgl>, Q3.1a <chr>, Q3.1b <chr>,
## #   Q3.1c <chr>, Q3.1d <chr>, Q3.1e <chr>, Q3.1f <chr>, Q3.1g <chr>,
## #   Q3.1h <chr>, Q3.1i <chr>, Q3.1j <chr>, Q3.1k <chr>, Q3.2 <lgl>,
## #   Q.3.2a <chr>, Q.3.2b <chr>, Q.3.2c <chr>, Q.3.2d <chr>, Q.3.2e <chr>,
## #   Q.3.2f <chr>, Q.3.2g <chr>, Q.3.2h <chr>, Q.3.2i <chr>, Q.3.2j <chr>,
## #   Q.3.2k <chr>, Q3.3 <lgl>, Q3.3a <chr>, Q3.3b <chr>, Q3.3c <chr>,
## #   Q3.3d <chr>, Q3.3e <chr>, Q3.3f <chr>, Q3.3g <chr>, Q3.3h <chr>,
## #   Q3.3i <chr>, Q3.3j <chr>, Q3.3k <chr>, Q3.4 <lgl>, Q3.4a <chr>,
## #   Q3.4b <chr>, Q3.4c <chr>, Q3.4d <chr>, Q3.4e <chr>, Q3.4f <chr>,
## #   Q3.4g <chr>, Q3.4h <chr>, Q3.4i <chr>, Q3.4j <chr>, Q3.4k <chr>,
## #   Q3.5 <chr>, Q3.5a <chr>, Q3.5b <chr>, Q3.5c <chr>, Q3.6 <chr>, Q3.7 <chr>,
## #   Q3.8 <lgl>, Q3.8a <dbl>, Q3.8b <dbl>, Q3.8c <dbl>, Q3.8d <dbl>,
## #   Q3.8e <dbl>, Q3.8f <dbl>, Q3.8g <dbl>, Q3.8h <dbl>, Q3.9 <chr>,
## #   Q3.9English <chr>, Q3.10 <chr>, Q3.10English <chr>, Q3.11 <chr>,
## #   Q3.12 <chr>, Q3.13 <chr>, Q3.14 <chr>, Q3.14English <chr>, Q4.1 <chr>,
## #   Q4.2 <chr>, Q4.3 <chr>, Q4.4 <chr>, Q4.5 <chr>, Q4.5.1 <chr>,
## #   Q4.5.1Englsih <chr>, ...

After checking that the correct rows were removed, I viewed the characteristics of the whole tibble to get an idea of what kind of data I am working with for this analysis. To do so, I just use the str function.

str(g) # This will show the characteristics of the data frame/tibble and the data it contains.

## tibble [210 x 213] (S3: tbl_df/tbl/data.frame)
##  $ Interview.ID  : chr [1:210] "LS-01" "LS-02" "LS-03" "LS-04" ...
##  $ Date          : chr [1:210] "18/11/2019" "20/11/2019" "18/11/2019" "20/11/2019" ...
##  $ Start.Time    : POSIXct[1:210], format: "1899-12-31 12:45:00" "1899-12-31 12:54:00" ...
##  $ End.Time      : POSIXct[1:210], format: "1899-12-31 13:15:00" NA ...
##  $ Interviewer   : chr [1:210] "Susanto" "Chairil" "Susanto" "Chairil" ...
##  $ Q1.1          : chr [1:210] "Laman Satong" "Laman Satong" "Laman Satong" "Laman Satong" ...
##  $ Q1.1a         : chr [1:210] "Manjau" "Manjau" "Manjau" "Manjau" ...
##  $ Q1.2          : chr [1:210] "Dusun (village)" "Dusun (village)" "Dusun (village)" "Dusun (village)" ...
##  $ Q1.3          : num [1:210] 5664 5597 5470 3930 NA ...
##  $ Q2.1          : chr [1:210] "X" "X" "X" "X" ...
##  $ Q2.2          : chr [1:210] "28" "61" "31" "51" ...
##  $ Q2.3          : chr [1:210] "Male" "Male" "Male" "Male" ...
##  $ Q2.4          : chr [1:210] "Tani (Farmer)" "Tani (Farmer)" "Swasta" "Tani (Farmer)" ...
##  $ Q2.5          : chr [1:210] "Dayak" "Bugis" "Dayak" "Melayu" ...
##  $ Q2.6          : chr [1:210] "Catholic" "Muslim" "Catholic" "Muslim" ...
##  $ Q2.7          : chr [1:210] "20+" "20+" "20+" "20+" ...
##  $ Q2.8          : chr [1:210] NA "Yes" "Yes" "No" ...
##  $ Q2.9          : chr [1:210] "1 to 2 days per week" "1 to 2 days per week" "1 to 2 days per week" "Every day" ...
##  $ Q2.9.1        : chr [1:210] NA NA NA NA ...
##  $ Q2.10         : chr [1:210] "Fishing" "Agriculture" "Other" "Logging" ...
##  $ Q2.10.1       : chr [1:210] "Fishing" "To the garden/farm" "Farming" "To the garden/farm" ...
##  $ Q2.11         : chr [1:210] "No" "No" "No" "No" ...
##  $ Q2.12         : chr [1:210] "Village Forest" "Bantok" "Jerangau" "Tabat" ...
##  $ Q2.13         : chr [1:210] "No" "No" "No" "Yes" ...
##  $ Q2.14         : chr [1:210] NA NA NA "2" ...
##  $ Q2.15         : chr [1:210] NA NA NA "Yes" ...
##  $ Q2.16         : chr [1:210] NA NA NA "2" ...
##  $ Q3.1          : logi [1:210] NA NA NA NA NA NA ...
##  $ Q3.1a         : chr [1:210] "Correct" "Correct" "Correct" "Correct" ...
##  $ Q3.1b         : chr [1:210] "Correct" "Correct" "Correct" "Correct" ...
##  $ Q3.1c         : chr [1:210] "False" "False" "False" "False" ...
##  $ Q3.1d         : chr [1:210] "False" "Correct" "False" "Correct" ...
##  $ Q3.1e         : chr [1:210] "Correct" "Correct" "Correct" "Correct" ...
##  $ Q3.1f         : chr [1:210] "Correct" "Correct" "Correct" "Correct" ...
##  $ Q3.1g         : chr [1:210] "Correct" "Correct" "Correct" "Correct" ...
##  $ Q3.1h         : chr [1:210] "False" "False" "False" "False" ...
##  $ Q3.1i         : chr [1:210] "Correct" "Correct" "Correct" "Correct" ...
##  $ Q3.1j         : chr [1:210] "Correct" "Correct" "Correct" "Correct" ...
##  $ Q3.1k         : chr [1:210] "Correct" "Correct" "Correct" "Correct" ...
##  $ Q3.2          : logi [1:210] NA NA NA NA NA NA ...
##  $ Q.3.2a        : chr [1:210] "Orang Utan" "Orang Utan" "Orang Utan" "Orang Utan" ...
##  $ Q.3.2b        : chr [1:210] "Kelempiau" "Kelempiau" "Kelempiau" "Kelempiau" ...
##  $ Q.3.2c        : chr [1:210] "Bentangan" "Bentangan" "Bentangan" "Bentangan" ...
##  $ Q.3.2d        : chr [1:210] "Penegong" NA "Penegong" "Ruai" ...
##  $ Q.3.2e        : chr [1:210] "Kelasi" "Kelasi" "Kelasi" "Kelasi" ...
##  $ Q.3.2f        : chr [1:210] "Kijang" "Kijang" "Kijang" "Kijang" ...
##  $ Q.3.2g        : chr [1:210] "Beruang" "Beruang" "Beruang" "Beruang" ...
##  $ Q.3.2h        : chr [1:210] "Tingang" "Tingang" "Tingang" "Tingang" ...
##  $ Q.3.2i        : chr [1:210] "Tenggiling" "Tenggiling" "Tenggiling" "Tenggiling" ...
##  $ Q.3.2j        : chr [1:210] "Kungkang" "Kungkang" "Kungkang" "Kungkang" ...
##  $ Q.3.2k        : chr [1:210] "Manjangan" "Rusa" "Manjangan" "Rusa" ...
##  $ Q3.3          : logi [1:210] NA NA NA NA NA NA ...
##  $ Q3.3a         : chr [1:210] "Protected" "Protected" "Protected" "Protected" ...
##  $ Q3.3b         : chr [1:210] "Protected" "Protected" "Protected" "Protected" ...
##  $ Q3.3c         : chr [1:210] "Protected" "Protected" "Protected" "Protected" ...
##  $ Q3.3d         : chr [1:210] "Protected" "Protected" "Protected" "Protected" ...
##  $ Q3.3e         : chr [1:210] "Protected" "Protected" "Protected" "Protected" ...
##  $ Q3.3f         : chr [1:210] "Protected" "Don't know" "Protected" "Not Protected" ...
##  $ Q3.3g         : chr [1:210] "Protected" "Protected" "Protected" "Protected" ...
##  $ Q3.3h         : chr [1:210] "Protected" "Protected" "Protected" "Protected" ...
##  $ Q3.3i         : chr [1:210] "Protected" "Protected" "Protected" "Protected" ...
##  $ Q3.3j         : chr [1:210] "Protected" "Protected" "Protected" "Protected" ...
##  $ Q3.3k         : chr [1:210] "Protected" "Don't know" "Protected" "Not Protected" ...
##  $ Q3.4          : logi [1:210] NA NA NA NA NA NA ...
##  $ Q3.4a         : chr [1:210] NA "Pelansi" "Sungai Kuai" "Lubuk Antu" ...
##  $ Q3.4b         : chr [1:210] "Hutan Desa" "Kumpang" "Bukit Bujang" "Tabat" ...
##  $ Q3.4c         : chr [1:210] "Sungai Durian" "S. Ulu Satong" NA "S. Ulu Tolak" ...
##  $ Q3.4d         : chr [1:210] NA NA NA "B.Kumpang" ...
##  $ Q3.4e         : chr [1:210] NA "Kumpang" "Bukit Bujang" "Tabat" ...
##  $ Q3.4f         : chr [1:210] "Hutan Desa" "Kumpang" "Bagan Liman" "B.Kumpang" ...
##  $ Q3.4g         : chr [1:210] NA "Lubuk Antu" NA "B.Kumpang" ...
##  $ Q3.4h         : chr [1:210] "Hutan Desa" "B. Kumpang" "Bukit Bujang" "B.Kumpang" ...
##  $ Q3.4i         : chr [1:210] "Dusun Manjan" NA "Bagan Liman" "B.Kumpang" ...
##  $ Q3.4j         : chr [1:210] NA NA "Bagan Liman" "Tabat" ...
##  $ Q3.4k         : chr [1:210] "Hutan Desa" NA "Bagan Liman" "B.Kumpang" ...
##  $ Q3.5          : chr [1:210] NA NA NA NA ...
##  $ Q3.5a         : chr [1:210] NA "Pelansi" "Sungai Kuai" "Lubuk Antu" ...
##  $ Q3.5b         : chr [1:210] NA NA NA NA ...
##  $ Q3.5c         : chr [1:210] NA NA NA NA ...
##  $ Q3.6          : chr [1:210] NA "5+" "5+" "<1" ...
##  $ Q3.7          : chr [1:210] "Decrease" "Increase" "Increase" "Increase" ...
##  $ Q3.8          : logi [1:210] NA NA NA NA NA NA ...
##  $ Q3.8a         : num [1:210] NA NA 1 NA NA NA 2 NA NA NA ...
##  $ Q3.8b         : num [1:210] NA NA 5 NA NA NA 3 NA NA NA ...
##  $ Q3.8c         : num [1:210] 2 NA 2 NA 2 NA 1 NA NA NA ...
##  $ Q3.8d         : num [1:210] 1 NA 4 NA 3 NA 4 NA NA NA ...
##  $ Q3.8e         : num [1:210] NA NA NA NA NA NA NA NA NA NA ...
##  $ Q3.8f         : num [1:210] NA NA NA NA NA NA NA NA NA NA ...
##  $ Q3.8g         : num [1:210] 3 NA 3 NA 1 NA 5 NA NA NA ...
##  $ Q3.8h         : num [1:210] NA NA NA NA NA NA NA NA NA NA ...
##  $ Q3.9          : chr [1:210] "Pembukaan lahan tambang menyebabkan orang utan jadi menghilang" NA "Karena pernah di buru maka orang utan jadi berkurang" NA ...
##  $ Q3.9English   : chr [1:210] "\r\nMining land clearing causes orangutans to disappear" NA "Because once hunted, the orangutans are reduced" NA ...
##  $ Q3.10         : chr [1:210] NA "Karena sudah tidak di buru dan sudah di lindungi" NA "Karena sudah tidak di ganggu" ...
##  $ Q3.10English  : chr [1:210] NA "\r\nBecause it's not being hunted, and it's been protected" NA "\r\nBecause it is not disturbed" ...
##  $ Q3.11         : chr [1:210] "No" "Yes" "No" "No" ...
##  $ Q3.12         : chr [1:210] NA "Can chase people" NA NA ...
##  $ Q3.13         : chr [1:210] "Yes" "Yes" "Yes" "Yes" ...
##  $ Q3.14         : chr [1:210] "Supaya tidak punah dan habitatnya terjaga" "Karena memang sudah dilarang" "Untuk keseimbangan alam dan penyebar bibit di hutan" "Karena memang sudah di atur negara" ...
##  $ Q3.14English  : chr [1:210] "\r\nSo that they are not extinct and their habitat is maintained" "\r\nBecause it is already prohibited" "For the balance of nature and seed dispersers in the forest" "\r\nBecause it is already set by the state" ...
##   [list output truncated]

Many of the rows are coming up as “chr”, which means they are being read as characters. While this is fine for most of the columns, I will need to adjust some that need to be read as dates, etc. The output of the str function also shows me that the data is currently in a tibble, which will allow me to clean up the data and manipulate it with functions in the {tidyr} and {dplyr} packages.

Also, note the columns for “Start.Time” and “End.Time” are classed as POSIXct. While this now contains a random date with the correct time, I will be able to fix this later (see below). Initially, these time columns were showing up as chr data, like many of the other data. After exploring these columns, it became clear that one of the cells in each of the time columns in the original Excel spreadsheet were incorrectly formatted, affecting the entire column. Once these cells were fixed and the updated Excel file opened in R Studio, these time columns automatically showed up as the POSIXct data type, which is what I want them to be for further manipulation.

I going to rename the tibble from g to HOI_Survey.

HOI_Survey <- g #Just renaming the tibble to make it more relevant.

Next, I used the mutate function and the {lubridate} package to convert the date column in the HOI_Survey tibble from a character class to a date (date month year) class.

library("lubridate")

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

HOI_Survey_Date_Update <- HOI_Survey %>% #Starting the piping command, also updating the name of the tibble to reflect what I was specifically manipulating.
  mutate(Date = lubridate::dmy(Date)) #dmy stands for date-month-year

## Warning: Problem with `mutate()` input `Date`.
## i  2 failed to parse.
## i Input `Date` is `lubridate::dmy(Date)`.

## Warning: 2 failed to parse.

HOI_Survey_Date_Update

## # A tibble: 210 x 213
##    Interview.ID Date       Start.Time          End.Time            Interviewer
##    <chr>        <date>     <dttm>              <dttm>              <chr>      
##  1 LS-01        2019-11-18 1899-12-31 12:45:00 1899-12-31 13:15:00 Susanto    
##  2 LS-02        2019-11-20 1899-12-31 12:54:00 NA                  Chairil    
##  3 LS-03        2019-11-18 1899-12-31 13:25:00 1899-12-31 14:00:00 Susanto    
##  4 LS-04        2019-11-20 1899-12-31 14:12:00 NA                  Chairil    
##  5 LS-29        2019-11-18 1899-12-31 09:30:00 1899-12-31 10:15:00 Susanto    
##  6 LS-05        2019-11-20 1899-12-31 15:38:00 NA                  Chairil    
##  7 LS-06        2019-11-18 1899-12-31 15:45:00 1899-12-31 16:30:00 Susanto    
##  8 LS-07        2019-11-18 1899-12-31 16:45:00 NA                  Susanto    
##  9 LS-08        2019-11-20 1899-12-31 16:33:00 NA                  Chairil    
## 10 LS-09        2019-11-20 1899-12-31 19:10:00 NA                  Chairil    
## # ... with 200 more rows, and 208 more variables: Q1.1 <chr>, Q1.1a <chr>,
## #   Q1.2 <chr>, Q1.3 <dbl>, Q2.1 <chr>, Q2.2 <chr>, Q2.3 <chr>, Q2.4 <chr>,
## #   Q2.5 <chr>, Q2.6 <chr>, Q2.7 <chr>, Q2.8 <chr>, Q2.9 <chr>, Q2.9.1 <chr>,
## #   Q2.10 <chr>, Q2.10.1 <chr>, Q2.11 <chr>, Q2.12 <chr>, Q2.13 <chr>,
## #   Q2.14 <chr>, Q2.15 <chr>, Q2.16 <chr>, Q3.1 <lgl>, Q3.1a <chr>,
## #   Q3.1b <chr>, Q3.1c <chr>, Q3.1d <chr>, Q3.1e <chr>, Q3.1f <chr>,
## #   Q3.1g <chr>, Q3.1h <chr>, Q3.1i <chr>, Q3.1j <chr>, Q3.1k <chr>,
## #   Q3.2 <lgl>, Q.3.2a <chr>, Q.3.2b <chr>, Q.3.2c <chr>, Q.3.2d <chr>,
## #   Q.3.2e <chr>, Q.3.2f <chr>, Q.3.2g <chr>, Q.3.2h <chr>, Q.3.2i <chr>,
## #   Q.3.2j <chr>, Q.3.2k <chr>, Q3.3 <lgl>, Q3.3a <chr>, Q3.3b <chr>,
## #   Q3.3c <chr>, Q3.3d <chr>, Q3.3e <chr>, Q3.3f <chr>, Q3.3g <chr>,
## #   Q3.3h <chr>, Q3.3i <chr>, Q3.3j <chr>, Q3.3k <chr>, Q3.4 <lgl>,
## #   Q3.4a <chr>, Q3.4b <chr>, Q3.4c <chr>, Q3.4d <chr>, Q3.4e <chr>,
## #   Q3.4f <chr>, Q3.4g <chr>, Q3.4h <chr>, Q3.4i <chr>, Q3.4j <chr>,
## #   Q3.4k <chr>, Q3.5 <chr>, Q3.5a <chr>, Q3.5b <chr>, Q3.5c <chr>, Q3.6 <chr>,
## #   Q3.7 <chr>, Q3.8 <lgl>, Q3.8a <dbl>, Q3.8b <dbl>, Q3.8c <dbl>, Q3.8d <dbl>,
## #   Q3.8e <dbl>, Q3.8f <dbl>, Q3.8g <dbl>, Q3.8h <dbl>, Q3.9 <chr>,
## #   Q3.9English <chr>, Q3.10 <chr>, Q3.10English <chr>, Q3.11 <chr>,
## #   Q3.12 <chr>, Q3.13 <chr>, Q3.14 <chr>, Q3.14English <chr>, Q4.1 <chr>,
## #   Q4.2 <chr>, Q4.3 <chr>, Q4.4 <chr>, Q4.5 <chr>, Q4.5.1 <chr>, ...

The above code successfully changed the class of the “Date” column from chr to date, as seen on the resulting tibble when I call HOI_Survey_Date_Update.

Then, I wanted to clean the time columns to show the time only, and remove the inaccurate and unnecessary dates that show up along with the time when time data is showing up as the POSIXct data class. A bit tricky, but the below code converted the POSIXct data class to the S3:hms (hour-minute-second) data class. The piping (%>% allows me to do make all these manipulations at once)

library(hms)

## 
## Attaching package: 'hms'

## The following object is masked from 'package:lubridate':
## 
##     hms

HOI_Survey_Time <- HOI_Survey_Date_Update %>% # Updating the name again to reflect the current action.
  mutate(Start.Time = lubridate::ymd_hms(Start.Time)) %>% # Changing the "Start.Time" column to ymd_hms
  mutate(Start.Time = hms::as_hms(Start.Time)) %>% # Changing the "Start.TIme" column to hms
  mutate(End.Time = lubridate::ymd_hms(End.Time)) %>% # Changing the "End.Time" column to ymd_hms
  mutate(End.Time = hms::as_hms(End.Time)) # Changing the "End.TIme" column to hms
HOI_Survey_Time

## # A tibble: 210 x 213
##    Interview.ID Date       Start.Time End.Time Interviewer Q1.1  Q1.1a Q1.2 
##    <chr>        <date>     <time>     <time>   <chr>       <chr> <chr> <chr>
##  1 LS-01        2019-11-18 12:45      13:15    Susanto     Lama~ Manj~ Dusu~
##  2 LS-02        2019-11-20 12:54         NA    Chairil     Lama~ Manj~ Dusu~
##  3 LS-03        2019-11-18 13:25      14:00    Susanto     Lama~ Manj~ Dusu~
##  4 LS-04        2019-11-20 14:12         NA    Chairil     Lama~ Manj~ Dusu~
##  5 LS-29        2019-11-18 09:30      10:15    Susanto     Lama~ Manj~ Dusu~
##  6 LS-05        2019-11-20 15:38         NA    Chairil     Lama~ Manj~ Dusu~
##  7 LS-06        2019-11-18 15:45      16:30    Susanto     Lama~ Manj~ Dusu~
##  8 LS-07        2019-11-18 16:45         NA    Susanto     Lama~ Manj~ Dusu~
##  9 LS-08        2019-11-20 16:33         NA    Chairil     Lama~ Manj~ Dusu~
## 10 LS-09        2019-11-20 19:10         NA    Chairil     Lama~ Manj~ Dusu~
## # ... with 200 more rows, and 205 more variables: Q1.3 <dbl>, Q2.1 <chr>,
## #   Q2.2 <chr>, Q2.3 <chr>, Q2.4 <chr>, Q2.5 <chr>, Q2.6 <chr>, Q2.7 <chr>,
## #   Q2.8 <chr>, Q2.9 <chr>, Q2.9.1 <chr>, Q2.10 <chr>, Q2.10.1 <chr>,
## #   Q2.11 <chr>, Q2.12 <chr>, Q2.13 <chr>, Q2.14 <chr>, Q2.15 <chr>,
## #   Q2.16 <chr>, Q3.1 <lgl>, Q3.1a <chr>, Q3.1b <chr>, Q3.1c <chr>,
## #   Q3.1d <chr>, Q3.1e <chr>, Q3.1f <chr>, Q3.1g <chr>, Q3.1h <chr>,
## #   Q3.1i <chr>, Q3.1j <chr>, Q3.1k <chr>, Q3.2 <lgl>, Q.3.2a <chr>,
## #   Q.3.2b <chr>, Q.3.2c <chr>, Q.3.2d <chr>, Q.3.2e <chr>, Q.3.2f <chr>,
## #   Q.3.2g <chr>, Q.3.2h <chr>, Q.3.2i <chr>, Q.3.2j <chr>, Q.3.2k <chr>,
## #   Q3.3 <lgl>, Q3.3a <chr>, Q3.3b <chr>, Q3.3c <chr>, Q3.3d <chr>,
## #   Q3.3e <chr>, Q3.3f <chr>, Q3.3g <chr>, Q3.3h <chr>, Q3.3i <chr>,
## #   Q3.3j <chr>, Q3.3k <chr>, Q3.4 <lgl>, Q3.4a <chr>, Q3.4b <chr>,
## #   Q3.4c <chr>, Q3.4d <chr>, Q3.4e <chr>, Q3.4f <chr>, Q3.4g <chr>,
## #   Q3.4h <chr>, Q3.4i <chr>, Q3.4j <chr>, Q3.4k <chr>, Q3.5 <chr>,
## #   Q3.5a <chr>, Q3.5b <chr>, Q3.5c <chr>, Q3.6 <chr>, Q3.7 <chr>, Q3.8 <lgl>,
## #   Q3.8a <dbl>, Q3.8b <dbl>, Q3.8c <dbl>, Q3.8d <dbl>, Q3.8e <dbl>,
## #   Q3.8f <dbl>, Q3.8g <dbl>, Q3.8h <dbl>, Q3.9 <chr>, Q3.9English <chr>,
## #   Q3.10 <chr>, Q3.10English <chr>, Q3.11 <chr>, Q3.12 <chr>, Q3.13 <chr>,
## #   Q3.14 <chr>, Q3.14English <chr>, Q4.1 <chr>, Q4.2 <chr>, Q4.3 <chr>,
## #   Q4.4 <chr>, Q4.5 <chr>, Q4.5.1 <chr>, Q4.5.1Englsih <chr>, Q4.6 <chr>,
## #   Q4.7 <chr>, ...

With the time columns changed to a more user-friendly data class, S3:hms, we are able to create and add a column to our tibble that calculates the duration of the interviews in minutes, based on the start and end times.

Interview.Interval <- HOI_Survey_Time$Start.Time %--% HOI_Survey_Time$End.Time # Calculation for subtracting the End.Time column from the Start.Time for each row.

## Warning: tz(): Don't know how to compute timezone for object of class hms/
## difftime; returning "UTC". This warning will become an error in the next major
## version of lubridate.

Interview.Duration <- as.duration(Interview.Interval) / dminutes(1) # Specifying that the calculation will be a duration, displayed in minutes and changing the name to reflect this specific action.
Interview.Duration #Calling Interview.Duration to make sure it looks like what I am expecting

##   [1]  30  NA  35  NA  45  NA  45  NA  NA  NA  70  60  NA  55  NA  NA  NA  70
##  [19]  80  45  75  NA  40  NA  41  NA  NA  50  NA  NA  40  NA  40  NA  31  NA
##  [37]  29  NA  NA  NA  48  NA  40  NA  44  NA  54  NA  52  NA  48  44  NA  NA
##  [55]  37  41  NA  36  43  NA  NA  40  NA  50  55  50  NA  NA  45  40  NA  40
##  [73]  NA  NA  35  40  60  39  NA  NA  NA  46  40  50  40  70  NA  NA  NA  NA
##  [91]  NA  32   0  35  NA  35  NA  31  NA  38  NA  34  37 108  NA  NA  NA  37
## [109]  NA  NA  36  NA  38  NA  45  NA  NA  35  NA  33  NA  35  NA  35  NA  40
## [127]  NA  38  35  NA  NA  40  35  43  NA  43  NA  53  NA  65  NA  NA  35  NA
## [145]  NA  NA  50  50  NA  25  NA  NA  45  NA  46  NA  NA  NA  NA  NA  41  NA
## [163]  45  NA  61  NA  35  NA  39  NA  39  NA  40  NA  48  NA  25  NA  53  NA
## [181]  45  NA  49  NA  35  NA  NA  42  NA  NA  41  55  38  NA  NA  35  NA  32
## [199]  NA  35  NA  27  NA  NA  NA  40  NA  30  45  NA

HOI_Survey_Time <- HOI_Survey_Time %>% 
  mutate(Duration = Interview.Duration)
HOI_Survey_Time

## # A tibble: 210 x 214
##    Interview.ID Date       Start.Time End.Time Interviewer Q1.1  Q1.1a Q1.2 
##    <chr>        <date>     <time>     <time>   <chr>       <chr> <chr> <chr>
##  1 LS-01        2019-11-18 12:45      13:15    Susanto     Lama~ Manj~ Dusu~
##  2 LS-02        2019-11-20 12:54         NA    Chairil     Lama~ Manj~ Dusu~
##  3 LS-03        2019-11-18 13:25      14:00    Susanto     Lama~ Manj~ Dusu~
##  4 LS-04        2019-11-20 14:12         NA    Chairil     Lama~ Manj~ Dusu~
##  5 LS-29        2019-11-18 09:30      10:15    Susanto     Lama~ Manj~ Dusu~
##  6 LS-05        2019-11-20 15:38         NA    Chairil     Lama~ Manj~ Dusu~
##  7 LS-06        2019-11-18 15:45      16:30    Susanto     Lama~ Manj~ Dusu~
##  8 LS-07        2019-11-18 16:45         NA    Susanto     Lama~ Manj~ Dusu~
##  9 LS-08        2019-11-20 16:33         NA    Chairil     Lama~ Manj~ Dusu~
## 10 LS-09        2019-11-20 19:10         NA    Chairil     Lama~ Manj~ Dusu~
## # ... with 200 more rows, and 206 more variables: Q1.3 <dbl>, Q2.1 <chr>,
## #   Q2.2 <chr>, Q2.3 <chr>, Q2.4 <chr>, Q2.5 <chr>, Q2.6 <chr>, Q2.7 <chr>,
## #   Q2.8 <chr>, Q2.9 <chr>, Q2.9.1 <chr>, Q2.10 <chr>, Q2.10.1 <chr>,
## #   Q2.11 <chr>, Q2.12 <chr>, Q2.13 <chr>, Q2.14 <chr>, Q2.15 <chr>,
## #   Q2.16 <chr>, Q3.1 <lgl>, Q3.1a <chr>, Q3.1b <chr>, Q3.1c <chr>,
## #   Q3.1d <chr>, Q3.1e <chr>, Q3.1f <chr>, Q3.1g <chr>, Q3.1h <chr>,
## #   Q3.1i <chr>, Q3.1j <chr>, Q3.1k <chr>, Q3.2 <lgl>, Q.3.2a <chr>,
## #   Q.3.2b <chr>, Q.3.2c <chr>, Q.3.2d <chr>, Q.3.2e <chr>, Q.3.2f <chr>,
## #   Q.3.2g <chr>, Q.3.2h <chr>, Q.3.2i <chr>, Q.3.2j <chr>, Q.3.2k <chr>,
## #   Q3.3 <lgl>, Q3.3a <chr>, Q3.3b <chr>, Q3.3c <chr>, Q3.3d <chr>,
## #   Q3.3e <chr>, Q3.3f <chr>, Q3.3g <chr>, Q3.3h <chr>, Q3.3i <chr>,
## #   Q3.3j <chr>, Q3.3k <chr>, Q3.4 <lgl>, Q3.4a <chr>, Q3.4b <chr>,
## #   Q3.4c <chr>, Q3.4d <chr>, Q3.4e <chr>, Q3.4f <chr>, Q3.4g <chr>,
## #   Q3.4h <chr>, Q3.4i <chr>, Q3.4j <chr>, Q3.4k <chr>, Q3.5 <chr>,
## #   Q3.5a <chr>, Q3.5b <chr>, Q3.5c <chr>, Q3.6 <chr>, Q3.7 <chr>, Q3.8 <lgl>,
## #   Q3.8a <dbl>, Q3.8b <dbl>, Q3.8c <dbl>, Q3.8d <dbl>, Q3.8e <dbl>,
## #   Q3.8f <dbl>, Q3.8g <dbl>, Q3.8h <dbl>, Q3.9 <chr>, Q3.9English <chr>,
## #   Q3.10 <chr>, Q3.10English <chr>, Q3.11 <chr>, Q3.12 <chr>, Q3.13 <chr>,
## #   Q3.14 <chr>, Q3.14English <chr>, Q4.1 <chr>, Q4.2 <chr>, Q4.3 <chr>,
## #   Q4.4 <chr>, Q4.5 <chr>, Q4.5.1 <chr>, Q4.5.1Englsih <chr>, Q4.6 <chr>,
## #   Q4.7 <chr>, ...

The new Interview.Duration column is now added at the end of the tibble. Yay for new data!

Here, I am just renaming the tibble to “HOI” keep it simple as I move on to some analysis. This version is my “cleaned” and updated version of the original Excel spreadsheet data. But, there is much more cleaning that could be done as I continue you my analysis. For example, in Meijaard et al. (2011), they suggest filling in blank or “NA” cells with information where appropriate, to get a dataset with much less missing data. One instance would be where it would make sense to replace a blank with “Don’t know” if I am comfortable assuming that a non-answered yes/no question means the person did not know the answer (versus not wanting to answer). Or, in a place where it is clear that the blank can be answered in context with other answers for that respondent. If a respondent says that they have never encountered an orangutan, but later in the interview the question about ever killing an orangutan is blank, then it would be safe to fill this blank as a “no” based on the respondent’s earlier answer. While this would take some time to go through each interview separately, it could make the data more robust in a way that would allow for further analyses that some of the sample sizes for specific questions are currently not.

HOI<-HOI_Survey_Time
HOI

## # A tibble: 210 x 214
##    Interview.ID Date       Start.Time End.Time Interviewer Q1.1  Q1.1a Q1.2 
##    <chr>        <date>     <time>     <time>   <chr>       <chr> <chr> <chr>
##  1 LS-01        2019-11-18 12:45      13:15    Susanto     Lama~ Manj~ Dusu~
##  2 LS-02        2019-11-20 12:54         NA    Chairil     Lama~ Manj~ Dusu~
##  3 LS-03        2019-11-18 13:25      14:00    Susanto     Lama~ Manj~ Dusu~
##  4 LS-04        2019-11-20 14:12         NA    Chairil     Lama~ Manj~ Dusu~
##  5 LS-29        2019-11-18 09:30      10:15    Susanto     Lama~ Manj~ Dusu~
##  6 LS-05        2019-11-20 15:38         NA    Chairil     Lama~ Manj~ Dusu~
##  7 LS-06        2019-11-18 15:45      16:30    Susanto     Lama~ Manj~ Dusu~
##  8 LS-07        2019-11-18 16:45         NA    Susanto     Lama~ Manj~ Dusu~
##  9 LS-08        2019-11-20 16:33         NA    Chairil     Lama~ Manj~ Dusu~
## 10 LS-09        2019-11-20 19:10         NA    Chairil     Lama~ Manj~ Dusu~
## # ... with 200 more rows, and 206 more variables: Q1.3 <dbl>, Q2.1 <chr>,
## #   Q2.2 <chr>, Q2.3 <chr>, Q2.4 <chr>, Q2.5 <chr>, Q2.6 <chr>, Q2.7 <chr>,
## #   Q2.8 <chr>, Q2.9 <chr>, Q2.9.1 <chr>, Q2.10 <chr>, Q2.10.1 <chr>,
## #   Q2.11 <chr>, Q2.12 <chr>, Q2.13 <chr>, Q2.14 <chr>, Q2.15 <chr>,
## #   Q2.16 <chr>, Q3.1 <lgl>, Q3.1a <chr>, Q3.1b <chr>, Q3.1c <chr>,
## #   Q3.1d <chr>, Q3.1e <chr>, Q3.1f <chr>, Q3.1g <chr>, Q3.1h <chr>,
## #   Q3.1i <chr>, Q3.1j <chr>, Q3.1k <chr>, Q3.2 <lgl>, Q.3.2a <chr>,
## #   Q.3.2b <chr>, Q.3.2c <chr>, Q.3.2d <chr>, Q.3.2e <chr>, Q.3.2f <chr>,
## #   Q.3.2g <chr>, Q.3.2h <chr>, Q.3.2i <chr>, Q.3.2j <chr>, Q.3.2k <chr>,
## #   Q3.3 <lgl>, Q3.3a <chr>, Q3.3b <chr>, Q3.3c <chr>, Q3.3d <chr>,
## #   Q3.3e <chr>, Q3.3f <chr>, Q3.3g <chr>, Q3.3h <chr>, Q3.3i <chr>,
## #   Q3.3j <chr>, Q3.3k <chr>, Q3.4 <lgl>, Q3.4a <chr>, Q3.4b <chr>,
## #   Q3.4c <chr>, Q3.4d <chr>, Q3.4e <chr>, Q3.4f <chr>, Q3.4g <chr>,
## #   Q3.4h <chr>, Q3.4i <chr>, Q3.4j <chr>, Q3.4k <chr>, Q3.5 <chr>,
## #   Q3.5a <chr>, Q3.5b <chr>, Q3.5c <chr>, Q3.6 <chr>, Q3.7 <chr>, Q3.8 <lgl>,
## #   Q3.8a <dbl>, Q3.8b <dbl>, Q3.8c <dbl>, Q3.8d <dbl>, Q3.8e <dbl>,
## #   Q3.8f <dbl>, Q3.8g <dbl>, Q3.8h <dbl>, Q3.9 <chr>, Q3.9English <chr>,
## #   Q3.10 <chr>, Q3.10English <chr>, Q3.11 <chr>, Q3.12 <chr>, Q3.13 <chr>,
## #   Q3.14 <chr>, Q3.14English <chr>, Q4.1 <chr>, Q4.2 <chr>, Q4.3 <chr>,
## #   Q4.4 <chr>, Q4.5 <chr>, Q4.5.1 <chr>, Q4.5.1Englsih <chr>, Q4.6 <chr>,
## #   Q4.7 <chr>, ...

Exploratory Analysis in R

Here I am seeing a list of the answers for if hunting has a positive impact, just to get an idea of what kind of answers are common.

Positive_impact<-filter(HOI, Q4.21 == "Positive impact")
Positive_impact$Q4.21.1English

##  [1] "\r\nPig pests are reduced"                                                          
##  [2] "Primates and pigs are pests, so they are reduced"                                   
##  [3] "Kill pests"                                                                         
##  [4] "\r\nCan reduce pests like pigs and monkeys, and can be consumed"                    
##  [5] "\r\nPig Pests are reduced"                                                          
##  [6] "\r\nCan reduce animals who are plant destroyers"                                    
##  [7] "\r\nPests are reduced"                                                              
##  [8] "\r\nFor consumption and for sale"                                                   
##  [9] "\r\nReducing pig pests"                                                             
## [10] "\r\nCan be eaten and sold"                                                          
## [11] "Can be consumed and sold"                                                           
## [12] "\r\nReducing wild boar pests"                                                       
## [13] "\r\nPlants are not disturbed"                                                       
## [14] "\r\nPest pigs out, garden safe"                                                     
## [15] "\r\nFor killing pests"                                                              
## [16] "Reducing pig and squirrel pests"                                                    
## [17] "Reducing pig pests"                                                                 
## [18] "Can reduce pig and squirrel pests"                                                  
## [19] "Because pig pests are reduced"                                                      
## [20] "pig pests are reduced"                                                              
## [21] "\r\nGet food, because the results are usually divided"                              
## [22] "Sometimes it's just for eating"                                                     
## [23] "\r\nCan be consumed and sold"                                                       
## [24] "Can be consumed, and reduce enemy plants"                                           
## [25] "\r\nBecause looking only to eat"                                                    
## [26] "Taken only to be eaten"                                                             
## [27] "Can be sold and consumed"                                                           
## [28] "Can be consumed and sold"                                                           
## [29] "\r\nTo reduce pig pests"                                                            
## [30] "\r\nCommunity plants are not disturbed"                                             
## [31] "For us, rice fields are safe from pests"                                            
## [32] "Can Be Eaten"                                                                       
## [33] "\r\nCan reduce pig pests"                                                           
## [34] "\r\nKill Pests"                                                                     
## [35] "\r\nPig are pests, So they are Reduced"                                             
## [36] "\r\nIf you can sell it"                                                             
## [37] "Reducing Pests"                                                                     
## [38] "\r\nBecause killing pests"                                                          
## [39] "\r\nReducing Pests"                                                                 
## [40] "\r\nReducing Pests"                                                                 
## [41] "\r\nPig Pests reduced, deer and mouse-deer can be eaten"                            
## [42] "\r\nPig Pests Reduced"                                                              
## [43] "\r\nWith the hunting of pigs, farmers' crops are safe"                              
## [44] "Reducing Pests"                                                                     
## [45] "Reducing Pests"                                                                     
## [46] "\r\nSquirrel pests that disturb plants are reduced"                                 
## [47] "Pest pigs are gone"                                                                 
## [48] "\r\nPig pests are reduced"                                                          
## [49] "\r\nCan be consumed and sold"                                                       
## [50] "\r\nReducing Pests"                                                                 
## [51] "\r\nCan be consumed and sold"                                                       
## [52] "\r\nKilling pig pests"                                                              
## [53] "\r\nSquirrel pests are reduced"                                                     
## [54] "Can be consumed and can be sold"                                                    
## [55] "\r\nHalal animals can be consumed and sold"                                         
## [56] "\r\nPig pests are gone"                                                             
## [57] "\r\nCan reduce the squirrel and pig pests"                                          
## [58] "\r\nFor consumption and sale"                                                       
## [59] "Pests reduced"                                                                      
## [60] "Pests reduced"                                                                      
## [61] "Reducing Pests"                                                                     
## [62] "\r\nNo Pests"                                                                       
## [63] "\r\nAble to be consumed and sold"                                                   
## [64] "Pests gone"                                                                         
## [65] "\r\nKill Pests"                                                                     
## [66] "Reducing Pests"                                                                     
## [67] "\r\nPigs become reduced because pigs disrupt plants"                                
## [68] "\r\nReduced pig pests"                                                              
## [69] "\r\nCan eradicate pests like pigs that destroy plants"                              
## [70] "\r\nPig pests are reduced"                                                          
## [71] "\r\nPlants that are often disturbed by pigs are safe"                               
## [72] "Reducing animals that damage community plants"                                      
## [73] "Because pigs that can damage plants are gone"                                       
## [74] "Because it is a pig pest, it doesn't interfere"                                     
## [75] "\r\nKilling pig pests"                                                              
## [76] "Pigs that normally disturb and damage plants are reduced"                           
## [77] "\r\nPests in the garden can be reduced"                                             
## [78] "Can reduce animals that often damage plants"                                        
## [79] "\r\nFor consumption and for sale"                                                   
## [80] "\r\nHalal animals can be eaten"                                                     
## [81] "Kill pests, like pigs, and gain food"                                               
## [82] "\r\nCan be eaten and the excess is for sale"                                        
## [83] "\r\nCan be consumed and sold"                                                       
## [84] "\r\nCan reduce pigs that like to damage plants, and halal animals that can be eaten"
## [85] "Can be consumed and sold"                                                           
## [86] "\r\nPigs that normally disturb plants are reduced"                                  
## [87] "\r\nCan be consumed by families and can be sold"                                    
## [88] "Animals are not destructive"

With text-mining, I can do a more systematic review of word frequency and the relationships between commonly used words.

library(tm)

## Loading required package: NLP

Positive_Corpus <- VCorpus(VectorSource(HOI$Q4.21.1English)) # Creating a corpus out of a vector, in this case the column that has the answers for those who answered that hunting has a positive impact for their family and the village. Each cell in the column will be a "document" in our corpus
Positive_Corpus <- tm_map(Positive_Corpus, content_transformer(tolower)) # makes everything lower case
Positive_Corpus <- tm_map(Positive_Corpus, removePunctuation) # removes punctuation
mystopwords <- c(stopwords("english"))
Positive_Corpus <- tm_map(Positive_Corpus, removeWords, mystopwords) # removes common "stop words" that won't provide me with anymore insights (e.g., are, the)
Positive_CorpusDTM <- DocumentTermMatrix(Positive_Corpus) # Creating document-term matrices to allow for investigation of the data and to create a data frame
Positive_CorpusDTM

## <<DocumentTermMatrix (documents: 210, terms: 174)>>
## Non-/sparse entries: 765/35775
## Sparsity           : 98%
## Maximal term length: 13
## Weighting          : term frequency (tf)

dim(Positive_CorpusDTM)

## [1] 210 174

inspect(Positive_CorpusDTM[1:20, 1:25])

## <<DocumentTermMatrix (documents: 20, terms: 25)>>
## Non-/sparse entries: 12/488
## Sparsity           : 98%
## Maximal term length: 8
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs able across allowed also animals anything become benefit can causing
##   1     0      0       0    0       0        0      0       0   0       0
##   11    0      0       0    0       1        0      1       0   0       1
##   12    0      0       0    0       0        0      0       0   2       0
##   14    0      0       0    0       1        0      0       0   1       0
##   15    0      0       0    0       1        0      0       0   0       0
##   16    0      0       0    0       1        0      0       0   0       0
##   20    0      0       0    0       0        0      0       0   1       0
##   4     0      0       0    0       0        0      0       1   0       0
##   6     0      0       0    0       1        0      0       0   0       0
##   8     0      0       0    0       1        0      0       0   0       0

Positive_CorpusTDM <- TermDocumentMatrix(Positive_Corpus)
inspect(Positive_CorpusTDM[1:25, 1:20])

## <<TermDocumentMatrix (terms: 25, documents: 20)>>
## Non-/sparse entries: 12/488
## Sparsity           : 98%
## Maximal term length: 8
## Weighting          : term frequency (tf)
## Sample             :
##           Docs
## Terms      1 11 12 14 15 16 20 4 6 8
##   able     0  0  0  0  0  0  0 0 0 0
##   across   0  0  0  0  0  0  0 0 0 0
##   allowed  0  0  0  0  0  0  0 0 0 0
##   also     0  0  0  0  0  0  0 0 0 0
##   animals  0  1  0  1  1  1  0 0 1 1
##   anything 0  0  0  0  0  0  0 0 0 0
##   become   0  1  0  0  0  0  0 0 0 0
##   benefit  0  0  0  0  0  0  0 1 0 0
##   can      0  0  2  1  0  0  1 0 0 0
##   causing  0  1  0  0  0  0  0 0 0 0

Positive_CorpusFreq <- colSums(as.matrix(Positive_CorpusDTM)) # Creating data frame, showing the most frequent used words in the corpus
Positive_CorpusFreq <- sort(Positive_CorpusFreq, decreasing = TRUE)
Positive_CorpusDF <- data.frame(word = names(Positive_CorpusFreq), freq = Positive_CorpusFreq)
rownames(Positive_CorpusDF) <- NULL
head(Positive_CorpusDF) #Looking at the first several rows of the data frame

##        word freq
## 1   animals   95
## 2       can   52
## 3     pests   47
## 4    become   29
## 5   extinct   26
## 6 protected   25

With this information, I can also plot these most words and their frequencies.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

p <- ggplot(data = Positive_CorpusDF[1:25, ], aes(x = reorder(word, freq), y = freq)) + 
    xlab("Word") + ylab("Frequency") + geom_bar(stat = "identity") + coord_flip()
p

I can also plot correlations between words with some data visualization.

library(Rgraphviz)

## Loading required package: graph

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following object is masked from 'package:NLP':
## 
##     annotation

## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union

## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min

## Loading required package: grid

attrs = list(node = list(fillcolor = "cadetblue", fontsize = "20"), edge = list(), 
    graph = list())
plot(Positive_CorpusDTM, terms = findFreqTerms(Positive_CorpusDTM, lowfreq = 11), attrs = attrs, 
    corThreshold = 0.1)

We can even create a word cloud with this data, which can be another interesting way to visualize word frequency based on the size of the words compared to others. The larger the word is the, more frequently it appeared among the answers.

library(wordcloud)

## Loading required package: RColorBrewer

set.seed(1)
wordcloud(Positive_CorpusDF$word, Positive_CorpusDF$freq, max.words = 50, random.color = TRUE, rot.per = 0.0, fixed.asp = TRUE, colors = brewer.pal(8, "Set1"))

Moving on, I will now create a few basic histograms, as another way to visualize and explore the data, to see if there is something I want to explore further.

library(ggplot2)

class(HOI$Q4.21)

## [1] "character"

HOI$Q4.21<-as.factor(HOI$Q4.21)

#remove blanks
HOI<-filter(HOI, Q4.21 !="")
t1<-table(HOI$Q4.21)
t1

## 
##      Don't know Negative impact Positive impact 
##               3             113              88

hunting<-ggplot(data=HOI, aes(x=Q4.21)) + labs(x="Is hunting positive or negative?", y="Number of Responses")+
                    geom_bar(fill= c("darkmagenta", "#2A9D8F", "#E76F51"))+
                    theme_minimal()
hunting

Now let’s try for a different category but the same time type of bar graph. We will look at if people think orangutans are ‘dangerous.’

class(HOI$Q3.11)

## [1] "character"

HOI$Q3.11<-as.factor(HOI$Q3.11)

#remove blanks
HOI<-filter(HOI, Q3.11 !="")
t2<-table(HOI$Q3.11)
t2

## 
## Don't Know         No        Yes 
##          9        150         45

dangerous<-ggplot(data=HOI, aes(x=Q3.11)) + labs(x="Are orangutans dangerous wildlife?", y="Number of Responses")+
                    geom_bar(fill= c("darkmagenta", "#2A9D8F", "#E76F51"))+
                    theme_minimal()
dangerous

Ok, one more time for the question I was really interested in, regarding the orangutan population being perceived as decreasing or increasing:

class(HOI$Q3.7)

## [1] "character"

HOI$Q3.7<-as.factor(HOI$Q3.7)

#remove blanks
HOI<-filter(HOI, Q3.7 !="")
t3<-table(HOI$Q3.7)
t3

## 
## Decrease Increase 
##       81      110

OH_pop<-ggplot(data=HOI, aes(x=Q3.7)) + labs(x="In the last 10 years has the orangutan population decreased or increased?", y="Number of Responses")+
                    geom_bar(fill= c("darkmagenta", "#2A9D8F"))+
                    theme_minimal()
OH_pop

Now for if people report to the authorities if they find illegal wildlife activity happening:

class(HOI$Q4.22)

## [1] "character"

HOI$Q4.22<-as.factor(HOI$Q4.22)

#remove blanks
HOI<-filter(HOI, Q4.22 !="")
t4<-table(HOI$Q4.22)
t4

## 
##  No yes 
##  66 120

report<-ggplot(data=HOI, aes(x=Q4.22)) + labs(x="Do you report illegal wildlife activity to authorities?", y="Number of Responses")+
                    geom_bar(fill= c("darkmagenta", "#2A9D8F"))+
                    theme_minimal()
report

Now, I want to look at the association between village and if people perceive that the orangutan population has increased or decreased over the past 10 years. If the null hypothesis is that there is no association, then it might indicate that people’s perceptions of the orangutan population in their area is inaccurate or confounded by other factors. If the alternative hypothesis is supported with a p-value below 0.05, indicating that there is an association with village and the perception in the increase or decrease of the local orangutan population, this would encourage me to pursue further analyses that look at between-village level comparisons, to see what might other variables might give us clues into this relationship.

Furthermore, I would want to look at orangutan population data we have for these areas to see if these perceptions are accurate. Several studies have shown that local knowledge of animal presence or absence is crucial in filling knowledge gaps by researchers, especially with rare, endangered or cryptic animals, such as the orangutan. Meijaard et al. (2011) found several orangutan populations that were previously unknown to researchers as a result of interviewing local people.

cont_tableDI <- table(HOI$Q1.1, HOI$Q3.7)
cont_tableDI

##                  
##                   Decrease Increase
##   Laman Satong           9       17
##   Pangkalan Teluk       10       17
##   Penjalaan             18        9
##   Rantau Panjang        15        8
##   Sedahan Jaya           8       19
##   Sempurna              10       18
##   Teluk Bayur            8       20

chisq.test(x = cont_tableDI)

## 
##  Pearson's Chi-squared test
## 
## data:  cont_tableDI
## X-squared = 16.918, df = 6, p-value = 0.009588

With a p-value of 0.01, this chi-square test shows that there is a statistically significant association between villages and the perception of the decrease or increase in orangutan populations in the past 10 years! This encourages me to analyze this further to find out what these associations are specifically, and what other variables may have an association or correlation with villages.