INTRODUCTION

This case study is a requirement from Jala Tech for a data analyst position. JALA is transforming the shrimp industry by offering a vastly improved management system. Helping to realize the sustainability of shrimp farming in Indonesia and the world by being directly involved in the aquaculture process to the supply chain of the shrimp industry.

Therefore, in this job, we’ll concentrate on datasets of Jala Tech and try to find the story behind the data. Hopefully the insight from the datasets can help the organization to improve in certain area.

To finish this process, I’m going to follow these 6 steps: 1.) Ask 2.) Prepare 3.) Process 4.) Analyze 5.) Share

JALA TECH DATASETS

Cycle Actual/budidaya.csv

  1. Age - Umur udang. Di lapangan, petambak biasa menyebutnya dengan DoC / days of culture
  2. ABW (gram) - Average Body Weight, berat rerata udang pada DoC tertentu. Data didapat dari hasil sampling udang yang dilakukan 7 hari sekali setelah melewati DoC 27 atau 30.
  3. Size *prediksi - Ukuran petambak dalam menetukan harga jual udang. Didapatkan dari perhitungan berapa banyak jumlah udang dalam satu kg. Contoh: Size 100 berarti, dalam satu kg udang, terdapat 100 ekor udang.
  4. Total weight (kg) *prediksi - Total bobot udang/biomassa udang yang ada di kolam.
  5. Feed (kg) - Jumlah pakan harian yang diberikan ke kolam.
  6. Feed accumulation (kg) - Jumlah akumulasi pakan yang diberikan ke kolam.
  7. Survival Rate (%) *prediksi - Presentasi udang yang masih bertahan hidup di kolam.
  8. Harvest/panen (ekor) - Jumlah populasi ekor udang yang masih hidup di kolam.
  9. Harvest weight/bobot panen (kg) - Total bobot udang/biomassa udang yang dipanen dan dijual.

Total weight (kg) = (ABW (gram) x Seed density/jumlah tebar x Survival Rate (%))/1000 eg. 2.3 x 200000 x 81.17% / 1000 = 372.6 kg 3.60 x 100000 x 80.66% / 1000 = 2

Daftar Siklus/daftar_siklus.csv

  1. Kode siklus - kode siklus budidaya di database Jala.
  2. Kode kolam - kode kolam tambak di database Jala.
  3. Panjang kolam (m) - panjang kolam tambak.
  4. Lebar kolam (m) - lebar kolam tambak.
  5. Kedalaman kolam (cm) - kedalaman kolam tambak.
  6. Tanggal tebar (Date) - tanggal waktu mulai siklus budidaya. Biasanya dimulai sejak waktu petambak melakukan tebar benur/post-larvae.
  7. Tanggal selesai siklus (Date) - tanggal waktu selesai budidaya. Biasanya ditandai saat kolam melakukan panen total/panen akhir.
  8. Seed density/jumlah benur - Total jumlah benur (benih udang - post larvae) yang ditebar pada saat memulai siklus budidaya.
  9. Daerah - nama daerah lokasi tambak.

Water Measurement/kualitas_air.csv

  1. Temperature/suhu - Suhu kolam (range normal 28°C - 31°C)
  2. Dissolved Oxygen/oksigen terlarut - Jumlah oksigen terlarut di dalam kolam, berpengaruh terhadap kelangsungan hidup kolam. (range normal di atas 4 ppm)
  3. Salinity/salinitas - Kadar garam dalam air kolam. (range normal di atas 10 ppt).
  4. pH - Kadar hidrogen di dalam air, berpengaruh dalam mood udang dan proses moulting saat udang tumbuh. (range normal 7.5 - 8.5, namun perlu diperhatikan fluktuasi antara nilai pagi dan sore, fluktuasi normal adalah di bawah 0.5).

1. ASK

Business Task

Our job is to concentrate on Jala’s datasets and finding the story behind the historical data. Then, based on these findings, we must suggest ways in which these patterns might support the growth of the Jala Tech.

Questions that will guide our Analysis:

  • Komparasi performa tiap kolam dan apa yang membuat performa setiap kolam bisa berbeda
  • Memahami karakteristik pertumbuhan udang di tiap kolam.
  • Memahami karakteristik kualitas air tiap kolam.
  • Pengaruh kualitas air terhadap hasil panen.
  • Model prediksi pertumbuhan udang di tiap kolam.
  • Pengaruh parameter kualitas air terhadap nafsu makan udang/konsumsi pakan udang (suhu terhadap bobot pakan harian).
  • Dan lainnya sesuai keinginan anda.

2. PREPARE

For analysis Jala Tech already sent me 3 datasets and they’re:

budidaya.csv daftar_siklus.csv kualitas_air.csv

3.PROCESS

So, our next step is to “load all the required packages” for analysis.

library(readxl)
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(ggplot2)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0
## ✔ purrr   1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

And, our next step is to load all datasets for analysis. Before that we are going to change the dataset file names from complex to less complex ones.

Budidaya <- read_excel("Budidaya.xlsx")
daftar_siklus <- read_excel("daftar_siklus.xlsx")
kualitas_air <- read_excel("kualitas_air.xlsx")

Now, let’s look deep into the data sets.

  1. Budidaya
glimpse(Budidaya)
## Rows: 438
## Columns: 11
## $ `Kode Siklus`       <dbl> 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50…
## $ Age                 <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
## $ Date                <dttm> 2018-07-16, 2018-07-17, 2018-07-18, 2018-07-19, 2…
## $ ABW                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ Size                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ `Total Weight`      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ Feed                <dbl> 0.0, 2.5, 3.0, 3.0, 3.0, 3.0, 3.0, 3.5, 3.5, 4.0, …
## $ `Feed Accumulation` <dbl> 0.0, 2.5, 5.5, 8.5, 11.5, 14.5, 17.5, 21.0, 24.5, …
## $ `Survival Rate`     <dbl> 100.00, 78.13, 88.24, 83.33, 78.95, 75.00, 71.43, …
## $ Panen               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ `Berat Panen`       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
summary(Budidaya)
##   Kode Siklus         Age              Date                      
##  Min.   : 50.0   Min.   :  0.00   Min.   :2018-07-16 00:00:00.0  
##  1st Qu.: 50.0   1st Qu.: 21.25   1st Qu.:2018-09-10 06:00:00.0  
##  Median : 66.0   Median : 43.00   Median :2018-11-04 00:00:00.0  
##  Mean   :271.9   Mean   : 47.12   Mean   :2018-12-17 00:49:18.9  
##  3rd Qu.:298.0   3rd Qu.: 69.75   3rd Qu.:2019-02-23 18:00:00.0  
##  Max.   :812.0   Max.   :120.00   Max.   :2019-08-01 00:00:00.0  
##                                                                  
##       ABW              Size         Total Weight         Feed        
##  Min.   : 2.050   Min.   : 50.00   Min.   :   0.0   Min.   :  0.000  
##  1st Qu.: 5.185   1st Qu.: 77.25   1st Qu.: 383.1   1st Qu.:  7.125  
##  Median : 7.165   Median :139.50   Median :1119.9   Median : 22.500  
##  Mean   : 8.921   Mean   :163.85   Mean   :1529.8   Mean   : 40.452  
##  3rd Qu.:12.988   3rd Qu.:192.75   3rd Qu.:2545.7   3rd Qu.: 47.000  
##  Max.   :20.000   Max.   :488.00   Max.   :4748.9   Max.   :200.000  
##  NA's   :404      NA's   :404      NA's   :404                       
##  Feed Accumulation Survival Rate        Panen         Berat Panen    
##  Min.   :   0      Min.   : 28.77   Min.   : 28861   Min.   : 379.6  
##  1st Qu.: 264      1st Qu.: 77.94   1st Qu.: 43470   1st Qu.: 479.3  
##  Median : 559      Median : 81.83   Median : 50353   Median : 532.8  
##  Mean   :1227      Mean   : 83.14   Mean   :122945   Mean   :1213.1  
##  3rd Qu.:1640      3rd Qu.: 94.53   3rd Qu.: 76046   3rd Qu.: 822.3  
##  Max.   :6021      Max.   :140.30   Max.   :542368   Max.   :4975.9  
##                                     NA's   :431      NA's   :431
n_distinct(Budidaya$`Kode Siklus`)
## [1] 5
sum(duplicated(Budidaya))
## [1] 0
  1. daftar_siklus
glimpse(daftar_siklus)
## Rows: 5
## Columns: 9
## $ `Kode Siklus`            <dbl> 50, 66, 295, 298, 812
## $ `Kode Kolam`             <dbl> 2266, 3, 3418, 3421, 4
## $ `Panjang Kolam`          <dbl> 115.47, 36.20, 79.16, 79.16, 49.90
## $ `Lebar Kolam`            <dbl> 86.60, 46.40, 59.37, 59.37, 36.30
## $ `Kedalaman Kolam`        <dbl> 100.0, 1.1, 1.5, 1.5, 0.9
## $ `Tanggal Tebar`          <dttm> 2018-07-16, 2018-07-20, 2019-01-05, 2019-01-1…
## $ `Tanggal Selesai Siklus` <dttm> 2018-11-13, 2018-11-08, 2019-03-03, 2019-03-1…
## $ `Jumlah Benur`           <dbl> 100000, 200000, 583740, 588432, 175000
## $ Daerah                   <chr> "BANTUL", "PURWOREJO", "LAMPUNG SELATAN", "LA…

I found that kedalaman_kolam for code of 50 has significantly different with others, so I assumed that a human error when input the data. I assumed that only code 50 was in cm and the others are in m, thus I changed all to m.

daftar_siklus$`Kedalaman Kolam`[0:1] <- daftar_siklus$`Kedalaman Kolam`[0:1]/100
summary(daftar_siklus)
##   Kode Siklus      Kode Kolam   Panjang Kolam     Lebar Kolam   
##  Min.   : 50.0   Min.   :   3   Min.   : 36.20   Min.   :36.30  
##  1st Qu.: 66.0   1st Qu.:   4   1st Qu.: 49.90   1st Qu.:46.40  
##  Median :295.0   Median :2266   Median : 79.16   Median :59.37  
##  Mean   :304.2   Mean   :1822   Mean   : 71.98   Mean   :57.61  
##  3rd Qu.:298.0   3rd Qu.:3418   3rd Qu.: 79.16   3rd Qu.:59.37  
##  Max.   :812.0   Max.   :3421   Max.   :115.47   Max.   :86.60  
##  Kedalaman Kolam Tanggal Tebar        Tanggal Selesai Siklus       
##  Min.   :0.9     Min.   :2018-07-16   Min.   :2018-11-08 00:00:00  
##  1st Qu.:1.0     1st Qu.:2018-07-20   1st Qu.:2018-11-13 00:00:00  
##  Median :1.1     Median :2019-01-05   Median :2019-03-03 00:00:00  
##  Mean   :1.2     Mean   :2018-11-23   Mean   :2019-02-17 14:24:00  
##  3rd Qu.:1.5     3rd Qu.:2019-01-10   3rd Qu.:2019-03-10 00:00:00  
##  Max.   :1.5     Max.   :2019-05-07   Max.   :2019-08-01 00:00:00  
##   Jumlah Benur       Daerah         
##  Min.   :100000   Length:5          
##  1st Qu.:175000   Class :character  
##  Median :200000   Mode  :character  
##  Mean   :329434                     
##  3rd Qu.:583740                     
##  Max.   :588432
n_distinct(daftar_siklus$`Kode Siklus`)
## [1] 5
sum(duplicated(daftar_siklus))
## [1] 0
  1. kualitas_air
glimpse(kualitas_air)
## Rows: 581
## Columns: 8
## $ `Waktu Pengukuran` <chr> "2018-07-16 07:00:00+07:00", "2018-07-16 17:00:00+0…
## $ Date               <chr> "2018-07-16", "2018-07-16", "2018-07-17", "2018-07-…
## $ Time               <chr> "07:00:00+07:00", "17:00:00+07:00", "07:00:00+07:00…
## $ `Kode Siklus`      <dbl> 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50,…
## $ `Suhu Air`         <dbl> 26.70, 28.90, 25.80, 28.90, 26.00, 29.40, 26.40, 29…
## $ `Oksigen Terlarut` <dbl> 15.90, 15.20, 11.30, 14.20, 8.70, 11.60, 4.80, 5.70…
## $ Salinitas          <dbl> 22.40, 22.70, 22.60, 22.80, 22.60, 23.00, 22.80, 25…
## $ pH                 <dbl> 5.80, 7.62, 7.70, 8.46, 8.30, 8.40, 8.11, 8.15, 8.1…
summary(kualitas_air)
##  Waktu Pengukuran       Date               Time            Kode Siklus   
##  Length:581         Length:581         Length:581         Min.   : 50.0  
##  Class :character   Class :character   Class :character   1st Qu.: 66.0  
##  Mode  :character   Mode  :character   Mode  :character   Median : 66.0  
##                                                           Mean   :295.4  
##                                                           3rd Qu.:298.0  
##                                                           Max.   :812.0  
##     Suhu Air      Oksigen Terlarut   Salinitas           pH       
##  Min.   :  0.00   Min.   : 0.000   Min.   : 0.00   Min.   :0.000  
##  1st Qu.: 27.00   1st Qu.: 4.400   1st Qu.:26.87   1st Qu.:7.530  
##  Median : 28.25   Median : 5.280   Median :29.72   Median :7.850  
##  Mean   : 28.52   Mean   : 5.514   Mean   :27.80   Mean   :7.638  
##  3rd Qu.: 29.60   3rd Qu.: 6.340   3rd Qu.:36.81   3rd Qu.:8.230  
##  Max.   :127.00   Max.   :18.260   Max.   :39.74   Max.   :9.770
n_distinct(kualitas_air$`Kode Siklus`)
## [1] 5
sum(duplicated(kualitas_air))
## [1] 0

FINDINGS

  1. Shrimps’ average daily feed is 40.452(kg), their average feed accumulation is 1227(kg), and their average size is 163.85 (means in 1 kg there are about 163 shrimps), with an average total weight is 1529.8(kg).

  2. The average the length of shrimp pond is 71.98(m), while the average width of the shrimp pond is 57.61(m), and the average depth of the shrimp pond is 1.2(m).

  3. The average water temperature is 28.52(C) which is normal, the average dissolved Oxygen is 5.514(ppm) which is normal as long as > 4ppm, the average salinity is 27.80(ppt) that is normal because > 10ppt, and the average of pH is 7.638 which is normal.

The main finding from this process is that all data and variables seem normal, but let’s do further analysis.

DATA CLEANING

So, our next step is to clean our dataset. Here, I use clean_names() function to make the data set unique and to clear the duplicate data.

Budidaya <- clean_names(Budidaya)
daftar_siklus <- clean_names(daftar_siklus)
kualitas_air <- clean_names(kualitas_air)

Actually, I have been doing the data cleaning in Excel and these what I’ve done:

Budidaya.csv

  1. import file -> text to colomn
  2. Adjust colomn
  3. change format (variable type)
  4. replace NA to 0: Feed, Feed Accumulation, Survival Rate.
  5. count the “total weight” using the right formula
  6. No Duplicate data

Daftar_siklus.csv

  1. import file -> text to colomn
  2. Adjust colomn

kualitas_air.csv

  1. import file -> text to colomn
  2. Adjust colomn
  3. Seperate waktu pengukuran; Date = LEFT(A2;10) then drag down to the last observation Time = RIGHT(A2;14) then drag down to the last observation
  4. replace NA to 0: suhu air, oksigen terlarut, salinitas, pH.
  5. No duplicate data

Change csv format into xlxs format.

4. ANALYZE

To find out the total of harvest by cycle code I used sum by group funtion.

Budidaya[is.na(Budidaya)] <- 0 #change NA to 0 thus we can measure the variable
Budidaya$kode_siklus <- as.character(Budidaya$kode_siklus)
Budidaya %>%
  group_by(kode_siklus) %>% 
  summarise(sum(panen))

FINDING

We can see that there is one of cycle code had zero amount of harvest, it is the code of 295.

To find out the volume for each pond we are going to create two column in daftar_siklus dataset by adding volume_kolam and panen column. For this purpose we are going to use mutate() function by using this fuction we can make changes in a column data.

daftar_siklus <- mutate(daftar_siklus, volume_kolam=panjang_kolam*lebar_kolam*kedalaman_kolam)#all still in meter (m)
daftar_siklus <- mutate(daftar_siklus, panen=c(81830, 186064, 0, 542368, 50353))

Here I changed the variable type of kode_siklus into character

daftar_siklus$kode_siklus <- as.character(daftar_siklus$kode_siklus)

Let’s classify several variables from kualitas_air dataset and added the coloumn type:

  1. Temperature or suhu range 28°C - 31°C as normal beyond that range as abnormal
kualitas_air$type_suhu <- ifelse(kualitas_air$suhu_air >= 28 & kualitas_air$suhu_air <= 31, 'normal', 'abnormal')
  1. Dissolved Oxygen/oksigen terlarut

range > 4 ppm as normal beyond that as abnormal

kualitas_air$type_oksigen_terlarut <- ifelse(kualitas_air$oksigen_terlarut > 4, 'normal', 'abnormal')
  1. Salinity/salinitas

range > 10 ppt as normal beyond that as abnormal

kualitas_air$type_salinitas <- ifelse(kualitas_air$salinitas > 10, 'normal', 'abnormal')

4)pH

range 7.5 - 8.5 as normal beyond that range as abnormal

kualitas_air$type_pH <- ifelse(kualitas_air$p_h >= 7.5 & kualitas_air$p_h <= 8.5, 'normal', 'abnormal')

5. SHARE

So, let’s visualize and share some of the findings for the ease of undestanding.

The frequency of shrimp harvest for each code:

## Warning: Removed 431 rows containing missing values (`geom_point()`).

Finding

The results show us that only code 66 had more than one time (four times) of shrimp harvest, while the other code only happened one time, even code 295 had no time of harvest.

find out the harvest time

## Warning: Removed 431 rows containing missing values (`geom_point()`).

finding

  1. Here, we can see that the harvest time of shrimps is around Sept-Dec 2018 and the peak amount of harvest is in march - Apr 2019.
  2. We also can see that within a year the harvest time only happened seven (7) times.

Finding the trend of daily feed:

## Warning: Removed 78 rows containing missing values (`geom_point()`).

Finding

The result shows that the trend among the codes is diverse but codes 295 and 298 had an identical trend.

Comparing the relationship between panen and berat_panen:

## Warning: Removed 431 rows containing missing values (`geom_point()`).

Finding

Here we can find that the more harvest, the more the weight of the harvest gained. It is a positive correlation. Yet, we can see that there is also an outlier but since the number of observations is few so I decided to keep that outlier.

Comparing the relationship of jumlah_benur and volume_kolam:

finding

Here, we can see that the least jumlah_benur had the biggest volume of the pond (kolam). This means jumlah_benur has a negative correlation with volume_kolam.

finding out the type of temperature

Findings

  1. Cycle code 66 had the same amount between both types.
  2. While cycle codes 295 and 298 had a significant gap between both types.
  3. And also, the cycle code of 812 had a significant gap but in a negative way because the amount of abnormal is big than the normal type.

finding out the type of Dissolved Oxygen

Finding

The results show us that all code seems to have the normal type of dissolved oxygen.

finding out the type of Salinity

Finding

Here we can see that the results of the salinity types tend to be normal.

finding out the type of pH

Finding

Here, the result also shows us that all codes had normal type > abnormal type.

finding the relationship of survival_rate and the amount of daily feed:

## Warning: Removed 78 rows containing missing values (`geom_point()`).

Findings

  1. several codes show us that the cycle codes 50, 66, and 812 had high survival_rate but had a low number of feeds.
  2. Different from the first result, the cycle codes 295 and 298 had positive results meaning more high the survival_rate more big the amount of feed.

finding the relationship of jumlah benur and panen:

Findings

  1. The novel condition here is in code 295 which had a huge amount of jumlah_benur but zero panen.
  2. Also, code 298 had the same amount of jumlah_benur with code 295 but had different results in the amount of panen.
  3. Code 298 had the highest amount of jumlah_benur and panen (harvest).

RESULT

Through this analysis I found that;

  1. The harvest time happened seven (7) times within a year.
  2. Based on the data from those five codes of the cycle, code 66 had significant results because that code had four times shrimp harvest within a year. And all the variables from kualitas_air are normal. The result from the daily feed also showed that code 66 had an average trend.
  3. And last, code 295 is so unusual, the data shows that 295 had the normal type of all variables from kualitas_air data, also code 295 had a good trend for daily feed, but how come code 295 had no harvest time? Based on the daftar_siklus data, code 295 and 298 have identic results and trends, also both come from the same city (LAMPUNG SELATAN).

Thanks!