This case study is a requirement from Jala Tech for a data analyst position. JALA is transforming the shrimp industry by offering a vastly improved management system. Helping to realize the sustainability of shrimp farming in Indonesia and the world by being directly involved in the aquaculture process to the supply chain of the shrimp industry.
Therefore, in this job, we’ll concentrate on datasets of Jala Tech and try to find the story behind the data. Hopefully the insight from the datasets can help the organization to improve in certain area.
To finish this process, I’m going to follow these 6 steps: 1.) Ask 2.) Prepare 3.) Process 4.) Analyze 5.) Share
Total weight (kg) = (ABW (gram) x Seed density/jumlah tebar x Survival Rate (%))/1000 eg. 2.3 x 200000 x 81.17% / 1000 = 372.6 kg 3.60 x 100000 x 80.66% / 1000 = 2
Our job is to concentrate on Jala’s datasets and finding the story behind the historical data. Then, based on these findings, we must suggest ways in which these patterns might support the growth of the Jala Tech.
For analysis Jala Tech already sent me 3 datasets and they’re:
budidaya.csv daftar_siklus.csv kualitas_air.csv
So, our next step is to “load all the required packages” for analysis.
library(readxl)
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(ggplot2)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
And, our next step is to load all datasets for analysis. Before that we are going to change the dataset file names from complex to less complex ones.
Budidaya <- read_excel("Budidaya.xlsx")
daftar_siklus <- read_excel("daftar_siklus.xlsx")
kualitas_air <- read_excel("kualitas_air.xlsx")
Now, let’s look deep into the data sets.
glimpse(Budidaya)
## Rows: 438
## Columns: 11
## $ `Kode Siklus` <dbl> 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50…
## $ Age <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
## $ Date <dttm> 2018-07-16, 2018-07-17, 2018-07-18, 2018-07-19, 2…
## $ ABW <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ Size <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ `Total Weight` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ Feed <dbl> 0.0, 2.5, 3.0, 3.0, 3.0, 3.0, 3.0, 3.5, 3.5, 4.0, …
## $ `Feed Accumulation` <dbl> 0.0, 2.5, 5.5, 8.5, 11.5, 14.5, 17.5, 21.0, 24.5, …
## $ `Survival Rate` <dbl> 100.00, 78.13, 88.24, 83.33, 78.95, 75.00, 71.43, …
## $ Panen <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ `Berat Panen` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
summary(Budidaya)
## Kode Siklus Age Date
## Min. : 50.0 Min. : 0.00 Min. :2018-07-16 00:00:00.0
## 1st Qu.: 50.0 1st Qu.: 21.25 1st Qu.:2018-09-10 06:00:00.0
## Median : 66.0 Median : 43.00 Median :2018-11-04 00:00:00.0
## Mean :271.9 Mean : 47.12 Mean :2018-12-17 00:49:18.9
## 3rd Qu.:298.0 3rd Qu.: 69.75 3rd Qu.:2019-02-23 18:00:00.0
## Max. :812.0 Max. :120.00 Max. :2019-08-01 00:00:00.0
##
## ABW Size Total Weight Feed
## Min. : 2.050 Min. : 50.00 Min. : 0.0 Min. : 0.000
## 1st Qu.: 5.185 1st Qu.: 77.25 1st Qu.: 383.1 1st Qu.: 7.125
## Median : 7.165 Median :139.50 Median :1119.9 Median : 22.500
## Mean : 8.921 Mean :163.85 Mean :1529.8 Mean : 40.452
## 3rd Qu.:12.988 3rd Qu.:192.75 3rd Qu.:2545.7 3rd Qu.: 47.000
## Max. :20.000 Max. :488.00 Max. :4748.9 Max. :200.000
## NA's :404 NA's :404 NA's :404
## Feed Accumulation Survival Rate Panen Berat Panen
## Min. : 0 Min. : 28.77 Min. : 28861 Min. : 379.6
## 1st Qu.: 264 1st Qu.: 77.94 1st Qu.: 43470 1st Qu.: 479.3
## Median : 559 Median : 81.83 Median : 50353 Median : 532.8
## Mean :1227 Mean : 83.14 Mean :122945 Mean :1213.1
## 3rd Qu.:1640 3rd Qu.: 94.53 3rd Qu.: 76046 3rd Qu.: 822.3
## Max. :6021 Max. :140.30 Max. :542368 Max. :4975.9
## NA's :431 NA's :431
n_distinct(Budidaya$`Kode Siklus`)
## [1] 5
sum(duplicated(Budidaya))
## [1] 0
glimpse(daftar_siklus)
## Rows: 5
## Columns: 9
## $ `Kode Siklus` <dbl> 50, 66, 295, 298, 812
## $ `Kode Kolam` <dbl> 2266, 3, 3418, 3421, 4
## $ `Panjang Kolam` <dbl> 115.47, 36.20, 79.16, 79.16, 49.90
## $ `Lebar Kolam` <dbl> 86.60, 46.40, 59.37, 59.37, 36.30
## $ `Kedalaman Kolam` <dbl> 100.0, 1.1, 1.5, 1.5, 0.9
## $ `Tanggal Tebar` <dttm> 2018-07-16, 2018-07-20, 2019-01-05, 2019-01-1…
## $ `Tanggal Selesai Siklus` <dttm> 2018-11-13, 2018-11-08, 2019-03-03, 2019-03-1…
## $ `Jumlah Benur` <dbl> 100000, 200000, 583740, 588432, 175000
## $ Daerah <chr> "BANTUL", "PURWOREJO", "LAMPUNG SELATAN", "LA…
I found that kedalaman_kolam for code of 50 has significantly different with others, so I assumed that a human error when input the data. I assumed that only code 50 was in cm and the others are in m, thus I changed all to m.
daftar_siklus$`Kedalaman Kolam`[0:1] <- daftar_siklus$`Kedalaman Kolam`[0:1]/100
summary(daftar_siklus)
## Kode Siklus Kode Kolam Panjang Kolam Lebar Kolam
## Min. : 50.0 Min. : 3 Min. : 36.20 Min. :36.30
## 1st Qu.: 66.0 1st Qu.: 4 1st Qu.: 49.90 1st Qu.:46.40
## Median :295.0 Median :2266 Median : 79.16 Median :59.37
## Mean :304.2 Mean :1822 Mean : 71.98 Mean :57.61
## 3rd Qu.:298.0 3rd Qu.:3418 3rd Qu.: 79.16 3rd Qu.:59.37
## Max. :812.0 Max. :3421 Max. :115.47 Max. :86.60
## Kedalaman Kolam Tanggal Tebar Tanggal Selesai Siklus
## Min. :0.9 Min. :2018-07-16 Min. :2018-11-08 00:00:00
## 1st Qu.:1.0 1st Qu.:2018-07-20 1st Qu.:2018-11-13 00:00:00
## Median :1.1 Median :2019-01-05 Median :2019-03-03 00:00:00
## Mean :1.2 Mean :2018-11-23 Mean :2019-02-17 14:24:00
## 3rd Qu.:1.5 3rd Qu.:2019-01-10 3rd Qu.:2019-03-10 00:00:00
## Max. :1.5 Max. :2019-05-07 Max. :2019-08-01 00:00:00
## Jumlah Benur Daerah
## Min. :100000 Length:5
## 1st Qu.:175000 Class :character
## Median :200000 Mode :character
## Mean :329434
## 3rd Qu.:583740
## Max. :588432
n_distinct(daftar_siklus$`Kode Siklus`)
## [1] 5
sum(duplicated(daftar_siklus))
## [1] 0
glimpse(kualitas_air)
## Rows: 581
## Columns: 8
## $ `Waktu Pengukuran` <chr> "2018-07-16 07:00:00+07:00", "2018-07-16 17:00:00+0…
## $ Date <chr> "2018-07-16", "2018-07-16", "2018-07-17", "2018-07-…
## $ Time <chr> "07:00:00+07:00", "17:00:00+07:00", "07:00:00+07:00…
## $ `Kode Siklus` <dbl> 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50,…
## $ `Suhu Air` <dbl> 26.70, 28.90, 25.80, 28.90, 26.00, 29.40, 26.40, 29…
## $ `Oksigen Terlarut` <dbl> 15.90, 15.20, 11.30, 14.20, 8.70, 11.60, 4.80, 5.70…
## $ Salinitas <dbl> 22.40, 22.70, 22.60, 22.80, 22.60, 23.00, 22.80, 25…
## $ pH <dbl> 5.80, 7.62, 7.70, 8.46, 8.30, 8.40, 8.11, 8.15, 8.1…
summary(kualitas_air)
## Waktu Pengukuran Date Time Kode Siklus
## Length:581 Length:581 Length:581 Min. : 50.0
## Class :character Class :character Class :character 1st Qu.: 66.0
## Mode :character Mode :character Mode :character Median : 66.0
## Mean :295.4
## 3rd Qu.:298.0
## Max. :812.0
## Suhu Air Oksigen Terlarut Salinitas pH
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. :0.000
## 1st Qu.: 27.00 1st Qu.: 4.400 1st Qu.:26.87 1st Qu.:7.530
## Median : 28.25 Median : 5.280 Median :29.72 Median :7.850
## Mean : 28.52 Mean : 5.514 Mean :27.80 Mean :7.638
## 3rd Qu.: 29.60 3rd Qu.: 6.340 3rd Qu.:36.81 3rd Qu.:8.230
## Max. :127.00 Max. :18.260 Max. :39.74 Max. :9.770
n_distinct(kualitas_air$`Kode Siklus`)
## [1] 5
sum(duplicated(kualitas_air))
## [1] 0
Shrimps’ average daily feed is 40.452(kg), their average feed accumulation is 1227(kg), and their average size is 163.85 (means in 1 kg there are about 163 shrimps), with an average total weight is 1529.8(kg).
The average the length of shrimp pond is 71.98(m), while the average width of the shrimp pond is 57.61(m), and the average depth of the shrimp pond is 1.2(m).
The average water temperature is 28.52(C) which is normal, the average dissolved Oxygen is 5.514(ppm) which is normal as long as > 4ppm, the average salinity is 27.80(ppt) that is normal because > 10ppt, and the average of pH is 7.638 which is normal.
The main finding from this process is that all data and variables seem normal, but let’s do further analysis.
So, our next step is to clean our dataset. Here, I use clean_names() function to make the data set unique and to clear the duplicate data.
Budidaya <- clean_names(Budidaya)
daftar_siklus <- clean_names(daftar_siklus)
kualitas_air <- clean_names(kualitas_air)
Actually, I have been doing the data cleaning in Excel and these what I’ve done:
Budidaya.csv
Daftar_siklus.csv
kualitas_air.csv
Change csv format into xlxs format.
To find out the total of harvest by cycle code I used sum by group funtion.
Budidaya[is.na(Budidaya)] <- 0 #change NA to 0 thus we can measure the variable
Budidaya$kode_siklus <- as.character(Budidaya$kode_siklus)
Budidaya %>%
group_by(kode_siklus) %>%
summarise(sum(panen))
We can see that there is one of cycle code had zero amount of harvest, it is the code of 295.
To find out the volume for each pond we are going to create two column in daftar_siklus dataset by adding volume_kolam and panen column. For this purpose we are going to use mutate() function by using this fuction we can make changes in a column data.
daftar_siklus <- mutate(daftar_siklus, volume_kolam=panjang_kolam*lebar_kolam*kedalaman_kolam)#all still in meter (m)
daftar_siklus <- mutate(daftar_siklus, panen=c(81830, 186064, 0, 542368, 50353))
Here I changed the variable type of kode_siklus into character
daftar_siklus$kode_siklus <- as.character(daftar_siklus$kode_siklus)
Let’s classify several variables from kualitas_air dataset and added the coloumn type:
kualitas_air$type_suhu <- ifelse(kualitas_air$suhu_air >= 28 & kualitas_air$suhu_air <= 31, 'normal', 'abnormal')
range > 4 ppm as normal beyond that as abnormal
kualitas_air$type_oksigen_terlarut <- ifelse(kualitas_air$oksigen_terlarut > 4, 'normal', 'abnormal')
range > 10 ppt as normal beyond that as abnormal
kualitas_air$type_salinitas <- ifelse(kualitas_air$salinitas > 10, 'normal', 'abnormal')
4)pH
range 7.5 - 8.5 as normal beyond that range as abnormal
kualitas_air$type_pH <- ifelse(kualitas_air$p_h >= 7.5 & kualitas_air$p_h <= 8.5, 'normal', 'abnormal')
Through this analysis I found that;
Thanks!