Introduction

Jumia is an online marketplace for electronics and fashion, among others, targeting several African countries, but headquartered and incorporated in Germany. The company is also a logistics service, which enables the shipment and delivery of packages from sellers to consumers, and a payment service, which facilitates transactions between active participants and the platform of Jumia in selected markets. It has established partnerships with more than 50,000 local African businesses.

library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(rvest)
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
## 
##     pluck
## The following object is masked from 'package:readr':
## 
##     guess_encoding
library(stringi)
library(stringr)
library(dplyr)

#Jumia websites
url_jumia <- 'https://group.jumia.com/'
url_jumia_site <- url_jumia %>% read_html() %>% html_nodes('.n2mu-single-client > a') %>% html_attr('href')
url_jumia_site[1] <- url_jumia
url_jumia_site
##  [1] "https://group.jumia.com/"  "https://www.jumia.com.eg/"
##  [3] "https://www.jumia.ma"      "https://www.jumia.co.ke"  
##  [5] "https://www.jumia.ci"      "https://www.zando.co.za"  
##  [7] "https://www.jumia.com.tn/" "https://www.jumia.dz/"    
##  [9] "https://www.jumia.com.gh/" "https://www.jumia.sn/"    
## [11] "https://www.jumia.ug/"

Liste of jumia website in Africa.

url_jumia_site
##  [1] "https://group.jumia.com/"  "https://www.jumia.com.eg/"
##  [3] "https://www.jumia.ma"      "https://www.jumia.co.ke"  
##  [5] "https://www.jumia.ci"      "https://www.zando.co.za"  
##  [7] "https://www.jumia.com.tn/" "https://www.jumia.dz/"    
##  [9] "https://www.jumia.com.gh/" "https://www.jumia.sn/"    
## [11] "https://www.jumia.ug/"

Web scraping

Web scraping, also known as data mining, is the process of collecting large amounts of data from the web and then placing it in databases for future analysis and later use.

The algorithm we will provide can scrape all data on Jumia.

Check if jumia website allow us to be scrape

library(robotstxt)
paths_allowed("https://group.jumia.com/")
## 
 group.jumia.com
## [1] TRUE

Let’s go

Specify the url of the category to scrape

#urlbas <- "https://www.jumia.co.ke/smartphones/"
#urlbas <- 'https://www.jumia.co.ke/laptops/'
urlbas <- "https://www.jumia.com.gh/smartphones/"

Get the number of dataset to scrape

products_found <- urlbas %>% read_html() %>% html_nodes(".-fs14.-gy5.-phs") %>% html_text()
products_found
## [1] "10828 products found"

Split products found

products_found_number <- str_split_fixed(products_found, " ", 2)
products_found_number[1]
## [1] "10828"

Create an empty table to store results

result_table <- tibble()

Start scraping

for(page in list_of_pages){
  page_source <- read_html(page)
  title <- html_nodes(page_source,'.name') %>% html_text()
  price <- html_nodes(page_source,'.prc') %>% html_text()
  temp_table <- tibble(title = title, price = price)
  result_table <- bind_rows(result_table, temp_table)
}

Remove empty values

result_table <- result_table %>% filter(price != '')
head(result_table, n = 10)
## # A tibble: 10 x 2
##    title                                                                price   
##    <chr>                                                                <chr>   
##  1 "Nokia 5.1 - 16GB HDD - 2GB RAM - Black"                             GH₵ 458 
##  2 "Itel S16 Dual SIM - 16GB HDD - 1GB RAM - Ice Crystal Blue"          GH₵ 400 
##  3 "Samsung Galaxy A01 - 16GB HDD - 2GB RAM - Black"                    GH₵ 569 
##  4 "Apple iPhone 5S 4G LTE - 16GB HDD - 1GB RAM - Gold"                 GH₵ 450 
##  5 "Mione A50 - Dual SIM - 32GB HDD - 3GB RAM"                          GH₵ 499 
##  6 "Infinix Note 7 X690 4G Dual SIM - 64GB HDD - 4GB RAM - Aether Blac… GH₵ 907 
##  7 "Samsung Galaxy A31 Smartphone - 128GB HDD - 4GB RAM - 6.4\" - Pris… GH₵ 1 2…
##  8 "Note10+ 5-Inch Digital Display Screen Hd Camera Quad-Core Smartpho… GH₵ 271 
##  9 "Samsung A3 Core Dual SIM - 16GB HDD - 1GB RAM - Red"                GH₵ 429 
## 10 "Infinix X610B Note 6 Dual SIM 4G LTE - 64GB HDD - 4GB RAM - Mocha … GH₵ 769

Export data

currentTime <- Sys.time()
csvFileName <- paste("resultatdata", currentTime,".csv", sep = ",")
write.csv(result_table, file = csvFileName, fileEncoding = "UTF-16LE")

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT License: Copyright (c) 2021 HK Corporation Inc.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Email

Repository

Click here