29-4-2021

Learnings

I set out to complete the Exasol tasks and to learn more about using Git integration in DBeaver and RStudio, to work on a shared repository form different tools.

  • Basic Git course would be helpful
  • Need more time to dive into UDF courses and documentation
  • Experience with Exasol data type conversion and Geo data needed
  • Getting more familiar with Exasol R package
  • Definitely prefer procedural (R) vs. declartive programming (SQL)…
  • Most of my time goes to struggling with date formats, data types and some to resources (memory and execution time)

File loading

I chose importing data with Exasol’s IMPORT, although also can see paths with R of further data cleaning is needed. See File loading.sql for details.

  • f1.csv Reading in went ok except for column with date string delimited by ’. Imported as string and altered to additional column D_DT_FIXED. Also GEO data should by typed.
  • f2.csv Encoding derived from identifying German symbols. Date should be typed and strings with | should be cleaned, maybe with a string function.
  • f3.csv Interestingly problems with importing price column due comma instead of dot decimal separator. Fixed with REGEX_REPLACE and TO_NUMBER on string.

Creating synthetic data: 500M rows

See 500M_table.R for details.

  • This was a challenge. There probably is an elegant DDL solution, but I instinctively go to R.
  • Ran into many memory problems, so had to take a batched approach which had to run all night … but it worked.
  • Tried a R UDF to do it in memory, but did not get it to work yet. Probably a batched approach is needed here as well due to VM memory limitations.

Coindesk

See CoinbaseAPI.R for details.

  • Loading of Coindesk BPI data was implemented with R.
  • Nice to see the Badoo PyExasol library, nice to check out when more tome.
  • Loading initial 6 months of data and update function that checks last BPI record data available in the database and updates to current date.

Prescriptions (with some assumptions)

See Queries.sql for details.

3 most prescribed chemicals in the postcode_area ‘East Central London’?
FREQ SK_CHEM_SUB CHEM_SUB CHEMICAL_NAME
9082 2492 2101 Other Appliances
5766 2008 0906040G0 Colecalciferol
7511 3444 2003 Wound Management & Other Dressings
In which year most often?
YEAR(PERIOD_FIRST_DAY_AS_DATE) FREQ_YEAR
2017 2623674

Free play … wondering …

See it in action on ShinyApps.io and code in app.R.

  • How rendering lots of data would work with R Shiny apps using Plotly rendering.
  • Also wondering whether loading from Exasol would work from app published to ShinyApps.io and it does.
  • Just crashed when choosing full sample size, probably due to memory limits of free subscription plan.
  • How can you run queries in R UDF like pquery in Lua?

Wouldn’t it be nice …

  • If we can set per column IMPORT settings or use e.g. string functions in import?
  • If we can query system properties of UDF VMs to uncover memory limits?
  • To make a R UDF that trains and saves model as binary data and performance results in Exasol columns … IN PROGRESS …
  • Together with a R UDF that reads binary model data and applies for dynamic prediction in VIEW … PLANNED …
  • If we had more time …

Facilitating the R world in Exasol adoption