0000 MonetDB Introduction

Bundled Products vs Products in Bulk

Data in Bundle or in Bulk

When we organize data in rows, we use a row-oriented database. When we organize data in columns, we use a columnar database. Data is typically collected in a row-oriented database (Postgres, Oracle, MySQL) and then transferred to a columnar database for analytics ( ClickHouse, DuckDB, Vertica ).

Row Oriented Databases vs Columnar Databases

Row oriented databases are catching current state and providing transactional correctness. Columnar databases are for analysis of the historical data. We can also say that row-oriented databases are better in writing data and providing data consistency and integrity. Columnar databases are better in reading huge quantities of data.

OLTP vs OLAP

If we are preparing our database for everyday operations and transactions, then we are making OLTP database. For analytics and BI, we would prepare OLAP database. OLTP is compatible with row-oriented databases, and OLAP with columnar databases.

OLTP tables are filled with data by users and applications in small transactions. From time to time ( usually nightly ), data is transformed and transported into OLAP databases in batches. This makes data management different between OLTP and OLAP databases. OLTP databases receive data continuously, but OLAP databases receive data periodically and only after data is well-reformed.

OLTP is acronym for "Online Transaction Processing" and OLAP for "Online Analytical Processing". We saw that the difference is made by:
      – Different purpose and usage.
      – Different database technology.
      – Different data organization inside the database.
      – Different data management.

Data Immutability

Columnar ( OLAP ) databases are filled in batches. That doesn't mean that it is not possible to UPDATE or DELETE rows in tables. There are four strategies that OLAP databases use:

In recent years, MonetDB researchers have been gradually adding or experimenting with features that move it slightly toward the HTAP category:

– Better handling of frequent updates and small transactions
– Improved concurrency and snapshot isolation
– Faster incremental data loads
– Optimizations for mixed workloads

MonetDB

If you have one server and install MonetDB on it, you will be able to:
      – Enjoy blazing-fast query performance, even if you have a lot of data and your queries are more complex.
      – Avoid thinking about indexing, compression, statistics tuning, partitioning decisions. MonetDB will do it all automatically.
      – Apply the full power of SQL to all your tables, whether they are carefully modeled or just collected in one place.
      – Spend your money on something else because MonetDB is an open source database under the Mozilla Public License.

MonetDB is a fast, complete SQL, easy-to-maintain, open source analytical database server that can process huge amounts of data on a single server machine.

Hardware and Speed Considerations for MonetDB

MonetDB doesn't have a requirement to keep the whole database in a RAM memory. Parts of data that are already processed will be deleted from memory. MonetDB have capacity to deal with data that is much bigger than the amount of memory on your server machine. Only data that is currently processed will have just-in-memory execution pipelines.

MonetDB, will benefit from the fast NVMe disk, that has sustained performance under write load. Such disk should be accompanied with a lot of RAM memory, so that usage of a disk is minimal.

Zero-Tuning Philosophy

MonetDB will not burden you with indexing strategies, partitioning decisions and vacuuming. MonetDB tries to eliminate complexity. Most workloads run at full speed without manual optimization. Indexes are automatic. Compression is automatic. Memory management is automatic.

Creating indexes manually is problematic because you need to know in advance where to create the indexes. Indexes can slow down updates and deletes because when data is modified, the database must also update the corresponding indexes to reflect these changes. This is something that we especially want to avoid in the columnar databases. If users change their behavior, that can make our indexes ineffective.

Technique that MonetDB use to automatically tune indexes is called database cracking. Cracking means that MonetDB will sort and group data, and create indexes during SELECT queries. If most of the statements on the OLAP databases are SELECT queries, and such queries touch a lot of rows, then it is best to create indexes and tune data during the SELECT queries execution. This is optimal because:

   – The parts of a database that are heavily used will be tuned the best. We will not spend effort on data that no one reads.
   – Instead of heaving one huge indexing job, we will have many tiny, incremental reorganizations.
   – After some time, our data will be optimally indexed and sorted, but MonetDB can change strategy if users change their behavior.
   – Cracking is especially powerful in columnar databases like MonetDB, where each column is stored separately.

Cracking is powerful but not ideal for everything:

   – Heavy transactional workloads (many updates) disrupt adaptive patterns.
   – Distributed cracking across many nodes is still complex.
   – Highly unpredictable workloads (all queries unique) limit its benefits.

Comparison of MonetDB and Power BI

Power BI database ( SSAS Tabular ) is in-memory database, that is using compressed columnar tables, and is optimized for BI models ( dimensions, measures, heirarchies ). Power BI needs the whole model loaded in the memory, but columns are heavily compressed, so Power BI doesn't have huge memory footprint. For many calculations Power BI doesn't even need to decompress columns, so it can do its work while maintaining really fast IO.

Power BI likes table relationships and measures defined in advance, and is not optimized for arbitrary queries. It is not intended for complex joins outside the model. If we spend time creating optimal model, and we wait for our model to process/refresh during the load, we will get database that is optimized for:

   – measure evaluation
   – filters
   – slice and dice

On the other side MonetDB is optimized for raw analytical SQL on large tables, without modeling overhead.

MonetDB Imperfections

As columnar server, MonetDB suffers from all of the shortcomings that columnar servers have. MonetDB is not good in transactional traffic, and lot of deletes and updates. MonetDB is not perfect for reports that return huge tables with lot of columns. MonetDB is best for short, aggregated analytical queries.

If you have massive amounts of data, you will need a computer cluster, and cloud scalability. This is something where MonetDB doesn't shine. MonetDB is best if you have one powerful server machine, and you want to run complex analytics on it. Distributed databases are better in scaling, but because their setup is more complex and they are spread over many computers, they usually have less rich SQL capabilities and other limitations.

MonetDB does not support replication, but it does support distributed operation across multiple machines via sharding. Sharding will speed up some queries, but not all. If we want to set up a cluster with MonetDB, partitioning and sharding must be done carefully and manually. If possible, it is better to have a single powerful server than to resort to sharding and expensive network equipment.

MonetDB Market Position

You should use MonetDB if you want to:
   – Self-hosting an analytical database on your server.
   – You run complex ad-hoc queries, window functions, joins, and aggregations.
   – You don't want to design cubes or DAX, you just want fast SQL.
   – Avoid vendor lock by using fully open source software.

MonetDB is best for:
   – Data engineers
   – BI developers
   – Researchers
   – Small/medium businesses
   – Python users

MonetDB supports SQL, ODBC, JDBC and many programming languages ( python, java, R, ruby, PHP… ). It can be easily integrated with different software, but the support of 3^rd party software, clients and ORMs is limited. This is the consequence of the MonetDB origin which was research and science oriented. MonetDB was developed on the CWI institute in the Netherland. Development was driven by innovation and curiosity, and not by commercialization and marketing. This is why you can say that MonetDB is the fastest database you've never heard of. But that development lasted for 30 years and today MonetDB is complete and powerful database system.

MonetDB Features

Here, I will list features of the MonetDB server, and I will direct you toward blog posts about specific feature.

1) Blog posts 0010, 0020 and 0030 will teach you how to install MonetDB, how to create a database, and how to fill database with a sample data. On the other side, installation of MonetDB through docker is explained in the last blog post 0600.

2) MonetDB can be augmented by using python language. There are different ways how that can be achieved. Blog post 0040 will explain how to connect to MonetDB from python.

3) Blog post 0050 is about identifiers and constants. MonetDB supports different data types.

Strings	0060	Time	0060, 0070	JSON	0090	URL ( https://www.google.com )	0100
Numbers	0060	Autonumber	0080	UUID ( universally unique identifier )	0090	Network ( IP addresses )	0100

4) How to create and alter a table, and set table and column constraints is explained in blog posts 0110 and 0320.

CREATE TABLE CREATE TABLE AS SELECT ALTER TABLE Constraints: PK,FK,Unique,NULL,read only,DEFAULT CHECK constraint is explained in 0520

MonetDB also supports some special kinds of tables.
   – Temporary tables are tables with limited lifetime. They are described in 0370.
   – If we have several small tables with the same structure, we can connect them into one big virtual table ( partitioning ). This is covered in 0380.
   – Unlogged tables are special tables that can be used for fast writes, updates and deletes. We can read about them in 0390.
   – Views are explained in 0300.
   – MERGE and REMOTE tables are special tables that are used for sharding in MonetDB. Sharding is a way to distribute MonetDB work on a cluster of    computers. We can read about them in 0480.

5) SELECT, WHERE, HAVING, LIMIT;OFFSET, GROUP BY, ORDER BY, are described in 0120, 0150.

JOINS

0130

UNION, INTERSECT, EXCEPT

0140

INSERT, INSERT INTO, UPDATE, DELETE

0150

Subqueries

0160

6) Article 0170 will teach as about ANALYZE, SAMPLE and PREPARE.
   – ANALYZE is used to update MonetDB statistics. That statistic is used by MonetDB to optimize queries.
   – SAMPLE is used to create a sample of rows from some table.
   – PREPARE is a way to prefabricate our statement so that when we execute it, it is running faster.

7) MERGE statement, CTE and GROUPING SETS can be considered as composite SQL statements, that do several things at once.

MERGE is used to partially synchronize two tables. 0180 CTE is a construction that will break long SQL statements into many smaller ones. Recursive CTEs are supported 0510. 0190 GROUPING SETS will help us to hierarchically group our data. 0490

8) In SQL, running totals and moving averages are calculated by using WINDOW functions. Theory and syntax of window functions are presented in 0200, 0220. Window function can be divided into Aggregate 0230, Ranking 0240 and Offset 0250 window functions.

9) MonetDB is rich with built-in functions. They can be divided into Aggregate 0210, Mathematical 0260, String 0270, Comparison 0280, functions. Time functions are explained in 0070.

10) Transactions are explained in 0290, Indexes in 0300, Schemas in 0310,

Transactions are a way to protect consistency and integrity of our data.

Indexes are usually made by MonetDB automatically. Still, we can create some special types of indexes by our selves.

Schemas are logical parts of our database.

11) Blog posts 0330, 0340, 0350 are about importing and exporting data to/from MonetDB. We can import/export data to/from CSV files or a binary format. It is also possible to load data into MonetDB from different programming languages ( we have covered python and SQL ).

12) Procedural SQL is a way to introduce procedural paradigm into SQL. Procedural paradigm is when we have to explain to database what to do instead of just describing our demands with SQL query. With procedural SQL we can make:

   – Custom functions ( 0400, 0410, 0420 ).
   – Procedures ( 0440 ). Procedures are sets of statements that are executed together.
   – Triggers ( 0450 ). Triggers are procedures that run automatically after some event.

CREATE FUNCTION CREATE PROCEDURE CREATE TRIGGER RETURN DECLARE TABLE DECLARE VARIABLE CASE WHILE IF

13) Creating users and setting their rights is explained in blog posts 0460, 0470.

CREATE USER ALTER USER GRANT REVOKE CREATE ROLE SET ROLE

14) Articles 0500, 0510 will teach us how to create ODBC and JDBC connections to MonetDB. We will also learn about file_loader and proto_loader functions. With the file_loader and proto_loader functions, we can treat CSV files and tables in other ODBC databases as if they were local MonetDB tables. Article 0530 will show us how to encrypt connection with MonetDB using TLS encryption.

15) Every database need backup. That is explained in 0540 and 0550.

16) For administration we have to learn three console applications:

   – Monetdbd ( 0560 ) is a linux daemon. This is the main application used to start and control MonetDB.
   – Monetdb ( 0570 ) is console application used to work with specific database.
   – Mclient ( 0580 ) is client application that we can use to send queries to MonetDB.

17) Different system and session procedures and commands will help us to monitor and control user sessions and queries. They are explained in 0590.

We can treat transactions as one bundle product. Product code, quantity and price are one bundle. We want to be able to quickly save this data bundle.	We can achieve this by saving our transactions as separate rows in a database. That is how we can quickly save or retrieve individual transaction as a bundle.