Bundled Products vs Products in Bulk
| Stores like to sell products as a bundle. For example, the store will sell a box with several decorations as one product. Store will benefit from this because: – Stacking and scanning products is easier when the are bundled ( effort- ). – The buyer will spent less time in the store choosing decorations ( speed+ ). – We can sell more decorations by bundling less popular decorations with the more popular ones ( sale+ ). | But bundles make production planning harder. – We have to unbundle data about products to find out how much each individual shape/color was sold ( effort+ ). – We are usually looking at sale for a longer period of time. Because of quantity of data, unbundling will make our database real slow ( speed- ). – We have to read data about all the shapes/colors, although we just want to analyze the sale of one of the shapes/colors ( analysis- ). |
![]() | ![]() |
Data in Bundle or in Bulk
| We can treat transactions as one bundle product. Product code, quantity and price are one bundle. We want to be able to quickly save this data bundle. | We can achieve this by saving our transactions as separate rows in a database. That is how we can quickly save or retrieve individual transaction as a bundle. |
![]() | ![]() |
| For production planning we will differently organize our data. We will organize it into columns. If we want to find out how many products, we have sold that have the price lower than 3 EUR, we just have to scan the column "Price", and then to find specific positions in the column "Qty". Each column can be sorted and compressed for even faster retrieval. We don't even need to read the column "Product". | ![]() |
When we organize data in rows, we use a row-oriented database. When we organize data in columns, we use a columnar database. Data is typically collected in a row-oriented database (Postgres, Oracle, MySQL) and then transferred to a columnar database for analytics ( ClickHouse, DuckDB, Vertica ).
Row Oriented Databases vs Columnar Databases
| Row oriented databases: 1) They are great for small writes/reads. Perfect for "insert one order", "update this user", "delete this invoice line" etc. 2) Application code often works with whole objects/rows (User, Order, Invoice), which matches row layout. 3) Great for transaction isolation, replication, ACID complacence, constraints, triggers, foreign keys. 4) We have many users who send queries that touch only one row from a table and expect an immediate response from the database. They are best for Cash registers, ERP, CRM, banking systems, ticketing. | Columnar databases: 1) We only read columns we need, and columns can be heavily compressed. That improves IO. 2) CPU can process several values of the same type at once. This is called vectorized execution. Queries with lot of data will work faster. 3) Column scans and aggregations are easy to parallelize across cores and nodes. 4) We have smaller number of queries, but they are much heavier, they touch a lot of data, with lot of joins, but the users are expecting results in 3-4 seconds. They are best for sales reports, dashboards, time-series, data mining. |
![]() | ![]() |
Row oriented databases are catching current state and providing transactional correctness. Columnar databases are for analysis of the historical data. We can also say that row-oriented databases are better in writing data and providing data consistency and integrity. Columnar databases are better in reading huge quantities of data.
OLTP vs OLAP
If we are preparing our database for everyday operations and transactions, then we are making OLTP database. For analytics and BI, we would prepare OLAP database. OLTP is compatible with row-oriented databases, and OLAP with columnar databases.
![]() | In the OLTP databases, data is organized in a lot of small tables. There is no data duplication and every value is written only once. Small tables make queries more complex so these databases usually don't grow significantly in size. | ![]() | In the OLAP databases, we merge small tables to reduce their number. Columnar technology have no problem with tall tables that have a lot of columns. By reducing the number of joins, we can query our data with simpler (faster) queries. That is why we organize our tables in a star schema. |
OLTP tables are filled with data by users and applications in small transactions. From time to time ( usually nightly ), data is transformed and transported into OLAP databases in batches. This makes data management different between OLTP and OLAP databases. OLTP databases receive data continuously, but OLAP databases receive data periodically and only after data is well-reformed.
OLTP is acronym for "Online Transaction Processing" and OLAP for "Online Analytical Processing". We saw that the difference is made by:
– Different purpose and usage.
– Different database technology.
– Different data organization inside the database.
– Different data management.
Data Immutability
Columnar ( OLAP ) databases are filled in batches. That doesn't mean that it is not possible to UPDATE or DELETE rows in tables. There are four strategies that OLAP databases use:
| 1) Data is immutable. We can only insert data in batches, and afterward we can not change individual rows. If data is immutable and presorted then we can achieve the biggest levels of compression. The example of this technology is the database used for Power BI. | 2) MonetDB is an example of the OLAP database that is suited for light updates and deletes. When we delete/update same value, MonetDB will label old record as "dead". That record will become invisible to queries. After some time, garbage collection will remove not-needed records. ![]() |
| 3) We can divide our columns into small segments. Whenever we want to change some value, we modify the whole segment. This is good solution when we have a lot of hardware power, like cloud providers do. Google Snowflake is an example of such database. | 4) We can have two databases in one. One database is row-oriented, and the other one is columnar. Data is written into row-oriented database, but when we want to read data we read from both databases together. Historical data is read from the columnar database, and today's data is read from the row-oriented database. SAP Hana is using such approach. | This last strategy creates something called "Hybrid Transaction Analytical Processing", or HTAP. HTAP is when a single database can do fast real-time transactions and fast analytics on the same data at the same time. ![]() |
| In recent years, MonetDB researchers have been gradually adding or experimenting with features that move it slightly toward the HTAP category: | – Better handling of frequent updates and small transactions – Improved concurrency and snapshot isolation – Faster incremental data loads – Optimizations for mixed workloads |
MonetDB
If you have one server and install MonetDB on it, you will be able to:
– Enjoy blazing-fast query performance, even if you have a lot of data and your queries are more complex.
– Avoid thinking about indexing, compression, statistics tuning, partitioning decisions. MonetDB will do it all automatically.
– Apply the full power of SQL to all your tables, whether they are carefully modeled or just collected in one place.
– Spend your money on something else because MonetDB is an open source database under the Mozilla Public License.
MonetDB is a fast, complete SQL, easy-to-maintain, open source analytical database server that can process huge amounts of data on a single server machine.
Hardware and Speed Considerations for MonetDB
MonetDB doesn't have a requirement to keep the whole database in a RAM memory. Parts of data that are already processed will be deleted from memory. MonetDB have capacity to deal with data that is much bigger than the amount of memory on your server machine. Only data that is currently processed will have just-in-memory execution pipelines.
MonetDB, will benefit from the fast NVMe disk, that has sustained performance under write load. Such disk should be accompanied with a lot of RAM memory, so that usage of a disk is minimal.
| MonetDB is designed to take the full advantage of your CPU. – When executing a query or loading data, MonetDB will spread the load across all processor cores, increasing CPU utilization to over 90%. There is a setting that can limit the number of cores used by MonetDB. – Vectorized query execution means that the processor can process thousands of values with a single instruction. The values must be of the same data type, which is perfect for columnar databases. – Data is processed in small chunks using simple instructions. This reduces RAM visits and uses the processor's fast L1/L2 memory. – Late materialization means that we avoid reading data that does not contribute to our query. We will only read the data needed for the final dataset, and skip everything else. – Partial query results are reused instead of being recomputed. | ![]() |
Zero-Tuning Philosophy
MonetDB will not burden you with indexing strategies, partitioning decisions and vacuuming. MonetDB tries to eliminate complexity. Most workloads run at full speed without manual optimization. Indexes are automatic. Compression is automatic. Memory management is automatic.
Creating indexes manually is problematic because you need to know in advance where to create the indexes. Indexes can slow down updates and deletes because when data is modified, the database must also update the corresponding indexes to reflect these changes. This is something that we especially want to avoid in the columnar databases. If users change their behavior, that can make our indexes ineffective.
Technique that MonetDB use to automatically tune indexes is called database cracking. Cracking means that MonetDB will sort and group data, and create indexes during SELECT queries. If most of the statements on the OLAP databases are SELECT queries, and such queries touch a lot of rows, then it is best to create indexes and tune data during the SELECT queries execution. This is optimal because:
– The parts of a database that are heavily used will be tuned the best. We will not spend effort on data that no one reads.
– Instead of heaving one huge indexing job, we will have many tiny, incremental reorganizations.
– After some time, our data will be optimally indexed and sorted, but MonetDB can change strategy if users change their behavior.
– Cracking is especially powerful in columnar databases like MonetDB, where each column is stored separately.
Cracking is powerful but not ideal for everything:
– Heavy transactional workloads (many updates) disrupt adaptive patterns.
– Distributed cracking across many nodes is still complex.
– Highly unpredictable workloads (all queries unique) limit its benefits.
Comparison of MonetDB and Power BI
Power BI database ( SSAS Tabular ) is in-memory database, that is using compressed columnar tables, and is optimized for BI models ( dimensions, measures, heirarchies ). Power BI needs the whole model loaded in the memory, but columns are heavily compressed, so Power BI doesn't have huge memory footprint. For many calculations Power BI doesn't even need to decompress columns, so it can do its work while maintaining really fast IO.
Power BI likes table relationships and measures defined in advance, and is not optimized for arbitrary queries. It is not intended for complex joins outside the model. If we spend time creating optimal model, and we wait for our model to process/refresh during the load, we will get database that is optimized for:
– measure evaluation
– filters
– slice and dice
On the other side MonetDB is optimized for raw analytical SQL on large tables, without modeling overhead.
MonetDB Imperfections
As columnar server, MonetDB suffers from all of the shortcomings that columnar servers have. MonetDB is not good in transactional traffic, and lot of deletes and updates. MonetDB is not perfect for reports that return huge tables with lot of columns. MonetDB is best for short, aggregated analytical queries.
If you have massive amounts of data, you will need a computer cluster, and cloud scalability. This is something where MonetDB doesn't shine. MonetDB is best if you have one powerful server machine, and you want to run complex analytics on it. Distributed databases are better in scaling, but because their setup is more complex and they are spread over many computers, they usually have less rich SQL capabilities and other limitations.
MonetDB does not support replication, but it does support distributed operation across multiple machines via sharding. Sharding will speed up some queries, but not all. If we want to set up a cluster with MonetDB, partitioning and sharding must be done carefully and manually. If possible, it is better to have a single powerful server than to resort to sharding and expensive network equipment.
MonetDB Market Position
| You should use MonetDB if you want to: – Self-hosting an analytical database on your server. – You run complex ad-hoc queries, window functions, joins, and aggregations. – You don't want to design cubes or DAX, you just want fast SQL. – Avoid vendor lock by using fully open source software. | MonetDB is best for: – Data engineers – BI developers – Researchers – Small/medium businesses – Python users |
MonetDB supports SQL, ODBC, JDBC and many programming languages ( python, java, R, ruby, PHP… ). It can be easily integrated with different software, but the support of 3rd party software, clients and ORMs is limited. This is the consequence of the MonetDB origin which was research and science oriented. MonetDB was developed on the CWI institute in the Netherland. Development was driven by innovation and curiosity, and not by commercialization and marketing. This is why you can say that MonetDB is the fastest database you've never heard of. But that development lasted for 30 years and today MonetDB is complete and powerful database system.
MonetDB Features
Here, I will list features of the MonetDB server, and I will direct you toward blog posts about specific feature.
1) Blog posts 0010, 0020 and 0030 will teach you how to install MonetDB, how to create a database, and how to fill database with a sample data. On the other side, installation of MonetDB through docker is explained in the last blog post 0600.
2) MonetDB can be augmented by using python language. There are different ways how that can be achieved. Blog post 0040 will explain how to connect to MonetDB from python.
| Blog post 0360 will teach us how to use python to fetch data from different sources into MonetDB. | ![]() | Beside using python to fetch data, we can use python to create custom functions in MonetDB. That is explained in the blog post 0430. | We can create inline or aggregated Python UDFs. ![]() |
3) Blog post 0050 is about identifiers and constants. MonetDB supports different data types.
| Strings | 0060 | Time | 0060, 0070 | JSON | 0090 | URL ( https://www.google.com ) | 0100 |
| Numbers | 0060 | Autonumber | 0080 | UUID ( universally unique identifier ) | 0090 | Network ( IP addresses ) | 0100 |
4) How to create and alter a table, and set table and column constraints is explained in blog posts 0110 and 0320.
CREATE TABLE | CREATE TABLE AS SELECT | ALTER TABLE | Constraints: PK,FK,Unique,NULL,read only,DEFAULT | CHECK constraint is explained in 0520 |
MonetDB also supports some special kinds of tables.
– Temporary tables are tables with limited lifetime. They are described in 0370.
– If we have several small tables with the same structure, we can connect them into one big virtual table ( partitioning ). This is covered in 0380.
– Unlogged tables are special tables that can be used for fast writes, updates and deletes. We can read about them in 0390.
– Views are explained in 0300.
– MERGE and REMOTE tables are special tables that are used for sharding in MonetDB. Sharding is a way to distribute MonetDB work on a cluster of computers. We can read about them in 0480.
5) SELECT, WHERE, HAVING, LIMIT;OFFSET, GROUP BY, ORDER BY, are described in 0120, 0150.
6) Article 0170 will teach as about ANALYZE, SAMPLE and PREPARE.
– ANALYZE is used to update MonetDB statistics. That statistic is used by MonetDB to optimize queries.
– SAMPLE is used to create a sample of rows from some table.
– PREPARE is a way to prefabricate our statement so that when we execute it, it is running faster.
7) MERGE statement, CTE and GROUPING SETS can be considered as composite SQL statements, that do several things at once.
MERGE is used to partially synchronize two tables. | 0180 | CTE is a construction that will break long SQL statements into many smaller ones. Recursive CTEs are supported 0510. | 0190 | GROUPING SETS will help us to hierarchically group our data. | 0490 |
8) In SQL, running totals and moving averages are calculated by using WINDOW functions. Theory and syntax of window functions are presented in 0200, 0220. Window function can be divided into Aggregate 0230, Ranking 0240 and Offset 0250 window functions.
9) MonetDB is rich with built-in functions. They can be divided into Aggregate 0210, Mathematical 0260, String 0270, Comparison 0280, functions. Time functions are explained in 0070.
10) Transactions are explained in 0290, Indexes in 0300, Schemas in 0310,
| Transactions are a way to protect consistency and integrity of our data. | Indexes are usually made by MonetDB automatically. Still, we can create some special types of indexes by our selves. | Schemas are logical parts of our database. |
11) Blog posts 0330, 0340, 0350 are about importing and exporting data to/from MonetDB. We can import/export data to/from CSV files or a binary format. It is also possible to load data into MonetDB from different programming languages ( we have covered python and SQL ).
12) Procedural SQL is a way to introduce procedural paradigm into SQL. Procedural paradigm is when we have to explain to database what to do instead of just describing our demands with SQL query. With procedural SQL we can make:
– Custom functions ( 0400, 0410, 0420 ).
– Procedures ( 0440 ). Procedures are sets of statements that are executed together.
– Triggers ( 0450 ). Triggers are procedures that run automatically after some event.
CREATE FUNCTION | CREATE PROCEDURE | CREATE TRIGGER | RETURN | DECLARE TABLE | DECLARE VARIABLE | CASE | WHILE | IF |
13) Creating users and setting their rights is explained in blog posts 0460, 0470.
CREATE USER | ALTER USER | GRANT | REVOKE | CREATE ROLE | SET ROLE |
14) Articles 0500, 0510 will teach us how to create ODBC and JDBC connections to MonetDB. We will also learn about file_loader and proto_loader functions. With the file_loader and proto_loader functions, we can treat CSV files and tables in other ODBC databases as if they were local MonetDB tables. Article 0530 will show us how to encrypt connection with MonetDB using TLS encryption.
15) Every database need backup. That is explained in 0540 and 0550.
16) For administration we have to learn three console applications:
– Monetdbd ( 0560 ) is a linux daemon. This is the main application used to start and control MonetDB.
– Monetdb ( 0570 ) is console application used to work with specific database.
– Mclient ( 0580 ) is client application that we can use to send queries to MonetDB.
17) Different system and session procedures and commands will help us to monitor and control user sessions and queries. They are explained in 0590.













