0005 MonetDB Benchmark

Benchmark Data

Machine Hardware

Creating Tables and Loading CSV Files

OrderRows Table

CREATE TABLE orderrows ( OrderKey BIGINT NOT NULL, LineNumber SMALLINT NOT NULL, ProductKey SMALLINT NOT NULL, Quantity SMALLINT NOT NULL, UnitPrice REAL NOT NULL, --DECIMAL(8,4)
NetPrice REAL NOT NULL, --DECIMAL(10,6) UnitCost REAL NOT NULL --DECIMAL(8,4) ); First, I will create table for OrderRows in the MonetDB. I will not create primary and foreign key constraints this time.

I will fill this table from my CSV file with "COPY INTO" statement.

COPY OFFSET 2 INTO orderrows FROM '/home/fff/Desktop/CSVs/orderrows.csv' USING DELIMITERS ',', E'\n', '"'
NULL AS '';

Other Tables

I will leave a file for download at the end of this article. That file will have SQL for creation and import of all of the Contoso tables.

Bellow we can see time for import for two other larger files. Smaller dimension tables are imported almost instantly.

Query Benchmarking

Cold Start

Aggregated Queries in MonetDB

It's hard to make a benchmark when execution times are constantly changing. So, from now on I will focus on the fastest times.

How Database Reports Execution Time

SELECT * FROM sales LIMIT 1000000; --13 ms
SELECT * FROM sales LIMIT 2000000; --24 ms MonetDB is reporting that the second query is slower than the first one. That is something that we are expecting. The problem is that according to my computer clock the first query was finished after 7 seconds, and the second one after 15 seconds.

Databases only report the time spent to produce the results in the memory. It will not include the time needed to print the result in the shell or any other client. That is why MonetDB is reporting 13 ms, but I can see the result only after 7 seconds. MonetDB has command to suppress printing of the result in the shell. I will use that command ( command explained here ) next, to test reading the whole tables.

I will disable my command, so we can again see the results of our queries.

Joins

SELECT productkey, SUM( quantity ), AVG( netprice )
FROM sales GROUP BY productkey; This query will execute for 340 ms. If we want to see brands then we have to make a join between "product" and "sales" tables.

ALTER TABLE product ADD CONSTRAINT product_pk PRIMARY KEY ( productkey ); ALTER TABLE sales ADD CONSTRAINT FKfromProduct FOREIGN KEY ( productkey )
REFERENCES product ( productkey ); Now that we have foreign key constraint,
the query from before will become 200 ms
faster. That is 20% faster.

DISTINCT, LIKE, ROLLUP

Window Functions

Updates

Problematic Queries

Double Grouping

Double grouping is when we first group our data, and then we group that result. For example, we will total sales quantity per customerkey, and then we will count customers per total quantity. We will count how many customers have the same total quantity.

This query is problematic because while the first grouping can be fast, the second one could be much longer. The result of the first grouping will have 2M rows, because we have so much customers. In the second stage, we have to group these 2M rows, and that is when I expect the performance to become bad.

SELECT customer.customerkey, SUM( quantity ) AS TotQty FROM customer INNER JOIN sales ON customer.customerkey = sales.customerkey
GROUP BY customer.customerkey; In the first phase I will measure how
much time is needed to group by customer.
It is 5 seconds because there are 2 million customers.

Aggregated Query from Two Fact Tables ( Stitch Query )

This time I will create foreign key constraint on the "OrderRows" (211M) table. I want to aggregate sales and orderrows per product brend.
ALTER TABLE orderrows ADD CONSTRAINT FK_Product FOREIGN KEY ( productkey ) REFERENCES product ( productkey );

Query bellow will last 7.5 seconds. This is longer than I expected. If we ran subqueries separately the time would be just 900 ms each. Because we only have 15 brands, it is surprising that it will take 5 seconds just to join two small tables.
SELECT S.brand, Sq, Oq FROM

( SELECT Brand, SUM( quantity ) Sq FROM Product INNER JOIN Sales ON Product.Productkey = Sales.ProductKey GROUP BY Brand ) S

INNER JOIN

( SELECT Brand, SUM( quantity ) Oq FROM Product INNER JOIN Orderrows ON Product.Productkey = OrderRows.ProductKey GROUP BY Brand ) O

ON S.Brand = O.Brand;

If we "UNION ALL" our subqueries, the execution will last 7.5 seconds, too.

SELECT Brand, SUM( quantity ) Sq FROM Product INNER JOIN Sales ON Product.Productkey = Sales.ProductKey GROUP BY Brand 
UNION ALL 
SELECT Brand, SUM( quantity ) Oq FROM Product INNER JOIN Orderrows ON Product.Productkey = OrderRows.ProductKey GROUP BY Brand;

I have tried to read two small subqueries into python, and then to join them with pandas. Python reported execution time of just 1.2 seconds. It is strange that we can get the final result faster by combining MonetDB and Pandas, then just by using MonetDB.

WITH S AS ( SELECT productkey, SUM( quantity ) AS Sq FROM Sales GROUP BY productkey ),
O AS ( SELECT productkey, SUM( quantity ) AS Oq FROM Orderrows GROUP BY productkey ), PS AS ( SELECT brand, SUM( Sq ) AS SQty FROM Product INNER JOIN S ON Product.productkey = S.productkey GROUP BY brand ), PO AS ( SELECT brand, SUM( Oq ) AS OQty FROM Product INNER JOIN O ON Product.productkey = O.productkey GROUP BY brand ) SELECT PS.brand, Sqty, OQty FROM PS INNER JOIN PO ON PS.brand = PO.brand; We can reduce our fact tables by grouping them by productkey and then following the same logic. This approach would speed up our query to 5.5 seconds.

INSERT INTO SELECT

I will recreate foreign key constraint toward "product" table.
ALTER TABLE sales ADD CONSTRAINT FKfromProduct FOREIGN KEY ( productkey ) REFERENCES product ( productkey );

Conclusions

0005-MonetDB Benchmark Download