This will not work. We can not use GROUP BY in the RETURNING clause. RETURNING clause must remain simple.
Referencing Columns by Their Position
SELECT Color, Size, qty FROM tabProducts ORDER BY 1, 2; --BY Color, qty
SELECT Color, SUM( qty ) FROM tabProducts GROUP BY 1; —BY Color
In MonetDB it is possible to reference columns by their position in the select clause.
Referencing Set of Columns with the Keyword ALL
SELECT Color, qty, COUNT( Size ) FROM tabProducts GROUP BY ALL; –BY Color, qty
When using GROUP BY, we can only group by columns which are not aggregated. If we want to group by all of such columns, we can use "GROUP BY ALL".
SELECT Color, Size, qty FROM tabProducts ORDER BY ALL; –BY Color, Size, qty
We can use ALL in the ORDER BY clause. This is the same as "ORDER BY Color, Size, qty".
MonetDB will accept this syntax, but there is a bug and this syntax does nothing. There will be no sorting.
Instead of the keyword ALL, we can use the star "*" sign.
SELECT Color, qty, COUNT( Size ) FROM tabProducts GROUP BY *; –BY Color, qty
SELECT Color, Size, qty FROM tabProducts ORDER BY *; –BY Color, Size, qty
IS [NOT] DISTINCT FROM
Anything compared with NULL will return NULL.
SELECT 'null' = null; SELECT null = null;
IS [NOT] DISTINCT FROM is a null-safe comparison operator. This operator will always return TRUE or FALSE, even if one of operands is NULL.
SELECT NULL IS DISTINCT FROM NULL;
SELECT 'A' IS DISTINCT FROM NULL;
SELECT NULL IS NOT DISTINCT FROM NULL;
SELECT 'A' IS NOT DISTINCT FROM NULL;
CHECK Constraint
A check constraint is a type of rule which specifies a condition ( boolean expression ) that must be met by each row in a database table. This rule limits acceptable values for data.
In our table I will add check constraint that quantity column must have values less then 1. This will fail because the existing data is already bigger than 1. ALTER TABLE tabProducts ADD CONSTRAINT "QtyLess1" CHECK (qty < 1);
If we change our condition, so that qty must be less than 100, then it will succeed. ALTER TABLE tabProducts ADD CONSTRAINT "QtyLess100" CHECK (qty < 100); After that, we will try to insert number 111 into qty column =>
This will fail because of the constraint (111>100).
UPDATE tabProducts SET qty = 111 WHERE qty = 30;
This also mean that we can not update the value in the qty column to a value that is bigger than 100.
We can only insert values that are smaller then 100. UPDATE tabProducts SET qty = 99 WHERE qty = 30;
SELECT * FROM tabProducts;
How to Add Check Constraint?
We can add several constraints on the same column. We now have two constraints, that "qty > 0" and "qty < 100". ALTER TABLE tabProducts ADD CONSTRAINT "QtyGrt0" CHECK (qty > 0);
That is not efficient. It is much better to add both constraints with one statement. We can connect conditions with AND, OR. ALTER TABLE tabProducts ADD CONSTRAINT QtyConstraints CHECK (qty > 0 AND qty < 100);
Constraints can combine severalcolumns in the requirement expression: ALTER TABLE tabProducts ADD CONSTRAINT CheckColorSize CHECK ( Color = 'Red' OR Size = 'XL' );
We can add constraint during the creation of a table. CREATE TABLE tabOrders ( Price INT, Qty INT, CONSTRAINT ValidOrders CHECK ( Price > 10 AND Qty < 10 ) );
During table creation, we can add a constraint that is at the singlecolumn level. The server will provide a default name for such a constraint.
CREATE TABLE tabOneColumn ( Col1 CHAR, Col2 INT CHECK ( Col2 < 999 ) ); Default name consists of the table name, the column name, and the keyword "check". In this example, that would be "tabonecolumn_col2_check".
Instead of using the default name for a single-column CHECK constraint, it is much better to give the constraint a specific name.
CHECK constraint can only refer to one row of a table. We can not use aggregation functions, because that would break such limitation. ALTER TABLE tabProducts ADD CONSTRAINT OverFlow CHECK ( SUM( Qty ) < 1000 ); This also means that a CHECK constraint cannot refer to tables other than the table on which it is defined.
A simple check on a single column has minimal impact on performance. We should avoid complex check conditions.
This INSERT statement will always succeed. CHECK constraints can not check null values. SELECT * FROM tabOneColumn;
This CHECK constraint will always return null, because nothing can be compared with a null. ALTER TABLE tabOneColumn ADD CONSTRAINT ZZZ CHECK ( Col1 <> null ); CHECK will prevent entry of a row, only if condition is returning FALSE, so this constraint is useless. INSERT INTO tabOneColumn ( Col1, Col2 ) VALUES ( null, 33 );
SELECT * FROM tabOneColumn;
How to Delete CHECK Constraints?
We can find our CHECK constraint in the view "Information_schema.check_constraint". SELECT * FROM information_schema.check_constraints WHERE table_name = 'tabonecolumn';
We can delete constraints like this: – We can delete just the CHECK constraint. – We can delete the whole table.
ALTER TABLE tabonecolumn DROP CONSTRAINT zzz; DROP TABLE tabOneColumnNamed;
Now, our constraints are gone.
SELECT * FROM information_schema.check_constraints WHERE table_name = 'tabOneColumnNamed' or constraint_name = 'zzz';
LIKE and ILIKE Operators
LIKE is a pattern matching operator. It can help us to find patterns in a text. LIKE is based on two wildcard characters. Percent sign "%" will replace any set of consecutive characters. Underscore "_" will replace exactly one character. Here are some examples:
SELECT 'zzz' LIKE '%'; –true SELECT 'zzz' LIKE '___'; –true
SELECT 'Azz' LIKE 'A%';–true SELECT 'AzzA' LIKE '%zz%';–true
SELECT 'AzzA' LIKE 'A__A'; –true SELECT 'Azz' LIKE 'Azz_'; –false
ILIKE is case insensitive version.
SELECT 'fff' LIKE 'F_F';–false
SELECT 'fff' ILIKE 'F_F';–true
We can negate LIKE with NOT.
SELECT 'M' NOT LIKE 'M'; –false
SELECT 'M' LIKE 'M'; –true
Default escape character is backslash "\".
SELECT '%_' LIKE '\%\_';–true
We can declare any other character to be ESCAPE sign.
SELECT '%_' LIKE '#%#_' ESCAPE '#';–-true
Function regexp_replace
For more complex patterns, we can use the regexp_replace function. This function accepts the original string, a search pattern, and a string that will replace the found pattern.
We will replace "m", followed by two letters, with the "z".
SELECT regexp_replace( 'maaam', 'm.{2}', 'z' );
This function is case sensitive, except if we use the forth, optional argument.
We will examine how TRUNCATE can affect the SERIAL data type. CREATE TABLE Students ( ID SERIAL, Name VARCHAR(10) ); INSERT INTO students(name) VALUES ('Wiremu'), ('Hinewai');
SELECT * FROM Students;
We will truncate this table, but we will continue sequence. TRUNCATE students CONTINUE IDENTITY; INSERT INTO students(name) VALUES ('Wiremu'), ('Hinewai');
SELECT * FROM Students;
Default behavior is to restart sequence. TRUNCATE students RESTART IDENTITY; INSERT INTO students(name) VALUES ('Wiremu'), ('Hinewai');
We will continue using the green and blue server that we have created in this post => link ( or you can watch the video on the youtube => link ). This was the post about distributed query processing.
This is how dialog will look like. When you click on the "Test" button you will get an error: Invalid JDBC URL: jdbc:monetdb://192.168.100.146:50000/DatabaseB
This error is because DBeaver is trying to validate "jdbc:monetdb://192.168.100.146:50000/DatabaseB" URL, or it is trying to add some hidden parameters or escape characters.
Solution is to change dialog to manual input.
DBeaver will now use the raw URL. After we click on the "Test" button, we will get confirmation dialog.
We can now expand our connection and there we will find all of the schemas and tables, and now we can run queries.
Recursive Common Table Expressions
In the blog post about common table expressions ( link ), I unintentionally lied that MonetDB doesn't support Recursive CTEs. But it does, and I will explain them now.
What is Recursion
When we need to take a shower, it is always difficult to set the right water temperature. To find it, we repeat the same action. We move the lever left and right until we find the perfect spot. With each movement of the lever, we get closer to the desired temperature.
Recursion is an iterative process of finding a solution. We repeat the same logic each time, but each time we are closer to the solution because we can base our logic on the information we have gained during previous iterations.
Let's look at this example. Our friend imagined a number between 1 and 10. We have to guess that number with the minimal number of questions. The best approach is to use binary logic, based on an elimination process where in each iteration we can remove half of the numbers.
Question: Is it bigger than 5?
Answer: Yes, it is.
Question: Is it bigger than 8?
Answer: No, it is not.
Question: Is it smaller than 7?
Answer: Yes, it is.
It has to be 6.
The only argument in our alghorithm is the range of the possible numbers.
Our logic has two steps: 1) Does the range of possible numbers has only one number. 2) If it doesn't, ask the question to eliminate half of the numbers and reduce the range of possible numbers by half.
We can pack the second step into function with a name "EliminateHalf". This function will return the range of all the possible numbers. We will call this function 3 times.
We can nest these functions: EliminateHalf(EliminateHalf( EliminateHalf( 1,2,3,4,5,6,7,8,9,10))) =6
Problem is that we don't know how many nested functions do we need. I will create a pseudo code that will nest as many functions as needed to get the final result.
Result = EliminateHalf( 1,2,3,4,5,6,7,8,9,10 )#initial state Do Until Count( Result ) = 1 # are we finished Result = EliminateHalf( Result )# if not, continue Loop
This is RECURSION. We broke the complex problem into small steps. Each step has the same logic. Each step is using arguments that are the result of the previous step. This is just one iterative process which brings us closer to the solution with each step.
You want example from the real life. The coach of the football team analyze data from the previous game. After each analysis he change the game of his team. He continues with this practice until his team start winning.
The Structure of Recursion
Recursion structure always has four steps. The first step "initial state" is a problem that we want to solve. We will solve it by improving our statistics.
1) Set initial values for our arguments. #initial state 2) Has our goal been achieved? # are we there yet # recursive part 3) Improve our arguments by using some strategy. # continue with effort # recursive part 4) Repeat steps 2 and 3 until we reach the goal. # be persistent
Linear and Tree Recursion
In linear recursion we only have two possible outcomes. We are either satisfied with the result or we will continue with our effort. For example, the coach can be satisfied with his team or he can continue introducing improvements.
Tree Recursion is when we have several possible strategies to direct our effort. For example, the coach can change the team's game, or he can look for position in some other team. If we create a diagram of his possible actions we can get something like this:
1) Red path: Be satisfied with a Real, and then transfer to Arsenal, and start improving Arsenal game. 2) Orange path: Be satisfied with a Real, and then transfer to Arsenal, and then to Barcelona, and start improving Barcelona's game. 3) Purple path: Leave Real, improve Liverpool for a while, and then transfer to Juventus to improve Juventus game.
We can present this coach career as folders and subfolders. Each time he improves his team, he is going deeper in the subfolders ( Real → Real+ ). Each time he changes his job, he is moving to another folder at the same level ( Real+→ Juventus ).
This is why this kind of recursion is called Tree Recursion, because we have a hierarchy. This hierarchy is reflection of strategic decisions we have to make to reach our goal. Each decision will move us to another path, but we always use the same logic to decide.
Structure of The Recursive Common Table Expression
WITH RECURSIVEcte_name AS ( SELECT ... –initial state UNION ALL SELECT ... –continue with improvements FROM cte_name –get the previous state WHERE ... –are we there yet ) SELECT * FROM cte_name; –return result
In the simplest form, recursive CTE has two SELECT statements connected with UNION ALL. First select statement will define initial state.
RECURSIVE CTE will return all interim results connected with UNION ALL.
Second select statement will calculate the new status. It will reference the previous status by the name of the CTE. WHERE in second select statement will tell us when to stop.
WITH RECURSIVE numbers(n) AS (SELECT 1
UNION ALL SELECT n + 1 FROM numbers WHERE n < 4) SELECT * FROM numbers;
The first select statement is called ANCHOR member. The second one is RECURSIVE member. ANCHOR member will be executed only once. In this example, RECURSIVE member will be executed several times, each time with the different value of "n", until the condition becomes false.
We will apply UNION ALL on all of the interim results.
WITH RECURSIVE factorial(n, fact) AS ( SELECT 1, 1
UNION ALL
SELECT n + 1, fact * (n + 1) FROM factorial WHERE n < 5 ) SELECT n, fact FROM factorial;
This is an example how to calculate factorial.
These two examples are for the linear recursion.
Tree Recursion
This is our hierarchy.
Our hierarchy can be presented with the self referencing table.
We'll create such table in MonetDB.
CREATE TABLE Folders ( Folder TEXT, Parent TEXT );
We'll make our CTE: WITH RECURSIVE Folder_paths AS ( SELECT Folder, Folder AS Path FROM Folders WHERE Parent IS NULL
UNION ALL
SELECT f.Folder, fp.Path || '>' || f.Folder AS Path FROM Folders f JOIN Folder_paths fp ON f.Parent = fp.Folder ) SELECT * FROM Folder_paths ORDER BY path;
This is initial state. This table represents folders at the top level. We'll join original Folders table with this initial state. That will give us our new state.
We'll do it again, between Folders table and new state. This is inner join.
We can now make a final result as a union between (1),(2),(3).
Tree recursion occurs then ANCHOR and RECURSIVE members are select statements that return tables with several rows. Those rows represent folders at the same level. We have (1), (2), (3) for top (1), middle (2) and bottom (3) folders. First, we get top folders (initial state), then middle folders (first recursion), and then bottom folders (second recursion). Each recursion is used to collect folders from the level bellow.
Recursive CTEs Caveats
1) Recursive and anchor member must match in columns number and data types.
2) If we use UNION ( instead UNION ALL ), recursion will stop immediately when we face a duplicate. This statement will immediately end, although it doesn't have WHERE statement. This is because the first recursion will return number 1, which is a duplicate. UNION operator will remove that duplicate, so our result will have only one row, and that row will have the original number 1.
WITH RECURSIVE numbers AS ( SELECT 1 UNION SELECT 1 FROM numbers ) SELECT * FROM numbers;
3) Don't user OUTER JOINS in the recursive member. The query will never end. Only INNER join is acceptable.
4) MonetDB will not complain if we use aggregate or window functions in recursive member. We can also use DISTINCT and GROUP BY in the recursive member. In MonetDB, we can use CTE's name in the FROM clause, but we can also use it in subquery. Some other servers don't allow this.
New Functions
DayName and MonthName Functions
SELECT DAYNAME('2025-07-12');
Saturday
This function returns a name of a day in a week according to the current locale, set in the OS.
SELECT MONTHNAME('2025-07-12');
July
This is similar function that is returning the name of a month.
Beside date arguments, we can also use timestamp ('1987-09-23 11:40') or timestamp TZ ('15:35:02.002345+01:00').
Generate Series Functions
SELECT * FROM generate_series(1,9,2);
1,3,5,7
This function will return numbers from 1 do 9, with step 2. Default step is 1.
SELECT * FROM generate_series('2025-01-01','2025-01-10',INTERVAL '5' DAY);
2025-01-01,2025-01-06
All dates from the range, but with the step of 5 days.
SELECT * FROM generate_series('2025-01-01','2025-05-10',INTERVAL '2' MONTH);
2025-01-01,2025-03-01
We can also get the months with the step od 2 months.
We can list seconds or days between two timestamps.
SELECT * FROM generate_series('2025-01-01 01:40:00','2025-01-01 1:40:05', INTERVAL '3' SECOND);
2025-01-01 01:40:00 2025-01-01 01:40:03
SELECT * FROM generate_series('2025-01-01 01:40:00','2025-01-06 1:40:05', INTERVAL '3' DAY);
2025-01-01 01:40:00 2025-01-0401:40:00
Generate Series functions will return a column with the name "value". SELECT * FROM generate_series( 1,7,2); SELECT * FROM generate_series('2025-01-01','2025-01-10',INTERVAL '5' DAY);
We will continue using the green and blue databases that we have created in this post => link ( or you can watch the video on the youtube => link ). This was the post about distributed query processing.
"CSV" file is using commas. It also has commas at end of the rows with data. This is the only file that doesn't have file format extension.
"TSV.tsv" file is using tab as delimiter, but it also has double quotes around the strings.
"PSV.psv" file is using pipes and has a null in the "Letter" column.
Files with file format extension ".gz"are just PSV file compressed.
Files must be placed on the server. Returned value of the "file_loader" function is virtual table.
We don't specify delimiters, wrappers and data types for the files. They are deduced automatically.
We can read from CSV, TSV and PSV files, and also ".gz,.lz4,.bz2 and .xz" files.
file_loader Function Syntax
SELECT * FROM file_loader( '/home/abhishek/sample.csv' );
The only argument of our function is the full path toward the file.
SELECT * FROM '/home/abhishek/sample.csv';
Shorter syntax is much better. We don't have to type the function name.
Experiments With the CSV File
We can not read from files that don't have file format extension. SELECT * FROM '/home/sima/Desktop/CSV';
I will add file format extension and then I will try again. SELECT * FROM '/home/sima/Desktop/CSV.csv';
Our CSV file is now reduced to only one column. The whole row of data from the CSV file is imported into one column. This is the consequence of the commas at the end of the row.
I will remove the surplus commas. Now our query works as expected.
Experiments With the TSV File
Import from the TSV file will work fine. The double quotes wrappers around strings will be considered as a part of a string. SELECT * FROM '/home/sima/Desktop/TSV.tsv';
We can try to trim double quotes, but our column will not be recognized. SELECT TRIM( Letter, '"' ) FROM '/home/sima/Desktop/TSV.tsv';
Column names are case sensitive, so we have to place column names inside of the double quotes. Only then our query will work. SELECT TRIM( "Letter", '"' ) FROM '/home/sima/Desktop/TSV.tsv';
Experiments With the PSV File
In PSV file we have a null. This null will be imported, but we don't know whether it is a null or a string "null". SELECT * FROM '/home/sima/Desktop/PSV.psv';
We will test the null value. We can see that "null" is a string. SELECT ISNULL( "Letter" ) FROM '/home/sima/Desktop/PSV.psv';
Experiments With the GZ files
We can easily read from the compressed PSV.psv file. SELECT * FROM '/home/sima/Desktop/PSV.psv.gz';
If we compress our file as a "tar" tape archive format then file loader will not work. SELECT * FROM '/home/sima/Desktop/PSV.psv.tar.gz';
Conclusion
We can conclude that file_loader function is not as versatile as a COPY INTO function, which is described in this blog post => link.
Read From Remote Database With a proto_loader Function
We have saw that we can login to MonetDB server that is on another computer. We also saw how we can create remote tables. This time we will see how to ad-hoc read tables that are on some other computer/server.
Testing Local MonetDB Server
This syntax will work on the local MonetDB server, too. We can use localhost to read from the local computer. I can run this statement on the green server. We can recall that local database, schema and table names are "DatabaseG", "schemagb" and "dimg". SELECT ProdID FROM proto_loader( 'monetdb://127.0.0.1:50000/DatabaseG/schemagb/dimg' );
This IP address and port number are default. We can use the link without them. Notice that the database name must use upper letters, but schema and table name must be with lower letters. SELECT * FROM proto_loader( 'monetdb:///DatabaseG/schemagb/dimg' );
We can also use the the link that has "mapi:" in front of it. In this case, we have to provide IP and port number.
SELECT * FROM proto_loader( 'mapi:monetdb://127.0.0.1:50000/DatabaseG/schemagb/dimg' );
Testing Remote MonetDB Server
More interesting thing is ability to read tables from the remote server. I will read table from the blue server ( before that please start the blue server ).
This statement will be executed on the green server and will return the whole table from the blue server. SELECT * FROM proto_loader( 'monetdb://192.168.100.146:50000/DatabaseB/schemagb/factb' );
Syntax with "mapi:" will also work on the remote server.
SELECT * FROM proto_loader( 'mapi:monetdb://192.168.100.146:50000/DatabaseB/schemagb/dimb' );
Creation Of a Remote Table
We can test whether we can create remote table using syntax that starts with "monetdb://". On the blue server I will change current schema and then I will create one table.
SET SCHEMA SchemaGB; CREATE TABLE Test( Number INT );
On the green server, I will set current schema to SchemaGB, and I will try to connect to the Test table. SET SCHEMA SchemaGB; CREATE REMOTE TABLE Test( Number INT ) ON 'monetdb://192.168.100.146:50000/DatabaseB'; This will successfully create a remote table. SELECT * FROM Test;
Connect to Any ODBC Database From MonetDB
ODBC Driver Manager
An ODBC driver manager is a middleware component that acts as an intermediary between applications and ODBC drivers, translating standardized ODBC API calls into driver-specific instructions. It allows applications to connect to different databases without needing to be rewritten for each one.
On the green server, we will install ODBC Driver Manager. First check if you alredy have it installed. Just type "odbcinst" in the shell.
I don't have it, so I will install "unixodbc" driver manager. After that command "odbcinst" will show us that the program is working. sudo apt install unixodbc
This program will add file "odbc.ini" in the "etc" folder. This file is empty.
/etc/odbc.ini
MonetDB ODBC Driver
We can connect to any ODBC capable server, but we will use this opportunity to see how to connect to MonetDB server. We will use MonetDB ODBC driver to connect to the blue server.
This is ODBC driver we need. We install it on the green server.
sudo apt install libmonetdb-client-odbc
This driver will add "odbcinst.ini" file in the "/etc" folder.
/etc/odbcinst.ini
The driver itself will be in this folder:
/usr/lib/x86_64-linux-gnu/libMonetODBC.so
We can see a problem. Our file is in "x86_64-linux-gnu" folder. Our "odbcinst.ini" file is directed toward "/lib" folder. We must fix that. In the shell we run this command to start text editor as sudo, and we fix our link. sudo gedit /etc/odbcinst.ini
Testing ODBC Driver
First, we will add our credentials to "/etc/odbc.ini" file.
sudo nano /etc/odbc.ini
Don't use "gedit" graphical text editor program. Use nano. Only then it will work.
[DatabaseB] Description = Connect to Blue Server Driver = MonetDB Server = 192.168.100.146 Port = 50000 Database = DatabaseB User = monetdb Password = monetdb
"isql" is console program that comes with "unixODBC". We can use it to test ODBC connection toward the blue server. We can connect to the blue server if we type this into green server shell:
isql -v DatabaseB
After connection, we can set our schema, and we can query our table:
SET SCHEMA SchemaGB; SELECT * FROM factb;
proto_loader Function For ODBC
But our goal is to use "proto_loader" function to directly fetch data into MonetDB server on the green computer, from the blue server, with ODBC. For that we will install one more package.
sudo apt install monetdb-odbc-loader
Using ODBC loader is still experimental. This functionality is NOT turned on by default. We will now turn it on. First, we exit "DatabaseG".
quit monetdb stop DatabaseG
monetdb get all DatabaseG
We can now read properties of the DatabaseG. Property of interest is the last one:
loadmodules
We will change this property like this: monetdb set loadmodules=odbc_loader DatabaseG We can always turn it back like this: monetdb set loadmodules= DatabaseG
We will now log in to our database. This will automatically start the server. During that, we will automatically load "odbc-loader" module.
mclient -u monetdb -d DatabaseG–password monetdb
Using proto_loader Function For ODBC
This is how we run ODBC query directly from the mclient. In this way we can connect to any ODBC capable server, not just other MonetDB. We just have to prepare ODBC driver for that other server. SELECT * FROM proto_loader('odbc:DSN=DatabaseB;QUERY=SELECT * FROM schemagb.factb');
All the magic is on the green server. "proto_loader" function will use "odbc_loader" (1) module to send standardized ODBC message to "unixODBC manager" (2). This manager will use MonetDB ODBC driver (3) to call the Blue server (4).
It is also possible to provide all of the necessary parameters directly inside of the ODBC connection string: SELECT * FROM proto_loader('odbc:DRIVER=/usr/lib/x86_64-linux-gnu/libMonetODBC.so;SERVER=192.168.100.146;PORT=50000;DATABASE=DatabaseB;UID=monetdb;PWD=monetdb;QUERY=SELECT * FROM schemagb.factb')
There is also a version that is using DSN file. This version is for Windows only. odbc:FILEDSN=<data source name>;[<ODBC connection parameters>;]QUERY=<SQL query>
Virtual Tables
Virtual tables are tables that don't have data physically stored in MonetDB table. Virtual tables are views, merge tables, remote tables. Tables that we receive through file_loader and proto_loader functions are also virtual tables. We will now see how to transform file_loader and proto_loader virtual tables into more permanent structures.
CREATE TABLE Based on the Loader Function
We can use CREATE TABLE AS to store CSV file into new table: CREATE TABLE permanentCSV ( Number, Letter ) AS ( SELECT * FROM '/home/sima/Desktop/CSV.csv' );
Next, we will read from the permanentCSV table: SELECT * FROM permanentCSV;
CREATE TEMPORARY TABLE Based on the Loader Function
CREATE LOCAL TEMPORARY TABLE temporaryFactB ( YearNum, Dates, ProdID, Qty ) AS ( SELECT * FROM proto_loader('odbc:DSN=DatabaseB;QUERY=SELECT * FROM schemagb.factb') ) WITH DATA ON COMMIT PRESERVE ROWS;
We can also make a temporary table.
We can read from this temporary table.
SELECT * FROM temporaryFactB;
Bulk INSERT Based on the Loader Function
TRUNCATE temporaryFactB; We can pull data from any other ODBC capable server into our temporaryFactB table (which is now empty).
INSERT INTO temporaryFactB ( YearNum, Dates, ProdID, Qty ) SELECT * FROM proto_loader('odbc:DRIVER=/usr/lib/x86_64-linux-gnu/libMonetODBC.so;SERVER=192.168.100.146;PORT=50000;DATABASE=DatabaseB;UID=monetdb;PWD=monetdb;QUERY=SELECT * FROM schemagb.factb');
SELECT * FROM temporaryFactB;
COPY command
We already talked about COPY INTO and COPY FROM statements ( blog1 and blog2; youtube1 and youtube2 ). We will now see some special syntaxes of these commands.
COPY FROM stdin
We will first create one empty table.
CREATE TABLE tabStdin( Number INT, Letter CHAR );
sql>COPY INTO tabStdin FROM STDIN; more>1|a more>2|b more> 2 affected rows sql>
We will then run this statement that copies directly from the mclient. After this statement, the prompt will become "more>". That means that MonetDB is expecting more data. We will type the first row "1|a", then the second row "2|b", and for the third row we will enter nothing. After we press Enter, we will exit and prompt will become "sql>".
We can now read from our table. Values that we have typed after "more>" prompt are inside of the table. Pipe was used as a delimiter between columns. SELECT * FROM tabStdin;
sql>COPY 2 RECORDS INTO tabStdin FROM STDIN; more>3|c more>4|d 2 affected rows sql>
If we provide the number of rows that we will COPY, then we just have to type Enter twice.
When importing CSV file, we can specify what symbols are used for decimal and thousands separators. Let's say that we have CSV file where decimal point is star "*", and thousands separator is underscore "_".
I will create one table that accepts DECIMAL numbers. CREATE TABLE tabDecimal ( Number DECIMAL, Letter CHAR );
COPY OFFSET 2 INTO tabDecimal FROM '/home/sima/Desktop/CSV_file'( Number, Letter ) DECIMAL AS '*','_';
With DECIMAL clause we can specify what decimal point and thousands separator, our CSV has.
If we now read from our table, there we will find numbers correctly recognized.
In MonetDB, default decimal point is a dot. There is no default sign for thousands separator.
When we create a Pivot table, from the sample table, we will see all of the detail sales (1,2,4,8,16), but we will also see totals (3,12,16,31).
The question is, what query would return all of these numbers, both detail values and totals, if we use MonetDB.
This is one possible solution:
SELECT Country, State, Sales FROM tabSales UNION ALL SELECT Country, null, SUM( Sales ) FROM tabSales GROUP BY Country UNION ALL SELECT null, null, SUM( Sales ) FROM tabSales;
On the image, the rows are sorted so that the table looks like the pivot table.
UNION ALL solution is bad for several reasons: 1) We have three queries to execute and then to combine multiple result sets into one. 2) It is hard to read and modify long UNION ALL query. 3) We have to be careful to properly align columns.
This is the problem that can be solved by grouping sets.
Grouping Sets
"Grouping Sets" are much better and faster syntax to achieve the same goal.
SELECT Country, State, SUM( Sales ) FROM tabSales GROUP BY GROUPING SETS ( Country, State, ( ) );
Empty parentheses are for the grand total.
We'll get the same result, except the repetition of country names is reduced. Instead of them we have nulls.
Look what we will get if we place parentheses around Country and State.
SELECT Country, State, SUM( Sales ) FROM tabSales GROUP BY GROUPING SETS ( ( Country, State ), () );
Parentheses are there to define each group.
We can see the effect of parentheses better on this example.
SELECT Continent, Subcontinent, SUM( Sales ) FROM tabSales GROUP BY GROUPING SETS ( ( Continent, Subcontinent ), Continent );
Continent can be used by itself, but it can be also used in conjunction with Subcontinent to define a group.
It is now clear that each element inside GROUPING SETS is a separate definition of a group. Each group can be defined by one column > Continent <, or by several columns placed inside of the parentheses ( Continent, Subcontinent ).
These two examples, that would return the same result, show the logic and brevity of the grouping sets.
SELECT Col1, Col2, SUM( Sales ) FROM Table GROUP BY GROUPING SETS ( ( Col1, Col2 ), Col1 );
SELECT Col1, Col2, Sales FROM Table UNION ALL SELECT Col1, null, SUM( Sales ) FROM Table GROUP BY Col1;
Rollup
SELECT Continent, Subcontinent, Country, SUM( Sales ) FROM tabSales GROUP BY ROLLUP( Continent, Subcontinent, Country );
ROLLUP( a, b, c ) is the same as grouping sets "( a, b, c ), ( a, b ), ( a ), ()". This is a way to get hierarchy of the columns. Rollup will give us all of the numbers that we need to create a pivot table.
For ROLLUP, the order of the columns is important.
ROLLUP( a, b, c ) ROLLUP( c, b, a )
( a, b, c ) ( c, b, a ) ( a, b ) ( c, b ) ( a ) ( c ) ( ) ( )
Similar to GROUPING SETS, ROLLUP can also create combinations of columns by using parentheses.
SELECT Subcontinent, Country, State, SUM( Sales ) FROM tabSales GROUP BY ROLLUP( ( Subcontinent, Country ), State );
CUBE
CUBE works similar to ROLLUP, but have a different logic. CUBE will give us all of the possible combinations. CUBE( a, b, c ) will give us 2^3 grouping sets "(a,b,c), (a,b), (a,c), (b,c), (a), (b), (c) and ()". ———————————————————————————-
SELECT Subcontinent, Country, SUM( Sales ) FROM tabSales GROUP BY CUBE( Subcontinent, Country ); Because we have only 2 columns inside of CUBE in our example, number of combinations is 2^2 = 4 "(a,b), (b), (a), ()".
We can also define groups by using parentheses. SELECT Subcontinent, Country, SUM( Sales ) FROM tabSales GROUP BY CUBE( ( Subcontinent, Country ) ); We now have only one element. We have only two ( 2^1 ) groups "(a), ()".
Addition and Multiplication in Grouping Sets
This is addition: ( a, b ) + ( c ) = ( a, b ) ( c )
This is multiplication. Multiplication is crossjoin between individual values. a1bc1 ( a ) ( b, c ) a1bc2 a1 * bc1 = a1bc3 a2 bc2a2bc1 bc3a2bc2 a2bc3
Syntax for addition is like this. Everything inside of the GROUPING SETS parentheses will be added to each other. In this example we will add ( Subcontinent ) + ( Country ) + ( ).
SELECT Subcontinent, Country, SUM( Sales ) FROM tabSales GROUP BY GROUPING SETS ( Subcontinent, ROLLUP( Country ) );
So, if we create GROUPING SETS like this, this will be addition. GROUPING SETS ( Continent, ROLLUP( Continent, Subcontinent ), CUBE( Country, State ), () )
Addition can easily create duplicates: SELECT Continent, SUM( Sales ) FROM tabSales GROUP BY GROUPING SETS ( CUBE( Continent ), () );
CUBE will create Grand Total, but we will also get grand total from the "( )" element.
This is a syntax for multiplication. This time we will have commas between GROUPING SETS, ROLLUPS and CUBES, and individual elements. GROUPING SETS ( Continent ), ROLLUP( Continent, Subcontinent ), CUBE( Country, State ), (),Country
This example will give us 2 x 2 = 4 rows. ROLLUP will give us America, and "( )". GROUPING SETS will give us "North" and "Central". Then we combine them 2 x 2.
SELECT Continent, Subcontinent, SUM( Sales ) FROM tabSales GROUP BY ROLLUP( Continent ), GROUPING SETS ( Subcontinent );
Multiplication can also easily create duplicates: SELECT Continent, SUM( Sales ) FROM tabSales GROUP BY GROUPING SETS ( Continent, () ), GROUPING SETS ( Continent, () );
Indicator Function – GROUPING
GROUPING function will inform us what rows are subtotals / grand total. In such rows, some columns have nulls because they are consolidated. GROUPING function has an argument which is a column, and GROUPING function will return the result only for that column.
SELECT Continent, GROUPING( Continent ) AS ContiGroup, Subcontinent, GROUPING( Subcontinent ) AS SubcontiGroup , SUM( Sales ) FROM tabSales GROUP BY ROLLUP ( Continent, Subcontinent );
This function is important because it help us to make distinction between subtotal nulls, and missing data nulls.
Formatting with COALESCE and Sort
This is not good looking table. Let's fix it.
SELECT Subcontinent, State, SUM( Sales ) FROM tabSales GROUP BY ROLLUP( Subcontinent, State );
COALESCE will helps us to eliminate NULLS: SELECT COALESCE( Subcontinent, 'Grand' ) AS Subcontinent , COALESCE( State, 'Total' ) AS State , SUM( Sales ) AS Sales FROM tabSales GROUP BY ROLLUP( Subcontinent, State );
With GROUPING function, we can create columns that will help us to sort the table.
SELECT COALESCE( Subcontinent, 'Grand' ) AS Subcontinent , COALESCE( State, 'Total' ) AS State , SUM( Sales ) AS Sales , GROUPING( Subcontinent ) AS SubcSort , GROUPING( State ) AS StateSort FROM tabSales GROUP BY ROLLUP( Subcontinent, State ) ORDER BY SubcSort, Subcontinent, StateSort;
These auxiliary columns ( SubcSort and StateSort ) can be easily eliminated by wrapping everything with "SELECT Subcontinent, State, Sales".
Comments
Sample Table and Function
Let's create two tables and function.
CREATE TABLE tabComment( Number INTEGER ); CREATE TEMPORARY TABLE tabTemporary( Number INTEGER );
CREATE OR REPLACE FUNCTION funcComment( Arg1 INTEGER ) RETURNS INTEGER BEGIN RETURN 2; END;
Comments on Database Objects
We can create comments that are tied for database objects. Comments convey information about that object.
COMMENT ON TABLE tabComment IS 'tabComment description'; COMMENT ON COLUMN tabComment.Number IS 'Number column description'; COMMENT ON FUNCTION funcComment IS 'funcComment description'; COMMENT ON SCHEMA sys IS 'sys schema description';
We will then find IDs of our database objects: SELECT * FROM sys.tables WHERE name = 'tabcomment'; 15876 SELECT * FROM sys.columns WHERE table_id = 15876; 15875 SELECT * FROM sys.functions WHERE name = 'funccomment'; 15881 SELECT * FROM sys.schemas WHERE name = 'sys'; 2000
All of these IDs can be found in the system table "sys.comments" together with their comments. SELECT * FROM sys.comments WHERE Id IN ( 15876, 15875, 15881, 2000 );
Deleting a Comment
If we delete an object, its comment will be deleted. DROP TABLE tabComment; SELECT * FROM sys.comments WHERE Id = 15876;
We can delete a comment by setting it to NULL or an empty string. COMMENT ON SCHEMA sys IS null; SELECT * FROM sys.comments WHERE Id = 2000;
If a function is overloaded then we have to provide the full signature. COMMENT ON FUNCTION funcComment( INTEGER ) IS ''; SELECT * FROM sys.comments WHERE Id = 15881;
Persistent Database Objects
There are other database objects that we can place a comment on. They are all persistent database objects.
COMMENT ON VIEW view_name IS 'Comment'; COMMENT ON INDEX index_name IS 'Comment'; COMMENT ON SEQUENCE sequence_name IS 'Comment'; COMMENT ON PROCEDURE procedure_name IS 'Comment'; COMMENT ON AGGREGATE aggregate_name IS 'Comment'; COMMENT ON LOADER loader_name IS 'Comment';
We can not create a comment on a temporary object. COMMENT ON TABLE tabTemporary IS 'tabTemporary description';
Note: Before reading this blog post, you should read article about merge tables in the MonetDB ( article ), or you can watch on youtube ( video ). Advice: I suggest you follow this blog as a strict instruction. Any freestyling could mean that you will have problems to create remote tables.
Visual Presentation
In this case, we will use two MonetDB instances that are placed on two different computers. Both of them will have dimension table with products, and that table will have the same data on both computers (DimG and DimB on the image). I will use (G)reen and (B)lue colors to differentiate servers.
Fact table will be divided into two parts. FactG will be on the green server, and FactB will be on the blue server. They will have different data.
The next step is to create remote tables. Remote tables are references on one server to tables on another server. The purpose of remote tables is to let each server know about all of the tables in the system and how those tables are connected. This allows each server to create a query execution plan, even though that query uses tables from both servers.
Green server
Blue server
Table: DimG Remote table: DimB
Table: FactG Remote table: FactB
Table: DimB Remote table: DimG
Table: FactB Remote table: FactG
As a user, we can now connect to one server (it doesn't matter which one) and we can execute a query that will use tables from both servers.
The computer we connect to will create an execution plan for the query. This plan will usually assume that intermediate results (1,2) will be calculated on each server. That way we can divide processing between the servers and increase parallelism.
The intermediate results will then be collected by the master computer (the one that creates the execution plan) and transformed into the final result (3).
The whole purpose of distributed queries is to divide the work between the computers in the cluster and increase performance.
In this part, we will create MonetDB folder, one database, and we will login as administrator.
As admin we will create new role, schema and user. Schema will have that role as authorization. User will have that schema as default schema, and role as default role. That means that user will be able to use all of the objects in this schema.
CREATE ROLE RoleGB; CREATE SCHEMA SchemaGB AUTHORIZATION RoleGB; CREATE USER UserGB WITH PASSWORD 'gb' NAME 'Distributed User' SCHEMA SchemaGB DEFAULT ROLE RoleGB; quit
Letters "GB" mean that this role, schema and user will exist in both the green and the blue server. In order for one user to access tables from both servers, his account must be present in both servers.
Then, we will login as a new user "usergb". We will create two tables, "DimG" and "FactG", and we will fill them with the data.
At the start of this blog post, we already saw content of these tables.
monetdbd stop /home/sima/monetdb/DBfarmG monetdbd get all /home/sima/monetdb/DBfarmG monetdbd set listenaddr=0.0.0.0 /home/sima/monetdb/DBfarmG
We are back in the shell. I will stop monetdb folder because I want to change one setting. That setting is "listenaddr". This setting is currently "localhost" which means that server is only available from the local computer.
We will change that value to "0.0.0.0" which means that anyone can access MonetDB server.
We will then start our MonetDB folder again: monetdbd start /home/sima/monetdb/DBfarmG
We can test who is listening the port 50.000. ss -tulnp | grep 50000
Below you can see some commands that you can use to configure your firewall so that you can control communication between servers. I will not teach you how to manage the firewall. I will just show you how I configured it. If you have any problems with the firewall, you can reset all the rules with "sudo ufw reset", and then you can disable the firewall with "sudo ufw disable".
I will now set firewall by running these commands in the shell.
"ufw" means "uncomplicated firewall".
sudo ufw enable –the firewall will be permanently enabled. It is disabled by default. sudo ufw default deny incoming –no one can call us sudo ufw default allow outgoing –we can call anyone — bellow is command that will open incoming traffic on port 50.000 for tcp protocol. — it will also limit the range of IP addresses only to IP addresses on my local network. sudo ufw allow from 192.168.100.0/24 to any port 50000 proto tcp — I don't know whether your local network is using the same range of IP addresses. I will delete this rule. sudo ufw delete allow from 192.168.100.0/24 to any port 50000 proto tcp –I will create a simpler rule that will only limit the port number and tcp protocol. sudo ufw allow 50000/tcp sudo ufw status –we can check the status of the firewall.
This step is just for the people who are following this tutorial with linux virtual machines inside of the Virtual Box. Go to Settings > Network > Adapter 1, and choose the "Bridged Adapter" option. This will include virtual machines into the local network. This step is needed so that two virtual machines can communicate over the network.
Testing Remote Access to the Green Server
In the green server shell, run this command. This is how you can find IP address of a server. ip addr | grep inet
Now you can go tothe blue server, and from there you can run this code. mclient -h 192.168.100.145 -p 50000 -u usergb -d DatabaseG — password is "gb" This is how we can log to remote green server, over the network.
Blue Server Setup
There is nothing different in preparing of the blue server. I will just repeat the same commands, but with different identifiers.
CREATE ROLE RoleGB; –all identifiers are the same for the privileges, as for the green server CREATE SCHEMA SchemaGB AUTHORIZATION RoleGB; CREATE USER UserGB WITH PASSWORD 'gb' NAME 'Distributed User' SCHEMA SchemaGB DEFAULT ROLE RoleGB; quit;
mclient -u usergb -d DatabaseB –now we login as the user CREATE TABLE FactB ( YearNum INT, Dates DATE, Prodid INT, Qty INT ); INSERT INTO FactB VALUES (2026, '2026-01-01', 11, 105), (2026, '2026-01-02', 11, 110) , (2026, '2026-01-03', 33, 115), (2026, '2026-01-04', 33, 120); CREATE TABLE DimB ( ProdID INT, ProdName VARCHAR(50) ); INSERT INTO DimB VALUES (11, 'product11'), (22, 'product22'), (33, 'product33'); quit
monetdbd stop /home/sima/monetdb/DBfarmB monetdbd get all /home/sima/monetdb/DBfarmB monetdbd set listenaddr=0.0.0.0 /home/sima/monetdb/DBfarmB –we make the server available from the network monetdbd start /home/sima/monetdb/DBfarmB ss -tulnp | grep 50000
sudo ufw enable –if needed, we can set firewall sudo ufw default deny incoming sudo ufw default allow outgoing sudo ufw allow 50000/tcp sudo ufw status
We can login to blue server from the green server (password="gb"). mclient -h 192.168.100.146 -p 50000 -u usergb -d DatabaseB
Preparing REMOTE, REPLICA and MERGE Tables in the Green Server
In the green server, we will create REMOTE tables that are just references toward tables in the blue server. mclient -u monetdb -d DatabaseG –only admin can create REMOTE tables (pass="monetdb") SET SCHEMA SchemaGB; –we must be in the same schema as the physical tables in the blue server CREATE REMOTE TABLE DimB( ProdID INT, ProdName VARCHAR(50) ) on 'mapi:monetdb://192.168.100.146:50000/DatabaseB'; –ip address of blue server CREATE REMOTE TABLE FactB( YearNum INT, Dates DATE, ProdID INT, Qty INT ) on 'mapi:monetdb://192.168.100.146:50000/DatabaseB';
CREATE REPLICA TABLE Dim( prodid INT, prodname VARCHAR(50) ); ALTER TABLE Dim ADD TABLE DimG; ALTER TABLE Dim ADD TABLE DimB;
Tables "DimG" and "DimB" are totally identical. We must notify MonetDB that they are "replicas". For queries MonetDB can use any of these tables interchangeably.
CREATE MERGE TABLE Fact( YearNum INT, Dates DATE, ProdID INT, Qty INT ); ALTER TABLE Fact ADD TABLE FactG; ALTER TABLE Fact ADD TABLE FactB;
Tables "FactG" and "FactB" are just partitions of the merge table. This is why you should read/watch video about merge tables.
We can now query these two new tables Dim and Fact.
Dim table is unchanged, but Fact table is UNION of FactG and FactB.
SELECT * FROM Dim; SELECT * FROM Fact;
Preparing REMOTE, REPLICA and MERGE Tables in the Blue Server
I will again repeat all of the steps, but this time for the blue server.
mclient -u monetdb -d DatabaseB –only admin can create REMOTE tables. (pass="monetdb") SET SCHEMA SchemaGB; –we must be in the same schema as the physical tables in the blue server CREATE REMOTE TABLE DimG( ProdID INT, ProdName VARCHAR(50) ) on 'mapi:monetdb://192.168.100.145:50000/DatabaseG'; CREATE REMOTE TABLE FactG( YearNum INT, Dates DATE, ProdID INT, Qty INT ) on 'mapi:monetdb://192.168.100.145:50000/DatabaseG';
CREATE REPLICA TABLE Dim( ProdID INT, ProdName VARCHAR(50) ); ALTERTABLE Dim ADD TABLE DimG; ALTER TABLE Dim ADD TABLE DimB;
For performance reasons, MonetDB does not check if the replica tables are indeed identical. This responsibility is left to the database users.
CREATE MERGE TABLE Fact( YearNum INT, Dates DATE, ProdID INT, Qty INT ); ALTER TABLE Fact ADD TABLE FactG; ALTER TABLE Fact ADD TABLE FactB;
SELECT * FROM Dim; SELECT * FROM Fact;
Horizontally dividing one table between computers in a cluster is called "sharding".
System Tables
In "sys.tables" we can find all our tables. Based on the system table "sys.table_types" we can see that we have regular (0), remote (5), merge (3) and replica tables (6). SELECT * FROM sys.tables WHERE name IN ( 'factg', 'factb', 'fact', 'dimg', 'dimb', 'dim' );
SELECT * FROM sys.table_types;
Dependencies inside of "fact" and "dim" table can be seen in the system view "sys.dependencies_vw". SELECT * FROM sys.dependencies_vw WHERE depend_type = 2;
The Fruit of Our Labor
In the green server, I will now login as a "UserGB". This user has privileges over all of the objects in the "SchemaGB". He can query "Dim" and "Fact" tables. mclient -u usergb -d DatabaseG –password "gb" SELECT ProdName, SUM( Qty ) AS Qty FROM Dim INNER JOIN Fact ON Dim.ProdID = Fact.ProdID GROUP BY ProdName;
Finally, we can run a query that will use the power of the two computers and two MonetDB servers.
It is the same for the blue server. mclient -u usergb -d DatabaseB –password "gb" SELECT ProdName, SUM( Qty ) AS Qty FROM Dim INNER JOIN Fact ON Dim.ProdID = Fact.ProdID GROUP BY ProdName;
We finally have a working distributed query processing system.
Load Balancer
The question is how to divide users between two (or more) servers to equalize the load. The easiest way is to have all users born from January to June use the green server, and all users born from July to December use the blue server.
A more professional way is to use a load balancer, something like HAproxy (link). This program will direct users to a server that is currently not under load.