0220 Window Functions Syntax

MonetDB / By Bizkapish / June 29, 2024 July 11, 2024

Before reading this blog post, please read the Window Functions theory.

Function(Expressions) FILTER ( WHERE Predicates ) OVER ( [PARTITION BY Expressions] [ORDER BY Expressions [NULLS FIRST|LAST]] [Frame Type BETWEEN Frame Start AND Frame End] [EXCLUDE Frame Exclusion] ) This is the complete syntax of a Window function. MonetDB
doesn't support FILTER clause, so we will skip that clause.

For EXCLUDE clause, only "EXCLUDE NO OTHERS" is
implemented. That means that this clause is currently useless, so we will skip it, too.

Sample Table

Simple Aggregation

Usage of Partitions

ORDER BY and Frames

Notice that ORDER BY clause can exist on two places in the statement. One is used to define frame, and the classic one is used to sort final data set.
SELECT Part, Number, SUM( Number )
OVER ( ORDER BY Number ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS Col1 FROM aggWindow ORDER BY Number;

Examples for the Rows, Ranges and Groups

Window Chaining ( WINDOW clause )

Window functions are verbose. If we want to use them several times in our statement, then our statement will become really long.

Default frames

We already saw that we should avoid using default frames. There are two more abbreviations that will assume default frames.

Abbreviated syntax	Full definition of the default frame
`{ UNBOUNDED \| X } PRECEDING`	`BETWEEN { UNBOUNDED \| X } PRECEDING AND CURRENT ROW`
`{ UNBOUNDED \| X } FOLLOWING`	`BETWEEN CURRENT ROW AND { UNBOUNDED \| X } FOLLOWING`

Window Functions and GROUP BY

Bellow we can see correct syntax. Columns in the grouped table are referred as Part and COUNT( Number ). Our Window function is based on those columns. That means that our window function will be SUM( COUNT( Number ) ).

Limitations of Window Functions

Range and ORDER BY

DISTINCT and Window functions

0210 Aggregate Functions and Logical Functions

MonetDB / By Bizkapish / June 26, 2024 July 9, 2024

Aggregation Functions

Sample Table for Aggregation Functions

Sys.Functions System Table

We can get a list of aggregate functions from the system table sys.functions. Aggregate functions are of the type 3.

SELECT DISTINCT name, mod FROM sys.functions WHERE type = 3 ORDER BY name;

We can divide aggregate functions into three groups:

Arithmetic functions	Concatenation functions	Statistic functions
avg, count, count_no_nil, max, min, prod, sum	group_concat, listagg, tojsonarray	corr, covar_pop, covar_samp, median, median_avg, quantile, quantile_avg, stdev_pop, stdev_samp, var_pop, var_samp

Arithmetic Functions

SQL	Result	Calculation	Comment
`SELECT AVG( Number ) FROM aggTable;`	2.75	(2+2+3+4)/4 = 2.75	NULL is ignored.
`SELECT COUNT( * ) FROM aggTable;`	5	Count the rows of the table.
`SELECT COUNT( Word ) FROM aggTable;`	4	Count words, without NULL.
`SELECT COUNT_NO_NIL( Word ) FROM aggTable;`	Error		Doesn't work.
`SELECT MAX( Word ) FROM aggTable;`	'two'	Last value.	Words are ordered alphabetically, A-Z.
`SELECT MIN( Number ) FROM aggTable;`	2	First value.	Numbers are ordered numerically.
`SELECT PROD( Number ) FROM aggTable;`	48	223*4=48
`SELECT SUM( Number ) FROM aggTable;`	11	2+2+3+4=11
`SELECT SUM( intervalMonth ) FROM aggTable;`	13	2+2+3+6=13	Also work with seconds. Result is of interval type.

SQL	Result	Calculation
`SELECT AVG( DISTINCT Number ) FROM aggTable;`	3	(2+ 3 +4)/3 = 3
`SELECT COUNT( DISTINCT Word ) FROM aggTable;`	3	Count word, no NULLs, no duplicates.
`SELECT PROD( DISTINCT Number ) FROM aggTable;`	24	2* 3 *4 = 24
`SELECT SUM( DISTINCT Number ) FROM aggTable;`	9	2+ 3 +4 = 9
`SELECT SUM( DISTINCT intervalMonth ) FROM aggTable;`	9	2+ 3 +4 = 9

Concatenation Functions

SQL	Result	Comment
SELECT LISTAGG( Word, '\|' ) FROM aggTable;	one\|two\|three\|three\|four	Default delimiter is a comma. Returns VARCHAR.
SELECT SYS.GROUP_CONCAT( Word, ';' ) FROM aggTable;	one;two;three;three;four	Default delimiter is a comma. Returns CLOB.
SELECT JSON.ToJsonArray( Word ) FROM aggTable;	[ "one", "two", "three", "three", "four" ]	Result is JSON list.
SELECT JSON.ToJsonArray( Number ) FROM aggTable;	[ "1", "2", "3", "3", "4" ]	Works with numbers, too.

Statistical Functions

Numbers, from Number column, can be divided into smaller and larger numbers. Half of the numbers will be smaller and the other half will be larger numbers. The number on the border between the smaller and larger numbers is the median.

SQL	Result	Result if we add number 5 to our column.
SELECT SYS.MEDIAN( Number ) FROM aggTable;	2₁,2₍₂₎,3₃,4₄=> 2	2₁,2₂,3₍₃₎,4₄,5₅=> 3
SELECT SYS.MEDIAN_AVG( Number ) FROM aggTable;	2₁,2₍₂₎,3₍₃₎,4₄ => (2+3)/2 = 2.5	2₁,2₂,3₍₃₎,4₄,5₅=> 3

Median is a special case of a quantile. Median is a 50% quantile. But we can differently divide our numbers. We can divide them 60% vs 40%, so that 60% numbers are on the smaller side, and 40% is on the bigger side. Number between them would be called 60% quantile. In our example below, "60 % quantile" is 2.8, which means that 60% of numbers is below 2.8. This would be more obvious if we had more numbers in our column.

SQL	Result	Calculation
SELECT SYS.QUANTILE_AVG( Number, 0.6 ) FROM aggTable;	2.8	( 4 – 1 ) * 0.6 + 1, where 4 is the count of our numbers.
SELECT SYS.QUANTILE( Number, 0.6 ) FROM aggTable;	3	This is just value from above, rounded to integer.

Variance and standard deviation are calculated differently depending whether our data represent a population or a sample.

SQL	Result	Calculation
SELECT SYS.VAR_SAMP( Number ) FROM aggTable;	0.917
SELECT SYS.StdDev_SAMP( Number ) FROM aggTable;	0.957	sqrt( variance ) = sqrt( 0.917 ) = 0.957
SELECT SYS.VAR_POP( Number ) FROM aggTable;	0.687
SELECT SYS.StdDev_POP( Number ) FROM aggTable;	0.829	sqrt( variance ) = sqrt( 0.687 ) = 0.829

Covariance in statistics measures the extent to which two variables vary linearly. Correlation is just covariance measured in normalized units. Unfortunately, there is a bug in MonetDB, version 11.49.09, and all of these functions will return wrong results.

SQL	Result in MonetDB	Calculation
SELECT SYS.COVAR_SAMP( Number, 10 – Number * 1.2 ) FROM aggTable;	-1.417 ( not correct, it is -1.1 )
SELECT SYS.COVAR_POP( Number, 10 – Number * 1.2 ) FROM aggTable;	-1.0625 ( not correct, it is -0.825 )
SELECT SYS.CORR( Number, 10 – Number * 1.2 ) FROM aggTable;	-0.986 ( not correct, it is -1 )

All statistic function will ignore NULLs.

Logical Operators

Binary Operators

SQL	Result	Comment
SELECT True AND True;	TRUE	Returns TRUE only if both arguments are TRUE.
SELECT True OR False;	TRUE	Returns TRUE if at least one argument is TRUE.

Unary Operators

SQL	Result	Comment
SELECT NOT True;	FALSE	Will transform TRUE to FALSE, and FALSE to TRUE. Always the opposite.
SELECT Null IS NULL;	TRUE	Checks whether something is NULL.
SELECT Null IS NOT NULL;	FALSE	Checks whether something is NOT NULL.

All other logical operators will return Null if at least one of its arguments is Null.

SQL	Result
SELECT NOT Null;	NULL
SELECT Null AND True;	NULL

Most SQL functions will either return NULL if one of the arguments is NULL, or will ignore rows with NULL values.

Logical Functions

Operators AND, OR and NOT have alternative syntax where they work like functions. XOR can not work like operator, only like a function.

SQL	Result	Comment
SELECT "xor"(True, False);	TRUE	Returns TRUE only when the first argument is the opposite of the second argument (Arg1 = NOT Arg2).
SELECT "and"(True, False);	FALSE	Returns TRUE only if both arguments are TRUE.
SELECT "or"(False, False);	FALSE	Returns TRUE if at least one argument is TRUE.
SELECT "not"(False);	TRUE	Will transform TRUE to FALSE, and FALSE to TRUE. Always the opposite.

0200 MonetDB: Window Functions Theory

MonetDB / By Bizkapish / June 26, 2024 July 5, 2024

Imagine people scattered through some area, looking through binoculars. None of them will see the whole scenery, they will only see part of it. It is like each of them is looking through the different window.

Now, imagine that in each row, of a database table, there is a data scientist looking through binoculars. Each data scientist can only see some of the rows from that table. Each data scientist has its own window.

What would scientist do to represent nature of data he is looking at? He would aggregate them. If each of our scientist decide to calculate average of data he is looking at, we would get a table like this one:

In the real database table, with millions of rows, these average values would not be representatives of anything. Our windows are too random. If we can create a rule by which windows are created, then we would have a scientific view of our data. Let's say that each data scientist can only see its own row, and previous two rows. Then we would have a rule. Check out animated image below (left image).

We can also define these rules in SQL if we use Window Functions. Window Functions are special, because they can define windows and then apply some aggregation to data in those windows. We can apply aggregations like SUM, AVG, MAX, but we can also use some special aggregation functions.

Window functions are also called Analytic functions, because they give us abilities that are beyond traditional SQL statements. With them we can do things which were previously hard to achieve in SQL, and they are really useful for a deep analysis of our data. Windows are like overlapping samples from our tables. They can reveal us how the nature of our data is changing through time and dimensions.

Terminology

In SQL, window, as explained above, is actually called "frame". The term "window" means something else. We will now discern difference between window, partition and a frame.

Frame is group of records that will be aggregated. Frame is presented with the moving red rectangle in the animation bellow.

Tables in animation below, show how many points each country won on some sport competition.

How to Define Window?

SELECT employee_id, salary
FROM employees
WHERE department_id = 101
AND salary > (SELECT AVG(salary) FROM employees); Subquery is not under direct influence of the outer query. In the example, we have a filter
department_id = 101 on the outer query. But subquery is not under the influence of that filter.
Subquery will calculate average salary for all of the employees. Meaning is, that we are looking for employees from department 101, that have bigger salary then the global average.

This is not true for window functions. Window is under influence of the query context. Everything that is used inside FROM, WHERE, GROUP BY and HAVING clauses will define our window. Window functions can only do their magic after the final dataset is defined and unchangeable. That also means that Window functions can only be used in SELECT and ORDER BY clauses.

How to Define Partitions?

Partitions can be defined by the values of one column. All rows that have the same value will be the same partition. On our image, all the rows with letter "A" will create Partition 1. It is also possible to use combination of the values from two or more columns to define partitions. Each unique combination of values will define a partition. On the image below, combination of values A and Q will define Partition 2. We can use expression to calculate values for our column(s). In our example, all the rows, where MOD function returns 1 will belong to Partition 1. Rows that return 0 will belong to Partition 2.

How to Define Frames?

Frames are moving and so, they are always calculated relative to the current row. Two other reference points are the first and the last row in our partition. Position of the frame is always relative to those reference points.

For definition of a frame, we have to define its start row, and its end row. End row has to be after Star row. Below we can see all the ways how to define start and end row.

Notice that for all of this to make sense, records have to be sorted.

What is a Group?

For window functions, start and end of a frame doesn't have to be a row. Start and end can also be defined with ranges and groups. Ranges and Groups are not individual records, they are sets of records.

What is a Range?

Relative positions are important for rows and groups, but not for ranges. With ranges, we are dealing with values in our column.

Let's say that some student took a school test. She scored 85 points on a test and she got a grade "A", because if number of points is between 76-100, then the grade is "A". It is similar with Ranges in window functions. Each frame is defined with a range of values. If a field value belongs to that range, then that record belongs to a frame defined by that range.

So, how we define a range? Really simple. If our current row has a value of X, we will add or subtract some number to that X, and we will get an extreme value of our range.

Demonstration

This is a simple example of a window function. This example shows how to calculate cumulative of the qty column. We are not using PARTITION BY clause, so the whole table is one big partition.

0190 Common Table Expressions (CTE) in MonetDB

MonetDB / By Bizkapish / June 1, 2024 June 11, 2024

Table with billion rows and 100 columns, can be represented with one short name. We can use the same logic and give a name to our SELECT statements. If we want to use those SELECT statements in a subquery, join or union, we can refer to them by their names. This is how we can reduce any complex statement to its building blocks. This will break complex logic to simple steps, where each step is easy to understand and manage.

CTEs are a way to name our SELECT statement, and then to use those names in the final SELECT, INSERT, DELETE , UPDATE or MERGE statement.

Syntax of Common Table Expression

Bellow we can see statement with two CTEs. We can have them as many as we want ( Name1, Name2, Name3, Name4 … ). Each CTE will give a custom name to some SELECT statement. Thanks to this, final statement (which can be SELECT, INSERT, DELETE, UPDATE, MERGE), will be short and simple.

WITH
Name1 AS ( SELECT * FROM Table1 ),            --first CTE
Name2 AS ( SELECT * FROM Table2 )             --second CTE   
SELECT *                                      --final statement 
FROM Name1

JOIN Name2 ON someCondition

JOIN Name2 ON someOtherCondition;

Not only CTE can break our logic into manageable elements, but it can also reduce repetition. We can write "SELECT * FROM Table2" once and then use it twice in the final statement. CTE will only improve readability of our statement and it will make it more concise. It will not improve performance of a statement.

Sample Tables

We will create two sample tables to use them in our CTEs.

Example with SELECT

Example with INSERT

Example with DELETE

Example with UPDATE

Example with MERGE

Aliases

Nested CTEs

ORDER BY

In MonetDB, it is possible to use ORDER BY inside of CTEs definitions. That sorting will propagate to the final result.

Recursive CTEs

Recursive CTEs are not supported in MonetDB database. This is done deliberately because of performance concerns.

0180 SQL Merge in MonetDB

MonetDB / By Bizkapish / May 26, 2024 May 29, 2024

Introduction

Merge is used when we want to use records from one table to decide what records to modify in another table. Let's assume that we want to keep two tables synchronized. In that case, every change on the first table should be reflected on the second table.

MERGE statement cannot synchronize updates, deletes and inserts at the same time. Merge can synchronize ( INSERT and UPDATE ), or ( INSERT and DELETE ). Image bellow give us an explanation for INSERT and UPDATE. MERGE will create left outer join between tables. It will match rows based on our condition. Merge will then add and update records in the table B, so that table B is the same as the table A.

For INSERT and DELETE, we do the similar thing. New rows will be added to the table B, but matched rows will be deleted from the table B.

Sample Tables

We will create two tables. We will try to propagate all the changes in the table A to the table B. For start, we will enter only one row in the table A.

Initial Synchronization

All the rows from the table A that do not exist in table B will be inserted into table B.

MERGE INTO B USING A ON B.Letter = A.Letter WHEN NOT MATCHED THEN INSERT ( Letter, Number ) VALUES ( A.Letter, A.Number ); MERGE INTO targetTable
USING sourceTable
ON matchingCondition
WHEN NOT MATCHED THEN INSERT RECORDS

Now, table B, has one row, the same as table A.

Merge with INSERT

Then we will repeat the same MERGE statement as above. This new row will then appear in the table B.

Merge with UPDATE

First, we will update the table A. We will change A1 to A4.

Then we'll push that change to the table B. Notice that we don't use 'WHEN NOT MATCHED'. Now we use 'WHEN MATCHED'.

Merge with DELETE

We will update one record in the table A.

Now, we will delete all the rows from the table B, that do exist in the table A. Row A4 exists in both tables, so that row will be deleted.

Merge with INSERT and UPDATE

The real reason for existence of MERGE statement is because we can do two things in one statement. This time we will do INSERT and UPDATE at the same time.

Merge with INSERT and DELETE

Currently, tables A and B are equal. We will add a row to table A to show how to use merge with INSERT and DELETE.

Things That Don't Work in Merge

INSERT with Subquery

Several MATCHED, or NOT MATCHED sentences

Multiple matches

I will add row "E6" in the table "A" with a statement "INSERT INTO A VALUES ( 'E', 6 )". Now this table has rows E5 and E6. If we now apply MERGE statement, both rows E5 and E6 will try to update row E5 in the table B. This is not allowed and will fail.

Abbreviated syntax	Full definition of the default frame
`{ UNBOUNDED \| X } PRECEDING`	`BETWEEN { UNBOUNDED \| X } PRECEDING AND CURRENT ROW`
`{ UNBOUNDED \| X } FOLLOWING`	`BETWEEN CURRENT ROW AND { UNBOUNDED \| X } FOLLOWING`