Offset functions are window functions that return a value from a specified row. The row is determined as an offset from the reference point. There are two major reference points. The first is the current row, the second is the frame border. Based on this, we distinguish between "row offset functions" and "frame offset functions".
Row Offset Functions
With row offset functions, we move from the current row several steps up or down. Our movement is dependent only on the current row, so we do not use frames with row offset functions. Frames are not applicable in this case.
LAG function will move us up. LEAD function will move us down.
We will create one sample table so that we can observe how these functions work.
CREATE TABLE offsetTable( "Month" INT, Value INT );
LAG function will allow us to get some previous value and to place it in the current row. That would allow us to compare some old and some new value, so we can analyze the change in time. SELECT "Month", Value, LAG( Value, 2 ) OVER ( ORDER BY "Month" ) FROM offsetTable ORDER BY "Month";
We can see three drawbacks in the image above:
Month April doesn't have a value. LAG function doesn't notice that, and will happily move two rows above, but not two months before. For the fifth month we are getting the value from the February which is three months before, not two. That is the consequence of the missing month April. We should avoid using LAG function when we are missing some time points.
For months January and February, we have nulls in the new column . Previous values don't exist so we have to be satisfied with nulls.
In the third row, the Value column has NULL. Sometimes we want to go over those nulls and not count them as steps we take. There is a way to achieve this in SQL standard, but that part of SQL standard is not implemented in MonetDB, so null values will always be counted as one step. Subclause IGNORE NULLS is not supported.
--not supported SELECT "Month", Value, LAG( Value, 2) IGNORE NULLS OVER (ORDER BY "Month" ) FROM offsetTable ORDER BY "Month";
LAG can work with partitions.
We should always order rows inside partition with ORDER BY. Although LAG function can work without ORDER BY, the results will be unpredictable and meaningless.
LAG Function Arguments
If the second argument is missing, then the default value is 1. In the image, we can see that the rows are shifted by one row.
SELECT "Month", Value, LAG(Value) OVER (ORDER BY "Month") FROM offsetTable ORDER BY "Month";
Only if we provide second argument, we can also provide the third argument. The third argument is the default value. Default value will be used for those rows where previous values don't exist. SELECT "Month", Value, LAG(Value,2,77) OVER (ORDER BY "Month") FROM offsetTable ORDER BY "Month";
Default value has to be of the proper data type that match other values in the column.
Third argument is optional. If we skip it, the default value would be null.
SELECT "Month", Value, LAG(Value,2) OVER (ORDER BY "Month") FROM offsetTable ORDER BY "Month";
LEAD Function
LEAD function is the same as LAG function, just the opposite. Notice that we can achieve the same result with LEAD function, as with LAG function, if we change ORDER BY from ascending order to descending order.
SELECT "Month", Value, LAG(Value,1) OVER (ORDER BY "Month" ASC) FROM offsetTable ORDER BY "Month";
SELECT "Month", Value, LEAD(Value,1) OVER (ORDER BY "Month" DESC) FROM offsetTable ORDER BY "Month";
<=== totally the same ===>
This is similar example. Both statements use ASC, but the result is the same. This is because the offset argument can be negative. This is another way we can make the LAG and LEAD functions do the same thing.
SELECT "Month", Value, LAG(Value,2) OVER (ORDER BY "Month" ASC) FROM offsetTable ORDER BY "Month";
SELECT "Month", Value, LEAD(Value,-2) OVER (ORDER BY "Month" ASC) FROM offsetTable ORDER BY "Month";
<=== totally the same ===>
Frame Offset Functions
For frame offset functions everything is relative to frame borders. We have three functions of this type. – FIRST_VALUE will return the first value in the frame. – NTH_VALUE will return the NTH value in the frame, from the start of a frame. – LAST_VALUE will return the last value in the frame.
For frame offset functions, we always need to have frame. If 6 is the month in the current row, then the frame is between 4th and 8th month. Now that we know the frame size and position, it is easy to see that the first value in this frame is 5.
SELECT "Month", Value, FIRST_VALUE(Value) OVER (ORDER BY "Month" RANGE BETWEEN 2 PRECEDING AND 2 FOLLOWING) FROM offsetTable ORDER BY "Month";
If we omit frame definition, the default frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW will be used. Be aware of this and always define frame.
SELECT "Month", Value, LAST_VALUE(Value) OVER (ORDER BY "Month") FROM offsetTable ORDER BY "Month";
Frame offset functions can work with partitions.
We should always order rows inside frame with ORDER BY. Frame offset functions can work without ORDER BY, but the results will be unpredictable and meaningless.
NTH_VALUE Function
NTH_VALUE function has another argument which tells how many steps to move from the start of a frame. In the example bellow, our frame is the whole partition. Forth number in the "value" column is number 5. That number will be the result of this window function. SELECT "Month", Value, NTH_VALUE(Value,4) OVER (ORDER BY "Month" ASC RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM offsetTable ORDER BY "Month";
If the second argument is omitted, default value will not be one, but the error will be raised.
The value of that second argument can not be negative.
We can use partitions with NTH_VALUE. You should always provide frame definition, and you should have ORDER BY for your frame.
Difference From SQL Standard
IGNORE NULLS is not supported in MonetDB for the functions FIRST_VALUE, LAST_VALUE and NTH_VALUE.
With IGNORE NULLS we would be able to not count nulls, but this is not supported in MonetDB. SELECT "Month", Value, NTH_VALUE(Value,3) IGNORE NULLS --not supported OVER (ORDER BY "Month" ASC RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM offsetTable ORDER BY "Month";
NTH_VALUE can not use subclause FROM LAST. This would allow to count steps from the end of the frame, and not from the start. But this is not supported in MonetDB.
SELECT "Month", Value, NTH_VALUE(Value,2) FROM LAST --not supported OVER (ORDER BY "Month" ASC RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM offsetTable ORDER BY "Month";
We actually don't need "FROM LAST" subclause. It is enough to change ORDER BY to descending, and we would get the same result as with "FROM LAST". SELECT "Month", Value, NTH_VALUE(Value,2) OVER (ORDER BY "Month" DESC RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM offsetTable ORDER BY "Month";
For each row in Excel we know its row number. On the image we can see that in Excel, in front of column "A", we have a "column" with row numbers.
We can create the same column, with row numbers, in MonetDB, by using ROW_NUMBER() window function. There are two more similar functions in MonetDB, and those are RANK() and DENSE_RANK().
Sometimes, in Olympic games, two sport players can share gold medal. In that case, there is no silver medal. Next best player will get bronze medal. This is called "olympic ranking". In MonetDB we can create such ranking with RANK() function.
For apples we wouldn't apply Olympic ranking. Instead of that we would place apples of the similar quality in consecutive classes, so we would have apples of first, second and third class. This can be achieved with DENSE_RANK() function. As it's name implies, DENSE_RANK() doesn't have gaps in ranking, so there will be always someone who will get silver medal.
With ROW_NUMBER(), RANK() and DENSE_RANK(), we can not use frames. These functions are not returning one value for each row, and because of that frames are not applicable. These functions always work with the whole partitions.
This is an example for all of the three functions:
SELECT part, number, ROW_NUMBER() OVER(ORDER BY Number) AS rowNum, RANK() OVER (ORDER BY Number) AS Rank, DENSE_RANK() OVER(ORDER BY Number) AS dense_Rank FROM rankFunctions ORDER BY Number;
From this example we can conlude: 1) All null values are treated as the same value. 2) It is important to use "ORDER BY" inside of WINDOW function, without it result would be meaningless. 3) These three functions are not taking any arguments.
This is what would be the result without ORDER BY. RANK and DENSE_RANK functions would return one's, so they are not really working without ORDER BY.
ROW_NUMBER() function can be still be useful without ORDER BY. It can be used as a column that would deduplicate rows that are otherwise indistinguishable.
SELECT part, number, ROW_NUMBER() OVER() AS rowNum, RANK() OVER () AS Rank, DENSE_RANK() OVER() AS dense_Rank FROM rankFunctions ORDER BY Number;
ROW_NUMBER() function is great when we want to read data step by step, by reading N consecutive rows each time. Example on the right would help us read our data by fetching 3 rows each time.
WITH Tab1 AS ( SELECT Part, Number, ROW_NUMBER() OVER (ORDER BY Number) AS RowNum FROM rankFunctions ) SELECT * FROM Tab1 WHERE RowNum BETWEEN 4 AND 6;
Process of reading N by N rows is called pagination. We are reading one page at a time.
Partitioning would just mean that we are restarting the sequence.
SELECT part, number, ROW_NUMBER() OVER(PARTITION BY Part ORDER BY Number) AS rowNum, RANK() OVER (PARTITION BY Part ORDER BY Number) AS Rank, DENSE_RANK() OVER(PARTITION BY Part ORDER BY Number) AS dense_Rank FROM rankFunctions ORDER BY Part, Number;
Relative Ranking Functions
Relative ranking functions will show us rank of a row expressed in normalized way, with numbers between 0 and 1. There are two relative ranking functions. Those are CUME_DIST() and PERCENT_RANK().
CUME_DIST() function
CUME_DIST() is short for Cumulative Distribution. This function will can answer questions like: What percentage of screws (image on the right) is equal or smaller than 9 cm?
The answer would be 4/6, that is 66%.
This function is called cumulative because it shows a cumulative probability.
This would be the result for our sample table. SELECT Number, CUME_DIST() OVER(ORDER BY Number) AS cumeRankNum FROM rankFunctions ORDER BY Number;
PERCENT_RANK() function
For PERCENT_RANK() function, we will start with an example. We can see that RANK() of Number will return numbers 1,1,3,4,5,6,7,7. PERCENT_RANK() function is similar, just the values are normalized, so it will return numbers from 0 to 1.
SELECT Number, RANK() OVER (ORDER BY Number) AS rankNum, PERCENT_RANK() OVER(ORDER BY Number) AS percRankNum FROM rankFunctions ORDER BY Number;
PERCENT_RANK() will calculate it's values with this formula:
PERCENT_RANK() will exclude current row from the calculation, like it doesn't exist.
NTILE(n) Function
We have 6 screws.
We want to divide them into 2 groups. This is the most natural way to divide them:
This is if we want to divide them into 3 groups:
Such separation of elements into groups can be done in SQL with NTILE(n) function. We will use NTILE(2) to divide Number column into 2 groups and NTILE(3) to divide Number columns into 3 groups.
SELECT Number, NTILE(2) OVER(ORDER BY Number) AS groupNum FROM rankFunctions ORDER BY Number;
SELECT Number, NTILE(3) OVER(ORDER BY Number) AS groupNum FROM rankFunctions ORDER BY Number;
We can notice that nulls would be considered as the smallest values. We can also notice that on the first image we have 2 groups with 4 Numbers. That is perfect. But, on the second image we have three groups with 3,3,2 Numbers. Groups are not of equal size. In that case we would start with 2,2,2 groups, of equal size, but the first few groups would get an extra Number. That would make things almost perfect.
Final Conclusion
General rules for ranking functions are this:
1) All null values are treated as the same value. 2) It is important to use "ORDER BY" inside of WINDOW function, without it result would be meaningless. 3) None of the ranking functions is taking an argument, with the exception of NTILE(n) function. 4) None of these functions can work with frames, but they can use partitioning. 5) If values for several rows are the same, then the values returned by ranking functions will be the same for those rows. In our sample table, both of two rows with Number 10 will always have the same rank, no matter what ranking function we are using (except ROW_NUMBER() ).
Aggregate functions support partitioning, ordering and framing. Below we can see example of this.
SELECT Part, Number, SUM( Number) OVER ( PARTITION BY Part ORDER BY Part DESC, Number NULLS LAST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) As Col1 FROM aggWindow ORDER BY Part, Number DESC;
We can use WINDOW definition to show all of the classical aggregation functions. Our window is defined as (), so we will aggregate the whole column.
SELECT Part, Number, SUM( Number ) OVER W AS SumCol, MAX( Number ) OVER W AS MaxCol, MIN( Number ) OVER W AS MinCol, AVG( Number ) OVER W AS AvgCol, PROD( Number ) OVER W AS ProdCol FROM aggWindow WINDOW W AS ( ) ORDER BY Number;
We can use listagg and group_concat functions to concatenate text from our column.
SELECT listagg( Part, ';' ) OVER W AS listAggCol, group_concat( Part, '-' ) OVER W AS groupConcatCol FROM aggWindow WINDOW W AS ( );
Running Aggregations
Aggregate functions are used to calculate cumulative values. In this case we create our frame relative to current row and we use ORDER BY.
SELECT Number, SUM( Number ) OVER ( ORDER BY Number ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) As Col1 FROM aggWindow;
Moving Aggregations
We can also use aggregate functions for calculation of moving aggregations.
SELECT Number, AVG( Number ) OVER ( ORDER BY Number ROWS BETWEEN 1 PRECEDING AND CURRENT ROW ) As Col1 FROM aggWindow;
Usage of window functions in ORDER BY
Window functions can be used only after all data from our query is created. That means that Window functions can be used only in SELECT and ORDER BY clauses. Here is the usage of Window function in ORDER BY.
SELECT Part, Number FROM aggWindow ORDER BY COUNT(Number) OVER (PARTITION BY Part);
Combining WINDOW functions and GROUPBY clause
If we combine WINDOW functions and GROUP BY clause, there is a difference between what these functions can see. GROUP BY calculation can only see one group, but it can see all the detail rows inside of that group. WINDOW function can see all the groups, but it can not see detail rows.
SELECT Part, SUM( Number ) AS Total FROM aggWindow GROUP BY Part ORDER BY Part;
GROUP BY will sum detail rows for each group separately. It can not combine values from different groups.
SELECT Part, SUM(SUM( Number )) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS Total FROM aggWindow GROUP BY Part ORDER BY Part;
WINDOW function will see only those rows that are returned after GROUP BY finished its calculation. WINDOW function can see only totals for the groups, and not the detail rows. WINDOW function can work with those totals, from all of the groups, at the same time.
It seems that GROUP BY and WINDOW functions are of equal power. But that is not the truth. WINDOW functions are more powerful. We can preserve detail rows in the original columns, and we can create new columns with aggregated values. This allow us to compare detail and aggregated values. This is not possible with GROUP BY.
Look, we now have detail values and aggregated values in the same row. We can now easily calculate share of details in the total.
SELECT Part, Number, SUM( Number ) OVER ( PARTITION BY Part ) AS Total FROM aggWindow ORDER BY Part, Number;
SELECT Part, Number, SUM( Number ) OVER ( PARTITION BY Part ) AS Total, 100 * Number / SUM( Number ) OVER ( PARTITION BY Part ) AS "Share%" FROM aggWindow ORDER BY Part, Number;
The last column will show us share as a percentage.
Statistical Aggregate Functions
Statistical aggregate functions can be used in the same way as the classical aggregate functions.
In one of the earlier posts, we saw how functions CORR, COVAR_POP and COVAR_SAMP return wrong result. Contrary to that, if these functions are used inside of a WINDOW function, their results will be correct.
SELECT Part, Number, CORR( Number, Number - 1 ) OVER W AS CorrCol, COVAR_POP( Number, Number - 1 ) OVER W AS CovarPopCol, COVAR_SAMP( Number, Number - 1 ) OVER W AS CovarSampCol, STDDEV_POP( Number ) OVER W AS StdDevPopCol, STDDEV_SAMP( Number ) OVER W AS StdDevSampCol, VAR_POP( Number ) OVER W AS VarPopCol, VAR_SAMP( Number ) OVER W AS VarSampCol FROM aggWindow WINDOW W AS ( ) ORDER BY Number;
If the parentheses after OVER keyword are empty, then we will aggregate the whole Number column. Aggregate functions, like SUM, AVERAGE, MAX…, will ignore NULLs.
1+1+3+3+6+6+8+8 = 36
SELECT Part, Number, SUM( Number ) OVER ( ) AS Col1 FROM aggWindow;
Usage of Partitions
We can partition our table before aggregation. Each partition will be separately aggregated.
1+1+3+3 = 8 6+6+8+8 = 28
SELECT Part, Number, SUM( Number ) OVER ( PARTITION BY Part ) AS Col1 FROM aggWindow;
ORDER BY and Frames
ORDER BY is always accompanied with the frame definition. If the frame definition is not provided, then the default frame is used. The default frame is:
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
SELECT Part, Number, SUM( Number ) OVER ( ORDER BY Number ) AS Col1 FROM aggWindow;
We should avoid using default values and we should always provide explicit frame definitions.
If we don't provide ORDER BY, there is no knowing how frames will be created. It is important to provide ORDER BY to avoid randomness. Notice in the image that we have a meaningless running total, because there is no ordering of the Number column.
SELECT Part, Number, SUM( Number ) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS Col1 FROM aggWindow;
If we sort Number column in the final data set, meaningless running total will become more obvious.
SELECT Part, Number, SUM( Number ) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS Col1 FROM aggWindow ORDER BY Number;
Notice that ORDER BY clause can exist on two places in the statement. One is used to define frame, and the classic one is used to sort final data set. SELECT Part, Number, SUM( Number ) OVER ( ORDER BY Number ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS Col1 FROM aggWindow ORDER BY Number;
Within the Window, the NULL position can be controlled independently of the ORDER BY clause. We can use NULLS LAST or NULLS FIRST to explicitly define position of NULLs.
SELECT Part, Number, SUM( Number ) OVER ( PARTITION BY Part ORDER BY NumberNULLS LAST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS Col1 FROM aggWindow ORDER BY Number;
In the image, our Null is appearing in the first row because of the classic ORDER BY clause, but it is paired with the biggest value in the "Col1" for Partition 2 (28). NULL is now part of the last frame in the last Partition.
This is because of the clause NULLS LAST,whichchanged position of the NULL row, from the first, to the last, in the window definition.
Examples for the Rows, Ranges and Groups
In this example, frame is defined by the current and two previous rows.
SELECT Part, Number, SUM( Number ) OVER ( ORDER BY Number ROWS BETWEEN 2 PRECEDING AND CURRENT ROW ) AS Col1 FROM aggWindow ORDER BY Number;
With Ranges, current row comprise all the rows with the same value. Current row is a set of records, and not only one row.
SELECT Part, Number, SUM( Number ) OVER ( ORDER BY Number RANGE BETWEEN CURRENT ROW AND 3 FOLLOWING ) AS Col1 FROM aggWindow ORDER BY Number;
This example will use all the rows with the current value X and will calculate range with limits [X,X+3].
For X = 6, limits are [6,6+3]. Numbers 6 and 8 are both inside of this range.
Each frame encompass current group, one previous group, and all the latter groups.
SELECT Part, Number, SUM( Number ) OVER ( ORDER BY Number GROUPS BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING ) AS Col1 FROM aggWindow ORDER BY Number;
Window Chaining ( WINDOW clause )
Window functions are verbose. If we want to use them several times in our statement, then our statement will become really long.
SELECT Part, Number, SUM( Number ) OVER ( PARTITION BY Part ORDER BY Number ROWS BETWEEN 1 PRECEDING AND CURRENT ROW ) AS Col1, SUM( Number ) OVER ( ORDER BY Number DESC GROUPS BETWEEN 1 PRECEDING AND CURRENT ROW ) AS Col2 FROM aggWindow ORDER BY Number;
The only workaround is available if we are using the same window for several functions. We can then define our window once, with WINDOW clause, and then reference it several times.
SELECT Part, Number, SUM( Number ) OVER W AS SumCol, MAX( Number ) OVER WAS MaxCol FROM aggWindow WINDOW W AS ( PARTITION BY Part ORDER BY Number ROWS BETWEEN 1 PRECEDING AND CURRENT ROW ) ORDER BY Number;
Default frames
We already saw that we should avoid using default frames. There are two more abbreviations that will assume default frames.
Abbreviated syntax
Full definition of the default frame
{ UNBOUNDED | X } PRECEDING
BETWEEN { UNBOUNDED | X } PRECEDING AND CURRENT ROW
{ UNBOUNDED | X } FOLLOWING
BETWEEN CURRENT ROW AND { UNBOUNDED | X } FOLLOWING
SELECT Part, Number, SUM( Number ) OVER ( ORDER BY Number ROWS1 PRECEDING) AS Col1 FROM aggWindow ORDER BY Number;
–The same is for RANGE or GROUPS.
Default frame is: SELECT Part, Number, SUM( Number ) OVER ( ORDER BY Number ROWSBETWEEN 1 PRECEDING AND CURRENT ROW) AS Col1 FROM aggWindow ORDER BY Number;
–The same is for RANGE or GROUPS.
Window Functions and GROUP BY
We can group our table with GROUP BY. After we do this, window function will only see those grouped rows. On the image to the left, window function will only see 2 rows. Detail rows will not be any more available to window function.
The question is, what syntax to use to create running total in this grouped table.
Bellow we can see correct syntax. Columns in the grouped table are referred as Part and COUNT( Number ). Our Window function is based on those columns. That means that our window function will be SUM(COUNT( Number )).
SELECT Part, COUNT( Number ) AS CountCol, SUM(COUNT( Number )) OVER ( ORDER BY Part ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS RunningTotal FROM aggWindow GROUP BY Part ORDER BY Part;
Limitations of Window Functions
Range and ORDER BY
Range frames can only use one column inside of ORDER BY clause.
SELECT Part, Number, SUM( Number ) OVER ( ORDER BY Part DESC, Number RANGE BETWEEN CURRENT ROW AND 3 FOLLOWING ) AS Col1 FROM aggWindow ORDER BY Number;
DISTINCT and Window functions
We can not use DISTINCT keyword inside of Window function.
SELECT Part, Number COUNT( DISTINCT Number ) OVER ( ) AS Col1 FROM aggWindow;
Also work with seconds. Result is of interval type.
We can use DISTINCT keyword to exclude duplicates. In our sample table, only number 3 is duplicate. Calculations below will be done like that duplicate value doesn't exist
SQL
Result
Calculation
SELECT AVG( DISTINCT Number ) FROM aggTable;
3
(2+ 3 +4)/3 = 3
SELECT COUNT( DISTINCT Word ) FROM aggTable;
3
Count word, no NULLs, no duplicates.
SELECT PROD( DISTINCT Number ) FROM aggTable;
24
2* 3 *4 = 24
SELECT SUM( DISTINCT Number ) FROM aggTable;
9
2+ 3 +4 = 9
SELECT SUM( DISTINCT intervalMonth ) FROM aggTable;
9
2+ 3 +4 = 9
Concatenation Functions
Concatenation is a way to aggregate text. Instead of having text occupying N rows, where on each row we have only one phrase, we can aggregate them all into one row. Result will be comma separated list of those phrases, although we can choose what delimiter will be used instead of the comma.
If we want to remove duplicates, we have to use DISTINCT keyword before the name of a column.
It is not possible to control sort order of the phrases in the result.
SQL
Result
Comment
SELECT LISTAGG( Word, '|' ) FROM aggTable;
one|two|three|three|four
Default delimiter is a comma. Returns VARCHAR.
SELECT SYS.GROUP_CONCAT( Word, ';' ) FROM aggTable;
one;two;three;three;four
Default delimiter is a comma. Returns CLOB.
SELECT JSON.ToJsonArray( Word ) FROM aggTable;
[ "one", "two", "three", "three", "four" ]
Result is JSON list.
SELECT JSON.ToJsonArray( Number ) FROM aggTable;
[ "1", "2", "3", "3", "4" ]
Works with numbers, too.
Statistical Functions
Numbers, from Number column, can be divided into smaller and larger numbers. Half of the numbers will be smaller and the other half will be larger numbers. The number on the border between the smaller and larger numbers is the median.
SQL
Result
Result if we add number 5 to our column.
SELECT SYS.MEDIAN( Number ) FROM aggTable;
21,2(2),33,44=> 2
21,22,3(3),44,55 => 3
SELECT SYS.MEDIAN_AVG( Number ) FROM aggTable;
21,2(2),3(3),44 => (2+3)/2 = 2.5
21,22,3(3),44,55=> 3
Median is a special case of a quantile. Median is a 50% quantile. But we can differently divide our numbers. We can divide them 60% vs 40%, so that 60% numbers are on the smaller side, and 40% is on the bigger side. Number between them would be called 60% quantile. In our example below, "60 % quantile" is 2.8, which means that 60% of numbers is below 2.8. This would be more obvious if we had more numbers in our column.
SQL
Result
Calculation
SELECT SYS.QUANTILE_AVG( Number, 0.6 ) FROM aggTable;
2.8
( 4 – 1 ) * 0.6 + 1, where 4 is the count of our numbers.
SELECT SYS.QUANTILE( Number, 0.6 ) FROM aggTable;
3
This is just value from above, rounded to integer.
Variance and standard deviation are calculated differently depending whether our data represent a population or a sample.
SQL
Result
Calculation
SELECT SYS.VAR_SAMP( Number ) FROM aggTable;
0.917
SELECT SYS.StdDev_SAMP( Number ) FROM aggTable;
0.957
sqrt( variance ) = sqrt( 0.917 ) = 0.957
SELECT SYS.VAR_POP( Number ) FROM aggTable;
0.687
SELECT SYS.StdDev_POP( Number ) FROM aggTable;
0.829
sqrt( variance ) = sqrt( 0.687 ) = 0.829
Covariance in statistics measures the extent to which two variables vary linearly. Correlation is just covariance measured in normalized units. Unfortunately, there is a bug in MonetDB, version 11.49.09, and all of these functions will return wrong results.
SQL
Result in MonetDB
Calculation
SELECT SYS.COVAR_SAMP( Number, 10 – Number * 1.2 ) FROM aggTable;
-1.417 ( not correct, it is -1.1 )
SELECT SYS.COVAR_POP( Number, 10 – Number * 1.2 ) FROM aggTable;
-1.0625 ( not correct, it is -0.825 )
SELECT SYS.CORR( Number, 10 – Number * 1.2 ) FROM aggTable;
-0.986 ( not correct, it is -1 )
All statistic function will ignore NULLs.
Logical Operators
Binary Operators
SQL
Result
Comment
SELECT True AND True;
TRUE
Returns TRUE only if both arguments are TRUE.
SELECT True OR False;
TRUE
Returns TRUE if at least one argument is TRUE.
Unary Operators
SQL
Result
Comment
SELECT NOT True;
FALSE
Will transform TRUE to FALSE, and FALSE to TRUE. Always the opposite.
SELECT Null IS NULL;
TRUE
Checks whether something is NULL.
SELECT Null IS NOT NULL;
FALSE
Checks whether something is NOT NULL.
All other logical operators will return Null if at least one of its arguments is Null.
SQL
Result
SELECT NOT Null;
NULL
SELECT Null AND True;
NULL
Most SQL functions will either return NULL if one of the arguments is NULL, or will ignore rows with NULL values.
Logical Functions
Operators AND, OR and NOT have alternative syntax where they work like functions. XOR can not work like operator, only like a function.
SQL
Result
Comment
SELECT "xor"(True, False);
TRUE
Returns TRUE only when the first argument is the opposite of the second argument (Arg1 = NOT Arg2).
SELECT "and"(True, False);
FALSE
Returns TRUE only if both arguments are TRUE.
SELECT "or"(False, False);
FALSE
Returns TRUE if at least one argument is TRUE.
SELECT "not"(False);
TRUE
Will transform TRUE to FALSE, and FALSE to TRUE. Always the opposite.