0430 Python UDFs in MonetDB

Prerequisites for Python UDFs

In order to create Python UDFs, we need to fulfill two prerequisites:
– We have to install the Monetdb-Python3 package. This package is used by MonetDB to communicate with python.
– We have to enable python in each database where we want to create python UDFs (monetdb set embedpy3=true).

How to provide these prerequisites is already explained in one of my previous posts ( 0360 Loader Functions In Monetdb ). Please see this blog post to learn how to enable Python support. Alternatively, you can watch video about loader functions on youtube ( https://youtu.be/2WHb41dzh_A ).

Presence of NumPy and Pandas Packages

In Ubuntu 22.04+, the global Python environment is "externally managed". That means that we have to install numpy and pandas from the official repository.

Numpy and pandas are available in the ubuntu's repository. We can install them.
sudo apt search python3-numpy; apt search python3-pandas

We now have numpy and pandas installed.

Sample Table

Scalar Python UDF

CREATE OR REPLACE FUNCTION pyConcat(val1 CHAR, val2 CHAR)
RETURNS CHAR(2)
LANGUAGE PYTHON {
return val1 + val2
}; This is simple function that will accept two arguments and will return their concatenation. We can use this function like this:

SELECT letter, sign, pyConcat( letter, sign ) FROM pyTab;

If we have null in some column, then instead of NumPy array, we will get masked array.
import numpy.ma as ma b = ma.array([1, -9999, 3, 4], mask=[False, True, False, False]) print("Sum of Masked array:", b.sum()) # Sum of Masked array: 8 A masked array is a combination of a standard NumPy array and a mask. A mask is used to hide invalid or missing values. After we hide the bad values, we can calculate the sum or average of the masked array without the influence of the bad values.

In our example, we have used "isinstance" function to examine if we have NumPy array or masked array. For masked array we have replaced bad values with empty string. if isinstance(val2, numpy.ma.MaskedArray): val2 = val2.filled('')

The data type of arguments in Python is directly inferred from the SQL data types, according to this mapping.

BOOLEAN	numpy.int8	\|\|	INTEGER	numpy.int32	\|\|	FLOAT	numpy.float64
TINYINT	numpy.int8	\|\|	BIGINT	numpy.int64	\|\|	HUGEINT	numpy.float64
SMALLINT	numpy.int16	\|\|	REAL	numpy.float32	\|\|	STRING	numpy.object

Returned Value

Let's try to return list, dictionary or pandas data frame. We'll call our functions with "SELECT letter, sign, Ret() FROM pyTab;".

Python Traps

`CREATE OR REPLACE FUNCTION funcCase(Letter CHAR)` `RETURNS CHAR` `LANGUAGE PYTHON {` `return( Letter )` `};`	If we try to call this function, we'll get an error. `SELECT funcCase('A');`	Because python is case sensitive and SQL is not, names of arguments will be turned into lower letters in python script.
Instead of "return( Letter )", we have to type "return( letter )". Arguments inside of python script have to be in lower letters.	After the change, this function will work. `SELECT funcCase('A');`

Python is, of course, sensitive to indentation.
Indentation must be consistent.
We'll get an error if it is not. CREATE OR REPLACE FUNCTION funcIndent()
RETURNS CHAR LANGUAGE PYTHON { a = 3 return a }; SELECT funcIndent();

Creating a Table with Python UDF

Use Python Function to Filter Data

Aggregate UDFs

So, inputs for our function are "val = np.array([1, 2, 3, 4])" and "aggr_group = np.array([0, 0, 0, 1])". On the left side bellow, we have our function. On the right side we can see interim results and pseudo code.

CREATE OR REPLACE AGGREGATE pySUM(val INTEGER) RETURNS INTEGER LANGUAGE PYTHON {
unique = numpy.unique(aggr_group)
x = numpy.zeros(shape=(unique.size))
for i in range(0, unique.size):
x[i] = numpy.sum(val[aggr_group==unique[i]])
return(x)
}; val = np.array([1, 2, 3, 4]) #we start with val and aggr_group aggr_group = np.array([0, 0, 0, 1])

unique = np.array([0, 1]) #we remove duplicates
x = np.array([0, 0]) #result array, but filled with zeros for i from 0 to 1 #for each group
x[0] = sum([1, 2, 3, 4] where [0, 0, 0, 1] = 0 ) = 6 #for A x[1] = sum([1, 2, 3, 4] where [0, 0, 0, 1] = 1 ) = 4 #for B return np.array( 6, 4 )

If we try to apply our function without grouping, we'll get an error.
SELECT pySUM( Number ) FROM pyTab; If there are no groups, then "aggr_group" is not defined.