0340 Loading from CSV files into MonetDB

Importing data from CSV files into MonetDB is much faster than using "INSERT INTO". The reason is, CSV can be loaded as is, no need for parsing like for "INSERT INTO". The other reason is because CSV file can be imported by using several CPU cores at once, while "INSERT INTO" can only use one thread.

Sample table and CSV file

To use COPY INTO statement, we have to log in as an administrator (password is monetdb):
mclient -u monetdb -d voc;

In order to import data, first we need to have a table where we will place data.
CREATE TABLE tblCSV ( Letter VARCHAR(10) DEFAULT 'Z', Number REAL, Calendar DATE );

We'll start with this CSV file. Line ending can be \n (Linux) (1) or \r\n (Windows), it doesn't matter. File has to be encoded as UTF-8 (2).  In our example, separator between columns is comma (3). Strings are wrapped inside double quotes (4).
Letter,Number,Calendar
"A",3.82,2025-01-01
"B",4.83,2025-01-02
"C",5.64,2025-01-03

Reading from CSV File to MonetDB table

We will read our CSV file into tblCSV table by this statement bellow. User executing this statement must have privilege to read the file.

• OFFSET 2 will skip the first row. In the first row we have the header. Header already exists in our MonetDB table, so we do not import it.
• tblCSV( Letter, Number, Calendar ) – this part is the table and the columns in which we are importing data.
• '/home/fffovde/Desktop/tblCSV.csv'( Letter, Number, Calendar ) ON SERVER – location of our file.
',' – this is separator between columns.
• E'\n' – this is delimiter between rows. '\n' is a symbol for LF (line feed).
'"' – strings are wrapped with double quotes.

COPY OFFSET 2 INTO tblCSV( Letter, Number, Calendar )
FROM '/home/fffovde/Desktop/tblCSV.csv'( Letter, Number, Calendar )
ON SERVER
USING DELIMITERS ',', E'\n', '"';

SELECT * FROM tblCSV;
This is now our table.

Number of Records and Default Options

Before each insert, I will delete all the existing rows. Deletion of all of the rows can be done with TRUNCATE statement:
TRUNCATE tblCSV;

• 2 RECORDS – we can limit our import to only two rows (after the header). If we know in advance how many rows we will import, that will allow MonetDB to reserve space on disk in advance, so it will increase performance.

We can also notice, that we do not have to provide column names. In that case, we have to make sure that columns have the same order in the file, as in the database table.COPY 2 RECORDS OFFSET 2 INTO tblCSV
FROM '/home/fffovde/Desktop/tblCSV.csv' ON SERVER USING DELIMITERS ',';

There is also no need to always define columns separator and string wrapper. MonetDB will assume that columns separator is either \n or \r\n. It will also assume that nothing is used as a string wrapper. On the image above, double quotes are now considered as a part of a string.

Column Names

TRUNCATE tblCSV;

If we are providing column names of the CSV file, we always have to provide all of the column names, but we can give them fake names. In the example bellow we are importing only the column "Number". Other two columns will not be imported. Because "Letter" column has a default value "Z", all the values in that column will be "Z". "Calendar" column will be filled with nulls.

COPY OFFSET 2 INTO tblCSV( Number )
FROM '/home/fffovde/Desktop/tblCSV.csv'( Dummy, Number, Dummy2 )

ON SERVER
USING DELIMITERS ',';

TRUNCATE tblCSV;

For next example I will add one more column to my CSV file ("Color"), and I will place Number column before Letter column. I will change columns separator to semicolon, and record delimiters to CRLF. I will remove string wrapers.

Colors, Number, Letter, Calendar

If we provide column names, we can use that fact to reorder the columns in correct order.
COPY OFFSET 2 INTO tblCSV( Letter, Number, Calendar )
FROM '/home/fffovde/Desktop/tblCSV.csv'( Color, Number, Letter, Calendar )
ON SERVER
USING DELIMITERS ';', E'\n';

– This time, we are using semicolon as a separator between columns.
– We have not provided string wrappers, so MonetDB will take data as is.
– Column delimiter is defined as LF, it doesn't matter that our file is using CRLF delimiter. Symbol "\n" will work in both cases.
SELECT * FROM tblCSV;

TRUNCATE tblCSV;

If we don't provide column names for the CSV file, MonetDB will assume that CSV file has the same number of columns as the target table. In the example bellow, MonetDB assumes that our CSV file has three columns. MonetDB will raise an error that one column is leftover.

COPY OFFSET 2 INTO tblCSV( Letter, Number )
FROM '/home/fffovde/Desktop/tblCSV.csv' --x3
ON SERVER
USING DELIMITERS ';';

Column Separators and NULL-s

TRUNCATE tblCSV;

This time we will remove header, we will use pipe "|" as columns separator.
In the first row, I will place one surplus pipe at the end of the record (1).

We will take only two records. We don't have to skip header, because there is no header in the CSV file. Our separators are pipes. They are default, so we can omit "USING" clause altogether.
COPY 2 RECORDS INTO tblCSV( Letter, Number, Calendar )
FROM '/home/fffovde/Desktop/tblCSV.csv'( Color, Number, Letter, Calendar )
ON SERVER;

The lone pipe (1), from the picture above, won't cause us any problems.
SELECT * FROM tblCSV;

TRUNCATE tblCSV;

We will replace column delimiters with dots. Dot is also used as a decimal point. When we use a dot as a decimal point, we have to place an ESCAPE sign before it. ESCAPE SIGN is backslash (1).

COPY 2 RECORDS INTO tblCSV( Letter, Number, Calendar )
FROM '/home/fffovde/Desktop/tblCSV.csv'( Color, Number, Letter, Calendar )
ON SERVER
USING DELIMITERS '.';

We can see that the escape characters did their job of distinguishing decimal points.
SELECT * FROM tblCSV;

TRUNCATE tblCSV;

It is also possible to have string wrapper sign inside of a string. We can escape such sign by doubling it (1), or by placing an escape sign before it (2).

Notice, also, that I have placed the word NULL in the third record (3). By default this will be considered as a "null" in MonetDB

COPY OFFSET 0 INTO tblCSV( Letter, Number, Calendar )
FROM '/home/fffovde/Desktop/tblCSV.csv'( Color, Number, Letter, Calendar )
ON SERVER
USING DELIMITERS ',', E'\n', '"';

If we offset zero or 1 row, that means that we are reading data from the start.

TRUNCATE tblCSV;

Sometimes we want to have backslash inside of a string as a literal. In that case we use "NO ESCAPE" subclause (1).

NULL is representing null value by default. It is possible to use any other representation (2).

COPY OFFSET 0 INTO tblCSV( Letter, Number, Calendar )
FROM '/home/fffovde/Desktop/tblCSV.csv'( Color, Number, Letter, Calendar )
ON SERVER
USING DELIMITERS ',', E'\n', '"'
NO ESCAPE       --backslash is considered literal (1)
NULL AS '';     --NULL is represented by empty string (2)

Fixed Width Columns

TRUNCATE tblCSV;

Beside using comma separated format, we can also use fixed width format. In this CSV file we have three columns – Letter, Number, Calendar, but each column has the fixed width. Letter is 4 characters wide, Number is 7 characters wide, Calendar is 10 characters wide.

COPY INTO tblCSV
FROM '/home/fffovde/Desktop/tblCSV.csv' ON SERVER
FWF ( 4, 7, 10 );

We are using FWF subclause to inform MonetDB about columns width.
Spaces in the columns will be stripped and only values will be imported.

Error Handling

TRUNCATE tblCSV;

This CSV file has an error that will prevent import. There is a letter in a number column (W), in the second row.

COPY INTO tblCSV
FROM '/home/fffovde/Desktop/tblCSV.csv'

ON SERVER
FWF ( 4, 7, 10 );
We will get an error, for this statement, and we will not import any rows.

COPY INTO tblCSV
FROM '/home/fffovde/Desktop/tblCSV.csv'

ON SERVER
BEST EFFORT

FWF ( 4, 7, 10 );
Solution is to use subclause "BEST EFFORT".
Only first and third rows are now imported.
Second row is not imported.

We can find the rows that are not imported by reading from the table:
SELECT * FROM sys.rejects;
This table will preserve its content until the end of a session, or until procedure statement "CALL sys.clearrejects();" is executed.

Loading Several CSV Files at Once

TRUNCATE tblCSV;

We will import three CSV files at once. We will offset the first record in each file, and then we will read only one next row. From each file we will import only one record, so in total we will insert three records.

COPY 1 RECORDS OFFSET 2 INTO tblCSV
FROM ( '/home/fffovde/Desktop/tblCSV1.csv'      

          , '/home/fffovde/Desktop/tblCSV2.csv'      
          , '/home/fffovde/Desktop/tblCSV3.csv' ) ON SERVER;

Loading Data on the Client Side

TRUNCATE tblCSV;

It is also possible to upload CSV file from the client. We will see how to do this by using python.

import pymonetdb  
connection = pymonetdb.connect(username="monetdb", password="monetdb", hostname="localhost", database="voc")  

handler = pymonetdb.SafeDirectoryHandler("/home/fffovde/Desktop") –this will designate this directory as safe connection.set_uploader(handler)                                  –we can place our files into this folder to upload them

cursor = connection.cursor()  
cursor.execute("COPY INTO tblCSV FROM '/home/fffovde/Desktop/tblCSV.csv' ON CLIENT BEST EFFORT FWF ( 4, 7, 10 );")
connection.commit()  

cursor.close()
connection.close()

Important thing is to register some folder from the client computer as the safe folder:
handler = pymonetdb.SafeDirectoryHandler("/home/fffovde/Desktop")
After that, we assign that folder as the folder from which we can upload files.
connection.set_uploader(handler)
Then we just execute our statement.
This is the result in our database table:


You can download CSV files from here:

Leave a Comment

Your email address will not be published. Required fields are marked *