OCR a PDF Document With Python and Tesseract

We have a pdf document that is like a picture. We want to turn it into searchable document. We will use python Pytesseract module to do the transformation.

WinPython

For this demonstration I will use WinPython distribution. This distribution has size of 813 MB for download. When decompressed this archive will become a 4 GB folder. Choose wisely where to place this folder because latter moving or deleting this folder could take some time.

I am using WinPython distribution because it has many modules already preinstalled, and this will save us some effort. It will also keep us from dependency hell problems.

We can download WinPython from this page "https://sourceforge.net/projects/winpython/files/". On that page click on the green button (1) and download EXE file on your computer (2). This file is self-extracting archive (3). Choose where do you want this file to be extracted. You will get a folder (4). In the subfolder "python-3.10.5.amd64" (5), of that folder, there is "python.exe" file.

PATH Environment Variable

Windows has a small key-value "database", where each key-value pair is one environment variable. We can open dialog with those variables by going to Run dialog (shortcut Win+R), and typing "sysdm.cpl" (1). In the new dialog we will go to Advanced tab (2), we will click on "Environment variables" button (3), and then we will see dialog with a list of all the key-values pairs (4,5) for environment variables. Some variables are available only for the logged user (6), and system variables are available for all the users (7).

For example, there is key-value pair TEMP = "C: \Windows \TEMP". This allows every program to easily find where Windows temporary folder resides. If Microsoft change the location of this folder, all other programs will still work correctly because they will read that new location from the environment variable. Name of a key is stable, only value is changing, so environment variables are used for communication between programs.

There is an environment variable named PATH. We will change value of this variable so we could easily use our WinPython. In user variables (1) find a variable named "Path" (2). Click on Edit button (3), and then, in a new dialog, click on button "New" (4). This will create a place where we can paste our python folder (5). Python folder is a folder where "python.exe" file resides.

If we don't have Path variable, then we click on "New" button (6) and we fill new dialog with the variable name and variable value (7). Now, python location is added to path variable.

Tesseract

Tesseract is an open-source optical character recognition software. Python is using this software for OCR. First, we go to address https://digi.bib.uni-mannheim.de/tesseract/, and from there we download this software. We will download latest version (1). After EXE file is downloaded to our computer (2), we will start installation. During the installation we can choose (3) additional scripts (Latin, Cyrillic…) and languages (Serbian, Romanian…). Selected scripts and languages will be downloaded during installation (4).

Folder with "tesseract.exe" file should be added to our path user environment variable ( "C:\Program Files\Tesseract-OCR" ). Now that both python and Tesseract are on the path, we can check whether they are correctly installed. Go to windows command prompt and type commands "python –version" and "tesseract -v". We will now see what versions are installed on our computer.

There is no need to change working directory inside command prompt. Windows will find both python and Tesseract because they are on the path. This is why we placed them in this environment variable in the first place. Now it is much easier to find these two programs.

Pytesseract

Pytesseract is python module which communicate with Tesseract software. We will use "pip" module for its installation. Go to command prompt and type command "python -m pip install pytesseract" (1). After successful installation we can check our installed modul with command "python -m pip show pytesseract" (2).

Poppler

Poppler is open-source library for PDF rendering. Go to web site https://blog.alivate.com.au/poppler-windows/ and download latest poppler (1). Downloaded file will be "7z" archive. Extract files from that archive into folder of your choice. Poppler's folder has subfolder "bin" (2). We should add this "bin" folder to path user environment variable.

Pdf2image

Pdf2image is python module for interacting with poppler program. We will install it in the same way as pytesseract module. We'll use "python -m pip install pdf2image" (1) and "python -m pip show pdf2image" (2).

Python IDE

We finally have all the ingredients for our project. Now, we will open our favorite python IDE or text editor. If you don't have one, you can go to WinPython folder. Inside it, there is a file "Spyder.exe". Click on this file and you will get Spyder IDE where we can test our code.

Before running the script, we will first tell Spyder what python.exe to use. Go to Tools > Preferences (1). In the new dialog, type location of our WinPython python.exe file in "Python interpreter > Use the following Python interpreter" (2).

Now we can paste our python script in the pane (3) of the Spyder, and click the Run button (4). OCR-ed text will appear in the console (5) and will also appear as a TXT file on our desktop (6).

Python Script

This is code we are using for our OCR. First, we transform our PDF to sequence of images, using "pdf2image" module. For each page in that sequence, we apply tessarect image_to_string method. This will create our final result. At the end, we write our result in console and in TXT file.

image = pdf2image.convert_from_path('C:\\Users\\Sima\\Desktop\\OCRfile.pdf')
with open('C:\\Users\\Sima\\Desktop\\OCRtext.txt', 'w') as file:
    for page number, page in enumerate(image):
        detected_text = pytesseract.image_to_string(page)
        print( detected_text )
        file.write( detected_text )

Bellow, we can see how our pdf document looks like as an original (1), in console (2) or in TXT file (3).

Here is the sample file. It contains python script, sample PDF file and poppler software.

Basic Type Functions in Power Query

In "Data Types in Power Query" post we saw what types exist in Power Query. Let's see now what functions Power Query has that can deal with types. Most of such functions are classified in "Type" group, but the important ones are placed in "Value" group.

Basic Functions

In the example bellow, the first column has data of different types. All other columns are created by using some Power Query functions, except the column "TypeName" which is created by using our custom function. Name of each column is the same as the name of a function used.

Value.Type(value as any) as type     

Value.Type function receives the values from the first column and then returns its type. In Power Query types are presented only by green word "Type", and those results in second column don't help us much. In order to reveal what type each value really is, we will use our custom function "TypeName". You can find that function in sample file bellow or you can see it in this post. This function will take types from the second column and for each "Type", it will reveal its correct type in the third column. Now we can see all the different types (type null, type text, type date…). 

Value.Type( [Data] ) 
 
TypeName( [Value.Type] )
 
Type.Is( [Value.Type],type number )
 
Value.As( [Data], type number )

Value.Is( [Data], type number )

Type.Is(type1 as type, type2 as type) as logical

Type.Is function will compare two types and will return TRUE or FALSE. In our example we compared types from the column Value.Type with type number. Obviously, only cell in eighth row will return TRUE, because 999 is a number ✶.

Value.As(value as any, type as type) as any

Value.As function is returning value from the first argument, but only if type of that value is compatible with the type that is second argument. If they are not compatible, this function will return error. Because we used type number as second argument, only the cell in eighth row will return 999 , all other cells will return Error.

Value.Is(value as any, type as type) as logical

Value.Is function is comparing type of some value with the type that is the second argument. It will return TRUE or FALSE. In our example, we compare types for values in the first column with type number. This is the reason that only cell in eighth row will return TRUE .

Is and As Operators

If we don't want to hard code our data types, then we have to use Value.As and Value.Is functions. If hard coding our types is OK, then we can use is and as operators. Expressions on the right side will create the same two last columns as in example above.

Not Hard CodedHard Coded
Value.As( [Data], type number )[Data] as number
Value.Is( [Data], type number )[Data] is number

Filtering Reports in Cognos

Levels of Filtering

We have source data (1). For each order we have products on that order and their quantity. We made report by grouping our data. We grouped data by "Product" (2), then by "Product type" and "Product line" (3,3). We can filter data by using all those three levels of data (1,2,3,3).

Filtering by Data Type

There are specialized dialogs for filtering Dates, Strings and Numbers. Dates could be filter by range (1) or individually (2). Strings could be filtered by selection (3) or by condition (4). Numbers could be filtered by range (5) or individually (6).

Filtering by Using Multiple Columns

Beside filtering with one column, we also have "Combined" and "Advanced" filter (1). Combined filter is just a combination of simple filters (2). First we filter some Dates, Strings and Numbers by using Cognos dialogs, and then we can combine those filters. Those filters can be connected with logic operators AND, OR, NOT.

Advanced filter allows creation of really complex filters. Such filters are expressions made of columns, built-in functions and parameters (3). Cognos Expression builder helps us to create such expressions by making lists of all functions and columns that we can use in our expressions. We just have to click on chosen column, function or parameter that we want to appear in our expression. For each used function, we will get explanation for its usage, arguments and returned values (4).

Filtering by Parameters

It is possible to parametrize report filter. Parameters can be used in Advanced filter where we refer to parameter as its name delimited by question marks (1). User can provide values for this parameters before execution of report, but he doesn't have to, default value can be used then instead.

Parameter values can be provided by using a variety of different prompts (2).

Other Functionalities Pertaining to Filtering

It is possible to present to user which filters are used in the report (1). We can control which filters will be presented in such a way and which will not. This will help user to understand what data are shown in the report.

After creating report in HTML format, user can further filter data by using different prompts. It is not necessary to create explicit prompts for this task. We can just allow user to change or delete some of the filters and such ability will appear in the HTML report in a Filters pane (2). This ability is only available for simple and combined filters, and not for advanced filters.

Video that Explains How to Use Cognos Filters

Watch about all different ways how to filter data in Cognos report.

Missing Crosstab Header in Cognos Analytics

Crosstabs created in Cognos reports will not have names of columns in dimensions header area by default (1). This could make crosstab harder to understand, especially when crosstab has many dimension columns. We want that our dimension columns have their names nicely presented (2).

Let's solve this problem. From toolbox (1), from advanced tab, we will drag "Crosstab space" (2) element between rows header area and data area (3). This will create an empty column (4). Because we have two columns in rows header area, we will repeat this action once more (5).

Now, we have to move right part of crosstab (1) close to the left part (2), like some tetris elements. For that to succeed, we just need to delete empty cells that we don't need any more (3).

We, first, have to select not needed cells (1), and then in Properties pane, we will find property "Box type" (2). When we change the value of this property to "None", selected cells will disappear. We will now get our header desired shape (3).

Last thing to do is to add column names into their places. Before that, we will unlock our report (1). We will use padlock in the upper right part of the Cognos window. Now, it is possible to drag object "Text Item" (2), from toolbox to designated cell. At that moment, dialog (3) will appear, and we will fill this dialog with the name of the column. We have created the name for our first column (4), and then we will do the same for the second column.

For aesthetic reasons, we will move column names to bottom of the cells (1). We will use command from toolbar for this. Finally, we have our header the way we like it (2), and all the columns are properly labeled.

SUMMARIZE Function in DAX

Simplest usage

In its simplest form SUMMARIZE can be used to get distinct values from several columns. We have sample data in the table named SampleTable (1).

By using SUMMARIZE we can get all distinct combinations of values from columns Region and Rep (2). This is the same what happens when we place those two columns into pivot table (3).
This is the formula we used to create (2).
EVALUATE
SUMMARIZE( SampleTable, [Region], [Rep] )
ORDER BY [Region], [Rep]

SUMMARIZE can work with columns from related tables. We added table (1) with unique values for Representatives, and we made a relation (2) between that table and original "Rep" column. Now, we can still use SUMMARIZE function that will use this new RepDetached column. Note, that this will only work for 1:N relations, where table mentioned in SUMMARIZE function is on "N" side of relation.

Result will be the same (3).
EVALUATE
SUMMARIZE( SampleTable, [Region], [RepDetached] )
ORDER BY [Region], [RepDetached]

SUMMARIZE With Aggregations

By using aggregations, we can add summary columns in the final table. We can aggregate values from current table and from related table. We can give new names to those columns.

EVALUATE
SUMMARIZE( SampleTable, [Region], "SumOfUnits", SUM( SampleTable[Units] )
           , "Representatives", DISTINCTCOUNT( RepDetached[RepDetached] ) )
ORDER BY [Region]

ROLLUP

Rollup is helper function that will provide us Grand Total and all other subtotals for our hierarchy.

We are grouping table by Region and Item. We will get subtotal for each Region (1), and we will get grand total (2). They will be added as a new rows at the bottom of the table.
EVALUATE SUMMARIZE( SampleTable, Rollup( [Region], [Item]   )
, "SumOfUnites" , SUM( SampleTable[Units] )    )


In this case, order of columns is important, because subtotals will be calculated following hierarchy from left to right.

ROLLUPGROUP

ROLLUPGROUP is another helper function. It is used in combination with ROLLUP function. Purpose of ROLLUPGROUP function is to exclude subtotals for some columns from the final result of SUMMARIZE.

This time we have three columns for grouping. We always have Grand Total (2) when we use ROLLUP. We left "Region" column outside of ROLLUPGROUP function. This means that we will have subtotals for "Region" column (1). For "Rep" and "Item" columns, there will be no subtotals that include those two columns. This is because those columns are inside ROLLUPGROUP function.

EVALUATE
SUMMARIZE( SampleTable, Rollup( [Region], ROLLUPGROUP( [Rep], [Item] ) )
, "SumOfUnites", SUM( SampleTable[Units] ) )

ISSUBTOTAL

ISSUBTOTAL is a function that informs whether current row is a subtotal row for some column. This function receives an argument that is one of the columns included in the ROLLUP function. If that column is aggregated by subtotal in the current row, then ISSUBTOTAL function will return TRUE in that row.

Formula bellow is making rollup for "Region" and "Item" columns.

EVALUATE SUMMARIZE( SampleTable, Rollup( [Region], [Item] )
, "Region Subtotal", ISSUBTOTAL( [Region] )
, "Item Subtotal", ISSUBTOTAL( [Item] )
, "Grand Total", ISSUBTOTAL( [Region] ) && ISSUBTOTAL( [Item] )     )

For those two columns we can find out whether they have subtotal in the current row. We can see that ISSUBTOTAL( [Item] ) returns TRUE in the rows (1). ISSUBTOTAL ( [Region] ) returns TRUE only in Grand Total (2). In Grand Total, both columns are returning TRUE.

Filter and Row Context

SUMMARIZE can create new columns in row context. Only the columns used for grouping can be used (Region and Item).

In the image, we can see that beside creating aggregate column (1), we can also create column (2) that doesn't use aggregation. This column is made by using row context. Formula for creation of our table is:

EVALUATE
SUMMARIZE( SampleTable, [Region], [Item]
, "Region+Item", [Region] & "+" & [Item]
, "SumOfUnits", SUM( SampleTable[Units] ) )

Nested grouping

We can first group our data by columns "Region" and "Item" (1). Then, we want to group table (1) by "Region" column (2).

We can't nest SUMMARIZE functions. SUMMARIZE function expects fully qualified column as column used in aggregation. This means that outer SUMMARIZE function would not be able to accept "SumOfUnits" column.

The easiest way to solve this problem is to use another grouping function, "GROUPBY", as outer function. This function will accept "SumOfUnits" column name without table quailifier.

EVALUATE
VAR SummarizeFunction = SUMMARIZE( SampleTable, [Region], [Item]
                                 , "SumOfUnits", SUM( SampleTable[Units] )    ) 
VAR GroupByFunction = GROUPBY( SummarizeFunction, [Region]
                             , "SumOfUnits2", SUMX( CURRENTGROUP(), [SumOfUnits] )   )
RETURN GroupByFunction 

If we do not use aggregated column in SUMMARIZE function, then nesting is possible:

EVALUATE
SUMMARIZE( 
   SUMMARIZE( SampleTable, [Region], [Item] )
   , [Region]
) 

How does SUMMARIZE function work?

Most of us know how advanced filter in Excel works.

Table (1) and Criteria table (2) are used by Advanced filter (3) to produce filtered table (4). Important thing here is "Criteria table". Criteria table is saying that only rows which have values (A,1) or (B,1) in columns (Col1,Col2) should pass the filter.

This "Criteria table" is how filter context is defined in Power BI. Filter context is a table. What filter context table do we expect is used, when aggregate values are calculated in SUMMARIZE function? We expect that filter context table has only columns that are used for grouping. We could graphically represent this like:

But, this is not how SUMMARIZE filter works. On our surprise, this is how SUMMARIZE really creates its filter. SUMMARIZE will first filter original SampleTable with region "Central" and item "Pen Set" (1). After this, whole table segment (2) will be used as a filter context.

The reason why SUMMARIZE works like this is because SUMMARIZE has full ability to control filter context of aggregated values with CALCULATE function. We can disable or modify some of the filters.

Example

Let's assume that we can not sell "Desk" item in "East" region and we can not sell "Pen Set" item in "West" region because of the government. We want to know what is the share of each Region in all the units sold, but only for items that are allowed for that Region. We can accomplish this by using only one function – SUMMARIZE.

In the image bellow, we can see that "Central" region is selling all of the 5 items (column "B"). That is why we compare its sale with 2.121 units (column "D") which is total number of units for all the regions. For "East" and "West" this number is 2.111 and 1.726 because those totals are corrected not to include items "East" and "West" region are not allowed to sell.

EVALUATE
SUMMARIZE( SampleTable
, [Region]
, "Distinct items", DISTINCTCOUNT( SampleTable[Item] )
, "Units Total for Region", SUM( SampleTable[Units] )    
, "Total for Distinct Items"
  , CALCULATE( SUM( SampleTable[Units] ), ALLEXCEPT( SampleTable, SampleTable[Item] ) )
, "Ratio", SUM( SampleTable[Units] )    
  / CALCULATE( SUM( SampleTable[Units] ), ALLEXCEPT( SampleTable, SampleTable[Item] ) )
) 
ORDER BY [Region]

In our formula, blue part is removing all filters except filter for items, when calculating Total sale. So, we don't filter "Region" column in our filter context. In the image bellow we can see how this influence filter context for "West" region. For "West" region filter context is the whole SampleTable with rows that contains "Pen Set" item removed (1). It is similar for "East" (2). For "Central" we don't filter SampleTable at all (3). That is why column "D" has the biggest value for "Central" region and smaller values for "West" and "East" region.

Performance Consequence

Because of the complexity of how SUMMARIZE function creates its filter context, performance of this function will not be impressive. We should use this function only when we need distinct rows for several columns (simplest case) or when we want to modify filter context (the most complex case).

Sample file can be downloaded from here: