Filtering data in SPSS

Our table has data about cable television users. We have ID of the customers (1), their gender (2), age group (3) and education level (4). Column "reside" (5) shows how many people live in the household. Column "service_usage" (6) shows what kind of cable service they use. We will see how to filter this data in different ways in SPSS program.

On the image on the right we can see that filtering dialog is opened by clicking on Data > Select Cases (1). This opens dialog where we can select columns to filter (2), we can choose how to filter (3), and we can decide what will happen with filtered data (4).

5 ways to filter data

All cases

"All cases" option means that there will be no filtering. This is the default.

If condition is satisfied

This option opens dialog where we can define formula that will filter data. We can click on column names (1), math and number symbols (2), and we can choose some of built-in functions (3). What we click, will appear in pane (6). We can also type by hand what we want in pane (6). Whole expression can be typed manually, but it is more easier to select elements of expression. Pane (5) will filter functions presented in the pane (3). This will help us to find function we want. When we select one of the functions in the pane (3), we can see its syntax and description in the pane (4).

Now that we created our filter (7), we will choose to "Filter out unselected cases" (8). This option will not hide or delete filtered data (9), but will only mark it as filtered (10). Although still visible, marked rows will not be included in SPSS calculations. This way we don't loose any data and we can after, apply some other filter on our table.

Random sample of cases

Button "Sample" opens dialog where we can use one of two possible random filters. First filter (1) will choose some percent of all the cases randomly. For this to show, I will use smaller table that has only 10 rows (2). If we choose to filter 10% of rows, only one row will be left (3). Other option (4) is to randomly select a limited number of rows from the specified number of first cases. We choose to select 3 rows from the first 5 rows, so only rows 1, 2 and 5 will be selected (5).

Base on time or case range

We can just choose one continuous range of cases. On image left, we can see (1) that we have chosen all cases between case 3 and 7 inclusively. All other cases will be crossed out (2).

Use filter variable

This option asks from us to select one of the columns (1). This column should have rows without data (2). All the rows where that columns doesn't have data will be filtered (3).

What will happen with filtered data?

Filtered data can be marked as filtered out and such rows will be excluded from further calculations (1). There are two more options. First is to create new Dataset(2). SPSS will open a new window. That window will show all the rows, but unselected rows will be crossed out. Other possibility is to delete cases that don't pass filter condition. Our Dataset will be then reduced by deleting not needed rows (3).
You should be careful when deleting the rows because deleting can not be undone.

Sample data can be downloaded from here:

VBA to connect to Power BI Desktop

We have our PBIX file opened (1). Our Excel is also opened. We click on button in the ribbon (2) and new pivot table is added to active sheet (4). This pivot table is connected to our PBIX file. Now we can easily create pivot table by using data from Power BI Desktop.

After closing Power BI Desktop, our pivot table will be non interactive. We can open PBIX file again, but this time Power BI Desktop will have some other credentials. This mean that our pivot table will still be non interactive. We have to click on button (3) "Reconnect" to establish connection again. Now we can use our pivot again.

This post will not explain how to create buttons (2,3). It will only explain VBA code needed to achieve described functionality. Code is simple, so it will work with only one PBIX file. Name of that file is hardcoded. This code has no error handling.

Structure of code

On the top we have two procedures "subCreateNewPivot" (1) and "subReconnect" (2). Beside them we have two groups of functions. "Red Orange" group is used to provide credentials of our PBIX file. "Green" group is used to create pivot table. "subCreateNewPivot" is using both groups, but "subReconnect" is using only the "red orange" one.

As explained in one of previous posts, each time Power BI Desktop is opened, it has different credentials and it writes them into location:
C:\Users\<Username>\AppData\Local\Microsoft\Power BI Desktop\AnalysisServicesWorkspaces (1)
That location has several subfolders (2), each for any of opened Power BI Desktop files. Sometimes there are subfolders that belong to already closed PBIX files, but this subfolders are incomplete and they will be deleted automatically after some time. Inside those subfolders there are two important files. File (3) has name of database in its name ( 601ae6aa-a3c6-4abb-8833-2253863854e4 ). File (4) is ordinary TXT file and it has port number of Power BI Desktop server as its only content.

There are two problems with this setup:
1) Credentials are changed each time Power BI Desktop is opened.
2) We don't know which credentials belong to which opened PBIX file.

Solution is to read all available credentials with function "funcAllCredentials". This function is using two helper functions "funcPort" and "funcDatabaseName". Those helper functions reads values of port number and database name from computer disk. This credentials are then given to function "funcPowerBIpivotDataCredentials" which have to decide which of credentials are the ones we are looking for. We are looking for credentials of "PowerBIpivotData" file. In order to identify correct credentials we are going to read values from tables that we previously created in all of our Power BI Desktop files. Each our PBIX files has table named "NameValue" that looks like (1,2). We will use function "funcPBIXname" to read from each open PBIX file from such table. If, by using some of credentials, we read the value "PowerBIpivotData" (1) from some of the opened PBIX files, that would indicate that those are credentials we need.

Credentials are all we need for "Reconnect" functionality. Procedure "subReconnect" will look for connection that has name starting with "PowerBIpivotData". This indicate connection that was created before, to create pivot tables. Procedure will then update this connection and its name with new credentials. After this, pivot tables will become interactive again.

Pivot table creation

"New pivot" button also needs information about credentials. This button will use those credentials to create new pivot table. Top procedure "SubCreateNewPivot" will call procedure "subNewPivot", and that procedure will create new pivot table, by using provided credentials. First, procedure subNewPivot will chech whether connection already exists. If we previously created some pivot table, then we already have connection and we don't have to create new one. We are using helper function "funcConnectionMissing" to check existance of such connection. If this is not true, then we will create new connection.

New connection will be used to create new pivot table. New pivot table will be created in the top left cell of user selection in the spreadsheet.

VBA code

VBA code is long enough to not be shown in the text of this blog. Bellow is link for downloading Excel file that has all VBA code and buttons in its ribbon tab. PBIX files are also included.

Python CSV external tool for Power BI desktop

This is explanation how, with python, to make an external Power BI Desktop tool that will export your DAX query to CSV. In Power BI desktop ribbon we have a button (1) "Export to CSV". This button will open window (2) where we have to type our DAX query. We already have template in this dialog to help us with writing. After this, we click on button (3) "Make CSV!". If something is wrong we will get message (4). If everything is OK, new file will appear on our desktop. That file will hold data from our query (5). Let's see how to make such external tool.

First step is to create JSON file which will tell Power BI desktop how to create button and what that button should do. We can see bellow sample of such file. Version is just a version of our tool. Just leave it as "1.0". Name is text that will appear bellow button. We can add some description of our tool. Path is fullpath to python executive file. Use backslashes and double them. Arguments has two parts. First part is fullpath to our python script. Second part are arguments that Power BI desktop will provide for our script. "Server" is address of our server ("localhost:57800"), and "database" is name of our database ("c6fb254f-f146-464d-99cf-b2a6e322eb38"). Our script will use name of a server for its execution. Last element is image in base64 format. Base64 format is textual presentation of an image. This format has two parts. First part is prefix, and prefix is always the same. Second part is base64 code. We can go to web site base64.guru and there we can upload our icon. Result will be long text that is base64 presentation of our image. This text will be really long, but all of it should be placed in JSON file, after the prefix.

{
  "version": "1.0",
  "name": "Export to CSV",
  "description": "Export user's DAX to CSV file.",
  "path": "G:\\WPy64-3950\\python-3.9.5.amd64\\pythonw.exe",
  "arguments": "C:/Users/Sima/Desktop/ExportToCSV.pyw \"%server%\" \"%database%\"",
  "iconData": "data:image/png;base64,iVBORw.......0KGgo"
}

Notice that instead of "python.exe" and "ExportToCSV.py" we are using version with "W" – "pythonw.exe" and "ExportToCSV.pyw". This is because we want to avoid black command prompt window (1) to be opened in background of our window.

Fullpath to our python script doesn't have spaces in it. If we have fullpath with spaces, we have to delimit that fullpath with quotes, but those quotes has to be escaped. In that case "arguments" would loook like:
"\"C:/Users/Sima/Desktop/ExportToCSV.pyw\" \"%server%\" \"%database%\""
We can see that arguments "server" and "database" were already delimited this way because they have "%" sign in them.
This is not needed for "path" element.

This JSON file should be placed on this address. If last two folders (\Power BI Desktop\External Tools) don't exist, create them and place JSON file inside. Name of our JSON file has to end with ".pbitool.json".
C:\Program Files (x86)\Common Files\Microsoft Shared\Power BI Desktop\External Tools\ourtool.pbitool.json

When we click on button in ribbon, python script mentioned in JSON file, will be called. This script has two parts. One part is about using python to create GUI window. The other part is explained bellow. This part is used to connect to Power BI desktop, read data and write them in CSV file. str(sys.argv[1]) is a python way to get first argument sent to script. This is name of server ("localhost:49380"). This address will be used inside connection string. By using connection string we will connect to PBID and we will execute our DAX.

Python will return names of columns as Table[ColumnName]. We will use regex to leave only "ColumnName" part.

Last part is where we read fullpath of desktop to create new file name. After this, all we need is to write data in CSV file by using python CSV modul.

    conn = adodbapi.connect("Provider=MSOLAP;Data Source=" + str(sys.argv[1]) + ";Initial Catalog='';")
    cursor = conn.cursor()
    cursor.execute(DAX)

    #fix names of columns. By default they contain table name and brackets.
    NamesOfColumns = [desc[0] for desc in cursor.description]
    CorrectedNamesOfColumns = [re.findall('\[(.*?)\]', NoC)[0] for NoC in NamesOfColumns]

    #write to CSV
    FileName = os.path.join(os.path.join(os.environ['USERPROFILE']), 'Desktop') + '\CSVfromPBID.csv'
    with open(FileName, 'w', newline='') as f:
        csv.writer(f, delimiter=',').writerow(CorrectedNamesOfColumns)
        csv.writer(f, delimiter=',').writerows(cursor)

Whole python script, JSON file and PBIX sample file are here. Don't forget to change fullpaths inside JSON file.

Local Image Server for Power BI Desktop

If we create table in Power BI desktop that has images URL-s in one column (1), this will allow us to present those images in Report View (4). For this to work, we have to mark this column as "Image URL" column (2). First we select our column (1) and then we go to Column tools > Properties > Data category. There, we choose option "Image URL" (2). Now, that our column is of "Image URL" type, we can add this column as a Field to a table visual (3). Final result is table, in Report View (4), that has image in every cell of table presented.

If we don't have URL-s for our images, but the images are available in some folder on our computer or network, we have to raise our local image server. Here, I will explain two ways how to do that easily.

Python http.server modul

We will open windows CMD environment. This CMD environment has to stay open for the time we are using our images in Power BI Desktop. When we close this environment, images will not be available any more. Then, first we have to change current folder to folder with images (1). Second, we have to open python "http.server" module (2). This module asks for port number. We can type anything, but for this example I used port number 21.000.

(1)
>cd "c:\Folder with images" 
(2)
>pyhton.exe -m http.server 21000 

All images in folder are now available as URL links. On our computer we can approach those links with "localhost" and our port number (1). On our computer, and all other computers on local network, we can open page with links by using IP address of our computer and our port number (2). Access from local network can be limited by firewall, so we need to have rights to access computer where python server works.

http://localhost:21000 (1) 
http://192.168.0.103:21000 (2)

Links to images will be generated as "http://localhost:21000/" + "Name of image.jpg". If we have names of all our images in Excel, we just have to prefix them with "http://localhost:21000/" and then we will have the links to all our images. Those links can be imported in Power BI Desktop and used in Report View.

http://192.168.0.103:21000/asters-gda84d7295_640_result.jpg

Mongoose server

There is a very simple server that can be used for the same purpose. Its name is "Mongoose server" and can be downloaded from https://mongoose.ws/desktop-app/. This is small portable executable file. Just click on it to open it.

New icon will appear in system tray. Right click this icon to access main menu. With option (1) we can change folder with images that are publish online. With (2) we can open browser and jump directly on page where all our images are listed. Link for this page and for each of images will be similar to one that python web server created.

http://192.168.0.103:8000
http://192.168.0.103:8000/flowers-g0eaf923f2_640_result.jpg

We can see bellow how page with image links looks like (1). On the right side (2) we can see that Mongoose server can share our images globally. This option has to be enabled by clicking on (3) on the image above. In this case sharing is done through Mongoose server so the link is something like "https://32cf61d37e7299.share.mongoose.ws". Global sharing of our images is not something that we want in this case and could present security risk because now anyone on world can access our images. It is best to turn this option off.

Here is PBIX sample file and one folder with collection of images.

How to create new variables in SPSS

We have three columns with the scores of a judges (1). We want to create new column that will present an average score (2). For this, we are using option Transform > Compute Variable (3). This will open form where we can define new variable.

In new dialog, first thing is to give name to new variable (1). Button (2) can open small dialog where we can define label (3) for our variable. That label is just description of a variable. We can also define our variable as Numeric or String type (4). In this case, we will choose "Numeric". Pane (6) is where we create expression for new variable. This expression is comprised of column names, mathematical operators, builtin functions and constants. We can type the whole expression from our keyboard, but instead of typing we can just select some building blocks for our expression from other areas in this dialog, and they will appear in pane (6).

Pane (5) has names of all the columns. Pane (9) has names of functions. Virtual keyboard (7) allows us to type different signs and numbers by using mouse. We can use all those elements to create desired expression in pane (6).
By selecting some of functions groups in pane (8), we can limit what functions will be presented in pane (9).
When some of functions in pane (9) is selected, we will be able to see its syntax and description in pane (10).
We created formula (6). The formula "MEAN (judge1, judge2, judge3)" would give us the same result.

When we click on column name, that column name will appear in our expression: By using virtual keyboard we can add different operators and numbers to our expression.
Clicking on some of functions will add syntax for that function in expression pane. Then we'll have to replace question marks with functions arguments. Those arguments can be column names, constants or subexpressions.

In virtual keyboard we have signs "**", "&", "|". Sign "&" is for logical AND. Sign "|" is for logical OR. When we write "2**3", it is the same as "23".

Conditional variables

If some of values, in columns that we used to calculate our new variable, are missing, then the value in the same row will not be calculated for our new variable (only for our original expression, it seems that builtin MEAN function is resistant to this). Missing value will be presented with a dot.
If we have all the values in columns judge1, judge2, judge3, it is still possible to make variable with missing values. Example bellow shows how to do it.

Now we have one column more (1). This column shows, whether scores belong to junior or senior competitor. We want to create new variable that will present maximal value, but only for junior competitors (2). We can achieve this with conditional variables. First we create formula for our variable. We are going to use MAX function (3). Then we have to define condition. Place for that is opened by clicking on "If" button on the bottom of the main dialog (4).

MAX(judge1,judge2,judge3)       (3)

We will be greeted with a dialog where on the top we have to select option "Include If case satisfies condition" (1).

In pane (2) we have to enter condition. In every row where this condition is satisfied, new variable will have a value, and vice versa. Way of creating condition is similar as way of creating expression for new variable. We have columns (3), functions (5) and buttons (4) to help us create this boolean expression.
Our junior competitors are coded with number "0" (Seniors are "1"). This makes our conditional expression as simple as:

Seniority = 0

This is all if we want only "Junior" rows filled with values for new variable.

After this, we can create new variable with the same name (1). This variable will calculate maximal score for judge2 and judge3 (2), but only for Senior competitors (3). For some reason, we will consider scores of judge1 invalid for Senior competitors. SPSS will ask us whether we want to change existing variable (4). We'll click on YES. SPSS will not overwrite old column with new column. It will combine them. We can see in (5) that all rows are populated. Rows 1, 2 and 5 are for Junior competitors and they show maximum for "judge1,judge2,judge3". Rows 3 and 4 are for Senior competitors and they show maximum for "judge2,judge3". In row 3 we can see that the value in "Maximal" column is 8.8, which is maximum for columns "judge2,judge3". Total maximum is 8.9, but that maximum is in column judge1 that does not count.

We can continue to add more conditional variables with the same name. Each new variable will overwrite only results which satisfy its condition. If we add third variable with name "Maximal" and condition "judge2>8", it will only overwrite values in rows where this condition is fullfilled.

Two caveats to consider

  1. When we click on some function, its syntax will be added to expression pane. If we click on MAX function, syntax like this will be added "MAX(?,?)". This doesn't mean that this function can have only two arguments. It can have infinite number of arguments "MAX(a,b,c,d,e…)", but we have to type other arguments by hand. Same for many other functions.
  2. When we create new variable, we can give it the same name as one of original columns. For example, we have original column "judge1", and we create new variable with the same name "judge1". SPSS will ask us whether to change existing variable. If we click on YES, we are going to lose original column. New variable will overwrite original variable.

Here is a sample file: