Crosstabs created in Cognos reports will not have names of columns in dimensions header area by default (1). This could make crosstab harder to understand, especially when crosstab has many dimension columns. We want that our dimension columns have their names nicely presented (2).
Let's solve this problem. From toolbox (1), from advanced tab, we will drag "Crosstab space" (2) element between rows header area and data area (3). This will create an empty column (4). Because we have two columns in rows header area, we will repeat this action once more (5).
Now, we have to move right part of crosstab (1) close to the left part (2), like some tetris elements. For that to succeed, we just need to delete empty cells that we don't need any more (3).
We, first, have to select not needed cells (1), and then in Properties pane, we will find property "Box type" (2). When we change the value of this property to "None", selected cells will disappear. We will now get our header desired shape (3).
Last thing to do is to add column names into their places. Before that, we will unlock our report (1). We will use padlock in the upper right part of the Cognos window. Now, it is possible to drag object "Text Item" (2), from toolbox to designated cell. At that moment, dialog (3) will appear, and we will fill this dialog with the name of the column. We have created the name for our first column (4), and then we will do the same for the second column.
For aesthetic reasons, we will move column names to bottom of the cells (1). We will use command from toolbar for this. Finally, we have our header the way we like it (2), and all the columns are properly labeled.
In its simplest form SUMMARIZE can be used to get distinct values from several columns. We have sample data in the table named SampleTable(1).
By using SUMMARIZE we can get all distinct combinations of values from columns Region and Rep (2). This is the same what happens when we place those two columns into pivot table (3).
This is the formula we used to create (2). EVALUATE SUMMARIZE( SampleTable, [Region], [Rep] ) ORDER BY [Region], [Rep]
SUMMARIZE can work with columns from related tables. We added table (1) with unique values for Representatives, and we made a relation (2) between that table and original "Rep" column. Now, we can still use SUMMARIZE function that will use this new RepDetached column. Note, that this will only work for 1:N relations, where table mentioned in SUMMARIZE function is on "N" side of relation.
Result will be the same (3). EVALUATE SUMMARIZE( SampleTable, [Region], [RepDetached] ) ORDER BY [Region], [RepDetached]
SUMMARIZE With Aggregations
By using aggregations, we can add summary columns in the final table. We can aggregate values from current table and from related table. We can give new names to those columns.
EVALUATE
SUMMARIZE( SampleTable, [Region], "SumOfUnits", SUM( SampleTable[Units] )
, "Representatives", DISTINCTCOUNT( RepDetached[RepDetached] ) )
ORDER BY [Region]
ROLLUP
Rollup is helper function that will provide us Grand Total and all other subtotals for our hierarchy.
We are grouping table by Region and Item. We will get subtotal for each Region (1), and we will get grand total (2). They will be added as a new rows at the bottom of the table. EVALUATE SUMMARIZE( SampleTable, Rollup( [Region], [Item] ) , "SumOfUnites" , SUM( SampleTable[Units] ) )
In this case, order of columns is important, because subtotals will be calculated following hierarchy from left to right.
ROLLUPGROUP
ROLLUPGROUP is another helper function. It is used in combination with ROLLUP function. Purpose of ROLLUPGROUP function is to exclude subtotals for some columns from the final result of SUMMARIZE.
This time we have three columns for grouping. We always have Grand Total (2) when we use ROLLUP. We left "Region" column outside of ROLLUPGROUP function. This means that we will have subtotals for "Region" column (1). For "Rep" and "Item" columns, there will be no subtotals that include those two columns. This is because those columns are inside ROLLUPGROUP function.
ISSUBTOTAL is a function that informs whether current row is a subtotal row for some column. This function receives an argument that is one of the columns included in the ROLLUP function. If that column is aggregated by subtotal in the current row, then ISSUBTOTAL function will return TRUE in that row.
Formula bellow is making rollup for "Region" and "Item" columns.
For those two columns we can find out whether they have subtotal in the current row. We can see that ISSUBTOTAL( [Item] ) returns TRUE in the rows (1). ISSUBTOTAL ( [Region] ) returns TRUE only in Grand Total (2). In Grand Total, both columns are returning TRUE.
Filter and Row Context
SUMMARIZE can create new columns in row context. Only the columns used for grouping can be used (Region and Item).
In the image, we can see that beside creating aggregate column (1), we can also create column (2) that doesn't use aggregation. This column is made by using row context. Formula for creation of our table is:
We can first group our data by columns "Region" and "Item" (1). Then, we want to group table (1) by "Region" column (2).
We can't nest SUMMARIZE functions. SUMMARIZE function expects fully qualified column as column used in aggregation. This means that outer SUMMARIZE function would not be able to accept "SumOfUnits" column.
The easiest way to solve this problem is to use another grouping function, "GROUPBY", as outer function. This function will accept "SumOfUnits" column name without table quailifier.
Most of us know how advanced filter in Excel works.
Table (1) and Criteria table (2) are used by Advanced filter (3) to produce filtered table (4). Important thing here is "Criteria table". Criteria table is saying that only rows which have values (A,1) or (B,1) in columns (Col1,Col2) should pass the filter.
This "Criteria table" is how filter context is defined in Power BI. Filter context is a table. What filter context table do we expect is used, when aggregate values are calculated in SUMMARIZE function? We expect that filter context table has only columns that are used for grouping. We could graphically represent this like:
But, this is not how SUMMARIZE filter works. On our surprise, this is how SUMMARIZE really creates its filter. SUMMARIZE will first filter original SampleTable with region "Central" and item "Pen Set" (1). After this, whole table segment (2) will be used as a filter context.
The reason why SUMMARIZE works like this is because SUMMARIZE has full ability to control filter context of aggregated values with CALCULATE function. We can disable or modify some of the filters.
Example
Let's assume that we can not sell "Desk" item in "East" region and we can not sell "Pen Set" item in "West" region because of the government. We want to know what is the share of each Region in all the units sold, but only for items that are allowed for that Region. We can accomplish this by using only one function – SUMMARIZE.
In the image bellow, we can see that "Central" region is selling all of the 5 items (column "B"). That is why we compare its sale with 2.121 units (column "D") which is total number of units for all the regions. For "East" and "West" this number is 2.111 and 1.726 because those totals are corrected not to include items "East" and "West" region are not allowed to sell.
In our formula, blue part is removing all filters except filter for items, when calculating Total sale. So, we don't filter "Region" column in our filter context. In the image bellow we can see how this influence filter context for "West" region. For "West" region filter context is the whole SampleTable with rows that contains "Pen Set" item removed (1). It is similar for "East" (2). For "Central" we don't filter SampleTable at all (3). That is why column "D" has the biggest value for "Central" region and smaller values for "West" and "East" region.
Performance Consequence
Because of the complexity of how SUMMARIZE function creates its filter context, performance of this function will not be impressive. We should use this function only when we need distinct rows for several columns (simplest case) or when we want to modify filter context (the most complex case).
SPSS / By Bizkapish
/ August 14, 2022 November 6, 2022
Descriptive statistics is a branch of statistics that is describing the characteristics of a sample or population. Descriptive statistics is based on brief informational coefficients that summarize a given data set. In descriptive statistics we are mostly interested in next characteristics:
Frequency Distribution or Dispersion, refers to the frequency of each value.
Measures of Central Tendency, they represent mean values.
Measures of Variability, they show how spread out the values are.
Frequency Distribution
Frequency distribution shows how many observations belong to different outcomes in a data set. Each outcome can be described by group name for nominal data or ordinal data, interval for ordinal data or range for quantitative data. Each outcome is mutually exclusive class. We just count how many statistical units belong to each class.
Frequency distribution is usually presented in a table or a chart. There are four kinds of dispersion tables. For each kind of table there is a convenient chart presentation:
– Ungrouped frequency distributions tables show number of units per category. Their counterpart is Bar Chart. – Grouped frequency distributions tables present number of units per range. Their companion is Histogram. – Relative frequency distributions tables show relative structure. Their supplement is Pie Chart. – Cumulative frequency distributions tables are presenting accumulation levels. Their double is Clustered Column chart.
Measures of Central Tendency
Measures of central tendency represent data set as a value that is in the center of all other values. This central value represents the value that is the most similar to all other values and is the best suited to describe all other values through one number. There are three measures of central tendency, Average, Median and Mode. In normal distribution these three values would be the same. If we don't have symmetry, then Median would be closer to extreme values then the Average, and Mode be at the top of distribution.
Average
Average is calculated by summing all the values, and then dividing the result with number of values ( x̄ = Σ xi / n ).
Median
The median is calculated by sorting a set of data and then picking the value that is in the center of that array. Let's say that all values in array are indexed from 1 to n [ x(1), x(2)…x(n-1), x(n)].
If number of values in array is odd, median is decided by index (n+1)/2. Median is then decided like x̃ = x(n+1)/2, like in array [ 9(1), 8(2), 7(3), 3(7+1)/2, 1(5), -5(6), -12(7)]. There are an equal number of values before and after median in our array, 3 values before the median and three values after the median.
If number of values is even, formula is x̃ = (x(n/2)+x(n/2)+1)/2, like in [ -3(1), 1(2), 0(3), 2(8/2), 3(8/2)+1, 4(6), 6(7), 9(8)], so we calculate an average of two middle numbers (2+3)/2 = 2.5. Again, there are an equal number of values in our array before and after the two centralvalues. As we can see, it is not important whether numbers are arranged in ascending or descending order.
Mode
The mode is the most frequent value in a sample or population. One data set can have multiple modes. In sample ( 1, 3, 2, 3, 3, 5, 4, 4, 8, 4 ) we have two modes. Both the numbers 3 and 4 appear three times. If we create Ungrouped Frequency Distribution table, we can easily notice our modes.
Measures of Variability
Measures of Variability shows how spread out the statistical units are. Those measures can give us a sense of how different the individual values are from each other and from their mean. Measures of Variability are Range, Percentile, Quartile Deviation, Mean Absolute Deviation, Standard Deviation, Variance, Coefficient of Variation, Skewness and Kurtosis.
Range
Range is a difference between maximal and minimal value. If we have a sample 2, 3, 7, 8, 11, 13, where maximal and minimal values are 13 and 2, then the range is: range = xmax– xmin= 13 – 2 = 11.
Percentile
Let's say that we have sample with 100 values ordered in ascending order ( x(1), x(2)…x(99), x(100) ). If some number Z is bigger than M% values from that sample, than we can say that Z is "M percentile". For example, Z could be larger than 32% of values ( x(1), x(2)… x(32), x(33), x(99), x(100) ). In this case x(33)is "32 percentile".
For this sample [ -12(1), -5(2), 1(3), 3(4), 7(5), 8(6), 9(7)], number 7 is larger than 4 values, so seven is "57 percentile". This is because 4 of 7 numbers are smaller than 7, and 4/7 = 0,57 = 57%. Percentile show us how big part of sample is smaller than some value.
Be aware that there are several algorithms how to calculate percentiles, but they all follow similar logic.
Percentiles "25 percentile", "50 percentile", "75 percentile" are the most used percentiles and they have special names. They are respectively called "first quartile", "second quartile" and "third quartile", and they are labeled with Q1, Q2, Q3. Quartile Q2 is the same as median.
Quartiles, together with maximal and minimal values divide our data set in 4 quarters. xmin [25% of values] Q1[25% of values] Q2[25% of values] Q3[25% of values] xmax
Quartile Deviation
The difference of third and first quartile is called "interquartile range": QR = Q3 – Q1. When we divide interquartile range with two, we get quartile deviation. Quartile deviation is an average distance of Q3 and Q1 fromthe Median.
Average of ranges A and B is quartile deviation, calculated as: QD = (Q3 – Q1 ) / 2.
Mean Absolute Deviation
Mean absolute deviation (MAD) is an average absolute distance between each data value and the sample mean. Some distances are negative and some are positive. Their sum is always zero. This is a direct consequence of how we calculate the sample mean.
x̄ = Σ xi / n n * x̄ = Σ xi n * x̄ – ( n * x̄ ) = Σ xi – Σ x̄ 0 = Σ ( xi – x̄ )
We can see, on the left, that formula, used for calculation of a mean, can be transformed to show that sum of all distances between values and the mean is equal to zero. This is why, for calculating mean absolute deviation (MAD) we are using absolute values.
If our sample is [ 10, 12, 14, 24, 32, 33, 36 ], mean value is 23. Sum of all distances is ( -13 – 11 – 9 + 1 + 9 + 10 +13 ) = 0. Instead of original deviations we are now going to use their absolute values. So, sum of all absolute deviations is ( 13 + 11 + 9 + 1 + 9 + 10 + 13 ) = 66. This means that MAD = 66 / 7 = 9,43.
Standard Deviation and Variance
The standard deviation is similar to the mean absolute deviation. SD also calculates the average of the distance between the point values and the sample mean, but uses a different calculation. To eliminate negative distances, SD uses the square of each distance. To compensate for this increase in deviation, the calculation will ultimately take the square root of the average squared distance. This is the formula used for calculating standard deviation of the sample:
Variance is just standard variation without root =>
Standard deviation is always same or bigger than mean absolute deviation. If we add some extreme values to our sample, then standard deviation will rise much more than mean absolute deviation.
Coefficient of Variation
Coefficient of variation is relative standard deviation. It is a ratio between standard deviation and the mean. The formula is CV = s / x̄. This is basically standard deviation measured in units of the mean. Because it is relative measure, it can be expressed in percentages.
Let's imagine that the standard deviation of a giraffe's height is equal to 200 cm. Standard deviation of a mouse's height could be 5 cm. Does that mean that variability of giraffe's height is much bigger than variability of mouse's height? Of course it is not. We have to take into account that giraffes are much bigger animals than mice. That is why we use coefficient of variation.
If we scale the images of mice and giraffes to the same height, we can see that the relative standard deviation of their heights is not as different as the ordinary standard deviation would indicate.
Kurtosis and Skewness
Kurtosis and Skewness are used to describe how much our distribution fits into normal distribution. If Kurtosis and Skewness are zero, or close to zero, then we have normal distribution.
Skewness is a measure of the asymmetry of a distribution. Sometimes, the normal distribution tends to tilt more on one side. If skewness is positive then distribution is tilted to right side, and if it is negative it is tilted to left side.
If skewness is absolutely higher than 1, then we have high asymmetry. If it isbetween -0,5 and 0,5, then we have fairly symmetrical distribution. All other values mean that it is moderately skewed.
To the right we can see how to calculate Skewness statistics.
The Kurtosis computes the flatness of our curve. Distribution is flat when data is equally distributed. If data is grouped around one value, then our distribution has a peak. Such humped distributions mean that kurtosis statistics is positive. That is Leptokurtic distribution. Negative values of kurtosis would mean that distribution is Platykurtic. The distribution is then more flat. Critical values for kurtosis statistics are the same as for skewness.
To the right we can see how to calculate Kurtosis statistics.
We have a table like (1). This tables shows how many units of product has been sold on each date in each region. We want to get pivot table (2) that shows how many days, sale was bigger than 75, 53 or 27 units. We want to observe those values per year so we added a slicer for years (3).
Needed calculations
If our data is (1), first we need to group our data by Region and OrderDate and to calculate sum of units for each group (2). That table we have to filter for values bigger than 75 (3). Then, again, we have to do grouping by Region and to count rows from our filtered table (4). We also have to repeat this for values 53 and 27.
Slow solution
For this calculation we could create three measures that look like the one bellow. If we have a lot of data, after clicking on slicer we would have to wait a lot for calculation to finish. Not only we do have a lot of data, but we also have demanding calculation for each measure, and we have three such measures.
All three measures starts with the same step ( "GroupByRegionOrderDate" ). This step is very expensive because it is the one doing most of calculations. This step returns table. If we could just do this step once instead of repeating it for each measure, that would be great. We could make a basic measure that returns such table (1), and then three other measures that would use that table to calculate final scalar values (2).
Unfortunately, measures can not return table. They can only return scalar.
Better solution
Better solution is to create one measure that returns all three values. We can simply concatenate three numbers in one string ( "25-32-36" ). That would be result of our Basic measure. Then we can create three measures that would extract separate values. This could be our Basic measure code:
BasicMeasureString:=
VAR GroupByRegionOrderDate = GROUPBY( SampleTable; [Region]; [OrderDate]; "SumOfUnits"; SUMX( CURRENTGROUP(); [Units] ) )
VAR BiggerThen75 = COUNTROWS( FILTER( GroupByRegionOrderDate; [SumOfUnits] >= 75 ) )
VAR BiggerThen53 = COUNTROWS( FILTER( GroupByRegionOrderDate; [SumOfUnits] >= 53 ) )
VAR BiggerThen27 = COUNTROWS( FILTER( GroupByRegionOrderDate; [SumOfUnits] >= 27 ) )
VAR Result = BiggerThen75 & "-" & BiggerThen53 & "-" & BiggerThen27
RETURN Result
It would return string „25-32-36". Then we can use three additional measures to extract three separate values:
As we see, we only do step „GroupByRegionOrderDate" once. This way we make our calculation much faster. If our final results are numbers, we could get results even faster by using mathematical calculations to extract our values, instead of doing string manipulations. In our Basic measure we could combine three values into one value like this:
VAR Result = BiggerThen75 * 10000 + BiggerThen53 * 100 + BiggerThen27 //returns 253236
Then, we can extract separate numbers by using INT and MOD functions like this:
Dates with sale over75 := INT( [BasicMeasure] / 10000 ) //returns 25 Dates with sale over53 := INT( MOD( [BasicMeasure]; 10000 ) / 100 ) //returns 32 Dates with sale over27 := MOD( [BasicMeasure]; 100 ) //returns 36
Solution with math calculations would be significantly faster.
Sometimes different users want pivot table sorted in different ways. If column has full names of people, it is possible to sort column either by using first names or by using last names. An additional requirement is the ability to change the sort order using a slicer.
In the example bellow, we can see that our pivot table can be sorted by "Points" (1), but two other possible sorting orders are by "Last Name" (2) or by "First Name" (3). We will create solution only for this three specific sorting methods. Points will be sorted from highest to lowest, and first and last names will be sorted alphabetically.
Helper Tables
First we have to create table in Power Pivot that will hold values for our slicer (1). We will call this table "SlicerValues".
Second, we have to create table in Power Pivot, that will have columns with sort order for every row (1). We need two such columns, one for last names and one for first names. This table will have name "SortOrdersTable".
General Idea
We know that it is possible to sort some column, in pivot table, by using some other column as sort order. First we have to right click on column which we want to sort, and then we go to Sort > More Sort Options (1). New dialog will open. In this dialog we will select "Descending (A to Z) by:" and then we will choose which column will define our sort order. In the image bellow, we decided that "Sum of Points" column (2) in DESC order is used for sorting. Now, "Full Name" column (3) is not sorted alphabetically, but according to "Sum of Points" column.
We can use mechanism, explained above, to control column "Full Name" sorting order. Idea is to create a measure, with a name "SortMeasure", that will be added to our pivot table (1). This measure values will change according to selection in the slicer (3,4,5). Our column "Full Name" will be set to sort by order defined by "SortMeasure" (2).
Now, when the user selects "First Names" (3), "Last Names" (4) or "Points" (5) in the slicer, column "SortMeasure" values will change (1) and that will sort our column "Full Name" (2). We can see in (3) that column "Full Name" is sorted by first name, in (4) by last name, and in (5) by points. At the end, column "SortMeasure" will be hidden from the user by hiding the whole spreadsheet column.
Creative process
First we have to make relation between "Full Name" columns in "SortOrdersTable" and "Data" table (1). "Data" table is the name of a table which is the base for our pivot (with "Full Name" and "Points" columns). Relation has to be "1:N" in direction "Data" to "SortOrdersTable" (2). This means that in relation dialog "Data" table has to be at the bottom. "Data" table can filter "SortOrdersTable".
Next step is to create our measure. This measure has to be reactive to slicer selection. ALLSELECTED function will return values selected in slicer as a column. With green part of code we will check whether specific individual values in slicer are part of that column. If they are, that means that those terms are selected and we will return TRUE(), otherwise we will return FALSE(). At the end, we will use SWITCH function to decide what values the measure should return. SWITCH function will return values for the first variable ( FirstNamesSelected, LastNamesSelected, PointSelected ) that returns TRUE(). For example, if FirstNamesSelected is FALSE(), and LastNamesSelected is TRUE(), values from "Last Name order" column will be returned. Those are the values that will be used to sort "Full Name" column.
SortMeasure:=
VAR FirstNamesSelected = IF( "First Names" INALLSELECTED( SlicerValues[SortBy] ); TRUE(); FALSE() )
VAR LastNamesSelected = IF( "Last Names" INALLSELECTED( SlicerValues[SortBy] ); TRUE(); FALSE() )
VAR PointsSelected = IF( "Points" INALLSELECTED( SlicerValues[SortBy] ); TRUE(); FALSE() )
VAR Result = SWITCH(TRUE(); FirstNamesSelected; SUM( SortOrdersTable[First Name order] )
; LastNamesSelected; SUM( SortOrdersTable[Last Name order] )
; PointsSelected; 10000 - SUM( Data[Points] ))
RETURN Result
All columns with possible result ( "First Name order", "Last Name order", "Points" ) are just wrapped in SUM function, but "Points" column has one more detail. In our measure we will transform all "Points" values into "10000 – Points". This is because, we want ascending sorting for "First Name order" and "Last Name order", but descending sorting for "Points". Because pivot table can only be in ascending or descending order at once, we need this transformation so that ascending ordering of our measure will always correctly sort our pivot table. Values of "Points" and "10000 – Points" are inversely correlated, and this solves our problem.
Descending sorting of "Points" is the same as ascending sorting of a transformation "10000 – Points".
Result
Bellow we can see our results and we can observe a few details. Columns "C" and "G" are hidden (1). This is how we will hide our "SortMeasure" column from the user. In green pivot table (2) we can see that while "SortMeasure" column is in ascending order, important column "Sum of Points" is actually in descending order, as per requirement.
In blue and orange pivot tables, there are two values selected in each slicer (3). It is not possible to force single selection in slicer in Excel without resorting to VBA. To solve this problem, we made our measure so that the first selected value is the one determining sorting order. That is why both blue and orange pivot tables are sorted by "First Names", that is the first value selected in their slicers.