Analysis Toolpak and Descriptive Statistics

Analysis ToolPak is an Excel add-in that provides many statistical tools. This add-in comes in two flavors, and each of them has to be enabled in "Add-ins" dialog (1) in order to use it. They both have the same functionality but "Analysis ToolPak" has its functionality available only through graphical interface. "Analysis ToolPak – VBA" is accessible both through graphical interface and VBA code.

After we enable "Analysis ToolPak" (1), a button will appear in Data > Analysis (2). By clicking on that button, we will open dialog with many statistical tools (3). In order to use any of these tools we just have to select it and click on OK. New dialog will appear which provides options for that specific statistical tool.

Descriptive Statistics

Descriptive statistics is based on a brief descriptive coefficients that quantitatively describe or summarize features of a data set. Data set is set of results of measurements made on a population or a sample. This coefficients comprise measures of central tendency like mean, median, mode, and measures of variability like Standard error, Variance, Range, Kurtosis.

We will select "Descriptive statistics" tool in Analysis ToolPak. "Descriptive statistics" dialog asks us to enter what range will be used as a source of data (1). This range, in our example, has two columns and we will get results for each of them. Program doesn't know whether data in selected range is organized in Columns or Rows. We will choose Columns (2) in our case, to solve that dilemma. Our data has labels in the first row, so we have to check option (3) in order to have first row excluded from calculations.

Final output can be directed to a range in the same sheet or some other sheet in the same workbook (A). This range is defined by it's top left cell. Option (B) means that new sheet will be created with provided name and result will be pasted in "A1" cell in that new sheet. Selection of option (C) will create new workbook with the name "Book1", and the result will be pasted in sheet "Sheet1" in cell "A1" in that workbook.

Last options (4,5,6,7) are for selecting what results should be calculated. Option (4) "Summary statistics" should always be selected because without it most of important indicators (mean, variance…) will not be part of a final result. Option (5) needs confidence level for calculating confidence interval for the mean. Without it checked, we will not get this result. Options (6,7) will give us the same results as SMALL and LARGE functions in Excel.

Descriptive Statistics Results

This is what the results of descriptive statistics would look like. We have results both for Col1 and Col2. For each column there is a list of different indicators which describe our data set. We will only analyze results for the first column.

To the right, we can see all the results for first column, with the same results calculated by using Excel formulas. We can see that all the results are the same but there is one potential difference. If there are multiple values in data set that could be considered as Mode ( e.g. 1,1,2,5,5 ), Analysis ToolPak would return the last one ( 5,5 ), but the MODE function would return the first one ( 1,1 ).

Formulas used in Analysis ToolPak descriptive statistics

Bellow are listed formulas used for calculation of some descriptive indicators.

Standard DeviationVariance
Standard ErrorConfidence Interval
KurtosisSkewness

Let's calculate Confidence Interval by using its formula. Because our sample is smaller than 30 observations, we will use t distribution. For 95% confidence level, with 3 degrees of freedom, t value would be 3.1824. That means that t * s = 3.1824 * 0,75 = 2.3868. This is the same result as Excel CONFIDENCE.T function is returning. Now our confidence interval is ( 2,75 – 2,3868; 2,75 + 2,3868 ), which is ( 0,3632; 5,1368 ).

Kurtosis and Skewness are used to describe how much our distribution fits into normal distribution. If Kurtosis and Skewness are zero, or close to zero, then we have normal distribution.

Analysis ToolPak Descriptive Statistics through VBA

Descr Subprocedure

Different functionalities of Analysis ToolPak can be called through VBA subprocedures. Bellow we can see how to provide arguments and call procedure with the name "Descr" for descriptive statistics. All arguments except the first one are optional. All arguments are of a type Variant.

Sub subDescrFunction()
Dim inputRng As Range: Set inputRng = Range("Sheet1!A3:B7")
Dim outputRng As Range: Set outputRng = Range("Sheet1!K3")
Application.Run "Descr", inputRng, outputRng, "C", True, True, 1, 1, 95
End Sub
inprng This is input range, with our data set.
outrngThis is output range. Output range is declared with a reference to a top left cell of the output range. If we type some word here, Analysis Toolpak will export result into new worksheet with such name. If this argument is False, blank or omitted, new workbook will be created.
groupedIs our data set organized in rows or columns? Depending on that, we type "R" or "C". "C" is default.
labelsHere we should type "True" if our data set has header. If False or omitted, then we claim that there are no labels.
summaryType "True" to get all major descriptive statistics. If this argument is False, blank or omitted, those values will not be calculated.
ds_largeWhat k-th largest value to get? If this argument is omitted, this value will not be calculated.
ds_smallWhat k-th smallest value to get? If this argument is omitted, this value will not be calculated.
confidWhat is confidence level for calculating confidence interval for the mean? If this argument is omitted, there will be no result.

VBA project for Analysis ToolPak (ATPVBAEN.XLAM) is protected with the password "Wildebeest!!". There you can find declarations of all VBA procedures.

DescrQ Subprocedure

There are two variants of subprocedures for descriptive statistics. First has name Descr, and the second one has name DescrQ. "Descr" would create result immediately by using arguments provided in the code. "DescrQ" would first open GUI dialog filled with values provided in the code (1). We can then accept those values, or we can change them, and then we can click on OK button. Only then, final results will be created (2). "DescrQ" give us opportunity to customize values provided in the code.

All the code for "Descr" and "DescrQ" is the same. Only difference is the name of subprocedure.

Application.Run "DescrQ", inputRng, outputRng, "C", False, True, 1, 1, 95
"DescrQ" version has one bug. Ranges in the dialog will have starting "grave accent" missing.
[New Microsoft Excel Worksheet.xlsm]Sheet1'!$D$9:$D$12 'initial formatting
'[New Microsoft Excel Worksheet.xlsm]Sheet1'!$D$9:$D$12 'corrected starting sign

Sample file can be downloaded here:

Power Query and protected sheet in Excel

We made a report. We'll give that report to some colleague. He will refresh that report periodically and he will then send it to end users. In the image bellow we can see that our colleague has to prepare 4 Access files as data source, and to open our report (1). He will than refresh it (2). At the end of refreshing MsgBox (3) will appear. After this, our colleague can save the file and he can then send it to other colleagues (4). Idea is to make this process as much easier as possible for our colleague.

One more requirement is that spreadsheet with tables has to be protected so that users can not accidentally change or delete some numbers. When we try to make report, as described, we will find few problems to do so:
1) Tables on protected sheet will not be able to refresh.
2) Combining Power Query and VBA asks for queries to be synchronous, which will prolong queries execution.
3) If we can not use VBA, how to create MsgBox (3)?
Let's see how to solve this problems.

Tables on Protected Sheet Will Not Refresh

When we try to "Refresh All" tables that are located in protected sheet, this is the message we will get. We will have to remove our protection before refreshing our data, and then to protect the sheet again after. Idea is to avoid such step and to automate things.

We could try to solve this problem by using VBA. Usually we would lift the protection, do the changes, and then set protection back. We can use VBA code similar to this bellow:

Sub RefreshTables()
ThisWorkbook.Worksheets("Queries").Unprotect Password:="fff"
ThisWorkbook.RefreshAll
ThisWorkbook.Worksheets("Queries").Protect Password:="fff"
MsgBox "All Tables Refreshed"
End Sub

Problem is that Power Query queries are executed asynchronously. Queries will still be executing (2), VBA will not wait for them to finish, but will show final MsgBox immediately (1). This means that protection on the sheet is already placed. After we click on OK button (1), we will again receive message (3) because of that.

As explained in this post, all we need to do is to make queries synchronous. This is done by unchecking option "Enable background refresh" in query properties.
Unfortunately this would make Refreshing lasting much longer. Instead of executing all of the queries at the same time (1), queries would be executed sequentially, one by one (2). This makes refreshing lasting two times longer, on my computer.
We can try to protect our cells by making some data validation rule that will never be fulfilled. This would prevent users from overwriting numbers in our cells, but it will not stop them from deleting values in our cells. Data validation is only triggered when cell is in Edit mode and the user is trying to enter some data into cell. Edit mode is not needed for deleting cell content, we just select the cell and press Delete key.

Solution for protected sheet

Best solution is to place refreshable tables in one sheet that is unprotected and hidden (2). Numbers from that sheet can be mirrored (3), by using formulas, in another sheet which is protected (4). Cells in protected sheet always have the same, unchanged formulas (5), so they will never trigger sheet protection.

User can now click on Refresh All button in Data tab (1). Tables in unprotected sheets would refresh, and tables in protected sheet would just reflect those results.

MsgBox without VBA

We said that synchronised queries would take longer to execute, so we will avoid VBA. Question is how to create MsgBox at the end of refreshing that would inform user that refreshing is over. The only way to create message in Power Query is by using "error" statement. We have to make sure that "error" statement happens at the end of refreshing when all individual queries are finished.

We can create a query like the one bellow. This query is referencing all other queries. This means that the query bellow has to be executed the last. At the end of refreshing, this query will show user the message that is written in Error.Record. This query has to be loaded into spreadsheet in order to work correctly.

let   
  Result = if { FactSales1, FactSales2, FactSales3, FactSales4 } = { "A","A","A","A" } 
    then "A" else error Error.Record( "Finished", "Refreshing is over." )
in
    Result

We can see bellow that the final query is dependent on all other queries (1). This MsgBox query will never load any data into spreadsheet, so its footprint will be like (2). This query will show us a message that looks like (3) after all other queries are refreshed.

Sample files can be downloaded from here. File "ContosoSales1.accdb" should be copied three more times to create files "ContosoSales2.accdb", "ContosoSales3.accdb", "ContosoSales4.accdb". In "Parameters" Excel sheet, change fullpaths toward those MS Access files.

Data Types in Power Query

BuiltIn Types

Power Query has built-in record with identifier #shared. This record will give us all identifiers that exist in Power Query in current context, and their types. We will transform this record into table and we will then remove this query itself from the list of identifiers, in order to avoid cyclic reference. We will filter only identifiers that represent some of Power Query types. At the end we will sort by type name.

let
    IdentifiersAndTypes = Record.ToTable( #shared )   
    , RemoveItself = Table.SelectRows(IdentifiersAndTypes, each [Name] <> "Query1" )    
    , FilterOnlyTypes = Table.SelectRows( RemoveItself, each Value.Type( [Value] ) = Type.Type )   
    , SortTypes = Table.Sort( FilterOnlyTypes, { "Name", Order.Ascending })
in
    SortTypes
#Shared (1) returned all identifiers in Power Query, together with their types. We will delete (2) the name of our query from this list in order to avoid cyclic reference. We will filter only identifiers that refer to Power Query types (3).

Result will show us, that there are 63. built in types in Power Query. They are presented below.

AccessControlEntry.ConditionContextTypeCompression.TypeGuid.TypeMissingField.TypeRelativePosition.Type
AccessControlEntry.TypeCsvStyle.TypeIdentity.TypeNone.TypeRoundingMode.Type
AccessControlKind.TypeCurrency.TypeIdentityProvider.TypeNull.TypeSingle.Type
SapHanaDistribution.TypeDate.TypeInt16.TypeNumber.TypeTable.Type
SapHanaRangeOperator.TypeDateTime.TypeInt32.TypeODataOmitValues.TypeText.Type
SapBusinessWarehouseExecutionMode.TypeDateTimeZone.TypeInt64.TypeOccurrence.TypeTextEncoding.Type
Any.TypeDay.TypeInt8.TypeOrder.TypeTime.Type
Binary.TypeDecimal.TypeJoinAlgorithm.TypePassword.TypeTraceLevel.Type
BinaryEncoding.TypeDouble.TypeJoinKind.TypePercentage.TypeType.Type
BinaryOccurrence.TypeDuration.TypeJoinSide.TypePercentileMode.TypeUri.Type
Byte.TypeExtraValues.TypeLimitClauseKind.TypePrecision.TypeWebMethod.Type
ByteOrder.TypeFunction.TypeList.TypeQuoteStyle.Type
Character.TypeGroupKind.TypeLogical.TypeRecord.Type

Primitive types

Most of those types above are so called "ascribed types". It means that we can declare them, Power Query will remember them, but it will not enforce them. Power Query only enforce primitive types. This is list of primitive types in Power Query:

type anytype anynonnulltype binarytype datetype datetimetype datetimezone
type durationtype functiontype listtype logicaltype nulltype none
type numbertype recordtype tabletype texttype timetype type
If we try to define a function, arguments can only be declared as primitive types. We can see on the image that only primitive types any, binary, text are colored green by intelisense (1). Even the longer syntax like "Any.Type" is not acceptable. All other types (2) are incorrect and can not be used.
It is the same if we want to use "is" operator. This operator will only work with primitive types ( number is green by intelisense (1) ). Even the longer syntax "Number.Type" will not work. We have to type exactly "number".
If we select the cell with "Decimal.Type" type in our #shared table, we will see in graphic interface that this type is actually of "number" type (1). It is the same for many other types "Double.Type", "Int16.Type", "Int64.Type", "Percentage.Type" etc., "number" is their real type.

We can create a function that will return primitive type for each of our 63. built in types in Power Query. This is the function:

( TypeToName as type ) as text =>
let
      ListOfTypes = { type any, type anynonnull, type binary, type date, type datetime
        , type datetimezone, type duration, type function, type list, type logical
        , type null, type number, type record, type table, type text, type time, type type }
    , ListOfTypeNames = { "type any", "type anynonnull", "type binary", "type date"
        , "type datetime", "type datetimezone", "type duration", "type function", "type list"
        , "type logical", "type null", "type number", "type record", "type table", "type text"
        , "type time", "type type" }
    , ZipedTypes = List.Zip( { ListOfTypes, ListOfTypeNames } )
    , NameForType = List.Select( ZipedTypes, each  Type.Is( _{0}, TypeToName ) ){0}{1}
in
    NameForType
By using the function above, we will create column "Primitive types". We can see that for each built in type (1), we can get one primitive type (2). This is true for 59/63 built in types. For some built in types (3), we will get Error. In graphic interface we can see that such types are either record or function (4).

The reason for Error in image above is in the fact that table, function and record types are abstract types. There is no value that has type of table, function of record. We can create one table, record and function and compare their types with such abstract types, but the result will never be TRUE.

let
      Table = #table( { "Column" }, { { "Value" } } ) 
    , Record = [Field="Value"]
    , Function = () => let Result = "Value" in Result    
    , IsTableType = Value.Type( Table ) = type table
    , IsRecordType = Value.Type( Record ) = type record 
    , IsFunctionType = Value.Type( Function ) = type function 
    , IsTableTypeCompatible = Type.Is( Value.Type( Table ), type table )
in
    [ IsTableType=IsTableType, IsRecordType=IsRecordType
    , IsFunctionType=IsFunctionType, IsTableTypeCompatible=IsTableTypeCompatible ]
We can see that equality will never be reached (1). But if we use "Type.Is" function to check compatibility with primitive types we will get TRUE. Compatibility means that any table is compatible with the "type table". The same is true for records and functions.

When we declare type for function arguments, that argument will accept any value that is compatible with our declaration. If our argument is of type "number", such argument can receive values of all the compatible types "Double.Type", "Int16.Type", "Int64.Type", "Percentage.Type" etc.

How to check ascribed type

We will create three values of type Int64.Type, Decimal.Type and Text.Type. The question is how to read ascribed types of this values. We can do that from "metadata".

let    
      Int64Type = Value.ReplaceType( 99, Int64.Type )
    , DecimalType = Value.ReplaceType( 99.99, Decimal.Type )
    , TextType = Value.ReplaceType( "hundred", Text.Type )
    , OneTable = #table( { "values" }, { { Int64Type }, { DecimalType }, { TextType } } )
    , AddMetadata = Table.AddColumn( OneTable, "Metadata"
        , each Value.Metadata( Value.Type( [values] ) ) )
    , AddAscribedTypeName = Table.AddColumn( AddMetadata, "AscribedTypeName"
        , each [Metadata][Documentation.Name] )
in
    AddAscribedTypeName 
Value.Metadata function will return record which has Documentation.Name fild which contains ascribed type name. This is where Power Query store information about ascribed types.

Sample file can be downloaded here:

Today() function is slowing down Excel

Today() function is one of volatile functions in Excel. Volatile functions are recalculated each time any cell in the spreadsheet is changed. Some other actions can also cause recalculation, such as renaming sheets, inserting columns, deleting rows, etc. If you have many cells in Spreadsheet that are referring to cell that contains Today() function or you have many cells using Today() function, everything in your file will slow down. Constant recalculations will make working in such file really unpleasant experience.

Today() function will change its result only when the new day arrives. Because of this we can find another way to get today's date. There are actually several ways to accomplish this.

Offline Solutions

VBA Solution

In VBA, if we place "Application.Volatile" as the first line in our UDF (User Define Function), that function will become volatile. We will not do that. That way we can make VBA UDF function that is not volatile and it is replacement for Excel today() function. This function will refresh itself each time we open the file.

Function vbaToday() As Date
vbaToday = Date
End Function
User now just have to type
"=vbaToday()"
into spreadsheet and he will get
today's date inside that cell.

Power Query Solution

We will first make a query that only returns today's date.

let
Source = #table( type table [ #"pqToday"=date ]
, { { DateTime.Date(DateTime.LocalNow()) } } )
in
Source
We will load this query into
spreadsheet so that we have
today's date in cell A3.

We will then set the option so that query is refreshed every time the file is opened.

In the pane with the queries (1), we should right click on our query and choose its Properties (2). In the new dialog we want to check option "Refresh data when opening the file" (3).

Possibility of error

There is a small problem with VBA and Power Query solutions. If we open the file just before a midnight, when the midnight pass, new day will arrive, but our date will not change. For VBA solution, we would need to enter the cell with F2 and press Enter ( F9 and Calculate Now would not work on UDF function ). For Power Query solution we would need to Refresh our query.

Closing and opening the file would also give us new date, or we can just turn off our computer and go to sleep before midnight.

Online Solution

Power Automate Solution

VBA and Power Query will not work for Excel online. We can use Power Automate to change date in one cell every day in midnight. We will combine Power Query solution and Power Automate solution, so that it doesn't matter if file is in cloud or on user's computer. First, we will change our query because Power Automate needs one more column. We are going to add "Key" column to our Power Query table.

let
Source = #table( type table [ #"Key"=text, #"pqToday"=date ]
, { { "paDate", DateTime.Date(DateTime.LocalNow()) } } )
in
Source

Now we can create flow that will change date in cell B3 every day in midnight.

We click on "Create" button (1) for the new flow. Then we select "Scheduled cloud flow" (2). We will be offered dialog where we give our flow name (3), we can decide that from now, each day the flow will run (4,5). Next, we will get diagram view with "Recurrence" as the first step (6). As a next step we will add "Update a row" (7). Now that we have both our steps (8), we can fill necessary data.

In Recurrence step we just have to show advanced options (9) and decide what time zone should be used (10). In "Update a row" step we have to write where is our file and what is its name (11,12). (13) is the name of the declared table in which we are going to insert today's date. "Key" column is considered as primary key column so we have to give a value for primary key which defines row where we want changes to happen (14).


Name of declared table is "pqToday".

That row is made of fields. Those fields will appear at the bottom (15). We have only one field to fill. In this field we will write:
convertFromUtc(utcNow(),'Central Europe Standard Time','dd.MM.yyyy').
This code will take current date according to given time zone and it will format it. This date will be written to our Excel file in cell B3.

For some reason step "Update a row" wasn't able to find my Excel file when I was using XLSM extension. It seams that file has to be XLSX.


Sample file can be downloaded from here:

Sampling in statistics

Definitions

An observational unit is the person or thing on which measurements are taken. Observational units have to be distinct and identifiable. Observational unit can also be called case, element, experimental unit or statistical unit. Examples of observational units are students, cars, houses, trees etc. An observation is a measured characteristic on an observational unit.

All observational units that have one or more characteristics in common, are called population. For example, if we observe people, all the people in one country can make population. They are all sharing the same characteristic that they are residents of the same country. One observational unit can belong to several populations at the same time, depending on the characteristics used to define those populations.

Sample is a subset of population units.

Population is a set of elements that are object of our research. Sampling is observing only subset of the whole population. Sample is always smaller then population, so it is really important for sample to be representative, it should have the same characteristics as the population.

Parameter is a function, that uses observations of all units in population, to calculate one real number. That real number represents value of some characteristic of the whole population. For example, if we have measured height of all the people in one population, we can use function μ = ( Σ Xi ) / N, to calculate average height. For specific given population, parameter is a constant that is result of parameter function.

Statistic is the same as Parameter, but it is calculated on a sample. Example would be function x̄ = ( Σ xi ) / N. When statistics is used as an estimate for a parameter, it is referred as an estimator.

Method (random, stratified, cluster…) used to select the observation units from a population into sample is known as sampling procedure. When we decide what sampling procedure and what statistics to use in our research, those two decision together created our sampling design.

For some sampling procedures we need to create list of all observation units that comprise that population. Such list is called frame or sampling frame.

Benefits and disadvantages of sampling

Benefits of sampling are:
– Research can be conducted faster and with smaller cost. Organizational problems could be avoided.
– Sometimes it is not possible to observe whole population. Some observational units are not accessible, or there is not enough highly trained personnel of specialized equipment for data collection. Sometimes we don't have enough time to observe full population.
– When sample is smaller, personnel can be better trained to produce more accurate results.
– Personnel could dedicate more time to one observational unit. They can measure many characteristics of a unit, so data can be collected for several science projects at the same time.

Collection of information on every unit in the population for the characteristics of interest is known as complete enumeration or census. Census would give us correct results. If we use sampling we can make mistakes like:
– Our results could be biased. This is consequence of the wrong sampling procedure.
– If phenomena under study is complex, it is really hard to select representative sample. Some inaccuracy will occur.
– Sometimes it is impossible to properly collect the data from observation units. Some respondents will be not reachable, they will refuse to respond or they are not capable of responding. This would force us to find replacements for some observation units.
– Sampling frame could be incorrect and incomplete. This is often the case with voters list.

Steps in sampling

  1. Define the population. The definition should allow researcher to immediately decide whether some unit belongs to population or not.
  2. Make a sampling frame.
  3. Define the observation unit. All observation units together should create population. Observation units should not be overlapping, they should be exclusive.
  4. Choose a sampling procedure.
  5. Determine the size of a sample based on sampling procedure, cost, and precision requirements.
  6. Create a sampling plan. Sampling plan is detailed plan of which measurements will be taken, on what units, at what time, in what way, and by whom.
  7. Select the sample.

Types of sampling procedures

There are two groups of sampling methods. Those are Probability Sampling and Non-Probability Sampling.
Probability Sampling involves random selection where every element of the population has an equal chance of being selected. If our sample is big and randomly selected that would guarantees us that our sample is representative. Unfortunately, this is not so easy to accomplish.
Non-Probability Sampling involves non-random selection where the chances of selection are not equal. It is also possible that some units have zero chance to be included in the sample. We use this method when it is not possible to use Probability Sampling or when we want to make sampling more convenient or cost effective. Such sampling methods are often used in preliminary stages of research.

Probability Sampling procedures

Simple Random Sampling

Simple random sampling requires that a list of all observation units be made. After this, we select some of observation units from that sampling frame by using either lottery technique or random numbers generator.


Sampling frame should be enumerated so that we can use five random numbers to select five Countries from our frame above.

Advantages of Simple Random Sample are:
– It is simple to implement, no need for some special skills.
– Because of its randomness, sample will be highly representative.

Disadvantages of Simple Random Sample are:
– It is not suitable for large populations because it requires a lot of time and money for data collection.
– This method offers no control to researcher so unrepresentative samples could be selected by chance. This method is best for homogenous populations where there is smaller risk to create biased sample. This could be solved only by bigger samples.
– It could be difficult to create sampling frame for some population.
– This method doesn't take in account existing knowledge that researcher has about population.

Systematic Sample

Systematic sample asks for population to be enumerated. If population has 12 units, and the size of sample is 4, we want to select one observation unit in every three (=12/4) consecutive units. The first element should be randomly selected in the first three observation units. We will select element 2 in our image below. After this, we will select every third unit. At the end, units 2,5,8 and 11 will create our sample.


In Systematic Sample first unit is randomly selected and others are selected in regular intervals.
Two other possible samples could start on first and on third element.

Advantages of Systematic Sample are:
– Systematic Sample is simple and linear.
– Chance of randomly selecting units that are close in population is eliminated.
– It is harder to manipulate sample in order to get favored result. Systematic Sample rigidly decide which units will become part of a sample and which will not. This is only true if we have some natural order of units. If researcher can manipulate how units are ordered, then it could be actually easier for him/her to manipulate results.

Disadvantages of Simple Random Sample are:
– We have to know in advanced, how big is our population, or at least we have to estimate its size.
– If there is a pattern in units order, Systematic Sample will be biased. For example, if we choose every 11-th player in some football cup, we could actually select only goalkeepers. No regular player would be selected. We should avoid populations with periodicity.

Cluster Sampling

Cluster Sampling can be used when whole population could be divided into groups where each group has the same characteristics as the whole population. Such groups are called clusters. We can randomly select several clusters and they will comprise our sample.

Imagine that we want to investigate trees in some forest. We don't have to encompass all the trees. We can divide forest into parcels. We can then select several parcels and only trees in those parcels will be object of our research.

Advantages of Cluster Sampling are:
– Observation units could be spatially closer to each other. In our example, all the trees on one parcel are in proximity of each other. This could significantly reduce cost of data collection.
– Because observation units are close to each other it is easier to create and collect bigger samples.
– If clusters really represent population, estimates made with cluster sampling will have lower variance.

Disadvantages of Cluster Sampling are:
– We have to be careful not to include clusters that are different then general population.
– Units shouldn't belong to several clusters. In our example, one tree can be on the border between parcels. We could measure it twice.
– It is statistical requirement that all clusters should be of similar size.

Stratified Sampling

Stratified Sampling involves dividing the population into subpopulations that may differ in some important trait. Such subpopulations should not overlap and together they should comprise the whole population. One such subpopulations is called stratum. Plural of the word stratum is strata. After this, we should take simple random of systematic sample from each stratum. Number of units taken from each stratum should be proportional to the size of the stratum.

Here, we divided our population based on gender. In our population ratio between men and women is 3:2. The same ratio should stay inside of our sample. If our population has 6 women and 4 men, then our sample should have 3 women and 2 men, if the size of sample is 5 in total.

Advantages of Stratified Sampling are:
– Every important part of population is included in sample.
– It is possible to investigate differences between stratums.
– Because units in each strata are similar, average value of some characteristic of those units will have smaller variance. This will have as consequence that variance of estimator for the whole population will have smaller variance too.

Disadvantages of Stratified Sampling are:
– We need to make complete sampling frame. Each observation unit has to be classify in which stratum belongs.
– Often it is hard to divide population in subpopulations that are internally homogenous but are heterogenous between them.

Multistage Sampling

Multistage Sampling is a method of obtaining a sample from a population by splitting a population into smaller and smaller groups and taking samples of individuals from the smallest resulting groups. Multistage Sampling involves stacking multiple sampling methods one after the other. Stratified Sampling is a special case of Multistage Sampling because it has two stages. One other possible method could be to divide population into clusters and then to take systematic sample from each cluster.

Non-Probability Sampling procedures

Purposive of Judgment Sampling

Judgmental sampling is when the researcher have right to discretely selects the units of the sample, by using their knowledge. It is used when researcher wants to gain detailed knowledge about some phenomenon. It is also used when population is very specific and hard to identify.

You want to interview winners of lottery about how they are spending and investing their money. Well, there are not so many winners of lottery and many of them will refuse to speak with researcher or their identity is a secret. Examiner will not find many people to talk with. In that case, every winner willing to share their experience will become part of purposive sample.

Another example would be when researcher interview only people who gave the most representative answers in some previous research or he/she wants to interview only people which have enough knowledge to predict some future event.

Advantages of Purposive Sampling are:
– It can be used for small and hidden populations.
– Examiner can use all of his knowledge to create heterogenous and representative sample.
– This sampling method can combine several qualitative research designs and can be conducted in multiple phases.

Disadvantages of Purposive Sampling are:
– There is huge bias because sample is not selected by chance. Also, when we use purposive sampling, our sample is usually small.
– There is no way to properly calculate optimal size of sample or to estimate accuracy of the research results.
– Examiner can easily manipulate the sample to get artificial results.

Convenience sampling

In this sampling procedure, units are selected based on their availability and willingness to take part. Studies that rely on volunteers or studies that only observe people on convenient places like busy streets, malls or airports are examples of Convenience Sampling. This technique is known as one of the easiest and cheapest.

Advantages of Convenience Sampling are:
– We can collect answers from dissatisfy buyers or employees. People are hesitant to express their dissatisfaction openly but they are more willing to do it during some research.
– This method is good for first stages of research because we don't have to worry about quality of our sample, all participants are willing to give us answers, we can collect some demographic data about them, we can get immediate feedback.
– It is cheap and fast.

Disadvantages of Convenience Sampling are:
– Potential bias. We are only covering people and things near to us, all others will be neglected. Results from convenience sampling can not be generalized.
– People who are in a hurry will often give us incomplete or false answers to shorten the interaction with us. This can cause the examiner to start avoiding people who are nervous or in a hurry and thus further reduce the representativeness of the sample.

Snowball Sampling

For Snowball Sampling we have to start with only few participants. Those people are asked to nominate further people known to them so that the sample increases in size like a rolling snowball. This technique is needed when participants don't want to talk about their situation because they feel vulnerable or in danger. Such populations are homeless, illegal immigrants, people with rare diseases.

Advantages of Snowball Sampling are:
– It can be used when sampling frame is unknown, when respondents don't want to disclose their status or to identify themselves.
– Sampling process is faster and more economical because existing contacts are used to reach to other people.

Disadvantages of Snowball Sampling are:
– Because they are connected, all participants have some common traits. This can exclude all other members of our population who don't share those traits. This means that there is huge bias in our research because population is not correctly presented.
– People from vulnerable groups can show resistance and doubt. Researcher has to be careful to earn their trust.
– Examiner can not use his previous knowledge to make sample better. He can not control the sampling process.

Quota sampling

Quota Sampling is similar to stratified sampling. Here, we also try to split population in exclusive homogeneous groups. This way we reduce variance inside such groups. After this, we apply some non-probability sampling method to select units inside our strata.

We don't know how big is our population, nor do we know how big are strata. Instead of that we are trying to guess what percentage of general population makes each of strata, based on some older research or on our expertise. We also have to decide how big our sample should be. Because samples from each stratum should be proportional to stratum size, we have enough information to decide how many units to pick from each stratum. If our strata are in proportion of 30% : 55% : 15%, and we want sample of 100 units then we have to choose 30, 55, 15 units from each stratum respectively.

Advantages of Quota Sampling are:
– Quota Sampling is simpler and less demanding on resources, similar to other non-probability sampling methods.
– Scientist can increase precision of research by proper segmentation of population by using his knowledge.
– We don't need to have sampling frame.

Disadvantages of Quota Sampling are:
– Like other non-probability methods, Quota Sampling introduce bias but that could be mitigate by proper partition of population.