Sampling in statistics

Definitions

An observational unit is the person or thing on which measurements are taken. Observational units have to be distinct and identifiable. Observational unit can also be called case, element, experimental unit or statistical unit. Examples of observational units are students, cars, houses, trees etc. An observation is a measured characteristic on an observational unit.

All observational units that have one or more characteristics in common, are called population. For example, if we observe people, all the people in one country can make population. They are all sharing the same characteristic that they are residents of the same country. One observational unit can belong to several populations at the same time, depending on the characteristics used to define those populations.

Population is a set of elements that are object of our research. Sampling is observing only subset of the whole population. Sample is always smaller then population, so it is really important for sample to be representative, it should have the same characteristics as the population.

Parameter is a function, that uses observations of all units in population, to calculate one real number. That real number represents value of some characteristic of the whole population. For example, if we have measured height of all the people in one population, we can use function μ = ( Σ Xi ) / N, to calculate average height. For specific given population, parameter is a constant that is result of parameter function.

Statistic is the same as Parameter, but it is calculated on a sample. Example would be function x̄ = ( Σ xi ) / N. When statistics is used as an estimate for a parameter, it is referred as an estimator.

Method (random, stratified, cluster…) used to select the observation units from a population into sample is known as sampling procedure. When we decide what sampling procedure and what statistics to use in our research, those two decision together created our sampling design.

For some sampling procedures we need to create list of all observation units that comprise that population. Such list is called frame or sampling frame.

Benefits of sampling are:
– Research can be conducted faster and with smaller cost. Organizational problems could be avoided.
– Sometimes it is not possible to observe whole population. Some observational units are not accessible, or there is not enough highly trained personnel of specialized equipment for data collection. Sometimes we don't have enough time to observe full population.
– When sample is smaller, personnel can be better trained to produce more accurate results.
– Personnel could dedicate more time to one observational unit. They can measure many characteristics of a unit, so data can be collected for several science projects at the same time.

Collection of information on every unit in the population for the characteristics of interest is known as complete enumeration or census. Census would give us correct results. If we use sampling we can make mistakes like:
– Our results could be biased. This is consequence of the wrong sampling procedure.
– If phenomena under study is complex, it is really hard to select representative sample. Some inaccuracy will occur.
– Sometimes it is impossible to properly collect the data from observation units. Some respondents will be not reachable, they will refuse to respond or they are not capable of responding. This would force us to find replacements for some observation units.
– Sampling frame could be incorrect and incomplete. This is often the case with voters list.

Steps in sampling

1. Define the population. The definition should allow researcher to immediately decide whether some unit belongs to population or not.
2. Make a sampling frame.
3. Define the observation unit. All observation units together should create population. Observation units should not be overlapping, they should be exclusive.
4. Choose a sampling procedure.
5. Determine the size of a sample based on sampling procedure, cost, and precision requirements.
6. Create a sampling plan. Sampling plan is detailed plan of which measurements will be taken, on what units, at what time, in what way, and by whom.
7. Select the sample.

Types of sampling procedures

There are two groups of sampling methods. Those are Probability Sampling and Non-Probability Sampling.
Probability Sampling involves random selection where every element of the population has an equal chance of being selected. If our sample is big and randomly selected that would guarantees us that our sample is representative. Unfortunately, this is not so easy to accomplish.
Non-Probability Sampling involves non-random selection where the chances of selection are not equal. It is also possible that some units have zero chance to be included in the sample. We use this method when it is not possible to use Probability Sampling or when we want to make sampling more convenient or cost effective. Such sampling methods are often used in preliminary stages of research.

Probability Sampling procedures

Simple Random Sampling

Simple random sampling requires that a list of all observation units be made. After this, we select some of observation units from that sampling frame by using either lottery technique or random numbers generator.

Advantages of Simple Random Sample are:
– It is simple to implement, no need for some special skills.
– Because of its randomness, sample will be highly representative.

Disadvantages of Simple Random Sample are:
– It is not suitable for large populations because it requires a lot of time and money for data collection.
– This method offers no control to researcher so unrepresentative samples could be selected by chance. This method is best for homogenous populations where there is smaller risk to create biased sample. This could be solved only by bigger samples.
– It could be difficult to create sampling frame for some population.
– This method doesn't take in account existing knowledge that researcher has about population.

Systematic Sample

Systematic sample asks for population to be enumerated. If population has 12 units, and the size of sample is 4, we want to select one observation unit in every three (=12/4) consecutive units. The first element should be randomly selected in the first three observation units. We will select element 2 in our image below. After this, we will select every third unit. At the end, units 2,5,8 and 11 will create our sample.

– Systematic Sample is simple and linear.
– Chance of randomly selecting units that are close in population is eliminated.
– It is harder to manipulate sample in order to get favored result. Systematic Sample rigidly decide which units will become part of a sample and which will not. This is only true if we have some natural order of units. If researcher can manipulate how units are ordered, then it could be actually easier for him/her to manipulate results.

Disadvantages of Simple Random Sample are:
– We have to know in advanced, how big is our population, or at least we have to estimate its size.
– If there is a pattern in units order, Systematic Sample will be biased. For example, if we choose every 11-th player in some football cup, we could actually select only goalkeepers. No regular player would be selected. We should avoid populations with periodicity.

Cluster Sampling

Cluster Sampling can be used when whole population could be divided into groups where each group has the same characteristics as the whole population. Such groups are called clusters. We can randomly select several clusters and they will comprise our sample.

– Observation units could be spatially closer to each other. In our example, all the trees on one parcel are in proximity of each other. This could significantly reduce cost of data collection.
– Because observation units are close to each other it is easier to create and collect bigger samples.
– If clusters really represent population, estimates made with cluster sampling will have lower variance.

– We have to be careful not to include clusters that are different then general population.
– Units shouldn't belong to several clusters. In our example, one tree can be on the border between parcels. We could measure it twice.
– It is statistical requirement that all clusters should be of similar size.

Stratified Sampling

Stratified Sampling involves dividing the population into subpopulations that may differ in some important trait. Such subpopulations should not overlap and together they should comprise the whole population. One such subpopulations is called stratum. Plural of the word stratum is strata. After this, we should take simple random of systematic sample from each stratum. Number of units taken from each stratum should be proportional to the size of the stratum.

– Every important part of population is included in sample.
– It is possible to investigate differences between stratums.
– Because units in each strata are similar, average value of some characteristic of those units will have smaller variance. This will have as consequence that variance of estimator for the whole population will have smaller variance too.

– We need to make complete sampling frame. Each observation unit has to be classify in which stratum belongs.
– Often it is hard to divide population in subpopulations that are internally homogenous but are heterogenous between them.

Multistage Sampling

Multistage Sampling is a method of obtaining a sample from a population by splitting a population into smaller and smaller groups and taking samples of individuals from the smallest resulting groups. Multistage Sampling involves stacking multiple sampling methods one after the other. Stratified Sampling is a special case of Multistage Sampling because it has two stages. One other possible method could be to divide population into clusters and then to take systematic sample from each cluster.

Non-Probability Sampling procedures

Purposive of Judgment Sampling

Judgmental sampling is when the researcher have right to discretely selects the units of the sample, by using their knowledge. It is used when researcher wants to gain detailed knowledge about some phenomenon. It is also used when population is very specific and hard to identify.

– It can be used for small and hidden populations.
– Examiner can use all of his knowledge to create heterogenous and representative sample.
– This sampling method can combine several qualitative research designs and can be conducted in multiple phases.

– There is huge bias because sample is not selected by chance. Also, when we use purposive sampling, our sample is usually small.
– There is no way to properly calculate optimal size of sample or to estimate accuracy of the research results.
– Examiner can easily manipulate the sample to get artificial results.

Convenience sampling

– We can collect answers from dissatisfy buyers or employees. People are hesitant to express their dissatisfaction openly but they are more willing to do it during some research.
– This method is good for first stages of research because we don't have to worry about quality of our sample, all participants are willing to give us answers, we can collect some demographic data about them, we can get immediate feedback.
– It is cheap and fast.

– Potential bias. We are only covering people and things near to us, all others will be neglected. Results from convenience sampling can not be generalized.
– People who are in a hurry will often give us incomplete or false answers to shorten the interaction with us. This can cause the examiner to start avoiding people who are nervous or in a hurry and thus further reduce the representativeness of the sample.

Snowball Sampling

– It can be used when sampling frame is unknown, when respondents don't want to disclose their status or to identify themselves.
– Sampling process is faster and more economical because existing contacts are used to reach to other people.

– Because they are connected, all participants have some common traits. This can exclude all other members of our population who don't share those traits. This means that there is huge bias in our research because population is not correctly presented.
– People from vulnerable groups can show resistance and doubt. Researcher has to be careful to earn their trust.
– Examiner can not use his previous knowledge to make sample better. He can not control the sampling process.

Quota sampling

Quota Sampling is similar to stratified sampling. Here, we also try to split population in exclusive homogeneous groups. This way we reduce variance inside such groups. After this, we apply some non-probability sampling method to select units inside our strata.

– Quota Sampling is simpler and less demanding on resources, similar to other non-probability sampling methods.
– Scientist can increase precision of research by proper segmentation of population by using his knowledge.
– We don't need to have sampling frame.