Sample size determination

Sample size determination is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample. In practice, the sample size used in a study is usually determined based on the cost, time, or convenience of collecting the data, and the need for it to offer sufficient statistical power. In complicated studies there may be several different sample sizes: for example, in a stratified survey there would be different sizes for each stratum. In a census, data is sought for an entire population, hence the intended sample size is equal to the population. In experimental design, where a study may be divided into different treatment groups, there may be different sample sizes for each group.
Sample sizes may be chosen in several ways:

using experience – small samples, though sometimes unavoidable, can result in wide confidence intervals and risk of errors in statistical hypothesis testing.
using a target variance for an estimate to be derived from the sample eventually obtained, i.e. if a high precision is required this translates to a low target variance of the estimator.
using a target for the power of a statistical test to be applied once the sample is collected.
using a confidence level, i.e. the larger the required confidence level, the larger the sample size.
Introduction

Larger sample sizes generally lead to increased precision when estimating unknown parameters. For example, if we wish to know the proportion of a certain species of fish that is infected with a pathogen, we would generally have a more precise estimate of this proportion if we sampled and examined 200 rather than 100 fish. Several fundamental facts of mathematical statistics describe this phenomenon, including the law of large numbers and the central limit theorem.
In some situations, the increase in precision for larger sample sizes is minimal, or even non-existent. This can result from the presence of systematic errors or strong dependence in the data, or if the data follows a heavy-tailed distribution.
Sample sizes may be evaluated by the quality of the resulting estimates. For example, if a proportion is being estimated, one may wish to have the 95% confidence interval be less than 0.06 units wide. Alternatively, sample size may be assessed based on the power of a hypothesis test. For example, if we are comparing the support for a certain political candidate among women with the support for that candidate among men, we may wish to have 80% power to detect a difference in the support levels of 0.04 units.

Estimation

Estimation of a proportion

A relatively simple situation is estimation of a proportion. For example, we may wish to estimate the proportion of residents in a community who are at least 65 years old.
The estimator of a proportion is, where X is the number of 'positive' observations . When the observations are independent, this estimator has a binomial distribution. The maximum variance of this distribution is 0.25/n, which occurs when the true parameter is p = 0.5. In practice, since p is unknown, the maximum variance is often used for sample size assessments. If a reasonable estimate for p is known the quantity may be used in place of 0.25.
For sufficiently large n, the distribution of will be closely approximated by a normal distribution. Using this and the Wald method for the binomial distribution, yields a confidence interval of the form
If we wish to have a confidence interval that is W units total in width, we would solve
for n, yielding the sample size
, in the case of using.5 as the most conservative estimate of the proportion.
Otherwise, the formula would be , which yields .
For example, if we are interested in estimating the proportion of the US population who supports a particular presidential candidate, and we want the width of 95% confidence interval to be at most 2 percentage points, then we would need a sample size of / = 9604. It is reasonable to use the 0.5 estimate for p in this case because the presidential races are often close to 50/50, and it is also prudent to use a conservative estimate. The margin of error in this case is 1 percentage point.

Estimation of a mean

A proportion is a special case of a mean. When estimating the population mean using an independent and identically distributed sample of size n, where each data value has variance σ², the standard error of the sample mean is:
This expression describes quantitatively how the estimate becomes more precise as the sample size increases. Using the central limit theorem to justify approximating the sample mean with a normal distribution yields a confidence interval of the form
If we wish to have a confidence interval that is W units total in width, we would solve
for n, yielding the sample size
.
For example, if we are interested in estimating the amount by which a drug lowers a subject's blood pressure with a 95% confidence interval that is six units wide, and we know that the standard deviation of blood pressure in the population is 15, then the required sample size is, which would be rounded up to 97, because the obtained value is the minimum sample size, and sample sizes must be integers and must lie on or above the calculated minimum.

Required sample sizes for hypothesis tests

A common problem faced by statisticians is calculating the sample size required to yield a certain power for a test, given a predetermined Type I error rate α. As follows, this can be estimated by pre-determined tables for certain values, by Mead's resource equation, or, more generally, by the cumulative distribution function:

Tables

The table shown on the right can be used in a two-sample t-test to estimate the sample sizes of an experimental group and a control group that are of equal size, that is, the total number of individuals in the trial is twice that of the number given, and the desired significance level is 0.05. The parameters used are:

The desired statistical power of the trial, shown in column to the left.
Cohen's d, which is the expected difference between the means of the target values between the experimental group and the control group, divided by the expected standard deviation.
Mead's resource equation

Mead's resource equation is often used for estimating sample sizes of laboratory animals, as well as in many other laboratory experiments. It may not be as accurate as using other methods in estimating sample size, but gives a hint of what is the appropriate sample size where parameters such as expected standard deviations or expected differences in values between groups are unknown or very hard to estimate.
All the parameters in the equation are in fact the degrees of freedom of the number of their concepts, and hence, their numbers are subtracted by 1 before insertion into the equation.
The equation is:
where:

N is the total number of individuals or units in the study
B is the blocking component, representing environmental effects allowed for in the design
T is the treatment component, corresponding to the number of treatment groups being used, or the number of questions being asked
E is the degrees of freedom of the error component, and should be somewhere between 10 and 20.

For example, if a study using laboratory animals is planned with four treatment groups, with eight animals per group, making 32 animals total, without any further stratification, then E would equal 28, which is above the cutoff of 20, indicating that sample size may be a bit too large, and six animals per group might be more appropriate.

Cumulative distribution function

Let X_i, i = 1, 2,..., n be independent observations taken from a normal distribution with unknown mean μ and known variance σ². Consider two hypotheses, a null hypothesis:
and an alternative hypothesis:
for some 'smallest significant difference' μ^* > 0. This is the smallest value for which we care about observing a difference. Now, if we wish to reject H₀ with a probability of at least 1 − β when
H_a is true, and reject H₀ with probability α when H₀ is true, then we need the following:
If z_α is the upper α percentage point of the standard normal distribution, then
and so
is a decision rule which satisfies.
Now we wish for this to happen with a probability at least 1 − β when
H_a is true. In this case, our sample average will come from a Normal distribution with mean μ^*. Therefore, we require
Through careful manipulation, this can be shown to happen when
where is the normal cumulative distribution function.

Stratified sample size

With more complicated sampling techniques, such as stratified sampling, the sample can often be split up into sub-samples. Typically, if there are H such sub-samples then each of them will have a sample size n_h, h = 1, 2,..., H. These n_h must conform to the rule that n₁ + n₂ +... + n_H = n. Selecting these n_h optimally can be done in various ways, using Neyman's optimal allocation.
There are many reasons to use stratified sampling: to decrease variances of sample estimates, to use partly non-random methods, or to study strata individually.
A useful, partly non-random method would be to sample individuals where easily accessible, but, where not, sample clusters to save travel costs.
In general, for H strata, a weighted sample mean is
with
The weights,, frequently, but not always, represent the proportions of the population elements in the strata, and. For a fixed sample size, that is,
which can be made a minimum if the sampling rate within each stratum is made
proportional to the standard deviation within each stratum:, where and is a constant such that.
An "optimum allocation" is reached when the sampling rates within the strata
are made directly proportional to the standard deviations within the strata
and inversely proportional to the square root of the sampling cost per element
within the strata, :
where is a constant such that, or, more generally, when

Qualitative research

Sample size determination in qualitative studies takes a different approach. It is generally a subjective judgment, taken as the research proceeds. One approach is to continue to include further participants or material until saturation is reached. The number needed to reach saturation has been investigated empirically.
There is a paucity of reliable guidance on estimating sample sizes before starting the research, with a range of suggestions given. A tool akin to a quantitative power calculation, based on the negative binomial distribution, has been suggested for thematic analysis.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...