Unseen species problem

The unseen species problem is commonly referred to in ecology and deals with the estimation of the number of species represented in an ecosystem that were not observed by samples. It more specifically relates to how many new species would be discovered if more samples were taken in an ecosystem. The study of the unseen species problem was started in the early 1940s by Alexander Steven Corbet. He spent 2 years in British Malaya trapping butterflies and was curious how many new species he would discover if he spent another 2 years trapping. Many different estimation methods have been developed to determine how many new species would be discovered given more samples. The unseen species problem also applies more broadly, as the estimators can be used to estimate any new elements of a set not previously found in samples. An example of this is determining how many words William Shakespeare knew based on all of his written works. The unseen species problem can be broken down mathematically as follows:
If independent samples are taken,, and then if more independent samples were taken, the number of unseen species that will be discovered by the additional samples is given bywith being the second set of samples.

History

In the early 1940s Alexander Steven Corbet spent 2 years in British Malaya trapping butterflies. He kept track of how many species he observed, and how many members of each species were captured. For example, he captured only 2 members of 74 different species. When he returned to the United Kingdom, he approached statistician Ronald Fisher, and asked how many new species of butterflies he could expect to catch if he went trapping for another two years. In essence, Corbet was asking how many species he observed zero times. Fisher responded with a simple estimation: for an additional 2 years of trapping, Corbet could expect to capture 75 new species. He did this using a simple summation :Here, corresponds to the number of individual species which were observed times. Fisher's sum was later confirmed by Good–Toulmin.

Estimators

To estimate the number of unseen species, let be number of future samples divided by the number of past samples, or. Let be the number of individual species observed times.

The Good–Toulmin estimator

The Good–Toulmin estimator was developed by I. J. Good and G. H. Toulmin in 1953. The estimate of the unseen species based on the Good–Toulmin estimator is given byThe Good–Toulmin Estimator has been shown to be a good estimate for values of. The Good–Toulmin estimator also approximates that This means that estimates to within as long as. However, for, the Good–Toulmin estimator fails to capture accurate results. This is because, if, increases by for with, meaning that if, grows super-linearly in, but can grow at most linearly with. Therefore, when, grows faster than and does not approximate the true value.
To compensate for this, Efron and Thisted showed that a truncated Euler transform can also be a usable estimate:withandwhere is the location chosen to truncate the Euler transform.

The smoothed Good–Toulmin estimator

Similar to the approach by Efron and Thisted, Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu developed the smooth Good–Toulmin estimator. They realized that the Good–Toulmin estimator failed for because of the exponential growth, and not its bias. Therefore, they estimated the number of unseen species by truncating the series.Orlitsky, Suresh, and Wu also noted that for distributions with, the driving term in the summation estimate is the term, regardless of which value of is chosen. To solve this, they elected a random nonnegative integer, truncated the series at, and then took the average over a distribution about. The resulting estimator isThis method was chosen because the bias of shifts signs due to the coefficient. Averaging over a distribution of therefore reduces the bias. This means that the estimator can be written as the linear combination of the prevalence: Depending on the distribution of chosen, the results will vary. With this method, estimates can be made for, and this is the best possible.

Species discovery curve

The species discovery curve can also be used. This curve relates the number of species found in an area as a function of the time. These curves can also be created by using estimators and plotting the number of unseen species at each value for.
A species discovery curve is always increasing, as there is never a sample that could decrease the number of discovered species. Furthermore, the species discovery curve is also decelerating; the more samples taken, the fewer unseen species are expected to be discovered. The species discovery curve will also never asymptote, as it is assumed that although the discovery rate might become infinitely slow, it will never actually stop. Two common models for a species discovery curve are the logarithmic and the exponential function.

Example – Corbet's butterflies

As an example, consider the data Corbet provided Fisher in the 1940s. Using the Good–Toulmin model, the number of unseen species is found usingThis can then be used to create a relationship between and.

Number of observed members,	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
Number of species,	118	74	44	24	29	22	20	19	20	15	12	14	6	12	6

This relationship is shown in the plot below.
From the plot, it is seen that at, which was the value of that Corbet brought to Fisher, the resulting estimate of is 75, matching what Fisher found. This plot also acts as a species discovery curve for this ecosystem, and defines how many new species will be discovered as increases.

Other uses

There are numerous uses for the predictive algorithm. Knowing that the estimators are accurate, it allows scientists to extrapolate accurately the results of polling people by a factor of 2. They can predict the number of unique answers based on the number of people that have answered similarly. The method can also be used to determine the extent of someone's knowledge. A prime example is determining how many unique words Shakespeare knew based on the written works we have today.

Example – How many words did Shakespeare know?

Based on research done by Thisted and Efron, of Shakespeare's known works, there are 884,647 total words. The research also found that there are at total of different words that appear more than 100 times. Therefore, the total number of unique words was found to be 31,534. Applying the Good–Toulmin model, if an equal number of works by Shakespeare were discovered, then it is estimated that unique words would be found. The goal would be to derive for. Thisted and Efron estimate that, meaning that Shakespeare most likely knew over twice as many words as he actually used in all of his writings.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...