Statistical model

A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data. A statistical model represents, often in considerably idealized form, the data-generating process.
A statistical model is usually specified as a mathematical relationship between one or more random variables and other non-random variables. As such, a statistical model is "a formal representation of a theory".
All statistical hypothesis tests and all statistical estimators are derived via statistical models. More generally, statistical models are part of the foundation of statistical inference.

Introduction

Informally, a statistical model can be thought of as a statistical assumption with a certain property: that the assumption allows us to calculate the probability of any event. As an example, consider a pair of ordinary six-sided dice. We will study two different statistical assumptions about the dice.
The first statistical assumption is this: for each of the dice, the probability of each face coming up is. From that assumption, we can calculate the probability of both dice coming up 5: More generally, we can calculate the probability of any event: e.g. or or.
The alternative statistical assumption is this: for each of the dice, the probability of the face 5 coming up is . From that assumption, we can calculate the probability of both dice coming up 5: We cannot, however, calculate the probability of any other nontrivial event, as the probabilities of the other faces are unknown.
The first statistical assumption constitutes a statistical model: because with the assumption alone, we can calculate the probability of any event. The alternative statistical assumption does not constitute a statistical model: because with the assumption alone, we cannot calculate the probability of every event.
In the example above, with the first assumption, calculating the probability of an event is easy. With some other examples, though, the calculation can be difficult, or even impractical. For an assumption to constitute a statistical model, such difficulty is acceptable: doing the calculation does not need to be practicable, just theoretically possible.

Formal definition

In mathematical terms, a statistical model is usually thought of as a pair, where is the set of possible observations, i.e. the sample space, and is a set of probability distributions on.
The intuition behind this definition is as follows. It is assumed that there is a "true" probability distribution induced by the process that generates the observed data. We choose to represent a set which contains a distribution that adequately approximates the true distribution.
Note that we do not require that contains the true distribution, and in practice that is rarely the case. Indeed, as Burnham & Anderson state, "A model is a simplification or approximation of reality and hence will not reflect all of reality"—whence the saying "all models are wrong".
The set is almost always parameterized:. The set defines the parameters of the model. A parameterization is generally required to have distinct parameter values give rise to distinct distributions, i.e. must hold. A parameterization that meets the requirement is said to be identifiable.

An example

Suppose that we have a population of school children, with the ages of the children distributed uniformly, in the population. The height of a child will be stochastically related to the age: e.g. when we know that a child is of age 7, this influences the chance of the child being 1.5 meters tall. We could formalize that relationship in a linear regression model, like this:
height_i = b₀ + b₁age_i + ε_i, where b₀ is the intercept, b₁ is a parameter that age is multiplied by to obtain a prediction of height, ε_i is the error term, and i identifies the child. This implies that height is predicted by age, with some error.
An admissible model must be consistent with all the data points. Thus, a straight line cannot be the equation for a model of the data—unless it exactly fits all the data points, i.e. all the data points lie perfectly on the line. The error term, ε_i, must be included in the equation, so that the model is consistent with all the data points.
To do statistical inference, we would first need to assume some probability distributions for the ε_i. For instance, we might assume that the ε_i distributions are i.i.d. Gaussian, with zero mean. In this instance, the model would have 3 parameters: b₀, b₁, and the variance of the Gaussian distribution.
We can formally specify the model in the form as follows. The sample space,, of our model comprises the set of all possible pairs. Each possible value of = determines a distribution on ; denote that distribution by. If is the set of all possible values of, then.
In this example, the model is determined by specifying and making some assumptions relevant to. There are two assumptions: that height can be approximated by a linear function of age; that errors in the approximation are distributed as i.i.d. Gaussian. The assumptions are sufficient to specify —as they are required to do.

General remarks

A statistical model is a special class of mathematical model. What distinguishes a statistical model from other mathematical models is that a statistical model is non-deterministic. Thus, in a statistical model specified via mathematical equations, some of the variables do not have specific values, but instead have probability distributions; i.e. some of the variables are stochastic. In the above example with children's heights, ε is a stochastic variable; without that stochastic variable, the model would be deterministic.
Statistical models are often used even when the data-generating process being modeled is deterministic. For instance, coin tossing is, in principle, a deterministic process; yet it is commonly modeled as stochastic.
Choosing an appropriate statistical model to represent a given data-generating process is sometimes extremely difficult, and may require knowledge of both the process and relevant statistical analyses. Relatedly, the statistician Sir David Cox has said, "How translation from subject-matter problem to statistical model is done is often the most critical part of an analysis".
There are three purposes for a statistical model, according to Konishi & Kitagawa.

Predictions
Extraction of information
Description of stochastic structures

Those three purposes are essentially the same as the three purposes indicated by Friendly & Meyer: prediction, estimation, description. The three purposes correspond with the three kinds of logical reasoning: deductive reasoning, inductive reasoning, abductive reasoning.

Dimension of a model

Suppose that we have a statistical model with. The model is said to be parametric if has a finite dimension. In notation, we write that where is a positive integer. Here, is called the dimension of the model.
As an example, if we assume that data arise from a univariate Gaussian distribution, then we are assuming that
In this example, the dimension,, equals 2.
As another example, suppose that the data consists of points that we assume are distributed according to a straight line with i.i.d. Gaussian residuals : this leads to the same statistical model as was used in the example with children's heights. The dimension of the statistical model is 3: the intercept of the line, the slope of the line, and the variance of the distribution of the residuals.
Although formally is a single parameter that has dimension, it is sometimes regarded as comprising separate parameters. For example, with the univariate Gaussian distribution, is formally a single parameter with dimension 2, but it is sometimes regarded as comprising 2 separate parameters—the mean and the standard deviation.
A statistical model is nonparametric if the parameter set is infinite dimensional. A statistical model is semiparametric if it has both finite-dimensional and infinite-dimensional parameters. Formally, if is the dimension of and is the number of samples, both semiparametric and nonparametric models have as. If as, then the model is semiparametric; otherwise, the model is nonparametric.
Parametric models are by far the most commonly used statistical models. Regarding semiparametric and nonparametric models, Sir David Cox has said, "These typically involve fewer assumptions of structure and distributional form but usually contain strong assumptions about independencies".

Nested models

Two statistical models are nested if the first model can be transformed into the second model by imposing constraints on the parameters of the first model. As an example, the set of all Gaussian distributions has, nested within it, the set of zero-mean Gaussian distributions: we constrain the mean in the set of all Gaussian distributions to get the zero-mean distributions. As a second example, the quadratic model
has, nested within it, the linear model
—we constrain the parameter to equal 0.
In both those examples, the first model has a higher dimension than the second model. Such is often, but not always, the case. As a different example, the set of positive-mean Gaussian distributions, which has dimension 2, is nested within the set of all Gaussian distributions.

Comparing models

Comparing statistical models is fundamental for much of statistical inference. Indeed, state, "The majority of the problems in statistical inference can be considered to be problems related to statistical modeling. They are typically formulated as comparisons of several statistical models."
Common criteria for comparing models include the following: R², Bayes factor, and the likelihood-ratio test together with its generalization relative likelihood.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...