Shot transition detection


Shot transition detection also called cut detection is a field of research of video processing. Its subject is the automated detection of transitions between shots in digital video with the purpose of temporal segmentation of videos.

Use

Shot transition detection is used to split up a film into basic temporal units called shots; a shot is a series of interrelated consecutive pictures taken contiguously by a single camera and representing a continuous action in time and space.
This operation is of great use in software for post-production of videos. It is also a fundamental step of automated indexing and content-based video retrieval or summarization applications which provide an efficient access to huge video archives, e.g. an application may choose a representative picture from each scene to create a visual overview of the whole film and, by processing such indexes, a search engine can process search items like "show me all films where there's a scene with a lion in it."
Generally speaking, cut detection can do nothing that a human editor couldn't do manually, but it saves a lot of time. Furthermore, due to the increase in the use of digital video and, consequently, in the importance of the aforementioned indexing applications, the automatic cut detection is very important nowadays.

Basic technical terms

In simple terms cut detection is about finding the positions in a video in that one scene is replaced by another one with different visual content. Technically speaking the following terms are used:
A digital video consists of frames that are presented to the viewer's eye in rapid succession to create the impression of movement. "Digital" in this context means both that a single frame consists of pixels and the data is present as binary data, such that it can be processed with a computer. Each frame within a digital video can be uniquely identified by its frame index, a serial number.
A shot is a sequence of frames shot uninterruptedly by one camera. There are several film transitions usually used in film editing to juxtapose adjacent shots; In the context of shot transition detection they are usually group into two types:
"Detecting a cut" means that the position of a cut is gained; more precisely a hard cut is gained as "hard cut between frame i and frame i+1", a soft cut as "soft cut from frame i to frame j".
A transition that is detected correctly is called a hit, a cut that is there but was not detected is called a missed hit and a position in that the software assumes a cut, but where actually no cut is present, is called a false hit.
An introduction to film editing and an exhaustive list of shot transition techniques can be found at film editing.

Vastness of the problem

Although cut detection appears to be a simple task for a human being, it is a non-trivial task for computers. Cut detection would be a trivial problem if each frame of a video was enriched with additional information about when and by which camera it was taken. Possibly no algorithm for cut detection will ever be able to detect all cuts with certainty, unless it is provided with powerful artificial intelligence.
While most algorithms achieve good results with hard cuts, many fail with recognizing soft cuts. Hard cuts usually go together with sudden and extensive changes in the visual content while soft cuts feature slow and gradual changes. A human being can compensate this lack of visual diversity with understanding the meaning of a scene. While a computer assumes a black line wiping a shot away to be "just another regular object moving slowly through the on-going scene", a person understands that the scene ends and is replaced by a black screen.

Methods

Each method for cut detection works on a two-phase-principle:
  1. Scoring – Each pair of consecutive frames of a digital video is given a certain score that represents the similarity/dissimilarity between these two frames.
  2. Decision – All scores calculated previously are evaluated and a cut is detected if the score is considered high.
This principle is error prone. First, because even minor exceedings of the threshold value produce a hit, it must be ensured that phase one scatters values widely to maximize the average difference between the score for "cut" and "no cut". Second, the threshold must be chosen with care; usually useful values can be gained with statistical methods.

Scoring

There are many possible scores used to access the differences in the visual content; some of the most common are:
Finally, a combination of two or more of these scores can improve the performance.

Decision

Also in the decision phase various approaches are usually used:
All of the above algorithms complete in O — that is to say they run in linear time — where n is the number of frames in the input video. The algorithms differ in a constant factor that is determined mostly by the image resolution of the video.

Measures for quality

Usually the following three measures are used to measure the quality of a cut detection algorithm:




The symbols stand for: C, the number of correctly detected cuts, M, the number of not detected cuts and F, the number of falsely detected cuts. All of these measures are mathematical measures, i. e. they deliver values in between 0 and 1. The basic rule is: the higher the value, the better performs the algorithm.