URI normalization

URI normalization is the process by which URIs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URI into a normalized URI so it is possible to determine if two syntactically different URIs may be equivalent.
Search engines employ URI normalization in order to and to reduce indexing of duplicate pages. Web crawlers perform URI normalization in order to avoid crawling the same resource more than once. Web browsers may perform normalization to determine if a link has been visited or to determine if a page has been cached.

Normalization process

There are several types of normalization that may be performed. Some of them are always semantics preserving and some may not be.

Normalizations that preserve semantics

The following normalizations are described in RFC 3986 to result in equivalent URIs:

Converting percent-encoded triplets to uppercase. The hexadecimal digits within a percent-encoding triplet of the URI are case-insensitive and therefore should be normalized to use uppercase letters for the digits A-F. Example:
Converting the scheme and host to lowercase. The scheme and host components of the URI are case-insensitive and therefore should be normalized to lowercase. Example:
Decoding percent-encoded triplets of unreserved characters. Percent-encoded triplets of the URI in the ranges of ALPHA, DIGIT, hyphen, period, underscore, or tilde do not require percent-encoding and should be decoded to their corresponding unreserved characters. Example:
Removing dot-segments. Dot-segments . and .. in the path component of the URI should be removed by applying the remove_dot_segments algorithm to the path described in RFC 3986. Example:
Converting an empty path to a "/" path. In presence of an authority component, an empty path component should be normalized to a path component of "/". Example:
Removing the default port. An empty or default port component of the URI with its ":" delimiter should be removed. Example:
Normalizations that usually preserve semantics

For http and https URIs, the following normalizations listed in RFC 3986 may result in equivalent URIs, but are not guaranteed to by the standards:

Adding a trailing "/" to a non-empty path. Directories are indicated with a trailing slash and should be included in URIs. Example:
Normalizations that change semantics

Applying the following normalizations result in a semantically different URI although it may refer to the same resource:

Removing directory index. Default directory indexes are generally not needed in URIs. Examples:
Removing the fragment. The fragment component of a URI is never seen by the server and can sometimes be removed. Example:
Replacing IP with domain name. Check if the IP address maps to a domain name. Example:
Limiting protocols. Limiting different application layer protocols. For example, the “https” scheme could be replaced with “http”. Example:
Removing duplicate slashes Paths which include two adjacent slashes could be converted to one. Example:
Removing or adding “www” as the first domain label. Some websites operate identically in two Internet domains: one whose least significant label is “www” and another whose name is the result of omitting the least significant label from the name of the first, the latter being known as a naked domain. For example, http://www.example.com/ and http://example.com/ may access the same website. Many websites redirect the user from the www to the non-www address or vice versa. A normalizer may determine if one of these URIs redirects to the other and normalize all URIs appropriately. Example:
Sorting the query parameters. Some web pages use more than one query parameter in the URI. A normalizer can sort the parameters into alphabetical order, and reassemble the URI. Example:
Removing unused query variables. A page may only expect certain parameters to appear in the query; unused parameters can be removed. Example:
Removing default query parameters. A default value in the query string may render identically whether it is there or not. Example:
Removing the "?" when the query is empty. When the query is empty, there may be no need for the "?". Example:
Normalization based on URI lists

Some normalization rules may be developed for specific websites by examining URI lists obtained from previous crawls or web server logs. For example, if the URI
appears in a crawl log several times along with
we may assume that the two URIs are equivalent and can be normalized to one of the URI forms.
Schonfeld et al. present a heuristic called DustBuster for detecting DUST rules that can be applied to URI lists. They showed that once the correct DUST rules were found and applied with a normalization algorithm, they were able to find up to 68% of the redundant URIs in a URI list.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...

URI normalization

Normalization process

Normalizations that preserve semantics

Normalizations that usually preserve semantics

Normalizations that change semantics

Normalization based on URI lists