Knowledge extraction

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction and ETL, the main criteria is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.
The RDB2RDF W3C group is currently standardizing a language for extraction of resource description frameworks from relational databases. Another popular example for knowledge extraction is the transformation of Wikipedia into structured data and also the mapping to existing knowledge.

Overview

After the standardization of knowledge representation languages such as RDF and OWL, much research has been conducted in the area, especially regarding transforming relational databases into RDF, identity resolution, knowledge discovery and ontology learning. The general process uses traditional methods from information extraction and extract, transform, and load, which transform the data from the sources into structured formats.
The following criteria can be used to categorize approaches in this topic :

Source	Which data sources are covered: Text, Relational Databases, XML, CSV
Exposition	How is the extracted knowledge made explicit ? How can you query it?
Synchronization	Is the knowledge extraction process executed once to produce a dump or is the result synchronized with the source? Static or dynamic. Are changes to the result written back
Reuse of vocabularies	The tool is able to reuse existing vocabularies in the extraction. For example, the table column 'firstName' can be mapped to foaf:firstName. Some automatic approaches are not capable of mapping vocab.
Automatization	The degree to which the extraction is assisted/automated. Manual, GUI, semi-automatic, automatic.
Requires a domain ontology	A pre-existing ontology is needed to map to it. So either a mapping is created or a schema is learned from the source.

Examples

Entity linking

DBpedia Spotlight, OpenCalais, , the Zemanta API, and analyze free text via named-entity recognition and then disambiguates candidates via name resolution and links the found entities to the DBpedia knowledge repository.

called Wednesday on to extend a tax break for students included in last year's economic stimulus package, arguing that the policy provides more generous assistance.

Relational databases to RDF

Triplify, D2R Server, , and Virtuoso RDF Views are tools that transform relational databases to RDF. During this process they allow reusing existing vocabularies and ontologies during the conversion process. When transforming a typical relational table named users, one column or an aggregation of columns has to provide the URI of the created entity. Normally the primary key is used. Every other column can be extracted as a relation with this entity. Then properties with formally defined semantics are used to interpret the information. For example, a column in a user table called marriedTo can be defined as symmetrical relation and a column homepage can be converted to a property from the FOAF Vocabulary called , thus qualifying it as an inverse functional property. Then each entry of the user table can be made an instance of the class . Additionally domain knowledge could be created from the status_id, either by manually created rules or by -automated methods. Here is an example transformation:

Name	marriedTo	homepage	status_id
Peter	Mary	http://example.org/Peters_page	1
Claus	Eva	http://example.org/Claus_page	2

:Peter :marriedTo :Mary.
:marriedTo a owl:SymmetricProperty.
:Peter foaf:homepage .
:Peter a foaf:Person.
:Peter a :Student.
:Claus a :Teacher.

Extraction from structured sources to RDF

1:1 Mapping from RDB Tables/Views to RDF Entities/Attributes/Values

When building a RDB representation of a problem domain, the starting point is frequently an entity-relationship diagram. Typically, each entity is represented as a database table, each attribute of the entity becomes a column in that table, and relationships between entities are indicated by foreign keys. Each table typically defines a particular class of entity, each column one of its attributes. Each row in the table describes an entity
instance, uniquely identified by a primary key. The table rows collectively describe an entity set. In an equivalent RDF representation of the same entity set:

Each column in the table is an attribute
Each column value is an attribute value
Each row key represents an entity ID
Each row represents an entity instance
Each row is represented in RDF by a collection of triples with a common subject.

So, to render an equivalent view based on RDF semantics, the basic mapping algorithm would be as follows:

create an RDFS class for each table
convert all primary keys and foreign keys into IRIs
assign a predicate IRI to each column
assign an rdf:type predicate for each row, linking it to an RDFS class IRI corresponding to the table
for each column that is neither part of a primary or foreign key, construct a triple containing the primary key IRI as the subject, the column IRI as the predicate and the column's value as the object.

Early mentioning of this basic or direct mapping can be found in Tim Berners-Lee's comparison of the ER model to the RDF model.

Complex mappings of relational databases to RDF

The 1:1 mapping mentioned above exposes the legacy data as RDF in a straightforward way, additional refinements can be employed to improve the usefulness of RDF output respective the given Use Cases. Normally, information is lost during the transformation of an entity-relationship diagram to relational tables and has to be reverse engineered. From a conceptual view, approaches for extraction can come from two directions. The first direction tries to extract or learn an OWL schema from the given database schema. Early approaches used a fixed amount of manually created mapping rules to refine the 1:1 mapping. More elaborate methods are employing heuristics or learning algorithms to induce schematic information. While some approaches try to extract the information from the structure inherent in the SQL schema, others analyse the content and the values in the tables to create conceptual hierarchies. The second direction tries to map the schema and its contents to a pre-existing domain ontology. Often, however, a suitable domain ontology does not exist and has to be created first.

XML

As XML is structured as a tree, any data can be easily represented in RDF, which is structured as a graph. is one example of an approach that uses RDF blank nodes and transforms XML elements and attributes to RDF properties. The topic however is more complex as in the case of relational databases. In a relational table the primary key is an ideal candidate for becoming the subject of the extracted triples. An XML element, however, can be transformed - depending on the context- as a subject, a predicate or object of a triple. XSLT can be used a standard transformation language to manually convert XML to RDF.

Survey of methods / tools

Name	Data Source	Data Exposition	Data Synchronisation	Mapping Language	Vocabulary Reuse	Mapping Automat.	Req. Domain Ontology	Uses GUI
	Relational Data	SPARQL/ETL	dynamic		false	automatic	false	false
	CSV	ETL	static	RDF	true	manual	false	false
	TSV, CoNLL	SPARQL/ RDF stream	static	none	true	automatic	false	false
	Delimited text file	ETL	static	RDF/DAML	true	manual	false	true
	RDB	SPARQL	bi-directional	D2R Map	true	manual	false	false
	RDB	own query language	dynamic	Visual Tool	true	manual	false	true
	RDB	ETL	static	proprietary	true	manual	true	true
	CSV, XML	ETL	static			semi-automatic	false	true
	XML	ETL	static	xslt	true	manual	true	false
	RDB	ETL	static	proprietary	true	manual	true	false
	RDB	ETL	static	proprietary xml based mapping language	true	manual	false	true
	CSV	ETL	static	MappingMaster	true	GUI	false	true
	RDB	ETL	static	proprietary	true	manual	true	true
	CSV	ETL	static	The RDF Data Cube Vocaublary	true	semi-automatic	false	true
	XML, Text	LinkedData	dynamic	RDF	true	semi-automatic	true	false
	RDB	ETL	static		false	automatic, the user furthermore has the chance to fine-tune results	false	true
	CSV	ETL	static	false	false	manual	false	true
	RDB	ETL	static	SQL	true	manual	true	true
	RDB	ETL	static		false	automatic	false	false
	CSV	ETL	static	false	false	automatic	false	false
	Multidimensional statistical data in spreadsheets			Data Cube Vocabulary	true	manual	false
	CSV	ETL	static	SKOS	false	semi-automatic	false	true
	RDB	LinkedData	dynamic	SQL	true	manual	false	false
	RDB	SPARQL/ETL	dynamic	R2RML	true	semi-automatic	false	true
	RDB	SPARQL	dynamic	Meta Schema Language	true	semi-automatic	false	true
	structured and semi-structured data sources	SPARQL	dynamic	Virtuoso PL & XSLT	true	semi-automatic	false	false
	RDB	RDQL	dynamic	SQL	true	manual	true	true
	CSV	ETL	static	TriG Syntax	true	manual	false	false
	XML	ETL	static	false	false	automatic	false	false

Extraction from natural language sources

The largest portion of information contained in business documents is encoded in natural language and therefore unstructured. Because unstructured data is rather a challenge for knowledge extraction, more sophisticated methods are required, which generally tend to supply worse results compared to structured data. The potential for a massive acquisition of extracted knowledge, however, should compensate the increased complexity and decreased quality of extraction. In the following, natural language sources are understood as sources of information, where the data is given in an unstructured fashion as plain text. If the given text is additionally embedded in a markup document, the mentioned systems normally remove the markup elements automatically.

Linguistic annotation / natural language processing (NLP)

As a preprocessing step to knowledge extraction, it can be necessary to perform linguistic annotation by one or multiple NLP tools. Individual modules in an NLP workflow normally build on tool-specific formats for input and output, but in the context of knowledge extraction, structured formats for representing linguistic annotations have been applied.
Typical NLP tasks relevant to knowledge extraction include:

part-of-speech tagging
lemmatization or stemming
word sense disambiguation
named entity recognition
syntactic parsing, often adopting syntactic dependencies
shallow syntactic parsing : if performance is an issue, chunking yields a fast extraction of nominal and other phrases
anaphor resolution
semantic role labelling
discourse parsing

In NLP, such data is typically represented in TSV formats, often referred to as CoNLL formats. For knowledge extraction workflows, RDF views on such data have been created in accordance with the following community standards:

NLP Interchange Format
Web Annotation
CoNLL-RDF

Other, platform-specific formats include

LAPPS Interchange Format
NLP Annotation Format
Traditional information extraction (IE)

Traditional information extraction is a technology of natural language processing, which extracts information from typically natural language texts and structures these in a suitable manner. The kinds of information to be identified must be specified in a model before beginning the process, which is why the whole process of traditional Information Extraction is domain dependent. The IE is split in the following five subtasks.

Named entity recognition
Coreference resolution
Template element construction
Template relation construction
Template scenario production

The task of named entity recognition is to recognize and to categorize all named entities contained in a text. This works by application of grammar based methods or statistical models.
Coreference resolution identifies equivalent entities, which were recognized by NER, within a text. There are two relevant kinds of equivalence relationship. The first one relates to the relationship between two different represented entities and the second one to the relationship between an entity and their anaphoric references. Both kinds can be recognized by coreference resolution.
During template element construction the IE system identifies descriptive properties of entities, recognized by NER and CO. These properties correspond to ordinary qualities like red or big.
Template relation construction identifies relations, which exist between the template elements. These relations can be of several kinds, such as works-for or located-in, with the restriction, that both domain and range correspond to entities.
In the template scenario production events, which are described in the text, will be identified and structured with respect to the entities, recognized by NER and CO and relations, identified by TR.

Ontology-based information extraction (OBIE)

Ontology-based information extraction is a subfield of information extraction, with which at least one ontology is used to guide the process of information extraction from natural language text. The OBIE system uses methods of traditional information extraction to identify concepts, instances and relations of the used ontologies in the text, which will be structured to an ontology after the process. Thus, the input ontologies constitute the model of information to be extracted.

Ontology learning (OL)

Ontology learning is the automatic or semi-automatic creation of ontologies, including extracting the corresponding domain's terms from natural language text. As building ontologies manually is extremely labor-intensive and time consuming, there is great motivation to automate the process.

Semantic annotation (SA)

During semantic annotation, natural language text is augmented with metadata, which should make the semantics of contained terms machine-understandable. At this process, which is generally semi-automatic, knowledge is extracted in the sense, that a link between lexical terms and for example concepts from ontologies is established. Thus, knowledge is gained, which meaning of a term in the processed context was intended and therefore the meaning of the text is grounded in machine-readable data with the ability to draw inferences. Semantic annotation is typically split into the following two subtasks.

Terminology extraction
Entity linking

At the terminology extraction level, lexical terms from the text are extracted. For this purpose a tokenizer determines at first the word boundaries and solves abbreviations. Afterwards terms from the text, which correspond to a concept, are extracted with the help of a domain-specific lexicon to link these at entity linking.
In entity linking a link between the extracted lexical terms from the source text and the concepts from an ontology or knowledge base such as DBpedia is established. For this, candidate-concepts are detected appropriately to the several meanings of a term with the help of a lexicon. Finally, the context of the terms is analyzed to determine the most appropriate disambiguation and to assign the term to the correct concept.
Note that "semantic annotation" in the context of knowledge extraction is not to be confused with semantic parsing as understood in natural language processing : Semantic parsing aims a complete, machine-readable representation of natural language, whereas semantic annotation in the sense of knowledge extraction tackles only a very elementary aspect of that.

Tools

The following criteria can be used to categorize tools, which extract knowledge from natural language text.

Source	Which input formats can be processed by the tool ?
Access Paradigm	Can the tool query the data source or requires a whole dump for the extraction process?
Data Synchronization	Is the result of the extraction process synchronized with the source?
Uses Output Ontology	Does the tool link the result with an ontology?
Mapping Automation	How automated is the extraction process ?
Requires Ontology	Does the tool need an ontology for the extraction?
Uses GUI	Does the tool offer a graphical user interface?
Approach	Which approach is used by the tool?
Extracted Entities	Which types of entities can be extracted by the tool?
Applied Techniques	Which techniques are applied ?
Output Model	Which model is used to represent the result of the tool ?
Supported Domains	Which domains are supported ?
Supported Languages	Which languages can be processed ?

The following table characterizes some tools for Knowledge Extraction from natural language sources.

Name

Source

Access Paradigm

Data Synchronization

Uses Output Ontology

Mapping Automation

Requires Ontology

Uses GUI

Approach

Extracted Entities

Applied Techniques

Output Model

Supported Domains

Supported Languages

plain text, HTML, XML, SGML

dump

yes

automatic

yes

named entities, relationships, events

linguistic rules

proprietary

domain-independent

English, Spanish, Arabic, Chinese, indonesian

plain text, HTML

automatic

yes

multilingual

plain text

dump

yes

finite state algorithms

multilingual

plain text

dump

semi-automatic

yes

concepts, concept hierarchy

NLP, clustering

automatic

named entities, relationships, events

NLP

plain text, HTML, URL

REST

automatic

yes

named entities, concepts

statistical methods

JSON

domain-independent

multilingual

plain text, HTML

dump, SPARQL

yes

automatic

yes

annotation to each word, annotation to non-stopwords

NLP, statistical methods, machine learning

RDFa

domain-independent

English

plain text, HTML

dump

yes

automatic

yes

IE, OL, SA

annotation to each word, annotation to non-stopwords

rule-based grammar

XML

domain-independent

English, German, Dutch

plain text

dump, REST API

yes

automatic

yes

IE, OL, SA, ontology design patterns, frame semantics

word NIF or EarMark annotation, predicates, instances, compositional semantics, concept taxonomies, frames, semantic roles, periphrastic relations, events, modality, tense, entity linking, event linking, sentiment

NLP, machine learning, heuristic rules

RDF/OWL

domain-independent

English, other languages via translation

HTML, PDF, DOC

SPARQL

yes

OBIE

instances, property values

NLP

personal, business

plain text, HTML, XML, SGML, PDF, MS Office

dump

Yes

Automatic

yes

Yes

named entities, relationships, events

NLP

XML, JSON, RDF-OWL, others

multiple domains

English, Arabic Chinese, French, Korean, Persian, Russian, Spanish

semi-automatic

yes

concepts, concept hierarchy, non-taxonomic relations, instances

NLP, machine learning, clustering

plain text, HTML

dump

yes

automatic

yes

concepts, concept hierarchy, instances

NLP, statistical methods

proprietary

domain-independent

English

plain text, HTML

dump

yes

automatic

yes

concepts, concept hierarchy, instances

NLP, statistical methods

proprietary

domain-independent

English

HTML, PDF, DOC

dump, search engine queries

yes

automatic

yes

OBIE

concepts, relations, instances

NLP, statistical methods

RDF

domain-independent

English

plain text

dump

yes

semi-automatic

yes

OBIE

instances, datatype property values

heuristic-based methods

proprietary

domain-independent

language-independent

plain text, HTML, XML

dump

yes

automatic

yes

annotation to entities, annotation to events, annotation to facts

NLP, machine learning

RDF

domain-independent

English, French, Spanish

plain text, HTML, DOC, ODT

dump

yes

automatic

yes

OBIE

named entities, concepts, relations, concepts that categorize the text, enrichments

NLP, machine learning, statistical methods

RDF, OWL

domain-independent

English, German, Spanish, French

plain text, HTML, XML, SGML, PDF, MS Office

dump

Yes

Automatic

Yes

named entity extraction, entity resolution, relationship extraction, attributes, concepts, multi-vector sentiment analysis, geotagging, language identification

NLP, machine learning

XML, JSON, POJO, RDF

multiple domains

Multilingual 200+ Languages

plain text, HTML

dump

yes

automatic

OBIE

instances, property values, RDFS types

NLP, machine learning

RDF, RDFa

domain-independent

English, German

HTML

dump

yes

automatic

yes

machine learning

database record

domain-independent

language-independent

plain text, HTML, PDF, DOC, e-Mail

dump

yes

automatic

yes

OBIE

named entities

NLP, machine learning

proprietary

domain-independent

English, German, French, Dutch, polish

plain text, HTML, PDF

dump

yes

semi-automatic

yes

concepts, concept hierarchy, non-taxonomic relations, instances, axioms

NLP, statistical methods, machine learning, rule-based methods

OWL

deomain-independent

English, German, Spanish

plain text, HTML, PDF, PostScript

dump

semi-automatic

yes

concepts, concept hierarchy, non-taxonomic relations, lexical entities referring to concepts, lexical entities referring to relations

NLP, machine learning, clustering, statistical methods

German

Plain Text

dump

automatic

concepts, relations, hierarchy

NLP, proprietary

JSON

multiple domains

English

plain text, HTML, PDF, DOC

dump

yes

automatic

yes

annotation to proper nouns, annotation to common nouns

machine learning

RDFa

domain-independent

English, German, Spanish, French, Portuguese, Italian, Russian

named entities, relationships, events

multilingual

Knowledge discovery

Knowledge discovery describes the process of automatically searching large volumes of data for patterns that can be considered knowledge about the data. It is often described as deriving knowledge from the input data. Knowledge discovery developed out of the data mining domain, and is closely related to it both in terms of methodology and terminology.
The most well-known branch of data mining is knowledge discovery, also known as knowledge discovery in databases. Just as many other forms of knowledge discovery it creates abstractions of the input data. The knowledge obtained through the process may become additional data that can be used for further usage and discovery. Often the outcomes from knowledge discovery are not actionable, actionable knowledge discovery, also known as domain driven data mining, aims to discover and deliver actionable knowledge and insights.
Another promising application of knowledge discovery is in the area of software modernization, weakness discovery and compliance which involves understanding existing software artifacts. This process is related to a concept of reverse engineering. Usually the knowledge obtained from existing software is presented in the form of models to which specific queries can be made when necessary. An entity relationship is a frequent format of representing knowledge obtained from existing software. Object Management Group developed the specification Knowledge Discovery Metamodel which defines an ontology for the software assets and their relationships for the purpose of performing knowledge discovery in existing code. Knowledge discovery from existing software systems, also known as software mining is closely related to data mining, since existing software artifacts contain enormous value for risk management and business value, key for the evaluation and evolution of software systems. Instead of mining individual data sets, software mining focuses on metadata, such as process flows, architecture, database schemas, and business rules/terms/process.

Input data

Databases
* Relational data
* Database
* Document warehouse
* Data warehouse
Software
* Source code
* Configuration files
* Build scripts
Text
* Concept mining
Graphs
* Molecule mining
Sequences
* Data stream mining
* Learning from time-varying data streams under concept drift
Web
Output formats
Data model
Metadata
Metamodels
Ontology
Knowledge representation
Knowledge tags
Business rule
Knowledge Discovery Metamodel
Business Process Modeling Notation
Intermediate representation
Resource Description Framework
Software metrics

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...

Knowledge extraction

Overview

Examples

Entity linking

Relational databases to RDF

Extraction from structured sources to RDF

1:1 Mapping from RDB Tables/Views to RDF Entities/Attributes/Values

Complex mappings of relational databases to RDF

XML

Survey of methods / tools

Extraction from natural language sources

Linguistic annotation / natural language processing (NLP)

Traditional information extraction (IE)

Ontology-based information extraction (OBIE)

Ontology learning (OL)

Semantic annotation (SA)

Tools

Knowledge discovery

Input data

Output formats