PublicShow -- Search literals

This module finds literals of the RDF database based on words, stemming and sounds like (metaphone). The normal user-level predicate is

Set options for the literal package. Currently defined options
If true, print progress messages while building the index tables.
Number of threads to use for initial indexing of literals
How to deal with indexing new literals. How is one of self (execute in the same thread), thread(N) (execute in N concurrent threads) or default (depends on number of cores).
Add a token to the dynamic stopgap set if it appears in more than Count literals. The default is 50,000.
Sourcerdf_find_literal(+Spec, -Literal) is nondet
Sourcerdf_find_literals(+Spec, -Literals) is det
Find literals in the RDF database matching Spec. Spec is defined as:
Spec ::= and(Spec,Spec)
Spec ::= or(Spec,Spec)
Spec ::= not(Spec)
Spec ::= sounds(Like)
Spec ::= stem(Like)             % same as stem(Like, en)
Spec ::= stem(Like, Lang)
Spec ::= prefix(Prefix)
Spec ::= between(Low, High)     % Numerical between
Spec ::= ge(High)               % Numerical greater-equal
Spec ::= le(Low)                % Numerical less-equal
Spec ::= Token

sounds(Like) and stem(Like) both map to a disjunction. First we compile the spec to normal form: a disjunction of conjunctions on elementary tokens. Then we execute all the conjunctions and generate the union using ordered-set algorithms.

Stopgaps are ignored. If the final result is only a stopgap, the predicate fails.

To be done
- Exploit ordering of numbers and allow for > N, < N, etc.
Sourcerdf_token_expansions(+Spec, -Extensions)
Determine which extensions of a token contribute to finding literals.
Fully delete a literal index
Sourcerdf_tokenize_literal(+Literal, -Tokens) is semidet
Tokenize a literal. We make this hookable as tokenization is generally domain dependent.
Sourcerdf_stopgap_token(-Token) is nondet
True when Token is a stopgap token. Currently, this implies one of:
  • exclude_from_index(token, Token) is true
  • default_stopgap(Token) is true
  • Token is an atom of length 1
  • Token was added to the dynamic stopgap token set because it appeared in more than stopgap_threshold literals.
Sourcerdf_literal_index(+Type, -Index) is det
True when Index is a literal map containing the index of Type. Type is one of:
Tokens are basically words of literal values. See rdf_tokenize_literal/2. The token map maps tokens to full literal texts.
Index of stemmed tokens. If the language is available, the tokens are stemmed using the matching snowball stemmer. The stem map maps stemmed to full tokens.
Phonetic index of tokens. The metaphone map maps phonetic keys to tokens.