# BlackLab Corpus Query Language (BCQL)
BlackLab Corpus Query Language or BCQL is a powerful query language for text corpora.
It is a dialect of the CQP Query Language (opens new window) introduced by the IMS Corpus WorkBench (CWB). Several other corpus engines support a similar language, such as the Lexicom Sketch Engine (opens new window). The various dialects are very similar, but differ in some of the more advanced features.
This page will introduce BCQL and show the features that BlackLab supports.
BlackLab started out as purely a token-based corpus engine. The next section shows BCQL's token-based features. After that, we will look at (syntactic) relations querying. Finally, we compare BCQL to CQP, so users familiar with that dialect can avoid common pitfalls.
# Token-based querying
# Matching a token
With BCQL you can specify a pattern of tokens (i.e. words) you're looking for.
A simple such pattern is:
[word='man']
This simply searches for all occurrences of the word man.
Each corpus has a default annotation; usually word. Using this fact, this query can be written even simpler:
'man'
Note that double and single quotes are interchangeable in BCQL (which is not true for all dialects). In this document we will use single quotes.
# Multiple annotations
If your corpus includes the per-word annotations lemma (i.e. headword) and pos (part-of-speech, i.e. noun, verb, etc.), you can query those as well.
For example:
[lemma='search' & pos='noun']
This query would match search and searches where used as a noun. (your data may use different part-of-speech tags, of course)
# Negation
You can use the "does not equal" operator (!=) to search for all words except nouns:
[pos != 'noun']
# Regular expressions
The strings between quotes can also contain "wildcards", of sorts. To be precise, they are regular expressions (opens new window), which provide a flexible way of matching strings of text. For example, to find man or woman (in the default annotation word), use:
'(wo)?man'
And to find lemmas starting with under, use:
[lemma='under.*']
Explaining regular expression syntax is beyond the scope of this document, but for a complete overview, see regular-expressions.info (opens new window).
Escaping and literal strings
To find characters with special meaning in a regular expression, such as the period, you need to escape them with a backslash:
[lemma='etc\.']
Alternatively, you can use a "literal string" by prefixing the string with an l
:
[lemma=l'etc.']
# Matching any token
Sometimes you want to match any token, regardless of its value.
Of course, this is usually only useful in a larger query, as we will explore next. But we'll introduce the syntax here.
To match any token, use the match-all pattern, which is just a pair of empty square brackets:
[]
# A sequence of tokens
You can search for sequences of words as well (i.e. phrase searches, but with many more possibilities). To search for the phrase the tall man, use this query:
'the' 'tall' 'man'
It might seem a bit clunky to separately quote each word, but this allows us the flexibility to specify exactly what kinds of words we're looking for.
For example, if you want to know all single adjectives used with man (not just tall), use this:
'an?|the' [pos='ADJ'] 'man'
This would also match a wise man, an important man, the foolish man, etc.
If we don't care about the part of speech between the article and man, we can use the match-all pattern we showed before:
'an?|the' [] 'man'
This way we might match something like the cable man as well as a wise man.
# Regular expression operators on tokens
Really powerful token-based queries become possible when you use the regular expression operators on whole tokens as well. If we want to see not just single adjectives applied to man, but multiple as well:
[pos='ADJ']+ 'man'
This query matches little green man, for example. The plus sign after [pos='ADJ']
says that the preceding part should occur one or more times (similarly, *
means "zero or more times", and ?
means "zero or once").
If you only want matches with exactly two or three adjectives, you can specify that too:
[pos='ADJ']{2,3} 'man'
Or, for two or more adjectives:
[pos='ADJ']{2,} 'man'
You can group sequences of tokens with parentheses and apply operators to the whole group as well. To search for a sequence of nouns, each optionally preceded by an article:
('an?|the'? [pos='NOU'])+
This would, for example, match the well-known palindrome a man, a plan, a canal: Panama! (provided the punctuation marks were not indexed as separate tokens)
# Case- and diacritics sensitivity
BlackLab defaults to (case and diacritics) insensitive search. That is, it ignores differences in upper- and lowercase, as well as diacritical marks (accented characters). So searching for 'panama'
will also find Panama.
To match a pattern sensitively, prefix it with (?-i)
:
'(?-i)Panama'
compare to other corpus engines
CWB and Sketch Engine both default to sensitive search.
# Matching XML elements
Your input data may contains "spans": marked regions of text, such as paragraphs, sentences, named entities, etc. If your input data is XML these may be XML elements, but they may also be marked in other ways. Non-XML formats may also define spans.
Finding text in relation to these spans is done using an XML-like syntax, regardless of the exact input data format.
# Finding spans
If you want to find all the sentence spans in your data:
<s/>
Note that forward slash before the closing bracket. This way of referring to the span means "the whole span". Compare this to <s>
, which means "the start of the span", and </s>
, which means "the end of the span".
So to find only the starts of sentences, use:
<s>
This would find zero-length hits at the position before the first word. Similarly, </s>
finds the ends of sentences. Not very useful, but we can combine these with other queries.
# Words at the start or end of a span
More useful might be to find the first word of each sentence:
<s> []
or sentences ending in that:
'that' </s>
(Note that this assumes the period at the end of the sentence is not indexed as a separate token - if it is, you would use 'that' '.' </s>
instead)
# Words inside a span
You can also search for words occurring inside a specific span. Say you've run named entity recognition on your data, so all names of people are tagged with the person
span. To find the word baker as part of a person's name, use:
'baker' within <person/>
The above query will just match the word baker as part of a person's name. But you're likely more interested in the entire name that contains the word baker. So, to find those full names, use:
<person/> containing 'baker'
# Other uses for within and containing
As you might have guessed, you can use within
and containing
with any other query as well. For example:
([pos='ADJ']+ containing 'tall') 'man'
will find adjectives applied to man, where one of those adjectives is tall.
# Labeling tokens, capturing groups
Just like in regular expressions, it is possible to "capture" part of the match for your query as a named group. Everything you capture is returned with the hit in a response section called match info.
Example:
'an?|the' A:[pos='ADJ'] 'man'
The adjective part of the match will be captured in a group named A.
You can capture multiple words as well:
'an?|the' adjectives:[pos='ADJ']+ 'man'
This will capture the adjectives found for each match in a captured group named adjectives.
The capture name can also just be a number:
'an?|the' 1:[pos='ADJ']+ 'man'
Compared to other corpus engines
CWB and Sketch Engine offer similar functionality, but instead of capturing part of the match, they label a single token.
BlackLab can capture a span of tokens of any length, capture relations and spans with all their details, and even capture lists of relations, such as all relations in a sentence (relations are described later in this document).
Spans are captured automatically
If your query involves spans like <s/>
, it will automatically be captured under the span name (s
in this case). You can override the capture name by specifying it in the query, e.g. A:<s/>
.
# Capture constraints
If you tag certain tokens with labels, you can also apply "capture constraints" (also known as "global constraints") on these tokens. This is a way of relating different tokens to one another, for example requiring that they correspond to the same word:
A:[] 'by' B:[] :: A.word = B.word
This would match day by day, step by step, etc.
Multiple-value annotations and constraints
Unfortunately, capture constraints can only access the first value indexed for an annotation. If you need this kind of functionality in combination with multi-values constraints, you'll have to find a way around this limitation.
Some queries can be rewritten so they don't need a capture constraint. For example,
A:[word="some"] B:[word="queries"] :: A.lemma="some" & B.lemma="query"
can also be written as
A:[word="some" & lemma="some"] B:[word="queries" & lemma="query"]
, which does work with multiple annotation values.
But this is rare.
In other cases, you might be able to add extra annotations or use spans ("inline tags") to get around this limitation.
# Functions
You can also use a few special functions in capture constraints. For example, ensure that words occur in the right order:
(<s> containing A:'cat') containing B:'fluffy' :: start(B) < start(A)
Here we find sentences containing both cat and fluffy (in some order), but then require that fluffy occurs before cat.
Of course this particular query would be better expressed as <s/> containing 'fluffy' []* 'cat'
. As a general rule,
capture constraints can be a bit slower, so only use them when you need to.
# Local capture constraints
Unlike most other corpus engines, BlackLab allows you to place capture constraints inside a parenthesized expression. Be careful that the constraint only refers to labels that are captured inside the parentheses, though!
This is valid and would match over and over again:
(A:[] 'and' B:[] :: A.word = B.word) 'again'
This is NOT valid (may not produce an error, but the results are undefined):
A:[] ('and' B:[] :: A.word = B.word) 'again' # BAD
# Relations querying
Supported from v4.0
Indexing and searching relations will be supported from BlackLab 4.0 (and current development snapshots).
Relations show how (groups of) words are related to one another. One of the most common types of relations is the dependency relation, which shows grammatical dependency between words.
If your corpus contains relations, you can query those as well. One advantage of this style of querying is that it's much easier to find nonadjacent words related to one another, or two related words regardless of what order they occur in.
How to index relations
Indexing relations is explained here.
Querying relations is essentially done by building a partial tree of relations constraints. BlackLab will try and find this structure in the corpus. Relations queries can be combined with "regular" token-level queries as well.
Treebank systems
BlackLab supports limited relations querying, but is not as powerful as a full treebank system, which is primarily designed for this style of search. Links to some treebank systems can be found here (opens new window). For BlackLab's relations querying limitations, see below.
# An example dependency tree
Let's use an example to illustrate the various querying options. Here's a simple dependency tree for the phrase I have a fluffy cat:
|
have
/ \
(subj) (obj)
/ \
I cat
/ |
(det)(amod)
/ |
a fluffy
# Finding specific relation types
We might want to find object relations in our data. We can do this as follows:
_ -obj-> _
This will find have a fluffy cat (the span of text covering the two ends of the relation), with a match info group named for the relation type (obj) containing the relation details between have and cat.
The two _
marks in the query simply means we only care about the relation type, not the source or target of the relation. If we specifically want to find instances where cat is the object of the sentence, we can use this instead:
_ -obj-> 'cat'
So you can see that the token-based queries described previously are still useful here.
Can I use [] instead of _ ?
As explained above, _
in a relation expression means "any source or target". You might be tempted to use []
instead, especially if you know your relations always have a single-token source and target. This works just fine, but it's a bit slower (it has to double-check that source and target are actually of length 1), so we recommend sticking with _
.
(the actual equivalent of _
here is []*
(zero or more tokens with no restrictions), but that makes for less readable queries)
# A note on terminology
For dependency relations, linguists call the left side the head of the relation and the right side the dependent. However, because dependency relations aren't the only class of relation, and because the term head can be a bit confusing (there's also the "head of an arrow", but that's the other end!), we will use source and target.
When talking about tree structures, we will also use parent and child.
Context | Terms |
---|---|
Dependency relations | head --> dependent |
Relations in BlackLab | source --> target |
Searching tree structures | parent --> child |
# Finding relation types using regular expressions
We can specify the relation type as a regular expression as well. To find both subject and object relations, we could use:
_ -subj|obj-> _
or:
_ -.*bj-> _
If you find it clearer, you can use parentheses around the regular expression:
_ -(subj|obj)-> _
With our example tree, the above queries will find all subject relations and all object relations. Each hit will have one relation in the match info. To find multiple relations per hit, read on.
Relation classes
When indexing relations in BlackLab, you assign them a class, a short string indicating what family of relations it belongs to. For example, you could assign the class string dep
to dependency relations. An obj
relation would become dep::obj
.
To simplify things, dep
is the default relation class in BlackLab. If you index relations without a class, they will automatically get the dep
class. Similarly, when searching, if you don't specify a class, dep::
will be prepended to the relation type. So if you're not indexing different classes of relations, you can just ignore the classes.
# Root relations
A dependency tree has a single root relation. A root relation is special relation that only has a target and no source. Its relation type is usually just called root. In our example, the root points to the word have.
You can find root relations with a special unary operator:
^--> _
This will find all root relations. The details for the root relation will be returned in the match info.
Of course you can place constraints on the target of the root relation as well:
^--> 'have'
This will only find root relations pointing to the word have.
# Finding two relations with the same source
What if we want to find the subject and object relations of a sentence, both linked to the same source (the verb in the sentence)? We can do that using a semicolon to separate the two target constraints (or child constraints):
_ -subj-> _ ;
-obj-> _
As you can see, the source or parent is specified only once at the beginning. Then you may specify one or more target constraints (a relation type plus target, e.g. -subj-> _
), separated by semicolons.
The above query will find hits covering the words involved in both relations, with details for the two relations in the match info of each hit. In our example, it would find the entire sentence I have a fluffy cat.
Target constraint uniqueness
Note that when matching multiple relations with the same source this way, BlackLab will enforce that they are unique. That is, two target constraints will only match two different relations.
# Negative child constraints
You may want to have negative constraints, such as making sure that dog is not the object of the sentence. This can be done by prefixing the relation operator with !
:
_ -subj-> _ ;
!-obj-> 'dog'
Note that this is different from :
_ -subj-> _ ;
-obj-> [word != 'dog']
The second query requires an object relation where the target is a word other than dog; that is, the object relation must exist. By contrast, in the first case, we only require that there exists no object relation with the target dog, so this might match sentences without an object as well as sentences with an object that is not dog.
# Searching over multiple levels in the tree
What if we want to query over multiple levels of the tree? For example, we want to find sentences where the target of the subj
relation is the source of an amod
relation pointing to fluffy, such as in our example tree.
_ -subj-> _ -amod-> 'fluffy'
We can combine the techniques as well, for example if we also want to find the object of the sentence like before:
_ -subj-> (_ -amod-> _) ;
-obj-> _
As you can see, the value of the expression (_ -amod-> _)
is actually the source of the amod
relation, so we can easily use it as the target of the subj
relation.
The -..->
operator is right-associative (as you can see from the first example), but we do need parentheses here, or the parent of the -obj->
relation would be ambiguous.
# Limitation: descendant search
One current limitation compared to dedicated treebank systems is the lack of support for finding descendants that are not direct children.
For example, if we want to look for sentences with the verb have and the word fluffy somewhere as an adjectival modifier in that sentence, we can't query something like this:
^--> 'have' -->> -amod-> 'fluffy' # DOES NOT WORK
Instead, we have to know how many nodes are between have and fluffy, e.g. this does work:
^--> 'have' --> _ -amod-> 'fluffy'
Supporting arbitrary descendant search with decent performance is a challenge that we may try to tackle in the future.
For now, you might be able to work around this limitation using a hybrid between token-based and relations querying, e.g.:
(<s/> containing (^--> 'have')) containing (_ -amod-> 'fluffy')
# Advanced relations querying features
Most users won't need this, but they might come in handy in some cases.
# Controlling the resulting span
As shown in the previous section, relation expressions return the source of the matching relation by default. But what if you want a different part of the relation?
For example, if we want to find targets of the amod relation, we can do this:
rspan(_ -amod-> _, 'target')
If we want the entire span covering both the source and the target (and anything in between):
rspan(_ -amod-> _, 'full')
Note that full is the default value for the second parameter, so this will work too:
rspan(_ -amod-> _)
rspan
supports another option: all will return a span covering all of the relations matched by your query.
rspan(_ -subj-> (_ -amod-> _) ; -obj-> _, 'all')
Because this is pretty useful when searching relations, there's an easy way to apply this rspan
operation: just add a parameter adjusthits=yes
to your BlackLab Server URL. Note that if your query already starts with a call to rspan
, adjusthits=yes
won't do anything.
# Capturing all relations in a sentence
If you want to capture all relations in the sentence containing your match, use:
'elephant' within rcapture(<s/>)
What actually happens here is that all relations in the matched clause are returned in the match info.
You can pass a second parameter with the match info name for the list of captured relations (defaults to captured_rels):
'elephant' within rcapture(<s/>, 'relations')
If you only want to capture certain relations, you specify a third parameter that is a regular expression filter on the relation type. For example, to only capture relations in the fam
class, use:
'elephant' within rcapture(<s/>, 'relations', 'fam::.*')
# Cross-field relations
It is possible to have a corpus with multiple annotated fields, with relations that point from a position or span in one field to another. Annotated fields should be named e.g. contents__original
and contents__corrected
for this to work, and a relation from the first to the second field could be found like this:
'mistpyed' -->corrected 'mistyped'
As you can see, the target version is appended to the relation operator. The source version is determined by the main annotated field searched (the field
parameter for BlackLab Server; will default to the main annotated field, which is the first one you defined in your indexing configuration).
Cross-field relations are used to enable parallel corpora, which we'll discuss next.
# Parallel corpus querying
Supported from v4.0
Indexing and searching parallel corpoora will be supported from BlackLab 4.0 (and current development snapshots).
A parallel corpus is a corpus that contains multiple versions of the corpus content, usually from different languages and/or time periods, and record the alignment between the versions at different levels (e.g. paragraph, sentence, word).
For example, you could have a parallel corpus of EU Parliament discussions in the various European languages, or a parallel corpus of different translations of a classic work such as Homer's Odyssey.
How to index a parallel corpus
Indexing parallel corpora is explained here.
BlackLab's parallel corpus functionality uses cross-field relations to find alignments between the content versions available in your corpus.
The alignments operator ==>
is specifically to find alignments between versions in your corpus. It essentially means "capture all relations between (part of) the left and right span". It will capture a list of relations in the response.
# Basic parallel querying
For example, if your corpus contains fields contents__en
(English version) and contents__nl
(Dutch version), and English is the default field (the first one defined in your indexing config), you can find the Dutch translation of an English word using:
'cat' ==>nl _
The hit for this query will be cat
in the English field, and the match info will contain a group named rels
with all alignment relations found (just the one in this case, between the word cat
and its Dutch equivalent). The hit response structure will also contain an otherFields
section containing the corresponding Dutch content fragment. The location of the Dutch word aligned with the English word cat
can be found from the relation in the rel
capture, which includes targetField
, targetStart
and targetEnd
.
Assuming your data has both sentence and word alignments, and you want to find all alignments for a sentence containing cat
, you could use:
<s/> containing 'cat' ==>nl _
This should find aligning English and Dutch sentences, including any word alignments between words in those sentences. You can also filter by alignment type, as we'll show later.
Required versus optional alignment
The ==>
operator will require that an alignment exists. If you wish to see all hits on the left side of the ==>nl
regardless of whether any aligments to the right side can be found, use ==>nl?
.
For example, if you're searching for translations of cat
to Dutch, with ==>nl
you will only see instances where cat
is aligned to a Dutch word; on the other hand, with ==>nl?
you will see both English cat
hits where the translation to Dutch was found, and cat
hits where it wasn't.
# Switching the main search field
If you want to search the Dutch version instead, and find alignments with the English version, you would use this query:
'kat' ==>en _
But of course, the main search field shouldn't be contents__en
in this case; we want to switch it to contents__nl
. You can specify a main search field other than the default with the BLS parameter field
. In this case, if you specify field=nl
. BlackLab will automatically recognize that you're specifying a version of the main annotated field and use the correct 'real' field, probably contents__nl
in this case.
# Filtering the target span
In the previous example, we used _
as the target span. This is the default, and means "the best matching span".
But you can also specify a different target span. For example, to find where fluffy was translated to pluizig:
'fluffy' ==>nl 'pluizig'
This will execute the left and right queries on their respective fields and match the hits by their alignment relations.
# Multiple alignment queries
You can also use multiple alignment operators in a single query to match to more than one other version:
'fluffy' ==>nl 'pluizig' ;
==>de 'flauschig'
# Only matching some (alignment) relations
Just like with other relations queries, you can filter by type:
'fluffy' =word=>nl 'pluizig'
This will only find relations of type word
. The type filter will automatically determine the capture name as well, so any relation(s) found will be captured as word
in this case instead of rels
(unless an explicit name is assigned, see below).
# Renaming the relations capture
You can use a override the default name rels
for the alignment operator's captures:
<s/> alignments:==>nl _
Now the alignment relations will be captured in a group named alignments
.
# Capturing in target fields
You can capture parts of the target query like normal, e.g.:
"and" w1:[] ==>nl "en" w2:[]
There will be one match info named w1
for the primary field searched (English in this case), and one named w2
for the target field (Dutch).
# Advanced subjects
# Operator precedence
This is the precedence of the different CQL operators in BlackLab, from highest to lowest. The highest precedence operators "bind most tightly". See the examples below.
Inside token brackets [ ]
:
Operator | Description | Associativity |
---|---|---|
! | logical not | right-to-left |
= != | (not) equals | left-to-right |
& \| | logical and/or | left-to-right |
At the sequence level (i.e. outside token brackets):
Operator | Description | Associativity |
---|---|---|
! | logical not | right-to-left |
[ ] | token brackets | left-to-right |
( ) | function call | left-to-right |
* + ? {n} {n,m} | repetition | left-to-right |
: | capture | right-to-left |
< /> < > </ > | span (start/end) | left-to-right |
[] [] | sequence (implied operator) | left-to-right |
\| & | union/intersection | left-to-right |
--> [ ; --> ] ^--> ==> [ ; ==> ] | child relations root relation alignment | right-to-left |
within containing | position filter | right-to-left |
:: | capture constraint | left-to-right |
NOTES:
- you can always use grouping parens
( )
(at either token or sequence level) to override this precedence. - notice that
|
and&
have the same precedence; don't rely on&
binding more tightly than|
or vice versa, which you might be used to from other languages.
A few examples:
Query | Interpreted as |
---|---|
[word = 'can' & pos != 'verb'] | [ (word = 'can') & (pos != 'verb') ] |
[pos = 'verb' \| pos = 'noun' & word = 'can'] | [ (pos = 'verb' \| pos = 'noun') & word = 'can'] |
A:'very'+ | A:('very'+) |
A:_ --> B:_ | (A:_) --> (B:_) |
_ -obj-> _ -amod-> _ | _ -obj-> (_ -amod-> _) |
!'d.*' & '.e.*' | (!'d.*') & '.e.*' , meaning [word != 'd.*' & word = '.e.*'] |
'cow' within <pasture/> containing 'grass' | 'cow' within (<pasture/> containing 'grass') |
# Supported features, differences from CWB
For those who already know CQL, here's a quick overview of the extent of BlackLab's support for this query language. If you a feature we don't support yet is important to you, please let us know. If it's quick to add, we may be able to help you out.
# Supported features
BlackLab currently supports (arguably) most of the important features of Corpus Query Language:
- Matching on token annotations, using regular expressions and
=
,!=
,!
. Example:[word='bank']
(or just'bank'
) - Case/accent sensitive matching. Note that, unlike in CWB, case-INsensitive matching is currently the default. To explicitly match case-/accent-insensitively, use
'(?i)...'
. Example:'(?-i)Mr\.' '(?-i)Banks'
- Combining criteria using
&
,|
and!
. Parentheses can also be used for grouping. Example:[lemma='bank' & pos='V']
- Matchall pattern
[]
matches any token. Example:'a' [] 'day'
- Regular expression operators
+
,*
,?
,{n}
,{n,m}
at the token level. Example:[pos='ADJ']+
- Sequences of token constraints. Example:
[pos='ADJ'] 'cow'
- Operators
|
,&
and parentheses can be used to build complex sequence queries. Example:'happy' 'dog' | 'sad' cat'
- Querying with tag positions using e.g.
<s>
(start of sentence),</s>
(end of sentence),<s/>
(whole sentence) or<s> ... </s>
(equivalent to<s/> containing ...
). Example:<s> 'The'
. XML attribute values may be used as well, e.g.<ne type='PERS'/>
("named entities that are persons"). - Using
within
andcontaining
operators to find hits inside another set of hits. Example:'you' 'are' within <s/>
- Using an anchor to capture a token position. Example:
'big' A:[]
. Captured matches can be used in capture constraints (see next item) or processed separately later (using the Java interface; capture information is not yet returned by BlackLab Server). Note that BlackLab can actually capture entire groups of tokens as well, similarly to regular expression engines. - Capture constraints, such as requiring two captures to contain the same word. Example:
'big' A:[] 'or' 'small' B:[] :: A.word = B.word
See below for features not in this list that may be added soon, and let us know if you want a particular feature to be added.
# Differences from CWB
BlackLab's CQL syntax and behaviour differs in a few ways from CWBs, although they are mostly lesser-used features.
For now, here's what you should know:
- Case-insensitive search is the default in BlackLab, while CWB and Sketch Engine use case-sensitive search as the default. If you want to match a term case-sensitively, use
'(?-i)..'
or'(?c)..'
. - If you want to match a string literally, not as a regular expression, use backslash escaping (
'e\.g\.'
) or a literal string (l'e.g.'
) - BlackLab supports result set manipulation such as: sorting (including on specific context words), grouping/frequency distribution, subsets, sampling, setting context size, etc. However, these are supported through the REST and Java APIs, not through a command interface like in CWB. See BlackLab Server overview).
- Querying XML elements and attributes looks natural in BlackLab:
<s/>
means "sentences",<s>
means "starts of sentences",<s type='A'>
means "sentence tags with a type attribute with value A". This natural syntax differs from CWBs in some places, however, particularly when matching XML attributes. - In capture constraints (expressions occurring after
::
), only literal matching (no regex matching) is currently supported. - To return whole sentences as the context of hits, pass
context=s
to BLS. - The implication operator
->
is currently only supported in capture constraints (expressions after the::
operator), not in a regular token constraints. - We don't support the
@
anchor and correspondingtarget
label; use a named anchor instead. - backreferences to anchors only work in capture constraints, so this doesn't work:
A:[] [] [word = A.word]
. Instead, use something like:A:[] [] B:[] :: A.word = B.word
. - Instead of CWBs
intersection
,union
anddifference
operators, BlackLab supports the&
,|
and!
operators at the top-level of the query, e.g.('double' [] & [] 'trouble')
to match the intersection of these queries, i.e. 'double trouble' and('happy' 'dog' | 'sad 'cat')
to match the union of 'happy dog' and 'sad cat'. Difference can be achieved by combining!
and&
, e.g.('happy' [] & !([] 'dog'))
to match 'happy' followed by anything except 'dog' (although this is better expressed as'happy' [word != 'dog']
).
# (Currently) unsupported features
Some features that are not (yet) supported:
lbound
,rbound
functions to get the edge of a region. You can use<s>
to get all starts-of-sentences or</s>
to get all ends-of-sentences, however.distance
,distabs
functions andmatch
,matchend
anchor points (sometimes used in capture constraints).- using an XML element name to mean 'token is contained within', like
[(pos = 'N') & !np]
meaning "noun NOT inside in an<np/>
tag". - a number of less well-known features.
If people ask about missing features, we're happy to work with them to see if it could be added.