BlackLab Server overview

BlackLab Server is a REST web service for accessing BlackLab indices. This makes it easy to use BlackLab from your favourite programming language. It can be used for anything from quick analysis scripts to full-featured corpus search applications.

This page explains how to set up and use BlackLab Server. See the BlackLab homepage for more information on the underlying corpus search engine.


  • Supports XML, JSON and JSONP, making it very easy to use from any programming language.
  • Supports faceted search, giving users an overview of their search results and allowing them to easily refine them.
  • Uses a stateless protocol and persistent identifiers, making it possible to develop web applications that “feel at home” in the browser, meaning they can be bookmarked, opened in multiple tabs and properly support the back button.
  • Makes very responsive (AJAX or ‘regular’) web applications possible because requests return all relevant information at once and both client and server can make use of caching.
  • One BlackLab Server instance can be used for multiple corpora; this means you need less configuration and fewer server resources.
  • BlackLab Server can be tuned in terms of server load and client responsiveness. Possible settings are, for example: strive for X free memory on the server; allow any client at most X running jobs; let clients cache results for X amount of time.
  • Provides debug information (e.g. show all searches in the cache)
  • Has many configurable defaults: page size, context size, default response format, default query language, etc. Of course defaults may be overridden for each request.


BlackLab Server is a read-only REST webservice. Only GET requests are supported.

It is stateless: a particular URL will always result in the same response. There’s one exception to this: when the server has the requested set of results, it might indicate that it is still counting the total number of results, and has counted X so far. The client may keep doing additional requests to update the running count for the user. This way, the user gets to see the first results as soon as possible, and she’ll be able to see that the total number of results is still being counted.

The webservice answers in JSON or XML. Selection of the desired output format is done through the HTTP Accept header (value “application/json” or “application/xml”), or by passing an extra parameter “outputformat” (value “json” or “xml”). If both are specified, the parameter has precedence. If neither are specified, the configured default format is used (usually XML).

An extra option is JSONP (“padded JSON”, for when the webservice is running on a different host than the web application). Use the “jsonp” parameter for this (see next section).


A request to BlackLab Server has the following structure:


Here’s what the various parts of this URL mean:

Part Meaning
server the server name, e.g. “”
webservice the web service name, e.g. “blacklab-server”
corpus the corpus (i.e. text collection) to search, e.g. “opensonar”
resource what type of information we’re looking for (hits, docs, docs/pid, docs/pid/content, …) (see below for the meaning of each resource)
pid persistent identifier for the document. This refers to a metadata field that must be configured per corpus (in the index metadata file; see documentation about indexing with BlackLab). Any field that uniquely identifies the document and won’t change in the future will do. You can retrieve documents with this pid, and result sets will use it to refer to the corresponding documents.>

NOTE: BlackLab Server will use the Lucene document id instead of a true persistent identifier if your corpus has no persistent identifier configured (using "pidField" in the index template file - see [Indexing with BlackLab](indexing-with-blacklab.html)), but this is not recommended: Lucene document ids can change if you re-index or compact the index, so bookmarked URLs may not always return to the same information.
parameters search and result parameters, that indicate what pattern you wish to look for, what metadata values you wish to filter on, what to group or sort on, what part of the results to show, what data format to return results in, etc. (see below)

Explanation of the various resources:

Resource Meaning
hits A set of occurrences of a pattern in the corpus (optionally filtered on document properties as well). This resource can also return the result of grouping hits (returning a list of groups), or the contents of one such group (if you wish to the hits in a group).
docs A set of documents that contain a certain pattern and/or match a certain document filter query. This resource can also return the result of grouping document results, or show the contents of one such group.
docs/pid Metadata for a document
docs/pid/contents Contents of a document. This returns the original input XML. (see note below about the contentViewable setting)
docs/pid/snippet Uses the forward index to retrieve a snippet of the document.
fields/FIELDNAME Shows the settings and (some) field values for a metadata field. For annotated fields (e.g. "contents"), it will show the different annotations (e.g. word, lemma, pos) the field has for each token.
autocomplete/FIELDNAME Shows field values for a metadata field. Requires a parameter "term".
termfreq EXPERIMENTAL. Returns most frequent terms from the an annotation in main annotated field. Parameters: `annotation` selects the annotation to get term frequencies for; `number` indicates the maximum number to return; `sensitive` (true/false) indicates whether or not to list terms case/diacritics sensitively; `terms` is an optional comma-separated list of terms for which to get the frequencies; `filter` filters documents.

NOTE: by default, users are not allowed to retrieve full document contents. In order to allow this, change the ‘contentViewable’ setting in the indexmetadata file in the index directory. You can also specify this setting in the corpusConfig part of an input format descripion file, see Influencing index metadata. The contentViewable setting in the indexmetadata file may be overridden for a document by adding a boolean metadata field named “contentViewable”. This can be configured in the input config file like any other metadata field. For example, if your documents contain a “license” element with an attribute status that must be equal to “public” for the content to be viewable, use an XPath query like “string(//license[1]/@status=‘public’)”.

Below is an overview of parameters that can be passed to the various resources. Default values for most parameters can be configured on the server; below are a few suggestions for defaults.

(NOTE: parameters in italics haven’t been implemented yet)

Parameter Meaning
patt Pattern to search for. This normally uses Corpus Query Language. (different query languages are possible, see `pattlang`)
pattlang Query language for the patt parameter. (default: corpusql, Corpus Query Language. Also supported: contextql, Contextual Query Language (only very basic support though)) and lucene (Lucene Query Language).
pattgapdata (Corpus Query Language only) Data (TSV, tab-separated values) to put in gaps in query. You may leave 'gaps' in the double-quoted strings in your query that can be filled in from tabular data. The gaps should be denoted by @@, e.g. [lemma="@@"] or [word="@@cat"]. For each row in your TSV data, will fill in the row data in the gaps. The queries resulting from all the rows are combined using OR. For example, if your query is "The" "@@" "@@" and your TSV data is "white\tcat\nblack\tdog", this will execute the query ("The" "white" "cat") | ("The" "black" "dog"). Please note that if you want to pass a large amount of data, you should use a POST request as the amount of data you can pass in a GET request is limited (with opinions on a safe maximum size varying between 255 and 2048 bytes). Large amounts of data
pattfield Content field to search. (default: the main contents field, corpus-specific. Usually “contents”.)
term term used with autocomplete, terms starting with it are returned.
filter Document filter query in [Lucene query syntax](, e.g. “publicationYear:1976” (default: none)
filterlang Query language for filter parameter. Supported are lucene (Lucene query syntax, the default) and (limited) contextql (contextual query language).(default: lucene )
docpid Filter on a single document pid, e.g. “DOC0001” (default: none)
wordsaroundhit Number of words of context to retrieve for hits- and docs-results (default: 5)
sort Sorting criteria, comma-separated. ‘-’ reverses sort order. See below. (default: don’t sort)
group Grouping criteria, comma-separated. See below. (default: don’t group)
includegroupcontents Whether to include the hits with each group (default: false).
NOTE: only works for /hits requests for now.
viewgroup Identity of one of the groups to view (identity values are returned with the grouping results). NOTE: you may not get all results in the group because there is a limit to how many results are stored per group! Use hitfiltercrit to get all hits.
hitfiltercrit A criterium to filter hits on. Also needs hitfilterval to work. See below. (default: don't filter)
This is useful if you want to view hits in a group, and then be able to group on those hits again. These two parameters essentially supersede the viewgroup parameter: that parameter also allows you to view the hits in a group, but won't allow you to group that subset of hits again. By specifying multiple criteria and values to hitfiltercrit/hitfilterval, you can keep diving deeper into your result set. NOTE: this may be slow because it finds all hits, then filters them by this criterium.
hitfilterval A value (of the specified hitfiltercrit) to filter hits on. (default: don't filter)
facets Document faceting criteria, comma-separated. See below. (default: don’t do any faceting)
collator What collator to use for sorting and grouping (default: nl)
first First result (0-based) to return (default: 0)
number Number of results to return (if available) (default: 50)
hitstart (snippet operation) First word (0-based) of the hit we want a snippet around (default: 0)
hitend (snippet operation) First word (0-based) after the hit we want a snippet around (default: 1)
wordstart (snippet/contents operations) First word (0-based) of the snippet/part of the document we want. -1 for document start. NOTE: partial contents XML output will be wrapped in <blacklabResponse/> element to ensure a single root element. NOTE: when greater than -1 content before the first word will not be included in the response!
wordend (snippet/contents operations) First word (0-based) after the snippet/part of the document we want. -1 for document end. NOTE when greater than -1 content after the last word will not be included in the response!
block (deprecated) Blocking (“yes”) or nonblocking (“no”) request? (default: yes)
NOTE: nonblocking requests will be removed in a future version.
waitfortotal Whether or not to wait for the total number of results to be known. If no (the default), subsequent requests (with number=0 if you don’t need more hits) can be used to monitor the total count progress. (default: no)
maxretrieve Maximum number of hits to retrieve. -1 means "no limit". Also affects documents-containing-pattern queries and grouped-hits queries. Default configurable in blacklab-server.yaml. Very large values (millions, or unlimited) may cause server problems.
maxcount Maximum number of hits to count. -1 means "no limit". Default configurable in blacklab-server.yaml. Even when BlackLab stops retrieving hits, it still keeps counting. For large results sets this may take a long time.
outputformat “json”, “xml” or "csv". (Default: check the HTTP Accept header, or use the server default (usually xml) if none was specified. NOTE: most browsers send a default Accept header including XML.

For "csv", two additional parameters are supported: "csvsummary=yes" will add a summary of the query to the CSV output; "csvsepline=yes" will add "sep=," as the first line, specifically for using the resulting CSV with Excel. Both default to "no".
jsonp Name of JSONP callback function to use. Automatically forces the outputformat to JSONP. (A JSONP response is a Javascript with a single function call that gets the JSON response object as its parameter. This is used to circumvent browsers' Same Origin Policy that blocks AJAX calls to other domains)
prettyprint yes or no. Determines whether or not the output is on separate lines and indented. Useful while debugging. (default: no (yes in debug mode, see configuration))
includetokencount yes or no. Determines whether or not a document search includes the total number of tokens in the matching documents. Slower, because all document information has to to be fetched to calculate this. (default: no)
usecontent fi or orig. fi uses the forward index to reconstruct document content (for snippets and concordances; inline tags are lost in the process), orig uses the original XML from the content store (slower but more accurate).
calc (empty) or colloc. Calculate some information from the result set. Currently only supports calculating collocations (frequency lists of words near hits).
sample Percentage of hits to select. Chooses a random sample of all the hits found.
samplenum Exact number of hits to select. Chooses a random sample of all the hits found.
sampleseed Signed long random seed for sampling. Optional. When given, uses this value to seed the random number generator, ensuring identical sampling results next time. Please note that, without sorting, hit order is undefined (if the same data is re-indexed, hits may be produced in a different order). So if you want true reproducability, you should always sort hits that you want to sample.

NOTE: using the original content may cause problems with well-formedness; these are fixed automatically, but the fix may result in inline tags in strange places (e.g. a start-sentence tag that is not at the start of the sentence anymore)

Sorting, grouping, filtering & faceting

NOTE: this is about sorting/grouping and filtering/faceting on groups. For basic filtering on document metadata, see the filter parameter above and the Lucene query syntax.

The sort, group, hitfiltercrit and facets parameters receive one or more criteria (comma-separated) that indicate what to sort, group, filter or facet on.

Criterium Meaning
hit[:prop[:c]] Sort/group/facet on matched text. If prop is omitted, the default annotation (usually word) is used. c can specify case-sensitivity: either s (sensitive) or i (insensitive). prop and c can also be added to left, right, wordleft and wordright. Examples: hit, hit:lemma, hit:lemma:s.
left / right Left/right context words. Used for sorting, not for grouping/faceting (use wordleft/wordright instead). Examples: left, left:pos, left:pos:s.
context More generic context words expression, giving the user more control at the cost of a bit of speed. Example: context:word:s:H1-2 (first two matched words). See below for a complete specification.
wordleft / wordright Single word to the left or right of the matched text. Used for grouping/faceting. Examples: wordleft, wordleft:pos
field:name Metadata field
decade:name Sort/group by the decade of the year given in specified metadata field.
numhits (for sorting per-document results) Sort by number of hits in the document.
identity (for sorting grouping results) Sort by group identity.
size (for sorting grouping results) Sort by group size, descending by default.

Grouping/sorting on context words

Criteria like “context:word:s:H1-2” (first two matched words) allow fine control over what to group or sort on.

Like with criteria such as left, right or hit, you can vary the annotation to group or sort on (e.g. word/lemma/pos, or other options depending on your data set). You may specify whether to sort/group case- and accent-sensitively (s) or insensitively (i).

The final parameter to a “context:” criterium is the specification. This consists of one or more parts separated by a semicolon. Each part consists of an “anchor” and number(s) to indicate a stretch of words. The anchor can be H (hit text), E (hit text, but counted from the end of the hit), L (words to the left of the hit) or R (words to the right of the hit). The number or numbers after the anchor specify what words you want from this part. A single number indicates a single word; 1 is the first word, 2 the second word, etc. So “E2” means “the second-to-last word of the hit”. Two numbers separated by a dash indicate a stretch of words. So “H1-2” means “the first two words of the hit”, and “E2-1” means “the second-to-last word followed by the last word”. A single number followed by a dash means “as much as possible from this part, starting from this word”. So “H2-” means “the entire hit text except the first word”.

A few more examples: - context:word:s:H1;E1 (the first and last matched word) - context:word:s:R2-3 (second and third word to the right of the match) - context:word:s:L1- (left context, starting from first word to the left of the hit, i.e. the same as “left:word:s”. How many words of context are used depends on the ‘wordsaroundhit’ parameter, which defaults to 5)


There’s code examples of using BlackLab Server from a number of different programming languages.

Below are examples of individual requests to BlackLab Server. NOTE: for clarity, double quotes have not been URL-encoded.


All occurrences of “test” in the “opensonar” corpus (CorpusQL query)"test"

All documents having “guide” in the title and “test” in the contents, sorted by author and date, results 61-90"test"& sort=field:author,field:date&first=61&number=30

Occurrences of “test”, grouped by the word left of each hit"test"&group=wordleft

Documents containing “test”, grouped by author"test"&group=field:author

Larger snippet around a hit:

Information about a document

Metadata of document with specific PID

The entire original document

The entire document, with occurrences of “test” highlighted (with <hl/> tags)"test"

Part of the document (embedded in a root element; BlackLab makes sure the resulting XML is well-formed)

Information about indices

Information about the webservice; list of available indices (trailing slash optional)

Information about the “opensonar” corpus (structure, fields, (sub)annotations, human-readable names) (trailing slash optional)

Information about the “opensonar” corpus, include all values for “pos” annotation (listvalues is a comma-separated list of annotation names):

Information about the “opensonar” corpus, include all values for “pos” annotation and any subannotations (listvalues may contain regexes):*

Autogenerated XSLT stylesheet for transforming whole documents (only available for configfile-based XML formats):

Indexing via BlackLab Server (EXPERIMENTAL)

BlackLab Server includes experimental support for creating indices and adding documents to them. We are using these features to build an interface where users can quickly index data and search it, without having to set up a BlackLab installation themselves. These features are still pretty volatile, so don’t rely too heavily on them yet, but here’s a very quick overview.

Currently, only private indices can be created and appended to. This means there must be a logged-in user. The setting authSystem in blacklab-server.yaml (or .json) will let you specify what authentication system you’d like to use. If you specify class “AuthDebugFixed” and a userId, you will always be logged in as this user. Note that this debug authentication method only works if you are a debug client (i.e. your IP address is listed in the debug.addresses setting, see Configuration files). Have a look at the other Auth* classes (mostly AuthRequestAttribute) to see how real authentication would work.

Another required setting is userCollectionsDir (in addition to indexCollections which points to the “globally available” indices). In this directory, user-private indices will be created. Obviously, the application needs write permissions on this directory.

When a user is logged in and you have a userCollectionsDir set up, you will see a user section on the BlackLab Server info page (/blacklab-server/) with both loggedIn and canCreateIndex set to true. To see what input formats are supported, look at the /blacklab-server/input-formats/ URL.

To create a private index, POST to /blacklab-server/ with parameters name (index identifier), display (a human-friendly index name) and format (the input format to use for this index, e.g. tei). The userId will be prepended to the index name, so if your userId is myUserId and you create an index name myIndex, the full name will be myUserId:myIndex.

To add a file to a private index, upload it to /blacklab-server/INDEX_NAME/docs with parameter name data.

To remove a private index, send a DELETE request to /blacklab-server/INDEX_NAME/.

Adding/removing user formats

To add an input format, upload a .yaml or .json configuration file to the /blacklab-server/input-formats/ URL with parameter name “data”. The file name will become the format name. User formats will be prefixed with the userId, so if your userId is myUserId and you upload a file myFormatName.blf.yaml, a new format myUserId:myFormatName will be created. Only you will see it in the formats list, but in theory, everyone can use it (this is different from indices, which are private).

To view an input format configuration, use /blacklab-server/input-formats/FORMAT_NAME.

To remove an input format, send a DELETE request to the format page, e.g. /blacklab-server/input-formats/FORMAT_NAME.

Share private index with a list of users

To see what users (if any) a private index is currently shared with, use:


To set the list of users to share a private index with, send a POST request to the same URL with the ‘users[]’ parameter for each user to share with (that is, you should specify this parameter multiple times, once for each user). You can leave the parameter empty if you don’t want to share the index anymore.

The sharing information is stored in the index directory in a file named .shareWithUsers.


First, you need the BlackLab Server WAR file. You can either download the latest release, or you can build it by cloning the [repository]( GitHub) and building it using Maven.

BlackLab Server needs to run in a Java application server that support servlets. We’ll assume Apache Tomcat here, but others should work almost the same.

PLEASE NOTE: BlackLab currently uses Java EE and therefore runs in Tomcat 8 and 9, but not in Tomcat 10 (which migrated to Jakarta EE). If you try to run BlackLab Server on Tomcat 10, you will get a ClassNotFoundException. A future release of BlackLab will migrate to Jakarta EE.

For larger indices, it is important to give Tomcat’s JVM enough heap memory. (If heap memory is low and/or fragmented, the JVM garbage collector might start taking 100% CPU moving objects in order to recover enough free space, slowing things down to a crawl.) If you are indexing unique ids for each word, you may also be able to save memory by disabling the forward index for that ‘unique id’ annotation.

Create a configuration file blacklab-server.yaml in /etc/blacklab/ or, if you prefer, on the application server’s classpath. Make sure the indexLocations setting is correctly specified (it should point to a directory containing one or more BlackLab indices as subdirectories, or to a single index directory). The minimal configuration file looks like this:

configVersion: 2

# Where indexes can be found
# (list directories whose subdirectories are indexes, or directories containing a single index)
- /data/blacklab/indexes

(for more information about configuration BlackLab and BlackLab Server, see Configuration files)

Place blacklab-server.war in Tomcat’s webapps directory ($TOMCAT/webapps/). Tomcat should automatically discover and deploy it, and you should be able to go to http://servername:8080/blacklab-server/ and see the BlackLab Server information page, which includes a list of available corpora.

To ensure the correct handling of accented characters in (search) URLs, you should make sure that your URLs are URL-encoded UTF-8 (so e.g. searching for “señor” corresponds to a request like http://myserver/blacklab-server/mycorpus/hits?patt=%22se%C3%B1or%22 . You should also tell Tomcat to interpret URLs as UTF-8 (by default, it does ISO-8859-1) by adding an attribute URIEncoding=“UTF-8” to the Connector element with the attribute port=“8080” in Tomcat’s server.xml file.

To (significantly!) improve performance of certain operations, including sorting and grouping large result sets, you might want to consider using the vmtouch tool to lock the forward index files in the OS’s disk cache. You could also serve these files (or the entire index) from an SSD.

Error and status responses

BLS can return these error and status codes. The associated human-readable message is informational only and can be shown to the user if you want; note though that the precise wording may change in future versions. The codes in the left column will not change and may be used to show your own custom messages (e.g. translations).

Operations that do not return status or error codes and messages (which is all succesful retrieval operations) will always set the HTTP status to “200 OK”.

HTTP status Error code Error message
200 OK (no code) (no message, just search results)
200 OK SUCCESS Index deleted succesfully.
201 Created SUCCESS Index created succesfully.
202 Accepted SUCCESS Documents uploaded succesfully; indexing started.
400 Bad Request UNKNOWN_OPERATION Unknown operation. Check your URL.
400 Bad Request NO_DOC_ID Specify document pid.
400 Bad Request NO_FILTER_GIVEN Document filter required. Please specify 'filter' parameter.
400 Bad Request FILTER_SYNTAX_ERROR Error parsing FILTERLANG filter query: ERRORMESSAGE (NOTE: see [Lucene query syntax](
400 Bad Request UNKNOWN_FILTER_LANG Unknown filter language ' FILTERLANG '. Supported: SUPPORTED_LANGS.
400 Bad Request NO_PATTERN_GIVEN Text search pattern required. Please specify 'patt' parameter.
400 Bad Request PATT_SYNTAX_ERROR Syntax error in PATTLANG pattern: ERRORMESSAGE
400 Bad Request UNKNOWN_PATT_LANG Unknown pattern language 'PATTLANG'. Supported: SUPPORTED_LANGS.
400 Bad Request UNKNOWN_GROUP_PROPERTY Unknown group property 'NAME'.
400 Bad Request UNKNOWN_SORT_PROPERTY Unknown sort property 'NAME'.
400 Bad Request ERROR_IN_GROUP_VALUE Parameter 'viewgroup' has an illegal value: GROUPID /
Parameter 'viewgroup' specified, but required 'group' parameter is missing.
400 Bad Request GROUP_NOT_FOUND Group not found: GROUPID
400 Bad Request REGEXP_TOO_LARGE Regular expression too large. (NOTE: maximum size depends on Java stack size)
400 Bad Request JSONP_ILLEGAL_CALLBACK Illegal JSONP callback function name. Must be a valid Javascript name.
400 Bad Request SNIPPET_TOO_LARGE Snippet too large. Maximum size for a snippet is MAXSIZE words.
400 Bad Request ILLEGAL_BOUNDARIES Illegal word boundaries specified. Please check parameters.
400 Bad Request HIT_NUMBER_OUT_OF_RANGE Non-existent hit number specified.
400 Bad Request CANNOT_CREATE_INDEX Could not create index. REASON
400 Bad Request INDEX_ALREADY_EXISTS Could not create index. Index already exists.
400 Bad Request ILLEGAL_INDEX_NAME Illegal index name (only word characters, underscore and dash allowed): INDEXNAME
400 Bad Request CANNOT_UPLOAD_FILE Cannot upload file. REASON
400 Bad Request INDEX_ERROR An error occurred during indexing: MESSAGE
401 Unauthorized NOT_AUTHORIZED Unauthorized operation. REASON
403 Forbidden FORBIDDEN_REQUEST Forbidden request. REASON
405 Method Not Allowed ILLEGAL_REQUEST Illegal GET/POST/PUT/DELETE request. REASON
404 Not Found CANNOT_OPEN_INDEX Could not open index 'NAME'. Please check the name.
404 Not Found DOC_NOT_FOUND Document with pid 'PID' not found.
409 Conflict INDEX_UNAVAILABLE The index 'INDEXNAME' is not available right now. Status: STATUS
429 Too Many Requests TOO_MANY_JOBS You already have too many running searches. Please wait for some previous searches to complete before starting new ones.
500 Internal Server Error INTERNAL_ERROR An internal error occurred. Please contact the administrator.
503 Service Unavailable SERVER_BUSY The server is under heavy load right now. Please try again later.
503 Service Unavailable SEARCH_TIMED_OUT Search took too long, cancelled.