Frequently Asked Questions

Who is BlackLab for?

BlackLab is a good choice if you want to search a large body of text annotated with extra information per word (e.g. lemma, part of speech, or any number of additional layers). It adds a number of search features to Lucene.

With BlackLab, you can search for complex patterns of words (e.g. “find all nouns preceded by two or three adjectives”). It can accurately highlight matches (not just simple terms) in the original document or show them in a keyword-in-context (KWIC) view. It can quickly sort or group large result sets based on several criteria, including the exact words matched or words surrounding the match. It can also search inside specific XML tags, so you can search for people or places, for example.

BlackLab supports a number of input formats out of the box. Adding support for a new input format is easy.

Who uses BlackLab?

See Who uses BlackLab?.

Is BlackLab easy to use?

Yes, ease of use is an important design goal of BlackLab. This goes for both the Java library (BlackLab Core) and the webservice (BlackLab Server).

The simplest program using the Java library is 3 lines long: open a BlackLab index, execute a query, and close the index again. Of course, you might want to iterate over the results and display them, so a real-world program will be longer.

Queries can be supplied in many different forms, depending on what you’re familiar with:

BlackLab Corpus Query Language (BCQL), similar to the query languages of the IMS Corpus WorkBench and Lexicom Sketch Engine;
Contextual Query Language (EXPERIMENTAL), used by many online information retrieval systems;
or if you prefer, you can programmatically construct queries out of objects, so you can add your own query languages, offer a GUI query builder, etc.

See Getting started to get your feet wet with BlackLab.

See BlackLab Server overview for more information about the webservice.

If you have questions, contact us (see below)!

Why can't BlackLab find words in my input files?

BlackLab needs tokenized input files. This means the word boundaries have already been determined so BlackLab can just index words as it parses the file.

In addition to this, the default TEI configuration (tei.blf.yaml) may not be suitable for your particular TEI documents. You can derive your own custom configuration from the default one to fix this. See how to configure indexing.

Why is BlackLab slow / running out of memory / using 100% CPU?

Usually this is related to the Java heap size. Make sure the JVM has enough heap space. If heap memory is low and/or fragmented, the JVM garbage collector might start taking 100% CPU moving objects in order to recover enough free space, slowing things down to a crawl.

If that doesn't help, you might be trying to index a very large corpus with a unique id for each word, for example. BlackLab was designed to index word, lemma, part of speech and such, which all have a limited number of unique values. If you wish to index a unique id annotation for each word, you could disable the forward index to save memory and disk space.

Sometimes, certain advanced queries may be slow as well. You can experiment with writing the same query in a slightly different way and see if that helps. If not, let us know.

Why can't I view whole documents?

You may be getting "permission denied" or such when trying to view a whole document in corpus-frontend or via BlackLab Server. This is due to the corpusConfig.contentViewable setting in the indexing configuration file.

See the note about contentViewable on the BlackLab Server overview page.

Can I use BlackLab with Solr/ElasticSearch?

We are working on it! A future version of BlackLab should work with Solr.

Distributed search will follow in a later version.

See our future plans.

Where do I find requirements and compatibility information?

See the downloads page. This includes information about indexed corpus compatibility ("does an older corpus work with a newer BlackLab, or do I have to reindex?")

Is BlackLab fast?

(this information is somewhat out of date, but the basic conclusions still hold)

We've done our best the features we've added don’t compromise Lucene’s impressive search speed. Of course, search and index speed varies based on machine and disk speed and available memory, but here are a few examples, from on a reasonably fast machine with 32GB RAM.

Search speed

Here’s a rough indication of current search performance in a corpus with 450M words:

Most queries will yield the first batch of results in under a second, even when there’s hundreds of thousands of matches. Some types of wildcard queries take a few seconds longer.
Sorting a resultset of 100K matches by document title or date takes about 2 seconds.
Sorting the same resultset by the words occurring to the right of the match takes about 4 seconds.
Grouping a resultset of 800K matches of a wildcard query on matched text (to see how often different words were matched) takes about 8 seconds.

Note that search performance is heavily reliant on disk caching (all the forward indices must be in OS cache to achieve maximum search, sort and group performance), so make sure you have plenty of memory.

Index speed

In addition to the factors mentioned above, indexing speed also depends on the input format used. Here’s two examples:

The 450M word data set mentioned (consisting of OCR’ed text pages) was indexed in around 6 hours (around 20,000 words/sec)
A 100M word data set in a more compact input format (word-per-line), including headword and part-of-speech tagging, was indexed in under an hour (around 30,000 words/sec; )

Why did you develop BlackLab?

Other corpus engines exist but we decided to develop our own at the Dutch Language Institute. A logical question is why.

We use all kinds of corpora in all kinds of projects, and we needed a flexible solution for these corpora and projects and their specific requirements.

We designed BlackLab to offer the flexibility that we missed in other corpus engines at the time:

allows sorting and grouping hits on different properties, including context
doesn't prescribe a fixed input format, but can work with any data you want to throw at it
provides solid support for BlackLab Corpus Query Language (BCQL), but you can easily add query languages. We've added basic support for SRU/CQL (Contextual Query Language), for example, and may add other query languages (e.g. treebanks-search related) in the future.
has a modular design, making it easier to extend and maintain
includes the ability to customize and fundamental indexing features, such as future support for searching tree-like structures, such as syntactic and semantic relations.
continually benefits from developments to Lucene, Solr and ElasticSearch, making it more future-proof and lowering development cost
is trivial to use from other programming languages when desired, but also perfectly integrates with the Java ecosystem many of our products are built on

Of course, other corpus engines have improved over the years and may fit your needs. We encourage you to do your own research, and if you have any questions about support for certain features, just ask.

Future plans

We will keep improving BlackLab as time goes on. Please see our future plans.