BlackLab is a good choice if you want to search a large body of text annotated with extra information per word (e.g. lemma, part of speech, or any number of additional layers). It adds a number of search features to Lucene.
With BlackLab, you can search for complex patterns of words (e.g. “find all nouns preceded by two or three adjectives”). It can accurately highlight matches (not just simple terms) in the original document or show them in a keyword-in-context (KWIC) view. It can quickly sort or group large result sets based on several criteria, including the exact words matched or words surrounding the match. It can also search inside specific XML tags, so you can search for people or places, for example.
BlackLab supports a number of input formats out of the box. Adding support for a new input format is easy.
See Who uses BlackLab?.
Yes, ease of use is an important design goal of BlackLab. This goes for both the Java library (BlackLab Core) and the webservice (BlackLab Server).
The simplest program using the Java library is 3 lines long: open a BlackLab index, execute a query, and close the index again. Of course, you might want to iterate over the results and display them, so a real-world program will be longer.
Queries can be supplied in many different forms, depending on what you’re familiar with:
See Getting started to get your feet wet with BlackLab. See the reference documentation for a detailed overview of the Java library, or BlackLab Server overview for more information about the webservice.
If you have questions, contact us (see below)!
BlackLab needs tokenized input files. This means the word boundaries have already been determined so BlackLab can just index words as it parses the file.
In addition to this, the default TEI configuration (tei.blf.yaml) may not be suitable for your particular TEI documents. You can derive your own custom configuration from the default one to fix this. See how to configure indexing.
Usually this is related to the Java heap size. Make sure the JVM has enough heap space. If heap memory is low and/or fragmented, the JVM garbage collector might start taking 100% CPU moving objects in order to recover enough free space, slowing things down to a crawl.
If that doesn’t help, you might be trying to index a very large corpus with a unique id for each word, for example. BlackLab was designed to index word, lemma, part of speech and such, which all have a limited number of unique values. If you wish to index a unique id annotation for each word, you could disable the forward index to save memory and disk space.
Sometimes, certain advanced queries may be slow as well. You can experiment with writing the same query in a slightly different way and see if that helps. If not, let us know.
You may be getting “permission denied” or such when trying to view a whole document in corpus-frontend or via BlackLab Server. This is due to the contentViewable setting.
@@@ TODO
Solr/ElasticSearch integration is high on our wishlist, but BlackLab started as a Java library using Lucene directly, and some changes are required to get it to integrate with Solr and/or ElasticSearch. We know the steps required, it’s just a question of finding the time.
If you’re using Solr/ElasticSearch and are interested in taking advantage of the features that BlackLab provides, drop us a line (see below). We’d love to collaborate on this.
We’ve done our best the features we’ve added don’t compromise Lucene’s impressive search speed. Of course, search and index speed varies based on machine and disk speed and available memory, but here are a few examples, from on a reasonably fast machine with 32GB RAM.
Here’s a rough indication of current search performance in a corpus with 450M words:
Note that search performance is heavily reliant on disk caching (all the forward indices must be in OS cache to achieve maximum search, sort and group performance), so make sure you have plenty of memory.
In addition to the factors mentioned above, indexing speed also depends on the input format used. Here’s two examples:
At the Dutch Language Institute, we use all kinds of corpora in all kinds of projects, and we needed a flexible solution for these corpora and projects and their specific requirements.
We designed BlackLab to offer the flexibility that we missed in other corpus engines at the time:
We intend to keep improving BlackLab. For an overview of our future plans, check the Road map.
For technical questions about BlackLab, contact Jan Niestadt. I’m always happy to hear from you.