There’s several examples online of what you can do with BlackLab.
To start searching your own data, you’ll need to:
Let’s go over these one by one.
The web service, BlackLab Server, can be used from any programming language and offers a simple REST interface. The Java library offers the most flexibility, but it does mean you have to use a language that runs on the JVM (e.g. Java, Scala, Kotlin, etc.). Note that you will need a Java SE 8 compatible JVM to use the latest BlackLab versions.
For now, this guide will focus on BlackLab Core and Java. For more information on BlackLab Server, see this overview.
First you need to get the BlackLab library. The simplest way is to let Maven download it automatically from the Central Repository, but you can also download a prebuilt binary, and it’s trivial to build it yourself.
Note to MacOS users: Dirk Roorda at DANS wrote a detailed guide for installing and indexing data on MacOS. It's available here.
BlackLab is in the Maven Central Repository, so you should be able to simply add it to your build tool. If you’re not sure what version to use, see the downloads or changelog pages.
BlackLab Core consists of a JAR and a set of required libraries. See the GitHub releases page and choose a jar-with-libs download. The latter one may also contain development versions you can try out.
BlackLab Server only consists of a WAR file that includes everything. You could even unzip this WAR file to obtain the included BlackLab JAR and zip files if you needed to for some reason.
If you want the very latest version (the “dev” branch) of BlackLab, you can easily build it from source code.
First, you need to download the source code from GitHub. You can download it from there in a .zip file (be sure to select the dev branch before doing so), but a better way to get it is by cloning it using Git. Install a Git client (we’ll give command line examples here, but it should translate easily to GUI clients like TortoiseGit), change to a directory where you keep your projects, and clone BlackLab:
git clone git://github.com/INL/BlackLab.git
Git will download the project and place it in a subdirectory “BlackLab”. Now switch to the dev branch:
git checkout dev
Install a recent JDK (Java Development Kit). If you’re on Linux, you can use your package manager to do this (OpenJDK is fine too). Note that you will need at least JDK version 8 (i.e. openjdk-1.8.0) to use the latest BlackLab versions.
BlackLab is built using Maven, a popular Java build tool. Install Maven (use your package manager if on Linux), change into the BlackLab directory, and build the library:
mvn install
(“install” refers to the fact that the library is “installed” to your private Maven repository after it is built)
After a lot of text output, it should say “BUILD SUCCESS” and the BlackLab JAR library should be under core/target/blacklab-VERSION.jar (where VERSION is the current BlackLab version, i.e. “1.7-SNAPSHOT”; SNAPSHOT means it’s not an official release, by the way). The BlackLab Server WAR will be in server/target/blacklab-server-VERSION.war.
NOTE: If you want to use BlackLab Server and BlackLab AutoSearch (our search application), you’ll need an application server like Apache Tomcat too. Also available via package manager in Linux. After installation, find the “webapps” directory (e.g. /var/lib/tomcat/webapps/, but may depend on distribution) and copy the WAR file to it. It should be extracted by Tomcat automatically. For full installation and configuration instructions, see BlackLab Server overview.
In order to search your data using BlackLab, it needs to be in a supported format, and it needs to be indexed by BlackLab.
BlackLab supports a number of input formats, but the most well-known are TEI (Text Encoding Initiative) and FoLiA (Format for Linguistic Annotation). These are both XML formats. If your data is not already in one of the supported formats, you need to convert it. (see the next section if you want to use a test dataset instead of your own)
NOTE: BlackLab needs tokenized data files as input. That means the word boundaries have already been determined and BlackLab can just index each word as it parses the input file.
One way to convert your data is using our tool OpenConvert, which can generate TEI or FoLiA from txt, doc(x) or html files, among others. After conversion, you can tag the files using a tool such as Frog.
A web-based user interface for converting and tagging (Dutch) input files is available. You will need a CLARIN.eu account (more information).
If you can’t use your own data yet, we’ve provided a tokenized, annotated TEI version of the Brown corpus for you to test with.
The Brown corpus is a corpus compiled in the 1960s by Nelson Francis and Henry Kucera at Brown University. It is small by today’s standard (500 documents, 1M words). It was converted to TEI format by Lou Burnard. It is available from archive.org under the CC-BY-NC 3.0 license, but we’ve created our own version which includes lemmata. (Please note that we didn’t check the lemmatization, and it probably contains errors - useful for testing purposes only!)
Get the blacklab JAR and the required libraries (see above). The libraries should be in a directory called “lib” that’s in the same directory as the BlackLab JAR (or elsewhere on the classpath).
Start the IndexTool without parameters for help information:
java -cp "blacklab.jar" nl.inl.blacklab.tools.IndexTool
(this assumes blacklab.jar and the lib subdirectory containing required libraries are located in the current directory; if not, prefix it with the correct directory)
(if you’re on Windows, replace the classpath separator colon (:) with a semicolon (;))
We want to create a new index, so we need to supply an index directory, input file(s) and an input format:
java -cp "blacklab.jar" nl.inl.blacklab.tools.IndexTool create INDEX_DIR INPUT_FILES FORMAT
If you specify a directory as the INPUT_FILES, it will be scanned recursively. You can also specify a file glob (such as *.xml) or a single file. If you specify a .zip or .tar.gz file, BlackLab will automatically index its contents.
For example, if you have TEI data in /tmp/my-tei/ and want to create an index as a subdirectory of the current directory called “test-index”, run the following command:
java -cp "blacklab.jar" nl.inl.blacklab.tools.IndexTool create test-index /tmp/my-tei/ tei
Your data is indexed and placed in a new BlackLab index in the “test-index” directory.
Please note that if you’re indexing large files, you should give java more than the default heap memory, using the -Xmx option. For really large files, and if you have the memory, you could use -Xmx 6G, for example.
See also:
BlackLab Core includes a very basic command-based query tool useful for testing and debugging. To query the index you just created using this tool, type:
java -cp "blacklab.jar" nl.inl.blacklab.tools.QueryTool test-index
The query tool supports several query languages, but it will start in CorpusQL mode. A few hints:
Type “help” to see a list of commands.
See also:
BlackLab AutoSearch is our corpus search application. It is easy to install; see the GitHub page for instructions.
Finally, let’s look at an example Java application.
Here’s the basic structure of a BlackLab search application, to give you an idea of where to look in the source code and documentation (note that we leave nl.inl.blacklab out of the package names for brevity):
The above in code:
// Open your index try (BlackLabIndex index = BlackLab.open(new File("/home/zwets/testindex"))) { String corpusQlQuery = " \"the\" [pos=\"adj.*\"] \"brown\" \"fox\" "; // Parse your query to get a TextPattern TextPattern pattern = CorpusQueryLanguageParser.parse(corpusQlQuery); // Execute the TextPattern Hits hits = index.find(pattern); // Sort the hits by the words to the left of the matched text HitProperty sortProperty = new HitPropertyLeftContext(index, index.annotation("word")); hits = hits.sort(sortProperty); // Limit the results to the ones we want to show now (i.e. the first page) Hits window = hits.window(0, 20); // Iterate over window and display the hits Concordances concs = hits.concordances(ContextSize.get(5)); for (Hit hit: window) { Concordance conc = concs.get(hit); // Strip out XML tags for display. String left = XmlUtil.xmlToPlainText(conc.left); String hitText = XmlUtil.xmlToPlainText(conc.hit); String right = XmlUtil.xmlToPlainText(conc.right); System.out.printf("%45s[%s]%s\n", left, hitText, right); } }
See also: