public interface DocIndexerFactory
DocIndexer
instances. Through this
factory it is possible to register new "formats" with BlackLab. A format
essentially is some implementation of a DocIndexer that supports indexing a
specific type of file/format (such as for example plaintext or TEI).
If you have created a custom implementation of DocIndexer to index a specific
dialect of the TEI format for example, you can make BlackLab aware of that
class by registering a new DocIndexerFactory capable of creating the
DocIndexer with the DocumentFormats
class. The factory must then
expose the new format through isSupported(String)
and getFormats()
, and construct and configure an
appropriate DocIndexer when get() is called for the format's id. BlackLab
will then use the factory to create fitting DocIndexers whenever it's asked
to index files of that format (as specified by the user).
How formatIdentifiers map to actual DocIndexer implementations is up to the factory, it's possible to map multiple formatIdentifiers to the same DocIndexer, or vice versa, this is up to the implementation of the factory and associated docIndexer(s).
This is used in DocIndexerFactoryConfig
for example, where only a few
actual DocIndexer classes are used, but each of them can handle an arbitrary
number of external configuration files, and the factory exposes each of those
configuration files with its own unique formatIdentifier.
Modifier and Type | Interface and Description |
---|---|
static class |
DocIndexerFactory.Format
Description of a supported input format
|
Modifier and Type | Method and Description |
---|---|
String |
formatError(String formatIdentifier)
If this format exists but has an error, return the error.
|
DocIndexer |
get(String formatIdentifier,
DocWriter indexer,
String documentName,
byte[] b,
Charset cs)
Instantiating a DocIndexer from a byte array.
|
DocIndexer |
get(String formatIdentifier,
DocWriter indexer,
String documentName,
File f,
Charset cs)
Instantiating a DocIndexer from a file.
|
DocIndexer |
get(String formatIdentifier,
DocWriter indexer,
String documentName,
InputStream is,
Charset cs)
Instantiating a DocIndexer from an input stream.
|
DocIndexer |
get(String formatIdentifier,
DocWriter indexer,
String documentName,
Reader reader)
Instantiating a DocIndexer from a reader.
|
DocIndexerFactory.Format |
getFormat(String formatIdentifier)
Get the full format from its identifier.
|
List<DocIndexerFactory.Format> |
getFormats()
Return all formats supported by this factory.
|
void |
init()
Don't call manually, is called when this factory is added to the
DocumentFormats registry
(
DocumentFormats.registerFactory(DocIndexerFactory) ). |
boolean |
isSupported(String formatIdentifier)
Can this factory instantiate a docIndexer for this type of format.
|
void init()
DocumentFormats.registerFactory(DocIndexerFactory)
).boolean isSupported(String formatIdentifier)
formatIdentifier
- lowercased and never null or empty stringList<DocIndexerFactory.Format> getFormats()
DocIndexerFactory.Format getFormat(String formatIdentifier)
formatIdentifier
- DocIndexer get(String formatIdentifier, DocWriter indexer, String documentName, Reader reader) throws UnsupportedOperationException
formatIdentifier
- the formatIdentifier for the documentindexer
- indexer objectdocumentName
- name of the unit we're indexingreader
- text to indexUnsupportedOperationException
- if called with an unsupported
formatIdentifier (use
isSupported(String)
)DocIndexer get(String formatIdentifier, DocWriter indexer, String documentName, InputStream is, Charset cs) throws UnsupportedOperationException
formatIdentifier
- the formatIdentifier for the documentindexer
- indexer objectdocumentName
- name of the unit we're indexingis
- data to indexcs
- default character set if not definedUnsupportedOperationException
- if called with an unsupported
formatIdentifier (use
isSupported(String)
)DocIndexer get(String formatIdentifier, DocWriter indexer, String documentName, File f, Charset cs) throws UnsupportedOperationException, FileNotFoundException
formatIdentifier
- the formatIdentifier for the documentindexer
- indexer objectdocumentName
- name of the unit we're indexingf
- file to indexcs
- default character set if not definedFileNotFoundException
- if file doesn't existUnsupportedOperationException
- if called with an unsupported
formatIdentifier (use
isSupported(String)
)DocIndexer get(String formatIdentifier, DocWriter indexer, String documentName, byte[] b, Charset cs) throws UnsupportedOperationException
formatIdentifier
- the formatIdentifier for the documentindexer
- indexer objectdocumentName
- name of the unit we're indexingb
- data to indexcs
- default character set if not definedUnsupportedOperationException
- if called with an unsupported
formatIdentifier (use
isSupported(String)
)Copyright © 2020 Instituut voor Nederlandse Taal (INT). All rights reserved.