# Example XPaths
When using Saxon you have extensive possibilities using XPath in BlackLab configuration. Some noteworthy examples are shown below.
To learn more about modern XPath, Altova's XPath 3 training (opens new window) is a good resource, and there are many others.
# Capture punctuation between words
To capture text content between <w/>
tags:
punctPath: .//text()[not(ancestor::w)]
This captures any text node that is not a descendant of a <w/>
tag.
Another possible approach:
punctPath: .//text()[.!='' and preceding-sibling::tei:w]|.//tei:pc |.//tei:lb
This captures non-empty text nodes after a <w/>
tag plus (the text contents of) pc
or lb
tags.
# Isolate a part of speech feature
Your data may have part of speech information that includes detailed features. Let's say this information is stored in an attribute with values like UPosTag=PRON|Case=Nom|Person=3|PronType=Prs
. You can isolate the value of the Case
feature like this:
valuePath: replace(./@msd, '.*Case=([A-Za-z0-9]+).*', '$1')
# Use default if value is missing
If some of your words have a lemma attribute, and you want to index the value _UNKNOWN_
if it's missing (perhaps to be able to locate these data problems easily), you can do that as follows:
valuePath: ./(string(@lemma), '_UNKNOWN_')[1]
# Using either an attribute, or a standoff annotation
Again, let's say some of your words have lemma
attributes. But some have the lemma in a separate tei:join
element instead. You might use XPath to look up the appropriate value like this:
valuePath: >-
let $xid := @xml:id
return if (@lemma) then @lemma
else if ($xid) then
following-sibling::tei:join[@lemma][matches(@target,'#'||$xid||'( |$)')]/@lemma
else ()
Note how we can easily split XPath expressions over multiple lines using >-
in YAML. (see YAML multiline strings (opens new window)).
# For loops
You can even use for
loops if necessary, e.g.:
for $w in //tei:w[@xml:id]
return let $xid := $w/@xml:id
return
if ($w/@lemma) then
$w/@lemma else
if ($xid) then
let $join := $w/following-sibling::tei:join[@lemma][matches(@target,concat('#',$xid,'( |$)'))]
return
$join/@lemma else
()
Thanks to @eduarddrenth (opens new window) for the initial Saxon version and some of the examples.