Power to the Filters! GraphDB Introduces Improvements to the Connectors in its 10th Edition.

A feature, even if well developed, supported and used, eventually gets deprecated. It is the circle of software life. Fortunately, the best-loved features get redesigned. This has been the fate of GraphDB‘s much lauded connectors.

Let’s take a look at the connectors of GraphDB and see where they stand now and how much ground they covered in their small leap forward.

The Status Quo

The connectors have long been one of the flagship features of GraphDB. They offer a powerful and performant way of synchronizing RDF data into non-RDF stores. The original trinity of connectors were Lucene, Solr and Elasticsearch. They are suited for full-text search, faceted search and aggregations. More recently, there was a newcomer to the group – the Kafka connector, which serializes incoming triples as JSON messages.

All the connectors are based around the ingestion mechanism – when a document gets changed, this is immediately reflected in the connected secondary stores. A “document” in this context is a logical collection of triples, and the change can be as small as a single triple. The incremental design means that there is a low latency for update synchronization and minimal need for third party synchronization solutions. Connectors are specialized plugins that offer an update and query mechanism. They index RDF data in secondary stores. There are many other GraphDB plugins, including the MongoDB Connector. However, MongoDB in particular only offers a query mechanism, which sets it apart from our other connectors.

It is obvious that not all RDF data should be stored in secondary indexes. They are best for their specialized use cases. So, ever since their inception, the connectors of GraphDB have offered a way to filter only the relevant data from all the triples in the database. This is called “filtering”.

In GraphDB 9, there used to be a single entity filter, with capabilities that largely spanned the particular specific value of a given field. It had capabilities such as:

Comparisons
Boolean logic
Set membership
Regular expressions

Beyond simple filtering, you could:

Filter by the previous element in the chain. To give an example, if you have a “child” field that contains data about a child, you can go up to the “parent” and check if that parent has a specific value – and not index the “child” if the filter fails.
Accessing additional elements that are not indexed. To give an example, you can have a “child” field and a non-indexed field, called `example:height`. Then, using the construct `?child -> example:height < 100` you could filter by height. This is limited to one predicate and no property chains.
Filtering by graph.
Filtering by language tag – for example, filter only the values in English and Bulgarian.

Power! Less Limited Power!

There are a number of improvements within GraphDB 10. This gives more flexibility and further filtering capabilities. This does come with a small downside – the filters are slightly more involved to configure. You also have to migrate any connectors built in GraphDB 9. The tradeoffs are more than worth it.

Splitting the Entity Filter

The reason for needing to migrate the connectors in GraphDB 10 is the large change to the entity filter. It has been split into four parts. Part of the reason for this split is the fact that the singular entity filter in GraphDB 9 was sometimes unclear in its functionality. Sometimes a filter would remove the whole document. Other times, it would affect only a specific field.There are, essentially, two types of filters, applied at two levels.

The value filters – filtering a specific value. This can be done at a specific field level, or at the top level. If this happens at the top level, the entire document is removed. If the value filter is applied at the field level, only this specific field value will be removed.
Notably, the value filter applied at the top level is applied before any fields are generated. Therefore, all that you can filter against is the root object, denoted by $this. This filter allows you to fail fast – if an object is obviously not interesting to us, we can remove it immediately, before we have spent any computational resources on processing it.
Document filters – filtering the whole document. This can be done at the top level, rejecting the whole document, or per nested document, rejecting nested documents. Those filters are applied last, after all fields have been computed and can access data from all of the field values.
Within the nested document, all fields are considered within the context of the nested document. I.e., the field “parent.child.name” is only addressed as “name” from the context of the nested “child” document.

Two-Variable Filtering

Very often, you would run into situations where you want to compare two fields. Assume a financial compliance scenario. For tax audit reasons, maybe, you want to filter people whose ?netIncome < ?netExpenditure. This is applied to the top-level document filter. Only people with dubious financial balance would be further evaluated for tax evasion.

Previously, in GraphDB 9, this was technically possible by treating the second variable you want to index as an additional element beyond the chain of the root field. However, this capability was beholden to the simple path restriction – you can’t follow multiple steps, and apply alternate paths.

For example, parent(?netIncome) -> urn:netExpenditure > $this would work. However, parent(?netIncome) -> (urn:expediture | urn:net) > $this would not! In GraphDB 10, this is as easy as declaring $this < ?netExpenditure at the level of the ?netIncome field.

Dependencies are resolved smartly by reordering the fields. To continue our financial example, consider a value filter. For example, in the expensivePurchases field, we only want to index purchases greater than monthlyIncome. This means that all filters on monthlyIncome will be applied first.

Circular dependencies are accounted for – if you define a circular dependency, an error will be thrown informing you that you have defined an invalid filter. Of course, this applies to multi-step dependencies!

New Filter Capabilities

The new version isn’t only about making filters more flexible and predictable. There are other changes as well.

First, there is now a direct function isExplicit(?field). This is a shorthand for the previous approach to doing this, graph(?field) not in (<http://www.ontotext.com/implicit>). The new construct is shorter and easier to understand.

The ALL() quantifier – a document passes a filter if one of its values matches the declared condition. For example, we may have ?nationality = . Then, a document may pass if the nationality is both German and British. Using ALL(?nationality) = would make the filter stricter. Previously, this couldn’t really be achieved.

Putting it All Together

Those are all new capabilities of the connectors in GraphDB 10. But what more can we do compared to GraphDB 9? To illustrate, we can create a small example.

Fiscal Compliance

Suppose we have a simple RDF database of people and their purchases. We want to present a view of their activities, including their purchases’ time and location. This would require faceted searches. Our experts are skilled in working with Kibana, and it offers ready tooling for time series analysis and mapping. Sounds like a good fit for the Elasticsearch connector.

To start with, we have two basic customers in our database, Dudley and Snidely. They are both instances of the foaf:Person class. This calls for a type configuration in our new connector.

Dudley has already been audited. This has been reflected in the database and, therefore, he has nothing to worry about. Snidely, however, is yet to be checked. Dudley has done right and can be filtered out at step one by our top level value filter.

Note that you can usually do this with != rather than not in. However, as we simply do not have isCompliant bound for Snidely, we would need to check for non-equality or unbound. Not in serves as a convenient shorthand in this case.

Now, we are interested in some basic information for all of the (not yet compliant!) people in our database such as their name and their monthly and yearly income. We would store all those values, without filtering.

Notice how we use the full IRI for the Property chain attribute. The connectors are not namespace aware, so you would have to use the full IRI. We do the same for income and monthly income.

Now, for the key part of the exercise, the listing of all suspicious spendings. Each person has numerous purchases and other expenses. They are all RDF objects associated with a specific person. Each of these objects contains a price. Proper purchases also contain a location and time. The structure only has two layers, but that’s still quite enough to complicate things. However, nested objects can’t be declared via the UI. So, we would switch to using JSON embedded in a SPARQL connector creation request.

To begin with, nested expense records can be obtained as compliance:expense This field needs to have the native:nested datatype (which corresponds to Elastisearch’s nested field type).

We would also apply a top-level value filter. We don’t want to index instances of compliance:GovernmentTax. This value filter applies to the nested purchase documents. We can also handle this check as a document filter. However, the value filter gets evaluated first and rejects the document before computing the fields. Putting this check in such a filter would result in better performance.

When declaring the nested fields, note that the location field is a geo point, since Elasticsearch supports this datatype out of the box.

Finally, once we have prepared the whole nested object, we would combine a few of our new features. We want to apply a document filter to the nested purchase object and filter out small purchases and old purchases. If the purchase doesn’t take more than a monthly salary, filter it out. This would involve the root-level field monthlyIncome, using the two-variable filtering capability. Also, filter out purchases which were made in 2021 or earlier. Do not perform the date check if no date is given.

Note that we use the $outer keyword to access the scope of the root level. You can chain $outer keywords if you have deeply nested objects.

In the end, we have a clean index containing all of Snidely’s suspicious purchases and, where applicable, their times and locations.

Reference Data

One major feature of experiments is repeatability. If you want to follow along with our example, you can use our sample data and connector creation commands.

If we were to ask Elasticsearch about the contents of the compliance index, we would get the following JSON data:

{
    "income": "12000",
    "name": "Snidely",
    "suspiciousExpense": [
        {
            "date": "2022-02-22",
            "amount": "5000",
            "location": "Point (52.9259503034234 -82.42871206672606)",
            "id": "http://example.org/compliance/purchase1"
        },
        {
            "amount": "1200",
            "id": "http://example.org/compliance/expense1"
        }
    ],
    "expense": [
        {
            "date": "2022-02-22",
            "amount": "5000",
            "location": "Point (52.9259503034234 -82.42871206672606)",
            "id": "http://example.org/compliance/purchase1"
        },
        {
            "date": "2021-02-22",
            "amount": "4500",
            "location": "Point (52.9259503034234 -82.42871206672606)",
            "id": "http://example.org/compliance/purchase2"
        },
        {
            "date": "2022-03-22",
            "amount": "20",
            "location": "Point (52.9259503034234 -82.42871206672606)",
            "id": "http://example.org/compliance/purchase3"
        },
        {
            "amount": "1200",
            "id": "http://example.org/compliance/expense1"
        },
        {
            "amount": "7600",
            "id": "http://example.org/compliance/expense2"
        }
    ],
    "monthlyIncome": "1000"
}

And the sample connector creation request:

PREFIX :<http://www.ontotext.com/connectors/elasticsearch#>
PREFIX inst:<http://www.ontotext.com/connectors/elasticsearch/instance#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
INSERT DATA {
    inst:compliance :createConnector '''
{
  "fields": [
    {
      "fieldName": "name",
      "propertyChain": [
        "http://xmlns.com/foaf/0.1/name"
      ],
      "multivalued": false,
    },
    {
      "fieldName": "income",
      "propertyChain": [
        "http://example.org/compliance/income"
      ],
      "multivalued": false,
    },
    {
      "fieldName": "monthlyIncome",
      "propertyChain": [
        "http://example.org/compliance/monthlyIncome"
      ],
      "multivalued": false,
    },
    {
      "fieldName": "suspiciousExpense",
      "propertyChain": [
        "http://example.org/compliance/expense"
      ],
      "datatype": "native:nested",
      "objectFields": [
        {
          "fieldName": "id",
          "propertyChain": [
            "$self"
          ]
        },
        {
          "fieldName": "amount",
            "propertyChain": [
              "http://example.org/compliance/amount"
            ]
        },
        {
          "fieldName": "date",
          "propertyChain": [
            "http://example.org/compliance/date"
          ]
        },
        {
          "fieldName": "location",
          "propertyChain": [
            "http://example.org/compliance/location"
          ],
         "datatype": "native:geo_point"
        }
      ],
      "documentFilter": "?amount > $outer.monthlyIncome and (?date >= \\\"2022-01-01\\\"^^xsd:date || !bound(?date))",
      "valueFilter": "$this -> type != <http://example.org/compliance/GovernmentTax>"
    },
    {
      "fieldName": "expense",
      "propertyChain": [
        "http://example.org/compliance/expense"
      ],
      "datatype": "native:nested",
      "objectFields": [
        {
          "fieldName": "id",
          "propertyChain": [
            "$self"
          ]
        },
        {
          "fieldName": "amount",
          "propertyChain": [
            "http://example.org/compliance/amount"
          ]
        },
        {
          "fieldName": "date",
          "propertyChain": [
            "http://example.org/compliance/date"
          ]
        },
        {
          "fieldName": "location",
          "propertyChain": [
            "http://example.org/compliance/location"
          ],
          "datatype": "native:geo_point"
        }
      ],
    }
  ],
  "languages": [],
  "types": [
    "http://xmlns.com/foaf/0.1/Person"
  ],
  "valueFilter": "$this -> <http://example.org/compliance/isCompliant> not in (\\\"true\\\"^^xsd:boolean)",
  "readonly": false,
  "detectFields": false,
  "importGraph": false,
  "skipInitialIndexing": false,
  "elasticsearchNode": "http://localhost:9200",
  "elasticsearchClusterSniff": true,
  "manageIndex": true,
  "manageMapping": true,
  "bulkUpdateBatchSize": 5000,
  "bulkUpdateRequestSize": 5242880
}
''' .
}

What’s Next?

Now that we have some data in Elasticsearch, how to visualize and integrate it with our other systems? That’s beyond the scope of this blog post, unfortunately. However, we already have blog posts on creating knowledge graphs, including visualizations. Or, perhaps, you would like to view our Elasticsearch-based demonstrator, the Transparency Energy Knowledge Graph? If you are more keen on learning by doing, you can start out with Lucene, which is packaged with each edition of GraphDB, including Free and Standard. If Elasticsearch, Solr or Kafka are your target secondary indexes, get in touch – trial and educational licenses are also available.