@geocat@fosstodon.org

In a «conceptual friday» session of the «PILOD.nl» project we examined options and obstacles to expose Dutch Geospatial Data to the semantic Web. A number of technical and philosophical matters were discussed and tested:

  • Will semantic technology provide an answer to an obvious but difficult geo question: «What data do you store of objects around me»
  • Mapping of iso19139 to DCAT
  • Load Geonetwork metadata in Virtuoso and query it with SPARQL
  • Which paths, options and obstacles to get to the ideal LOD situation
  • Any query with a spatial filter should contain a location but also an indication of scale
  • DCAT as facilitator to exchange metadata with Open Data Catalogues

What data do you store about the area around me?

Governments collect a lot of data and expose that as open data. However for a citizen it’s hard to find relevant data (data about my house). The data is scattered over 100’s of datasets. One would have to query each dataset to find out if it has data related to my street/house/garden. Semantic technology could facilitate such a query by ingesting all data into a big index that can be queried with a single query. For the Netherlands there are currently 8000 spatial datasets ranging from kilobytes to terrabytes. Having them all in a single triple store and filtering them on relevant geometry will (with current technology) not be as responsive as expected. A SOLR instance on top of the triple store is probably a good technology for this usecase.

Mapping of iso19139 to DCAT and vice versa

In recent years we’ve seen quite some initiatives to map iso19139 metadata to RDF with the DCAT ontology. This mapping is quite a challenge, since some concepts don’t fully map. One of the challenges for example are the required codelist values. In iso these codelists are stored in xml, where DCAT expects them as uri’s in SKOS/RDF format. Another aspect is that DCAT-ap (european profle on DCAT) requires different thesauri (eurovoc) as the codelists of iso. There is a lot to discuss about such mappings (some might even say you’d have to start your dcat document from scratch). For sure DCAT doesn’t provide all attributes that we as spatial people require. However semantic web allows to add new predicates from other ontologies to the dataset description.

Load GeoNetwork metadata in Virtuoso and query it with SPARQL

We used the existing DCAT conversion of GeoNetwork to expose iso19139 metadata as DCAT (a new mapping based on INSPIRE 2 DCAT-AP is being worked on). The way GeoNetwork exposes DCAT data, via a searchable index page referring to each dataset description individually, can not be easily crawled by Virtuoso. Virtuoso expects a single big file with all datasets concatenated. I wrote a little script that generates such a file from the GeoNetwork output. This file was then imported in Virtuoso using the Virtuoso «Quad store upload». With the (meta)data in Virtuoso we could query it using simple SPARQL queries, like:

select * where
{graph <http://almere.pilod.nl/ngr-test/>
{ ?s ?p ?o }
}

and

select * where
{graph <http://almere.pilod.nl/ngr-test/>
{ ?s ?p «1E62BE97-121D-4BF6-82C7-0E3316441E58».
?s ?p2 ?o2 }
}

Run these queries at http://almere.pilod.nl/sparql to see a list of datasets and all details of a single dataset. Note that a lot of links are non functional since this a test setup. Today we were unfortunately not succesfull in importing (or querying) the spatial property of DCAT, which in our case contains a GeoSparql polygon. We’ll keep that for the next session.

Paths, options and obstacles to get to an ideal LOD situation

We acknowledge that in an ideal situation current dataproviders set up SPARQL endpoints on the databases that they now expose as WMS/WFS. However it’s interesting to imagine some small actions that providers (or facilitators) can do, for endusers to profit from semantic technology using existing services. Data providers can use the gained experiences to prepare a next step in data publication.

In such a scenario we could imagine a big triple store ingesting data from the existing WFS services. But how can you define an ontology for an existing WFS? Some WFSses like INSPIRE and basisregistraties have a harmonised well described data model (XML schema). Others that don’t have such a schema might profit from feature-catalog metadata (iso19110) made available in GeoNetwork and linked to the dataset (metadata). Most WFSses can encode their response as GeoJSON, if a process (Virtuoso Sponger?) can manage to add a json-ld header, such a file could be easily imported in a triple store.

A challenge in this scenario is the definition of URI’s for objects. A record in a WFS typically doesn’t have a persistant identifier in the form of a URI. INSPIRE tried to facilitate this by introducing the concept of an INSPIRE-ID. A temporary solution would be to have the crawler that ingests the WFS data create a URI during the crawl process.

GeoCat also worked on a different approach to make existing data wider available. In that project a python process converts all datasets (mostly files like ESRI Shape/FGDB and Microsoft Excel/Access) that are attached to metadata records in GeoNetworkGeoNetwork to open formats like CSV and GeoJSON. The process places them in an Apache Web Accessible Folder and updates the links in the originating metadata. The folders are linked using RDFa and a sitemap, for optimal spidering. This approach leaves the actual spidering/crawling to third parties, but exposes the data so it can be optimally discovered.

Any spatial filter should have an indication of scale
When one queries a big repository of geometries an important success factor is to consider a relevant scale for that query. Some examples to clarify:

  • A query by the name of a city is fine to reply with a point if the user is locating the city on a national scale. However if the user is in or close to the city, you probably want to share the city border
  • A representation of a river can consist of 1000 fragments all sharing the name of that river. If a user queries by name each of these individual segments might not be relevant.
  • If i’m looking for a nearby pizzeria, there is no need to return the country and state that i’m in.

Exchanging dataset metadata with open data catalogues using DCAT

In the Open Data Community DCAT is a popular format for dataset metadata. By exposing iso19139 as DCAT catalogue products in that area can easily ingest metadata from the spatial community. Some of the open data catalogue products also provide importers for iso19139 metadata. However if you follow that approach you have no control on how iso19139 is mapped to dcat.

Conclusion

We had a fruitfull session at Geonovum. Thanks all for participating. Some suggested that this initiative could grow into a subgroup «Linked GeoSpatial Data» of the Pilod project. We will share our experiences with the OGC+W3C workinggroup on linked geospatial data, the EU Geospatial profile for DCAT-AP and the ‘Laan van de Leefomgeving‘ who are dealing with similar challenges.