At the last INSPIRE conference one of the hot topics was the discoverability of INSPIRE datasets in search engines. Google had just released its dataset search engine. Some datasets were findable via that engine (via the European open data portal), but certainly not all. Upcoming 3-4 July JRC is hosting a workshop on discoverability of datasets in search engines to continue the conversation. To prepare for that event, GeoCat software developer Paul van Genuchten updates you on discoverability of GeoNetwork records by search engines.
‘These days GeoNetwork records are still hardly harvested by search engines. There are some problems that can be addressed quite easily to improve this. The GeoNetwork Application uses quite some ajax to retrieve partial content updates. This ajax content unfortunately cannot be indexed by search engines. To improve this, the latest versions (3.0+) have an option to implement a HTML formatter that is able to create a HTML page for each metadata. Formatter output is cached on disk, so the response time is low (after initial creation). Search engines use response time in their ranking algorithms.
The main challenge is to make the search engines aware of the fact that these record-pages exist. This is achieved by registering the sitemap (/geonetwork/srv/eng/portal.sitemap) in the search engine admin console (Bing, Yandex, Google and others have this option).
To register your sitemap on these search engines, the engines requires you to place a verification-HTML-file on the root of your domain. This file can also be placed in a subfolder, but in that case the console offers a bit less functionality.
In recent versions schema.org annotations have been added to the HTML formatter of GeoNetwork. Search engines are able to extract the annotations as schema.org/Dataset and schema.org/DataCatalog. These annotations enable the record to be discoverable via for example the Google dataset search engine. In the upcoming 3.8 release some fixes will be available that optimize this functionality.
Indicate what to be crawled
An important aspect of search engine optimization is to indicate to the crawler which aspects of an SDI should not be crawled. Endpoints that require parameters to give valid responses (such as WMS/WFS/CSW), or endpoints that respond in non-HTML formats can better be excluded for the crawler. If these endpoints are crawled, there will be search results like the one below, which will make relevant search results less visible (search engines may even hide them as duplicates).
Which Endpoints to be excluded by the crawler can be indicated by introducing a robots.txt file. Note that if you want to exclude a WMS endpoint of GeoServer, the robots.txt file should be added to the domain on which GeoServer is hosted. GeoNetwork has a build in robots.txt service, but it currently only adds the sitemap location. You can add configuration for your environment there. This file is located here: /web/xslt/services/sitemap/robots.xsl. Place the robots.txt file in the root of your website, not in a subfolder.
Search engine tools
The search engines provide nice tools to evaluate the discoverability of the catalog and which links point to missing resources. Use the tools to gain insight into the usage of your catalog and improve its quality.’
Contact: Paul van Genuchten, firstname.lastname@example.org