Currently GeoNetwork records are hardly harvested by search engines. There are some problems that can be addressed quite easily to improve this. This post presents some findings from the implementation of https://nationaalgeoregister.nl, as presented at Bolsena 2017.

The GeoNetwork Application uses quite some ajax to retrieve partial content updates. This ajax content unfortunately can not be indexed by search engines. To improve this, the latest versions (3.0+) have an option to implement a html formatter that is able to create a html page for each metadata. Formatter output is cached on disk, so the response time is low (after initial creation). Search engine use response time in their ranking algorithms.

Next challenge is to make the search engines aware of the fact that these pages exist. This can be done by registering the sitemap (/geonetwork/srv/eng/portal.sitemap) in the search engine admin console (bing, yandex and google have this option).

To register your sitemap on these search engines, the engines require you to place a verification-html-file on the root of your domain. This file can also be placed in a subfolder, but in that case the console offers a bit less functionality.

It may be the case that the sitemap links have to be updated to use the proper link to a html output in stead of a url including ‘#’. Update this file: https://github.com/geonetwork/core-geonetwork/blob/develop/web/src/main/webapp/xslt/services/sitemap/sitemap.xsl. In some situations we noticed the default pagination size of 25000 is too high for GeoNetwork (it can run in memory problems), so we usually decrease the pagination size to 250/500. For this, open the file https://github.com/geonetwork/core-geonetwork/web/src/main/webapp/WEB-INF/config/config-service-sitemap.xml and add the param to manage about pagination.

<service name=”portal.sitemap”>
 <class name=”.guiservices.metadata.Sitemap”>
  <param name=”maxItemsPage” value=”500″/></class>

To make the user experience nicer, a header and footer similar to the main website can be added to the html output. Also it is possible to add schema.org and/or og annotations to the html output, so search engines are able to harvest the metadata content as schema.org/Dataset. Another nice option is to replace the complex formatter url by a human friendly url /srv/metadata/{uuid} (in the sitemap.xml) and use a (serverside) redirect to redirect the user to the complex url. Use this human readable url also in the share option, to have facebook/linkedin/twitter display a nice title/thumbnail for each metadata link.

An important aspect of search engine optimisation is to indicate to the crawler which aspects of an SDI should not be crawled. Endpoints that require parameters to give valid responses (such as WMS/WFS/CSW), or endpoints that respond in non-html formats can better be excluded for the crawler. If these endpoints are crawled, there will be search results like the one below, which will make relevant search results less visible (search engines may even hide them as duplicates).

Which Endpoints to be excluded by the crawler can be indicated by introducing a robots.txt file. Note that if you want to exclude a WMS endpoint of geoserver, the robots.txt file should be added to the domain on which geoserver is hosted. GeoNetwork has a build in robots.txt service, but it currently only adds the sitemap location. You can add configuration for your environment there. This file is located here /web/xslt/services/sitemap/robots.xsl.

SPATIAL DATA INFRASTRUCTURES SIMPLIFIED