Managing data updates
By default, yente
will check for an updated build of the OpenSanctions database published at data.opensanctions.org
every hour. New data releases are published several times a day. If a fresh version is found, an indexing process will be spawned and load the data into the ElasticSearch index.
You can change this behavior using the following configuration options:
Env. variable | Default | Description |
---|---|---|
YENTE_MANIFEST |
manifests/default.yml |
Specifies the path of the manifest that defines the datasets exposed by the service. This is used to add extra datasets to the service or to define custom scopes for entity screening. |
YENTE_CRONTAB |
0 * * * * |
Gives the frequency at which new data will be indexed as a crontab. |
YENTE_AUTO_REINDEX |
true |
Can be set to false to disable automatic data updates. Data will only be refreshed and re-indexed when running yente reindex . |
Re-indexing in deployments with multiple yente instances.
In the single-instance default configuration, yente will set up an automatic refresh of its dataset every hour (governed by YENTE_CRONTAB
).
However, when running more than one yente container, for example deploying to Kubernetes or a managed multi-instance environment (such as Google Cloud Run), you should set YENTE_AUTO_REINDEX
to false
— otherwise the workers will clash, as they will all attempt to re-index in parallel. In this scenario, yente reindex
should be run using an external cron mechanism. For more information, see the documentation on how to deploy yente.
How does yente update its index?
Here is a very quick tour of how yente
works:
- When the application starts, it will download a metadata file from
data.opensanctions.org
which states the latest version of the OpenSanctions data that was been released. - If there is fresh data, it will create an ElasticSearch index with a timestamp that match the latest release of the data (e.g.
yente-entities-default-00220221030xxxx
). - It will then fetch the latest data from
data.opensanctions.org
(a 2GB+ JSON file) and push it into ElasticSearch in small batches. - When all the data is indexed,
yente
will create an ES index alias fromyente-entities-default
to the latest snapshot of the index (e.g.yente-entities-all-00220221030xxxx
) and delete all older snapshots of the index. - Only once this has completed will the
/search
and/match
APIs work correctly. On the plus side, any future updates to the data will be indexed first, and the switch-over to the new data will be instantaneous.