In:
SOIL, Copernicus GmbH, Vol. 9, No. 1 ( 2023-03-14), p. 155-168
Abstract:
Abstract. Summarizing information from large bodies of scientific literature is an
essential but work-intensive task. This is especially true in environmental
studies where multiple factors (e.g., soil, climate, vegetation) can
contribute to the effects observed. Meta-analyses, studies that
quantitatively summarize findings of a large body of literature, rely on
manually curated databases built upon primary publications. However, given
the increasing amount of literature, this manual work is likely to require
more and more effort in the future. Natural language processing (NLP)
facilitates this task, but it is not clear yet to which extent the
extraction process is reliable or complete. In this work, we explore three
NLP techniques that can help support this task: topic modeling, tailored
regular expressions and the shortest dependency path method. We apply these
techniques in a practical and reproducible workflow on two corpora of
documents: the Open Tension-disk
Infiltrometer Meta-database (OTIM) and the Meta corpus. The OTIM corpus contains the source
publications of the entries of the OTIM database of near-saturated hydraulic
conductivity from tension-disk infiltrometer measurements
(https://github.com/climasoma/otim-db, last access: 1 March 2023). The Meta corpus is constituted of
all primary studies from 36 selected meta-analyses on the impact of
agricultural practices on sustainable water management in Europe. As a first
step of our practical workflow, we identified different topics from the
individual source publications of the Meta corpus using topic modeling.
This enabled us to distinguish well-researched topics (e.g., conventional
tillage, cover crops), where meta-analysis would be useful, from neglected
topics (e.g., effect of irrigation on soil properties), showing potential
knowledge gaps. Then, we used tailored regular expressions to extract
coordinates, soil texture, soil type, rainfall, disk diameter and tensions
from the OTIM corpus to build a quantitative database. We were able to
retrieve the respective information with 56 % up to 100 % of all
relevant information (recall) and with a precision between 83 % and
100 %. Finally, we extracted relationships between a set of drivers
corresponding to different soil management practices or amendments (e.g.,
“biochar”, “zero tillage”) and target variables (e.g., “soil
aggregate”, “hydraulic conductivity”, “crop yield”) from the
source publications' abstracts of the Meta corpus using the shortest
dependency path between them. These relationships were further classified
according to positive, negative or absent correlations between the driver
and the target variable. This quickly provided an overview of the different
driver–variable relationships and their abundance for an entire body of
literature. Overall, we found that all three tested NLP techniques were able
to support evidence synthesis tasks. While human supervision remains
essential, NLP methods have the potential to support automated evidence
synthesis which can be continuously updated as new publications become
available.
Type of Medium:
Online Resource
ISSN:
2199-398X
DOI:
10.5194/soil-9-155-2023
Language:
English
Publisher:
Copernicus GmbH
Publication Date:
2023
detail.hit.zdb_id:
2834892-8
Permalink