Concordance between generative artificial intelligence and human reviewers for screening of publications for a systematic literature review

Bernard Kerr,1 Brian Norman,1 Thierry Deltheil,1 Valerie Moss,1 Mary Coffey,2 Iryna Shnitsar,2 and Doreen Valentine2
1Prime, London, UK; 2Bristol Myers Squibb, Princeton, NJ, USA

Table of contents

Summary

  • This exploratory analysis highlights the potential for genAI-based tools to allow more efficient screening of literature search results than human review alone.
  • Overall agreement between genAI and human review was high (> 75%) for 8 selection criteria, moderate (69%) for 1 criterion, and poor (< 50%) for 3 selection criteria.
    • Performance was unexpectedly high for the majority of selection criteria, given the dataset and selection criteria were chosen for their high complexity and variability.
    • It would be reasonable to expect improved performance for screening literature and criteria with a lower level of technical complexity.
  • These results provide valuable insight into the strengths and limitations of genAIbased literature screening; however, comparison of this analysis with similar studies performed under different conditions should be done with caution.
  • GenAI screening required < 15 minutes of user time to run, supporting use of the tool to prioritize publications for human review, assist with categorizing publications, and reduce the risk of human error going undetected.
  • Further studies are planned to evaluate use of the genAI screening tool in other subject areas, and to investigate the reproducibility of genAI responses.
Top

Learnings and Considerations for Future use of GenAI

  • GenAI screening was most effective where information relevant to criteria was clearly stated in publications and closely matched the wording of the prompt.
  • Criteria with poor performance had prompts with a non-specific scope, related to complex concepts or technical information, or screened for information that was reported using highly variable language.
  • Splitting complex criteria into multiple prompts and combining the results with Boolean logic improved evaluability of the results compared with complex prompts that screened for multiple items.
  • Inclusion of follow-up prompts asking the genAI model to explain its decisions assisted with identifying screening errors by allowing the user to monitor the model’s decision-making process.
  • Word count and readability of publications were not associated with screening errors, suggesting that technical terminology and the overall complexity of publications may be key contributors to poor performance.
  • The rate of errors in the genAI outputs was low for most criteria, but the accumulation of errors across criteria led to a meaningful reduction in sensitivity and specificity for the overall selection of publications.
  • Continued evolution of AI models, improved prompting approaches, and use of models with more relevant training data are expected to increase the effectiveness of AI screening.
Top

Introduction

  • Comprehensive literature searches are used to inform the development of systematic literature reviews (SLRs), publication plans, scientific communications platforms, and various other medical communications activities.
  • The time, cost, and resource requirements to evaluate search results are barriers to undertaking literature searches.1,2
  • Generative artificial intelligence (genAI) tools may assist with literature searches, reducing associated costs and timelines; however, the accuracy and reliability of these tools require further investigation.
  • This analysis investigated the agreement between a proprietary GPT4-powered literature review tool and a validated human review for screening literature search results from a published SLR on analytical concordance of PD-L1 diagnostic assays by Prince EA et al.3
  • The dataset was selected for this analysis for three reasons:
    • The SLR was recently completed and scientifically important.
    • Publications screened for inclusion in the SLR were technically complex and highly heterogeneous, allowing the genAI model to be tested with challenging literature.
    • The inclusion and exclusion criteria for the SLR would allow evaluation of how the genAI model responds to screening criteria with differing requirements, including (1) a minimum count of items from a specified list; (2) information that was explicitly stated or required complex interpretation of the publication text; (3) information reported with highly variable wording; and (4) open- ended criteria where it was not possible to define a specific list of items in the prompt.
Top

Methods

  • In total, 3477 unique publications were identified in literature searches and screened for inclusion in the published SLR by Prince EA et al3 (Figure 1).
    • 160 publications were available for genAI screening in the current analysis.
      • 97 publications that met both key inclusion criteria had screening data for human review of all criteria.
      • 63 open access publications that did not meet both key inclusion criteria were used as
    • The search methodology and publication selection are described in detail in the published SLR.3
  • Question prompts (Table A) were developed to screen the full text of all publications against 12 selection criteria from the published SLR methodology (6 inclusion and 6 exclusion criteria; Figure 1)3 using a proprietary GPT4-powered tool.
    • GenAI responses were restricted to “yes”, “no”, or “unclear” (i.e., could not answer) for whether each criterion was met.
    • Three prespecified selection criteria from the original SLR were used to limit the original literature search and were not evaluated in this analysis.
  • Overall agreement, false-positive, and false-negative rates were calculated for genAI screening using human review as a gold standard.
  • Associations of screening errors with word count and Flesch reading ease score of the AI-screened text files were investigated as surrogate measures of complexity.
  • Post-hoc sensitivity analyses were performed to investigate the contributions of selection criteria with high disagreement to the overall screening outcomes by sequentially removing those criteria from the overall decision.
  • To investigate the underlying reasons for incorrect screening results, additional prompts asked the genAI model to explain its decisions for inclusion criterion 1 and exclusion criterion 4.
Top
Figure 1.
(A) Disposition of literature search results screened by Prince EA et al, composition of the dataset for genAI screening,3 selection criteria, and (B) flow of analyses
A chart showing the disposition of literature search results screened by Prince EA et al, composition of the dataset for genAI screening,3 selection criteria, and (B) flow of analyses

a LDTs were defined as assays that were not performed with the reagents, methods, and/or equipment specified in the manufacturer’s instructions for FDA-approved diagnostics.b A total of 42 publications were included in the published qualitative synthesis (24 journal publications and 18 congress abstracts); a further 12 publications met all inclusion criteria and no exclusion criteria but were excluded from the qualitative synthesis for other reasons (congress abstracts with a subsequent journal publication that was included in the synthesis and publications that presented insufficient data to include in the synthesis). FNA, fine needle aspiration; genAI, generative artificial intelligence; IHC, immunohistochemistry; LDTs, laboratory-developed tests; PD-L1/2, programmed death ligand-1/2; RUO, research use only.

Table A.
Selection criteria for the Prince EA et al SLR3 and prompt design
A table showing selection criteria for the Prince EA et al SLR and prompt design

Inclusion criteria 1 and 2 were key inclusion criteria used for initial screening of publications. Only publications that met both key criteria were evaluated on all selection criteria. a LDTs were defined as assays that were not performed with the reagents, methods, and/or equipment specified in the manufacturer’s instructions for FDA-approved diagnostics.

Results

  • GenAI screening was automated and required < 15 minutes of total active user time to screen all 160 publications.
  • Agreement statistics for inclusion criteria are shown in Interactive Figure 2 and Table 2.
  • Overall agreement rates for inclusion criteria 1 and 2 were 88% and 86%, respectively (n = 160).
    • The overall agreement rate was 89% for meeting both key inclusion criteria in the overall dataset, 92% for the full-screening dataset that met both key inclusion criteria (n = 97), and 86% for the control dataset (n = 63) that did not meet both key inclusion by human review (Table B).
  • Overall agreement rates were high (77%–100%) for inclusion criteria 3, 4, and 6 (n = 97); overall agreement for inclusion criterion 5 was 38%, with false-negative or unsure responses for 60% of publications.
    • Agreement statistics for congress abstract and journal publication subgroups were consistent with the overall dataset (Interactive Figure 2).
  • The overall agreement rate for meeting all inclusion criteria was 42%; 65% of publications that met all inclusion criteria on human review had one or more false-negative result on AI screening.
Interactive Table 2.
Agreement between genAI and human review for meeting individual inclusion criteria, key inclusion criteria, and all inclusion criteria
A table showing the agreement between genAI and human review for meeting individual inclusion criteria, key inclusion criteria, and all inclusion criteria

Yellow and orange shading indicate >25% and >50% disagreement with human review, respectively. False-negative results reduce sensitivity and false-positive results reduce specificity. Denominators for false-positive and false-negative rates are the total number of publications that were negative and positive by human review, respectively, for that criterion. a Inclusion criteria 4 and 5 were screened using multiple prompts that were combined using Boolean logic.

A table showing the agreement between genAI and human review for meeting individual inclusion criteria, key inclusion criteria, and all inclusion criteria

Yellow and orange shading indicate >25% and >50% disagreement with human review, respectively. False-negative results reduce sensitivity and false-positive results reduce specificity. Denominators for false-positive and false-negative rates are the total number of publications that were negative and positive by human review, respectively, for that criterion. a Inclusion criteria 4 and 5 were screened using multiple prompts that were combined using Boolean logic.

Interactive Table B.
Agreement statistics for key inclusion criteria in publications that met both key inclusion criteria and controls
A table showing the agreement statistics for key inclusion criteria in publications that met both key inclusion criteria and controls

Yellow and orange shading indicate > 25% and > 50% disagreement with human review, respectively. False-negative results reduce sensitivity and false-positive results reduce specificity. Denominators for false-positive and false-negative rates are the total number of publications that were negative and positive by human review, respectively, for that criterion.

A table showing the agreement statistics for key inclusion criteria in publications that met both key inclusion criteria and controls

Yellow and orange shading indicate > 25% and > 50% disagreement with human review, respectively. False-negative results reduce sensitivity and false-positive results reduce specificity. Denominators for false-positive and false-negative rates are the total number of publications that were negative and positive by human review, respectively, for that criterion.

Interactive Figure 2.
Agreement statistics for key inclusion criteria in publications that met both key inclusion criteria and controls
A chart showing agreement between genAI and human review for meeting individual inclusion criteria, key inclusion criteria, and all inclusion criteria
A chart showing agreement between genAI and human review for meeting individual inclusion criteria, key inclusion criteria, and all inclusion criteria

CA, congress abstracts; JM, journal manuscripts.

Top
  • Agreement statistics for exclusion criteria are shown in Table 3 and Interactive Figure 3.
  • Agreement between genAI and human review was variable across individual exclusion criteria (n = 97); overall agreement rates were high for criteria 1, 3, and 6 (87%–100%), moderate for criterion 2 (69%), and low for criteria 4 and 5 (44% and 49% , respectively).
    • The low number of publications excluded by human reviewers on most criteria skewed disagreement to false-positive results that incorrectly excluded publications (i.e., decreased sensitivity).
    • Agreement statistics for congress abstract and journal publication subgroups were consistent with the overall dataset (Interactive Figure 3).
  • The overall agreement rate for meeting ≥ 1 exclusion criterion was low (46%); 96% of publications that did not meet any exclusion criteria on human review met one or more exclusion criteria on AI screening.
Interactive Table 3.
Agreement between genAI and human review for meeting individual exclusion criteria and ≥ 1 exclusion criteria
A table showing agreement between genAI and human review for meeting individual exclusion criteria and ≥ 1 exclusion criteria

Yellow and orange shading indicate > 25% and > 50% disagreement with human review, respectively. False-positive results reduce sensitivity and false-negative results reduce specificity. Denominators for false-positive and false-negative rates are the total number of publications that were negative and positive by human review, respectively, for that criterion. aExclusion criterion 6 was screened using two prompts that were combined using Boolean logic.

A table showing agreement between genAI and human review for meeting individual exclusion criteria and ≥ 1 exclusion criteria

Yellow and orange shading indicate > 25% and > 50% disagreement with human review, respectively. False-positive results reduce sensitivity and false-negative results reduce specificity. Denominators for false-positive and false-negative rates are the total number of publications that were negative and positive by human review, respectively, for that criterion. aExclusion criterion 6 was screened using two prompts that were combined using Boolean logic.

Interactive Figure 3.
Agreement between genAI and human review for meeting individual exclusion criteria and ≥ 1 exclusion criteria
A chart showing agreement between genAI and human review for meeting individual exclusion criteria and ≥ 1 exclusion criteria
A chart showing agreement between genAI and human review for meeting individual exclusion criteria and ≥ 1 exclusion criteria
Top
  • Performance of the genAI tool was poor for overall selection of publications based on all criteria (meeting all inclusion criteria AND no exclusion criteria).
    • All publications failed ≥ 1 selection criteria on AI screening (median 3; range, 1–7); 65 (67%) failed ≥ 1 inclusion criterion and 95 (98%) failed ≥ 1 exclusion criterion (Figure A).
    • 3 publications had screening results that agreed with human review on all criteria and were correctly excluded; the remaining 94 publications had a screening result that disagreed with human review on ≥ 1 criterion (median 2; range, 1–7) (Figure B).
    • Word count and readability were not associated with the number of criteria with disagreement (Figure C).
Figure A.
Number of criteria with screening failure on genAI review
A diagram showing Number of criteria with screening failure on genAI review

 

Figure B.
Number of criteria with disagreement between genAI and human review
A diagram showing number of criteria with disagreement between genAI and human review

CA, congress abstracts; JM, journal manuscripts.

Figure C.
Number of screening results with disagreement vs human review by word count and Flesch reading ease score
A diagram showing number of screening results with disagreement vs human review by word count and Flesch reading ease score

Lower Flesch reading ease score indicates greater reading difficulty. Black data points indicate publications with no screening results that disagree with human review.

Top
  • The most permissive sensitivity analysis removed inclusion criterion 5 and exclusion criteria 2, 4, and 5 from the overall screening outcome for genAI (Figures D and E).
    • Removal of criteria with poor performance improved overall agreement with human review on matching criteria (73% for inclusion criteria and 89% for exclusion criteria), with the remaining disagreement suggesting an accumulation of screening errors across multiple criteria.
    • 39 (72%) of the 54 publications selected by human review on all criteria were selected by genAI in the sensitivity analysis, as were 31 of the 43 publications that were excluded by human review (false‑positive rate 44%).
Figure D.
Agreement between genAI and human review for matching composite screening outcomes for base case and sensitivity analysis of (A) all inclusion criteria and (B) ≥ 1 exclusion criterion
A diagram showing agreement between genAI and human review for matching composite screening outcomes for base case and sensitivity analysis of (A) all inclusion criteria and  (B) ≥ 1 exclusion criterion

Sensitivity analyses for removal of selection criteria with poor performance on overall screening outcome compared with the outcome by human review on matching composite criteria. I0 and E0 indicate use of all 6 inclusion and exclusion criteria, respectively. I1 indicates removal of inclusion criterion 5; E1, E2, and E3 indicate removal of exclusion criterion 4 (E1), 4 and 5 (E2), and 2, 4, and 5 (E3).

Figure E.
Number of publications meeting all inclusion criteria and no exclusion criteria on genAI screening in sensitivity analyses vs human review on all criteria
A diagram showing number of publications meeting all inclusion criteria and no exclusion criteria on genAI screening in sensitivity analyses vs human review on all criteria

Sensitivity analyses for removal of selection criteria with poor performance on overall screening outcome compared with the overall outcome by human review on all criteria. The blue line indicates the number of publications that met all inclusion criteria AND no exclusion criteria on human review (n = 54). I0 and E0 indicate use of all 6 inclusion and exclusion criteria, respectively. I1 indicates removal of inclusion criterion 5; E1, E2, and E3 indicate removal of exclusion criterion 4 (E1), 4 and 5 (E2), and 2, 4, and 5 (E3).

  • To further explore the reliability of screening results, additional prompts were added asking the AI model to explain its responses for inclusion criterion 1 and exclusion criterion 4.
    • Inclusion criterion 1 was selected because of the low complexity of the prompt and high agreement between genAI and human review.
    • Exclusion criterion 4 was selected because of the high complexity of the prompt, poor agreement between genAI and human review, and the relatively high proportion of publications excluded on the criterion by human reviewers.
  • Various issues were identified that led to both incorrect screening responses and correct responses with an improper decision process, suggesting that some responses were chance results.
    • Logical errors, such as returning a false-positive result because information relating to the criteria was not present.
    • Inconsistent prompt interpretation, including incorrect or partial application of the prompt, such as generating the response based on the presence/absence of a single specific item from a list.
    • Counting errors leading to false-negative and false-positive results for inclusion criterion 1 (e.g. false-negative responses to inclusion criterion 1, despite listing two assays in the explanation).
    • Misattribution errors leading to false-positive responses (e.g. immunohistochemistry or automated laboratory equipment mean that digital imaging and image analysis were used).
    • Out-of-scope/misclassification errors, where a response was not made per the information specified in the prompt but was instead made using related information that was outside the specified scope in the prompt (e.g. false-positive response due to detecting a PD-L1 assay that was not one of the four assays listed in the prompt for inclusion criterion 1).
Figure D.
Agreement between genAI and human review for matching composite screening outcomes for base case and sensitivity analysis of (A) all inclusion criteria and (B) ≥ 1 exclusion criterion
A diagram showing agreement between genAI and human review for matching composite screening outcomes for base case and sensitivity analysis of (A) all inclusion criteria and  (B) ≥ 1 exclusion criterion

Sensitivity analyses for removal of selection criteria with poor performance on overall screening outcome compared with the outcome by human review on matching composite criteria. I0 and E0 indicate use of all 6 inclusion and exclusion criteria, respectively. I1 indicates removal of inclusion criterion 5; E1, E2, and E3 indicate removal of exclusion criterion 4 (E1), 4 and 5 (E2), and 2, 4, and 5 (E3).

Figure E.
Number of publications meeting all inclusion criteria and no exclusion criteria on genAI screening in sensitivity analyses vs human review on all criteria
A diagram showing number of publications meeting all inclusion criteria and no exclusion criteria on genAI screening in sensitivity analyses vs human review on all criteria

Sensitivity analyses for removal of selection criteria with poor performance on overall screening outcome compared with the overall outcome by human review on all criteria. The blue line indicates the number of publications that met all inclusion criteria AND no exclusion criteria on human review (n = 54). I0 and E0 indicate use of all 6 inclusion and exclusion criteria, respectively. I1 indicates removal of inclusion criterion 5; E1, E2, and E3 indicate removal of exclusion criterion 4 (E1), 4 and 5 (E2), and 2, 4, and 5 (E3).

Top
Disclosures
  • Bernard Kerr, Brian Norman, Thierry Deltheil, and Valerie Moss are employees of Prime, London, UK.
  • Iryna Shnitsar, Mary Coffey, and Doreen Valentine are employees of Bristol Myers Squibb, Princeton, NJ, USA.
  • All opinions are the authors’ own and do not necessarily reflect those of our employers.
Acknowledgements
  • The Prime authors thank Bristol Myers Squibb for their support and collaboration, and for permission to use the Prince EA et al SLR in this analysis.
  • We thank the Prime Editor, Digital, and Production teams for their support with development of this poster, and Jack Roweth, Prime, London, UK, for assistance with the analysis.
References
  1. Michelson M and Reuter K. Contemp Clin Trials Commun. 2019;16:100443.
  2. Borah R, et al. BMJ Open. 2017;7:e012545.
  3. Prince EA, et al. JCO Precis Oncol. 2021;5:953-973.
Top