This exploratory analysis highlights the potential for genAI-based tools to allow more efficient screening of literature search results than human review alone.
Overall agreement between genAI and human review was high (> 75%) for 8 selection criteria, moderate (69%) for 1 criterion, and poor (< 50%) for 3 selection criteria.
Performance was unexpectedly high for the majority of selection criteria, given the dataset and selection criteria were chosen for their high complexity and variability.
It would be reasonable to expect improved performance for screening literature and criteria with a lower level of technical complexity.
These results provide valuable insight into the strengths and limitations of genAIbased literature screening; however, comparison of this analysis with similar studies performed under different conditions should be done with caution.
GenAI screening required < 15 minutes of user time to run, supporting use of the tool to prioritize publications for human review, assist with categorizing publications, and reduce the risk of human error going undetected.
Further studies are planned to evaluate use of the genAI screening tool in other subject areas, and to investigate the reproducibility of genAI responses.
Learnings and Considerations for Future use of GenAI
GenAI screening was most effective where information relevant to criteria was clearly stated in publications and closely matched the wording of the prompt.
Criteria with poor performance had prompts with a non-specific scope, related to complex concepts or technical information, or screened for information that was reported using highly variable language.
Splitting complex criteria into multiple prompts and combining the results with Boolean logic improved evaluability of the results compared with complex prompts that screened for multiple items.
Inclusion of follow-up prompts asking the genAI model to explain its decisions assisted with identifying screening errors by allowing the user to monitor the model’s decision-making process.
Word count and readability of publications were not associated with screening errors, suggesting that technical terminology and the overall complexity of publications may be key contributors to poor performance.
The rate of errors in the genAI outputs was low for most criteria, but the accumulation of errors across criteria led to a meaningful reduction in sensitivity and specificity for the overall selection of publications.
Continued evolution of AI models, improved prompting approaches, and use of models with more relevant training data are expected to increase the effectiveness of AI screening.
Comprehensive literature searches are used to inform the development of systematic literature reviews (SLRs), publication plans, scientific communications platforms, and various other medical communications activities.
The time, cost, and resource requirements to evaluate search results are barriers to undertaking literature searches.1,2
Generative artificial intelligence (genAI) tools may assist with literature searches, reducing associated costs and timelines; however, the accuracy and reliability of these tools require further investigation.
This analysis investigated the agreement between a proprietary GPT4-powered literature review tool and a validated human review for screening literature search results from a published SLR on analytical concordance of PD-L1 diagnostic assays by Prince EA et al.3
The dataset was selected for this analysis for three reasons:
The SLR was recently completed and scientifically important.
Publications screened for inclusion in the SLR were technically complex and highly heterogeneous, allowing the genAI model to be tested with challenging literature.
The inclusion and exclusion criteria for the SLR would allow evaluation of how the genAI model responds to screening criteria with differing requirements, including (1) a minimum count of items from a specified list; (2) information that was explicitly stated or required complex interpretation of the publication text; (3) information reported with highly variable wording; and (4) open- ended criteria where it was not possible to define a specific list of items in the prompt.
In total, 3477 unique publications were identified in literature searches and screened for inclusion in the published SLR by Prince EA et al3 (Figure 1).
160 publications were available for genAI screening in the current analysis.
97 publications that met both key inclusion criteria had screening data for human review of all criteria.
63 open access publications that did not meet both key inclusion criteria were used as
The search methodology and publication selection are described in detail in the published SLR.3
Question prompts (Table A) were developed to screen the full text of all publications against 12 selection criteria from the published SLR methodology (6 inclusion and 6 exclusion criteria; Figure 1)3 using a proprietary GPT4-powered tool.
GenAI responses were restricted to “yes”, “no”, or “unclear” (i.e., could not answer) for whether each criterion was met.
Three prespecified selection criteria from the original SLR were used to limit the original literature search and were not evaluated in this analysis.
Overall agreement, false-positive, and false-negative rates were calculated for genAI screening using human review as a gold standard.
Associations of screening errors with word count and Flesch reading ease score of the AI-screened text files were investigated as surrogate measures of complexity.
Post-hoc sensitivity analyses were performed to investigate the contributions of selection criteria with high disagreement to the overall screening outcomes by sequentially removing those criteria from the overall decision.
To investigate the underlying reasons for incorrect screening results, additional prompts asked the genAI model to explain its decisions for inclusion criterion 1 and exclusion criterion 4.
Overall agreement rates for inclusion criteria 1 and 2 were 88% and 86%, respectively (n = 160).
The overall agreement rate was 89% for meeting both key inclusion criteria in the overall dataset, 92% for the full-screening dataset that met both key inclusion criteria (n = 97), and 86% for the control dataset (n = 63) that did not meet both key inclusion by human review (Table B).
Overall agreement rates were high (77%–100%) for inclusion criteria 3, 4, and 6 (n = 97); overall agreement for inclusion criterion 5 was 38%, with false-negative or unsure responses for 60% of publications.
Agreement statistics for congress abstract and journal publication subgroups were consistent with the overall dataset (Interactive Figure 2).
The overall agreement rate for meeting all inclusion criteria was 42%; 65% of publications that met all inclusion criteria on human review had one or more false-negative result on AI screening.
Agreement between genAI and human review was variable across individual exclusion criteria (n = 97); overall agreement rates were high for criteria 1, 3, and 6 (87%–100%), moderate for criterion 2 (69%), and low for criteria 4 and 5 (44% and 49% , respectively).
The low number of publications excluded by human reviewers on most criteria skewed disagreement to false-positive results that incorrectly excluded publications (i.e., decreased sensitivity).
Agreement statistics for congress abstract and journal publication subgroups were consistent with the overall dataset (Interactive Figure 3).
The overall agreement rate for meeting ≥ 1 exclusion criterion was low (46%); 96% of publications that did not meet any exclusion criteria on human review met one or more exclusion criteria on AI screening.
Performance of the genAI tool was poor for overall selection of publications based on all criteria (meeting all inclusion criteria AND no exclusion criteria).
All publications failed ≥ 1 selection criteria on AI screening (median 3; range, 1–7); 65 (67%) failed ≥ 1 inclusion criterion and 95 (98%) failed ≥ 1 exclusion criterion (Figure A).
3 publications had screening results that agreed with human review on all criteria and were correctly excluded; the remaining 94 publications had a screening result that disagreed with human review on ≥ 1 criterion (median 2; range, 1–7) (Figure B).
Word count and readability were not associated with the number of criteria with disagreement (Figure C).
The most permissive sensitivity analysis removed inclusion criterion 5 and exclusion criteria 2, 4, and 5 from the overall screening outcome for genAI (Figures D and E).
Removal of criteria with poor performance improved overall agreement with human review on matching criteria (73% for inclusion criteria and 89% for exclusion criteria), with the remaining disagreement suggesting an accumulation of screening errors across multiple criteria.
39 (72%) of the 54 publications selected by human review on all criteria were selected by genAI in the sensitivity analysis, as were 31 of the 43 publications that were excluded by human review (false‑positive rate 44%).
To further explore the reliability of screening results, additional prompts were added asking the AI model to explain its responses for inclusion criterion 1 and exclusion criterion 4.
Inclusion criterion 1 was selected because of the low complexity of the prompt and high agreement between genAI and human review.
Exclusion criterion 4 was selected because of the high complexity of the prompt, poor agreement between genAI and human review, and the relatively high proportion of publications excluded on the criterion by human reviewers.
Various issues were identified that led to both incorrect screening responses and correct responses with an improper decision process, suggesting that some responses were chance results.
Logical errors, such as returning a false-positive result because information relating to the criteria was not present.
Inconsistent prompt interpretation, including incorrect or partial application of the prompt, such as generating the response based on the presence/absence of a single specific item from a list.
Counting errors leading to false-negative and false-positive results for inclusion criterion 1 (e.g. false-negative responses to inclusion criterion 1, despite listing two assays in the explanation).
Misattribution errors leading to false-positive responses (e.g. immunohistochemistry or automated laboratory equipment mean that digital imaging and image analysis were used).
Out-of-scope/misclassification errors, where a response was not made per the information specified in the prompt but was instead made using related information that was outside the specified scope in the prompt (e.g. false-positive response due to detecting a PD-L1 assay that was not one of the four assays listed in the prompt for inclusion criterion 1).
Bernard Kerr, Brian Norman, Thierry Deltheil, and Valerie Moss are employees of Prime, London, UK.
Iryna Shnitsar, Mary Coffey, and Doreen Valentine are employees of Bristol Myers Squibb, Princeton, NJ, USA.
All opinions are the authors’ own and do not necessarily reflect those of our employers.
Acknowledgements
The Prime authors thank Bristol Myers Squibb for their support and collaboration, and for permission to use the Prince EA et al SLR in this analysis.
We thank the Prime Editor, Digital, and Production teams for their support with development of this poster, and Jack Roweth, Prime, London, UK, for assistance with the analysis.
References
Michelson M and Reuter K. Contemp Clin Trials Commun. 2019;16:100443.
Borah R, et al. BMJ Open. 2017;7:e012545.
Prince EA, et al. JCO Precis Oncol. 2021;5:953-973.