# Aggregated Literature Search Results

**Generated**: 2026-04-10 18:42
**Total results**: 234
**Citation style**: vancouver

---

### [1] ChatTogoVar: a TogoVar-based retrieval-augmented generation system for precise genomic variant interpretation.

**Citation**: 1. Mitsuhashi N, Fujiwara T, Yamaguchi A. ChatTogoVar: a TogoVar-based retrieval-augmented generation system for precise genomic variant interpretation.. Human genome variation. 2026. doi:10.1038/s41439-026-00344-4
**Source database**: pubmed
**DOI**: https://doi.org/10.1038/s41439-026-00344-4

**Abstract**: Large language models (LLMs) have recently been adopted to assist in the interpretation of human genomic variants. However, general-purpose LLMs can produce incorrect outputs (commonly termed 'hallucinations'), particularly on specialized queries, raising concerns about their reliability for variant interpretation. Here, to mitigate this risk, we developed ChatTogoVar, a retrieval-augmented generation system that queries TogoVar, a variant database that integrates information, such as allele fre...

---

### [2] Promises and challenges of applying large language models in the healthcare domain.

**Citation**: 2. Wang Q, Gong Z, Lai Z, et al.. Promises and challenges of applying large language models in the healthcare domain.. Frontiers in digital health. 2026. doi:10.3389/fdgth.2026.1772274
**Source database**: pubmed
**DOI**: https://doi.org/10.3389/fdgth.2026.1772274

**Abstract**: Large language models are rapidly moving from theoretical concepts to active clinical pilots. Current approaches diverge between general-purpose models, which adapt to healthcare via prompt engineering, and domain-specific models, which prioritize deep alignment with medical knowledge graphs to ensure safety. Despite reported benefits in documentation efficiency and diagnostic reasoning, significant challenges remain regarding hallucination, privacy, and the validity of evaluation metrics. This ...

---

### [3] Harnessing Large Language Models in Neonatal Intraventricular Hemorrhage: Exploring Retrieval Augmented Generation Methodology for Prognostic Variable Discovery.

**Citation**: 3. Arora T, Beam K. Harnessing Large Language Models in Neonatal Intraventricular Hemorrhage: Exploring Retrieval Augmented Generation Methodology for Prognostic Variable Discovery.. American journal of perinatology. 2026. doi:10.1055/a-2838-5446
**Source database**: pubmed
**DOI**: https://doi.org/10.1055/a-2838-5446

**Abstract**: The objective of this study is to evaluate whether large language models (LLMs) can autonomously synthesize existing literature and accurately extract prognostic variables for neonatal intraventricular hemorrhage (IVH) and its outcomes while assessing their capability for clinical feature ranking and risk stratification.This pilot study employed a systematic literature review combined with retrieval-augmented generation (RAG) methodology. GPT 4 (OpenAI) and Claude Sonnet (4.0, Anthropic) were pr...

---

### [4] Integrating Fine-Tuning and Retrieval-Augmented Generation for Healthcare AI Systems: A Scoping Review.

**Citation**: 4. Collaco BG, Srinivasagam P, Gomez-Cabello CA, et al.. Integrating Fine-Tuning and Retrieval-Augmented Generation for Healthcare AI Systems: A Scoping Review.. Bioengineering (Basel, Switzerland). 2026. doi:10.3390/bioengineering13020225
**Source database**: pubmed
**DOI**: https://doi.org/10.3390/bioengineering13020225

**Abstract**: (1) Background: Large language models (LLMs) show promise in healthcare but are constrained by hallucinations, static knowledge, and limited domain specificity. Fine-tuning (FT) and retrieval-augmented generation (RAG) offer complementary solutions, with FT embedding domain reasoning and RAG enabling dynamic, up-to-date knowledge access. Hybrid FT + RAG frameworks have been proposed to improve factual accuracy and clinical reliability. This scoping review synthesizes current evidence on such hyb...

---

### [5] Effective prompt design for large language models in clinical practice.

**Citation**: 5. Callens S. Effective prompt design for large language models in clinical practice.. Acta clinica Belgica. 2026. doi:10.1080/17843286.2026.2613903
**Source database**: pubmed
**DOI**: https://doi.org/10.1080/17843286.2026.2613903

**Abstract**: Large language models (LLMs) have emerged as transformative healthcare tools for clinical documentation, diagnostic reasoning, and medical education. However, effective utilization requires understanding prompt engineering principles-the strategic design of inputs to optimize performance while mitigating hallucination, bias, and outdated information.

---

### [6] GraphRAG-Enabled Local Large Language Model for Gestational Diabetes Mellitus: Development of a Proof-of-Concept.

**Citation**: 6. Evangelista E, Ruba F, Bukhari S, et al.. GraphRAG-Enabled Local Large Language Model for Gestational Diabetes Mellitus: Development of a Proof-of-Concept.. JMIR diabetes. 2026. doi:10.2196/76454
**Source database**: pubmed
**DOI**: https://doi.org/10.2196/76454

**Abstract**: Gestational diabetes mellitus (GDM) is a prevalent chronic condition that affects maternal and fetal health outcomes worldwide, increasingly in underserved populations. While generative artificial intelligence (AI) and large language models (LLMs) have shown promise in health care, their application in GDM management remains underexplored.

---

### [7] Refine Medical Diagnosis Using Generation Augmented Retrieval and Clinical Practice Guidelines.

**Citation**: 7. Li W, Zhang H, Zhang H, et al.. Refine Medical Diagnosis Using Generation Augmented Retrieval and Clinical Practice Guidelines.. IEEE journal of biomedical and health informatics. 2025. doi:10.1109/JBHI.2025.3641931
**Source database**: pubmed
**DOI**: https://doi.org/10.1109/JBHI.2025.3641931

**Abstract**: Current medical language models, adapted from large language models, typically predict ICD code-based diagnosis from electronic health records (EHRs) because these labels are readily available. However, ICD codes do not capture the nuanced, context-rich reasoning clinicians use for diagnosis. Clinicians synthesize diverse patient data and reference clinical practice guidelines (CPGs) to make evidence-based decisions. This misalignment limits the clinical utility of existing models. We introduce ...

---

### [8] Enhancing Large Language Models for Improved Accuracy and Safety in Medical Question Answering: Comparative Study.

**Citation**: 8. Wang D, Ye J, Li J, et al.. Enhancing Large Language Models for Improved Accuracy and Safety in Medical Question Answering: Comparative Study.. JMIR medical education. 2025. doi:10.2196/70190
**Source database**: pubmed
**DOI**: https://doi.org/10.2196/70190

**Abstract**: Large language models (LLMs) offer the potential to improve virtual patient-physician communication and reduce health care professionals' workload. However, limitations in accuracy, outdated knowledge, and safety issues restrict their effective use in real clinical settings. Addressing these challenges is crucial for making LLMs a reliable health care tool.

---

### [9] The Development and Evaluation of a Retrieval-Augmented Generation Large Language Model Virtual Assistant for Postoperative Instructions.

**Citation**: 9. Haider SA, Prabha S, Gomez Cabello CA, et al.. The Development and Evaluation of a Retrieval-Augmented Generation Large Language Model Virtual Assistant for Postoperative Instructions.. Bioengineering (Basel, Switzerland). 2025. doi:10.3390/bioengineering12111219
**Source database**: pubmed
**DOI**: https://doi.org/10.3390/bioengineering12111219

**Abstract**: During postoperative recovery, patients and their caregivers often lack crucial information, leading to numerous repetitive inquiries that burden healthcare providers. Traditional discharge materials, including paper handouts and patient portals, are often static, overwhelming, or underutilized, leading to patient overwhelm and contributing to unnecessary ER visits and overall healthcare overutilization. Conversational chatbots offer a solution, but Natural Language Processing (NLP) systems are ...

---

### [10] Retrieval-augmented generation for interpreting clinical laboratory regulations using large language models.

**Citation**: 10. Nanua S, Steward R, Neely B, et al.. Retrieval-augmented generation for interpreting clinical laboratory regulations using large language models.. Journal of pathology informatics. 2025. doi:10.1016/j.jpi.2025.100520
**Source database**: pubmed
**DOI**: https://doi.org/10.1016/j.jpi.2025.100520

**Abstract**: Large language models (LLMs) have demonstrated strong performance on general knowledge tasks, but they have important limitations as standalone tools for question answering in specialized domains where accuracy and consistency are critical. Retrieval-augmented generation (RAG) is a strategy in which LLM outputs are grounded in dynamically retrieved source documents, offering advantages in accuracy, explainability, and maintainability. We developed and evaluated a custom RAG system called Raven, ...

---

### [11] Performance of Large Language Models on the Acute Coronary Syndrome Guidelines Using Retrieval-Augmented Generation.

**Citation**: 11. Alexandrou M, Kumar S, Mahtani AU, et al.. Performance of Large Language Models on the Acute Coronary Syndrome Guidelines Using Retrieval-Augmented Generation.. JACC. Cardiovascular interventions. 2025. doi:10.1016/j.jcin.2025.08.019
**Source database**: pubmed
**DOI**: https://doi.org/10.1016/j.jcin.2025.08.019

**Abstract**: Large language models (LLMs) are increasingly applied in interventional cardiology, but hallucinations limit their clinical utility.

---

### [12] Evaluating Web Retrieval-Assisted Large Language Models With and Without Whitelisting for Evidence-Based Neurology: Comparative Study.

**Citation**: 12. Masanneck L, Epping PZ, Meuth SG, et al.. Evaluating Web Retrieval-Assisted Large Language Models With and Without Whitelisting for Evidence-Based Neurology: Comparative Study.. Journal of medical Internet research. 2025. doi:10.2196/79379
**Source database**: pubmed
**DOI**: https://doi.org/10.2196/79379

**Abstract**: Large language models (LLMs) coupled with real-time web retrieval are reshaping how clinicians and patients locate medical evidence, and as major search providers fuse LLMs into their interfaces, this hybrid approach might become the new "gateway" to the internet. However, open-web retrieval exposes models to nonprofessional sources, risking hallucinations and factual errors that might jeopardize evidence-based care.

---

### [13] Development and Evaluation of a Retrieval-Augmented Generation Chatbot for Orthopedic and Trauma Surgery Patient Education: Mixed-Methods Study.

**Citation**: 13. Baur D, Ansorg J, Heyde CE, et al.. Development and Evaluation of a Retrieval-Augmented Generation Chatbot for Orthopedic and Trauma Surgery Patient Education: Mixed-Methods Study.. JMIR AI. 2025. doi:10.2196/75262
**Source database**: pubmed
**DOI**: https://doi.org/10.2196/75262

**Abstract**: Large language models are increasingly applied in health care for documentation, patient education, and clinical decision support. However, their factual reliability can be compromised by hallucinations and a lack of source traceability. Retrieval-augmented generation (RAG) enhances response accuracy by combining generative models with document retrieval mechanisms. While promising in medical contexts, RAG-based systems remain underexplored in orthopedic and trauma surgery patient education, par...

---

### [14] MEGA-RAG: a retrieval-augmented generation framework with multi-evidence guided answer refinement for mitigating hallucinations of LLMs in public health.

**Citation**: 14. Xu S, Yan Z, Dai C, et al.. MEGA-RAG: a retrieval-augmented generation framework with multi-evidence guided answer refinement for mitigating hallucinations of LLMs in public health.. Frontiers in public health. 2025. doi:10.3389/fpubh.2025.1635381
**Source database**: pubmed
**DOI**: https://doi.org/10.3389/fpubh.2025.1635381

**Abstract**: The increasing adoption of large language models (LLMs) in public health has raised significant concerns about hallucinations-factually inaccurate or misleading outputs that can compromise clinical communication and policy decisions.

---

### [15] Exploring Patient Perspectives, Engagement, and Output Quality in Doctor-Supervised Use of Artificial Intelligence During Informed Consent Consultation With ChatGPT and Retrieval Augmented Generation (RAG): Quantitative Exploratory Study.

**Citation**: 15. Donner S, Knauer P, Kienzle A, et al.. Exploring Patient Perspectives, Engagement, and Output Quality in Doctor-Supervised Use of Artificial Intelligence During Informed Consent Consultation With ChatGPT and Retrieval Augmented Generation (RAG): Quantitative Exploratory Study.. Journal of medical Internet research. 2025. doi:10.2196/73717
**Source database**: pubmed
**DOI**: https://doi.org/10.2196/73717

**Abstract**: Comprehensive preoperative education is essential for optimizing outcomes and ensuring informed consent in patients undergoing total hip arthroplasty (THA). Emerging artificial intelligence (AI) tools, such as ChatGPT, offer scalable support for patient education, but their clinical application requires rigorous evaluation to ensure accuracy, safety, and trust.

---

### [16] Enhancing Large Language Models with Domain-specific Retrieval Augment Generation: A Case Study on Long-form Consumer Health Question Answering in Ophthalmology.

**Citation**: 16. Gilson A, Ai X, Arunachalam T, et al.. Enhancing Large Language Models with Domain-specific Retrieval Augment Generation: A Case Study on Long-form Consumer Health Question Answering in Ophthalmology.. ArXiv. 2024.
**Source database**: pubmed

**Abstract**: Despite the potential of Large Language Models (LLMs) in medicine, they may generate responses lacking supporting evidence or based on hallucinated evidence. While Retrieval Augment Generation (RAG) is popular to address this issue, few studies implemented and evaluated RAG in downstream domain-specific applications. We developed a RAG pipeline with ~70,000 ophthalmology-specific documents that retrieve relevant documents to augment LLMs during inference time. In a case study on long-form consum...

---

### [17] Large language models in clinical nutrition: an overview of its applications, capabilities, limitations, and potential future prospects.

**Citation**: 17. Belkhouribchia J, Pen JJ. Large language models in clinical nutrition: an overview of its applications, capabilities, limitations, and potential future prospects.. Frontiers in nutrition. 2025. doi:10.3389/fnut.2025.1635682
**Source database**: pubmed
**DOI**: https://doi.org/10.3389/fnut.2025.1635682

**Abstract**: The integration of large language models (LLMs) into clinical nutrition marks a transformative advancement, offering promising solutions for enhancing patient care, personalizing dietary recommendations, and supporting evidence-based clinical decision-making. Trained on extensive text corpora and powered by transformer-based architectures, LLMs demonstrate remarkable capabilities in natural language understanding and generation. This review provides an overview of their current and potential app...

---

### [18] Large language models for clinical decision support in gastroenterology and hepatology.

**Citation**: 18. Wiest IC, Bhat M, Clusmann J, et al.. Large language models for clinical decision support in gastroenterology and hepatology.. Nature reviews. Gastroenterology & hepatology. 2025. doi:10.1038/s41575-025-01108-1
**Source database**: pubmed
**DOI**: https://doi.org/10.1038/s41575-025-01108-1

**Abstract**: Clinical decision making in gastroenterology and hepatology has become increasingly complex and challenging for physicians. This growing complexity can be addressed by computational tools that support clinical decisions. Although numerous clinical decision support systems (CDSS) have emerged, they have faced difficulties with real-world performance and generalizability, resulting in limited clinical adoption. Generative artificial intelligence (AI), particularly large language models (LLMs), are...

---

### [19] Context-Aware Retrieval-Augmented Generation for Artificial Intelligence in Urology.

**Citation**: 19. Sriram A, N M, Sundan B, et al.. Context-Aware Retrieval-Augmented Generation for Artificial Intelligence in Urology.. Cureus. 2025. doi:10.7759/cureus.88167
**Source database**: pubmed
**DOI**: https://doi.org/10.7759/cureus.88167

**Abstract**: Background Artificial intelligence (AI) is increasingly being used in healthcare, particularly for interpreting complex medical queries. However, conventional AI models often generate inaccurate or irrelevant responses that are commonly termed hallucinations, which may compromise patient safety. To address this, our study introduces a modified retrieval-augmented generation (RAG) framework tailored for the urology domain to enhance contextual relevance and accuracy in AI-generated responses. Met...

---

### [20] Retrieval-augmented generation elevates local LLM quality in radiology contrast media consultation.

**Citation**: 20. Wada A, Tanaka Y, Nishizawa M, et al.. Retrieval-augmented generation elevates local LLM quality in radiology contrast media consultation.. NPJ digital medicine. 2025. doi:10.1038/s41746-025-01802-z
**Source database**: pubmed
**DOI**: https://doi.org/10.1038/s41746-025-01802-z

**Abstract**: Large language models (LLMs) demonstrate significant potential in healthcare applications, but clinical deployment is limited by privacy concerns and insufficient medical domain training. This study investigated whether retrieval-augmented generation (RAG) can improve locally deployable LLM for radiology contrast media consultation. In 100 synthetic iodinated contrast media consultations we compared Llama 3.2-11B (baseline and RAG) with three cloud-based models-GPT-4o mini, Gemini 2.0 Flash and ...

---

### [21] Retrieval augmented generation for large language models in healthcare: A systematic review.

**Citation**: 21. Amugongo LM, Mascheroni P, Brooks S, et al.. Retrieval augmented generation for large language models in healthcare: A systematic review.. PLOS digital health. 2025. doi:10.1371/journal.pdig.0000877
**Source database**: pubmed
**DOI**: https://doi.org/10.1371/journal.pdig.0000877

**Abstract**: Large Language Models (LLMs) have demonstrated promising capabilities to solve complex tasks in critical sectors such as healthcare. However, LLMs are limited by their training data which is often outdated, the tendency to generate inaccurate ("hallucinated") content and a lack of transparency in the content they generate. To address these limitations, retrieval augmented generation (RAG) grounds the responses of LLMs by exposing them to external knowledge sources. However, in the healthcare dom...

---

### [22] Enhancing medical AI with retrieval-augmented generation: A mini narrative review.

**Citation**: 22. Gargari OK, Habibi G. Enhancing medical AI with retrieval-augmented generation: A mini narrative review.. Digital health. 2025. doi:10.1177/20552076251337177
**Source database**: pubmed
**DOI**: https://doi.org/10.1177/20552076251337177

**Abstract**: Retrieval-augmented generation (RAG) is a powerful technique in artificial intelligence (AI) and machine learning that enhances the capabilities of large language models (LLMs) by integrating external data sources, allowing for more accurate, contextually relevant responses. In medical applications, RAG has the potential to improve diagnostic accuracy, clinical decision support, and patient care. This narrative review explores the application of RAG across various medical domains, including guid...

---

### [23] Leveraging long context in retrieval augmented language models for medical question answering.

**Citation**: 23. Zhang G, Xu Z, Jin Q, et al.. Leveraging long context in retrieval augmented language models for medical question answering.. NPJ digital medicine. 2025. doi:10.1038/s41746-025-01651-w
**Source database**: pubmed
**DOI**: https://doi.org/10.1038/s41746-025-01651-w

**Abstract**: While holding great promise for improving and facilitating healthcare through applications of medical literature summarization, large language models (LLMs) struggle to produce up-to-date responses on evolving topics due to outdated knowledge or hallucination. Retrieval-augmented generation (RAG) is a pivotal innovation that improves the accuracy and relevance of LLM responses by integrating LLMs with a search engine and external sources of knowledge. However, the quality of RAG responses can be...

---

### [24] A Current Review of Generative AI in Medicine: Core Concepts, Applications, and Current Limitations.

**Citation**: 24. Rouzrokh P, Khosravi B, Faghani S, et al.. A Current Review of Generative AI in Medicine: Core Concepts, Applications, and Current Limitations.. Current reviews in musculoskeletal medicine. 2025. doi:10.1007/s12178-025-09961-y
**Source database**: pubmed
**DOI**: https://doi.org/10.1007/s12178-025-09961-y

**Abstract**: This review aims to offer a foundational overview of Generative Artificial Intelligence (AI) for healthcare professionals without an engineering background. It seeks to aid their understanding of Generative AI's current capabilities, applications, and limitations within the medical field.

---

### [25] RAGing ahead in rheumatology: new language model architectures to tame artificial intelligence.

**Citation**: 25. Benavent D, Venerito V, Michelena X. RAGing ahead in rheumatology: new language model architectures to tame artificial intelligence.. Therapeutic advances in musculoskeletal disease. 2025. doi:10.1177/1759720X251331529
**Source database**: pubmed
**DOI**: https://doi.org/10.1177/1759720X251331529

**Abstract**: Artificial intelligence (AI) is increasingly transforming rheumatology with research on disease detection, monitoring, and outcome prediction through the analysis of large datasets. The advent of generative models and large language models (LLMs) has expanded AI's capabilities, particularly in natural language processing (NLP) tasks such as question-answering and medical literature synthesis. While NLP has shown promise in identifying rheumatic diseases from electronic health records with high a...

---

### [26] Thyro-GenAI: A Chatbot Using Retrieval-Augmented Generative Models for Personalized Thyroid Disease Management.

**Citation**: 26. Shin M, Song J, Kim MG, et al.. Thyro-GenAI: A Chatbot Using Retrieval-Augmented Generative Models for Personalized Thyroid Disease Management.. Journal of clinical medicine. 2025. doi:10.3390/jcm14072450
**Source database**: pubmed
**DOI**: https://doi.org/10.3390/jcm14072450

---

### [27] Evidence-based artificial intelligence: Implementing retrieval-augmented generation models to enhance clinical decision support in plastic surgery.

**Citation**: 27. Ozmen BB, Mathur P. Evidence-based artificial intelligence: Implementing retrieval-augmented generation models to enhance clinical decision support in plastic surgery.. Journal of plastic, reconstructive & aesthetic surgery : JPRAS. 2025. doi:10.1016/j.bjps.2025.03.053
**Source database**: pubmed
**DOI**: https://doi.org/10.1016/j.bjps.2025.03.053

**Abstract**: The rapid advancement of large language models (LLMs) has generated significant enthusiasm within healthcare, especially in supporting clinical decision-making and patient management. However, inherent limitations including hallucinations, outdated clinical context, and unreliable references pose serious concerns for their clinical utility. Retrieval-Augmented Generation (RAG) models address these limitations by integrating validated, curated medical literature directly into AI workflows, signif...

---

### [28] Utilizing large language models for gastroenterology research: a conceptual framework.

**Citation**: 28. Berry P, Dhanakshirur RR, Khanna S. Utilizing large language models for gastroenterology research: a conceptual framework.. Therapeutic advances in gastroenterology. 2025. doi:10.1177/17562848251328577
**Source database**: pubmed
**DOI**: https://doi.org/10.1177/17562848251328577

**Abstract**: Large language models (LLMs) transform healthcare by assisting clinicians with decision-making, research, and patient management. In gastroenterology, LLMs have shown potential in clinical decision support, data extraction, and patient education. However, challenges such as bias, hallucinations, integration with clinical workflows, and regulatory compliance must be addressed for safe and effective implementation. This manuscript presents a structured framework for integrating LLMs into gastroent...

---

### [29] A Review of Large Language Models in Medical Education, Clinical Decision Support, and Healthcare Administration.

**Citation**: 29. Vrdoljak J, Boban Z, Vilović M, et al.. A Review of Large Language Models in Medical Education, Clinical Decision Support, and Healthcare Administration.. Healthcare (Basel, Switzerland). 2025. doi:10.3390/healthcare13060603
**Source database**: pubmed
**DOI**: https://doi.org/10.3390/healthcare13060603

---

### [30] Empowering large language models for automated clinical assessment with generation-augmented retrieval and hierarchical chain-of-thought.

**Citation**: 30. Gu Z, Jia W, Piccardi M, et al.. Empowering large language models for automated clinical assessment with generation-augmented retrieval and hierarchical chain-of-thought.. Artificial intelligence in medicine. 2025. doi:10.1016/j.artmed.2025.103078
**Source database**: pubmed
**DOI**: https://doi.org/10.1016/j.artmed.2025.103078

**Abstract**: Understanding and extracting valuable information from electronic health records (EHRs) is important for improving healthcare delivery and health outcomes. Large language models (LLMs) have demonstrated significant proficiency in natural language understanding and processing, offering promises for automating the typically labor-intensive and time-consuming analytical tasks with EHRs. Despite the active application of LLMs in the healthcare setting, many foundation models lack real-world healthca...

---

### [31] Applying generative AI with retrieval augmented generation to summarize and extract key clinical information from electronic health records.

**Citation**: 31. Alkhalaf M, Yu P, Yin M, et al.. Applying generative AI with retrieval augmented generation to summarize and extract key clinical information from electronic health records.. Journal of biomedical informatics. 2024. doi:10.1016/j.jbi.2024.104662
**Source database**: pubmed
**DOI**: https://doi.org/10.1016/j.jbi.2024.104662

**Abstract**: Malnutrition is a prevalent issue in aged care facilities (RACFs), leading to adverse health outcomes. The ability to efficiently extract key clinical information from a large volume of data in electronic health records (EHR) can improve understanding about the extent of the problem and developing effective interventions. This research aimed to test the efficacy of zero-shot prompt engineering applied to generative artificial intelligence (AI) models on their own and in combination with retrieva...

---

### [32] Integrating Retrieval-Augmented Generation with Large Language Models in Nephrology: Advancing Practical Applications.

**Citation**: 32. Miao J, Thongprayoon C, Suppadungsuk S, et al.. Integrating Retrieval-Augmented Generation with Large Language Models in Nephrology: Advancing Practical Applications.. Medicina (Kaunas, Lithuania). 2024. doi:10.3390/medicina60030445
**Source database**: pubmed
**DOI**: https://doi.org/10.3390/medicina60030445

**Abstract**: The integration of large language models (LLMs) into healthcare, particularly in nephrology, represents a significant advancement in applying advanced technology to patient care, medical research, and education. These advanced models have progressed from simple text processors to tools capable of deep language understanding, offering innovative ways to handle health-related data, thus improving medical practice efficiency and effectiveness. A significant challenge in medical applications of LLMs...

---

### [33] AR-RAG: Autoregressive Retrieval Augmentation for Image Generation

**Citation**: 33. Jingyuan Qi, Zhiyang Xu, Qifan Wang, et al.. AR-RAG: Autoregressive Retrieval Augmentation for Image Generation. arXiv:2506.06962v3. 2025.
**Source database**: arxiv

**Abstract**: We introduce Autoregressive Retrieval Augmentation (AR-RAG), a novel paradigm that enhances image generation by autoregressively incorporating knearest neighbor retrievals at the patch level. Unlike prior methods that perform a single, static retrieval before generation and condition the entire generation on fixed reference images, AR-RAG performs context-aware retrievals at each generation step, using prior-generated patches as queries to retrieve and incorporate the most relevant patch-level v...

---

### [34] Intelligent Interaction Strategies for Context-Aware Cognitive Augmentation

**Citation**: 34.  Xiangrong,  Zhu, Yuan Xu, et al.. Intelligent Interaction Strategies for Context-Aware Cognitive Augmentation. arXiv:2504.13684v1. 2025.
**Source database**: arxiv

**Abstract**: Human cognition is constrained by processing limitations, leading to cognitive overload and inefficiencies in knowledge synthesis and decision-making. Large Language Models (LLMs) present an opportunity for cognitive augmentation, but their current reactive nature limits their real-world applicability. This position paper explores the potential of context-aware cognitive augmentation, where LLMs dynamically adapt to users' cognitive states and task environments to provide appropriate support. Th...

---

### [35] Factually: Exploring Wearable Fact-Checking for Augmented Truth Discernment

**Citation**: 35. Chitralekha Gupta, Hanjun Wu, Praveen Sasikumar, et al.. Factually: Exploring Wearable Fact-Checking for Augmented Truth Discernment. arXiv:2504.17204v1. 2025.
**Source database**: arxiv

**Abstract**: Wearable devices are transforming human capabilities by seamlessly augmenting cognitive functions. In this position paper, we propose a voice-based, interactive learning companion designed to amplify and extend cognitive abilities through informal learning. Our vision is threefold: (1) to enable users to discover new knowledge on-the-go through contextual interactive quizzes, fostering critical thinking and mindfulness, (2) to proactively detect misinformation, empowering users to critically ass...

---

### [36] Designing AI Systems that Augment Human Performed vs. Demonstrated Critical Thinking

**Citation**: 36. Katelyn Xiaoying Mei, Nic Weber. Designing AI Systems that Augment Human Performed vs. Demonstrated Critical Thinking. arXiv:2504.14689v1. 2025.
**Source database**: arxiv

**Abstract**: The recent rapid advancement of LLM-based AI systems has accelerated our search and production of information. While the advantages brought by these systems seemingly improve the performance or efficiency of human activities, they do not necessarily enhance human capabilities. Recent research has started to examine the impact of generative AI on individuals' cognitive abilities, especially critical thinking. Based on definitions of critical thinking across psychology and education, this position...

---

### [37] Automated Literature Review Using NLP Techniques and LLM-Based Retrieval-Augmented Generation

**Citation**: 37. Nurshat Fateh Ali, Md. Mahdi Mohtasim, Shakil Mosharrof, et al.. Automated Literature Review Using NLP Techniques and LLM-Based Retrieval-Augmented Generation. arXiv:2411.18583v1. 2024.
**Source database**: arxiv

**Abstract**: This research presents and compares multiple approaches to automate the generation of literature reviews using several Natural Language Processing (NLP) techniques and retrieval-augmented generation (RAG) with a Large Language Model (LLM). The ever-increasing number of research articles provides a huge challenge for manual literature review. It has resulted in an increased demand for automation. Developing a system capable of automatically generating the literature reviews from only the PDF file...

---

### [38] EVOR: Evolving Retrieval for Code Generation

**Citation**: 38. Hongjin Su, Shuyang Jiang, Yuhang Lai, et al.. EVOR: Evolving Retrieval for Code Generation. arXiv:2402.12317v2. 2024.
**Source database**: arxiv

**Abstract**: Recently the retrieval-augmented generation (RAG) has been successfully applied in code generation. However, existing pipelines for retrieval-augmented code generation (RACG) employ static knowledge bases with a single source, limiting the adaptation capabilities of Large Language Models (LLMs) to domains they have insufficient knowledge of. In this work, we develop a novel pipeline, EVOR, that employs the synchronous evolution of both queries and diverse knowledge bases. On two realistic settin...

---

### [39] Riddle Me This! Stealthy Membership Inference for Retrieval-Augmented Generation

**Citation**: 39. Ali Naseh, Yuefeng Peng, Anshuman Suri, et al.. Riddle Me This! Stealthy Membership Inference for Retrieval-Augmented Generation. arXiv:2502.00306v2. 2025.
**Source database**: arxiv

**Abstract**: Retrieval-Augmented Generation (RAG) enables Large Language Models (LLMs) to generate grounded responses by leveraging external knowledge databases without altering model parameters. Although the absence of weight tuning prevents leakage via model parameters, it introduces the risk of inference adversaries exploiting retrieved documents in the model's context. Existing methods for membership inference and data extraction often rely on jailbreaking or carefully crafted unnatural queries, which ca...

---

### [40] Ragas: Automated Evaluation of Retrieval Augmented Generation

**Citation**: 40. Shahul Es, Jithin James, Luis Espinosa-Anke, et al.. Ragas: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217v2. 2023.
**Source database**: arxiv

**Abstract**: We introduce Ragas (Retrieval Augmented Generation Assessment), a framework for reference-free evaluation of Retrieval Augmented Generation (RAG) pipelines. RAG systems are composed of a retrieval and an LLM based generation module, and provide LLMs with knowledge from a reference textual database, which enables them to act as a natural language layer between a user and textual databases, reducing the risk of hallucinations. Evaluating RAG architectures is, however, challenging because there are...

---

### [41] FAIR-RAG: Faithful Adaptive Iterative Refinement for Retrieval-Augmented Generation

**Citation**: 41. Mohammad Aghajani Asl, Majid Asgari-Bidhendi, Behrooz Minaei-Bidgoli. FAIR-RAG: Faithful Adaptive Iterative Refinement for Retrieval-Augmented Generation. arXiv:2510.22344v1. 2025.
**Source database**: arxiv

**Abstract**: While Retrieval-Augmented Generation (RAG) mitigates hallucination and knowledge staleness in Large Language Models (LLMs), existing frameworks often falter on complex, multi-hop queries that require synthesizing information from disparate sources. Current advanced RAG methods, employing iterative or adaptive strategies, lack a robust mechanism to systematically identify and fill evidence gaps, often propagating noise or failing to gather a comprehensive context. We introduce FAIR-RAG, a novel a...

---

### [42] MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation

**Citation**: 42. Daeyong Kwon, SeungHeon Doh, Juhan Nam. MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation. arXiv:2507.23334v2. 2025.
**Source database**: arxiv

**Abstract**: Recent advancements in Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains. While they exhibit strong zero-shot performance on various tasks, LLMs' effectiveness in music-related applications remains limited due to the relatively small proportion of music-specific knowledge in their training data. To address this limitation, we propose MusT-RAG, a comprehensive framework based on Retrieval Augmented Generation (RAG) to adapt general-purpose LLMs for text...

---

### [43] Open-Source Retrieval Augmented Generation Framework for Retrieving Accurate Medication Insights from Formularies for African Healthcare Workers

**Citation**: 43. Axum AI,  :, J. Owoyemi, et al.. Open-Source Retrieval Augmented Generation Framework for Retrieving Accurate Medication Insights from Formularies for African Healthcare Workers. arXiv:2502.15722v1. 2025.
**Source database**: arxiv

**Abstract**: Accessing accurate medication insights is vital for enhancing patient safety, minimizing errors, and supporting clinical decision-making. However, healthcare professionals in Africa often rely on manual and time-consuming processes to retrieve drug information, exacerbated by limited access to pharmacists due to brain drain and healthcare disparities. This paper presents "Drug Insights," an open-source Retrieval-Augmented Generation (RAG) chatbot designed to streamline medication lookup for heal...

---

### [44] Engineering the RAG Stack: A Comprehensive Review of the Architecture and Trust Frameworks for Retrieval-Augmented Generation Systems

**Citation**: 44. Dean Wampler, Dave Nielson, Alireza Seddighi. Engineering the RAG Stack: A Comprehensive Review of the Architecture and Trust Frameworks for Retrieval-Augmented Generation Systems. arXiv:2601.05264v1. 2025.
**Source database**: arxiv

**Abstract**: This article provides a comprehensive systematic literature review of academic studies, industrial applications, and real-world deployments from 2018 to 2025, providing a practical guide and detailed overview of modern Retrieval-Augmented Generation (RAG) architectures. RAG offers a modular approach for integrating external knowledge without increasing the capacity of the model as LLM systems expand. Research and engineering practices have been fragmented as a result of the increasing diversity ...

---

### [45] RAGPart & RAGMask: Retrieval-Stage Defenses Against Corpus Poisoning in Retrieval-Augmented Generation

**Citation**: 45. Pankayaraj Pathmanathan, Michael-Andrei Panaitescu-Liess, Cho-Yu Jason Chiang, et al.. RAGPart & RAGMask: Retrieval-Stage Defenses Against Corpus Poisoning in Retrieval-Augmented Generation. arXiv:2512.24268v1. 2025.
**Source database**: arxiv

**Abstract**: Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm to enhance large language models (LLMs) with external knowledge, reducing hallucinations and compensating for outdated information. However, recent studies have exposed a critical vulnerability in RAG pipelines corpus poisoning where adversaries inject malicious documents into the retrieval corpus to manipulate model outputs. In this work, we propose two complementary retrieval-stage defenses: RAGPart and RAGMask. Our defen...

---

### [46] MultiRAG: A Knowledge-guided Framework for Mitigating Hallucination in Multi-source Retrieval Augmented Generation

**Citation**: 46. Wenlong Wu, Haofen Wang, Bohan Li, et al.. MultiRAG: A Knowledge-guided Framework for Mitigating Hallucination in Multi-source Retrieval Augmented Generation. arXiv:2508.03553v1. 2025. doi:https://doi.org/10.1109/ICDE65448.2025.00230
**Source database**: arxiv
**DOI**: https://doi.org/https://doi.org/10.1109/ICDE65448.2025.00230

**Abstract**: Retrieval Augmented Generation (RAG) has emerged as a promising solution to address hallucination issues in Large Language Models (LLMs). However, the integration of multiple retrieval sources, while potentially more informative, introduces new challenges that can paradoxically exacerbate hallucination problems. These challenges manifest primarily in two aspects: the sparse distribution of multi-source data that hinders the capture of logical relationships and the inherent inconsistencies among ...

---

### [47] Hybrid-Code v2: Zero-Hallucination Clinical ICD-10 Coding via Neuro-Symbolic Verification and Automated Knowledge Base Expansion

**Citation**: 47. Yunguo Yu. Hybrid-Code v2: Zero-Hallucination Clinical ICD-10 Coding via Neuro-Symbolic Verification and Automated Knowledge Base Expansion. arXiv:2512.23743v2. 2025.
**Source database**: arxiv

**Abstract**: Automated clinical ICD-10 coding is a high-impact healthcare task requiring a balance between coverage, precision, and safety. While neural approaches achieve strong performance, they suffer from hallucination-generating invalid or unsupported codes-posing unacceptable risks in safety-critical clinical settings. Rule-based systems eliminate hallucination but lack scalability and coverage due to manual knowledge base (KB) curation.   We present Hybrid-Code v2, a neuro-symbolic framework that achi...

---

### [48] SoftTiger: A Clinical Foundation Model for Healthcare Workflows

**Citation**: 48. Ye Chen, Igor Couto, Wei Cai, et al.. SoftTiger: A Clinical Foundation Model for Healthcare Workflows. arXiv:2403.00868v3. 2024.
**Source database**: arxiv

**Abstract**: We introduce SoftTiger, a clinical large language model (CLaM) designed as a foundation model for healthcare workflows. The narrative and unstructured nature of clinical notes is a major obstacle for healthcare intelligentization. We address a critical problem of structuring clinical notes into clinical data, according to international interoperability standards. We collect and annotate data for three subtasks, namely, international patient summary, clinical impression and medical encounter. We ...

---

### [49] Utilizing Metadata for Better Retrieval-Augmented Generation

**Citation**: 49. Raquib Bin Yousuf, Shengzhe Xu, Mandar Sharma, et al.. Utilizing Metadata for Better Retrieval-Augmented Generation. arXiv:2601.11863v1. 2026.
**Source database**: arxiv

**Abstract**: Retrieval-Augmented Generation systems depend on retrieving semantically relevant document chunks to support accurate, grounded outputs from large language models. In structured and repetitive corpora such as regulatory filings, chunk similarity alone often fails to distinguish between documents with overlapping language. Practitioners often flatten metadata into input text as a heuristic, but the impact and trade-offs of this practice remain poorly understood. We present a systematic study of m...

---

### [50] CARROT: A Learned Cost-Constrained Retrieval Optimization System for RAG

**Citation**: 50. Ziting Wang, Haitao Yuan, Wei Dong, et al.. CARROT: A Learned Cost-Constrained Retrieval Optimization System for RAG. arXiv:2411.00744v2. 2024.
**Source database**: arxiv

**Abstract**: Large Language Models (LLMs) have demonstrated impressive ability in generation and reasoning tasks but struggle with handling up-to-date knowledge, leading to inaccuracies or hallucinations. Retrieval-Augmented Generation (RAG) mitigates this by retrieving and incorporating external knowledge into input prompts. In particular, due to LLMs' context window limitations and long-context hallucinations, only the most relevant "chunks" are retrieved. However, current RAG systems face three key challe...

---

### [51] Privacy-preserving machine learning for healthcare: open challenges and future perspectives

**Citation**: 51. Alejandro Guerra-Manzanares, L. Julian Lechuga Lopez, Michail Maniatakos, et al.. Privacy-preserving machine learning for healthcare: open challenges and future perspectives. arXiv:2303.15563v1. 2023. doi:https://doi.org/10.1007/978-3-031-39539-0_3
**Source database**: arxiv
**DOI**: https://doi.org/https://doi.org/10.1007/978-3-031-39539-0_3

**Abstract**: Machine Learning (ML) has recently shown tremendous success in modeling various healthcare prediction tasks, ranging from disease diagnosis and prognosis to patient treatment. Due to the sensitive nature of medical data, privacy must be considered along the entire ML pipeline, from model training to inference. In this paper, we conduct a review of recent literature concerning Privacy-Preserving Machine Learning (PPML) for healthcare. We primarily focus on privacy-preserving training and inferenc...

---

### [52] Federated Learning for Healthcare Domain - Pipeline, Applications and Challenges

**Citation**: 52. Madhura Joshi, Ankit Pal, Malaikannan Sankarasubbu. Federated Learning for Healthcare Domain - Pipeline, Applications and Challenges. arXiv:2211.07893v2. 2022. doi:https://doi.org/10.1145/3533708
**Source database**: arxiv
**DOI**: https://doi.org/https://doi.org/10.1145/3533708

**Abstract**: Federated learning is the process of developing machine learning models over datasets distributed across data centers such as hospitals, clinical research labs, and mobile devices while preventing data leakage. This survey examines previous research and studies on federated learning in the healthcare sector across a range of use cases and applications. Our survey shows what challenges, methods, and applications a practitioner should be aware of in the topic of federated learning. This paper aims...

---

### [53] To Retrieve or Not to Retrieve? Uncertainty Detection for Dynamic Retrieval Augmented Generation

**Citation**: 53. Kaustubh D. Dhole. To Retrieve or Not to Retrieve? Uncertainty Detection for Dynamic Retrieval Augmented Generation. arXiv:2501.09292v3. 2025.
**Source database**: arxiv

**Abstract**: Retrieval-Augmented Generation equips large language models with the capability to retrieve external knowledge, thereby mitigating hallucinations by incorporating information beyond the model's intrinsic abilities. However, most prior works have focused on invoking retrieval deterministically, which makes it unsuitable for tasks such as long-form question answering. Instead, dynamically performing retrieval by invoking it only when the underlying LLM lacks the required knowledge can be more effi...

---

### [54] Reducing hallucination in structured outputs via Retrieval-Augmented Generation

**Citation**: 54. Patrice Béchard, Orlando Marquez Ayala. Reducing hallucination in structured outputs via Retrieval-Augmented Generation. arXiv:2404.08189v1. 2024. doi:https://doi.org/10.18653/v1/2024.naacl-industry.19
**Source database**: arxiv
**DOI**: https://doi.org/https://doi.org/10.18653/v1/2024.naacl-industry.19

**Abstract**: A common and fundamental limitation of Generative AI (GenAI) is its propensity to hallucinate. While large language models (LLM) have taken the world by storm, without eliminating or at least reducing hallucinations, real-world GenAI systems may face challenges in user adoption. In the process of deploying an enterprise application that produces workflows based on natural language requirements, we devised a system leveraging Retrieval Augmented Generation (RAG) to greatly improve the quality of ...

---

### [55] Document Understanding for Healthcare Referrals

**Citation**: 55. Jimit Mistry, Natalia M. Arzeno. Document Understanding for Healthcare Referrals. arXiv:2309.13184v1. 2023. doi:https://doi.org/10.1109/ICHI57859.2023.00067
**Source database**: arxiv
**DOI**: https://doi.org/https://doi.org/10.1109/ICHI57859.2023.00067

**Abstract**: Reliance on scanned documents and fax communication for healthcare referrals leads to high administrative costs and errors that may affect patient care. In this work we propose a hybrid model leveraging LayoutLMv3 along with domain-specific rules to identify key patient, physician, and exam-related entities in faxed referral documents. We explore some of the challenges in applying a document understanding model to referrals, which have formats varying by medical practice, and evaluate model perf...

---

### [56] IGMiRAG: Intuition-Guided Retrieval-Augmented Generation with Adaptive Mining of In-Depth Memory

**Citation**: 56. Xingliang Hou, Yuyan Liu, Qi Sun, et al.. IGMiRAG: Intuition-Guided Retrieval-Augmented Generation with Adaptive Mining of In-Depth Memory. arXiv:2602.07525v1. 2026.
**Source database**: arxiv

**Abstract**: Retrieval-augmented generation (RAG) equips large language models (LLMs) with reliable knowledge memory. To strengthen cross-text associations, recent research integrates graphs and hypergraphs into RAG to capture pairwise and multi-entity relations as structured links. However, their misaligned memory organization necessitates costly, disjointed retrieval. To address these limitations, we propose IGMiRAG, a framework inspired by human intuition-guided reasoning. It constructs a hierarchical het...

---

### [57] AlzheimerRAG: Multimodal Retrieval Augmented Generation for Clinical Use Cases using PubMed articles

**Citation**: 57. Aritra Kumar Lahiri, Qinmin Vivian Hu. AlzheimerRAG: Multimodal Retrieval Augmented Generation for Clinical Use Cases using PubMed articles. arXiv:2412.16701v3. 2024. doi:https://doi.org/10.3390/make7030089
**Source database**: arxiv
**DOI**: https://doi.org/https://doi.org/10.3390/make7030089

**Abstract**: Recent advancements in generative AI have fostered the development of highly adept Large Language Models (LLMs) that integrate diverse data types to empower decision-making. Among these, multimodal retrieval-augmented generation (RAG) applications are promising because they combine the strengths of information retrieval and generative models, enhancing their utility across various domains, including clinical use cases. This paper introduces AlzheimerRAG, a Multimodal RAG application for clinical...

---

### [58] Reconstructing Context: Evaluating Advanced Chunking Strategies for Retrieval-Augmented Generation

**Citation**: 58. Carlo Merola, Jaspinder Singh. Reconstructing Context: Evaluating Advanced Chunking Strategies for Retrieval-Augmented Generation. arXiv:2504.19754v1. 2025.
**Source database**: arxiv

**Abstract**: Retrieval-augmented generation (RAG) has become a transformative approach for enhancing large language models (LLMs) by grounding their outputs in external knowledge sources. Yet, a critical question persists: how can vast volumes of external knowledge be managed effectively within the input constraints of LLMs? Traditional methods address this by chunking external documents into smaller, fixed-size segments. While this approach alleviates input limitations, it often fragments context, resulting...

---

### [59] Lightweight Transformers for Clinical Natural Language Processing

**Citation**: 59. Omid Rohanian, Mohammadmahdi Nouriborji, Hannah Jauncey, et al.. Lightweight Transformers for Clinical Natural Language Processing. arXiv:2302.04725v1. 2023. doi:https://doi.org/10.1017/S1351324923000542
**Source database**: arxiv
**DOI**: https://doi.org/https://doi.org/10.1017/S1351324923000542

**Abstract**: Specialised pre-trained language models are becoming more frequent in NLP since they can potentially outperform models trained on generic texts. BioBERT and BioClinicalBERT are two examples of such models that have shown promise in medical NLP tasks. Many of these models are overparametrised and resource-intensive, but thanks to techniques like Knowledge Distillation (KD), it is possible to create smaller versions that perform almost as well as their larger counterparts. In this work, we specifi...

---

### [60] Enhancing Critical Thinking with AI: A Tailored Warning System for RAG Models

**Citation**: 60. Xuyang Zhu, Sejoon Chang, Andrew Kuik. Enhancing Critical Thinking with AI: A Tailored Warning System for RAG Models. arXiv:2504.16883v1. 2025.
**Source database**: arxiv

**Abstract**: Retrieval-Augmented Generation (RAG) systems offer a powerful approach to enhancing large language model (LLM) outputs by incorporating fact-checked, contextually relevant information. However, fairness and reliability concerns persist, as hallucinations can emerge at both the retrieval and generation stages, affecting users' reasoning and decision-making. Our research explores how tailored warning messages -- whose content depends on the specific context of hallucination -- shape user reasoning...

---

### [61] Towards Smart Healthcare: Challenges and Opportunities in IoT and ML

**Citation**: 61. Munshi Saifuzzaman, Tajkia Nuri Ananna. Towards Smart Healthcare: Challenges and Opportunities in IoT and ML. arXiv:2312.05530v2. 2023. doi:https://doi.org/10.1007/978-981-97-5624-7_10
**Source database**: arxiv
**DOI**: https://doi.org/https://doi.org/10.1007/978-981-97-5624-7_10

**Abstract**: The COVID-19 pandemic and other ongoing health crises have underscored the need for prompt healthcare services worldwide. The traditional healthcare system, centered around hospitals and clinics, has proven inadequate in the face of such challenges. Intelligent wearable devices, a key part of modern healthcare, leverage Internet of Things technology to collect extensive data related to the environment as well as psychological, behavioral, and physical health. However, managing the substantial da...

---

### [62] Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers

**Citation**: 62. Kunal Sawarkar, Abhilasha Mangal, Shivam Raj Solanki. Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers. arXiv:2404.07220v2. 2024. doi:https://doi.org/10.1109/MIPR62202.2024.00031
**Source database**: arxiv
**DOI**: https://doi.org/https://doi.org/10.1109/MIPR62202.2024.00031

**Abstract**: Retrieval-Augmented Generation (RAG) is a prevalent approach to infuse a private knowledge base of documents with Large Language Models (LLM) to build Generative Q\&A (Question-Answering) systems. However, RAG accuracy becomes increasingly challenging as the corpus of documents scales up, with Retrievers playing an outsized role in the overall RAG accuracy by extracting the most relevant document from the corpus to provide context to the LLM. In this paper, we propose the 'Blended RAG' method of...

---

### [63] Investigating Retrieval-Augmented Generation in Quranic Studies: A Study of 13 Open-Source Large Language Models

**Citation**: 63. Zahra Khalila, Arbi Haza Nasution, Winda Monika, et al.. Investigating Retrieval-Augmented Generation in Quranic Studies: A Study of 13 Open-Source Large Language Models. arXiv:2503.16581v1. 2025. doi:https://doi.org/10.14569/IJACSA.2025.01602134
**Source database**: arxiv
**DOI**: https://doi.org/https://doi.org/10.14569/IJACSA.2025.01602134

**Abstract**: Accurate and contextually faithful responses are critical when applying large language models (LLMs) to sensitive and domain-specific tasks, such as answering queries related to quranic studies. General-purpose LLMs often struggle with hallucinations, where generated responses deviate from authoritative sources, raising concerns about their reliability in religious contexts. This challenge highlights the need for systems that can integrate domain-specific knowledge while maintaining response acc...

---

### [64] Video Enriched Retrieval Augmented Generation Using Aligned Video Captions

**Citation**: 64. Kevin Dela Rosa. Video Enriched Retrieval Augmented Generation Using Aligned Video Captions. arXiv:2405.17706v1. 2024.
**Source database**: arxiv

**Abstract**: In this work, we propose the use of "aligned visual captions" as a mechanism for integrating information contained within videos into retrieval augmented generation (RAG) based chat assistant systems. These captions are able to describe the visual and audio content of videos in a large corpus while having the advantage of being in a textual format that is both easy to reason about & incorporate into large language model (LLM) prompts, but also typically require less multimedia content to be inse...

---

### [65] Expert Mind: A Retrieval-Augmented Architecture for Expert Knowledge Preservation in the Energy Sector

**Citation**: 65. Diego Ezequiel Cervera. Expert Mind: A Retrieval-Augmented Architecture for Expert Knowledge Preservation in the Energy Sector. arXiv:2603.14541v1. 2026.
**Source database**: arxiv

**Abstract**: The departure of subject-matter experts from industrial organizations results in the irreversible loss of tacit knowledge that is rarely captured through conventional documentation practices. This paper proposes Expert Mind, an experimental system that leverages Retrieval-Augmented Generation (RAG), large language models (LLMs), and multimodal capture techniques to preserve, structure, and make queryable the deep expertise of organizational knowledge holders. Drawing on the specific context of t...

---

### [66] Fact-Controlled Diagnosis of Hallucinations in Medical Text Summarization

**Citation**: 66. Suhas BN, Han-Chin Shing, Lei Xu, et al.. Fact-Controlled Diagnosis of Hallucinations in Medical Text Summarization. arXiv:2506.00448v1. 2025.
**Source database**: arxiv

**Abstract**: Hallucinations in large language models (LLMs) during summarization of patient-clinician dialogues pose significant risks to patient care and clinical decision-making. However, the phenomenon remains understudied in the clinical domain, with uncertainty surrounding the applicability of general-domain hallucination detectors. The rarity and randomness of hallucinations further complicate their investigation. In this paper, we conduct an evaluation of hallucination detection methods in the medical...

---

### [67] Contradictions in Context: Challenges for Retrieval-Augmented Generation in Healthcare

**Citation**: 67. Saeedeh Javadi, Sara Mirabi, Manan Gangar, et al.. Contradictions in Context: Challenges for Retrieval-Augmented Generation in Healthcare. arXiv:2511.06668v2. 2025.
**Source database**: arxiv

**Abstract**: In high-stakes information domains such as healthcare, where large language models (LLMs) can produce hallucinations or misinformation, retrieval-augmented generation (RAG) has been proposed as a mitigation strategy, grounding model outputs in external, domain-specific documents. Yet, this approach can introduce errors when source documents contain outdated or contradictory information. This work investigates the performance of five LLMs in generating RAG-based responses to medicine-related quer...

---

### [68] Tree of Reviews: A Tree-based Dynamic Iterative Retrieval Framework for Multi-hop Question Answering

**Citation**: 68. Li Jiapeng, Liu Runze, Li Yabo, et al.. Tree of Reviews: A Tree-based Dynamic Iterative Retrieval Framework for Multi-hop Question Answering. arXiv:2404.14464v1. 2024.
**Source database**: arxiv

**Abstract**: Multi-hop question answering is a knowledge-intensive complex problem. Large Language Models (LLMs) use their Chain of Thoughts (CoT) capability to reason complex problems step by step, and retrieval-augmentation can effectively alleviate factual errors caused by outdated and unknown knowledge in LLMs. Recent works have introduced retrieval-augmentation in the CoT reasoning to solve multi-hop question answering. However, these chain methods have the following problems: 1) Retrieved irrelevant pa...

---

### [69] The Geometry of Queries: Query-Based Innovations in Retrieval-Augmented Generation for Healthcare QA

**Citation**: 69. Eric Yang, Jonathan Amar, Jong Ha Lee, et al.. The Geometry of Queries: Query-Based Innovations in Retrieval-Augmented Generation for Healthcare QA. arXiv:2407.18044v2. 2024.
**Source database**: arxiv

**Abstract**: Deploying Large Language Models (LLMs) for healthcare question answering requires robust methods to ensure accuracy and reliability. This work introduces Query-Based Retrieval Augmented Generation (QB-RAG), a framework for enhancing Retrieval-Augmented Generation (RAG) systems in healthcare question-answering by pre-aligning user queries with a database of curated, answerable questions derived from healthcare content. A key component of QB-RAG is an LLM-based filtering mechanism that ensures tha...

---

### [70] Retrieval Augmented Thought Process for Private Data Handling in Healthcare

**Citation**: 70. Thomas Pouplin, Hao Sun, Samuel Holt, et al.. Retrieval Augmented Thought Process for Private Data Handling in Healthcare. arXiv:2402.07812v2. 2024.
**Source database**: arxiv

**Abstract**: Large Language Models (LLMs) have demonstrated the strong potential to assist both clinicians and the general public with their extensive medical knowledge. However, their application in healthcare is constrained due to concerns about the privacy of data used in training, which prevents the integration of private and personal information because of security and ethical issues. Moreover, if their capabilities can be enhanced with information retrieval to access up-to-date knowledge, the current i...

---

### [71] Multi-Task Retrieval-Augmented Text Generation with Relevance Sampling

**Citation**: 71. Sebastian Hofstätter, Jiecao Chen, Karthik Raman, et al.. Multi-Task Retrieval-Augmented Text Generation with Relevance Sampling. arXiv:2207.03030v1. 2022.
**Source database**: arxiv

**Abstract**: This paper studies multi-task training of retrieval-augmented generation models for knowledge-intensive tasks. We propose to clean the training set by utilizing a distinct property of knowledge-intensive generation: The connection of query-answer pairs to items in the knowledge base. We filter training examples via a threshold of confidence on the relevance labels, whether a pair is answerable by the knowledge base or not. We train a single Fusion-in-Decoder (FiD) generator on seven combined tas...

---

### [72] Are Large Language Models Ready for Healthcare? A Comparative Study on Clinical Language Understanding

**Citation**: 72. Yuqing Wang, Yun Zhao, Linda Petzold. Are Large Language Models Ready for Healthcare? A Comparative Study on Clinical Language Understanding. arXiv:2304.05368v3. 2023.
**Source database**: arxiv

**Abstract**: Large language models (LLMs) have made significant progress in various domains, including healthcare. However, the specialized nature of clinical language understanding tasks presents unique challenges and limitations that warrant further investigation. In this study, we conduct a comprehensive evaluation of state-of-the-art LLMs, namely GPT-3.5, GPT-4, and Bard, within the realm of clinical language understanding tasks. These tasks span a diverse range, including named entity recognition, relat...

---

### [73] Grounded by Experience: Generative Healthcare Prediction Augmented with Hierarchical Agentic Retrieval

**Citation**: 73. Chuang Zhao, Hui Tang, Hongke Zhao, et al.. Grounded by Experience: Generative Healthcare Prediction Augmented with Hierarchical Agentic Retrieval. arXiv:2511.13293v1. 2025.
**Source database**: arxiv

**Abstract**: Accurate healthcare prediction is critical for improving patient outcomes and reducing operational costs. Bolstered by growing reasoning capabilities, large language models (LLMs) offer a promising path to enhance healthcare predictions by drawing on their rich parametric knowledge. However, LLMs are prone to factual inaccuracies due to limitations in the reliability and coverage of their embedded knowledge. While retrieval-augmented generation (RAG) frameworks, such as GraphRAG and its variants...

---

### [74] DeepCodeSeek: Real-Time API Retrieval for Context-Aware Code Generation

**Citation**: 74. Esakkivel Esakkiraja, Denis Akhiyarov, Aditya Shanmugham, et al.. DeepCodeSeek: Real-Time API Retrieval for Context-Aware Code Generation. arXiv:2509.25716v1. 2025.
**Source database**: arxiv

**Abstract**: Current search techniques are limited to standard RAG query-document applications. In this paper, we propose a novel technique to expand the code and index for predicting the required APIs, directly enabling high-quality, end-to-end code generation for auto-completion and agentic AI applications. We address the problem of API leaks in current code-to-code benchmark datasets by introducing a new dataset built from real-world ServiceNow Script Includes that capture the challenge of unclear API usa...

---

### [75] Hybrid Retrieval for Hallucination Mitigation in Large Language Models: A Comparative Analysis

**Citation**: 75. Chandana Sree Mala, Gizem Gezici, Fosca Giannotti. Hybrid Retrieval for Hallucination Mitigation in Large Language Models: A Comparative Analysis. arXiv:2504.05324v1. 2025.
**Source database**: arxiv

**Abstract**: Large Language Models (LLMs) excel in language comprehension and generation but are prone to hallucinations, producing factually incorrect or unsupported outputs. Retrieval Augmented Generation (RAG) systems address this issue by grounding LLM responses with external knowledge. This study evaluates the relationship between retriever effectiveness and hallucination reduction in LLMs using three retrieval approaches: sparse retrieval based on BM25 keyword search, dense retrieval using semantic sea...

---

### [76] Detecting Hallucination and Coverage Errors in Retrieval Augmented Generation for Controversial Topics

**Citation**: 76. Tyler A. Chang, Katrin Tomanek, Jessica Hoffmann, et al.. Detecting Hallucination and Coverage Errors in Retrieval Augmented Generation for Controversial Topics. arXiv:2403.08904v1. 2024.
**Source database**: arxiv

**Abstract**: We explore a strategy to handle controversial topics in LLM-based chatbots based on Wikipedia's Neutral Point of View (NPOV) principle: acknowledge the absence of a single true answer and surface multiple perspectives. We frame this as retrieval augmented generation, where perspectives are retrieved from a knowledge base and the LLM is tasked with generating a fluent and faithful response from the given perspectives. As a starting point, we use a deterministic retrieval system and then focus on ...

---

### [77] Explainable Depression Detection in Clinical Interviews with Personalized Retrieval-Augmented Generation

**Citation**: 77. Linhai Zhang, Ziyang Gao, Deyu Zhou, et al.. Explainable Depression Detection in Clinical Interviews with Personalized Retrieval-Augmented Generation. arXiv:2503.01315v1. 2025.
**Source database**: arxiv

**Abstract**: Depression is a widespread mental health disorder, and clinical interviews are the gold standard for assessment. However, their reliance on scarce professionals highlights the need for automated detection. Current systems mainly employ black-box neural networks, which lack interpretability, which is crucial in mental health contexts. Some attempts to improve interpretability use post-hoc LLM generation but suffer from hallucination. To address these limitations, we propose RED, a Retrieval-augme...

---

### [78] Using Bottleneck Adapters to Identify Cancer in Clinical Notes under Low-Resource Constraints

**Citation**: 78. Omid Rohanian, Hannah Jauncey, Mohammadmahdi Nouriborji, et al.. Using Bottleneck Adapters to Identify Cancer in Clinical Notes under Low-Resource Constraints. arXiv:2210.09440v2. 2022.
**Source database**: arxiv

**Abstract**: Processing information locked within clinical health records is a challenging task that remains an active area of research in biomedical NLP. In this work, we evaluate a broad set of machine learning techniques ranging from simple RNNs to specialised transformers such as BioBERT on a dataset containing clinical notes along with a set of annotations indicating whether a sample is cancer-related or not.   Furthermore, we specifically employ efficient fine-tuning methods from NLP, namely, bottlenec...

---

### [79] RAG-Star: Enhancing Deliberative Reasoning with Retrieval Augmented Verification and Refinement

**Citation**: 79. Jinhao Jiang, Jiayi Chen, Junyi Li, et al.. RAG-Star: Enhancing Deliberative Reasoning with Retrieval Augmented Verification and Refinement. arXiv:2412.12881v1. 2024.
**Source database**: arxiv

**Abstract**: Existing large language models (LLMs) show exceptional problem-solving capabilities but might struggle with complex reasoning tasks. Despite the successes of chain-of-thought and tree-based search methods, they mainly depend on the internal knowledge of LLMs to search over intermediate reasoning steps, limited to dealing with simple tasks involving fewer reasoning steps. In this paper, we propose \textbf{RAG-Star}, a novel RAG approach that integrates the retrieved information to guide the tree-...

---

### [80] T-RAG: Lessons from the LLM Trenches

**Citation**: 80. Masoomali Fatehkia, Ji Kim Lucas, Sanjay Chawla. T-RAG: Lessons from the LLM Trenches. arXiv:2402.07483v2. 2024.
**Source database**: arxiv

**Abstract**: Large Language Models (LLM) have shown remarkable language capabilities fueling attempts to integrate them into applications across a wide range of domains. An important application area is question answering over private enterprise documents where the main considerations are data security, which necessitates applications that can be deployed on-prem, limited computational resources and the need for a robust application that correctly responds to queries. Retrieval-Augmented Generation (RAG) has...

---

### [81] Vendi-RAG: Adaptively Trading-Off Diversity And Quality Significantly Improves Retrieval Augmented Generation With LLMs

**Citation**: 81. Mohammad Reza Rezaei, Adji Bousso Dieng. Vendi-RAG: Adaptively Trading-Off Diversity And Quality Significantly Improves Retrieval Augmented Generation With LLMs. arXiv:2502.11228v2. 2025.
**Source database**: arxiv

**Abstract**: Retrieval-augmented generation (RAG) enhances large language models (LLMs) for domain-specific question-answering (QA) tasks by leveraging external knowledge sources. However, traditional RAG systems primarily focus on relevance-based retrieval and often struggle with redundancy, especially when reasoning requires connecting information from multiple sources. This paper introduces Vendi-RAG, a framework based on an iterative process that jointly optimizes retrieval diversity and answer quality. ...

---

### [82] MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

**Citation**: 82. Yixuan Tang, Yi Yang. MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries. arXiv:2401.15391v1. 2024.
**Source database**: arxiv

**Abstract**: Retrieval-augmented generation (RAG) augments large language models (LLM) by retrieving relevant knowledge, showing promising potential in mitigating LLM hallucinations and enhancing response quality, thereby facilitating the great adoption of LLMs in practice. However, we find that existing RAG systems are inadequate in answering multi-hop queries, which require retrieving and reasoning over multiple pieces of supporting evidence. Furthermore, to our knowledge, no existing RAG benchmarking data...

---

### [83] Localising In-Domain Adaptation of Transformer-Based Biomedical Language Models

**Citation**: 83. Tommaso Mario Buonocore, Claudio Crema, Alberto Redolfi, et al.. Localising In-Domain Adaptation of Transformer-Based Biomedical Language Models. arXiv:2212.10422v3. 2022. doi:https://doi.org/10.1016/j.jbi.2023.104431
**Source database**: arxiv
**DOI**: https://doi.org/https://doi.org/10.1016/j.jbi.2023.104431

**Abstract**: In the era of digital healthcare, the huge volumes of textual information generated every day in hospitals constitute an essential but underused asset that could be exploited with task-specific, fine-tuned biomedical language representation models, improving patient care and management. For such specialized domains, previous research has shown that fine-tuning models stemming from broad-coverage checkpoints can largely benefit additional training rounds over large-scale in-domain resources. Howe...

---

### [84] A scoping review on multimodal deep learning in biomedical images and texts

**Citation**: 84. Zhaoyi Sun, Mingquan Lin, Qingqing Zhu, et al.. A scoping review on multimodal deep learning in biomedical images and texts. arXiv:2307.07362v3. 2023. doi:https://doi.org/10.1016/j.jbi.2023.104482
**Source database**: arxiv
**DOI**: https://doi.org/https://doi.org/10.1016/j.jbi.2023.104482

**Abstract**: Computer-assisted diagnostic and prognostic systems of the future should be capable of simultaneously processing multimodal data. Multimodal deep learning (MDL), which involves the integration of multiple sources of data, such as images and text, has the potential to revolutionize the analysis and interpretation of biomedical data. However, it only caught researchers' attention recently. To this end, there is a critical need to conduct a systematic review on this topic, identify the limitations ...

---

### [85] Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights

**Citation**: 85. Yi Chen, Daiwei Chen, Sukrut Madhav Chikodikar, et al.. Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights. arXiv:2603.16817v1. 2026.
**Source database**: arxiv

**Abstract**: Large language models (LLMs) frequently hallucinate, limiting their reliability in knowledge-intensive applications. Retrieval-augmented generation (RAG) and conformal factuality have emerged as potential ways to address this limitation. While RAG aims to ground responses in retrieved evidence, it provides no statistical guarantee that the final output is correct. Conformal factuality filtering offers distribution-free statistical reliability by scoring and filtering atomic claims using a thresh...

---

### [86] RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts

**Citation**: 86. Yining She, Daniel W. Peterson, Marianne Menglin Liu, et al.. RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts. arXiv:2510.05310v1. 2025.
**Source database**: arxiv

**Abstract**: With the increasing adoption of large language models (LLMs), ensuring the safety of LLM systems has become a pressing concern. External LLM-based guardrail models have emerged as a popular solution to screen unsafe inputs and outputs, but they are themselves fine-tuned or prompt-engineered LLMs that are vulnerable to data distribution shifts. In this paper, taking Retrieval Augmentation Generation (RAG) as a case study, we investigated how robust LLM-based guardrails are against additional info...

---

### [87] UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation

**Citation**: 87. Linshan Wu, Yuxiang Nie, Sunan He, et al.. UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation. arXiv:2504.21336v3. 2025.
**Source database**: arxiv

**Abstract**: The integration of AI-assisted biomedical image analysis into clinical practice demands AI-generated findings that are not only accurate but also interpretable to clinicians. However, existing biomedical AI models generally lack the ability to simultaneously generate diagnostic findings and localize corresponding biomedical objects. This limitation makes it challenging for clinicians to correlate AI-generated findings with visual evidence (e.g., tiny lesions) in images and interpret the results ...

---

### [88] Collab-RAG: Boosting Retrieval-Augmented Generation for Complex Question Answering via White-Box and Black-Box LLM Collaboration

**Citation**: 88. Ran Xu, Wenqi Shi, Yuchen Zhuang, et al.. Collab-RAG: Boosting Retrieval-Augmented Generation for Complex Question Answering via White-Box and Black-Box LLM Collaboration. arXiv:2504.04915v1. 2025.
**Source database**: arxiv

**Abstract**: Retrieval-Augmented Generation (RAG) systems often struggle to handle multi-hop question-answering tasks accurately due to irrelevant context retrieval and limited complex reasoning capabilities. We introduce Collab-RAG, a collaborative training framework that leverages mutual enhancement between a white-box small language model (SLM) and a blackbox large language model (LLM) for RAG. Specifically, the SLM decomposes complex queries into simpler sub-questions, thus enhancing the accuracy of the ...

---

### [89] PlainQAFact: Retrieval-augmented Factual Consistency Evaluation Metric for Biomedical Plain Language Summarization

**Citation**: 89. Zhiwen You, Yue Guo. PlainQAFact: Retrieval-augmented Factual Consistency Evaluation Metric for Biomedical Plain Language Summarization. arXiv:2503.08890v4. 2025. doi:https://doi.org/10.1016/j.jbi.2026.105019
**Source database**: arxiv
**DOI**: https://doi.org/https://doi.org/10.1016/j.jbi.2026.105019

**Abstract**: Hallucinated outputs from large language models (LLMs) pose risks in the medical domain, especially for lay audiences making health-related decisions. Existing automatic factual consistency evaluation methods, such as entailment- and question-answering (QA) -based, struggle with plain language summarization (PLS) due to elaborative explanation phenomenon, which introduces external content (e.g., definitions, background, examples) absent from the scientific abstract to enhance comprehension. To a...

---

### [90] RAG-Gym: Systematic Optimization of Language Agents for Retrieval-Augmented Generation

**Citation**: 90. Guangzhi Xiong, Qiao Jin, Xiao Wang, et al.. RAG-Gym: Systematic Optimization of Language Agents for Retrieval-Augmented Generation. arXiv:2502.13957v2. 2025.
**Source database**: arxiv

**Abstract**: Retrieval-augmented generation (RAG) has shown great promise for knowledge-intensive tasks and recently advanced with agentic RAG, where language agents engage in multi-round interactions with external knowledge sources for adaptive information retrieval. However, existing agentic RAG methods often depend on ad-hoc prompt engineering and lack a unified optimization framework. We introduce RAG-Gym, a comprehensive platform that systematically explores three optimization dimensions: (1) prompt eng...

---

### [91] Biolink Model: A Universal Schema for Knowledge Graphs in Clinical, Biomedical, and Translational Science

**Citation**: 91. Deepak R. Unni, Sierra A. T. Moxon, Michael Bada, et al.. Biolink Model: A Universal Schema for Knowledge Graphs in Clinical, Biomedical, and Translational Science. arXiv:2203.13906v1. 2022. doi:https://doi.org/10.1111/cts.13302
**Source database**: arxiv
**DOI**: https://doi.org/https://doi.org/10.1111/cts.13302

**Abstract**: Within clinical, biomedical, and translational science, an increasing number of projects are adopting graphs for knowledge representation. Graph-based data models elucidate the interconnectedness between core biomedical concepts, enable data structures to be easily updated, and support intuitive queries, visualizations, and inference algorithms. However, knowledge discovery across these "knowledge graphs" (KGs) has remained difficult. Data set heterogeneity and complexity; the proliferation of a...

---

### [92] Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets

**Citation**: 92. Lei Hsiung, Tianyu Pang, Yung-Chen Tang, et al.. Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets. arXiv:2506.05346v1. 2025.
**Source database**: arxiv

**Abstract**: Recent advancements in large language models (LLMs) have underscored their vulnerability to safety alignment jailbreaks, particularly when subjected to downstream fine-tuning. However, existing mitigation strategies primarily focus on reactively addressing jailbreak incidents after safety guardrails have been compromised, removing harmful gradients during fine-tuning, or continuously reinforcing safety alignment throughout fine-tuning. As such, they tend to overlook a critical upstream factor: t...

---

### [93] Long-form factuality in large language models

**Citation**: 93. Jerry Wei, Chengrun Yang, Xinying Song, et al.. Long-form factuality in large language models. arXiv:2403.18802v4. 2024.
**Source database**: arxiv

**Abstract**: Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for long-form factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an L...

---

### [94] GROUNDEDKG-RAG: Grounded Knowledge Graph Index for Long-document Question Answering

**Citation**: 94. Tianyi Zhang, Andreas Marfurt. GROUNDEDKG-RAG: Grounded Knowledge Graph Index for Long-document Question Answering. arXiv:2604.04359v1. 2026.
**Source database**: arxiv

**Abstract**: Retrieval-augmented generation (RAG) systems have been widely adopted in contemporary large language models (LLMs) due to their ability to improve generation quality while reducing the required input context length. In this work, we focus on RAG systems for long-document question answering. Current approaches suffer from a heavy reliance on LLM descriptions resulting in high resource consumption and latency, repetitive content across hierarchical levels, and hallucinations due to no or limited g...

---

### [95] How Does Response Length Affect Long-Form Factuality

**Citation**: 95. James Xu Zhao, Jimmy Z. J. Liu, Bryan Hooi, et al.. How Does Response Length Affect Long-Form Factuality. arXiv:2505.23295v1. 2025.
**Source database**: arxiv

**Abstract**: Large language models (LLMs) are widely used for long-form text generation. However, factual errors in the responses would undermine their reliability. Despite growing attention to LLM factuality, the effect of response length on factuality remains underexplored. In this work, we systematically investigate this relationship by first introducing an automatic and bi-level long-form factuality evaluation framework, which achieves high agreement with human annotations while being cost-effective. Usi...

---

### [96] Ingest-And-Ground: Dispelling Hallucinations from Continually-Pretrained LLMs with RAG

**Citation**: 96. Chenhao Fang, Derek Larson, Shitong Zhu, et al.. Ingest-And-Ground: Dispelling Hallucinations from Continually-Pretrained LLMs with RAG. arXiv:2410.02825v2. 2024.
**Source database**: arxiv

**Abstract**: This paper presents new methods that have the potential to improve privacy process efficiency with LLM and RAG. To reduce hallucination, we continually pre-train the base LLM model with a privacy-specific knowledge base and then augment it with a semantic RAG layer. Our evaluations demonstrate that this approach enhances the model performance (as much as doubled metrics compared to out-of-box LLM) in handling privacy-related queries, by grounding responses with factual information which reduces ...

---

### [97] RAG over Tables: Hierarchical Memory Index, Multi-Stage Retrieval, and Benchmarking

**Citation**: 97. Jiaru Zou, Dongqi Fu, Sirui Chen, et al.. RAG over Tables: Hierarchical Memory Index, Multi-Stage Retrieval, and Benchmarking. arXiv:2504.01346v4. 2025.
**Source database**: arxiv

**Abstract**: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating them with an external knowledge base to improve the answer relevance and accuracy. In real-world scenarios, beyond pure text, a substantial amount of knowledge is stored in tables, and user questions often require retrieving answers that are distributed across multiple tables. Retrieving knowledge from a table corpora (i.e., various individual tables) for a question remains nascent, at least, for (i) how to...

---

### [98] CaresAI at BioCreative IX Track 1 -- LLM for Biomedical QA

**Citation**: 98. Reem Abdel-Salam, Mary Adewunmi, Modinat A. Abayomi. CaresAI at BioCreative IX Track 1 -- LLM for Biomedical QA. arXiv:2509.00806v1. 2025.
**Source database**: arxiv

**Abstract**: Large language models (LLMs) are increasingly evident for accurate question answering across various domains. However, rigorous evaluation of their performance on complex question-answering (QA) capabilities is essential before deployment in real-world biomedical and healthcare applications. This paper presents our approach to the MedHopQA track of the BioCreative IX shared task, which focuses on multi-hop biomedical question answering involving diseases, genes, and chemicals. We adopt a supervi...

---

### [99] P-RAG: Prompt-Enhanced Parametric RAG with LoRA and Selective CoT for Biomedical and Multi-Hop QA

**Citation**: 99. Xingda Lyu, Gongfu Lyu, Zitai Yan, et al.. P-RAG: Prompt-Enhanced Parametric RAG with LoRA and Selective CoT for Biomedical and Multi-Hop QA. arXiv:2602.15874v1. 2026. doi:https://doi.org/10.54254/2755-2721/2025.AST28253
**Source database**: arxiv
**DOI**: https://doi.org/https://doi.org/10.54254/2755-2721/2025.AST28253

**Abstract**: Large Language Models (LLMs) demonstrate remarkable capabilities but remain limited by their reliance on static training data. Retrieval-Augmented Generation (RAG) addresses this constraint by retrieving external knowledge during inference, though it still depends heavily on knowledge base quality. To explore potential improvements, we evaluated three RAG variants-Standard RAG, DA-RAG, and our proposed Prompt-Enhanced Parametric RAG (P-RAG), a hybrid architecture that integrates parametric knowl...

---

### [100] RAG based Question-Answering for Contextual Response Prediction System

**Citation**: 100. Sriram Veturi, Saurabh Vaichal, Reshma Lal Jagadheesh, et al.. RAG based Question-Answering for Contextual Response Prediction System. arXiv:2409.03708v2. 2024.
**Source database**: arxiv

**Abstract**: Large Language Models (LLMs) have shown versatility in various Natural Language Processing (NLP) tasks, including their potential as effective question-answering systems. However, to provide precise and relevant information in response to specific customer queries in industry settings, LLMs require access to a comprehensive knowledge base to avoid hallucinations. Retrieval Augmented Generation (RAG) emerges as a promising technique to address this challenge. Yet, developing an accurate question-...

---

### [101] RAG System for Supporting Japanese Litigation Procedures: Faithful Response Generation Complying with Legal Norms

**Citation**: 101. Yuya Ishihara, Atsushi Keyaki, Hiroaki Yamada, et al.. RAG System for Supporting Japanese Litigation Procedures: Faithful Response Generation Complying with Legal Norms. arXiv:2511.22858v1. 2025.
**Source database**: arxiv

**Abstract**: This study discusses the essential components that a Retrieval-Augmented Generation (RAG)-based LLM system should possess in order to support Japanese medical litigation procedures complying with legal norms. In litigation, expert commissioners, such as physicians, architects, accountants, and engineers, provide specialized knowledge to help judges clarify points of dispute. When considering the substitution of these expert roles with a RAG-based LLM system, the constraint of strict adherence to...

---

### [102] Ask-EDA: A Design Assistant Empowered by LLM, Hybrid RAG and Abbreviation De-hallucination

**Citation**: 102. Luyao Shi, Michael Kazda, Bradley Sears, et al.. Ask-EDA: A Design Assistant Empowered by LLM, Hybrid RAG and Abbreviation De-hallucination. arXiv:2406.06575v1. 2024.
**Source database**: arxiv

**Abstract**: Electronic design engineers are challenged to find relevant information efficiently for a myriad of tasks within design construction, verification and technology development. Large language models (LLM) have the potential to help improve productivity by serving as conversational agents that effectively function as subject-matter experts. In this paper we demonstrate Ask-EDA, a chat agent designed to serve as a 24x7 expert available to provide guidance to design engineers. Ask-EDA leverages LLM, ...

---

### [103] NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

**Citation**: 103. Traian Rebedea, Razvan Dinu, Makesh Sreedhar, et al.. NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. arXiv:2310.10501v1. 2023.
**Source database**: arxiv

**Abstract**: NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems. Guardrails (or rails for short) are a specific way of controlling the output of an LLM, such as not talking about topics considered harmful, following a predefined dialogue path, using a particular language style, and more. There are several mechanisms that allow LLM providers and developers to add guardrails that are embedded into a specific model at training, e.g. using mode...

---

### [104] Telco-RAG: Navigating the Challenges of Retrieval-Augmented Language Models for Telecommunications

**Citation**: 104. Andrei-Laurentiu Bornea, Fadhel Ayed, Antonio De Domenico, et al.. Telco-RAG: Navigating the Challenges of Retrieval-Augmented Language Models for Telecommunications. arXiv:2404.15939v3. 2024.
**Source database**: arxiv

**Abstract**: The application of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems in the telecommunication domain presents unique challenges, primarily due to the complex nature of telecom standard documents and the rapid evolution of the field. The paper introduces Telco-RAG, an open-source RAG framework designed to handle the specific needs of telecommunications standards, particularly 3rd Generation Partnership Project (3GPP) documents. Telco-RAG addresses the critical challeng...

---

### [105] Predicting Failures of LLMs to Link Biomedical Ontology Terms to Identifiers Evidence Across Models and Ontologies

**Citation**: 105. Daniel B. Hier, Steven Keith Platt, Tayo Obafemi-Ajayi. Predicting Failures of LLMs to Link Biomedical Ontology Terms to Identifiers Evidence Across Models and Ontologies. arXiv:2509.04458v2. 2025.
**Source database**: arxiv

**Abstract**: Large language models often perform well on biomedical NLP tasks but may fail to link ontology terms to their correct identifiers. We investigate why these failures occur by analyzing predictions across two major ontologies, Human Phenotype Ontology and Gene Ontology, and two high-performing models, GPT-4o and LLaMa 3.1 405B. We evaluate nine candidate features related to term familiarity, identifier usage, morphology, and ontology structure. Univariate and multivariate analyses show that exposu...

---

### [106] Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation

**Citation**: 106. Jun Zhuang, Haibo Jin, Ye Zhang, et al.. Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation. arXiv:2505.18556v2. 2025.
**Source database**: arxiv

**Abstract**: Intent detection, a core component of natural language understanding, has considerably evolved as a crucial mechanism in safeguarding large language models (LLMs). While prior work has applied intent detection to enhance LLMs' moderation guardrails, showing a significant success against content-level jailbreaks, the robustness of these intent-aware guardrails under malicious manipulations remains under-explored. In this work, we investigate the vulnerability of intent-aware guardrails and demons...

---

### [107] FIT-RAG: Black-Box RAG with Factual Information and Token Reduction

**Citation**: 107. Yuren Mao, Xuemei Dong, Wenyi Xu, et al.. FIT-RAG: Black-Box RAG with Factual Information and Token Reduction. arXiv:2403.14374v1. 2024.
**Source database**: arxiv

**Abstract**: Due to the extraordinarily large number of parameters, fine-tuning Large Language Models (LLMs) to update long-tail or out-of-date knowledge is impractical in lots of applications. To avoid fine-tuning, we can alternatively treat a LLM as a black-box (i.e., freeze the parameters of the LLM) and augment it with a Retrieval-Augmented Generation (RAG) system, namely black-box RAG. Recently, black-box RAG has achieved success in knowledge-intensive tasks and has gained much attention. Existing black...

---

### [108] SimulRAG: Simulator-based RAG for Grounding LLMs in Long-form Scientific QA

**Citation**: 108. Haozhou Xu, Dongxia Wu, Matteo Chinazzi, et al.. SimulRAG: Simulator-based RAG for Grounding LLMs in Long-form Scientific QA. arXiv:2509.25459v1. 2025.
**Source database**: arxiv

**Abstract**: Large language models (LLMs) show promise in solving scientific problems. They can help generate long-form answers for scientific questions, which are crucial for comprehensive understanding of complex phenomena that require detailed explanations spanning multiple interconnected concepts and evidence. However, LLMs often suffer from hallucination, especially in the challenging task of long-form scientific question answering. Retrieval-Augmented Generation (RAG) approaches can ground LLMs by inco...

---

### [109] Benchmarking LLM Guardrails in Handling Multilingual Toxicity

**Citation**: 109. Yahan Yang, Soham Dan, Dan Roth, et al.. Benchmarking LLM Guardrails in Handling Multilingual Toxicity. arXiv:2410.22153v1. 2024.
**Source database**: arxiv

**Abstract**: With the ubiquity of Large Language Models (LLMs), guardrails have become crucial to detect and defend against toxic content. However, with the increasing pervasiveness of LLMs in multilingual scenarios, their effectiveness in handling multilingual toxic inputs remains unclear. In this work, we introduce a comprehensive multilingual test suite, spanning seven datasets and over ten languages, to benchmark the performance of state-of-the-art guardrails. We also investigates the resilience of guard...

---

### [110] Background Knowledge Grounding for Readable, Relevant, and Factual Biomedical Lay Summaries

**Citation**: 110. Domenic Rosati. Background Knowledge Grounding for Readable, Relevant, and Factual Biomedical Lay Summaries. arXiv:2305.02104v1. 2023.
**Source database**: arxiv

**Abstract**: Communication of scientific findings to the public is important for keeping non-experts informed of developments such as life-saving medical treatments. However, generating readable lay summaries from scientific documents is challenging, and currently, these summaries suffer from critical factual errors. One popular intervention for improving factuality is using additional external knowledge to provide factual grounding. However, it is unclear how these grounding sources should be retrieved, sel...

---

### [111] Retrieval Augmented Generation (RAG) for Fintech: Agentic Design and Evaluation

**Citation**: 111. Thomas Cook, Richard Osuagwu, Liman Tsatiashvili, et al.. Retrieval Augmented Generation (RAG) for Fintech: Agentic Design and Evaluation. arXiv:2510.25518v1. 2025.
**Source database**: arxiv

**Abstract**: Retrieval-Augmented Generation (RAG) systems often face limitations in specialized domains such as fintech, where domain-specific ontologies, dense terminology, and acronyms complicate effective retrieval and synthesis. This paper introduces an agentic RAG architecture designed to address these challenges through a modular pipeline of specialized agents. The proposed system supports intelligent query reformulation, iterative sub-query decomposition guided by keyphrase extraction, contextual acro...

---

### [112] GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation

**Citation**: 112. Ionut-Teodor Sorodoc, Leonardo F. R. Ribeiro, Rexhina Blloshmi, et al.. GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation. arXiv:2506.07671v1. 2025.
**Source database**: arxiv

**Abstract**: We present GaRAGe, a large RAG benchmark with human-curated long-form answers and annotations of each grounding passage, allowing a fine-grained evaluation of whether LLMs can identify relevant grounding when generating RAG answers. Our benchmark contains 2366 questions of diverse complexity, dynamism, and topics, and includes over 35K annotated passages retrieved from both private document sets and the Web, to reflect real-world RAG use cases. This makes it an ideal test bed to evaluate an LLM'...

---

### [113] Probing Factually Grounded Content Transfer with Factual Ablation

**Citation**: 113. Peter West, Chris Quirk, Michel Galley, et al.. Probing Factually Grounded Content Transfer with Factual Ablation. arXiv:2203.10133v2. 2022.
**Source database**: arxiv

**Abstract**: Despite recent success, large neural models often generate factually incorrect text. Compounding this is the lack of a standard automatic evaluation for factuality--it cannot be meaningfully improved if it cannot be measured. Grounded generation promises a path to solving both of these problems: models draw on a reliable external document (grounding) for factual information, simplifying the challenge of factuality. Measuring factuality is also simplified--to factual consistency, testing whether ...

---

### [114] RMIT-ADM+S at the MMU-RAG NeurIPS 2025 Competition

**Citation**: 114. Kun Ran, Marwah Alaofi, Danula Hettiachchi, et al.. RMIT-ADM+S at the MMU-RAG NeurIPS 2025 Competition. arXiv:2602.20735v1. 2026.
**Source database**: arxiv

**Abstract**: This paper presents the award-winning RMIT-ADM+S system for the Text-to-Text   track of the NeurIPS~2025 MMU-RAG Competition. We introduce Routing-to-RAG   (R2RAG), a research-focused retrieval-augmented generation (RAG)   architecture composed of lightweight components that dynamically adapt the   retrieval strategy based on inferred query complexity and evidence   sufficiency. The system uses smaller LLMs, enabling operation on a single   consumer-grade GPU while supporting complex research ta...

---

### [115] PrismRAG: Boosting RAG Factuality with Distractor Resilience and Strategized Reasoning

**Citation**: 115. Mohammad Kachuee, Teja Gollapudi, Minseok Kim, et al.. PrismRAG: Boosting RAG Factuality with Distractor Resilience and Strategized Reasoning. arXiv:2507.18857v1. 2025.
**Source database**: arxiv

**Abstract**: Retrieval-augmented generation (RAG) often falls short when retrieved context includes confusing semi-relevant passages, or when answering questions require deep contextual understanding and reasoning. We propose an efficient fine-tuning framework, called PrismRAG, that (i) trains the model with distractor-aware QA pairs mixing gold evidence with subtle distractor passages, and (ii) instills reasoning-centric habits that make the LLM plan, rationalize, and synthesize without relying on extensive...

---

### [116] Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases

**Citation**: 116. Jiarui Li, Ye Yuan, Zehua Zhang. Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases. arXiv:2403.10446v1. 2024.
**Source database**: arxiv

**Abstract**: We proposed an end-to-end system design towards utilizing Retrieval Augmented Generation (RAG) to improve the factual accuracy of Large Language Models (LLMs) for domain-specific and time-sensitive queries related to private knowledge-bases. Our system integrates RAG pipeline with upstream datasets processing and downstream performance evaluation. Addressing the challenge of LLM hallucinations, we finetune models with a curated dataset which originates from CMU's extensive resources and annotate...

---

### [117] LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding

**Citation**: 117. Haoyu Zhao, Wenhang Ge, Ying-cong Chen. LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding. arXiv:2405.17104v2. 2024.
**Source database**: arxiv

**Abstract**: Visual grounding is an essential tool that links user-provided text queries with query-specific regions within an image. Despite advancements in visual grounding models, their ability to comprehend complex queries remains limited. To overcome this limitation, we introduce LLM-Optic, an innovative method that utilizes Large Language Models (LLMs) as an optical lens to enhance existing visual grounding models in comprehending complex text queries involving intricate text structures, multiple objec...

---

### [118] fastbmRAG: A Fast Graph-Based RAG Framework for Efficient Processing of Large-Scale Biomedical Literature

**Citation**: 118. Guofeng Meng, Li Shen, Qiuyan Zhong, et al.. fastbmRAG: A Fast Graph-Based RAG Framework for Efficient Processing of Large-Scale Biomedical Literature. arXiv:2511.10014v1. 2025.
**Source database**: arxiv

**Abstract**: Large language models (LLMs) are rapidly transforming various domains, including biomedicine and healthcare, and demonstrate remarkable potential from scientific research to new drug discovery. Graph-based retrieval-augmented generation (RAG) systems, as a useful application of LLMs, can improve contextual reasoning through structured entity and relationship identification from long-context knowledge, e.g. biomedical literature. Even though many advantages over naive RAGs, most of graph-based RA...

---

### [119] Grounding Beyond Detection: Enhancing Contextual Understanding in Embodied 3D Grounding

**Citation**: 119. Yani Zhang, Dongming Wu, Hao Shi, et al.. Grounding Beyond Detection: Enhancing Contextual Understanding in Embodied 3D Grounding. arXiv:2506.05199v2. 2025.
**Source database**: arxiv

**Abstract**: Embodied 3D grounding aims to localize target objects described in human instructions from ego-centric viewpoint. Most methods typically follow a two-stage paradigm where a trained 3D detector's optimized backbone parameters are used to initialize a grounding model. In this study, we explore a fundamental question: Does embodied 3D grounding benefit enough from detection? To answer this question, we assess the grounding performance of detection models using predicted boxes filtered by the target...

---

### [120] Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training

**Citation**: 120. Meng Xiao, Xunxin Cai, Qingqing Long, et al.. Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training. arXiv:2504.19565v3. 2025.
**Source database**: arxiv

**Abstract**: Corpus distillation for biomedical large language models (LLMs) seeks to address the pressing challenge of insufficient quantity and quality in open-source annotated scientific corpora, which remains a bottleneck for effective LLM training in biomedical research. This paper proposes a knowledge-driven, agentic framework for scientific corpus distillation, tailored explicitly for LLM training in the biomedical domain, addressing the challenge posed by the complex hierarchy of biomedical knowledge...

---

### [121] OLAPH: Improving Factuality in Biomedical Long-form Question Answering

**Citation**: 121. Minbyul Jeong, Hyeon Hwang, Chanwoong Yoon, et al.. OLAPH: Improving Factuality in Biomedical Long-form Question Answering. arXiv:2405.12701v3. 2024.
**Source database**: arxiv

**Abstract**: In the medical domain, numerous scenarios necessitate the long-form generation ability of large language models (LLMs). Specifically, when addressing patients' questions, it is essential that the model's response conveys factual claims, highlighting the need for an automated method to evaluate those claims. Thus, we introduce MedLFQA, a benchmark dataset reconstructed using long-form question-answering datasets related to the biomedical domain. We use MedLFQA to facilitate a cost-effective autom...

---

### [122] MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering

**Citation**: 122. Yingpeng Ning, Yuanyuan Sun, Ling Luo, et al.. MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering. arXiv:2510.14400v2. 2025.
**Source database**: arxiv

**Abstract**: Biomedical question answering (QA) requires accurate interpretation of complex medical knowledge. Large language models (LLMs) have shown promising capabilities in this domain, with retrieval-augmented generation (RAG) systems enhancing performance by incorporating external medical literature. However, RAG-based approaches in biomedical QA suffer from hallucinations due to post-retrieval noise and insufficient verification of retrieved evidence, undermining response reliability. We propose MedTr...

---

### [123] Current state of LLM Risks and AI Guardrails

**Citation**: 123. Suriya Ganesh Ayyamperumal, Limin Ge. Current state of LLM Risks and AI Guardrails. arXiv:2406.12934v1. 2024.
**Source database**: arxiv

**Abstract**: Large language models (LLMs) have become increasingly sophisticated, leading to widespread deployment in sensitive applications where safety and reliability are paramount. However, LLMs have inherent risks accompanying them, including bias, potential for unsafe actions, dataset poisoning, lack of explainability, hallucinations, and non-reproducibility. These risks necessitate the development of "guardrails" to align LLMs with desired behaviors and mitigate potential harm.   This work explores th...

---

### [124] ConsistencyAI: A Benchmark to Assess LLMs' Factual Consistency When Responding to Different Demographic Groups

**Citation**: 124. Peter Banyas, Shristi Sharma, Alistair Simmons, et al.. ConsistencyAI: A Benchmark to Assess LLMs' Factual Consistency When Responding to Different Demographic Groups. arXiv:2510.13852v2. 2025.
**Source database**: arxiv

**Abstract**: Is an LLM telling you different facts than it's telling me? This paper introduces ConsistencyAI, an independent benchmark for measuring the factual consistency of large language models (LLMs) for different personas. ConsistencyAI tests whether, when users of different demographics ask identical questions, the model responds with factually inconsistent answers. Designed without involvement from LLM providers, this benchmark offers impartial evaluation and accountability. In our experiment, we que...

---

### [125] RAG-Guardrails Integration for AI Content Control

**Citation**: 125. R. More. RAG-Guardrails Integration for AI Content Control. Proceedings of the 2025 18th International Conference on Computer Science and Information Technology. 2025. doi:10.1145/3783862.3783896
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.1145/3783862.3783896

**Abstract**: Generative AI, particularly large language models (LLMs), has shown remarkable potential across domains such as healthcare, legal services, and finance. However, their adoption is hindered by two persistent challenges: hallucination-where models generate factually incorrect information-and the risk of producing biased or unsafe content. This paper proposes a hybrid framework that integrates Retrieval-Augmented Generation (RAG) with NVIDIA NeMo Guardrails to address these concerns. RAG mitigates ...

---

### [126] GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation

**Citation**: 126. I. Sorodoc, Leonardo F. R. Ribeiro, Rexhina Blloshmi, et al.. GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation. ArXiv. 2025. doi:10.48550/arXiv.2506.07671
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.48550/arXiv.2506.07671

**Abstract**: We present GaRAGe, a large RAG benchmark with human-curated long-form answers and annotations of each grounding passage, allowing a fine-grained evaluation of whether LLMs can identify relevant grounding when generating RAG answers. Our benchmark contains 2366 questions of diverse complexity, dynamism, and topics, and includes over 35K annotated passages retrieved from both private document sets and the Web, to reflect real-world RAG use cases. This makes it an ideal test bed to evaluate an LLM'...

---

### [127] Less Finetuning, Better Retrieval: Rethinking LLM Adaptation for Biomedical Retrievers via Synthetic Data and Model Merging

**Citation**: 127. Sameh Khattab, Jean-Philippe Corbeil, Osman Alperen Koras, et al.. Less Finetuning, Better Retrieval: Rethinking LLM Adaptation for Biomedical Retrievers via Synthetic Data and Model Merging. ArXiv. 2026. doi:10.48550/arXiv.2602.04731
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.48550/arXiv.2602.04731

**Abstract**: Retrieval-augmented generation (RAG) has become the backbone of grounding Large Language Models (LLMs), improving knowledge updates and reducing hallucinations. Recently, LLM-based retriever models have shown state-of-the-art performance for RAG applications. However, several technical aspects remain underexplored on how to adapt general-purpose LLMs into effective domain-specific retrievers, especially in specialized domains such as biomedicine. We present Synthesize-Train-Merge (STM), a modula...

---

### [128] KRAGEN: a knowledge graph-enhanced RAG framework for biomedical problem solving using large language models

**Citation**: 128. Nicholas Matsumoto, Jay Moran, Hyunjun Choi, et al.. KRAGEN: a knowledge graph-enhanced RAG framework for biomedical problem solving using large language models. Bioinformatics. 2024. doi:10.1093/bioinformatics/btae353
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.1093/bioinformatics/btae353

**Abstract**: Abstract Motivation Answering and solving complex problems using a large language model (LLM) given a certain domain such as biomedicine is a challenging task that requires both factual consistency and logic, and LLMs often suffer from some major limitations, such as hallucinating false or irrelevant information, or being influenced by noisy data. These issues can compromise the trustworthiness, accuracy, and compliance of LLM-generated text and insights. Results Knowledge Retrieval Augmented Ge...

---

### [129] SimulRAG: Simulator-based RAG for Grounding LLMs in Long-form Scientific QA

**Citation**: 129. Haozhou Xu, D. Wu, M. Chinazzi, et al.. SimulRAG: Simulator-based RAG for Grounding LLMs in Long-form Scientific QA. ArXiv. 2025. doi:10.48550/arXiv.2509.25459
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.48550/arXiv.2509.25459

**Abstract**: Large language models (LLMs) show promise in solving scientific problems. They can help generate long-form answers for scientific questions, which are crucial for comprehensive understanding of complex phenomena that require detailed explanations spanning multiple interconnected concepts and evidence. However, LLMs often suffer from hallucination, especially in the challenging task of long-form scientific question answering. Retrieval-Augmented Generation (RAG) approaches can ground LLMs by inco...

---

### [130] Towards a Multi-Agent System Based on LLM and RAG for Automated and Customizable Urban Diagnostics

**Citation**: 130. Rida Azmi, Ebnou Abdem Seyid Abdellahi, Mariem Bounabi, et al.. Towards a Multi-Agent System Based on LLM and RAG for Automated and Customizable Urban Diagnostics. 2025 International Conference on Intelligent Systems: Theories and Applications (SITA). 2025. doi:10.1109/SITA67914.2025.11273206
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.1109/SITA67914.2025.11273206

**Abstract**: The increasing complexity and dynamism of urban environments necessitate advanced tools for comprehensive and timely diagnostics. Traditional methods are often labor-intensive, fragmented, and struggle to synthesize the vast, heterogeneous data streams generated by modern cities. This paper presents a novel theoretical framework for a multi-agent system that synergistically integrates Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) to deliver automated and customizable urba...

---

### [131] Conceptual Design of an LLM-Based Tech Product Recommendation System Using LangChain, LangGraph, Firecrawl, and n8n with RAG, Fine-Tuning, Prompt Engineering, and KNN with Cosine Similarity

**Citation**: 131. Mrs. Abha Pathak, Mrs. Tejaswini Mali, Mr. Sanket Rathod, et al.. Conceptual Design of an LLM-Based Tech Product Recommendation System Using LangChain, LangGraph, Firecrawl, and n8n with RAG, Fine-Tuning, Prompt Engineering, and KNN with Cosine Similarity. International Journal of Advanced Research in Science, Communication and Technology. 2025. doi:10.48175/ijarsct-29973
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.48175/ijarsct-29973

**Abstract**: Choosing the right technology product has become increasingly difficult for consumers due to limited technical knowledge, rapidly evolving specifications, and the overwhelming number of available options. Traditional recommendation systems rely on static filters or keyword-based searches, often producing incomplete or context-insensitive results. This paper proposes a conceptual design for an AI-driven recommendation framework that leverages Large Language Models (LLMs) to deliver accurate, expl...

---

### [132] LLMForum-RAG: A Multilingual, Multi-domain Framework for Factual Reasoning via Weighted Retrieval and LLM Collaboration

**Citation**: 132. Soham Chaudhuri, Dipanjan Saha, Dipankar Das. LLMForum-RAG: A Multilingual, Multi-domain Framework for Factual Reasoning via Weighted Retrieval and LLM Collaboration. . 2025. doi:10.18653/v1/2025.findings-ijcnlp.88
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.18653/v1/2025.findings-ijcnlp.88

**Abstract**: LLMs have emerged as a transformative technology, enabling a wide range of tasks such as text generation, summarization, question answering, and more. The use of RAG with LLM is on the rise to provide deeper knowledge bases of various domains. In the present study, we propose a RAG framework that employs weighted Rocchio mechanism for retrieval and LLM collaborative forum with supervision for generation. Our framework is evaluated in two downstream tasks: a biomedical question answering (BioASQ-...

---

### [133] HYPER-RAG: Evaluating Hyperparameter Trade-Offs in Biomedical Retrieval-Augmented Generation

**Citation**: 133. Ankush Sil Sarma, Pawan Kumar Singh. HYPER-RAG: Evaluating Hyperparameter Trade-Offs in Biomedical Retrieval-Augmented Generation. 2025 IEEE Pune Section International Conference (PuneCon). 2025. doi:10.1109/PuneCon67554.2025.11377827
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.1109/PuneCon67554.2025.11377827

**Abstract**: Retrieval-Augmented Generation (RAG) improves the factual accuracy of large language models by combining document retrieval with text generation. In biomedical question answering, where correctness is critical, the effect of key hyperparameters has not been studied in a systematic way. This paper presents an evaluation of RAG on the COVID-QA dataset with a focus on three retrievers (dense, BM25, hybrid), two retrieval depths (top- $\mathrm{k}=1,3$), and optional reranking with a cross encoder. W...

---

### [134] Navi: RAG-Powered LLM Chatbot for Academic Institutions

**Citation**: 134. Amiya P Bovas. Navi: RAG-Powered LLM Chatbot for Academic Institutions. International Journal for Research in Applied Science and Engineering Technology. 2025. doi:10.22214/ijraset.2025.75655
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.22214/ijraset.2025.75655

**Abstract**: With the growing integration of artificial intelligence (AI) across educational ecosystems, there is an increasing
demand for intelligent conversational agents that can efficiently deliver reliable, domain-specific information to students,
faculty, and visitors. This research introduces Navi, an academic virtual assistant designed using a Large Language Model
(LLM) combined with a Retrieval-Augmented Generation (RAG) framework to generate accurate and contextually grounded
responses [1], [5]. Th...

---

### [135] MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering

**Citation**: 135. Yingpeng Ning, Yuanyuan Sun, Ling Luo, et al.. MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering. 2025 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 2025. doi:10.1109/BIBM66473.2025.11356290
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.1109/BIBM66473.2025.11356290

**Abstract**: Biomedical question answering (QA) requires precise interpretation of complex medical knowledge. Large language models (LLMs) and retrieval-augmented generation (RAG) leverage external medical literature but often produce hallucinations due to noisy retrieval and insufficient verification. We propose MedTrust-Guided Iterative RAG, a framework that improves factual consistency and reduces hallucinations in medical QA. It introduces three innovations. First, citation-aware reasoning grounds genera...

---

### [136] Reshaping Biomedical Scientific Literature in a RAG Pipeline for Question Answering

**Citation**: 136. Maël Lesavourey, Gilles Hubert. Reshaping Biomedical Scientific Literature in a RAG Pipeline for Question Answering. . 2025.
**Source database**: semantic_scholar

---

### [137] Finetune-RAG: Fine-Tuning Language Models to Resist Hallucination in Retrieval-Augmented Generation

**Citation**: 137. Zhan Peng Lee, A. Lin, Calvin Tan. Finetune-RAG: Fine-Tuning Language Models to Resist Hallucination in Retrieval-Augmented Generation. ArXiv. 2025. doi:10.48550/arXiv.2505.10792
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.48550/arXiv.2505.10792

**Abstract**: Retrieval-Augmented Generation (RAG) has emerged as a powerful framework to improve factuality in large language models (LLMs) by grounding their outputs in retrieved documents. However, ensuring perfect retrieval of relevant information remains challenging, and when irrelevant content is passed downstream to an LLM, it can lead to hallucinations. In this work, we propose Finetune-RAG, a simple and effective fine-tuning approach that features the first-of-its-kind RAG training dataset constructe...

---

### [138] An Agentic Hybrid LLM–RAG Framework for Explainable Clinical Decision Support

**Citation**: 138. Mohammed Kapadia, Mohammed Memon, P. Mishra, et al.. An Agentic Hybrid LLM–RAG Framework for Explainable Clinical Decision Support. Proceedings of the 18th International Conference on Agents and Artificial Intelligence. 2026. doi:10.5220/0014459100004052
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.5220/0014459100004052

**Abstract**: : The fast evolution of Large Language Models (LLMs) has provided new opportunities to intelligent Clinical Decision Support Systems (CDSS), but such issues as hallucination, absent interpretability, and poor factual foundation still exist. This paper proposes a Hybrid LLM-Retrieval Augmented Generation (RAG) model of evidence-based clinical reasoning, which would combine transformer-based contextual understanding and retrieval-based factual verification. This system is an agentic design and con...

---

### [139] ModernBERT + ColBERT: Enhancing biomedical RAG through an advanced re-ranking retriever

**Citation**: 139. Eduardo Mart'inez Rivera, F. Menolascina. ModernBERT + ColBERT: Enhancing biomedical RAG through an advanced re-ranking retriever. ArXiv. 2025. doi:10.48550/arXiv.2510.04757
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.48550/arXiv.2510.04757

**Abstract**: Retrieval-Augmented Generation (RAG) is a powerful technique for enriching Large Language Models (LLMs) with external knowledge, allowing for factually grounded responses, a critical requirement in high-stakes domains such as healthcare. However, the efficacy of RAG systems is fundamentally restricted by the performance of their retrieval module, since irrelevant or semantically misaligned documents directly compromise the accuracy of the final generated response. General-purpose dense retriever...

---

### [140] DeepSeek-Med-8B: Medical LLM for Chinese Diagnosis and Referral

**Citation**: 140. Chenxing Li, J. Mao, Bin Liu, et al.. DeepSeek-Med-8B: Medical LLM for Chinese Diagnosis and Referral. 2025 8th International Conference on Computer Information Science and Application Technology (CISAT). 2025. doi:10.1109/CISAT66811.2025.11181817
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.1109/CISAT66811.2025.11181817

**Abstract**: The uneven distribution of medical resources in China poses significant challenges, especially in rural areas. While large language models (LLMs) offer potential for clinical support, existing systems like GPT-4 and Med-PaLM suffer from hallucinations, English-centric biases, and lack real-time physician integration. We present DeepSeek-Med-8B, a Chinese medical conversational agent based on the DeepSeek-R1-DistillLlama-8B architecture.DeepSeek-Med-8B is trained through: (i) Supervised Fine-Tuni...

---

### [141] Enterprise GenAI: LLM Deployment on AWS

**Citation**: 141. Sufiyan Shaikh. Enterprise GenAI: LLM Deployment on AWS. International Journal for Research in Applied Science and Engineering Technology. 2026. doi:10.22214/ijraset.2026.77762
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.22214/ijraset.2026.77762

**Abstract**: Generative AI and Large Language Models (LLMs) have transitioned from experimental prototypes to critical
enterprise assets, requiring robust, scalable, and secure deployment frameworks. This paper presents a comprehensive survey of
LLM deployment strategies on Amazon Web Services (AWS), focusing on the shift from consumer-grade to enterprise-ready
architectures. We analyze the AWS Generative AI stack, specifically comparing managed serverless approaches via Amazon
Bedrock with customizable infr...

---

### [142] ”A Midsummer Night’s Dream” quest for truth: From ChatGPT “hallucinations” to RAG reasoning and ACURAI precision — a scoping review on detection, minimizing, and (almost) complete error elimination and enhancing Large Language Models' re-liability

**Citation**: 142. A. Anghelescu, Constantin Munteanu, Lucia Ana Maria Anghelescu, et al.. ”A Midsummer Night’s Dream” quest for truth: From ChatGPT “hallucinations” to RAG reasoning and ACURAI precision — a scoping review on detection, minimizing, and (almost) complete error elimination and enhancing Large Language Models' re-liability. Balneo and PRM Research Journal. 2025. doi:10.12680/balneo.2025.847
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.12680/balneo.2025.847

**Abstract**: Like A Midsummer Night’s Dream, large language models (LLMs) exhibit vast imagination, drawing on massive training datasets. However, they may fabricate or mix information, lacking mechanisms to verify real-world sources. Most commercial LLMs, including those used in medicine, remain prone to hallucinations—plausible but false content. Retrieval-Augmented Generation (RAG) aims to address this by grounding LLM outputs in real-time access to verified sources like scientific databases. A 2023–2025 ...

---

### [143] From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

**Citation**: 143. M. Ferrag, N. Tihanyi, M. Debbah. From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review. ArXiv. 2025. doi:10.48550/arXiv.2504.19678
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.48550/arXiv.2504.19678

**Abstract**: Large language models and autonomous AI agents have evolved rapidly, resulting in a diverse array of evaluation benchmarks, frameworks, and collaboration protocols. Driven by the growing need for standardized evaluation and integration, we systematically consolidate these fragmented efforts into a unified framework. However, the landscape remains fragmented and lacks a unified taxonomy or comprehensive survey. Therefore, we present a side-by-side comparison of benchmarks developed between 2019 a...

---

### [144] Auto-GDA: Automatic Domain Adaptation for Efficient Grounding Verification in Retrieval Augmented Generation

**Citation**: 144. Tobias Leemann, Periklis Petridis, Giuseppe Vietri, et al.. Auto-GDA: Automatic Domain Adaptation for Efficient Grounding Verification in Retrieval Augmented Generation. ArXiv. 2024. doi:10.48550/arXiv.2410.03461
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.48550/arXiv.2410.03461

**Abstract**: While retrieval-augmented generation (RAG) has been shown to enhance factuality of large language model (LLM) outputs, LLMs still suffer from hallucination, generating incorrect or irrelevant information. A common detection strategy involves prompting the LLM again to assess whether its response is grounded in the retrieved evidence, but this approach is costly. Alternatively, lightweight natural language inference (NLI) models for efficient grounding verification can be used at inference time. ...

---

### [145] UniTor at BioASQ 2025: Modular Biomedical QA with Synthetic Snippets and Multiple Task Answer Generation

**Citation**: 145. Federico Borazio, Andriy Shcherbakov, D. Croce, et al.. UniTor at BioASQ 2025: Modular Biomedical QA with Synthetic Snippets and Multiple Task Answer Generation. . 2025.
**Source database**: semantic_scholar

---

### [146] Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking

**Citation**: 146. Liangliang Zhang, Zhuorui Jiang, H. Chi, et al.. Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking. ArXiv. 2025. doi:10.48550/arXiv.2505.23495
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.48550/arXiv.2505.23495

**Abstract**: Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets, including WebQSP and CWQ, we find...

---

### [147] Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage

**Citation**: 147. Siddhant Arora, Haidar Khan, Kai Sun, et al.. Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage. ArXiv. 2025. doi:10.48550/arXiv.2510.02044
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.48550/arXiv.2510.02044

**Abstract**: End-to-end speech-in speech-out dialogue systems are emerging as a powerful alternative to traditional ASR-LLM-TTS pipelines, generating more natural, expressive responses with significantly lower latency. However, these systems remain prone to hallucinations due to limited factual grounding. While text-based dialogue systems address this challenge by integrating tools such as web search and knowledge graph APIs, we introduce the first approach to extend tool use directly into speech-in speech-o...

---

### [148] LLM-Driven Learner Modeling and Personalized Learning Pathways: A Closed-Loop Framework and Engineering Design for Virtual Laboratories

**Citation**: 148. Ruijie Wang, Guangtao Xu. LLM-Driven Learner Modeling and Personalized Learning Pathways: A Closed-Loop Framework and Engineering Design for Virtual Laboratories. 2025 International Conference on Educational Technology Management (ICETM). 2025. doi:10.1109/ICETM67477.2025.11413398
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.1109/ICETM67477.2025.11413398

**Abstract**: Focusing on virtual experiment teaching, this paper proposes a personalized learning closed-loop with LLM as the core. A simulation engine provides a verifiable factual baseline, while the LLM undertakes semantic interpretation, two-phase path way generation (skeleton-verification-refinement), fact-grounded judgement and feedback, and explanatory summarization. To enhance robustness and compliance, the framework employs retrieval-augmented generation (RAG), structured outputs, and a second-pass ...

---

### [149] Principled Context Engineering for RAG: Statistical Guarantees via Conformal Prediction

**Citation**: 149. Debashish Chakraborty, Eugene Yang, Daniel Khashabi, et al.. Principled Context Engineering for RAG: Statistical Guarantees via Conformal Prediction. ArXiv. 2025. doi:10.1007/978-3-032-21300-6_45
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.1007/978-3-032-21300-6_45

**Abstract**: Retrieval-Augmented Generation (RAG) enhances factual grounding in large language models (LLMs) by incorporating retrieved evidence, but LLM accuracy declines when long or noisy contexts exceed the model's effective attention span. Existing pre-generation filters rely on heuristics or uncalibrated LLM confidence scores, offering no statistical control over retained evidence. We evaluate and demonstrate context engineering through conformal prediction, a coverage-controlled filtering framework th...

---

### [150] RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG

**Citation**: 150. Joshua Gao, Quoc Huy Pham, Subin Varghese, et al.. RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG. ArXiv. 2025. doi:10.48550/arXiv.2511.04502
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.48550/arXiv.2511.04502

**Abstract**: Retrieval-Augmented Generation (RAG) is a critical technique for grounding Large Language Models (LLMs) in factual evidence, yet evaluating RAG systems in specialized, safety-critical domains remains a significant challenge. Existing evaluation frameworks often rely on heuristic-based metrics that fail to capture domain-specific nuances and other works utilize LLM-as-a-Judge approaches that lack validated alignment with human judgment. This paper introduces RAGalyst, an automated, human-aligned ...

---

### [151] Retrieval-Augmented Multi-LLM Ensemble for Industrial Part Specification Extraction

**Citation**: 151. Muzakkiruddin Ahmed Mohammed, John R. Talburt, Leon Claasssens, et al.. Retrieval-Augmented Multi-LLM Ensemble for Industrial Part Specification Extraction. 2025 17th International Conference on Knowledge and System Engineering (KSE). 2025. doi:10.1109/KSE68178.2025.11309590
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.1109/KSE68178.2025.11309590

**Abstract**: Industrial part specification extraction from unstructured text remains a persistent challenge in manufacturing, procurement, and maintenance, where manual processing is both time-consuming and error-prone. This paper introduces RAGsemble, a retrieval-augmented multi-LLM ensemble framework that orchestrates nine state-of-the-art Large Language Models (LLMs) within a structured three-phase pipeline. RAGsemble addresses key limitations of single-model systems by combining the complementary strengt...

---

### [152] Artificial Intelligence for Quantitative Finance: A RAG-Augmented Multi-Agent Framework for Robust Equity Strategy Discovery

**Citation**: 152. Jianfei Wang, Hualin Li. Artificial Intelligence for Quantitative Finance: A RAG-Augmented Multi-Agent Framework for Robust Equity Strategy Discovery. Proceedings of the 2025 9th International Conference on Computer Science and Artificial Intelligence. 2025. doi:10.1145/3788149.3788249
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.1145/3788149.3788249

**Abstract**: This paper introduces an AI-driven multi-agent framework for automated quantitative strategy generation and validation, integrating large language model (LLM) agents with Retrieval-Augmented Generation (RAG) to enhance factual accuracy and research reliability. The system orchestrates specialized agents for market analysis, financial knowledge retrieval, feature engineering, strategy construction, backtesting, and performance interpretation, establishing a comprehensive autonomous investment res...

---

### [153] Mitigating Artificial Intelligence Hallucinations in Education: A Comparative Study of Retrieval-Augmented Generation (RAG) and Large Language Models

**Citation**: 153. Pei-hua Chen, Yuen-Min Huang, Ting‐Ting Wu, et al.. Mitigating Artificial Intelligence Hallucinations in Education: A Comparative Study of Retrieval-Augmented Generation (RAG) and Large Language Models. 2025 7th International Conference on Modern Educational Technology (ICMET). 2025. doi:10.1109/ICMET67594.2025.11451842
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.1109/ICMET67594.2025.11451842

**Abstract**: The integration of Large Language Models (LLMs) into educational technologies promises to revolutionize personalized learning. However, their propensity for “hallucination,” the generation of factually incorrect or nonsensical information, poses a significant risk to knowledge integrity and student trust. To address this critical challenge, this paper investigates Retrieval-Augmented Generation (RAG), an architectural approach that mitigates hallucinations by grounding the model's responses in f...

---

### [154] Enhancing Truth with AI: Evaluating ML, LLMs, and RAG in Combating Misinformation

**Citation**: 154. Manisha, Manisha Jailia. Enhancing Truth with AI: Evaluating ML, LLMs, and RAG in Combating Misinformation. 2025 2nd International Conference on Advanced Computing and Emerging Technologies (ACET). 2025. doi:10.1109/ACET67282.2025.11430189
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.1109/ACET67282.2025.11430189

**Abstract**: In the digital age, misinformation is a new danger. It impacts how people talk about things, harms democracy, and changes how people think about the government. This study answers the important demand for better ways to find false information by comparing classic ML models with the newest LLMs and then creating a hybrid framework that contains the best parts of both. Unfortunately, classic classifiers like Naive Bayes and the Passive-Aggressive Classifier often miss content that is both hostile ...

---

### [155] LightRAG-Driven Medical QA: Leveraging Domain-Specific for Efficient LLM Reasoning

**Citation**: 155. Rishabh Kushwaha, Reshma Swain, Bal Krishna Saraswat. LightRAG-Driven Medical QA: Leveraging Domain-Specific for Efficient LLM Reasoning. 2025 5th International Conference on Advancement in Electronics & Communication Engineering (AECE). 2025. doi:10.1109/AECE67531.2025.11386653
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.1109/AECE67531.2025.11386653

**Abstract**: Large Language Model (LLM) have demonstrated great capabilities in medical question answering. But fake facts and projections restrict the efficiency of their real world. Past studies have incorporated RAG methods such as Naive RAG and GraphRAG to enhance the aspect of factual grounding. Efficiency retention and recall and precision optimization are challenging to these particular methods. In this paper, we apply the LightRAG framework to the medical domain as a lightweight and domain-specific r...

---

### [156] A Hybrid GNN-LLM Framework for Correlating Cybersecurity Incidents

**Citation**: 156. Lina Baha, Amine Mammasse, Oualid Saci. A Hybrid GNN-LLM Framework for Correlating Cybersecurity Incidents. 2025 Fourth International Conference on Theoretical and Applicative Aspects of Computer Science (ICTAACS). 2025. doi:10.1109/ICTAACS69003.2025.11399321
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.1109/ICTAACS69003.2025.11399321

**Abstract**: High volumes of alerts from Intrusion Detection Systems (IDS) cause significant "alert fatigue" among security analysts, hindering the identification of genuine incidents. Existing automated correlation methods often lack the semantic context and explainability needed for effective response. This paper presents a hybrid framework that integrates Graph Neural Networks (GNNs) and Large Language Models (LLMs) to correlate and explain security alerts. The approach constructs an alert graph using hyb...

---

### [157] A Systematic Review of Key Retrieval-Augmented Generation (RAG) Systems: Progress, Gaps, and Future Directions

**Citation**: 157. Agada Joseph Oche, Ademola Glory Folashade, Tirthankar Ghosal, et al.. A Systematic Review of Key Retrieval-Augmented Generation (RAG) Systems: Progress, Gaps, and Future Directions. ArXiv. 2025. doi:10.48550/arXiv.2507.18910
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.48550/arXiv.2507.18910

**Abstract**: Retrieval-Augmented Generation (RAG) represents a major advancement in natural language processing (NLP), combining large language models (LLMs) with information retrieval systems to enhance factual grounding, accuracy, and contextual relevance. This paper presents a comprehensive systematic review of RAG, tracing its evolution from early developments in open domain question answering to recent state-of-the-art implementations across diverse applications. The review begins by outlining the motiv...

---

### [158] GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning

**Citation**: 158. Costas Mavromatis, George Karypis. GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning. ArXiv. 2024. doi:10.48550/arXiv.2405.20139
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.48550/arXiv.2405.20139

**Abstract**: Knowledge Graphs (KGs) represent human-crafted factual knowledge in the form of triplets (head, relation, tail), which collectively form a graph. Question Answering over KGs (KGQA) is the task of answering natural questions grounding the reasoning to the information provided by the KG. Large Language Models (LLMs) are the state-of-the-art models for QA tasks due to their remarkable ability to understand natural language. On the other hand, Graph Neural Networks (GNNs) have been widely used for K...

---

### [159] Chunking, Retrieval, and Re-ranking: An Empirical Evaluation of RAG Architectures for Policy Document Question Answering

**Citation**: 159. A. Maharjan, Umesh Yadav. Chunking, Retrieval, and Re-ranking: An Empirical Evaluation of RAG Architectures for Policy Document Question Answering. ArXiv. 2026. doi:10.48550/arXiv.2601.15457
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.48550/arXiv.2601.15457

**Abstract**: The integration of Large Language Models (LLMs) into the public health policy sector offers a transformative approach to navigating the vast repositories of regulatory guidance maintained by agencies such as the Centers for Disease Control and Prevention (CDC). However, the propensity for LLMs to generate hallucinations, defined as plausible but factually incorrect assertions, presents a critical barrier to the adoption of these technologies in high-stakes environments where information integrit...

---

### [160] GRACE-RAG: Graph Retrieval with Adaptive Chunk Extraction for Long-Context Question Answering

**Citation**: 160. Tianwei Huang, Shuai Lei, Askar Hamdulla, et al.. GRACE-RAG: Graph Retrieval with Adaptive Chunk Extraction for Long-Context Question Answering. 2026 International Conference on Communication Networks and Machine Learning (CNML). 2026. doi:10.1109/CNML68938.2026.11452294
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.1109/CNML68938.2026.11452294

**Abstract**: Retrieval-augmented generation (RAG) improves factuality by grounding large language models (LLMs) on external corpora, but it still struggles with multi-hop reasoning and long-context overload. We propose GRACE-RAG, a two-stage framework that (i) builds a fine-grained chunk–sentence–entity graph via dynamic chunking during offline indexing, (ii) performs query-aware entity activation and personalized PageRank for associative retrieval online, and (iii) adaptively compresses retrieved evidence u...

---

### [161] Design and Implementation of a RAG Chatbot System for Scientific Research Institutes

**Citation**: 161. Igor Radulović, Jovana Mitrić, Katarina Kovijanić, et al.. Design and Implementation of a RAG Chatbot System for Scientific Research Institutes. 2026 30th International Conference on Information Technology (IT). 2026. doi:10.1109/IT67293.2026.11435604
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.1109/IT67293.2026.11435604

**Abstract**: This paper presents the design and implementation of a prototype chatbot system based on the Retrieval-Augmented Generation (RAG) architecture, applied in a scientific research institute to improve knowledge access. The system combines semantic search over a vector knowledge base with response generation using large language models, enabling contextually relevant institutional information. A case study was conducted to evaluate the prototype in a real-world environment. Results indicate improved...

---

### [162] SRAG: RAG with Structured Data Improves Vector Retrieval

**Citation**: 162. Shalin Shah, Srikanth Ryali, Ramasubbu Venkatesh. SRAG: RAG with Structured Data Improves Vector Retrieval. . 2026.
**Source database**: semantic_scholar

**Abstract**: Retrieval Augmented Generation (RAG) provides the necessary informational grounding to LLMs in the form of chunks retrieved from a vector database or through web search. RAG could also use knowledge graph triples as a means of providing factual information to an LLM. However, the retrieval is only based on representational similarity between a question and the contents. The performance of RAG depends on the numeric vector representations of the query and the chunks. To improve these representati...

---

### [163] Rational Synthesizers or Heuristic Followers? Analyzing LLMs in RAG-based Question-Answering

**Citation**: 163. Atharv Naphade. Rational Synthesizers or Heuristic Followers? Analyzing LLMs in RAG-based Question-Answering. ArXiv. 2026. doi:10.48550/arXiv.2601.06189
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.48550/arXiv.2601.06189

**Abstract**: Retrieval-Augmented Generation (RAG) is the prevailing paradigm for grounding Large Language Models (LLMs), yet the mechanisms governing how models integrate groups of conflicting retrieved evidence remain opaque. Does an LLM answer a certain way because the evidence is factually strong, because of a prior belief, or merely because it is repeated frequently? To answer this, we introduce GroupQA, a curated dataset of 1,635 controversial questions paired with 15,058 diversely-sourced evidence docu...

---

### [164] Ingest-And-Ground: Dispelling Hallucinations from Continually-Pretrained LLMs with RAG

**Citation**: 164. Chenhao Fang, Derek Larson, Shitong Zhu, et al.. Ingest-And-Ground: Dispelling Hallucinations from Continually-Pretrained LLMs with RAG. ArXiv. 2024. doi:10.48550/arXiv.2410.02825
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.48550/arXiv.2410.02825

**Abstract**: This paper presents new methods that have the potential to improve privacy process efficiency with LLM and RAG. To reduce hallucination, we continually pre-train the base LLM model with a privacy-specific knowledge base and then augment it with a semantic RAG layer. Our evaluations demonstrate that this approach enhances the model performance (as much as doubled metrics compared to out-of-box LLM) in handling privacy-related queries, by grounding responses with factual information which reduces ...

---

### [165] Engineering RAG Systems for Real-World Applications: Design, Development, and Evaluation

**Citation**: 165. Md Toufique Hasan, Muhammad Waseem, Kai-Kristian Kemell, et al.. Engineering RAG Systems for Real-World Applications: Design, Development, and Evaluation. ArXiv. 2025. doi:10.1007/978-3-032-04200-2_10
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.1007/978-3-032-04200-2_10

**Abstract**: Retrieval-Augmented Generation (RAG) systems are emerging as a key approach for grounding Large Language Models (LLMs) in external knowledge, addressing limitations in factual accuracy and contextual relevance. However, there is a lack of empirical studies that report on the development of RAG-based implementations grounded in real-world use cases, evaluated through general user involvement, and accompanied by systematic documentation of lessons learned. This paper presents five domain-specific ...

---

### [166] Turk-LettuceDetect: A Hallucination Detection Model for Turkish RAG Applications

**Citation**: 166. Selva Tas, Mahmut El Huseyni, Özay Ezerceli, et al.. Turk-LettuceDetect: A Hallucination Detection Model for Turkish RAG Applications. 2025 3rd International Conference on Foundation and Large Language Models (FLLM). 2025. doi:10.1109/FLLM67465.2025.11390913
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.1109/FLLM67465.2025.11390913

**Abstract**: The widespread adoption of Large Language Models (LLMs) has been hindered by their tendency to hallucinate, generating plausible but factually incorrect information. While Retrieval-Augmented Generation (RAG) systems attempt to address this issue by grounding responses in external knowledge, hallucination remains a persistent challenge, particularly for morphologically complex, low-resource languages like Turkish. This paper introduces Turk-LettuceDetect, the first suite of hallucination detecti...

---

### [167] Benchmarking Vector, Graph and Hybrid Retrieval Augmented Generation (RAG) Pipelines for Open Radio Access Networks (ORAN)

**Citation**: 167. Sarat Ahmad, Zeinab Nezami, Maryam Hafeez, et al.. Benchmarking Vector, Graph and Hybrid Retrieval Augmented Generation (RAG) Pipelines for Open Radio Access Networks (ORAN). 2025 IEEE 36th International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC). 2025. doi:10.1109/PIMRC62392.2025.11274810
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.1109/PIMRC62392.2025.11274810

**Abstract**: Generative AI (GenAI) is expected to play a pivotal role in enabling autonomous optimization in future wireless networks. Within the ORAN architecture, Large Language Models (LLMs) can be specialized to generate xApps and rApps by leveraging specifications and API definitions from the RAN Intelligent Controller (RIC) platform. However, fine-tuning base LLMs for telecom-specific tasks remains expensive and resource-intensive. Retrieval-Augmented Generation (RAG) offers a practical alternative thr...

---

### [168] Medical Graph-RAG: Bilingual Graph-Based Reasoning for Cardiological Intelligence

**Citation**: 168. Leen I. A. Shaqalaih, Omar Belal, Fatma Küçük, et al.. Medical Graph-RAG: Bilingual Graph-Based Reasoning for Cardiological Intelligence. 2025 International Conference on INnovations in Intelligent SysTems and Applications (INISTA). 2025. doi:10.1109/INISTA68122.2025.11249583
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.1109/INISTA68122.2025.11249583

**Abstract**: Clinical question answering requires factual accuracy, transparency, and evidence that is traceable. Despite recent progress, Large Language Models (LLMs) still hallucinate and struggle with specialized terminology. Retrieval-Augmented Generation (RAG) mitigates this by grounding answers in external sources, but conventional RAG neglects the rich relational structure of medical knowledge. Building on MedGraphRAG, an evidence-focused, graph-based RAG framework, we present the first bilingual (Eng...

---

### [169] A Survey of Retrieval-Augmented Generation (RAG) for Large Language Models

**Citation**: 169. Yusong Ma, Hongxuan Nie, Chao Chen, et al.. A Survey of Retrieval-Augmented Generation (RAG) for Large Language Models. 2025 International Conference on Trustworthy Big Data and Artificial Intelligence (ICTBAI). 2025. doi:10.1109/ICTBAI68361.2025.00008
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.1109/ICTBAI68361.2025.00008

**Abstract**: While Large Language Models (LLMs) are revolutionary, their deployment is constrained by inherent limitations such as factual hallucination and static knowledge. This survey systematically reviews Retrieval-Augmented Generation (RAG), a key paradigm for addressing these challenges by grounding LLMs in external, verifiable knowledge. To overcome the flaws of standalone models, RAG integrates LLMs with updatable knowledge bases, a hybrid approach that significantly enhances output accuracy and tru...

---

### [170] LLMs and LVMs for agentic AI: a GPU-accelerated multimodal system architecture for RAG-grounded, explainable, and adaptive intelligence

**Citation**: 170. Kiarash Ahi, Chih-Hung Hsieh, G. Fenger. LLMs and LVMs for agentic AI: a GPU-accelerated multimodal system architecture for RAG-grounded, explainable, and adaptive intelligence. . 2025. doi:10.1117/12.3078485
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.1117/12.3078485

**Abstract**: This paper presents an architecture for an Agentic AI System that autonomously operates and manages complex workflows across enterprise and industrial software ecosystems such as Electronic Design Automation (EDA) tools (e.g., Siemens Calibre), Product Lifecycle Management (PLM) and Digital Twin platforms (e.g., Teamcenter Digital Reality Viewer), as well as knowledge-centric domains including HR analytics, financial modeling, healthcare diagnostics, and creative design platforms. This architect...

---

### [171] Optimizing Medical Question-Answering Systems: A Comparative Study of Fine-Tuned and Zero-Shot Large Language Models with RAG Framework

**Citation**: 171. Tasnimul Hassan, Md. Karim, Haziq Jeelani, et al.. Optimizing Medical Question-Answering Systems: A Comparative Study of Fine-Tuned and Zero-Shot Large Language Models with RAG Framework. ArXiv. 2025. doi:10.48550/arXiv.2512.05863
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.48550/arXiv.2512.05863

**Abstract**: Medical question-answering (QA) systems can benefit from advances in large language models (LLMs), but directly applying LLMs to the clinical domain poses challenges such as maintaining factual accuracy and avoiding hallucinations. In this paper, we present a retrieval-augmented generation (RAG) based medical QA system that combines domain-specific knowledge retrieval with open-source LLMs to answer medical questions. We fine-tune two state-of-the-art open LLMs (LLaMA~2 and Falcon) using Low-Ran...

---

### [172] Generate but Verify: Answering with Faithfulness in RAG-based Question Answering

**Citation**: 172. Simone Filice, Elad Haramaty, Guy Horowitz, et al.. Generate but Verify: Answering with Faithfulness in RAG-based Question Answering. . 2025. doi:10.18653/v1/2025.ijcnlp-long.56
**Source database**: semantic_scholar
**DOI**: https://doi.org/10.18653/v1/2025.ijcnlp-long.56

**Abstract**: Retrieval-Augmented Generation (RAG) enhances LLMs by grounding answers in retrieved passages, which is key in factual Question Answering. However, generated answers may still be unfaithful to the passages, either due to retrieval or generation errors. Many RAG downstream applications rely on assessing answer faithfulness for applying fallback strategies, yet address it implicitly, without a consistent evaluation methodology. We introduce the task of Answering with Faithfulness (AwF), which brin...

---

### [173] Specialized AI and neurosurgeons in niche expertise: a proof-of-concept in neuromodulation with vagus nerve stimulation.

**Citation**: 173. Barrit S, Ranuzzi G, Fetzer S, et al.. Specialized AI and neurosurgeons in niche expertise: a proof-of-concept in neuromodulation with vagus nerve stimulation.. Acta neurochirurgica. 2025. doi:10.1007/s00701-025-06610-8
**Source database**: pubmed
**DOI**: https://doi.org/10.1007/s00701-025-06610-8

**Abstract**: Applying large language models (LLM) in specialized medical disciplines presents unique challenges requiring precision, reliability, and domain-specific relevance. We evaluated a specialized LLM-driven system against neurosurgeons in vagus nerve stimulation (VNS) for drug-resistant epilepsy knowledge assessment-a complex neuromodulation therapy requiring transdisciplinary expertise in neural anatomy, epileptic disorders, and device technology.

---

### [174] Detecting hallucinations in large language models using semantic entropy.

**Citation**: 174. Farquhar S, Kossen J, Kuhn L, et al.. Detecting hallucinations in large language models using semantic entropy.. Nature. 2024. doi:10.1038/s41586-024-07421-0
**Source database**: pubmed
**DOI**: https://doi.org/10.1038/s41586-024-07421-0

**Abstract**: Large language model (LLM) systems, such as ChatGPT

---

### [175] PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications

**Citation**: 175. Dingkang Yang, Jinjie Wei, Dongling Xiao, et al.. PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications. arXiv:2405.19266v4. 2024.
**Source database**: arxiv

**Abstract**: Developing intelligent pediatric consultation systems offers promising prospects for improving diagnostic efficiency, especially in China, where healthcare resources are scarce. Despite recent advances in Large Language Models (LLMs) for Chinese medicine, their performance is sub-optimal in pediatric applications due to inadequate instruction data and vulnerable training procedures. To address the above issues, this paper builds PedCorpus, a high-quality dataset of over 300,000 multi-task instru...

---

### [176] Hallucination Detection with Small Language Models

**Citation**: 176. Ming Cheung. Hallucination Detection with Small Language Models. arXiv:2506.22486v1. 2025.
**Source database**: arxiv

**Abstract**: Since the introduction of ChatGPT, large language models (LLMs) have demonstrated significant utility in various tasks, such as answering questions through retrieval-augmented generation. Context can be retrieved using a vectorized database, serving as a foundation for LLMs to generate responses. However, hallucinations in responses can undermine the reliability of LLMs in practical applications, and they are not easily detectable in the absence of ground truth, particularly in question-and-answ...

---

### [177] Reviewing Clinical Knowledge in Medical Large Language Models: Training and Beyond

**Citation**: 177. Qiyuan Li, Haijiang Liu, Caicai Guo, et al.. Reviewing Clinical Knowledge in Medical Large Language Models: Training and Beyond. arXiv:2502.20988v2. 2025. doi:https://doi.org/10.1016/j.knosys.2025.114215
**Source database**: arxiv
**DOI**: https://doi.org/https://doi.org/10.1016/j.knosys.2025.114215

**Abstract**: The large-scale development of large language models (LLMs) in medical contexts, such as diagnostic assistance and treatment recommendations, necessitates that these models possess accurate medical knowledge and deliver traceable decision-making processes. Clinical knowledge, encompassing the insights gained from research on the causes, prognosis, diagnosis, and treatment of diseases, has been extensively examined within real-world medical practices. Recently, there has been a notable increase i...

---

### [178] THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models

**Citation**: 178. Mengfei Liang, Archish Arun, Zekun Wu, et al.. THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models. arXiv:2409.11353v3. 2024.
**Source database**: arxiv

**Abstract**: Hallucination, the generation of factually incorrect content, is a growing challenge in Large Language Models (LLMs). Existing detection and mitigation methods are often isolated and insufficient for domain-specific needs, lacking a standardized pipeline. This paper introduces THaMES (Tool for Hallucination Mitigations and EvaluationS), an integrated framework and library addressing this gap. THaMES offers an end-to-end solution for evaluating and mitigating hallucinations in LLMs, featuring aut...

---

### [179] Instruction-tuned Large Language Models for Machine Translation in the Medical Domain

**Citation**: 179. Miguel Rios. Instruction-tuned Large Language Models for Machine Translation in the Medical Domain. arXiv:2408.16440v2. 2024.
**Source database**: arxiv

**Abstract**: Large Language Models (LLMs) have shown promising results on machine translation for high resource language pairs and domains. However, in specialised domains (e.g. medical) LLMs have shown lower performance compared to standard neural machine translation models. The consistency in the machine translation of terminology is crucial for users, researchers, and translators in specialised domains. In this study, we compare the performance between baseline LLMs and instruction-tuned LLMs in the medic...

---

### [180] Med-HALT: Medical Domain Hallucination Test for Large Language Models

**Citation**: 180. Ankit Pal, Logesh Kumar Umapathi, Malaikannan Sankarasubbu. Med-HALT: Medical Domain Hallucination Test for Large Language Models. arXiv:2307.15343v2. 2023.
**Source database**: arxiv

**Abstract**: This research paper focuses on the challenges posed by hallucinations in large language models (LLMs), particularly in the context of the medical domain. Hallucination, wherein these models generate plausible yet unverified or incorrect information, can have serious consequences in healthcare applications. We propose a new benchmark and dataset, Med-HALT (Medical Domain Hallucination Test), designed specifically to evaluate and reduce hallucinations. Med-HALT provides a diverse multinational dat...

---

### [181] MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models

**Citation**: 181. Shrey Pandit, Jiawei Xu, Junyuan Hong, et al.. MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models. arXiv:2502.14302v1. 2025.
**Source database**: arxiv

**Abstract**: Advancements in Large Language Models (LLMs) and their increasing use in medical question-answering necessitate rigorous evaluation of their reliability. A critical challenge lies in hallucination, where models generate plausible yet factually incorrect outputs. In the medical domain, this poses serious risks to patient safety and clinical decision-making. To address this, we introduce MedHallu, the first benchmark specifically designed for medical hallucination detection. MedHallu comprises 10,...

---

### [182] Probabilistic distances-based hallucination detection in LLMs with RAG

**Citation**: 182. Rodion Oblovatny, Alexandra Kuleshova, Konstantin Polev, et al.. Probabilistic distances-based hallucination detection in LLMs with RAG. arXiv:2506.09886v2. 2025.
**Source database**: arxiv

**Abstract**: Detecting hallucinations in large language models (LLMs) is critical for their safety in many applications. Without proper detection, these systems often provide harmful, unreliable answers. In recent years, LLMs have been actively used in retrieval-augmented generation (RAG) settings. However, hallucinations remain even in this setting, and while numerous hallucination detection methods have been proposed, most approaches are not specifically designed for RAG systems. To overcome this limitatio...

---

### [183] Beneath the Surface: Unveiling Harmful Memes with Multimodal Reasoning Distilled from Large Language Models

**Citation**: 183. Hongzhan Lin, Ziyang Luo, Jing Ma, et al.. Beneath the Surface: Unveiling Harmful Memes with Multimodal Reasoning Distilled from Large Language Models. arXiv:2312.05434v1. 2023.
**Source database**: arxiv

**Abstract**: The age of social media is rife with memes. Understanding and detecting harmful memes pose a significant challenge due to their implicit meaning that is not explicitly conveyed through the surface text and image. However, existing harmful meme detection approaches only recognize superficial harm-indicative signals in an end-to-end classification manner but ignore in-depth cognition of the meme text and image. In this paper, we attempt to detect harmful memes based on advanced reasoning over the ...

---

### [184] How do language models learn facts? Dynamics, curricula and hallucinations

**Citation**: 184. Nicolas Zucchet, Jörg Bornschein, Stephanie Chan, et al.. How do language models learn facts? Dynamics, curricula and hallucinations. arXiv:2503.21676v2. 2025.
**Source database**: arxiv

**Abstract**: Large language models accumulate vast knowledge during pre-training, yet the dynamics governing this acquisition remain poorly understood. This work investigates the learning dynamics of language models on a synthetic factual recall task, uncovering three key findings: First, language models learn in three phases, exhibiting a performance plateau before acquiring precise factual knowledge. Mechanistically, this plateau coincides with the formation of attention-based circuits that support recall....

---

### [185] Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

**Citation**: 185. Jiawei Chen, Dingkang Yang, Tong Wu, et al.. Detecting and Evaluating Medical Hallucinations in Large Vision Language Models. arXiv:2406.10185v1. 2024.
**Source database**: arxiv

**Abstract**: Large Vision Language Models (LVLMs) are increasingly integral to healthcare applications, including medical visual question answering and imaging report generation. While these models inherit the robust capabilities of foundational Large Language Models (LLMs), they also inherit susceptibility to hallucinations-a significant concern in high-stakes medical contexts where the margin for error is minimal. However, currently, there are no dedicated methods or benchmarks for hallucination detection ...

---

### [186] Unifying Large Language Models and Knowledge Graphs: A Roadmap

**Citation**: 186. Shirui Pan, Linhao Luo, Yufei Wang, et al.. Unifying Large Language Models and Knowledge Graphs: A Roadmap. arXiv:2306.08302v3. 2023. doi:https://doi.org/10.1109/TKDE.2024.3352100
**Source database**: arxiv
**DOI**: https://doi.org/https://doi.org/10.1109/TKDE.2024.3352100

**Abstract**: Large language models (LLMs), such as ChatGPT and GPT4, are making new waves in the field of natural language processing and artificial intelligence, due to their emergent ability and generalizability. However, LLMs are black-box models, which often fall short of capturing and accessing factual knowledge. In contrast, Knowledge Graphs (KGs), Wikipedia and Huapu for example, are structured knowledge models that explicitly store rich factual knowledge. KGs can enhance LLMs by providing external kn...

---

### [187] AutoHall: Automated Factuality Hallucination Dataset Generation for Large Language Models

**Citation**: 187. Zouying Cao, Yifei Yang, XiaoJing Li, et al.. AutoHall: Automated Factuality Hallucination Dataset Generation for Large Language Models. arXiv:2310.00259v3. 2023.
**Source database**: arxiv

**Abstract**: Large language models (LLMs) have gained broad applications across various domains but still struggle with hallucinations. Currently, hallucinations occur frequently in the generation of factual content and pose a great challenge to trustworthy LLMs. However, hallucination detection is hindered by the laborious and expensive manual annotation of hallucinatory content. Meanwhile, as different LLMs exhibit distinct types and rates of hallucination, the collection of hallucination datasets is inher...

---

### [188] Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents

**Citation**: 188. Renxi Wang, Haonan Li, Xudong Han, et al.. Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents. arXiv:2402.11651v2. 2024.
**Source database**: arxiv

**Abstract**: Large language models (LLMs) have achieved success in acting as agents, which interact with environments through tools such as search engines. However, LLMs are optimized for language generation instead of tool use during training or alignment, limiting their effectiveness as agents. To resolve this problem, previous work has first collected interaction trajectories between LLMs and environments, using only trajectories that successfully finished the task to fine-tune smaller models, making fine...

---

### [189] Efficient Hallucination Detection: Adaptive Bayesian Estimation of Semantic Entropy with Guided Semantic Exploration

**Citation**: 189. Qiyao Sun, Xingming Li, Xixiang He, et al.. Efficient Hallucination Detection: Adaptive Bayesian Estimation of Semantic Entropy with Guided Semantic Exploration. arXiv:2603.22812v1. 2026.
**Source database**: arxiv

**Abstract**: Large language models (LLMs) have achieved remarkable success in various natural language processing tasks, yet they remain prone to generating factually incorrect outputs known as hallucinations. While recent approaches have shown promise for hallucination detection by repeatedly sampling from LLMs and quantifying the semantic inconsistency among the generated responses, they rely on fixed sampling budgets that fail to adapt to query complexity, resulting in computational inefficiency. We propo...

---

### [190] Towards Omni-RAG: Comprehensive Retrieval-Augmented Generation for Large Language Models in Medical Applications

**Citation**: 190. Zhe Chen, Yusheng Liao, Shuyang Jiang, et al.. Towards Omni-RAG: Comprehensive Retrieval-Augmented Generation for Large Language Models in Medical Applications. arXiv:2501.02460v3. 2025.
**Source database**: arxiv

**Abstract**: Large language models hold promise for addressing medical challenges, such as medical diagnosis reasoning, research knowledge acquisition, clinical decision-making, and consumer health inquiry support. However, they often generate hallucinations due to limited medical knowledge. Incorporating external knowledge is therefore critical, which necessitates multi-source knowledge acquisition. We address this challenge by framing it as a source planning problem, which is to formulate context-appropria...

---

### [191] Evaluating Class Membership Relations in Knowledge Graphs using Large Language Models

**Citation**: 191. Bradley P. Allen, Paul T. Groth. Evaluating Class Membership Relations in Knowledge Graphs using Large Language Models. arXiv:2404.17000v1. 2024.
**Source database**: arxiv

**Abstract**: A backbone of knowledge graphs are their class membership relations, which assign entities to a given class. As part of the knowledge engineering process, we propose a new method for evaluating the quality of these relations by processing descriptions of a given entity and class using a zero-shot chain-of-thought classifier that uses a natural language intensional definition of a class. We evaluate the method using two publicly available knowledge graphs, Wikidata and CaLiGraph, and 7 large lang...

---

### [192] SADM: Sequence-Aware Diffusion Model for Longitudinal Medical Image Generation

**Citation**: 192. Jee Seok Yoon, Chenghao Zhang, Heung-Il Suk, et al.. SADM: Sequence-Aware Diffusion Model for Longitudinal Medical Image Generation. arXiv:2212.08228v2. 2022. doi:https://doi.org/10.1007/978-3-031-34048-2_30
**Source database**: arxiv
**DOI**: https://doi.org/https://doi.org/10.1007/978-3-031-34048-2_30

**Abstract**: Human organs constantly undergo anatomical changes due to a complex mix of short-term (e.g., heartbeat) and long-term (e.g., aging) factors. Evidently, prior knowledge of these factors will be beneficial when modeling their future state, i.e., via image generation. However, most of the medical image generation tasks only rely on the input from a single image, thus ignoring the sequential dependency even when longitudinal data is available. Sequence-aware deep generative models, where model input...

---

### [193] Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models

**Citation**: 193. Yazhou Zhang, Chunwang Zou, Bo Wang, et al.. Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models. arXiv:2503.18681v3. 2025.
**Source database**: arxiv

**Abstract**: Sarcasm detection, as a crucial research direction in the field of Natural Language Processing (NLP), has attracted widespread attention. Traditional sarcasm detection tasks have typically focused on single-modal approaches (e.g., text), but due to the implicit and subtle nature of sarcasm, such methods often fail to yield satisfactory results. In recent years, researchers have shifted the focus of sarcasm detection to multi-modal approaches. However, effectively leveraging multi-modal informati...

---

### [194] HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge

**Citation**: 194. Haochun Wang, Chi Liu, Nuwa Xi, et al.. HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge. arXiv:2304.06975v1. 2023.
**Source database**: arxiv

**Abstract**: Large Language Models (LLMs), such as the LLaMA model, have demonstrated their effectiveness in various general-domain natural language processing (NLP) tasks. Nevertheless, LLMs have not yet performed optimally in biomedical domain tasks due to the need for medical expertise in the responses. In response to this challenge, we propose HuaTuo, a LLaMA-based model that has been supervised-fine-tuned with generated QA (Question-Answer) instances. The experimental results demonstrate that HuaTuo gen...

---

### [195] Reducing Hallucinations of Medical Multimodal Large Language Models with Visual Retrieval-Augmented Generation

**Citation**: 195. Yun-Wei Chu, Kai Zhang, Christopher Malon, et al.. Reducing Hallucinations of Medical Multimodal Large Language Models with Visual Retrieval-Augmented Generation. arXiv:2502.15040v1. 2025.
**Source database**: arxiv

**Abstract**: Multimodal Large Language Models (MLLMs) have shown impressive performance in vision and text tasks. However, hallucination remains a major challenge, especially in fields like healthcare where details are critical. In this work, we show how MLLMs may be enhanced to support Visual RAG (V-RAG), a retrieval-augmented generation framework that incorporates both text and visual data from retrieved images. On the MIMIC-CXR chest X-ray report generation and Multicare medical image caption generation d...

---

### [196] Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge

**Citation**: 196. Zhihong Chen, Guanbin Li, Xiang Wan. Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge. arXiv:2209.07118v1. 2022.
**Source database**: arxiv

**Abstract**: Medical vision-and-language pre-training (Med-VLP) has received considerable attention owing to its applicability to extracting generic vision-and-language representations from medical images and texts. Most existing methods mainly contain three elements: uni-modal encoders (i.e., a vision encoder and a language encoder), a multi-modal fusion module, and pretext tasks, with few studies considering the importance of medical domain expert knowledge and explicitly exploiting such knowledge to facil...

---

### [197] Demystifying Instruction Mixing for Fine-tuning Large Language Models

**Citation**: 197. Renxi Wang, Haonan Li, Minghao Wu, et al.. Demystifying Instruction Mixing for Fine-tuning Large Language Models. arXiv:2312.10793v3. 2023.
**Source database**: arxiv

**Abstract**: Instruction tuning significantly enhances the performance of large language models (LLMs) across various tasks. However, the procedure to optimizing the mixing of instruction datasets for LLM fine-tuning is still poorly understood. This study categorizes instructions into three primary types: NLP downstream tasks, coding, and general chat. We explore the effects of instruction tuning on different combinations of datasets on LLM performance, and find that certain instruction types are more advant...

---

### [198] Proficient Graph Neural Network Design by Accumulating Knowledge on Large Language Models

**Citation**: 198. Jialiang Wang, Hanmo Liu, Shimin Di, et al.. Proficient Graph Neural Network Design by Accumulating Knowledge on Large Language Models. arXiv:2408.06717v3. 2024. doi:https://doi.org/10.1145/3773966.3777982
**Source database**: arxiv
**DOI**: https://doi.org/https://doi.org/10.1145/3773966.3777982

**Abstract**: High-level automation is increasingly critical in AI, driven by rapid advances in large language models (LLMs) and AI agents. However, LLMs, despite their general reasoning power, struggle significantly in specialized, data-sensitive tasks such as designing Graph Neural Networks (GNNs). This difficulty arises from (1) the inherent knowledge gaps in modeling the intricate, varying relationships between graph properties and suitable architectures and (2) the external noise from misleading descript...

---

### [199] Can ChatGPT be Your Personal Medical Assistant?

**Citation**: 199. Md. Rafiul Biswas, Ashhadul Islam, Zubair Shah, et al.. Can ChatGPT be Your Personal Medical Assistant?. arXiv:2312.12006v1. 2023.
**Source database**: arxiv

**Abstract**: The advanced large language model (LLM) ChatGPT has shown its potential in different domains and remains unbeaten due to its characteristics compared to other LLMs. This study aims to evaluate the potential of using a fine-tuned ChatGPT model as a personal medical assistant in the Arabic language. To do so, this study uses publicly available online questions and answering datasets in Arabic language. There are almost 430K questions and answers for 20 disease-specific categories. GPT-3.5-turbo mo...

---

### [200] Is Self-knowledge and Action Consistent or Not: Investigating Large Language Model's Personality

**Citation**: 200. Yiming Ai, Zhiwei He, Ziyin Zhang, et al.. Is Self-knowledge and Action Consistent or Not: Investigating Large Language Model's Personality. arXiv:2402.14679v2. 2024.
**Source database**: arxiv

**Abstract**: In this study, we delve into the validity of conventional personality questionnaires in capturing the human-like personality traits of Large Language Models (LLMs). Our objective is to assess the congruence between the personality traits LLMs claim to possess and their demonstrated tendencies in real-world scenarios. By conducting an extensive examination of LLM outputs against observed human response patterns, we aim to understand the disjunction between self-knowledge and action in LLMs.

---

### [201] Soft Inductive Bias Approach via Explicit Reasoning Perspectives in Inappropriate Utterance Detection Using Large Language Models

**Citation**: 201. Ju-Young Kim, Ji-Hong Park, Se-Yeon Lee, et al.. Soft Inductive Bias Approach via Explicit Reasoning Perspectives in Inappropriate Utterance Detection Using Large Language Models. arXiv:2512.08480v1. 2025.
**Source database**: arxiv

**Abstract**: Recent incidents in certain online games and communities, where anonymity is guaranteed, show that unchecked inappropriate remarks frequently escalate into verbal abuse and even criminal behavior, raising significant social concerns. Consequently, there is a growing need for research on techniques that can detect inappropriate utterances within conversational texts to help build a safer communication environment. Although large-scale language models trained on Korean corpora and chain-of-thought...

---

### [202] Knowledge-tuning Large Language Models with Structured Medical Knowledge Bases for Reliable Response Generation in Chinese

**Citation**: 202. Haochun Wang, Sendong Zhao, Zewen Qiang, et al.. Knowledge-tuning Large Language Models with Structured Medical Knowledge Bases for Reliable Response Generation in Chinese. arXiv:2309.04175v1. 2023. doi:https://doi.org/10.1145/3686807
**Source database**: arxiv
**DOI**: https://doi.org/https://doi.org/10.1145/3686807

**Abstract**: Large Language Models (LLMs) have demonstrated remarkable success in diverse natural language processing (NLP) tasks in general domains. However, LLMs sometimes generate responses with the hallucination about medical facts due to limited domain knowledge. Such shortcomings pose potential risks in the utilization of LLMs within medical contexts. To address this challenge, we propose knowledge-tuning, which leverages structured medical knowledge bases for the LLMs to grasp domain knowledge efficie...

---

### [203] WizardLM: Empowering large pre-trained language models to follow complex instructions

**Citation**: 203. Can Xu, Qingfeng Sun, Kai Zheng, et al.. WizardLM: Empowering large pre-trained language models to follow complex instructions. arXiv:2304.12244v3. 2023.
**Source database**: arxiv

**Abstract**: Training large language models (LLMs) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions. In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rew...

---

### [204] Object Detection with Multimodal Large Vision-Language Models: An In-depth Review

**Citation**: 204. Ranjan Sapkota, Manoj Karkee. Object Detection with Multimodal Large Vision-Language Models: An In-depth Review. arXiv:2508.19294v2. 2025. doi:https://doi.org/10.1016/j.inffus.2025.103575
**Source database**: arxiv
**DOI**: https://doi.org/https://doi.org/10.1016/j.inffus.2025.103575

**Abstract**: The fusion of language and vision in large vision-language models (LVLMs) has revolutionized deep learning-based object detection by enhancing adaptability, contextual reasoning, and generalization beyond traditional architectures. This in-depth review presents a structured exploration of the state-of-the-art in LVLMs, systematically organized through a three-step research review process. First, we discuss the functioning of vision language models (VLMs) for object detection, describing how thes...

---

### [205] MedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language Models

**Citation**: 205. Kaiwen Zuo, Yirui Jiang. MedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language Models. arXiv:2412.18947v4. 2024.
**Source database**: arxiv

**Abstract**: Medical Large Language Models (MLLMs) have demonstrated potential in healthcare applications, yet their propensity for hallucinations -- generating medically implausible or inaccurate information -- presents substantial risks to patient care. This paper introduces MedHallBench, a comprehensive benchmark framework for evaluating and mitigating hallucinations in MLLMs. Our methodology integrates expert-validated medical case scenarios with established medical databases to create a robust evaluatio...

---

### [206] Quantifying Hallucinations in Language Language Models on Medical Textbooks

**Citation**: 206. Brandon C. Colelough, Davis Bartels, Dina Demner-Fushman. Quantifying Hallucinations in Language Language Models on Medical Textbooks. arXiv:2603.09986v1. 2026.
**Source database**: arxiv

**Abstract**: Hallucinations, the tendency for large language models to provide responses with factually incorrect and unsupported claims, is a serious problem within natural language processing for which we do not yet have an effective solution to mitigate against. Existing benchmarks for medical QA rarely evaluate this behavior against a fixed evidence source. We ask how often hallucinations occur on textbook-grounded QA and how responses to medical QA prompts vary across models. We conduct two experiments:...

---

### [207] Knowledge Overshadowing Causes Amalgamated Hallucination in Large Language Models

**Citation**: 207. Yuji Zhang, Sha Li, Jiateng Liu, et al.. Knowledge Overshadowing Causes Amalgamated Hallucination in Large Language Models. arXiv:2407.08039v1. 2024.
**Source database**: arxiv

**Abstract**: Hallucination is often regarded as a major impediment for using large language models (LLMs), especially for knowledge-intensive tasks. Even when the training corpus consists solely of true statements, language models still generate hallucinations in the form of amalgamations of multiple facts. We coin this phenomenon as ``knowledge overshadowing'': when we query knowledge from a language model with multiple conditions, some conditions overshadow others, leading to hallucinated outputs. This phe...

---

### [208] WizardCoder: Empowering Code Large Language Models with Evol-Instruct

**Citation**: 208. Ziyang Luo, Can Xu, Pu Zhao, et al.. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. arXiv:2306.08568v2. 2023.
**Source database**: arxiv

**Abstract**: Code Large Language Models (Code LLMs), such as StarCoder, have demonstrated exceptional performance in code-related tasks. However, most existing models are solely pre-trained on extensive raw code data without instruction fine-tuning. In this paper, we introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code. Through comprehensive experiments on four prominent code generation benchmarks, namely HumanEval, H...

---

### [209] (Im)possibility of Automated Hallucination Detection in Large Language Models

**Citation**: 209. Amin Karbasi, Omar Montasser, John Sous, et al.. (Im)possibility of Automated Hallucination Detection in Large Language Models. arXiv:2504.17004v2. 2025.
**Source database**: arxiv

**Abstract**: Is automated hallucination detection possible? In this work, we introduce a theoretical framework to analyze the feasibility of automatically detecting hallucinations produced by large language models (LLMs). Inspired by the classical Gold-Angluin framework for language identification and its recent adaptation to language generation by Kleinberg and Mullainathan, we investigate whether an algorithm, trained on examples drawn from an unknown target language $K$ (selected from a countable collecti...

---

### [210] MedHal: An Evaluation Dataset for Medical Hallucination Detection

**Citation**: 210. Gaya Mehenni, Fabrice Lamarche, Odette Rios-Ibacache, et al.. MedHal: An Evaluation Dataset for Medical Hallucination Detection. arXiv:2504.08596v2. 2025.
**Source database**: arxiv

**Abstract**: We present MedHal, a novel large-scale dataset specifically designed to evaluate if models can detect hallucinations in medical texts. Current hallucination detection methods face significant limitations when applied to specialized domains like medicine, where they can have disastrous consequences. Existing medical datasets are either too small, containing only a few hundred samples, or focus on a single task like Question Answering or Natural Language Inference. MedHal addresses these gaps by: ...

---

### [211] Fact Grounded Attention: Eliminating Hallucination in Large Language Models Through Attention Level Knowledge Integration

**Citation**: 211. Aayush Gupta. Fact Grounded Attention: Eliminating Hallucination in Large Language Models Through Attention Level Knowledge Integration. arXiv:2509.25252v2. 2025.
**Source database**: arxiv

**Abstract**: "The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge." Large Language Models have conquered natural language but remain prisoners of their own probabilistic nature--confidently hallucinating facts they never truly knew. We present Fact Grounded Attention (FGA), a novel architectural modification that transforms unreliable language models into deterministic truth tellers by injecting verifiable knowledge directly into the attention mechanism. Unlike existing approach...

---

### [212] Improving Scientific Hypothesis Generation with Knowledge Grounded Large Language Models

**Citation**: 212. Guangzhi Xiong, Eric Xie, Amir Hassan Shariatmadari, et al.. Improving Scientific Hypothesis Generation with Knowledge Grounded Large Language Models. arXiv:2411.02382v1. 2024.
**Source database**: arxiv

**Abstract**: Large language models (LLMs) have demonstrated remarkable capabilities in various scientific domains, from natural language processing to complex problem-solving tasks. Their ability to understand and generate human-like text has opened up new possibilities for advancing scientific research, enabling tasks such as data analysis, literature review, and even experimental design. One of the most promising applications of LLMs in this context is hypothesis generation, where they can identify novel r...

---

### [213] Editing Factual Knowledge and Explanatory Ability of Medical Large Language Models

**Citation**: 213. Derong Xu, Ziheng Zhang, Zhihong Zhu, et al.. Editing Factual Knowledge and Explanatory Ability of Medical Large Language Models. arXiv:2402.18099v3. 2024.
**Source database**: arxiv

**Abstract**: Model editing aims to precisely alter the behaviors of large language models (LLMs) in relation to specific knowledge, while leaving unrelated knowledge intact. This approach has proven effective in addressing issues of hallucination and outdated information in LLMs. However, the potential of using model editing to modify knowledge in the medical field remains largely unexplored, even though resolving hallucination is a pressing need in this area. Our observations indicate that current methods f...

---

### [214] Neural Probe-Based Hallucination Detection for Large Language Models

**Citation**: 214. Shize Liang, Hongzhi Wang. Neural Probe-Based Hallucination Detection for Large Language Models. arXiv:2512.20949v1. 2025.
**Source database**: arxiv

**Abstract**: Large language models(LLMs) excel at text generation and knowledge question-answering tasks, but they are prone to generating hallucinated content, severely limiting their application in high-risk domains. Current hallucination detection methods based on uncertainty estimation and external knowledge retrieval suffer from the limitation that they still produce erroneous content at high confidence levels and rely heavily on retrieval efficiency and knowledge coverage. In contrast, probe methods th...

---

### [215] MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context

**Citation**: 215. Zishan Gu, Changchang Yin, Fenglin Liu, et al.. MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context. arXiv:2407.02730v1. 2024.
**Source database**: arxiv

**Abstract**: Large Vision Language Models (LVLMs) have recently achieved superior performance in various tasks on natural image and text data, which inspires a large amount of studies for LVLMs fine-tuning and training. Despite their advancements, there has been scant research on the robustness of these models against hallucination when fine-tuned on smaller datasets. In this study, we introduce a new benchmark dataset, the Medical Visual Hallucination Test (MedVH), to evaluate the hallucination of domain-sp...

---

### [216] Evolutionary Computation in the Era of Large Language Model: Survey and Roadmap

**Citation**: 216. Xingyu Wu, Sheng-hao Wu, Jibin Wu, et al.. Evolutionary Computation in the Era of Large Language Model: Survey and Roadmap. arXiv:2401.10034v3. 2024.
**Source database**: arxiv

**Abstract**: Large language models (LLMs) have not only revolutionized natural language processing but also extended their prowess to various domains, marking a significant stride towards artificial general intelligence. The interplay between LLMs and evolutionary algorithms (EAs), despite differing in objectives and methodologies, share a common pursuit of applicability in complex problems. Meanwhile, EA can provide an optimization framework for LLM's further enhancement under black-box settings, empowering...

---

### [217] Unified Hallucination Detection for Multimodal Large Language Models

**Citation**: 217. Xiang Chen, Chenxi Wang, Yida Xue, et al.. Unified Hallucination Detection for Multimodal Large Language Models. arXiv:2402.03190v4. 2024.
**Source database**: arxiv

**Abstract**: Despite significant strides in multimodal tasks, Multimodal Large Language Models (MLLMs) are plagued by the critical issue of hallucination. The reliable detection of such hallucinations in MLLMs has, therefore, become a vital aspect of model evaluation and the safeguarding of practical application deployment. Prior research in this domain has been constrained by a narrow focus on singular tasks, an inadequate range of hallucination categories addressed, and a lack of detailed granularity. In r...

---

### [218] ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models

**Citation**: 218. Yuzhe Gu, Ziwei Ji, Wenwei Zhang, et al.. ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models. arXiv:2407.04693v2. 2024.
**Source database**: arxiv

**Abstract**: Large language models (LLMs) exhibit hallucinations in long-form question-answering tasks across various domains and wide applications. Current hallucination detection and mitigation datasets are limited in domains and sizes, which struggle to scale due to prohibitive labor costs and insufficient reliability of existing hallucination annotators. To facilitate the scalable oversight of LLM hallucinations, this paper introduces an iterative self-training framework that simultaneously and progressi...

---

### [219] Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models

**Citation**: 219. Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, et al.. Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models. arXiv:2308.16149v2. 2023.
**Source database**: arxiv

**Abstract**: We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs). The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts, including source code in various programming languages. With 13 billion parameters, they demonstrate better knowledge and reasoning capabilities in Arabic than any existing open Arabic and multilingual models by a sizable margin, b...

---

### [220] Aviation Safety Enhancement via NLP & Deep Learning: Classifying Flight Phases in ATSB Safety Reports

**Citation**: 220. Aziida Nanyonga, Hassan Wasswa, Graham Wild. Aviation Safety Enhancement via NLP & Deep Learning: Classifying Flight Phases in ATSB Safety Reports. arXiv:2501.07923v1. 2025.
**Source database**: arxiv

**Abstract**: Aviation safety is paramount, demanding precise analysis of safety occurrences during different flight phases. This study employs Natural Language Processing (NLP) and Deep Learning models, including LSTM, CNN, Bidirectional LSTM (BLSTM), and simple Recurrent Neural Networks (sRNN), to classify flight phases in safety reports from the Australian Transport Safety Bureau (ATSB). The models exhibited high accuracy, precision, recall, and F1 scores, with LSTM achieving the highest performance of 87%...

---

### [221] RAC: Retrieval-Augmented Clarification for Faithful Conversational Search

**Citation**: 221. Ahmed Rayane Kebir, Vincent Guigue, Lynda Said Lhadj, et al.. RAC: Retrieval-Augmented Clarification for Faithful Conversational Search. arXiv:2601.11722v1. 2026.
**Source database**: arxiv

**Abstract**: Clarification questions help conversational search systems resolve ambiguous or underspecified user queries. While prior work has focused on fluency and alignment with user intent, especially through facet extraction, much less attention has been paid to grounding clarifications in the underlying corpus. Without such grounding, systems risk asking questions that cannot be answered from the available documents. We introduce RAC (Retrieval-Augmented Clarification), a framework for generating corpu...

---

### [222] Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations

**Citation**: 222. Michael Keeman, Anastasia Keeman. Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations. arXiv:2603.09997v1. 2026.
**Source database**: arxiv

**Abstract**: When OpenAI deprecated GPT-4o in early 2026, thousands of users protested under #keep4o, claiming newer models had "lost their empathy." No published study has tested this claim. We conducted the first clinical measurement, evaluating three OpenAI model generations (GPT-4o, o4-mini, GPT-5-mini) across 14 emotionally challenging conversational scenarios in mental health and AI companion domains, producing 2,100 scored AI responses assessed on six psychological safety dimensions using clinically-g...

---

### [223] Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations

**Citation**: 223. Supriya Manna, Niladri Sett. Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations. arXiv:2409.17774v2. 2024. doi:https://doi.org/10.18653/v1/2024.blackboxnlp-1.12
**Source database**: arxiv
**DOI**: https://doi.org/https://doi.org/10.18653/v1/2024.blackboxnlp-1.12

**Abstract**: Faithfulness is arguably the most critical metric to assess the reliability of explainable AI. In NLP, current methods for faithfulness evaluation are fraught with discrepancies and biases, often failing to capture the true reasoning of models. We introduce Adversarial Sensitivity as a novel approach to faithfulness evaluation, focusing on the explainer's response when the model is under adversarial attack. Our method accounts for the faithfulness of explainers by capturing sensitivity to advers...

---

### [224] Inference-Time Safety For Code LLMs Via Retrieval-Augmented Revision

**Citation**: 224. Manisha Mukherjee, Vincent J. Hellendoorn. Inference-Time Safety For Code LLMs Via Retrieval-Augmented Revision. arXiv:2603.01494v1. 2026.
**Source database**: arxiv

**Abstract**: Large Language Models (LLMs) are increasingly deployed for code generation in high-stakes software development, yet their limited transparency in security reasoning and brittleness to evolving vulnerability patterns raise critical trustworthiness concerns. Models trained on static datasets cannot readily adapt to newly discovered vulnerabilities or changing security standards without retraining, leading to the repeated generation of unsafe code.   We present a principled approach to trustworthy ...

---

### [225] VeriCite: Towards Reliable Citations in Retrieval-Augmented Generation via Rigorous Verification

**Citation**: 225. Haosheng Qian, Yixing Fan, Jiafeng Guo, et al.. VeriCite: Towards Reliable Citations in Retrieval-Augmented Generation via Rigorous Verification. arXiv:2510.11394v1. 2025. doi:https://doi.org/10.1145/3767695.3769505
**Source database**: arxiv
**DOI**: https://doi.org/https://doi.org/10.1145/3767695.3769505

**Abstract**: Retrieval-Augmented Generation (RAG) has emerged as a crucial approach for enhancing the responses of large language models (LLMs) with external knowledge sources. Despite the impressive performance in complex question-answering tasks, RAG still struggles with hallucinations. Attributing RAG-generated content through in-line citations has demonstrated potential in reducing hallucinations and facilitating human verification. Existing citation generation methods primarily rely on either fine-tunin...

---

### [226] Relational Action Bases: Formalization, Effective Safety Verification, and Invariants (Extended Version)

**Citation**: 226. Silvio Ghilardi, Alessandro Gianola, Marco Montali, et al.. Relational Action Bases: Formalization, Effective Safety Verification, and Invariants (Extended Version). arXiv:2208.06377v2. 2022.
**Source database**: arxiv

**Abstract**: Modeling and verification of dynamic systems operating over a relational representation of states are increasingly investigated problems in AI, Business Process Management, and Database Theory. To make these systems amenable to verification, the amount of information stored in each relational state needs to be bounded, or restrictions are imposed on the preconditions and effects of actions. We introduce the general framework of relational action bases (RABs), which generalizes existing models by...

---

### [227] Reason and Verify: A Framework for Faithful Retrieval-Augmented Generation

**Citation**: 227. Eeham Khan, Luis Rodriguez, Marc Queudot. Reason and Verify: A Framework for Faithful Retrieval-Augmented Generation. arXiv:2603.10143v1. 2026.
**Source database**: arxiv

**Abstract**: Retrieval-Augmented Generation (RAG) significantly improves the factuality of Large Language Models (LLMs), yet standard pipelines often lack mechanisms to verify inter- mediate reasoning, leaving them vulnerable to hallucinations in high-stakes domains. To address this, we propose a domain-specific RAG framework that integrates explicit rea- soning and faithfulness verification. Our architecture augments standard retrieval with neural query rewriting, BGE-based cross-encoder reranking, and a ra...

---

### [228] Synchronous Faithfulness Monitoring for Trustworthy Retrieval-Augmented Generation

**Citation**: 228. Di Wu, Jia-Chen Gu, Fan Yin, et al.. Synchronous Faithfulness Monitoring for Trustworthy Retrieval-Augmented Generation. arXiv:2406.13692v2. 2024.
**Source database**: arxiv

**Abstract**: Retrieval-augmented language models (RALMs) have shown strong performance and wide applicability in knowledge-intensive tasks. However, there are significant trustworthiness concerns as RALMs are prone to generating unfaithful outputs, including baseless information or contradictions with the retrieved context. This paper proposes SynCheck, a lightweight monitor that leverages fine-grained decoding dynamics including sequence likelihood, uncertainty quantification, context influence, and semanti...

---

### [229] Retrieval-Augmented and Knowledge-Grounded Language Models for Faithful Clinical Medicine

**Citation**: 229. Fenglin Liu, Bang Yang, Chenyu You, et al.. Retrieval-Augmented and Knowledge-Grounded Language Models for Faithful Clinical Medicine. arXiv:2210.12777v4. 2022.
**Source database**: arxiv

**Abstract**: Language models (LMs), including large language models (such as ChatGPT), have the potential to assist clinicians in generating various clinical notes. However, LMs are prone to produce ``hallucinations'', i.e., generated content that is not aligned with facts and knowledge. In this paper, we propose the Re$^3$Writer method with retrieval-augmented generation and knowledge-grounded reasoning to enable LMs to generate faithful clinical texts. We demonstrate the effectiveness of our method in gene...

---

### [230] Case Study: Runtime Safety Verification of Neural Network Controlled System

**Citation**: 230. Frank Yang, Sinong Simon Zhan, Yixuan Wang, et al.. Case Study: Runtime Safety Verification of Neural Network Controlled System. arXiv:2408.08592v1. 2024.
**Source database**: arxiv

**Abstract**: Neural networks are increasingly used in safety-critical applications such as robotics and autonomous vehicles. However, the deployment of neural-network-controlled systems (NNCSs) raises significant safety concerns. Many recent advances overlook critical aspects of verifying control and ensuring safety in real-time scenarios. This paper presents a case study on using POLAR-Express, a state-of-the-art NNCS reachability analysis tool, for runtime safety verification in a Turtlebot navigation syst...

---

### [231] Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems

**Citation**: 231. Xunzhuo Liu, Bowei He, Xue Liu, et al.. Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems. arXiv:2603.23508v1. 2026.
**Source database**: arxiv

**Abstract**: Retrieval-augmented generation (RAG) is increasingly deployed in enterprise search and document-centric assistants, where responses must be grounded in long and complex source materials. In practice, verifying that generated answers faithfully reflect retrieved documents is difficult: large language models can check long contexts but are too slow and costly for interactive services, while lightweight classifiers operate within strict context limits and frequently miss evidence outside truncated ...

---

### [232] Unifying Qualitative and Quantitative Safety Verification of DNN-Controlled Systems

**Citation**: 232. Dapeng Zhi, Peixin Wang, Si Liu, et al.. Unifying Qualitative and Quantitative Safety Verification of DNN-Controlled Systems. arXiv:2404.01769v1. 2024.
**Source database**: arxiv

**Abstract**: The rapid advance of deep reinforcement learning techniques enables the oversight of safety-critical systems through the utilization of Deep Neural Networks (DNNs). This underscores the pressing need to promptly establish certified safety guarantees for such DNN-controlled systems. Most of the existing verification approaches rely on qualitative approaches, predominantly employing reachability analysis. However, qualitative verification proves inadequate for DNN-controlled systems as their behav...

---

### [233] Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation

**Citation**: 233. Ekaterina Fadeeva, Aleksandr Rubashevskii, Dzianis Piatrashyn, et al.. Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation. arXiv:2505.21072v3. 2025.
**Source database**: arxiv

**Abstract**: Large Language Models (LLMs) enhanced with retrieval, an approach known as Retrieval-Augmented Generation (RAG), have achieved strong performance in open-domain question answering. However, RAG remains prone to hallucinations: factually incorrect outputs may arise from inaccuracies in the model's internal knowledge and the retrieved context. Existing approaches to mitigating hallucinations often conflate factuality with faithfulness to the retrieved evidence, incorrectly labeling factually corre...

---

### [234] Retrieving, Rethinking and Revising: The Chain-of-Verification Can Improve Retrieval Augmented Generation

**Citation**: 234. Bolei He, Nuo Chen, Xinran He, et al.. Retrieving, Rethinking and Revising: The Chain-of-Verification Can Improve Retrieval Augmented Generation. arXiv:2410.05801v1. 2024.
**Source database**: arxiv

**Abstract**: Recent Retrieval Augmented Generation (RAG) aims to enhance Large Language Models (LLMs) by incorporating extensive knowledge retrieved from external sources. However, such approach encounters some challenges: Firstly, the original queries may not be suitable for precise retrieval, resulting in erroneous contextual knowledge; Secondly, the language model can easily generate inconsistent answer with external references due to their knowledge boundary limitation. To address these issues, we propos...

---
