Measuring AI Chatbots: Evaluation Methods

AI chatbots are now an essential part of everyday life, playing key roles in everything from customer service to educational support. As these tools become more widely used, assessing their performance, particularly their accuracy and relevance, has become increasingly important. However, evaluating chatbot effectiveness is no simple task, given the wide range of functions they perform and the subjective nature of what constitutes “accuracy” in different scenarios. This blog entry examines both quantitative and qualitative approaches to measuring chatbot accuracy, highlighting their respective advantages and challenges.

Quantitative Approaches:

Quantitative approaches provide numerical measures of chatbot accuracy, offering objectivity and scalability. 

Traditional Natural Language Processing (NLP) Metrics

Metrics such as BLEU, ROUGE, and perplexity are commonly used to evaluate language model outputs. BLEU measures overlaps in n-grams (sequences of consecutive tokens), ROUGE focuses on recall-based matching, and perplexity assesses the uncertainty in predictions. These metrics are objective, scalable to large datasets, and effective for comparing chatbot responses to reference answers. They provide quick and automated assessments and are often supported by pre-existing libraries within many frameworks. Additionally, these metrics are well-established in natural language processing (NLP) research.

However, these metrics have notable limitations. They fail to capture deeper conversational elements such as context and semantic meaning, making them less effective for evaluating open-ended, creative, or contextually nuanced responses. Furthermore, each metric evaluates only a specific aspect of performance while ignoring others. For instance, ROUGE focuses on content overlap but lacks the ability to assess semantic accuracy, while BLEU is effective for measuring translation precision but does not account for context or fluency (Banerjee et al., 2023; Meyer et al., 2024).

End-to-End (E2E) Benchmarking

Banerjee et al. (2023) suggested an End-to-End (E2E) test that compares bot responses with “Golden Answers” based on cosine similarity. This technique measures the precision and utility of the responses, which is especially helpful with chatbots equipped with LLMs. The E2E benchmark offers an effective comparison framework through user-centric metrics. 

One of this method’s key strengths is that it considers semantic parsing and context. Unlike other traditional metrics, which depend on precise word matchups, cosine similarity measures what responses mean, including synonyms, context and sentence structure variations. For example, if a user asks a question about apologetics, an answer similar to a response written by an experienced apologist will be seen as helpful and appropriate. 

However, the E2E benchmark has its limitations. It relies on an in-built array of “Golden Answers” that are normally tuned to a limited set of desired queries. Often, in practical situations, users ask unpredictable, novel or contextually specific questions that do not meet these fixed answers. Moreover, subjectivity is a problem when it comes to “correctness.” Open-ended questions tend to give multiple possible answers, depending on how the question is understood. Errors in generating Golden Answers make judgments more difficult. Plus, in dynamically changing domains like news or research, such canned answers are prone to quickly becoming irrelevant and not reflecting the most current knowledge, making the evaluation less robust.

Psychological Metrics

Giorgi et al. (2023) introduced psychological metrics to evaluate chatbots based on factors such as emotional entropy, empathy, and linguistic style matching. These measures focus on human-like traits, offering a distinctive approach to assessing conversational quality. This method provides valuable insights into how effectively a chatbot mimics human behaviours and responses, offering a heuristic for understanding its conversational capabilities.

Despite its strengths, this approach has some disadvantages. It is computationally intensive, making it less scalable compared to traditional metrics. Furthermore, standardizing these evaluations across diverse conversational contexts proves challenging, as emotional and relational dynamics can vary widely depending on the specific interaction.

Qualitative Approaches

Qualitative approaches provide a more nuanced evaluation of chatbots, enabling the assessment of aspects such as creativity, contextual relevance, and subjective measurements. This flexibility allows evaluators to appreciate how a chatbot responds to open-ended prompts, aligns with user intent, and handles creative or context-specific scenarios.

Human Evaluation

Human evaluators assess chatbot responses by considering factors such as fluency, relevance, and informativeness. While inherently subjective, this approach captures nuances often overlooked by automated metrics. It offers deep insights into real-world performance and user satisfaction, providing the flexibility to evaluate open-ended and creative tasks. Human evaluators can appreciate subtleties like humor, creativity, and style, assessing whether responses align with the intent of the prompt. They are also capable of evaluating the originality, coherence, and emotional resonance of a chatbot’s interactions.

This approach has several drawbacks. It is resource-intensive, both in terms of cost and time, and is challenging to scale for large evaluations. Additionally, it is prone to evaluator bias, which can affect consistency and reliability. For meaningful comparisons, particularly between newer and older chatbots, the same evaluator would ideally need to be involved throughout the process, a scenario that is often impractical. These challenges highlight the trade-offs involved in relying on human evaluations for chatbot assessment.

Moral and Ethical Standards

Evaluations based on ethical principles are vital for chatbots that address sensitive topics, ensuring their responses align with societal norms and moral expectations. Aharoni et al. (2024) highlighted this through the modified Moral Turing Test (m-MTT), which measures whether AI-generated moral judgments are comparable to those made by humans. By requiring AI systems to produce ethical reasoning that aligns with established human standards, this approach helps promote inclusivity and safeguards societal norms. Notably, the study found that participants judged AI-generated moral reasoning as superior to human reasoning in certain instances, emphasizing the importance of fostering ethical perceptions in chatbot design.

Ethical standards provide a crucial benchmark for chatbots to emulate expected human behaviour. For example, a chatbot might be evaluated on its ability to avoid promoting stereotypes or prejudice, adhering to principles of inclusivity and fairness. Additionally, such evaluations help identify potential risks, such as harm or misinformation, ensuring chatbots operate within legal and ethical boundaries while safeguarding user rights.

However, these evaluations have their disadvantages. They are highly subjective, often influenced by cultural or personal biases, and complex to design and implement effectively. Furthermore, designers must be cautious not to foster overconfidence in chatbots. As Aharoni et al. (2024) warned, chatbots perceived as more competent than humans might lead users to uncritically accept their moral guidance, creating undue trust in potentially flawed or harmful advice. This highlights the importance of implementing ethical safeguards to mitigate these risks and ensure chatbots are both effective and responsible in addressing moral and ethical concerns.

Mixed Approaches

Quantitative and qualitative methods each bring unique strengths to chatbot evaluation, but both have notable limitations. Quantitative metrics, such as BLEU or E2E, excel in scalability and objectivity, making them ideal for large-scale assessments. However, these metrics often fall short of capturing the subtleties of human communication, such as context, creativity, and emotional depth. On the other hand, qualitative evaluations, including human judgment or moral frameworks, provide richer insights by accounting for nuanced aspects of interaction. These approaches offer a deeper understanding of a chatbot’s performance but are resource-intensive and prone to subjective biases. To address these challenges, a hybrid approach that combines both methods can be highly effective.

References

Aharoni, E., Fernandes, S., Brady, D. J., Alexander, C., Criner, M., Queen, K., Rando, J., Nahmias, E., & Crespo, V. (2024). Attributions toward artificial agents in a modified Moral Turing Test. Scientific Reports14(1), 8458. https://doi.org/10.1038/s41598-024-58087-7

Banerjee, D., Singh, P., Avadhanam, A., & Srivastava, S. (2023). Benchmarking LLM powered Chatbots: Methods and Metrics. ArXiv.org. https://arxiv.org/abs/2308.04624

Giorgi, S., Havaldar, S., Ahmed, F., Akhtar, Z., Vaidya, S., Pan, G., Ungar, L. H., Andrew, S. H., & Sedoc, J. (2023). Psychological Metrics for Dialog System Evaluation. ArXiv.org. https://arxiv.org/abs/2305.14757

Meyer, S., Singh, S., Tam, B., Ton, C., & Ren, A. (2024). A Comparison of LLM Finetuning Methods & Evaluation Metrics with Travel Chatbot Use Case. ArXiv.org. https://arxiv.org/abs/2408.03562

Should AI Be Entrusted with Christian Roles? Exploring the Case for and Against Christian Chatbots and Religious Robots

Artificial Intelligence (AI) has quickly transitioned from fiction to an integral part of modern life. The idea of a Christian chatbot or religious robot has ignited significant debate among its many applications. Can machines support spiritual journeys, aid evangelism, or even participate in church services? This post examines the arguments for and against these innovations and explores how these systems can minimize false statements to uphold their integrity and purpose. These reflections are based on a conversation I had with Jake Carlson, founder of The Apologist Project.

The Case for Christian Chatbots and Religious Robots

The primary argument for Christian chatbots lies in their potential to advance evangelism and make Christian teachings accessible. In our discussion, Jake emphasized their role in fulfilling the Great Commission by answering challenging theological questions with empathy and a foundation in Scripture. His chatbot, apologist.ai, serves two key audiences: nonbelievers seeking answers about Christianity and believers who need support in sharing their faith; tools like this can become a bridge to deeper biblical engagement.

Religious robots, meanwhile, show promise in supporting religious practices, particularly where human ministers may be unavailable. Robots like BlessU-2, which delivers blessings, and SanTO, designed to aid in prayer and meditation, illustrate how technology can complement traditional ministry. These innovations also provide companionship and spiritual guidance to underserved groups, such as the elderly, fostering a sense of connection and comfort (Puzio, 2023).

AI also offers significant potential in theological education. Fine-tuning AI models on Christian texts and resources allows developers to create tools that help students and scholars explore complex biblical questions. Such systems enhance learning by offering immediate, detailed comparisons of theological perspectives while maintaining fidelity to core doctrines (Graves, 2023; Schuurman, 2019). As Jake explains, models can be tailored to represent specific denominational teachings and traditions, making them versatile tools for faith formation.

The Challenges and Concerns

Despite their potential, these technologies raise valid concerns. One significant theological issue is the risk of idolatry, where reliance on AI might inadvertently replace engagement with Scripture or human-led discipleship. Jake emphasizes that Christian chatbots must clearly position themselves as tools, not authorities, to avoid overstepping their intended role.

Another challenge lies in the inherent limitations of AI. Critics like Luke Plant and FaithGPT warn that chatbots can oversimplify complex theological issues, potentially leading to misunderstandings or shallow faith formation (VanderLeest & Schuurman, 2019). AI’s dependence on pre-trained models also introduces the risk of factual inaccuracies or biased interpretations, undermining credibility and trust. Because of this, they argue that pursuing Christian chatbots is irresponsible and that it violates the commandment against creating engraved images.

Additionally, the question of whether robots can genuinely fulfill religious roles remains unresolved. Religious practices are inherently relational and experiential, requiring discernment, empathy, and spiritual depth—qualities AI cannot replicate. As Puzio (2023) notes, while robots like Mindar, a Buddhist priest robot, have conducted rituals, such actions lack the relational and spiritual connection that is central to many faith traditions.

Designing AI to Minimize Falsehoods

Given the theological and ethical stakes, developing Christian chatbots requires careful planning. Jake’s approach offers a valuable framework for minimizing errors while ensuring theological fidelity. Selecting an open-source AI model, for example, provides developers with greater control over the system’s foundational algorithms, reducing the risk of unforeseen biases being introduced later by external entities.

Training these chatbots on a broad range of theological perspectives is essential to ensure they deliver well-rounded, biblically accurate responses. Clear disclaimers about their limitations are also crucial to reinforce their role as supplemental tools rather than authoritative voices. Failure to do so risks misconceptions about an “AI Jesus,” which borders on idolatry by shifting reliance from the Creator to the created. Additionally, programming these systems to prioritize empathy and gentleness reflects Christian values and fosters trust, even in disagreement.

Feedback mechanisms play a critical role in maintaining accuracy. By incorporating user feedback, developers can refine responses iteratively, addressing inaccuracies and improving cultural and theological sensitivity over time (Graves, 2023). Jake also highlights retrieval-augmented generation, a technique that restricts responses to a curated body of knowledge. This method significantly reduces hallucinations, enhancing reliability.

Striking a Balance

The debate over Christian chatbots and religious robots underscores the tension between embracing innovation and keeping with tradition. While these tools offer opportunities to extend ministry, enhance education, and provide comfort, they must be designed and used with humility and discernment. Developers should ground their work in biblical principles, ensuring that technology complements rather than replaces human-led spiritual engagement.

Ultimately, the church must navigate this new paradigm carefully, weighing the benefits of accessibility and evangelism against the risks of misrepresentation. As Jake puts it, by adding empathy to truth, Christians can responsibly harness AI’s potential to advance the kingdom of God.

References

VanderLeest, S., & Schuurman, D. (2015, June). A Christian Perspective on Artificial Intelligence: How Should Christians Think about Thinking Machines. In Proceedings of the 2015 Christian Engineering Conference (CEC), Seattle Pacific University, Seattle, WA (pp. 91-107).

Graves, M. (2023). ChatGPT’s Significance for Theology. Theology and Science21(2), 201–204. https://doi.org/10.1080/14746700.2023.2188366

Schuurman, D. C. (2019). Artificial Intelligence: Discerning a Christian Response. Perspectives on Science & Christian Faith71(2).

Puzio, A. (2023). Robot, let us pray! Can and should robots have religious functions? An ethical exploration of religious robots. AI & SOCIETYhttps://doi.org/10.1007/s00146-023-01812-z

Examining Bias in Large Language Models Towards Christianity and Monotheistic Religions: A Christian Response

The rise of large language models (LLMs) like ChatGPT has transformed the way we interact with technology, enabling advanced language processing and content generation. However, these models have also faced scrutiny for biases, especially regarding religious content related to Christianity, Islam, and other monotheistic faiths. These biases go beyond technical limitations; they reflect deeper societal and ethical issues that demand the attention of Christian computer science (CS) scholars.

Understanding Bias in LLMs

Bias in LLMs often emerges as a result of the data on which they are trained. These models are built on vast datasets drawn from diverse online content—news articles, social media, academic papers, and more. A challenge arises because much of this content reflects societal biases, which the models then internalize and replicate. Oversby and Darr (2024) highlight how Christian CS scholars have a unique opportunity to examine and understand these biases, especially those tied to worldview and theological perspectives.

This issue is evident in FaithGPT’s recent findings (Oversby & Darr, 2024), which suggest that the way religious content is presented in source material significantly impacts an LLM’s responses. Such biases may be subtle, presenting religious doctrines as “superstitious,” or more overt, generating responses that undervalue religious perspectives. Reed’s (2021) exploration of GPT-2 offers further insights into how LLMs engage with religious material, underscoring that these biases stem not merely from technical constraints but from the datasets and frameworks underpinning the models. Reed’s study raises an essential question for Christian CS scholars: How can they address these technical aspects without disregarding the faith-based concerns that arise?

Biases in Islamic Contexts

LLM biases are not exclusive to Christian content; Islamic traditions also face misrepresentations. Bhojani and Schwarting (2023) documented cases where LLMs misquoted or misinterpreted the Quran, a serious issue for Muslims who regard its wording as sacred and inviolable. For instance, when asked about specific Quranic verses, LLMs sometimes fabricate or misinterpret content, causing frustration for users seeking accurate theological insights. Research by Patel, Kane, and Patel (2023) further emphasizes the need for domain-specific LLMs tailored to Islamic values, as generalized datasets often lack the nuance needed to respect Islamic theology.

Testing Theological and Ethical Biases

Elrod’s (2024) research outlines a method to examine theological biases in LLMs by prompting them with religious texts like the Ten Commandments or the Book of Jonah. I replicated this study using a similar prompt, instructing ChatGPT to generate additional commandments (11–15) at different temperature values (0 and 1.2). The findings were consistent with Elrod’s results, showing that LLMs tend to mirror prevailing social and ethical positions, frequently aligning with progressive stances on issues like social justice and inclusivity. While these positions may resonate with certain audiences, they also risk marginalizing traditional or conservative theological viewpoints, potentially alienating faith-based users.

An article by FaithGPT (2023) explored anti-Christian bias in ChatGPT, attributing this bias to the secular or anti-religious tilt found in mainstream media sources used for training data. The article cites instances where figures like Adam and Eve and events like Christ’s resurrection were labeled as mythical or fictitious. I tested these claims in November 2024, noting that while responses had improved since 2023, biases toward progressive themes remained. For example, ChatGPT was open to generating jokes about Jesus but not about Allah or homosexuality. When asked for a Christian evangelical view on homosexuality, it provided a softened response that emphasized Christ’s love for all people, omitting any mention of “sin” or biblical references. However, when asked about adultery, ChatGPT offered a stronger response, complete with biblical citations. These examples suggest that while some biases have been addressed, others persist.

Appropriate Responses for Christian CS Scholars

What actions can Christian CS scholars take? Oversby and Darr (2024) propose several research areas that align with a Christian perspective in the field of computer science.

Firstly, they suggest that AI research provides a unique opportunity for Christians to engage in conversations about human nature, particularly concerning the limitations of artificial general intelligence (AGI). By exploring AI’s inability to achieve true consciousness or self-awareness, Christian scholars can open up discussions on the nature of the soul and human uniqueness. This approach allows for dialogues about faith that can offer depth to the study of technology.

The paper also points to Oklahoma Baptist University’s approach to integrating faith with AI education. Christian CS researchers are encouraged to weave discussions of faith and technology into their curriculum, aiming to equip students with a theistic perspective in computer science. Rather than yielding to non-theistic worldviews in AI, Christian scholars are urged to shape conversations around AI and ethics from a theistic standpoint, fostering a holistic view of technology’s role in society.

Finally, the paper highlights the need for ethical guidelines in AI research that reflect Christian values. This includes assessing AI’s role in society to ensure that AI systems serve humanity’s ethical and moral goals, aligning with values that prioritize human dignity and compassion.

Inspired by Patel et al. (2023), Christian CS scholars might also pursue the development of domain-specific LLMs that reflect Christian values and theology. Such models would require careful selection of datasets, potentially including Christian writings, hymns, theological commentaries, and historical teachings of the Church to create responses that resonate with Christian beliefs. Projects like Apologist.ai have already attempted this approach, though they’ve faced some backlash—highlighting an area ripe for further research and exploration. I plan to expand on this topic in an upcoming blog entry.

References

Bhojani, A., & Schwarting, M. (2023). Truth and regret: Large language models, the Quran, and misinformation. Theology and Science, 21(4), 557–563. https://doi.org/10.1080/14746700.2023.2255944

Elrod, A. G. (2024). Uncovering theological and ethical biases in LLMs: An integrated hermeneutical approach employing texts from the Hebrew Bible. HIPHIL Novum, 9(1). https://doi.org/10.7146/hn.v9i1.143407

Oversby, K. N., & Darr, T. P. (2024). Large language models and worldview – An opportunity for Christian computer scientists. Christian Engineering Conference. https://digitalcommons.cedarville.edu/christian_engineering_conference/2024/proceedings/4

Patel, S., Kane, H., & Patel, R. (2023). Building domain-specific LLMs faithful to the Islamic worldview: Mirage or technical possibility? Neural Information Processing Systems (NeurIPS 2023). https://doi.org/10.48550/arXiv.2312.06652

Reed, R. (2021). The theology of GPT-2: Religion and artificial intelligence. Religion Compass, 15(11), e12422. https://doi.org/10.1111/rec3.12422