Measuring AI Chatbots: Evaluation Methods

AI chatbots are now an essential part of everyday life, playing key roles in everything from customer service to educational support. As these tools become more widely used, assessing their performance, particularly their accuracy and relevance, has become increasingly important. However, evaluating chatbot effectiveness is no simple task, given the wide range of functions they perform and the subjective nature of what constitutes “accuracy” in different scenarios. This blog entry examines both quantitative and qualitative approaches to measuring chatbot accuracy, highlighting their respective advantages and challenges.

Quantitative Approaches:

Quantitative approaches provide numerical measures of chatbot accuracy, offering objectivity and scalability. 

Traditional Natural Language Processing (NLP) Metrics

Metrics such as BLEU, ROUGE, and perplexity are commonly used to evaluate language model outputs. BLEU measures overlaps in n-grams (sequences of consecutive tokens), ROUGE focuses on recall-based matching, and perplexity assesses the uncertainty in predictions. These metrics are objective, scalable to large datasets, and effective for comparing chatbot responses to reference answers. They provide quick and automated assessments and are often supported by pre-existing libraries within many frameworks. Additionally, these metrics are well-established in natural language processing (NLP) research.

However, these metrics have notable limitations. They fail to capture deeper conversational elements such as context and semantic meaning, making them less effective for evaluating open-ended, creative, or contextually nuanced responses. Furthermore, each metric evaluates only a specific aspect of performance while ignoring others. For instance, ROUGE focuses on content overlap but lacks the ability to assess semantic accuracy, while BLEU is effective for measuring translation precision but does not account for context or fluency (Banerjee et al., 2023; Meyer et al., 2024).

End-to-End (E2E) Benchmarking

Banerjee et al. (2023) suggested an End-to-End (E2E) test that compares bot responses with “Golden Answers” based on cosine similarity. This technique measures the precision and utility of the responses, which is especially helpful with chatbots equipped with LLMs. The E2E benchmark offers an effective comparison framework through user-centric metrics. 

One of this method’s key strengths is that it considers semantic parsing and context. Unlike other traditional metrics, which depend on precise word matchups, cosine similarity measures what responses mean, including synonyms, context and sentence structure variations. For example, if a user asks a question about apologetics, an answer similar to a response written by an experienced apologist will be seen as helpful and appropriate. 

However, the E2E benchmark has its limitations. It relies on an in-built array of “Golden Answers” that are normally tuned to a limited set of desired queries. Often, in practical situations, users ask unpredictable, novel or contextually specific questions that do not meet these fixed answers. Moreover, subjectivity is a problem when it comes to “correctness.” Open-ended questions tend to give multiple possible answers, depending on how the question is understood. Errors in generating Golden Answers make judgments more difficult. Plus, in dynamically changing domains like news or research, such canned answers are prone to quickly becoming irrelevant and not reflecting the most current knowledge, making the evaluation less robust.

Psychological Metrics

Giorgi et al. (2023) introduced psychological metrics to evaluate chatbots based on factors such as emotional entropy, empathy, and linguistic style matching. These measures focus on human-like traits, offering a distinctive approach to assessing conversational quality. This method provides valuable insights into how effectively a chatbot mimics human behaviours and responses, offering a heuristic for understanding its conversational capabilities.

Despite its strengths, this approach has some disadvantages. It is computationally intensive, making it less scalable compared to traditional metrics. Furthermore, standardizing these evaluations across diverse conversational contexts proves challenging, as emotional and relational dynamics can vary widely depending on the specific interaction.

Qualitative Approaches

Qualitative approaches provide a more nuanced evaluation of chatbots, enabling the assessment of aspects such as creativity, contextual relevance, and subjective measurements. This flexibility allows evaluators to appreciate how a chatbot responds to open-ended prompts, aligns with user intent, and handles creative or context-specific scenarios.

Human Evaluation

Human evaluators assess chatbot responses by considering factors such as fluency, relevance, and informativeness. While inherently subjective, this approach captures nuances often overlooked by automated metrics. It offers deep insights into real-world performance and user satisfaction, providing the flexibility to evaluate open-ended and creative tasks. Human evaluators can appreciate subtleties like humor, creativity, and style, assessing whether responses align with the intent of the prompt. They are also capable of evaluating the originality, coherence, and emotional resonance of a chatbot’s interactions.

This approach has several drawbacks. It is resource-intensive, both in terms of cost and time, and is challenging to scale for large evaluations. Additionally, it is prone to evaluator bias, which can affect consistency and reliability. For meaningful comparisons, particularly between newer and older chatbots, the same evaluator would ideally need to be involved throughout the process, a scenario that is often impractical. These challenges highlight the trade-offs involved in relying on human evaluations for chatbot assessment.

Moral and Ethical Standards

Evaluations based on ethical principles are vital for chatbots that address sensitive topics, ensuring their responses align with societal norms and moral expectations. Aharoni et al. (2024) highlighted this through the modified Moral Turing Test (m-MTT), which measures whether AI-generated moral judgments are comparable to those made by humans. By requiring AI systems to produce ethical reasoning that aligns with established human standards, this approach helps promote inclusivity and safeguards societal norms. Notably, the study found that participants judged AI-generated moral reasoning as superior to human reasoning in certain instances, emphasizing the importance of fostering ethical perceptions in chatbot design.

Ethical standards provide a crucial benchmark for chatbots to emulate expected human behaviour. For example, a chatbot might be evaluated on its ability to avoid promoting stereotypes or prejudice, adhering to principles of inclusivity and fairness. Additionally, such evaluations help identify potential risks, such as harm or misinformation, ensuring chatbots operate within legal and ethical boundaries while safeguarding user rights.

However, these evaluations have their disadvantages. They are highly subjective, often influenced by cultural or personal biases, and complex to design and implement effectively. Furthermore, designers must be cautious not to foster overconfidence in chatbots. As Aharoni et al. (2024) warned, chatbots perceived as more competent than humans might lead users to uncritically accept their moral guidance, creating undue trust in potentially flawed or harmful advice. This highlights the importance of implementing ethical safeguards to mitigate these risks and ensure chatbots are both effective and responsible in addressing moral and ethical concerns.

Mixed Approaches

Quantitative and qualitative methods each bring unique strengths to chatbot evaluation, but both have notable limitations. Quantitative metrics, such as BLEU or E2E, excel in scalability and objectivity, making them ideal for large-scale assessments. However, these metrics often fall short of capturing the subtleties of human communication, such as context, creativity, and emotional depth. On the other hand, qualitative evaluations, including human judgment or moral frameworks, provide richer insights by accounting for nuanced aspects of interaction. These approaches offer a deeper understanding of a chatbot’s performance but are resource-intensive and prone to subjective biases. To address these challenges, a hybrid approach that combines both methods can be highly effective.

References

Aharoni, E., Fernandes, S., Brady, D. J., Alexander, C., Criner, M., Queen, K., Rando, J., Nahmias, E., & Crespo, V. (2024). Attributions toward artificial agents in a modified Moral Turing Test. Scientific Reports14(1), 8458. https://doi.org/10.1038/s41598-024-58087-7

Banerjee, D., Singh, P., Avadhanam, A., & Srivastava, S. (2023). Benchmarking LLM powered Chatbots: Methods and Metrics. ArXiv.org. https://arxiv.org/abs/2308.04624

Giorgi, S., Havaldar, S., Ahmed, F., Akhtar, Z., Vaidya, S., Pan, G., Ungar, L. H., Andrew, S. H., & Sedoc, J. (2023). Psychological Metrics for Dialog System Evaluation. ArXiv.org. https://arxiv.org/abs/2305.14757

Meyer, S., Singh, S., Tam, B., Ton, C., & Ren, A. (2024). A Comparison of LLM Finetuning Methods & Evaluation Metrics with Travel Chatbot Use Case. ArXiv.org. https://arxiv.org/abs/2408.03562

Should AI Be Entrusted with Christian Roles? Exploring the Case for and Against Christian Chatbots and Religious Robots

Artificial Intelligence (AI) has quickly transitioned from fiction to an integral part of modern life. The idea of a Christian chatbot or religious robot has ignited significant debate among its many applications. Can machines support spiritual journeys, aid evangelism, or even participate in church services? This post examines the arguments for and against these innovations and explores how these systems can minimize false statements to uphold their integrity and purpose. These reflections are based on a conversation I had with Jake Carlson, founder of The Apologist Project.

The Case for Christian Chatbots and Religious Robots

The primary argument for Christian chatbots lies in their potential to advance evangelism and make Christian teachings accessible. In our discussion, Jake emphasized their role in fulfilling the Great Commission by answering challenging theological questions with empathy and a foundation in Scripture. His chatbot, apologist.ai, serves two key audiences: nonbelievers seeking answers about Christianity and believers who need support in sharing their faith; tools like this can become a bridge to deeper biblical engagement.

Religious robots, meanwhile, show promise in supporting religious practices, particularly where human ministers may be unavailable. Robots like BlessU-2, which delivers blessings, and SanTO, designed to aid in prayer and meditation, illustrate how technology can complement traditional ministry. These innovations also provide companionship and spiritual guidance to underserved groups, such as the elderly, fostering a sense of connection and comfort (Puzio, 2023).

AI also offers significant potential in theological education. Fine-tuning AI models on Christian texts and resources allows developers to create tools that help students and scholars explore complex biblical questions. Such systems enhance learning by offering immediate, detailed comparisons of theological perspectives while maintaining fidelity to core doctrines (Graves, 2023; Schuurman, 2019). As Jake explains, models can be tailored to represent specific denominational teachings and traditions, making them versatile tools for faith formation.

The Challenges and Concerns

Despite their potential, these technologies raise valid concerns. One significant theological issue is the risk of idolatry, where reliance on AI might inadvertently replace engagement with Scripture or human-led discipleship. Jake emphasizes that Christian chatbots must clearly position themselves as tools, not authorities, to avoid overstepping their intended role.

Another challenge lies in the inherent limitations of AI. Critics like Luke Plant and FaithGPT warn that chatbots can oversimplify complex theological issues, potentially leading to misunderstandings or shallow faith formation (VanderLeest & Schuurman, 2019). AI’s dependence on pre-trained models also introduces the risk of factual inaccuracies or biased interpretations, undermining credibility and trust. Because of this, they argue that pursuing Christian chatbots is irresponsible and that it violates the commandment against creating engraved images.

Additionally, the question of whether robots can genuinely fulfill religious roles remains unresolved. Religious practices are inherently relational and experiential, requiring discernment, empathy, and spiritual depth—qualities AI cannot replicate. As Puzio (2023) notes, while robots like Mindar, a Buddhist priest robot, have conducted rituals, such actions lack the relational and spiritual connection that is central to many faith traditions.

Designing AI to Minimize Falsehoods

Given the theological and ethical stakes, developing Christian chatbots requires careful planning. Jake’s approach offers a valuable framework for minimizing errors while ensuring theological fidelity. Selecting an open-source AI model, for example, provides developers with greater control over the system’s foundational algorithms, reducing the risk of unforeseen biases being introduced later by external entities.

Training these chatbots on a broad range of theological perspectives is essential to ensure they deliver well-rounded, biblically accurate responses. Clear disclaimers about their limitations are also crucial to reinforce their role as supplemental tools rather than authoritative voices. Failure to do so risks misconceptions about an “AI Jesus,” which borders on idolatry by shifting reliance from the Creator to the created. Additionally, programming these systems to prioritize empathy and gentleness reflects Christian values and fosters trust, even in disagreement.

Feedback mechanisms play a critical role in maintaining accuracy. By incorporating user feedback, developers can refine responses iteratively, addressing inaccuracies and improving cultural and theological sensitivity over time (Graves, 2023). Jake also highlights retrieval-augmented generation, a technique that restricts responses to a curated body of knowledge. This method significantly reduces hallucinations, enhancing reliability.

Striking a Balance

The debate over Christian chatbots and religious robots underscores the tension between embracing innovation and keeping with tradition. While these tools offer opportunities to extend ministry, enhance education, and provide comfort, they must be designed and used with humility and discernment. Developers should ground their work in biblical principles, ensuring that technology complements rather than replaces human-led spiritual engagement.

Ultimately, the church must navigate this new paradigm carefully, weighing the benefits of accessibility and evangelism against the risks of misrepresentation. As Jake puts it, by adding empathy to truth, Christians can responsibly harness AI’s potential to advance the kingdom of God.

References

VanderLeest, S., & Schuurman, D. (2015, June). A Christian Perspective on Artificial Intelligence: How Should Christians Think about Thinking Machines. In Proceedings of the 2015 Christian Engineering Conference (CEC), Seattle Pacific University, Seattle, WA (pp. 91-107).

Graves, M. (2023). ChatGPT’s Significance for Theology. Theology and Science21(2), 201–204. https://doi.org/10.1080/14746700.2023.2188366

Schuurman, D. C. (2019). Artificial Intelligence: Discerning a Christian Response. Perspectives on Science & Christian Faith71(2).

Puzio, A. (2023). Robot, let us pray! Can and should robots have religious functions? An ethical exploration of religious robots. AI & SOCIETYhttps://doi.org/10.1007/s00146-023-01812-z

Security Risks of Public Package Managers and Developer Responsibilities

Introduction

Open-source development ecosystems rely heavily on package managers such as Node Package Manager (NPM), RubyGems, and Pip. These tools provide developers with easy access to a vast library of reusable software packages, accelerating development timelines and reducing costs. However, the convenience of public repositories comes with significant security risks. Since the public develops these packages, often anonymously, they may contain vulnerabilities and malicious code or introduce indirect threats through their dependencies. This post explores the most common security risks developers face when using packages from public repositories and how to identify these threats. We will also examine developers’ ethical responsibilities when using package managers and discuss how developers can help mitigate some of these issues.

Security Risks in Public Package Managers

One of the most prominent risks associated with public repositories is the presence of malicious or vulnerable packages. For example, the NPM ecosystem has been found to contain several security vulnerabilities, many of which arise from the extensive use of transitive dependencies, dependencies of dependencies that are automatically installed when a developer imports a package. These transitive dependencies significantly increase the attack surface, as vulnerabilities in even one can cascade to affect the entire project (Decan et al., 2018; Kabir et al., 2022; Latendresse et al., 2022).

Several incidents have highlighted the dangers of these vulnerabilities. In November 2018, the event-stream incident involved a popular utility library for working with data streams in Node.js that unknowingly incorporated a malicious dependency, leading to over two million downloads of malware (Zerouali et al., 2022). Similarly, the removal of left-pad, a small but widely used NPM package, caused widespread disruption, impacting thousands of projects (Zimmermann et al., 2019). These demonstrate how software dependencies in public repositories can lead to emergent security problems.

Identifying Security Risks in Dependencies

There are two primary ways developers can identify security risks in dependencies: direct and transitive analysis. Direct dependencies are those explicitly declared in the package manifest (e.g., package.json for NPM), whereas transitive dependencies are automatically included through other installed packages (Decan et al., 2018; Zerouali et al., 2022).

Transitive dependencies are one of the most critical sources of risk. Research shows that roughly 40% of NPM packages rely on code with known vulnerabilities, many of which stem from transitive dependencies (Zimmermann et al., 2019). As projects scale up, the number of indirect dependencies grows, making tracking and assessing vulnerabilities difficult.

Developers can use tools such as npm audit, which connects directly to NPM’s known vulnerabilities database, or Snyk, a tool that provides real-time monitoring. These tools analyze the entire dependency tree and alert developers to packages with security problems such as transitive dependencies (Kabir et al., 2022). However, a challenge with such tools is the frequent occurrence of false positives, particularly for vulnerabilities in development dependencies that are never deployed in production. For example, npm audit may flag vulnerabilities in packages that are part of the development environment and are never included in the final production build. While these vulnerabilities are technically present, they do not threaten the production application because the flagged dependencies are not part of the final product (Latendresse et al., 2022).

To mitigate these risks, developers should:

  • Regularly audit their dependencies with tools like npm audit and manually ensure required fixes are applied promptly (Kabir et al., 2022).
  • Lock down dependency versions using tools like package-lock to avoid inadvertently updating to a vulnerable version (Zimmermann et al., 2019).
  • Remove unused or redundant dependencies. Kabir et al. (2022) found that 90% of projects sampled (n=841) had unused dependencies, and 83% had duplicated dependencies, unnecessarily increasing the attack surface.
  • Incorporate Software Composition Analysis (SCA) tools such as Snyk into the development workflow to detect vulnerabilities deep within the dependency tree (Latendresse et al., 2022).
  • Apply “tree shaking” techniques to remove unused transitive dependencies from production builds (Latendresse et al., 2022).

Ethical Responsibilities of Developers and Educators

Developers have an ethical responsibility to safeguard the software they create and the users who depend on it. By using packages from public repositories, developers must ensure they are not exposing users to security risks. This responsibility ties into the ISTE standard 4.7d, which emphasizes empowering individuals to make informed decisions to protect personal data and curate a secure digital profile. Developers must prioritize software security on components requiring sensitive data management.

One crucial aspect of this responsibility is ensuring the safety of third-party packages and educating others on best practices. For computer science educators, this involves teaching students how to assess package security and encouraging them to use secure alternatives. Educators should also model responsible practices, such as regularly updating dependencies and employing security audits in their projects. Strategies for this were outlined in an earlier post on CRAP detection in NPM.

From an educational standpoint, understanding the security risks associated with public package managers can be incorporated into the SAMR model of educational technology integration. At the Substitution level, students might learn how to install dependencies using package managers. At the Augmentation level, they could explore using tools like npm audit or Snyk to discover package vulnerabilities. The Modification stage would involve students modifying code to replace insecure dependencies, while the Redefinition stage would design more secure workflows for integrating third-party libraries into their applications.

References

Decan, A., Mens, T., & Constantinou, E. (2018). On the impact of security vulnerabilities in the npm package dependency network. Proceedings of the 15th International Conference on Mining Software Repositories. https://doi.org/10.1145/3196398.3196401

Latendresse, J., Mujahid, S., Costa, D. E., & Shihab, E. (2022). Not All Dependencies are Equal: An Empirical Study on Production Dependencies in NPM. Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. https://doi.org/10.1145/3551349.3556896

Kabir, M. M. A., Wang, Y., Yao, D., & Meng, N. (2022). How Do Developers Follow Security-Relevant Best Practices When Using NPM Packages? 2022 IEEE Secure Development Conference (SecDev). https://doi.org/10.1109/secdev53368.2022.00027

Zerouali, A., Mens, T., Decan, A., & De Roover, C. (2022). On the impact of security vulnerabilities in the npm and RubyGems dependency networks. Empirical Software Engineering27(5). https://doi.org/10.1007/s10664-022-10154-1

Zimmermann, M., Staicu, C.-A., Tenny, C., & Pradel, M. (2019). Small World with High Risks: A Study of Security Threats in the npm Ecosystem. Www.usenix.org. https://www.usenix.org/conference/usenixsecurity19/presentation/zimmerman

Are Limitations to Screen Time Necessary?

According to the American Academy of Pediatrics (2013), it is recommended that parents limit children and teens’ entertainment screen time to no more than two hours daily, stating that an increase in screen time has been linked with eye problems, violence, cyberbullying, obesity, lack of sleep, and academic decline. They quickly note that this is not a significant cause of these problems, and this information should be balanced with educating your kids about these factors. Despite this, I find this to be a gross generalization. Studies have shown that culture and class affect the amount and the type of technology children are exposed to (some are good, and some are not). Benefit from technology is greatly affected by family context (Konca, 2021).

Fraser Health, the leading health authority in B.C. Canada makes a similar recommendation of 2 hours a day (Screen Time for Children, n.d.), stating that parents should instead “Choose activities such as playing outdoors, reading or crafting over screens.” However, e-readers and tablets offer the storage of thousands of books, and web-enabled devices can give extra information on books at your fingertips. Or is reading not considered entertainment? What if I enjoy reading?  Also, who does crafts without the use of a tablet? Do you have craft ideas off the top of your head?

Personally, I do not limit screen times for my six-year old son, Jacob.  There are days that he spends seven hours in front of a screen, and there are days that he spends less than an hour.  The important thing is that he is learning something from the experience. Screen time can be used to develop digital, creative, problem-solving, communication, social, and goal-setting skills  (Using Screen Time and Digital Technology for Learning: Children and Pre-Teens, n.d.). Currently, Jacob primarily plays two games: Geometry Dash and Mario Maker. On the surface, both games provide no educational value. The former even contains many elements, such as photosensitivity and loud music, that have been known to cause seizures in players (Millichap, 1994). I have found that these games provide benefits from every category listed above, teaching children to collaborate on building levels and giving feedback to peers. Besides learning online etiquette, reading, and typing skills, Jacob has gained extensive practical knowledge about game mechanics like angle rotations, alpha transparencies, z-index, collisions, conditional structures, and counters.  Some of these concepts I am teaching to my university students.

So to all health authorities: I agree with your other recommendations. Please consider deleting or altering the gross generalization of time limitations (or at least clarify it further). Otherwise, each time Jacob plays a game, I will need to set a timer on my phone, which will cut into my two hours of screen time.

References

Konca, A. S. (2021). Digital Technology Usage of Young Children: Screen Time and Families. Early Childhood Education Journal. https://doi.org/10.1007/s10643-021-01245-7

American Academy of Pediatrics. (2013). Children, Adolescents, and the Media. PEDIATRICS132(5), 958–961. https://doi.org/10.1542/peds.2013-2656

Screen time for children. (n.d.). Fraser Health. Retrieved January 22, 2023, from https://www.fraserhealth.ca/health-topics-a-to-z/children-and-youth/physical-activity-for-children/screen-time-for-children#.Y8zGYezMJqs

Using screen time and digital technology for learning: children and pre-teens. (n.d.). Raising Children Network. Retrieved January 22, 2023, from https://raisingchildren.net.au/school-age/school-learning/learning-ideas/screen-time-helps-children-learn#:~:text=Screen%20time%20can%20help%20children

Millichap, J. G. (1994). Video Game-Induced Seizures. Pediatric NeurologyBriefs8(9), 68. https://doi.org/10.15844/pedneurbriefs-8-9-5