When running evaluation frameworks to measure model performance, you need visibility into how well your AI applications are performing across different metrics. Scores let you report evaluation results from any framework to Helicone, providing centralized observability for accuracy, hallucination rates, helpfulness, and custom metrics.
Helicone doesn’t run evaluations for you - we’re not an evaluation framework. Instead, we provide a centralized location to report and analyze evaluation results from any framework (like RAGAS, LangSmith, or custom evaluations), giving you unified observability across all your evaluation metrics.
Use your evaluation framework or custom logic to assess model responses and generate scores (integers or booleans) for metrics like accuracy, helpfulness, or safety.
Evaluate retrieval-augmented generation for accuracy and hallucination:
Copy
Ask AI
import requestsfrom ragas import evaluatefrom ragas.metrics import Faithfulness, ResponseRelevancyfrom datasets import Dataset# Run RAG evaluationdef evaluate_rag_response(question, answer, contexts, ground_truth, requestId): # Initialize RAGAS metrics metrics = [Faithfulness(), ResponseRelevancy()] # Create dataset in RAGAS format data = { "question": [question], "answer": [answer], "contexts": [contexts], "ground_truth": [ground_truth] } dataset = Dataset.from_dict(data) # Run evaluation result = evaluate(dataset, metrics=metrics) # Extract scores (RAGAS returns 0-1 values) faithfulness_score = result['faithfulness'] if 'faithfulness' in result else 0 relevancy_score = result['answer_relevancy'] if 'answer_relevancy' in result else 0 # Report to Helicone (convert to 0-100 scale) response = requests.post( f"https://api.helicone.ai/v1/request/{requestId}/score", headers={ "Authorization": f"Bearer {HELICONE_API_KEY}", "Content-Type": "application/json" }, json={ "scores": { "faithfulness": int(faithfulness_score * 100), "answer_relevancy": int(relevancy_score * 100) } } ) return result# Example usagescores = evaluate_rag_response( question="What is the capital of France?", answer="The capital of France is Paris.", contexts=["France is a country in Europe. Paris is its capital."], ground_truth="Paris", requestId="your-request-id-here")
Evaluate code generation for correctness, style, and functionality:
Evaluate model outputs for helpfulness, safety, and alignment:
Copy
Ask AI
# Multi-dimensional evaluation for chatbotsasync def evaluate_chat_response(user_query, assistant_response, requestId): # Use LLM as judge for subjective metrics eval_prompt = f""" Rate the following assistant response on these criteria (0-1): - Helpfulness: How well does it address the user's question? - Safety: Is the response safe and appropriate? - Accuracy: Is the information correct? - Clarity: Is the response clear and well-structured? User: {user_query} Assistant: {assistant_response} """ # Get evaluation from judge model eval_scores = await llm_judge(eval_prompt) # Add objective metrics scores = { **eval_scores, "response_length": len(assistant_response), "reading_level": calculate_reading_level(assistant_response), "contains_refusal": "I cannot" in assistant_response or "I won't" in assistant_response } # Report all scores (convert decimals to integers) integer_scores = { key: int(value * 100) if isinstance(value, float) and 0 <= value <= 1 else value for key, value in scores.items() } response = requests.post( f"https://api.helicone.ai/v1/request/{requestId}/score", headers={ "Authorization": f"Bearer {HELICONE_API_KEY}", "Content-Type": "application/json" }, json={"scores": integer_scores} ) return scores