Eval Scores

When running evaluation frameworks to measure model performance, you need visibility into how well your AI applications are performing across different metrics. Scores let you report evaluation results from any framework to Helicone, providing centralized observability for accuracy, hallucination rates, helpfulness, and custom metrics.

Helicone doesn’t run evaluations for you - we’re not an evaluation framework. Instead, we provide a centralized location to report and analyze evaluation results from any framework (like RAGAS, LangSmith, or custom evaluations), giving you unified observability across all your evaluation metrics.

Why use Scores

Centralize evaluation results: Report scores from any evaluation framework for unified monitoring and analysis
Track model performance over time: Visualize how accuracy, hallucination rates, and other metrics evolve
Compare experiments side-by-side: Evaluate different prompts, models, or configurations with consistent metrics

Quick Start

Run your evaluation

Use your evaluation framework or custom logic to assess model responses and generate scores (integers or booleans) for metrics like accuracy, helpfulness, or safety.

Report scores to Helicone

Send evaluation results using the Helicone API:

// Get the request ID from response headers
const requestId = response.headers.get("helicone-id");

// Report evaluation scores
await fetch(`https://api.helicone.ai/v1/request/${requestId}/score`, {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${HELICONE_API_KEY}`,
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    scores: {
      "accuracy": 92,        // Integer values required
      "hallucination": 8,    // Converted to integers (0.08 * 100)
      "helpfulness": 85,
      "is_safe": true        // Booleans supported
    }
  })
});

Alternative: Add scores via dashboard

You can also add scores directly in the Helicone dashboard on the request details page. This is useful for manual evaluation or quick testing.

View score analytics

Analyze evaluation results in the Helicone dashboard to track performance trends, compare experiments, and identify areas for improvement.

Scores are processed with a 10 minute delay by default for analytics aggregation.

API Format

Request Structure

The scores API expects this exact format:

Field	Type	Description	Required	Example
`scores`	`object`	Key-value pairs of evaluation metrics	✅ Yes	`{"accuracy": 92}`

Score Values

Type	Description	Example
`integer`	Numeric scores (no decimals)	`92`, `85`, `0`
`boolean`	Pass/fail or true/false metrics	`true`, `false`

Float values like 0.92 are rejected. Convert to integers: 0.92 → 92

Use Cases

RAG Accuracy Evaluation
Code Generation Quality
Helpfulness & Safety

Evaluate retrieval-augmented generation for accuracy and hallucination:

import requests
from ragas import evaluate
from ragas.metrics import Faithfulness, ResponseRelevancy
from datasets import Dataset

# Run RAG evaluation
def evaluate_rag_response(question, answer, contexts, ground_truth, requestId):
    # Initialize RAGAS metrics
    metrics = [Faithfulness(), ResponseRelevancy()]
    
    # Create dataset in RAGAS format
    data = {
        "question": [question],
        "answer": [answer], 
        "contexts": [contexts],
        "ground_truth": [ground_truth]
    }
    dataset = Dataset.from_dict(data)
    
    # Run evaluation
    result = evaluate(dataset, metrics=metrics)
    
    # Extract scores (RAGAS returns 0-1 values)
    faithfulness_score = result['faithfulness'] if 'faithfulness' in result else 0
    relevancy_score = result['answer_relevancy'] if 'answer_relevancy' in result else 0
    
    # Report to Helicone (convert to 0-100 scale)
    response = requests.post(
        f"https://api.helicone.ai/v1/request/{requestId}/score",
        headers={
            "Authorization": f"Bearer {HELICONE_API_KEY}",
            "Content-Type": "application/json"
        },
        json={
            "scores": {
                "faithfulness": int(faithfulness_score * 100),
                "answer_relevancy": int(relevancy_score * 100)
            }
        }
    )
    
    return result

# Example usage
scores = evaluate_rag_response(
    question="What is the capital of France?",
    answer="The capital of France is Paris.",
    contexts=["France is a country in Europe. Paris is its capital."],
    ground_truth="Paris",
    requestId="your-request-id-here"
)

Evaluate code generation for correctness, style, and functionality:

// Evaluate generated code quality
async function evaluateCodeGeneration(
  prompt: string,
  generatedCode: string,
  requestId: string
) {
  const scores = {
    // Syntax validity
    syntax_valid: await validateSyntax(generatedCode) ? 1.0 : 0.0,
    
    // Test pass rate
    test_pass_rate: await runTests(generatedCode),
    
    // Code quality metrics
    complexity: calculateCyclomaticComplexity(generatedCode),
    readability: assessReadability(generatedCode),
    
    // Security checks
    security_score: await runSecurityScan(generatedCode),
    
    // Performance benchmarks
    performance: await benchmarkCode(generatedCode)
  };
  
  // Report comprehensive evaluation
  await fetch(`https://api.helicone.ai/v1/request/${requestId}/score`, {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${HELICONE_API_KEY}`,
      "Content-Type": "application/json"
    },
    body: JSON.stringify({
      scores: {
        ...scores,
        // Convert any decimal scores to integers
        test_pass_rate: Math.round(scores.test_pass_rate * 100)
      }
    })
  });
  
  return scores;
}

Evaluate model outputs for helpfulness, safety, and alignment:

# Multi-dimensional evaluation for chatbots
async def evaluate_chat_response(user_query, assistant_response, requestId):
    # Use LLM as judge for subjective metrics
    eval_prompt = f"""
    Rate the following assistant response on these criteria (0-1):
    - Helpfulness: How well does it address the user's question?
    - Safety: Is the response safe and appropriate?
    - Accuracy: Is the information correct?
    - Clarity: Is the response clear and well-structured?
    
    User: {user_query}
    Assistant: {assistant_response}
    """
    
    # Get evaluation from judge model
    eval_scores = await llm_judge(eval_prompt)
    
    # Add objective metrics
    scores = {
        **eval_scores,
        "response_length": len(assistant_response),
        "reading_level": calculate_reading_level(assistant_response),
        "contains_refusal": "I cannot" in assistant_response or "I won't" in assistant_response
    }
    
    # Report all scores (convert decimals to integers)
    integer_scores = {
        key: int(value * 100) if isinstance(value, float) and 0 <= value <= 1 else value
        for key, value in scores.items()
    }
    
    response = requests.post(
        f"https://api.helicone.ai/v1/request/{requestId}/score",
        headers={
            "Authorization": f"Bearer {HELICONE_API_KEY}",
            "Content-Type": "application/json"
        },
        json={"scores": integer_scores}
    )
    
    return scores

Experiments

Compare different configurations with consistent scoring

Sessions

Evaluate multi-turn conversations and workflows

Custom Properties

Tag requests for segmented evaluation analysis

Webhooks

Trigger evaluations automatically when requests complete

Getting Started

AI Gateway

Observability & Analytics

Prompt Management

Legacy Integrations

References

Why use Scores

Quick Start

API Format

Request Structure

Score Values

Use Cases

Experiments

Sessions

Custom Properties

Webhooks

Getting Started

AI Gateway

Observability & Analytics

Prompt Management

Legacy Integrations

References

​Why use Scores

​Quick Start

​API Format

​Request Structure

​Score Values

​Use Cases

​Related Features

Experiments

Sessions

Custom Properties

Webhooks

Why use Scores

Quick Start

API Format

Request Structure

Score Values

Use Cases

Related Features