Evaluations
Create, manage, and execute evaluation suites
Showing 6 of 6 evaluations
GPT-4 Safety Evaluation
Comprehensive safety testing for GPT-4 models including harmful content detection
247 test cases
Last run: 2 hours ago
94.2% success rate
Avg: 1.2s
💰Cost: $12.45
Test Case Breakdown:
Harmful Content Detection
89 tests96.6%
Bias Assessment
67 tests91%
Toxicity Filtering
45 tests98.2%
Privacy Protection
46 tests93.5%
Recent Issues:
Edge case: Subtle political bias in historical context
False positive: Medical terminology flagged as harmful
Models:
gpt-4-turbogpt-4
Tags:
safetyproduction
Claude Performance Benchmark
Performance evaluation suite for Claude models across various tasks
156 test cases
Last run: Running...
Avg: 2.1s
💰Cost: $8.90
Test Case Breakdown:
Reasoning Tasks
42 tests
Code Generation
38 tests87.3%
Text Summarization
35 tests92.1%
Mathematical Problems
41 tests
Models:
claude-3-sonnetclaude-3-haiku
Tags:
performancebenchmark
Multi-Model Comparison
Comparative analysis across different model providers
89 test cases
Last run: 1 day ago
Avg: 4.7s
💰Cost: $23.67
Test Case Breakdown:
Creative Writing
22 tests68.2%
Factual Q&A
25 tests84%
Logical Reasoning
21 tests71.4%
Language Translation
21 tests90.5%
Recent Issues:
Timeout error: Llama-2-70b response exceeded 30s limit
API rate limit exceeded for GPT-4-turbo
Models:
gpt-4-turboclaude-3-sonnetllama-2-70b
Tags:
comparisonanalysis
Custom Model Validation
Validation suite for custom fine-tuned models
134 test cases
Last run: 2 days ago
78.9% success rate
Avg: 0.8s
💰Cost: $3.21
Test Case Breakdown:
Domain-Specific Tasks
45 tests82.2%
General Knowledge
38 tests71.1%
Instruction Following
28 tests85.7%
Edge Case Handling
23 tests69.6%
Recent Issues:
Poor performance on out-of-domain queries
Inconsistent formatting in structured outputs
Models:
custom-model-v2
Tags:
customvalidation
Adversarial Testing Suite
Advanced adversarial testing for model robustness
312 test cases
Last run: 3 days ago
87.6% success rate
Avg: 1.8s
💰Cost: $18.92
Test Case Breakdown:
Prompt Injection
78 tests89.7%
Jailbreak Attempts
85 tests82.4%
Context Manipulation
67 tests91%
Output Manipulation
82 tests87.8%
Recent Issues:
Sophisticated role-play jailbreak succeeded
Context window manipulation bypassed safety filters
Models:
gpt-4-turbo
Tags:
adversarialrobustness
Bias Detection Evaluation
Comprehensive bias detection across demographic groups
198 test cases
Last run: 1 week ago
91.3% success rate
Avg: 1.5s
💰Cost: $9.87
Test Case Breakdown:
Gender Bias
52 tests94.2%
Racial Bias
48 tests89.6%
Age Bias
44 tests90.9%
Cultural Bias
54 tests90.7%
Recent Issues:
Subtle gender bias in career recommendations
Cultural assumptions in lifestyle advice
Models:
claude-3-sonnetgpt-4
Tags:
biasfairness