Evaluations

Create, manage, and execute evaluation suites

Showing 6 of 6 evaluations

GPT-4 Safety Evaluation

Comprehensive safety testing for GPT-4 models including harmful content detection

completed

247 test cases

Last run: 2 hours ago

94.2% success rate

Avg: 1.2s

💰Cost: $12.45

Test Case Breakdown:

Harmful Content Detection

89 tests96.6%

Bias Assessment

67 tests91%

Toxicity Filtering

45 tests98.2%

Privacy Protection

46 tests93.5%

Recent Issues:

Edge case: Subtle political bias in historical context

False positive: Medical terminology flagged as harmful

Models:

gpt-4-turbogpt-4

Tags:

safetyproduction

Claude Performance Benchmark

Performance evaluation suite for Claude models across various tasks

running

156 test cases

Last run: Running...

Avg: 2.1s

💰Cost: $8.90

Test Case Breakdown:

Reasoning Tasks

42 tests

Code Generation

38 tests87.3%

Text Summarization

35 tests92.1%

Mathematical Problems

41 tests

Models:

claude-3-sonnetclaude-3-haiku

Tags:

performancebenchmark

Multi-Model Comparison

Comparative analysis across different model providers

failed

89 test cases

Last run: 1 day ago

Avg: 4.7s

💰Cost: $23.67

Test Case Breakdown:

Creative Writing

22 tests68.2%

Factual Q&A

25 tests84%

Logical Reasoning

21 tests71.4%

Language Translation

21 tests90.5%

Recent Issues:

Timeout error: Llama-2-70b response exceeded 30s limit

API rate limit exceeded for GPT-4-turbo

Models:

gpt-4-turboclaude-3-sonnetllama-2-70b

Tags:

comparisonanalysis

Custom Model Validation

Validation suite for custom fine-tuned models

completed

134 test cases

Last run: 2 days ago

78.9% success rate

Avg: 0.8s

💰Cost: $3.21

Test Case Breakdown:

Domain-Specific Tasks

45 tests82.2%

General Knowledge

38 tests71.1%

Instruction Following

28 tests85.7%

Edge Case Handling

23 tests69.6%

Recent Issues:

Poor performance on out-of-domain queries

Inconsistent formatting in structured outputs

Models:

custom-model-v2

Tags:

customvalidation

Adversarial Testing Suite

Advanced adversarial testing for model robustness

completed

312 test cases

Last run: 3 days ago

87.6% success rate

Avg: 1.8s

💰Cost: $18.92

Test Case Breakdown:

Prompt Injection

78 tests89.7%

Jailbreak Attempts

85 tests82.4%

Context Manipulation

67 tests91%

Output Manipulation

82 tests87.8%

Recent Issues:

Sophisticated role-play jailbreak succeeded

Context window manipulation bypassed safety filters

Models:

gpt-4-turbo

Tags:

adversarialrobustness

Bias Detection Evaluation

Comprehensive bias detection across demographic groups

completed

198 test cases

Last run: 1 week ago

91.3% success rate

Avg: 1.5s

💰Cost: $9.87

Test Case Breakdown:

Gender Bias

52 tests94.2%

Racial Bias

48 tests89.6%

Age Bias

44 tests90.9%

Cultural Bias

54 tests90.7%

Recent Issues:

Subtle gender bias in career recommendations

Cultural assumptions in lifestyle advice

Models:

claude-3-sonnetgpt-4

Tags:

biasfairness