Benchmark Analysis Methodology
This document describes the statistical methods and data processing pipeline used in the serializer benchmark analysis.
Overview
The benchmark analysis tool processes raw CSV logs from C# and Python benchmark runs, normalizes time units, filters outliers, and generates comparative reports. The goal is to provide accurate, comparable performance metrics across different serializers and languages.
Data Pipeline
1. Raw Data Ingestion
Benchmark logs are CSV files with the following columns:
StringOrStream: Mode of operation (Stream/string/bytes)TestDataName: Type of test data (Integer, Person, SimpleObject, etc.)Repetitions: Number of repetitions in the batchRepetitionIndex: Index within the repetition batchSerializerName: Name of the serializer being testedTimeSer: Time for serialization (ticks for C#, nanoseconds for Python)TimeDeser: Time for deserialization (ticks for C#, nanoseconds for Python)Size: Size of serialized output in bytesTimeSerAndDeser: Combined time for serialization + deserializationOpPerSecSer,OpPerSecDeser,OpPerSecSerAndDeser: Operations per second (as reported by benchmark)
2. Time Unit Normalization
C# and Python benchmarks use different time units:
- C#: Uses ticks (100 nanoseconds per tick)
- Python: Uses nanoseconds directly
The _detect_time_unit() function auto-detects the unit based on magnitude:
- Values > 1,000,000 are assumed to be C# ticks → multiplied by 100 to get nanoseconds
- Values ≤ 1,000,000 are assumed to be Python nanoseconds → used as-is
This ensures all timing comparisons are done in consistent nanosecond units.
3. Outlier Filtering
Raw benchmark data often contains extreme outliers due to:
- GC pauses
- Thread scheduling delays
- JIT compilation overhead (first-run effects)
- System load spikes
These outliers can severely skew mean calculations. For example, a single 90-second measurement among 99 sub-millisecond measurements would make the mean useless.
Tukey's IQR Method
We use Tukey's Interquartile Range (IQR) fences for outlier detection:
Q1 = 25th percentile
Q3 = 75th percentile
IQR = Q3 - Q1
Lower fence = Q1 - 1.5 × IQR
Upper fence = Q3 + 1.5 × IQR
Values outside [lower, upper] are considered outliers and removed.
Filtering Rules
- Applied per group: (SerializerName, TestDataName, StringOrStream)
- Only applied when group has ≥ 10 measurements
- If IQR is 0 (all identical values), no filtering is done
- If filtering would remove all values, original data is preserved
Warmup Exclusion
Before IQR filtering, the first repetition (RepetitionIndex 0) of each test group is excluded from analysis. This warmup run typically contains:
- JIT compilation overhead: First-time code compilation (especially in C#)
- Static initialization: Type constructors and static field initialization
- Cache cold starts: Cold CPU caches, branch predictors, and TLB
These warmup effects can be 10-100× slower than steady-state performance and often blend into the Q3 tail, reducing IQR filter effectiveness. By excluding RepetitionIndex 0 before filtering, we ensure:
- More accurate baseline for IQR calculation (Q1, Q3, IQR)
- Better detection of true runtime outliers (GC pauses, thread delays)
- Representative performance metrics for production scenarios
| Metric | Tracking |
|---|---|
runs_raw |
Original count before warmup exclusion |
warmup_skipped |
Count of RepetitionIndex 0 excluded |
outliers_removed |
Count of IQR-filtered outliers |
runs |
Final count after all filtering |
Example Impact
| Serializer | Data Type | Mode | Before Outliers (ns) | After Filtering (ns) | Improvement |
|---|---|---|---|---|---|
| FlatSharp | Integer | string | 911,721,849 | 9,686 | ~94,000× |
| Ceras | Integer | string | ~6,870,000 | 69,524 | ~99× |
| Jil | Person | string | ~1,370,000 | 62,582 | ~22× |
4. Statistics Computation
After filtering, the following metrics are computed per group:
| Metric | Description |
|---|---|
avg_time_ser_ns |
Mean serialization time (nanoseconds) |
avg_time_deser_ns |
Mean deserialization time (nanoseconds) |
avg_time_total_ns |
Mean total time (nanoseconds) |
avg_ops_per_sec |
Operations per second (1e9 / avg_time_total_ns) |
min_ops_per_sec |
Min ops/sec (from max time) |
max_ops_per_sec |
Max ops/sec (from min time) |
median_size_bytes |
Median serialized size |
runs |
Count of measurements after all filtering |
runs_raw |
Original count before warmup exclusion |
warmup_skipped |
Count of warmup (RepetitionIndex 0) excluded |
outliers_removed |
Count of IQR-filtered outliers |
Ops/Sec is recalculated consistently using 1e9 / nanoseconds for both languages, ensuring comparability.
Pivot Tables
Tabular views of performance metrics organized by:
- Rows: Serializers
- Columns: Modes or Data Types
- Values: Avg time or Ops/Sec
Visualization
Violin Plots
The HTML dashboard includes violin plots showing the distribution of serialization vs deserialization times per data type. These use seaborn's catplot(kind='violin', split=True) to show:
- Top side: Serialize operation distribution
- Bottom side: Deserialize operation distribution
This reveals performance characteristics that averages hide, such as: - Bimodal distributions (suggesting different code paths) - Variance within serializers - Outliers that passed the IQR filter
Report Generation
Markdown Summary
- Pivot tables in GitHub-flavored markdown
Validation
To verify the analysis is working correctly:
- Check outlier counts: Look for the console output showing how many outliers were removed
- Cross-check pivot tables: Serializer × Mode tables should show reasonable consistency
- Compare with notebook: Results should align with the Jupyter notebook analysis
- Sanity check extreme values: No serializer should show >1 second average times for simple objects
References
- Tukey, J.W. (1977). Exploratory Data Analysis
- Seaborn.catplot