Benchmark Analysis Methodology

This document describes the statistical methods and data processing pipeline used in the serializer benchmark analysis.

Overview

The benchmark analysis tool processes raw CSV logs from C# and Python benchmark runs, normalizes time units, filters outliers, and generates comparative reports. The goal is to provide accurate, comparable performance metrics across different serializers and languages.

Data Pipeline

1. Raw Data Ingestion

Benchmark logs are CSV files with the following columns:

StringOrStream: Mode of operation (Stream/string/bytes)
TestDataName: Type of test data (Integer, Person, SimpleObject, etc.)
Repetitions: Number of repetitions in the batch
RepetitionIndex: Index within the repetition batch
SerializerName: Name of the serializer being tested
TimeSer: Time for serialization (ticks for C#, nanoseconds for Python)
TimeDeser: Time for deserialization (ticks for C#, nanoseconds for Python)
Size: Size of serialized output in bytes
TimeSerAndDeser: Combined time for serialization + deserialization
OpPerSecSer, OpPerSecDeser, OpPerSecSerAndDeser: Operations per second (as reported by benchmark)

2. Time Unit Normalization

C# and Python benchmarks use different time units:

C#: Uses ticks (100 nanoseconds per tick)
Python: Uses nanoseconds directly

The _detect_time_unit() function auto-detects the unit based on magnitude:

Values > 1,000,000 are assumed to be C# ticks → multiplied by 100 to get nanoseconds
Values ≤ 1,000,000 are assumed to be Python nanoseconds → used as-is

This ensures all timing comparisons are done in consistent nanosecond units.

3. Outlier Filtering

Raw benchmark data often contains extreme outliers due to:

GC pauses
Thread scheduling delays
JIT compilation overhead (first-run effects)
System load spikes

These outliers can severely skew mean calculations. For example, a single 90-second measurement among 99 sub-millisecond measurements would make the mean useless.

Tukey's IQR Method

We use Tukey's Interquartile Range (IQR) fences for outlier detection:

Q1 = 25th percentile
Q3 = 75th percentile
IQR = Q3 - Q1

Lower fence = Q1 - 1.5 × IQR
Upper fence = Q3 + 1.5 × IQR

Values outside [lower, upper] are considered outliers and removed.

Filtering Rules

Applied per group: (SerializerName, TestDataName, StringOrStream)
Only applied when group has ≥ 10 measurements
If IQR is 0 (all identical values), no filtering is done
If filtering would remove all values, original data is preserved

Warmup Exclusion

Before IQR filtering, the first repetition (RepetitionIndex 0) of each test group is excluded from analysis. This warmup run typically contains:

JIT compilation overhead: First-time code compilation (especially in C#)
Static initialization: Type constructors and static field initialization
Cache cold starts: Cold CPU caches, branch predictors, and TLB

These warmup effects can be 10-100× slower than steady-state performance and often blend into the Q3 tail, reducing IQR filter effectiveness. By excluding RepetitionIndex 0 before filtering, we ensure:

More accurate baseline for IQR calculation (Q1, Q3, IQR)
Better detection of true runtime outliers (GC pauses, thread delays)
Representative performance metrics for production scenarios

Metric	Tracking
`runs_raw`	Original count before warmup exclusion
`warmup_skipped`	Count of RepetitionIndex 0 excluded
`outliers_removed`	Count of IQR-filtered outliers
`runs`	Final count after all filtering

Example Impact

Serializer	Data Type	Mode	Before Outliers (ns)	After Filtering (ns)	Improvement
FlatSharp	Integer	string	911,721,849	9,686	~94,000×
Ceras	Integer	string	~6,870,000	69,524	~99×
Jil	Person	string	~1,370,000	62,582	~22×

4. Statistics Computation

After filtering, the following metrics are computed per group:

Metric	Description
`avg_time_ser_ns`	Mean serialization time (nanoseconds)
`avg_time_deser_ns`	Mean deserialization time (nanoseconds)
`avg_time_total_ns`	Mean total time (nanoseconds)
`avg_ops_per_sec`	Operations per second (1e9 / avg_time_total_ns)
`min_ops_per_sec`	Min ops/sec (from max time)
`max_ops_per_sec`	Max ops/sec (from min time)
`median_size_bytes`	Median serialized size
`runs`	Count of measurements after all filtering
`runs_raw`	Original count before warmup exclusion
`warmup_skipped`	Count of warmup (RepetitionIndex 0) excluded
`outliers_removed`	Count of IQR-filtered outliers

Ops/Sec is recalculated consistently using 1e9 / nanoseconds for both languages, ensuring comparability.

Pivot Tables

Tabular views of performance metrics organized by:

Rows: Serializers
Columns: Modes or Data Types
Values: Avg time or Ops/Sec

Visualization

Violin Plots

The HTML dashboard includes violin plots showing the distribution of serialization vs deserialization times per data type. These use seaborn's catplot(kind='violin', split=True) to show: - Top side: Serialize operation distribution - Bottom side: Deserialize operation distribution

This reveals performance characteristics that averages hide, such as: - Bimodal distributions (suggesting different code paths) - Variance within serializers - Outliers that passed the IQR filter

Report Generation

Markdown Summary

Pivot tables in GitHub-flavored markdown

Validation

To verify the analysis is working correctly:

Check outlier counts: Look for the console output showing how many outliers were removed
Cross-check pivot tables: Serializer × Mode tables should show reasonable consistency
Compare with notebook: Results should align with the Jupyter notebook analysis
Sanity check extreme values: No serializer should show >1 second average times for simple objects

References

Tukey, J.W. (1977). Exploratory Data Analysis
Seaborn.catplot