Test Data Configuration
To ensure consistent benchmarking across different languages (C# and Python), we use a centralized configuration file located at schemas/test_data_config.json. This allows us to control the structure and size of the test objects, making comparisons more accurate.
Configuration Parameters
StringOptions
- MinWordLength / MaxWordLength: Controls the size of individual words generated for names and descriptions.
- MinPhraseLength / MaxPhraseLength: Controls the number of words in a phrase (e.g., for authorities or descriptions).
- MinIdLength / MaxIdLength: Controls the length of generated ID strings.
- DuplicationFactor: A value between 0 and 1 representing the probability that a previously generated string will be reused instead of generating a new one. This is crucial for testing serialization algorithms that support object referencing or string deduplication (e.g., CBOR with sharing, or custom deduplication in some formats).
CollectionOptions
- PersonPoliceRecordsCount: The number of records in the
PoliceRecordsarray for thePersonobject. - TelemetryMeasurementsCount: The number of double-precision floating-point numbers in the
TelemetryDataobject. - StringArrayCount: The number of strings in the
StringArrayObject. - EdiClaimsCount / EdiLinesPerClaimCount: Controls the complexity of the EDI 835 document.
RandomSeed
- A fixed seed to ensure that the "random" data generated is identical across different runs and different languages, provided the PRNG implementation is compatible or we use a similar logic.
Design Reasons
- Reproducibility: By using a fixed seed and shared configuration, we can guarantee that the same data payload is being serialized in both C# and Python benchmarks.
- Real-life Resemblance: Default values are chosen to reflect typical data sizes in enterprise applications. For example, a telemetry packet often contains around 100 measurements, and a person's record typically doesn't have hundreds of police records.
- Serialization Optimization Testing: The
DuplicationFactorallows us to stress-test how different serializers handle redundant data. Serializers that use dictionary-based compression or object tracking should show significantly better performance and smaller payloads when this factor is high. - Cross-Language Consistency: Hardcoding these values in each language's source code was prone to desynchronization. Moving them to a shared schema ensures they are always in sync.