Cleaning Tick Data and Writing Parquet

1) Define a strict schema first

Decide what a “tick” means for your venue/data source. At minimum:

• Timestamp (UTC, nanoseconds if available)
• Price (decimal or integer ticks)
• Size (integer)
• Side (buy/sell/unknown if you have it)
• Flags (auction, implied, etc. if provided)
• Symbol + venue + session identifiers

A strict schema prevents silent type drift (strings in numeric columns, “NaN” sentinels, mixed timestamp formats).

2) Normalize timestamps (and prove ordering)

Convert everything to UTC and one unit (prefer ns). Then validate monotonicity within symbol/venue partitions. If ordering isn’t guaranteed by your source, explicitly sort and track the cost.

# Example checks (conceptual)
- parse raw timestamp → utc_ns
- assert utc_ns not null
- per symbol-day: check out-of-order ratio
- if ratio > threshold: sort + flag partition

3) Remove bad prints, duplicates, and impossible values

Common problems: zero/negative prices, absurd sizes, duplicate ticks, repeated timestamps with different prices, and out-of-session prints. Handle these with explicit rules and keep counters so you can audit what changed.

• Drop ticks with non-positive price or size
• De-duplicate by (symbol, utc_ns, price, size, side)
• Bound-check using dynamic filters (e.g., median ± k·MAD per window)
• Remove or flag out-of-session prints (your trading hours definition)

4) Create derived views (optional but recommended)

Many strategies don’t need raw ticks. Generate derived tables for your use-case:

• Trade bars (time/volume/dollar)
• Top-of-book or NBBO snapshots (if you have quotes)
• Spread, microprice, imbalance features (quotes required)
• Session-level summary stats for quality monitoring

5) Write Parquet correctly (partition + sort + types)

Parquet wins when you align storage with your scan patterns. Typical partitions: symbol / venue / date. Sort within partitions by timestamp. Use dictionary encoding for low-cardinality columns (symbol/venue/side).

Recommended layout:
dataset/
  venue=...
    symbol=...
      date=YYYY-MM-DD/
        part-....parquet

Inside each Parquet file:
- sorted by utc_ns
- explicit dtypes (int64 timestamps, int64 sizes, int64 ticks or decimal)
- stats enabled (min/max) for predicate pushdown

6) Add a quality report (so you can trust it)

Every cleaning pipeline should output a small audit artifact per partition: counts in/out, duplicates removed, out-of-order ratio, invalid prints, session coverage, and basic distribution checks. This is what keeps a research pipeline honest.

If you want this as a production pipeline

We build reproducible pipelines with audit logs, guardrails, and outputs that plug directly into validation/backtesting. If you want us to implement your exact venue + schema, use the institutional request page.