PHI De-identification Leaderboard

Community benchmark for clinical PHI masking models on vkatg/streaming-phi-deidentification-benchmark. Maintained by Venkata Krishna Azith Teja Ganti.

How this works. Results are added only when accompanied by a reproducible eval script run against the public test splits. Self-reported scores without a verifiable script are not accepted. To submit, open an issue on GitHub with your model link, scores, and eval script.

#	Model	PHI F1	Precision	Recall	Eval set
No verified results yet. Submit your model's results on GitHub

ExposureGuard models are listed here for reference. No independent eval results exist yet. If you run these models against the benchmark, please submit your results.

Model	Role	PHI F1	Risk MAE	Retok Acc	Status

Evaluation Metrics

Models should be evaluated on the crossmodal test split of vkatg/streaming-phi-deidentification-benchmark (10,250 records).

Metrics

PHI F1: Token-level F1 for PHI span detection across NAME, DOB, MRN, ADDRESS, PHONE, SSN, DATE, AGE, LOCATION.
Precision: Fraction of predicted PHI spans that are true PHI.
Recall: Fraction of true PHI spans detected.
Risk MAE: Mean absolute error of predicted re-identification risk score. Only for models that output this signal.
Retok Acc: Retokenization trigger accuracy. Only for models that output this signal.

Benchmark subsets

default: 1,950 standard single-modality clinical text records.
crossmodal: 10,250 records with cross-modal linkage and exposure accumulation triggers.
signed: 390 records with cryptographic audit signatures.

Submit Results

Open an issue on phi-exposure-guard on GitHub with the following:

Required

Model URL: HuggingFace model repository.
PHI F1: Token-level F1 on the crossmodal test split.
Precision / Recall: On the same split.
Eval script: A script that anyone can run to reproduce your numbers. This is required — scores without a script will not be added.

Optional

Risk MAE: If your model outputs a risk score.
Retok Acc: If your model outputs a retokenization signal.

Models trained on any portion of the benchmark test splits must disclose this. Results will be marked accordingly.