PHI De-identification Leaderboard

Community benchmark for clinical PHI masking models on vkatg/streaming-phi-deidentification-benchmark. Maintained by Venkata Krishna Azith Teja Ganti.

How this works. Results are added only when accompanied by a reproducible eval script run against the public test splits. Self-reported scores without a verifiable script are not accepted. To submit, open an issue on GitHub with your model link, scores, and eval script.
# Model PHI F1 Precision Recall Eval set

No verified results yet.

Submit your model's results on GitHub
ExposureGuard models are listed here for reference. No independent eval results exist yet. If you run these models against the benchmark, please submit your results.
Model Role PHI F1 Risk MAE Retok Acc Status

Evaluation Metrics

Models should be evaluated on the crossmodal test split of vkatg/streaming-phi-deidentification-benchmark (10,250 records).

Metrics

PHI F1
Token-level F1 for PHI span detection across NAME, DOB, MRN, ADDRESS, PHONE, SSN, DATE, AGE, LOCATION.
Precision
Fraction of predicted PHI spans that are true PHI.
Recall
Fraction of true PHI spans detected.
Risk MAE
Mean absolute error of predicted re-identification risk score. Only for models that output this signal.
Retok Acc
Retokenization trigger accuracy. Only for models that output this signal.

Benchmark subsets

default
1,950 standard single-modality clinical text records.
crossmodal
10,250 records with cross-modal linkage and exposure accumulation triggers.
signed
390 records with cryptographic audit signatures.

Submit Results

Open an issue on phi-exposure-guard on GitHub with the following:

Required

Model URL
HuggingFace model repository.
PHI F1
Token-level F1 on the crossmodal test split.
Precision / Recall
On the same split.
Eval script
A script that anyone can run to reproduce your numbers. This is required — scores without a script will not be added.

Optional

Risk MAE
If your model outputs a risk score.
Retok Acc
If your model outputs a retokenization signal.

Models trained on any portion of the benchmark test splits must disclose this. Results will be marked accordingly.