How this works. Results are added only when accompanied by a reproducible eval script run against the public test splits.
Self-reported scores without a verifiable script are not accepted.
To submit, open an issue on
GitHub with your model link, scores, and eval script.
Leaderboard
ExposureGuard Models
Metrics
Submit
#
Model
PHI F1
Precision
Recall
Eval set
ExposureGuard models are listed here for reference. No independent eval results exist yet.
If you run these models against the benchmark, please submit your results.
Model
Role
PHI F1
Risk MAE
Retok Acc
Status
Evaluation Metrics
Models should be evaluated on the crossmodal test split of vkatg/streaming-phi-deidentification-benchmark (10,250 records).
Metrics
PHI F1
Token-level F1 for PHI span detection across NAME, DOB, MRN, ADDRESS, PHONE, SSN, DATE, AGE, LOCATION.
Precision
Fraction of predicted PHI spans that are true PHI.
Recall
Fraction of true PHI spans detected.
Risk MAE
Mean absolute error of predicted re-identification risk score. Only for models that output this signal.
Retok Acc
Retokenization trigger accuracy. Only for models that output this signal.
Benchmark subsets
default
1,950 standard single-modality clinical text records.
crossmodal
10,250 records with cross-modal linkage and exposure accumulation triggers.
signed
390 records with cryptographic audit signatures.
Submit Results
Open an issue on phi-exposure-guard on GitHub with the following:
Required
Model URL
HuggingFace model repository.
PHI F1
Token-level F1 on the crossmodal test split.
Precision / Recall
On the same split.
Eval script
A script that anyone can run to reproduce your numbers. This is required — scores without a script will not be added.
Optional
Risk MAE
If your model outputs a risk score.
Retok Acc
If your model outputs a retokenization signal.
Models trained on any portion of the benchmark test splits must disclose this. Results will be marked accordingly.