Recently, we embarked on an ambitious project to create an internal repository of images to be used for a multitude of future research projects, such as evaluation of AI models. The plan was to take all chest X-rays done from 2016 through 2020, translating to roughly 480,000 exams. We would capture the following data points:
- Patient demographics: birthdate, gender, class (inpatient, outpatient or emergency).
- Exam metadata: body part, procedure code and description, reason for exam.
- Images in DICOM format
- Reports in plain text format
Each field containing PHI would be anonymized, this article will focus on the effort to anonymize the DICOM images, future posts will explore other anonymization challenges (e.g. reports). On the face of it, anonymization of images appears rather straightforward, but once you dig below the surface, you find plenty of hidden challenges. Some of the prominent ones we faced were:
- The sheer number of images. Those 480k studies translated to nearly 1.4 million individual images. Automation and reliability of the anonymization pipeline were of utmost importance. The images were to be anonymized in about 10 batches, kicking off each batch would be manual, but the rest would run autonomously, requiring no human intervention and minimal supervision. As we later found out, each one of those batches took nearly 2 days (about 48 hours) of end-to-end processing time (retrieval, anonymization, storage of anonymized instance), so about 3 weeks for all 480k studies (nearly 5 TB worth of data).
- Preserving the longitudinal relationships between studies belonging to the same patient, i.e. the patient jacket. Some of the analysis slated to take place later on was sensitive to patients having undergone more than one imaging exam in the period of those five years, and the chronological order of those studies.
- Little room for surprises and no chance of a “redo” – there was not a lot disk space nor CPU power around to spare, not to mention time. We were able to experiment with small batches of a thousand or so images, but once we started in earnest, there was no room to turn back and make additional changes.
Our next step to find the right tool for anonymizing the DICOM images. Luckily, there are plenty of such tools, many of which are FOSS. As opposed to the requirements above, there were features we knew we did NOT need, such as:
- Blacking out of images, none of our images had PHI burned-in (something you typically see with ultrasound.
- No explicit requirements as to what anonymized data should look like (e.g. formatting of patient ID),
- No strong dependency/preference on a specific programming language or platform.
One last remark before we get into the details of each tool. Please see DICOM standard’s supplement on anonymization here: https://www.dicomstandard.org/News-dir/ftsup/docs/sups/sup142.pdf – it is an excellent starting point and reference to keep handy for a wide range of anonymization projects.
- Can anonymize longitudinally (keep the patient jacket).
- Very powerful tool with lots of configuration options and capabilities.
- Can provide ID mapping very conveniently
- Can run as a service on a headless server
- True pipeline execution continues working while there are jobs, and new jobs can arrive mid-execution, too
- Ships with an extremely comprehensive DICOM anonymization script that includes many more tags that we could think of, including private attributes (proprietary to different vendors). A huge plus!
- Supports DICOM C-STORE as option to export
- UI is not great, a little quirky to use
- Lacks query/retrieve option to load studies, but does support acting as a DICOM SCP to receive C-STORE requests from other systems
- 100% control over anonymization process
- Lots of grunt work, i.e. code written to handle anonymization of every individual tag
- We would have to do a LOT of testing with image samples from different scanner vendors to cover the wide variety of tags, especially private tags.
Note: Other libraries (e.g. dcm4che) could have been an equivalent here, but python was preferred because this project had a goal from day one to share the outcome with the community, and most people tend think of python as more approachable than Java. NullPointerException, anyone?!?
- Good blend between control and easiness of use (conciseness of code), Similar to CTP in some regards.
- Still requires snippets of codes to be written to:
- Handle our own logic (e.g. anonymize longitudinally to keep the patient jacket)
- Create a pipeline to process batches of images
- Lacks comprehensive anonymization mapping/script out of the box (vs. CTP)
- Still requires snippets of codes to be written to:
- Very polished and capable tool
- Seamless end-to-end anonymization workflow, even includes query/retrieve and C-STORE support.
- Not flexible enough for our needs here. Might be workable to use it as a library (vs. using the GUI) but could not find the source code, despite it being open-source.
- Geared towards smaller batches of easily discoverable studies, our constraints here were quite a bit more complex.
Note: This tool would probably be my top recommendations to give to an end-user who is not very DICOM-savvy, to be able to anonymize a smaller research project.
- Comprehensive platform for imaging informatics research.
- Overkill for what we need here, XNAT is meant more as a self-serve tool for end users at institutions were multiple projects were running in parallel.
Note: Institutions that are very active on the research would probably benefit from having a central XNAT instance to manage all their research and anonymization needs.
- Effortless to run/install in a Linux environment thanks to Docker image
- Does not allow for any configuration of how anonymization is done via UI. REST API allows customization but it is not ideal (too much work) and still “under powered” comparing to other solutions here
Note: Anonymization aside, Orthanc is my personal favourite mini PACS/VNA implementation.
CTP checked pretty much every box for us. Pydicom/pied was a distant runner-up.