RTF is a relatively popular format for storing formatted reports natively in a sysem like RIS/PACS/etc. However, RTF scores very poorly on the interoperability front. Although Microsoft created format, support for RTF’s different features varies between Microsoft’s own products, let alone other products (e.g. OpenOffice/LibreOffice). For example, testing between Microsoft Word 2013 and Wordpad (the default RTF editor that ships with Windows out of the box), the following discrepancies were noticed off the bat:
- Wordpad cannot add tables to an RTF document. However, if the tables are added in Word, then Wordpad can display them.
- Word allows to edit the headers and footers of document, but Wordpad failes completely to display any of that information.
Additionally, analysis of text (e.g. NLP) cannot run on RTF. Thus, it must be converted plain text beforehand. There is no library/tool that do said conversion with 100% accuracy (in comparison with Microsoft Word as the “gold” standard). For the purpose of our project, we evaluated:
- Striprtf (Python library)
- UnRTF (Linux library)
Both of which produced decent results initially. UnRTF edged out striprtf slightly with elements like lists and tables, but our reports did not contain such elements (pure text), so we went with striprtf since it was native to python scripting, which we used to perform other anonymization-related tasks (more on that in a future post).
Another alternative that was not tested, was to use OpenOffice/LibreOffice to do the conversion. It is generally considered to be the best option for an automated pipline. However, we did not feel the need for it in the context of this project, since the RTF samples we were processing do not contain any complex formatting.
Below is an example PDF printout of the RTF document we used to evaluate the different tools mentioned above.