Report Anonymization: Date formats and whitespaces

Just when you think you’ve seen every date format out there, random sampling of reports reveals even more, weirder and more confusing formats!  Some examples formats to consider when writing regular expressions for anonymizing reports:

Full date (year, month & day)

  • May 21, 2020 
  • 21 May 2020 
  • 21-May-2020  
  • 21/May/2020 
  • 21 May 2020 
  • 5/11/16 
  • 5.11.16 
  • 5-11-2016 
  • 2016-05-11 

Partial date

  • May 2020  
  • May, 2020 
  • May (just a month) 
  • 2020 (just a year) 
  • May 21

Time

  • 7:21pm 
  • 7:21 pm 
  • 7:21 p.m. 
  • 7:21 PM 
  • 19:21 hrs 
  • 1921 hrs

But wait, there’s more! Sometimes the spaces you see are not really spaces – read the next section for more details. 

When a whitespace is not really a whitespace! 

While writing regular expression patterns to match all the mixes of date formats above. I noticed sometimes the format looked like it should match, but it wasn’t. After a lot of head scratching, I discovered that one of the reporting systems was using the 0xa0 character for white spaces, instead of the usual 0x20. There’s a subtle difference – 0x20 indicates the line can be broken for wrapping purposes, whereas 0xa0 means not to break the line with this space.  

For example, a normal date of February 4, 2021, might be wrapped like so if using regular spaces and it needs to wrap the line: 

February 4, 
2021 

But the “special” white space that is 0xa0, prevents that from happening to ensure the date always comes up on the same line, i.e.

February 4, 2021 

Unfortunately, \s (to detect spaces) in Python 3’s regular expression implementation does not match the novel 0xa0 space. I.e., the non-breaking space broke my regular expressions – get it? Ha ha! Hence why you’ll notice my report anonymization code uses \W (i.e. non-word character) to match both types of spaces, 0x20 normal (aka whitespace) and 0xa0 non-breaking space.

One thought on “Report Anonymization: Date formats and whitespaces

Comments are closed.