Tuesday, 19 February 2013

Dirty data

We're having to process a lot of dirty data at work. We're spinning through data from a legacy system and parsing the data which has been kept in all sorts of different formats for address data.

An awful lot of data doesn't have spaces where you'd think spaces should be so I'm left looking for solutions... I thought about abbreviating the strings using Intelligent String Abbreviation from the lovely JSPRO.COM:

function abbreviate(str, max, suffix){
  if((str = str.replace(/^\s+|\s+$/g, '').replace(/[\r\n]*\s*[\r\n]+/g, ' ').replace(/[ \t]+/g, ' ')).length <= max){
    return str;
  }
  var abbr = '',
  str = str.split(' '),
  suffix = (typeof suffix !== 'undefined' ? suffix : ' ...'),
  max = (max - suffix.length);
  for(var len = str.length, i = 0; i &lt; len; i ++){
    if((abbr + str[i]).length &lt; max){
      abbr += str[i] + ' ';
    }else { 
      break; 
    }
  }
  return abbr.replace(/[ ]$/g, '') + suffix;
}

But after thinking about it I clocked that it was simply a question of replacing "," with ", " but only when the comma wasn't being followed by a space. I guess that HTML would be fine with more than one space but it offended my sensibilities to indiscriminately replace commas with commas and spaces...

This, however, does the trick very nicely (Thanks Lars):

var str = str.replace(/[\,^\s]/g,", ");
UPDATE 1

I've been using this for a while now and I clocked a problem. I now use this instead:

var str = str.replace(/\,/g, ', ').replace(/'  '/g, ' ');

I think that the second replace is extraneous as browsers interpret multiple white spaces as a single space but it's better to be safe than sorry ;-).

UPDATE 2

While the above works it's not as elegant as I would like so after a bit more research I've now got this:

str.replace(/,\s*/g,", ");

Much cleaner!