Eccentric Flower:201012/Unfortunate Data

From Eccentric Flower

«December 2010 «Eccentric Flower

Unfortunate Data

This morning I sat down to do some easy data-conversion work which was the last major responsibility I had on the Project From Hell until the beginning of the new year. It was a formality, really, a strict "make a query, output as delimited text, drop that into Excel so people who think Oracle is intimidating can use it." Minutes of work. No problem.

Now, as I begin to write this, it's 4:30 and I'm just finishing up. This was messy enough that I've forgotten what I was vaguely thinking about ranting about at 10:30 this morning - it's completely vanished from my head. So I'm going to rant for a minute or two about data instead. But don't worry! I'll keep it non-technical. For it is a problem that concerns us all.

You see, the data problem I'm about to gripe about is not a technological issue. It is not a programming thing or a database thing or any other kind of nerd thing. It is a human-behavior thing. And basically, what it comes down to is, you all suck.

(I'm exempting myself from the generalization - saying "you" there instead of "we" - not because of any sense of moral superiority, but because of taught habits. This will become clear in a moment.)

Humans are extremely bad at consistent structure of data input. What that means in English is, if you give someone a web form (or on a survey or what-have-you) which has a place for a phone number, and you give no particular instructions about how that phone number is to be provided, you are going to get all sorts of divergent formats.

Some people will write a phone number with area code in the format 000-000-0000. Others will write (000) 000-0000. Some will leave out the space (and yes, that matters): (000)000-0000. Some people (mostly Europeans) like dots: 000.000.0000. Some don't want to put any punctuation in at all: 0000000000. I myself, if I had my preference, would just leave spaces: 000 000 0000. Some folks won't put on an area code unless specifically asked to; and when you're dealing with overseas calls there is a whole host of complications having to do with country and city prefixes and whether you should put the 011 international dialout code in front for the Americans who don't know how that works because they've never had to call another nation.

To a computer all of these things are different. To you they all look like phone numbers, and while you can't supply outright missing information, you can look at 0000000000 in a phone number field and have no trouble interpreting what it means. This is because you can do intrinsic pattern recognition. Computers cannot. They have to be told how to do everything. So, yeah, you can train a computer to process all those different ways of writing a phone number - but for every variant form you add, you also add a few more lines of code, and eventually you end up with a whole routine just for parsing out phone numbers, another for email addresses, another for paper addresses, et cetera.

Now, there are those who say, "well, too bad for the computers, but I Am Not A Number I Am A Free Man and I'll not be hounded around by these standardized forms." Oh yes, there are. In the seventies there was a backlash against zip codes. Go look it up. People complained that they should not be required to memorize an arbitrary number just to improve delivery of their mail, that they were remembering too many such identifiers in their lives already by then and it was dehumanizing. Of course, today, as we all struggle to remember ten thousand site passwords and logons, we look back at their petty problems and laugh. They had no idea!

Others deliberately rebel - they try to put data in formats that will give people fidgets just so it makes their paperwork hard to process. I understand what motivates these people - personally, I feel that as a general rule, the fewer electronic records that are kept on me anywhere, the better - but to a large extent that ship has sailed, and let me tell you a key piece of insider information from one of the gatekeepers: Making your form hard to process does not get you personal, customized attention. It puts your record in the "to be attended to" queue and it waits until the first day the data specialist has spare time to deal with it - which could be quite a while. And meanwhile, you do not exist. It's no skin off our noses if you can't get your termbill or a password for our plotter queues or a university ID. Your own damned fault.

Basically I'm on the fence. I'm lazy too, and I'm always short on time too, and I don't really like being pushed around by forms either. But I'm also a data maintainer and a data converter, sometimes (I wear many hats), and from that side of the fence, you people make us fucking insane.

Some things are beyond your control, of course. Who made a system where addresses sometimes have one line and sometimes two and sometimes even three, which there's never any space for in the table design, so you have to figure out whether to cram the third line into the other two somehow or just omit it? Who designed addresses to have weird characters that sometimes break poorly-designed data parsers? Who decided that humans would often have non-ASCII characters in their names?

But many things are human problems. Putting extra carriage returns at the ends of things, for example. I spent an hour today finding and correcting some thirty or forty rows of data (in a 70,000 row file, just finding can take a while) where someone, when originally entering the data, had casually pressed the Enter key after typing whatever went in that field - an email address, say. Well, the problem with that is, when you export data as text, you know what the most common way to indicate "this is the end of this record and the beginning of the next one" is? A carriage return. So if one crops up in the middle of a record somewhere in a place you haven't anticipated, what the computer sees is not one record with a carriage return in the middle, but two short malformed records.

I've had people paste in long blocks of text from Word. Now, I can't blame them too hard - if you are forced to use Word, and your document is in Word, you don't want to have to retype the whole thing again just to get it into the database or web page. The problem is, as all programmers know (but virtually no one else in the world does), Word is Evil Incarnate. All those pretty "smart quotes" and real em-dashes and other things that Word automatically does to your document unless you know the fifty things you have to unset to forcibly turn them off? Well, databases and other structured-data things hate those characters and don't know what to do with them. So would web pages, in a just universe, except a few years ago most web browser makers threw up their hands in disgust and quietly taught their browsers how to display funky Word characters properly - which was the wrong answer.

(The correct answer would have been to teach their browsers to detect them and put up a message that says, "I'm sorry, this is in Word, which does not play by the rules; we cannot and will not show this page and you should complain to the moron who thought it was suitable to paste Word material into an HTML page in the first place." This would have reinforced the correct set of behaviors - don't use Word for HTML or plain-text content! - instead of the wrong ones - "oh, we can be sloppy whenever we like and so can Microsoft." Ahem. Sorry.)

With some of our data-mangling habits, being very strict can help. If you put (NNN) NNN-NNNN over your phone number field on your web form, you do increase the chances that everyone will put it in the same way. With dates, it is absolutely inexcusable not to show exactly how many digits you want and in what order (e.g. MM/DD/YYYY) - or better yet, make it three fixed-selection fields (pick a day from a set list of choices, pick a month from a set list, pick a year from a set list) whenever possible.

But even these are not perfect solutions, because (get this): People do not read directions. I don't know why this is. But they don't. I have put things on forms in GIGANTIC RED TYPE and had them be constantly ignored.

Rejecting data that isn't in the format you want is also a good trick (because the rule is that it's easier to get it right at input time than correct the data later). The problem is, if you throw the form back ("Sorry, you didn't do this part right, do it again") and you get the kind of person who doesn't read directions, the same kind of person will get unreasonably annoyed that they're being fussed at because they didn't bother to do it right in the first place. You may think that's a trivial issue, but businesses in particular are understandably loath to annoy customers. I've had someone tell me to remove code that bounced improper data input in this way; he was more willing to have bad, possibly unusable data than take even the slightest risk of annoying a potential customer.

The good news is, data cleanup is one of the many things I do, and (knock wood) it means there will always be a job for me somewhere. Because all large data sets contain a lot of garbage. This is another great truth from an insider; take it to the bank. No matter what they may say, no matter what confidence they exude, all big sets of records contain some records that do not conform, some that are missing big hunks of data, some that are mangled in some way so as to be unreadable, et cetera. The best you can do is hope none of them are yours.

Unless, of course, you're the sort of person who sees that as desirable.

I was going to leave it at that (it's now 5:13 and I want to go home), but seeing Joy's comment two entries back about terms for gender reminded me of something I posted over at Dan Lyke's place a week or two ago. And yes, this does have to do with structured data.

See, I think that some data shouldn't be structured. Data really only needs to be structured when you need to do searches and sorts on it. And one of the pieces of data that I don't think should be structured is "gender." Now, putting aside the discussion - and believe me, there's lots of room for it - of whether the term "gender" is even meaningful at all, I really don't think someone is being unreasonable when they throw a fit because a form offers them only M or F as choices. (I personally favor U for "undecided.")

The perfect example would be not to ask for gender at all, but I suppose then all those godawful dating/mating/fornicating sites would have a fit when no one knew what token to search on to find one another. Really, though, even there "gender" is not a good fit for the data because, in that particular example, it doesn't go nearly far enough. What a dating site really needs is something like:

  1. I prefer to wear [suits/dresses/warmup clothes/jeans and t-shirts/spandex bodysuits/fill in your own examples]
  2. I wear facial makeup [always/often/occasionally/only under duress/never]
  3. I like to wear my hair [very long/long/medium/short/very short/removed]
  4. I find people attractive who look like [Harrison Ford/Veronica Lake/Gwyneth Paltrow/Tilda Swinton/Johnny Depp/Brucilla the Muscle/add your own/pick all that apply]


And so forth. It strikes me that situational examples are far more valuable there than some canned term which a substantial number of us would dispute the meaning of anyway.

(And let's not go into the plumbing. We can always add a question for that. I mean, sure, it's completely legitimate to say, "I can be totally attracted to the looks and personality of someone but their not having a penis is still a deal-breaker for me." I don't have a problem with that. But the point is, specifying some arbitrary gender code is not going to necessarily answer that question, so spell out what you want.)

But none of that is easily searchable or sortable, although I suppose you could come up with some kind of point scale ... anyway, that's the worst case. Institutions which are not dating/mating/hating/berating sites have even less need to keep gender in a structured way than that example. I mean, think about it. Why does your employer hold information on gender at all? The only legitimate reason I can think of for tracking employee gender at, say, a bank is for anti-discrimination purposes, and frankly that's a red herring.

(No, no, I'm not saying there's no such thing as gender discrimination, good god, no, have you known me long? But in places where there is gender discrimination, it is no secret from anyone, least of all the people denying fervently that it's there, nor is it hard to spot. You don't need to go run a count of M vs F records in the database to see where gender discrimination is happening; you need to take a walk around the building and see who is in which offices. Or talk to the employees, particularly the female ones. It'll be faster and more useful, too.)

Um, anyway, so, with gender, Joy, my answer is, call yourself whatever you want. (Actually, that's my answer with "womyn" too. If you want to use it, that's your business. It annoys me but that's strictly my own problem. Contrived neutral terms for gender annoy me too, and that is also my own problem.)

Personally my solution to the collective-pronoun issue is to use "she" and "her" as the default pronoun in neutral situations. (I don't always remember to, but I try to.) When someone objects - and someone has only dared object once; I even got away with writing an entire software manual like that, years ago - you then have a piece of valuable data about that person.


<< older | © 2010 columbina | newer >>


What drives me insane is when websites refuse to let me move forward with what I'm buying/registering for/etc because they don't like how I've entered my phone number, but decline to state the format they DO want it in.

-- 23:23, 9 December 2010 (GMT)


And one of the pieces of data that I don't think should be structured is "gender."

(can't think of a snappy segue so I'll just say DeviantArt fiasco:)

Maybe it's because of where I live, but I have the sense of evolving gender codes that implicitly signal information like what you describe in your survey sample. Call me a starry-eyed optimist, but the path from hit-and-miss signals to rough consensus to having-words-for-it to, eventually, putting it in a drop-down menu (and supplementing that drop-down with a textbox) on ye newe dating sites seems slow-but-inexorable.

-- 23:41, 9 December 2010 (GMT)


I am unduly pleased to note that I looked at this page on an iPad and saw that all of the phone number examples made from zeroes were highlighted as "add to address book" items. Apparently some developer shot some time teaching the device to recognize a phone number properly!

-- 23:56, 9 December 2010 (GMT)


By the by, while I don't say that DeviantArt made a GOOD decision (actually, it baffles me why they'd bother making that change in the first place), their stance is definite and unequivocal - pick M or F or leave. Dumb, but within their rights (just as their consumers have a right to vote with their feet.)

(I didn't notice this when it happened because my profile at DA is listed as "female," but I had heard about it before this; there was some understandable consternation in transgender circles.)

-- 00:24, 10 December 2010 (GMT)


Some website I filled out a form at recently specified the date as you suggest (MM/DD/YYYY) but then when you went to fill it in they didn't have the corresponding number of digits.

-- 04:55, 10 December 2010 (GMT)


i maintain our academic program's online application and deal with the collected data. the form is fairly straightforward, so i don't have too much of a problem with formats; phone numbers can be a jumble but you can figure it out later if you need to call the applicant.

getting apartment numbers in address line two is only slightly annoying, but putting 'N/A' in every single field you could have just left blank is really asking for it.

-- 16:11, 10 December 2010 (GMT)


I realize that the phone number thing is just a convenient example that comes to hand, and that other cases can be trickier, but as long as you bring it up... I do get annoyed at websites that reject a form if it doesn't abide by their version of the One True Format for phone numbers, because it properly ought to be a nonissue; processing the form ought to include stripping all non-numeric characters from the input -- brackets, dashes, dots, whatever -- and formatting the results according to the desired standard. (This won't help for people who leave the area code out, but that's actually a legitimate ground for rejecting the form for missing information.)

-- 17:52, 10 December 2010 (GMT)


Browsers being strict about rejecting Microsoft Word would be committing suicide, given that the browser with a 90% market penetration was made by Microsoft, who aren't very keen on standards they don't control.

I do feel sympathy for data cleaning. I must have spent more time dealing with data scrubbing than anything else when writing my own syslogd (and the spec for syslog, RFC-3164, basically says, "anything goes, but here's a format that is often followed"---gah!)

And speaking of phone numbers---apparently, in the telephony backend of things (where I'm now working on actual phone switches), they only accept pure digit strings and any other punctuation (like spaces, parenthesis, dashes, dots, anything) will muck up the software like you wouldn't believe. It surprised me the first time I entered a phone number like (555) 555-1212, have it fail, and told "of course it failed! You *have* to submit it as 5555551212. D'uh!"

Oh, and for web forms, it's still worse than you talked about:

-- 00:26, 11 December 2010 (GMT)


<< older | © 2010 columbina | newer >>

Personal tools
eccentric flower