Data is full of bias

Stumbled on an article about the job market will demand data skills because of A.I. A big part of that is that it will rely on good, quality data, which will require people who can ensure that the systems are fed that. Specifically this caught my eye:

“Data underpins all of the large language models. But data is full of bias,” said Andy Cotgreave, senior data evangelist at Tableau. “Understanding how the quality of the data going into the large language model will impact the quality of the output is absolutely vital. The social aspect of data collection, labelling, querying and presentation are fundamental things we talk about at Tableau as skills people will need.”

Equipping students with data skills to navigate an AI-driven world. Times Higher Education.
white printer paper
Photo by Lukas on Pexels.com

In my experience, we often have “good enough” data, which is messy. Messy as in incomplete, duplicative, inconsistent, and aged snapshots. The person’s context may have changed because the data is a snapshot from years ago. It might not get matched up to the right person. So much is data entries by a person who may take shortcuts. We only clean it up much after the fact, but that requires cultural shifts, which the larger the enterprise, the harder to achieve. Underground shadow systems pop up to maintain the old way.

For instance, as long as I have worked at my current place until last year, my “time of service” was wrong. Moving from a child organization to the main one before it was all one HR system, my HR record was passed along and someone did data entry of it to the main office HR. And botched it so badly I couldn’t figure out what they were using to calculate how long I’d been working. Once my HR person looked into it, she found the issue and got it corrected. But, how many people in my situation would bother?

I suspect it depends on the perception of the impact. We did “time of service” plaques once in all my time here. (Mine was wrong, so I wasn’t too keep on being honored and displaying a wrong one.) Much of the data mistakes are “good enough” for the original purpose. Accuracy isn’t important to the needs of a single individual making small decisions. The higher the level of the decision maker, the better the data needs to be.

Around 2011, the really hot buzzword was data analytics and then Big Data. College kids needed to learn it because all the tech and finance companies were snapping data scientists up for big pay. Now, with AI, it is: everyone needs to be proficient. The need to know it will permeate many jobs.

I dunno. We keep hearing about the same stuff being the future, and new things point to the same data skills. However, it doesn’t feel like we have reached the predicted amount of this work. If we don’t, the results will be bad. But, I lack faith we will.


Leave a Reply