One reason your machine learning project will fail before it starts

Have you ever showed up to a meeting where everyone has a different answer to the same question? Imagine a marketing manager, let’s call her Mary, who shows up to a meeting convinced that her new e-marketing campaign will generate a 10% improvement in the conversion rate. While her counterpart Suzie from the sales team is showing little to no improvement when looking at the same data. How could this be? What are they to do?

All too often the key metrics used to make essential business decisions are based on shaky foundations. At the heart of Mary and Suzie’s debate are different data sources. They each had the data team run the numbers for them but came up with different results. The reason is that one team was using data from Google Analytics to calculate the conversion while the other was using a third party vendor such as Kapost to figure out conversion rates. What are they to do?

Similarly, a Data Scientist could spend weeks or months developing an amazing black-box system to predict future sales for the company. Yet all to no avail if the data it was based on was not accurate in the first place. If erroneous data is fed into a machine learning algorithm then even the best Artificial Intelligence will become artificial stupidity. Without consistent dataset and a single source of truth your project is sure to be a failure before it even starts. As is commonly said, “garbage in equals garbage out.”

The solution is a Data Governance Policy to ensure a single source of truth for every data point from customer information to business processes to inventory management.

What is a Data Governance Policy?

Data quality is very important and can mean the difference between trash and gold. A data governance policy is a set of processes and rules to manage how information is entered, modified and accessed within an organization. On a recent software project, a multibillion company came to review the results to remark that one of the key distinguishers was the quality of the dataset. The goal of data governance is to improve the quality of your data and gain trust in what is being reported. Pixentia lists 6 reasons why you need a governance policy.

Data Governance Policy helps you manage your data

In their article on Master Data Management needs data governance Infogix highlights how having all your data in one place is not the same as a true data governance policy.

A good data governance policy answers the following questions.

  1. When data enters the organization, where is it coming from and what quality checks are performed?
  1. What checks are being done to verify the accuracy of the transformations? What if some of the aggregations or concatenations are being done correctly?
  1. Once saved, what systems and users have access to the data?
  1. How long is data being kept for? What happens when it is “deleted”?
  1. Who is responsible for each data set? Who knows what the meaning is of each data point?
  1. How does each data point relate to every other data point? Is the “same” data being captured in multiple places?

Many of these components are covered often by data experts, however the last point is commonly overlooked. After all this data is stored in a single place, how do you make sense of it all?

Data Dictionary

Many people are often involved in the creation, implementation and management of software which collects and processes data. For example, in an inventory management system, the warehouse operator knows the information needed to keep track of the inventory, but likely doesn’t know how those values are stored in the database. The name presented is usually different than the name presented to the end user because software usually doesn’t like spaces or capitalization but people do. Therefore, “First Name” often becomes “first_name” or the more obscure, “fname.”

When a Manager, Data Analyst or Data Scientist go to review reports they need to chase down a chain of people to discover what each value actually represents. Oftentimes this is fraught with mistakes and assumptions because the original people who created it left the company long ago and it is simply passed down from one person to the next. This is another reason that Data Science initiatives often fail because a large portion of the time is just spent trying to make sense of all the data and find the appropriate domain experts.

To alleviate this problem, a concise and explicit data dictionary is needed along with the appropriate owners for each item.

A data dictionary is a short and detailed definition of each data point stored in every database and any appropriate relationships between the data. It should also include how and why it was calculated in a specific way (if applicable) and most importantly who the owner is for this data. There should be someone who is responsible for the data quality and another person who is the domain expert.

Ideally all of this should be stored somewhere which is version controlled so as updates are made a record is maintained. Most importantly, a process needs to be defined for any changes to systems which write data to ensure this data dictionary is updated within the current workflow. It can’t be an afterthought or done by another team otherwise it will be forgotten.

Data Ownership

People are the end consumers of data and therefore a person needs to be assigned ownership for each data point and needs to understand how it is generated and maintained. AIM Consulting recommends the following roles when it comes to data management:

These can of course vary according to the size of your company, but each of these functions will need to be handled by someone within the team.

Pluralsight has an excellent process for dividing the data (and software) ownership into entirely self sufficient teams. Each team contains Product Manager, Developers, UX/UI and any other needed roles. The team is responsible for a set of data while all other teams must go to that team’s services for access to the data (usually via an API). This prevents unnecessary copies of the data from roaming around and ensures all data is eventually consistent.

What’s the goal?

In a perfect world, a consumer of the data with the appropriate access levels has the ability to explore all data points at a high level and quickly find the information they are looking for. The Data Analyst and Business Intelligence (BI) Analyst could easily explore and look for trends and patterns within the data. For the Data Scientist, a single view can be easily created which contains all the necessary features to describe, prescribe or predict the target metrics. Anyone from the CEO, Director or Manager could create a clear dashboard with the Key Performance Indicators (KPIs) they need to drive their business, department or team. In a perfect world, this could all be done within hours or minutes, instead of weeks or days.

What’s the state of data within your company? Perhaps it’s time for a checkup to see how hard it is to find the data you need.