One reason your machine learning project will fail before it starts
Have you ever showed up to a meeting where everyone has a different answer to the same question? Imagine a marketing manager, let’s call her Mary, who shows up to a meeting convinced that her new e-marketing campaign will generate a 10% improvement in the conversion rate. While her counterpart Suzie from the sales team is showing little to no improvement when looking at the same data. How could this be? What are they to do?
All too often the key metrics used to make essential business decisions are based on shaky foundations. At the heart of Mary and Suzie’s debate are different data sources. They each had the data team run the numbers for them but came up with different results. The reason is that one team was using data from Google Analytics to calculate the conversion while the other was using a third party vendor such as Kapost to figure out conversion rates. What are they to do?
Similarly, a Data Scientist could spend weeks or months developing an amazing black-box system to predict future sales for the company. Yet all to no avail if the data it was based on was not accurate in the first place. If erroneous data is fed into a machine learning algorithm then even the best Artificial Intelligence will become artificial stupidity. Without consistent dataset and a single source of truth your project is sure to be a failure before it even starts. As is commonly said, “garbage in equals garbage out.”
The solution is a Data Governance Policy to ensure a single source of truth for every data point from customer information to business processes to inventory management.
What is a Data Governance Policy?
Data quality is very important and can mean the difference between trash and gold. A data governance policy is a set of processes and rules to manage how information is entered, modified and accessed within an organization. On a recent software project, a multibillion company came to review the results to remark that one of the key distinguishers was the quality of the dataset. The goal of data governance is to improve the quality of your data and gain trust in what is being reported. Pixentia lists 6 reasons why you need a governance policy.
In their article on Master Data Management needs data governance Infogix highlights how having all your data in one place is not the same as a true data governance policy.
A good data governance policy answers the following questions.
When data enters the organization, where is it coming from and what quality checks are performed?
Is the user entered information such as zip code, address or phone number being validated? Perhaps they match the correct format but are erroneous values because a customer wants to hide their real identity.
After being captured, how is the data transformed? Where is it stored?
What checks are being done to verify the accuracy of the transformations? What if some of the aggregations or concatenations are being done correctly?
Is the encryption working correctly?
Once saved, what systems and users have access to the data?
What Extra Transform Load (ETL) jobs are running on the data? Is it being stored in a new “temporary” location? Does that location have the same access levels?
How often are the appropriate access levels verified? How?
How long is data being kept for? What happens when it is “deleted”?
Sometimes data is only marked as “deleted” without actually being deleted.
Who is responsible for each data set? Who knows what the meaning is of each data point?
Who knows what that random field, “recommended_service” represents anyways?
How does each data point relate to every other data point? Is the “same” data being captured in multiple places?
For example, do customer support and sales separately store customer account information? If they are using different third party software with no integration the answers is likely yes.
Many of these components are covered often by data experts, however the last point is commonly overlooked. After all this data is stored in a single place, how do you make sense of it all?
Many people are often involved in the creation, implementation and management of software which collects and processes data. For example, in an inventory management system, the warehouse operator knows the information needed to keep track of the inventory, but likely doesn’t know how those values are stored in the database. The name presented is usually different than the name presented to the end user because software usually doesn’t like spaces or capitalization but people do. Therefore, “First Name” often becomes “first_name” or the more obscure, “fname.”
When a Manager, Data Analyst or Data Scientist go to review reports they need to chase down a chain of people to discover what each value actually represents. Oftentimes this is fraught with mistakes and assumptions because the original people who created it left the company long ago and it is simply passed down from one person to the next. This is another reason that Data Science initiatives often fail because a large portion of the time is just spent trying to make sense of all the data and find the appropriate domain experts.
To alleviate this problem, a concise and explicit data dictionary is needed along with the appropriate owners for each item.
A data dictionary is a short and detailed definition of each data point stored in every database and any appropriate relationships between the data. It should also include how and why it was calculated in a specific way (if applicable) and most importantly who the owner is for this data. There should be someone who is responsible for the data quality and another person who is the domain expert.
Ideally all of this should be stored somewhere which is version controlled so as updates are made a record is maintained. Most importantly, a process needs to be defined for any changes to systems which write data to ensure this data dictionary is updated within the current workflow. It can’t be an afterthought or done by another team otherwise it will be forgotten.
People are the end consumers of data and therefore a person needs to be assigned ownership for each data point and needs to understand how it is generated and maintained. AIM Consulting recommends the following roles when it comes to data management:
Chief Data Officer – Provides overall guidance and makes overarching decisions across the entire organization; sets the vision and ensures executive leadership sees value.
Data Governance Leader – Navigates politics, briefs executives and guides the overall data governance program; the chief lieutenant for the CDO
Data Governance Council (Data Owners) – Typically a cross-functional representation of the data owners aligned by their respective business areas. These leaders ensure their own program tracks are progressing and elevate corporate issues as they arise. The council must empower data stewards.
Data Stewards – Business-aligned resources that set the business process rules, data definitions and help define the standards and policies; actively monitor data quality.
Data Custodians – The technical resources (often IT-aligned) that ensure data follows the standards and policies defined by the governance council and data stewards.
Data Consumers and Producers – The main users of the data and the first to be impacted by changes in policies and standards; one of the main stakeholders to consider.
These can of course vary according to the size of your company, but each of these functions will need to be handled by someone within the team.
Pluralsight has an excellent process for dividing the data (and software) ownership into entirely self sufficient teams. Each team contains Product Manager, Developers, UX/UI and any other needed roles. The team is responsible for a set of data while all other teams must go to that team’s services for access to the data (usually via an API). This prevents unnecessary copies of the data from roaming around and ensures all data is eventually consistent.
What’s the goal?
In a perfect world, a consumer of the data with the appropriate access levels has the ability to explore all data points at a high level and quickly find the information they are looking for. The Data Analyst and Business Intelligence (BI) Analyst could easily explore and look for trends and patterns within the data. For the Data Scientist, a single view can be easily created which contains all the necessary features to describe, prescribe or predict the target metrics. Anyone from the CEO, Director or Manager could create a clear dashboard with the Key Performance Indicators (KPIs) they need to drive their business, department or team. In a perfect world, this could all be done within hours or minutes, instead of weeks or days.
What’s the state of data within your company? Perhaps it’s time for a checkup to see how hard it is to find the data you need.