Creating Value from Data
By: Ewald Geschwinde
What value does your data hold?
Every team generates data in its own format for its own use. However, the data that is generated may not be in the same language that another team speaks. The exact same data from two different teams may come out differently. What's called X in one system may be called Y in another system. And in fact, it gets even more complicated because one might need to further subdivide X in order for it to match Y.
Take for example a hospital, they want to have patient data and match the patient data with the departments visited by that patient and the procedures, tests, and treatments given. Each department generates data that gets put into a centralized system. A best practice is matching the data sources together, creates a centralized place for information to be stored. The doctor can then see the data source from each of the departments and get a complete picture of the medical history of this person.
The public relations department in the hospital wants to publish statistics about surgical outcomes. The SPOT system classifies the data as confidential or public granting access only to the data fields permitted for external use. Basically, anything someone could potentially want to know about the hospital is described in the metadata.
Large companies have the valuable asset of having a large database. For it to be used effectively data must be collected and stored in one safe and accessible place, a SPOT system - so everyone in the company knows where to look. Different systems can have the same data but different interpretations with a SPOT everyone knows where to go for the final word.
What Is The Real Value Of Having This Data?
We’ve found that the value in data is found when we sort the data in a way that allows it to be easily extracted into accessible insights.
This starts with knowing what you’re trying to measure. At Clevertech, we dive in and create a data strategy that best suits the company’s needs.
In this article, we share some ideas, strategies, and software tips to help you achieve your goal of drilling the data you need.
Before diving into sorting data, start with knowing what kind of data the company owns. Research and identify all the different data sources the company has. Once the data has been identified, save the metadata in a central system to create a Single Point of Truth (SPOT).
What Is Metadata?
Metadata simply stated is, “data about data.” For example, if you have a sales number, you want to know what this sales number is about, what is included in this number, and how it is calculated. You describe it with more data. From a technical perspective, you also describe database fields to identify the semantic context. It is important to have metadata to understand what this data means.
In general, metadata systems are not used on their own, it is often in combination with data governance systems. Understanding the difference is important. Data Governance is more about processes around the data, while metadata is to describe the data as explained before.
Decide What Data Is Important
Data should be labeled and categorized upon saving. Using an example in which case a company has multiple sources of revenue, we would ask the following questions to help categorize the data:
- Who will be using the data? Does it need to be in specific formats?
- Are the different parties receiving the data using the same fields that were created by the original source? Does their Y= the sources X? - We focus on teasing out that information so that everything can be calculated accurately.
- Who is the responsible person for this data?
- Does the company have all the relevant information available?
Answering these questions as data is being saved is the key to keeping things organized.
Storage Software Recommendations
Below are three good software candidates for storing metadata:
Apache Atlas https://atlas.apache.org/ It is an open-source Apache project which is widely used in many companies and battle-tested.
Alation https://www.alation.com/ This is a commercial vendor in the data sector and I had a very good experience with this product.
Datahub https://datahubproject.io/ Developed by LinkedIn and released as open-source. This is a very promising newcomer.
**Note: **When it comes to storing metadata about machine learning models, you should go with Kubeflow or TensorFlow Extended to store not only metadata about data but also data about models, metrics, and context (Programing languages + version, CPU and OS information, environment variables)
Example of an architect for the data:
The metadata is saved in one place to ensure that can be searched.
With the SPOT established in a company, it creates an overview of what data there is, what it means, and who is the owner of that data. Viewing permissions can be controlled in the central system.
1. Check Data Privacy Levels
Data privacy is a hot topic and must be addressed correctly according to the governing body of the country the company is in. A tool we use for this is ARX (https://arx.deidentifier.org/). It supports many privacy models (including the k-Anonymity model) and includes data quality models.
When we think about data privacy, we need to consider the custom fields that we are using. Things like social security numbers, email addresses, birth dates, and mailing addresses all have different requirements for data privacy.
Information classification can categorize the data’s level of secrecy. Some information is not allowed to be published, stored in the cloud, or even touched. Meta-information can be crucial to have it added to the data, so the user of the data knows what he is not allowed to do with the data. Imagine some secret data gets accidentally published, this can cause a huge negative public impact and also a market share loss if the competition can use this data.
2. Create Data Pipelines
Metadata can be used in data pipelines. Imagine that all meta-information from the metadata system can easily be found.
For data pipelines, we recommend using Python, especially Airflow as it has many connectors and can be integrated easily in all the three main cloud providers and many other data sources. Personalized connectors can also be created if that better suits the company’s needs.
3. Enrich Data Pipelines
Enriching the information in the “SPOT” system can make all the difference. Say the company needs to merge the data sources and transform them to create new figures and KPIs. All this information and transformation can be very complex. Think of giving data back, the pipeline uses the metainformation and creates new meta information which can be written back to the “SPOT” system. All new data is classified, categorized and the calculation algorithm can be stored to verify the calculation.
- Transparency to management
- Better data quality
- Automatic data lineage
- All processes speed up and meta information can be reused.
- Faster implementation of data pipelines therefore fewer costs.
From a developer’s perspective; it is getting easier. All the questions about the data are already solved in the central system and it is a technical implementation of using the information given and creating higher quality data pipelines in a shorter time frame. We have a win-win situation.
4. Using Data
To have better insight into company data the metadata can be incorporated into machine learning algorithms, knowledge graph analysis, reporting, and the most valuable asset is to have access to all the information to have transparency over the whole data inside of the organization.
This brings a complete advantage and leads to new possibilities in having access to new insights and therefore developing more accurate strategies for the future.
The data flow is documented, and data quality can be rechecked based on the information given. What about the analysis and reports? With all the data given it is easier to make reports with various reporting tools. Fetch the metadata and the data and present it as wanted to the consumer. All information about information classification, privacy, and calculation methods has been gathered.
We talked in this article about metadata as the central “SPOT” system to achieve data management for companies that create the meta-information needed about data. Remember to choose a strategy that best meets the needs of the company. A pre-analysis should be done to evaluate if you are better off with data governance, master data management or metadata management systems, or even a combination of solutions.
With the demand for more reporting from your company users and the need for you to deliver valuable information to your customer, it is important that you have a system in place. Use the data as new oil but don’t forget to also drill for it and maintain it for successful data projects.
Want to peek into our daily work? Our coaches recount real world situations shared as learning opportunities to build soft skills. We share frameworks, podcasts and thinking tools for sr software developers.
Keep on readingGo to Blog home
The (remote) opportunities
We expect professionalism and client service, so we can offer a deeply caring experience for our clients. In return, you get freedom to work wherever you want. No timesheets, no big brother watching every move. We trust you to know what’s best to find the right solution.