Data management: why it's so essential
Unpolluted data is core to a successful business — particularly one that relies on analytics to survive.
But how difficult is it to manage unfiltered data and get it ready for analytics? Most data scientists spend 50–80% of their model development time on data preparation — they often know ahead of time what data they want to profile or visualise prior to preparing and modelling it. But what they don’t know is which variables are best suited — with the highest predictive value — for the type of model being implemented and the variable being modelled.
Identifying and accessing the right data are crucial first steps. Before you can build an effective model, you’ll need consistent, reliable data that’s ready for analytics. For data scientists and business analysts who prepare data for analytics, data management technology can act like a data filter — providing a single platform that lets them access, cleanse, transform and structure data for any analytical purpose. As it removes the drudgery of routine data preparation, it reveals clean data and adds value along the way.
Here are five best practices that can help with data management analytics.
1. Simplify access to traditional and emerging data
Accessing data is challenging — different data sources, formats and structures make it hard to bring the data together. And statistical analysis essentially only cares about two data types — character and numeric. Yet some data sources, like relational databases, have 10–20 different numeric data types.
A capable data management package should have a plethora of native data access capabilities that:
- Simplifies access to multiple data sources.
- Minimises data movement and improves governance by pushing data processing down to the data source.
- Provides self-service data preparation capabilities with intuitive user interfaces that make data accessible to more users, with less training. This, in turn, frees IT personnel from iterative data provisioning tasks so they can be more productive.
- Enables agile, secure techniques for managing data.
2. Strengthen the data scientist’s arsenal with advanced analytics techniques
Look for data management software that provides sophisticated statistical analysis capabilities:
- Frequency analysis that goes beyond simple counts to help identify outliers and missing values that can skew other measures like mean, average and median (measure of central tendency), as well as affects analyses like forecasting.
- Summary statistics that describe the data by providing several measures, including central tendency, variability, percentiles and cardinality. Cardinality shows how many unique values exist for a given variable.
- Correlation that can be used during the analytical model building process, when business analysts try to understand the data to determine which variables or combination of variables will be most useful based on predictive capability strength.
3. Scrub data to build quality into existing processes
Up to 40% of all strategic processes fail because of poor data. Data cleansing begins with understanding the data through profiling, correcting data values (like typos and misspellings), adding missing data values (such as postcode), finding and dealing with duplicate data or customer records, and standardising data formats (dates, monetary values, units of measure). Cleaning data can also include automated selection of best records and cleaning data in multiple languages.
Look for a data quality platform that:
- Incorporates the cleansing capability into your data integration flow to make IT resources more productive.
- Puts data quality in database — that is, pushes this processing down to the database to improve performance.
- Removes invalid data from the dataset based on the analytical method you’re using — such as outliers, missing data, redundant data or irrelevant data.
- Enriches data via a process called binning — which simply means grouping together data that was originally in smaller intervals. For example, the individual value of age alone may not have much relevance, but age groups could, such as “between 35 and 45”.
4. Shape data using flexible manipulation techniques
Without flexible methods of manipulating data, it can be difficult to structure the final dataset. Typical analytical methods expect a “flattened” dataset, often called “one row per subject”, which can be problematic because database systems are not designed with a single-row-per-customer data structure in mind. As a result, many database systems limit the number of columns a single table can have. Transaction systems record every transaction as it happens, resulting in a high volume of records for each customer. These transaction records need to be consolidated and transposed to be joined with the customer records pulled from the data warehouse.
Look for a solution that simplifies data transposition with intuitive, graphical interfaces for transformations. Other desired reshaping transformations include frequency analysis to reduce the number of categories of variables, appending data, partitioning and combining data, and a variety of summarisation techniques.
5. Share metadata across data management and analytics domains
Look for a solution that provides a common metadata layer that allows data preparation processes to be consistently repeated. This promotes more efficient collaboration between those who initially prepare data and the business analysts and data scientists who ultimately complete the data preparation process and analytical model development.
A common metadata layer makes it easier to deploy models. As each model is registered in metadata and made available along with its data requirements, it becomes less of a challenge to adopt.
Applying metadata across the analytics life cycle delivers savings on multiple levels. When a common metadata layer serves as the foundation for the model development process, it eases the intensely iterative nature of data preparation, the burden of the model creation process and the challenge of deployment. Advantages include:
- Faster testing and increased productivity due to automated model development and scoring.
- Creation of more models with greater accuracy because of automated model management.
- Faster cycle times that increase profitability and result in more relevant and timely models.
- Less time spent on mundane data work and more focus on model development and evaluation.
- Knowledge that can be re-used across the enterprise after it’s obtained during the data preparation process.
- Increased flexibility to accommodate changes because of better manageability and governance over the analytics life cycle.
- Auditable, transparent data that meets regulatory requirements.
Microsoft co-founder and philanthropist Paul Allen has died of complications arising from...
Rod Sims says the emerging data economy is bringing about a host of new issues for regulators...
A survey of companies in 10 markets including Australia found that 70% of projects to implement...