Data management: why it's so essential
Friday, 01 June, 2018
Unpolluted data is core to a successful business — particularly one that relies on analytics to survive.
But how difficult is it to manage unfiltered data and get it ready for analytics? Most data scientists spend 50–80% of their model development time on data preparation — they often know ahead of time what data they want to profile or visualise prior to preparing and modelling it. But what they don’t know is which variables are best suited — with the highest predictive value — for the type of model being implemented and the variable being modelled.
Identifying and accessing the right data are crucial first steps. Before you can build an effective model, you’ll need consistent, reliable data that’s ready for analytics. For data scientists and business analysts who prepare data for analytics, data management technology from SAS acts like a data filter — providing a single platform that lets them access, cleanse, transform and structure data for any analytical purpose. As it removes the drudgery of routine data preparation, it reveals clean data and adds value along the way.
SAS also has five data management for analytics best practices that can help.
1. Simplify access to traditional and emerging data
Accessing data is challenging — different data sources, formats and structures make it hard to bring the data together. And statistical analysis essentially only cares about two data types — character and numeric. Yet some data sources, like relational databases, have 10–20 different numeric data types.
SAS has a plethora of native data access capabilities that make working with a wide variety of data sources easy:
- Simplifies access to multiple data sources.
- Minimises data movement and improves governance by pushing data processing down to the data source via SQL passthrough and the SAS Embedded Process — a portable, lightweight SAS execution engine.
- Provides self-service data preparation capabilities with intuitive user interfaces that make data accessible to more users, with less training. This, in turn, frees IT personnel from iterative data provisioning tasks so they can be more productive.
- Enables agile, secure techniques for managing data.
2. Strengthen the data scientist’s arsenal with advanced analytics techniques
SAS provides sophisticated statistical analysis capabilities inside of the ETL flow:
- Frequency analysis goes beyond simple counts to help identify outliers and missing values that can skew other measures like mean, average and median (measure of central tendency), as well as affects analyses like forecasting.
- Summary statistics describes the data by providing several measures, including central tendency, variability, percentiles and cardinality. Cardinality shows how many unique values exist for a given variable.
- Correlation is used during the analytical model building process, when business analysts try to understand the data to determine which variables or combination of variables will be most useful based on predictive capability strength.
3. Scrub data to build quality into existing processes
Up to 40% of all strategic processes fail because of poor data. Data cleansing begins with understanding the data through profiling, correcting data values (like typos and misspellings), adding missing data values (like ZIP code), finding and dealing with duplicate data or customer records, and standardising data formats (dates, monetary values, units of measure). Cleaning data can also include automated selection of best records and cleaning data in multiple languages.
SAS has an industry-leading data quality platform that:
- Incorporates the cleansing capability into your data integration flow to make IT resources more productive.
- Puts data quality in database — that is, pushes this processing down to the database to improve performance.
- Removes invalid data from the dataset based on the analytical method you’re using — such as outliers, missing data, redundant data or irrelevant data.
- Enriches data via a process called binning — which simply means grouping together data that was originally in smaller intervals. For example, the individual value of age alone may not have much relevance, but age groups could, such as “between 35 and 45”.
4. Shape data using flexible manipulation techniques
Without flexible methods of manipulating data, it can be difficult to structure the final dataset. Typical analytical methods expect a “flattened” dataset, often called “one row per subject”, which can be problematic because database systems are not designed with a single-row-per-customer data structure in mind. As a result, many database systems limit the number of columns a single table can have. Transaction systems record every transaction as it happens, resulting in a high volume of records for each customer. These transaction records need to be consolidated and transposed to be joined with the customer records pulled from the data warehouse.
SAS simplifies data transposition with intuitive, graphical interfaces for transformations. Plus, you can use other reshaping transformations. Those include frequency analysis to reduce the number of categories of variables, appending data, partitioning and combining data, and a variety of summarisation techniques.
5. Share metadata across data management and analytics domains
SAS has a common metadata layer that allows data preparation processes to be consistently repeated. This promotes more efficient collaboration between those who initially prepare data and the business analysts and data scientists who ultimately complete the data preparation process and analytical model development.
Because of the common metadata layer, SAS makes it easier to deploy models. As each model is registered in metadata and made available along with its data requirements, it becomes less of a challenge to adopt.
Applying metadata across the analytics life cycle delivers savings on multiple levels. When a common metadata layer serves as the foundation for the model development process, it eases the intensely iterative nature of data preparation, the burden of the model creation process and the challenge of deployment. Advantages include:
- Faster testing and increased productivity due to automated model development and scoring.
- Creation of more models with greater accuracy because of automated model management.
- Faster cycle times that increase profitability and result in more relevant and timely models.
- Less time spent on mundane data work and more focus on model development and evaluation.
- Knowledge that can be re-used across the enterprise after it’s obtained during the data preparation process.
- Increased flexibility to accommodate changes because of better manageability and governance over the analytics life cycle.
- Auditable, transparent data that meets regulatory requirements.
ACMA has released new industry standards requiring nbn providers to provide more accurate...
A European Union (EU) trade delegation has been hosted by Australian super incubator Cicada...
Queensland's Premier has called on Telstra to rethink plans that will see 8000 workers lose...