Questions?   +(303) 502-5407

Data profiling & Outlier identification for Software Design

Outliers doesn’t necessary apply to single dimensions. Outliers in large dimensional data sets are also important. A special case of outlier data is important to IT departments. A broader definition of Outlier data is data that doesn’t conform to constraints of the enterprise data model. Such outlier data can be the cause of several problems and consume a majority of time and effort.

Understanding data is crucial for writing good code especially if the code is for data conversions or interfacing data from one system to another. Such programs written to facilitate data conversion, data interfaces are more susceptible to outlier data as each of these conversions have inbuilt assumptions about the data structures.

For example in large data conversions involving multiple entities like customers, addresses, items, orders, invoices etc, there is an implicit relationship between these entities. For example every customer needs to have an address and every order needs to have at-least one bill to address etc.

Before even writing the conversion programs and identifying the transformation requirements, one needs to profile the source data and identify outliers, to ensure that the software can handle these “outlier data”.

By profiling the data one can easily identify outlier data by looking for

  • Blanks, Spaces, sparsity and NULLs in various columns
  • Least and most frequent values
  • Patterns – SSNs, Telephone Numbers, dates, zipcodes
  • Cross Table and Inter-table relationships. Identifying orphaned records, cardinality
  • Business relationships across data bases

Few questions for the audience:

  1. How does your company deal with Data conversions and interface requirements?
  2. What tools do you use to analyze the data to ensure that the code that is being built?

Leave a Reply