Executive Summary :
Many organizations inadvertently breach information when they routinely copy sensitive/ regulated or industry specific production data into non-production/Test environments. As a result, data in non-production/Test environment has increasingly become the target of cyber criminals and can be lost or stolen. Data breaches in non-production environments can cause millions of dollars to remediate; entailing irreperable harm to reputation and the brand.
With this paper, we have shared an approach to generate faked data for testing cycles and other non-production purposes (by employing Artificial Intelligence and ML techniques) in order to help organizations preventing the misuse of the data. To enable this prevention, A lot of packages for NLP (ML) are freely available online namely SPACY, NLTK, CoreNLP. SPACY supports Named Entity Recognition very efficiently to identify specific attributes.
Data Faking :
Data Faking is the process of hiding original data with changed content (replaced with special characters or similar data) to be used in Test Environment to perform testing activities.
The foremost reason for applying “Faking” to a data field is to protect the data that is classified as Personal Identifiable Data/ Sensitive Data/ Commercially Sensitive Data. Also ensuring that data continues being usable for undergoing further valid test cycles.
Why Fake Data?
Organizations share data with other users for a variety of business needs
- Copy production data into test/development environments allowing system administrators to test upgrades, patches and fixes
- Businesses, competitive in nature, require new and improved functionality in the existing production applications. As a result, application developers require an environment mimicking close to production (to build) and test the new functionality; ensuring that the existing functionality does not break
- Retail organizations share customers’ Point- Of- Sale data with market researchers to analyze customer buying patterns
- Pharmaceutical or healthcare organizations share patients’ data with medical researchers to assess the efficiency of clinical trials and medical treatments
As a result of the cited above reasons, organizations copy millions of sensitive (customer and consumer) data to non-production environments, however, a handful of organizations actually plan and work towards protecting the data when sharing with outsourcers & third parties.
SPACY (https://spacy.io/) is an open-source package for advanced Natural Language Processing, written in Python and Cython. The package is published under the MIT license and offers statistical neural network models for English, German, Spanish, Portuguese, French, Italian, Dutch and multilanguage NER (Named Entity Recognition), as well as tokenization for various other languages
Advantages of SpaCy
- Rapid as compared to other packages e.g. NLTK (Natural Language Toolkit)
- Features convolutional neural network models for part-of-speech tagging, dependency parsing and named entity recognition
- Easy to use and need not be an NLP Expert to start off with SPACY
- Supports in identifying domain specific data for faking
Proposed Approach for Data Faking
We define a more generic way of data faking which can be used not only for fake data but can also generate consistent data. This recommended approach does not need access to any Database like a SQL server or an Oracle Database, however only a sample dataset in file format of Excel or CSV or Json etc. will suffice.
Technical Approach to create models to Fake Data
Advantages of Defined Approach
- Models are reusable as SPACY supports multiple languages with minimal rework
- Identify sensitive data automatically for data faking
- Models can be reused multiple domains e.g. Retail, Banking
- Easy to integrate with TDM and Test automation solutions
- Saves up to 70% of efforts in Data Faking
- Minimizes dependency on DBA/ Business Analysts to provide data for testing
Practice Head, Digital Assurance Services