Data privacy regulations—coupled with the desire to protect sensitive data—impose requirements on organizations to protect production data. Since many organizations rely on production data as a source for test data, techniques are needed to mask sensitive data elements from unauthorized viewing.
A popular technique for protecting sensitive data is data masking, a method that creates data that is structurally similar to production data but that is not the same as the actual data. After data is masked, it can be used by application systems the same way as the actual, production data. But protected, sensitive data values are not exposed for all to see.
The ability to mask data is important to be in compliance with regulations such as GDPR and PCI-DSS, which place restrictions on how personally identifiable information (PII) can be used. PII includes personal information such as names, addresses, Social Security numbers, and payment card details; financial data such as account numbers, revenue, salary, and transactions; and confidential company information such as blueprints, product road maps, and acquisition plans.
How Is Data Masked?
The general idea is to create reasonable test data that can be used as the production data would, but without using and therefore exposing the sensitive information. Data masking protects the actual data but provides a functional substitute for tasks that do not require actual data values.
Data masking is an important component of building any test bed of data—especially when data is copied from production. To comply with pertinent regulations, all PII must be masked or changed, and, if it is changed, it should look plausible and work the same as the data it is masking. Think about what this means:
- Referential constraints must be maintained. If primary or foreign keys change—and they may have to if you can figure out the original data using the key—the data must be changed the same way in both the parent and child tables.
- Unique constraints must be enforced. If a column, or group of columns, is supposed to be unique, then the masked version of the data must also be unique.
- The masked data must conform to the same validity checks that are used on the actual data. For example, a random number will not pass a credit card number check. The same is true of the Social Insurance number in Canada and the Social Security number in the U.S., too (although both have different rules).
- And do not forget about related data. For example, city, state, and ZIP code values are correlated, meaning that a specific ZIP code aligns with a specific city and state. As such, the masked values should conform to the rules.
A reliable method of automating the process of data masking that understands these issues and solves them is clearly needed—and this typically requires a tool to implement properly.
Novices sometimes look at the problem and think it should be easy to mask or obfuscate data, but to do it properly is a tall task. For example, it is a best practice for a good data masking tool to always generate the same result value for an input value. In other words, if my name (Craig Mullins) is masked to (Ted Jacoby) in one table, it should be masked to that same name every time it exists in the database.
Many data masking tools use hashing functions and lookup tables to accomplish this. The hashing function must be non-invertible so the process cannot be easily reversed, and the lookup tables need to be thorough, protected, and available for any language to be used.
Some database management systems offer basic data masking capabilities, so you should always investigate the native masking functionality before embarking on using a tool. But most DBMS functionality will be limited, such as just a way of displaying a different value based on a rule for a specific column.
In-depth data masking is applied using a set of rules that indicate which columns of which tables should be masked. It should be possible to build rules based on wild carding (such as all tables beginning with FIN or ending with XX).
Masking while copying data is generally most useful when copying data from a production environment into a test or QA system, but sometimes you may need to mask data in-place for an existing set of tables without making another copy.
The Goal
The goal should be to mask your data such that it works just as the actual production data but does not contain any actual data values (or any processing artifacts that make it possible to infer information about the actual data). Masked data is protected data, and with the continuing growth of data breaches, protecting your data should be a top priority.