Product release: redact PIIs in PDF and generate fake data from scratch

redact pii pdf
By
Elise Devaux

Our team worked hard over the summer to bring new capabilities to the Statice Platform and SDK. This month, we are releasing new features, including:

  • the ability to generate realistic-looking data from scratch for testing scenarios,
  • support for PII (Personal Identifying Information) detection and protection in PDF,
  • new data storage connectors and a Docker Compose deployment option, 
  • multi-language support for all sanitization operations,
  • performance and privacy improvement in the SDK.

Quickly generate fake data from scratch with our new data simulation feature

The new data simulation functionality lets you generate new tabular datasets from scratch. To obtain realistic-looking data in minutes, you have to input the types of columns (emails, names, IBAN, etc.) and the size of the dataset you want to create. This feature is perfect when you lack test or real data or your original dataset is too small.

fake data generation
Create or augment datasets with fake data with our new data simulation feature (click on the animation to enlarge).


You can generate, among other things, columns of fake addresses, bank information, credit cards, dates, identities, and contact details. You can choose several regions and languages to obtain localized datasets. You simulate as many or as few tables and rows as you need. You can define distribution rules so your new data matches specific statistical properties. 

Redact sensitive information in PDF documents

We added support for PDF documents in our sanitization feature to better answer use-case needs and the data reality of many organizations.

You can now process PDF files and use the sanitization operators (masking, redacting, replacing) in these documents. Detecting and protecting your PII has never been so easy! This functionality also has multi-language support.

redact pdf
Mask, redact, or replace specific PII in PDF files (click onthe image to enlarge).


Protect your data in any language: the sanitization operators support multiple languages

Chances are, if you are working with German or French data, that Personally Identifiable Information (PII) is not so easily automatically detected.

So we added support of multiple languages to our data sanitization capability. You can now automatically detect and protect names, locations, or any PII in your datasets regardless of the language: mask, redact, or hash sensitive information in German, French, or the language of your choice. 

data masking german
Data masking, redacting, hashing or replacement is available in multiple languages including German and French (click on the image to enlarge).

Integrate flexibly in your environment thanks to new data storage connections and deployment options 

You can now configure a connection to PostgreSQL, MySQL, MariaDB, or Microsoft SQL Server in a few clicks. Connect the data storage of your choice to easily load data into the platform and save anonymized or sanitized datasets.

synthetic data sql
It takes a few clicks to configure a connection to a PostgreSQL, MySQL, MariaDB, or Microsoft SQL Server withing our synthetic data software

We’ve made our solution deployment more flexible with the possibility to deploy using Docker Compose.

You can install our solution on virtual machines or bare metal using this deployment method. This option adds to the Kubernetes or self-deployment methods already available.

Experience better performance and a new privacy feature in the SDK

We shipped a new major version of our SDK which improves the performance and adds more privacy features. It comes with revised APIs enriched from months of customer feedback and collaboration. This version offers more transparency and control over the synthetic data generation process. Once you obtain synthetic data, you can more easily audit how the synthesizer generated it. 

On the privacy side, the SDK now allows the suppression of rare categories in categorical columns to improve the protection of outliers during synthesization. It reinforces the protection against singling out and membership attacks, improving the overall privacy of the synthetic data.

synthetic data outliers
Comparison of value distribution in a synthetic version of the Texas Discharge Hospital dataset with a privacy mechanism to prevent the reproduction of unique outliers.

Subscribe to our product newsletter

Get the latest content straight in your inbox!

Articles you might like

The Statice Platform and SDK got an upgrade: see what's new

Read more

Synthetic data for predictive analytics in insurance: The case of Provinzial

Read more

Generate high-fidelity synthetic data with the Statice SDK

Read more