Product release: redact PIIs in PDF and generate fake data from scratch

Our team worked hard over the summer to bring new capabilities to the Statice Platform and SDK. This month, we are releasing new features, including:
‍

the ability to generate realistic-looking data from scratch for testing scenarios,
support for PII (Personal Identifying Information) detection and protection in PDF,
new data storage connectors and a Docker Compose deployment option,
multi-language support for all sanitization operations,
performance and privacy improvement in the SDK.
‍

Quickly generate fake data from scratch with our new data simulation feature

‍

The new data simulation functionality lets you generate new tabular datasets from scratch. To obtain realistic-looking data in minutes, you have to input the types of columns (emails, names, IBAN, etc.) and the size of the dataset you want to create. This feature is perfect when you lack test or real data or your original dataset is too small.

‍

fake data generation — *Create or augment datasets with fake data with our new data simulation feature (click on the animation to enlarge).*

You can generate, among other things, columns of fake addresses, bank information, credit cards, dates, identities, and contact details. You can choose several regions and languages to obtain localized datasets. You simulate as many or as few tables and rows as you need. You can define distribution rules so your new data matches specific statistical properties.

‍

Redact sensitive information in PDF documents

‍

We added support for PDF documents in our sanitization feature to better answer use-case needs and the data reality of many organizations.

You can now process PDF files and use the sanitization operators (masking, redacting, replacing) in these documents. Detecting and protecting your PII has never been so easy! This functionality also has multi-language support.

‍

redact pdf — *Mask, redact, or replace specific PII in PDF files (click onthe image to enlarge).*

Protect your data in any language: the sanitization operators support multiple languages

‍

Chances are, if you are working with German or French data, that Personally Identifiable Information (PII) is not so easily automatically detected.

So we added support of multiple languages to our data sanitization capability. You can now automatically detect and protect names, locations, or any PII in your datasets regardless of the language: mask, redact, or hash sensitive information in German, French, or the language of your choice.

data masking german — *Data masking, redacting, hashing or replacement is available in multiple languages including German and French (click on the image to enlarge).*

‍

Integrate flexibly in your environment thanks to new data storage connections and deployment options

‍

You can now configure a connection to PostgreSQL, MySQL, MariaDB, or Microsoft SQL Server in a few clicks. Connect the data storage of your choice to easily load data into the platform and save anonymized or sanitized datasets.

synthetic data sql — *It takes a few clicks to configure a connection to a PostgreSQL, MySQL, MariaDB, or Microsoft SQL Server withing our synthetic data software*

‍

We’ve made our solution deployment more flexible with the possibility to deploy using Docker Compose.

You can install our solution on virtual machines or bare metal using this deployment method. This option adds to the Kubernetes or self-deployment methods already available.

‍

Experience better performance and a new privacy feature in the SDK

We shipped a new major version of our SDK which improves the performance and adds more privacy features. It comes with revised APIs enriched from months of customer feedback and collaboration. This version offers more transparency and control over the synthetic data generation process. Once you obtain synthetic data, you can more easily audit how the synthesizer generated it.

On the privacy side, the SDK now allows the suppression of rare categories in categorical columns to improve the protection of outliers during synthesization. It reinforces the protection against singling out and membership attacks, improving the overall privacy of the synthetic data.

‍

synthetic data outliers — *Comparison of value distribution in a synthetic version of the* *Texas Discharge Hospital* *dataset with a privacy mechanism to prevent the reproduction of unique outliers.*

‍

Product release: redact PIIs in PDF and generate fake data from scratch

Quickly generate fake data from scratch with our new data simulation feature

Redact sensitive information in PDF documents

Protect your data in any language: the sanitization operators support multiple languages

Integrate flexibly in your environment thanks to new data storage connections and deployment options

Experience better performance and a new privacy feature in the SDK

Get the latest content straight in your inbox!

Get the latest content straight in your inbox!

Articles you might like

Product release: redact PIIs in PDF and generate fake data from scratch

Quickly generate fake data from scratch with our new data simulation feature

Redact sensitive information in PDF documents

Protect your data in any language: the sanitization operators support multiple languages

Integrate flexibly in your environment thanks to new data storage connections and deployment options

Experience better performance and a new privacy feature in the SDK

Get the latest content straight in your inbox!

Get the latest content straight in your inbox!

Articles you might like

Get in touch.