Fully automatic removal of private information to enable data sharing and aggregation, while protecting users' privacy is essential in today's privacy-conscious world.
To be effective, AI training requires access to large datasets. Thus, one of the biggest challenges is the seemingly unavoidable trade-off between user privacy and the ability to share and aggregate data.
At Data Summit Connect 2021 Julia Komissarchik, CEO and founder, Glendor, discussed how it is possible to do both fully automatically during her presentation, “Ensuring Data Privacy.”
The annual Data Summit event is being held virtually again this year—May 10–May 12—due to the ongoing COVID-19 pandemic.
“Once we have large amounts of data, there are a lot of interesting things that can be done,” Komissarchik said. “It’s getting at that data which can be troublesome.”
AI is data thirsty, she explained. It requires large amounts of data for training and testing.
It is easy to overfit models if the training data does not represent the variability of the world. Data can be procured quite easily. But, it’s not for private information.
Because of regulations like HIPAA and GDPR, developers can have difficulties procuring information to train models.
There are a few ways to procure the data for training such as utilizing federated AI, which allows people to work together to gather data.
The ultimate answer is to create systems that remove PII, (personally identifiable information) from the data so it can be shared and aggregated.
A failure to follow regulations leads to financial, legal, and reputation damages. When it comes to healthcare, removing data that may identify someone can be worked around by redacting the information or removing the sensitive information. These examples include medical texts, images, engraved part numbers, and more.
In a healthcare setting, workers can manually block out the information, use a template to remove the information, or fully automate this function.
There is a hybrid modeling approach that can help, she said. Organizations should divide the space into overlapping regions of demonstrating homogenous behavior. Build individual oracles for each region. When available, use existing models. If data exists in similar domains, build models to be used to transfer learning. Then combine the results of individual oracles in overlapping regions.
Register here now for Data Summit Connect 2021 which continues through Wednesday, May 12.