Introducing Large Language Models (LLMs) for PII Masking Large Language Models (LLMs), such as OpenAI’s GPT-4, offer a transformative approach to PII masking.
PII refers to any information that can reasonably infer the identity of an individual, either directly or indirectly. Recent studies indicate that PII constitutes a significant portion of organisational data stores. As a result, Chief Information Officers (CIOs) and other C-suite executives are investing substantial time and resources to manage PII, including masking or redacting sensitive information to make data accessible for business purposes without compromising privacy.
PII Masking is a fundamental practice in data privacy. It involves techniques that protect PII by rendering it unreadable or unusable by unauthorized parties. Traditionally, PII masking is performed manually or with specialized tools. However, manual masking is both costly and time-consuming, while conventional tools may lack the adaptability to handle evolving types of PII.
In the past decade, Machine Learning (ML) models have gained traction for PII masking. Techniques like Named Entity Recognition (NER)—a natural language processing method that identifies and categorises entities in unstructured text—have been used to automate the classification and removal of PII. Despite their utility, these models have notable drawbacks:
These challenges underscore the need for more flexible and efficient solutions.
Large Language Models (LLMs), such as OpenAI’s GPT-4, offer a transformative approach to PII masking. LLMs excel in Zero-Shot Learning, enabling them to perform tasks without explicit prior training on specific datasets. This capability makes them highly adaptable to new and evolving types of PII without the need for retraining.
By combining Prompt Engineering and Function Calling, organizations can leverage LLMs to ingest text containing PII and output a redacted version. Prompt engineering involves crafting specific instructions that guide the LLM to perform the desired task effectively. Function calling adds structure to the output, simplifying integration with existing systems and reducing the need for extensive exception handling.
Note: While the GPT-4 model from OpenAI is used here for illustration, it’s crucial to choose an LLM that aligns with your organization’s security policies.
# Load Dataset
dataset_name = 'ai4privacy/pii-masking-65k'
dataset = datasets.load_dataset(dataset_name, split="train[:10%]")
dataset
# Function Calling
class RedactTextParams(BaseModel):
redacted_text: str = Field(description="Returned text after masking all PII data with [REDACTED] token")
tool_definitions = [
{
"type": "function",
"function":{
"name": "redact_pii_from_text",
"description": "Generates masked text after removing all PII information related to persons \
and companies to be uploaded into the datalake",
"parameters": RedactTextParams.model_json_schema()
},
"required": ["redacted_text"]
}
]
# Prompt Engineering
def redact_pii_fc(text):
completion = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are an AI assistant, skilled in masking \
personally identifiable information including incompleted names or other information \
related to persons, organizations, including hash keys, crypto addresses, API keys \
in text blocks. Do not return as a code block. Mask all PII related words in the following \
text with [REDACTED] token.'''" + text + "'''"}
],
seed = 42,
tools=tool_definitions,
tool_choice={"type": "function", "function": {"name": "redact_pii_from_text"}}
)
response = json.loads(completion.choices[0].message.tool_calls[0].function.arguments)
return [val for val in response.values()][0]Complete code for reference is available here.
To assess the effectiveness of the LLM in PII masking, organizations can:
Below is the similarity distribution between expected output vs actual output for the LLM based PII model on a sample.
Data privacy is non-negotiable, especially when dealing with sensitive information. We strongly recommend performing PII masking using local LLMs to ensure data does not leave your secure environment. Guardgen.AI offers a solution that allows you to run LLMs securely within your infrastructure. Our platform facilitates PII masking and other generative AI solutions with ease and confidence.
Implementing PII masking with Large Language Models presents a significant opportunity for organizations to enhance data privacy while reducing operational costs and effort. By embracing this innovative approach, CXOs can ensure their organizations remain compliant with evolving privacy regulations and protect sensitive information effectively.
Learn how Australian businesses are maximising value from their AI investments with Evolve bespoke solutions
ROI
Faster processing
Faster documentation
We build a custom solution to maximise your business revenue, reduce costs and add operational efficiency
Speak to an expert