Advances in Natural Language Processing

Where it might take the world of enterprise data security

  • Type: Blog
  • Date: 20/03/2023
  • Author: Kyle DuPont
  • Tags: Unstructured Data, NLP, ML, Natural Language Processing, Data X-Ray

Natural Language Processing (NLP) is a subfield of machine learning (ML) that has made remarkable advances in recent years. While ML in general has facilitated the development of sophisticated security tools that can help organizations mitigate cyber risks and safeguard their sensitive data, NLP focuses on enabling machines to understand and interpret human language, which has implications for various fields, including cybersecurity and data security in the enterprise.

One of the most significant NLP advancements that have revolutionized cybersecurity and data security is the accessibility of models to be run on enterprise infrastructure and the availability of machine learning-as-a-service APIs. Essentially what had been in academia for the greater part of 60 years is now in the enterprise. However, the enterprise is also about scale, and NLP models still need work when working sometimes with tens or hundreds of millions of documents. Now we are seeing the rise of large language models and are just beginning to understand the implications of these models in the data security space in the enterprise.

NLP in the enterprise today

So what is good at the moment? Right now NLP does a lot, very good at scale. NLP techniques around entity recognition and sentiment analysis in particular are currently deployed at scale in the enterprise showing value in a number of use cases around data security.

Entity recognition involves identifying and extracting specific pieces of information from text, such as names, addresses, and other identifying details. In the context of data security, entity recognition can be used to automatically identify and classify sensitive data, such as credit card numbers or social security numbers, and take appropriate action to protect that data. When used in combination with more traditional techniques like regular expressions and dictionary searches, a highly accurate view of your enterprise data security landscape can emerge.

One of the most significant benefits of entity recognition in enterprise data security is that it can help organizations comply with regulatory requirements such as the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA). These regulations require organizations to protect the privacy and security of sensitive data, and entity recognition can be used to identify and classify that data automatically. This can help organizations ensure that they are in compliance with these regulations and avoid costly penalties and legal action.

Another benefit of entity recognition in enterprise data security is that it can help organizations minimize or respond to data breaches more quickly and efficiently. By automatically identifying and classifying sensitive data, entity recognition can alert security teams to potentially sensitive data and help them take action to prevent data loss or theft in case of a breach. This can help organizations avoid the reputational damage and financial costs associated with data breaches, which can be significant.

Sentiment analysis is another key technique in natural language processing that can be highly useful in enterprise data security. This technique involves automatically identifying the emotional tone or sentiment expressed in a piece of text, such as an employee chat. In the context of data security, sentiment analysis can be used to identify potential security threats, such as negative sentiment expressed towards the company or a coworker.

Beyond negative security implications, sentiment analysis can also be used to improve employee engagement and productivity in the enterprise. By analyzing the sentiment expressed in employee feedback, for example, organizations can identify potential issues with employee morale or job satisfaction and take action to address those issues. This can help organizations improve employee retention, productivity, and overall performance, which can have a significant impact on the organization's success.

NLP in the enterprise tomorrow

However, by far the biggest news in the NLP space as of late is the rise of large language models (LLMs) like GPT-3 that allow for highly accurate text generation. It is not clear exactly how these models will integrate into the enterprise data security stack. Indeed, even with ChatGPT (which is essentially a layer over GPT-3) large server infrastructure, the models can only analyze around a few hundred or thousands of words per second. Compare that to other NLP techniques that get up to the hundreds of thousands or millions of words per second, and scale becomes a big concern in the enterprise.

Some possible areas where LLMs may play a role in enterprise data security include the ability to have simply more accurate categorization schemes than are possible today, suggesting newly sensitive items that are adjacent to currently sensitive items, and new UX to update security rules and pipelines. All of these are underpinned by the assumption that LLMs will be able to scale to enterprise scale in the future.

More accurate document categorization

Different from word-by-word classification, document categorization is the ability to find full text records that are similar or dissimilar from other full text records based on the “tone” of a document at a much higher accuracy level than current techniques.

For instance, an LLM may be able to differentiate an internal document written for a board of directors from a document that is written for general internal consumption by analyzing various linguistic features, such as the level of formality, technicality, and specificity of language used in the document. It is likely that internal documents written for a board of directors are more formal, technical, and specific in nature, as they are intended to provide detailed information and analysis to support decision-making at the highest level of the organization. These documents often include complex financial or legal terms, as well as detailed analysis and recommendations based on data and research. An LLM can analyze the language used in such documents and identify the specific terminology and phrasing that is commonly used in this context, allowing it to distinguish them from documents written for general internal consumption.

On the other hand, documents written for general internal consumption are often less formal, less technical, and less specific in nature, as they are intended to communicate information and updates to a broader audience within the organization. These documents may include more conversational language, simpler terminology, and less detailed analysis, as they are intended to be accessible to a wider range of readers. An LLM may be able to analyze the language used in these documents and identify the use of simpler and more conversational language, as well as a greater use of colloquial terms, abbreviations, and acronyms.

Inheriting sensitivity across documents

Beyond the “tone” of the document, there may be underlying patterns in a document that are not easily matched by entity recognition or more traditional dictionary and regular expression techniques. This is what we call Inherited Sensitivity. The main advantage would be to suggest (but importantly not rely on!) possible new security classifications or inherited features of a document that suggest a high sensitivity.

In the enterprise where supervised training for these types of use cases is difficult from a resourcing and people perspective, this would necessarily rely on unsupervised machine learning techniques, such as clustering, to group documents that have similar language patterns or themes together. This approach does not require labeled data as current models do, but instead relies on the model to identify patterns in the language used in the documents and group them together based on those patterns. Once the documents have been grouped, a human expert can review them to determine which ones are actually sensitive and should be classified as such.

It's important to have a human expert review the results of the model's analysis to ensure that the documents flagged as potentially sensitive are actually sensitive and require additional protection.

A new UX to crowdsource data security

It might be that the biggest advantage of new machine learning models in enterprise data security is actually a new UX for data security through something like a chatbot for data security. In today’s data security products it is very cumbersome for non-expert users to update security rules and classifications or otherwise understand why a document is classified as being more or less sensitive. If a user could interact with a human-like chat interface, it may be that the human would be able to trust the sensitivity classifications more and also feed into the enterprise-wide classification scheme. When many employees in an organization can easily collaborate on data security in a human-centered way, perhaps a greater level of compliance with security rules and a more secure data environment overall will be achieved.


The world of NLP is innovating at a pace unseen in recent years. It will continue to gather pace as this technology penetrates all corners of the enterprise.

We at Ohalo are already starting to use some of these techniques to deploy into our customers and create order out of their data chaos and make their data more secure.

Contact us to find out more.

Subscribe to our newsletter

Subscribe now