August 30, 2024
A guest post from Fabrício Ceolin, DevOps Engineer at Comet. Inspired by the growing demand…
Humans produce so much text data that we do not even realize the value it holds for businesses and society today. We don’t realize its importance because it’s part of our day-to-day lives and easy to understand, but if you input this same text data into a computer, it’s a big challenge to understand what’s being said or happening.
This is where NLP (Natural Language Processing) comes into play — the process used to help computers understand text data. However, this is not an easy task. Learning a language is already hard for us humans, so you can imagine how difficult it is to teach a computer to understand text data.
Although NLP has been growing and has been working hand-in-hand with NLU (Natural Language Understanding) to help computers understand and respond to human language, the major challenge faced is how fluid and inconsistent language can be.
So let’s look into some of these challenges and a few solutions.
Context constitutes 90% of a message, words only 10%. Yes, words make up text data, however, words and phrases have different meanings depending on the context of a sentence. As humans, from birth, we learn and adapt to understand the context. Although NLP models are inputted with many words and definitions, one thing they struggle to differentiate is the context.
Let’s take Mandarin, for example. The language has four tones and each of these tones can change the meaning of a word. This is what we call homonyms, two or more words that have the same pronunciation but have different meanings. This can make tasks such as speech recognition difficult, as it is not in the form of text data.
Embedding. This is the representation of words for text analysis. It helps a machine to better understand human language through a distributed representation of the text in an n-dimensional space. The technique is highly used in NLP challenges — one of them being to understand the context of words.
There are two types of embedding techniques I will talk about here:
The aim of both of the embedding techniques is to learn the representation of each word in the form of a vector.
Word embedding creates a global glossary for itself — focusing on unique words without taking context into consideration. With this, the model can then learn about other words that also are found frequently or close to one another in a document. However, the limitation with word embedding comes from the challenge we are speaking about — context.
The most popular technique used in word embedding is word2vec — an NLP tool that uses a neural network model to learn word association from a large piece of text data. However, the major limitation to word2vec is understanding context, such as polysemous words.
Some examples:
This is where contextual embedding comes into play and is used to learn sequence-level semantics by taking into consideration the sequence of all words in the documents. This technique can help overcome challenges within NLP and give the model a better understanding of polysemous words.
Contextual word embedding works by building a vector for each word. This provides representation for each token of the entire input sentence.
Everybody makes spelling mistakes, but for the majority of us, we can gauge what the word was actually meant to be. However, this is a major challenge for computers as they don’t have the same ability to infer what the word was actually meant to spell. They literally take it for what it is — so NLP is very sensitive to spelling mistakes.
Comet Artifacts lets you track and reproduce complex multi-experiment scenarios, reuse data points, and easily iterate on datasets. Read this quick overview of Artifacts to explore all that it can do.
Cosine similarity is a method that can be used to resolve spelling mistakes for NLP tasks. It mathematically measures the cosine of the angle between two vectors in a multi-dimensional space. As a document size increases, it’s natural for the number of common words to increase as well — regardless of the change in topics.
In relation to NLP, it calculates the distance between two words by taking a cosine between the common letters of the dictionary word and the misspelt word. Using this technique, we can set a threshold and scope through a variety of words that have similar spelling to the misspelt word and then use these possible words above the threshold as a potential replacement word.
Most of the data used for NLP tasks comes from conversations, emails, tweets, etc. This type of data is highly unstructured causing many challenges to producing useful information.
Before you start cooking, preparing your ingredients makes your life 10x easier. You don’t want to be in the middle of cooking your dish and realize you have three missing ingredients.
The same applies when working with data. You want to ensure the data that you are using to input into your NLP model is of high quality. Solutions include:
I will briefly expand a few.
Text standardization is the process of expanding contraction words into their complete words. Contractions are words or combinations of words that are shortened by dropping out a letter or letters and replacing them with an apostrophe.
Although it simplifies our text and speech and we can easily understand it; machines handle text data better when it uses full words. For example, “can’t” will be standardized to “can not.”
In linguistics, Lemmatization means grouping together different forms of the same word — bringing it to its base form. For example, the word “tried,” “tries,” and “trying” will be converted and grouped to “try.”
Similar to lemmatization, stemming does not have the ability to apply context during its process and removes the last few characters from a word — it finds the ‘stem’. For example, if we used stemming on the word “caring” it would reduce it down to “car.” However, with lemmatization, it applies context and it applies context to the word “caring” and brings it down to “care.”
Lemmatization is more computationally expensive than stemming as it requires the need to scan through look-up tables, etc.
Tokenization is the process of splitting paragraphs and sentences into smaller units, or splitting a string, text into a list of tokens. For example, if we have the sentence “this article is about text data” — this will be split up into individual tokens of “this,” “article,” “is,” “about,” “text,” “data.”
These are the most common challenges that are faced in NLP that can be easily resolved. The main problem with a lot of models and the output they produce is down to the data inputted. If you focus on how you can improve the quality of your data using a Data-Centric AI mindset, you will start to see the accuracy in your models output increase.