NLP Unleashed: From Extraction To Automation For Enhanced Efficiency

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources.

In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). information extraction refers to the automatic extraction of structured information such as entities, relationships between entities, and attributes describing entities from Information Extraction refers to the automatic extraction of structured information Such identify phrases in language that refer to specific types of entities and relations in set.

Named entity recognition is the task of identifying names of people. and organizations, etc. in text.

Relation extraction identifies specific relations between entities. A man works for IBM PERSON works for ORGANIZATION Information extraction is the process of acquiring knowledge by skimming a text and looking for occurrences of a particular class of object and for relationships among objects.

A difficult task is to extract instances of addresses from Web pages, with database fields for street, city, state, and zip code.

Finite state automata for information extraction:

The simplest type of information extraction system is an attribute-based extraction system that assumes that the entire text refers to a single object and the task is to extract attributes of that object.

For example, the problem of extracting from the text “IBM ThinkBook970.Our price: Rs.399,00”. the set of attributes

We can address this problem by defining a template (also known as a pattern) for each attribute we would like to extract. The template is defined by a finite state automaton, the simplest example of which is the regular expression, or regex.

Regular expressions are used in UNIX commands such as grep and in word processors such as Microsoft Word.

Templates are always defined in three parts: a prefix regex, a target regex, and a post regex. For prices, the target regex is as above, the prefix would look for strings such as “price:” and the postfix could be empty. The idea is that some clues about an attribute come from the attribute value itself and some come from the surrounding text.

If a regular expression for an attribute matches the text exactly once, then we can put out the portion of the text that is the value of the attribute If there are miss-match, It means that was a default value or given attribute missing: but if there are several matches, we Should a process to choose among them.

One strategy is to have several templates for each attribute, ordered by priority.

For example, the top-priority example for price might look for the prefix “our price:” if that is not found, we look for the prefix “price:” and if that is not found, the empty prefix. Another strategy is to require all the matches and find some way to choose among them.

One step up from attribute-based extraction systems are relational extraction systems, which deal with multiple objects and also the relations among them. Thus, when these Systems see the text “Rs.249.99,” they need to determine not just that it is a price, but also which object has that price.

There are relational-based extraction system is FASTUS, which handles news stories about corporate mergers and acquisitions. It can read the story Bridgestone Sports Co.

said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan.

And extract the relations:

E Joint Ventures A Product(e, “golfclubs'”) A Date(e, “Friday”)

A Member (e, “Bridgestone Sports Co'”) A Member (e, “a local concern”)

A Member (e, “a Japanese trading house”).

A relational extraction system can be built as a series of cascaded finite-state transducers.

So the system consists of a series of small, efficient finite-state automata (FSAs), where each automaton receives text as input, transfers the text into a different format, and passes it along with the next automaton.

FASTUS consists of five stages :

Tokenization
Complex-word handling
Basic-group handling
Complex-phrase handling
Structure merging

FASTUS’s first stage is tokenization, which segments the stream of characters int tokens (words, numbers, and punctuation).

For English, tokenization can be simple; just separating characters at white space or punctuation does a fairly good job. Some tokenizes also deal with markup languages such as HTML, SGML, and XML.

The second stage handles complex words, including collocations such as “set up” an “joint venture,” as well as proper names such as “Bridgestone Sports Co.” These a recognized by a combination of lexical entries and finite-state grammar rules.

For example, a company name might be recognized by the rule

Capitalized Word+ (*Company” | “Co” | “nc” | “Ltd”).

The third stage handles basic groups, meaning noun groups and verb groups. The idea to chunk these into units which will be managed by the later stages.

But here we have simple rules that only approximate the complexity of English, but have the advantage of being representable by finite state automata.

The example sentence would emerge from this stage as the following sequence of tagged groups:

NG: Bridgestone Sports Co. 2 VG: said 1l1 CJ: and
NG: Friday 12 NG: a Japanese trading house
NG: it 13 VG: to produce
VG: had set up 14 NG: golf clubs
NG: a joint venture 15 VG: to be shipped
PR: in 16 PR: to
NG: Taiwan 17 NG: Japan
PR: with
NG: a local concern

Here NG stands for noun group, VG stands for verb group, PR stands for preposition, and CJ stands for conjunction.

The fourth stage combines the basic groups into complex phrases.

Again, the aim is to have rules that are finite-state and therefore it can be processed quickly, and will result in unambiguous (or nearly unambiguous) output phrases. One type of combination rule deals with domain-specific events. For example, the rule Company+ Set Up Point Venture (“with” Company+)?

Captures one way to describe the formation of a joint venture. This stage is the first one in the cascade where the output is placed into a database template as well as being placed in the output stream. The final stage merges structures which were built up in the previous step.

If the next sentence says “The joint venture will start production in January,” then this step will notice that there are two references to a joint venture, and that they should be merged into one. This is an instance of the identity uncertainty problem.

In general, finite-state template-based information extraction works well for a restricted domain in which it is possible to predetermine, and how they will be mentioned.

The cascaded transducer model helps modularize the necessary knowledge, easing construction of the system. These systems work especially well when they are reverse engineering text that has been generated by a program.

For example, a shopping site on the Web is generated by a program that takes database entries and formats them into Webpages; a template-based extractor then recovers the original database.

City-state information extraction is less successful at recovering information in highly able format, such as text written by humans on a variety of subjects.

Probabilistic model for information extraction:

Drobabi listic language models based on n-grams recover a surprising quantity of data about a language. They can perform well on such diverse tasks as language identification, selling correction, genre classification, and named-entity recognition.

These language models have millions of features, therefore feature selection and preprocessing of the data to reduce noise is important.

When information extraction must be attempted from noisy or varied input, simple finite state approaches fare poorly. It is too hard to get all the rules and their priorities right; it is better to use a probabilistic model rather than a rule-based model. The simplest probabilistic model for sequences with hidden state is the hidden Markov model, or HMM.

To apply HMMs to information extraction, we can either build one big HMM for all the attributes or build a separate HMM for each attribute. We’ll do the second.

The observations are the words of the text, and the hidden states are whether we are within the target, prefix, or postfix part of the attribute template, or in the background (not part of a template).

For example, here is a brief text and the most probable (Viterbi) path for that text for two

HMMs, one trained to acknowledge the speaker in a talk announcement, and one trained to acknowledge dates.

The indicates a background state:

Text: There will be a seminar by Dr. Andrew McCallum on Friday

Speaker: —- PRE PRE TARGET TARGETTARGET POST –

Date: – —– PRE TARGET

1. HMMs are probabilistic, and thus tolerant to noise. In a regular expression, if a single expected character is missing, the regex fails to match; with HMMs there is graceful degradation with missing characters/words, and we get a probability indicating the degree of match, not just a Boolean match/fail.

Extraction Figure:

Hidden Mark model for the speaker of a talk announcement. The two a quare state are the target, the four circle to the left are the prefix, and the one on the right is the postfix.

For each state, only a few of the high-probability words are shown. From Freitag and McCallum (2000).

2. HMMs can be trained from data: they don’t require laborious engineering of templates, and thus they can more easily be kept up to date as text changes over time.

Note that we have assumed a certain level of structure in our HMM templates: they all consist of one or more target states, and any prefix states must precede the targets, postfix states most follow the targets, and other states must be background. This structure makes it easier to learn HMMs from examples.

For example, the word “Friday” would have high probability in one or more of the target states of the date HMM, and lower probability elsewhere. With sufficient training data, the HMM automatically learns a structure of dates that we find intuitive: the date HMM might have one target state in which the high-probability words are Monday,” Tuesday,” etc., and which has a high-probability transition to a target state with words Jan”, “January,” “Feb,” etc.

Figure shows the HMM for the speaker of a talk announcement, as learned from data.

The prefix covers expressions such as “Speaker: ” and “seminar by,” and the target has one state that covers titles and first names and another state that covers initials and last names.

Once the HMMs have been learned, we can apply them to a text, using Viterbi algorithm to find the most likely path through the HMM states. One approach is to attribute HMM separately; In this case you would expect most of the HMMs to the Spend most of their time in background states.

This is appropriate when the extraction is sparse when the number of extracted words is small compared to the length of the text.

Other approach is to combine all the individual attributes into one big HMM, high would then find a path that wanders through different target attributes, first finding sneaker target, then a date target, etc.

Separate HMMs are better when we expect just of each attribute in a text and one big HMM is better when the texts are more free and dense with attributes With either approach, in the end we have a collection of attribute observations, and have to decide what to do with them.

If every expected has one target filler then the decision is easy: we have an instance of the desired nation.

If there are multiple fillers, we need to decide which to choose, as we discussed it template-based systems. HMMs have the advantage of supplying probability numbers that can help make the choice.

If some targets are missing, we need to decide if this is an instance of the desired relation at all, or if the targets found are false positives. A machine learning algorithm can be trained to make this choice.

Examples: Applications of Natural Language Processing

A few examples of NLP that people use every day are:

Spell check
Autocomplete
Voice text messaging
Spam filters
Search results
Related keywords on search engines
Siri, Alexa, or Google Assistant

In any case, the computer is able to identify the appropriate word, phrase, or response by using context clues, the same way that any human would. Conceptually, it’s a fairly straight forward technology.

The best-known example of NLP, smart assistants such as Siri, Alexa and Cortana have become increasingly integrated into our lives.

Automate support:

Chatbots are nothing new, but advancements in NLP have increased their usefulness to the point that live agents no longer need to be the first point of communication for some customers.

Some features of chatbots include being able to help users navigate support articles and knowledge bases, order products or services, and manage accounts.

Chatbots:

To provide a better customer support service, companies have started using chatbots for 24/7 service. Chatbots helps resolve the basic queries of customers.

If a chatbot is not able to resolve any query, then it forwards it to the support team, while still engaging the customer. It helps make customers feel that the customer support team is quickly attending them. With the help of chatbots, companies have become capable of building cordial relations with customers.

It is only possible with the help of Natural Language Processing.

NEW POST