NPL: the ability to understand and process human languages. It is important in order to fill the gap in communication between humans and machines.
Natural Language Processing (NLP), by definition, is a method that enables the communication of humans with computers or rather a computer program by using human languages, referred to as natural languages, like English.
These include both text and speech input. It helps computers to understand and interpret the languages and reply validly in a valid manner.
It is a part of Computer Science, Human language, and Artificial Intelligence. It is the technology that is used by machines to understand, analyze, manipulate, and interpret human’s languages.
It helps developers to organize knowledge for performing tasks such as translation, automatic summarization, Named Entity Recognition (NER), speech recognition, relationship extraction, and topic segmentation.
It is used to create automated software that helps understand human spoken languages to extract useful information it gets within the style of audio. Techniques in NLP allow computer systems to process and interpret data in the form of natural languages.
Natural language processing (NLP) draw back sared often divided into two tasks:
Processing written text, using lexical, syntactic and semantic information of the language as well as the required real world information. Processing spoken language, using all the information required higher than and extra knowledge about phonology as well as enough added information to handle the additional ambiguities that arise in speech.
NLP is one of the major and most addressed parts of Artificial Intelligence.
It is very common used in our day-to-day lives, in applications like Google Assistant, Siri, Google Translate, Alexa, etc. NLP includes dividing the input into smaller items and playing tasks to understand the connection between them and then produce significant output.
Language Models Of NLP:
Formal languages, such as the programming languages Java or Python have exactly defined language models.
A language can be defined as a collection of strings, “print (2 + 2y is a legal program in the language Python, whereas “2) + (2 print” is not. Since there are an infinite number of legal programs, they cannot be enumerated; instead they are such that by a collection of rules called a grammar. Formal languages also have rules that define the meaning or semantics of a program; as an example, the rules say that the meaning” of “2 +2” is 4, and the meaning of “/0” is that an error is signal.
Natural languages are also ambiguous. We cannot speak of a single meaning for a sentence, but rather of a probability distribution over possible meanings.
N-gram character models of NLP:
An n-gram model is a technique of counting sequences of characters or words that allows us to support rich pattern discovery in text.
In other words, it tries to capture patterns of sequences (characters or words next to each other) while being sensitive to contextual relations (characters or words near each other).
Ultimately, a written text is consist of characters-letters, digits, punctuation, and spaces in English (and additional exotic characters in some other languages). Thus, one of the simplest language models is a probability distribution over sequences of characters. We have tendency to write P (cl :N) for the probability of a sequence of N characters, cl through cn.
A sequence of written symbols of length n is called an n-gram, with special case “unigram” for 1-gram, “bigram” for 2-gram, and “trigram” for 3-gran. A model of the probability distribution of n-letter sequences is thus called an n-gram model. An n-gram model is defined as a Markov chain of order n -1.
We can define the probability of a sequence of characters P (cl: N) under the trigram model by first factoring with the chain rule and then using the Markov assumption.
For a trigram character model in a language with 100 characters, P (CiC-2:i-1) ha million entries, and can be accurately estimated by counting character sequences in a body of text of 10 million characters or more. We call a body of text a corpus.
What can we do with n-gram character models? One task for which they are welt suited is language identification: given a text, determine what natural language it written in. This is a relatively easy task: even with short texts such as “Hello world”
*Wiegehtesdir,” it is easy to identify the first as English and the second as German Computer systems identify languages with greater than 99% accuracy, sometimes, closely related languages, such as Swedish and Norwegian are confused Other tasks for character models include spelling correction, genre classification, and named-entity recognition.
Genre classification means deciding if a text is a news story a legal document, a scientific article, etc.
While many features help make this classification, counts of function and other character n-gram features go a long way. Named-entity recognition is the task of finding names of things in 2 document and deciding what class they belong to.
For example, in the text “Mr. Sopersteen was prescribedaciphex,”.
We should recognize that “Mr. Sopersteen is the name of a person and “aciphex” is the name of a drug Character-level models are good for this task because they can associate the character sequence “ex” (“ex” followed by a space) with a drug name andsteen with an person name, and thereby identify words that they have never seen before.
N-gram word models of NLP:
When we have a tendency to analyze a sentence one word at a time, then it is referred to as a unigram. The sentence parsed two words at a time isa bigram.
When the sentence is parsed three words at a time, then it is a trigam. Similarily, n-gram refers to the parsing of n words at a time
Example: To understand unigrams, bigrams, and trigrams, you can refer to the belongs
Therefore, parsing allows machines to understand the individual meaning of a word in a sentence. Also, this type of parsing helps predict the next word and correct spelling errors.
Consider two sentences: “There was heavy rain” vs, There was heavy flood”.
From experience, we all know that the previous sentence sounds better. An N-gram model will tell us that “heavy rain” occurs much more often than “heavy flood” within the traiming corpus. Thus, the first sentence is more probable and can be related by the model.
A model that simply depends on however a word occurs without looking at previous words is called unigram, If a model considers only the previous word to predict the current word, then it’s called bigram. If two previous words are considered, then its a trigramn model.
An n-gram model for the above example would calculate the following probability:
P(There was heavy rain’) = P(There’, ‘was’, ‘heavy’, ‘rain’) = P(There’)P(was |There) P(heavy’ |”There was’) P(rain “There was heavy’)
Since it’s impractical to calculate these conditional probabilities, using Markov assumption, we approximate this to a bigram model:
P(There was heavy rain’) ~ P(There) P(was |There’)P(heavy’ ‘was)P(‘rain’ ‘heavy’)
Now we turn to n-gram models over words instead of characters. All the same mechanism applies equally to word and character models. The main difference is that the vocabulary the set of symbols that make up the corpus and the model is larger.
There are only about 100 characters in most languages, and sometimes we build character models that are even more restrictive, for example by treating “A” and ‘a” as the same symbol or by treating all punctuation as the same symbol. But with word models we have at least tens of thousands of symbols, and sometimes millions.
The wide range is because it is not clear what constitutes a word. In English a sequence of letters enclosed by spaces is a word, but in some languages, like Chinese, words are not separated by spaces, and even in English many decisions must bemade to have a clear policy on word boundaries: how many words are in “ne’er-do-well”? Orin (Tel:1-800-960-5660x 123)”?
Text classification:
Text classification is the process of classifying documents into predefined categories based on their content. It is the machine controlled (automated) assignment of natural language texts to predefined classes (or categories).
Text classification is the primary requirement of text retrieval systems that retrieve texts in response to a user query, and text understanding systems, which transform text in some way such as producing Summaries, questions-answers or extract data.
We now consider in depth the task of text classification, also known as categorization: Given a text of some kind, decide that of a predefined set of categories it belongs to.
Text classification is that the method of (process of) classifying documents int predefined categories by their content.
Language identification and genre classification are examples of text classification. A is sentiment analysis(classifying a movie or product review as positive or negative) and spam detection (classifying an email message as spam or not-spam).
Since “not-spam” is awkward, researchers have contained the term ham for not-spam.
We can treat spam detection as a problem in supervised learning, A training set is readily available: the positive (spam) examples are in my spam folder, the negative (ham) examples are in my inbox.
Here is an excerpt:
Spam: Wholesale Fashion Watches -57% today. Designer watches for cheap …
Spam: You can buy ViagraFrS1.85 All Medications at unbeatable prices! ..
Spam: WE CAN TREAT ANYTHING YOU SUFFER FROM JUST TRUST US.
Spam: Sta.rt earn*ing the salary yo,u d-eserve by o’btaining the prope,rcrede’ntials!
Ham: The practical significance of hyper tree width in identifying more..
Ham: Abstract: We will motivate the problem of social identity clustering:..
Ham: Good to see you my friend. Hey Peter, It was good to hear from you. ..
Ham: PDS implies convexity of the resulting optimization problem (Kernel Ridge… From this section(part) we can start to get an idea of what might be good features to include in the supervised learning model.
Word n-grams such as “for cheap” and You can buy” seem to be indicators of spam (although they would have a nonzero probability in ham as well).
Character-level features also seem important: spam is more likely to be all uppercase and to have punctuation embedded in words. Apparently the spammers thought that the word bigram” you deserve” would be too indicative of spam, and thus wrote “you d-eserve” instead.
A character model should detect this. We could either create a full character n- gram model of spam and ham, or we could hand features such as “number of punctuation mark in words.” Note that we have two complementary ways in which of talking about classification.
In the language-modeling approach, we define one n-gram language model for P(Message | spam)by training on the spam folder, and one model for P(Message ham) by training on the inbox,
LATEST POST Problem Solutions In AI’s Searching Strategies – AI ADVANTAGE.
DOWNLOAD AI ASSINGMENT