Easy To Use Classification by data compression : PageRank vs HITS

Another way to think about classification is as a pagerank problem in data compression.

A lossless Compression algorithm takes a sequence of symbols, detects recreated patterns in it, and writes a description of the sequence that is lot of compact than the original.

For example, the text0.142857 142857 142857″ might be compressed to “0. [142857] *3.”

Compression algorithms work by dictionaries of sub sequences of the text, and then referring to entries within the dictionary. The example here had only one dictionary entry, 142857.»

In effect, compression algorithms are creating a language model. To do classification by compression, first collect all the spam training messages and compress them as a unit.

We do the same for the ham. Then when given a new message to classify, we append it to the spam messages and compress the result. We also append it to the ham and compress that.

Whichever class compresses better adds the fewer number of additional bytes for the new message is the predicted class.

The idea is that a spam message will tend to share dictionary entries with different spam messages and therefore can compress better when appended toa collection that already contains the spam dictionary.

Experiments with compression-based classification on number of the standard corpora For text classification the 20-Newsgroups data set, the Reuters-10 Corpora, the Industry Sector corpora indicate that whereas running off-the-shelf compression algorithms like gzip, RAR can be quite slow; their accuracy is comparable to traditional classification algorithms.

This is interesting in its own right, and also serves to point out that there’s promise for algorithms that use character n-grams directly with no preprocessing of the text or feature selection: they appear to be capturing some real patterns.

Information retrieval :

Information retrieval is concerned with representing, searching, and manipulating large Eollections of electronic text and alternative human-language information.

Information retrieval systems use a very simple language model based on bags of words, yet still manage to perform well in terms of recall and precision(exactness)on Very large corpora of text.

On internet corpora, link-analysis algorithms improve performance. Information retrieval is the task of finding documents that are relevant to a user’s need for information

The well-known examples of information retrieval systems are search engines on the World Wide Web. A Web user can type a query such as: AI book into a search engine and see a list of relevant pages. In this section, we will see how such systems are built. An information retrieval system can be characterized by.

  1. A corpus of documents. Every system should decide what it needs to treat as a document: A paragraph, a page, or a multipage text.
  2. Queries posed in a query language. A query specifies what the user needs to understand.
  3. The query language will be just a list of words, such as AI book]; or it can specify phrase of words that has to be inform, it will contain Boole an operators as in  it can include non-Boolean operators such as
  4. A result set. This is the subset of documents that the IR system judges to be relevant to the query. By relevant, we tend to mean likely to be of use to the person who posed the query, for the particular information need expressed in the query.
  5. A presentation of the result set. This will be as simple as a ranked list of document titles or as advanced as a rotating shine map of the result set projected onto a three dimensional space, rendered as a two-dimensional display.

The PageRank algorithm:

PageRank

It is a scoring measure based only on the link structure of web pages or website. A web page is important if it is pointed to by other important web pages. Our first technique for link analysis assigns to every node in the web graph a numerical score between 0 and 1 known as it page rank.

Given a query, a web search engine computes a composite score for every web page content that combines or mixes hundreds of options like cos similarity and term proximity together with the Page rank score.

PageRank was one of the two original ideas that set Google’s search except from other Web Search engines when it was introduced in 1997. The HITS algorithm for computing hubs and authorities with respect to a query.

DEIEVANT-PAGES fetch the pages that match the query, and EXPAND-PAGES adds in each page that links to or is linked from one of the relevant pages. NORMALIZE divide search page’s score by the sum of the squares of all pages’ scores (separately for both the authority and hubs scores).

(The other innovation was the use of anchor text-the underlined text in a hyperlink to index a page, even though the anchor text was on a different page than the one being indexed.)

PageRank was invented to solve the problem of the tyranny of TF scores: if the query is [IBMI, how do we make sure that IBM’s home page, ibm.com, is the first result, even if another page mentions the term “IBM” more frequently?

The idea is that ibm.com has several in-links (links to the page), so it should be ranked higher: every in-link is a vote for the standard of the linked-to page. But if we only counted in links, then it would be possible for a Web spammer to create a network of pages and have all of them all point to a page of his selecting, increasing the score of that page.

Therefore, the PageRank algorithm is designed to weight links from high-quality sites more heavily. What is a high quality site? One that is linked to by other high-quality sites. The definition is Algorithmic or recursive, but we will see that the recursion bottoms out properly.

The PageRank for a page p is defined as:

Where PR (p) is the PageRank of page p, N is the total number of pages in the corpus, in; is the pages that link in to p, and C (in;) is the count of the total number of out-links on page in.

The constant d is a damping factor. It can be understood through the random surfer model: imagine a Web surfer who starts at some random page and begins exploring.

With probability d (we’ll assume d-0.85) the surfer clicks on one in all the links on the page (choosing uniformly among them), and with chance l-d she gets uninterested the page and restarts on a random page anywhere on the Web.

The PageRank of page p then the probability that the random surfer will be at page p at any purpose in time. Pag Rank can be computed by an iterative procedure: starts with all pages having PR(p)= and iterate the algorithm, updating ranks until they converge.

Thus PageRank could be global ranking of all web pages based on their locations in t web graph structure. PageRank uses information that is external to the web pages back links. Backlinks from important pages are more significant than backlinks from average pages.

Advantages of Page Rank:

  • Fighting Spam: A page is important if the pages pointing to it are important. Since it hard for Web page owner to add in-links into his/her page from different important pages, it is thus not easy to influence Page Rank.
  • Page Rank is a global measure and is query independent. Page Rank values of all the pages are computed and saved off-line rather than at the query time.

The HITS algorithm:

Hyperlink Induced Topic Search (HITS) Algorithm is a Link Analysis Algorithm that rates Web pages, developed by Jon Kleinberg in 1999. This algorithm is used to the web link-structures to discover and rank the Web Pages relevant for a particular search. HITS use hubs and authorities to define a recursive relationship between webpages.

To get knowledge of HITS Algorithm, we first have to get knowledge about Hubs and Authorities. Given a query to a Search Engine, the set of highly relevant web pages are called Roots. They are potential Authorities.

Pages that are not very relevant but point to pages in the Root are called Hubs. Thus, an Authority is a page that many hubs link and a Hub is a page that links to many authorities.

Algorithm

Let number of iterations be k.

  • each node is assigned a Hub score = l and an Authority score = 1.
  • Repeat k times: Hub update: Each node’s Hub score = (Authority score of each node it points to).

Authority update: Each node ‘s Authority score = (Hub score of each node pointing to. Normalize the scores by dividing each Hub score by square root of the sum of the squares of all Hub scores, and dividing each Authority score by square root of the sum of the squares of all Authority scores.

Two sets of inter-related pages:

Hub Pages-good lists of links on a subject Authority pages-occur recurrently on good hubs for the subjects.

The HITS algorithm Ho)-Say) Ay-hy)

The Hyperlink-Induced Topic Search algorithm, also known as Hubs and Authorities. HITS differs from page rank in several ways.

First, it is a query-dependent measure: it rates pages with respect to a query. That means that it must be computed anew for each that most search engines have created not to take on. Given a query, HITS first finds a set of pages that are relevant to the query.

It does that by intersecting hit lists of query words, and then adding pages in the link neigh or hood of these pages pages that link to or is linked from one of the pages in the original relevant set.

Each page in this set is considered an authority on the query to the degree that other pages in the relevant set point to it. A page is considered a hub to the degree that it points to other authoritative pages in the relevant set.

Just as with PageRank, we don’t want to merely count the number of links: we want to give more value to the high-quality hubs and authorities. Thus, as with PageRank.

we iterate a process that updates the authority score of a page to be the sum of the hub scores of the pages that point to it, and the hub score to be the sum of the authority scores of the pages it points to. If we then normalize the scores and repeat k times, the process will converge.

Both PageRank and HITS played important roles in developing our understanding of Web information retrieval. These algorithms and their extensions are used in ranking billions of queries daily as search engines steadily develop better ways of extracting yet finer signals of search relevance.

Information Retrieval vs Information Extraction:

Information Retrieval: Given a set of terms and a set of document terms select only the most relevant document (precision and preferably all the relevant ones (recall).

• Information Extraction: Extract from the text what the document means.

Ai Materials

see new post

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *