Wednesday, March 3, 2010

GOOGLE PAGE RANK

-Prathmesh Jadhav & Abhilash Kumar

Introduction

Ever wondered how is it that whenever you type a search string in Google, the pages appear in the descending order of relevance? Who does the job of ordering millions of pages on the net in such a sophisticated manner? If you own a website, how do you get its link on the first 10-20 search results shown by Google? If these questions have ever clogged your mind, this article aims at answering most of them. The answer to all your questions and one of the significant reasons behind the success of Google is, PageRank.

d.bmpPageRank is a link analysis algorithm, named after Larry Page, used by the Google Internet search engine. Basically, PageRank is Google’s way of deciding a page’s importance. It assigns rank or weight age to each element of hyperlinked set of documents, such as the World Wide Web, in order to measure its relative importance within the set. The numerical weight age that it assigns to any element E is called the PageRank of E and denoted by PR(E). Google describes PageRank as:

PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves “important" weigh more heavily and help to make other pages “important”.

A weight age from 0-10 is assigned for each page on the net which denotes the sites importance. There are many tricks to increase the ranking of your site which we’ll cover later.

HISTORY

PageRank was developed at Stanford University by Larry Page and was named after him and later. The first paper about the project, describing PageRank and the initial prototype of the Goole Search engine, was published in 1998, shortly after, Page and Sergey Brin founded Google Inc., the company behind the Google search engine. While just one of many factors which determine the ranking of Google search results, PageRank continues to provide the basis for all of Google's web search tools.

ALGORITHM

PageRank is a probability distribution which represents the possibility that a person randomly clicking on links will arrive at particular page. In order to calculate the accurate PageRank for a set of documents, several passes called as “iterations” are made. PageRank is expressed as probability from 0 to 1. The PageRank is derived from a theoretical probability value on a logarithmic scale i.e. 0.4 probability means that there is 40% chance that a person clicking on random links will be directed to that document.

In order to understand the algorithm better, let’s get familiar with some terminologies.

abc.bmpText Box:                   Site A1. Internal Linking:

It is internal linkages between the pages of a website.
The maximum amount of PageRank increases with increase in number of pages in the website. One of the ways to increase PageRank is to have good internal Linking.

abc.bmpabc.bmp2. Inbound Links:

Inbound Links are the links to your
website from pages of another website. It is one way of increasing the PageRank of your website. The more number of inbound links your website has higher will be the PageRank
of the website. The increase in PageRank also

Text Box:            Site A

depends on which Page links to your website. In case of an important page, it is still more beneficial as it causes a greater increase in PageRank. EG: Site AàSite B and Site AàSite C are inbound links for B and C from A.

3. Outbound Links:

Outbound Links are links from the pages of your website to some other site. You need to take care when choosing where to exchange links because outbound links drain your site’s total PageRank. The reason for this is according to Page and Brin’s paper “The sum of all PageRanks is one”. Thus, the drain must be reciprocated. There are ways by which we can link to other sites without losing our PageRank. EG: In the previous Figure, Site Aà Site B is an outbound link from A to B.

There are mainly 2 ways:

i. Form Actions:

A form’s ‘action’ attribute can be any html page on any site rather than being the url of a form parsing script.

EG : name= “myform” action= http://www.domain.com/somepage.html>

Click here

We can also put the action attribute in a Javascript code rather than in the form tag and the code can be loaded from a ‘js’ file stored in a directory that is barred to the Google’s spider by the robots.txt file.

ii. Javascript:

EG :Click here

3. The “rel” attribute:

This attribute tells Google to ignore the link completely.

EG :http://www.domain.com/somepage.html rel= “no follow”>

Link text

4. Dangling Links:

A Dangling link is a link to a page that has no outgoing links, or has links to a page that Google hasn’t indexed. In both cases, Google removes these links before calculation of PageRank and puts them back after calculations are finished. Thus they do not affect the PageRank.

5. Link Farms:

On the World Wide Web, a link farm is any group of Web sites that all link to every other site in the group. Most are created through automated programs and services. A link farm is a form of spamming the index of a search engine. Such websites have PR=0 and is penalised by Google. Which sites link to our sites cannot be controlled but care should be taken that none of the outbound links link to a link farm.

HOW IS THE PAGERANK CALCULATED?

In order to calculate the PageRank, we need to consider the inbound links to that page and number of outbound links from that page. The equation to calculate a Page’s Rank is:

PR(A)=(1-d)+d(PR(t1)/C(t1)+...+PR(tn)/C(tn))

In this equation,

Þ ‘t1-tn’ pages link to page A.

Þ ‘C’ is the number of outbound links that page A has.

Þ ‘d’ signifies the damping factor.

Let us assume that an imaginary surfer who is randomly clicking on links will eventually stop clicking. The probability, at any step, that the person will continue is called damping factor denoted by d. It is generally assumed that the damping factor will be set around 0.85.

Whenever a page has an outbound link to another page, it casts a vote which is little less than its own PageRank i.e. (0.85*its own PageRank). In order for a site to have a good PageRank, it is necessary to have inbound links from important pages as well as the number of inbound links should be high.

Since PageRank is calculated on a logarithmic scale, a site require lot more PageRank to move to higher level then it required to move from the previous level. According to Google:

“PageRank or PR (A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix if the web.”

The statement means that we can calculate page’s PR without knowing the final value of the PageRank of other pages. That seems strange but, basically, each time we perform iteration we’re getting a closer estimate of the final value. So we perform iterations until we get 2 consecutive iterations with almost same values.

Example: a.bmpConsider 2 pages A and B. Both of them have 1 outbound link to each other and 1 inbound link from each other.

Case 1: Iteration 1

Let’s assume that both the pages at the beginning have PR=1:

D

= 0.85

PR(A)

= (1 – d) + d(PR(B)/1)

PR(B)

= (1 – d) + d(PR(A)/1)

i.e.

PR(A)

= 0.15 + 0.85 * 1
= 1

PR(B)

= 0.15 + 0.85 * 1
= 1

The values of PR (A) and PR (B) remain the same. Let’s start with some other values of PR.

Iteration 2: Consider PR=0 and calculate:

PR(A)

= 0.15 + 0.85 * 0
= 0.15

PR(B)

= 0.15 + 0.85 * 0.15
= 0.2775

Again:

PR(A)

= 0.15 + 0.85 * 0.2775
= 0.385875

PR(B)

= 0.15 + 0.85 * 0.385875
= 0.47799375

And again

PR(A)

= 0.15 + 0.85 * 0.47799375
= 0.5562946875

PR(B)

= 0.15 + 0.85 * 0.5562946875
= 0.622850484375

and so on. The numbers just keep going up. But will the numbers stop increasing when they get to 1.0? What if a calculation over-shoots and goes above 1.0?

Case 2: Iteration 1

Let’s calculate for PR=40:

PR(A) = 40
PR(B) = 40

First calculation

PR(A)

= 0.15 + 0.85 * 40
= 34.25

PR(B)

= 0.15 + 0.85 * 0.385875
= 29.1775

And again

PR(A)

= 0.15 + 0.85 * 29.1775
= 24.950875

PR(B)

= 0.15 + 0.85 * 24.950875
= 21.35824375

The numbers are decreasing and on further calculations will reach 1.

Case 3:

Now let’s try internal linking. Let’s link all the pages to all other pages of that website. Suppose that three pages A,B,C are all linked to each other. Thus all the pages have PR=1. The same occurs while using a loop.

EG: A àB, BàC, CàA

Special cases:

1. While adding new pages, link the new pages to important pages which improve the PageRank of the site as a whole. The new pages however do not add Page Rank until they are present in the Google index but to attain that they need to have at least 1 inbound link. Thus suppose page A is the important one, put a link to the new page on A. As a site will have multiple important pages, spread links to new pages on the important ones. When new pages are added the PageRank of important pages might decrease. Thus in order that the important pages don’t suffer, care should be taken that sufficient number of new pages should be added or get more number of inbound links.

2. When a page links to itself, is the link counted?

Whenever a page links to other, it casts a vote to other page. However a page cannot cast a vote to itself as it can give way to manipulation of Rank.

TIPS TO INCREASE PAGERANK

The different ways or techniques that are used to increase the PageRank or to get your page in the top results of search engines are called as Search Engine Optimization.

1. Linking is the fastest way to increase PageRank.

Try to get as many links as possible from important pages of different sites.

2. Include at least one link to Google.

Higher the traffic you send to them, more the chances of having higher PageRank.

3. Use lots of meta keyword tags.

Google indexes only the first 101k of your document. Separate each of the meta keyword tags with commas. But don’t abuse it.(i.e. don’t repeat the same word more than three times in a row).

4. Build a sitemap for your website.

Submit the sitemap to search engines like Google and Yahoo. Also using the Google webmaster account you can check whether your site is indexed (i.e. will the site’s link appear as a search result in Google). You can also know the PageRank of your site.

5. Update your website often with more useful Content.

The probability of a surfer coming back to your site, once he finds it useful is high.

6. Use barter system to trade links.

The benefit of this is, you do not need to shell out money from your pocket.

7. Use H1 Tags for the entire page except the header and title.

The trick is to use CSS style sheets to reduce the font size. Google will give you extra points for using higher font size though, everything appears small on the page.

8. Use internal linking for your site tactfully.

Suppose there are 3 pages in your site.

EG: pg1àpg2, pg2àpg3, pg3àpg1.

This will surely get you some extra PageRank but just don’t overdo it.

9. Make use of lots of dots at the bottom of your document.

……….
Link each one of them to your pages and also to search engines.

10. Use short URL’s.

Google hates long url’s. It’s beneficial to use shorter url’s.

REFERENCES

1. http://en.wikipedia.org/wiki/PageRank

2. http://www.webworkshop.net/pagerank.html