Talk:PageRank

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Can someone check the MATLAB code, I believe it should be Norm(v,1)[edit]

Hi folks,

I was playing with this code, and couldnt get the PR results to sum to 1.0 (per the discussion at start of document). Looking at it, I believe that the problem is that the normalization doesn't cause it to sum to 1 - setting it to norm(v,1) does do this.

This might just have been random good luck on my part - but it meets the initial condition - that the random distribition of pagerank sums to 1.0, not so sure about why it needs to renormalized everytime though, that seems flaky. — Preceding unsigned comment added by 67.160.239.133 (talk) 05:26, 26 December 2012 (UTC)Reply[reply]


Follow up - the example itself doesn't sum to one, running on Octave.

The reason I *believe* is at least partly due to the columns not all summing to one. I fixed this in my example code, by ensuring that Node has a SELF-LOOP (ie the diagonal is non zero), and by removing the L2 norm in the loop. Would love someone more knowledgeable than me to explain this though.

ans =

  0.54033
  0.30295
  0.30295
  0.45012
  0.56735  — Preceding unsigned comment added by 67.160.239.133 (talk) 19:58, 26 December 2012 (UTC)Reply[reply] 

Update -- in looking at the paper Topic Sensitive Pagerank, (Taher H. Haveliwala) the author reports that if a Node has out degree == 0, then instead of just having a self-loop, assign an equal probability to going EVERYWHERE in the graph. — Preceding unsigned comment added by 63.194.72.222 (talk) 18:25, 4 January 2013 (UTC)Reply[reply]

Update 2 -- I've successfully resolved various issues in the Matlab code.

1) There is no reason why the pagerank vector should be constantly renormalized by the L2 norm. I haven't found any justification for it's use. It makes sense to normalize the initial vale by the L1 norm, to assure that the page rank distribution sums to 1.0. However, after this it will remain a PDF and not need renormalizing.

2) The way of dealing with SINKs is ill-defined at best. In order to achieve the PageRank for the original graphic at the top of the page, it is necessary to treat a sink's out links as being equally connected to the whole graph, including itself. However, one does not include self-loops for nodes with existing outlinks. The reason for this is that making the transition be equally likely across all nodes is to make sure that a sink node is treated as a 100% AUTOMATIC restart

Wdavies (talk) 16:31, 22 April 2013 (UTC)WDaviesReply[reply]

Hi, thanks for pointing this out, you are indeed right. Sink pages have to be represented as if they would link to all pages; and the older Matlab example was wrong in using the L2 norm. I have changed the (broken) Matlab example to a simple Python example and illustrated how to run it (in the external pages section of the page). I think this is much clearer now. I have also removed the "proof that the Matlab algorithm is wrong" discussion in the middle of the page. Others, please feel free to improve on this.

129.169.150.206 (talk) 21:39, 23 August 2014 (UTC)Reply[reply]

Removed[edit]

Google uses an automated web spider called Googlebot to actually count links and gather other information on web pages.

It is by no means clear that the counting can be said to be done by Googlebot, and it is not intuitively a spidering operation, more likely a feature of the database to which the spidering software stores its flies. Therefore this needs a citation to be in the article. Clearly what parts of the Google infrastructure are called "Googlebot" is up to Google, however if it extends too far, the description needs to be changed. All the best: Rich Farmbrough13:25, 11 August 2014 (UTC).

Unnecessary assumption in derivation of power method?[edit]

In the power method section, the first step of the derivation is :

If the matrix \mathcal{M} is a transition probability, i.e., column-stochastic with no columns consisting of just zeros and \mathbf{R} is a probability distribution (i.e., |\mathbf{R}|=1, \mathbf{E}\mathbf{R}=\mathbf{1} where \mathbf{E} is matrix of all ones), Eq. (**) is equivalent to

   \mathbf{R} = \left( d \mathcal{M} + \frac{1-d}{N} \mathbf{E} \right)\mathbf{R} =: \widehat{ \mathcal{M}} \mathbf{R}.       (***) 

And so on...

My Question: Is it necessary that \mathcal{M} has this particular property for this step of the derivation to hold? It seems that only the property that R is a probability distribution is required. — Preceding unsigned comment added by Mtjoul (talkcontribs) 22:41, 21 December 2014 (UTC)Reply[reply]

External links modified[edit]

Hello fellow Wikipedians,

I have just added archive links to one external link on PageRank. Please take a moment to review my edit. You may add {{cbignore}} after the link to keep me from modifying it, if I keep adding bad data, but formatting bugs should be reported instead. Alternatively, you can add {{nobots|deny=InternetArchiveBot}} to keep me off the page altogether, but should be used as a last resort. I made the following changes:

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at {{Sourcecheck}}).

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 18 January 2022).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—cyberbot IITalk to my owner:Online 05:05, 31 March 2016 (UTC)Reply[reply]

  • What does the link: Our products and services by Google have to do with PageRank? I followed the link but could not find PageRank. If it is somewhere at that URL, then the link should be refined. 5.34.22.174 (talk) 04:41, 5 January 2018 (UTC)Reply[reply]

Minor error in "Damping Factor"[edit]

In "Damping Factor", after the two formulas, it states, "The difference between them is that the PageRank values in the first formula sum to one, while in the second formula each PageRank is multiplied by N and the sum becomes N." However (unless I'm blind), from the first equation to the second, only the first part has been multiplied by N. (Otherwise you'd have 1 - d + Nd(stuff).) With the second equation given as-is, the sum of page ranks is rather trickier than "N". Given that the equation is already acknowledged to be wrong, it's probably not urgent, but hey. — Preceding unsigned comment added by 98.102.161.228 (talk) 17:37, 22 August 2016 (UTC) The parameter of this code is good to be used in many problems..hoursguru.comReply[reply]

Damping factor for biological data[edit]

The article mentions 0.31 as the optimal value for d, however nowhere in the cited paper can I directly find 0.31. I can find a reference to an epsilon of 0.3 in Appendix 1.1, but I am not yet convinced this is equivalent to the damping factor d. Can someone clarify? --José Devezas (talk) 08:19, 19 December 2018 (UTC)Reply[reply]

I also wonder whether the statement even makes sense. How can there be a single damping factor for all "biological data"? What application are we talking about? What is being modelled? I will remove the statement. If someone feels it adds value or explanatory power to the article, perhaps they can clarify it. GBMorris (talk) 13:25, 9 March 2020 (UTC)Reply[reply]

I agree with the points raised by both of you and consider the removal of the statement a good decision. BernardoSulzbach (talk) 20:50, 9 March 2020 (UTC)Reply[reply]

Examples are Wrong[edit]

  • It's not clear what does the value 0.5 represents
  • The sum(i, M_i,j) is never 1
  • I think columns/rows have been messed up. Comments mention that M_i,j represents link from j to i, but if you run a clear example, e.g, the following:

everything links to A which means A is an important page, and A links to C, thus C is important as well.

   M = np.array([
       [1, 1, 1],  # * -> A
       [0, 0, 0], 
       [1, 0, 0]   # A -> C
   ])
  print(pagerank(M, 0.001, 0.85)) 
  array([[0.61536926],  
         [0.28131799],
         [0.10331275]])

B should be last but its second. Am I missing something?

I'm not sure which version of the algorithm this edit was referring to, but I've been working with it and it works; in this case, [1, 1, 1] should be impossible because A cannot link to itself. A links to C would be [0, 0, 1] on the A row if [A, B, C], and A links to both B and C would be [0, 1, 1]. I'm also not sure what the call to `pagerank(..., 0.001, ...)` is meant to do, since `0.001` doesn't make sense as the number of iterations (I'm guessing it would iterate once).
--D.g.lab. (talk) 16:31, 10 April 2021 (UTC)Reply[reply]
I'm also not sure what was meant by the sum of (i, M_i, j) is 1 since half those variable names don't exist in the code. It could be said that the sum of every column in the matrix is 1?
The sum of the result vector, called v in the code, should also be approximately 1, I believe, since it contains percentages.
--D.g.lab. (talk) 16:39, 10 April 2021 (UTC)Reply[reply]

Variables lack clear names[edit]

It's already difficult enough to understand as it is, why not clarify the variable names and add comments, as in any good code? For example:

   def pagerank(matrix, num_iterations: int = 100, damping: float = 0.85):
       size = matrix.shape[1]
       res_vector = [1/size] * size
       M_hat = (damping * matrix + (1 - damping) / size)
       for i in range(num_iterations):
           res_vector = M_hat @ res_vector # matrix multiplication
       return res_vector

Also, notice the lines in the article:

   v = np.random.rand(N, 1)
   v = v / np.linalg.norm(v, 1)

these insert random numbers in the multiplication vector. However, as far as I understand, this vector is supposed to start initially as `1/number_of_pages` (1/size) for each of its elements, which could be re-written

   v = [1/N] * N

or, using the new varible names,

   res_vector = [1/size] * size

According to this video on YouTube: Linear Algebra – Introduction to PageRank (4:21 timestamp).

I'm unsure why this was like this, so I'm not editing the page; however, if someone can confirm, the initial method seems "way too complicated" and there's no explanation for it. --D.g.lab. (talk) 16:31, 10 April 2021 (UTC)Reply[reply]

site:example.com[edit]

😐site:example.com 2A00:F41:4828:36B6:0:4B:5D7E:4801 (talk) 23:52, 24 February 2022 (UTC)Reply[reply]