Statistically Improbable Phrase

A statistically improbable phrase (SIP) is a phrase or set of words that occurs more frequently in a document (or collection of documents) than in some larger corpus.[1][2][3] Amazon.com uses this concept in determining keywords for a given book or chapter, since keywords of a book or chapter are likely to appear disproportionately within that section.[4][5] Christian Rudder has also used this concept with data from online dating profiles and Twitter posts to determine the phrases most characteristic of a given race or gender in his book Dataclysm https://en.m.wikipedia.org/wiki/Statistically_improbable_phrase

https://stackoverflow.com/questions/2009498/how-does-amazons-statistically-improbable-phrases-work

https://www.wired.com/2009/05/web-semantics-statistically-impossible-phrases-a-literary-view/winamp/

Using NLTK


Edited:    |       |    Search Twitter for discussion