combig.pl man page on DragonFly

Man page or keyword search:  
man Server   44335 pages
apropos Keyword Search (all sections)
Output format
DragonFly logo
[printable version]

COMBIG(1)	      User Contributed Perl Documentation	     COMBIG(1)

NAME
       combig.pl - Combine frequency counts to determine co-occurrence

SYNOPSIS
       Combines (sums) the frequency counts of bigrams made up of the same
       pair of words in either possible order.	It will	  count the number of
       time two words occur together in a bigram  regardless of which one
       comes first.

DESCRIPTION
   USAGE
	combig.pl [OPTIONS] BIGRAM

       INPUT PARAMETERS

       ·   BIGRAM

	   Specify a file of bigram counts created by NSP programs count.pl.
	   The entries in BIGRAM will be formatted as follows:

		   word1<>word2<>n11 n1p np1

	   Here, word1 is followed by word2 n11 times. word1 occurs as the 1st
	   word in total n1p bigrams and word2 occurs as the 2nd word in np1
	   bigrams.

       ·   OPTIONS

	    --help

	   Displays this message.

	    --version

	   Displays the version information.

       OUTPUT

       combig.pl produces a count of the number of times two words make up a
       bigram in either order, whereas count.pl produces counts for a single
       fixed ordering. In other words, combig.pl combines the counts of
       bigrams that are composed of the same words but in reverse order. While
       the BIGRAM shows pairs of words forming bigrams, output of combig will
       show the pairs of words that are co-occurrences or that co-occur
       irrespective of their order.

       e.g. if bigrams

	       word1<>word2<>n11 n1p np1
       and
	       word2<>word1<>m11 m1p mp1

       are found in BIGRAM file, then combig.pl treats these as a single
       unordered bigram

	       word1<>word2<>n11+m11 n1p+mp1 np1+m1p

       where the new bigram will show a combined contingency table in which
       the order of words doesn't matter.

			       word2	       ~word2
	       ___________________________________________________________
		word1  |       n11+m11	       n12+m21	   | n1p+mp1
		       |				   |
	       ~word1  |       n21+m12	       n22+m22-n   | n2p+mp2-n
		       ___________________________________________________
			       np1+m1p	       np2+m2p-n   |  n
       here the entry

       ·   (word1,word2)=n11+m11

	   shows the number of bigrams having both word1 and word2 in either
	   order

	   i.e. word1<>word2 + word2<>word1

       ·   (word1,~word2)=n12+m21

	   shows the number of bigrams having word1 but not word2 at either
	   position

	   i.e. word1<>~word2 + ~word2<>word1

       ·   (~word1,word2)=n21+m12

	   shows the number of bigrams having word2 but not word1 at either
	   position

	   i.e. ~word1<>word2 + word2<>~word1

       ·   (~word1,~word2)=n22+m22

	   shows the number of bigrams not having word1 nor word2 at either
	   position

	   i.e. ~word1<>~word2 + ~word2<>~word1 - n

	   where n=total number of bigrams

	   The mathematical proof of how the cell counts in the above
	   contingency table are counted is explained in section Proof.

       When a bigram appears in only one order i.e.

       word1<>word2<>n11 n1p np1

       appears but

       word2<>word1<>m11 m1p mp1

       does not, then the combined bigram will be same as the original bigram
       that appears. Or in other words,

       word1<>word2<>n11 n1p np1

       is displayed as it is.

   PROOF OF CORRECTNESS
       A bigram word1<>word2<>n11 n1p np1 represents a contingency table

			 word2	       ~word2
		       --------------------------------------
	       word1   n11     |       n12     |       n1p
			       |	       |
	       ~word1  n21     |       n22     |       n2p
		       --------------------------------------
		       np1     |       np2     |       n

       while a bigram word2<>word1<>m11 m1p mp1 represents a contingency table

			 word1	       ~word1
		       --------------------------------------
	       word2   m11     |       m12     |       m1p
			       |	       |
	       ~word2  m21     |       m22     |       m2p
		       --------------------------------------
		       mp1     |       mp2     |       n

       Here,

	n11+n12+n21+n22 = n

       Also,

	m11+m12+m21+m22 = n

       combig.pl combines bigram counts into a single order independant word
       pair

	word1<>word2<>n11+m11 n12+m21 n21+m12

       And the corresponding contingency table will be shown as

			       word2	       ~word2
		       -----------------------------------------
	       word1   n11+m11	 |     n12+m21	 |     n1p+mp1
				 |		 |
	       ~word1  n21+m12	 |     n22+m22-n |     n2p+mp2
		       -----------------------------------------
		       np1+m1p	 |     np2+m2p	 |     n

       The first cell (n11+m11) shows the #bigrams having word1 and word2
       (irrespective of their positions) i.e. word1<>word2 or word2<>word1
       which is n11+m11.

       The second cell (n12+m21) shows the #bigrams having word1 but not word2
       at any position i.e. word1<>~word2 or ~word2<>word1 which is n12+m21.

       The third cell (n21+m12) shows the #bigrams having word2 but not word1
       at any position i.e. ~word1<>word2 or word2<>~word1 which is n21+m12.

       The fourth cell (m22+n22-n) shows the #bigrams not having word1 nor
       word2 at any position which

	= n - (n11+m11) - (n12+m21) - (n21+m12)

	= n - (n11+n12+n21) - (m11+m12+m21)

	= n - (n-n22) - (n-m22)

	= n22 + m22 - n

       Alternative proof -

	n22 = m11 + m12 + m21 + X      (a)

	m22 = n11 + n12 + n21 + X      (b)

       where X = #bigrams not having either word1 or word2.

       as both n22 and m22 have some terms in common which show the bigrams
       not having either word1 or word2. But,

	m11+m12+m21 = n - m22

       Substituting this in eqn (a)

	n22 = n - m22 + X

       Or

	X = n22 + m22 - n

       Or add (a) and (b) to get

	n22+m22 = (n11+m11) + (n12+m21) + (n21+m12) + 2X

       rearranging terms,

	n22+m22 = (n11+n12+n21) + (m11+m12+m21) + 2X

       but

	n11+n12+n21 = n - n22 and

	m11+m12+m21 = n - m22

       Hence,

	n22+m22 = (n-n22) + (n-m22) + 2X

	2(n22+m22-n) = 2X

       Or

	(n22+m22-n) = X

       which is the fourth cell count.

   Viewing Bigrams as Graphs
       In bigrams, the order of words is important. Bigram word1<>word2 shows
       that word2 follows word1. Bigrams can be viewed as a directed graph
       where a bigram word1<>word2 will represent a directed edge e from
       initial vertex word1 to terminal vertex word2(word1->word2).

       In this case,

       n11, which is the number of times bigram word1<>word2 occurs, becomes
       the weight of the directed edge word1->word2.

       n1p, which is the number of bigrams having word1 at 1st position,
       becomes the out degree of vertex word1

       and

       np1, which is the number of bigrams having word2 at 2nd position,
       becomes the in degree of vertex word2

       combig.pl creates a new list of word pairs from these bigrams such that
       the order of words can be ignored. Viewed another way, it converts the
       directed graph of given bigrams to an undirected graph showing new word
       pairs.

       A pair say

	       word1<>word2<>n11 n1p np1

       shown in the output of combig can be viewed as an undirected edge
       joining word1 and word2 having weight n11. If we count the degree of
       vertex word1 it will be n1p and degree of vertex word2 will be np1.

AUTHORS
	Amruta Purandare, pura0010@d.umn.edu
	Ted Pedersen, tpederse@d.umn.edu

	Last update 03/22/04 by ADP

       This work has been partially supported by a National Science Foundation
       Faculty Early CAREER Development award (#0092784).

BUGS
SEE ALSO
       http://www.d.umn.edu/~tpederse/nsp.html

COPYRIGHT
       Copyright (c) 2004, Amruta Purandare and Ted Pedersen

       This program is free software; you can redistribute it and/or modify it
       under the terms of the GNU General Public License as published by the
       Free Software Foundation; either version 2 of the License, or (at your
       option) any later version.

       This program is distributed in the hope that it will be useful, but
       WITHOUT ANY WARRANTY; without even the implied warranty of
       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
       General Public License for more details.

       You should have received a copy of the GNU General Public License along
       with this program; if not, write to

       The Free Software Foundation, Inc., 59 Temple Place - Suite 330,
       Boston, MA  02111-1307, USA.

perl v5.20.2			  2008-03-24			     COMBIG(1)
[top]

List of man pages available for DragonFly

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net