NAME

Bcc - This module contains all the functions needed for BCC.


METHOD

source_reader
Reads source.txt file and process each citation into a reference.

Data Structure: reference => array 0 => array #index 0 'id' 1 'information' 1 ...

Usage:
my $report_ref = bcc->source_reader;
print $report_ref->[0]->[0]#print first citation's ID
print $report_ref->[0]->[1]#print first citation's text

assertion_reader
Reader for assertion.mdf file, which is a manual digest and extraction of information in raw.txt file, which contains the original abstracts. It processes assertion.mdf file into a hash reference. ID is the ID of the citation.

Data Structure: reference => hash 'dat' => array 0 => array #index 0 'id' 1 'information'

        'abb'   => array
                0 => array      #index
                        0 'id'
                        1 'information'

Usage:
my $ref = bcc->assertion_reader;
$ref->{dat}->[0]->[0];#ID
$ref->{dat}->[0]->[1];#information

review_filereader
Reader for review.mdf, which contains 110 paragraphs of biomedical reviews with citation tags, table and figure references removed.

Usage:
my $ref = bcc->review_filereader;
print join(``\n'', @{$ref});#print all of the citations

medpost_filereader
Reader for MedPost corpus, a set of 6700 manually tagged sentences from biomedical literature, in medpost.mdf file. (L. Smith, T. Rindflesch and W. J. Wilbur. 2004. MedPost: a part-of-speech tagger for bioMedical text. Bioinformatics 20(14):2320-2321) Returns a reference of tagged sentences.

Usage:
my $ref = bcc->medpost_filereader;
print join(``\n'', @{$ref});#print all of the tagged sentence

medpost_penn_filereader
Reader for MedPost corpus, a set of 6700 manually tagged sentences from biomedical literature, in medpost_penn.mdf file, which is tagged with Penn treebank tag set (Marcus, MP, Santorini, B and Marcinkiewicz, MA. 1994. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19, 313-330) rather than MedPost tag set as in medpost.mdf (L. Smith, T. Rindflesch and W. J. Wilbur. 2004. MedPost: a part-of-speech tagger for bioMedical text. Bioinformatics 20(14):2320-2321) Returns a reference of tagged sentences.

Usage:
my $ref = bcc->medpost_penn_filereader;
print join(``\n'', @{$ref});#print all of the tagged sentences

medpost_spec_filereader
Reader for MedPost corpus, a set of 6700 manually tagged sentences from biomedical literature, in medpost_spec.mdf file, which is tagged with SPECIALIST tag set (National Library of Medicine, 2003. UMLS Knowledge Sources, 14th Edn) rather than MedPost tag set as in medpost.mdf (L. Smith, T. Rindflesch and W. J. Wilbur. 2004. MedPost: a part-of-speech tagger for bioMedical text. Bioinformatics 20(14):2320-2321) Returns a reference of tagged sentences.

Usage:
my $ref = bcc->medpost_spec_filereader;
print join(``\n'', @{$ref});

medpost_tagreader
Line reader for MedPost corpus. To be used after medpost_filereader() method. It takes a tagged sentence in MedPost corpus and split it into a ARRAY of ARRAYs, [[word, tag], ...]. The sequence of [word, tag] will be the same sequence in the tagged sentence.

Data structure: reference => array 0 => array #index 0 => array 0 'word' 1 'tag'

Usage:
my $ref = bcc->medpost_filereader;
my @medo_tagged;
push @medp_tagged, bcc->medpost_tagreader($_) foreach(@{$ref});
print $medp_tagged->[index of tagged sentence]->[index of words]->[0:word, 1:tag];

medpost_untagged
Removes tags from a sentence from MedPost corpus and returns the untagged (original) sentence. To be used after medpost_filereader() method.

Usage:
my $ref = bcc->medpost_filereader;
my @medp_untagged;
push @medp_untagged,bcc->medpost_untagged($_) foreach(@{$ref});
print $medp_untagged[index of sentence];

medpost_tagsplitter
Combines the efforts of medpost_tagreader() and medpost_untagged() methods to process a sentence from MedPost corpus into untagged form (as returned from medpost_untagged() method) and to split the ARRAY of word, tag from medpost_tagreader() method into a REFERENCE of words and tags.

Data structure: reference => hash 'source' => sentence 'word_list' => array 0 word 1 ... 'tag_list' => array 0 tag 1 ...

Usage:
my $ref = bcc->medpost_filereader;
my @medp_tagsplitter;
push @medp_tagsplitter, [bcc->medpost_tagsplitter($_)] foreach(@{$ref});
print ${$medp_tagsplitter[index of sentence]->{source}};
print @{$medp_tagsplitter[index of sentence]->{word_list, taglist}};

yapex_filereader
Reads Yapex corpus and returns a list of abstracts (abs). Franzen, K., Eriksson, G., Olsson, F., Asker Per Lidin, L., and Coster, J. 2002. Protein names and how to find them. International Journal of Medical Informatics 67(1-3):49-61.

lllc05
Reads the data set (training set) for LLLChallenge 2005 (Learning Lagic in Languages) and outputs the following as a 5-element tuple:source sentences, words, agents, targets, interactions, synonymous gene names). Cussens, J. (ed). 2005. Proceedings of the Learning Logic in Languages Workshop 2005 (LLL05). Claire Nedellec, MIG-INRA, France.

Data Structure: reference => hash 'f05' = >hash 'agent' => array 0 'agent' 1 ...


                'interaction' => array
                        0 array
                                0 'scalar'
                                1 ...
                        1 ...

                'sentence' => array
                        0 'sentence 1'
                        1 'sentence 2'
                        2 ...

                'targets' => array
                        0 'target'
                        1 ...

                'words' => array
                        0 => array      #sentence 1
                                0 'word 1'
                                1 'word 2'
                                2 ...
                        1 ...   #sentence 2

        'f05dd' => array
                0 'synonymous gene name 1'
                1 'synonymous gene name 2'
                2 ...

Usage:
my $lllc = bcc->lllc05;
print @{$llc->{f05dd}};
print $llc->{f05}->{sentence}->[index of sentence];
{words}->[index of sentence]->[index of word];
{agents}->[index of sentence]->[index of agent];
{targets}->[index of sentence]->[index of targets];
{interactions}->[index of sentence]->[index of interactions];

medstract_alias_dev_filereader
Reads Medstract Acronym/Alias Identification (development) corpus and output a 2-element tuple: (ARRAY of abstracts/titles, ARRAY of alias-pairs) where alias-pairs is a 2-element ARRAY, [alias1, alias2].

http://medstract.org/gold-standards.html

Data Structure: reference => hash 'abs' => array 0 'abstract/title 1' 1 ...


        'alias' => array
                0 => array      #alias-pairs
                        0 'alias 1'
                        1 'alias 2'
                1 ...

Usage:
my $alias_dev = bcc->medstract_alias_dev_filereader;
print join(``\n'', @{$alias_dev->{abs}});
print join(``\n'', @{$alias_dev->{alias}->[alias1, alias2]});

medstract_alias_evl_filereader
Reads Medstract Acronym/Alias Identification (evaluation) corpus and output a 2-element tuple: (list of abstracts/titles, list of alias-pairs) where alias-pairs is a 2-element list, [alias1, alias2]

http://medstract.org/gold-standards.html

Data Structure: reference => hash 'abs' => array 0 'abstract/title 1' 1 ...


        'alias' => array
                0 => array      #alias-pairs
                        0 'alias 1'
                        1 'alias 2'
                1 ..

Usage:
my $alias_evl = bcc->medstract_alias_evl_filereader;
print join(``\n'', @{$alias_evl->{abs}});

genia302pos_filereader
Reads GENIA version 3.02 POS-tagged corpus and returns a list containing the tagged sentences. (Kim, Jin-Dong, Tomoko Ohta, Yuka Teteisi and Jun\'ichi Tsujii. (2003). GENIA corpus - a semantically annotated corpus for bio-textmining. Bioinformatics. 19(suppl. 1). pp. i180-i182.)

Usage:
my $genia302 = bcc->genia302pos_filereader;
print join(' ', @{$genia302});

genia302pos_tagreader
Line reader for GENIA 3.02 POS corpus. To be used after genia302pos_filereader() method. It takes a tagged sentence in GENIA 3.02 POS corpus and split it into a ARRAY of ARRAY, [[word, tag], ...]. The sequence of [word, tag] will be the same sequence in the tagged sentence.

Data Structure: reference => array 0 => array #pair of word and tag 0 word 1 tag 1 ...

Usage:
my $genia302 = bcc->genia302pos_filereader;
my $report = bcc->genia302pos_tagreader($genia302->[0]);
print join(' ', @{$report});

genia302pos_untagged
Removes tags from a sentence from GENIA 3.02 POS corpus and returns the untagged (original) sentence. To be used after genia302pos_filereader() method.

Usage:
my $genia302 = bcc->genia302pos_filereader;
my $report = bcc->genia302pos_untagged($genia302->[0]);
print $report;

genia302pos_tagsplitter
Combines the efforts of genia302pos_tagreader() and genia302pos_untagged() methods to process a sentence from GENIA 3.02 POS corpus into untagged form (as returned from genia302pos_untagged() method) and to split the ARRAY of [word, tag] from genia302pos_tagreader() method into a ARRAY of words and a ARRAY of tags.

Data Structure: reference => hash 'source' => 'source sentence'


        'tag_list' => array
                0 'tag 1'
                1 'tag 2'
                2 ...

        'word_list' => array
                0 'word 1'
                1 'word 2'
                2 ...

Usage:
my $genia302 = bcc->genia302pos_filereader;
my $report = bcc->genia302pos_tagsplitter($genia302->[0]);
print join(`` '', @{$report->{tag_list}});
print join(`` '', @{$report->{word_list}});

commonwords
1000 most common words in English language.

commonwords
Common verbs in English language.

penntag_lineSplitter
Process a Penn treebank tagged, or any tagged structure of <word>/<tag>, into a list of words and tags each.

Usage:
my $genia = bcc->genia302pos_filereader;
my $report = bcc->penntag_lineSplitter($genia->[0]);
print @{$report->{word_list}};
print @{$report->{tag_list}};

penntag_corpusSplitter
Process a list of Penn treebank tagged, or any tagged structure of <word>/<tag>, into a list of words and tags each.

Usage:
my $genia = bcc->genia302pos_filereader;
my $report = bcc->penntag_corpusSplitter($genia);
print join(``\n'', @{$report->{word_list}});
print join(``\n'', @{$report->{tag_list}});

tagging_accuracy
Given 2 lists of tags, it gives results about the accuracy of tagging using 1-to-1 tag comparison.

param $ref: reference of array of tags (reference set) param $test: reference of array of tags (testing set)

return: $self where '@{$self->{index_of_wrong}}' is a array of index of wrong tags, '$self->{length_of_ref}' is a scalar of length of $ref, '$self->{length_of_test}' is a scalar of length of $test and '$self->{correct_count}' is a scalar of number of correct tags.

Usage:
my $ref = ['aaa', 'bbb', 'ccc'];
my $test= ['aaa', 'bcb', 'cbc'];
my $report = bcc->tagging_accuracy($ref, $test);
print $report->{length_of_ref}; print $report->{length_of_test};
print $report->{correct_count};
print join(' ', @{$report->{index_of_wrong}});

ArithmeticMean
Returns the arithematic mean of the values in the passed list. Assumes a '1D' list, but will function on the 1st dim of an array(!).

Usage:
my $list = [1,2,3,4,5,6];
my $report = bcc->ArithmeticMean($list);

StandardDeviation
Calculating standard deviation from a 1-dimensional list (inlist) and the arithmetic mean (mean).

BootstrapRandomization
Basic statistical bootstrap estimation of mean and standard deviation from a large sample of percentage correct, calculated as #correct/#correct+wrong.

param $correct: the number of correct answers param $wrong: the number of wrong answers param $runs: the number of randomization runs (Efron suggests 200 or more)

Return $self('@{$self}' = resultList, '$self->{mean}' = Mean, '$self->{sd}' = StandardDeviation ) where resultList contains the randomized percentage correct.

Usage:
my $boot = bcc->BootstrapRandomization(correct, wrong, runs);
@{$boot->{result}}#List of result
$boot->{mean}#Mean
$boot->{sd}#StandardDeviation

Jaccard
Given 2 space-delimited strings (original and test), calculates the Jaccard Distance based on the formula,

1 - [(number of regions where both species are present)/ (number of regions where at least one species is present)]

Nei_Li
Given 2 space-delimited strings (original and test), calculates the Nei and Li Distance based on the formula,

1 - [2 x (number of regions where both species are present)/ [(2 x (number of regions where both species are present)) + (number of regions where only one species is present)]]

Levenshtein
Calculates the Levenshtein distance between a and b. This routine is implemented by Eli Bendersky (http://www.merriampark.com/ldperl.htm)

Calculates the Levenshtein distance between a and b. This routine is implemented by Eli Bendersky (http://www.merriampark.com/ldperl.htm)