Bcc - This module contains all the functions needed for BCC.
- source_reader
-
Reads source.txt file and process each citation into a reference.
-
Data Structure:
reference => array
0 => array #index
0 'id'
1 'information'
1 ...
-
Usage:
|
| my $report_ref = bcc->source_reader; |
|
| print $report_ref->[0]->[0] | #print first citation's ID
|
| print $report_ref->[0]->[1] | #print first citation's text |
- assertion_reader
-
Reader for assertion.mdf file, which is a manual digest and extraction of information
in raw.txt file, which contains the original abstracts. It processes assertion.mdf
file into a hash reference. ID is the ID of the citation.
-
Data Structure:
reference => hash
'dat' => array
0 => array #index
0 'id'
1 'information'
-
'abb' => array
0 => array #index
0 'id'
1 'information'
-
Usage:
|
| my $ref = bcc->assertion_reader; |
|
| $ref->{dat}->[0]->[0]; | #ID
|
| $ref->{dat}->[0]->[1]; | #information |
- review_filereader
-
Reader for review.mdf, which contains 110 paragraphs of biomedical reviews with citation
tags, table and figure references removed.
-
Usage:
|
| my $ref = bcc->review_filereader;
|
| print join(``\n'', @{$ref}); | #print all of the citations |
- medpost_filereader
-
Reader for MedPost corpus, a set of 6700 manually tagged sentences from biomedical
literature, in medpost.mdf file. (L. Smith, T. Rindflesch and W. J. Wilbur. 2004.
MedPost: a part-of-speech tagger for bioMedical text. Bioinformatics
20(14):2320-2321)
Returns a reference of tagged sentences.
-
Usage:
|
| my $ref = bcc->medpost_filereader;
|
| print join(``\n'', @{$ref}); | #print all of the tagged sentence |
- medpost_penn_filereader
-
Reader for MedPost corpus, a set of 6700 manually tagged sentences from biomedical
literature, in medpost_penn.mdf file, which is tagged with Penn treebank tag set (Marcus,
MP, Santorini, B and Marcinkiewicz, MA. 1994. Building a large annotated corpus of English:
the Penn Treebank. Computational Linguistics 19, 313-330) rather than MedPost tag set as
in medpost.mdf (L. Smith, T. Rindflesch and W. J. Wilbur. 2004. MedPost: a part-of-speech
tagger for bioMedical text. Bioinformatics
20(14):2320-2321)
Returns a reference of tagged
sentences.
-
Usage:
|
| my $ref = bcc->medpost_penn_filereader;
|
| print join(``\n'', @{$ref}); | #print all of the tagged sentences |
- medpost_spec_filereader
-
Reader for MedPost corpus, a set of 6700 manually tagged sentences from biomedical
literature, in medpost_spec.mdf file, which is tagged with SPECIALIST tag set (National
Library of Medicine, 2003. UMLS Knowledge Sources, 14th Edn) rather than MedPost tag
set as in medpost.mdf (L. Smith, T. Rindflesch and W. J. Wilbur. 2004. MedPost: a
part-of-speech tagger for bioMedical text. Bioinformatics
20(14):2320-2321)
Returns a
reference of tagged sentences.
-
Usage:
|
| my $ref = bcc->medpost_spec_filereader;
|
| print join(``\n'', @{$ref}); |
- medpost_tagreader
-
Line reader for MedPost corpus. To be used after
medpost_filereader()
method. It takes
a tagged sentence in MedPost corpus and split it into a ARRAY of ARRAYs, [[word, tag], ...].
The sequence of [word, tag] will be the same sequence in the tagged sentence.
-
Data structure:
reference => array
0 => array #index
0 => array
0 'word'
1 'tag'
-
Usage:
|
| my $ref = bcc->medpost_filereader;
|
| my @medo_tagged;
|
| push @medp_tagged, bcc->medpost_tagreader($_) foreach(@{$ref});
|
| print $medp_tagged->[index of tagged sentence]->[index of words]->[0:word, 1:tag]; |
- medpost_untagged
-
Removes tags from a sentence from MedPost corpus and returns the untagged (original)
sentence. To be used after
medpost_filereader()
method.
-
Usage:
|
| my $ref = bcc->medpost_filereader;
|
| my @medp_untagged;
|
| push @medp_untagged,bcc->medpost_untagged($_) foreach(@{$ref});
|
| print $medp_untagged[index of sentence]; |
- medpost_tagsplitter
-
Combines the efforts of
medpost_tagreader()
and medpost_untagged()
methods to process a
sentence from MedPost corpus into untagged form (as returned from medpost_untagged()
method)
and to split the ARRAY of word, tag from medpost_tagreader()
method into a REFERENCE of
words and tags.
-
Data structure:
reference => hash
'source' => sentence
'word_list' => array
0 word
1 ...
'tag_list' => array
0 tag
1 ...
-
Usage:
|
| my $ref = bcc->medpost_filereader;
|
| my @medp_tagsplitter;
|
| push @medp_tagsplitter, [bcc->medpost_tagsplitter($_)] foreach(@{$ref});
|
| print ${$medp_tagsplitter[index of sentence]->{source}};
|
| print @{$medp_tagsplitter[index of sentence]->{word_list, taglist}}; |
- yapex_filereader
-
Reads Yapex corpus and returns a list of abstracts (abs). Franzen, K., Eriksson, G., Olsson, F.,
Asker Per Lidin, L., and Coster, J. 2002. Protein names and how to find them. International
Journal of Medical Informatics 67(1-3):49-61.
- lllc05
-
Reads the data set (training set) for LLLChallenge 2005 (Learning Lagic in Languages) and outputs
the following as a 5-element tuple:source sentences, words, agents, targets, interactions,
synonymous gene names). Cussens, J. (ed). 2005. Proceedings of the Learning Logic in Languages
Workshop 2005 (LLL05). Claire Nedellec, MIG-INRA, France.
-
Data Structure:
reference => hash
'f05' = >hash
'agent' => array
0 'agent'
1 ...
-
'interaction' => array
0 array
0 'scalar'
1 ...
1 ...
-
'sentence' => array
0 'sentence 1'
1 'sentence 2'
2 ...
-
'targets' => array
0 'target'
1 ...
-
'words' => array
0 => array #sentence 1
0 'word 1'
1 'word 2'
2 ...
1 ... #sentence 2
-
'f05dd' => array
0 'synonymous gene name 1'
1 'synonymous gene name 2'
2 ...
-
Usage:
|
| my $lllc = bcc->lllc05;
|
| print @{$llc->{f05dd}};
|
| print $llc->{f05}-> | {sentence}->[index of sentence];
|
| {words}->[index of sentence]->[index of word];
|
| {agents}->[index of sentence]->[index of agent];
|
| {targets}->[index of sentence]->[index of targets];
|
| {interactions}->[index of sentence]->[index of interactions]; |
- medstract_alias_dev_filereader
-
Reads Medstract Acronym/Alias Identification (development) corpus and output a 2-element tuple:
(ARRAY of abstracts/titles, ARRAY of alias-pairs) where alias-pairs is a 2-element ARRAY, [alias1, alias2].
-
http://medstract.org/gold-standards.html
-
Data Structure:
reference => hash
'abs' => array
0 'abstract/title 1'
1 ...
-
'alias' => array
0 => array #alias-pairs
0 'alias 1'
1 'alias 2'
1 ...
-
Usage:
|
| my $alias_dev = bcc->medstract_alias_dev_filereader;
|
| print join(``\n'', @{$alias_dev->{abs}});
|
| print join(``\n'', @{$alias_dev->{alias}->[alias1, alias2]}); |
- medstract_alias_evl_filereader
-
Reads Medstract Acronym/Alias Identification (evaluation) corpus and output a 2-element tuple:
(list of abstracts/titles, list of alias-pairs) where alias-pairs is a 2-element list, [alias1, alias2]
-
http://medstract.org/gold-standards.html
-
Data Structure:
reference => hash
'abs' => array
0 'abstract/title 1'
1 ...
-
'alias' => array
0 => array #alias-pairs
0 'alias 1'
1 'alias 2'
1 ..
-
Usage:
|
| my $alias_evl = bcc->medstract_alias_evl_filereader;
|
| print join(``\n'', @{$alias_evl->{abs}}); |
- genia302pos_filereader
-
Reads GENIA version 3.02 POS-tagged corpus and returns a list containing the tagged sentences.
(Kim, Jin-Dong, Tomoko Ohta, Yuka Teteisi and Jun\'ichi Tsujii. (2003). GENIA corpus - a semantically
annotated corpus for bio-textmining. Bioinformatics. 19(suppl. 1). pp. i180-i182.)
-
Usage:
|
| my $genia302 = bcc->genia302pos_filereader;
|
| print join(' ', @{$genia302}); |
- genia302pos_tagreader
-
Line reader for GENIA 3.02 POS corpus. To be used after
genia302pos_filereader()
method. It takes
a tagged sentence in GENIA 3.02 POS corpus and split it into a ARRAY of ARRAY, [[word, tag], ...].
The sequence of [word, tag] will be the same sequence in the tagged sentence.
-
Data Structure:
reference => array
0 => array #pair of word and tag
0 word
1 tag
1 ...
-
Usage:
|
| my $genia302 = bcc->genia302pos_filereader;
|
| my $report = bcc->genia302pos_tagreader($genia302->[0]);
|
| print join(' ', @{$report}); |
- genia302pos_untagged
-
Removes tags from a sentence from GENIA 3.02 POS corpus and returns the untagged (original) sentence.
To be used after
genia302pos_filereader()
method.
-
Usage:
|
| my $genia302 = bcc->genia302pos_filereader;
|
| my $report = bcc->genia302pos_untagged($genia302->[0]);
|
| print $report; |
- genia302pos_tagsplitter
-
Combines the efforts of
genia302pos_tagreader()
and genia302pos_untagged()
methods to process a
sentence from GENIA 3.02 POS corpus into untagged form (as returned from genia302pos_untagged()
method) and to split the ARRAY of [word, tag] from genia302pos_tagreader()
method into a ARRAY of
words and a ARRAY of tags.
-
Data Structure:
reference => hash
'source' => 'source sentence'
-
'tag_list' => array
0 'tag 1'
1 'tag 2'
2 ...
-
'word_list' => array
0 'word 1'
1 'word 2'
2 ...
-
Usage:
|
| my $genia302 = bcc->genia302pos_filereader;
|
| my $report = bcc->genia302pos_tagsplitter($genia302->[0]);
|
| print join(`` '', @{$report->{tag_list}});
|
| print join(`` '', @{$report->{word_list}}); |
- commonwords
-
1000 most common words in English language.
- commonwords
-
Common verbs in English language.
- penntag_lineSplitter
-
Process a Penn treebank tagged, or any tagged structure of <word>/<tag>, into a list of words
and tags each.
-
Usage:
|
| my $genia = bcc->genia302pos_filereader;
|
| my $report = bcc->penntag_lineSplitter($genia->[0]);
|
| print @{$report->{word_list}};
|
| print @{$report->{tag_list}}; |
- penntag_corpusSplitter
-
Process a list of Penn treebank tagged, or any tagged structure of <word>/<tag>, into a list
of words and tags each.
-
Usage:
|
| my $genia = bcc->genia302pos_filereader;
|
| my $report = bcc->penntag_corpusSplitter($genia);
|
| print join(``\n'', @{$report->{word_list}});
|
| print join(``\n'', @{$report->{tag_list}}); |
- tagging_accuracy
-
Given 2 lists of tags, it gives results about the accuracy of tagging using 1-to-1 tag comparison.
-
param $ref: reference of array of tags (reference set)
param $test: reference of array of tags (testing set)
-
return: $self where '@{$self->{index_of_wrong}}' is a array of index of
wrong tags, '$self->{length_of_ref}' is a scalar of length of $ref,
'$self->{length_of_test}' is a scalar of length of $test and
'$self->{correct_count}' is a scalar of number of correct tags.
-
Usage:
|
| my $ref = ['aaa', 'bbb', 'ccc'];
|
| my $test= ['aaa', 'bcb', 'cbc'];
|
| my $report = bcc->tagging_accuracy($ref, $test);
|
| print $report->{length_of_ref};
| print $report->{length_of_test};
|
| print $report->{correct_count};
|
| print join(' ', @{$report->{index_of_wrong}}); |
- ArithmeticMean
-
Returns the arithematic mean of the values in the passed list.
Assumes a '1D' list, but will function on the 1st dim of an array(!).
-
Usage:
|
| my $list = [1,2,3,4,5,6];
|
| my $report = bcc->ArithmeticMean($list); |
- StandardDeviation
-
Calculating standard deviation from a 1-dimensional list (inlist) and the arithmetic mean (mean).
- BootstrapRandomization
-
Basic statistical bootstrap estimation of mean and standard deviation
from a large sample of percentage correct, calculated as #correct/#correct+wrong.
-
param $correct: the number of correct answers
param $wrong: the number of wrong answers
param $runs: the number of randomization runs (Efron suggests 200 or more)
-
Return $self('@{$self}' = resultList, '$self->{mean}' = Mean, '$self->{sd}'
= StandardDeviation ) where resultList contains the randomized percentage
correct.
-
Usage:
|
| my $boot = bcc->BootstrapRandomization(correct, wrong, runs);
|
| @{$boot->{result}} | #List of result
|
| $boot->{mean} | #Mean
|
| $boot->{sd} | #StandardDeviation |
- Jaccard
-
Given 2 space-delimited strings (original and test), calculates the
Jaccard Distance based on the formula,
-
1 - [(number of regions where both species are present)/
(number of regions where at least one species is present)]
- Nei_Li
-
Given 2 space-delimited strings (original and test), calculates the
Nei and Li Distance based on the formula,
-
1 - [2 x (number of regions where both species are present)/
[(2 x (number of regions where both species are present)) +
(number of regions where only one species is present)]]
- Levenshtein
-
Calculates the Levenshtein distance between a and b.
This routine is implemented by Eli Bendersky
(http://www.merriampark.com/ldperl.htm)
-
Calculates the Levenshtein distance between a and b.
This routine is implemented by Eli Bendersky
(http://www.merriampark.com/ldperl.htm)