View on GitHub

Tagging Tools

Useful set of tools for part of speech (POS) tagging

Download this project as a .zip file Download this project as a tar.gz file

Tagging Tools: Useful set of tools for POST (Part-of-speech tagging)

Tagging Tools allows you to perform common tasks such as visual confusion matrix generation and complex file split.

Confusion matrix generator

Define 2 files to compare (gold standard and tagged file) and build the confusion matrix. The output can be set as LaTex or plain text.

You can see below an example of a pdf confusion matrix output. By default the cells with the biggest differences are shown.

Pdf output example:

pdf confusion matrix example

In the next example you can see a plain text confusion matrix output. Notice that for each cell the file shows the words whose tags differ.

Plain text output example:

plain text confusion matrix example

Tags translation:

You can also use a tags translation file in order to convert tags from one tagset into another before the comparison occurs. Penn Treebank-C5 translation is included as an example.

Command:

tt -compare <goldStandard> <fileToCompare> <output> [options]

Compares goldStandard file against fileToCompare generating a confusion matrix as output.

Where [options] is one or more of the following:

Files splitter:

This tool splits a file in several parts, preserving sentences. It optionally generates the complementary file for each extracted part.

Command:

tt -split <file> <parts> [options]

Where [options] is one or more of the following:

File format:

Sentences are composed of tokens (words and symbols).

Each line should contain a token. Empty lines will be used to denote sentence break.

For comparison operations, each line must contain the token followed by a tab and the POS tag.

Example: file format example

Binaries:

Linux and Windows binaries are available here

License

Tagging Tools is released under the MIT License.