Skip to content

Prepare Data Pool

This tutorial shows how to prepare data pool.

Load CoNLL Format

We can use the ColumnDataset class to load CoNLL format.

from seqal.datasets import ColumnDataset

columns = {0: "text", 1: "ner"}
pool_file = "./data/sample_bio/labeled_data_pool.txt"
data_pool = ColumnDataset(pool_file, columns)
unlabeled_sentences = data_pool.sentences

We can get sentences from data_pool by calling sentences property.

print(unlabeled_sentences[0])

This prints:

Sentence: "this is New York"   [− Tokens: 4  − Token-Labels: "this is New <B-LOC> York <I-LOC>"]

Load Plain Text

We can use load_plain_text to read the unlabeled dataset. This will create a list of Sentence objects.

from seqal.utils import load_plain_text

file_path = "./data/sample_bio/unlabeled_data_pool.txt"
unlabeled_sentences = load_plain_text(file_path)
print(unlabeled_sentences[0])

This prints:

Sentence: "this is New York"   [− Tokens: 4]

Non-spaced Language

As we mentioned in TUTORIAL_2_Prepare_Corpus, we have to provide the tokenized data for non-spaced language.

An example with CoNLL format:

東京 B-LOC
は O
都市 O
です O

If the input format is plain text 東京は都市です. We should tokenize the sentence.

We mainly use the spacy model for the tokenization.

import spacy
from seqal.transformer import Transformer

nlp = spacy.load("ja_core_news_sm")
tokenizer = Transformer(nlp)
unlabeled_sentences = [tokenizer.to_subword(sentence) for sentence in sentences]

We also can directly use the spacy tokenizer.

import spacy
from flair.data import Sentence
from flair.tokenization import SpacyTokenizer

tokenizer = SpacyTokenizer("ja_core_news_sm")
unlabeled_sentences = [Sentence(sentence, use_tokenizer=tokenizer) for sentence in sentences]

We should download the spacy model beforehand and we can find different language's spacy model in spacy models.