Prepare Data Pool
This tutorial shows how to prepare data pool.
Load CoNLL Format
We can use the ColumnDataset
class to load CoNLL format.
from seqal.datasets import ColumnDataset
columns = {0: "text", 1: "ner"}
pool_file = "./data/sample_bio/labeled_data_pool.txt"
data_pool = ColumnDataset(pool_file, columns)
unlabeled_sentences = data_pool.sentences
We can get sentences from data_pool by calling sentences
property.
This prints:
Load Plain Text
We can use load_plain_text
to read the unlabeled dataset. This will create a list of Sentence
objects.
from seqal.utils import load_plain_text
file_path = "./data/sample_bio/unlabeled_data_pool.txt"
unlabeled_sentences = load_plain_text(file_path)
This prints:
Non-spaced Language
As we mentioned in TUTORIAL_2_Prepare_Corpus, we have to provide the tokenized data for non-spaced language.
An example with CoNLL format:
If the input format is plain text 東京は都市です
. We should tokenize the sentence.
We mainly use the spacy model for the tokenization.
import spacy
from seqal.transformer import Transformer
nlp = spacy.load("ja_core_news_sm")
tokenizer = Transformer(nlp)
unlabeled_sentences = [tokenizer.to_subword(sentence) for sentence in sentences]
We also can directly use the spacy tokenizer.
import spacy
from flair.data import Sentence
from flair.tokenization import SpacyTokenizer
tokenizer = SpacyTokenizer("ja_core_news_sm")
unlabeled_sentences = [Sentence(sentence, use_tokenizer=tokenizer) for sentence in sentences]
We should download the spacy model beforehand and we can find different language's spacy model in spacy models.