Multiple Language Support
This tutorial shows how to use SeqAL for different language.
Introduction shows example on English. If you want to use SeqAL on other language, you need to meet the following requirements.
- Prepare the data with BIO or BIOES format on the language you used. More detail can be found in Prepare Corpus
- Prepare the embedding on the same language. We can use
TransformerWordEmbeddings
to read different languages' embedding from HuggingFace. - If the language of dataset is a kind of non-spaced language, we have to use spacy model to tokenize the dataset.
We introduce the processing workflow for spaced language and non-spaced language below.
Spaced Language
Below is the simple example of active learning cycle on English.
from seqal.datasets import ColumnCorpus
from flair.embeddings import WordEmbeddings
from seqal.active_learner import ActiveLearner
from seqal.samplers import LeastConfidenceSampler
## 1 Load data
columns = {0: "text", 1: "ner"}
data_folder = "./data/sample_bio"
corpus = ColumnCorpus(
data_folder,
columns,
train_file="train_seed.txt",
dev_file="valid.txt",
test_file="test.txt",
) # Change to the dataset on the language you used
## 2 Initialize Active Learner
tagger_params = {}
tagger_params["tag_type"] = "ner"
tagger_params["hidden_size"] = 256
embeddings = WordEmbeddings("glove") # Prepare the embedding on the same language
tagger_params["embeddings"] = embeddings
trainer_params = {}
trainer_params["max_epochs"] = 1
trainer_params["mini_batch_size"] = 32
trainer_params["learning_rate"] = 0.1
sampler = LeastConfidenceSampler()
learner = ActiveLearner(corpus, sampler, tagger_params, trainer_params)
## 3 Active Learning Cycle
query_number = 200
for i in range(5):
unlabeled_sentences, queried_samples = learner.query(
unlabeled_sentences, query_number, token_based=False
)
learner.teach(queried_samples, dir_path=f"output/retrain_{i}")
If you want to change the language, you just need change few lines. Below is an example on German.
First, changing the dataset on the language that you used.
data_folder = "./data/conll_deu"
corpus = ColumnCorpus(
data_folder,
columns,
train_file="train.txt",
test_file="test.txt",
dev_file="dev.txt",
)
Next, changing the embedding with the same language.
Check out the full list of all word embeddings models here, along with more explanations on the WordEmbeddings
class.
We also can use the contextualized embedding.
Here are more explanations on the TransformerWordEmbeddings
class. We can find more language embeddings on
Non-spaced Language
Prepare Corpus
First we provide the corpus with CoNLL format like below:
Then we load the corpus just like the spaced language.
columns = {0: "text", 1: "ner"}
data_folder = "./data/sample_jp"
corpus = ColumnCorpus(
data_folder,
columns,
train_file="train_seed.txt",
dev_file="valid.txt",
test_file="test.txt",
)
You can see the examples in ./data/sample_jp
directory.
Prepare Embeddings
We provide the proper embedding for specified language. Below is a Japances example.
from flair.embeddings import StackedEmbeddings, FlairEmbeddings
embedding_types = [
FlairEmbeddings('ja-forward'),
FlairEmbeddings('ja-backward'),
]
embeddings = StackedEmbeddings(embeddings=embedding_types)
Prepare Data Pool
Research Mode
If we just want to simulate the active learning cycle (the research mode), we can should load the labeled data pool by ColumnDataset
and set the research_mode
as True
.
from seqal.datasets import ColumnDataset
columns = {0: "text", 1: "ner"}
pool_file = "./data/sample_jp/labeled_data_pool.txt"
data_pool = ColumnDataset(pool_file, columns)
unlabeled_sentences = data_pool.sentences
# ...
for i in range(iterations):
queried_samples, unlabeled_sentences = learner.query(
unlabeled_sentences, query_number, token_based=token_based, research_mode=True
)
learner.teach(queried_samples, dir_path=f"output/retrain_{i}")
Annotation Mode
If we are dealing with a real annotation project, we usually load the data from plain text.
file_path = "./data/sample_jp/unlabeled_data_pool.txt"
unlabeled_sentences = []
with open(file_path, mode="r", encoding="utf-8") as f:
for line in f:
unlabeled_sentences.append(line)
Because the loaded sentence is non-spaced, we have to tokenize the sentence. For examples, we tokenize the 東京は都市です
to ["東京", "は", "都市", "です"]
.
There are two ways to tokenize the sentences. The first method is using the Transformer.to_subword()
.
import spacy
from seqal.transformer import Transformer
nlp = spacy.load("ja_core_news_sm")
tokenizer = Transformer(nlp)
unlabeled_sentences = [tokenizer.to_subword(sentence) for sentence in unlabeled_sentences]
The second method is directly using the spacy tokenizer.
import spacy
from flair.data import Sentence
from flair.tokenization import SpacyTokenizer
tokenizer = SpacyTokenizer("ja_core_news_sm")
unlabeled_sentences = [Sentence(sentence, use_tokenizer=tokenizer) for sentence in sentences]
And do not forget set the research_mode
as False
.
for i in range(iterations):
queried_samples, unlabeled_sentences = learner.query(
unlabeled_sentences, query_number, token_based=token_based, research_mode=False
)
learner.teach(queried_samples, dir_path=f"output/retrain_{i}")
Run Active Learning Cycle
Executing below command will run the active learning cycle on non-spaced language. The user can see the script for more detail.