文本分类
文本分类是指将一个句子或段落归为特定类别,在自然语言处理中是最基础的任务,涉及到如对话机器人,搜索推荐、情感识别、内容理解、企业风险控制、质量检测等诸多方向.
下面是agnews数据集的其中一条数据:
"3","Fears for T N pension after talks","Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul."
在这里,”3”表示文本所属的类别.“Fears for T N pension after talks”是新闻的标题.“Unions representing workers at Turner Newall say they are ‘disappointed’ after talks with stricken parent firm Federal Mogul.”是文章的内容.
AGNews:
提示
AGNews数据集是学术新闻搜索引擎ComeToMyHead从2000多个新闻源收集的新闻文章的集合
本文中,我们将通过使用glove预训练词向量来训练一个能完成文本分类任务的fasttext模型.
定义模型
import numpy as np
from mindspore import nn, Tensor
from mindspore.common import dtype as mstype
from mindspore.common.initializer import XavierUniform
from mindspore.dataset.text.utils import Vocab
from mindnlp.modules.embeddings import Glove
class FasttextModel(nn.Cell):
"""
FastText model
"""
def __init__(self, vocab_size, embedding_dims, num_class):
super(FasttextModel, self).__init__()
self.vocab_size = vocab_size
self.embeding_dims = embedding_dims
self.num_class = num_class
self.embeding_func = Glove(vocab=Vocab.from_list(['default']),
init_embed=Tensor(np.zeros([self.vocab_size, self.embeding_dims]), mstype.float32))
self.fc = nn.Dense(self.embeding_dims, out_channels=self.num_class,
weight_init=XavierUniform(1)).to_float(mstype.float16)
def construct(self, text):
"""
construct network
"""
src_token_length = len(text)
text = self.embeding_func(text)
embeding = text.sum(axis=1)
embeding = Tensor.div(embeding, src_token_length)
embeding = embeding.astype(mstype.float32)
classifier = self.fc(embeding)
classifier = classifier.astype(mstype.float32)
return classifier
定义超参数
以下是模型训练过程中需要的一些超参数。
vocab_size = 1383812
embedding_dims = 16
num_class = 4
lr = 0.001
bucket_boundaries = [64, 128, 467]
max_len = 467
drop = 0.0
数据预处理
在本文中使用agnews数据集,并通过mindnlp的API自动下载.在数据预处理中,数据被清洗后,经过lookup操作再分桶.
加载数据集:
from mindnlp.dataset import load
ag_news_train, ag_news_test = load('ag_news', shuffle=True)
初始化用于数据预处理的vocab和tokenizer:
from mindnlp.modules import Glove
from mindnlp.dataset.transforms import BasicTokenizer
tokenizer = BasicTokenizer(True)
embedding, vocab = Glove.from_pretrained('6B', 100)
加载的数据集经过预处理后被分成训练集和验证集.
from mindnlp.dataset import process
ag_news_train = process('ag_news', ag_news_train, tokenizer=tokenizer, vocab=vocab, \
bucket_boundaries=bucket_boundaries, max_len=max_len, drop_remainder=True)
ag_news_train, ag_news_valid = ag_news_train.split([0.7, 0.3])
实例化模型
# net
net = FasttextModel(vocab_size, embedding_dims, num_class)
训练过程
设置loss,optimizer,metric.
loss = nn.NLLLoss(reduction='mean')
optimizer = nn.Adam(net.trainable_params(), learning_rate=lr)
metric = Accuracy()
以mindnlp的trainer开始训练.
from mindnlp.engine.trainer import Trainer
# define trainer
trainer = Trainer(network=net, train_dataset=ag_news_train, eval_dataset=ag_news_valid, metrics=metric,
epochs=5, loss_fn=loss, optimizer=optimizer)
print("start train")
trainer.run(tgt_columns="label", jit=False)
# trainer.run()
print("end train")