1 Introduction

1.1 大赛概况

受疫情催化影响，近一年内全球电商及在线零售行业进入高速发展期。作为线上交易场景的重要购买入口，搜索行为背后是强烈的购买意愿，电商搜索质量的高低将直接决定最终的成交结果，因此在AI时代，如何通过构建智能搜索能力提升线上GMV转化成为了众多电商开发者的重要研究课题。本次比赛由阿里云天池平台和问天引擎联合举办，诚邀社会各界开发者参与竞赛，共建AI未来。

1.2 赛程安排

本次大赛分为报名组队、初赛、复赛和决赛三个阶段，具体安排和要求如下：

报名组队：3月2日—4月10日
初赛阶段：3月2日—4月13日
复赛阶段：4月18日—5月18日
决赛答辩：6月1日

1.2.1 初赛阶段（2022年3月2日-4月13日）

选手报名成功后，参赛队伍通过天池平台下载数据，本地调试算法，在线提交结果。初赛提供训练数据集，供参赛选手训练算法模型；同时提供测试数据集，供参赛选手提交评测结果,参与排名。

初赛时间为3月2日—4月13日，系统每天提供3次提交机会，系统进行实时评测并返回成绩，排行榜每小时进行更新，按照评测指标从高到低排序。排行榜将选择参赛队伍在本阶段的历史最优成绩进行排名展示。

初赛淘汰：2022年4月13日中午11：59：59，初赛一阶段未产出成绩队伍，将被取消复赛参赛资格。

初赛结束，初赛排名前100名的参赛队伍将进入复赛，复赛名单将在 4月15日18点 前公布。

1.2.2 复赛阶段（2022年4月18日-5月18日）

复赛时间为4月18日—5月18日，系统每天提供3次提交机会，系统进行实时评测并返回成绩，排行榜每小时进行更新，按照评测指标从高到低排序。排行榜将选择参赛队伍在本阶段的历史最优成绩进行排名展示。

复赛提交截止时间5月18日中午11：59：59（选手需预估出上传时间，在中午11：59：59提交入口关闭前完成上传）

本阶段内，选手需保证最后提交的是最优模型对应的完整端到端代码（包含数据处理和模型训练等）并能运行复现最优成绩。复赛结束后，该阶段最优成绩对应提交的镜像将直接用于代码审核，如最优成绩对应的镜像代码不是完整代码运行得出，将会直接淘汰，因此如果最后阶段出现无法复现的最优成绩可在复赛提交结束前联系组委会协助删除最优记录，复赛结束后不再受理。

榜单将在复赛截止后公布。复赛结束后，组委会将对排行榜TOP 10参赛队伍进行最优提交成绩的模型和完整代码审核，选手需保证提交最优模型对应完整端到端代码（包含数据处理和模型训练等）且能运行复现最优成绩，不接受随机成绩。如最优成绩对应的镜像代码不是完整代码运行得出，将会直接淘汰。对于未提交、复现未成功或审核不通过的队伍，将取消决赛资格和比赛奖励。

最终审核通过的前5名参赛队伍晋级决赛，决赛名单将在 5月23日18点 前公布。

1.2.3 决赛答辩（2022年6月1日）

入围团队需要在5月30日上午11：59：59前提交答辩PPT，并在线下决赛前一天参与决赛彩排完成设备调试。线下决赛具体地址将在复赛结束后公布。

决赛评分参考：复赛榜单、代码质量和答辩。

答辩需要准备答辩材料，包括答辩PPT(中英文均可)、参赛总结、算法核心。本次赛事决赛入围团队的最终得分将由复赛成绩、决赛答辩成绩加权得出，评分权重为：复赛成绩占80%，决赛答辩成绩占20%。

大赛最终获奖名单将在 6月3日18点 前公布。

1.3 比赛内容

本次题目围绕电商领域搜索算法，开发者们可以通过基于阿里巴巴集团自研的高性能分布式搜索引擎问天引擎（提供高工程性能的电商智能搜索平台），可以快速迭代搜索算法，无需自主建设检索全链路环境。

本次评测的数据来自于 淘宝搜索 真实的业务场景，其中整个搜索商品集合按照商品的类别随机抽样保证了数据的多样性，搜索 Query 和相关的商品来自点击行为日志并通过 模型+人工确认 的方式完成校验保证了训练和测试数据的准确性。

比赛形式分为初赛和复赛两部分，分别从向量召回角度和精排模型角度让选手比拼算法模型。

初赛

提供HA3环境，让选手PK向量召回模型的效果，选手拿到100万全量Doc和10万对Query-Doc相关训练集，自行训练向量召回模型。选手每次提交的内容为100万全量Doc通过模型转换的embedding（固定维度，如128）以及测试集1000条Query转换的embedding。我们通过回流数据，建向量索引，查询测试，给出评测指标（MRR@10，正确Doc排的位置越靠前分越高）。
复赛

对于进入到复赛的选手开放精排模型的PK，选手需要在PAI上按照我们要求的模型格式训练精排模型。选手每次提交的内容除了初赛的Doc和Query的embedding，还包括训练好的精排模型。我们通过回流数据，建向量索引，查询测试（该阶段会做超时限制，防止选手无限制扩大模型复杂度），给出评测指标。

1.4 比赛数据

corpus.tsv
- 介绍：语料库，从淘宝商品搜索的标题数据随机抽取doc，量级约100万。
- 格式：doc_id从1开始编号的，title是是商品标题。
train.query.txt
- 介绍：训练集的query，训练集量级为10万。
- 格式：query_id从1开始编号，query是搜索日志中抽取的查询词。
qrels.train.tsv
- 介绍：训练集的query与doc对应关系，训练集量级为10万。
- 格式：query_id和doc_id。数据来自于搜索点击日志，人工标注query和doc之间具备高相关性，训练集用来训练模型。
dev.query.txt
- 介绍：测试集的query，测试集量级为1000。
- 格式：query_id和query，训练集id从1开始编号，测试集id从200001开始编号，query是搜索日志中抽取的查询词。

Note：比赛数据文件列之间的分隔符均为 tab 符（\t）

1.5 结构提交

1.5.1 初赛

选手上传数据格式：评测数据必须包括 doc_embedding，query_embedding 两个文件，文件名必须固定，文件打包为 .tar.gz 格式的压缩包

1	tar zcvf foo.tar.gz doc_embedding query_embedding

注意：请严格遵守下面要求的文件内容和格式，才能顺利得到结果，提交前也可以使用比赛提供的数据校验脚本检查通过后再进行打包提交。脚本使用方式：将脚本data_check.py与待提交文件 doc_embedding 和 query_embedding 放在相同目录，执行:

1	python data_check.py

doc_embedding：语料库 embedding，100万语料库通过选手训练的向量召回模型转化后的向量，维度限制128维。

格式：doc_id embedding
query_embedding：测试集embedding，1000条测试集query通过选手训练的向量召回模型转化后的向量，维度限制128维。

格式：query_id embedding

1.5.2 复赛

doc_embedding：语料库embedding，100万语料库通过选手训练的向量召回模型转化后的向量，维度限制128维。
query_embedding：测试集embedding，1000条测试集query通过选手训练的向量召回模型转化后的向量，维度限制128维。
rank_model：精排相关性模型目录，TF SavedModel格式，大小限制1GB以内，模型结构不限
corpus_index：语料库精排模型输入id序列，100万语料库通过选手训练的精排相关性模型转换后的输入id序列，维度限制128维
query_index：测试集精排模型输入id序列，1000条测试集query通过选手训练的精排相关性模型转换后的输入id序列，维度限制128维

1.6 评价指标

本次比赛采用MRR指标来评测选手基于HA3构建搜索系统的检索效果：

MRR = \frac{1}{Q} \sum_1^{|Q|} \frac{1}{rank_i}

其中Q代表所有测试集（1000条query），rank_i代表第i条测试query对应的相关doc在搜索系统返回中的位置。对于第一条query的相关doc在选手的系统中排在第一位，该测试query的MRR值为1；排在第二位，则MRR值为0.5，最终指标为全部测试query MRR值的平均数。

具体到本次比赛，采用MRR@10作为最终评测指标，即如果测试query相关doc不在top 10，则MRR值为0。

2. Data processing

2.1 Setup

import numpy as np
import pandas as pd

data_dir = './Data/'
corpus_dir = data_dir + 'corpus.tsv'              # 语料库
train_query_dir = data_dir + 'train.query.txt'    # 训练集的 query
qrels_train_dir = data_dir + 'qrels.train.tsv'    # 训练集的 query 和 doc 对应关系
dev_query_dir = data_dir + 'dev.query.txt'        # 测试集的 query

dict_corpus_dir = data_dir + 'corpus.dic'              # 语料库
dict_train_query_dir = data_dir + 'train.query.dic'    # 训练集的 query
dict_qrels_train_dir = data_dir + 'qrels.train.dic'    # 训练集的 query 和 doc 对应关系
dict_dev_query_dir = data_dir + 'dev.query.dic'        # 测试集的 query

2.2 Load data

corpus
- 介绍：语料库，从淘宝商品搜索的标题数据随机抽取doc，量级约100万。
- 格式：doc_id从1开始编号的，title是是商品标题。

if os.path.exists(dict_corpus_dir):
    # 读取已保存的字典
    with open(dict_corpus_dir, 'r', encoding = 'utf-8') as f:
        dict_corpus = eval(f.read())
    corpus = pd.DataFrame(columns = ['doc_id', 'title'])
    corpus['doc_id'] = dict_corpus.keys()
    corpus['title'] = dict_corpus.values()
else:      
    # 如果为保存为字典，则对源数据进行读取
    corpus = pd.read_table(corpus_dir, sep = '\t', header = None)
    corpus.columns =['doc_id', 'title']
    dict_corpus = dict(zip(corpus['doc_id'], corpus['title']))
    with open(dict_corpus_dir, 'w', encoding = 'utf-8') as f:
        f.write(str(dict_corpus))
        
corpus = corpus.set_index('doc_id')
corpus.head()

Results:

	doc_id	title
0	1	铂盛弹盖文艺保温杯学生男女情侣车载时尚英文锁扣不锈钢真空水杯
1	2	可爱虎子华为荣耀X30i手机壳荣耀x30防摔全包镜头honorx30max液态硅胶虎年情侣女...
2	3	190色素色亚麻棉平纹布料衬衫裙服装定制手工绣花面料汇典亚麻
3	4	松尼合金木工开孔器实木门开锁孔木板圆形打空神器定位打孔钻头
4	5	微钩绿蝴蝶材料包非成品赠送视频组装教程需自备钩针染料

1	corpus.shape

Results:

(1001500, 2)

train_query_dir
- 介绍：训练集的query，训练集量级为10万。
- 格式：query_id从1开始编号，query是搜索日志中抽取的查询词。

if os.path.exists(dict_train_query_dir):
    # 读取已保存的字典
    with open(dict_train_query_dir, 'r', encoding = 'utf-8') as f:
        dict_train_query = eval(f.read())
    train_query = pd.DataFrame(columns = ['query_id', 'query'])
    train_query['query_id'] = dict_train_query.keys()
    train_query['query'] = dict_train_query.values()
else:      
    # 如果为保存为字典，则对源数据进行读取
    train_query = pd.read_table(train_query_dir, sep = '\t', header = None)
    train_query.columns =['query_id', 'query']
    dict_train_query = dict(zip(train_query['query_id'], train_query['query']))
    with open(dict_train_query_dir, 'w', encoding = 'utf-8') as f:
        f.write(str(dict_train_query))
        
train_query = train_query.set_index('query_id')
train_query.head()

Result:

	query_id	query
0	1	美赞臣亲舒一段
1	2	慱朗手动料理机
2	3	電力貓
3	4	掏夹缝工具
4	5	飞推vip

1	train_query.shape

Results:

(100000, 2)

qrels.train.tsv
- 介绍：训练集的query与doc对应关系，训练集量级为10万。
- 格式：query_id和doc_id。数据来自于搜索点击日志，人工标注query和doc之间具备高相关性，训练集用来训练模型。

if os.path.exists(dict_qrels_train_dir):
    # 读取已保存的字典
    with open(dict_qrels_train_dir, 'r', encoding = 'utf-8') as f:
        dic_qrels_train = eval(f.read())
    qrels_train = pd.DataFrame(columns = ['query_id', 'doc_id'])
    qrels_train['query_id'] = dic_qrels_train.keys()
    qrels_train['doc_id'] = dic_qrels_train.values()
else:      
    # 如果为保存为字典，则对源数据进行读取
    qrels_train = pd.read_table(qrels_train_dir, sep = '\t', header = None)
    qrels_train.columns =['query_id', 'doc_id']
    dic_qrels_train = dict(zip(qrels_train['query_id'], qrels_train['doc_id']))
    with open(dict_qrels_train_dir, 'w', encoding = 'utf-8') as f:
        f.write(str(dic_qrels_train))
        
qrels_train = qrels_train.set_index('query_id')
qrels_train.head()

Results:

	query_id	doc_id
0	1	679139
1	2	35343
2	3	781652
3	4	557516
4	5	588014

1	qrels_train.shape

Results:

(100000, 2)

dev.query.txt
- 介绍：测试集的query，测试集量级为1000。
- 格式：query_id和query，训练集id从1开始编号，测试集id从200001开始编号，query是搜索日志中抽取的查询词。

if os.path.exists(dict_dev_query_dir):
    # 读取已保存的字典
    with open(dict_dev_query_dir, 'r', encoding = 'utf-8') as f:
        dict_dev_query = eval(f.read())
    dev_query = pd.DataFrame(columns = ['query_id', 'query'])
    dev_query['query_id'] = dict_dev_query.keys()
    dev_query['query'] = dict_dev_query.values()
else:      
    # 如果为保存为字典，则对源数据进行读取
    dev_query = pd.DataFrame(columns = ['query_id', 'query'])

    with open(dev_query_dir, 'r', encoding = 'utf-8') as f:
        lines = f.readlines()
        for i, line in enumerate(lines):
            query_list = line.strip().split('\t', 1)
            query_id, query = query_list[0], query_list[1]
            dev_query.loc[i, :] = [query_id, query]
            
    dict_dev_query = dict(zip(dev_query['query_id'], dev_query['query']))
    with open(dict_dev_query_dir, 'w', encoding = 'utf-8') as f:
        f.write(str(dict_dev_query))
        
dev_query = dev_query.set_index('query_id')
dev_query.head()

Results

	query_id	query
0	200001	甲黄酸阿怕替尼片
1	200002	索泰zbox
2	200003	kfc游戏机
3	200004	bunny成兔粮
4	200005	铁线威灵仙

1	dev_query.shape

Results:

(1000, 2)

3 Text-mining

3.1 Text preprocessing

3.1.1 cut words

1
2
3

import jieba

" ".join(jieba.cut("甲黄酸阿怕替尼片"))

Results:

'甲 黄酸 阿怕 替尼片'

def title_cut(x):
    return list(jieba.cut(x))

from joblib import Parallel, delayed

corpus_title = Parallel(n_jobs=-1)(
    delayed(title_cut)(title) for title in corpus["title"]
)
train_title = Parallel(n_jobs=-1)(
    delayed(title_cut)(title) for title in train_query["query"]
)

dev_title = Parallel(n_jobs=-1)(
    delayed(title_cut)(title) for title in dev_query["query"]
)

3.1.2 Word2Vec

from gensim.models import Word2Vec
from gensim.test.utils import common_texts

save_word2vec_dir = data_dir + 'word2vec.model'        # 保存word2vec 模型

if os.path.exists(save_word2vec_dir):
    model = Word2Vec.load(save_word2vec_dir)
else: 
    model = Word2Vec(
        sentences=list(corpus_title) + list(train_title) + list(dev_title),
        vector_size=128,
        window=5,
        min_count=1,
        workers=4,
    )
    model.save(save_word2vec_dir)
   
model.wv.most_similar("美赞臣")

Results:

[('佳儿', 0.8830252885818481),
 ('美素', 0.8764686584472656),
 ('HMO', 0.8655505180358887),
 ('雅培', 0.8550017476081848),
 ('适度', 0.8514788746833801),
 ('蓝臻', 0.8505593538284302),
 ('白金版', 0.849611759185791),
 ('版美素', 0.8368095755577087),
 ('Friso', 0.8177567720413208),
 ('Enfamil', 0.8171215057373047)]

1	model.wv.index_to_key[:10]

Results:

[' ', '新款', '女', '/', '2021', '-', '加厚', '儿童', '秋冬', '外套']

1
2
3

train_w2v_ids = [[model.wv.key_to_index[xx] for xx in x] for x in train_title]
corpus_w2v_ids = [[model.wv.key_to_index[xx] for xx in x] for x in corpus_title]
dev_w2v_ids = [[model.wv.key_to_index[xx] for xx in x] for x in dev_title]

3.1.3 IDF

from sklearn.feature_extraction.text import TfidfVectorizer

idf = TfidfVectorizer(analyzer=lambda x: x)
idf.fit(train_title+corpus_title)

vocab = idf.get_feature_names()

train_ids = [[idf.vocabulary_[x] for x in title] for title in train_title]
train_ids[:2]

Results:

[[534608, 259662, 242520], [385727, 418567, 387725, 406257, 420718]]

1
2
3

corpus_idf = idf.transform(corpus_title)
train_idf = idf.transform(train_title)
dev_idf = idf.transform(dev_title)

1	idf.idf_, len(idf.idf_)

Results:

(array([13.12042488,  2.46291458,  8.5771301 , ..., 14.21903717,
        14.21903717, 14.21903717]),
 640558)

# 将不重要的词语进行过滤
token = np.array(idf.get_feature_names())
drop_token = token[np.where(idf.idf_ < 10)[0]]
drop_token = list(set(drop_token))
drop_token += ['领券']

1	drop_token_ids = [model.wv.key_to_index[x] for x in drop_token]

3.2 无监督Word2Vec 直接构造embeding

def unsuper_w2c_encoding(s, pooling="max"):
    feat = []
    corpus_query_word = [x for x in s if x not in drop_token_ids]
    if len(corpus_query_word) == 0:
        return np.zeros(128)
    
    feat = model.wv[corpus_query_word]

    if pooling == "max":
        return np.array(feat).max(0)
    if pooling == "avg":
        return np.array(feat).mean(0)

1	unsuper_w2c_encoding(train_w2v_ids[0])

Results:

array([ 0.00627435,  0.00999549,  0.02740174,  0.404633  ,  0.4252403 ,
        0.23775053, -0.00301973, -0.040391  ,  0.12731566,  0.17506774,
        0.25548676, -0.02362359,  0.26719984, -0.02683249,  0.07950898,
        0.220341  , -0.03015471, -0.00539176,  0.0019598 , -0.00798864,
        0.48653737,  0.36375955,  0.00815388, -0.04465045, -0.00626526,
        0.19315946,  0.020267  ,  0.5370557 ,  0.22179876, -0.02791146,
       -0.02321572,  0.07029994, -0.05289224,  0.04565464,  0.13785234,
        0.20623907,  0.13043119,  0.13404372,  0.03903083,  0.0843261 ,
        0.12350243,  0.45837042, -0.02492538, -0.01868924,  0.19881272,
        0.18007484, -0.01738105, -0.03000462,  0.20649071,  0.1869723 ,
        0.42692277,  0.531309  ,  0.06245843,  0.79199183,  0.2779064 ,
        0.13139981,  0.39747572,  0.02227024, -0.0458317 , -0.02028935,
       -0.02558809, -0.03027141,  0.00923416,  0.01233742,  0.58877105,
       -0.01235619,  0.12105425,  0.5268592 ,  0.05424006, -0.03668537,
       -0.02789669, -0.03252351,  0.02342302, -0.00386677,  0.01726603,
        0.3063016 ,  0.04099182,  0.02319455, -0.01743875,  0.26839563,
       -0.02042119,  0.04482168,  0.1912046 ,  0.44685388,  0.56942856,
        0.1663401 ,  0.43730047, -0.01347114,  0.02290028,  0.32279697,
        0.12649736, -0.05318116,  0.12012497, -0.02839875,  0.06174724,
        0.0259997 , -0.01499749, -0.06226162,  0.16565983, -0.00632856,
        0.0447054 , -0.04246325,  0.11062343,  0.46360213,  0.02929302,
        0.11677497,  0.00865052, -0.03610176,  0.16167475, -0.03535878,
        0.15723507, -0.01752737,  0.02870348, -0.00388006,  0.21380839,
        0.50670934, -0.01522826,  0.16151543,  0.2793154 ,  0.15158144,
       -0.05517713, -0.01663058,  0.06395861,  0.03442291,  0.03111118,
       -0.0300983 , -0.04598322,  0.08044875], dtype=float32)

1	len(corpus_w2v_ids)

Results:

from tqdm import tqdm_notebook
# [corpus_w2v_ids[x] for x in qrels['doc'].values[:100] - 1]

corpus_mean_feat = [
    unsuper_w2c_encoding(s) for s in tqdm_notebook(corpus_w2v_ids)
]
corpus_mean_feat = np.vstack(corpus_mean_feat)

train_mean_feat = [
    unsuper_w2c_encoding(s) for s in tqdm_notebook(train_w2v_ids)
]
train_mean_feat = np.vstack(train_mean_feat)

dev_mean_feat = [
    unsuper_w2c_encoding(s) for s in tqdm_notebook(dev_w2v_ids)
]
dev_mean_feat = np.vstack(dev_mean_feat)

Results:

  0%|          | 0/1001500 [00:00<?, ?it/s]

  0%|          | 0/100000 [00:00<?, ?it/s]

  0%|          | 0/1000 [00:00<?, ?it/s]

初步探索

from sklearn.preprocessing import normalize

corpus_mean_feat = normalize(corpus_mean_feat)
train_mean_feat = normalize(train_mean_feat)
dev_mean_feat = normalize(dev_mean_feat)

mrr = []
for idx in tqdm_notebook(range(1, 100)):
    dis = np.dot(train_mean_feat[idx - 1], corpus_mean_feat.T)
    ids = np.argsort(dis)[::-1]
    
    print(train_title[idx-1], corpus.loc[qrels_train.loc[idx].ravel()[0]]["title"],  dis[qrels_train.loc[idx].ravel()-1])
    print(corpus_title[ids[0]])
    mrr.append(1/(np.where(ids == qrels_train.loc[idx].ravel()[0] - 1)[0] + 1))
    break
    print('')
    mrr.append(ids[0]==idx+999)

  0%|          | 0/99 [00:00<?, ?it/s]


['美赞臣', '亲舒', '一段'] 领券满减】美赞臣安婴儿A+亲舒 婴儿奶粉1段850克 0-12个月宝宝 [0.73458592]
['现货', '加拿大', '美赞臣', '1', '段', 'EnfamilA', '+', '一段', 'DHA', '奶粉', '765', '克', '超值', '装金装']

1	np.mean(mrr)

Results:

0.00013095861707700367

dir_query_embedding = data_dir + 'query_embedding'
dir_doc_embedding = data_dir + 'doc_embedding'

with open(dir_query_embedding, 'w') as up :
    for id, feat in zip(dev_query.index, dev_mean_feat):
        up.write('{0}\t{1}\n'.format(id, ','.join([str(x)[:6] for x in feat])))
        
with open(dir_doc_embedding, 'w') as up :
    for id, feat in zip(corpus.index, corpus_mean_feat):
        up.write('{0}\t{1}\n'.format(id, ','.join([str(x)[:6] for x in feat])))

1	!tar zcvf foo.tar.gz doc_embedding query_embedding

3.3 Sentence-bert

3.3.1 Negative sample

keyword_corpus_idxs, keyword_idxs = corpus_idf.nonzero()
inverse_keyword_map = {}
for x, y in zip(keyword_idxs, keyword_corpus_idxs):
    if vocab[x] in inverse_keyword_map:
        inverse_keyword_map[vocab[x]].append(y)
    else:
        inverse_keyword_map[vocab[x]] = [y]
        
vocab = idf.get_feature_names()
vocab[276415], idf.vocabulary_["公文包"]

Results:

1 2	vocab = idf.get_feature_names() vocab[276415], idf.vocabulary_["公文包"]

1	corpus_idf.shape, max(keyword_corpus_idxs), max(keyword_idxs)

Results:

1	((1001500, 640558), 1001499, 640515)

dir_train_neg_piar = data_dir + 'train_neg_piar.txt'

if os.path.exists(dir_train_neg_piar):
    with open(dir_train_neg_piar, 'r', encoding = 'utf-8') as f:
        train_neg_piar = eval(f.read())
else:
    from tqdm import tqdm_notebook

    MAX_NEG_SAMPLES = 10

    train_neg_piar = []
    for idx in tqdm_notebook(range(1, train_query.shape[0] + 1 - 5000)):
        idx_keyword = train_title[idx - 1]
        idx_keyword_idf = idf.idf_[train_ids[idx - 1]]
        idx_top1_word = idx_keyword[idx_keyword_idf.argmax()]
        # idx_start_word, idx_end_word = idx_keyword[0], idx_keyword[-1]

        if idx_top1_word in inverse_keyword_map:
            negative_idx = inverse_keyword_map[idx_top1_word][:MAX_NEG_SAMPLES]
        else:
            negative_idx = np.random.randint(corpus.shape[0], size=MAX_NEG_SAMPLES)

        """
        idx_keyword = []
        if len(idx_top1_word) >= 2 and idx_top1_word in inverse_keyword_map:
            idx_keyword += inverse_keyword_map[idx_top1_word]
        if len(idx_start_word) >= 2 and idx_start_word in inverse_keyword_map:
            idx_keyword += inverse_keyword_map[idx_start_word]
        if len(idx_end_word) >= 2 and idx_end_word in inverse_keyword_map:
            idx_keyword += inverse_keyword_map[idx_end_word]
        negative_idx = sum(negative_idx, [])
        """

        # negative_idx = list(set(negative_idx))
        negative_idx = [x + 1 for x in negative_idx]
        positive_idx = qrels_train.loc[idx].ravel()[0]
        if positive_idx in negative_idx:
            negative_idx.remove(positive_idx)

        train_neg_piar.append(negative_idx)
    with open(dir_train_neg_piar, 'w', encoding = 'utf-8') as f:
        f.write(str(train_neg_piar))

train_neg_piar

Results:

[[987810],
 [442445,
  847603,
  54192,
  181611,
  687754,
  448015,
  656972,
  613466,
  850616,
  934382],
...
[8494, 28439, 31381, 34086, 38776, 39614, 46192, 46248, 54942, 71139],
 ...]

1 2	idx_keyword_idf = idf.idf_[train_ids[idx - 1]] idx_keyword[idx_keyword_idf.argmax()]

1	corpus.loc[negative_idx]

3.3.2 Construct train dataset

1	train_query.shape[0]

Results:

from sentence_transformers import InputExample, SentenceTransformer
import tqdm
from tqdm import tqdm_notebook

train_examples = []
# for idx in tqdm_notebook(range(1, train_data.shape[0] + 1 - 5000)):
for idx in tqdm.notebook.tqdm(range(1, train_query.shape[0] + 1 - 99000)):
    train_examples.append(
        InputExample(
            texts=[
                train_query.loc[idx]["query"],
                corpus.loc[qrels_train.loc[idx].ravel()[0]]["title"],
            ],
            label=1.0,
        )
    )
    
    if idx-1 in train_neg_piar:
        for neg_idx in train_neg_piar[idx-1]:
            if neg_idx % 2 == 0:
                train_examples.append(
                    InputExample(
                        texts=[
                            train_query.loc[idx]["query"],
                            corpus.loc[neg_idx]["title"],
                        ],
                        label=0.0,
                    )
                )
            else:
                train_examples.append(
                    InputExample(
                        texts=[
                            corpus.loc[neg_idx]["query"],
                            train_query.loc[idx]["title"],

                        ],
                        label=0.0,
                    )
                )
        
    rand_idx = np.random.randint(1, corpus.shape[0], size=20)
    for neg_idx in rand_idx:
        train_examples.append(
            InputExample(
                texts=[
                    train_query.loc[idx]["query"],
                    corpus.loc[neg_idx]["title"],
                ],
                label=0.0,
            )
        )
        
        
        # print(corpus_data.loc[neg_idx]["title"])

[str(x) for x in train_examples[:10]]

Results:

['<InputExample> label: 1.0, texts: 美赞臣亲舒一段; 领券满减】美赞臣安婴儿A+亲舒 婴儿奶粉1段850克 0-12个月宝宝',
 '<InputExample> label: 0.0, texts: 美赞臣亲舒一段; 小猪一家手指偶玩具 佩奇猪手指木偶 亲子游戏道具指偶 小猪挂饰',
 '<InputExample> label: 0.0, texts: 美赞臣亲舒一段; 雪地棉靴女冬加绒加厚2021新款东北大头娃娃马丁靴一脚蹬厚底短靴',
 '<InputExample> label: 0.0, texts: 美赞臣亲舒一段; 该喝酒的年龄就不要去想着喝奶茶车贴抖音同款经典语录汽车身贴纸',
 '<InputExample> label: 0.0, texts: 美赞臣亲舒一段; 花童毛披肩外出公主长袖加厚小童女宝宝毛绒保暖外套斗篷披风春秋',
 '<InputExample> label: 0.0, texts: 美赞臣亲舒一段; 门板弹力带拉力带弹力带田径训练跑步阻力带爆发力量训练男女健身',
 '<InputExample> label: 0.0, texts: 美赞臣亲舒一段; 蕾丝假领子女百搭假领夏季薄款花边镂空内搭防跑光假领子抹胸内搭',
 '<InputExample> label: 0.0, texts: 美赞臣亲舒一段; 日丰ppr内丝活接4分20铜配件6分25ppr水管管件1寸32铁暖气用',
 '<InputExample> label: 0.0, texts: 美赞臣亲舒一段; 灰粉 玫瑰粉460g重磅纯棉厚绒连帽卫衣纯色简约百搭品质情侣帽衫',
 '<InputExample> label: 0.0, texts: 美赞臣亲舒一段; 手机贴金属贴纸电脑贴随意笔记本标签书贴创意贴道家教贴']

1	len(train_examples)

Results:

3.3.3 Construct Validation dataset

from sentence_transformers import evaluation

eval_s1 = []
eval_s2 = []
eval_socre = []

for idx in tqdm_notebook(range(train_query.shape[0] - 1000, train_query.shape[0] + 1)):
    eval_s1.append(train_query.loc[idx]["query"])
    eval_s2.append(corpus.loc[qrels_train.loc[idx].ravel()[0]]["title"])
    eval_socre += [1]

    if idx-1 in train_neg_piar:
        for neg_idx in train_neg_piar[idx-1]:
            eval_s1.append(train_query.loc[idx]["query"])
            eval_s2.append(corpus.loc[neg_idx]["title"])
            eval_socre += [0]
        
    rand_idx = np.random.randint(corpus.shape[0], size=10)
    for neg_idx in rand_idx:
        eval_s1 += [train_query.loc[idx]["query"]]
        eval_s2 += [corpus.loc[neg_idx]["title"]]
        eval_socre += [0]


evaluator = evaluation.EmbeddingSimilarityEvaluator(
    eval_s1, eval_s2, eval_socre, write_csv=True
)

Results:

1	0%\| \| 0/1001 [00:00<?, ?it/s]

1	train_query.shape[0] - 1000, train_query.shape[0] + 1

Results:

1	(99000, 100001)

1 2	idx = 11 len(eval_s1[idx]), len(eval_s2[idx]), eval_socre[idx]

Results:

1	(5, 30, 1)

3.3.4 Sentence BERT

from sentence_transformers import SentenceTransformer, models, util
from torch import nn

word_embedding_model = models.Transformer("bert-base-chinese", max_seq_length=50)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
dense_model = models.Dense(
    in_features=pooling_model.get_sentence_embedding_dimension(),
    out_features=128,
    activation_function=nn.Tanh(),
)

model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])

len(train_examples)

Results:

from torch.utils.data import DataLoader
from sentence_transformers import InputExample, SentenceTransformer, losses

# Define your train dataset, the dataloader and the train loss
train_size = len(train_examples) # 10000
train_dataloader = DataLoader(train_examples[:train_size], shuffle=True, batch_size=10)
train_loss = losses.CosineSimilarityLoss(model)

# Tune the model
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=2,
    warmup_steps=100,
    evaluator=evaluator,
    evaluation_steps=1000,
    show_progress_bar=True,
    output_path="./sentence-bert/",
    checkpoint_save_steps=10000,
    save_best_model=True,
    checkpoint_path='./sentence-bert/'
)

Results:

1 2	Epoch: 0%\| \| 0/2 [00:00<?, ?it/s] Iteration: 0%\| \| 0/1000 [00:00<?, ?it/s]

load and save

# save 
model.save('./sentence-bert')

# load 
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('./sentence-bert/')

3.3.5 Validation & Submission

query_len = train_query.shape[0]
corpus_len = corpus.shape[0]

query_len = 1000
corpus_len = 10000
query_sentences = list(train_query["query"])[:query_len]
corpus_sentences = list(corpus["title"].iloc[:])[:corpus_len]
corpus_sentences = [x for x in corpus_sentences if len(x) > 10]

model.eval()

Results:

SentenceTransformer(
  (0): Transformer({'max_seq_length': 50, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Dense({'in_features': 768, 'out_features': 128, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
)

from sklearn.preprocessing import normalize

query_embeddings = model.encode(query_sentences, batch_size=500, show_progress_bar=True)
corpus_embeddings = model.encode(corpus_sentences, batch_size=500, show_progress_bar=True)

np.set_printoptions(suppress=True)
query_embeddings = normalize(query_embeddings)
corpus_embeddings = normalize(corpus_embeddings)

print(query_embeddings.shape, corpus_embeddings.shape)

Results:

1	(1000, 128) (9976, 128)

cos_sim = util.cos_sim(query_embeddings, corpus_embeddings)

for query_idx in range(0, 200):
    query_sim = cos_sim[query_idx, :]
    corpus_idx = query_sim.argmax().item()
    ids = query_sim.argsort().numpy()[::-1]
    print(
        "{} \t {} \t {}".format(
            query_sentences[query_idx],
            corpus["title"].iloc[corpus_idx],
            corpus["title"].loc[qrels_train.loc[query_idx+1, 'doc_id']],
            query_sim[corpus_idx],
        )
    )

Results:

1
2
3

美赞臣亲舒一段 	 HUJIAN苹果华为手机通用智能手表女运动手环血压心率蓝牙电话拍照多功能定位电子手表学生潮流情侣运动手环 	 领券满减】美赞臣安婴儿A+亲舒 婴儿奶粉1段850克 0-12个月宝宝
...
古琴擦琴布 	 女童套装秋装2021新款小女孩韩版时尚两件套儿童春秋牛仔外套宝宝 	 古琴擦布 麂皮绒双层擦布 加长加宽可水洗 包邮

3.3.6 Submission

1	model.eval()

Results:

SentenceTransformer(
  (0): Transformer({'max_seq_length': 50, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Dense({'in_features': 768, 'out_features': 128, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
)

query_len = dev_query.shape[0]
corpus_len = corpus.shape[0]

# query_len = 1000
# corpus_len = 10000

query_sentences = list(dev_query["query"])[:query_len]
corpus_sentences = list(corpus["title"].iloc[:])[:corpus_len]
# corpus_sentences = [x for x in corpus_sentences if len(x) > 10]

from sklearn.preprocessing import normalize
test_size = len(corpus_sentences)

np.set_printoptions(suppress=True)

query_embeddings = model.encode(query_sentences, batch_size=500, show_progress_bar=True)
corpus_embeddings = model.encode(corpus_sentences[:test_size], batch_size=500, show_progress_bar=True)

np.set_printoptions(suppress=True)
query_embeddings = normalize(query_embeddings)
corpus_embeddings = normalize(corpus_embeddings)

print(query_embeddings.shape, corpus_embeddings.shape)

Results:

(1000, 128) (1001500, 128)

# result_dir = './results/'
result_dir = './'
dir_query_embedding = result_dir + 'query_embedding'
dir_doc_embedding = result_dir + 'doc_embedding'

with open(dir_query_embedding, 'w') as up :
    for id, feat in zip(dev_query.index, query_embeddings):
        up.write('{0}\t{1}\n'.format(id, ','.join([str(x)[:4] for x in feat])))
        
with open(dir_doc_embedding, 'w') as up :
    for id, feat in zip(corpus.index, corpus_embeddings):
        up.write('{0}\t{1}\n'.format(id, ','.join([str(x)[:4] for x in feat])))

核验生成的数据是否合规

import math

def is_number(s):
    if s != s.strip():
        return False
    try:
        f = float(s)
        if math.isnan(f) or math.isinf(f):
            return False
        return True
    except ValueError:
        return False

# def data_check(file, file_type="doc"):
""" check if a file is UTF8 without BOM,
    doc_embedding index starts with 1,
    query_embedding index starts with 200001,
    the dimension of the embedding is 128.
"""
erro_count = []
error_embeding = []
single_error_embedding = []

for file, file_type in zip(['query_embedding', 'doc_embedding'], ['query', 'doc']):
    # file, file_type = "query_embedding", "query"
    # file, file_type = "doc_embedding", "doc"
    count = 1
    id_set = set()
    with open(file) as f:
        for line in f:
            sp_line = line.strip('\n').split("\t")
            if len(sp_line) != 2:
                print("[Error] Please check your line. The line should be two parts, i.e. index \t embedding")
                print("line number: ", count)
            index, embedding = sp_line

            if not is_number(index):
                print("[Error] Please check your id. The id should be int without other char")
                print("line number: ", count)
            id_set.add(int(index))

            embedding_list = embedding.split(',')
            if len(embedding_list) != 128:
                print("[Error] Please check the dimension of embedding. The dimension is not 128")
                print("line number: ", count)

            for i, emb in enumerate(embedding_list):
                if not is_number(emb):
                    print("[Error] Please check your embedding. The embedding should be float without other char")
                    print("line number: ", count)
                    erro_count.append([index, i])
                    error_embeding.append(embedding_list)
                    single_error_embedding.append(emb)

            count += 1

        if file_type == "doc":
            # 1001501
            for i in range(1, test_size+1):
                if i not in id_set:
                    print("[Error] The index[{}] of doc_embedding is not found. Please check it.".format(i))
        elif file_type == "query":
            for i in range(200001, 201001):
                if i not in id_set:
                    print("[Error] The index[{}] of query_embedding is not found. Please check it.".format(i))

    print("Check done!\n")

Results:

  Check done!
  
  Check done!

压缩

1	!tar zcvf foo.tar.gz doc_embedding query_embedding

Results:

  doc_embedding
  query_embedding

独孤诗人的学习驿站

Tianchi-nlp

1 Introduction

1.1 大赛概况

1.2 赛程安排

1.2.1 初赛阶段（2022年3月2日-4月13日）

1.2.2 复赛阶段（2022年4月18日-5月18日）

1.2.3 决赛答辩（2022年6月1日）

1.3 比赛内容

1.4 比赛数据

1.5 结构提交

1.5.1 初赛

1.5.2 复赛

1.6 评价指标

2. Data processing

2.1 Setup

2.2 Load data

3 Text-mining

3.1 Text preprocessing

3.1.1 cut words

3.1.2 Word2Vec

3.1.3 IDF

3.2 无监督Word2Vec 直接构造embeding

3.3 Sentence-bert

3.3.1 Negative sample

3.3.2 Construct train dataset

3.3.3 Construct Validation dataset

3.3.4 Sentence BERT

3.3.5 Validation & Submission

3.3.6 Submission