如何用TensorFlow训练词向量

前言

前面在《谈谈谷歌word2vec的原理》文章中已经把word2vec的来龙去脉说得很清楚了，接下去这篇文章将尝试根据word2vec的原理并使用TensorFlow来训练词向量，这里选择使用skip-gram模型。

语料库的准备

这里仅仅收集了网上关于房产新闻的文章，并且将全部文章拼凑到一起形成一个语料库。

skip-gram简要说明

skip-gram核心思想可以通过下图来看，假设我们的窗口大小为2，则对于文本"The quick brown fox jumps over the lazy dog."，随着窗口的滑动将产生训练样本。比如刚开始是(the,quick)(the,brown)两个样本，右移一步后训练样本为(quick,the)(quick,brown)(quick,fox)，继续右移后训练样本为(brown,the)(brown,quick)(brown,fox)(brown,jumps)，接着不断右移产生训练样本。skip-gram模型的核心思想即是上面所说。

预料加载&分词

def read_data(filename):
    with codecs.open(filename, 'r', encoding='utf-8') as f:
        data = f.read()
        seg_list = jieba.cut(data, cut_all=False)
        text = tf.compat.as_str("/".join(seg_list)).split('/')
    return text

filename = "D:\\data6\\house_train\\result.txt"

vocabulary = read_data(filename)

实现对语料库文件的加载并且对其进行分词。filename指定语料库文件，而分词使用jieba来实现，最后返回一个包含语料库所有词的list。

构建词典

vocabulary_size = 50000

def build_dataset(words, n_words):
    count = [['UNK', -1]]
    count.extend(collections.Counter(words).most_common(n_words - 1))
    dictionary = dict()
    for word, _ in count:
        dictionary[word] = len(dictionary)
    data = list()
    unk_count = 0
    for word in words:
        if word in dictionary:
            index = dictionary[word]
        else:
            index = 0
            unk_count += 1
        data.append(index)
    count[0][1] = unk_count
    reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return data, count, dictionary, reversed_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(vocabulary, vocabulary_size)
del vocabulary

这里我们是要建立一个大小为50000的词汇，vocabulary是从语料集中获取的所有单词，统计vocabulary每个单词出现的次数，而且是取出现频率最多的前49999个词，count词典方便后面查询某个单词对应出现的次数。接着我们建立dictionary词典，它是单词与索引的词典，方便后面查询某个单词对应的索引位置。接着我们将vocabulary所有单词转换成索引的形式保存到data中，凡是不在频率最高的49999个词当中的我们都当成是unknown词汇并且将其索引置为0，此过程顺便统计vocabulary包含了多少个unknown的词汇。另外还要建立一个反向索引词典reversed_dictionary，可以通过位置索引得到单词。

获取批数据

def generate_batch(batch_size, num_skips, skip_window):
    global data_index
    assert batch_size % num_skips == 0
    assert num_skips <= 2 * skip_window
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    span = 2 * skip_window + 1
    buffer = collections.deque(maxlen=span)
    if data_index + span > len(data):
        data_index = 0
    buffer.extend(data[data_index:data_index + span])
    data_index += span
    for i in range(batch_size // num_skips):
        target = skip_window
        targets_to_avoid = [skip_window]
        for j in range(num_skips):
            while target in targets_to_avoid:
                target = random.randint(0, span - 1)
            targets_to_avoid.append(target)
            batch[i * num_skips + j] = buffer[skip_window]
            labels[i * num_skips + j, 0] = buffer[target]
        if data_index == len(data):
            buffer[:] = data[:span]
            data_index = span
        else:
            buffer.append(data[data_index])
            data_index += 1
    data_index = (data_index + len(data) - span) % len(data)
    return batch, labels

提供一个生成训练批数据的函数，batch_size是我们一次取得一批样本的数量，num_skip则可以看成是我们要去某个词窗口内的词的数量，比如前面我们说到的窗口大小为2，则某个词附近一共有4个词最终最多可以组成4个训练样本，但如果你只需要组成2个样本的话则通过num_skip来设置。skip_window则用于设定窗口的大小。取样本时是在整个vocabulary通过滑动窗口进行的，得到的batch和labels都是单词对应的词典索引，这对后面运算提供了方便。

构建图

graph = tf.Graph()
with graph.as_default():
    train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
    train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
    valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

    with tf.device('/cpu:0'):
        embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
        embed = tf.nn.embedding_lookup(embeddings, train_inputs)

        nce_weights = tf.Variable(
            tf.truncated_normal([vocabulary_size, embedding_size], stddev=1.0 / math.sqrt(embedding_size)))
        nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

    loss = tf.reduce_mean(
        tf.nn.nce_loss(weights=nce_weights,
                       biases=nce_biases,
                       labels=train_labels,
                       inputs=embed,
                       num_sampled=num_sampled,
                       num_classes=vocabulary_size))

    optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

    norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
    normalized_embeddings = embeddings / norm
    valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
    similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)

    init = tf.global_variables_initializer()

train_inputs是一个[batch_size]形状的输入占位符，它表示一批输入数据的索引。train_labels是一个[batch_size, 1]形状的正确的分类标签，它表示一批输入对应的正确的分类标签。

embeddings变量用来表示词典中所有单词的128维词向量，这些向量是会在训练过程中不断被更新的，它是一个[vocabulary_size, embedding_size]形状的矩阵，这里其实是[50000,128]，因为我们设定词汇一共有50000个单词，且它的元素的值都在-1到1之间。

然后通过embedding_lookup函数根据索引train_inputs获取到一批128维的输入embed。

接着使用NCE作为损失函数，根据词汇量数量vocabulary_size以及词向量维度embedding_size构建损失函数即可，NCE是负采样损失函数，也可以试试用其他的损失函数。nce_weights和nce_biases是NCE过程的权重和偏置，取平均后用梯度下降法优化损失函数。

最后对embeddings进行标准化，得到标准的词向量，再计算所有词向量与我们选来校验的词的相似性（距离）。

创建会话

with tf.Session(graph=graph) as session:
    init.run()
    average_loss = 0
    for step in range(num_steps):
        batch_inputs, batch_labels = generate_batch(batch_size, num_skips, skip_window)
        feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}
        _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
        average_loss += loss_val

        if step % 2000 == 0:
            if step > 0:
                average_loss /= 2000
            print('Average loss at step ', step, ': ', average_loss)
            average_loss = 0

        if step % 10000 == 0:
            sim = similarity.eval()
            for i in range(valid_size):
                valid_word = reverse_dictionary[valid_examples[i]]
                top_k = 8
                nearest = (-sim[i, :]).argsort()[1:top_k + 1]
                log_str = 'Nearest to %s:' % valid_word
                for k in range(top_k):
                    close_word = reverse_dictionary[nearest[k]]
                    log_str = '%s %s,' % (log_str, close_word)
                print(log_str)
    final_embeddings = normalized_embeddings.eval()

创建会话开始训练，设置需要训练多少轮，由num_steps指定。然后通过generate_batch获取到一批输入及对应标签，指定优化器对象和损失函数对象开始训练，每训练2000轮输出看下具体损失，每10000轮则使用校验数据看看他们最近距离的8个是什么词。

降维画图

def plot_with_labels(low_dim_embs, labels, filename='tsne.png'):
    assert low_dim_embs.shape[0] >= len(labels), 'More labels than embeddings'
    plt.figure(figsize=(18, 18))  # in inches
    for i, label in enumerate(labels):
        x, y = low_dim_embs[i, :]
        plt.scatter(x, y)
        plt.annotate(label,
                     xy=(x, y),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')

    plt.savefig(filename)

plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000, method='exact')
plot_only = 300
low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only, :])
labels = [reverse_dictionary[i] for i in range(plot_only)]
plot_with_labels(low_dim_embs, labels)

选取300个词汇并使用TSNE对其进行降维然后画图。

github

github.com/sea-boat/De…

========广告时间========

鄙人的新书《Tomcat内核设计剖析》已经在京东销售了，有需要的朋友可以到 item.jd.com/12185360.ht… 进行预定。感谢各位朋友。

为什么写《Tomcat内核设计剖析》

=========================

欢迎关注：