【英】如何在每次训练都得到相同的word2vec/doc2vec/paragraph vectors

603 阅读3分钟
原文链接: medium.com

How to get same word2vec/doc2vec/paragraph vectors in every time of training

OK, welcome to our Word Embedding Series. This post is the first story of the series. You may find this story is suitable for the intermediate or above, who has trained or at least tried once on word2vec, or doc2vec/paragraph vectors. But no worries, I will introduce background, prerequisites and knowledge and how the packages implements it in the following weeks.

I will try my best to do not redirect you to some other links that ask you to read tedious tutorials and end with giving up (trust me, I am the victim of the tremendous online tutorials :) ). I want you to understand word vectors from the coding level together with me.


If you got any chance to train word vectors yourself, you may find that the model and the vector representation is different across every training even you feed into same training data. This is because of the randomness introduced in the training time. Code can talk itself, let’s take a look on where the randomness comes and how to eliminate it thoroughly. I will use DL4j’s implementation of paragraph vectors to show the code. If you want to take look on the other package, go to gensim’s doc2vec, which has the same method of implementation.

Where the randomness comes

The initialization of weights and matrix

We know that before training, the weights of model and vector representation will be initialized randomly, and the randomness is controlled by seed. Hence, if we set seed as 0, we will get exact same initialization every time. Here is the place where the seed takes effect.

syn0 = Nd4j.rand(new int[] {vocab.numWords(), vectorLength}, rng).subi(0.5).divi(vectorLength);

PV-DBOW algorithm

If we use PV-DBOW algorithm (I will explain the details of it in the following posts) to train Paragraph Vectors, during the iterations, it randomly subsamples words from text window to calculate and update weights. But this random is not real random. Let’s take a look at the code.

// next random is an AtomicLong initialized by thread id
this.nextRandom = new AtomicLong(this.threadId);

And nextRandom is used in

trainSequence(sequence, nextRandom, alpha);

Where inside trainSequence, it will do

nextRandom.set(nextRandom.get() * 25214903917L + 11);

If we go deeper on this, we will find it generates nextRandom by the same way, so the number relies only on the thread id, where the thread id is 0, 1, 2, 3, …. Hence, it’s no longer random.

Parallel tokenization

It’s used for tokenizing parallely, since the process of complicated text can be time costing, tokenizing parallelly can help the performance, while the consistency among training is not guaranteed. The sequences processed by tokenizer can have random order to feed into threads to train. As you can see from the code, the runnable which is doing the tokenization, will be await until it finishes if we set allowParallelBuilder to false, where the order can maintain.

if (!allowParallelBuilder) {
try {
runnable.awaitDone();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new RuntimeException(e);
}
}

Queue that provides sequences to every thread to train

This LinkedBlockingQueue gets sequences from the iterator of training text, and provides these sequences to each threads. Since every threads can come randomly, in every time of training, each thread can get different sequences to train. Let’s look at the implementation.

// initialize a sequencer to provide data to threads
val sequencer = new AsyncSequencer(this.iterator, this.stopWords);
// each threads are pointing to the same sequencer
for (int x = 0; x < workers; x++) {
threads.add(x, new VectorCalculationsThread(x, ..., sequencer);
threads.get(x).start();
}
// sequencer will initialize a LinkedBlockingQueue buffer
// and maintain the size between
private final LinkedBlockingQueue<Sequence<T>> buffer;
limitLower = workers * batchSize;
limitUpper = workers * batchSize * 2;
// threads get data from the queue through
buffer.poll(3L, TimeUnit.SECONDS);

Hence, if we set the number of worker as 1, it will run in single thread and have the exact same order of feeding data in each time of training. But notice that single thread will tremendously slow down the training.

Summarize

To summarize, the following is what we need to do to exclude randomness thoroughly:
1. Set seed as 0;
2. Set allowParallelTokenization as false;
3. Set number of workers (threads) as 1.

Then we will have the exactly same results of word vector and paragraph vector if we feed into the same data.


If you are feeling like

please follow the next stories about word embedding and language model, I have prepared the feast for you.

Reference

[1] Deeplearning4j, ND4J, DataVec and more — deep learning & linear algebra for Java/Scala with GPUs + Spark — From Skymind deeplearning4j.org github.com/deeplearnin…
[2] Java™ Platform, Standard Edition 8 API Specification docs.oracle.com/javase/8/do…
[3] giphy.com/
[4] images.google.com/