背景
本文我们要介绍SortedNumericDocValues,它也是用来存储数值类型的,它和NumericDocValues的区别是:每个文档不能有同名的NumericDocValues字段,但是可以有多个同名的SortedNumericDocValues字段,最终存储的时候同一个doc中的所有同名的SortedNumericDocValues会排序后再存储。
public class DocValueDemo {
public static void main(String[] args) throws IOException {
Directory directory = FSDirectory.open(new File("D:\\code\\lucene-9.1.0-learning\\data").toPath());
WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer();
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
indexWriterConfig.setUseCompoundFile(false);
IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
Document document = new Document();
// 一个doc中同名的NumericDocValuesField最多只能一个,存储数值类型
document.add(new NumericDocValuesField("age", 20));
// 一个doc中同名的SortedNumericDocValuesField可以有多个,存储数值类型,存储的时候会排序:0,3,4
document.add(new SortedNumericDocValuesField("level", 4));
document.add(new SortedNumericDocValuesField("level", 3));
document.add(new SortedNumericDocValuesField("level", 0));
indexWriter.addDocument(document);
indexWriter.flush();
indexWriter.commit();
indexWriter.close();
}
}
前置知识
本文涉及到的一些知识在之前的文章中都做了详细的介绍,后续碰到不会重复介绍。
- DirectMonotonicWriter:用来压缩存储单调递增的long集合,详见《多值编码压缩算法》
- NumericDocValues:单个数值类型的DocValues,详见《DocValues-NumericDocValues》
存储方案
SortedNumericDocValues对于DocValues值的存储和NumericDocValues是一样的,虽然每个doc可能有多个相同field的SortedNumericDocValues,但是这些值是排序之后存储在一起的,在存储value的时候并没有区分是哪个doc的,也是把所有doc的value进行存储,因此NumericDocValues的存储方案同样也适用于SortedNumericDocValues。
区别是当有doc拥有不止一个同名field的SortedNumericDocValues时,需要额外记录每个doc的SortedNumericDocValues个数。
文件格式
dvm
整体结构
字段详解
- Numeric:和NumericDocValues的所有的dvm的字段含义一致。
- NumerDocsWithField:包含此字段的doc总数
- DocValueCountDataOffset:如果存在doc拥有不止一个同filed的SortedNumericDocValues,则需要存储每个doc的SortedNumericDocValues个数,使用DirectMonotonicWriter,第i个位置存储的是:doc0+doc1+... +doci包含的SortedNumericDocValues总数。DocValueCountDataOffset就是DirectMonotonicWriter在dvd文件中的起始位置。
- BlockShift:DirectMonotonicWriter中的参数。
- DocValueCountDataBlockMetas:DirectMonotonicWriter元信息。
- DocValueCountDatalength:DirectMonotonicWriter在dvd中的总长度。
dvd
整体结构
字段详解
- Numeric:数据存储和NumericDocValues一模一样。
- DirectMonotonicWriter:增量存储每个doc包含的SortedNumericDocValues的个数。
源码解析
构建
数据收集
SortedNumericDocValues临时存储的工具是PackedLongValues,每处理256个value使用的是Packed64进行压缩存储。
在持久化的时候,会把SortedNumericDocValuesWriter的值和包含该field的docID集合会根据是否需要对doc进行排序都封装到BufferedSortedNumericDocValues或者SortingSortedNumericDocValues中。
SortedNumericDocValuesWriter
class SortedNumericDocValuesWriter extends DocValuesWriter<SortedNumericDocValues> {
// 存储docValues值
private final PackedLongValues.Builder pending;
// 存储每个doc中有多少个docValues,延迟初始化,只有在某个 doc 的SortedNumericDocValues总数大于1的时候才初始化
private PackedLongValues.Builder pendingCounts;
private final DocsWithFieldSet docsWithField;
private final Counter iwBytesUsed;
private long bytesUsed;
private final FieldInfo fieldInfo;
// 当前处理的docId
private int currentDoc = -1;
// 存储当前处理的doc中的DocValues
private long[] currentValues = new long[8];
private int currentUpto = 0;
private PackedLongValues finalValues;
private PackedLongValues finalValuesCount;
SortedNumericDocValuesWriter(FieldInfo fieldInfo, Counter iwBytesUsed) {
this.fieldInfo = fieldInfo;
this.iwBytesUsed = iwBytesUsed;
pending = PackedLongValues.deltaPackedBuilder(PackedInts.COMPACT);
docsWithField = new DocsWithFieldSet();
bytesUsed =
pending.ramBytesUsed()
+ docsWithField.ramBytesUsed()
+ RamUsageEstimator.sizeOf(currentValues);
iwBytesUsed.addAndGet(bytesUsed);
}
// 为docID新增一个SortedNumericDocValues
public void addValue(int docID, long value) {
assert docID >= currentDoc;
if (docID != currentDoc) { // 当前的docID已经处理结束
finishCurrentDoc();
currentDoc = docID;
}
// 临时存储
addOneValue(value);
updateBytesUsed();
}
private void finishCurrentDoc() {
if (currentDoc == -1) {
return;
}
// 同一个doc中的SortedNumericDocValues需要排序
Arrays.sort(currentValues, 0, currentUpto);
for (int i = 0; i < currentUpto; i++) { // 把排序好的SortedNumericDocValues添加到pending中
pending.add(currentValues[i]);
}
if (pendingCounts != null) { // pendingCounts初始化过了,直接记录当前doc的SortedNumericDocValues个数
pendingCounts.add(currentUpto);
} else if (currentUpto != 1) { // 碰到第一个doc包含多个SortedNumericDocValues
pendingCounts = PackedLongValues.deltaPackedBuilder(PackedInts.COMPACT);
// 这个doc之前的所有doc都是1
for (int i = 0; i < docsWithField.cardinality(); ++i) {
pendingCounts.add(1);
}
// 记录当前doc的SortedNumericDocValues个数
pendingCounts.add(currentUpto);
}
currentUpto = 0;
docsWithField.add(currentDoc);
}
// 如果 currentValues空间不足了就扩容
// 把值加入currentValues中
private void addOneValue(long value) {
if (currentUpto == currentValues.length) {
currentValues = ArrayUtil.grow(currentValues, currentValues.length + 1);
}
currentValues[currentUpto] = value;
currentUpto++;
}
@Override
SortedNumericDocValues getDocValues() {
if (finalValues == null) {
assert finalValuesCount == null;
finishCurrentDoc();
finalValues = pending.build();
finalValuesCount = pendingCounts == null ? null : pendingCounts.build();
}
return getValues(finalValues, finalValuesCount, docsWithField);
}
private SortedNumericDocValues getValues(
PackedLongValues values, PackedLongValues valueCounts, DocsWithFieldSet docsWithField) {
if (valueCounts == null) { // 如果每个doc最多只有一个SortedNumericDocValues,那就跟NumericDocValues一样
return DocValues.singleton(new BufferedNumericDocValues(values, docsWithField.iterator()));
} else {
return new BufferedSortedNumericDocValues(values, valueCounts, docsWithField.iterator());
}
}
@Override
public void flush(SegmentWriteState state, Sorter.DocMap sortMap, DocValuesConsumer dvConsumer)
throws IOException {
final PackedLongValues values;
final PackedLongValues valueCounts;
if (finalValues == null) { // 结束PackedLongValues的builder
finishCurrentDoc(); // 处理最后一个doc
values = pending.build();
valueCounts = pendingCounts == null ? null : pendingCounts.build();
} else {
values = finalValues;
valueCounts = finalValuesCount;
}
final LongValues sorted;
if (sortMap != null) { // 对doc进行排序
sorted =
new LongValues(
state.segmentInfo.maxDoc(),
sortMap,
getValues(values, valueCounts, docsWithField),
PackedInts.FASTEST);
} else {
sorted = null;
}
dvConsumer.addSortedNumericField(
fieldInfo,
new EmptyDocValuesProducer() {
@Override
public SortedNumericDocValues getSortedNumeric(FieldInfo fieldInfoIn) {
if (fieldInfoIn != fieldInfo) {
throw new IllegalArgumentException("wrong fieldInfo");
}
final SortedNumericDocValues buf = getValues(values, valueCounts, docsWithField);
if (sorted == null) {
return buf;
} else {
return new SortingSortedNumericDocValues(buf, sorted);
}
}
});
}
}
BufferedSortedNumericDocValues
private static class BufferedSortedNumericDocValues extends SortedNumericDocValues {
// value的迭代器
final PackedLongValues.Iterator valuesIter;
// 每个doc包含的SortedNumericDocValues个数的迭代器
final PackedLongValues.Iterator valueCountsIter;
// 包含SortedNumericDocValues的doc的迭代器
final DocIdSetIterator docsWithField;
private int valueCount;
private int valueUpto;
BufferedSortedNumericDocValues(
PackedLongValues values, PackedLongValues valueCounts, DocIdSetIterator docsWithField) {
valuesIter = values.iterator();
valueCountsIter = valueCounts.iterator();
this.docsWithField = docsWithField;
}
@Override
public int docID() {
return docsWithField.docID();
}
@Override
public int nextDoc() throws IOException {
for (int i = valueUpto; i < valueCount; ++i) {
valuesIter.next();
}
int docID = docsWithField.nextDoc();
if (docID != NO_MORE_DOCS) {
valueCount = Math.toIntExact(valueCountsIter.next());
valueUpto = 0;
}
return docID;
}
@Override
public int advance(int target) {
throw new UnsupportedOperationException();
}
@Override
public boolean advanceExact(int target) throws IOException {
throw new UnsupportedOperationException();
}
@Override
public int docValueCount() {
return valueCount;
}
@Override
public long nextValue() {
if (valueUpto == valueCount) {
throw new IllegalStateException();
}
valueUpto++;
return valuesIter.next();
}
@Override
public long cost() {
return docsWithField.cost();
}
}
SortingSortedNumericDocValues
static class SortingSortedNumericDocValues extends SortedNumericDocValues {
private final SortedNumericDocValues in;
// 排序之后的doc和对应的DocValues
private final LongValues values;
private int docID = -1;
private long upto;
private int numValues = -1;
private long limit;
SortingSortedNumericDocValues(SortedNumericDocValues in, LongValues values) {
this.in = in;
this.values = values;
}
@Override
public int docID() {
return docID;
}
@Override
public int nextDoc() {
do {
docID++;
if (docID >= values.offsets.length) {
return docID = NO_MORE_DOCS;
}
} while (values.offsets[docID] <= 0);
upto = values.offsets[docID];
numValues = Math.toIntExact(values.values.get(upto - 1));
limit = upto + numValues;
return docID;
}
@Override
public int advance(int target) {
throw new UnsupportedOperationException("use nextDoc instead");
}
@Override
public boolean advanceExact(int target) throws IOException {
docID = target;
upto = values.offsets[docID];
if (values.offsets[docID] > 0) {
numValues = Math.toIntExact(values.values.get(upto - 1));
limit = upto + numValues;
return true;
} else {
limit = upto;
}
return false;
}
@Override
public long nextValue() {
if (upto == limit) {
throw new AssertionError();
} else {
return values.values.get(upto++);
}
}
@Override
public long cost() {
return in.cost();
}
@Override
public int docValueCount() {
return numValues;
}
}
持久化
NumericDocValues持久化的核心逻辑在writeValues方法中,主要做了以下几件事:
- 同NumericDocValues的逻辑(writeValues),存储DocValues的数据
- 如果某个doc包含多个同名的SortedNumericDocValues,则使用DirectMonotonicWriter存储每个doc的同名的SortedNumericDocValues个数。
public void addSortedNumericField(FieldInfo field, DocValuesProducer valuesProducer)
throws IOException {
meta.writeInt(field.number);
meta.writeByte(Lucene90DocValuesFormat.SORTED_NUMERIC);
doAddSortedNumericField(field, valuesProducer);
}
private void doAddSortedNumericField(FieldInfo field, DocValuesProducer valuesProducer)
throws IOException {
long[] stats = writeValues(field, valuesProducer, false);
int numDocsWithField = Math.toIntExact(stats[0]);
long numValues = stats[1];
assert numValues >= numDocsWithField;
meta.writeInt(numDocsWithField);
if (numValues > numDocsWithField) { // 说明至少有一个doc包含了多个SortedNumericDocValues
long start = data.getFilePointer(); // 当前dvd文件中的位置
meta.writeLong(start);
meta.writeVInt(DIRECT_MONOTONIC_BLOCK_SHIFT);
final DirectMonotonicWriter addressesWriter =
DirectMonotonicWriter.getInstance(
meta, data, numDocsWithField + 1L, DIRECT_MONOTONIC_BLOCK_SHIFT);
long addr = 0;
addressesWriter.add(addr);
SortedNumericDocValues values = valuesProducer.getSortedNumeric(field);
for (int doc = values.nextDoc();
doc != DocIdSetIterator.NO_MORE_DOCS;
doc = values.nextDoc()) {
// doc包含的SortedNumericDocValues的叠加
addr += values.docValueCount();
addressesWriter.add(addr);
}
addressesWriter.finish();
meta.writeLong(data.getFilePointer() - start);
}
}
读取
读取逻辑比较简单:
- 从dvm文件中解析相关的元信息
- 如果所有文档都包含本字段或者都不包含本字段,则可以用maxDoc来判断doc遍历的结束。
- 如果部分文档包含本字段,则使用IndexDISI来遍历doc。
private SortedNumericEntry readSortedNumeric(IndexInput meta) throws IOException {
SortedNumericEntry entry = new SortedNumericEntry();
readSortedNumeric(meta, entry);
return entry;
}
private SortedNumericEntry readSortedNumeric(IndexInput meta, SortedNumericEntry entry)
throws IOException {
readNumeric(meta, entry);
entry.numDocsWithField = meta.readInt();
// 需要读取每个doc包含的docValues数
if (entry.numDocsWithField != entry.numValues) {
entry.addressesOffset = meta.readLong();
final int blockShift = meta.readVInt();
entry.addressesMeta =
DirectMonotonicReader.loadMeta(meta, entry.numDocsWithField + 1, blockShift);
entry.addressesLength = meta.readLong();
}
return entry;
}
public SortedNumericDocValues getSortedNumeric(FieldInfo field) throws IOException {
SortedNumericEntry entry = sortedNumerics.get(field.name);
return getSortedNumeric(entry);
}
private SortedNumericDocValues getSortedNumeric(SortedNumericEntry entry) throws IOException {
if (entry.numValues == entry.numDocsWithField) {
return DocValues.singleton(getNumeric(entry));
}
final RandomAccessInput addressesInput =
data.randomAccessSlice(entry.addressesOffset, entry.addressesLength);
final LongValues addresses =
DirectMonotonicReader.getInstance(entry.addressesMeta, addressesInput, merging);
final LongValues values = getNumericValues(entry);
if (entry.docsWithFieldOffset == -1) { // 如果所有的doc都包含本字段的SortedNumericDocValues
return new SortedNumericDocValues() {
int doc = -1;
long start, end;
int count;
@Override
public int nextDoc() throws IOException {
return advance(doc + 1);
}
@Override
public int docID() {
return doc;
}
@Override
public long cost() {
return maxDoc;
}
@Override
public int advance(int target) throws IOException {
if (target >= maxDoc) {
return doc = NO_MORE_DOCS;
}
start = addresses.get(target);
end = addresses.get(target + 1L);
count = (int) (end - start);
return doc = target;
}
@Override
public boolean advanceExact(int target) throws IOException {
start = addresses.get(target);
end = addresses.get(target + 1L);
count = (int) (end - start);
doc = target;
return true;
}
// nextValue没有做越界防护,所以需要业务根据docValueCount方法做判断
public long nextValue() throws IOException {
return values.get(start++);
}
@Override
public int docValueCount() {
return count;
}
};
} else { // 部分doc包含此字段的SortedNumericDocValues
final IndexedDISI disi =
new IndexedDISI(
data,
entry.docsWithFieldOffset,
entry.docsWithFieldLength,
entry.jumpTableEntryCount,
entry.denseRankPower,
entry.numDocsWithField);
return new SortedNumericDocValues() {
// 记录下面三个参数是否是正确可用的
boolean set;
long start, end;
// doc拥有多少个DocValues
int count;
@Override
public int nextDoc() throws IOException {
set = false;
return disi.nextDoc();
}
@Override
public int docID() {
return disi.docID();
}
@Override
public long cost() {
return disi.cost();
}
@Override
public int advance(int target) throws IOException {
set = false;
return disi.advance(target);
}
@Override
public boolean advanceExact(int target) throws IOException {
set = false;
return disi.advanceExact(target);
}
// nextValue没有做越界防护,所以需要业务根据docValueCount方法做判断
public long nextValue() throws IOException {
set();
return values.get(start++);
}
@Override
public int docValueCount() {
set();
return count;
}
private void set() {
if (set == false) {
// 第几个doc
final int index = disi.index();
start = addresses.get(index);
end = addresses.get(index + 1L);
// 当前doc的SortedNumericDocValues个数
count = (int) (end - start);
set = true;
}
}
};
}
}
总结
SortedNumericDocValues是在NumericDocValues的基础上设计的,而大部分复杂的设计都在NumericDocValues中,所以如果看明白了NumericDocValues,会发现SortedNumericDocValues很简单。