diff options
author | Jiang Yutang <yutang2.jiang@hxt-semitech.com> | 2018-09-07 05:09:24 +0200 |
---|---|---|
committer | Jiang Yutang <yutang2.jiang@hxt-semitech.com> | 2018-09-07 05:19:45 +0200 |
commit | cc59da9785730a4247a24f2af17401124c506293 (patch) | |
tree | 49fae8e10c10d479cbcb93d5fdbf01decefb67b8 /src | |
parent | Merge pull request #23900 from libingyang-zte/master (diff) | |
download | ceph-cc59da9785730a4247a24f2af17401124c506293.tar.xz ceph-cc59da9785730a4247a24f2af17401124c506293.zip |
common/buffer.cc: add create_small_page_aligned to avoid mem waste when apply for small mem in big page size(e.g. 64k) OS
On my arm64 dev board, CentOS 7.4, the default OS page size is 64k, one SSD disk,
ceph version is 13.2.1. When I do fio randread test(bs=4k), the ceph-osd process uses a
large amount of memory(more than 20G), while bs=64, just more than 2G.
After traceing the mem allocate process, it is found to be related to page size
alignment - applying for small mem(4k) but align to big page size(64k) will lead
to waste memory.
With reference to the original create_page_aligned, add a new interface
create_small_page_aligned to useing 4k alignment. Go through all the callers of
create_page_aligned, divide the big and small page align according to the
relationship between applying for and current page size. Individual callers with
their own context logic not do the diversion.
After using the patch, do the fio randread test(bs=4k) in 64k page size OS, the
memory used by the ceph-osd process be reduced from more than 20G to about 3G;
for the bs=16k case, the memory used is also significantly reduced; while the
reading performance has not been reduced.
When I porting the patch to the last ceph tree(version 14.0.0-xxx), also made a
comparative verification. For the fio(bs=4k) test, although the current 14.0.0-x
version is less mem expensive than the 13.2.1 version, but the memory usage of
using the patche is also reduced significantly.
The following is a partial comparison of validation data, different software and
hardware environments may have different test values, the better the performance
of the SSD, the more memory it will use.
ceph version bs VIRT RES
13.2.1 64k 3600896 2.7g
13.2.1 64k 3610112 2.7g
13.2.1 64k 3614208 2.7g
13.2.1 16k 7485184 6.4g
13.2.1 16k 7486208 6.4g
13.2.1 16k 7486208 6.4g
13.2.1 4k 23.7g 22.9g <--A lot of waste
13.2.1 4k 23.7g 22.9g
13.2.1 4k 23.7g 22.9g
13.2.1+patch 64k 3632384 2.7g
13.2.1+patch 64k 3636480 2.7g
13.2.1+patch 64k 3640576 2.7g
13.2.1+patch 16k 3175296 2.2g
13.2.1+patch 16k 3175296 2.2g
13.2.1+patch 16k 3176320 2.2g
13.2.1+patch 4k 4265920 3.3g <--Reasonable usage quantity
13.2.1+patch 4k 4265920 3.3g
13.2.1+patch 4k 4265920 3.3g
14.0.0-x 64k 6230784 4.4g
14.0.0-x 64k 5731840 4.1g
14.0.0-x 64k 4547072 3.5g
14.0.0-x 64k 4544000 3.6g
14.0.0-x 16k 6272192 5.2g
14.0.0-x 16k 6343168 5.3g
14.0.0-x 16k 6357696 5.3g
14.0.0-x 4k 10.1g 9.3g <--A lot of waste
14.0.0-x 4k 10.3g 9.6g
14.0.0-x 4k 10.3g 9.4g
14.0.0-x+patch 64k 5974464 4.6g
14.0.0-x+patch 64k 4547008 3.5g
14.0.0-x+patch 64k 4556288 3.6g
14.0.0-x+patch 16k 4058560 3.1g
14.0.0-x+patch 16k 4053504 3.1g
14.0.0-x+patch 16k 4062720 3.1g
14.0.0-x+patch 4k 5283264 4.3g <--Reasonable usage quantity
14.0.0-x+patch 4k 5324224 4.3g
14.0.0-x+patch 4k 5297600 4.3g
Signed-off-by: Jiang Yutang <yutang2.jiang@hxt-semitech.com>
Diffstat (limited to 'src')
-rw-r--r-- | src/common/buffer.cc | 6 | ||||
-rw-r--r-- | src/compressor/QatAccel.cc | 4 | ||||
-rw-r--r-- | src/compressor/brotli/BrotliCompressor.cc | 2 | ||||
-rw-r--r-- | src/compressor/lz4/LZ4Compressor.h | 2 | ||||
-rw-r--r-- | src/compressor/snappy/SnappyCompressor.h | 2 | ||||
-rw-r--r-- | src/compressor/zstd/ZstdCompressor.h | 2 | ||||
-rw-r--r-- | src/include/buffer.h | 1 | ||||
-rw-r--r-- | src/msg/async/AsyncConnection.cc | 2 | ||||
-rw-r--r-- | src/msg/simple/Pipe.cc | 2 | ||||
-rw-r--r-- | src/os/bluestore/BlueStore.cc | 2 | ||||
-rw-r--r-- | src/os/bluestore/KernelDevice.cc | 4 | ||||
-rw-r--r-- | src/os/bluestore/NVMEDevice.cc | 4 | ||||
-rw-r--r-- | src/os/bluestore/PMEMDevice.cc | 2 | ||||
-rw-r--r-- | src/os/bluestore/aio.h | 2 | ||||
-rw-r--r-- | src/os/filestore/FileJournal.cc | 4 |
15 files changed, 24 insertions, 17 deletions
diff --git a/src/common/buffer.cc b/src/common/buffer.cc index ebd3aae511e..fc3b74b8dad 100644 --- a/src/common/buffer.cc +++ b/src/common/buffer.cc @@ -718,6 +718,12 @@ using namespace ceph; buffer::raw* buffer::create_page_aligned(unsigned len) { return create_aligned(len, CEPH_PAGE_SIZE); } + buffer::raw* buffer::create_small_page_aligned(unsigned len) { + if (len < CEPH_PAGE_SIZE) { + return create_aligned(len, CEPH_BUFFER_ALLOC_UNIT); + } else + return create_aligned(len, CEPH_PAGE_SIZE); + } buffer::raw* buffer::create_zero_copy(unsigned len, int fd, int64_t *offset) { #ifdef CEPH_HAVE_SPLICE diff --git a/src/compressor/QatAccel.cc b/src/compressor/QatAccel.cc index 7701678c45b..7836243b8a3 100644 --- a/src/compressor/QatAccel.cc +++ b/src/compressor/QatAccel.cc @@ -65,7 +65,7 @@ int QatAccel::compress(const bufferlist &in, bufferlist &out) { unsigned int len = i.length(); unsigned int out_len = qzMaxCompressedLength(len); - bufferptr ptr = buffer::create_page_aligned(out_len); + bufferptr ptr = buffer::create_small_page_aligned(out_len); int rc = qzCompress(&session, c_in, &len, (unsigned char *)ptr.c_str(), &out_len, 1); if (rc != QZ_OK) return -1; @@ -103,7 +103,7 @@ int QatAccel::decompress(bufferlist::const_iterator &p, len = tmp.length(); } unsigned int out_len = len * expansion_ratio[ratio_idx]; - bufferptr ptr = buffer::create_page_aligned(out_len); + bufferptr ptr = buffer::create_small_page_aligned(out_len); if (joint) rc = qzDecompress(&session, (const unsigned char*)tmp.c_str(), &len, (unsigned char*)ptr.c_str(), &out_len); diff --git a/src/compressor/brotli/BrotliCompressor.cc b/src/compressor/brotli/BrotliCompressor.cc index 4c473cda427..b0785c07e55 100644 --- a/src/compressor/brotli/BrotliCompressor.cc +++ b/src/compressor/brotli/BrotliCompressor.cc @@ -20,7 +20,7 @@ int BrotliCompressor::compress(const bufferlist &in, bufferlist &out) size_t available_in = i->length(); size_t max_comp_size = BrotliEncoderMaxCompressedSize(available_in); size_t available_out = max_comp_size; - bufferptr ptr = buffer::create_page_aligned(max_comp_size); + bufferptr ptr = buffer::create_small_page_aligned(max_comp_size); uint8_t* next_out = (uint8_t*)ptr.c_str(); const uint8_t* next_in = (uint8_t*)i->c_str(); ++i; diff --git a/src/compressor/lz4/LZ4Compressor.h b/src/compressor/lz4/LZ4Compressor.h index d2248cb0a98..8189f18f4c3 100644 --- a/src/compressor/lz4/LZ4Compressor.h +++ b/src/compressor/lz4/LZ4Compressor.h @@ -40,7 +40,7 @@ class LZ4Compressor : public Compressor { if (qat_enabled) return qat_accel.compress(src, dst); #endif - bufferptr outptr = buffer::create_page_aligned( + bufferptr outptr = buffer::create_small_page_aligned( LZ4_compressBound(src.length())); LZ4_stream_t lz4_stream; LZ4_resetStream(&lz4_stream); diff --git a/src/compressor/snappy/SnappyCompressor.h b/src/compressor/snappy/SnappyCompressor.h index 67664de42d6..0291a923112 100644 --- a/src/compressor/snappy/SnappyCompressor.h +++ b/src/compressor/snappy/SnappyCompressor.h @@ -72,7 +72,7 @@ class SnappyCompressor : public Compressor { return qat_accel.compress(src, dst); #endif BufferlistSource source(const_cast<bufferlist&>(src).begin(), src.length()); - bufferptr ptr = buffer::create_page_aligned( + bufferptr ptr = buffer::create_small_page_aligned( snappy::MaxCompressedLength(src.length())); snappy::UncheckedByteArraySink sink(ptr.c_str()); snappy::Compress(&source, &sink); diff --git a/src/compressor/zstd/ZstdCompressor.h b/src/compressor/zstd/ZstdCompressor.h index eba32c6e6b3..0b17c99ad13 100644 --- a/src/compressor/zstd/ZstdCompressor.h +++ b/src/compressor/zstd/ZstdCompressor.h @@ -35,7 +35,7 @@ class ZstdCompressor : public Compressor { size_t left = src.length(); size_t const out_max = ZSTD_compressBound(left); - bufferptr outptr = buffer::create_page_aligned(out_max); + bufferptr outptr = buffer::create_small_page_aligned(out_max); ZSTD_outBuffer_s outbuf; outbuf.dst = outptr.c_str(); outbuf.size = outptr.length(); diff --git a/src/include/buffer.h b/src/include/buffer.h index 15a062e21be..e88f65f7491 100644 --- a/src/include/buffer.h +++ b/src/include/buffer.h @@ -171,6 +171,7 @@ namespace buffer CEPH_BUFFER_API { raw* create_aligned(unsigned len, unsigned align); raw* create_aligned_in_mempool(unsigned len, unsigned align, int mempool); raw* create_page_aligned(unsigned len); + raw* create_small_page_aligned(unsigned len); raw* create_zero_copy(unsigned len, int fd, int64_t *offset); raw* create_unshareable(unsigned len); raw* create_static(unsigned len, char *buf); diff --git a/src/msg/async/AsyncConnection.cc b/src/msg/async/AsyncConnection.cc index 15027e5b54d..5ccb385aa6c 100644 --- a/src/msg/async/AsyncConnection.cc +++ b/src/msg/async/AsyncConnection.cc @@ -113,7 +113,7 @@ static void alloc_aligned_buffer(bufferlist& data, unsigned len, unsigned off) left -= head; } alloc_len += left; - bufferptr ptr(buffer::create_page_aligned(alloc_len)); + bufferptr ptr(buffer::create_small_page_aligned(alloc_len)); if (head) ptr.set_offset(CEPH_PAGE_SIZE - head); data.push_back(std::move(ptr)); diff --git a/src/msg/simple/Pipe.cc b/src/msg/simple/Pipe.cc index 8005cf8947d..b828ca35937 100644 --- a/src/msg/simple/Pipe.cc +++ b/src/msg/simple/Pipe.cc @@ -2043,7 +2043,7 @@ static void alloc_aligned_buffer(bufferlist& data, unsigned len, unsigned off) } unsigned middle = left & CEPH_PAGE_MASK; if (middle > 0) { - data.push_back(buffer::create_page_aligned(middle)); + data.push_back(buffer::create_small_page_aligned(middle)); left -= middle; } if (left) { diff --git a/src/os/bluestore/BlueStore.cc b/src/os/bluestore/BlueStore.cc index 25d09c8648f..0154c74e680 100644 --- a/src/os/bluestore/BlueStore.cc +++ b/src/os/bluestore/BlueStore.cc @@ -10264,7 +10264,7 @@ void BlueStore::_pad_zeros( size_t pad_count = 0; if (front_pad) { size_t front_copy = std::min<uint64_t>(chunk_size - front_pad, length); - bufferptr z = buffer::create_page_aligned(chunk_size); + bufferptr z = buffer::create_small_page_aligned(chunk_size); z.zero(0, front_pad, false); pad_count += front_pad; bl->copy(0, front_copy, z.c_str() + front_pad); diff --git a/src/os/bluestore/KernelDevice.cc b/src/os/bluestore/KernelDevice.cc index 1d6148bdcad..97f3bd074ca 100644 --- a/src/os/bluestore/KernelDevice.cc +++ b/src/os/bluestore/KernelDevice.cc @@ -807,7 +807,7 @@ int KernelDevice::read(uint64_t off, uint64_t len, bufferlist *pbl, _aio_log_start(ioc, off, len); - bufferptr p = buffer::create_page_aligned(len); + bufferptr p = buffer::create_small_page_aligned(len); int r = ::pread(buffered ? fd_buffered : fd_direct, p.c_str(), len, off); if (r < 0) { @@ -861,7 +861,7 @@ int KernelDevice::direct_read_unaligned(uint64_t off, uint64_t len, char *buf) { uint64_t aligned_off = align_down(off, block_size); uint64_t aligned_len = align_up(off+len, block_size) - aligned_off; - bufferptr p = buffer::create_page_aligned(aligned_len); + bufferptr p = buffer::create_small_page_aligned(aligned_len); int r = 0; r = ::pread(fd_direct, p.c_str(), aligned_len, aligned_off); diff --git a/src/os/bluestore/NVMEDevice.cc b/src/os/bluestore/NVMEDevice.cc index 633700d5ab6..563d3c7586c 100644 --- a/src/os/bluestore/NVMEDevice.cc +++ b/src/os/bluestore/NVMEDevice.cc @@ -915,7 +915,7 @@ int NVMEDevice::read(uint64_t off, uint64_t len, bufferlist *pbl, ceph_assert(is_valid_io(off, len)); Task *t = new Task(this, IOCommand::READ_COMMAND, off, len, 1); - bufferptr p = buffer::create_page_aligned(len); + bufferptr p = buffer::create_small_page_aligned(len); int r = 0; t->ctx = ioc; char *buf = p.c_str(); @@ -945,7 +945,7 @@ int NVMEDevice::aio_read( Task *t = new Task(this, IOCommand::READ_COMMAND, off, len); - bufferptr p = buffer::create_page_aligned(len); + bufferptr p = buffer::create_small_page_aligned(len); pbl->append(p); t->ctx = ioc; char* buf = p.c_str(); diff --git a/src/os/bluestore/PMEMDevice.cc b/src/os/bluestore/PMEMDevice.cc index 68859d01686..81d880b85d8 100644 --- a/src/os/bluestore/PMEMDevice.cc +++ b/src/os/bluestore/PMEMDevice.cc @@ -256,7 +256,7 @@ int PMEMDevice::read(uint64_t off, uint64_t len, bufferlist *pbl, dout(5) << __func__ << " " << off << "~" << len << dendl; ceph_assert(is_valid_io(off, len)); - bufferptr p = buffer::create_page_aligned(len); + bufferptr p = buffer::create_small_page_aligned(len); memcpy(p.c_str(), addr + off, len); pbl->clear(); diff --git a/src/os/bluestore/aio.h b/src/os/bluestore/aio.h index 324b13e6940..bc6acb7ec5f 100644 --- a/src/os/bluestore/aio.h +++ b/src/os/bluestore/aio.h @@ -32,7 +32,7 @@ struct aio_t { void pread(uint64_t _offset, uint64_t len) { offset = _offset; length = len; - bufferptr p = buffer::create_page_aligned(length); + bufferptr p = buffer::create_small_page_aligned(length); io_prep_pread(&iocb, fd, p.c_str(), length, offset); bl.append(std::move(p)); } diff --git a/src/os/filestore/FileJournal.cc b/src/os/filestore/FileJournal.cc index cfb1692cf35..98bed0dc298 100644 --- a/src/os/filestore/FileJournal.cc +++ b/src/os/filestore/FileJournal.cc @@ -672,7 +672,7 @@ int FileJournal::read_header(header_t *hdr) const dout(10) << "read_header" << dendl; bufferlist bl; - buffer::ptr bp = buffer::create_page_aligned(block_size); + buffer::ptr bp = buffer::create_small_page_aligned(block_size); char* bpdata = bp.c_str(); int r = ::pread(fd, bpdata, bp.length(), 0); @@ -727,7 +727,7 @@ bufferptr FileJournal::prepare_header() header.committed_up_to = journaled_seq; } encode(header, bl); - bufferptr bp = buffer::create_page_aligned(get_top()); + bufferptr bp = buffer::create_small_page_aligned(get_top()); // don't use bp.zero() here, because it also invalidates // crc cache (which is not yet populated anyway) char* data = bp.c_str(); |