summaryrefslogtreecommitdiffstats
path: root/drivers (follow)
Commit message (Collapse)AuthorAgeFilesLines
* nvme: fix kernel paging oopsSagi Grimberg2018-12-131-1/+1
| | | | | | | | free the controller discard_page correctly. Fixes: cb5b7262b011 ("nvme: provide fallback for discard alloc failure") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge branch 'nvme-4.21' of git://git.infradead.org/nvme into for-4.21/blockJens Axboe2018-12-1326-110/+4490
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NVMe updates from Christoph: "Here is the second large chunk of nvme updates for 4.21: - host and target support for NVMe over TCP (Sagi Grimberg, Roy Shterman, Solganik Alexander) - error log page support in target (Chaitanya Kulkarni) plus small fixes and improvements from Jens Axboe and Chengguang Xu." * 'nvme-4.21' of git://git.infradead.org/nvme: (33 commits) nvme-rdma: support separate queue maps for read and write nvme-tcp: support separate queue maps for read and write nvme-fabrics: allow user to set nr_write_queues for separate queue maps nvme-fabrics: add missing nvmf_ctrl_options documentation blk-mq-rdma: pass in queue map to blk_mq_rdma_map_queues nvmet: update smart log with num err log entries nvmet: add error log page cmd handler nvmet: add error log support for file backend nvmet: add error log support for bdev backend nvmet: add error log support for admin-cmd nvmet: add error log support for rdma backend nvmet: add error log support for fabrics-cmd nvmet: add error log support in the core nvmet: add interface to update error-log page nvmet: add error-log definitions nvme: add error log page slot definition nvme: remove nvme_common command cdw10 array nvmet: remove unused variable nvme: provide fallback for discard alloc failure nvme: add __exit annotation ...
| * nvme-rdma: support separate queue maps for read and writeSagi Grimberg2018-12-131-3/+25
| | | | | | | | | | | | | | | | | | | | | | | | llow NVMF_OPT_NR_WRITE_QUEUES to describe additional write queues. In addition, implement .map_queues that will apply 2 queue maps for read and write queue sets. Note that with the separate queue map, HCTX_TYPE_READ will always use nr_io_queues and HCTX_TYPE_DEFAULT will use nr_write_queues. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme-tcp: support separate queue maps for read and writeSagi Grimberg2018-12-131-6/+41
| | | | | | | | | | | | | | | | | | | | | | | | Allow NVMF_OPT_NR_WRITE_QUEUES to describe additional write queues. In addition, implement .map_queues that will apply 2 queue maps for read and write queue sets. Note that with the separate queue map, HCTX_TYPE_READ will always use nr_io_queues and HCTX_TYPE_DEFAULT will use nr_write_queues. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme-fabrics: allow user to set nr_write_queues for separate queue mapsSagi Grimberg2018-12-132-0/+16
| | | | | | | | | | | | | | | | | | | | | | This argument will specify how many I/O queues will be connected in create_ctrl in addition to nr_io_queues. With this configuration, I/O that carries payload from the host to the target, will use the default hctx queue map, and I/O that involves target to host transfers will use the read hctx queue map. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme-fabrics: add missing nvmf_ctrl_options documentationSagi Grimberg2018-12-131-0/+3
| | | | | | | | | | Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * blk-mq-rdma: pass in queue map to blk_mq_rdma_map_queuesSagi Grimberg2018-12-131-1/+1
| | | | | | | | | | | | | | Will be used by nvme-rdma for queue map separation support. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvmet: update smart log with num err log entriesChaitanya Kulkarni2018-12-131-0/+6
| | | | | | | | | | | | | | | | | | | | Now that we have error log page implementation update smart log command handler to provide number of error log entries in the lifetime of the controller field. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvmet: add error log page cmd handlerChaitanya Kulkarni2018-12-131-7/+29
| | | | | | | | | | | | | | | | | | Now that we have support for all the major parts of the target we add a NVMe error log page handler so that host can read the log page. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvmet: add error log support for file backendChaitanya Kulkarni2018-12-133-15/+60
| | | | | | | | | | | | | | | | | | This patch adds support for the file backend to populate the error log entries. Here we map the errno to the NVMe status codes. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvmet: add error log support for bdev backendChaitanya Kulkarni2018-12-131-12/+72
| | | | | | | | | | | | | | | | | | This patch adds the support for the block device backend to populate the error log entries. Here we map the blk_status_t to the NVMe status. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvmet: add error log support for admin-cmdChaitanya Kulkarni2018-12-131-4/+18
| | | | | | | | | | | | | | | | | | This patch adds the support to maintain the error log page for admin commands. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvmet: add error log support for rdma backendChaitanya Kulkarni2018-12-131-1/+9
| | | | | | | | | | | | | | | | | | This patch adds the support to maintain the error log page for rdma transport, we mainly focus here on the NVME_INVALID_FIELD errors. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvmet: add error log support for fabrics-cmdChaitanya Kulkarni2018-12-132-13/+45
| | | | | | | | | | | | | | | | | | | | | | This patch adds the support to maintain error log page for the fabrics prop get, prop set, and admin connect commands. Here we also update the discovery.c and add update set/get features and parse functions to support error log page. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvmet: add error log support in the coreChaitanya Kulkarni2018-12-131-8/+23
| | | | | | | | | | | | | | | | | | This patch adds the support to maintain error log page for the nvmet-core. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvmet: add interface to update error-log pageChaitanya Kulkarni2018-12-132-6/+31
| | | | | | | | | | | | | | | | | | | | This patch adds nvmet_req based interface to the nvmet-core so that we can update the error log page. We update error log page in the request completion path when status is not set to NVME_SC_SUCCESS. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvmet: add error-log definitionsChaitanya Kulkarni2018-12-132-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds necessary fields in the target data structures to support error log page. For a target controller, we add a new error log field to maintain the error log, at any given point we maintain error entries equal to NVMET_ERROR_LOG_SLOTS for each controller. In the following patch, we also update the error log page entry in the I/O completion path so we introduce a spinlock for synchronization of the log. For nvmet_req, we add a new field error_loc to hold the location of the error in the command when the actual error occurs for each request and a starting LBA if applicable. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme: remove nvme_common command cdw10 arrayChaitanya Kulkarni2018-12-136-23/+23
| | | | | | | | | | | | | | | | | | | | | | This is a preparation patch which removes the nvme common command cdw10 array and replace with individual fields. This is needed for the nvmet error log page implementation make is error log page entry offset assignment easier. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvmet: remove unused variableSagi Grimberg2018-12-131-2/+1
| | | | | | | | | | Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme: provide fallback for discard alloc failureJens Axboe2018-12-132-6/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When boxes are run near (or to) OOM, we have a problem with the discard page allocation in nvme. If we fail allocating the special page, we return busy, and it'll get retried. But since ordering is honored for dispatch requests, we can keep retrying this same IO and failing. Behind that IO could be requests that want to free memory, but they never get the chance. Allocate a fixed discard page per controller for a safe fallback, and use that if the initial allocation fails. Signed-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme: add __exit annotationChengguang Xu2018-12-132-2/+2
| | | | | | | | | | | | | | | | | | Add __exit annotation to cleanup helper which is only called once in the module. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme-tcp: add NVMe over TCP host driverSagi Grimberg2018-12-133-0/+2260
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements the NVMe over TCP host driver. It can be used to connect to remote NVMe over Fabrics subsystems over good old TCP/IP. The driver implements the TP 8000 of how nvme over fabrics capsules and data are encapsulated in nvme-tcp pdus and exchaged on top of a TCP byte stream. nvme-tcp header and data digest are supported as well. To connect to all NVMe over Fabrics controllers reachable on a given taget port over TCP use the following command: nvme connect-all -t tcp -a $IPADDR This requires the latest version of nvme-cli with TCP support. Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com> Signed-off-by: Roy Shterman <roys@lightbitslabs.com> Signed-off-by: Solganik Alexander <sashas@lightbitslabs.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvmet: allow configfs tcp trtype configurationSagi Grimberg2018-12-131-0/+1
| | | | | | | | | | | | Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvmet-tcp: add NVMe over TCP target driverSagi Grimberg2018-12-133-0/+1749
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements the TCP transport driver for the NVMe over Fabrics target stack. This allows exporting NVMe over Fabrics functionality over good old TCP/IP. The driver implements the TP 8000 of how nvme over fabrics capsules and data are encapsulated in nvme-tcp pdus and exchaged on top of a TCP byte stream. nvme-tcp header and data digest are supported as well. Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com> Signed-off-by: Roy Shterman <roys@lightbitslabs.com> Signed-off-by: Solganik Alexander <sashas@lightbitslabs.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme-fabrics: allow user passing data digestSagi Grimberg2018-12-132-0/+7
| | | | | | | | | | | | | | | | | | Data digest is a nvme-tcp specific feature, but nothing prevents other transports reusing the concept so do not associate with tcp transport solely. Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme-fabrics: allow user passing header digestSagi Grimberg2018-12-132-0/+7
| | | | | | | | | | | | | | | | | | Header digest is a nvme-tcp specific feature, but nothing prevents other transports reusing the concept so do not associate with tcp transport solely. Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvmet: Add install_queue calloutSagi Grimberg2018-12-132-0/+11
| | | | | | | | | | | | | | | | nvmet-tcp will implement it to allocate queue commands which are only known at nvmf connect time (sq size). Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * ath6kl: add ath6kl_ prefix to crypto_typeSagi Grimberg2018-12-134-8/+8
| | | | | | | | | | | | | | | | | | | | Prevent a namespace conflict as in following patches as skbuff.h will include the crypto API. Acked-by: David S. Miller <davem@davemloft.net> Cc: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* | bcache: print number of keys in trace_bcache_journal_writeGuoju Fang2018-12-131-1/+1
| | | | | | | | | | | | | | | | | | Sometimes flush journal may be very frequent, so it's useful to dump number of keys every time write journal. Signed-off-by: Guoju Fang <fangguoju@gmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | bcache: set writeback_percent in a flexible rangeColy Li2018-12-131-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because CUTOFF_WRITEBACK is defined as 40, so before the changes of dynamic cutoff writeback values, writeback_percent is limited to [0, CUTOFF_WRITEBACK]. Any value larger than CUTOFF_WRITEBACK will be fixed up to 40. Now cutof writeback limit is a dynamic value bch_cutoff_writeback, so the range of writeback_percent can be a more flexible range as [0, bch_cutoff_writeback]. The flexibility is, it can be expended to a larger or smaller range than [0, 40], depends on how value bch_cutoff_writeback is specified. The default value is still strongly recommended to most of users for most of workloads. But for people who want to do research on bcache writeback perforamnce tuning, they may have chance to specify more flexible writeback_percent in range [0, 70]. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | bcache: make cutoff_writeback and cutoff_writeback_sync tunableColy Li2018-12-133-2/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the cutoff writeback and cutoff writeback sync thresholds are defined by CUTOFF_WRITEBACK (40) and CUTOFF_WRITEBACK_SYNC (70) as static values. Most of time these they work fine, but when people want to do research on bcache writeback mode performance tuning, there is no chance to modify the soft and hard cutoff writeback values. This patch introduces two module parameters bch_cutoff_writeback_sync and bch_cutoff_writeback which permit people to tune the values when loading bcache.ko. If they are not specified by module loading, current values CUTOFF_WRITEBACK_SYNC and CUTOFF_WRITEBACK will be used as default and nothing changes. When people want to tune this two values, - cutoff_writeback can be set in range [1, 70] - cutoff_writeback_sync can be set in range [1, 90] - cutoff_writeback always <= cutoff_writeback_sync The default values are strongly recommended to most of users for most of workloads. Anyway, if people wants to take their own risk to do research on new writeback cutoff tuning for their own workload, now they can make it. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | bcache: add MODULE_DESCRIPTION informationColy Li2018-12-131-3/+4
| | | | | | | | | | | | | | | | | | | | This patch moves MODULE_AUTHOR and MODULE_LICENSE to end of super.c, and add MODULE_DESCRIPTION("Bcache: a Linux block layer cache"). This is preparation for adding module parameters. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | bcache: option to automatically run gc thread after writebackColy Li2018-12-134-0/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The option gc_after_writeback is disabled by default, because garbage collection will discard SSD data which drops cached data. Echo 1 into /sys/fs/bcache/<UUID>/internal/gc_after_writeback will enable this option, which wakes up gc thread when writeback accomplished and all cached data is clean. This option is helpful for people who cares writing performance more. In heavy writing workload, all cached data can be clean only happens when writeback thread cleans all cached data in I/O idle time. In such situation a following gc running may help to shrink bcache B+ tree and discard more clean data, which may be helpful for future writing requests. If you are not sure whether this is helpful for your own workload, please leave it as disabled by default. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | bcache: introduce force_wake_up_gc()Coly Li2018-12-132-15/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Garbage collection thread starts to work when c->sectors_to_gc is negative value, otherwise nothing will happen even the gc thread is woken up by wake_up_gc(). force_wake_up_gc() sets c->sectors_to_gc to -1 before calling wake_up_gc(), then gc thread may have chance to run if no one else sets c->sectors_to_gc to a positive value before gc_should_run(). This routine can be called where the gc thread is woken up and required to run in force. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | bcache: cannot set writeback_running via sysfs if no writeback kthread createdShenghui Wang2018-12-131-2/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "echo 1 > writeback_running" marks writeback_running even if no writeback kthread created as "d_strtoul(writeback_running)" will simply set dc-> writeback_running without checking the existence of dc->writeback_thread. Add check for setting writeback_running via sysfs: if no writeback kthread available, reject setting to 1. v2 -> v3: * Make message on wrong assignment more clear. * Print name of bcache device instead of name of backing device. Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | bcache: do not mark writeback_running too earlyShenghui Wang2018-12-131-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A fresh backing device is not attached to any cache_set, and has no writeback kthread created until first attached to some cache_set. But bch_cached_dev_writeback_init run " dc->writeback_running = true; WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING, &dc->disk.flags)); " for any newly formatted backing devices. For a fresh standalone backing device, we can get something like following even if no writeback kthread created: ------------------------ /sys/block/bcache0/bcache# cat writeback_running 1 /sys/block/bcache0/bcache# cat writeback_rate_debug rate: 512.0k/sec dirty: 0.0k target: 0.0k proportional: 0.0k integral: 0.0k change: 0.0k/sec next io: -15427384ms The none ZERO fields are misleading as no alive writeback kthread yet. Set dc->writeback_running false as no writeback thread created in bch_cached_dev_writeback_init(). We have writeback thread created and woken up in bch_cached_dev_writeback _start(). Set dc->writeback_running true before bch_writeback_queue() called, as a writeback thread will check if dc->writeback_running is true before writing back dirty data, and hung if false detected. After the change, we can get the following output for a fresh standalone backing device: ----------------------- /sys/block/bcache0/bcache$ cat writeback_running 0 /sys/block/bcache0/bcache# cat writeback_rate_debug rate: 0.0k/sec dirty: 0.0k target: 0.0k proportional: 0.0k integral: 0.0k change: 0.0k/sec next io: 0ms v1 -> v2: Set dc->writeback_running before bch_writeback_queue() called, Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | bcache: update comment in sysfs.cShenghui Wang2018-12-131-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | We have struct cached_dev allocated by kzalloc in register_bcache(), which initializes all the fields of cached_dev with 0s. And commit ce4c3e19e520 ("bcache: Replace bch_read_string_list() by __sysfs_match_string()") has remove the string "default". Update the comment. Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | bcache: update comment for bch_data_insertShenghui Wang2018-12-131-3/+3
| | | | | | | | | | | | | | | | | | | | commit 220bb38c21b8 ("bcache: Break up struct search") introduced changes to struct search and s->iop. bypass/bio are fields of struct data_insert_op now. Update the comment. Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | bcache: do not check if debug dentry is ERR or NULL explicitly on removeShenghui Wang2018-12-132-4/+2
| | | | | | | | | | | | | | | | | | | | | | debugfs_remove and debugfs_remove_recursive will check if the dentry pointer is NULL or ERR, and will do nothing in that case. Remove the check in cache_set_free and bch_debug_init. Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | bcache: add comment for cache_set->fill_iterShenghui Wang2018-12-132-1/+10
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have the following define for btree iterator: struct btree_iter { size_t size, used; #ifdef CONFIG_BCACHE_DEBUG struct btree_keys *b; #endif struct btree_iter_set { struct bkey *k, *end; } data[MAX_BSETS]; }; We can see that the length of data[] field is static MAX_BSETS, which is defined as 4 currently. But a btree node on disk could have too many bsets for an iterator to fit on the stack - maybe far more that MAX_BSETS. Have to dynamically allocate space to host more btree_iter_sets. bch_cache_set_alloc() will make sure the pool cache_set->fill_iter can allocate an iterator equipped with enough room that can host (sb.bucket_size / sb.block_size) btree_iter_sets, which is more than static MAX_BSETS. bch_btree_node_read_done() will use that pool to allocate one iterator, to host many bsets in one btree node. Add more comment around cache_set->fill_iter to make code less confusing. Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: do not overwrite ppa list with meta listIgor Konopko2018-12-111-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | Ehen using pblk with 0 sized metadata both ppa list and meta list points to the same memory since pblk_dma_meta_size() returns 0 in that case. This patch fix that issue by ensuring that pblk_dma_meta_size() always returns space equal to sizeof(struct pblk_sec_meta) and thus ppa list and meta list points to different memory address. Even that in that case drive does not really care about meta_list pointer, this is the easiest way to fix that issue without introducing changes in many places in the code just for 0 sized metadata case. The same approach needs to be also done for pblk_get_sec_meta() since we also cannot point to the same memory address in meta buffer when we are using it for pblk recovery process Reported-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Tested-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Igor Konopko <igor.j.konopko@intel.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: support packed metadataIgor Konopko2018-12-119-20/+122
| | | | | | | | | | | | | | | | | | | pblk performs recovery of open lines by storing the LBA in the per LBA metadata field. Recovery therefore only works for drives that has this field. This patch adds support for packed metadata, which store l2p mapping for open lines in last sector of every write unit and enables drives without per IO metadata to recover open lines. After this patch, drives with OOB size <16B will use packed metadata and metadata size larger than16B will continue to use the device per IO metadata. Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Igor Konopko <igor.j.konopko@intel.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: disable interleaved metadataIgor Konopko2018-12-112-0/+7
| | | | | | | | | | | | | | | | | Currently pblk only check the size of I/O metadata and does not take into account if this metadata is in a separate buffer or interleaved in a single metadata buffer. In reality only the first scenario is supported, where second mode will break pblk functionality during any IO operation. This patch prevents pblk to be instantiated in case device only supports interleaved metadata. Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Igor Konopko <igor.j.konopko@intel.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: dynamic DMA pool entry sizeIgor Konopko2018-12-116-12/+22
| | | | | | | | | | | | | | | | | Currently lightnvm and pblk uses single DMA pool, for which the entry size always is equal to PAGE_SIZE. The contents of each entry allocated from the DMA pool consists of a PPA list (8bytes * 64), leaving 56bytes * 64 space for metadata. Since the metadata field can be bigger, such as 128 bytes, the static size does not cover this use-case. This patch adds support for I/O metadata above 56 bytes by changing DMA pool size based on device meta size and allows pblk to use OOB metadata >=16B. Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Igor Konopko <igor.j.konopko@intel.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: add helpers for OOB metadataIgor Konopko2018-12-116-32/+69
| | | | | | | | | | | | | pblk currently assumes that size of OOB metadata on drive is always equal to size of pblk_sec_meta struct. This commit add helpers which will allow to handle different sizes of OOB metadata on drive in the future. After this patch only OOB metadata equal to 16 bytes is supported. Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Igor Konopko <igor.j.konopko@intel.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: move lba list to partial read contextIgor Konopko2018-12-112-15/+7
| | | | | | | | | | | | Currently DMA allocated memory is reused on partial read for lba_list_mem and lba_list_media arrays. In preparation for dynamic DMA pool sizes we need to move this arrays into pblk_pr_ctx structures. Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Igor Konopko <igor.j.konopko@intel.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: avoid ref warning on cache creationJavier González2018-12-111-9/+5
| | | | | | | | | | | | | | | | The current kref implementation around pblk global caches triggers a false positive on refcount_inc_checked() (when called) as the kref is initialized to 0. Instead of usint kref_inc() on a 0 reference, which is in principle correct, use kref_init() to avoid the check. This is also more explicit about what actually happens on cache creation. In the process, do a small refactoring to use kref helpers. Fixes: 1864de94ec9d6 "lightnvm: pblk: stop recreating global caches" Signed-off-by: Javier González <javier@cnexlabs.com> Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: simplify geometry enumerationMatias Bjørling2018-12-114-30/+20
| | | | | | | | | | | | | | | | | | | Currently the geometry of an OCSSD is enumerated using a two step approach: First, nvm_register is called, the OCSSD identify command is issued, and second the geometry sos and csecs values are read either from the OCSSD identify if it is a 1.2 drive, or from the NVMe namespace data structure if it is a 2.0 device. This patch recombines it into a single step, such that nvm_register can use the csecs and sos fields independent of which version is used. This enables one to dynamically size the lightnvm subsystem dma pool. Reviewed-by: Igor Konopko <igor.j.konopko@intel.com> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: add comments wrt locking in recovery pathJavier González2018-12-112-0/+4
| | | | | | | | | | pblk's recovery path is single threaded and therefore a number of assumptions regarding concurrency can be made. To avoid confusion, make this explicit with a couple of comments in the code. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: add lock protection to list operationsHua Su2018-12-111-3/+10
| | | | | | | | | | | Protect the list_add on the pblk_line_init_bb() error path in case this code is used for some other purpose in the future. Signed-off-by: Hua Su <suhua.tanke@gmail.com> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>