summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* kallsyms: Add self-test facilityZhen Lei2022-11-156-1/+514
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Added test cases for basic functions and performance of functions kallsyms_lookup_name(), kallsyms_on_each_symbol() and kallsyms_on_each_match_symbol(). It also calculates the compression rate of the kallsyms compression algorithm for the current symbol set. The basic functions test begins by testing a set of symbols whose address values are known. Then, traverse all symbol addresses and find the corresponding symbol name based on the address. It's impossible to determine whether these addresses are correct, but we can use the above three functions along with the addresses to test each other. Due to the traversal operation of kallsyms_on_each_symbol() is too slow, only 60 symbols can be tested in one second, so let it test on average once every 128 symbols. The other two functions validate all symbols. If the basic functions test is passed, print only performance test results. If the test fails, print error information, but do not perform subsequent performance tests. Start self-test automatically after system startup if CONFIG_KALLSYMS_SELFTEST=y. Example of output content: (prefix 'kallsyms_selftest:' is omitted start --------------------------------------------------------- | nr_symbols | compressed size | original size | ratio(%) | |---------------------------------------------------------| | 107543 | 1357912 | 2407433 | 56.40 | --------------------------------------------------------- kallsyms_lookup_name() looked up 107543 symbols The time spent on each symbol is (ns): min=630, max=35295, avg=7353 kallsyms_on_each_symbol() traverse all: 11782628 ns kallsyms_on_each_match_symbol() traverse all: 9261 ns finish Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
* livepatch: Use kallsyms_on_each_match_symbol() to improve performanceZhen Lei2022-11-131-1/+19
| | | | | | | | | Based on the test results of kallsyms_on_each_match_symbol() and kallsyms_on_each_symbol(), the average performance can be improved by more than 1500 times. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
* kallsyms: Add helper kallsyms_on_each_match_symbol()Zhen Lei2022-11-132-0/+26
| | | | | | | | | | | | | | | | | | | | | | Function kallsyms_on_each_symbol() traverses all symbols and submits each symbol to the hook 'fn' for judgment and processing. For some cases, the hook actually only handles the matched symbol, such as livepatch. Because all symbols are currently sorted by name, all the symbols with the same name are clustered together. Function kallsyms_lookup_names() gets the start and end positions of the set corresponding to the specified name. So we can easily and quickly traverse all the matches. The test results are as follows (twice): (x86) kallsyms_on_each_match_symbol: 7454, 7984 kallsyms_on_each_symbol : 11733809, 11785803 kallsyms_on_each_match_symbol() consumes only 0.066% of kallsyms_on_each_symbol()'s time. In other words, 1523x better performance. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
* kallsyms: Reduce the memory occupied by kallsyms_seqs_of_names[]Zhen Lei2022-11-133-6/+19
| | | | | | | | | | kallsyms_seqs_of_names[] records the symbol index sorted by address, the maximum value in kallsyms_seqs_of_names[] is the number of symbols. And 2^24 = 16777216, which means that three bytes are enough to store the index. This can help us save (1 * kallsyms_num_syms) bytes of memory. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
* kallsyms: Correctly sequence symbols when CONFIG_LTO_CLANG=yZhen Lei2022-11-132-2/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | LLVM appends various suffixes for local functions and variables, suffixes observed: - foo.llvm.[0-9a-f]+ - foo.[0-9a-f]+ Therefore, when CONFIG_LTO_CLANG=y, kallsyms_lookup_name() needs to truncate the suffix of the symbol name before comparing the local function or variable name. Old implementation code: - if (strcmp(namebuf, name) == 0) - return kallsyms_sym_address(i); - if (cleanup_symbol_name(namebuf) && strcmp(namebuf, name) == 0) - return kallsyms_sym_address(i); The preceding process is traversed by address from low to high. That is, for those with the same name after the suffix is removed, the one with the smallest address is returned first. Therefore, when sorting in the tool, if the raw names are the same, they should be sorted by address in ascending order. ASCII[.] = 2e ASCII[0-9] = 30,39 ASCII[A-Z] = 41,5a ASCII[_] = 5f ASCII[a-z] = 61,7a According to the preceding ASCII code values, the following sorting result is strictly followed. --------------------------------- | main-key | sub-key | |---------------------------------| | | addr_lowest | | <name> | ... | | <name>.<suffix> | ... | | | addr_highest | |---------------------------------| | <name>?<others> | | //? is [_A-Za-z0-9] --------------------------------- Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
* kallsyms: Improve the performance of kallsyms_lookup_name()Zhen Lei2022-11-133-11/+113
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, to search for a symbol, we need to expand the symbols in 'kallsyms_names' one by one, and then use the expanded string for comparison. It's O(n). If we sort names in ascending order like addresses, we can also use binary search. It's O(log(n)). In order not to change the implementation of "/proc/kallsyms", the table kallsyms_names[] is still stored in a one-to-one correspondence with the address in ascending order. Add array kallsyms_seqs_of_names[], it's indexed by the sequence number of the sorted names, and the corresponding content is the sequence number of the sorted addresses. For example: Assume that the index of NameX in array kallsyms_seqs_of_names[] is 'i', the content of kallsyms_seqs_of_names[i] is 'k', then the corresponding address of NameX is kallsyms_addresses[k]. The offset in kallsyms_names[] is get_symbol_offset(k). Note that the memory usage will increase by (4 * kallsyms_num_syms) bytes, the next two patches will reduce (1 * kallsyms_num_syms) bytes and properly handle the case CONFIG_LTO_CLANG=y. Performance test results: (x86) Before: min=234, max=10364402, avg=5206926 min=267, max=11168517, avg=5207587 After: min=1016, max=90894, avg=7272 min=1014, max=93470, avg=7293 The average lookup performance of kallsyms_lookup_name() improved 715x. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
* scripts/kallsyms: rename build_initial_tok_table()Zhen Lei2022-11-131-2/+2
| | | | | | | | | | | | | | | | Except for the function build_initial_tok_table(), no token abbreviation is used elsewhere. $ cat scripts/kallsyms.c | grep tok | wc -l 33 $ cat scripts/kallsyms.c | grep token | wc -l 31 Here, it would be clearer to use the full name. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
* module: Fix NULL vs IS_ERR checking for module_get_next_pageMiaoqian Lin2022-11-111-4/+4
| | | | | | | | | | | The module_get_next_page() function return error pointers on error instead of NULL. Use IS_ERR() to check the return value to fix this. Fixes: b1ae6dc41eaa ("module: add in-kernel support for decompressing") Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
* kernel/params.c: defer most of param_sysfs_init() to late_initcall timeRasmus Villemoes2022-11-111-2/+19
| | | | | | | | | | | | | | | | | | | | param_sysfs_init(), and in particular param_sysfs_builtin() is rather time-consuming; for my board, it currently takes about 30ms. That amounts to about 3% of the time budget I have from U-Boot hands over control to linux and linux must assume responsibility for keeping the external watchdog happy. We must still continue to initialize module_kset at subsys_initcall time, since otherwise any request_module() would fail in mod_sysfs_init(). However, the bulk of the work in param_sysfs_builtin(), namely populating /sys/module/*/version and/or /sys/module/*/parameters/ for builtin modules, can be deferred to late_initcall time - there's no userspace yet anyway to observe contents of /sys or the lack thereof. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
* module: Remove unused macros module_addr_min/maxChen Zhongjin2022-11-111-3/+0
| | | | | | | | | | | | | | | | | | | Unused macros reported by [-Wunused-macros]. These macros are introduced to record the bound address of modules. Commit 80b8bf436990 ("module: Always have struct mod_tree_root") made "struct mod_tree_root" always present and its members addr_min and addr_max can be directly accessed. Macros module_addr_min and module_addr_min are not used anymore, so remove them. Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> [mcgrof: massaged the commit messsage as suggested by Miroslav] Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
* module: remove redundant module_sysfs_initialized variableRasmus Villemoes2022-11-113-4/+1
| | | | | | | | | | | | | | The variable module_sysfs_initialized is used for checking whether module_kset has been initialized. Checking module_kset itself works just fine for that. This is a leftover from commit 7405c1e15edf ("kset: convert /sys/module to use kset_create"). Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Miroslav Benes <mbenes@suse.cz> [mcgrof: adjusted commit log as suggested by Christophe Leroy] Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
* Merge tag 'perf-tools-fixes-for-v6.1-2-2022-11-10' of ↵Linus Torvalds2022-11-114-4/+12
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux Pull perf tools fixes from Arnaldo Carvalho de Melo: - Fix 'perf stat' crash with --per-node --metric-only in CSV mode, due to the AGGR_NODE slot in the 'aggr_header_csv' array not being set. - Fix printing prefix in CSV output of 'perf stat' metrics in interval mode (-I), where an extra separator was being added to the start of some lines. - Fix skipping branch stack sampling 'perf test' entry, that was using both --branch-any and --branch-filter, which can't be used together. * tag 'perf-tools-fixes-for-v6.1-2-2022-11-10' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: perf tools: Add the include/perf/ directory to .gitignore perf test: Fix skipping branch stack sampling test perf stat: Fix printing os->prefix in CSV metrics output perf stat: Fix crash with --per-node --metric-only in CSV mode
| * perf tools: Add the include/perf/ directory to .gitignoreDonglin Peng2022-11-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 3af1dfdd51e06697 ("perf build: Move perf_dlfilters.h in the source tree") moved perf_dlfilters.h to the include/perf/ directory while include/perf is ignored because it has 'perf' in the name. Newly created files in the include/perf/ directory will be ignored. Testing: Before: $ touch tools/perf/include/perf/junk $ git status | grep junk $ git check-ignore -v tools/perf/include/perf/junk tools/perf/.gitignore:6:perf tools/perf/include/perf/junk After: $ git status | grep junk tools/perf/include/perf/junk $ git check-ignore -v tools/perf/include/perf/junk Add !include/perf/ to perf's .gitignore file. Fixes: 3af1dfdd51e06697 ("perf build: Move perf_dlfilters.h in the source tree") Signed-off-by: Donglin Peng <dolinux.peng@gmail.com> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20221103092704.173391-1-dolinux.peng@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| * perf test: Fix skipping branch stack sampling testJames Clark2022-11-082-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit f4a2aade6809c657 ("perf tests powerpc: Fix branch stack sampling test to include sanity check for branch filter") added a skip if certain branch options aren't available. But the change added both -b (--branch-any) and --branch-filter options at the same time, which will always result in a failure on any platform because the arguments can't be used together. Fix this by removing -b (--branch-any) and leaving --branch-filter which already specifies 'any'. Also add warning messages to the test and perf tool. Output on x86 before this fix: $ sudo ./perf test branch 108: Check branch stack sampling : Skip After: $ sudo ./perf test branch 108: Check branch stack sampling : Ok Fixes: f4a2aade6809c657 ("perf tests powerpc: Fix branch stack sampling test to include sanity check for branch filter") Signed-off-by: James Clark <james.clark@arm.com> Tested-by: Athira Jajeev <atrajeev@linux.vnet.ibm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Anshuman.Khandual@arm.com Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kajol Jain <kjain@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20221028121913.745307-1-james.clark@arm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| * perf stat: Fix printing os->prefix in CSV metrics outputAthira Rajeev2022-11-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'perf stat' with CSV output option prints an extra empty string as first field in metrics output line. Sample output below: # ./perf stat -x, --per-socket -a -C 1 ls S0,1,1.78,msec,cpu-clock,1785146,100.00,0.973,CPUs utilized S0,1,26,,context-switches,1781750,100.00,0.015,M/sec S0,1,1,,cpu-migrations,1780526,100.00,0.561,K/sec S0,1,1,,page-faults,1779060,100.00,0.561,K/sec S0,1,875807,,cycles,1769826,100.00,0.491,GHz S0,1,85281,,stalled-cycles-frontend,1767512,100.00,9.74,frontend cycles idle S0,1,576839,,stalled-cycles-backend,1766260,100.00,65.86,backend cycles idle S0,1,288430,,instructions,1762246,100.00,0.33,insn per cycle ====> ,S0,1,,,,,,,2.00,stalled cycles per insn The above command line uses field separator as "," via "-x," option and per-socket option displays socket value as first field. But here the last line for "stalled cycles per insn" has "," in the beginning. Sample output using interval mode: # ./perf stat -I 1000 -x, --per-socket -a -C 1 ls 0.001813453,S0,1,1.87,msec,cpu-clock,1872052,100.00,0.002,CPUs utilized 0.001813453,S0,1,2,,context-switches,1868028,100.00,1.070,K/sec ------ 0.001813453,S0,1,85379,,instructions,1856754,100.00,0.32,insn per cycle ====> 0.001813453,,S0,1,,,,,,,1.34,stalled cycles per insn Above result also has an extra CSV separator after the timestamp. Patch addresses extra field separator in the beginning of the metric output line. The counter stats are displayed by function "perf_stat__print_shadow_stats" in code "util/stat-shadow.c". While printing the stats info for "stalled cycles per insn", function "new_line_csv" is used as new_line callback. The new_line_csv function has check for "os->prefix" and if prefix is not null, it will be printed along with cvs separator. Snippet from "new_line_csv": if (os->prefix) fprintf(os->fh, "%s%s", os->prefix, config->csv_sep); Here os->prefix gets printed followed by "," which is the cvs separator. The os->prefix is used in interval mode option ( -I ), to print time stamp on every new line. But prefix is already set to contain CSV separator when used in interval mode for CSV option. Reference: Function "static void print_interval" Snippet: sprintf(prefix, "%6lu.%09lu%s", ts->tv_sec, ts->tv_nsec, config->csv_sep); Also if prefix is not assigned (if not used with -I option), it gets set to empty string. Reference: function printout() in util/stat-display.c Snippet: .prefix = prefix ? prefix : "", Since prefix already set to contain cvs_sep in interval option, patch removes printing config->csv_sep in new_line_csv function to avoid printing extra field. After the patch: # ./perf stat -x, --per-socket -a -C 1 ls S0,1,2.04,msec,cpu-clock,2045202,100.00,1.013,CPUs utilized S0,1,2,,context-switches,2041444,100.00,979.289,/sec S0,1,0,,cpu-migrations,2040820,100.00,0.000,/sec S0,1,2,,page-faults,2040288,100.00,979.289,/sec S0,1,254589,,cycles,2036066,100.00,0.125,GHz S0,1,82481,,stalled-cycles-frontend,2032420,100.00,32.40,frontend cycles idle S0,1,113170,,stalled-cycles-backend,2031722,100.00,44.45,backend cycles idle S0,1,88766,,instructions,2030942,100.00,0.35,insn per cycle S0,1,,,,,,,1.27,stalled cycles per insn Fixes: 92a61f6412d3a09d ("perf stat: Implement CSV metrics output") Reported-by: Disha Goel <disgoel@linux.vnet.ibm.com> Reviewed-By: Kajol Jain <kjain@linux.ibm.com> Signed-off-by: Athira Jajeev <atrajeev@linux.vnet.ibm.com> Tested-by: Disha Goel <disgoel@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: James Clark <james.clark@arm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: linuxppc-dev@lists.ozlabs.org Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nageswara R Sastry <rnsastry@linux.ibm.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20221018085605.63834-1-atrajeev@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| * perf stat: Fix crash with --per-node --metric-only in CSV modeNamhyung Kim2022-11-081-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following command will get segfault due to missing aggr_header_csv for AGGR_NODE: $ sudo perf stat -a --per-node -x, --metric-only true Committer testing: Before this patch: # perf stat -a --per-node -x, --metric-only true Segmentation fault (core dumped) # After: # gdb perf -bash: gdb: command not found # perf stat -a --per-node -x, --metric-only true node,Ghz,frontend cycles idle,backend cycles idle,insn per cycle,branch-misses of all branches, N0,32,0.335,2.10,0.65,0.69,0.03,1.92, # Fixes: 86895b480a2f10c7 ("perf stat: Add --per-node agregation support") Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: James Clark <james.clark@arm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Link: http://lore.kernel.org/lkml/20221107213314.3239159-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
* | Merge tag 'riscv-for-linus-6.1-rc5' of ↵Linus Torvalds2022-11-116-2/+47
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes from Palmer Dabbelt: - A fix to add the missing PWM LEDs into the SiFive HiFive Unleashed device tree. - A fix to fully clear a task's registers on creation, as they end up in userspace and thus leak kernel memory. - A pair of VDSO-related build fixes that manifest on recent LLVM-based toolchains. - A fix to our early init to ensure the DT is adequately processed before reserved memory nodes are processed. * tag 'riscv-for-linus-6.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: RISC-V: vdso: Do not add missing symbols to version section in linker script riscv: fix reserved memory setup riscv: vdso: fix build with llvm riscv: process: fix kernel info leakage riscv: dts: sifive unleashed: Add PWM controlled LEDs
| * | RISC-V: vdso: Do not add missing symbols to version section in linker scriptNathan Chancellor2022-11-112-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recently, ld.lld moved from '--undefined-version' to '--no-undefined-version' as the default, which breaks the compat vDSO build: ld.lld: error: version script assignment of 'LINUX_4.15' to symbol '__vdso_gettimeofday' failed: symbol not defined ld.lld: error: version script assignment of 'LINUX_4.15' to symbol '__vdso_clock_gettime' failed: symbol not defined ld.lld: error: version script assignment of 'LINUX_4.15' to symbol '__vdso_clock_getres' failed: symbol not defined These symbols are not present in the compat vDSO or the regular vDSO for 32-bit but they are unconditionally included in the version section of the linker script, which is prohibited with '--no-undefined-version'. Fix this issue by only including the symbols that are actually exported in the version section of the linker script. Link: https://github.com/ClangBuiltLinux/linux/issues/1756 Signed-off-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20221108171324.3377226-1-nathan@kernel.org/ Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
| * | riscv: fix reserved memory setupConor Dooley2022-11-102-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, RISC-V sets up reserved memory using the "early" copy of the device tree. As a result, when trying to get a reserved memory region using of_reserved_mem_lookup(), the pointer to reserved memory regions is using the early, pre-virtual-memory address which causes a kernel panic when trying to use the buffer's name: Unable to handle kernel paging request at virtual address 00000000401c31ac Oops [#1] Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 6.0.0-rc1-00001-g0d9d6953d834 #1 Hardware name: Microchip PolarFire-SoC Icicle Kit (DT) epc : string+0x4a/0xea ra : vsnprintf+0x1e4/0x336 epc : ffffffff80335ea0 ra : ffffffff80338936 sp : ffffffff81203be0 gp : ffffffff812e0a98 tp : ffffffff8120de40 t0 : 0000000000000000 t1 : ffffffff81203e28 t2 : 7265736572203a46 s0 : ffffffff81203c20 s1 : ffffffff81203e28 a0 : ffffffff81203d22 a1 : 0000000000000000 a2 : ffffffff81203d08 a3 : 0000000081203d21 a4 : ffffffffffffffff a5 : 00000000401c31ac a6 : ffff0a00ffffff04 a7 : ffffffffffffffff s2 : ffffffff81203d08 s3 : ffffffff81203d00 s4 : 0000000000000008 s5 : ffffffff000000ff s6 : 0000000000ffffff s7 : 00000000ffffff00 s8 : ffffffff80d9821a s9 : ffffffff81203d22 s10: 0000000000000002 s11: ffffffff80d9821c t3 : ffffffff812f3617 t4 : ffffffff812f3617 t5 : ffffffff812f3618 t6 : ffffffff81203d08 status: 0000000200000100 badaddr: 00000000401c31ac cause: 000000000000000d [<ffffffff80338936>] vsnprintf+0x1e4/0x336 [<ffffffff80055ae2>] vprintk_store+0xf6/0x344 [<ffffffff80055d86>] vprintk_emit+0x56/0x192 [<ffffffff80055ed8>] vprintk_default+0x16/0x1e [<ffffffff800563d2>] vprintk+0x72/0x80 [<ffffffff806813b2>] _printk+0x36/0x50 [<ffffffff8068af48>] print_reserved_mem+0x1c/0x24 [<ffffffff808057ec>] paging_init+0x528/0x5bc [<ffffffff808031ae>] setup_arch+0xd0/0x592 [<ffffffff8080070e>] start_kernel+0x82/0x73c early_init_fdt_scan_reserved_mem() takes no arguments as it operates on initial_boot_params, which is populated by early_init_dt_verify(). On RISC-V, early_init_dt_verify() is called twice. Once, directly, in setup_arch() if CONFIG_BUILTIN_DTB is not enabled and once indirectly, very early in the boot process, by parse_dtb() when it calls early_init_dt_scan_nodes(). This first call uses dtb_early_va to set initial_boot_params, which is not usable later in the boot process when early_init_fdt_scan_reserved_mem() is called. On arm64 for example, the corresponding call to early_init_dt_scan_nodes() uses fixmap addresses and doesn't suffer the same fate. Move early_init_fdt_scan_reserved_mem() further along the boot sequence, after the direct call to early_init_dt_verify() in setup_arch() so that the names use the correct virtual memory addresses. The above supposed that CONFIG_BUILTIN_DTB was not set, but should work equally in the case where it is - unflatted_and_copy_device_tree() also updates initial_boot_params. Reported-by: Valentina Fernandez <valentina.fernandezalanis@microchip.com> Reported-by: Evgenii Shatokhin <e.shatokhin@yadro.com> Link: https://lore.kernel.org/linux-riscv/f8e67f82-103d-156c-deb0-d6d6e2756f5e@microchip.com/ Fixes: 922b0375fc93 ("riscv: Fix memblock reservation for device tree blob") Signed-off-by: Conor Dooley <conor.dooley@microchip.com> Tested-by: Evgenii Shatokhin <e.shatokhin@yadro.com> Link: https://lore.kernel.org/r/20221107151524.3941467-1-conor.dooley@microchip.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
| * | riscv: vdso: fix build with llvmJisheng Zhang2022-11-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Even after commit 89fd4a1df829 ("riscv: jump_label: mark arguments as const to satisfy asm constraints"), building with CC_OPTIMIZE_FOR_SIZE + LLVM=1 can reproduce below build error: CC arch/riscv/kernel/vdso/vgettimeofday.o In file included from <built-in>:4: In file included from lib/vdso/gettimeofday.c:5: In file included from include/vdso/datapage.h:17: In file included from include/vdso/processor.h:10: In file included from arch/riscv/include/asm/vdso/processor.h:7: In file included from include/linux/jump_label.h:112: arch/riscv/include/asm/jump_label.h:42:3: error: invalid operand for inline asm constraint 'i' " .option push \n\t" ^ 1 error generated. I think the problem is when "-Os" is passed as CFLAGS, it's removed by "CFLAGS_REMOVE_vgettimeofday.o = $(CC_FLAGS_FTRACE) -Os" which is introduced in commit e05d57dcb8c7 ("riscv: Fixup __vdso_gettimeofday broke dynamic ftrace"), thus no optimization at all for vgettimeofday.c arm64 does remove "-Os" as well, but it forces "-O2" after removing "-Os". I compared the generated vgettimeofday.o with "-O2" and "-Os", I think no big performance difference. So let's tell the kbuild not to remove "-Os" rather than follow arm64 style. vdso related performance can be improved a lot when building kernel with CC_OPTIMIZE_FOR_SIZE after this commit, ("-Os" VS no optimization) Fixes: e05d57dcb8c7 ("riscv: Fixup __vdso_gettimeofday broke dynamic ftrace") Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Tested-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20221031182943.2453-1-jszhang@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
| * | riscv: process: fix kernel info leakageJisheng Zhang2022-11-101-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | thread_struct's s[12] may contain random kernel memory content, which may be finally leaked to userspace. This is a security hole. Fix it by clearing the s[12] array in thread_struct when fork. As for kthread case, it's better to clear the s[12] array as well. Fixes: 7db91e57a0ac ("RISC-V: Task implementation") Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Tested-by: Guo Ren <guoren@kernel.org> Link: https://lore.kernel.org/r/20221029113450.4027-1-jszhang@kernel.org Reviewed-by: Guo Ren <guoren@kernel.org> Link: https://lore.kernel.org/r/CAJF2gTSdVyAaM12T%2B7kXAdRPGS4VyuO08X1c7paE-n4Fr8OtRA@mail.gmail.com/ Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
| * | riscv: dts: sifive unleashed: Add PWM controlled LEDsEmil Renner Berthing2022-10-291-0/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds the 4 PWM controlled green LEDs to the HiFive Unleashed device tree. The schematic doesn't specify any special function for the LEDs, so they're added here without any default triggers and named d1, d2, d3 and d4 just like in the schematic. Signed-off-by: Emil Renner Berthing <emil.renner.berthing@canonical.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Tested-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20221012110928.352910-1-emil.renner.berthing@canonical.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
* | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2022-11-1123-207/+435
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm "This is a pretty large diffstat for this time of the release. The main culprit is a reorganization of the AMD assembly trampoline, allowing percpu variables to be accessed early. This is needed for the return stack depth tracking retbleed mitigation that will be in 6.2, but it also makes it possible to tighten the IBRS restore on vmexit. The latter change is a long tail of the spectrev2/retbleed patches (the corresponding Intel change was simpler and went in already last June), which is why I am including it right now instead of sharing a topic branch with tip. Being assembly and being rich in comments makes the line count balloon a bit, but I am pretty confident in the change (famous last words) because the reorganization actually makes everything simpler and more understandable than before. It has also had external review and has been tested on the aforementioned 6.2 changes, which explode quite brutally without the fix. Apart from this, things are pretty normal. s390: - PCI fix - PV clock fix x86: - Fix clash between PMU MSRs and other MSRs - Prepare SVM assembly trampoline for 6.2 retbleed mitigation and for... - ... tightening IBRS restore on vmexit, moving it before the first RET or indirect branch - Fix log level for VMSA dump - Block all page faults during kvm_zap_gfn_range() Tools: - kvm_stat: fix incorrect detection of debugfs - kvm_stat: update vmexit definitions" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86/mmu: Block all page faults during kvm_zap_gfn_range() KVM: x86/pmu: Limit the maximum number of supported AMD GP counters KVM: x86/pmu: Limit the maximum number of supported Intel GP counters KVM: x86/pmu: Do not speculatively query Intel GP PMCs that don't exist yet KVM: SVM: Only dump VMSA to klog at KERN_DEBUG level tools/kvm_stat: update exit reasons for vmx/svm/aarch64/userspace tools/kvm_stat: fix incorrect detection of debugfs x86, KVM: remove unnecessary argument to x86_virt_spec_ctrl and callers KVM: SVM: move MSR_IA32_SPEC_CTRL save/restore to assembly KVM: SVM: restore host save area from assembly KVM: SVM: move guest vmsave/vmload back to assembly KVM: SVM: do not allocate struct svm_cpu_data dynamically KVM: SVM: remove dead field from struct svm_cpu_data KVM: SVM: remove unused field from struct vcpu_svm KVM: SVM: retrieve VMCB from assembly KVM: SVM: adjust register allocation for __svm_vcpu_run() KVM: SVM: replace regs argument of __svm_vcpu_run() with vcpu_svm KVM: x86: use a separate asm-offsets.c file KVM: s390: pci: Fix allocation size of aift kzdev elements KVM: s390: pv: don't allow userspace to set the clock under PV
| * | | KVM: x86/mmu: Block all page faults during kvm_zap_gfn_range()Sean Christopherson2022-11-111-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When zapping a GFN range, pass 0 => ALL_ONES for the to-be-invalidated range to effectively block all page faults while the zap is in-progress. The invalidation helpers take a host virtual address, whereas zapping a GFN obviously provides a guest physical address and with the wrong unit of measurement (frame vs. byte). Alternatively, KVM could walk all memslots to get the associated HVAs, but thanks to SMM, that would require multiple lookups. And practically speaking, kvm_zap_gfn_range() usage is quite rare and not a hot path, e.g. MTRR and CR0.CD are almost guaranteed to be done only on vCPU0 during boot, and APICv inhibits are similarly infrequent operations. Fixes: edb298c663fc ("KVM: x86/mmu: bump mmu notifier count in kvm_zap_gfn_range") Reported-by: Chao Peng <chao.p.peng@linux.intel.com> Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221111001841.2412598-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | Merge tag 'kvm-s390-master-6.1-1' of ↵Paolo Bonzini2022-11-09251-1718/+3474
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD A PCI allocation fix and a PV clock fix.
| | * | | KVM: s390: pci: Fix allocation size of aift kzdev elementsRafael Mendonca2022-11-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'kzdev' field of struct 'zpci_aift' is an array of pointers to 'kvm_zdev' structs. Allocate the proper size accordingly. Reported by Coccinelle: WARNING: Use correct pointer type argument for sizeof Fixes: 98b1d33dac5f ("KVM: s390: pci: do initial setup for AEN interpretation") Signed-off-by: Rafael Mendonca <rafaelmendsr@gmail.com> Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com> Link: https://lore.kernel.org/r/20221026013234.960859-1-rafaelmendsr@gmail.com Message-Id: <20221026013234.960859-1-rafaelmendsr@gmail.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
| | * | | KVM: s390: pv: don't allow userspace to set the clock under PVNico Boehr2022-11-073-10/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When running under PV, the guest's TOD clock is under control of the ultravisor and the hypervisor isn't allowed to change it. Hence, don't allow userspace to change the guest's TOD clock by returning -EOPNOTSUPP. When userspace changes the guest's TOD clock, KVM updates its kvm.arch.epoch field and, in addition, the epoch field in all state descriptions of all VCPUs. But, under PV, the ultravisor will ignore the epoch field in the state description and simply overwrite it on next SIE exit with the actual guest epoch. This leads to KVM having an incorrect view of the guest's TOD clock: it has updated its internal kvm.arch.epoch field, but the ultravisor ignores the field in the state description. Whenever a guest is now waiting for a clock comparator, KVM will incorrectly calculate the time when the guest should wake up, possibly causing the guest to sleep for much longer than expected. With this change, kvm_s390_set_tod() will now take the kvm->lock to be able to call kvm_s390_pv_is_protected(). Since kvm_s390_set_tod_clock() also takes kvm->lock, use __kvm_s390_set_tod_clock() instead. The function kvm_s390_set_tod_clock is now unused, hence remove it. Update the documentation to indicate the TOD clock attr calls can now return -EOPNOTSUPP. Fixes: 0f3035047140 ("KVM: s390: protvirt: Do only reset registers that are accessible") Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com> Signed-off-by: Nico Boehr <nrb@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20221011160712.928239-2-nrb@linux.ibm.com Message-Id: <20221011160712.928239-2-nrb@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
| * | | | KVM: x86/pmu: Limit the maximum number of supported AMD GP countersLike Xu2022-11-093-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The AMD PerfMonV2 specification allows for a maximum of 16 GP counters, but currently only 6 pairs of MSRs are accepted by KVM. While AMD64_NUM_COUNTERS_CORE is already equal to 6, increasing without adjusting msrs_to_save_all[] could result in out-of-bounds accesses. Therefore introduce a macro (named KVM_AMD_PMC_MAX_GENERIC) to refer to the number of counters supported by KVM. Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20220919091008.60695-3-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | KVM: x86/pmu: Limit the maximum number of supported Intel GP countersLike Xu2022-11-094-9/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Intel Architectural IA32_PMCx MSRs addresses range allows for a maximum of 8 GP counters, and KVM cannot address any more. Introduce a local macro (named KVM_INTEL_PMC_MAX_GENERIC) and use it consistently to refer to the number of counters supported by KVM, thus avoiding possible out-of-bound accesses. Suggested-by: Jim Mattson <jmattson@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20220919091008.60695-2-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | KVM: x86/pmu: Do not speculatively query Intel GP PMCs that don't exist yetLike Xu2022-11-091-12/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The SDM lists an architectural MSR IA32_CORE_CAPABILITIES (0xCF) that limits the theoretical maximum value of the Intel GP PMC MSRs allocated at 0xC1 to 14; likewise the Intel April 2022 SDM adds IA32_OVERCLOCKING_STATUS at 0x195 which limits the number of event selection MSRs to 15 (0x186-0x194). Limiting the maximum number of counters to 14 or 18 based on the currently allocated MSRs is clearly fragile, and it seems likely that Intel will even place PMCs 8-15 at a completely different range of MSR indices. So stop at the maximum number of GP PMCs supported today on Intel processors. There are some machines, like Intel P4 with non Architectural PMU, that may indeed have 18 counters, but those counters are in a completely different MSR address range and are not supported by KVM. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: stable@vger.kernel.org Fixes: cf05a67b68b8 ("KVM: x86: omit "impossible" pmu MSRs from MSR list") Suggested-by: Jim Mattson <jmattson@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20220919091008.60695-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | KVM: SVM: Only dump VMSA to klog at KERN_DEBUG levelPeter Gonda2022-11-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Explicitly print the VMSA dump at KERN_DEBUG log level, KERN_CONT uses KERNEL_DEFAULT if the previous log line has a newline, i.e. if there's nothing to continuing, and as a result the VMSA gets dumped when it shouldn't. The KERN_CONT documentation says it defaults back to KERNL_DEFAULT if the previous log line has a newline. So switch from KERN_CONT to print_hex_dump_debug(). Jarkko pointed this out in reference to the original patch. See: https://lore.kernel.org/all/YuPMeWX4uuR1Tz3M@kernel.org/ print_hex_dump(KERN_DEBUG, ...) was pointed out there, but print_hex_dump_debug() should similar. Fixes: 6fac42f127b8 ("KVM: SVM: Dump Virtual Machine Save Area (VMSA) to klog") Signed-off-by: Peter Gonda <pgonda@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Harald Hoyer <harald@profian.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org Message-Id: <20221104142220.469452-1-pgonda@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | tools/kvm_stat: update exit reasons for vmx/svm/aarch64/userspaceRong Tao2022-11-091-14/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Update EXIT_REASONS from source, including VMX_EXIT_REASONS, SVM_EXIT_REASONS, AARCH64_EXIT_REASONS, USERSPACE_EXIT_REASONS. Signed-off-by: Rong Tao <rongtao@cestc.cn> Message-Id: <tencent_00082C8BFA925A65E11570F417F1CD404505@qq.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | tools/kvm_stat: fix incorrect detection of debugfsMatthias Gerstner2022-11-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The first field in /proc/mounts can be influenced by unprivileged users through the widespread `fusermount` setuid-root program. Example: ``` user$ mkdir ~/mydebugfs user$ export _FUSE_COMMFD=0 user$ fusermount ~/mydebugfs -ononempty,fsname=debugfs user$ grep debugfs /proc/mounts debugfs /home/user/mydebugfs fuse rw,nosuid,nodev,relatime,user_id=1000,group_id=100 0 0 ``` If there is no debugfs already mounted in the system then this can be used by unprivileged users to trick kvm_stat into using a user controlled file system location for obtaining KVM statistics. Even though the root user is not allowed to access non-root FUSE mounts for security reasons, the unprivileged user can unmount the FUSE mount before kvm_stat uses the mounted path. If it wins the race, kvm_stat will read from the location where the FUSE mount resided. Note that the files in debugfs are only opened for reading, so the attacker can cause very large data to be read in by kvm_stat, or fake data to be processed, but there should be no viable way to turn this into a privilege escalation. The fix is simply to use the file system type field instead. Whitespace in the mount path is escaped in /proc/mounts thus no further safety measures in the parsing should be necessary to make this correct. Message-Id: <20221103135927.13656-1-matthias.gerstner@suse.de> Signed-off-by: Matthias Gerstner <matthias.gerstner@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | x86, KVM: remove unnecessary argument to x86_virt_spec_ctrl and callersPaolo Bonzini2022-11-093-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | x86_virt_spec_ctrl only deals with the paravirtualized MSR_IA32_VIRT_SPEC_CTRL now and does not handle MSR_IA32_SPEC_CTRL anymore; remove the corresponding, unused argument. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | KVM: SVM: move MSR_IA32_SPEC_CTRL save/restore to assemblyPaolo Bonzini2022-11-095-38/+136
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Restoration of the host IA32_SPEC_CTRL value is probably too late with respect to the return thunk training sequence. With respect to the user/kernel boundary, AMD says, "If software chooses to toggle STIBP (e.g., set STIBP on kernel entry, and clear it on kernel exit), software should set STIBP to 1 before executing the return thunk training sequence." I assume the same requirements apply to the guest/host boundary. The return thunk training sequence is in vmenter.S, quite close to the VM-exit. On hosts without V_SPEC_CTRL, however, the host's IA32_SPEC_CTRL value is not restored until much later. To avoid this, move the restoration of host SPEC_CTRL to assembly and, for consistency, move the restoration of the guest SPEC_CTRL as well. This is not particularly difficult, apart from some care to cover both 32- and 64-bit, and to share code between SEV-ES and normal vmentry. Cc: stable@vger.kernel.org Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk") Suggested-by: Jim Mattson <jmattson@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | KVM: SVM: restore host save area from assemblyPaolo Bonzini2022-11-095-13/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow access to the percpu area via the GS segment base, which is needed in order to access the saved host spec_ctrl value. In linux-next FILL_RETURN_BUFFER also needs to access percpu data. For simplicity, the physical address of the save area is added to struct svm_cpu_data. Cc: stable@vger.kernel.org Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk") Reported-by: Nathan Chancellor <nathan@kernel.org> Analyzed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | KVM: SVM: move guest vmsave/vmload back to assemblyPaolo Bonzini2022-11-093-20/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is error-prone that code after vmexit cannot access percpu data because GSBASE has not been restored yet. It forces MSR_IA32_SPEC_CTRL save/restore to happen very late, after the predictor untraining sequence, and it gets in the way of return stack depth tracking (a retbleed mitigation that is in linux-next as of 2022-11-09). As a first step towards fixing that, move the VMCB VMSAVE/VMLOAD to assembly, essentially undoing commit fb0c4a4fee5a ("KVM: SVM: move VMLOAD/VMSAVE to C code", 2021-03-15). The reason for that commit was that it made it simpler to use a different VMCB for VMLOAD/VMSAVE versus VMRUN; but that is not a big hassle anymore thanks to the kvm-asm-offsets machinery and other related cleanups. The idea on how to number the exception tables is stolen from a prototype patch by Peter Zijlstra. Cc: stable@vger.kernel.org Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk") Link: <https://lore.kernel.org/all/f571e404-e625-bae1-10e9-449b2eb4cbd8@citrix.com/> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | KVM: SVM: do not allocate struct svm_cpu_data dynamicallyPaolo Bonzini2022-11-093-29/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The svm_data percpu variable is a pointer, but it is allocated via svm_hardware_setup() when KVM is loaded. Unlike hardware_enable() this means that it is never NULL for the whole lifetime of KVM, and static allocation does not waste any memory compared to the status quo. It is also more efficient and more easily handled from assembly code, so do it and don't look back. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | KVM: SVM: remove dead field from struct svm_cpu_dataPaolo Bonzini2022-11-092-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The "cpu" field of struct svm_cpu_data has been write-only since commit 4b656b120249 ("KVM: SVM: force new asid on vcpu migration", 2009-08-05). Remove it. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | KVM: SVM: remove unused field from struct vcpu_svmPaolo Bonzini2022-11-091-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The pointer to svm_cpu_data in struct vcpu_svm looks interesting from the point of view of accessing it after vmexit, when the GSBASE is still containing the guest value. However, despite existing since the very first commit of drivers/kvm/svm.c (commit 6aa8b732ca01, "[PATCH] kvm: userspace interface", 2006-12-10), it was never set to anything. Ignore the opportunity to fix a 16 year old "bug" and delete it; doing things the "harder" way makes it possible to remove more old cruft. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | KVM: SVM: retrieve VMCB from assemblyPaolo Bonzini2022-11-094-15/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Continue moving accesses to struct vcpu_svm to vmenter.S. Reducing the number of arguments limits the chance of mistakes due to different registers used for argument passing in 32- and 64-bit ABIs; pushing the VMCB argument and almost immediately popping it into a different register looks pretty weird. 32-bit ABI is not a concern for __svm_sev_es_vcpu_run() which is 64-bit only; however, it will soon need @svm to save/restore SPEC_CTRL so stay consistent with __svm_vcpu_run() and let them share the same prototype. No functional change intended. Cc: stable@vger.kernel.org Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk") Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | KVM: SVM: adjust register allocation for __svm_vcpu_run()Paolo Bonzini2022-11-091-19/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 32-bit ABI uses RAX/RCX/RDX as its argument registers, so they are in the way of instructions that hardcode their operands such as RDMSR/WRMSR or VMLOAD/VMRUN/VMSAVE. In preparation for moving vmload/vmsave to __svm_vcpu_run(), keep the pointer to the struct vcpu_svm in %rdi. In particular, it is now possible to load svm->vmcb01.pa in %rax without clobbering the struct vcpu_svm pointer. No functional change intended. Cc: stable@vger.kernel.org Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk") Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | KVM: SVM: replace regs argument of __svm_vcpu_run() with vcpu_svmPaolo Bonzini2022-11-095-20/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since registers are reachable through vcpu_svm, and we will need to access more fields of that struct, pass it instead of the regs[] array. No functional change intended. Cc: stable@vger.kernel.org Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk") Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | KVM: x86: use a separate asm-offsets.c filePaolo Bonzini2022-11-095-7/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This already removes an ugly #include "" from asm-offsets.c, but especially it avoids a future error when trying to define asm-offsets for KVM's svm/svm.h header. This would not work for kernel/asm-offsets.c, because svm/svm.h includes kvm_cache_regs.h which is not in the include path when compiling asm-offsets.c. The problem is not there if the .c file is in arch/x86/kvm. Suggested-by: Sean Christopherson <seanjc@google.com> Cc: stable@vger.kernel.org Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk") Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | | | Merge tag 'hyperv-fixes-signed-20221110' of ↵Linus Torvalds2022-11-116-31/+51
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fixes from Wei Liu: - Fix TSC MSR write for root partition (Anirudh Rayabharam) - Fix definition of vector in pci-hyperv driver (Dexuan Cui) - A few other misc patches * tag 'hyperv-fixes-signed-20221110' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: PCI: hv: Fix the definition of vector in hv_compose_msi_msg() MAINTAINERS: remove sthemmin x86/hyperv: fix invalid writes to MSRs during root partition kexec clocksource/drivers/hyperv: add data structure for reference TSC MSR Drivers: hv: fix repeated words in comments x86/hyperv: Remove BUG_ON() for kmap_local_page()
| * | | | | PCI: hv: Fix the definition of vector in hv_compose_msi_msg()Dexuan Cui2022-11-031-6/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The local variable 'vector' must be u32 rather than u8: see the struct hv_msi_desc3. 'vector_count' should be u16 rather than u8: see struct hv_msi_desc, hv_msi_desc2 and hv_msi_desc3. Fixes: a2bad844a67b ("PCI: hv: Fix interrupt mapping for multi-MSI") Signed-off-by: Dexuan Cui <decui@microsoft.com> Cc: Jeffrey Hugo <quic_jhugo@quicinc.com> Cc: Carl Vanderlip <quic_carlv@quicinc.com> Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Link: https://lore.kernel.org/r/20221027205256.17678-1-decui@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
| * | | | | MAINTAINERS: remove sthemminStephen Hemminger2022-11-031-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Leaving Microsoft, the Hyper-V drivers have lots of other maintainers. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Link: https://lore.kernel.org/r/20221028153741.25470-1-stephen@networkplumber.org Signed-off-by: Wei Liu <wei.liu@kernel.org>
| * | | | | x86/hyperv: fix invalid writes to MSRs during root partition kexecAnirudh Rayabharam2022-11-031-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | hyperv_cleanup resets the hypercall page by setting the MSR to 0. However, the root partition is not allowed to write to the GPA bits of the MSR. Instead, it uses the hypercall page provided by the MSR. Similar is the case with the reference TSC MSR. Clear only the enable bit instead of zeroing the entire MSR to make the code valid for root partition too. Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20221027095729.1676394-3-anrayabh@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
| * | | | | clocksource/drivers/hyperv: add data structure for reference TSC MSRAnirudh Rayabharam2022-11-032-14/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a data structure to represent the reference TSC MSR similar to other MSRs. This simplifies the code for updating the MSR. Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20221027095729.1676394-2-anrayabh@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
| * | | | | Drivers: hv: fix repeated words in commentsJilin Yuan2022-10-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Delete the redundant word 'of'. Signed-off-by: Jilin Yuan <yuanjilin@cdjrlc.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20221019125604.52999-1-yuanjilin@cdjrlc.com Signed-off-by: Wei Liu <wei.liu@kernel.org>