Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL
interface (Running Average Power Limit) for advertising the accumulated
energy consumption of various power domains (e.g. CPU packages, DRAM,
etc.).
The consumption is reported via MSRs (model specific registers) like
MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are
64 bits registers that represent the accumulated energy consumption in
micro Joules. They are updated by microcode every ~1ms.
For now, KVM always returns 0 when the guest requests the value of
these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle
these MSRs dynamically in userspace.
To limit the amount of system calls for every MSR call, create a new
thread in QEMU that updates the "virtual" MSR values asynchronously.
Each vCPU has its own vMSR to reflect the independence of vCPUs. The
thread updates the vMSR values with the ratio of energy consumed of
the whole physical CPU package the vCPU thread runs on and the
thread's utime and stime values.
All other non-vCPU threads are also taken into account. Their energy
consumption is evenly distributed among all vCPUs threads running on
the same physical CPU package.
To overcome the problem that reading the RAPL MSR requires priviliged
access, a socket communication between QEMU and the qemu-vmsr-helper is
mandatory. You can specified the socket path in the parameter.
This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock
Actual limitation:
- Works only on Intel host CPU because AMD CPUs are using different MSR
adresses.
- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at
the moment.
Signed-off-by: Anthony Harivel <aharivel@redhat.com>
---
accel/kvm/kvm-all.c | 27 +++
docs/specs/index.rst | 1 +
docs/specs/rapl-msr.rst | 155 ++++++++++++
include/sysemu/kvm.h | 2 +
include/sysemu/kvm_int.h | 32 +++
target/i386/cpu.h | 8 +
target/i386/kvm/kvm-cpu.c | 9 +
target/i386/kvm/kvm.c | 428 ++++++++++++++++++++++++++++++++++
target/i386/kvm/meson.build | 1 +
target/i386/kvm/vmsr_energy.c | 335 ++++++++++++++++++++++++++
target/i386/kvm/vmsr_energy.h | 99 ++++++++
11 files changed, 1097 insertions(+)
create mode 100644 docs/specs/rapl-msr.rst
create mode 100644 target/i386/kvm/vmsr_energy.c
create mode 100644 target/i386/kvm/vmsr_energy.h
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index a8cecd040ebc..7649f226767a 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -3613,6 +3613,21 @@ static void kvm_set_device(Object *obj,
s->device = g_strdup(value);
}
+static void kvm_set_kvm_rapl(Object *obj, bool value, Error **errp)
+{
+ KVMState *s = KVM_STATE(obj);
+ s->msr_energy.enable = value;
+}
+
+static void kvm_set_kvm_rapl_socket_path(Object *obj,
+ const char *str,
+ Error **errp)
+{
+ KVMState *s = KVM_STATE(obj);
+ g_free(s->msr_energy.socket_path);
+ s->msr_energy.socket_path = g_strdup(str);
+}
+
static void kvm_accel_instance_init(Object *obj)
{
KVMState *s = KVM_STATE(obj);
@@ -3632,6 +3647,7 @@ static void kvm_accel_instance_init(Object *obj)
s->xen_gnttab_max_frames = 64;
s->xen_evtchn_max_pirq = 256;
s->device = NULL;
+ s->msr_energy.enable = false;
}
/**
@@ -3676,6 +3692,17 @@ static void kvm_accel_class_init(ObjectClass *oc, void *data)
object_class_property_set_description(oc, "device",
"Path to the device node to use (default: /dev/kvm)");
+ object_class_property_add_bool(oc, "rapl",
+ NULL,
+ kvm_set_kvm_rapl);
+ object_class_property_set_description(oc, "rapl",
+ "Allow energy related MSRs for RAPL interface in Guest");
+
+ object_class_property_add_str(oc, "rapl-helper-socket", NULL,
+ kvm_set_kvm_rapl_socket_path);
+ object_class_property_set_description(oc, "rapl-helper-socket",
+ "Socket Path for comminucating with the Virtual MSR helper daemon");
+
kvm_arch_accel_class_init(oc);
}
diff --git a/docs/specs/index.rst b/docs/specs/index.rst
index 1484e3e76077..e738ea7d102f 100644
--- a/docs/specs/index.rst
+++ b/docs/specs/index.rst
@@ -33,3 +33,4 @@ guest hardware that is specific to QEMU.
virt-ctlr
vmcoreinfo
vmgenid
+ rapl-msr
diff --git a/docs/specs/rapl-msr.rst b/docs/specs/rapl-msr.rst
new file mode 100644
index 000000000000..1202ee89bee0
--- /dev/null
+++ b/docs/specs/rapl-msr.rst
@@ -0,0 +1,155 @@
+================
+RAPL MSR support
+================
+
+The RAPL interface (Running Average Power Limit) is advertising the accumulated
+energy consumption of various power domains (e.g. CPU packages, DRAM, etc.).
+
+The consumption is reported via MSRs (model specific registers) like
+MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are 64 bits
+registers that represent the accumulated energy consumption in micro Joules.
+
+Thanks to the MSR Filtering patch [#a]_ not all MSRs are handled by KVM. Some
+of them can now be handled by the userspace (QEMU). It uses a mechanism called
+"MSR filtering" where a list of MSRs is given at init time of a VM to KVM so
+that a callback is put in place. The design of this patch uses only this
+mechanism for handling the MSRs between guest/host.
+
+At the moment the following MSRs are involved:
+
+.. code:: C
+
+ #define MSR_RAPL_POWER_UNIT 0x00000606
+ #define MSR_PKG_POWER_LIMIT 0x00000610
+ #define MSR_PKG_ENERGY_STATUS 0x00000611
+ #define MSR_PKG_POWER_INFO 0x00000614
+
+The ``*_POWER_UNIT``, ``*_POWER_LIMIT``, ``*_POWER INFO`` are part of the RAPL
+spec and specify the power limit of the package, provide range of parameter(min
+power, max power,..) and also the information of the multiplier for the energy
+counter to calculate the power. Those MSRs are populated once at the beginning
+by reading the host CPU MSRs and are given back to the guest 1:1 when
+requested.
+
+The MSR_PKG_ENERGY_STATUS is a counter; it represents the total amount of
+energy consumed since the last time the register was cleared. If you multiply
+it with the UNIT provided above you'll get the power in micro-joules. This
+counter is always increasing and it increases more or less faster depending on
+the consumption of the package. This counter is supposed to overflow at some
+point.
+
+Each core belonging to the same Package reading the MSR_PKG_ENERGY_STATUS (i.e
+"rdmsr 0x611") will retrieve the same value. The value represents the energy
+for the whole package. Whatever Core reading it will get the same value and a
+core that belongs to PKG-0 will not be able to get the value of PKG-1 and
+vice-versa.
+
+High level implementation
+-------------------------
+
+In order to update the value of the virtual MSR, a QEMU thread is created.
+The thread is basically just an infinity loop that does:
+
+1. Snapshot of the time metrics of all QEMU threads (Time spent scheduled in
+ Userspace and System)
+
+2. Snapshot of the actual MSR_PKG_ENERGY_STATUS counter of all packages where
+ the QEMU threads are running on.
+
+3. Sleep for 1 second - During this pause the vcpu and other non-vcpu threads
+ will do what they have to do and so the energy counter will increase.
+
+4. Repeat 2. and 3. and calculate the delta of every metrics representing the
+ time spent scheduled for each QEMU thread *and* the energy spent by the
+ packages during the pause.
+
+5. Filter the vcpu threads and the non-vcpu threads.
+
+6. Retrieve the topology of the Virtual Machine. This helps identify which
+ vCPU is running on which virtual package.
+
+7. The total energy spent by the non-vcpu threads is divided by the number
+ of vcpu threads so that each vcpu thread will get an equal part of the
+ energy spent by the QEMU workers.
+
+8. Calculate the ratio of energy spent per vcpu threads.
+
+9. Calculate the energy for each virtual package.
+
+10. The virtual MSRs are updated for each virtual package. Each vCPU that
+ belongs to the same package will return the same value when accessing the
+ the MSR.
+
+11. Loop back to 1.
+
+Ratio calculation
+-----------------
+
+In Linux, a process has an execution time associated with it. The scheduler is
+dividing the time in clock ticks. The number of clock ticks per second can be
+found by the sysconf system call. A typical value of clock ticks per second is
+100. So a core can run a process at the maximum of 100 ticks per second. If a
+package has 4 cores, 400 ticks maximum can be scheduled on all the cores
+of the package for a period of 1 second.
+
+The /proc/[pid]/stat [#b]_ is a sysfs file that can give the executed time of a
+process with the [pid] as the process ID. It gives the amount of ticks the
+process has been scheduled in userspace (utime) and kernel space (stime).
+
+By reading those metrics for a thread, one can calculate the ratio of time the
+package has spent executing the thread.
+
+Example:
+
+A 4 cores package can schedule a maximum of 400 ticks per second with 100 ticks
+per second per core. If a thread was scheduled for 100 ticks between a second
+on this package, that means my thread has been scheduled for 1/4 of the whole
+package. With that, the calculation of the energy spent by the thread on this
+package during this whole second is 1/4 of the total energy spent by the
+package.
+
+Usage
+-----
+
+Currently this feature is only working on an Intel CPU that has the RAPL driver
+mounted and available in the sysfs. if not, QEMU fails at start-up.
+
+This feature is activated with -accel
+kvm,rapl=true,rapl-helper-socket=/path/sock.sock
+
+It is important that the socket path is the same as the one
+:program:`qemu-vmsr-helper` is listening to.
+
+qemu-vmsr-helper
+----------------
+
+The qemu-vmsr-helper is working very much like the qemu-pr-helper. Instead of
+making persistent reservation, qemu-vmsr-helper is here to overcome the
+CVE-2020-8694 which remove user access to the rapl msr attributes.
+
+A socket communication is established between QEMU processes that has the RAPL
+MSR support activated and the qemu-vmsr-helper. A systemd service and socket
+activation is provided in contrib/systemd/qemu-vmsr-helper.(service/socket).
+
+The systemd socket uses 600, like contrib/systemd/qemu-pr-helper.socket. The
+socket can be passed via SCM_RIGHTS by libvirt, or its permissions can be
+changed (e.g. 660 and root:kvm for a Debian system for example). Libvirt could
+also start a separate helper if needed. All in all, the policy is left to the
+user.
+
+See the qemu-pr-helper documentation or manpage for further details.
+
+Current Limitations
+-------------------
+
+- Works only on Intel host CPUs because AMD CPUs are using different MSR
+ addresses.
+
+- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at the
+ moment.
+
+References
+----------
+
+.. [#a] https://patchwork.kernel.org/project/kvm/patch/20200916202951.23760-7-graf@amazon.com/
+.. [#b] https://man7.org/linux/man-pages/man5/proc.5.html
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index fad9a7e8ff30..37f68c496807 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -544,4 +544,6 @@ uint32_t kvm_dirty_ring_size(void);
* reported for the VM.
*/
bool kvm_hwpoisoned_mem(void);
+
+bool kvm_is_rapl_feat_enable(CPUState *cs);
#endif
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index 882e37e12c5b..8dbeda473c6c 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -14,6 +14,9 @@
#include "qemu/accel.h"
#include "qemu/queue.h"
#include "sysemu/kvm.h"
+#include "hw/boards.h"
+#include "hw/i386/topology.h"
+#include "io/channel-socket.h"
typedef struct KVMSlot
{
@@ -48,6 +51,34 @@ typedef struct KVMMemoryListener {
#define KVM_MSI_HASHTAB_SIZE 256
+typedef struct KVMHostTopoInfo {
+ /* Number of package on the Host */
+ unsigned int maxpkgs;
+ /* Number of cpus on the Host */
+ unsigned int maxcpus;
+ /* Number of cpus on each different package */
+ unsigned int *pkg_cpu_count;
+ /* Each package can have different maxticks */
+ unsigned int *maxticks;
+} KVMHostTopoInfo;
+
+struct KVMMsrEnergy {
+ pid_t pid;
+ bool enable;
+ char *socket_path;
+ QIOChannelSocket *sioc;
+ QemuThread msr_thr;
+ unsigned int vcpus;
+ unsigned int vsockets;
+ X86CPUTopoInfo topo_info;
+ KVMHostTopoInfo host_topo;
+ const CPUArchIdList *cpu_list;
+ uint64_t *msr_value;
+ uint64_t msr_unit;
+ uint64_t msr_limit;
+ uint64_t msr_info;
+};
+
enum KVMDirtyRingReaperState {
KVM_DIRTY_RING_REAPER_NONE = 0,
/* The reaper is sleeping */
@@ -114,6 +145,7 @@ struct KVMState
bool kvm_dirty_ring_with_bitmap;
uint64_t kvm_eager_split_size; /* Eager Page Splitting chunk size */
struct KVMDirtyRingReaper reaper;
+ struct KVMMsrEnergy msr_energy;
NotifyVmexitOption notify_vmexit;
uint32_t notify_window;
uint32_t xen_version;
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 952174bb6f52..135bda8a5db9 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -400,6 +400,10 @@ typedef enum X86Seg {
#define MSR_IA32_TSX_CTRL 0x122
#define MSR_IA32_TSCDEADLINE 0x6e0
#define MSR_IA32_PKRS 0x6e1
+#define MSR_RAPL_POWER_UNIT 0x00000606
+#define MSR_PKG_POWER_LIMIT 0x00000610
+#define MSR_PKG_ENERGY_STATUS 0x00000611
+#define MSR_PKG_POWER_INFO 0x00000614
#define MSR_ARCH_LBR_CTL 0x000014ce
#define MSR_ARCH_LBR_DEPTH 0x000014cf
#define MSR_ARCH_LBR_FROM_0 0x00001500
@@ -1795,6 +1799,10 @@ typedef struct CPUArchState {
uintptr_t retaddr;
+ /* RAPL MSR */
+ uint64_t msr_rapl_power_unit;
+ uint64_t msr_pkg_energy_status;
+
/* Fields up to this point are cleared by a CPU reset */
struct {} end_reset_fields;
diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c
index 9c791b7b0520..eafb625839b8 100644
--- a/target/i386/kvm/kvm-cpu.c
+++ b/target/i386/kvm/kvm-cpu.c
@@ -50,6 +50,15 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
MSR_IA32_UCODE_REV);
}
}
+
+ if (kvm_is_rapl_feat_enable(cs)) {
+ if (!IS_INTEL_CPU(env)) {
+ error_setg(errp, "RAPL feature can only be\
+ enabled with Intel CPU models");
+ return false;
+ }
+ }
+
return host_cpu_realizefn(cs, errp);
}
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index e68cbe929302..3de69caa229e 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -16,9 +16,12 @@
#include "qapi/qapi-events-run-state.h"
#include "qapi/error.h"
#include "qapi/visitor.h"
+#include <math.h>
#include <sys/ioctl.h>
#include <sys/utsname.h>
#include <sys/syscall.h>
+#include <sys/resource.h>
+#include <sys/time.h>
#include <linux/kvm.h>
#include "standard-headers/asm-x86/kvm_para.h"
@@ -26,6 +29,7 @@
#include "cpu.h"
#include "host-cpu.h"
+#include "vmsr_energy.h"
#include "sysemu/sysemu.h"
#include "sysemu/hw_accel.h"
#include "sysemu/kvm_int.h"
@@ -2479,6 +2483,54 @@ static bool kvm_rdmsr_core_thread_count(X86CPU *cpu, uint32_t msr,
return true;
}
+static bool kvm_rdmsr_rapl_power_unit(X86CPU *cpu, uint32_t msr,
+ uint64_t *val)
+{
+
+ CPUState *cs = CPU(cpu);
+
+ *val = cs->kvm_state->msr_energy.msr_unit;
+
+ return true;
+}
+
+static bool kvm_rdmsr_pkg_power_limit(X86CPU *cpu, uint32_t msr,
+ uint64_t *val)
+{
+
+ CPUState *cs = CPU(cpu);
+
+ *val = cs->kvm_state->msr_energy.msr_limit;
+
+ return true;
+}
+
+static bool kvm_rdmsr_pkg_power_info(X86CPU *cpu, uint32_t msr,
+ uint64_t *val)
+{
+
+ CPUState *cs = CPU(cpu);
+
+ *val = cs->kvm_state->msr_energy.msr_info;
+
+ return true;
+}
+
+static bool kvm_rdmsr_pkg_energy_status(X86CPU *cpu, uint32_t msr,
+ uint64_t *val)
+{
+
+ CPUState *cs = CPU(cpu);
+ *val = cs->kvm_state->msr_energy.msr_value[cs->cpu_index];
+
+ return true;
+}
+
+bool kvm_is_rapl_feat_enable(CPUState *cs)
+{
+ return cs->kvm_state->msr_energy.enable;
+}
+
static Notifier smram_machine_done;
static KVMMemoryListener smram_listener;
static AddressSpace smram_address_space;
@@ -2513,6 +2565,339 @@ static void register_smram_listener(Notifier *n, void *unused)
&smram_address_space, 1, "kvm-smram");
}
+static void *kvm_msr_energy_thread(void *data)
+{
+ KVMState *s = data;
+ struct KVMMsrEnergy *vmsr = &s->msr_energy;
+
+ g_autofree package_energy_stat *pkg_stat = NULL;
+ g_autofree thread_stat *thd_stat = NULL;
+ g_autofree pid_t *thread_ids = NULL;
+ g_autofree CPUState *cpu = NULL;
+ g_autofree unsigned int *vpkgs_energy_stat = NULL;
+ unsigned int num_threads = 0;
+ unsigned int tmp_num_threads = 0;
+
+ X86CPUTopoIDs topo_ids;
+
+ rcu_register_thread();
+
+ /* Allocate memory for each package energy status */
+ pkg_stat = g_new0(package_energy_stat, vmsr->host_topo.maxpkgs);
+
+ /* Allocate memory for thread stats */
+ thd_stat = g_new0(thread_stat, 1);
+
+ /* Allocate memory for holding virtual package energy counter */
+ vpkgs_energy_stat = g_new0(unsigned int, vmsr->vsockets);
+
+ /* Populate the max tick of each packages */
+ for (int i = 0; i <= vmsr->host_topo.maxpkgs; i++) {
+ /*
+ * Max numbers of ticks per package
+ * Time in second * Number of ticks/second * Number of cores/package
+ * ex: 100 ticks/second/CPU, 12 CPUs per Package gives 1200 ticks max
+ */
+ vmsr->host_topo.maxticks[i] = (MSR_ENERGY_THREAD_SLEEP_US / 1000000)
+ * sysconf(_SC_CLK_TCK)
+ * vmsr->host_topo.pkg_cpu_count[i];
+ }
+
+ while (true) {
+ /* Get all qemu threads id */
+ thread_ids = vmsr_get_thread_ids(vmsr->pid, &num_threads);
+
+ if (thread_ids == NULL) {
+ goto clean;
+ }
+
+ if (tmp_num_threads < num_threads) {
+ thd_stat = g_renew(thread_stat, thd_stat, num_threads);
+ }
+
+ tmp_num_threads = num_threads;
+
+ /* Populate all the thread stats */
+ for (int i = 0; i < num_threads; i++) {
+ thd_stat[i].utime = g_new0(unsigned long long, 2);
+ thd_stat[i].stime = g_new0(unsigned long long, 2);
+ thd_stat[i].thread_id = thread_ids[i];
+ vmsr_read_thread_stat(vmsr->pid,
+ thd_stat[i].thread_id,
+ thd_stat[i].utime,
+ thd_stat[i].stime,
+ &thd_stat[i].cpu_id);
+ thd_stat[i].pkg_id =
+ vmsr_get_physical_package_id(thd_stat[i].cpu_id);
+ }
+
+ /* Retrieve all packages power plane energy counter */
+ for (int i = 0; i < vmsr->host_topo.maxpkgs; i++) {
+ for (int j = 0; j < num_threads; j++) {
+ /*
+ * Use the first thread we found that ran on the CPU
+ * of the package to read the packages energy counter
+ */
+ if (thd_stat[j].pkg_id == i) {
+ pkg_stat[i].e_start =
+ vmsr_read_msr(MSR_PKG_ENERGY_STATUS,
+ thd_stat[j].cpu_id,
+ thd_stat[j].thread_id,
+ s->msr_energy.sioc);
+ break;
+ }
+ }
+ }
+
+ /* Sleep a short period while the other threads are working */
+ usleep(MSR_ENERGY_THREAD_SLEEP_US);
+
+ /*
+ * Retrieve all packages power plane energy counter
+ * Calculate the delta of all packages
+ */
+ for (int i = 0; i < vmsr->host_topo.maxpkgs; i++) {
+ for (int j = 0; j < num_threads; j++) {
+ /*
+ * Use the first thread we found that ran on the CPU
+ * of the package to read the packages energy counter
+ */
+ if (thd_stat[j].pkg_id == i) {
+ pkg_stat[i].e_end =
+ vmsr_read_msr(MSR_PKG_ENERGY_STATUS,
+ thd_stat[j].cpu_id,
+ thd_stat[j].thread_id,
+ s->msr_energy.sioc);
+ /*
+ * Prevent the case we have migrate the VM
+ * during the sleep period or any other cases
+ * were energy counter might be lower after
+ * the sleep period.
+ */
+ if (pkg_stat[i].e_end > pkg_stat[i].e_start) {
+ pkg_stat[i].e_delta =
+ pkg_stat[i].e_end - pkg_stat[i].e_start;
+ } else {
+ pkg_stat[i].e_delta = 0;
+ }
+ break;
+ }
+ }
+ }
+
+ /* Delta of ticks spend by each thread between the sample */
+ for (int i = 0; i < num_threads; i++) {
+ vmsr_read_thread_stat(vmsr->pid,
+ thd_stat[i].thread_id,
+ thd_stat[i].utime,
+ thd_stat[i].stime,
+ &thd_stat[i].cpu_id);
+
+ if (vmsr->pid < 0) {
+ /*
+ * We don't count the dead thread
+ * i.e threads that existed before the sleep
+ * and not anymore
+ */
+ thd_stat[i].delta_ticks = 0;
+ } else {
+ vmsr_delta_ticks(thd_stat, i);
+ }
+ }
+
+ /*
+ * Identify the vcpu threads
+ * Calculate the number of vcpu per package
+ */
+ CPU_FOREACH(cpu) {
+ for (int i = 0; i < num_threads; i++) {
+ if (cpu->thread_id == thd_stat[i].thread_id) {
+ thd_stat[i].is_vcpu = true;
+ thd_stat[i].vcpu_id = cpu->cpu_index;
+ pkg_stat[thd_stat[i].pkg_id].nb_vcpu++;
+ thd_stat[i].acpi_id = kvm_arch_vcpu_id(cpu);
+ break;
+ }
+ }
+ }
+
+ /* Retrieve the virtual package number of each vCPU */
+ for (int i = 0; i < vmsr->x86_cpu_list->len; i++) {
+ for (int j = 0; j < num_threads; j++) {
+ if ((thd_stat[j].acpi_id == vmsr->x86_cpu_list->cpus[i].arch_id)
+ && (thd_stat[j].is_vcpu == true)) {
+ x86_topo_ids_from_apicid(thd_stat[j].acpi_id,
+ &vmsr->topo_info, &topo_ids);
+ thd_stat[j].vpkg_id = topo_ids.pkg_id;
+ }
+ }
+ }
+
+ /* Calculate the total energy of all non-vCPU thread */
+ for (int i = 0; i < num_threads; i++) {
+ double temp;
+ if ((thd_stat[i].is_vcpu != true) &&
+ (thd_stat[i].delta_ticks > 0)) {
+ temp = vmsr_get_ratio(pkg_stat[thd_stat[i].pkg_id].e_delta,
+ thd_stat[i].delta_ticks,
+ vmsr->host_topo.maxticks[thd_stat[i].pkg_id]);
+ pkg_stat[thd_stat[i].pkg_id].e_ratio
+ += (uint64_t)lround(temp);
+ }
+ }
+
+ /* Calculate the ratio per non-vCPU thread of each package */
+ for (int i = 0; i < vmsr->host_topo.maxpkgs; i++) {
+ if (pkg_stat[i].nb_vcpu > 0) {
+ pkg_stat[i].e_ratio = pkg_stat[i].e_ratio / pkg_stat[i].nb_vcpu;
+ }
+ }
+
+ /*
+ * Calculate the energy for each Package:
+ * Energy Package = sum of each vCPU energy that belongs to the package
+ */
+ for (int i = 0; i < num_threads; i++) {
+ double temp;
+
+ if ((thd_stat[i].is_vcpu == true) && \
+ (thd_stat[i].delta_ticks > 0)) {
+ temp = vmsr_get_ratio(pkg_stat[thd_stat[i].pkg_id].e_delta,
+ thd_stat[i].delta_ticks,
+ vmsr->host_topo.maxticks[thd_stat[i].pkg_id]);
+ vpkgs_energy_stat[thd_stat[i].vpkg_id] +=
+ (uint64_t)lround(temp);
+ vpkgs_energy_stat[thd_stat[i].vpkg_id] +=
+ pkg_stat[thd_stat[i].pkg_id].e_ratio;
+ }
+ }
+
+ /*
+ * Finally populate the vmsr register of each vCPU with the total
+ * package value to emulate the real hardware where each CPU return the
+ * value of the package it belongs.
+ */
+ for (int i = 0; i < num_threads; i++) {
+ if ((thd_stat[i].is_vcpu == true) && \
+ (thd_stat[i].delta_ticks > 0)) {
+ vmsr->msr_value[thd_stat[i].vcpu_id] = \
+ vpkgs_energy_stat[thd_stat[i].vpkg_id];
+ }
+ }
+
+ /* Freeing memory before zeroing the pointer */
+ for (int i = 0; i < num_threads; i++) {
+ g_free(thd_stat[i].utime);
+ g_free(thd_stat[i].stime);
+ }
+ /* Zeroing the reused memory with g_renew */
+ memset(thd_stat, 0, num_threads * sizeof(thread_stat));
+ memset(thread_ids, 0, sizeof(pid_t));
+ }
+
+clean:
+ rcu_unregister_thread();
+ return NULL;
+}
+
+static int kvm_msr_energy_thread_init(KVMState *s, MachineState *ms)
+{
+ struct KVMMsrEnergy *r = &s->msr_energy;
+ int ret = 0;
+
+ /*
+ * Sanity check
+ * 1. Host cpu must be Intel cpu
+ * 2. RAPL must be enabled on the Host
+ */
+ if (is_host_cpu_intel()) {
+ error_report("The RAPL feature can only be enabled on hosts\
+ with Intel CPU models");
+ ret = 1;
+ goto out;
+ }
+
+ if (!is_rapl_enabled()) {
+ ret = 1;
+ goto out;
+ }
+
+ /* Retrieve the virtual topology */
+ vmsr_init_topo_info(&r->topo_info, ms);
+
+ /* Retrieve the number of vcpu */
+ r->vcpus = ms->smp.cpus;
+
+ /* Retrieve the number of virtual sockets */
+ r->vsockets = ms->smp.sockets;
+
+ /* Allocate register memory (MSR_PKG_STATUS) for each vcpu */
+ r->msr_value = g_new0(uint64_t, r->vcpus);
+
+ /* Retrieve the CPUArchIDlist */
+ r->x86_cpu_list = x86_possible_cpu_arch_ids(ms);
+
+ /* Max number of cpus on the Host */
+ r->host_topo.maxcpus = vmsr_get_maxcpus();
+ if (r->host_topo.maxcpus == 0) {
+ error_report("host max cpus = 0");
+ ret = 1;
+ goto out;
+ }
+
+ /* Max number of packages on the host */
+ r->host_topo.maxpkgs = vmsr_get_max_physical_package(r->host_topo.maxcpus);
+ if (r->host_topo.maxpkgs == 0) {
+ error_report("host max pkgs = 0");
+ ret = 1;
+ goto out;
+ }
+
+ /* Allocate memory for each package on the host */
+ r->host_topo.pkg_cpu_count = g_new0(unsigned int, r->host_topo.maxpkgs);
+ r->host_topo.maxticks = g_new0(unsigned int, r->host_topo.maxpkgs);
+
+ vmsr_count_cpus_per_package(r->host_topo.pkg_cpu_count,
+ r->host_topo.maxpkgs);
+ for (int i = 0; i < r->host_topo.maxpkgs; i++) {
+ if (r->host_topo.pkg_cpu_count[i] == 0) {
+ error_report("cpu per packages = 0 on package_%d", i);
+ ret = 1;
+ goto out;
+ }
+ }
+
+ /* Get QEMU PID*/
+ r->pid = getpid();
+
+ /* Compute the socket path if necessary */
+ if (s->msr_energy.socket_path == NULL) {
+ s->msr_energy.socket_path = vmsr_compute_default_paths();
+ }
+
+ /* Open socket with vmsr helper */
+ s->msr_energy.sioc = vmsr_open_socket(s->msr_energy.socket_path);
+
+ /* Those MSR values should not change */
+ r->msr_unit = vmsr_read_msr(MSR_RAPL_POWER_UNIT, 0, r->pid,
+ s->msr_energy.sioc);
+ r->msr_limit = vmsr_read_msr(MSR_PKG_POWER_LIMIT, 0, r->pid,
+ s->msr_energy.sioc);
+ r->msr_info = vmsr_read_msr(MSR_PKG_POWER_INFO, 0, r->pid,
+ s->msr_energy.sioc);
+ if (r->msr_unit == 0 || r->msr_limit == 0 || r->msr_info == 0) {
+ error_report("can't read any virtual msr");
+ ret = 1;
+ goto out;
+ }
+
+ qemu_thread_create(&r->msr_thr, "kvm-msr",
+ kvm_msr_energy_thread,
+ s, QEMU_THREAD_JOINABLE);
+out:
+ return ret;
+}
+
int kvm_arch_get_default_type(MachineState *ms)
{
return 0;
@@ -2715,6 +3100,49 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
strerror(-ret));
exit(1);
}
+
+ if (s->msr_energy.enable == true) {
+ r = kvm_filter_msr(s, MSR_RAPL_POWER_UNIT,
+ kvm_rdmsr_rapl_power_unit, NULL);
+ if (!r) {
+ error_report("Could not install MSR_RAPL_POWER_UNIT \
+ handler: %s",
+ strerror(-ret));
+ exit(1);
+ }
+
+ r = kvm_filter_msr(s, MSR_PKG_POWER_LIMIT,
+ kvm_rdmsr_pkg_power_limit, NULL);
+ if (!r) {
+ error_report("Could not install MSR_PKG_POWER_LIMIT \
+ handler: %s",
+ strerror(-ret));
+ exit(1);
+ }
+
+ r = kvm_filter_msr(s, MSR_PKG_POWER_INFO,
+ kvm_rdmsr_pkg_power_info, NULL);
+ if (!r) {
+ error_report("Could not install MSR_PKG_POWER_INFO \
+ handler: %s",
+ strerror(-ret));
+ exit(1);
+ }
+ r = kvm_filter_msr(s, MSR_PKG_ENERGY_STATUS,
+ kvm_rdmsr_pkg_energy_status, NULL);
+ if (!r) {
+ error_report("Could not install MSR_PKG_ENERGY_STATUS \
+ handler: %s",
+ strerror(-ret));
+ exit(1);
+ }
+ r = kvm_msr_energy_thread_init(s, ms);
+ if (r) {
+ error_report("kvm : error RAPL feature requirement not meet");
+ exit(1);
+ }
+
+ }
}
return 0;
diff --git a/target/i386/kvm/meson.build b/target/i386/kvm/meson.build
index 84d9143e6029..16010638df69 100644
--- a/target/i386/kvm/meson.build
+++ b/target/i386/kvm/meson.build
@@ -3,6 +3,7 @@ i386_kvm_ss = ss.source_set()
i386_kvm_ss.add(files(
'kvm.c',
'kvm-cpu.c',
+ 'vmsr_energy.c',
))
i386_kvm_ss.add(when: 'CONFIG_XEN_EMU', if_true: files('xen-emu.c'))
diff --git a/target/i386/kvm/vmsr_energy.c b/target/i386/kvm/vmsr_energy.c
new file mode 100644
index 000000000000..934d14c4305f
--- /dev/null
+++ b/target/i386/kvm/vmsr_energy.c
@@ -0,0 +1,335 @@
+/*
+ * QEMU KVM support -- x86 virtual RAPL msr
+ *
+ * Copyright 2024 Red Hat, Inc. 2024
+ *
+ * Author:
+ * Anthony Harivel <aharivel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+#include "vmsr_energy.h"
+#include "io/channel.h"
+#include "io/channel-socket.h"
+#include "hw/boards.h"
+#include "cpu.h"
+#include "host-cpu.h"
+
+char *vmsr_compute_default_paths(void)
+{
+ g_autofree char *state = qemu_get_local_state_dir();
+
+ return g_build_filename(state, "run", "qemu-vmsr-helper.sock", NULL);
+}
+
+bool is_host_cpu_intel(void)
+{
+ int family, model, stepping;
+ char vendor[CPUID_VENDOR_SZ + 1];
+
+ host_cpu_vendor_fms(vendor, &family, &model, &stepping);
+
+ return strcmp(vendor, CPUID_VENDOR_INTEL);
+}
+
+int is_rapl_enabled(void)
+{
+ const char *path = "/sys/class/powercap/intel-rapl/enabled";
+ FILE *file = fopen(path, "r");
+ int value = 0;
+
+ if (file != NULL) {
+ if (fscanf(file, "%d", &value) != 1) {
+ error_report("INTEL RAPL not enabled");
+ }
+ fclose(file);
+ } else {
+ error_report("Error opening %s", path);
+ }
+
+ return value;
+}
+
+QIOChannelSocket *vmsr_open_socket(const char *path)
+{
+ g_autofree char *socket_path = NULL;
+
+ socket_path = g_strdup(path);
+
+ SocketAddress saddr = {
+ .type = SOCKET_ADDRESS_TYPE_UNIX,
+ .u.q_unix.path = socket_path
+ };
+
+ QIOChannelSocket *sioc = qio_channel_socket_new();
+ Error *local_err = NULL;
+
+ qio_channel_set_name(QIO_CHANNEL(sioc), "vmsr-helper");
+ qio_channel_socket_connect_sync(sioc,
+ &saddr,
+ &local_err);
+ if (local_err) {
+ /* Close socket. */
+ qio_channel_close(QIO_CHANNEL(sioc), NULL);
+ object_unref(OBJECT(sioc));
+ }
+
+ return sioc;
+}
+
+uint64_t vmsr_read_msr(uint32_t reg, uint32_t cpu_id, uint32_t tid,
+ QIOChannelSocket *sioc)
+{
+ uint64_t data = 0;
+ int r = 0;
+ Error *local_err = NULL;
+ g_autofree uint32_t buffer[3];
+ /*
+ * Send the required arguments:
+ * 1. RAPL MSR register to read
+ * 2. On which CPU ID
+ * 3. From which vCPU (Thread ID)
+ */
+ buffer[0] = reg;
+ buffer[1] = cpu_id;
+ buffer[2] = tid;
+
+ r = qio_channel_write_all(QIO_CHANNEL(sioc),
+ (char *)buffer, sizeof(buffer),
+ &local_err);
+ if (r < 0) {
+ goto out_close;
+ }
+
+ r = qio_channel_read_all(QIO_CHANNEL(sioc),
+ (char *)data, sizeof(data),
+ &local_err);
+ if (r < 0) {
+ data = 0;
+ goto out_close;
+ }
+
+out_close:
+ return data;
+}
+
+/* Retrieve the max number of physical package */
+unsigned int vmsr_get_max_physical_package(unsigned int max_cpus)
+{
+ const char *dir = "/sys/devices/system/cpu/";
+ const char *topo_path = "topology/physical_package_id";
+ g_autofree int *uniquePackages = g_new0(int, max_cpus);
+ unsigned int packageCount = 0;
+ FILE *file = NULL;
+
+ for (int i = 0; i < max_cpus; i++) {
+ g_autofree char *filePath = NULL;
+ g_autofree char *cpuid = g_strdup_printf("cpu%d", i);
+
+ filePath = g_build_filename(dir, cpuid, topo_path, NULL);
+
+ file = fopen(filePath, "r");
+
+ if (file == NULL) {
+ error_report("Error opening physical_package_id file");
+ return 0;
+ }
+
+ char packageId[10];
+ if (fgets(packageId, sizeof(packageId), file) == NULL) {
+ packageCount = 0;
+ }
+
+ fclose(file);
+
+ int currentPackageId = atoi(packageId);
+
+ bool isUnique = true;
+ for (int j = 0; j < packageCount; j++) {
+ if (uniquePackages[j] == currentPackageId) {
+ isUnique = false;
+ break;
+ }
+ }
+
+ if (isUnique) {
+ uniquePackages[packageCount] = currentPackageId;
+ packageCount++;
+
+ if (packageCount >= max_cpus) {
+ break;
+ }
+ }
+ }
+
+ return (packageCount == 0) ? 1 : packageCount;
+}
+
+/* Retrieve the max number of physical cpu on the host */
+unsigned int vmsr_get_maxcpus(void)
+{
+ GDir *dir;
+ const gchar *entry_name;
+ unsigned int cpu_count = 0;
+ const char *path = "/sys/devices/system/cpu/";
+
+ dir = g_dir_open(path, 0, NULL);
+ if (dir == NULL) {
+ error_report("Unable to open cpu directory");
+ return -1;
+ }
+
+ while ((entry_name = g_dir_read_name(dir)) != NULL) {
+ if (g_ascii_strncasecmp(entry_name, "cpu", 3) == 0 &&
+ isdigit(entry_name[3])) {
+ cpu_count++;
+ }
+ }
+
+ g_dir_close(dir);
+
+ return cpu_count;
+}
+
+/* Count the number of physical cpu on each packages */
+unsigned int vmsr_count_cpus_per_package(unsigned int *package_count,
+ unsigned int max_pkgs)
+{
+ g_autofree char *file_contents = NULL;
+ g_autofree char *path = NULL;
+ gsize length;
+
+ /* Iterate over cpus and count cpus in each package */
+ for (int cpu_id = 0; ; cpu_id++) {
+ path = g_build_filename(
+ g_strdup_printf("/sys/devices/system/cpu/cpu%d/"
+ "topology/physical_package_id", cpu_id), NULL);
+
+ if (!g_file_get_contents(path, &file_contents, &length, NULL)) {
+ break; /* No more cpus */
+ }
+
+ /* Get the physical package ID for this CPU */
+ int package_id = atoi(file_contents);
+
+ /* Check if the package ID is within the known number of packages */
+ if (package_id >= 0 && package_id < max_pkgs) {
+ /* If yes, count the cpu for this package*/
+ package_count[package_id]++;
+ }
+ }
+
+ return 0;
+}
+
+/* Get the physical package id from a given cpu id */
+int vmsr_get_physical_package_id(int cpu_id)
+{
+ g_autofree char *file_contents = NULL;
+ g_autofree char *file_path = NULL;
+ int package_id = -1;
+ gsize length;
+
+ file_path = g_strdup_printf("/sys/devices/system/cpu/cpu%d\
+ /topology/physical_package_id", cpu_id);
+
+ if (!g_file_get_contents(file_path, &file_contents, &length, NULL)) {
+ goto out;
+ }
+
+ package_id = atoi(file_contents);
+
+out:
+ return package_id;
+}
+
+/* Read the scheduled time for a given thread of a give pid */
+void vmsr_read_thread_stat(pid_t pid,
+ unsigned int thread_id,
+ unsigned long long *utime,
+ unsigned long long *stime,
+ unsigned int *cpu_id)
+{
+ g_autofree char *path = NULL;
+
+ path = g_build_filename(g_strdup_printf("/proc/%u/task/%d/stat", pid, \
+ thread_id), NULL);
+
+ FILE *file = fopen(path, "r");
+ if (file == NULL) {
+ pid = -1;
+ return;
+ }
+
+ if (fscanf(file, "%*d (%*[^)]) %*c %*d %*d %*d %*d %*d %*u %*u %*u %*u %*u"
+ " %llu %llu %*d %*d %*d %*d %*d %*d %*u %*u %*d %*u %*u"
+ " %*u %*u %*u %*u %*u %*u %*u %*u %*u %*d %*u %*u %u",
+ utime, stime, cpu_id) != 3)
+ {
+ pid = -1;
+ return;
+ }
+
+ fclose(file);
+ return;
+}
+
+/* Read QEMU stat task folder to retrieve all QEMU threads ID */
+pid_t *vmsr_get_thread_ids(pid_t pid, unsigned int *num_threads)
+{
+ g_autofree char *path = g_build_filename("/proc",
+ g_strdup_printf("%d/task", pid), NULL);
+
+ DIR *dir = opendir(path);
+ if (dir == NULL) {
+ error_report("Error opening /proc/qemu/task");
+ return NULL;
+ }
+
+ pid_t *thread_ids = NULL;
+ unsigned int thread_count = 0;
+
+ g_autofree struct dirent *ent = NULL;
+ while ((ent = readdir(dir)) != NULL) {
+ if (ent->d_name[0] == '.') {
+ continue;
+ }
+ pid_t tid = atoi(ent->d_name);
+ if (pid != tid) {
+ thread_ids = g_renew(pid_t, thread_ids, (thread_count + 1));
+ thread_ids[thread_count] = tid;
+ thread_count++;
+ }
+ }
+
+ closedir(dir);
+
+ *num_threads = thread_count;
+ return thread_ids;
+}
+
+void vmsr_delta_ticks(thread_stat *thd_stat, int i)
+{
+ thd_stat[i].delta_ticks = (thd_stat[i].utime[1] + thd_stat[i].stime[1])
+ - (thd_stat[i].utime[0] + thd_stat[i].stime[0]);
+}
+
+double vmsr_get_ratio(uint64_t e_delta,
+ unsigned long long delta_ticks,
+ unsigned int maxticks)
+{
+ return (e_delta / 100.0) * ((100.0 / maxticks) * delta_ticks);
+}
+
+void vmsr_init_topo_info(X86CPUTopoInfo *topo_info,
+ const MachineState *ms)
+{
+ topo_info->dies_per_pkg = ms->smp.dies;
+ topo_info->cores_per_die = ms->smp.cores;
+ topo_info->threads_per_core = ms->smp.threads;
+}
diff --git a/target/i386/kvm/vmsr_energy.h b/target/i386/kvm/vmsr_energy.h
new file mode 100644
index 000000000000..a04114179346
--- /dev/null
+++ b/target/i386/kvm/vmsr_energy.h
@@ -0,0 +1,99 @@
+/*
+ * QEMU KVM support -- x86 virtual energy-related MSR.
+ *
+ * Copyright 2024 Red Hat, Inc. 2024
+ *
+ * Author:
+ * Anthony Harivel <aharivel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef VMSR_ENERGY_H
+#define VMSR_ENERGY_H
+
+#include <stdint.h>
+#include "qemu/osdep.h"
+#include "io/channel-socket.h"
+#include "hw/i386/topology.h"
+
+/*
+ * Define the interval time in micro seconds between 2 samples of
+ * energy related MSRs
+ */
+#define MSR_ENERGY_THREAD_SLEEP_US 1000000.0
+
+/*
+ * Thread statistic
+ * @ thread_id: TID (thread ID)
+ * @ is_vcpu: true if TID is vCPU thread
+ * @ cpu_id: CPU number last executed on
+ * @ pkg_id: package number of the CPU
+ * @ vcpu_id: vCPU ID
+ * @ vpkg: virtual package number
+ * @ acpi_id: APIC id of the vCPU
+ * @ utime: amount of clock ticks the thread
+ * has been scheduled in User mode
+ * @ stime: amount of clock ticks the thread
+ * has been scheduled in System mode
+ * @ delta_ticks: delta of utime+stime between
+ * the two samples (before/after sleep)
+ */
+struct thread_stat {
+ unsigned int thread_id;
+ bool is_vcpu;
+ unsigned int cpu_id;
+ unsigned int pkg_id;
+ unsigned int vpkg_id;
+ unsigned int vcpu_id;
+ unsigned long acpi_id;
+ unsigned long long *utime;
+ unsigned long long *stime;
+ unsigned long long delta_ticks;
+};
+
+/*
+ * Package statistic
+ * @ e_start: package energy counter before the sleep
+ * @ e_end: package energy counter after the sleep
+ * @ e_delta: delta of package energy counter
+ * @ e_ratio: store the energy ratio of non-vCPU thread
+ * @ nb_vcpu: number of vCPU running on this package
+ */
+struct package_energy_stat {
+ uint64_t e_start;
+ uint64_t e_end;
+ uint64_t e_delta;
+ uint64_t e_ratio;
+ unsigned int nb_vcpu;
+};
+
+typedef struct thread_stat thread_stat;
+typedef struct package_energy_stat package_energy_stat;
+
+char *vmsr_compute_default_paths(void);
+void vmsr_read_thread_stat(pid_t pid,
+ unsigned int thread_id,
+ unsigned long long *utime,
+ unsigned long long *stime,
+ unsigned int *cpu_id);
+
+QIOChannelSocket *vmsr_open_socket(const char *path);
+uint64_t vmsr_read_msr(uint32_t reg, uint32_t cpu_id,
+ uint32_t tid, QIOChannelSocket *sioc);
+void vmsr_delta_ticks(thread_stat *thd_stat, int i);
+unsigned int vmsr_get_maxcpus(void);
+unsigned int vmsr_get_max_physical_package(unsigned int max_cpus);
+unsigned int vmsr_count_cpus_per_package(unsigned int *package_count,
+ unsigned int max_pkgs);
+int vmsr_get_physical_package_id(int cpu_id);
+pid_t *vmsr_get_thread_ids(pid_t pid, unsigned int *num_threads);
+double vmsr_get_ratio(uint64_t e_delta,
+ unsigned long long delta_ticks,
+ unsigned int maxticks);
+void vmsr_init_topo_info(X86CPUTopoInfo *topo_info, const MachineState *ms);
+bool is_host_cpu_intel(void);
+int is_rapl_enabled(void);
+#endif /* VMSR_ENERGY_H */
--
2.44.0
On Thu, Apr 11, 2024 at 02:14:34PM +0200, Anthony Harivel wrote: > Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL > interface (Running Average Power Limit) for advertising the accumulated > energy consumption of various power domains (e.g. CPU packages, DRAM, > etc.). > > The consumption is reported via MSRs (model specific registers) like > MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are > 64 bits registers that represent the accumulated energy consumption in > micro Joules. They are updated by microcode every ~1ms. > > For now, KVM always returns 0 when the guest requests the value of > these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle > these MSRs dynamically in userspace. > > To limit the amount of system calls for every MSR call, create a new > thread in QEMU that updates the "virtual" MSR values asynchronously. > > Each vCPU has its own vMSR to reflect the independence of vCPUs. The > thread updates the vMSR values with the ratio of energy consumed of > the whole physical CPU package the vCPU thread runs on and the > thread's utime and stime values. > > All other non-vCPU threads are also taken into account. Their energy > consumption is evenly distributed among all vCPUs threads running on > the same physical CPU package. > > To overcome the problem that reading the RAPL MSR requires priviliged > access, a socket communication between QEMU and the qemu-vmsr-helper is > mandatory. You can specified the socket path in the parameter. > > This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock > > Actual limitation: > - Works only on Intel host CPU because AMD CPUs are using different MSR > adresses. > > - Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at > the moment. > > Signed-off-by: Anthony Harivel <aharivel@redhat.com> > --- > accel/kvm/kvm-all.c | 27 +++ > docs/specs/index.rst | 1 + > docs/specs/rapl-msr.rst | 155 ++++++++++++ > include/sysemu/kvm.h | 2 + > include/sysemu/kvm_int.h | 32 +++ > target/i386/cpu.h | 8 + > target/i386/kvm/kvm-cpu.c | 9 + > target/i386/kvm/kvm.c | 428 ++++++++++++++++++++++++++++++++++ > target/i386/kvm/meson.build | 1 + > target/i386/kvm/vmsr_energy.c | 335 ++++++++++++++++++++++++++ > target/i386/kvm/vmsr_energy.h | 99 ++++++++ > 11 files changed, 1097 insertions(+) > create mode 100644 docs/specs/rapl-msr.rst > create mode 100644 target/i386/kvm/vmsr_energy.c > create mode 100644 target/i386/kvm/vmsr_energy.h > > diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c > index 9c791b7b0520..eafb625839b8 100644 > --- a/target/i386/kvm/kvm-cpu.c > +++ b/target/i386/kvm/kvm-cpu.c > @@ -50,6 +50,15 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp) > MSR_IA32_UCODE_REV); > } > } > + > + if (kvm_is_rapl_feat_enable(cs)) { > + if (!IS_INTEL_CPU(env)) { > + error_setg(errp, "RAPL feature can only be\ > + enabled with Intel CPU models"); > + return false; > + } > + } I see a crash in kvm_is_rapl_feat_enable() from this caller, when I run with this kind of command line: $ qemu-system-x86_64 \ -kernel /lib/modules/6.6.9-100.fc38.x86_64/vmlinuz \ -initrd tiny-initrd.img -m 2000 -serial stdio -nodefaults \ -display none -accel kvm -append "console=ttyS0 quiet" #0 0x0000555555bc14b7 in kvm_is_rapl_feat_enable (cs=cs@entry=0x555557b83470) at ../target/i386/kvm/kvm.c:2531 #1 0x0000555555bc7534 in kvm_cpu_realizefn (cs=0x555557b83470, errp=0x7fffffffd2a0) at ../target/i386/kvm/kvm-cpu.c:54 #2 0x0000555555d2432a in accel_cpu_common_realize (cpu=0x555557b83470, errp=0x7fffffffd2a0) at ../accel/accel-target.c:130 #3 0x0000555555cdd955 in cpu_exec_realizefn (cpu=cpu@entry=0x555557b83470, errp=errp@entry=0x7fffffffd2a0) at ../cpu-target.c:137 #4 0x0000555555c14b89 in x86_cpu_realizefn (dev=0x555557b83470, errp=0x7fffffffd310) at ../target/i386/cpu.c:7320 #5 0x0000555555d58f4b in device_set_realized (obj=<optimized out>, value=<optimized out>, errp=0x7fffffffd390) at ../hw/core/qdev.c:510 #6 0x0000555555d5d78d in property_set_bool (obj=0x555557b83470, v=<optimized out>, name=<optimized out>, opaque=0x5555578558e0, errp=0x7fffffffd390) at ../qom/object.c:2358 #7 0x0000555555d60b0b in object_property_set (obj=obj@entry=0x555557b83470, name=name@entry=0x55555607c799 "realized", v=v@entry=0x555557b8ccb0, errp=0x7fffffffd390, errp@entry=0x555556e210d8 <error_fatal>) at ../qom/object.c:1472 #8 0x0000555555d6444f in object_property_set_qobject (obj=obj@entry=0x555557b83470, name=name@entry=0x55555607c799 "realized", value=value@entry=0x555557854800, errp=errp@entry=0x555556e210d8 <error_fatal>) at ../qom/qom-qobject.c:28 #9 0x0000555555d61174 in object_property_set_bool (obj=0x555557b83470, name=name@entry=0x55555607c799 "realized", value=value@entry=true, errp=errp@entry=0x555556e210d8 <error_fatal>) at ../qom/object.c:1541 #10 0x0000555555d59a3c in qdev_realize (dev=<optimized out>, bus=bus@entry=0x0, errp=errp@entry=0x555556e210d8 <error_fatal>) at ../hw/core/qdev.c:292 #11 0x0000555555bd51e0 in x86_cpu_new (x86ms=<optimized out>, apic_id=0, errp=0x555556e210d8 <error_fatal>) at ../hw/i386/x86.c:105 #12 0x0000555555bd52ce in x86_cpus_init (x86ms=x86ms@entry=0x555557aaed30, default_cpu_version=<optimized out>) at ../hw/i386/x86.c:156 #13 0x0000555555bdc1a7 in pc_init1 (machine=0x555557aaed30, pci_type=0x55555604aa61 "i440FX") at ../hw/i386/pc_piix.c:185 #14 0x0000555555947a11 in machine_run_board_init (machine=0x555557aaed30, mem_path=<optimized out>, errp=<optimized out>, errp@entry=0x555556e210d8 <error_fatal>) at ../hw/core/machine.c:1547 #15 0x0000555555b020ed in qemu_init_board () at ../system/vl.c:2613 #16 qmp_x_exit_preconfig (errp=0x555556e210d8 <error_fatal>) at ../system/vl.c:2705 #17 0x0000555555b0611e in qemu_init (argc=<optimized out>, argv=<optimized out>) at ../system/vl.c:3739 #18 0x0000555555897ca9 in main (argc=<optimized out>, argv=<optimized out>) at ../system/main.c:47 The problem is that 'cs->kvm_state' is NULL here With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
Hi Daniel, Daniel P. Berrangé, Apr 18, 2024 at 18:42: > > + if (kvm_is_rapl_feat_enable(cs)) { > > + if (!IS_INTEL_CPU(env)) { > > + error_setg(errp, "RAPL feature can only be\ > > + enabled with Intel CPU models"); > > + return false; > > + } > > + } > > I see a crash in kvm_is_rapl_feat_enable() from this caller, > when I run with this kind of command line: > > $ qemu-system-x86_64 \ > -kernel /lib/modules/6.6.9-100.fc38.x86_64/vmlinuz \ > -initrd tiny-initrd.img -m 2000 -serial stdio -nodefaults \ > -display none -accel kvm -append "console=ttyS0 quiet" > > > #0 0x0000555555bc14b7 in kvm_is_rapl_feat_enable (cs=cs@entry=0x555557b83470) at ../target/i386/kvm/kvm.c:2531 > #1 0x0000555555bc7534 in kvm_cpu_realizefn (cs=0x555557b83470, errp=0x7fffffffd2a0) at ../target/i386/kvm/kvm-cpu.c:54 > #2 0x0000555555d2432a in accel_cpu_common_realize (cpu=0x555557b83470, errp=0x7fffffffd2a0) at ../accel/accel-target.c:130 > #3 0x0000555555cdd955 in cpu_exec_realizefn (cpu=cpu@entry=0x555557b83470, errp=errp@entry=0x7fffffffd2a0) at ../cpu-target.c:137 > #4 0x0000555555c14b89 in x86_cpu_realizefn (dev=0x555557b83470, errp=0x7fffffffd310) at ../target/i386/cpu.c:7320 > #5 0x0000555555d58f4b in device_set_realized (obj=<optimized out>, value=<optimized out>, errp=0x7fffffffd390) at ../hw/core/qdev.c:510 > #6 0x0000555555d5d78d in property_set_bool (obj=0x555557b83470, v=<optimized out>, name=<optimized out>, opaque=0x5555578558e0, errp=0x7fffffffd390) > at ../qom/object.c:2358 > #7 0x0000555555d60b0b in object_property_set (obj=obj@entry=0x555557b83470, name=name@entry=0x55555607c799 "realized", v=v@entry=0x555557b8ccb0, errp=0x7fffffffd390, > errp@entry=0x555556e210d8 <error_fatal>) at ../qom/object.c:1472 > #8 0x0000555555d6444f in object_property_set_qobject > (obj=obj@entry=0x555557b83470, name=name@entry=0x55555607c799 "realized", value=value@entry=0x555557854800, errp=errp@entry=0x555556e210d8 <error_fatal>) > at ../qom/qom-qobject.c:28 > #9 0x0000555555d61174 in object_property_set_bool > (obj=0x555557b83470, name=name@entry=0x55555607c799 "realized", value=value@entry=true, errp=errp@entry=0x555556e210d8 <error_fatal>) at ../qom/object.c:1541 > #10 0x0000555555d59a3c in qdev_realize (dev=<optimized out>, bus=bus@entry=0x0, errp=errp@entry=0x555556e210d8 <error_fatal>) at ../hw/core/qdev.c:292 > #11 0x0000555555bd51e0 in x86_cpu_new (x86ms=<optimized out>, apic_id=0, errp=0x555556e210d8 <error_fatal>) at ../hw/i386/x86.c:105 > #12 0x0000555555bd52ce in x86_cpus_init (x86ms=x86ms@entry=0x555557aaed30, default_cpu_version=<optimized out>) at ../hw/i386/x86.c:156 > #13 0x0000555555bdc1a7 in pc_init1 (machine=0x555557aaed30, pci_type=0x55555604aa61 "i440FX") at ../hw/i386/pc_piix.c:185 > #14 0x0000555555947a11 in machine_run_board_init (machine=0x555557aaed30, mem_path=<optimized out>, errp=<optimized out>, errp@entry=0x555556e210d8 <error_fatal>) > at ../hw/core/machine.c:1547 > #15 0x0000555555b020ed in qemu_init_board () at ../system/vl.c:2613 > #16 qmp_x_exit_preconfig (errp=0x555556e210d8 <error_fatal>) at ../system/vl.c:2705 > #17 0x0000555555b0611e in qemu_init (argc=<optimized out>, argv=<optimized out>) at ../system/vl.c:3739 > #18 0x0000555555897ca9 in main (argc=<optimized out>, argv=<optimized out>) at ../system/main.c:47 > > The problem is that 'cs->kvm_state' is NULL here > After some investigation it seems that kvm_state is not yet committed at this point. Shame, because GDB showed me that we have already pass the kvm_accel_instance_init() in accel/kvm/kvm-all.c that sets the value "msr_energy.enable" in kvm_state... So should I dig more to still do the sanity check in kvm_cpu_realizefn() or should I already move the RAPL feature from -kvm to -cpu like suggested by Zhao from Intel and then access it from the CPUState ? The last one would require more work but if I can skip a new iteration because I would need to do it anyway that would save me time in this end. Thanks Regards, Anthony
On Thu, Apr 25, 2024 at 05:34:52PM +0200, Anthony Harivel wrote: > Hi Daniel, > > Daniel P. Berrangé, Apr 18, 2024 at 18:42: > > > > + if (kvm_is_rapl_feat_enable(cs)) { > > > + if (!IS_INTEL_CPU(env)) { > > > + error_setg(errp, "RAPL feature can only be\ > > > + enabled with Intel CPU models"); > > > + return false; > > > + } > > > + } > > > > I see a crash in kvm_is_rapl_feat_enable() from this caller, > > when I run with this kind of command line: > > > > $ qemu-system-x86_64 \ > > -kernel /lib/modules/6.6.9-100.fc38.x86_64/vmlinuz \ > > -initrd tiny-initrd.img -m 2000 -serial stdio -nodefaults \ > > -display none -accel kvm -append "console=ttyS0 quiet" > > > > > > #0 0x0000555555bc14b7 in kvm_is_rapl_feat_enable (cs=cs@entry=0x555557b83470) at ../target/i386/kvm/kvm.c:2531 > > #1 0x0000555555bc7534 in kvm_cpu_realizefn (cs=0x555557b83470, errp=0x7fffffffd2a0) at ../target/i386/kvm/kvm-cpu.c:54 > > #2 0x0000555555d2432a in accel_cpu_common_realize (cpu=0x555557b83470, errp=0x7fffffffd2a0) at ../accel/accel-target.c:130 > > #3 0x0000555555cdd955 in cpu_exec_realizefn (cpu=cpu@entry=0x555557b83470, errp=errp@entry=0x7fffffffd2a0) at ../cpu-target.c:137 > > #4 0x0000555555c14b89 in x86_cpu_realizefn (dev=0x555557b83470, errp=0x7fffffffd310) at ../target/i386/cpu.c:7320 > > #5 0x0000555555d58f4b in device_set_realized (obj=<optimized out>, value=<optimized out>, errp=0x7fffffffd390) at ../hw/core/qdev.c:510 > > #6 0x0000555555d5d78d in property_set_bool (obj=0x555557b83470, v=<optimized out>, name=<optimized out>, opaque=0x5555578558e0, errp=0x7fffffffd390) > > at ../qom/object.c:2358 > > #7 0x0000555555d60b0b in object_property_set (obj=obj@entry=0x555557b83470, name=name@entry=0x55555607c799 "realized", v=v@entry=0x555557b8ccb0, errp=0x7fffffffd390, > > errp@entry=0x555556e210d8 <error_fatal>) at ../qom/object.c:1472 > > #8 0x0000555555d6444f in object_property_set_qobject > > (obj=obj@entry=0x555557b83470, name=name@entry=0x55555607c799 "realized", value=value@entry=0x555557854800, errp=errp@entry=0x555556e210d8 <error_fatal>) > > at ../qom/qom-qobject.c:28 > > #9 0x0000555555d61174 in object_property_set_bool > > (obj=0x555557b83470, name=name@entry=0x55555607c799 "realized", value=value@entry=true, errp=errp@entry=0x555556e210d8 <error_fatal>) at ../qom/object.c:1541 > > #10 0x0000555555d59a3c in qdev_realize (dev=<optimized out>, bus=bus@entry=0x0, errp=errp@entry=0x555556e210d8 <error_fatal>) at ../hw/core/qdev.c:292 > > #11 0x0000555555bd51e0 in x86_cpu_new (x86ms=<optimized out>, apic_id=0, errp=0x555556e210d8 <error_fatal>) at ../hw/i386/x86.c:105 > > #12 0x0000555555bd52ce in x86_cpus_init (x86ms=x86ms@entry=0x555557aaed30, default_cpu_version=<optimized out>) at ../hw/i386/x86.c:156 > > #13 0x0000555555bdc1a7 in pc_init1 (machine=0x555557aaed30, pci_type=0x55555604aa61 "i440FX") at ../hw/i386/pc_piix.c:185 > > #14 0x0000555555947a11 in machine_run_board_init (machine=0x555557aaed30, mem_path=<optimized out>, errp=<optimized out>, errp@entry=0x555556e210d8 <error_fatal>) > > at ../hw/core/machine.c:1547 > > #15 0x0000555555b020ed in qemu_init_board () at ../system/vl.c:2613 > > #16 qmp_x_exit_preconfig (errp=0x555556e210d8 <error_fatal>) at ../system/vl.c:2705 > > #17 0x0000555555b0611e in qemu_init (argc=<optimized out>, argv=<optimized out>) at ../system/vl.c:3739 > > #18 0x0000555555897ca9 in main (argc=<optimized out>, argv=<optimized out>) at ../system/main.c:47 > > > > The problem is that 'cs->kvm_state' is NULL here > > > > After some investigation it seems that kvm_state is not yet committed > at this point. Shame, because GDB showed me that we have already pass > the kvm_accel_instance_init() in accel/kvm/kvm-all.c that sets the > value "msr_energy.enable" in kvm_state... > > So should I dig more to still do the sanity check in kvm_cpu_realizefn() > or should I already move the RAPL feature from -kvm to -cpu > like suggested by Zhao from Intel and then access it from the CPUState ? I'm not so sure about that question. I think Paolo is best placed to suggest which is better, as the KVM maintainer. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
Hi Paolo, Daniel P. Berrangé, Apr 25, 2024 at 17:42: > On Thu, Apr 25, 2024 at 05:34:52PM +0200, Anthony Harivel wrote: > > Hi Daniel, > > > > Daniel P. Berrangé, Apr 18, 2024 at 18:42: > > > > > > + if (kvm_is_rapl_feat_enable(cs)) { > > > > + if (!IS_INTEL_CPU(env)) { > > > > + error_setg(errp, "RAPL feature can only be\ > > > > + enabled with Intel CPU models"); > > > > + return false; > > > > + } > > > > + } > > > > > > I see a crash in kvm_is_rapl_feat_enable() from this caller, > > > when I run with this kind of command line: > > > > > > $ qemu-system-x86_64 \ > > > -kernel /lib/modules/6.6.9-100.fc38.x86_64/vmlinuz \ > > > -initrd tiny-initrd.img -m 2000 -serial stdio -nodefaults \ > > > -display none -accel kvm -append "console=ttyS0 quiet" > > > > > > > > > #0 0x0000555555bc14b7 in kvm_is_rapl_feat_enable (cs=cs@entry=0x555557b83470) at ../target/i386/kvm/kvm.c:2531 > > > #1 0x0000555555bc7534 in kvm_cpu_realizefn (cs=0x555557b83470, errp=0x7fffffffd2a0) at ../target/i386/kvm/kvm-cpu.c:54 > > > #2 0x0000555555d2432a in accel_cpu_common_realize (cpu=0x555557b83470, errp=0x7fffffffd2a0) at ../accel/accel-target.c:130 > > > #3 0x0000555555cdd955 in cpu_exec_realizefn (cpu=cpu@entry=0x555557b83470, errp=errp@entry=0x7fffffffd2a0) at ../cpu-target.c:137 > > > #4 0x0000555555c14b89 in x86_cpu_realizefn (dev=0x555557b83470, errp=0x7fffffffd310) at ../target/i386/cpu.c:7320 > > > #5 0x0000555555d58f4b in device_set_realized (obj=<optimized out>, value=<optimized out>, errp=0x7fffffffd390) at ../hw/core/qdev.c:510 > > > #6 0x0000555555d5d78d in property_set_bool (obj=0x555557b83470, v=<optimized out>, name=<optimized out>, opaque=0x5555578558e0, errp=0x7fffffffd390) > > > at ../qom/object.c:2358 > > > #7 0x0000555555d60b0b in object_property_set (obj=obj@entry=0x555557b83470, name=name@entry=0x55555607c799 "realized", v=v@entry=0x555557b8ccb0, errp=0x7fffffffd390, > > > errp@entry=0x555556e210d8 <error_fatal>) at ../qom/object.c:1472 > > > #8 0x0000555555d6444f in object_property_set_qobject > > > (obj=obj@entry=0x555557b83470, name=name@entry=0x55555607c799 "realized", value=value@entry=0x555557854800, errp=errp@entry=0x555556e210d8 <error_fatal>) > > > at ../qom/qom-qobject.c:28 > > > #9 0x0000555555d61174 in object_property_set_bool > > > (obj=0x555557b83470, name=name@entry=0x55555607c799 "realized", value=value@entry=true, errp=errp@entry=0x555556e210d8 <error_fatal>) at ../qom/object.c:1541 > > > #10 0x0000555555d59a3c in qdev_realize (dev=<optimized out>, bus=bus@entry=0x0, errp=errp@entry=0x555556e210d8 <error_fatal>) at ../hw/core/qdev.c:292 > > > #11 0x0000555555bd51e0 in x86_cpu_new (x86ms=<optimized out>, apic_id=0, errp=0x555556e210d8 <error_fatal>) at ../hw/i386/x86.c:105 > > > #12 0x0000555555bd52ce in x86_cpus_init (x86ms=x86ms@entry=0x555557aaed30, default_cpu_version=<optimized out>) at ../hw/i386/x86.c:156 > > > #13 0x0000555555bdc1a7 in pc_init1 (machine=0x555557aaed30, pci_type=0x55555604aa61 "i440FX") at ../hw/i386/pc_piix.c:185 > > > #14 0x0000555555947a11 in machine_run_board_init (machine=0x555557aaed30, mem_path=<optimized out>, errp=<optimized out>, errp@entry=0x555556e210d8 <error_fatal>) > > > at ../hw/core/machine.c:1547 > > > #15 0x0000555555b020ed in qemu_init_board () at ../system/vl.c:2613 > > > #16 qmp_x_exit_preconfig (errp=0x555556e210d8 <error_fatal>) at ../system/vl.c:2705 > > > #17 0x0000555555b0611e in qemu_init (argc=<optimized out>, argv=<optimized out>) at ../system/vl.c:3739 > > > #18 0x0000555555897ca9 in main (argc=<optimized out>, argv=<optimized out>) at ../system/main.c:47 > > > > > > The problem is that 'cs->kvm_state' is NULL here > > > > > > > After some investigation it seems that kvm_state is not yet committed > > at this point. Shame, because GDB showed me that we have already pass > > the kvm_accel_instance_init() in accel/kvm/kvm-all.c that sets the > > value "msr_energy.enable" in kvm_state... > > > > So should I dig more to still do the sanity check in kvm_cpu_realizefn() > > or should I already move the RAPL feature from -kvm to -cpu > > like suggested by Zhao from Intel and then access it from the CPUState ? > > I'm not so sure about that question. I think Paolo is best placed > to suggest which is better, as the KVM maintainer. > I'm facing an issue that either require a simple change or a more complex one depending on the decision. TL;DR: Should I move from -kvm to -cpu the rapl feature ? Zhao from Intel suggested me that enabling the rapl feature looks more natural than through -kvm feature because it is indeed a cpu feature. From the point of view of the QEMU architecture, it's more easier to enable the feature with -kvm because it is kvm dependent; but maybe from the point of view of the user, it's more natural to enable this at -cpu level. The issue I'm facing above is from a suggestion from Daniel to do the sanity check at kvm_cpu_realizefn() level. However GDB showed me that kvm_state is not yet populated, even if we have passed the init instance kvm_accel_instance_init(). So either I'm moving from -kvm to -cpu and change the location of the sanity check (checking if rapl feat is activated on Host / if Intel CPU) or I keep the feature on -kvm and either find a solution of the above issue or change the location of the sanity check. Thanks in advance for your guidance. Regards, Anthony
Anthony Harivel, Apr 26, 2024 at 10:36: > > Hi Paolo, > > Daniel P. Berrangé, Apr 25, 2024 at 17:42: > > On Thu, Apr 25, 2024 at 05:34:52PM +0200, Anthony Harivel wrote: > > > Hi Daniel, > > > > > > Daniel P. Berrangé, Apr 18, 2024 at 18:42: > > > > > > > > + if (kvm_is_rapl_feat_enable(cs)) { > > > > > + if (!IS_INTEL_CPU(env)) { > > > > > + error_setg(errp, "RAPL feature can only be\ > > > > > + enabled with Intel CPU models"); > > > > > + return false; > > > > > + } > > > > > + } > > > > > > > > I see a crash in kvm_is_rapl_feat_enable() from this caller, > > > > when I run with this kind of command line: > > > > > > > > $ qemu-system-x86_64 \ > > > > -kernel /lib/modules/6.6.9-100.fc38.x86_64/vmlinuz \ > > > > -initrd tiny-initrd.img -m 2000 -serial stdio -nodefaults \ > > > > -display none -accel kvm -append "console=ttyS0 quiet" > > > > > > > > > > > > #0 0x0000555555bc14b7 in kvm_is_rapl_feat_enable (cs=cs@entry=0x555557b83470) at ../target/i386/kvm/kvm.c:2531 > > > > #1 0x0000555555bc7534 in kvm_cpu_realizefn (cs=0x555557b83470, errp=0x7fffffffd2a0) at ../target/i386/kvm/kvm-cpu.c:54 > > > > #2 0x0000555555d2432a in accel_cpu_common_realize (cpu=0x555557b83470, errp=0x7fffffffd2a0) at ../accel/accel-target.c:130 > > > > #3 0x0000555555cdd955 in cpu_exec_realizefn (cpu=cpu@entry=0x555557b83470, errp=errp@entry=0x7fffffffd2a0) at ../cpu-target.c:137 > > > > #4 0x0000555555c14b89 in x86_cpu_realizefn (dev=0x555557b83470, errp=0x7fffffffd310) at ../target/i386/cpu.c:7320 > > > > #5 0x0000555555d58f4b in device_set_realized (obj=<optimized out>, value=<optimized out>, errp=0x7fffffffd390) at ../hw/core/qdev.c:510 > > > > #6 0x0000555555d5d78d in property_set_bool (obj=0x555557b83470, v=<optimized out>, name=<optimized out>, opaque=0x5555578558e0, errp=0x7fffffffd390) > > > > at ../qom/object.c:2358 > > > > #7 0x0000555555d60b0b in object_property_set (obj=obj@entry=0x555557b83470, name=name@entry=0x55555607c799 "realized", v=v@entry=0x555557b8ccb0, errp=0x7fffffffd390, > > > > errp@entry=0x555556e210d8 <error_fatal>) at ../qom/object.c:1472 > > > > #8 0x0000555555d6444f in object_property_set_qobject > > > > (obj=obj@entry=0x555557b83470, name=name@entry=0x55555607c799 "realized", value=value@entry=0x555557854800, errp=errp@entry=0x555556e210d8 <error_fatal>) > > > > at ../qom/qom-qobject.c:28 > > > > #9 0x0000555555d61174 in object_property_set_bool > > > > (obj=0x555557b83470, name=name@entry=0x55555607c799 "realized", value=value@entry=true, errp=errp@entry=0x555556e210d8 <error_fatal>) at ../qom/object.c:1541 > > > > #10 0x0000555555d59a3c in qdev_realize (dev=<optimized out>, bus=bus@entry=0x0, errp=errp@entry=0x555556e210d8 <error_fatal>) at ../hw/core/qdev.c:292 > > > > #11 0x0000555555bd51e0 in x86_cpu_new (x86ms=<optimized out>, apic_id=0, errp=0x555556e210d8 <error_fatal>) at ../hw/i386/x86.c:105 > > > > #12 0x0000555555bd52ce in x86_cpus_init (x86ms=x86ms@entry=0x555557aaed30, default_cpu_version=<optimized out>) at ../hw/i386/x86.c:156 > > > > #13 0x0000555555bdc1a7 in pc_init1 (machine=0x555557aaed30, pci_type=0x55555604aa61 "i440FX") at ../hw/i386/pc_piix.c:185 > > > > #14 0x0000555555947a11 in machine_run_board_init (machine=0x555557aaed30, mem_path=<optimized out>, errp=<optimized out>, errp@entry=0x555556e210d8 <error_fatal>) > > > > at ../hw/core/machine.c:1547 > > > > #15 0x0000555555b020ed in qemu_init_board () at ../system/vl.c:2613 > > > > #16 qmp_x_exit_preconfig (errp=0x555556e210d8 <error_fatal>) at ../system/vl.c:2705 > > > > #17 0x0000555555b0611e in qemu_init (argc=<optimized out>, argv=<optimized out>) at ../system/vl.c:3739 > > > > #18 0x0000555555897ca9 in main (argc=<optimized out>, argv=<optimized out>) at ../system/main.c:47 > > > > > > > > The problem is that 'cs->kvm_state' is NULL here > > > > > > > > > > After some investigation it seems that kvm_state is not yet committed > > > at this point. Shame, because GDB showed me that we have already pass > > > the kvm_accel_instance_init() in accel/kvm/kvm-all.c that sets the > > > value "msr_energy.enable" in kvm_state... > > > > > > So should I dig more to still do the sanity check in kvm_cpu_realizefn() > > > or should I already move the RAPL feature from -kvm to -cpu > > > like suggested by Zhao from Intel and then access it from the CPUState ? > > > > I'm not so sure about that question. I think Paolo is best placed > > to suggest which is better, as the KVM maintainer. > > > > I'm facing an issue that either require a simple change or a more > complex one depending on the decision. > > TL;DR: Should I move from -kvm to -cpu the rapl feature ? > > Zhao from Intel suggested me that enabling the rapl feature looks more > natural than through -kvm feature because it is indeed a cpu feature. > > From the point of view of the QEMU architecture, it's more easier to > enable the feature with -kvm because it is kvm dependent; but maybe from > the point of view of the user, it's more natural to enable this at -cpu > level. > > The issue I'm facing above is from a suggestion from Daniel to do the > sanity check at kvm_cpu_realizefn() level. However GDB showed me that > kvm_state is not yet populated, even if we have passed the init instance > kvm_accel_instance_init(). > > So either I'm moving from -kvm to -cpu and change the location of the > sanity check (checking if rapl feat is activated on Host / if Intel CPU) > or I keep the feature on -kvm and either find a solution of the above > issue or change the location of the sanity check. > > Thanks in advance for your guidance. > > Regards, > Anthony Hi Paolo, Just a gentle ping for the above question that would unlock me for the next iteration. Thanks a lot. Regards, Anthony
Hi Daniel, Daniel P. Berrangé, Apr 18, 2024 at 18:42: > On Thu, Apr 11, 2024 at 02:14:34PM +0200, Anthony Harivel wrote: > > Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL > > interface (Running Average Power Limit) for advertising the accumulated > > energy consumption of various power domains (e.g. CPU packages, DRAM, > > etc.). > > > > The consumption is reported via MSRs (model specific registers) like > > MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are > > 64 bits registers that represent the accumulated energy consumption in > > micro Joules. They are updated by microcode every ~1ms. > > > > For now, KVM always returns 0 when the guest requests the value of > > these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle > > these MSRs dynamically in userspace. > > > > To limit the amount of system calls for every MSR call, create a new > > thread in QEMU that updates the "virtual" MSR values asynchronously. > > > > Each vCPU has its own vMSR to reflect the independence of vCPUs. The > > thread updates the vMSR values with the ratio of energy consumed of > > the whole physical CPU package the vCPU thread runs on and the > > thread's utime and stime values. > > > > All other non-vCPU threads are also taken into account. Their energy > > consumption is evenly distributed among all vCPUs threads running on > > the same physical CPU package. > > > > To overcome the problem that reading the RAPL MSR requires priviliged > > access, a socket communication between QEMU and the qemu-vmsr-helper is > > mandatory. You can specified the socket path in the parameter. > > > > This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock > > > > Actual limitation: > > - Works only on Intel host CPU because AMD CPUs are using different MSR > > adresses. > > > > - Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at > > the moment. > > > > Signed-off-by: Anthony Harivel <aharivel@redhat.com> > > --- > > accel/kvm/kvm-all.c | 27 +++ > > docs/specs/index.rst | 1 + > > docs/specs/rapl-msr.rst | 155 ++++++++++++ > > include/sysemu/kvm.h | 2 + > > include/sysemu/kvm_int.h | 32 +++ > > target/i386/cpu.h | 8 + > > target/i386/kvm/kvm-cpu.c | 9 + > > target/i386/kvm/kvm.c | 428 ++++++++++++++++++++++++++++++++++ > > target/i386/kvm/meson.build | 1 + > > target/i386/kvm/vmsr_energy.c | 335 ++++++++++++++++++++++++++ > > target/i386/kvm/vmsr_energy.h | 99 ++++++++ > > 11 files changed, 1097 insertions(+) > > create mode 100644 docs/specs/rapl-msr.rst > > create mode 100644 target/i386/kvm/vmsr_energy.c > > create mode 100644 target/i386/kvm/vmsr_energy.h > > > > > diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c > > index 9c791b7b0520..eafb625839b8 100644 > > --- a/target/i386/kvm/kvm-cpu.c > > +++ b/target/i386/kvm/kvm-cpu.c > > @@ -50,6 +50,15 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp) > > MSR_IA32_UCODE_REV); > > } > > } > > + > > + if (kvm_is_rapl_feat_enable(cs)) { > > + if (!IS_INTEL_CPU(env)) { > > + error_setg(errp, "RAPL feature can only be\ > > + enabled with Intel CPU models"); > > + return false; > > + } > > + } > > I see a crash in kvm_is_rapl_feat_enable() from this caller, > when I run with this kind of command line: > > $ qemu-system-x86_64 \ > -kernel /lib/modules/6.6.9-100.fc38.x86_64/vmlinuz \ > -initrd tiny-initrd.img -m 2000 -serial stdio -nodefaults \ > -display none -accel kvm -append "console=ttyS0 quiet" > > > #0 0x0000555555bc14b7 in kvm_is_rapl_feat_enable (cs=cs@entry=0x555557b83470) at ../target/i386/kvm/kvm.c:2531 > #1 0x0000555555bc7534 in kvm_cpu_realizefn (cs=0x555557b83470, errp=0x7fffffffd2a0) at ../target/i386/kvm/kvm-cpu.c:54 > #2 0x0000555555d2432a in accel_cpu_common_realize (cpu=0x555557b83470, errp=0x7fffffffd2a0) at ../accel/accel-target.c:130 > #3 0x0000555555cdd955 in cpu_exec_realizefn (cpu=cpu@entry=0x555557b83470, errp=errp@entry=0x7fffffffd2a0) at ../cpu-target.c:137 > #4 0x0000555555c14b89 in x86_cpu_realizefn (dev=0x555557b83470, errp=0x7fffffffd310) at ../target/i386/cpu.c:7320 > #5 0x0000555555d58f4b in device_set_realized (obj=<optimized out>, value=<optimized out>, errp=0x7fffffffd390) at ../hw/core/qdev.c:510 > #6 0x0000555555d5d78d in property_set_bool (obj=0x555557b83470, v=<optimized out>, name=<optimized out>, opaque=0x5555578558e0, errp=0x7fffffffd390) > at ../qom/object.c:2358 > #7 0x0000555555d60b0b in object_property_set (obj=obj@entry=0x555557b83470, name=name@entry=0x55555607c799 "realized", v=v@entry=0x555557b8ccb0, errp=0x7fffffffd390, > errp@entry=0x555556e210d8 <error_fatal>) at ../qom/object.c:1472 > #8 0x0000555555d6444f in object_property_set_qobject > (obj=obj@entry=0x555557b83470, name=name@entry=0x55555607c799 "realized", value=value@entry=0x555557854800, errp=errp@entry=0x555556e210d8 <error_fatal>) > at ../qom/qom-qobject.c:28 > #9 0x0000555555d61174 in object_property_set_bool > (obj=0x555557b83470, name=name@entry=0x55555607c799 "realized", value=value@entry=true, errp=errp@entry=0x555556e210d8 <error_fatal>) at ../qom/object.c:1541 > #10 0x0000555555d59a3c in qdev_realize (dev=<optimized out>, bus=bus@entry=0x0, errp=errp@entry=0x555556e210d8 <error_fatal>) at ../hw/core/qdev.c:292 > #11 0x0000555555bd51e0 in x86_cpu_new (x86ms=<optimized out>, apic_id=0, errp=0x555556e210d8 <error_fatal>) at ../hw/i386/x86.c:105 > #12 0x0000555555bd52ce in x86_cpus_init (x86ms=x86ms@entry=0x555557aaed30, default_cpu_version=<optimized out>) at ../hw/i386/x86.c:156 > #13 0x0000555555bdc1a7 in pc_init1 (machine=0x555557aaed30, pci_type=0x55555604aa61 "i440FX") at ../hw/i386/pc_piix.c:185 > #14 0x0000555555947a11 in machine_run_board_init (machine=0x555557aaed30, mem_path=<optimized out>, errp=<optimized out>, errp@entry=0x555556e210d8 <error_fatal>) > at ../hw/core/machine.c:1547 > #15 0x0000555555b020ed in qemu_init_board () at ../system/vl.c:2613 > #16 qmp_x_exit_preconfig (errp=0x555556e210d8 <error_fatal>) at ../system/vl.c:2705 > #17 0x0000555555b0611e in qemu_init (argc=<optimized out>, argv=<optimized out>) at ../system/vl.c:3739 > #18 0x0000555555897ca9 in main (argc=<optimized out>, argv=<optimized out>) at ../system/main.c:47 > > The problem is that 'cs->kvm_state' is NULL here > Oh boy... Thanks for the feedback! I'll put that on my priority list for tomorrow. Regards, Anthony > With regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Thu, Apr 11, 2024 at 02:14:34PM +0200, Anthony Harivel wrote: > diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h > index fad9a7e8ff30..37f68c496807 100644 > --- a/include/sysemu/kvm.h > +++ b/include/sysemu/kvm.h > @@ -544,4 +544,6 @@ uint32_t kvm_dirty_ring_size(void); > * reported for the VM. > */ > bool kvm_hwpoisoned_mem(void); > + > +bool kvm_is_rapl_feat_enable(CPUState *cs); > #endif > diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h > index 882e37e12c5b..8dbeda473c6c 100644 > --- a/include/sysemu/kvm_int.h > +++ b/include/sysemu/kvm_int.h > @@ -14,6 +14,9 @@ > #include "qemu/accel.h" > #include "qemu/queue.h" > #include "sysemu/kvm.h" > +#include "hw/boards.h" > +#include "hw/i386/topology.h" > +#include "io/channel-socket.h" > > typedef struct KVMSlot > { > @@ -48,6 +51,34 @@ typedef struct KVMMemoryListener { > > #define KVM_MSI_HASHTAB_SIZE 256 > > +typedef struct KVMHostTopoInfo { > + /* Number of package on the Host */ > + unsigned int maxpkgs; > + /* Number of cpus on the Host */ > + unsigned int maxcpus; > + /* Number of cpus on each different package */ > + unsigned int *pkg_cpu_count; > + /* Each package can have different maxticks */ > + unsigned int *maxticks; > +} KVMHostTopoInfo; > + > +struct KVMMsrEnergy { > + pid_t pid; > + bool enable; > + char *socket_path; > + QIOChannelSocket *sioc; > + QemuThread msr_thr; > + unsigned int vcpus; > + unsigned int vsockets; Add 'guest_' prefix on to these two, to make it clearer > + X86CPUTopoInfo topo_info; Lets call this 'guest_topo' too, to make it more explicitly distinguished from the next 'host_topo' field. > + KVMHostTopoInfo host_topo; > + const CPUArchIdList *cpu_list; This name choice has an unfortunate side effect. cpu.h has a '#define cpu_list x86_cpu_list' for the purposes of selecting the target specific cpu_list function. Unfortunately that macro affects your field name here, which is why the code is wierdly able to set an 'x86_cpu_list' field despite you having declared it 'cpu_list'. I'd suggest perhaps calling this field 'guest_cpus' to avoid this confusing side-effect, and again also making clear that this is tracking guest, not host, CPUs. > + uint64_t *msr_value; > + uint64_t msr_unit; > + uint64_t msr_limit; > + uint64_t msr_info; > +}; > + > enum KVMDirtyRingReaperState { > KVM_DIRTY_RING_REAPER_NONE = 0, > /* The reaper is sleeping */ > @@ -114,6 +145,7 @@ struct KVMState > bool kvm_dirty_ring_with_bitmap; > uint64_t kvm_eager_split_size; /* Eager Page Splitting chunk size */ > struct KVMDirtyRingReaper reaper; > + struct KVMMsrEnergy msr_energy; > NotifyVmexitOption notify_vmexit; > uint32_t notify_window; > uint32_t xen_version; > diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c > index e68cbe929302..3de69caa229e 100644 > --- a/target/i386/kvm/kvm.c > +++ b/target/i386/kvm/kvm.c > @@ -2513,6 +2565,339 @@ static void register_smram_listener(Notifier *n, void *unused) > &smram_address_space, 1, "kvm-smram"); > } > > +static void *kvm_msr_energy_thread(void *data) > +{ > + KVMState *s = data; > + struct KVMMsrEnergy *vmsr = &s->msr_energy; > + > + g_autofree package_energy_stat *pkg_stat = NULL; > + g_autofree thread_stat *thd_stat = NULL; > + g_autofree pid_t *thread_ids = NULL; > + g_autofree CPUState *cpu = NULL; > + g_autofree unsigned int *vpkgs_energy_stat = NULL; > + unsigned int num_threads = 0; > + unsigned int tmp_num_threads = 0; > + > + X86CPUTopoIDs topo_ids; > + > + rcu_register_thread(); > + > + /* Allocate memory for each package energy status */ > + pkg_stat = g_new0(package_energy_stat, vmsr->host_topo.maxpkgs); > + > + /* Allocate memory for thread stats */ > + thd_stat = g_new0(thread_stat, 1); > + > + /* Allocate memory for holding virtual package energy counter */ > + vpkgs_energy_stat = g_new0(unsigned int, vmsr->vsockets); > + > + /* Populate the max tick of each packages */ > + for (int i = 0; i <= vmsr->host_topo.maxpkgs; i++) { The 'maxticks' array is allocated as 'maxpkgs' in size, so this limit condition needs to be '<' not '<=', to avoid an out of bounds write. > + /* > + * Max numbers of ticks per package > + * Time in second * Number of ticks/second * Number of cores/package > + * ex: 100 ticks/second/CPU, 12 CPUs per Package gives 1200 ticks max > + */ > + vmsr->host_topo.maxticks[i] = (MSR_ENERGY_THREAD_SLEEP_US / 1000000) > + * sysconf(_SC_CLK_TCK) > + * vmsr->host_topo.pkg_cpu_count[i]; > + } > + > + while (true) { > + /* Get all qemu threads id */ > + thread_ids = vmsr_get_thread_ids(vmsr->pid, &num_threads); Since you're re-allocating thread_ids on each loop iteration, you need to free it, not overwrite the pointer. This is easily achieved by moving the earlier declaration inside the loop body. ie g_autofree pid_t *thread_ids = thread_ids = vmsr_get_thread_ids(vmsr->pid, &num_threads); > + > + if (thread_ids == NULL) { > + goto clean; > + } > + > + if (tmp_num_threads < num_threads) { > + thd_stat = g_renew(thread_stat, thd_stat, num_threads); > + } > + > + tmp_num_threads = num_threads; You could eliminate 'tmp_num_threads' and just unconditionally call g_renew on each loop iteration. It'll effectively be a no-op if the size hasn't changed since the last iteration, and won't hurt performance in any measurable way. > + > + /* Populate all the thread stats */ > + for (int i = 0; i < num_threads; i++) { > + thd_stat[i].utime = g_new0(unsigned long long, 2); > + thd_stat[i].stime = g_new0(unsigned long long, 2); > + thd_stat[i].thread_id = thread_ids[i]; > + vmsr_read_thread_stat(vmsr->pid, > + thd_stat[i].thread_id, > + thd_stat[i].utime, > + thd_stat[i].stime, > + &thd_stat[i].cpu_id); > + thd_stat[i].pkg_id = > + vmsr_get_physical_package_id(thd_stat[i].cpu_id); > + } > + > + /* Retrieve all packages power plane energy counter */ > + for (int i = 0; i < vmsr->host_topo.maxpkgs; i++) { > + for (int j = 0; j < num_threads; j++) { > + /* > + * Use the first thread we found that ran on the CPU > + * of the package to read the packages energy counter > + */ > + if (thd_stat[j].pkg_id == i) { > + pkg_stat[i].e_start = > + vmsr_read_msr(MSR_PKG_ENERGY_STATUS, > + thd_stat[j].cpu_id, > + thd_stat[j].thread_id, > + s->msr_energy.sioc); > + break; > + } > + } > + } > + > + /* Sleep a short period while the other threads are working */ > + usleep(MSR_ENERGY_THREAD_SLEEP_US); > + > + /* > + * Retrieve all packages power plane energy counter > + * Calculate the delta of all packages > + */ > + for (int i = 0; i < vmsr->host_topo.maxpkgs; i++) { > + for (int j = 0; j < num_threads; j++) { > + /* > + * Use the first thread we found that ran on the CPU > + * of the package to read the packages energy counter > + */ > + if (thd_stat[j].pkg_id == i) { > + pkg_stat[i].e_end = > + vmsr_read_msr(MSR_PKG_ENERGY_STATUS, > + thd_stat[j].cpu_id, > + thd_stat[j].thread_id, > + s->msr_energy.sioc); > + /* > + * Prevent the case we have migrate the VM > + * during the sleep period or any other cases > + * were energy counter might be lower after > + * the sleep period. > + */ > + if (pkg_stat[i].e_end > pkg_stat[i].e_start) { > + pkg_stat[i].e_delta = > + pkg_stat[i].e_end - pkg_stat[i].e_start; > + } else { > + pkg_stat[i].e_delta = 0; > + } > + break; > + } > + } > + } > + > + /* Delta of ticks spend by each thread between the sample */ > + for (int i = 0; i < num_threads; i++) { > + vmsr_read_thread_stat(vmsr->pid, > + thd_stat[i].thread_id, > + thd_stat[i].utime, > + thd_stat[i].stime, > + &thd_stat[i].cpu_id); > + > + if (vmsr->pid < 0) { > + /* > + * We don't count the dead thread > + * i.e threads that existed before the sleep > + * and not anymore > + */ > + thd_stat[i].delta_ticks = 0; > + } else { > + vmsr_delta_ticks(thd_stat, i); > + } > + } > + > + /* > + * Identify the vcpu threads > + * Calculate the number of vcpu per package > + */ > + CPU_FOREACH(cpu) { > + for (int i = 0; i < num_threads; i++) { > + if (cpu->thread_id == thd_stat[i].thread_id) { > + thd_stat[i].is_vcpu = true; > + thd_stat[i].vcpu_id = cpu->cpu_index; > + pkg_stat[thd_stat[i].pkg_id].nb_vcpu++; > + thd_stat[i].acpi_id = kvm_arch_vcpu_id(cpu); > + break; > + } > + } > + } > + > + /* Retrieve the virtual package number of each vCPU */ > + for (int i = 0; i < vmsr->x86_cpu_list->len; i++) { > + for (int j = 0; j < num_threads; j++) { > + if ((thd_stat[j].acpi_id == vmsr->x86_cpu_list->cpus[i].arch_id) > + && (thd_stat[j].is_vcpu == true)) { > + x86_topo_ids_from_apicid(thd_stat[j].acpi_id, > + &vmsr->topo_info, &topo_ids); > + thd_stat[j].vpkg_id = topo_ids.pkg_id; > + } > + } > + } > + > + /* Calculate the total energy of all non-vCPU thread */ > + for (int i = 0; i < num_threads; i++) { > + double temp; Move this inside the 'if', and assign to it at time of declaration. > + if ((thd_stat[i].is_vcpu != true) && > + (thd_stat[i].delta_ticks > 0)) { > + temp = vmsr_get_ratio(pkg_stat[thd_stat[i].pkg_id].e_delta, > + thd_stat[i].delta_ticks, > + vmsr->host_topo.maxticks[thd_stat[i].pkg_id]); > + pkg_stat[thd_stat[i].pkg_id].e_ratio > + += (uint64_t)lround(temp); > + } > + } > + > + /* Calculate the ratio per non-vCPU thread of each package */ > + for (int i = 0; i < vmsr->host_topo.maxpkgs; i++) { > + if (pkg_stat[i].nb_vcpu > 0) { > + pkg_stat[i].e_ratio = pkg_stat[i].e_ratio / pkg_stat[i].nb_vcpu; > + } > + } > + > + /* > + * Calculate the energy for each Package: > + * Energy Package = sum of each vCPU energy that belongs to the package > + */ > + for (int i = 0; i < num_threads; i++) { > + double temp; Aain can move inside the 'if' > + > + if ((thd_stat[i].is_vcpu == true) && \ > + (thd_stat[i].delta_ticks > 0)) { > + temp = vmsr_get_ratio(pkg_stat[thd_stat[i].pkg_id].e_delta, > + thd_stat[i].delta_ticks, > + vmsr->host_topo.maxticks[thd_stat[i].pkg_id]); > + vpkgs_energy_stat[thd_stat[i].vpkg_id] += > + (uint64_t)lround(temp); > + vpkgs_energy_stat[thd_stat[i].vpkg_id] += > + pkg_stat[thd_stat[i].pkg_id].e_ratio; > + } > + } > + > + /* > + * Finally populate the vmsr register of each vCPU with the total > + * package value to emulate the real hardware where each CPU return the > + * value of the package it belongs. > + */ > + for (int i = 0; i < num_threads; i++) { > + if ((thd_stat[i].is_vcpu == true) && \ > + (thd_stat[i].delta_ticks > 0)) { > + vmsr->msr_value[thd_stat[i].vcpu_id] = \ > + vpkgs_energy_stat[thd_stat[i].vpkg_id]; > + } > + } > + > + /* Freeing memory before zeroing the pointer */ > + for (int i = 0; i < num_threads; i++) { > + g_free(thd_stat[i].utime); > + g_free(thd_stat[i].stime); > + } > + /* Zeroing the reused memory with g_renew */ > + memset(thd_stat, 0, num_threads * sizeof(thread_stat)); This memset() should be done immediately after the 'g_renew' call, since if g_renew enlarged the array, it won't have zero'd the newly added elements. > + memset(thread_ids, 0, sizeof(pid_t)); We need to be free'ing thread_ids, not zero'ing it. This memset can just be removed entirely if you follow my earlier suggestion to rely on g_autofree and declare thread_ids inside the loop body. > + } > + > +clean: > + rcu_unregister_thread(); > + return NULL; > +} > + > +static int kvm_msr_energy_thread_init(KVMState *s, MachineState *ms) > +{ > + struct KVMMsrEnergy *r = &s->msr_energy; > + int ret = 0; > + > + /* > + * Sanity check > + * 1. Host cpu must be Intel cpu > + * 2. RAPL must be enabled on the Host > + */ > + if (is_host_cpu_intel()) { > + error_report("The RAPL feature can only be enabled on hosts\ > + with Intel CPU models"); > + ret = 1; > + goto out; > + } > + > + if (!is_rapl_enabled()) { > + ret = 1; > + goto out; > + } > + > + /* Retrieve the virtual topology */ > + vmsr_init_topo_info(&r->topo_info, ms); > + > + /* Retrieve the number of vcpu */ > + r->vcpus = ms->smp.cpus; > + > + /* Retrieve the number of virtual sockets */ > + r->vsockets = ms->smp.sockets; > + > + /* Allocate register memory (MSR_PKG_STATUS) for each vcpu */ > + r->msr_value = g_new0(uint64_t, r->vcpus); > + > + /* Retrieve the CPUArchIDlist */ > + r->x86_cpu_list = x86_possible_cpu_arch_ids(ms); > + > + /* Max number of cpus on the Host */ > + r->host_topo.maxcpus = vmsr_get_maxcpus(); > + if (r->host_topo.maxcpus == 0) { > + error_report("host max cpus = 0"); > + ret = 1; > + goto out; > + } > + > + /* Max number of packages on the host */ > + r->host_topo.maxpkgs = vmsr_get_max_physical_package(r->host_topo.maxcpus); > + if (r->host_topo.maxpkgs == 0) { > + error_report("host max pkgs = 0"); > + ret = 1; > + goto out; > + } > + > + /* Allocate memory for each package on the host */ > + r->host_topo.pkg_cpu_count = g_new0(unsigned int, r->host_topo.maxpkgs); > + r->host_topo.maxticks = g_new0(unsigned int, r->host_topo.maxpkgs); > + > + vmsr_count_cpus_per_package(r->host_topo.pkg_cpu_count, > + r->host_topo.maxpkgs); > + for (int i = 0; i < r->host_topo.maxpkgs; i++) { > + if (r->host_topo.pkg_cpu_count[i] == 0) { > + error_report("cpu per packages = 0 on package_%d", i); > + ret = 1; > + goto out; > + } > + } > + > + /* Get QEMU PID*/ > + r->pid = getpid(); > + > + /* Compute the socket path if necessary */ > + if (s->msr_energy.socket_path == NULL) { > + s->msr_energy.socket_path = vmsr_compute_default_paths(); > + } > + > + /* Open socket with vmsr helper */ > + s->msr_energy.sioc = vmsr_open_socket(s->msr_energy.socket_path); Need to detect & report error if opening the socket failed. > + > + /* Those MSR values should not change */ > + r->msr_unit = vmsr_read_msr(MSR_RAPL_POWER_UNIT, 0, r->pid, > + s->msr_energy.sioc); > + r->msr_limit = vmsr_read_msr(MSR_PKG_POWER_LIMIT, 0, r->pid, > + s->msr_energy.sioc); > + r->msr_info = vmsr_read_msr(MSR_PKG_POWER_INFO, 0, r->pid, > + s->msr_energy.sioc); > + if (r->msr_unit == 0 || r->msr_limit == 0 || r->msr_info == 0) { > + error_report("can't read any virtual msr"); > + ret = 1; > + goto out; > + } > + > + qemu_thread_create(&r->msr_thr, "kvm-msr", > + kvm_msr_energy_thread, > + s, QEMU_THREAD_JOINABLE); > +out: > + return ret; > +} > + > int kvm_arch_get_default_type(MachineState *ms) > { > return 0; > diff --git a/target/i386/kvm/vmsr_energy.c b/target/i386/kvm/vmsr_energy.c > new file mode 100644 > index 000000000000..934d14c4305f > --- /dev/null > +++ b/target/i386/kvm/vmsr_energy.c > @@ -0,0 +1,335 @@ > +/* > + * QEMU KVM support -- x86 virtual RAPL msr > + * > + * Copyright 2024 Red Hat, Inc. 2024 > + * > + * Author: > + * Anthony Harivel <aharivel@redhat.com> > + * > + * This work is licensed under the terms of the GNU GPL, version 2 or later. > + * See the COPYING file in the top-level directory. > + * > + */ > + > +#include "qemu/osdep.h" > +#include "qemu/error-report.h" > +#include "vmsr_energy.h" > +#include "io/channel.h" > +#include "io/channel-socket.h" > +#include "hw/boards.h" > +#include "cpu.h" > +#include "host-cpu.h" > + > +QIOChannelSocket *vmsr_open_socket(const char *path) > +{ > + g_autofree char *socket_path = NULL; > + > + socket_path = g_strdup(path); Assign this at time of declaring socket_path > + > + SocketAddress saddr = { > + .type = SOCKET_ADDRESS_TYPE_UNIX, > + .u.q_unix.path = socket_path > + }; > + > + QIOChannelSocket *sioc = qio_channel_socket_new(); > + Error *local_err = NULL; > + > + qio_channel_set_name(QIO_CHANNEL(sioc), "vmsr-helper"); > + qio_channel_socket_connect_sync(sioc, > + &saddr, > + &local_err); > + if (local_err) { > + /* Close socket. */ > + qio_channel_close(QIO_CHANNEL(sioc), NULL); > + object_unref(OBJECT(sioc)); You need a 'return NULL' here, otherwise we're returning a free'd pointer to the caller. Aso we need to report the error. > + } > + > + return sioc; Indent misaligned. > +} > + > +uint64_t vmsr_read_msr(uint32_t reg, uint32_t cpu_id, uint32_t tid, > + QIOChannelSocket *sioc) > +{ > + uint64_t data = 0; > + int r = 0; > + Error *local_err = NULL; > + g_autofree uint32_t buffer[3]; g_autofree is wrong for a stack allocated array, just remove it. > + /* > + * Send the required arguments: > + * 1. RAPL MSR register to read > + * 2. On which CPU ID > + * 3. From which vCPU (Thread ID) > + */ > + buffer[0] = reg; > + buffer[1] = cpu_id; > + buffer[2] = tid; > + > + r = qio_channel_write_all(QIO_CHANNEL(sioc), > + (char *)buffer, sizeof(buffer), > + &local_err); > + if (r < 0) { > + goto out_close; > + } > + > + r = qio_channel_read_all(QIO_CHANNEL(sioc), > + (char *)data, sizeof(data), > + &local_err); > + if (r < 0) { > + data = 0; > + goto out_close; > + } > + > +out_close: > + return data; > +} > +/* Retrieve the max number of physical cpu on the host */ > +unsigned int vmsr_get_maxcpus(void) > +{ > + GDir *dir; > + const gchar *entry_name; > + unsigned int cpu_count = 0; > + const char *path = "/sys/devices/system/cpu/"; > + > + dir = g_dir_open(path, 0, NULL); > + if (dir == NULL) { > + error_report("Unable to open cpu directory"); Slightly more friendly to include the path in the error message, and also the error that glib reports.eg g_autoptr(GError) gerr = NULL; GDir *dir = g_dir_open(path, 0, &gerr); if (dir == NULL) { error_report("Unable to open cpu directory %s: %s", path, err->message); > + return -1; > + } > + > + while ((entry_name = g_dir_read_name(dir)) != NULL) { > + if (g_ascii_strncasecmp(entry_name, "cpu", 3) == 0 && > + isdigit(entry_name[3])) { > + cpu_count++; > + } > + } > + > + g_dir_close(dir); > + > + return cpu_count; > +} > + > +/* Count the number of physical cpu on each packages */ > +unsigned int vmsr_count_cpus_per_package(unsigned int *package_count, > + unsigned int max_pkgs) > +{ > + g_autofree char *file_contents = NULL; > + g_autofree char *path = NULL; > + gsize length; > + > + /* Iterate over cpus and count cpus in each package */ > + for (int cpu_id = 0; ; cpu_id++) { > + path = g_build_filename( > + g_strdup_printf("/sys/devices/system/cpu/cpu%d/" > + "topology/physical_package_id", cpu_id), NULL); The result of 'g_strdup_printf' needs assigning to a variable marked with 'g_autofree', before being passed to g_build_filename to avoid a mem leak. > + > + if (!g_file_get_contents(path, &file_contents, &length, NULL)) { > + break; /* No more cpus */ > + } > + > + /* Get the physical package ID for this CPU */ > + int package_id = atoi(file_contents); > + > + /* Check if the package ID is within the known number of packages */ > + if (package_id >= 0 && package_id < max_pkgs) { > + /* If yes, count the cpu for this package*/ > + package_count[package_id]++; > + } > + } > + > + return 0; > +} > +/* Read the scheduled time for a given thread of a give pid */ > +void vmsr_read_thread_stat(pid_t pid, > + unsigned int thread_id, > + unsigned long long *utime, > + unsigned long long *stime, > + unsigned int *cpu_id) > +{ > + g_autofree char *path = NULL; > + > + path = g_build_filename(g_strdup_printf("/proc/%u/task/%d/stat", pid, \ > + thread_id), NULL); Again need to assign g_strdup_printf result to a g_autofree marked variable to avoid memleak. > + > + FILE *file = fopen(path, "r"); > + if (file == NULL) { > + pid = -1; > + return; > + } > + > + if (fscanf(file, "%*d (%*[^)]) %*c %*d %*d %*d %*d %*d %*u %*u %*u %*u %*u" > + " %llu %llu %*d %*d %*d %*d %*d %*d %*u %*u %*d %*u %*u" > + " %*u %*u %*u %*u %*u %*u %*u %*u %*u %*d %*u %*u %u", > + utime, stime, cpu_id) != 3) > + { > + pid = -1; > + return; > + } > + > + fclose(file); > + return; > +} > + > +/* Read QEMU stat task folder to retrieve all QEMU threads ID */ > +pid_t *vmsr_get_thread_ids(pid_t pid, unsigned int *num_threads) > +{ > + g_autofree char *path = g_build_filename("/proc", > + g_strdup_printf("%d/task", pid), NULL); Another memleak of g_strdup_printf()result > + > + DIR *dir = opendir(path); > + if (dir == NULL) { > + error_report("Error opening /proc/qemu/task"); > + return NULL; > + } > + > + pid_t *thread_ids = NULL; > + unsigned int thread_count = 0; > + > + g_autofree struct dirent *ent = NULL; > + while ((ent = readdir(dir)) != NULL) { > + if (ent->d_name[0] == '.') { > + continue; > + } > + pid_t tid = atoi(ent->d_name); > + if (pid != tid) { > + thread_ids = g_renew(pid_t, thread_ids, (thread_count + 1)); > + thread_ids[thread_count] = tid; > + thread_count++; > + } > + } > + > + closedir(dir); > + > + *num_threads = thread_count; > + return thread_ids; > +} > diff --git a/target/i386/kvm/vmsr_energy.h b/target/i386/kvm/vmsr_energy.h > new file mode 100644 > index 000000000000..a04114179346 > --- /dev/null > +++ b/target/i386/kvm/vmsr_energy.h > @@ -0,0 +1,99 @@ > +/* > + * QEMU KVM support -- x86 virtual energy-related MSR. > + * > + * Copyright 2024 Red Hat, Inc. 2024 > + * > + * Author: > + * Anthony Harivel <aharivel@redhat.com> > + * > + * This work is licensed under the terms of the GNU GPL, version 2 or later. > + * See the COPYING file in the top-level directory. > + * > + */ > + > +#ifndef VMSR_ENERGY_H > +#define VMSR_ENERGY_H > + > +#include <stdint.h> > +#include "qemu/osdep.h" > +#include "io/channel-socket.h" > +#include "hw/i386/topology.h" > + > +/* > + * Define the interval time in micro seconds between 2 samples of > + * energy related MSRs > + */ > +#define MSR_ENERGY_THREAD_SLEEP_US 1000000.0 > + > +/* > + * Thread statistic > + * @ thread_id: TID (thread ID) > + * @ is_vcpu: true if TID is vCPU thread > + * @ cpu_id: CPU number last executed on > + * @ pkg_id: package number of the CPU > + * @ vcpu_id: vCPU ID > + * @ vpkg: virtual package number > + * @ acpi_id: APIC id of the vCPU > + * @ utime: amount of clock ticks the thread > + * has been scheduled in User mode > + * @ stime: amount of clock ticks the thread > + * has been scheduled in System mode > + * @ delta_ticks: delta of utime+stime between > + * the two samples (before/after sleep) > + */ > +struct thread_stat { Suggest 'vmsr_thread_stat' to better namespace > + unsigned int thread_id; > + bool is_vcpu; > + unsigned int cpu_id; > + unsigned int pkg_id; > + unsigned int vpkg_id; > + unsigned int vcpu_id; > + unsigned long acpi_id; > + unsigned long long *utime; > + unsigned long long *stime; > + unsigned long long delta_ticks; > +}; > + > +/* > + * Package statistic > + * @ e_start: package energy counter before the sleep > + * @ e_end: package energy counter after the sleep > + * @ e_delta: delta of package energy counter > + * @ e_ratio: store the energy ratio of non-vCPU thread > + * @ nb_vcpu: number of vCPU running on this package > + */ > +struct package_energy_stat { And 'vmsr_package_energy_stat' too > + uint64_t e_start; > + uint64_t e_end; > + uint64_t e_delta; > + uint64_t e_ratio; > + unsigned int nb_vcpu; > +}; > + > +typedef struct thread_stat thread_stat; > +typedef struct package_energy_stat package_energy_stat; > + > +char *vmsr_compute_default_paths(void); > +void vmsr_read_thread_stat(pid_t pid, > + unsigned int thread_id, > + unsigned long long *utime, > + unsigned long long *stime, > + unsigned int *cpu_id); > + > +QIOChannelSocket *vmsr_open_socket(const char *path); > +uint64_t vmsr_read_msr(uint32_t reg, uint32_t cpu_id, > + uint32_t tid, QIOChannelSocket *sioc); > +void vmsr_delta_ticks(thread_stat *thd_stat, int i); > +unsigned int vmsr_get_maxcpus(void); > +unsigned int vmsr_get_max_physical_package(unsigned int max_cpus); > +unsigned int vmsr_count_cpus_per_package(unsigned int *package_count, > + unsigned int max_pkgs); > +int vmsr_get_physical_package_id(int cpu_id); > +pid_t *vmsr_get_thread_ids(pid_t pid, unsigned int *num_threads); > +double vmsr_get_ratio(uint64_t e_delta, > + unsigned long long delta_ticks, > + unsigned int maxticks); > +void vmsr_init_topo_info(X86CPUTopoInfo *topo_info, const MachineState *ms); > +bool is_host_cpu_intel(void); > +int is_rapl_enabled(void); > +#endif /* VMSR_ENERGY_H */ Bearing in mind the other feedback from Intel about future CPUs not using MSRs for this interface, I wonder if a rename is worth doing. eg call the file 'rapl.h', and the APIs / structs all have a 'rapl_' name prefix, instead of 'vmsr'. Then use 'rapl_vmsr_' as a name prefix on the ones that are specifically tied to the MSR based impl. In future we can have 'rapl_tpmi' as the prefix for the new interface. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
Hi Anthony, May I ask what your usage scenario is? Is it to measure Guest's energy consumption and to charged per watt consumed? ;-) On Thu, Apr 11, 2024 at 02:14:34PM +0200, Anthony Harivel wrote: > Date: Thu, 11 Apr 2024 14:14:34 +0200 > From: Anthony Harivel <aharivel@redhat.com> > Subject: [PATCH v5 3/3] Add support for RAPL MSRs in KVM/Qemu > > Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL > interface (Running Average Power Limit) for advertising the accumulated > energy consumption of various power domains (e.g. CPU packages, DRAM, > etc.). > > The consumption is reported via MSRs (model specific registers) like > MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are > 64 bits registers that represent the accumulated energy consumption in > micro Joules. They are updated by microcode every ~1ms. What is your current target platform? On future Xeon platforms (EMR and beyond) RAPL will support TPMI (an MMIO interface) and the TPMI based RAPL will be preferred in the future as well: * TPMI doc: https://github.com/intel/tpmi_power_management * TPMI based RAPL driver: drivers/powercap/intel_rapl_tpmi.c So do you have the plan to support RAPL-TPMI? > For now, KVM always returns 0 when the guest requests the value of > these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle > these MSRs dynamically in userspace. > > To limit the amount of system calls for every MSR call, create a new > thread in QEMU that updates the "virtual" MSR values asynchronously. > > Each vCPU has its own vMSR to reflect the independence of vCPUs. The > thread updates the vMSR values with the ratio of energy consumed of > the whole physical CPU package the vCPU thread runs on and the > thread's utime and stime values. > > All other non-vCPU threads are also taken into account. Their energy > consumption is evenly distributed among all vCPUs threads running on > the same physical CPU package. The package energy consumption includes core part and uncore part, where uncore part consumption may not be able to be scaled based on vCPU runtime ratio. When the uncore part consumption is small, the error in this part is small, but if it is large, then the error generated by scaling by vCPU runtime will be large. May I ask what your usage scenario is? Is there significant uncore consumption (e.g. GPU)? Also, I think of a generic question is whether the error in this calculation is measurable? Like comparing the RAPL status of the same workload on Guest and bare metal to check the error. IIUC, this calculation is highly affected by native/sibling Guests, especially in cloud scenarios where there are multiple Guests, the accuracy of this algorithm needs to be checked. > To overcome the problem that reading the RAPL MSR requires priviliged > access, a socket communication between QEMU and the qemu-vmsr-helper is > mandatory. You can specified the socket path in the parameter. > > This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock Based on the above comment, I suggest to rename this option as "rapl-msr" to distinguish it from rapl-tpmi. In addition, RAPL is basically a CPU feature, I think it would be more appropriate to make it as a x86 CPU's property. Your RAPL support actually provides a framework for assisting KVM emulation in userspace, so this informs other feature support (maybe model specific, or architectural) in the future. Enabling/disabling CPU features via -cpu looks more natural. [snip] > +High level implementation > +------------------------- > + > +In order to update the value of the virtual MSR, a QEMU thread is created. > +The thread is basically just an infinity loop that does: > + > +1. Snapshot of the time metrics of all QEMU threads (Time spent scheduled in > + Userspace and System) > > +2. Snapshot of the actual MSR_PKG_ENERGY_STATUS counter of all packages where > + the QEMU threads are running on. > + > +3. Sleep for 1 second - During this pause the vcpu and other non-vcpu threads > + will do what they have to do and so the energy counter will increase. > + > +4. Repeat 2. and 3. and calculate the delta of every metrics representing the > + time spent scheduled for each QEMU thread *and* the energy spent by the > + packages during the pause. > + > +5. Filter the vcpu threads and the non-vcpu threads. > + > +6. Retrieve the topology of the Virtual Machine. This helps identify which > + vCPU is running on which virtual package. > + > +7. The total energy spent by the non-vcpu threads is divided by the number > + of vcpu threads so that each vcpu thread will get an equal part of the > + energy spent by the QEMU workers. > + > +8. Calculate the ratio of energy spent per vcpu threads. > + > +9. Calculate the energy for each virtual package. > + > +10. The virtual MSRs are updated for each virtual package. Each vCPU that > + belongs to the same package will return the same value when accessing the > + the MSR. > + > +11. Loop back to 1. > + > +Ratio calculation > +----------------- > + > +In Linux, a process has an execution time associated with it. The scheduler is > +dividing the time in clock ticks. The number of clock ticks per second can be > +found by the sysconf system call. A typical value of clock ticks per second is > +100. So a core can run a process at the maximum of 100 ticks per second. If a > +package has 4 cores, 400 ticks maximum can be scheduled on all the cores > +of the package for a period of 1 second. > + > +The /proc/[pid]/stat [#b]_ is a sysfs file that can give the executed time of a > +process with the [pid] as the process ID. It gives the amount of ticks the > +process has been scheduled in userspace (utime) and kernel space (stime). > + I understand tick would ignore frequency changes, e.g., HWP's auto-pilot or turbo boost. All these CPU frequency change would impact on core energy consumption. I think the better way would be to use APERF counter, but unfortunately it lacks virtualization support (for Intel). Due to such considerations, it may be more worthwhile to evaluate the accuracy of this tick-based algorithm. [snip] > diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c > index e68cbe929302..3de69caa229e 100644 > --- a/target/i386/kvm/kvm.c > +++ b/target/i386/kvm/kvm.c [snip] > +static void *kvm_msr_energy_thread(void *data) > +{ > + KVMState *s = data; > + struct KVMMsrEnergy *vmsr = &s->msr_energy; > + > + g_autofree package_energy_stat *pkg_stat = NULL; > + g_autofree thread_stat *thd_stat = NULL; > + g_autofree pid_t *thread_ids = NULL; > + g_autofree CPUState *cpu = NULL; > + g_autofree unsigned int *vpkgs_energy_stat = NULL; > + unsigned int num_threads = 0; > + unsigned int tmp_num_threads = 0; [snip] > + /* Sleep a short period while the other threads are working */ > + usleep(MSR_ENERGY_THREAD_SLEEP_US); Is it possible to passively read the energy status? i.e. access the Host MSR and calculate the energy consumption for the Guest when the Guest triggers the relevant exit. I think this might make the error larger, but not sure the error would be so large as to be unacceptable. [snip] > + /* Retrieve the virtual package number of each vCPU */ > + for (int i = 0; i < vmsr->x86_cpu_list->len; i++) { > + for (int j = 0; j < num_threads; j++) { > + if ((thd_stat[j].acpi_id == vmsr->x86_cpu_list->cpus[i].arch_id) > + && (thd_stat[j].is_vcpu == true)) { > + x86_topo_ids_from_apicid(thd_stat[j].acpi_id, > + &vmsr->topo_info, &topo_ids); > + thd_stat[j].vpkg_id = topo_ids.pkg_id; > + } > + } > + } We lack package instances as well as MSR topology, so we have to deal with this pain... Similarly, my attempt at other package/die level MSRs [1] is to only allow Guest to set a package/die. But I think that in the future with QOM-topo [2] (i.e. creating package/ die instances), MSR topology can have an easier implementation, though, looks there's a long way to go! [1]: https://lore.kernel.org/qemu-devel/20240203093054.412135-4-zhao1.liu@linux.intel.com/ [2]: https://lore.kernel.org/qemu-devel/20231130144203.2307629-1-zhao1.liu@linux.intel.com/ > + /* Calculate the total energy of all non-vCPU thread */ > + for (int i = 0; i < num_threads; i++) { > + double temp; > + if ((thd_stat[i].is_vcpu != true) && > + (thd_stat[i].delta_ticks > 0)) { > + temp = vmsr_get_ratio(pkg_stat[thd_stat[i].pkg_id].e_delta, > + thd_stat[i].delta_ticks, > + vmsr->host_topo.maxticks[thd_stat[i].pkg_id]); > + pkg_stat[thd_stat[i].pkg_id].e_ratio > + += (uint64_t)lround(temp); > + } > + } > + [snip] > +static int kvm_msr_energy_thread_init(KVMState *s, MachineState *ms) > +{ > + struct KVMMsrEnergy *r = &s->msr_energy; > + int ret = 0; > + [snip] > + /* Retrieve the number of virtual sockets */ > + r->vsockets = ms->smp.sockets; RAPL's package domain is special: * When there's no die, it's package-scope. * When there's die level, it's die-scope. In SDM vol.3, char 15.10 PLATFORM SPECIFIC POWER MANAGEMENT SUPPORT, it said: Package domain is the processor die. So if a Guest has die level, thers MSRs should be shared in a die. Thus I guess the energy consumption status should also be distributed based on the die number. > + /* Allocate register memory (MSR_PKG_STATUS) for each vcpu */ > + r->msr_value = g_new0(uint64_t, r->vcpus); > + > + /* Retrieve the CPUArchIDlist */ > + r->x86_cpu_list = x86_possible_cpu_arch_ids(ms); > + [snip] > +int is_rapl_enabled(void) > +{ > + const char *path = "/sys/class/powercap/intel-rapl/enabled"; This field does not ensure the existence of RAPL MSRs since TPMI would also enable this field. (See powercap_register_control_type() in drivers/powercap/intel_rapl_{msr,tpmi}.c) We can read RAPL MSRs directly. If we get 0 or failure, then there's no RAPL MSRs on Host. > + FILE *file = fopen(path, "r"); > + int value = 0; > + > + if (file != NULL) { > + if (fscanf(file, "%d", &value) != 1) { > + error_report("INTEL RAPL not enabled"); > + } > + fclose(file); > + } else { > + error_report("Error opening %s", path); > + } > + > + return value; > +} [snip] > +/* Get the physical package id from a given cpu id */ > +int vmsr_get_physical_package_id(int cpu_id) > +{ > + g_autofree char *file_contents = NULL; > + g_autofree char *file_path = NULL; > + int package_id = -1; > + gsize length; > + > + file_path = g_strdup_printf("/sys/devices/system/cpu/cpu%d\ > + /topology/physical_package_id", cpu_id); Similarly, based on the topological properties of RAPL, it is more accurate to also consider the die_id. Currently, only CLX-AP has multiple dies. So, we either have to put a limit on the Host's CPU model or consider the topology of the die here as well. I think the latter is a more general approach. > + > + if (!g_file_get_contents(file_path, &file_contents, &length, NULL)) { > + goto out; > + } > + > + package_id = atoi(file_contents); > + > +out: > + return package_id; > +} > + Regards, Zhao
Hi Zhao, Zhao Liu, Apr 17, 2024 at 12:07: > Hi Anthony, > > May I ask what your usage scenario is? Is it to measure Guest's energy > consumption and to charged per watt consumed? ;-) See previous email from Daniel. > On Thu, Apr 11, 2024 at 02:14:34PM +0200, Anthony Harivel wrote: > > Date: Thu, 11 Apr 2024 14:14:34 +0200 > > From: Anthony Harivel <aharivel@redhat.com> > > Subject: [PATCH v5 3/3] Add support for RAPL MSRs in KVM/Qemu > > > > Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL > > interface (Running Average Power Limit) for advertising the accumulated > > energy consumption of various power domains (e.g. CPU packages, DRAM, > > etc.). > > > > The consumption is reported via MSRs (model specific registers) like > > MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are > > 64 bits registers that represent the accumulated energy consumption in > > micro Joules. They are updated by microcode every ~1ms. > > What is your current target platform? > > On future Xeon platforms (EMR and beyond) RAPL will support TPMI (an MMIO > interface) and the TPMI based RAPL will be preferred in the future as > well: > * TPMI doc: https://github.com/intel/tpmi_power_management > * TPMI based RAPL driver: drivers/powercap/intel_rapl_tpmi.c > > So do you have the plan to support RAPL-TPMI? Yes, I guess this would be inevitable in the future. But right now the lack of HW with this TPMI make hard to integrate it on day 1. > > For now, KVM always returns 0 when the guest requests the value of > > these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle > > these MSRs dynamically in userspace. > > > > To limit the amount of system calls for every MSR call, create a new > > thread in QEMU that updates the "virtual" MSR values asynchronously. > > > > Each vCPU has its own vMSR to reflect the independence of vCPUs. The > > thread updates the vMSR values with the ratio of energy consumed of > > the whole physical CPU package the vCPU thread runs on and the > > thread's utime and stime values. > > > > All other non-vCPU threads are also taken into account. Their energy > > consumption is evenly distributed among all vCPUs threads running on > > the same physical CPU package. > > The package energy consumption includes core part and uncore part, where > uncore part consumption may not be able to be scaled based on vCPU > runtime ratio. > > When the uncore part consumption is small, the error in this part is > small, but if it is large, then the error generated by scaling by vCPU > runtime will be large. > So far we can only work with what Intel is giving us i.e Package power plane and DRAM power plane on server, which is the main target of this feature. Maybe in the future, Intel will expand the core power plane and the uncore power plane to server class CPU ? > May I ask what your usage scenario is? Is there significant uncore > consumption (e.g. GPU)? > Same answer as above: uncore/graphics power plane is only available on client class CPU. > Also, I think of a generic question is whether the error in this > calculation is measurable? Like comparing the RAPL status of the same > workload on Guest and bare metal to check the error. > > IIUC, this calculation is highly affected by native/sibling Guests, > especially in cloud scenarios where there are multiple Guests, the > accuracy of this algorithm needs to be checked. > Indeed, depending on where your vCPUs are running within the package (on the native or sibling CPU), you might observe different power consumption levels. However, I don't consider this to be a problem, as the ratio calculation takes into account the vCPU's location. We also need to approach the measurement differently. Due to the complexity of factors influencing power consumption, we must compare what is comparable. If you require precise power consumption data, use a power meter on the PSU of the server.It will provide the ultimate judgment. However, if you need an estimation to optimize software workloads in a guest, then this feature could be useful. All my tests have consistently shown reproducible output in terms of power consumption, which has convinced me that we can effectively work with it. > > To overcome the problem that reading the RAPL MSR requires priviliged > > access, a socket communication between QEMU and the qemu-vmsr-helper is > > mandatory. You can specified the socket path in the parameter. > > > > This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock > > Based on the above comment, I suggest to rename this option as "rapl-msr" > to distinguish it from rapl-tpmi. Fair enough, I can rename this like that. > > In addition, RAPL is basically a CPU feature, I think it would be more > appropriate to make it as a x86 CPU's property. > > Your RAPL support actually provides a framework for assisting KVM > emulation in userspace, so this informs other feature support (maybe model > specific, or architectural) in the future. Enabling/disabling CPU features > via -cpu looks more natural. This is totally dependant of KVM because it used the KVM MSR Filtering to access userspace when a specific MSR is required. I can try to find a way to use -cpu for this feature and check if KVM is activated or not. > > [snip] > > > +High level implementation > > +------------------------- > > + > > +In order to update the value of the virtual MSR, a QEMU thread is created. > > +The thread is basically just an infinity loop that does: > > + > > +1. Snapshot of the time metrics of all QEMU threads (Time spent scheduled in > > + Userspace and System) > > > > +2. Snapshot of the actual MSR_PKG_ENERGY_STATUS counter of all packages where > > + the QEMU threads are running on. > > + > > +3. Sleep for 1 second - During this pause the vcpu and other non-vcpu threads > > + will do what they have to do and so the energy counter will increase. > > + > > +4. Repeat 2. and 3. and calculate the delta of every metrics representing the > > + time spent scheduled for each QEMU thread *and* the energy spent by the > > + packages during the pause. > > + > > +5. Filter the vcpu threads and the non-vcpu threads. > > + > > +6. Retrieve the topology of the Virtual Machine. This helps identify which > > + vCPU is running on which virtual package. > > + > > +7. The total energy spent by the non-vcpu threads is divided by the number > > + of vcpu threads so that each vcpu thread will get an equal part of the > > + energy spent by the QEMU workers. > > + > > +8. Calculate the ratio of energy spent per vcpu threads. > > + > > +9. Calculate the energy for each virtual package. > > + > > +10. The virtual MSRs are updated for each virtual package. Each vCPU that > > + belongs to the same package will return the same value when accessing the > > + the MSR. > > + > > +11. Loop back to 1. > > + > > +Ratio calculation > > +----------------- > > + > > +In Linux, a process has an execution time associated with it. The scheduler is > > +dividing the time in clock ticks. The number of clock ticks per second can be > > +found by the sysconf system call. A typical value of clock ticks per second is > > +100. So a core can run a process at the maximum of 100 ticks per second. If a > > +package has 4 cores, 400 ticks maximum can be scheduled on all the cores > > +of the package for a period of 1 second. > > + > > +The /proc/[pid]/stat [#b]_ is a sysfs file that can give the executed time of a > > +process with the [pid] as the process ID. It gives the amount of ticks the > > +process has been scheduled in userspace (utime) and kernel space (stime). > > + > > I understand tick would ignore frequency changes, e.g., HWP's auto-pilot > or turbo boost. All these CPU frequency change would impact on core energy > consumption. > > I think the better way would be to use APERF counter, but unfortunately it > lacks virtualization support (for Intel). > > Due to such considerations, it may be more worthwhile to evaluate the > accuracy of this tick-based algorithm. > I've evaluated such things with another tool called Kepler [1]. This tool calculate the power ratio with metrics from RAPL and uses either eBPF or the tick based systems for time metrics. The eBPF part [2] is triggered on each 'finish_task_switch' of Thread and calculate the delta of cpu cycle, cache miss, cpu time, etc. Very complex. My tests showed that the difference between using eBPF and tick based ratio is really not that important. Maybe on some special cases, using eBPF would show a way better accuracy but I'm not aware of that. [1]: https://github.com/sustainable-computing-io/kepler [2]: https://github.com/sustainable-computing-io/kepler/blob/main/bpfassets/libbpf/src/kepler.bpf.c > [snip] > > > diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c > > index e68cbe929302..3de69caa229e 100644 > > --- a/target/i386/kvm/kvm.c > > +++ b/target/i386/kvm/kvm.c > > [snip] > > > +static void *kvm_msr_energy_thread(void *data) > > +{ > > + KVMState *s = data; > > + struct KVMMsrEnergy *vmsr = &s->msr_energy; > > + > > + g_autofree package_energy_stat *pkg_stat = NULL; > > + g_autofree thread_stat *thd_stat = NULL; > > + g_autofree pid_t *thread_ids = NULL; > > + g_autofree CPUState *cpu = NULL; > > + g_autofree unsigned int *vpkgs_energy_stat = NULL; > > + unsigned int num_threads = 0; > > + unsigned int tmp_num_threads = 0; > > [snip] > > > + /* Sleep a short period while the other threads are working */ > > + usleep(MSR_ENERGY_THREAD_SLEEP_US); > > Is it possible to passively read the energy status? i.e. access the Host > MSR and calculate the energy consumption for the Guest when the Guest > triggers the relevant exit. > Yes it could be possible. But what I wanted to avoid with my approach is the overhead it could take when accessing a RAPL MSR in the Guest. The value is always available and return very quickly. I'm not sure about the overhead if we have to have to access the MSR, then do the calculation and so on. > I think this might make the error larger, but not sure the error would > be so large as to be unacceptable. > I'm a bit concerned about the potential overflow in calculation if the time in between is too big. > [snip] > > > + /* Retrieve the virtual package number of each vCPU */ > > + for (int i = 0; i < vmsr->x86_cpu_list->len; i++) { > > + for (int j = 0; j < num_threads; j++) { > > + if ((thd_stat[j].acpi_id == vmsr->x86_cpu_list->cpus[i].arch_id) > > + && (thd_stat[j].is_vcpu == true)) { > > + x86_topo_ids_from_apicid(thd_stat[j].acpi_id, > > + &vmsr->topo_info, &topo_ids); > > + thd_stat[j].vpkg_id = topo_ids.pkg_id; > > + } > > + } > > + } > > We lack package instances as well as MSR topology, so we have to deal > with this pain... > > Similarly, my attempt at other package/die level MSRs [1] is to only > allow Guest to set a package/die. > > But I think that in the future with QOM-topo [2] (i.e. creating package/ > die instances), MSR topology can have an easier implementation, though, > looks there's a long way to go! > > [1]: https://lore.kernel.org/qemu-devel/20240203093054.412135-4-zhao1.liu@linux.intel.com/ > [2]: https://lore.kernel.org/qemu-devel/20231130144203.2307629-1-zhao1.liu@linux.intel.com/ > Oh, great! Good to know! Topology is indeed a nightmare so good luck with that!! > > + /* Calculate the total energy of all non-vCPU thread */ > > + for (int i = 0; i < num_threads; i++) { > > + double temp; > > + if ((thd_stat[i].is_vcpu != true) && > > + (thd_stat[i].delta_ticks > 0)) { > > + temp = vmsr_get_ratio(pkg_stat[thd_stat[i].pkg_id].e_delta, > > + thd_stat[i].delta_ticks, > > + vmsr->host_topo.maxticks[thd_stat[i].pkg_id]); > > + pkg_stat[thd_stat[i].pkg_id].e_ratio > > + += (uint64_t)lround(temp); > > + } > > + } > > + > > [snip] > > > +static int kvm_msr_energy_thread_init(KVMState *s, MachineState *ms) > > +{ > > + struct KVMMsrEnergy *r = &s->msr_energy; > > + int ret = 0; > > + > > [snip] > > > + /* Retrieve the number of virtual sockets */ > > + r->vsockets = ms->smp.sockets; > > RAPL's package domain is special: > * When there's no die, it's package-scope. > * When there's die level, it's die-scope. > > In SDM vol.3, char 15.10 PLATFORM SPECIFIC POWER MANAGEMENT SUPPORT, it > said: Package domain is the processor die. > > So if a Guest has die level, thers MSRs should be shared in a die. Thus > I guess the energy consumption status should also be distributed based on > the die number. > Right, I've missed that totally. Thanks for let me know. I'm going to have a deeper look at this. > > + /* Allocate register memory (MSR_PKG_STATUS) for each vcpu */ > > + r->msr_value = g_new0(uint64_t, r->vcpus); > > + > > + /* Retrieve the CPUArchIDlist */ > > + r->x86_cpu_list = x86_possible_cpu_arch_ids(ms); > > + > > [snip] > > > +int is_rapl_enabled(void) > > +{ > > + const char *path = "/sys/class/powercap/intel-rapl/enabled"; > > This field does not ensure the existence of RAPL MSRs since TPMI would > also enable this field. (See powercap_register_control_type() in > drivers/powercap/intel_rapl_{msr,tpmi}.c) > But this is exactly what it is intended to do. Wether it is MSR of TPMI, checking this, tell me that RAPL is activated or not. If it is activated... > We can read RAPL MSRs directly. If we get 0 or failure, then there's no > RAPL MSRs on Host. > ... then it is safe to access the RAPL MSR. I would not like to access this on a XYZ cpu without knowing I can ! > > + FILE *file = fopen(path, "r"); > > + int value = 0; > > + > > + if (file != NULL) { > > + if (fscanf(file, "%d", &value) != 1) { > > + error_report("INTEL RAPL not enabled"); > > + } > > + fclose(file); > > + } else { > > + error_report("Error opening %s", path); > > + } > > + > > + return value; > > +} > > [snip] > > > +/* Get the physical package id from a given cpu id */ > > +int vmsr_get_physical_package_id(int cpu_id) > > +{ > > + g_autofree char *file_contents = NULL; > > + g_autofree char *file_path = NULL; > > + int package_id = -1; > > + gsize length; > > + > > + file_path = g_strdup_printf("/sys/devices/system/cpu/cpu%d\ > > + /topology/physical_package_id", cpu_id); > > Similarly, based on the topological properties of RAPL, it is more > accurate to also consider the die_id. > > Currently, only CLX-AP has multiple dies. So, we either have to put a > limit on the Host's CPU model or consider the topology of the die here > as well. I think the latter is a more general approach. > Alright. I take not on that and I will check the die approach. > > + > > + if (!g_file_get_contents(file_path, &file_contents, &length, NULL)) { > > + goto out; > > + } > > + > > + package_id = atoi(file_contents); > > + > > +out: > > + return package_id; > > +} > > + > > Regards, > Zhao Thanks a lot for all your feedback Zhao ! Regards, Anthony
Hi Anthony, On Thu, Apr 18, 2024 at 12:52:14PM +0200, Anthony Harivel wrote: > Date: Thu, 18 Apr 2024 12:52:14 +0200 > From: Anthony Harivel <aharivel@redhat.com> > Subject: Re: [PATCH v5 3/3] Add support for RAPL MSRs in KVM/Qemu > > > The package energy consumption includes core part and uncore part, where > > uncore part consumption may not be able to be scaled based on vCPU > > runtime ratio. > > > > When the uncore part consumption is small, the error in this part is > > small, but if it is large, then the error generated by scaling by vCPU > > runtime will be large. > > > > So far we can only work with what Intel is giving us i.e Package power > plane and DRAM power plane on server, which is the main target of > this feature. Maybe in the future, Intel will expand the core power > plane and the uncore power plane to server class CPU ? Not future features, I'd like to illustrate the impact of the uncore part (iGPU/NPU on Client or various accelerators on Server) on this algorithm. Because the consumption of the uncore part is complex and not necessarily linearly related to the vCPU task running time. It might be worth to state potential impact on accuracy of uncore parts to doc (I doubt that heavy uncore consumption will even affect the consistency of the energy trend as you said). Anyway, clearer scenarios help this feature get used. > > May I ask what your usage scenario is? Is there significant uncore > > consumption (e.g. GPU)? > > > > Same answer as above: uncore/graphics power plane is only available on > client class CPU. Yes, iGPU is, but server may have other accelerators, e.g., DSA/IAA/QAT on SPR. > > Also, I think of a generic question is whether the error in this > > calculation is measurable? Like comparing the RAPL status of the same > > workload on Guest and bare metal to check the error. > > > > IIUC, this calculation is highly affected by native/sibling Guests, > > especially in cloud scenarios where there are multiple Guests, the > > accuracy of this algorithm needs to be checked. > > > > Indeed, depending on where your vCPUs are running within the package (on > the native or sibling CPU), you might observe different power > consumption levels. However, I don't consider this to be a problem, as > the ratio calculation takes into account the vCPU's location. > > We also need to approach the measurement differently. Due to the > complexity of factors influencing power consumption, we must compare > what is comparable. If you require precise power consumption data, > use a power meter on the PSU of the server.It will provide the > ultimate judgment. However, if you need an estimation to optimize > software workloads in a guest, then this feature could be useful. All my > tests have consistently shown reproducible output in terms of power > consumption, which has convinced me that we can effectively work with > it. Thanks, another mail in which you illustrated that the trend is consistent. [snip] > > > > In addition, RAPL is basically a CPU feature, I think it would be more > > appropriate to make it as a x86 CPU's property. > > > > Your RAPL support actually provides a framework for assisting KVM > > emulation in userspace, so this informs other feature support (maybe model > > specific, or architectural) in the future. Enabling/disabling CPU features > > via -cpu looks more natural. > > This is totally dependant of KVM because it used the KVM MSR Filtering > to access userspace when a specific MSR is required. Yes, but in other words, other KVM based features (completely hardware virtualization) are also configured by -cpu. This RAPL is still a CPU feature and just need KVM's help. > I can try to find a way to use -cpu for this feature and check if KVM is > activated or not. > [snip] > > > > I understand tick would ignore frequency changes, e.g., HWP's auto-pilot > > or turbo boost. All these CPU frequency change would impact on core energy > > consumption. > > > > I think the better way would be to use APERF counter, but unfortunately it > > lacks virtualization support (for Intel). > > > > Due to such considerations, it may be more worthwhile to evaluate the > > accuracy of this tick-based algorithm. > > > > I've evaluated such things with another tool called Kepler [1]. This > tool calculate the power ratio with metrics from RAPL and uses either > eBPF or the tick based systems for time metrics. Thanks for this information! I understand current tick based algorithm is a common approximation in the industry (like Kepler), right? > The eBPF part [2] is > triggered on each 'finish_task_switch' of Thread and calculate the delta > of cpu cycle, cache miss, cpu time, etc. Very complex. My tests showed > that the difference between using eBPF and tick based ratio is really > not that important. Maybe on some special cases, using eBPF would show > a way better accuracy but I'm not aware of that. Good to know! Just curious, so for using Kepler in Guest to optimize the task (your use case), is it only necessary that the trends are similar and there is no requirement for accuracy of the values? > [1]: https://github.com/sustainable-computing-io/kepler > [2]: https://github.com/sustainable-computing-io/kepler/blob/main/bpfassets/libbpf/src/kepler.bpf.c > [snip] > > > > > + /* Sleep a short period while the other threads are working */ > > > + usleep(MSR_ENERGY_THREAD_SLEEP_US); > > > > Is it possible to passively read the energy status? i.e. access the Host > > MSR and calculate the energy consumption for the Guest when the Guest > > triggers the relevant exit. > > > > Yes it could be possible. But what I wanted to avoid with my approach is > the overhead it could take when accessing a RAPL MSR in the Guest. > The value is always available and return very quickly. > I'm not sure about the overhead if we have to have to access the MSR, > then do the calculation and so on. > > > I think this might make the error larger, but not sure the error would > > be so large as to be unacceptable. > > > > I'm a bit concerned about the potential overflow in calculation if the > time in between is too big. Okay. [snip] > > > +int is_rapl_enabled(void) > > > +{ > > > + const char *path = "/sys/class/powercap/intel-rapl/enabled"; > > > > This field does not ensure the existence of RAPL MSRs since TPMI would > > also enable this field. (See powercap_register_control_type() in > > drivers/powercap/intel_rapl_{msr,tpmi}.c) > > > > But this is exactly what it is intended to do. Wether it is MSR of TPMI, > checking this, tell me that RAPL is activated or not. If it is > activated... > > > We can read RAPL MSRs directly. If we get 0 or failure, then there's no > > RAPL MSRs on Host. > > > > ... then it is safe to access the RAPL MSR. I would not like to access > this on a XYZ cpu without knowing I can ! RAPL will disappear and this field can't ease your concern. ;-) Even if there are no commercially available machines yet, the logic of the linux driver has shown that relying on this item is unreliable. For model specific features, either use the CPU model ID to know if it is supported, or to access the relevant MSR directly. Even if the feature is not supported, the register address should not be used for other purposes, so I don't think there is information leak problem here. [snip] > > Thanks a lot for all your feedback Zhao ! > You're welcome! I've also been thinking about how to support the emulation of other thermal features in user space in QEMU, as you did with RAPL. Regards, Zhao
On Wed, Apr 17, 2024 at 06:07:02PM +0800, Zhao Liu wrote: > Hi Anthony, > > May I ask what your usage scenario is? Is it to measure Guest's energy > consumption and to charged per watt consumed? ;-) > > On Thu, Apr 11, 2024 at 02:14:34PM +0200, Anthony Harivel wrote: > > Date: Thu, 11 Apr 2024 14:14:34 +0200 > > From: Anthony Harivel <aharivel@redhat.com> > > Subject: [PATCH v5 3/3] Add support for RAPL MSRs in KVM/Qemu > > > > Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL > > interface (Running Average Power Limit) for advertising the accumulated > > energy consumption of various power domains (e.g. CPU packages, DRAM, > > etc.). > > > > The consumption is reported via MSRs (model specific registers) like > > MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are > > 64 bits registers that represent the accumulated energy consumption in > > micro Joules. They are updated by microcode every ~1ms. > > What is your current target platform? I think we can assume /all/ future CPUs are conceptially in scope for this. The use case is to allow guest owners to monitor the power consumption of their workloads, so they can take steps to optimize their guest VM workloads to reduce power consumed. > On future Xeon platforms (EMR and beyond) RAPL will support TPMI (an MMIO > interface) and the TPMI based RAPL will be preferred in the future as > well: Is the MSR based interface likely to be removed in future silicon, or it will be remain for back compat ? > * TPMI doc: https://github.com/intel/tpmi_power_management > * TPMI based RAPL driver: drivers/powercap/intel_rapl_tpmi.c > With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
Hi Daniel, On Wed, Apr 17, 2024 at 01:27:03PM +0100, Daniel P. Berrangé wrote: > Date: Wed, 17 Apr 2024 13:27:03 +0100 > From: "Daniel P. Berrangé" <berrange@redhat.com> > Subject: Re: [PATCH v5 3/3] Add support for RAPL MSRs in KVM/Qemu > > On Wed, Apr 17, 2024 at 06:07:02PM +0800, Zhao Liu wrote: > > Hi Anthony, > > > > May I ask what your usage scenario is? Is it to measure Guest's energy > > consumption and to charged per watt consumed? ;-) > > > > On Thu, Apr 11, 2024 at 02:14:34PM +0200, Anthony Harivel wrote: > > > Date: Thu, 11 Apr 2024 14:14:34 +0200 > > > From: Anthony Harivel <aharivel@redhat.com> > > > Subject: [PATCH v5 3/3] Add support for RAPL MSRs in KVM/Qemu > > > > > > Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL > > > interface (Running Average Power Limit) for advertising the accumulated > > > energy consumption of various power domains (e.g. CPU packages, DRAM, > > > etc.). > > > > > > The consumption is reported via MSRs (model specific registers) like > > > MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are > > > 64 bits registers that represent the accumulated energy consumption in > > > micro Joules. They are updated by microcode every ~1ms. > > > > What is your current target platform? > > I think we can assume /all/ future CPUs are conceptially in scope > for this. > > The use case is to allow guest owners to monitor the power consumption > of their workloads, so they can take steps to optimize their guest VM > workloads to reduce power consumed. Thanks for the explanation! > > On future Xeon platforms (EMR and beyond) RAPL will support TPMI (an MMIO > > interface) and the TPMI based RAPL will be preferred in the future as > > well: > > Is the MSR based interface likely to be removed in future silicon, > or it will be remain for back compat ? For Xeon, GNR will have both TMPI & MSR RAPL, but eventually MSR RAPL will be removed. Therefore, if RAPL support is desired for all future Xeons, then it's necessary to consider TMPI as the next plan. Alternatively, the whole RAPL scope can be split into rapl-msr and rapl-tpmi features. > > * TPMI doc: https://github.com/intel/tpmi_power_management > > * TPMI based RAPL driver: drivers/powercap/intel_rapl_tpmi.c > > Regards, Zhao
Hi Zhao, Daniel, Zhao Liu, Apr 17, 2024 at 17:13: > Hi Daniel, > > On Wed, Apr 17, 2024 at 01:27:03PM +0100, Daniel P. Berrangé wrote: > > Date: Wed, 17 Apr 2024 13:27:03 +0100 > > From: "Daniel P. Berrangé" <berrange@redhat.com> > > Subject: Re: [PATCH v5 3/3] Add support for RAPL MSRs in KVM/Qemu > > > > On Wed, Apr 17, 2024 at 06:07:02PM +0800, Zhao Liu wrote: > > > Hi Anthony, > > > > > > May I ask what your usage scenario is? Is it to measure Guest's energy > > > consumption and to charged per watt consumed? ;-) > > > > > > On Thu, Apr 11, 2024 at 02:14:34PM +0200, Anthony Harivel wrote: > > > > Date: Thu, 11 Apr 2024 14:14:34 +0200 > > > > From: Anthony Harivel <aharivel@redhat.com> > > > > Subject: [PATCH v5 3/3] Add support for RAPL MSRs in KVM/Qemu > > > > > > > > Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL > > > > interface (Running Average Power Limit) for advertising the accumulated > > > > energy consumption of various power domains (e.g. CPU packages, DRAM, > > > > etc.). > > > > > > > > The consumption is reported via MSRs (model specific registers) like > > > > MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are > > > > 64 bits registers that represent the accumulated energy consumption in > > > > micro Joules. They are updated by microcode every ~1ms. > > > > > > What is your current target platform? > > > > I think we can assume /all/ future CPUs are conceptially in scope > > for this. > > > > The use case is to allow guest owners to monitor the power consumption > > of their workloads, so they can take steps to optimize their guest VM > > workloads to reduce power consumed. > > Thanks for the explanation! > Thanks Daniel for stepping in on the explanation. > > > On future Xeon platforms (EMR and beyond) RAPL will support TPMI (an MMIO > > > interface) and the TPMI based RAPL will be preferred in the future as > > > well: > > > > Is the MSR based interface likely to be removed in future silicon, > > or it will be remain for back compat ? > > For Xeon, GNR will have both TMPI & MSR RAPL, but eventually MSR RAPL > will be removed. Therefore, if RAPL support is desired for all future > Xeons, then it's necessary to consider TMPI as the next plan. > > Alternatively, the whole RAPL scope can be split into rapl-msr and > rapl-tpmi features. > I'm aware of the MSR/TPMI RAPL that will appear in the future, and I would like to share my perspective on that. Firstly, we can safely assume that it will take years before all server hardware is transitioned to the new GNR (or future XEON without RAPL MSR). It may be around 2024 when these features could be integrated into QEMU. While the adoption of this feature might take some time, I'm optimistic that once it's implemented, people will finally have the tools to optimize workloads inside VMs and start reducing power consumption. Secondly, the second-hand server market is substantial. This means that with the Virtual RAPL MSR, all XEON processors starting from Sandy Bridge (2012!) will have the potential for software optimization. Making the most of existing resources is essential for sustainability. Lastly, when the TPMI becomes available in hardware in the future, the RAPL interface and ratio calculation will remain the same, with only the method of obtaining host values changing. This transition should be manageable. Regards, Anthony
© 2016 - 2024 Red Hat, Inc.