In my previous role working on NVMe drivers, I had to ramp up on PCIe devices and used QEMU to get some hands-on experience.

I thought it might be helpful to build a simple example PCIe device and documented it as part of reorganizing my Obsidian notes.

Host Setup

The example uses the following setup:

SoftwareVersion
CMake3.31.7
GCC12.3
Debian (Guest OS)12
QEMU (x86_64 Machine)10.0

You can find the source code and scripts in the repository. Refer to the README.md for more detailed setup instructions.

QEMU PCIe Device

This section walks through how a simple PCIe device is implemented in QEMU.

Requirements

The device is designed with these goals:

  1. Discoverable by Linux.
  2. Simple MMIO-accessible registers (Add additional registers for testability).
  3. Perform a basic DMA operation.
  4. Support MSI-X interrupt.

This post focuses on requirements 1 and 2. DMA and MSI-X will be covered in a follow-up post.

Register Layout

The device implements the following registers:

Address (Hex)DescriptionFields
00Control
Bit FieldDescriptionTypeDefault Value
[0]startRW0x0
[2:1]typeRW0x0
[31]resetRW0x0
04Status
Bit FieldDescriptionTypeDefault Value
[0]busyRO0x0
08Interrupt Mask
Bit FieldDescriptionTypeDefault Value
[0]mask0RW0x0
0CInterrupt Status
Bit FieldDescriptionTypeDefault Value
[0]int0RW0x0
10Interrupt Trigger
Bit FieldDescriptionTypeDefault Value
[0]trigger0RW0x0
14--
18Scratch
Bit FieldDescriptionTypeDefault Value
[31:0]scratchRW0x0
1CVersion
Bit FieldDescriptionTypeDefault Value
[7:0]minorRO0x1
[15:8]majorRO0x1

Registers Interrupt Trigger, Scratch, and Version registers are for testing purposes only.

The DMA descriptor definition will be covered in a follow-up post.

Adding the Device

The new device is added to hw/misc/pcie-testdevice.c.

To include it in the QEMU build, the following files are updated:

     default y if TEST_DEVICES
     depends on PCI
+
+config PCIE_TESTDEVICE
+    bool
+    default y if TEST_DEVICES
+    depends on PCI
hw/misc/Kconfig
 # HPPA devices
 system_ss.add(when: 'CONFIG_LASI', if_true: files('lasi.c'))
+
+# PCIE test device
+system_ss.add(when: 'CONFIG_PCIE_TESTDEVICE', if_true: files('pcie-testdevice.c'))
hw/misc/meson.build

QEMU Object Model (QOM)

Devices are implemented using the QEMU Object Model (QOM), which requires defining a TypeInfo structure to describe the device type and registering it with QEMU. This structure tells QEMU how to define, initialize, and register your device’s type and its class.

#define TYPE_PCIE_TEST_DEVICE        "pcie-test-device"
 
static const TypeInfo pcie_test_device_info = {
    .name = TYPE_PCIE_TEST_DEVICE,
    .parent = TYPE_PCI_DEVICE,
    .instance_size = sizeof(PcieTestDevice),
    .instance_init = pcie_test_device_init,
    .instance_finalize = pcie_test_device_finalize,
    .class_init = pcie_test_device_class_init,
    .interfaces =
        (InterfaceInfo[]){
            {INTERFACE_PCIE_DEVICE},
            {},
        },
};
 
static void pcie_test_device_register_types(void) { type_register_static(&pcie_test_device_info); }
 
type_init(pcie_test_device_register_types)

TypeInfo specifies:

  • Name of the device defined in the macro TYPE_PCIE_TEST_DEVICE
  • Parent is of type PCI_DEVICE
  • Class initialization runs the function pcie_test_device_class_init
  • Device initialization runs the function pcie_test_device_init

Note that in this example, the device inherits all standard PCI functionality (including DMA and MSI support) from PCI_DEVICE.

The device state is captured in a PcieTestDevice struct:

typedef struct PcieTestDevice {
    PCIDevice parentPci;
 
    /* PCIe BARs*/
    MemoryRegion bar0; /* BAR0 MMIO device registers */
    uint32_t regs[PCIE_TEST_DEVICE_MIMO_MAX_SIZE_DWORDS];
 
    MemoryRegion mem; /* BAR1 */
 
    PCIExpLinkSpeed speed;
    PCIExpLinkWidth width;
} PcieTestDevice;

BAR0 & regs together model the device’s MMIO registers.

To handle register reads and writes, we need to define read/write callbacks and store them in a MemoryRegionOps structure.

static const MemoryRegionOps bar_ops = {
    .read = mmio_read,
    .write = mmio_write,
...
};

The callbacks are registered with a MemoryRegion representing BAR0 (bar0), which models the device’s MMIO register space. This memory region is then linked to BAR0 of the PCI device using pci_register_bar().

memory_region_init_io(&d->bar0, OBJECT(d), &bar_ops, d, "pcie-test-device-bar0", PCIE_TEST_DEVICE_MIMO_MAX_SIZE_BYTES);
pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, &d->bar0);

This sets up a working MMIO region and fulfills requirements 1 & 2 mentioned above.

To build the example device, run the provided qemu-build.sh script. You can verify the device instantiation (i.e., confirm that both class_init and instance_init run correctly) by querying its properties:

 ./external/qemu/build/qemu-system-x86_64 -device pcie-test-device,help
pcie-test-device options:
  acpi-index=<uint32>    -  (default: 0)
  addr=<str>             - Slot and optional function number, example: 06.0 or 06 (default: -1)
  busnr=<busnr>
  failover_pair_id=<str>
  multifunction=<bool>   - on/off (default: off)
  rombar=<int32>         -  (default: -1)
  romfile=<str>
  romsize=<uint32>       -  (default: 4294967295)
  x-max-bounce-buffer-size=<size> - Maximum buffer size allocated for bounce buffers used for mapped access to indirect DMA memory (default: 4096)
  x-pcie-ari-nextfn-1=<bool> - on/off (default: off)
  x-pcie-err-unc-mask=<bool> - on/off (default: on)
  x-pcie-ext-tag=<bool>  - on/off (default: on)
  x-pcie-extcap-init=<bool> - on/off (default: on)
  x-pcie-lnksta-dllla=<bool> - on/off (default: on)
  x-speed=<PCIELinkSpeed> - 2_5/5/8/16/32/64 (default: 2_5)
  x-width=<PCIELinkWidth> - 1/2/4/8/12/16/32 (default: 16)
 

Simple Checks

This section uses Linux to verify that the PCIe device “works”.

Linux Utilities

To verify whether the device is properly discovered, use lspci. Since the PCIe Device ID was set to ABBA in the pcie_test_device_class_init function, you can verify its presence using lspci | grep abba.

Below is the output, as you can see the device 1234:abba shows up with a BDF of 00:04.0.

root@debian:/home/andre# lspci | grep abba
00:04.0 RAM memory: Device 1234:abba (rev 01)

To see detailed information about the device, including its capabilities and resources, use the -vvv option:

root@debian:/home/andre# lspci -s 04.0 -vvv
00:04.0 RAM memory: Device 1234:abba (rev 01)
	Subsystem: Red Hat, Inc. Device 1100
	Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 10
	Region 0: Memory at febe6000 (32-bit, non-prefetchable) [size=256]
	Region 1: Memory at febd0000 (32-bit, non-prefetchable) [size=64K]
	Region 3: Memory at febe7000 (32-bit, non-prefetchable) [size=4K]
	Capabilities: [4c] Express (v2) Root Complex Integrated Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0
			ExtTag+ RBE+ FLReset-
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 128 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR-
			 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 4
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
	Capabilities: [40] MSI-X: Enable- Count=1 Masked-
		Vector table: BAR=3 offset=00000000
		PBA: BAR=3 offset=00000800
 

This shows that:

  • I/O+ : Enable I/O space (legacy access)
  • Mem+ : Device respond to MMIO access
  • BusMaster-: Bus Master is currently disabled (Will need to enable this later to perform DMA)

Regions:

  • Region 0: MMIO region
  • Region 1: Device memory
  • Region 3: MSI-X support

Capabilities:

  • Capability @ 0x40: MSI-X Enable
  • Capability @ 0x4c: Express (v2) Root Complex Integrated Endpoint

You can also see the raw PCI header using -x option.

root@debian:/home/andre# lspci -s 04.0 -x
00:04.0 RAM memory: Device 1234:abba (rev 01)
00: 34 12 ba ab 03 01 10 00 01 00 00 05 00 00 00 00
10: 00 60 be fe 00 00 bd fe 00 00 00 00 00 70 be fe
20: 00 00 00 00 00 00 00 00 00 00 00 00 f4 1a 00 11
30: 00 00 00 00 4c 00 00 00 00 00 00 00 0a 01 00 00

Examining the PCI configuration space, we can inspect the Command Register at offset 0x04, which has a value of 0x0103. The table below provides the bit definitions for this register, allowing us to decode and interpret its value.

BitNameDescription
0I/O Space EnableWhen set (1), allows the device to respond to I/O space accesses/transactions.
1Memory Space EnableWhen set (1), allows the device to respond to memory space accesses/transactions.
2Bus Master EnableWhen set (1), allows the device to initiate bus transactions (act as a bus master for DMA in PCI, or a Requester in PCIe).
3Special Cycle EnablePCI: When set (1), allows the device to monitor Special Cycle operations.
PCIe: Hardwired to 0 (Special Cycles are not used).
4Memory Write and Invalidate EnablePCI: When set (1), allows the device to use the Memory Write and Invalidate command for cache coherency.
PCIe: Hardwired to 0 (PCI-specific cache mechanism not used).
5VGA Palette SnoopPCI: When set (1), enables a non-VGA device to snoop VGA palette register writes.
PCIe: Hardwired to 0 (legacy feature not relevant).
6Parity Error ResponseWhen set (1), the device takes its normal action upon detecting a parity error (e.g., sets Status Register bit, may assert PERR#). If clear (0), it may ignore some parity errors.
7Wait Cycle ControlPCI: Historically “IDSEL Stepping” or “Wait Cycle Control.” When set (1), enabled address/data stepping or specific wait state behavior.
PCIe: Hardwired to 0 (PCI bus timing not applicable).
8SERR# EnableWhen set (1), enables the device to drive the SERR# (System Error) signal (PCI) or report system errors (PCIe) when it detects a serious error.
9Fast Back-to-Back EnablePCI: When set (1), allows capable devices to perform fast back-to-back transactions to the same agent.
PCIe: Hardwired to 0 (not applicable to PCIe’s protocol).
10Interrupt DisableWhen set (1), disables the device’s ability to generate legacy INTx interrupts (INTA#, INTB#, etc.). Does not affect MSI or MSI-X.

In this example, the bits for I/O space, memory space, and SERR# are set, while bus mastering is initially disabled. We can see this matches the previous output where the following bits are set:

  • I/O Space → Enabled
  • Memory Space → Enabled
  • Bus Master → Disabled
  • SERR# → Enabled

Bus Mastering is initially disabled, meaning the device cannot initiate DMA or other bus transactions. To enable DMA support, we need to explicitly set the Bus Master Enable bit in the Command Register. This can be done by using the setpci utility to write to the Command Register.

root@debian:/home/andre# setpci -s 04.0 COMMAND=0107
root@debian:/home/andre# lspci -s 04.0 -x
00:04.0 RAM memory: Device 1234:abba (rev 01)
00: 34 12 ba ab 07 01 10 00 01 00 00 05 00 00 00 00
10: 00 60 be fe 00 00 bd fe 00 00 00 00 00 70 be fe
...

Re-running lspci shows that BusMaster is updated a +, showing that it is enabled.

root@debian:/home/andre# lspci -s 04.0 -vvv
00:04.0 RAM memory: Device 1234:abba (rev 01)
	Subsystem: Red Hat, Inc. Device 1100
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
....

Checking Device Registers

Linux exposes PCIe device resources via sysfs. The device can be found somewhere under /sys/bus/pci/devices/.

root@debian:/home/andre# tree -L 1 /sys/bus/pci/devices/0000\:00\:04.0
/sys/bus/pci/devices/0000:00:04.0
├── ari_enabled
...
├── rescan
├── resource
├── resource0
├── resource1
├── resource3
├── revision
..
└── waiting_for_supplier

The resource0 file maps to the MMIO region (BAR0) of the PCIe device. It can be memory-mapped from userspace using mmap, allowing direct read/write access to the device registers from a user-mode application. Once resource0 is mapped, you can validate the MMIO read/write paths.

To test the mmio_read path, you can read from the VERSION register and verify that the returned value matches the expected version. Below is a code snippet from the [application]:

const uint32_t EXPECTED_VERSION = 0x0101;
 
printf("Checking Version Register\n");
version = *(regPtr + PCIE_TEST_DEVICE_MMIO_VER_OFFSET / sizeof(uint32_t));
 
if (version != EXPECTED_VERSION) {
	fprintf(stderr, "ERROR: Mismatch version! 0x%x != 0x%x\n", version, EXPECTED_VERSION);
} else {
	printf("Version check passed \xE2\x9C\x93!\n");
}

Next, we can confirm that the mmio_write path is functional by performing a simple write-read test using the SCRATCH register. The code first zeroes out the register, then writes 0x5555_5555. It then reads the value back to ensure the write was successful.

Below is a code snippet:

printf("Testing read modify write\n");
*(regPtr + PCIE_TEST_DEVICE_MMIO_SCRATCH_OFFSET / sizeof(uint32_t)) = 0;
value = *(regPtr + PCIE_TEST_DEVICE_MMIO_SCRATCH_OFFSET / sizeof(uint32_t));
if (value != 0) {
	fprintf(stderr, "ERROR: Scratch register should be 0x0!\n");
}
value ^= 0x55555555;
*(regPtr + PCIE_TEST_DEVICE_MMIO_SCRATCH_OFFSET / sizeof(uint32_t)) = value;
 
value = *(regPtr + PCIE_TEST_DEVICE_MMIO_SCRATCH_OFFSET / sizeof(uint32_t));
if (value != 0x55555555) {
	fprintf(stderr, "ERROR: Scratch register should be 0x55555555!\n");
} else {
	printf("Scratch register write passed \xE2\x9C\x93!\n");
}

Running the application on the guest OS shows the following:

root@debian:/home/andre# ./sanity-check 0000:00:04.0
Checking Version Register
Version check passed ✓!
Testing read modify write
Scratch register write passed ✓!

NOTE

I used 9pfs share files between the host and guest.

Summary

This post demonstrates how to create and integrate a basic PCIe device in QEMU. In Part 2, we’ll cover DMA and kernel module integration.

Resources