In my previous role working on NVMe drivers, I had to ramp up on PCIe devices and used QEMU to get some hands-on experience.
I thought it might be helpful to build a simple example PCIe device and documented it as part of reorganizing my Obsidian notes.
Host Setup
The example uses the following setup:
Software | Version |
---|---|
CMake | 3.31.7 |
GCC | 12.3 |
Debian (Guest OS) | 12 |
QEMU (x86_64 Machine) | 10.0 |
You can find the source code and scripts in the repository. Refer to the README.md for more detailed setup instructions.
QEMU PCIe Device
This section walks through how a simple PCIe device is implemented in QEMU.
Requirements
The device is designed with these goals:
- Discoverable by Linux.
- Simple MMIO-accessible registers (Add additional registers for testability).
- Perform a basic DMA operation.
- Support MSI-X interrupt.
This post focuses on requirements 1 and 2. DMA and MSI-X will be covered in a follow-up post.
Register Layout
The device implements the following registers:
Address (Hex) | Description | Fields | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
00 | Control |
| ||||||||||||||||
04 | Status |
| ||||||||||||||||
08 | Interrupt Mask |
| ||||||||||||||||
0C | Interrupt Status |
| ||||||||||||||||
10 | Interrupt Trigger |
| ||||||||||||||||
14 | - | - | ||||||||||||||||
18 | Scratch |
| ||||||||||||||||
1C | Version |
|
Registers Interrupt Trigger
, Scratch
, and Version
registers are for testing purposes only.
The DMA descriptor definition will be covered in a follow-up post.
Adding the Device
The new device is added to hw/misc/pcie-testdevice.c
.
To include it in the QEMU build, the following files are updated:
default y if TEST_DEVICES
depends on PCI
+
+config PCIE_TESTDEVICE
+ bool
+ default y if TEST_DEVICES
+ depends on PCI
# HPPA devices
system_ss.add(when: 'CONFIG_LASI', if_true: files('lasi.c'))
+
+# PCIE test device
+system_ss.add(when: 'CONFIG_PCIE_TESTDEVICE', if_true: files('pcie-testdevice.c'))
QEMU Object Model (QOM)
Devices are implemented using the QEMU Object Model (QOM), which requires defining a TypeInfo
structure to describe the device type and registering it with QEMU.
This structure tells QEMU how to define, initialize, and register your device’s type and its class.
#define TYPE_PCIE_TEST_DEVICE "pcie-test-device"
static const TypeInfo pcie_test_device_info = {
.name = TYPE_PCIE_TEST_DEVICE,
.parent = TYPE_PCI_DEVICE,
.instance_size = sizeof(PcieTestDevice),
.instance_init = pcie_test_device_init,
.instance_finalize = pcie_test_device_finalize,
.class_init = pcie_test_device_class_init,
.interfaces =
(InterfaceInfo[]){
{INTERFACE_PCIE_DEVICE},
{},
},
};
static void pcie_test_device_register_types(void) { type_register_static(&pcie_test_device_info); }
type_init(pcie_test_device_register_types)
TypeInfo
specifies:
- Name of the device defined in the macro
TYPE_PCIE_TEST_DEVICE
- Parent is of type
PCI_DEVICE
- Class initialization runs the function
pcie_test_device_class_init
- Device initialization runs the function
pcie_test_device_init
Note that in this example, the device inherits all standard PCI functionality (including DMA and MSI support) from PCI_DEVICE
.
The device state is captured in a PcieTestDevice
struct:
typedef struct PcieTestDevice {
PCIDevice parentPci;
/* PCIe BARs*/
MemoryRegion bar0; /* BAR0 MMIO device registers */
uint32_t regs[PCIE_TEST_DEVICE_MIMO_MAX_SIZE_DWORDS];
MemoryRegion mem; /* BAR1 */
PCIExpLinkSpeed speed;
PCIExpLinkWidth width;
} PcieTestDevice;
BAR0
& regs
together model the device’s MMIO registers.
To handle register reads and writes, we need to define read/write callbacks and store them in a MemoryRegionOps
structure.
static const MemoryRegionOps bar_ops = {
.read = mmio_read,
.write = mmio_write,
...
};
The callbacks are registered with a MemoryRegion
representing BAR0 (bar0
), which models the device’s MMIO register space.
This memory region is then linked to BAR0 of the PCI device using pci_register_bar()
.
memory_region_init_io(&d->bar0, OBJECT(d), &bar_ops, d, "pcie-test-device-bar0", PCIE_TEST_DEVICE_MIMO_MAX_SIZE_BYTES);
pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, &d->bar0);
This sets up a working MMIO region and fulfills requirements 1 & 2 mentioned above.
To build the example device, run the provided qemu-build.sh
script.
You can verify the device instantiation (i.e., confirm that both class_init
and instance_init
run correctly) by querying its properties:
./external/qemu/build/qemu-system-x86_64 -device pcie-test-device,help
pcie-test-device options:
acpi-index=<uint32> - (default: 0)
addr=<str> - Slot and optional function number, example: 06.0 or 06 (default: -1)
busnr=<busnr>
failover_pair_id=<str>
multifunction=<bool> - on/off (default: off)
rombar=<int32> - (default: -1)
romfile=<str>
romsize=<uint32> - (default: 4294967295)
x-max-bounce-buffer-size=<size> - Maximum buffer size allocated for bounce buffers used for mapped access to indirect DMA memory (default: 4096)
x-pcie-ari-nextfn-1=<bool> - on/off (default: off)
x-pcie-err-unc-mask=<bool> - on/off (default: on)
x-pcie-ext-tag=<bool> - on/off (default: on)
x-pcie-extcap-init=<bool> - on/off (default: on)
x-pcie-lnksta-dllla=<bool> - on/off (default: on)
x-speed=<PCIELinkSpeed> - 2_5/5/8/16/32/64 (default: 2_5)
x-width=<PCIELinkWidth> - 1/2/4/8/12/16/32 (default: 16)
Simple Checks
This section uses Linux to verify that the PCIe device “works”.
Linux Utilities
To verify whether the device is properly discovered, use lspci
. Since the PCIe Device ID was set to ABBA
in the pcie_test_device_class_init
function, you can verify its presence using lspci | grep abba.
Below is the output, as you can see the device 1234:abba
shows up with a BDF of 00:04.0
.
root@debian:/home/andre# lspci | grep abba
00:04.0 RAM memory: Device 1234:abba (rev 01)
To see detailed information about the device, including its capabilities and resources, use the -vvv option:
root@debian:/home/andre# lspci -s 04.0 -vvv
00:04.0 RAM memory: Device 1234:abba (rev 01)
Subsystem: Red Hat, Inc. Device 1100
Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 10
Region 0: Memory at febe6000 (32-bit, non-prefetchable) [size=256]
Region 1: Memory at febd0000 (32-bit, non-prefetchable) [size=64K]
Region 3: Memory at febe7000 (32-bit, non-prefetchable) [size=4K]
Capabilities: [4c] Express (v2) Root Complex Integrated Endpoint, MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0
ExtTag+ RBE+ FLReset-
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR-
10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 4
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
AtomicOpsCtl: ReqEn-
Capabilities: [40] MSI-X: Enable- Count=1 Masked-
Vector table: BAR=3 offset=00000000
PBA: BAR=3 offset=00000800
This shows that:
I/O+
: Enable I/O space (legacy access)Mem+
: Device respond to MMIO accessBusMaster-
: Bus Master is currently disabled (Will need to enable this later to perform DMA)
Regions:
- Region 0: MMIO region
- Region 1: Device memory
- Region 3: MSI-X support
Capabilities:
- Capability @ 0x40: MSI-X Enable
- Capability @ 0x4c: Express (v2) Root Complex Integrated Endpoint
You can also see the raw PCI header using -x
option.
root@debian:/home/andre# lspci -s 04.0 -x
00:04.0 RAM memory: Device 1234:abba (rev 01)
00: 34 12 ba ab 03 01 10 00 01 00 00 05 00 00 00 00
10: 00 60 be fe 00 00 bd fe 00 00 00 00 00 70 be fe
20: 00 00 00 00 00 00 00 00 00 00 00 00 f4 1a 00 11
30: 00 00 00 00 4c 00 00 00 00 00 00 00 0a 01 00 00
Examining the PCI configuration space, we can inspect the Command Register at offset 0x04
, which has a value of 0x0103
.
The table below provides the bit definitions for this register, allowing us to decode and interpret its value.
Bit | Name | Description |
---|---|---|
0 | I/O Space Enable | When set (1), allows the device to respond to I/O space accesses/transactions. |
1 | Memory Space Enable | When set (1), allows the device to respond to memory space accesses/transactions. |
2 | Bus Master Enable | When set (1), allows the device to initiate bus transactions (act as a bus master for DMA in PCI, or a Requester in PCIe). |
3 | Special Cycle Enable | PCI: When set (1), allows the device to monitor Special Cycle operations. PCIe: Hardwired to 0 (Special Cycles are not used). |
4 | Memory Write and Invalidate Enable | PCI: When set (1), allows the device to use the Memory Write and Invalidate command for cache coherency. PCIe: Hardwired to 0 (PCI-specific cache mechanism not used). |
5 | VGA Palette Snoop | PCI: When set (1), enables a non-VGA device to snoop VGA palette register writes. PCIe: Hardwired to 0 (legacy feature not relevant). |
6 | Parity Error Response | When set (1), the device takes its normal action upon detecting a parity error (e.g., sets Status Register bit, may assert PERR#). If clear (0), it may ignore some parity errors. |
7 | Wait Cycle Control | PCI: Historically “IDSEL Stepping” or “Wait Cycle Control.” When set (1), enabled address/data stepping or specific wait state behavior. PCIe: Hardwired to 0 (PCI bus timing not applicable). |
8 | SERR# Enable | When set (1), enables the device to drive the SERR# (System Error) signal (PCI) or report system errors (PCIe) when it detects a serious error. |
9 | Fast Back-to-Back Enable | PCI: When set (1), allows capable devices to perform fast back-to-back transactions to the same agent. PCIe: Hardwired to 0 (not applicable to PCIe’s protocol). |
10 | Interrupt Disable | When set (1), disables the device’s ability to generate legacy INTx interrupts (INTA#, INTB#, etc.). Does not affect MSI or MSI-X. |
In this example, the bits for I/O space, memory space, and SERR# are set, while bus mastering is initially disabled. We can see this matches the previous output where the following bits are set:
- I/O Space → Enabled
- Memory Space → Enabled
- Bus Master → Disabled
- SERR# → Enabled
Bus Mastering is initially disabled, meaning the device cannot initiate DMA or other bus transactions.
To enable DMA support, we need to explicitly set the Bus Master Enable bit in the Command Register.
This can be done by using the setpci
utility to write to the Command Register.
root@debian:/home/andre# setpci -s 04.0 COMMAND=0107
root@debian:/home/andre# lspci -s 04.0 -x
00:04.0 RAM memory: Device 1234:abba (rev 01)
00: 34 12 ba ab 07 01 10 00 01 00 00 05 00 00 00 00
10: 00 60 be fe 00 00 bd fe 00 00 00 00 00 70 be fe
...
Re-running lspci
shows that BusMaster
is updated a +
, showing that it is enabled.
root@debian:/home/andre# lspci -s 04.0 -vvv
00:04.0 RAM memory: Device 1234:abba (rev 01)
Subsystem: Red Hat, Inc. Device 1100
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
....
Checking Device Registers
Linux exposes PCIe device resources via sysfs. The device can be found somewhere under /sys/bus/pci/devices/
.
root@debian:/home/andre# tree -L 1 /sys/bus/pci/devices/0000\:00\:04.0
/sys/bus/pci/devices/0000:00:04.0
├── ari_enabled
...
├── rescan
├── resource
├── resource0
├── resource1
├── resource3
├── revision
..
└── waiting_for_supplier
The resource0
file maps to the MMIO region (BAR0) of the PCIe device. It can be memory-mapped from userspace using mmap, allowing direct read/write access to the device registers from a user-mode application. Once resource0
is mapped, you can validate the MMIO read/write paths.
To test the mmio_read
path, you can read from the VERSION
register and verify that the returned value matches the expected version.
Below is a code snippet from the [application]:
const uint32_t EXPECTED_VERSION = 0x0101;
printf("Checking Version Register\n");
version = *(regPtr + PCIE_TEST_DEVICE_MMIO_VER_OFFSET / sizeof(uint32_t));
if (version != EXPECTED_VERSION) {
fprintf(stderr, "ERROR: Mismatch version! 0x%x != 0x%x\n", version, EXPECTED_VERSION);
} else {
printf("Version check passed \xE2\x9C\x93!\n");
}
Next, we can confirm that the mmio_write
path is functional by performing a simple write-read test using the SCRATCH
register.
The code first zeroes out the register, then writes 0x5555_5555
.
It then reads the value back to ensure the write was successful.
Below is a code snippet:
printf("Testing read modify write\n");
*(regPtr + PCIE_TEST_DEVICE_MMIO_SCRATCH_OFFSET / sizeof(uint32_t)) = 0;
value = *(regPtr + PCIE_TEST_DEVICE_MMIO_SCRATCH_OFFSET / sizeof(uint32_t));
if (value != 0) {
fprintf(stderr, "ERROR: Scratch register should be 0x0!\n");
}
value ^= 0x55555555;
*(regPtr + PCIE_TEST_DEVICE_MMIO_SCRATCH_OFFSET / sizeof(uint32_t)) = value;
value = *(regPtr + PCIE_TEST_DEVICE_MMIO_SCRATCH_OFFSET / sizeof(uint32_t));
if (value != 0x55555555) {
fprintf(stderr, "ERROR: Scratch register should be 0x55555555!\n");
} else {
printf("Scratch register write passed \xE2\x9C\x93!\n");
}
Running the application on the guest OS shows the following:
root@debian:/home/andre# ./sanity-check 0000:00:04.0
Checking Version Register
Version check passed ✓!
Testing read modify write
Scratch register write passed ✓!
NOTE
I used 9pfs share files between the host and guest.
Summary
This post demonstrates how to create and integrate a basic PCIe device in QEMU. In Part 2, we’ll cover DMA and kernel module integration.