1 - INTRODUCTION ================== A Host's PBDMA unit fetches pushbuffer data from memory, generates commands, called "methods", from the fetched data, executes some of the generated methods itself, and sends the remainder of the methods to engines. This manual describes the Host PBDMA register space and all Host methods. The NV_PPBDMA space defines registers that are contained within each of Host's PBDMA units. Each PBDMA unit is allocated a 8KB address space for its registers. The NV_UDMA space defines the Host methods. A method consists of an address doubleword and a data doubleword. The address specifies the operation to be performed. The data is an operand. The NV_UDMA address space contains the addresses of the methods that are executed by a PBDMA unit. GP_ENTRY0 and GP_ENTRY1 - GP-Entry Memory Format A pushbuffer contains the specifications of the operations that a GPU context is to perform for a particular client. Pushbuffers are stored in memory. A doubleword-sized (4-byte) unit of pushbuffer data is known as a pushbuffer entry. GP entries indicate the location of the pushbuffer data in memory. GP entries themselves are also stored in memory. A GP entry specifies the location and size of a pushbuffer segment (a contiguous block of PB entries) in memory. See "FIFO_DMA" in dev_ram.ref for details about pushbuffer segments and the format of pushbuffer data. The NV_PPBDMA_GP_ENTRY0_GET and NV_PPBDMA_GP_ENTRY1_GET_HI fields of a GP entry specify the 38-bit dword-address (which would make a 40-bit byte-address) of the first pushbuffer entry of the GP entry's pushbuffer segment. Because each pushbuffer entry (and by extension each pushbuffer segment) is doubleword aligned (4-byte aligned), the least significant 2 bits of the 40-bit byte-address are not stored. The byte-address of the first pushbuffer entry in a GP entry's pushbuffer segment is (GP_ENTRY1_GET_HI << 32) + (GP_ENTRY0_GET << 2). The NV_PPBDMA_GP_ENTRY1_LENGTH field, when non-zero, indicates the number of pushbuffer entries contained within the GP entry's pushbuffer segment. The byte-address of the first pushbuffer entry beyond the pushbuffer segment is (GP_ENTRY1_GET_HI << 32) + (GP_ENTRY0_GET << 2) + (GP_ENTRY1_LENGTH * 4). If NV_PPBDMA_GP_ENTRY1_LENGTH is CONTROL (0), then the GP entry is a "control" entry, meaning this GP entry will not cause any PB data to be fetched or executed. In this case, the NV_PPBDMA_GP_ENTRY1_OPCODE field specifies an operation to perform, and the NV_PPBDMA_GP_ENTRY0_OPERAND field contains the operand. The available operations are as follows: * NV_PPBDMA_GP_ENTRY1_OPCODE_NOP: no operation will be performed, but note that the SYNC field is still respected--see below. * NV_PPBDMA_GP_ENTRY1_OPCODE_GP_CRC: the ENTRY0_OPERAND field is compared with the cyclic redundancy check value that was calculated over previous GP entries (NV_PPBDMA_GP_CRC). After each comparison, the NV_PPBDMA_GP_CRC is cleared, whether they match or differ. If they differ, then Host initiates an interrupt (NV_PPBDMA_INTR_0_GPCRC). For recovery, clearing the interrupt will cause the PBDMA to continue as if the control entry was OPCODE_NOP. * NV_PPBDMA_GP_ENTRY1_OPCODE_PB_CRC: the ENTRY0_OPERAND is compared with the CRC value that was calculated over the previous pushbuffer segment (NV_PPBDMA_PB_CRC). The PB CRC resets to 0 with each pushbuffer segment. If the two CRCs differ, Host will raise the NV_PPBDMA_INTR_0_PBCRC interrupt. For recovery, clearing the interrupt will continue as if the control entry was OPCODE_NOP. Note the PB_CRC is indeterminate if an END_PB_SEGMENT PB control entry was used in the prior segment or if SSDM disabled the device and the segment had conditional fetching enabled. Host supports two privilege levels for channels: privileged and non-privileged. The privilege level is determined by the NV_PPBDMA_CONFIG_AUTH_LEVEL field set from the corresponding NV_RAMFC_CONFIG dword in the RAMFC. Non-privileged channels cannot execute privileged methods, but privileged channels can. Any attempt to run a privileged operation from a non-privileged channel will result in PB raising NV_PPBDMA_INTR_0_METHOD. The NV_PPBDMA_GP_ENTRY1_SYNC field specifies whether a pushbuffer may be fetched before Host has finished processing the preceding PB segment. If this field is SYNC_PROCEED, then Host does not wait for the preceding PB segment to be processed. If this field is SYNC_WAIT, then Host waits until the preceding PB segment has been processed by Host before beginning to fetch the current PB segment. Host's processing of a PB segment consists of parsing PB entries into PB instructions, decoding those instructions into control entries or method headers, generating methods from method headers, determining whether methods are to be executed by Host or by an engine, executing Host methods, and sending non-Host methods and SetObject methods to engines. Note that in the case where the final PB entry of the preceding PB segment is a method header representing a PB compressed method sequence of nonzero length--that is, the compressed method sequence is split across PB segments with all of its method data entries in the PB segment for which SYNC_WAIT is set--then Host is considered to have finished processing the preceding PB segment once that method header is read. However, splitting a PB compressed method sequence for software methods is not supported because Host will issue the DEVICE interrupt indicating the SW method as soon as it processess the method header, which happens prior to fetching the method data entries for that compressed method sequence. Thus SW cannot actually execute any of the methods in the sequence because the method data is not yet available, leaving the PBDMA wedged. When SYNC_WAIT is set, Host does not wait for any engine methods generated from the preceding PB segment to complete. Host does not automatically wait until an engine is done processing all methods generated from that PB segment. If software desires that the engine finish processing all methods generated from one PB segment before a second PB segment is fetched, then software may place Host methods that wait until the engine is idle in the first PB segment (like WFI, SET_REF, or SEM_EXECUTE with RELEASE_WFI_EN set). Alternatively, software might put a semaphore acquire at the end of the first PB segment, and have an engine release the semaphore. In both cases, SYNC_WAIT must be set on the second PB segment. This field applies even if the NV_PPBDMA_GP_ENTRY1_LENGTH field is zero; if SYNC_WAIT is specified in this case, no further GP entries will be processed until the wait finishes. Some parts of a pushbuffer may not be executed depending on the value of the NV_PPBDMA_SUBDEVICE_ID and SUBDEVICE_MASK. If an entire PB segment will not be executed due to conditional execution, Host need not even bother fetching the PB segment. The NV_PPBDMA_GP_ENTRY0_FETCH field indicates whether the PB segment specified by the GP entry should be fetched unconditionally or fetched conditionally. If this field is FETCH_UNCONDITIONAL, then the PB segment is fetched unconditionally. If this field is FETCH_CONDITIONAL, then the PB segment is only fetched if the NV_PPBDMA_SUBDEVICE_STATUS field is STATUS_ACTIVE. ******************************************************************************** Warning: When using subdevice masking, one must take care to synchronize properly with any later GP entries marked FETCH_CONDITIONAL. If GP fetching gets too far ahead of PB processing, it is possible for a later conditional PB segment to be discarded prior to reaching an SSDM command that sets SUBDEVICE_STATUS to ACTIVE. This would cause Host to execute garbage data. One way to avoid this would be to set the SYNC_WAIT flag on any FETCH_CONDITIONAL segments following a subdevice reenable. ******************************************************************************** If the PB segment is not fetched then it behaves as an OPCODE_NOP control entry. If a PB segment contains a SET_SUBDEVICE_MASK PB instruction that Host must see, then the GP entry for that PB segment must specify FETCH_UNCONDITIONAL. If the PB segment specifies FETCH_CONDITIONAL and the subdevice mask shows STATUS_ACTIVE, but the PB segment contains a SET_SUBDEVICE_MASK PB instruction that will disable the mask, the rest of the PB segment will be discarded. In that case, an arbitrary number of entries past the SSDM may have already updated the PB CRC, rendering the PB CRC indeterminate. If Host must wait for a previous PB segment's Host processing to be completed before examining NV_PPBDMA_SUBDEVICE_STATUS, then the GP entry should also have its SYNC_WAIT field set. A PB segment marked FETCH_CONDITIONAL must not have a PB compressed method sequence that crosses a PB segment boundary (with its header in previous non- conditional PB segment and its final valid data in a conditional PB segment)-- doing so will cause a NV_PPBDMA_INTR_0_PBSEG interrupt. Software may monitor Host's progress through the pushbuffer by reading the channel's NV_RAMUSERD_TOP_LEVEL_GET entry from USERD, which is backed by Host's NV_PPBDMA_TOP_LEVEL_GET register. See "NV_PFIFO_USERD_WRITEBACK" in dev_fifo.ref for information about how frequently this information is written back into USERD. If a PB segment occurs multiple times within a pushbuffer (like a commonly used subroutine), then progress through that segment may be less useful for monitoring, because software will not know which occurrence of the segment is being processed. The NV_PPBDMA_GP_ENTRY_LEVEL field specifies whether progress through the GP entry's PB segment should be indicated in NV_RAMUSERD_TOP_LEVEL_GET. If this field is LEVEL_MAIN, then progress through the PB segment will be reported -- NV_RAMUSERD_TOP_LEVEL_GET will equal NV_RAMUSERD_GET. If this field is LEVEL_SUBROUTINE, then progress through this PB segment is not reported -- Host will not alter NV_RAMUSERD_TOP_LEVEL_GET. If this field is LEVEL_SUBROUTINE, reads of NV_RAMUSERD_TOP_LEVEL_GET will return the last value of NV_RAMUSERD_GET from a PB segment at LEVEL_MAIN. If the GP entry's opcode is OPCODE_ILLEGAL or an invalid opcode, Host will initiate an interrupt (NV_PPBDMA_INTR_0_GPENTRY). If a GP entry specifies a PB segment that crosses the end of the virtual address space (0xFFFFFFFFFF), then Host will initiate an interrupt (NV_PPBDMA_INTR_0_GPENTRY). Invalid GP entries are treated like traps: they will set the interrupt and freeze the PBDMA, but the invalid GP entry is discarded. Once the interrupt is cleared, the PBDMA unit will simply continue with the next GP entry. Note a corner case exists where the PB segment described by a GP entry is at the end of the virtual address space, or in other words, the last PB entry in the described PB segment is the last dword in the virtual address space. This type of GP entry is not valid and will generate a GPENTRY interrupt. The PBDMA's PUT pointer describes the address of the first dword beyond the PB segment, thus making the last dword in the virtual address space unusable for storing a pbentry. #define NV_PPBDMA_GP_ENTRY__SIZE 8 /* */ #define NV_PPBDMA_GP_ENTRY0 0x10000000 /* RW-4R */ #define NV_PPBDMA_GP_ENTRY0_OPERAND 31:0 /* RWXUF */ #define NV_PPBDMA_GP_ENTRY0_FETCH 0:0 /* */ #define NV_PPBDMA_GP_ENTRY0_FETCH_UNCONDITIONAL 0x00000000 /* */ #define NV_PPBDMA_GP_ENTRY0_FETCH_CONDITIONAL 0x00000001 /* */ #define NV_PPBDMA_GP_ENTRY0_GET 31:2 /* */ #define NV_PPBDMA_GP_ENTRY1 0x10000004 /* RW-4R */ #define NV_PPBDMA_GP_ENTRY1_GET_HI 7:0 /* RWXUF */ #define NV_PPBDMA_GP_ENTRY1_LEVEL 9:9 /* RWXUF */ #define NV_PPBDMA_GP_ENTRY1_LEVEL_MAIN 0x00000000 /* RW--V */ #define NV_PPBDMA_GP_ENTRY1_LEVEL_SUBROUTINE 0x00000001 /* RW--V */ #define NV_PPBDMA_GP_ENTRY1_LENGTH 30:10 /* RWXUF */ #define NV_PPBDMA_GP_ENTRY1_LENGTH_CONTROL 0x00000000 /* RW--V */ #define NV_PPBDMA_GP_ENTRY1_SYNC 31:31 /* RWXUF */ #define NV_PPBDMA_GP_ENTRY1_SYNC_PROCEED 0x00000000 /* RW--V */ #define NV_PPBDMA_GP_ENTRY1_SYNC_WAIT 0x00000001 /* RW--V */ #define NV_PPBDMA_GP_ENTRY1_OPCODE 7:0 /* RWXUF */ #define NV_PPBDMA_GP_ENTRY1_OPCODE_NOP 0x00000000 /* RW--V */ #define NV_PPBDMA_GP_ENTRY1_OPCODE_ILLEGAL 0x00000001 /* RW--V */ #define NV_PPBDMA_GP_ENTRY1_OPCODE_GP_CRC 0x00000002 /* RW--V */ #define NV_PPBDMA_GP_ENTRY1_OPCODE_PB_CRC 0x00000003 /* RW--V */ Number of NOPs for self-modifying gpfifo This is a formula for SW to estimate the number of NOPs needed to pad the gpfifo such that the modification of a gp entry by the engine or by the CPU can take effect. Here, NV_PFIFO_LB_GPBUF_CONTROL_SIZE(eng) refers to the SIZE field in the NV_PFIFO_LB_GPBUF_CONTROL(eng) register.(More info about the register in dev_fifo.ref) NUM_GP_NOPS(eng) = ((NV_PFIFO_LB_GPBUF_CONTROL_SIZE(eng)+1) * NV_PFIFO_LB_ENTRY_SIZE)/ NV_PPBDMA_GP_ENTRY__SIZE GP_BASE - Base and Limit of the Circular Buffer of GP Entries GP entries are stored in a buffer in memory. The NV_PPBDMA_GP_BASE_OFFSET and NV_PPBDMA_GP_BASE_HI_OFFSET fields specify the 37-bit address in 8-byte granularity of the start of a circular buffer that contains GP entries (GPFIFO). This address is a virtual (not a physical) address. GP entries are always GP_ENTRY__SIZE-byte aligned, so the least significant three bits of the byte address are not stored. The byte address of the GPFIFO base pointer is thus: gpfifo_base_ptr = GP_BASE + (GP_BASE_HI_OFFSET << 32) The number of GP entries in the circular buffer is always a power of 2. The NV_PPBDMA_GP_BASE_HI_LIMIT2 field specifies the number of bits used to count the memory allocated to the GP FIFO. The LIMIT2 value specified in these registers is Log base 2 of the number of entries in the GP FIFO. For example, if the number of entries is 2^16--indicating a memory area of (2^16)*GP_ENTRY__SIZE bytes--then the value written in LIMIT2 is 16. The circular buffer containing GP entries cannot cross the maximum address. If OFFSET + (1< 0xFFFFFFFFFF, then Host will initiate a CPU interrupt (NV_PPBDMA_INTR_0_GPFIFO). The NV_PPBDMA_GP_PUT, NV_PPBDMA_GP_GET, and NV_PPBDMA_GP_FETCH registers (and their associated NV_RAMFC and NV_RAMUSERD entries) are relative to the value of this register. These registers are part of a GPU context's state. On a switch, the values of these registers are saved to, and restored from, the NV_RAMFC_GP_BASE and NV_RAMFC_GP_BASE_HI entries in the RAMFC part of the GPU context's GPU-instance block. Typically, software initializes the information in NV_RAMFC_GP_BASE and NV_RAMFC_GP_BASE_HI when the GPU context's GPU-instance block is first created. These registers are available to software only for debug. Software should use them only if the GPU context is assigned to a PBDMA unit and that PBDMA unit is stalled. While a GPU context's Host context is not contained within a PBDMA unit, software should use the RAMFC entries to access this information. A pair of these registers exists for each of Host's PBDMA units. These registers run on Host's internal bus clock. #define NV_PPBDMA_GP_BASE(i) (0x00040048+(i)*8192) /* RW-4A */ #define NV_PPBDMA_GP_BASE__SIZE_1 14 /* */ #define NV_PPBDMA_GP_BASE_OFFSET 31:3 /* RW-UF */ #define NV_PPBDMA_GP_BASE_OFFSET_ZERO 0x00000000 /* RW--V */ #define NV_PPBDMA_GP_BASE_RSVD 2:0 /* RW-UF */ #define NV_PPBDMA_GP_BASE_RSVD_ZERO 0x00000000 /* RW--V */ #define NV_PPBDMA_GP_BASE_HI(i) (0x0004004c+(i)*8192) /* RW-4A */ #define NV_PPBDMA_GP_BASE_HI__SIZE_1 14 /* */ #define NV_PPBDMA_GP_BASE_HI_OFFSET 7:0 /* RW-UF */ #define NV_PPBDMA_GP_BASE_HI_OFFSET_ZERO 0x00000000 /* RW--V */ #define NV_PPBDMA_GP_BASE_HI_LIMIT2 20:16 /* RW-UF */ #define NV_PPBDMA_GP_BASE_HI_LIMIT2_ZERO 0x00000000 /* RW--V */ #define NV_PPBDMA_GP_BASE_HI_RSVDA 15:8 /* RW-UF */ #define NV_PPBDMA_GP_BASE_HI_RSVDA_ZERO 0x00000000 /* RW--V */ #define NV_PPBDMA_GP_BASE_HI_RSVDB 31:21 /* RW-UF */ #define NV_PPBDMA_GP_BASE_HI_RSVDB_ZERO 0x00000000 /* RW--V */ GP_FETCH - Pointer to the next GP-Entry to be Fetched Host does not fetch all GP entries with a single request to the memory subsystem. Host fetches GP entries in batches. The NV_PPBDMA_GP_FETCH register indicates index of the next GP entry to be fetched by Host. The actual 40-bit virtual address of the specified GP entry is computed as follows: fetch address = GP_FETCH_ENTRY * NV_PPBDMA_GP_ENTRY__SIZE + GP_BASE If NV_PPBDMA_GP_PUT==NV_PPBDMA_GP_FETCH, then requests to fetch the entire GP circular buffer have been issued, and Host cannot make more requests until NV_PPBDMA_GP_PUT is changed. Host may finish fetching GP entries long before it has finished processing the PB segments specified by those entries. Software should not use NV_PPBDMA_GP_FETCH (it should use NV_PPBDMA_GP_GET), to determine whether the GP circular buffer is full. NV_PPBDMA_GP_FETCH represents the current extent of prefetching of GP entries; prefetched entries may be discarded and refetched later. This register is part of a GPU context's state. On a switch, the value of this register is saved to, and restored from, the NV_RAMFC_GP_FETCH entry of the RAMFC part of the GPU context's GPU-instance block. A PBDMA unit maintains this register. Typically, software does not need to access this register. This register is available to software only for debug. Because Host may fetch GP entries long before it is ready to process the entries, and because Host may discard GP entries that it has fetched, software should not use NV_PPBDMA_GP_FETCH to monitor Host's progress (software should use NV_PPBDMA_GP_GET for monitoring). Software should use this register only if the GPU context is assigned to a PBDMA unit and that PBDMA unit is stalled. While a GPU context's Host context is not contained within a PBDMA unit, software should use NV_RAMFC_GP_FETCH to access this information. If after a PRI write, or after this register has been restored from RAMFC memory, the value equals or exceeds the size of the circular buffer that stores GP entries (1<= PV), where SV is the semaphore value in memory, PV is the payload value, and >= is an unsigned greater-than-or-equal-to comparison. If OPERATION is ACQ_CIRC_GEQ, the acquire succeeds when the two's complement signed representation of the semaphore value minus the payload value is non-negative; that is, when the semaphore value is within half a range greater than or equal to the payload value, modulo that range. The PAYLOAD_SIZE field determines if Host is doing a 32 bit comparison or a 64 bit comparison. So in other words, the condition is met when the PAYLOAD_SIZE is 32BIT and the semaphore value is within the range [payload, ((payload+(2^(32-1)))-1)], modulo 2^32, or when the PAYLOAD_SIZE is 64BIT and the semaphore value is within the range [payload, ((payload+(2^(64-1)))-1)], modulo 2^64. If OPERATION is ACQ_AND, the acquire succeeds when the bitwise-AND of the semaphore value and the payload value is not zero. The PAYLOAD_SIZE field determines if a 32 bit or 64 bit value is read from memory, and compared to. If OPERATION is ACQ_NOR, the acquire succeeds when the bitwise-NOR of the semaphore value and the payload value is not zero. PAYLOAD_SIZE determines if a 32 bit or 64 bit value is read from memory, and compared to. If OPERATION is RELEASE, then Host simply writes the payload value to the semaphore structure in memory at the SEM_ADDR_LO/_HI address. The exact value written depends on the operation defined. If PAYLOAD_SIZE is 32BIT then a 32 bit payload value from PAYLOAD_LO is used. If PAYLOAD_SIZE is 64BIT then a 64 bit payload specified by PAYLOAD_LO/_HI is used. If OPERATION is REDUCTION, then Host sends the memory system an instruction to perform the atomic reduction operation specified in the REDUCTION field on the memory value, using the PAYLOAD_LO/_HI payload value as the operand. The OPERATION_PAYLOAD_SIZE field determines if a 32 bit or 64 bit reduction is performed. Note that if the semaphore address refers to a page whose PTE has ATOMIC_DISABLE set, the operation will result in an ATOMIC_VIOLATION fault; Note that if the PAYLOAD_SIZE is 64BIT, the semaphore address is required to be 8-byte aligned. If RELEASE_TIMESTAMP is EN while the operation is a RELEASE or REDUCTION operation, the semaphore address is required to be 16-byte aligned. The semaphore address is not required to be 16-byte aligned during an acquire operation. If the semaphore address is not aligned according to the field values Host will raise the NV_PPBDMA_INTR_0 interrupt. For iGPU cases where a semaphore release can be mapped to an onchip syncpoint, the SIZE must be 4Bytes to avoid double incrementing the target syncpoint. Timestamping should also be disabled to avoid unwanted behavior. Semaphore switch option: The NV_UDMA_SEM_EXECUTE_ACQUIRE_SWITCH_TSG field specifies whether or not Host should switch to processing another TSG if the acquire fails. If every channel within the same TSG has no work (is waiting on a semaphore acquire, is idle, is unbound, or is disabled), the TSG can make no further progress until one of the relevant semaphores is released. Because it may be a long time before the release, it may be more efficient for the PBDMA unit to switch off the blocked TSG prior to the runqueue timeslice expiring, so that it can serve a different TSG that is not waiting, or so that it can poll other semaphores on other TSGs whose channels are waiting on acquires. When a semaphore acquire fails, the PBDMA unit will always switch to another channel within the same TSG, provided that it has not completed a traversal through all the TSG's channels. If every pending channel in the TSG is waiting on a semaphore acquire, the Host scheduler is able identify a lack of progress for the entire TSG by the time it has completed a traversal through all those channels. In this case the value of ACQUIRE_SWITCH_TSG for each of these channels determines whether the PBDMA will switch to another TSG or start another traversal through the same TSG. If ACQUIRE_SWITCH_TSG is DIS for any of the channels in the TSG, the Host scheduler will ignore any lack of progress and continue processing the TSG, until either every channel in the TSG runs out of work or the timeslice expires. If ACQUIRE_SWITCH_TSG is EN for every pending channel in the TSG, the Host scheduler will recognize a lack of progress for the whole TSG, and will switch to the next serviceable TSG on the runqueue, if possible. In the case described above, if there isn't a different serviceable TSG on the runlist, then the current channel's TSG will continue to be scheduled and the acquire retry will be naturally delayed by the time it takes for Host's runlist processing to return to the same channel. This retry delay may be too short, in which case the runlist search can be throttled to increase the delay by configuring NV_PFIFO_ACQ_PRETEST; see dev_fifo.ref. Note that if the channel remains switched in, the prefetched pushbuffer data is not discarded, so setting ACQUIRE_SWITCH_TSG_EN cannot deterministically be depended on to cause the discarding of prefetched pushbuffer data. Also note that when switching between channels within a TSG, Host does not wait on any timer (such as NV_PFIFO_ACQ_PRETEST or NV_PPBDMA_ACQUIRE_RETRY), but is instead throttled by the time it takes to switch channels. Host will honor the ACQUIRE_RETRY time, but only if the same channel is rescheduled without a channel switch. Semaphore wait-for-idle option: The NV_UDMA_SEM_EXECUTE_RELEASE_WFI field applies only to releases and reductions. It specifies whether Host should wait until the engine to which the channel last sent methods is idle (in other words, until all previous methods in the channel have been completed) before writing to memory as part of the release or reduction operation. If this field is RELEASE_WFI_EN, then Host waits for the engine to be idle, inserts a system memory barrier, and then updates the value in memory. If this field is RELEASE_WFI_DIS, Host performs the semaphore operation on the memory without waiting for the engine to be idle, and without using a system memory barrier. Semaphore timestamp option: The NV_UDMA_SEM_EXECUTE_RELEASE_TIMESTAMP specifies whether a timestamp should be written by a release in addition to the payload. If RELEASE_TIMESTAMP is DIS, then only the semaphore payload will be written. If the field is EN then both the semaphore payload and a nanosecond timestamp will be written. In this case, the semaphore address must be 16-byte aligned; see the related note at NV_UDMA_SEM_ADDR_LO. If RELEASE_TIMESTAMP is EN and SEM_ADDR_LO is not 16-byte aligned, then Host will initiate an interrupt (NV_PPBDMA_INTR_0_SEMAPHORE). When a 16-byte semaphore is written, the semaphore timestamp will be written before the semaphore payload so that when an acquire succeeds, the timestamp write will have completed. This ensures SW will not get an out-of-date timestamp on platforms which guarantee ordering within a 16-byte aligned region. The timestamp value is snapped from the NV_PTIMER_TIME_1/0 registers; see dev_timer.ref. For iGPU cases where a semaphore release can be mapped to an onchip syncpoint, the SIZE must be 4Bytes to avoid double incrementing the target syncpoint. Timestamping should also be disabled for a synpoint backed releast to avoid unexpected behavior. Below is the little endian format of 16-byte semaphores in memory: ---- ------------------- ------------------- byte Data(Little endian) Data(Little endian) PAYLOAD_SIZE=32BIT PAYLOAD_SIZE=64BIT ---- ------------------- ------------------- 0 Payload[ 7: 0] Payload[ 7: 0] 1 Payload[15: 8] Payload[15: 8] 2 Payload[23:16] Payload[23:16] 3 Payload[31:24] Payload[31:24] 4 0 Payload[39:32] 5 0 Payload[47:40] 6 0 Payload[55:48] 7 0 Payload[63:56] 8 timer[ 7: 0] timer[ 7: 0] 9 timer[15: 8] timer[15: 8] 10 timer[23:16] timer[23:16] 11 timer[31:24] timer[31:24] 12 timer[39:32] timer[39:32] 13 timer[47:40] timer[47:40] 14 timer[55:48] timer[55:48] 15 timer[63:56] timer[63:56] ---- ------------------- ------------------- Semaphore reduction operations: The NV_UDMA_SEM_EXECUTE_REDUCTION field specifies the reduction operation to perform on the semaphore memory value, using the semaphore payload from SEM_PAYLOAD_LO/HI as an operand, when the OPERATION field is OPERATION_REDUCTION. Based on the PAYLOAD_SIZE field the semaphore value and the payload are interpreted as 32bit or 64bit integers and the reduction operation is performed according to the signedness specified via the REDUCTION_FORMAT field described below. The reduction operation leaves the modified value in the semaphore memory according to the operation as follows: REDUCTION_IMIN - the minimum of the value and payload REDUCTION_IMAX - the maximum of the value and payload REDUCTION_IXOR - the bitwise exclusive or (XOR) of the value and payload REDUCTION_IAND - the bitwise AND of the value and payload REDUCTION_IOR - bitwise OR of the value and payload REDUCTION_IADD - the sum of the value and payload REDUCTION_INC - the value incremented by 1, or reset to 0 if the incremented value would exceed the payload REDUCTION_DEC - the value decremented by 1, or reset back to the payload if the original value is already 0 or exceeds the payload Note that INC and DEC are somewhat surprising: they can be used to repeatedly loop the semaphore value when performed successively with the same payload p. INC repeatedly iterates from 0 to p inclusive, resetting to 0 once exceeding p. DEC repeatedly iterates down from p to 0 inclusive, resetting back to p once the value would otherwise underflow. Therefore, an INC or DEC reduction with payload 0 effectively releases a semaphore by setting its value to 0. The reduction opcode assignment matches the enumeration in the XBAR translator (to avoid extra remapping of hardware), but this does not match the graphics FE reduction opcodes used by graphics backend semaphores. The reduction operation itself is performed by L2. Semaphore signedness option: The NV_UDMA_SEM_EXECUTE_REDUCTION_FORMAT field specifies whether the values involved in a reduction operation will be interpreted as signed or unsigned. The following table summarizes each reduction operation, and the signedness and payload size supported for each operation: signedness r op 32b 64b function (v = memory value, p = semaphore payload) -----+-----+-----+--------------------------------------------------- IMIN U,S U,S v = (v < p) ? v : p IMAX U,S U,S v = (v > p) ? v : p IXOR N/A N/A v = v ^ p IAND N/A N/A v = v & p IOR N/A N/A v = v | p IADD U,S U v = v + p INC U inv v = (v >= p) ? 0 : v + 1 DEC U inv v = (v == 0 || v > p) ? p : v - 1 (from L2 IAS) An operation with signedness "N/A" will ignore the value of REDUCTION_FORMAT when executing, and either value of REDUCTION_FORMAT is valid. If an operation is "U only" this means a signed version of this operation is not supported, and if it is marked "inv" then it is unsupported for any signedness. If Host sees an unsupported reduction op (in other words, is expected to run a reduction op while PAYLOAD_SIZE and REDUCTION_FORMAT are set to unsupported values for that op), Host will raise the NV_PPBDMA_INTR_0_SEMAPHORE interrupt. Example: A signed 32-bit IADD reduction operation is valid. A signed 64-bit IADD reduction operation is unsupported and will trigger an interrupt if sent to Host. A 64-bit INC (or DEC) operation is not supported and will trigger an interrupt if sent to Host. Legal semaphore operation combinations: For iGPU cases where a semaphore release can be mapped to an onchip syncpoint, the SIZE must be 4Bytes to avoid double incrementing the target syncpoint. Timestamping should also be disabled for a synpoint backed release to avoid unexpected behavior. The following table diagrams the types of semaphore operations that are possible. In the columns, "x" matches any field value. ACQ refers to any of the ACQUIRE, ACQ_STRICT_GEQ, ACQ_CIRC_GEQ, ACQ_AND, and ACQ_NOR operations. REL refers to either a RELEASE or a REDUCTION operation. OP SWITCH WFI PAYLOAD_SIZE TIMESTAMP Description --- ------ --- ------------ --------- -------------------------------------------------------------- ACQ 0 x 0 x acquire; 4B (32 bit comparison); retry on fail ACQ 0 x 1 x acquire; 8B (64 bit comparison); retry on fail ACQ 1 x 0 x acquire; 4B (32 bit comparison); switch on fail ACQ 1 x 1 x acquire; 8B (64 bit comparison); switch on fail REL x 0 0 1 WFI & release 4B payload + timestamp semaphore REL x 0 1 1 WFI & release 8B payload + timestamp semaphore REL x 1 0 1 do not WFI & release 4B payload + timestamp semaphore REL x 1 1 1 do not WFI & release 8B payload + timestamp semaphore REL x 0 0 0 WFI & release doubleword (4B) semaphore payload REL x 0 1 0 WFI & release quadword (8B) semaphore payload REL x 1 0 0 do not WFI & release doubleword (4B) semaphore payload REL x 1 1 0 do not WFI & release quadword (8B) semaphore payload --- ------ --- ------------ --------- -------------------------------------------------------------- While the channel is loaded on a PBDMA unit, information from this method is stored in the NV_PPBDMA_SEM_EXECUTE register. Otherwise, this information is stored in the NV_RAMFC_SEM_EXECUTE field of the RAMFC part of the channel's instance block. Undefined bits: Bits in the NV_UDMA_SEM_EXECUTE method data that are not used by the specified OPERATION should be set to 0. When non-zero, their behavior is undefined. #define NV_UDMA_SEM_EXECUTE 0x0000006C /* -W-4R */ #define NV_UDMA_SEM_EXECUTE_OPERATION 2:0 /* -W-VF */ #define NV_UDMA_SEM_EXECUTE_OPERATION_ACQUIRE 0x00000000 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_OPERATION_RELEASE 0x00000001 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_OPERATION_ACQ_STRICT_GEQ 0x00000002 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_OPERATION_ACQ_CIRC_GEQ 0x00000003 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_OPERATION_ACQ_AND 0x00000004 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_OPERATION_ACQ_NOR 0x00000005 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_OPERATION_REDUCTION 0x00000006 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_ACQUIRE_SWITCH_TSG 12:12 /* -W-VF */ #define NV_UDMA_SEM_EXECUTE_ACQUIRE_SWITCH_TSG_DIS 0x00000000 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_ACQUIRE_SWITCH_TSG_EN 0x00000001 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_RELEASE_WFI 20:20 /* -W-VF */ #define NV_UDMA_SEM_EXECUTE_RELEASE_WFI_DIS 0x00000000 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_RELEASE_WFI_EN 0x00000001 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_PAYLOAD_SIZE 24:24 /* -W-VF */ #define NV_UDMA_SEM_EXECUTE_PAYLOAD_SIZE_32BIT 0x00000000 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_PAYLOAD_SIZE_64BIT 0x00000001 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_RELEASE_TIMESTAMP 25:25 /* -W-VF */ #define NV_UDMA_SEM_EXECUTE_RELEASE_TIMESTAMP_DIS 0x00000000 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_RELEASE_TIMESTAMP_EN 0x00000001 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_REDUCTION 30:27 /* -W-VF */ #define NV_UDMA_SEM_EXECUTE_REDUCTION_IMIN 0x00000000 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_REDUCTION_IMAX 0x00000001 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_REDUCTION_IXOR 0x00000002 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_REDUCTION_IAND 0x00000003 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_REDUCTION_IOR 0x00000004 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_REDUCTION_IADD 0x00000005 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_REDUCTION_INC 0x00000006 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_REDUCTION_DEC 0x00000007 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_REDUCTION_FORMAT 31:31 /* -W-VF */ #define NV_UDMA_SEM_EXECUTE_REDUCTION_FORMAT_SIGNED 0x00000000 /* -W--V */ #define NV_UDMA_SEM_EXECUTE_REDUCTION_FORMAT_UNSIGNED 0x00000001 /* -W--V */ NON_STALL_INT [method] - Non-Stalling Interrupt Method The NON_STALL_INT method causes the NV_PFIFO_INTR_0_CHANNEL_INTR field to be set to PENDING in the channel's interrupt register, as well as NV_PFIFO_INTR_HIER_* registers. This will cause an interrupt if it is enabled. Host does not stall the execution of the GPU context's method, does not switch out the GPU context, and does not disable switching the GPU context. A NON_STALL_INT method's data (NV_UDMA_NON_STALL_INT_HANDLE) is ignored. Software should handle all of a channel's non-stalling interrupts before it unbinds the channel from the GPU context. #define NV_UDMA_NON_STALL_INT 0x00000020 /* -W-4R */ #define NV_UDMA_NON_STALL_INT_HANDLE 31:0 /* -W-VF */ MEM_OP methods: membars, and cache and TLB management. MEM_OP_A, MEM_OP_B, and MEM_OP_C set up state for performing a memory operation. MEM_OP_D sets additional state, specifies the type of memory operation to perform, and triggers sending the mem op to HUB. To avoid unexpected behavior for future revisions of the MEM_OP methods, all 4 methods should be sent for each requested mem op, with irrelevant fields set to 0. Note that hardware does not enforce the requirement that unrelated fields be set to 0, but ignoring this advice could break forward compatibility. Host does not wait until an engine is idle before beginning to execute this method. While a GPU context is bound to a channel and assigned to a PBDMA unit, the NV_UDMA_MEM_OP_A-C values are stored in the NV_PPBDMA_MEM_OP_A-C registers respectively. While the GPU context is not assigned to a PBDMA unit, these values are stored in the respective NV_RAMFC_MEM_OP_A-C fields of the RAMFC part of the GPU context's instance block in memory. Usage, operations, and configuration: MEM_OP_D_OPERATION specifies the type of memory operation to perform. This field determines the value of the opcode on the Host/FB interface. When Host encounters the MEM_OP_D method, Host sends the specified request to the FB and waits for an indication that the request has completed before beginning to process the next method. To issue a memory operation, first issue the 3 MEM_OP_A-C methods to configure the operation as documented below. Then send MEM_OP_D to complete the configuration and trigger the operation. The operations available for MEM_OP_D_OPERATION are as follows: MEMBAR - perform a memory barrier; see below. MMU_TLB_INVALIDATE - invalidate page translation and attribute data from the given page directory that are cached in the Memory-Management Unit TLBs. MMU_TLB_INVALIDATE_TARGETED - invalidate page translation and attributes data corresponding to a specific page in a given page directory. L2_SYSMEM_INVALIDATE - invalidate data from system memory cached in L2. L2_PEERMEM_INVALIDATE - invalidate peer-to-peer data in the L2 cache. L2_CLEAN_COMPTAGS - clean the L2 compression tag cache. L2_FLUSH_DIRTY - flush dirty lines from L2. L2_WAIT_FOR_SYS_PENDING_READS - ensure all sysmem reads are past the point of being modified by a write through a reflected mapping. To do this, L2 drains all sysmem reads to the point where they cannot be modified by future non-blocking writes to reflected sysmem. L2 will block any new sysmem read requests and drain out all read responses. Note VC's with sysmem read requests at the head would stall any request till the flush is complete. The niso-nb vc does not have sysmem read requests so it would continue to flow. L2 will ack that the sys flush is complete and unblock all VC's. Note this operation is a NOP on tegra chips. ACCESS_COUNTER_CLR - clear page access counters. Depending on the operation given in MEM_OP_D_OPERATION, the other fields of all four MEM_OP methods are interpreted differently: MMU_TLB_INVALIDATE* ------------------- When the operation is MMU_TLB_INVALIDATE or MMU_TLB_INVALIDATE_TARGETED, then Host will initiate a TLB invalidate as described above. The MEM_OP configuration fields specify what to invalidate, where to perform the invalidate, and optionally trigger a replay or cancel event for replayable faults buffered within the TLBs as part of UVM page management. When the operation is MMU_TLB_INVALIDATE_TARGETED, MEM_OP_C_TLB_INVALIDATE_PDB must be ONE, and the TLB_INVALIDATE_TARGET_ADDR_LO and HI fields must be filled in to specify the target page. These operations are privileged and can only be executed from channels with NV_PPBDMA_CONFIG_AUTH_LEVEL set to PRIVILEGED. This is configured via the NV_RAMFC_CONFIG dword in the channel's RAMFC during channel setup. MEM_OP_A_TLB_INVALIDATE_CANCEL_TARGET_GPC_ID and MEM_OP_A_TLB_INVALIDATE_CANCEL_TARGET_CLIENT_UNIT_ID identify the GPC and uTLB within that GPC respectively that should perform the cancel operation when MEM_OP_C_TLB_INVALIDATE_REPLAY is CANCEL_TARGETED. These field values should be copied from the GPC_ID and CLIENT fields from the associated NV_UVM_FAULT_BUF_ENTRY packet or NV_PFIFO_INTR_MMU_FAULT_INFO(i) entry. The CLIENT_UNIT_ID corresponds to the values specified by NV_PFAULT_CLIENT_GPC_* in dev_fault.ref. These fields are used with the CANCEL_TARGETED operation. The fields also overlap with CANCEL_MMU_ENGINE_ID, and are interpreted as CANCEL_MMU_ENGINE_ID during reply of type REPLAY_CANCEL_VA_GLOBAL. For other replay operations, these fields must be 0. MEM_OP_A_TLB_INVALIDATE_CANCEL_MMU_ENGINE_ID specifies the associated MMU_ENGINE_ID of the requests targeted by a REPLAY_CANCEL_VA_GLOBAL operation. The field is ignored if the replay operation is not REPLAY_CANCEL_VA_GLOBAL. This field overlaps with CANCEL_TARGET_GPC_ID and CANCEL_TARGET_CLIENT_UNIT_ID field. MEM_OP_A_TLB_INVALIDATE_INVALIDATION_SIZE is aliased/repurposed with MEM_OP_A_TLB_INVALIDATE_CANCEL_TARGET_CLIENT_UNIT_ID field when MEM_OP_C_TLB_INVALIDATE_REPLAY (below) is anything other than CANCEL_TARGETED or CANCEL_VA_GLOBAL or CANCEL_VA_TARGETED. In the invalidation size enabled replay type cases, actual region to be invalidated iscalculated as 4K*(2^INVALIDATION_SIZE) i.e., 4K*(2^CANCEL_TARGET_CLIENT_UNIT_ID); client unit id and gpc id are not applicable. MEM_OP_A_TLB_INVALIDATE_SYSMEMBAR controls whether a Hub SYSMEMBAR operation is performed after waiting for all outstanding acks to complete, after the TLB is invalidated. Note if ACK_TYPE is ACK_TYPE_NONE then this field is ignored and no MEMBAR will be performed. This is provided as a SW optimization so that SW does not need to perform a NV_UDMA_MEM_OP_D_OPERATION_MEMBAR op with MEMBAR_TYPE SYS_MEMBAR after the TLB_INVALIDATE. This field must be 0 if TLB_INVALIDATE_GPC is DISABLE. MEM_OP_B_TLB_INVALIDATE_TARGET_ADDR_HI:MEM_OP_A_TLB_INVALIDATE_TARGET_ADDR_LO specifies the 4k aligned virtual address of the page whose translation to invalidate within the TLBs. These fields are valid only when OPERATION is MMU_TLB_INVALIDATE_TARGETED; otherwise, they must be set to 0. MEM_OP_C_TLB_INVALIDATE_PDB controls whether a TLB invalidate should apply to a particular page directory or to all of them. If PDB is ALL, then all page directories are invalidated. If PDB is ONE, then the PDB address and aperture are specified in the PDB_ADDR_LO:PDB_ADDR_HI and PDB_APERTURE fields. Note that ALL does not make sense when OPERATION is MMU_TLB_INVALIDATE_TARGETED; the behavior in that case is undefined. MEM_OP_C_TLB_INVALIDATE_GPC controls whether the GPC-MMU and uTLB entries should be invalidated in addition to the Hub-MMU TLB (Note: the Hub TLB is always invalidated). Set it to INVALIDATE_GPC_ENABLE to invalidate the GPC TLBs. The REPLAY, ACK_TYPE, and SYSMEMBAR fields are only used by the GPC TLB and so are ignored if INVALIDATE_GPC is DISABLE. MEM_OP_C_TLB_INVALIDATE_REPLAY specifies the type of replay to perform in addition to the invalidate. A replay causes all replayable faults outstanding in the TLB to attempt their translations again. Once a TLB acks a replay, that TLB may start accepting new translations again. The replay flavors are as follows: NONE - do not replay any replayable faults on invalidate. START - initiate a replay across all TLBs, but don't wait for completion. The replay will be acked as soon as the invalidate is processed, but replays themselves are in flight and not necessarily translated. START_ACK_ALL - initiate the replay and wait until it completes. The replay will be acked after all pending transactions in the replay fifo have been translated. New requests will remain stalled in the gpcmmu until all transactions in the replay fifo have completed and there are no pending faults left in the replay fifo. CANCEL_TARGETED - initiate a cancel-replay on a targeted uTLB, causing any replayable translations buffered in that uTLB to become non-replayable if they fault again. In this case, the first faulting translation will be reported in the NV_PFIFO_INTR_MMU_FAULT registers and will raise PFIFO_INTR_0_MMU_FAULT. The specific TLB to target for the cancel is specified in the CANCEL_TARGET fields. Note the TLB invalidate still applies globally to all TLBs. CANCEL_GLOBAL - like CANCEL_TARGETED, but all TLBs will cancel-replay. CANCEL_VA_GLOBAL - initiates a cancel operation that cancels all requests with the matching mmu_engine_id and access_type that land in the specified 4KB aligned virtual address within the scope of specified PDB. All other requests are replayed. If the specified engine is not bound, or if the PDB of the specified engine does not match the specified PDB, all requests will be replayed and none will be canceled. MEM_OP_C_TLB_INVALIDATE_ACK_TYPE controls which sort of ACK the uTLBs wait for after having issued a membar to L2. ACK_TYPE_NONE does not perform any sort of membar. ACK_TYPE_INTRANODE waits for an ack from the XBAR. ACK_TYPE_GLOBALLY waits for an L2 ACK. ACK_TYPE_GLOBALLY is equivalent to a MEMBAR operation from the engine, or a SYS_MEMBAR if MEM_OP_A_TLB_INVALIDATE_SYSMEMBAR is EN. MEM_OP_C_TLB_INVALIDATE_PAGE_TABLE_LEVEL specifies which levels in the page directory hierarchy of the TLB cache to invalidate. The levels are numbered from the bottom up, with the PTE being at the bottom with level 1. The specified level and all those below it in the hierarchy -- that is, all those with a lower numbered level -- are invalidated. ALL (the 0 default) is special-cased to indicate the top level; this causes the invalidate to apply to the entire page mapping structure. The field is ignored if the replay operation is REPLAY_CANCEL_VA_GLOBAL. MEM_OP_C_TLB_INVALIDATE_ACCESS_TYPE specifies the associated ACCESS_TYPE of the requests targeted by a REPLAY_CANCEL_VA_GLOBAL operation. This field overlaps with the INVALIDATE_PAGE_TABLE_LEVEL field, and is ignored if the replay operation is not REPLAY_CANCEL_VA_GLOBAL. The ACCESS_TYPE field can get one of the following values: READ - the cancel_va_global should be performed on all pending read requests. WRITE - the cancel_va_global should be performed on all pending write requests. ATOMIC_STRONG - the cancel_va_global should be performed on all pending strong atomic requests. ATOMIC_WEAK - the cancel_va_global should be performed on all pending weak atomic requests. ATOMIC_ALL - the cancel_va_global should be performed on all pending atomic requests. WRITE_AND_ATOMIC - the cancel_va_global should be performed on all pending write and atomic requests. ALL - the cancel_va_global should be performed on all pending requests. MEM_OP_C_TLB_INVALIDATE_PDB_APERTURE specifies the target aperture of the page directory for which TLB entries should be invalidated. This field must be 0 when TLB_INVALIDATE_PDB is ALL. MEM_OP_C_TLB_INVALIDATE_PDB_ADDR_LO specifies the low 20 bits of the 4k-block-aligned PDB (base address of the page directory) when TLB_INVALIDATE_PDB is ONE; otherwise this field must be 0. The PDB byte address should be 4k aligned and right-shifted by 12 before being split and packed into the ADDR fields. Note that the PDB_ADDR_LO field starts at bit 12, so it is possible to set MEM_OP_C to the low 32 bits of the byte address, mask off the low 12, and then or in the rest of the configuration fields. MEM_OP_D_TLB_INVALIDATE_PDB_ADDR_HI contains the high bits of the PDB when TLB_INVALIDATE_PDB is ONE. Otherwise this field must be 0. UVM handling of replayable faults: The following example illustrates how TLB invalidate may be used by the UVM driver: 1. When the TLB invalidate completes, all memory accesses using the old TLB entries prior to the invalidate will finish translation (but not completion), and any new virtual accesses will trigger new translations. The outstanding in-flight translations are allowed to fault but will not indefinitely stall the invalidate. 2. When the TLB invalidate completes, in-flight memory accesses using the old physical translations may not yet be visible to other GPU clients (such as CopyEngine) or to the CPU. Accesses coming from clients that support recoverable faults (such as TEX and GCC) can be made visible by requesting the MMU to perform a membar using the ACK_TYPE and SYSMEMBAR fields. a. If ACK_TYPE is NONE the SYSMEMBAR field is ignored and no membar is performed. b. If ACK_TYPE is INTRANODE the invalidate will wait until all in-flight physical accesses using the old translations are visible to XBAR clients on the blocking VC. c. If ACK_TYPE is GLOBALLY the invalidate will wait until all in-flight physical accesses using the old translations are at the point of coherence in L2, meaning writes will be visible to all other GPU clients and reads will not be mutable by them. d. If the SYSMEMBAR field is set to EN then a Hub SYSMEMBAR will also be performed following the ACK_TYPE membar. This is the equivalent of performing a NV_UDMA_MEM_OP_C_MEMBAR_TYPE_SYS_MEMBAR. 3. If fault replay was requested then all pending recoverable faults in the TLB replay list will be retranslated. This includes all faults discovered while the invalidate was pending. This replay may generate more recoverable faults. 4. If fault replay cancel was requested then another replay is attempted of all pending replayable faults on the targeted TLB(s). If any of these re-fault they are discarded (sticky NACK or ACK/TRAP sent back to the client depending on the setting of NV_PGPC_PRI_MMU_DEBUG_CTRL). MEMBAR ------ When the operation is MEMBAR, Host will perform a memory barrier operation. All other fields must be set to 0 except for MEM_OP_C_MEMBAR_TYPE. When MEMBAR_TYPE is MEMBAR, then a memory barrier will be performed with respect to other clients on the GPU. When it is SYS_MEMBAR, the memory barrier will also be performed with respect to the CPU and peer GPUs. MEMBAR - This issues a MEMBAR operation following all reads, writes, and atomics currently in flight from the PBDMA. The MEMBAR operation will push all such accesses already in flight on the same VC as the PBDMA to a point of GPU coherence before proceeding. After this operation is complete, reads from any GPU client will see prior writes from this PBDMA, and writes from any GPU client cannot modify the return data of earlier reads from this PBDMA. This is true regardless of whether those accesses target vidmem, sysmem, or peer mem. WARNING: This only guarantees accesses from the same VC as the PBDMA that are already in flight are coherent. Accesses from clients such as SM or a non-PBDMA engine need already be at some point of coherency before this operation to be coherent. SYS_MEMBAR - This implies the MEMBAR type above but in addition to having accesses reach coherence with all GPU clients, this further waits for accesses to be coherent with respect to the CPU and peer GPUs as well. After this operation is complete, reads from the CPU or peer GPUs will see prior writes from this PBDMA, and writes from the CPU or peer GPUs cannot modify the return data of earlier reads from this PBDMA (with the exception of CPU reflected writes, which can modify earlier reads). Note SYS_MEMBAR is really only needed to guarantee ordering with off-chip clients. For on-chip clients such as the graphics engine or copy engine, accesses to sysmem will be coherent with just a MEMBAR operation. SYS_MEMBAR provides the same function as OPERATION_SYSMEMBAR_FLUSH on previous architectures. WARNING: As described above, SYS_MEMBAR will not prevent CPU reflected writes issued after the SYS_MEMBAR from clobbering the return data of reads issued before the SYS_MEMBAR. To handle this case, the invalidate must be followed with a separate L2_WAIT_FOR_SYS_PENDING_READS mem op. L2* --- These values initiate a cache management operation -- see above. All other fields must be 0; there are no configuration options. The ACCESS_COUNTER_CLR operation -------------------------------- When MEM_OP_D_OPERATION is ACCESS_COUNTER_CLR, Host will request to clear the the page access counters. There are two types of access counters - MIMC and MOMC. This operation can be issued to clear all counters of all types, all counters of a specified type (MIMC or MOMC), or a specific counter indicated by counter type, bank and notify tag. This operation is privileged and can only be executed from channels with NV_PPBDMA_CONFIG_AUTH_LEVEL set to PRIVILEGED. This is configured via the NV_RAMFC_CONFIG dword in the channel's RAMFC during channel setup. The operation uses the following fields in the MEM_OP_* methods: ACCESS_COUNTER_CLR_TYPE (TY) : type of the access counter clear operation ACCESS_COUNTER_CLR_TARGETED_TYPE (T) : type of the access counter for targeted operation ACCESS_COUNTER_CLR_TARGETED_NOTIFY_TAG : 20 bits notify tag of the access counter for targeted operation ACCESS_COUNTER_CLR_TARGETED_BANK : 4 bits bank number of the access counter for targeted operation MEM_OP method field defines: MEM_OP_A [method] - Memory Operation Method 1/4 - see above for documentation #define NV_UDMA_MEM_OP_A 0x00000028 /* -W-4R */ #define NV_UDMA_MEM_OP_A_TLB_INVALIDATE_CANCEL_TARGET_CLIENT_UNIT_ID 5:0 /* -W-VF */ #define NV_UDMA_MEM_OP_A_TLB_INVALIDATE_INVALIDATION_SIZE 5:0 /* -W-VF */ #define NV_UDMA_MEM_OP_A_TLB_INVALIDATE_CANCEL_TARGET_GPC_ID 10:6 /* -W-VF */ #define NV_UDMA_MEM_OP_A_TLB_INVALIDATE_CANCEL_MMU_ENGINE_ID 6:0 /* -W-VF */ #define NV_UDMA_MEM_OP_A_TLB_INVALIDATE_SYSMEMBAR 11:11 /* -W-VF */ #define NV_UDMA_MEM_OP_A_TLB_INVALIDATE_SYSMEMBAR_EN 0x00000001 /* -W--V */ #define NV_UDMA_MEM_OP_A_TLB_INVALIDATE_SYSMEMBAR_DIS 0x00000000 /* -W--V */ #define NV_UDMA_MEM_OP_A_TLB_INVALIDATE_TARGET_ADDR_LO 31:12 /* -W-VF */ MEM_OP_B [method] - Memory Operation Method 2/4 - see above for documentation #define NV_UDMA_MEM_OP_B 0x0000002c /* -W-4R */ #define NV_UDMA_MEM_OP_B_TLB_INVALIDATE_TARGET_ADDR_HI 31:0 /* -W-VF */ MEM_OP_C [method] - Memory Operation Method 3/4 - see above for documentation #define NV_UDMA_MEM_OP_C 0x00000030 /* -W-4R */ Membar configuration field. Note: overlaps MMU_TLB_INVALIDATE* config fields. #define NV_UDMA_MEM_OP_C_MEMBAR_TYPE 2:0 /* -W-VF */ #define NV_UDMA_MEM_OP_C_MEMBAR_TYPE_SYS_MEMBAR 0x00000000 /* -W--V */ #define NV_UDMA_MEM_OP_C_MEMBAR_TYPE_MEMBAR 0x00000001 /* -W--V */ Invalidate TLB entries for ONE page directory base, or for ALL of them. #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_PDB 0:0 /* -W-VF */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_PDB_ONE 0x00000000 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_PDB_ALL 0x00000001 /* -W--V */ Invalidate GPC MMU TLB entries or not (Hub-MMU entries are always invalidated). #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_GPC 1:1 /* -W-VF */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_GPC_ENABLE 0x00000000 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_GPC_DISABLE 0x00000001 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_REPLAY 4:2 /* -W-VF */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_REPLAY_NONE 0x00000000 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_REPLAY_START 0x00000001 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_REPLAY_START_ACK_ALL 0x00000002 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_REPLAY_CANCEL_TARGETED 0x00000003 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_REPLAY_CANCEL_GLOBAL 0x00000004 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_REPLAY_CANCEL_VA_GLOBAL 0x00000005 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_ACK_TYPE 6:5 /* -W-VF */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_ACK_TYPE_NONE 0x00000000 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_ACK_TYPE_GLOBALLY 0x00000001 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_ACK_TYPE_INTRANODE 0x00000002 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_ACCESS_TYPE 9:7 /* -W-VF */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_ACCESS_TYPE_VIRT_READ 0 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_ACCESS_TYPE_VIRT_WRITE 1 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_ACCESS_TYPE_VIRT_ATOMIC_STRONG 2 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_ACCESS_TYPE_VIRT_RSVRVD 3 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_ACCESS_TYPE_VIRT_ATOMIC_WEAK 4 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_ACCESS_TYPE_VIRT_ATOMIC_ALL 5 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_ACCESS_TYPE_VIRT_WRITE_AND_ATOMIC 6 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_ACCESS_TYPE_VIRT_ALL 7 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_PAGE_TABLE_LEVEL 9:7 /* -W-VF */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_PAGE_TABLE_LEVEL_ALL 0x00000000 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_PAGE_TABLE_LEVEL_PTE_ONLY 0x00000001 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_PAGE_TABLE_LEVEL_UP_TO_PDE0 0x00000002 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_PAGE_TABLE_LEVEL_UP_TO_PDE1 0x00000003 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_PAGE_TABLE_LEVEL_UP_TO_PDE2 0x00000004 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_PAGE_TABLE_LEVEL_UP_TO_PDE3 0x00000005 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_PAGE_TABLE_LEVEL_UP_TO_PDE4 0x00000006 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_PAGE_TABLE_LEVEL_UP_TO_PDE5 0x00000007 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_PDB_APERTURE 11:10 /* -W-VF */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_PDB_APERTURE_VID_MEM 0x00000000 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_PDB_APERTURE_SYS_MEM_COHERENT 0x00000002 /* -W--V */ #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_PDB_APERTURE_SYS_MEM_NONCOHERENT 0x00000003 /* -W--V */ Address[31:12] of page directory for which TLB entries should be invalidated. #define NV_UDMA_MEM_OP_C_TLB_INVALIDATE_PDB_ADDR_LO 31:12 /* -W-VF */ #define NV_UDMA_MEM_OP_C_ACCESS_COUNTER_CLR_TARGETED_NOTIFY_TAG 19:0 /* -W-VF */ MEM_OP_D [method] - Memory Operation Method 4/4 - see above for documentation (Must be preceded by MEM_OP_A-C.) #define NV_UDMA_MEM_OP_D 0x00000034 /* -W-4R */ Address[58:32] of page directory for which TLB entries should be invalidated. #define NV_UDMA_MEM_OP_D_TLB_INVALIDATE_PDB_ADDR_HI 26:0 /* -W-VF */ #define NV_UDMA_MEM_OP_D_OPERATION 31:27 /* -W-VF */ #define NV_UDMA_MEM_OP_D_OPERATION_MEMBAR 0x00000005 /* -W--V */ #define NV_UDMA_MEM_OP_D_OPERATION_MMU_TLB_INVALIDATE 0x00000009 /* -W--V */ #define NV_UDMA_MEM_OP_D_OPERATION_MMU_TLB_INVALIDATE_TARGETED 0x0000000a /* -W--V */ #define NV_UDMA_MEM_OP_D_OPERATION_L2_PEERMEM_INVALIDATE 0x0000000d /* -W--V */ #define NV_UDMA_MEM_OP_D_OPERATION_L2_SYSMEM_INVALIDATE 0x0000000e /* -W--V */ #define NV_UDMA_MEM_OP_D_OPERATION_L2_CLEAN_COMPTAGS 0x0000000f /* -W--V */ #define NV_UDMA_MEM_OP_D_OPERATION_L2_FLUSH_DIRTY 0x00000010 /* -W--V */ #define NV_UDMA_MEM_OP_D_OPERATION_L2_WAIT_FOR_SYS_PENDING_READS 0x00000015 /* -W--V */ #define NV_UDMA_MEM_OP_D_OPERATION_ACCESS_COUNTER_CLR 0x00000016 /* -W--V */ #define NV_UDMA_MEM_OP_D_ACCESS_COUNTER_CLR_TYPE 1:0 /* -W-VF */ #define NV_UDMA_MEM_OP_D_ACCESS_COUNTER_CLR_TYPE_MIMC 0x00000000 /* -W--V */ #define NV_UDMA_MEM_OP_D_ACCESS_COUNTER_CLR_TYPE_MOMC 0x00000001 /* -W--V */ #define NV_UDMA_MEM_OP_D_ACCESS_COUNTER_CLR_TYPE_ALL 0x00000002 /* -W--V */ #define NV_UDMA_MEM_OP_D_ACCESS_COUNTER_CLR_TYPE_TARGETED 0x00000003 /* -W--V */ #define NV_UDMA_MEM_OP_D_ACCESS_COUNTER_CLR_TARGETED_TYPE 2:2 /* -W-VF */ #define NV_UDMA_MEM_OP_D_ACCESS_COUNTER_CLR_TARGETED_TYPE_MIMC 0x00000000 /* -W--V */ #define NV_UDMA_MEM_OP_D_ACCESS_COUNTER_CLR_TARGETED_TYPE_MOMC 0x00000001 /* -W--V */ #define NV_UDMA_MEM_OP_D_ACCESS_COUNTER_CLR_TARGETED_BANK 6:3 /* -W-VF */ SET_REF [method] - Set Reference Count Method The SET_REF method allows the user to set the reference count (NV_PPBDMA_REF_CNT) to a value. The reference count may be monitored to track Host's progress through the pushbuffer. Instead of monitoring NV_RAMUSERD_TOP_LEVEL_GET, software may put into the method stream SET_REF methods that set the reference count to ever increasing values, and then read NV_RAMUSERD_REF to determine how far in the stream Host has gone. Before the reference count value is altered, Host waits for the engine to be idle (to have completed executing all earlier methods), issues a SysMemBar flush, and waits for the flush to complete. While the GPU context is bound to a channel and assigned to a PBDMA unit, the reference count value is stored in the NV_PPBDMA_REF register. While the GPU context is not assigned to a PBDMA unit, the reference count value is stored in the NV_RAMFC_REF field of the RAMFC portion of the GPU context's GPU-instance block. #define NV_UDMA_SET_REF 0x00000050 /* -W-4R */ #define NV_UDMA_SET_REF_CNT 31:0 /* -W-VF */ CRC_CHECK [method] - Method-CRC Check Method When debugging a problem in a real chip, it may be useful to determine whether a PBDMA unit has sent the proper methods toward the engine. The CRC_CHECK method checks whether the cyclic redundancy check value calculated over previous methods has an expected value. If the value in the NV_PPBDMA_METHOD_CRC register is not equal to NV_UDMA_CRC_CHECK_VALUE, then Host initiates an interrupt (NV_PPBDMA_INTR_0_METHODCRC) and stalls. After each comparison, the NV_PPBDMA_METHOD_CRC register is cleared. The IEEE 802.3 CRC-32 polynomial (0x04c11db7) is used to calculate CRC values. The CRC is calculated over the method subchannel, method address, and method data of methods sent to an engine. Host can set both single and dual methods to engines. The CRC is calculated as if dual methods were sent as two single methods. The CRC is calculated on the byte-stream in little-endian order. Pseudocode for CRC calculation is: static NVR_U32 table[256]; void init() { for (NVR_U32 i = 0; i < 256; i++) { // create crc value for every byte NVR_U32 crc = i << 24; for (int j = 0; j < 8; j++) { // for every bit in the byte if (crc & 0x80000000) crc = (crc << 1) ^ 0x04c11db7 else crc = (crc << 1); } table[i] = crc; } } NVR_U32 new_crc(unsigned char byte, NVR_U32 old_crc) { NVR_U32 crc_top_byte = old_crc >> 24; crc_top_byte ^= byte; NVR_U32 new_crc = (old_crc << 8) ^ table[crc_top_byte]; return new_crc; } This method is used for debug. This method was added in Fermi. #define NV_UDMA_CRC_CHECK 0x0000007c /* -W-4R */ #define NV_UDMA_CRC_CHECK_VALUE 31:0 /* -W-VF */ YIELD [method] - Yield Method The YIELD method causes a channel to yield the remainder of its timeslice. The method's OP field specifies whether the channels' PBDMA timeslice, the channel's runlist timeslice, or no timeslice is yielded. If YIELD_OP_RUNLIST_TIMESLICE, then Host will act as if the channel's runlist or TSG timeslice expired. Host will exit the TSG and switch to the next channel after the TSG on the runlist. If there is no such channel to switch to, then YIELD_OP_RUNLIST_TIMESLICE will not cause a switch. When the PBDMA executes a YIELD_OP_RUNLIST_TIMESLICE method, it guarantees that it will not execute further methods from the same channel or TSG until the channel is restarted by the scheduler. However, note that this does not yield the engine timeslice; if the engine is preemptable, the context will continue to run on the engine until the remainder of its timeslice expires before Host will attempt to preempt it. Also if there is an outstanding ctx load either due to ctx_reload or from the other PBDMA in the SCG case, then yielding won't take place until the outstanding ctx load finishes or aborts due to a preempt. When the ctx load does complete on the other PBDMA, it is possible for that PBDMA to execute some small number of additional methods before the runlist yield takes effect and that PBDMA halts work for its channel. If NV_UDMA_YIELD_OP_TSG, and if the channel is part of a TSG, then Host will switch to the next channel in the same TSG, and if the channel is not part of the TSG then this will be treated similar to YIELD_OP_NOP. If there is only one channel with work in the TSG, Host will simply reschedule the same channel in the TSG. YIELD_OP_TSG does not cause the scheduler to leave the TSG. The TSG timeslice (TSG timeslice is equivalent to runlist timeslice for TSGs) counter continues to increment through the channel switch and does not restart after executing the yield method. When the PBDMA executes a Yield method, it guarantees that it will not execute the method following that Yield until the channel is restarted by the scheduler. YIELD_OP_NOP is simply a NOP. Neither timeslice is yielded. This was kept for compatibility with existing tests; NV_UDMA_NOP is the preferred NOP, but also see the universal NOP PB instruction. See the description of NV_FIFO_DMA_NOP in the "FIFO_DMA" section of dev_ram.ref. If an unknown OP is specified, Host will raise an NV_PPBDMA_INTR_*_METHOD interrupt. #define NV_UDMA_YIELD 0x00000080 /* -W-4R */ #define NV_UDMA_YIELD_OP 1:0 /* -W-VF */ #define NV_UDMA_YIELD_OP_NOP 0x00000000 /* -W--V */ #define NV_UDMA_YIELD_OP_RUNLIST_TIMESLICE 0x00000002 /* -W--V */ #define NV_UDMA_YIELD_OP_TSG 0x00000003 /* -W--V */ WFI [method] - Wait-for-Idle Method The WFI (Wait-For-Idle) method will stall Host from processing any more methods on the channel until the engine to which the channel last sent methods is idle. Note that the subchannel encoded in the method header is ignored (as it is for all Host-only methods) and does NOT specify which engine to idle. In Kepler, this is only relevant on runlists that serve multiple engines (specifically, the graphics runlist, which also serves GR COPY). The WFI method has a single field SCOPE which specifies the level of WFI the Host method performs. ALL waits for all work in the engine from the same context to be idle across all classes and subchannels. CURRENT_VEID causes the WFI to only apply to work from the same VEID as the current channel. Note for engines that do not support VEIDs, CURRENT_VEID works identically to ALL. Note that Host methods ignore the subchannel field in the method. A Host WFI method always applies to the engine the channel last sent methods to. If a WFI with ALL is specified and the channel last sent work to the GRCE, this will only guarantee that GRCE has no work in progress. It is possible that the GR context will have work in progress from other VEIDs, or even the current VEID if the current channel targets GRCE and has never sent FE methods before. This means that if SW wants to idle the graphics pipe for all VEIDs, SW must send a method to GR immediately before the WFI method. A GR_NOP is sufficient. Note also that even if the current NV_PPBDMA_TARGET is GRAPHICS and not GRCE, there are cases where Host can trivially complete a WFI without sending the NV_PMETHOD_HOST_WFI internal method to FE. This can happen when 1. the runlist timeslices to a different TSG just before the WFI method, 2. the other TSG does a ctxsw request due to methods for FE, and 3. FECS reports non-preempted in the ctx ack, so CTX_RELOAD doesn't get set. In that case, when the channel switches back onto the PBDMA, the PBDMA rightly concludes that there is no way the context could be non-idle for that channel, and therefore filters out the WFI, even if the other PBDMA is sending work to other VEIDs. As in the subchannel case, a GR_NOP preceding the WFI is sufficient to ensure that a SCOPE_ALL_VEID WFI will be sent to FE regardless of timeslicing as long as the NOP and the WFI are submitted as part of the same GP_PUT update. This is ensured by the semantics of the channel state SHOULD_SEND_HOST_TSG_EVENT behaving like CTX_RELOAD: the GR_NOP causes the PBDMA to set the SHOULD_SEND_HOST_TSG_EVENT state, so even a channel or context switch will still result in the PBDMA having the engine context loaded. Thus the WFI will cause the HOST_WFI internal method to be sent to FE. #define NV_UDMA_WFI 0x00000078 /* -W-4R */ #define NV_UDMA_WFI_SCOPE 0:0 /* -W-VF */ #define NV_UDMA_WFI_SCOPE_CURRENT_VEID 0x00000000 /* -W--V */ #define NV_UDMA_WFI_SCOPE_ALL 0x00000001 /* -W--V */ #define NV_UDMA_WFI_SCOPE_ALL_VEID 0x00000001 /* */ CLEAR_FAULTED [method] - Clear Faulted Method The CLEAR_FAULTED method clears a channel's PCCSR PBDMA_FAULTED or ENG_FAULTED bit. These bits are set by Host in response to a PBDMA fault or engine fault respectively on the specified channel; see dev_fifo.ref. The CHID field specifies the ID of the channel whose FAULTED bit is to be cleared. The TYPE field specifies which FAULTED bit is to be cleared: either PBDMA_FAULTED or ENG_FAULTED. When Host receives a CLEAR_FAULTED method for a channel, the corresponding PCCSR FAULTED bit for the channel should be set. However, due to a race between SW seeing the fault message from MMU and handling the fault and sending the CLEAR_FAULT method verses Host seeing the fault from CE or MMU and setting the FAULTED bit, it is possible for the CLEAR_FAULTED method to arrive before the FAULTED bit is set. Host will handle a CLEAR_FAULTED method according to the following cases: a. The FAULTED bit specified by TYPE is set. Host will clear the bit and retire the CLEAR_FAULTED method. b. If the bit is not set, the PBDMA will continue to retry the CLEAR_FAULTED method on every PTIMER microsecond tick by rechecking the FAULTED bit of the target channel. Once the bit is set, the PBDMA will clear the bit and retire the method. The execution of the fault handling channel will stall on the CLEAR_FAULTED method until the FAULTED bit for the target channel is set. The PBDMA will retry the CLEAR_FAULTED method approximately every microsecond. c. If the fault handling channel's timeslice expires while stalled on a CLEAR_FAULTED method, the channel will switch out. Once rescheduled, the channel will resume retrying the CLEAR_FAULTED method. d. To avoid indefinitely waiting for the CLEAR_FAULTED method to retire (likely due to wrongly injected CLEAR_FAULTED method due to a SW bug), Host has a timeout mechanism to inform SW of a potential bug. This timeout is controlled by NV_PFIFO_CLEAR_FAULTED_TIMEOUT; see dev_fifo.ref for details. e. When a CLEAR_FAULTED timeout is detected, Host will raise a stalling interrupt by setting the NV_PPBDMA_INTR_0_CLEAR_FAULTED_ERROR field. The address of the invalid CLEAR_FAULTED method will be in NV_PPBDMA_METHOD0, and its payload will be in NV_PPBDMA_DATA0. Note Setting the timeout value too low could result in false stalling interrupts to SW. The timeout should be set equal to NV_PFIFO_FB_TIMEOUT_PERIOD. Note the CLEAR_FAULTED timeout mechanism uses the same PBDMA registers and RAMFC fields as the semaphore acquire timeout mechanism: NV_PPBDMA_SEM_EXECUTE_ACQUIRE_FAIL is set TRUE when the first attempt fails, and the NV_PPBDMA_ACQUIRE_DEADLINE is loaded with the sum of the current PTIMER and the NV_PFIFO_CLEAR_FAULTED_TIMEOUT. The ACQUIRE_FAIL bit is reset to FALSE when the CLEAR_FAULTED method times out or succeeds. #define NV_UDMA_CLEAR_FAULTED 0x00000084 /* -W-4R */ #define NV_UDMA_CLEAR_FAULTED_CHID 11:0 /* -W-VF */ #define NV_UDMA_CLEAR_FAULTED_TYPE 31:31 /* -W-VF */ #define NV_UDMA_CLEAR_FAULTED_TYPE_PBDMA_FAULTED 0x00000000 /* -W--V */ #define NV_UDMA_CLEAR_FAULTED_TYPE_ENG_FAULTED 0x00000001 /* -W--V */ Addresses that are not defined in this device are reserved. Those below 0x100 are reserved for future Host methods. Addresses 0x100 and beyond are reserved for the engines served by Host.