Leo Driver Specification

Leo Driver Specification


Contents

Leo Driver Specification
Leo Driver Specification

Kishore Kumar

1.0 Introduction

Leo is supported on SVR4 only as a loadable driver. Mars is the OS release platform. The same driver supports both Leo104 and Leo48 accelerators.C2 /Viking and Sunergy are the main hardware platforms for Leo104 and Leo48 respectively. The driver package consists of a device driver, device specific segment driver and an mmu driver. Section 2 of this document describes the role of these drivers during the read DMA operation. Section 3 explains the role of the Leo segment driver during the transparent graphic device context switching. Sections 4, 5 and 6 explain the different routines in the three drivers. Since Leo uses the segment driver and the HAT layer, it is not DDI/DKI compliant.

2.0 DMA

Leo supports two DMA mechanisms on the Accelerator port, one for reading and the other one for writing. The read DMA is used for fetching the vertex data from the usr virtual address space to the Accelerator port. It bypasses the IOMMU and uses the physical addresses on sun4m architecture based machines. The user sets up a minimum threshold of the data transfer amount and only incur the overhead of setting up the DMA for transfer above this minimum. SS2 and older platforms do not support the physical DMA. On Sunergy the cpu read /write may be faster than DMA. C2 and Galaxy with Ross module are not cache consistent for Sbus master accesses currently. User has to directly feed the vertex data to Leo on these machines. The write DMA is used for sending the pick data from the Accelerator port to the user's pick buffer which is mapped to the DVMA space.

2.1 Physical DMA

The vertex data may range from few words to several mega bytes. The memory where this data is stored can be of any form (stack, mmap or malloc) and this memory can be anywhere in the user's address space. Imposing on the user to inform the driver in advance about the memory that will be used for DMA may not be attractive. It is desired not to call the kernel for every DMA request for improving the performance. Leo has an MMU to translate the 32 bit user virtual address to a physical address.This MMU supports fixed 4K byte pagesize, a 3 level page table walk in the host memory, Page Table Entry (PTE), Page Table Descriptor (PTD) and page tables compatible with the Sparc Reference MMU (SRMMU) . There is no Translation Lookaside Buffers. The MMU does the page table walk for every user initiated DMA request and also when the address crosses the page boundary. It does not modify the page tables ie. it does not update the reference or modify bits in the PTE. Leo weo will either share the SRMMU's page tables for the performance enhancement and the design simplicity or it will have its own page tables. Both of these cases are discussed below.

2. 1.1 Sharing the page table with SRMMU

When the user program mmaps the Accelerator port, the segment driver replaces the srmmu_hatops vector in that address space with the leo_hatops vector. Most of the routines in the leo_hatops will be straight redirections to the srmmu_hatops, except for leo_pagesync() and leo_page unload() routines which do some device checking before calling the corresponding srmmu routines. After creating the vertex list the user tries to access the Accelerator port registers to set up the relevent registers. Since the real memory is not allocated when the mmap() was called, segleo_fault() routine gets called. If some other process has the valid mappings to the Accelerator port, the faulthandler switches the graphic context as explained in Section 3.If the fault handler is called for the first time for a particular address space, then it finds the address of the level-1 page table for that address space and writes it into Leo's Table walk root pointer register. Since the kernel currently does not have the routines to lock the srmmu context, during every context switching, the address of the current level-1 page table pointer has to be found and then it has to be loaded into the Leo register. Then the fault handler allows the user to access the Leo registers. User writes the address of the vertex buffer in DMA virtual address register and the size of the vertex list in DMA word count register.Writing to the word count register initiates the Leo MMU to do the page table walk on the page table specified in the Table walk root pointer register. It finds the valid PTE in most of the cases and reads the physical pageframe number and then starts the DMA read operation. For every DMA transfer, the read virtual address register gets incremented and the word count register gets decremented by the device. When the address register crosses the page boundary, the MMU does the page table walk again. If it finds the invalid PTE or PTD, it issues an interrupt. The interrupt handler tries to locate the page in the memory and if it is not in the memory, it will waken a thread to fetch the page from the backing store and then clears the interrupt and returns. If the pageout daemon needs to make few pages invalid to accommodate a different process, it tries to find out the best candidates by checking the reference and modify bits of each page. Since srmmu_hatops vector is replaced with leo_hatops vector, it calls leo_pagesync() to synchronize the hardware and software reference and modify bits. This routine checks the DMA status register and if the read DMA is ON and the page passed to it is in the requested DMA transfer address range, it sets the reference bit in the page structure.This saves the page being whisked away. If the DMA is OFF, the routine does not touch any bits in the page and calls srmmu_pagesync(). If the SRMMU has not touched this page for a while, the pageout routine decides to free that page . Before freeing, it calls leo_pageunload() to unload all the translations that map the page. This routine also checks the DMA status register to see whether the user initiated a DMA operation after leo_pagesync() returned without setting the reference bit. It calls the srmmu_ pageunload() which invalidates the PTE, if the DMA is ON, stops it and then starts it again. This forces the hardware to do the page table walk again. As an optimization it could also get that page back and then only start the DMA to avoid the overhead of one interrupt. It is conceivable that the user can request for DMA operation just before srmmu_pageunload() marks the PTE as invalid. While LeoMMU finds the valid PTE and starts the DMA operation, srmmu_ pageunload() invalidates the PTE. This could result in reading the wrong vertex data. This can be avoided by leo_pageunload() by loading a bogus level-1 page table pointer before calling srmmu_pageunload() routine and loading the real pointer when srmmu_pageunload() returns. If the user requestes for DMA on the same page just before that page is marked invalid, the hardware causes the interrupt due to invalid PTE/PTD. The interrupt handler fetches the requested page. The race condition can also be avoided by always setting the reference bit in leo_pagesync(). This will avoid the paging out of all the pages in this user` s address space. But this is against the unix design principles. The Table walk root pointer and the word count registers are part of the Accelerator port context. They are saved and restored during the graphics conetxt switch. As an optimization, the user code could also reference all the pages where the data is kept during the DMA operation so that they do not become good candidates for the pageout decisions. There is one more way to share the page table without replacing the srmmu_hatops . This may require small modifications to the srmmu hat layer to know about Leo, like adding a "leo active bit" in the context structure to srmmu_ptesync() routine. This needs some more investigation. 2.1 .2 Separate page tables

If replacing the srmmu_hatops is not acceptable or if there are some hidden problems in sharing the page tables then Leo will have its own page tables and makes full use of the multi-hat layer feature. When the user mmaps the Accelerator port, the segment driver calls hat_alloc(as, leo_hatops). hat_alloc( ) allocates a hat structure for Leo and links it into the hat list for the address space of that user and calls leo_hatalloc() which initializes the hat private data area and allocates a level-1 page table for that address space. Then the user creates the vertex list and accesses the Accelerator port registers. The first reference causes the segment fault. If some other process has the valid mappings to the Accelerator port, the fault handler switches the graphic context and writes the level-1 page table address into Leo's Table walk root pointer register. Then the user writes the address of the vertex buffer in the DMA virtual address register and the size of the vertex list in the DMA word count register.Writing to the word count register initiates the Leo MMU to do the pagetable walk. MMU does not find the valid PTE since Leo driver has not created the mapping for this address in its page table yet. Leo MMU issues the interrupt.The interrupt handler tries to locate the page in the memory and if it is not found, wakes up a thread to fetch the page from the backing store.When the page is fetched, it calls hat_getpfnum(as, addr) to find out the page frame number. Then it calls leo_devload() to create the mapping for this address in Leo's page table for that address space. leo_devload() also adds the leo hat mapping to the mapping list for that page by calling hme_add(). This makes sure that the Leo hat gets notified if the pageout daemon looks at this page in order to free it. By looking at the word count, MMU interrupt handler can create in advance the mappings for the remaining pages. This reduces the overhead of the interrupt to one for every DMA request. After creating the mappings it clears the interrupt and asks the MMU to do the page table walk again. If the pageout daemon decides to take any of the pages marked by Leo, it calls the leo_pagesync () and leo_pageunload(). Leo_pagesync() checks whether Leo is doing DMA from that page. If it is so, sets the reference bit and returns . Leo_pageunload() marks the PTE for that address invalid, stops the DMA if it is ON and then starts it again so that the hardware does the page table walk.. Although this scheme results in the overhead of one interrupt for every DMA request, if the same pages are used for the subsequent DMA requests, the MMU finds the valid mappings and no interrupts are generated. Hence this scheme is better than using the write() system call to request the driver to do the DMA since write() invlolves the kernel for every DMA request. But there could be several user applications which may not use the same buffer for subsequent DMA transfer. In these cases the overhead will be significantly reduced if Leo shares the page table with SRMMU. 2.2 Write DMA

The user mmaps the Accelerator port with the offset PICKOFFSET and the size of 2 pages to inform the driver that it expects the pick data in the memory returned by mmap(). The segment driver gets the DVMA space from the kernel and allocates 2 pages of memory for the pick buffer and maps this buffer to the DVMA space allocated. Then the driver loads the kernel virtual address of this buffer in the DMA write buffer start addr register and the size of the buffer in the DMA write current buffer size register which enables the write DMA engine. The DMA write buffer start addr register is part of the graphics context. This is saved and restored during the graphic context switching. If the user calls mmap( ) with PICKOFFSET more than once, it will return an error. 3.0 Transparent graphic device context switching

Leo segment driver provides the graphic device context switching on both Accelerator and Direct ports. This is based on the Lego model. This provides the user an illusion of having an exclusive access to Leo. Only one process has a valid mappings to the Accelerator port at a time. When some other process tries to access this port, segment fault handler is called since that process does not have a valid mapping. The fault handler flushes the accelerator pipe, saves the context of the current process, invalidates its mappings, validates the mappings of the other process and restores the context of the other process. Similarly only one process can have a valid mappings to the Direct port at a time. Two processes can access independently the Direct port and the Accelerator port without going through the overhead of the context switching. 3.1 MP and context switching:

When two processes in two different processors try to access the same port simultaneously, the segment driver could end up doing the context switching for ever. This thrashing is not Leo specific. Lego, Egret, Hawk, SPAM and any other graphic accelerator using the segment driver for doing the graphic context switching have this problem. To avoid this thrashing, either some kind of protection has to be given or the context switching has to be done in the user land. Jim Putnam and Jerry Evans are developing a prototype to implement the context switching in the user land using the mutex locks. If the performance using this model is not acceptable then we have to use some kind of hysteresis with the current model. The process having the valid mapping of Leo's port will run only for a fixed time quantum. If some other process likes to access the device during this time, it will be put on the sleep queue. When the time slot expires, the next process in the sleep queue gets the exclusive access to Leo. The problem with this scheme is that a process may not be really accessing Leo during the entire time slot. Also the time quantum chosen may not be optimum for all types of applications. Leo and SPAM both will use this model in the beginning and if the performance out of the usr level model looks comparable with the segment driver model then the user level model will be used. 3.2 Accelerator port context switching:

Following steps are executed in the segleo_fault() handler. Check the DMA status . If the user has requested DMA but it is stopped due to an invalid PTE/ PTD, put that process into sleep. When the interrupt arrives, handle the page fault and then wakeup that process. If the DMA is on, remember it, write to DMA read ON/OFF register to stop it and wait until it is really stopped. Start the context switch mode. Check the LeoDraw control status register. If the stall bit is set then wait till the block copy/fill operation is completed. Then clear the stall bit. Check the accelerator port status register. Wait until bucket buffer and vertex buffer controls are idle. Save the context of the LeoCommand. Put the device driver context in the accelerator port by programming Pass through mode control, Vertex mode control, Pass Through Header and the state machine registers. Exit the context mode. Set the pass thorugh mode and send SET_LD semaphore through a single LeoFloat to the LeoDraw chips. Use unicast to next LeoFloat and multicast to the LeoDraws. Wait till the LeoDraw semaphore bits in all five LeoDraw control status registers are set. The driver will wait for maximum 300 usecs and then enables the interrupt if the bits are not set. At this stage it can be assumed that no pick hits can occur. This can also be confirmed by looking at the DMA status register. Send pass through packet to LeoFloat to send the sram context. Save the LeoDraw context. Invalidate the mappings of the current process. Validate the mappings of the next process. Restore the LeoDraw context of the next process. Send pass through packet to LeoFloat to restore the sram context for the next process. Start the context mode. Restore the accelerator port context. Restore the LeoCommand context of the next process and exit the context mode. This starts the DMA if the segment driver had stopped it when that process was context switched previously. 3.3 Direct port context switching:

The fault handler checks whether Leo is doing block copy/fill operation. If so spin loops for 300 usec and if the block operation is not yet finished then enable the blit done interrupt bit. When this interrupt happens, the state set 0 context registers are saved, the direct port mappings of the current process are invalidated, the mappings of the next process which caused the fault are validated and the context of that process is restored. 4.0 Device driver

The device driver implements the following routines in addition to the routines for loadable driver support. leoidentify() leoattach() leoopen() leoclose() leoget_dev_info() leointr() leoioctl() 4.1 leoidentify(dev_info_t * dip) It is called by the kernel during the autoconfiguration process to find out whether the driver drives the device specifed by the devinfo pointer. 4.2 leoattach(dev_info_t *dip)

It is also called by the kernel during the autoconfiguration process to initialize the device, to set up the interrupt handler for the device, to map the device registers, to initialize the driver data structures etc. It talks to the prom to know the properties like whether it is Leo104 or Leo48, width, height, cputype and stores these properties in its data structure. Some of the device initialization are setting up of the LeoFloat enable mask and DMA configuration registers, SBus Slot configuration register, etc. 4.3 leoopen( dev_t *devp, int flag, int otyp, cred_t *credp)

It is called when a user process issues an open(2) system call to open either direct or accelerator port of Leo. 4.4 leoclose(dev_t dev, int flag, int otyp , cred_t *credp)

It is called by the system either explicitly when the last process having the device open issues a close(2) system call or implicitly when the last process exits. This restores the pre-opened state. 4.5 leoget_dev_info(dev_t dev, int otyp) >It is called to return a pointer to the dev_info node for the unit encoded in the minor device number dev. 4.6 leointr(caddr_t arg)

It is called by the kernel when a device interrupts. The argument passed to leintr() is a pointer to the unit sructure for the unit that may have issued the interrupt. This pointer is passed earlier by the leoattach() to ddi_add_intr() as the int_handler_arg argument. leointr() reads the Leo interrupt status registers and determines whether the interrupt is generated from Leo. Leo can generate the interrupt for the following functions. - Leo MMU page fault ( Invalid PTE/PTD) - Read DMA done - Write DMA done - DMA error acknowledge - Slave illegal address - Slave rerun timeout - Blit done - Scoreboard empty - LD semaphore set ? - WID update - Cursor update - Colormap update - Fast clear done - Verticle retrace active The MMU page fault indicates that the MMU did not find the valid translation for the desired virtual address during the table walk. If Leo shares the page table with the SRMMU, then this fault would not happen frequently. If Leo has its own page table then this fault handler will get the page either from the physical memory or from the swap area and finds physical pageframe number and makes entry in the Leo's page table. It also prefaults for the remaining desired virtual address range and updates the page table. Then it clears the interrupt and starts the DMA. Yet to be decided whether Read DMA, write DMA done and Scoreboard empty interrupt will be enabled and used. For DMA error acknowledge, Slave illegal address and Slave rerun timeout conditions, the handler will send SIGKILL or SIGTERM signal to the user process using Leo. Sending a signal to a process from the driver is not DDI compliant. But asking the application to poll for these error conditions may not be acceptable. Blit done interrupt is used in Direct port context switching and LD semaphore set interrupt is used in Accelerator port context switching. WID update, colormap update, Fast clear done and Vertical retrace active interrupts are used in ioctls explained below. 4.7 leoioctl(dev_t dev, int cmd, int arg, int mode, cred_t *credp, int *rvalp)

The leo driver supports the following ioctls calls. FBIOGTYPE FBIOSVIDEO FBIOGVIDEO FBIO_U_RST FBIOVERTICAL FBIOVRTOFFSET FBIOPUTCMAP FBIOGETCMAP FBIO_WID_ALLOC FBIO_WID_FREE FBIO_WID_PUT FBIO_WID_GET FB_CLUTALLOC FB_CLUTFREE FB_CLUTPOST FB_CLUTREAD FB_FCSALLOC FB_FCSFREE FB_SETSERVER FB_SETDIAGMODE FB_SETMONITOR FB_GETMONITOR FB_GRABHW FB_UNGRABHW FB_FLUSHPIPE FB_SYS_INFO The support for the following ioctls is not decided yet. FBIOPIXRECT FBIO_DEVID 4 .7.1 FBIOGTYPE
It returns the frame buffer descriptor which indicates the height, width, depth and type of the accelerator.The following two frame buffer types will be added to fbio.h. FBIO_SUNLEO104 FBIO_SUNLEO48 4.7.2 FBIOSATTR Sets the bwtwo emulation mode. 4.7.3 FBIOGATTR Returns the current emulation mode.
4.7.4 FBIOSVIDEO Enables the video output. 4.7.5 FBIOGVIDEO Returns the status of the video output. 4.7.6 FBIO_U_RST Resets the board. 4. 7.7 FBIOVERTICAL Enables the vertical retrace interrupt, puts the calling process into sleep, when this interrupt happens disables the interrupt and wakesup the process. 4.7.8 FBIOVRTOFFSET Returns the virtual offset for the frame counter. 4.7.9 FBIOPUTCMAP Updates a color lookup table (CLUT) as requested in fbcmap structure. 4.7.10 FBIOGETCMAP Returns the requested CLUT. FBIOPUTCMAP and FBIOGETCMAP are supported mainly for compatible reasons. Windowsystem will use FB_CLUTPOST and FB_CLUTREAD instead of these ioctls. 4.7.11 FBIO_WID_ALLOC Allocates WID lookup tables (WLUT) as requested in the structure fb_wid_alloc. If the request is for more than one WLUT, the WIDs are allocated in a contiguous block aligned on a 2 to the power of k, where k is the next power of two greater than or equal to the number of WIDs requested. Both FB_WID_DBL_8 and FB_WID_DBL_24 types are supported. 4.7.12 FBIO_WID_FREE Deallocates the requested WLUTS previously allocated by FBIO_WID_ALLOC. If the process exits without deallocating the WLUTS it owned, they are deallocated by the segment driver. 4.7.13 FBIO_WID_PUT Updates the WLUT entries that have been allocated as requested in the fb_wid_list and fb_wid_item structures. If FB_BLOCK flag is set in fb_wid_list, puts the requested process into sleep until the device says that the update is finished. If the flag is not set, the ioctl returns immediately after updating the shadow lookup table and requesting the device to update. 4 .7 .14 FBIO_WID_GET Gets the WLUT entries from the device and passes it to the requested process. 4.7.15 FB_CLUTALLOC Allocates the three CLUTs as requested in the fb_clutalloc structure. 4.7.16 FB_CLUTFREE Deallocates the CLUTs allocated by FB_CLUTALLOC. If the process exits without deallocating these tables, they are deallocated by the segment driver. 4.7.17 FB_CLUTPOST Updates a CLUT as requested in fb_clut structure. If FB_BLOCK flag is set, the process is blocked until the device gives an interrupt to inform that the update is completed. 4.7. 18 FB_CLUTREAD Reads a CLUT from the device as requested in fb_clut structure and passes it to the user. 4.7.19 FB_FCSALLOC Allocates a fast clear plane as requested in fb_fcsalloc structure. 4.7. 20 FB_FCSFREE Deallocates the FCS allocated by FB_FCSALLOC. If the process exits without deallocating these tables , they are deallocated by the segment driver. 4.7.21 FB_SETDIAGMODE Enables the requesting process to map any register on Leo. 4.7.22 FB_SETMONITOR Initializes the Leo cross registers with the monitor specific parameters. 4.7.23 FB_GETMONITOR Returns the monitor type. 4.7.24 FB_SETSERVER Marks the calling process as the window server. Only one process can be window server at a time. 4 .7.25 FB_GRABHW If the calling process is window server, strobes the stall LD accelerator register and flushes the accelerator pipeline. 4.7.26 FB_UNGRABHW If the calling process is window server, writes to clear stall LD accelerator register. 4.7.27 FB_FLUSHPIPE Flushes the accelerator pipeline. 4.7.28 FB_SYS_INFO This informs whether DMA is supported/recommended on the current machine. 4.7.29 FBIOPIXRECT Currently OpenWindow ver 3.0 uses win () driver which calls this ioctl. Without this ioctl OpenWindows may not run. So until win driver is replaced with evq driver, support for FBIOPIXRECT may be required. This is TBD. 4.7.30 FBIO_DEVID This can be used to inform the calling process whether it is Leo104 or Leo48 instead of using FBTYPE. This is TBD. 5.0 Segment driver Leo segment driver provides the following. - mappings to registers and frame buffer of direct and accelerator ports - transparent graphic device context switching for direct and accerator ports - releasing of the resources owned by a process when it exits - hooks for DMA operation The offsets to the mmap() call distinguishes the Accelerator port from the direct port. Leo segment driver contains the following routines. leosegmap() segleo_create() segleo_dup() segleo_unmap() segleo_free() segleo_fault() segleo_checkprot() segleo_incore() 5 .1 leosegmap() This is called when the user calls mmap(2) to map a segment. This routine checks the offsets and if it is Accelerator port offsets and if the page tables are shared, then replaces the srmmu_hatops with the leo_hatops for the address space of that user. Then it passes the address of segleo_create. 5.2 segleo_create() This routine creates the device segment for Leo. 5.3 segleo_dup( ) Duplicates the mappings of an existing segment in a new segment. 5.4 segleo_unmap() Removes the mappings at a given address for the specified length. 5.5 segleo_free() Frees the segement private data allocated by segleo_create. Also releases the resources owned by that process. 5.6 segleo_fault() Handles a fault on a device segment. The first fault to a segment loads the address of the level-1 page table on Leo. If some other process has the valid mappings to Leo, it switches the context as previously explained. Finally it calls the hat_devload() to set up a mapping in the host. 6.0 MMU driver MMU driver supports the Leo MMU for the read DMA operation. This consists of following routines if Leo shares the page table with the SRMMU. The rest of the routines in the hat layer interface will call the corresponfing SRMMU routines. leo_pagesync() leo_pageunload() 6.1 leo_pagesync() Checks whether Leo is doing DMA from the given page. If it is true, sets the reference bit of that page. If the page tables are shared, calls srmmu_pagesync(). 6.2 leo_pageunload( ) It also checks whether Leo is doing DMA on the given page. If it is true, stops the DMA. If page tables are shared, removes the user mappings to the DMA read register, starts a thread and calls srmmu_pageunload(). If the page tables ares not shared , removes the mappings for that page in Leo's page tables. If Leo has its own page tables, then following routines will be added. leo_alloc() leo_free() leo_dup() leo_memload() leo_devload() leo_fault() leo_chgprot() leo_unload() leo_map() 7.0 Acknowledgements: Thanks to Eric Sultan for his ever willingness to listen to all the ideas and his feedback, Dock Williams for providing the multi-hat internals, Ashvin Kamaraju for SPAM internals and Chidam Jambu for reviewing this document and giving his feedback.