Chapter 7. Frequently Asked Questions

This section provides answers to frequently asked questions associated with the NVIDIA Linux x86 Driver and its installation. Common problem diagnoses can be found in Chapter 8, Common Problems and tips for new users can be found in Appendix I, Tips for New Linux Users. Also, detailed information for specific setups is provided in the Appendices.

7.1. NVIDIA-INSTALLER

How do I extract the contents of the .run without actually installing the driver?

Run the installer as follows:

    # sh NVIDIA-Linux-x86-343.13.run --extract-only

This will create the directory NVIDIA-Linux-x86-343.13, containing the uncompressed contents of the .run file.

How can I see the source code to the kernel interface layer?

The source files to the kernel interface layer are in the kernel directory of the extracted .run file. To get to these sources, run:

    # sh NVIDIA-Linux-x86-343.13.run --extract-only
    # cd NVIDIA-Linux-x86-343.13/kernel/

How and when are the NVIDIA device files created?

When a user-space NVIDIA driver component needs to communicate with the NVIDIA kernel module, and the NVIDIA character device files do not yet exist, the user-space component will first attempt to load the kernel module and create the device files itself.

Device file creation and kernel module loading generally require root privileges. The X driver, running within a setuid root X server, will have these privileges, but not, e.g., the CUDA driver within the environment of a normal user.

If the user-space NVIDIA driver component cannot load the kernel module or create the device files itself, it will attempt to invoke the setuid root nvidia-modprobe utility, which will perform these operations on behalf of the non-privileged driver.

See the nvidia-modprobe(1) man page, or its source code, available here: ftp://download.nvidia.com/XFree86/nvidia-modprobe/

When possible, it is recommended to use your Linux distribution's native mechanisms for managing kernel module loading and device file creation. nvidia-modprobe is provided as a fallback to work out-of-the-box in a distribution-independent way.

Whether a user-space NVIDIA driver component does so itself, or invokes nvidia-modprobe, it will default to creating the device files with the following attributes:

      UID:  0     - 'root'
      GID:  0     - 'root'
      Mode: 0666  - 'rw-rw-rw-'

Existing device files are changed if their attributes don't match these defaults. If you want the NVIDIA driver to create the device files with different attributes, you can specify them with the "NVreg_DeviceFileUID" (user), "NVreg_DeviceFileGID" (group) and "NVreg_DeviceFileMode" NVIDIA Linux kernel module parameters.

For example, the NVIDIA driver can be instructed to create device files with UID=0 (root), GID=44 (video) and Mode=0660 by passing the following module parameters to the NVIDIA Linux kernel module:

      NVreg_DeviceFileUID=0
      NVreg_DeviceFileGID=44
      NVreg_DeviceFileMode=0660

The "NVreg_ModifyDeviceFiles" NVIDIA kernel module parameter will disable dynamic device file management, if set to 0.

Why does NVIDIA not provide RPMs?

Not every Linux distribution uses RPM, and NVIDIA provides a single solution that works across all Linux distributions. NVIDIA encourages Linux distributions to repackage and redistribute the NVIDIA Linux driver in their native package management formats. These repackaged NVIDIA drivers are likely to inter-operate best with the Linux distribution's package management technology. For this reason, NVIDIA encourages users to use their distribution's repackaged NVIDIA driver, where available.

Can the nvidia-installer use a proxy server?

Yes, because the FTP support in nvidia-installer is based on snarf, it will honor the FTP_PROXY, SNARF_PROXY, and PROXY environment variables.

What is the significance of the -no-compat32 suffix on Linux-x86_64 .run files?

To distinguish between Linux-x86_64 driver package files that do or do not also contain 32-bit compatibility libraries, "-no-compat32" is be appended to the latter. NVIDIA-Linux-x86-343.13.run contains both 64-bit and 32-bit driver binaries; but NVIDIA-Linux-x86-343.13-no-compat32.run omits the 32-bit compatibility libraries.

Can I add my own precompiled kernel interfaces to a .run file?

Yes, the --add-this-kernel .run file option will unpack the .run file, build a precompiled kernel interface for the currently running kernel, and repackage the .run file, appending -custom to the filename. This may be useful, for example. if you administer multiple Linux computers, each running the same kernel.

Where can I find the source code for the nvidia-installer utility?

The nvidia-installer utility is released under the GPL. The source code for the version of nvidia-installer built with driver 343.13 is in nvidia-installer-343.13.tar.bz2 available here: ftp://download.nvidia.com/XFree86/nvidia-installer/

How can I minimize software overhead when driving many GPUs in a single system?

Coordinating access to many GPUs within a system can introduce software overhead, hurting performance. When separate GPUs are intended to process independent workloads (e.g., CUDA running on separate GPUs, or a virtualized environment where different GPUs are dedicated to different virtual machines), the NVIDIA GPU driver can be configured to use multiple NVIDIA kernel modules, where each GPU is assigned to a specific instance of the NVIDIA kernel module. This can reduce the software overhead needed to coordinate across GPUs.

You can build and install multiple kernel modules with a command such as:

    # sh NVIDIA-Linux-x86-343.13.run \
      --multiple-kernel-modules=MULTIPLE-KERNEL-MODULES 

Replace 'MULTIPLE-KERNEL-MODULES' with the number of NVIDIA kernel modules to build. The maximum number of modules that can be built is 8.

This will install nvidia-frontend.ko and nvidia[0-7].ko modules, instead of nvidia.ko, in /lib/modules/`uname -r`/kernel/drivers/video.

By default, all of the NVIDIA GPUs will be assigned to the same module. The 'NVreg_AssignGpus' NVIDIA Linux kernel module option can be used to assign GPUs to specific module instances. E.g., If you wish to assign NVIDIA GPU 0:01:00.0 and 0:02:00.0 to the nvidia0 module:

    # modprobe nvidia0 NVreg_AssignGpus="0:01:00.0,0:02:00.0"

The nvidia-frontend module is responsible for dispatching work to the individual nvidia[0-7] modules, and does not directly drive GPUs on its own. Therefore, GPUs cannot be assigned to the nvidia-frontend module.

When multiple NVIDIA kernel modules are used, any applications which use the NVIDIA driver will need to be directed to a particular module instance. The module instances are numbered sequentially, beginning with 0, and the instance number is reflected in the name of the module. The environment variable '__NVIDIA_KERNEL_MODULE_INSTANCE' controls the module instance used by each application. E.g., if an application uses a device that is associated with module nvidia2, the module instance would be 2, and __NVIDIA_KERNEL_MODULE_INSTANCE should be set to 2 in that application's environment. This can be done by exporting the value in a shell before launching the application. If you do not set the variable '__NVIDIA_KERNEL_MODULE_INSTANCE', it will result in a failure with the following message.

FATAL: Module nvidia not found.

7.2. NVIDIA Driver

Where should I start when diagnosing display problems?

One of the most useful tools for diagnosing problems is the X log file in /var/log. Lines that begin with (II) are information, (WW) are warnings, and (EE) are errors. You should make sure that the correct config file (i.e. the config file you are editing) is being used; look for the line that begins with:

    (==) Using config file:

Also make sure that the NVIDIA driver is being used, rather than another driver. Search for

    (II) LoadModule: "nvidia"

Lines from the driver should begin with:

    (II) NVIDIA(0)

How can I increase the amount of data printed in the X log file?

By default, the NVIDIA X driver prints relatively few messages to stderr and the X log file. If you need to troubleshoot, then it may be helpful to enable more verbose output by using the X command line options -verbose and -logverbose, which can be used to set the verbosity level for the stderr and log file messages, respectively. The NVIDIA X driver will output more messages when the verbosity level is at or above 5 (X defaults to verbosity level 1 for stderr and level 3 for the log file). So, to enable verbose messaging from the NVIDIA X driver to both the log file and stderr, you could start X with the verbosity level set to 5, by doing the following

    % startx -- -verbose 5 -logverbose 5

What is NVIDIA's policy towards development series Linux kernels?

NVIDIA does not officially support development series kernels. However, all the kernel module source code that interfaces with the Linux kernel is available in the kernel/ directory of the .run file. NVIDIA encourages members of the Linux community to develop patches to these source files to support development series kernels. A web search will most likely yield several community supported patches.

Where can I find the tarballs?

Plain tarballs are not available. The .run file is a tarball with a shell script prepended. You can execute the .run file with the --extract-only option to unpack the tarball.

How do I tell if I have my kernel sources installed?

If you are running on a distro that uses RPM (Red Hat, Mandriva, SuSE, etc), then you can use rpm to tell you. At a shell prompt, type:

    % rpm -qa | grep kernel

and look at the output. You should see a package that corresponds to your kernel (often named something like kernel-2.6.15-7) and a kernel source package with the same version (often named something like kernel-devel-2.6.15-7). If none of the lines seem to correspond to a source package, then you will probably need to install it. If the versions listed mismatch (e.g., kernel-2.6.15-7 vs. kernel-devel-2.6.15-10), then you will need to update the kernel-devel package to match the installed kernel. If you have multiple kernels installed, you need to install the kernel-devel package that corresponds to your running kernel (or make sure your installed source package matches the running kernel). You can do this by looking at the output of uname -r and matching versions.

What is SELinux and how does it interact with the NVIDIA driver ?

Security-Enhanced Linux (SELinux) is a set of modifications applied to the Linux kernel and utilities that implement a security policy architecture. When in use it requires that the security type on all shared libraries be set to 'shlib_t'. The installer detects when to set the security type, and sets it on all shared libraries it installs. The option --force-selinux passed to the .run file overrides the detection of when to set the security type.

How can I build multiple NVIDIA kernel modules?

Multiple NVIDIA kernel modules can be built by using the standard Linux kernel module building technique:

    # sh ./NVIDIA-Linux-x86-343.13.run --extract-only
    # cd NVIDIA-Linux-x86-343.13/kernel
    # make module

and passing the NV_BUILD_MODULE_INSTANCES variable to make. Set this variable to the desired number of NVIDIA kernel modules, not greater than eight. This is equivalent to the --multiple-kernel-modules passed to the .run file. E.g.,

    # make module NV_BUILD_MODULE_INSTANCES=2

This will generate two NVIDIA kernel modules: nvidia0.ko and nvidia1.ko. An additional nvidia-frontend.ko module will also be generated, which handles the redirection of system calls corresponding to GPUs to their respective NVIDIA kernel modules.

Why does X use so much memory?

When measuring any application's memory usage, you must be careful to distinguish between physical system RAM used and virtual mappings of shared resources. For example, most shared libraries exist only once in physical memory but are mapped into multiple processes. This memory should only be counted once when computing total memory usage. In the same way, the video memory on a graphics card or register memory on any device can be mapped into multiple processes. These mappings do not consume normal system RAM.

This has been a frequently discussed topic on XFree86 mailing lists; see, for example:

http://marc.theaimsgroup.com/?l=xfree-xpert&m=96835767116567&w=2

The pmap utility described in the above thread is available in the "procps" package shipped with most recent Linux distributions, and is a useful tool in distinguishing between types of memory mappings. For example, while top may indicate that X is using several hundred MB of memory, the last line of output from the output of pmap (note that pmap may need to be run as root):

    # pmap -d `pidof X` | tail -n 1
    mapped: 161404K    writeable/private: 7260K    shared: 118056K

reveals that X is really only using roughly 7MB of system RAM (the "writeable/private" value).

Note, also, that X must allocate resources on behalf of X clients (the window manager, your web browser, etc); the X server's memory usage will increase as more clients request resources such as pixmaps, and decrease as you close X applications.

The IndirectMemoryAccess X configuration option may cause additional virtual address space to be reserved.

Why do applications that use DGA graphics fail?

The NVIDIA driver does not support the graphics component of the XFree86-DGA (Direct Graphics Access) extension. Applications can use the XDGASelectInput() function to acquire relative pointer motion, but graphics-related functions such as XDGASetMode() and XDGAOpenFramebuffer() will fail.

The graphics component of XFree86-DGA is not supported because it requires a CPU mapping of framebuffer memory. As graphics cards ship with increasing quantities of video memory, the NVIDIA X driver has had to switch to a more dynamic memory mapping scheme that is incompatible with DGA. Furthermore, DGA does not cooperate with other graphics rendering libraries such as Xlib and OpenGL because it accesses GPU resources directly.

NVIDIA recommends that applications use OpenGL or Xlib, rather than DGA, for graphics rendering. Using rendering libraries other than DGA will yield better performance and improve interoperability with other X applications.

My kernel log contains messages that are prefixed with "Xid"; what do these messages mean?

"Xid" messages indicate that a general GPU error occurred, most often due to the driver misprogramming the GPU or to corruption of the commands sent to the GPU. These messages provide diagnostic information that can be used by NVIDIA to aid in debugging reported problems.

I use the Coolbits overclocking interface to adjust my graphics card's clock frequencies, but the defaults are reset whenever X is restarted. How do I make my changes persistent?

Clock frequency settings are not saved/restored automatically by default to avoid potential stability and other problems that may be encountered if the chosen frequency settings differ from the defaults qualified by the manufacturer. You can add an nvidia-settings command to ~/.xinitrc to automatically apply custom clock frequency settings when the X server is started. See the nvidia-settings(1) manual page for more information on setting clock frequency settings on the command line.

Why is the refresh rate not reported correctly by utilities that use the XF86VidMode X extension and/or RandR X extension versions prior to 1.2 (e.g., `xrandr --q1`)?

These extensions are not aware of multiple display devices on a single X screen; they only see the MetaMode bounding box, which may contain one or more actual modes. This means that if multiple MetaModes have the same bounding box, these extensions will not be able to distinguish between them. In order to support dynamic display configuration, the NVIDIA X driver must make each MetaMode appear to be unique and accomplishes this by using the refresh rate as a unique identifier.

You can use `nvidia-settings -q RefreshRate` to query the actual refresh rate on each display device.

Why does starting certain applications result in Xlib error messages indicating extensions like "XFree86-VidModeExtension" or "SHAPE" are missing?

If your X config file has a Module section that does not list the "extmod" module, some X server extensions may be missing, resulting in error messages of the form:

Xlib: extension "SHAPE" missing on display ":0.0"
Xlib: extension "XFree86-VidModeExtension" missing on display ":0.0"
Xlib: extension "XFree86-DGA" missing on display ":0.0"

You can solve this problem by adding the line below to your X config file's Module section:

    Load "extmod"

Where can I find older driver versions?

Please visit ftp://download.nvidia.com/XFree86/Linux-x86/

What is the format of a PCI Bus ID?

Different tools have different formats for the PCI Bus ID of a PCI device.

The X server's "BusID" X configuration file option interprets the BusID string in the format "bus@domain:device:function" (the "@domain" portion is only needed if the PCI domain is non-zero), in decimal. More specifically,

"%d@%d:%d:%d", bus, domain, device, function

in printf(3) syntax. NVIDIA X driver logging, nvidia-xconfig, and nvidia-settings match the X configuration file BusID convention.

The lspci(8) utility, in contrast, reports the PCI BusID of a PCI device in the format "domain:bus:device.function", printing the values in hexadecimal. More specifically,

"%04x:%02x:%02x.%x", domain, bus, device, function

in printf(3) syntax. The "Bus Location" reported in the information file matches the lspci format. Also, the name of per-GPU directory in /proc/driver/nvidia/gpus is the same as the corresponding GPU's PCI BusID in lspci format.

On systems where both an integrated GPU and a PCI slot are present, setting the "BusID" option to "AXI" selects the integrated GPU. By default, not specifying this option or setting it to an empty string selects a discrete GPU if available, the integrated GPU otherwise.

How do I interpret X server version numbers?

X server version numbers can be difficult to interpret because some X.Org X servers report the versions of different things.

In 2003, X.Org created a fork of the XFree86 project's code base, which used a monolithic build system to build the X server, libraries, and applications together in one source code repository. It resumed the release version numbering where it left off in 2001, continuing with 6.7, 6.8, etc., for the releases of this large bundle of code. These version numbers are sometimes written X11R6.7, X11R6.8, etc. to include the version of the X protocol.

In 2005, an effort was made to split the monolithic code base into separate modules with their own version numbers to make them easier to maintain and so that they could be released independently. X.Org still occasionally releases these modules together, with a single version number. These releases are simply referred to as “X.Org releases”, or sometimes “katamari” releases. For example, X.Org 7.6 was released on December 20, 2010 and contains version 1.9.3 of the xorg-server package, which contains the core X server itself.

The release management changes from XFree86, to X.Org monolithic releases, to X.Org modular releases impacted the behavior of the X server's -version command line option. For example, XFree86 X servers always report the version of the XFree86 monolithic package:

XFree86 Version 4.3.0 (Red Hat Linux release: 4.3.0-2)
Release Date: 27 February 2003
X Protocol Version 11, Revision 0, Release 6.6

X servers in X.Org monolithic and early “katamari” releases did something similar:

X Window System Version 7.1.1
Release Date: 12 May 2006
X Protocol Version 11, Revision 0, Release 7.1.1

However, X.Org later modified the X server to start printing its individual module version number instead:

X.Org X Server 1.9.3
Release Date: 2010-12-13
X Protocol Version 11, Revision 0

Please keep this in mind when comparing X server versions: what looks like “version 7.x” is older than version 1.x.

Why doesn't the NVIDIA X driver make more display resolutions and refresh rates available via RandR?

Prior to the 302.* driver series, the list of modes reported to applications by the NVIDIA X driver was not limited to the list of modes natively supported by a display device. In order to expose the largest possible set of modes on digital flat panel displays, which typically do not accept arbitrary mode timings, the driver maintained separate sets of "front-end" and "back-end" mode timings, and scaled between them to simulate the availability of more modes than would otherwise be supported.

Front-end timings were the values reported to applications, and back-end timings were what was actually sent to the display. Both sets of timings went through the full mode validation process, with the back-end timings having the additional constraint that they must be provided by the display's EDID, as only EDID-provided modes can be safely assumed to be supported by the display hardware. Applications could request any available front-end timings, which the driver would implicitly scale to either the "best fit" or "native" mode timings. For example, an application might request an 800x600 @ 60 Hz mode and the driver would provide it, but the real mode sent to the display would be 1920x1080 @ 30 Hz. While the availability of modes beyond those natively supported by a display was convenient for some uses, it created several problems. For example:

  • The complete front-end timings were reported to applications, but only the width and height were actually used. This could cause confusion because in many cases, changing the front-end timings did not change the back-end timings. This was especially confusing when trying to change the refresh rate, because the refresh rate in the front-end timings was ignored, but was still reported to applications.

  • The front-end timings reported to the user could be different from the backend timings reported in the display device's on screen display, leading to user confusion. Finding out the back-end timings (e.g. to find the real refresh rate) required using the NVIDIA-specific NV-CONTROL X extension.

  • The process by which back-end timings were selected for use with any given front-end timings was not transparent to users, and this process could only be explicitly configured with NVIDIA-specific xorg.conf options or the NV-CONTROL X extension. Confusion over how changing front-end timings could affect the back-end timings was especially problematic in use cases that were sensitive to the timings the display device receives, such as NVIDIA 3D Vision.

  • User-specified modes underwent normal mode validation, even though the timings in those modes were not used. For example, a 1920x1080 @ 100 Hz mode might fail the VertRefresh check, even though the back-end timings might actually be 1920x1080 @ 30 Hz.

Version 1.2 of the X Resize and Rotate extension (henceforth referred to as "RandR 1.2") allows configuration of display scaling in a much more flexible and standardized way. The protocol allows applications to choose exactly which (back-end) mode timing is used, and exactly how the screen is scaled to fill that mode. It also allows explicit control over which displays are enabled, and which portions of the screen they display. This also provides much-needed transparency: the mode timings reported by RandR 1.2 are the actual mode timings being sent to the display. However, this means that only modes actually supported by the display are reported in the RandR 1.2 mode list. Scaling configurations, such as the 800x600 to 1920x1080 example above, need to be configured via the RandR 1.2 transform feature. Adding implicitly scaled modes to the mode list would conflict with the transform configuration options and reintroduce the same problems that the previous front-end/back-end timing system had.

With the introduction of RandR 1.2 support to the 302.* driver series, the front-end/back-end timing system was abandoned, and the list of mode timings exposed by the NVIDIA X driver was simplified to include only those modes which would actually be driven by the hardware. Although it remained possible to manually configure all of the scaling configurations that were previously possible, and many scaling configurations which were previously impossible, this change resulted in some inconvenient losses of functionality:

  • Applications which used RandR 1.1 or earlier or XF86VidMode to set modes no longer had the implicitly scaled front-end timings available to them. Many displays have EDIDs which advertise only the display's native resolution, or a list of resolutions that is otherwise small, compared to the list that would previously have been exposed as front-end timings, preventing these applications from setting modes that were possible with previous versions of the NVIDIA driver.

  • The nvidia-settings control panel, which formerly listed all available front-end modes for displays in its X Server Display Configuration page, only listed the actual back-end modes.

Subsequent driver releases restored some of this functionality without reverting to the front-end/back-end system:

  • The NVIDIA X driver now builds a list of "Implicit MetaModes", which implicitly scale many common resolutions to a mode that is supported by the display. These modes are exposed to applications which use RandR 1.1 and XF86VidMode, as neither supports the scaling or other transform capabilities of RandR 1.2.

  • The resolution list in the nvidia-settings X Server Display Configuration page now includes explicitly scaled modes for many common resolutions which are not directly supported by the display. To reduce confusion, the scaled modes are identified as being scaled, and it is not possible to set a refresh rate for any of the scaled modes.

As mentioned previously, the RandR 1.2 mode list contains only modes which are supported by the display. Modern applications that wish to set modes other than those available in the RandR 1.2 mode list are encouraged to use RandR 1.2 transformations to program any required scaling operations. For example, the xrandr utility can program RandR scaling transformations, and the following command can scale a 1280x720 mode to a display connected to output DVI-I-0 that does not support the desired mode, but does support 1920x1080:

xrandr --output DVI-I-0 --mode 1920x1080 --scale-from 1280x720