Powerful and reliable programming model and computing toolkit

NVIDIA CUDA Toolkit

Join our mailing list

Stay up to date with latest software releases, news, software discounts, deals and more.

Subscribe
Download NVIDIA CUDA Toolkit 12.4.0 (for Windows 11)

NVIDIA CUDA Toolkit

  -  3 GB  -  Freeware
  • Latest Version

    NVIDIA CUDA Toolkit 12.4.0 (for Windows 11) LATEST

  • Review by

    Daniel Leblanc

  • Operating System

    Windows 11

  • User Rating

    Click to vote
  • Author / Product

    NVIDIA Corporation / External Link

  • Filename

    cuda_12.4.0_551.61_windows.exe

NVIDIA CUDA Toolkit provides a development environment for creating high-performance GPU-accelerated applications. With the CUDA Toolkit, you can develop, optimize, and deploy your applications on GPU-accelerated embedded systems, desktop workstations, enterprise data centers, cloud-based platforms, and HPC supercomputers. The toolkit includes GPU-accelerated libraries, debugging and optimization tools, a C/C++ compiler, and a runtime library to deploy your application.

GPU-accelerated CUDA libraries enable drop-in acceleration across multiple domains such as linear algebra, image and video processing, deep learning, and graph analytics. For developing custom algorithms, you can use available integrations with commonly used languages and numerical packages as well as well-published development APIs.

Your CUDA applications can be deployed across all NVIDIA GPU families available on-premise and on GPU instances in the cloud. Using built-in capabilities for distributing computations across multi-GPU configurations, scientists and researchers can develop applications that scale from single GPU workstations to cloud installations with thousands of GPUs.

IDE with graphical and command-line tools for debugging, identifying performance bottlenecks on the GPU and CPU, and providing context-sensitive optimization guidance. Develop applications using a programming language you already know, including C, C++, Fortran, and Python.

To get started, browse through online getting started resources, optimization guides, illustrative examples, and collaborate with the rapidly growing developer community. Download NVIDIA CUDA Toolkit for PC today!

Features and Highlights
  • GPU Timestamp: Start timestamp
  • Method: GPU method name. This is either "memcpy*" for memory copies or the name of a GPU kernel. Memory copies have a suffix that describes the type of a memory transfer, e.g. "memcpyDToHasync" means an asynchronous transfer from Device memory to Host memory
  • GPU Time: It is the execution time for the method on GPU
  • CPU Time: It is the sum of GPU time and CPU overhead to launch that Method. At driver generated data level, CPU Time is only CPU overhead to launch the Method for non-blocking Methods; for blocking methods it is a sum of GPU time and CPU overhead. All kernel launches by default are non-blocking. But if any profiler counters are enabled kernel launches are blocking. Asynchronous memory copy requests in different streams are non-blocking
  • Stream Id: Identification number for the stream
  • Columns only for kernel methods
  • Occupancy: Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of active warps
  • Profiler counters: Refer the profiler counters section for a list of counters supported
  • grid size: Number of blocks in the grid along the X, Y, and Z dimensions are shown as [num_blocks_X num_blocks_Y num_blocks_Z] in a single column
  • block size: Number of threads in a block along X, Y, and Z dimensions is shown as [num_threads_X num_threads_Y num_threads_Z]] in a single column
  • dyn smem per block: Dynamic shared memory size per block in bytes
  • sta smem per block: Static shared memory size per block in bytes
  • reg per thread: Number of registers per thread
  • Columns only for memcopy methods
  • mem transfer size: Memory transfer size in bytes
  • host mem transfer type: Specifies whether a memory transfer uses "Pageable" or "Page-locked" memory
Also Available: Download NVIDIA CUDA Toolkit for Mac

What's new in this version:

CUDA Components:
- Starting with CUDA 11, the various components in the toolkit are versioned independently.

CUDA Driver:
Running a CUDA application requires the system with at least one CUDA capable GPU and a driver that is compatible with the CUDA Toolkit. See Table 3. For more information various GPU products that are CUDA capable, visit
- Each release of the CUDA Toolkit requires a minimum version of the CUDA driver. The CUDA driver is backward compatible, meaning that applications compiled against a particular version of the CUDA will continue to work on subsequent (later) driver releases.

General CUDA:
Access-counter-based memory migration for Grace Hopper systems is now enabled by default. As this is the first release with the capability enabled, developers may find that applications that had been optimized for earlier memory migration algorithms may see a performance regression if optimized for the earlier behaviors. Should this occur, we introduce a supported but temporary flag to opt out of this behavior. You can control the enablement of this feature by unloading and reloading the NVIDIA UVM driver, as follows:

This release introduces support for the following new features in CUDA graphs:
- Graph conditional nodes (enhanced from 12.3)
- Device-side node parameter update for device graphs
- Updatable graph node priorities without recompilation

Enhanced monitoring capabilities through NVML and nvidia-smi:
- NVJPG and NVOFA utilization percentage
- PCIe class and subclass reporting
- dmon reports are now available in CSV format
- More descriptive error codes returned from NVML
- dmon now reports gpm-metrics for MIG (that is, nvidia-smi dmon --gpm-metrics runs in MIG mode)
- NVML running against older drivers will report FUNCTION_NOT_FOUND in some cases, failing gracefully if NVML is newer than the driver
- NVML APIs to query protected memory information for Hopper Confidential Computing
- This release introduces nvFatbin, a new library to create CUDA fat binary files at runtime.

Confidential Computing General Access:
- Starting in 12.4 with R550.54.14, the Confidential Computing of Hopper will move to General Access for discrete GPU usage.
- All EA RIM certificates prior to this release will be revoked with status PrivilegeWithdrawn 30 days after posting.

CUDA Compilers:
For changes to PTX, refer to
- Added the __maxnreg__ kernel function qualifier to allow users to directly specify the maximum number of registers to be allocated to a single thread in a thread block in CUDA C++.
- Added a new flag -fdevice-syntax-only that ends device compilation after front-end syntax checking. This option can provide rapid feedback (warnings and errors) of source code changes as it will not invoke the optimizer. Note: this option will not generate valid object code.

Add a new flag -minimal for NVRTC compilation. The -minimal flag omits certain language features to reduce compile time for small programs. In particular, the following are omitted:
- Texture and surface functions and associated types (for example, cudaTextureObject_t).
- CUDA Runtime Functions that are provided by the cudadevrt device code library, typically named with prefix “cuda”, for example, cudaMalloc.
- Kernel launch from device code.
- Types and macros associated with CUDA Runtime and Driver APIs, provided by cuda/tools/cudart/driver_types.h, typically named with the prefix “cuda” for example, cudaError_t.
- Starting in CUDA 12.4, PTXAS enables position independent code (-pic) as default when the compilation mode is whole program compilation. Users can opt out by specifying the -pic=false option to PTXAS. Debug compilation and separate compilation continue to have position independent code disabled by default. In future, position independent code will allow the CUDA Driver to share a single copy of text section across contexts and reduce resident memory usage.

CUDA Developer Tools:
- For changes to nvprof and Visual Profiler, see the changelog.
- For new features, improvements, and bug fixes in Nsight Systems, see the changelog.
- For new features, improvements, and bug fixes in Nsight Visual Studio Edition, see the changelog.
- For new features, improvements, and bug fixes in CUPTI, see the changelog.
- For new features, improvements, and bug fixes in Nsight Compute, see the changelog.
- For new features, improvements, and bug fixes in Compute Sanitizer, see the changelog.
- For new features, improvements, and bug fixes in CUDA-GDB, see the changelog.

Resolved Issues:
General CUDA:
- Fixed a compiler crash that could occur when inputs to MMA instructions were used before being initialized.
- CUDA Compilers
- In certain cases, dp4a or dp2a instructions would be generated in ptx and cause incorrect behavior due to integer overflow. This has been fixed in CUDA 12.4.

Deprecated or Dropped Features:
- Features deprecated in the current release of the CUDA software still work in the current release, but their documentation may have been removed, and they will become officially unsupported in a future release. We recommend that developers employ alternative solutions to these features in their software.

Deprecated Architectures:
- CUDA Toolkit 12.4 deprecates NVIDIA CUDA support for the PowerPC architecture. Support for this architecture is considered deprecated and will be removed in an upcoming release.

Deprecated Operating Systems:
- CUDA Toolkit 12.4 deprecates support for Red Hat Enterprise Linux 7 and CentOS 7. Support for these operating systems will be removed in an upcoming release.

Deprecated Toolchains:
CUDA Toolkit 12.4 deprecates support for the following host compilers:
- Microsoft Visual C/C++ (MSVC) 2017
- All GCC versions prior to GCC 7.3

CUDA Libraries:
- This section covers CUDA Libraries release notes for 12.x releases.
- CUDA Math Libraries toolchain uses C++11 features, and a C++11-compatible standard library (libstdc++ >= 20150422) is required on the host.

Support for the following compute capabilities is removed for all libraries:
- sm_35 (Kepler)
- sm_37 (Kepler)

cuBLAS: Release 12.4:
New Features:
cuBLAS adds experimental APIs to support grouped batched GEMM for single precision and double precision. Single precision also supports the math mode, CUBLAS_TF32_TENSOR_OP_MATH. Grouped batch mode allows you to concurrently solve GEMMs of different dimensions (m, n, k), leading dimensions (lda, ldb, ldc), transpositions (transa, transb), and scaling factors (alpha, beta). Please see cublasgemmGroupedBatched <

cuFFT: Release 12.4:
- New Features
- Added Just-In-Time Link-Time Optimized (JIT LTO) kernels for improved performance in FFTs with 64-bit indexing
- Added per-plan properties to the cuFFT API. These new routines can be leveraged to give users more control over the behavior of cuFFT. Currently they can be used to enable JIT LTO kernels for 64-bit FFTs.
- Improved accuracy for certain single-precision (fp32) FFT cases, especially involving FFTs for larger sizes

Resolved Issues:
- Fixed an issue that could cause overwriting of user data when performing out-of-place real-to-complex (R2C) transforms with user-specified output strides (i.e. using the ostride component of the Advanced Data Layout API).
- Fixed inconsistent behavior between libcufftw and FFTW when both inembed and onembed are nullptr / NULL. From now on, as in FFTW, passing nullptr / NULL as inembed/onembed parameter is equivalent to passing n, that is, the logical size for that dimension.

cuSOLVER: Release 12.4:
New Features:
- cusolverDnXlarft and cusolverDnXlarft_bufferSize APIs were introduced. cusolverDnXlarft forms the triangular factor of a real block reflector, while cusolverDnXlarft_bufferSize returns its required workspace sizes in bytes.

cuSPARSE: Release 12.4:
New Features:
- Added the preprocessing step for sparse matrix-vector multiplication cusparseSpMV_preprocess()
- Added support for mixed real and complex types for cusparseSpMM()
- Added a new API cusparseSpSM_updateMatrix() to update the sparse matrix between the analysis and solving phase of cusparseSpSM()

Resolved Issues:
- cusparseSpVV() provided incorrect results when the sparse vector has many non-zeros

CUDA Math: Release 12.4:
Resolved Issues:
- Host-specific code in cuda_fp16/bf16 headers is now free from type-punning and shall work correctly in the presence of optimizations based on strict-aliasing rules

NPP: Release 12.4:
New Features:
- Enhanced large file support with size_t

Join our mailing list

Stay up to date with latest software releases, news, software discounts, deals and more.

Subscribe