Add PersistentProgramCache (sqlite + filestream backends)#1912
Add PersistentProgramCache (sqlite + filestream backends)#1912cpcloud wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
de57bd8 to
ac38a68
Compare
|
f1ae40e to
b27ed2c
Compare
4407cef to
c534df1
Compare
Convert cuda.core.utils to a package and add persistent, on-disk caches
for compiled ObjectCode produced by Program.compile.
Public API (cuda.core.utils):
* ProgramCacheResource -- abstract bytes|str -> ObjectCode mapping
with context manager and pickle-safety warning. Path-backed
ObjectCode is rejected at write time (would store only the path).
* SQLiteProgramCache -- single-file sqlite3 backend (WAL mode,
autocommit) with LRU eviction against an optional size cap. A
threading.RLock serialises connection use so one cache object is
safe across threads. wal_checkpoint(TRUNCATE) + VACUUM run after
evictions so the size cap bounds real on-disk usage. __contains__
is read-only -- it does not bump LRU. __len__ counts only entries
that survive validation and prunes corrupt rows. Schema-version
mismatch on open drops the tables and rebuilds; corrupt /
non-SQLite files are detected and the cache reinitialises empty.
Transient OperationalError ("database is locked") propagates
without nuking the file (and closes the partial connection).
* FileStreamProgramCache -- directory of atomically-written entries
(tmp + os.replace) safe across concurrent processes. On-disk
filenames are blake2b(32) hashes of the key so arbitrary-length
keys never overflow filesystem name limits. Reader pruning is
stat-guarded: only delete a corrupt-looking file if its inode/
size/mtime have not changed since the read, so a concurrent
os.replace by a writer is preserved. clear() and _enforce_size_cap
use the same stat guard. Stale temp files (older than 1 hour) are
swept on open and during eviction; live temp files count toward
the size cap. Windows ERROR_SHARING_VIOLATION (32) and
ERROR_LOCK_VIOLATION (33) on os.replace are retried with bounded
backoff (~185ms) before being treated as a non-fatal cache miss;
other PermissionErrors and all POSIX failures propagate. __len__
matches __getitem__ semantics (rejects schema/key/value mismatch).
* make_program_cache_key -- stable 32-byte blake2b key over code,
code_type, ProgramOptions, target_type, name expressions, cuda
core/NVRTC versions, NVVM lib+IR version, linker backend+version
for PTX inputs (driver version included only on the cuLink path).
Backend-specific gates mirror Program/Linker:
* code_type lower-cased to match Program_init.
* code_type/target_type combination validated against Program's
SUPPORTED_TARGETS matrix.
* NVRTC side-effect options (create_pch, time, fdevice_time_trace)
and external-content options (include_path, pre_include, pch,
use_pch, pch_dir) require an extra_digest from the caller. The
per-field set/unset predicate (_option_is_set) mirrors the
compiler's emission gates; collections.abc.Sequence is the
is_sequence check, matching _prepare_nvrtc_options_impl.
* NVVM use_libdevice=True requires extra_digest because libdevice
bitcode comes from the active toolkit. extra_sources is
rejected for non-NVVM. Bytes-like ``code`` is rejected for
non-NVVM (Program() requires str there).
* PTX (Linker) input options are normalised through per-field
gates that match _prepare_nvjitlink_options /
_prepare_driver_options. ftz/prec_div/prec_sqrt/fma collapse
to a sentinel under the driver linker (it ignores them).
ptxas_options canonicalises across str/list/tuple/empty shapes.
The driver linker's hard rejections (time, ptxas_options,
split_compile) raise at key time.
* name_expressions are gated on backend == "nvrtc"; PTX/NVVM
ignore them, matching Program.compile.
* Failed environment probes mix the exception class name into a
*_probe_failed label so broken environments never collide with
working ones, while staying stable across processes and across
repeated calls within a process.
Lazy import: ``from cuda.core.utils import StridedMemoryView`` does
NOT pull in the cache backends. The cache classes are exposed via
module __getattr__. sqlite3 is imported lazily inside
SQLiteProgramCache.__init__ so the package is usable on interpreters
built without libsqlite3.
Tests: 177 cache tests covering single-process CRUD, LRU/size-cap
(logical and on-disk, including stat-guarded race scenarios),
corruption + __len__ pruning, schema-mismatch table-DROP, threaded
SQLite, cross-process FileStream stress (writer/reader race exercising
the stat-guard prune; clear/eviction race injection via generator
cleanup), Windows vs POSIX PermissionError narrowing (winerror 32/33
swallow + retry, others propagate; partial-conn close on
OperationalError), lazy-import subprocess test, an end-to-end test
that compiles a real CUDA C++ kernel, stores the ObjectCode, reopens
the cache, and calls get_kernel on the deserialised copy, and a test
that parses _program.pyx via tokenize + ast.literal_eval to assert
the cache's _SUPPORTED_TARGETS_BY_CODE_TYPE matches Program.compile's
matrix. Public API is documented in cuda_core/docs/source/api.rst.
2dc5c8f to
5da111b
Compare
…ent pickle compat; add usage example
|
Generated with the help of Cursor GPT-5.4 Extra High Fast High:
|
| __all__ = [ | ||
| "FileStreamProgramCache", | ||
| "ProgramCacheResource", | ||
| "SQLiteProgramCache", | ||
| "StridedMemoryView", | ||
| "args_viewable_as_strided_memory", | ||
| "make_program_cache_key", | ||
| ] | ||
|
|
||
| # Lazily expose the program-cache APIs so ``from cuda.core.utils import | ||
| # StridedMemoryView`` stays lightweight -- the cache backends pull in driver, | ||
| # NVRTC, and module-load machinery that memoryview-only consumers do not need. | ||
| _LAZY_CACHE_ATTRS = frozenset( | ||
| { | ||
| "FileStreamProgramCache", | ||
| "ProgramCacheResource", | ||
| "SQLiteProgramCache", | ||
| "make_program_cache_key", | ||
| } | ||
| ) |
There was a problem hiding this comment.
Small readability/maintenance cleanup suggestion:
__all__ and _LAZY_CACHE_ATTRS currently duplicate the same cache-export names, so defining the ordered lazy-export list once and reusing it in __all__ seems a bit easier to scan and reduces the chance that the two drift apart later.
Something along these lines:
_LAZY_CACHE_ATTRS = (
"FileStreamProgramCache",
"ProgramCacheResource",
"SQLiteProgramCache",
"make_program_cache_key",
)
__all__ = [
"StridedMemoryView",
"args_viewable_as_strided_memory",
*_LAZY_CACHE_ATTRS,
]Mostly just a readability nit, but I think this makes the relationship between "lazy exports" and "public exports" a little clearer.
Summary
cuda.core.utilsfrom a module to a package; expose cache APIs lazily via__getattr__sofrom cuda.core.utils import StridedMemoryViewstays lightweight.ProgramCacheResourceABC withbytes | strkeys, context manager, pickle-safety warning, and rejection of path-backedObjectCodeat write time.make_program_cache_key()— blake2b(32) digest with backend-specific gates that mirrorProgram/Linker:code_type/target_typeagainstProgram.compile'sSUPPORTED_TARGETS; rejects bytes-likecodefor non-NVVM andextra_sourcesfor non-NVVM.create_pch,time,fdevice_time_trace) and external-content (include_path,pre_include,pch,use_pch,pch_dir) options requireextra_digest; NVVMuse_libdevice=Truelikewise._prepare_nvjitlink_options/_prepare_driver_options;ptxas_optionscanonicalised across str/list/tuple/empty shapes; driver-linker hard rejections (time,ptxas_options,split_compile) raise at key time;ftz/prec_div/prec_sqrt/fmacollapse under driver linker.*_probe_failedlabel so broken environments never collide with working ones, while staying stable across processes and repeated calls.SQLiteProgramCache— single-file sqlite3 (WAL + autocommit), LRU eviction, optional size cap,wal_checkpoint(TRUNCATE) + VACUUMafter evictions so the cap bounds real on-disk usage.__contains__is read-only;__len__validates and prunes corrupt rows.threading.RLockserialises connection use. Schema-mismatch on open drops tables and rebuilds; corrupt / non-SQLite files reinitialise empty;OperationalError(lock/busy) propagates without nuking the file (and closes the partial connection).FileStreamProgramCache— multi-process via tmp +os.replace. Hash-based filenames so arbitrary-length keys don't overflow filesystem limits. Reader pruning,clear(), and_enforce_size_capare all stat-guarded (snapshot(ino, size, mtime_ns), refuse unlink on mismatch) so a concurrent writer'sos.replaceis preserved. Stale temp files swept on open; live temps count toward the size cap. WindowsERROR_SHARING_VIOLATION/ERROR_LOCK_VIOLATIONonos.replaceare retried with bounded backoff (~185ms) before being treated as a non-fatal cache miss; otherPermissionErrorand all POSIX failures propagate.__len__also rejectsstored_key/path mismatch.Program.compile(cache=...)integration is out of scope (tracked by #176/#179).Test plan
__len__pruning; schema-mismatch table-DROP; threaded SQLite (4 writers + 4 readers × 200 ops); cross-process FileStream stress (writer/reader race exercising the stat-guard prune; clear/eviction race injection via generator cleanup); Windows vs POSIXPermissionErrornarrowing (winerror 32/33 swallow + retry, others propagate; partial-conn close onOperationalError); lazy-import subprocess test;_SUPPORTED_TARGETS_BY_CODE_TYPEparity test that parses_program.pyxviatokenize+ast.literal_eval.get_kernelon the deserialisedObjectCode, parametrized over both backends.Closes #178