How generic kind specification permits highly effective static evaluation and runtime validation
As instruments for Python kind annotations (or hints) have developed, extra advanced information buildings will be typed, bettering maintainability and static evaluation. Arrays and DataFrames, as advanced containers, have solely not too long ago supported full kind annotations in Python. NumPy 1.22 launched generic specification of arrays and dtypes. Constructing on NumPy’s basis, StaticFrame 2.0 launched full kind specification of DataFrames, using NumPy primitives and variadic generics. This text demonstrates sensible approaches to completely type-hinting arrays and DataFrames, and reveals how the identical annotations can enhance code high quality with each static evaluation and runtime validation.
StaticFrame is an open-source DataFrame library of which I’m an creator.
Sort hints (see PEP 484) enhance code high quality in a variety of methods. As an alternative of utilizing variable names or feedback to speak varieties, Python-object-based kind annotations present maintainable and expressive instruments for kind specification. These kind annotations will be examined with kind checkers resembling mypy or pyright, rapidly discovering potential bugs with out executing code.
The identical annotations can be utilized for runtime validation. Whereas reliance on duck-typing over runtime validation is widespread in Python, runtime validation is extra usually wanted with advanced information buildings resembling arrays and DataFrames. For instance, an interface anticipating a DataFrame argument, if given a Sequence, won’t want express validation as utilization of the improper kind will possible elevate. Nevertheless, an interface anticipating a 2D array of floats, if given an array of Booleans, may profit from validation as utilization of the improper kind might not elevate.
Many vital typing utilities are solely obtainable with the most-recent variations of Python. Luckily, the typing-extensions bundle back-ports commonplace library utilities for older variations of Python. A associated problem is that kind checkers can take time to implement full help for brand new options: most of the examples proven right here require at the least mypy 1.9.0.
With out kind annotations, a Python perform signature offers no indication of the anticipated varieties. For instance, the perform beneath may take and return any varieties:
def process0(v, q): … # no kind data
By including kind annotations, the signature informs readers of the anticipated varieties. With fashionable Python, user-defined and built-in lessons can be utilized to specify varieties, with further assets (resembling Any, Iterator, solid(), and Annotated) present in the usual library typing module. For instance, the interface beneath improves the one above by making anticipated varieties express:
def process0(v: int, q: bool) -> checklist[float]: …
When used with a sort checker like mypy, code that violates the specs of the sort annotations will elevate an error throughout static evaluation (proven as feedback, beneath). For instance, offering an integer when a Boolean is required is an error:
x = process0(v=5, q=20)# tp.py: error: Argument “q” to “process0″# has incompatible kind “int”; anticipated “bool” [arg-type]
Static evaluation can solely validate statically outlined varieties. The total vary of runtime inputs and outputs is commonly extra various, suggesting some type of runtime validation. The very best of each worlds is feasible by reusing kind annotations for runtime validation. Whereas there are libraries that do that (e.g., typeguard and beartype), StaticFrame presents CallGuard, a device specialised for complete array and DataFrame type-annotation validation.
A Python decorator is good for leveraging annotations for runtime validation. CallGuard presents two decorators: @CallGuard.test, which raises an informative Exception on error, or @CallGuard.warn, which points a warning.
Additional extending the process0 perform above with @CallGuard.test, the identical kind annotations can be utilized to boost an Exception (proven once more as feedback) when runtime objects violate the necessities of the sort annotations:
import static_frame as sf
@sf.CallGuard.checkdef process0(v: int, q: bool) -> checklist[float]:return [x * (0.5 if q else 0.25) for x in range(v)]
z = process0(v=5, q=20)# static_frame.core.type_clinic.ClinicError:# In args of (v: int, q: bool) -> checklist[float]# └── Anticipated bool, supplied int invalid
Whereas kind annotations should be legitimate Python, they’re irrelevant at runtime and will be improper: it’s attainable to have appropriately verified varieties that don’t mirror runtime actuality. As proven above, reusing kind annotations for runtime checks ensures annotations are legitimate.
Python lessons that allow part kind specification are “generic”. Part varieties are specified with positional “kind variables”. A listing of integers, for instance, is annotated with checklist[int]; a dictionary of floats keyed by tuples of integers and strings is annotated dict[tuple[int, str], float].
With NumPy 1.20, ndarray and dtype develop into generic. The generic ndarray requires two arguments, a form and a dtype. Because the utilization of the primary argument continues to be below growth, Any is often used. The second argument, dtype, is itself a generic that requires a sort variable for a NumPy kind resembling np.int64. NumPy additionally presents extra common generic varieties resembling np.integer[Any].
For instance, an array of Booleans is annotated np.ndarray[Any, np.dtype[np.bool_]]; an array of any kind of integer is annotated np.ndarray[Any, np.dtype[np.integer[Any]]].
As generic annotations with part kind specs can develop into verbose, it’s sensible to retailer them as kind aliases (right here prefixed with “T”). The next perform specifies such aliases after which makes use of them in a perform.
from typing import Anyimport numpy as np
TNDArrayInt8 = np.ndarray[Any, np.dtype[np.int8]]TNDArrayBool = np.ndarray[Any, np.dtype[np.bool_]]TNDArrayFloat64 = np.ndarray[Any, np.dtype[np.float64]]
def process1(v: TNDArrayInt8,q: TNDArrayBool,) -> TNDArrayFloat64:s: TNDArrayFloat64 = np.the place(q, 0.5, 0.25)return v * s
As earlier than, when used with mypy, code that violates the sort annotations will elevate an error throughout static evaluation. For instance, offering an integer when a Boolean is required is an error:
v1: TNDArrayInt8 = np.arange(20, dtype=np.int8)x = process1(v1, v1)# tp.py: error: Argument 2 to “process1” has incompatible kind# “ndarray[Any, dtype[floating[_64Bit]]]”; anticipated “ndarray[Any, dtype[bool_]]” [arg-type]
The interface requires 8-bit signed integers (np.int8); trying to make use of a distinct sized integer can also be an error:
TNDArrayInt64 = np.ndarray[Any, np.dtype[np.int64]]v2: TNDArrayInt64 = np.arange(20, dtype=np.int64)q: TNDArrayBool = np.arange(20) % 3 == 0x = process1(v2, q)# tp.py: error: Argument 1 to “process1” has incompatible kind# “ndarray[Any, dtype[signedinteger[_64Bit]]]”; anticipated “ndarray[Any, dtype[signedinteger[_8Bit]]]” [arg-type]
Whereas some interfaces may profit from such slender numeric kind specs, broader specification is feasible with NumPy’s generic varieties resembling np.integer[Any], np.signedinteger[Any], np.float[Any], and so on. For instance, we are able to outline a brand new perform that accepts any measurement signed integer. Static evaluation now passes with each TNDArrayInt8 and TNDArrayInt64 arrays.
TNDArrayIntAny = np.ndarray[Any, np.dtype[np.signedinteger[Any]]]def process2(v: TNDArrayIntAny, # a extra versatile interfaceq: TNDArrayBool,) -> TNDArrayFloat64:s: TNDArrayFloat64 = np.the place(q, 0.5, 0.25)return v * s
x = process2(v1, q) # no mypy errorx = process2(v2, q) # no mypy error
Simply as proven above with components, generically specified NumPy arrays will be validated at runtime if embellished with CallGuard.test:
@sf.CallGuard.checkdef process3(v: TNDArrayIntAny, q: TNDArrayBool) -> TNDArrayFloat64:s: TNDArrayFloat64 = np.the place(q, 0.5, 0.25)return v * s
x = process3(v1, q) # no error, identical as mypyx = process3(v2, q) # no error, identical as mypyv3: TNDArrayFloat64 = np.arange(20, dtype=np.float64) * 0.5x = process3(v3, q) # error# static_frame.core.type_clinic.ClinicError:# In args of (v: ndarray[Any, dtype[signedinteger[Any]]],# q: ndarray[Any, dtype[bool_]]) -> ndarray[Any, dtype[float64]]# └── ndarray[Any, dtype[signedinteger[Any]]]# └── dtype[signedinteger[Any]]# └── Anticipated signedinteger, supplied float64 invalid
StaticFrame supplies utilities to increase runtime validation past kind checking. Utilizing the typing module’s Annotated class (see PEP 593), we are able to lengthen the sort specification with a number of StaticFrame Require objects. For instance, to validate that an array has a 1D form of (24,), we are able to exchange TNDArrayIntAny with Annotated[TNDArrayIntAny, sf.Require.Shape(24)]. To validate {that a} float array has no NaNs, we are able to exchange TNDArrayFloat64 with Annotated[TNDArrayFloat64, sf.Require.Apply(lambda a: ~a.insna().any())].
Implementing a brand new perform, we are able to require that each one enter and output arrays have the form (24,). Calling this perform with the beforehand created arrays raises an error:
from typing import Annotated
@sf.CallGuard.checkdef process4(v: Annotated[TNDArrayIntAny, sf.Require.Shape(24)],q: Annotated[TNDArrayBool, sf.Require.Shape(24)],) -> Annotated[TNDArrayFloat64, sf.Require.Shape(24)]:s: TNDArrayFloat64 = np.the place(q, 0.5, 0.25)return v * s
x = process4(v1, q) # varieties go, however Require.Form fails# static_frame.core.type_clinic.ClinicError:# In args of (v: Annotated[ndarray[Any, dtype[int8]], Form((24,))], q: Annotated[ndarray[Any, dtype[bool_]], Form((24,))]) -> Annotated[ndarray[Any, dtype[float64]], Form((24,))]# └── Annotated[ndarray[Any, dtype[int8]], Form((24,))]# └── Form((24,))# └── Anticipated form ((24,)), supplied form (20,)
Identical to a dictionary, a DataFrame is a fancy information construction composed of many part varieties: the index labels, column labels, and the column values are all distinct varieties.
A problem of generically specifying a DataFrame is {that a} DataFrame has a variable variety of columns, the place every column is perhaps a distinct kind. The Python TypeVarTuple variadic generic specifier (see PEP 646), first launched in Python 3.11, permits defining a variable variety of column kind variables.
With StaticFrame 2.0, Body, Sequence, Index and associated containers develop into generic. Help for variable column kind definitions is supplied by TypeVarTuple, back-ported with the implementation in typing-extensions for compatibility right down to Python 3.9.
A generic Body requires two or extra kind variables: the kind of the index, the kind of the columns, and nil or extra specs of columnar worth varieties specified with NumPy varieties. A generic Sequence requires two kind variables: the kind of the index and a NumPy kind for the values. The Index is itself generic, additionally requiring a NumPy kind as a sort variable.
With generic specification, a Sequence of floats, listed by dates, will be annotated with sf.Sequence[sf.IndexDate, np.float64]. A Body with dates as index labels, strings as column labels, and column values of integers and floats will be annotated with sf.Body[sf.IndexDate, sf.Index[np.str_], np.int64, np.float64].
Given a fancy Body, deriving the annotation is perhaps troublesome. StaticFrame presents the via_type_clinic interface to supply a whole generic specification for any part at runtime:
>>> v4 = sf.Body.from_fields([range(5), np.arange(3, 8) * 0.5],columns=(‘a’, ‘b’), index=sf.IndexDate.from_date_range(‘2021-12-30’, ‘2022-01-03’))>>> v4<Body><Index> a b <<U1><IndexDate>2021-12-30 0 1.52021-12-31 1 2.02022-01-01 2 2.52022-01-02 3 3.02022-01-03 4 3.5<datetime64[D]> <int64> <float64>
# get a string illustration of the annotation>>> v4.via_type_clinicFrame[IndexDate, Index[str_], int64, float64]
As proven with arrays, storing annotations as kind aliases permits reuse and extra concise perform signatures. Under, a brand new perform is outlined with generic Body and Sequence arguments totally annotated. A solid is required as not all operations can statically resolve their return kind.
TFrameDateInts = sf.Body[sf.IndexDate, sf.Index[np.str_], np.int64, np.int64]TSeriesYMBool = sf.Sequence[sf.IndexYearMonth, np.bool_]TSeriesDFloat = sf.Sequence[sf.IndexDate, np.float64]
def process5(v: TFrameDateInts, q: TSeriesYMBool) -> TSeriesDFloat:t = v.index.iter_label().apply(lambda l: q[l.astype(‘datetime64[M]’)]) # kind: ignores = np.the place(t, 0.5, 0.25)return solid(TSeriesDFloat, (v.via_T * s).imply(axis=1))
These extra advanced annotated interfaces may also be validated with mypy. Under, a Body with out the anticipated column worth varieties is handed, inflicting mypy to error (proven as feedback, beneath).
TFrameDateIntFloat = sf.Body[sf.IndexDate, sf.Index[np.str_], np.int64, np.float64]v5: TFrameDateIntFloat = sf.Body.from_fields([range(5), np.arange(3, 8) * 0.5],columns=(‘a’, ‘b’), index=sf.IndexDate.from_date_range(‘2021-12-30’, ‘2022-01-03’))
q: TSeriesYMBool = sf.Sequence([True, False],index=sf.IndexYearMonth.from_date_range(‘2021-12’, ‘2022-01’))
x = process5(v5, q)# tp.py: error: Argument 1 to “process5” has incompatible kind# “Body[IndexDate, Index[str_], signedinteger[_64Bit], floating[_64Bit]]”; anticipated# “Body[IndexDate, Index[str_], signedinteger[_64Bit], signedinteger[_64Bit]]” [arg-type]
To make use of the identical kind hints for runtime validation, the sf.CallGuard.test decorator will be utilized. Under, a Body of three integer columns is supplied the place a Body of two columns is anticipated.
# a Body of three columns of integersTFrameDateIntIntInt = sf.Body[sf.IndexDate, sf.Index[np.str_], np.int64, np.int64, np.int64]v6: TFrameDateIntIntInt = sf.Body.from_fields([range(5), range(3, 8), range(1, 6)],columns=(‘a’, ‘b’, ‘c’), index=sf.IndexDate.from_date_range(‘2021-12-30’, ‘2022-01-03’))
x = process5(v6, q)# static_frame.core.type_clinic.ClinicError:# In args of (v: Body[IndexDate, Index[str_], signedinteger[_64Bit], signedinteger[_64Bit]],# q: Sequence[IndexYearMonth, bool_]) -> Sequence[IndexDate, float64]# └── Body[IndexDate, Index[str_], signedinteger[_64Bit], signedinteger[_64Bit]]# └── Anticipated Body has 2 dtype, supplied Body has 3 dtype
It won’t be sensible to annotate each column of each Body: it is not uncommon for interfaces to work with Body of variable column sizes. TypeVarTuple helps this by the utilization of *tuple[] expressions (launched in Python 3.11, back-ported with the Unpack annotation). For instance, the perform above could possibly be outlined to take any variety of integer columns with that annotation Body[IndexDate, Index[np.str_], *tuple[np.int64, …]], the place *tuple[np.int64, …]] means zero or extra integer columns.
The identical implementation will be annotated with a much more common specification of columnar varieties. Under, the column values are annotated with np.quantity[Any] (allowing any kind of numeric NumPy kind) and a *tuple[] expression (allowing any variety of columns): *tuple[np.number[Any], …]. Now neither mypy nor CallGuard errors with both beforehand created Body.
TFrameDateNums = sf.Body[sf.IndexDate, sf.Index[np.str_], *tuple[np.number[Any], …]]
@sf.CallGuard.checkdef process6(v: TFrameDateNums, q: TSeriesYMBool) -> TSeriesDFloat:t = v.index.iter_label().apply(lambda l: q[l.astype(‘datetime64[M]’)]) # kind: ignores = np.the place(t, 0.5, 0.25)return tp.solid(TSeriesDFloat, (v.via_T * s).imply(axis=1))
x = process6(v5, q) # a Body with integer, float columns passesx = process6(v6, q) # a Body with three integer columns passes
As with NumPy arrays, Body annotations can wrap Require specs in Annotated generics, allowing the definition of further run-time validations.
Whereas StaticFrame is perhaps the primary DataFrame library to supply full generic specification and a unified answer for each static kind evaluation and run-time kind validation, different array and DataFrame libraries supply associated utilities.
Neither the Tensor class in PyTorch (2.4.0), nor the Tensor class in TensorFlow (2.17.0) help generic kind or form specification. Whereas each libraries supply a TensorSpec object that can be utilized to carry out run-time kind and form validation, static kind checking with instruments like mypy isn’t supported.
As of Pandas 2.2.2, neither the Pandas Sequence nor DataFrame help generic kind specs. Quite a few third-party packages have provided partial options. The pandas-stubs library, for instance, supplies kind annotations for the Pandas API, however doesn’t make the Sequence or DataFrame lessons generic. The Pandera library permits defining DataFrameSchema lessons that can be utilized for run-time validation of Pandas DataFrames. For static-analysis with mypy, Pandera presents various DataFrame and Sequence subclasses that allow generic specification with the identical DataFrameSchema lessons. This strategy doesn’t allow the expressive alternatives of utilizing generic NumPy varieties or the unpack operator for supplying variadic generic expressions.
Python kind annotations could make static evaluation of varieties a useful test of code high quality, discovering errors earlier than code is even executed. Up till not too long ago, an interface may take an array or a DataFrame, however no specification of the kinds contained in these containers was attainable. Now, full specification of part varieties is feasible in NumPy and StaticFrame, allowing extra highly effective static evaluation of varieties.
Offering right kind annotations is an funding. Reusing these annotations for runtime checks supplies the very best of each worlds. StaticFrame’s CallGuard runtime kind checker is specialised to appropriately consider totally specified generic NumPy varieties, in addition to all generic StaticFrame containers.