nncf#

Neural Network Compression Framework (NNCF) for enhanced OpenVINO™ inference.

Subpackages#

Classes#

NNCFConfig

Contains the configuration parameters required for NNCF to apply the selected algorithms.

Dataset

Wrapper for passing custom user datasets into NNCF algorithms.

CompressWeightsMode

Defines a mode for weight compression.

DropType

Describes the accuracy drop type, which determines how the accuracy drop between

ModelType

Describes the model type the specificity of which will be taken into account during compression.

QuantizationMode

Defines special modes.

SensitivityMetric

Defines a sensitivity metric for assigning quantization precision to layers. In order to

TargetDevice

Target device architecture for compression.

QuantizationPreset

An enum with values corresponding to the available quantization presets.

OverflowFix

This option controls whether to apply the overflow issue fix for the 8-bit

IgnoredScope

Provides an option to specify portions of model to be excluded from compression.

Subgraph

Defines the ignored subgraph as follows: A subgraph comprises all nodes along

Functions#

strip(model[, do_copy])

Returns the model object with as much custom NNCF additions as possible removed

compress_weights(model[, mode, ratio, group_size, ...])

Compress model weights.

quantize(model, calibration_dataset[, mode, preset, ...])

Applies post-training quantization to the provided model.

quantize_with_accuracy_control(model, ...[, max_drop, ...])

Applies post-training quantization algorithm with accuracy control to provided model.

nncf.strip(model, do_copy=True)[source]#

Returns the model object with as much custom NNCF additions as possible removed while still preserving the functioning of the model object as a compressed model.

Parameters:
  • model (TModel) – The compressed model.

  • do_copy (bool) – If True (default), will return a copy of the currently associated model object. If False, will return the currently associated model object “stripped” in-place.

Returns:

The stripped model.

Return type:

TModel

class nncf.NNCFConfig(*args, **kwargs)[source]#

Bases: dict

Contains the configuration parameters required for NNCF to apply the selected algorithms.

This is a regular dictionary object extended with some utility functions, such as the ability to attach well-defined structures to pass non-serializable objects as parameters. It is primarily built from a .json file, or from a Python JSON-like dictionary - both data types will be checked against a JSONSchema. See the definition of the schema at https://openvinotoolkit.github.io/nncf/schema/, or by calling NNCFConfig.schema().

classmethod from_dict(nncf_dict)[source]#

Load NNCF config from a Python dictionary. The dict must contain only JSON-supported primitives.

Parameters:

nncf_dict (Dict) – A Python dict with the JSON-style configuration for NNCF.

Return type:

NNCFConfig

classmethod from_json(path)[source]#

Load NNCF config from a JSON file at path.

Parameters:

path (str) – Path to the .json file containing the NNCF configuration.

Return type:

NNCFConfig

register_extra_structs(struct_list)[source]#

Attach the supplied list of extra configuration structures to this configuration object.

Parameters:

struct_list (List[nncf.config.structures.NNCFExtraConfigStruct]) – List of extra configuration structures.

get_redefinable_global_param_value_for_algo(param_name, algo_name)[source]#

Some parameters can be specified both on the global NNCF config .json level (so that they apply to all algos), and at the same time overridden in the algorithm-specific section of the .json. This function returns the value that should apply for a given algorithm name, considering the exact format of this config.

Parameters:
  • param_name (str) – The name of a parameter in the .json specification of the NNCFConfig, that may be present either at the top-most level of the .json, or at the top level of the algorithm-specific subdict.

  • algo_name (str) – The name of the algorithm (among the allowed algorithm names in the .json) for which the resolution of the redefinable parameter should occur.

Returns:

The value of the parameter that should be applied for the algo specified by algo_name.

Return type:

Optional[str]

static schema()[source]#

Returns the JSONSchema against which the input data formats (.json or Python dict) are validated.

Return type:

Dict

class nncf.Dataset(data_source, transform_func=None)[source]#

Bases: Generic[DataItem, ModelInput]

Wrapper for passing custom user datasets into NNCF algorithms.

This class defines the interface by which compression algorithms retrieve data items from the passed data source object. These data items are used for different purposes, for example, model inference and model validation, based on the choice of the exact compression algorithm.

If the data item has been returned from the data source per iteration and it cannot be used as input for model inference, the transformation function is used to extract the model’s input from this data item. For example, in supervised learning, the data item usually contains both examples and labels. So transformation function should extract the examples from the data item.

Parameters:
  • data_source (Iterable[DataItem]) – The iterable object serving as the source of data items.

  • transform_func (Optional[Callable[[DataItem], ModelInput]]) – The function that is used to extract the model’s input from the data item. The data item here is the data item that is returned from the data source per iteration. This function should be passed when the data item cannot be directly used as model’s input. If this is not specified, then the data item will be passed into the model as-is.

get_data(indices=None)[source]#

Returns the iterable object that contains selected data items from the data source as-is.

Parameters:

indices (Optional[List[int]]) – The zero-based indices of data items that should be selected from the data source. The indices should be sorted in ascending order. If indices are not passed all data items are selected from the data source.

Returns:

The iterable object that contains selected data items from the data source as-is.

Return type:

Iterable[DataItem]

get_inference_data(indices=None)[source]#

Returns the iterable object that contains selected data items from the data source, for which the transformation function was applied. The item, which was returned per iteration from this iterable, can be used as the model’s input for model inference.

Parameters:

indices (Optional[List[int]]) – The zero-based indices of data items that should be selected from the data source. The indices should be sorted in ascending order. If indices are not passed all data items are selected from the data source.

Returns:

The iterable object that contains selected data items from the data source, for which the transformation function was applied.

Return type:

Iterable[ModelInput]

get_length()[source]#

Tries to fetch length of the underlying dataset. :return: The length of the data_source if __len__() is implemented for it, and None otherwise.

Return type:

Optional[int]

get_batch_size()[source]#

Tries to fetch batch size of the underlying dataset. :return: The value of batch_size or _batch_size attributes of the data_source if exist, and None otherwise.

Return type:

Optional[int]

class nncf.CompressWeightsMode[source]#

Bases: StrEnum

Defines a mode for weight compression. :param INT8_SYM: Stands for 8-bit integer symmetric quantization of all weights.

Parameters:
class nncf.DropType[source]#

Bases: StrEnum

Describes the accuracy drop type, which determines how the accuracy drop between the original model and the compressed model is calculated.

Parameters:
  • ABSOLUTE – The accuracy drop is calculated as the absolute drop with respect to the results of the original model.

  • RELATIVE – The accuracy drop is calculated relative to the results of the original model.

class nncf.ModelType[source]#

Bases: StrEnum

Describes the model type the specificity of which will be taken into account during compression.

Parameters:

TRANSFORMER – Transformer-based models (https://arxiv.org/pdf/1706.03762.pdf)

class nncf.QuantizationMode[source]#

Bases: StrEnum

Defines special modes. Currently contains only FP8-related modes (https://arxiv.org/pdf/2209.05433.pdf).

Parameters:
  • FP8_E4M3 – Mode with 4-bit exponent and 3-bit mantissa.

  • FP8_E5M2 – Mode with 5-bit exponent and 2-bit mantissa.

class nncf.SensitivityMetric[source]#

Bases: StrEnum

Defines a sensitivity metric for assigning quantization precision to layers. In order to

preserve the accuracy of the model, the more sensitive layers receives a higher precision.

Parameters:
  • WEIGHT_QUANTIZATION_ERROR – The inverted 8-bit quantization noise. Weights with highest value of this metric can be accurately quantized channel-wise to 8-bit. The idea is to leave these weights in 8bit, and quantize the rest of layers to 4-bit group-wise. Since group-wise is more accurate than per-channel, accuracy should not degrade.

  • HESSIAN_INPUT_ACTIVATION – The average Hessian trace of weights with respect to the layer-wise quantization error multiplied by L2 norm of 8-bit quantization noise.

  • MEAN_ACTIVATION_VARIANCE – The mean variance of the layers’ inputs multiplied by inverted 8-bit quantization noise.

  • MAX_ACTIVATION_VARIANCE – The maximum variance of the layers’ inputs multiplied by inverted 8-bit quantization noise.

  • MEAN_ACTIVATION_MAGNITUDE – The mean magnitude of the layers’ inputs multiplied by inverted 8-bit quantization noise.

class nncf.TargetDevice[source]#

Bases: StrEnum

Target device architecture for compression.

Compression will take into account the value of this parameter in order to obtain the best performance for this type of device.

class nncf.QuantizationPreset[source]#

Bases: nncf.parameters.StrEnum

An enum with values corresponding to the available quantization presets.

nncf.compress_weights(model, mode=CompressWeightsMode.INT8_ASYM, ratio=None, group_size=None, ignored_scope=None, all_layers=None, dataset=None, sensitivity_metric=None, *, subset_size=128, awq=None)[source]#

Compress model weights.

Parameters:
  • model (nncf.api.compression.TModel) – A model to be compressed.

  • mode

    Defines a mode for weight compression. INT8_SYM stands for 8-bit integer symmetric quantization of all weights. INT8_ASYM is the same as INT8_SYM mode, but weights are quantized to a primary precision asymmetrically

    with a typical non-fixed zero point.

    INT4_SYM stands for a mixed-precision weights quantization with 4-bit integer as a primary precision.

    Weights are quantized to a primary precision symmetrically with a fixed zero point equals to 8. All embeddings and the last layer are always compressed to a backup precision, which is INT8_ASYM, by default. All others are quantized whether to 4-bit integer or to a backup precision depending on criteria and the given ratio.

    INT4_ASYM is the same as INT4_SYM mode, but weights are quantized to a primary precision asymmetrically

    with a typical non-fixed zero point.

    NF4 is the same as INT4_SYM mode, but primary precision is NF4 data type without zero point.

  • ratio (Optional[float]) – the ratio between baseline and backup precisions (e.g. 0.9 means 90% of layers quantized to NF4 and the rest to INT8_ASYM).

  • group_size (Optional[int]) – number of weights (e.g. 128) in the channel dimension that share quantization parameters (scale). The value -1 means no grouping.

  • ignored_scope (Optional[nncf.scopes.IgnoredScope]) – An ignored scope that defined the list of model control flow graph nodes to be ignored during quantization.

  • all_layers (Optional[bool]) – Indicates whether embeddings and last MatMul layers should be compressed to a primary precision. By default, the backup precision is assigned for the embeddings and last MatMul layers.

  • dataset (Optional[nncf.data.Dataset]) – Dataset used for assigning different quantization precision by finding outliers in activations.

  • sensitivity_metric (Optional[nncf.parameters.SensitivityMetric]) – The sensitivity metric for assigning quantization precision to layers. In order to preserve the accuracy of the model, the more sensitive layers receives a higher precision.

  • subset_size (Optional[int]) – Number of data samples to calculate activation statistics used for assigning different quantization precision. Defaults to 128.

  • awq (Optional[bool]) – Indicates whether use AWQ weights correction.

Returns:

The non-trainable model with compressed weights.

Return type:

nncf.api.compression.TModel

nncf.quantize(model, calibration_dataset, mode=None, preset=None, target_device=TargetDevice.ANY, subset_size=300, fast_bias_correction=True, model_type=None, ignored_scope=None, advanced_parameters=None)[source]#

Applies post-training quantization to the provided model.

Parameters:
  • model (TModel) – A model to be quantized.

  • calibration_dataset (nncf.Dataset) – A representative dataset for the calibration process.

  • mode (Optional[nncf.QuantizationMode]) – Special quantization mode that specify different ways of the optimization.

  • preset (nncf.QuantizationPreset) – A preset controls the quantization mode (symmetric and asymmetric). It can take the following values: - performance: Symmetric quantization of weights and activations. - mixed: Symmetric quantization of weights and asymmetric quantization of activations. Default value is None. In this case, mixed preset is used for transformer model type otherwise performance.

  • target_device (nncf.TargetDevice) – A target device the specificity of which will be taken into account while compressing in order to obtain the best performance for this type of device.

  • subset_size (int) – Size of a subset to calculate activations statistics used for quantization. Must be positive.

  • fast_bias_correction (bool) – Setting this option to False enables a different bias correction method which is more accurate, in general, and takes more time but requires less memory.

  • model_type (Optional[nncf.ModelType]) – Model type is needed to specify additional patterns in the model. Supported only transformer now.

  • ignored_scope (Optional[nncf.IgnoredScope]) – An ignored scope that defined the list of model control flow graph nodes to be ignored during quantization.

  • advanced_parameters (Optional[nncf.quantization.advanced_parameters.AdvancedQuantizationParameters]) – Advanced quantization parameters for fine-tuning the quantization algorithm.

Returns:

The quantized model.

Return type:

TModel

nncf.quantize_with_accuracy_control(model, calibration_dataset, validation_dataset, validation_fn, max_drop=0.01, drop_type=DropType.ABSOLUTE, preset=None, target_device=TargetDevice.ANY, subset_size=300, fast_bias_correction=True, model_type=None, ignored_scope=None, advanced_quantization_parameters=None, advanced_accuracy_restorer_parameters=None)[source]#

Applies post-training quantization algorithm with accuracy control to provided model.

Parameters:
  • model (TModel) – A model to be quantized.

  • calibration_dataset (nncf.Dataset) – A representative dataset for the calibration process.

  • validation_dataset (nncf.Dataset) – A dataset for the validation process.

  • validation_fn (Callable[[Any, Iterable[Any]], float]) –

    A validation function to validate the model. It should take two arguments: - model: model to be validate. - validation_dataset: dataset that provides data items to

    validate the provided model.

    The function should return the value of the metric with the following meaning: A higher value corresponds to better performance of the model.

  • max_drop (float) – The maximum accuracy drop that should be achieved after the quantization.

  • drop_type (nncf.parameters.DropType) – The accuracy drop type, which determines how the maximum accuracy drop between the original model and the compressed model is calculated.

  • preset (nncf.QuantizationPreset) – A preset controls the quantization mode (symmetric and asymmetric). It can take the following values: - performance: Symmetric quantization of weights and activations. - mixed: Symmetric quantization of weights and asymmetric quantization of activations. Default value is None. In this case, mixed preset is used for transformer model type otherwise performance.

  • target_device (nncf.TargetDevice) – A target device the specificity of which will be taken into account while compressing in order to obtain the best performance for this type of device.

  • subset_size (int) – Size of a subset to calculate activations statistics used for quantization.

  • fast_bias_correction (bool) – Setting this option to False enables a different bias correction method which is more accurate, in general, and takes more time but requires less memory.

  • model_type (nncf.ModelType) – Model type is needed to specify additional patterns in the model. Supported only transformer now.

  • ignored_scope (nncf.IgnoredScope) – An ignored scope that defined the list of model control flow graph nodes to be ignored during quantization.

  • advanced_quantization_parameters (Optional[nncf.quantization.advanced_parameters.AdvancedQuantizationParameters]) – Advanced quantization parameters for fine-tuning the quantization algorithm.

  • advanced_accuracy_restorer_parameters (Optional[AdvancedAccuracyRestorerParameters]) – Advanced parameters for fine-tuning the accuracy restorer algorithm.

Returns:

The quantized model.

Return type:

TModel

class nncf.OverflowFix[source]#

Bases: nncf.parameters.StrEnum

This option controls whether to apply the overflow issue fix for the 8-bit quantization.

8-bit instructions of older Intel CPU generations (based on SSE, AVX-2, and AVX-512 instruction sets) suffer from the so-called saturation (overflow) issue: in some configurations, the output does not fit into an intermediate buffer and has to be clamped. This can lead to an accuracy drop on the aforementioned architectures. The fix set to use only half a quantization range to avoid overflow for specific operations.

If you are going to infer the quantized model on the architectures with AVX-2, and AVX-512 instruction sets, we recommend using FIRST_LAYER option as lower aggressive fix of the overflow issue. If you still face significant accuracy drop, try using ENABLE, but this may get worse the accuracy.

Parameters:
  • ENABLE – All weights of all types of Convolutions and MatMul operations are be quantized using a half of the 8-bit quantization range.

  • FIRST_LAYER – Weights of the first Convolutions of each model inputs are quantized using a half of the 8-bit quantization range.

  • DISABLE – All weights are quantized using the full 8-bit quantization range.

class nncf.IgnoredScope[source]#

Provides an option to specify portions of model to be excluded from compression.

The ignored scope defines model sub-graphs that should be excluded from the compression process such as quantization, pruning and etc.

Example:

import nncf

# Exclude by node name:
node_names = ['node_1', 'node_2', 'node_3']
ignored_scope = nncf.IgnoredScope(names=node_names)

# Exclude using regular expressions:
patterns = ['node_\d']
ignored_scope = nncf.IgnoredScope(patterns=patterns)

# Exclude by operation type:

# OpenVINO opset https://docs.openvino.ai/latest/openvino_docs_ops_opset.html
operation_types = ['Multiply', 'GroupConvolution', 'Interpolate']
ignored_scope = nncf.IgnoredScope(types=operation_types)

# ONNX opset https://github.com/onnx/onnx/blob/main/docs/Operators.md
operation_types = ['Mul', 'Conv', 'Resize']
ignored_scope = nncf.IgnoredScope(types=operation_types)

Note: Operation types must be specified according to the model framework.

Parameters:
  • names (List[str]) – List of ignored node names.

  • patterns (List[str]) – List of regular expressions that define patterns for names of ignored nodes.

  • types (bool) – List of ignored operation types.

  • subgraphs (List[Subgraph]) – List of ignored subgraphs.

  • validate – If set to True, then a RuntimeError will be raised if any ignored scope does not match in the model graph.

class nncf.Subgraph[source]#

Defines the ignored subgraph as follows: A subgraph comprises all nodes along all simple paths in the graph from input to output nodes.

Parameters:
  • inputs (List[str]) – Input node names.

  • outputs (List[str]) – Output node names.