skmultiflow.transform.MissingValuesCleaner

class skmultiflow.transform.MissingValuesCleaner(missing_value=nan, strategy='zero', window_size=200, new_value=1)[source]

Fills missing values with some defined value.

Provides a simple way to replace missing values in data samples with some value. The imputation value can be set via a set of imputation strategies.

Parameters
  • missing_value (int, float or list (Default: numpy.nan)) – Missing value to replace

  • strategy (string (Default: 'zero')) – The strategy adopted to find the missing value replacement. It can be one of the following: ‘zero’, ‘mean’, ‘median’, ‘mode’, ‘custom’.

  • window_size (int (Default: 200)) – Defines the window size for the ‘mean’, ‘median’ and ‘mode’ strategies.

  • new_value (int (Default: 1)) – This is the replacement value in case the chosen strategy is ‘custom’.

Examples

>>> # Imports
>>> import numpy as np
>>> from skmultiflow.data.file_stream import FileStream
>>> from skmultiflow.transform.missing_values_cleaner import MissingValuesCleaner
>>> # Setting up a stream
>>> stream = FileStream('skmultiflow/data/datasets/covtype.csv', -1, 1)
>>> stream.prepare_for_use()
>>> # Setting up the filter to substitute values -47 by the median of the
>>> # last 10 samples
>>> cleaner = MissingValuesCleaner(-47, 'median', 10)
>>> X, y = stream.next_sample(10)
>>> X[9, 0] = -47
>>> # We will use this list to keep track of values
>>> data = []
>>> # Iterate over the first 9 samples, to build a sample window
>>> for i in range(9):
>>>     X_transf = cleaner.partial_fit_transform([X[i].tolist()])
>>>     data.append(X_transf[0][0])
>>>
>>> # Transform last sample. The first feature should be replaced by the list's
>>> # median value
>>> X_transf = cleaner.partial_fit_transform([X[9].tolist()])
>>> np.median(data)

Notes

A missing value in a sample can be coded in many different ways, but the most common one is to use numpy’s NaN, that’s why that is the default missing value parameter.

The user should choose the correct substitution strategy for his use case, as each strategy has its pros and cons. The strategy can be chosen from a set of predefined strategies, which are: ‘zero’, ‘mean’, ‘median’, ‘mode’, ‘custom’.

Notice that MissingValuesCleaner can actually be used to replace arbitrary values.

__init__(missing_value=nan, strategy='zero', window_size=200, new_value=1)[source]

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__([missing_value, strategy, …])

Initialize self.

get_info()

Collects and returns the information about the configuration of the estimator

get_params([deep])

Get parameters for this estimator.

partial_fit(X[, y])

Partial fits the model.

partial_fit_transform(X[, y])

Partially fits the model and then apply the transform to the data.

reset()

Resets the estimator to its initial state.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Does the transformation process in the samples in X.

get_info()[source]

Collects and returns the information about the configuration of the estimator

Returns

Configuration of the estimator.

Return type

string

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters

deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

mapping of string to any

partial_fit(X, y=None)[source]

Partial fits the model.

Parameters
  • X (numpy.ndarray of shape (n_samples, n_features)) – The sample or set of samples that should be transformed.

  • y (Array-like) – The true labels.

Returns

self

Return type

MissingValuesCleaner

partial_fit_transform(X, y=None)[source]

Partially fits the model and then apply the transform to the data.

Parameters
  • X (numpy.ndarray of shape (n_samples, n_features)) – The sample or set of samples that should be transformed.

  • y (Array-like) – The true labels.

Returns

The transformed data.

Return type

numpy.ndarray of shape (n_samples, n_features)

reset()[source]

Resets the estimator to its initial state.

Returns

Return type

self

set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns

Return type

self

transform(X)[source]

Does the transformation process in the samples in X.

Parameters

X (numpy.ndarray of shape (n_samples, n_features)) – The sample or set of samples that should be transformed.