interpret_community.dataset.dataset_wrapper module

Defines a helpful dataset wrapper to allow operations such as summarizing data, taking the subset or sampling.

class interpret_community.dataset.dataset_wrapper.CustomTimestampFeaturizer(features)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

An estimator for featurizing timestamp columns to numeric data.

Parameters:features (list[str]) – Feature column names.
fit(X)

Fits the CustomTimestampFeaturizer.

Parameters:X (numpy.array or pandas.DataFrame or iml.datatypes.DenseData or scipy.sparse.csr_matrix) – The dataset containing timestamp columns to featurize.
transform(X)

Transforms the timestamp columns to numeric type in the given dataset.

Specifically, extracts the year, month, day, hour, minute, second and time since min timestamp in the training dataset.

Parameters:X (numpy.array or pandas.DataFrame or iml.datatypes.DenseData or scipy.sparse.csr_matrix) – The dataset containing timestamp columns to featurize.
Returns:The transformed dataset.
Return type:numpy.array or iml.datatypes.DenseData or scipy.sparse.csr_matrix
class interpret_community.dataset.dataset_wrapper.DatasetWrapper(dataset)

Bases: object

A wrapper around a dataset to make dataset operations more uniform across explainers.

Parameters:dataset (numpy.array or pandas.DataFrame or iml.datatypes.DenseData or scipy.sparse.csr_matrix) – A matrix of feature vector examples (# examples x # features) for initializing the explainer.
apply_indexer(column_indexer, bucket_unknown=False)

Indexes categorical string features on the dataset.

Parameters:
  • column_indexer (ColumnTransformer) – The transformation steps to index the given dataset.
  • bucket_unknown (bool) – If true, buckets unknown values to separate categorical level.
apply_one_hot_encoder(one_hot_encoder)

One-hot-encode categorical string features on the dataset.

Parameters:one_hot_encoder (OneHotEncoder) – The transformation steps to one-hot-encode the given dataset.
apply_timestamp_featurizer(timestamp_featurizer)

Apply timestamp featurization on the dataset.

Parameters:timestamp_featurizer (CustomTimestampFeaturizer) – The transformation steps to featurize timestamps in the given dataset.
augment_data(max_num_of_augmentations=inf)

Augment the current dataset.

Parameters:max_augment_data_size (int) – number of times we stack permuted x to augment.
compute_summary(nclusters=10, **kwargs)

Summarizes the dataset if it hasn’t been summarized yet.

dataset

Get the dataset.

Returns:The underlying dataset.
Return type:numpy.array or iml.datatypes.DenseData or scipy.sparse.csr_matrix
get_column_indexes(features, categorical_features)

Get the column indexes for the given column names.

Parameters:
  • features (list[str]) – The full list of existing column names.
  • categorical_features (list[str]) – The list of categorical feature names to get indexes for.
Returns:

The list of column indexes.

Return type:

list[int]

get_features(features=None, explain_subset=None, **kwargs)

Get the features of the dataset if None on current kwargs.

Returns:The features of the dataset if currently None on kwargs.
Return type:list
num_features

Get the number of features (columns) on the dataset.

Returns:The number of features (columns) in the dataset.
Return type:int
one_hot_encode(columns)

Indexes categorical string features on the dataset.

Parameters:columns (list[int]) – Parameter specifying the subset of column indexes that may need to be one-hot-encoded.
Returns:The transformation steps to one-hot-encode the given dataset.
Return type:OneHotEncoder
original_dataset

Get the original dataset prior to performing any operations.

Note: if the original dataset was a pandas dataframe, this will return the numpy version.

Returns:The original dataset.
Return type:numpy.array or iml.datatypes.DenseData or scipy.sparse matrix
original_dataset_with_type

Get the original typed dataset which could be a numpy array or pandas DataFrame or pandas Series.

Returns:The original dataset.
Return type:numpy.array or pandas.DataFrame or pandas.Series or iml.datatypes.DenseData or scipy.sparse matrix
reset_index()

Reset index to be part of the features on the dataset.

sample(max_dim_clustering=50, sampling_method='hdbscan')

Sample the examples.

First does random downsampling to upper_bound rows, then tries to find the optimal downsample based on how many clusters can be constructed from the data. If sampling_method is hdbscan, uses hdbscan to cluster the data and then downsamples to that number of clusters. If sampling_method is k-means, uses different values of k, cutting in half each time, and chooses the k with highest silhouette score to determine how much to downsample the data. The danger of using only random downsampling is that we might downsample too much or too little, so the clustering approach is a heuristic to give us some idea of how much we should downsample to.

Parameters:
  • max_dim_clustering (int) – Dimensionality threshold for performing reduction.
  • sampling_method (str) – Method to use for sampling, can be ‘hdbscan’ or ‘kmeans’.
set_index()

Undo reset_index. Set index as feature on internal dataset to be an index again.

string_index(columns=None)

Indexes categorical string features on the dataset.

Parameters:columns (list) – Optional parameter specifying the subset of columns that may need to be string indexed.
Returns:The transformation steps to index the given dataset.
Return type:ColumnTransformer
summary_dataset

Get the summary dataset without any subsetting.

Returns:The original dataset or None if summary was not computed.
Return type:numpy.array or iml.datatypes.DenseData or scipy.sparse.csr_matrix
take_subset(explain_subset)

Take a subset of the dataset if not done before.

Parameters:explain_subset (list) – A list of column indexes to take from the original dataset.
timestamp_featurizer()

Featurizes the timestamp columns.

Returns:The transformation steps to featurize the timestamp columns.
Return type:DatasetWrapper
typed_dataset

Get the dataset in the original type, pandas DataFrame or Series.

Returns:The underlying dataset.
Return type:numpy.array or pandas.DataFrame or pandas.Series or iml.datatypes.DenseData or scipy.sparse matrix
typed_wrapper_func(dataset, keep_index_as_feature=False)

Get a wrapper function to convert the dataset to the original type, pandas DataFrame or Series.

Parameters:
  • dataset (numpy.array or scipy.sparse.csr_matrix) – The dataset to convert to original type.
  • keep_index_as_feature (bool) – Whether to keep the index as a feature when converting back. Off by default to convert it back to index.
Returns:

A wrapper function for a given dataset to convert to original type.

Return type:

function that outputs the original type