trustyai.explainers.BackgroundGenerator

class trustyai.explainers.BackgroundGenerator(datapoints: ndarray | DataFrame | List[PredictionInput], feature_domains=None, seed=0)

Generate a background for the SHAP explainer

Generate a background for the SHAP explainer via one of three algorithms:

  • sample: Randomly sample a set of provided points

  • kmeans: Summarize a set of provided points into k centroids

  • counterfactual: Generate a set of background points that meet certain criteria

__init__(datapoints: ndarray | DataFrame | List[PredictionInput], feature_domains=None, seed=0)

Initialize the BackgroundGenerator.

Parameters:
datapointsnumpy.ndarray, pandas.DataFrame, List[PredictionInput]]

The set of datapoints to be used to sample/generate the background, as a:

  • Numpy array of shape [n_rows, n_features]

  • Pandas DataFrame with n_rows rows and n_features columns

  • A list of TrustyAI PredictionInput

seedint

The random seed to use in the sampling/generation method

Methods

__init__(datapoints[, feature_domains, seed])

Initialize the BackgroundGenerator.

counterfactual(goals, model[, k_per_goal])

Generate a background via the CounterfactualExplainer.

kmeans([k])

Use k-means clustering over datapoints and return k centroids as the background data set.

sample([k])

Randomly sample datapoints.

counterfactual(goals: ndarray | DataFrame | List[PredictionOutput], model: PredictionProvider, k_per_goal=100, **kwargs)

Generate a background via the CounterfactualExplainer. This lets you specify exact output values that the background dataset conforms to, and thus set the reference point by which all SHAP values compare. For example, if your model is a regression model, choosing a counterfactual goal of 0 will create a background dataset where :math:’f(x) approx 0 forall x in text{background}`, and as such the SHAP values will compare against zero, which is a useful baseline for regression.

Parameters:
goalsnumpy.ndarray, pandas.DataFrame, List[PredictionOutput]]

The set of background datapoints as a:

  • Numpy array of shape [n_rows, n_outputs]

  • Pandas DataFrame with n_rows rows and n_outputs columns

  • A list of TrustyAI PredictionOutput

modelPredictionProvider

The TrustyAI PredictionProvider, as generated by Model

k_per_goalint

The number of background datapoints to generate per goal.

Keyword Arguments:
  • k_seeds: int

    (default= 5) For each goal, a number of starting seeds from datapoints are used to start the search from. These are the k_seeds points within datapoint whose corresponding outputs are closet to the goal output. Choose a larger number to get a more diverse background dataset, but the search might require larger max_attempt_count, step_count, and timeout_seconds to get good results.

  • goal_threshold: float

    (default= .01) The distance (percentage) threshold defining whether a particular output satisfies the goal. Set to 0 to require an exact match, but this will likely require larger max_attempt_count, step_count, and timeout_seconds to get good results.

  • chain: boolean

    (default= False) If chaining is set to true, found counterfactual datapoints will be added to the search seeds for subsequent searches. This is useful when a range of counterfactual outputs is desired; for example, if the desired goals are [0, 1, 2, 3], whichever goal is closest to the closest point within datapoints will be searched for first. The found counterfactuals from that search are then included in the search for the second-closest goal, and so on. This is especially helpful if the extremes of the goal range are far outside the range produced by the datapoints. If only

  • max_attempt_count: int

    (default= ``5’`) If no valid counterfactual can be found for a starting seed in the search, the point is slightly perturbed and search is retried. This parameter sets the maximum number of perturbation-retry cycles are allowed during generation.

  • step_count: int

    (default= 5_000) The number of datapoints to evaluate during the search

  • timeout_seconds: int

    (default= 3) The maximum number of seconds allowed for each counterfactual search. This will set the maximum runtime of the search to roughly timeout_seconds * max_attempt_count * k_per_goal * len(goals)

Returns:
:list:`PredictionInput`

The background dataset to pass to the SHAPExplainer

kmeans(k=100)

Use k-means clustering over datapoints and return k centroids as the background data set.

Parameters:
kint

The number of centroids to find

Returns:
:list:`PredictionInput`

The background dataset to pass to the SHAPExplainer

sample(k=100)

Randomly sample datapoints.

Parameters:
kint

The number of datapoints to select

Returns:
:list:`PredictionInput`

The background dataset to pass to the SHAPExplainer