1 Introduction
Model interpretability is essential to both the development and deployment of safe and robust models. In the development phase, interpretability methods function as debugging tools, verifying whether machine learning models consider the input features which matter most to humans
Ross2017eo; Holzinger2017re; Weller2017ap. In the deployment phase, interpretability methods check model performance on the fly by showing the user what part of the image the model focuses on Murdoch2019qv; Selvaraju2017ne.The most widely used interpretability methods, saliency maps, explain models at the image level by attributing model prediction to input features, usually pixels Kim2017yi; Zhang2019wq; Adel2018ds. In contrast, semantic interpretability methods explain models by quantifying concept sensitivity, i.e. the effect of altering the input with respect to one concept – e.g. color, age, gender, or texture – while holding constant other aspects of the image. For example, for highstakes applications such as medical imaging, it is essential to understand whether the model uses reliable semantic concepts (e.g. object shape), or instead relies on the presence of confounding noise (e.g. an image corruption) Zech2018ad; Zech2018qg. Semantic interpretability methods can uncover bias at the dataset level during model development as well at the image level during deployment.
In this paper, we propose a state of the art method for quantifying concept sensitivity, and introduce the first realistic benchmarks for semantic interpretability. We consider two use cases of semantic interpretability: (1) imagelevel quantification of concept sensitivity, e.g. the effect of increasing redlevels in an image, and (2) datasetlevel quantification of model concept bias, i.e. the aggregation of imagelevel concept sensitivity across a dataset. We also identify a gap in the literature: interpretability benchmarks are too artificial and disconnected from these use cases.
To be useful in practice, a semantic interpretability method must work with existing convolutional neural network (CNN) architectures, minimize added computational and data overhead, and yield quantitative results. To this end, Kim et al. introduced concept activation vectors (CAV) which linearly approximate a model’s latent space concept encoding
Kim2017yi. Their method, Testing with CAVs (TCAV), quantifies datasetlevel concept sensitivity by taking the inner product of a CAV with model output gradients. The proposed Robust Concept Activation Vector (RCAV) method builds on TCAV. Whereas TCAV was restricted to linear interaction between concept and model, we generalize to allow for nonlinear effects of concepts on model predictions.Our main contributions are as follows:

RCAV accurately quantifies imagelevel concept sensitivity both for semantically meaningful concepts and for confounding, artefactual concepts. TCAV was restricted to the datasetlevel.

We highlight the false positive identification of irrelevant concepts as a shortcoming of TCAV. RCAV uses improved hypothesis testing to significantly reduce the false positive rate.

We introduce two benchmark datasets for measuring the accuracy of semantic interpretability methods in identifying concept sensitivity.
2 Related Work
Motivating Semantic Interpretability
It is often important to ensure that models have not inadvertently learned to predict using protected attributes and artefactual noise – e.g. race, gender and camera blur, JPEG artefacts. Research on fairness in machine learning seeks to identify and avoid the learning of protected attributes Jiang2019xl; Donini2018mi; Kusner2017db. Previous work has relied on measuring correlation between these attributes and model predictions, but correlation is a limited proxy for the causal effect of protected attributes on prediction Chiappa2018ua. We suggest that semantic interpretability may prove a better metric for fairness research. Similarly, methods proposed for the debiasing of model predictions may evaluate success using semantic interpretability methods Li2019ce; Kim2019pe. In the field of medical imaging, devicespecific confounders compromise model generalization Zech2018ad; Zech2018qg; these confounders may be treated as concepts and quantified using semantic interpretability methods.
Semantic Interpretability Methods
Our paper builds on Testing with Concept Activation Vectors (TCAV) Kim2017yi; Zhou2018eu. TCAV pioneered concept activation vectors for datasetlevel quantification of concept sensitivity. However, it does not support imagelevel concept sensitivity quantification. Adel et al. apply normalizing flows to semantic interpretability, which offers high accuracy interpretations at the cost of requiring large training sets and expensive computational overhead Adel2018ds
. Another approach to semantic interpretability necessitates redefining a model architecture to support interpretability, e.g. by replacing certain layers with random forests
Zhang2019wq; Agarwal2020jt. Architecture modification is incompatible with popular CNN architectures and so does not see widespread adoption.Benchmarking Interpretability Methods
To the best of our knowledge, no benchmarks have been published for the comparison of semantic interpretability methods. Given lack of consensus around benchmarking interpretability, recent methods have proposed retraining with feature ablation, among others Hooker2019qg; Yang2019rd. Recent research has also tested the robustness of interpretability to adversarial attacks, model randomization, and input perturbation Heo2019nw; Adebayo2018fm; Samek2017dw. These existing benchmarks are not directly relevant to semantic interpretability use cases.
Textured Fashion MNIST provides a ground truth for semantic interpretability methods.
3 Methods
Overview
Semantic interpretability faces two problems: first, find a function to translate from the model’s neural activations to semantic concepts; second, once we know how the model encodes a semantic concept, quantify how this concept affects model predictions – i.e. concept sensitivity. Concept activation vectors (CAVs) address the first problem by treating a trained model as a feature extractor. Using the activations of an intermediate model layer, we train a logistic regression to identify concept samples. For example, we train a vector for each of the concepts in the set of colors red, green, blue, and yellow; the textures stripe, dot, and zigzag; or the image properties highcontrast and lowcontrast. The weight vector of the logistic regression is the model’s representation of our concept. To evaluate how the concept affects prediction, we augment the model’s representation of inputs in the direction of the concept. No updates are needed on the model weights.
Notation
We denote the trained model by , the userdefined concept by , and the validation set data by . We denote the classes, i.e. labels, of the model’s original dataset, by . We denote the forward pass of a model up to and including layer by , the forward pass of a model from the output of layer to the final layer by , and the softmax score for class by . We denote a concept activation vector (CAV) by .
3.1 Defining Concept Activation Vectors
As in TCAV, a concept activation vector is generated by training a logistic regression on an intermediate layer
to classify a given concept,
relative to the union of other concepts . We use 100300 samples per concept which may be drawn from the validation set or an auxiliary dataset, as described in the appendix. The CAV, , is then the weight matrix of the trained logistic regression. In effect, the CAV is a linear approximation of ’s encoding of the concept .3.2 Quantifying Concept Sensitivity
As shown in Figure 1, we propose quantifying the sensitivity of model to the concept , by perturbing the latent representation:
(1) 
where
, the step size, is a hyperparameter of RCAV shown to be insensitive within the range of 110 (see appendix). Then the imagelevel concept sensitivity score,
, of model on input to concept with respect to class is:(2) 
The proposed RCAV sensitivity score accounts for nonlinearity in , i.e. the later layers of the model, but does not account for nonlinearity in the initial layers of the model (this being a limitation of any CAVbased method). The score proposed in TCAV relies on the gradient’s linearization of , meaning TCAV linearly approximated both the initial and later layers. In subsection 5.1 we observe that this use of nonlinearity enables RCAV to quantify the imagelevel concept sensitivity of whereas TCAV cannot.
At the dataset level we are also interested in determining whether the model systematically uses concept for prediction across all inputs of a fixed class. To that end, we compute a datasetwide concept sensitivity score, as:
(3) 
Note the term centers values, so that a positive score corresponds to a positive contribution of the concept to model predictions and corresponds to concept irrelevance.
3.3 Hypothesis Testing Robustness
quantifies concept sensitivity, but we empirically observe that even random vectors drawn from the target space of
yield nonzero concept sensitivity scores. We use hypothesis testing to determine whether the CAVs found by RCAV correspond to meaningful variation within the model’s latent representations rather than noise. Whereas TCAV used a ttest to determine CAV significance, we find that ttesting is not robust. We instead propose a permutation test which generates a null distribution,
, of noise vectors, , by permuting the correspondence between samples and labels for the concept set . Then we compute a pvalue:(4) 
We apply the Bonferroni correction for multiple testing to all resulting pvalues. The proposed permutation test achieves a false positive rate on the concepts considered (Table 2). In contrast to the observed false positive rate for the ttest used in TCAV. In subsection 5.3
, we explain the need for a permutation test, because permutationgenerated null vectors are not normally distributed, and the ttest used in TCAV underestimates the variance of the null distribution. To improve the time complexity of the proposed hypothesis test, we stop computing the pvalue as soon as a null result is guaranteed. RCAV, using a significance threshold of
and permutations, achieves a best case 250x speedup and a 4x average case improvement compared to TCAV.RCAV leaves layer, concept set, and step size as hyperparameters to be determined by the user; these choices may be tuned to minimize variance of the permutation null set and maximize CAV accuracy. To comprehensively assess RCAV independently of manual layer choice, we select five layers uniformly spaced across model depth. We test on a subset of layers instead of all layers, because the worst case runtime of RCAV scales quadratically with the number of layers tested. Since the permutation test is an approximate method, the number of permutations must increase linearly against number of layers to avoid losing power following multiple testing adjustment. Concept set images are taken from the validation set. For example, the concept red, is represented by the 100 validation set images with highest intensity in the red channel. We discuss choice of concept set and step size in detail in the appendix. In the below experiments, we use throughout.
4 Benchmarking Semantic Interpretability Methods
To evaluate RCAV’s performance, we introduce two new datasets that test the accuracy of concept sensitivity measurements. We designed these datasets to come as close as possible to real world use of semantic interpretability methods. Existing benchmarks evaluate interpretability methods without taking into account the intended use case. For example, Lage et al. quantify interpretability by measuring how accurately users can reconstruct model predictions from interpretability explanations Lage2018px. Although this reconstruction metric is important, it does not tell us how reliably the interpretability method will work in practice.
Interpretability methods are used to determine whether a model relies on robust input features, or instead relies on spurious noise. To determine the ground truth about the relative importance of signal and noise, we need a dataset in which we have two copies of each image: one with the concept and one without. For example, for the image contrast concept, we would run the model on an image with normal contrast levels, and then we would run the model on the same image with increased contrast levels. The difference in prediction after augmentation gives us a ground truth regarding how sensitive the model is to contrast in that image.
4.1 Benchmarking Datasets
Textured Fashion MNIST (TFMNIST)
To evaluate semantic interpretability we need a dataset where we can manipulate the presence of concepts as realistically as possible. Previous work superimposed foreground images on background, but this process results in unrealistic images such as a backpack floating on a bamboo forest Yang2019rd. We build on the Fashion MNIST (FMNIST) dataset Xiao2017ue by replacing the surface of clothing items with our concepts. We use textures drawn from a Google Image search to replace the original surfaces as shown in Figure 2. Importantly, our dataset is fully extensible: users may easily replace these textured concepts with any concept of their choosing – colors, animals, etc.
Using FMNIST, we may continuously interpolate between two arbitrary textures creating, for example, shirts with mixtures of striped and spiral patterns. We construct a training set in which all Tshirts are given spiral textures and all nonT shirts are given zigzag textures. All other classes are given dotted and striped textures. Then, on the validation set, we interpolate between textures to compute the ground truth concept sensitivity. For example, by interpolating spiral Tshirts 10% in the direction of zigzag textures, we can quantify the effect of zigzags on tshirt predictions. See top row of (b).
Biased CAMELYON16
For application to medical imaging, we need a dataset where we can manipulate the presence of confounding artefacts. To simulate device differences, we build on the CAMELYON16 lymph node section histology dataset Bejnordi2016mn by augmenting the contrast level of images. Importantly, this dataset is fully extensible: contrast augmentation may be replaced by any other form of corruption – JPEG artefacting, camera tilt, outoffocus, etc.
For our experiments, we increased the contrast of cancerous tissue while leaving noncancerous tissue at baseline, (b). Then, to provide a groundtruth for the model’s sensitivity to the contrast concept, we compare model predictions before and after changing contrast level by 3% – a change nearly imperceptible to the human eye.
ImageNet
We can only rely on the results of a semantic interpretability method if we know that the method will not report high concept sensitivity where there is none. To this end, we use ImageNet to quantify the robustness of semantic interpretability methods to false positives
Deng2009dm. We build on the experiments proposed in Kim2017yifor texture and color concepts. We select negative control classes which appear intuitively unrelated to the concepts: great white shark for texture, and apron for color (the first unrelated class by ImageNet label order). This protocol for identifying negative control classes may be generally applied to any multiclass classification task.
Dataset:  Textured FMNIST  Biased CAMELYON16  

Metric:  AUROC  AUPRC  AUROC  AUPRC  
RCAV  64%  0.77  0.56  92%  0.94  0.89 
TCAV  53%  0.51  0.24  45%  0.55  0.27 
Interpretability performance metrics. AUROC and AUPRC after applying 75th percentile threshold to labelbinarize counterfactual augmentation differences.
4.2 Measuring Concept Sensitivity
When evaluating a semantic interpretability method on the proposed datasets, we highlight the need for metrics which align with the intended use case. Interpretability methods are used at test time to decide whether the model is trustworthy. The user decides whether to accept the model’s prediction as right for the right reasons, or to reject the model’s prediction as unjustified. Given a ground truth for the model’s counterfactual sensitivity to a concept, we compute the area under the receiver operating characteristic curve (AUROC) and area under the precisionrecall curve (AUPRC). The proposed datasets provide a ground truth for concept sensitivity as the difference between the model’s predictions before and after input augmentation, denoted
. For example in CAMELYON16, we augment the input images by reducing the contrast level, and then depending on whether model prediction delta exceeds a certain threshold, these samples are labelled "trustworthy" or "untrustworthy."AUROC and AUPRC metrics are typically used to quantify performance for tasks with fixed decision thresholds. However, in some cases the interpretability method will be used onthefly without a fixed decision threshold. For example, a user may be interested in identifying ImageNet samples in a batch which are most sensitive to the stripes concept. To evaluate accuracy for this use case, we compute the probability,
, of RCAV correctly ordering the imagelevel concept sensitivity for each pair of samples. We denote the ground truth model concept sensitivity by , and validation set size by .(5) 
When reporting and Kendall’s values, we suggest generating pvalues by permutation testing and comparing observed against permutation testgenerated . The above metrics are applicable to other datasets, semantic interpretability methods, and even saliency maps.
5 Results
The proposed datasets define ground truth concept sensitivity that we now use to evaluate the accuracy of RCAV. For the FMNIST and CAMELYON16 datasets, we use Inceptionv3 Szegedy2015xz, and for FMNIST in particular, we apply input mixup during training Zhang2017oh. Input mixup has been shown to locally linearly regularize outofdistribution predictions Guo2019of. For the ImageNet experiments we use GoogLeNet to allow direct comparison against TCAV results Szegedy2014wg.
5.1 Imagelevel Concept Sensitivity
By comparing RCAV predicted concept sensitivity, , to ground truth concept sensitivity, , Figure 2 and Figure 3 show that RCAV accurately predicts the concept sensitivity of individual model predictions. The correlation between and is significant following permutation testing. The RCAV linear fit has intercept at or near 0, indicating that RCAV concept sensitivity predictions are calibrated such that a null sensitivity result in RCAV indicates ground truth null and vice versa.
Unlike RCAV, TCAV concept sensitivity correlation with ground truth is indistinguishable from the null (Figure 3), meaning TCAV explanations perform no better than random. The failure of TCAV to predict observed concept sensitivity suggests that even for the small perturbations considered, the firstorder, gradient approximation used by TCAV fails.
Table 1 shows that RCAV performs robustly across choice of metric. For CAMELYON16, RCAV’s imagelevel sensitivity performance is near optimal, but for FMNIST we see a drop. In section 6
, we explain the poor FMNIST performance relative to Biased CAMELYON16 by performing a singular value decomposition (SVD) component analysis to place a bound on RCAV performance.
5.2 Datasetlevel False Positive and False Negative Robustness
Concept and Class:  Color for Apron  Texture for Shark  

Multiple Testing Adjustment:  Before  After  Before  After 
RCAV  5 %  0 %  13 %  0 % 
TCAV  100 %  100 %  100 %  100 % 
To quantify RCAV’s robustness to false positives, we select negative control ImageNet classes which appear unrelated to the texture and color concepts. We compute the pvalues of for these classes. When the pvalues pass the significance threshold, we call this a false positive. As would be the case in realworld applications of RCAV, we do not have access to the ground truth regarding model sensitivity to color on aprons and to texture on great white sharks, but we have selected these pairs in order to maximize the apparent likelihood of noninteraction.^{2}^{2}2Alternatively, we may quantify the false positive rate by applying RCAV to an untrained model. In this case, the ground truth is known. However, an untrained model often yields lower CAV accuracy, and so it is unclear whether a low false positive rate for untrained models generalizes to the trained model setting. In Table 2, we see that RCAV has a 0% false positive rate for both concepts considered, whereas TCAV predicts a false positive on every layer tested. In subsection 5.3, we clarify which aspects of the RCAV method contribute to this decrease in false positives.
Since RCAV uses more stringent hypothesis testing, we reconstruct two experiments proposed in the TCAV paper to demonstrate that RCAV does not suffer increased false negatives, see Figure 4. Significance testing in TCAV is not meaningful because all results are found to be significant (dark blue bars), whereas RCAV finds layers in which concepts are most meaningfully encoded versus those that are not (light grayblue bars). There is no known ground truth for Figure 4, so it remains open whether, for instance, GoogLeNet uses the concept zigzag for making zebra predictions.
5.3 Ablation Study
In order to clarify the extent to which each aspect of RCAV is responsible for the observed reduction in false positives, we run an ablation study in which we hold constant the sensitivity scoring functions used in RCAV and TCAV – softmax difference and cosine similarity, respectively – while varying the hypothesis test. RCAV improves over TCAV by replacing cosine similarity with softmax difference, and by replacing a ttest with a permutation test. As an intermediate alternative to the ttest and permutation tests, we also include a null hypothesis defined by vectors drawn from a unit norm uniform distribution. The uniform null pvalue is calculated using the same formula as the permutation test pvalue (equation 4). We report the raw pvalue false positive rate before multiple testing adjustment, because raw pvalues give a worstcase upper bound on the observed FPR independent of layer choice.
Table 3 shows that permutation testing is necessary to reduce the false positive rate, but the sensitivity scoring function appears to also help. We suggest that future use of hypothesis testing in interpretability favor permutation testing where possible over parametric tests.RCAV  TCAV  

Permutation test  13 %  60 % 
Uniform random null  40 %  80 % 
ttest  93 %  100 % 
6 Conclusion
In this paper, we propose a novel use of CAVs as well as novel benchmarks for the evaluation of semantic interpretability methods. Our RCAV method is designed to be as userfriendly as saliency maps by supporting onthefly explanations for individual model predictions. At the dataset level, RCAV enables the quantification of model biases into a single scalar value. We also address a gap in the interpretability benchmarking literature by proposing datasets and metrics for the realistic evaluation of semantic interpretability methods. On these benchmarks, we demonstrate that RCAV improves upon previous work in terms of accuracy, robustness, and runtime.
Limitations
Although we observe results to be robust across choice of hyperparameters in most cases, on the FMNIST dataset, imagelevel concept sensitivity is best captured on layers 6a and 6b with null results on other layers. An SVD analysis of latent space pairwise differences before and after input augmentation showed that these are the layers with highest explained variance in the first component (see appendix). This SVD analysis implies that for the rest of the layers, texture concepts are not encoded uniformly across the validation set, so CAVbased methods are ineffective.
Future Work
Recent research has used CAVs to automatically discover predictionrelevant concepts Ghorbani2019ki. RCAV may enhance such methods, serving as a dropin replacement for TCAV. This paper shows that RCAV improves over TCAV across benchmarks, but in the unlimited data setting, we expect nonlinear methods to outperform CAV methods. A complete comparison of RCAV to other semantic interpretability methods remains an outstanding direction for future research.
Broader Impact
RCAV is broadly applicable to the real world deployment of CNNs. We hope that RCAV will prove of use to researchers and engineers as a technique to assist in guaranteeing the robustness and fairness of models. We also see potential applications to scientific discovery: scientific properties can be used to define concept sets and RCAV can be applied to assess the scientific significance of certain properties. For example, it may be possible to train a model on molecular similarity and then assess the importance of molecule properties using RCAV. We hope that the metrics and benchmarks proposed above will encourage greater reproducibility and transparency around the development of semantic interpretability methods. In particular, the proposed false positive metric will ideally help avoid mistaken application of semantic interpretability methods to the highstakes fields of medical imaging.
We would like to thank Laura Gunsalus and Garret Gaskins for extensive feedback on a draft of this paper. We would also like to thank Wren Saylor, Will Connell and Kangway Chuang for feedback on an early draft. This work was in part supported by the Helen Diller Family Comprehensive Cancer Center Impact Award and the Melanoma Research Alliance.
Appendix A Choice of Concept Set
For RCAV, a set of representative images defines a concept. For any semantic concept there will be many possible choices of concept set: e.g. we may choose between either color patches or colored objects to define a particular color. In practice, it is usually easiest to draw concept set images from the validation set. For instance, to define the concept red, we could select the 100 validation set images with highest intensity in the red channel. When sampling images for the concept set, it is necessary to maintain class balance – i.e. the number of samples per class for each subconcept must be identical. For the TFMNIST and Camelyon experiments we use validation set images to define the concept sets. In contrast, for the ImageNet experiments, we follow the precedent set in TCAV using Broden images for textures and Gaussiannoised color patches for color Kim2017yi; Bau2017yo.
If there are multiple possible choices of concept set, the optimal choice minimizes the variance of the set of null concept sensitivity scores, . This optimal choice of concept set will minimize false negatives, i.e. misidentification of concepts that are meaningful to model prediction as statistically insignificant.
Appendix B Hyperparameter Sensitivity Analysis
RCAV  TCAV  

AUROC  AUPRC  AUROC  AUPRC  
Conv2d 3b 1x1  92%  0.94  0.89  45%  0.55  0.27 
Mixed 5c  89%  0.94  0.86  68%  0.65  0.31 
Mixed 6d  83%  0.84  0.77  69%  0.70  0.34 
Mixed 7b  91%  0.94  0.83  70%  0.70  0.37 
Layer
Interpretability methods seek to explain model predictions. We propose using layerspecific RCAV scores for interpretability, but we may only extrapolate from layerspecific results if RCAV performance is layer invariant. At the image level, Table 4 shows that RCAV predicts concept sensitivity for all layers considered.
At the dataset level, we observe that the absolute value of increases monotonically as the layer approaches softmax. We explain this increase by observing that head of the model, , converges to linearity as approaches . In the linear case, any fixed CAV will have , because the effect of perturbation in a fixed direction is invariant over choice of input for a linear classifier.
To ensure that any conclusion based on a specific layer’s RCAV scores is representative of the whole model, we need a weak form of layer invariance: consistency of across layers. Figure 4 shows that for six out of the seven concepts considered, RCAV consistently predicts datasetlevel concept sensitivity. It is possible that layer inconsistency occurs when a concept plays a nonbinary role in the classifier’s decision function. For this reason, we recommend testing multiple layers when using RCAV for datasetlevel concept sensitivity quantification.
Step Size  AUROC  AUPRC  

0.1  2e5  81%  0.93  0.88 
1  2e4  88%  0.94  0.86 
10  2e3  94%  0.94  0.89 
100  0.03  91%  0.94  0.86 
Step size
In realworld use of RCAV, ground truth concept sensitivity is not known, so it is impossible to tune step size for optimal performance. Instead we suggest choosing step size such that observed concept sensitivity scores, , range from 0.001 to 0.1. This is the observed range of softmax differences in the benchmark experiments shown in Figure 2 and Figure 3. Empirically, RCAV performance is robust across choice of step size, as shown in Table 5.
RCAV  TCAV  

Threshold  AUROC  AUPRC  AUROC  AUPRC 
5%  0.94  0.99  0.35  0.84 
25%  0.97  0.99  0.38  0.68 
75%  0.94  0.89  0.55  0.27 
95%  0.98  0.90  0.63  0.07 
Label binarization threshold
In practice, we often use RCAV to make a binary decision: either the input is sensitive to the concept, or the input is not sensitive to the concept. In subsection 4.2, we used AUROC and AUPRC to quantify the accuracy of RCAV for this binary task. The ground truth for this task is whether model prediction delta exceeds a certain threshold when augmenting inputs. Formally the ground truth labels are defined by for some fixed threshold , augmented input and class . In Table 6 we choose our threshold as a percentile of the ground truth sensitivity values, and show that RCAV performs robustly across all thresholds.
Appendix C Measuring Concept Encoding Linearity
Layer  TFMNIST  CAMELYON16 

Conv2d 3b 1x1  24%  50% 
Mixed 5c  40%  37% 
Mixed 6d  40%  80% 
Mixed 7b  44%  84% 
RCAV relies on CAV’s linear approximation of the model’s concept encoding. By doing an SVD on the ground truth concept sensitivity differences, we can measure the extent to which this linearity constraint bottlenecks RCAV performance. The CAV can reliably estimate the ground truth effect only if the difference vector between encodings of input,
, and augmented input, , is similar to the CAV – i.e. . If, on the other hand, the difference vector has high variance across points of the validation set, then the effect of the concept cannot be encoded as a CAV. We can measure the extent to which the concept is consistently encoded by examining the matrix of pairwise encoding differences,(6) 
The optimal CAV^{3}^{3}3In practice, the optimal CAV cannot be calculated in this way, because it is not feasible to counterfactually augment the input – i.e. we do not have . is the first singular vector for , because this vector best approximates . Using the SVD, we can upper bound the performance of RCAV on layer by calculating the reconstruction accuracy, , of the best rank one approximation to . Matrix dimensions varies across layer, so we normalize the reconstruction accuracy to where is the rank one approximation and we use the Frobenius norm. Table 7 shows that reconstruction accuracy is higher for CAMELYON16 than TFMNIST. We infer that the CAMELYON16 model’s encoding of the contrast concept is more linear than the TFMNIST model’s encoding of the texture concepts. These results explain the difference in performance between these two datasets seen in Table 1.
Comments
There are no comments yet.