1 Introduction
Fairness has become one of the most popular topics in machine learning over the last years and the research community is investing a large amount of effort in this area. The main motivation is the increasing impact that the lives of Human beings are experiencing due to the generalization of machine learning systems in a wide variety of fields. Originally designed to improve recommendation systems in the internet industry, they are now becoming an inseparable part of our daily lives since more and more companies start integrating Artifitial Intelligence (AI) into their existing practice or products. While some of these quotidian uses may involve leisure, with vain consequences (Amazon or Netflix use recommender systems to present a customized page that offers their products according to the order of preference of each user), other ones entail particularly sensitive decisions such as in Medicine, where patient suitability for treatment is considered; in Human Resources, where candidates are sorted out on an algorithmic decision basis; in the Automotive industry, with the release of selfdriving cars; in the Banking and Insurance industry, which characterize customers according to a risk index; in Criminal justice, where the COMPAS algorithm is used in the United States for recidivism prediction… For a more detailed background on these facts see for instance [33], [3] [30] or [15], and references therein.
The technologies that AI offers certainly make life easier. It is however a common misconception that they are absolutely objective. In particular, machine learning algorithms which are meant to automatically take accurate and efficient decisions that mimic and even sometimes outmatch human expertise, rely heavily on potentially biased data. It is interesting to remark that this bias is often due to an inherent social bias existing in the population that is used to generate the training dataset of the machine learning models. A list of potential causes for the discriminatory behaviours that machine learning algorithms may exhibit, in the sense that groups of population are treated differently, is given in [2]. Various real and striking cases that can be found in the literature are the following. In [1], it was found that the algorithm COMPAS used for recidivism prediction produces much higher rate of false positive predictions for black people than for white people. Later in [24]
, a job platform similar to Linkedin called XING was found to predict less highly ranked qualified male candidates than female candidates. Publicly available commercial face recognition online services provided by Microsoft, Face++, and IBM respectively were also recently found to suffer from achieving much lower accuracy on females with darker skin color in
[6]. Although a discrimination may appear naturally and could be thought as acceptable, as in [20] for instance, quantifying the effect of a machine learning predictor with respect to a given situation is of high importance. Therefore, the notion of fairness in machine learning algorithms has received a growing interest over the last years. We believe this is crucial in order to guarantee a fair treatment for every subgroup of population, which will contribute to reduce the growing distrust of machine learning systems in the society.Yet providing a definition of fairness or equity in machine learning is a complicated task and several propositions have been formulated. First described in terms of law [37], fairness is now quantified in order to detect biased decisions from automatic algorithms. We will focus on the issue of biased training data, which is one of the several possible causes of such discriminatory outcomes in machine learning mentioned above. In the fair learning literature, fairness is often defined with respect to selected variables, which are commonly denoted protected or sensitive attributes. We note that throughout the paper we will use both terms indistinctly. This variables encode a potential risk of discriminatory information in the population that should not be used by the algorithm. In this framework, two main streams of understanding fairness in machine learning have been considered. The probabilistic notion underlying this division is the independence between distributions. The first one gives rise to the concept of Statistical Parity, which means the independence between the protected attribute and the outcome of the decision rule. This concept is quantified using the Disparate Impact index, which is described for instance in [13]. This notion was firstly considered as a tool for quantifying discrimination as the socalled rule by the State of California Fair Employment Practice Commission (FEPC) in 1971. For more details on the origin and first applications of this index we refer to [4]
. The second one proposes the Equality of Odds, which considers the independence between the protected attribute and the output prediction, conditionally to the true output value. In other words, it quantifies the independence between the error of the algorithm and the protected variable. Hence, in practice, it compares the error rates of the algorithmic decisions between the different groups of the population. This second point of view has been originally proposed for recidivism of defendants in
[14]. Many others criteria (see for instance in [3] for a review) have been proposed leading sometimes to incompatible formulations as stated in [7]. Note finally that the notion of fairness is closely related to the notion of privacy as pointed out in [11].In this paper, our goal is to present some comprehensive statistical results on fairness in machine learning studying the statistical parity criterion through the analysis of the example given in the Adult Income dataset. This public dataset is available on the UCI Machine Learning Repository^{1}^{1}1https://archive.ics.uci.edu/ml/datasets/adult
and it consists in forecasting a binary variable (low or high income) which corresponds to an income lower or higher than 50k
a year. This decision could be potentially used to evaluate the credit risk of loan applicants, making this dataset particularly popular in the machine learning community. It is considered here as potentially sensitive to a discrimination with respect to the Gender and Ethnic origin variables. The covariables used in the prediction as well as the true outcome are available in the dataset, hence supervised machine learning algorithms will be used.Section 2 describes this dataset. It specifically highlights the existing unbalance between the income prediction and the Gender and Ethnic origin sensitive variables. We note that a preprocessing step is needed in order to prepare the data for further analyses and the performed modifications are detailed in the Appendix A.1.1. In Section 3, we then explain the statistical framework for the fairness problem, by particularly focusing on the binary classification setting. We follow the approach of the Statistical Parity
to quantify the fairness and we thus present the Disparate Impact as our preferred index for measuring the bias. Note that the bias is present in this dataset, so the machine learning decision rules learned in this paper will be trained by using a biased dataset. Although, many criteria have been described in the fair learning literature, they are often used as a score without statistical control. In the cases where test procedures or confidence bounds are provided, they are obtained using a resampling scheme to get standardized Gaussian confidence intervals under a Gaussian assumption which does not correspond to the distribution of the observations. In this work, we promote the use of confidence intervals to control the risk of false discriminatory assessment. We then show in the Appendix
A.2the exact asymptotic distribution of the estimates of different fairness criteria obtained through the classical approach of the Delta method described in
[36]. Then, Section 4 is devoted to present some naive approaches that try to correct the discriminatory behaviour of machine learning algorithms or to test possible discriminations. Finally, Section 5 is devoted to studying the efficiency of two easy way to incorporate fairness in machine learning algorithms: building a differentiate algorithm for each class of the population or adapting the decision of a single algorithm in a different way for each subpopulation. We then in Section 6 present some conclusions for this work and thus provide a concrete pedagogical example for a better understanding of bias issues and fairness treatment in machine learning. Proofs and more technical details are presented in the Appendix.2 Machine learning algorithms for the attribution of bank loans
One of the applications for which machine learning algorithms have already become firmly established is credit scoring. In order to minimize its risks, the banking industry uses machine learning models to detect the clients who are likely to deal with a credit loan. The FICO score in the US or the SCHUFA score in Germany are examples of these algorithmically determined credit rating scores, as well as those used by a number of Fintech startups, who are also basing their loan decisions entirely on algorithmic models [19]^{2}^{2}2See, e.g., https://www.kreditech.com/.. Yet, credit rating systems have been criticized as opaque and biased in [29], [34] or [19].
In this paper, we use the Adult Income dataset as a realistic material to reproduce this kind of analyses for credit risk assessment. This dataset was built by using a database containing the results of a census made in the United States in 1994. It has been largely used among the fair learning community as a suitable benchmark to compare the performance of different machine learning methods. It contains information from about 48 thousands of individuals, each of them being described by 14 variables as detailed in Table 2. This dataset is often used to predict the binary variable Anual Income higher or not than . Such forecast does not convey any discrimination itself, but it illustrates what can be done in the banking or insurance industry since the machine learning procedures are similar to those made by banks to evaluate the credit risk of their clients. The fact that the true value of the target variable is known, in contrast to the majority of the datasets available in the literature (e.g. the German Credit Data), as well as the value of potential protected attributes such as the ethnic origin or the gender, makes this dataset one of the most widely used to compare the properties of the fair learning algorithms. In this paper, we will then compare supervised machine learning methods on this dataset. A graphic representation of the distribution of each feature can be found in https://www.valentinmihov.com/2015/04/17/adultincomedataset/. This representation gives a good overview of what this dataset contains. It also makes clear that it has to be preprocessed before its analysis using blackbox machine learning algorithms. In this work, we have deleted missing data, errors or inconsistencies. We also have merged highly dispersed categories and eliminated strong redundancies between certain variables (see details in Supplementary material A.1.1). In Figure 1
, we represent the dataset after our pretreatments, and show the number of occurrences for each categorical variable as well as the histograms for each continuous variable.
2.1 Unbalanced Learning Sample
After preprocessing the dataset, standard preliminary exploratory analyses first show that the dataset obviously suffers from an unbalanced repartition of low and high incomes with respect to two variables: Gender (male or female) and Ethnic origin (caucasian or noncaucasian). These variables therefore seem to be potentially sensitive variables in our data. Figure 2 shows this unbalanced repartition of incomes with respect to these variables. It is of high importance to be aware of such unbalanced repartitions in reference datasets since a bank willing to use an automatic algorithm to predict which clients should have successful loan applications could be tempted to train the decision rules on such unbalanced data. This fact is at the heart of our work and we question its effect on further predictions on other data. What information will be learnt from such unbalanced data: a fair relationship between the variables and the true income that will enable socially reasonable forecasts; or biased relations in the repartition of the income with respect to the sensitive variables? We explore this question in the following section.
2.2 Machine Learning Algorithms to forecast income
We study now the performance of four categories of supervised learning models: logistic regression
[8][27][35], and Neural Network. We used the
Scikitlearn implementations of the Logistic Regression (LR) and Decision Trees (DT), and the lightGBM implementation of the Gradient Boosting (GB) algorithm. The Neural Network (NN) was finally coded using PyTorchand contains four fully connected layers with Rectified Linear Units (ReLU) activation functions.
In order to analyze categorical features using these models, the binary categorical variables were encoded using zeros and ones. The categorical variables with more than two classes were also transformed into onehot vectors,
i.e. into vectors where only one element is nonzero (or hot). We specifically encoded the target variable by the values for an income below , and for an income above . We used a 10fold crossvalidation approach in order to assess the robustness of our results. The average accuracy as well as its true positive (TP) and true negative (TN) rates were finally measured for each trained model. Figure 3 summarizes these results.We can observe in Fig. 3 that the best average results are obtained by using Gradient Boosting. More interestingly, we can also remark that the prediction obtained using all models for (represented by the true negative rates) are clearly more accurate than those obtained for (represented by the true positive rates), which contains about
of the observations. All tested models then make more mistakes in average for the observations which should have a successful prediction than a negative one. Note that the tested neural network is outperformed by other methods in these tests in term of prediction accuracy. Although we used default parametrizations for the Logistic Regression model as well as the Gradient Boosting model, and we simply tuned the decision tree to have a maximum depth of 5 nodes, we tested different parametrizations of the Neural Network model (number of epochs, minibatch sizes, optimization strategies) and kept the best performing one. It therefore appears that the neural network model we tested was clearly not adapted to the
Adult Income dataset.Hence we have built and compare several algorithms ranging from completely interpretable models to black box models involving optimization of several parameters. Note that we could have used the popular Random Forest algorithm that could lead to equivalent but we privilegiated boosting models whose implementation is easier using Python.
3 Measuring the Bias with Disparate Impact
3.1 Notations
Among the criteria proposed in the literature to reveal the presence of a bias in a dataset or in automatic decisions (see e.g. [17] for a recent review), we focus in this paper on the socalled Statistical Parity. This criterion deals with the differences in reference decisions or the outcome of decision rules with respect to a sensitive attribute. Note that we only consider the binary classification problem with a single sensitive attribute for the sake of simplicity, although we could consider other tasks (e.g. regression) or multiple sensitive attributes (see [18] or [21]). Here is a summary of the notations we use:

is the variable to be predicted. We consider here binary variables where is a positive decision (here a high income) while is a negative decision (here a low income);

is the prediction given by the algorithm. As for , this is a binary variable interpreted such that or
means a negative or a positive decision, respectively. Note that most machine learning algorithms output continuous scores or probabilities. We consider in this case that this output is already thresholded.

is the variable which splits the observations into groups for which the decision rules may lead to discriminative outputs. From a legal or a moral point of view, is a sensitive variable that should not influence the decisions, but could lead to discriminative decisions. We consider hereafter that represents the minority that could be discriminated, while represents the majority. We specifically focus here on estimating the disproportionate effect with respect to two sensitive variables: the gender (male vs. female) and the ethnic origin (caucasian vs. noncaucasian).
Statistical Parity is often quantified in the fair learning literature using the socalled Disparate Impact (DI). The notion of DI has been introduced in the us legislation in 1971^{3}^{3}3https://www.govinfo.gov/content/pkg/CFR2017title29vol4/xml/CFR2017title29vol4part1607.xml. It measures the existing bias in a dataset as
(3.1) 
and can be empirically estimated as
(3.2) 
where is number of observations such that and . The smaller this index, the stronger the discrimination over the minority group. Note first that this index supposes that since is defined as the group which can be discriminated with respect to the output . It is also important to remark that this estimation may be unstable due to the unbalanced amount of observations in the groups and and the inherent noise existing in all data. We then propose to estimate a confidence interval around the Disparate Impact in order to provide statistical guarantees of this score, as detailed in the Supplementary material A.2. These confidence intervals will be used later in this section to quantify how reliable are two disparate impacts computed on our dataset. This fairness criterion can be extended to the outcome of an algorithm by replacing in Eq. (3.1) the true variable by , that is
(3.3) 
This measures the risk of discrimination when using the decision rules encoded in on data following the same distribution as in the test set. Hence, in [16]
is said not to have a Disparate Impact at level when . Note that the notion of DI defined Eq. (3.1) was first introduced as the rule by the State of California Fair Employment Practice Commission (FEPC) in 1971. Since then, the threshold was chosen in different trials as a legal score to judge whether the discriminations committed by an algorithm are acceptable or not (see e.g. [13] [38], or [26]).3.2 Measures of disparate impacts
The disparate impact should be obviously close to to claim that makes fair decisions. A more subtle, though critical, remark is that it should at least not be smaller than the general disparate impact . This would indeed mean that the decision rules reinforce the discriminations compared with the reference data on which it was trained. We will then measure hereafter the disparate impacts and obtained on our dataset.
In Table 1, we have quantified confidence intervals for the bias already present in the original dataset using Eq. (3.1) with the sensitive attributes Gender and Ethnic origin. They were computed using the method of Appendix A.2 and represent the range of values the computed disparate impacts can have with a 95% confidence (subject to standard and reasonable hypotheses on the data). Here the DI computed on the Gender variable then appears as very robust and the one computed on the Ethnic origin variable is relatively robust. It is clear from this table that both considered sensitive attributes generate discriminations. These discriminations are also more severe for the Gender variable than for the Ethnic origin variable.
Protected attribute  DI  CI 

Gender  
Ethnic origin 
We have then measured the disparate impacts Eq. (3.3) obtained using the predictions made by the four models in the 10folds crossvalidation of Section 2.2. These disparate impacts are presented in Fig. 4. We can see that, except for the decision tree with the Ethnic origin variable, the algorithms have smaller disparate impact than for the true variable. The impact is additionally clearly worsened with the Gender variable using all trained predictors. These predictors therefore reinforced the discriminations in all cases by enhancing the bias present in the training sample. Observing the true positive and true negative rates of Fig. 4, which distinguish the groups and is particularly interesting here to understand this effect more deeply. As already mentioned Section 2.2, the true negative (TN) rates are generally higher than the true positive (TP) rates. It can be seen Fig. 4 that this phenomenon is clearly stronger in the subplot representing the TP and TN for than the one representing them for , so false predictions are more favorable to the group than the group . This explains why the disparate impacts of the predictions are higher than those of the original data (boxplots Ref in Fig. 4). Note that these measures are directly related to the notions of equality of odds and opportunity as discussed in [17]. The machine learning models we used in our experiments were then shown as unfair on this dataset, in the sense that discrimination is reinforced.
As pointed out in [15]
, there may have a strong variability when computing the disparate impact of different subsamples of the data. Hence, we additionally propose in this paper an exact Central Limit Theorem to overcome this effect. The confidence intervals we obtain prove their stability when confronted to bootstrap replications and for this therefore crossvalidated our results using 10 replications of different learning and test samples on the three algorithms. The construction of these confidence intervals are postponed to Section
A.2 while comparison with bootstrap procedures are detailed in Section A.3 of the Appendix. In order to conveniently compare the bias in the predictions with the one in the original data, we show on the left the bias measured in the data. We can see that these boxplots are coherent with the results of Table 1 and Figure 4, and again show that the discrimination was reinforced by the machine learning models in this test.In all generality, we conclude here that one has to be careful when training decision rules. They can indeed worsen existing discriminations in the original database. We also remark that the majority of works using the Disparate Impact as a measure of fairness rely only on this score as a numerical value with no estimation of how reliable it is. This motivated the definition of our confidence intervals strategy in Appendix A.2, which was shown to be realistic in our experiments when comparing the Ref boxplots of Figure 4 with the confidence intervales of Tables 1.
Note that we will only focus in the rest of the paper on the protected variable Gender since it was shown in Section 3 to be clearly the variable leading to discrimination for all tested machine learning models. We will also only test the Logistic Regression (LR) and Decision Tree (DT) as they are highly interpretable, plus the Gradient Boosting (GB) model which was shown to be the best performing one on the Adult Census dataset.
4 A quantitative evaluation of GDPR recommendations against algorithm discrimination
Once the presence of bias is detected, the goal of machine learning becomes to reduce its impact without hampering the efficiency of the algorithm. Actually, the predictions made by the algorithm should remain sufficiently accurate to make the machine learning model relevant in Artificial Intelligence applications. For instance, the decisions
made by a well balanced coin when playing head or tail are absolutely fair, as they are independent of any possible sensitive variable . However, they also do not take into account any other input information , making them pointless in practice. Reducing the bias of a machine learning model therefore ideally consists in taking rid of the influence of in all input data while preserving the relevant information to predict the true outputs . We will see below that this is not that obvious, even in our simple example.It is first interesting to remark that the problem cannot be solved by simply having a balanced amount of observations with and . We indeed reproduced the experimental protocol of Section 3.2 with 16,192 randomly chosen observations representing males (instead of 32,650), so that the decision rules were trained in average with as many males as females. As shown in Fig. 5, the trends of the results turned out to be very similar to those obtained in Fig. 4(Gender).
We specifically study in section the effect of complying to the European regulations. From a legal point of view, the GDPR’s recommendation indeed consists in not using the sensitive variable in machine learning algorithms. Hence, we simply remove here from the database in subsection 4.1, and we consider in subsection 4.2 one of the most common legal proof for discrimination called the testing method. It consists in considering the response for the same individual but with a different sensitive variable. We will study whether this procedure enables to detect the group discrimination coming from the decisions of an algorithm.
4.1 What if the sensitive variable is removed?
The most obvious idea to remove the influence of a sensitive variable is to remove it from the data, so we cannot use it when training the decision rules and then obviously when making new decisions. Note that this solution is recommended by GPDR regulations. To test the pertinence of this solution, we considered the algorithms analyzed in Sections 2 and 3 and then used them without using the Gender variable. As in Section 3, a 10fold crossvalidation approach was used to assess the robustness of our results.
As shown Figure 6(top), the disparate impacts as well as the model accuracies remained almost unchanged when removing the Gender variable from the input data. Anonymizing database by removing a variable therefore had very little effect on the discrimination that is induced by the use of an automated decision algorithm. This is very likely to be explained by the fact that a machine learning algorithm uses all possible information conveyed by the variables. In particular, if the sensitive variable (here the Gender variable) is strongly correlated to other variables, then the algorithm learns and reconstruct automatically the sensitive variable from the other variables. Hence we can deduce that social determinism is stronger than the presence of the sensitive variable here, so the classification algorithms were not impacted by the removal of this variable.
4.2 From Testing for bias detection to unfair prediction
Testing procedures are often used as a legal proof for discrimination. For an individual prediction, such procedures consist in first creating an artificial individual which shares the same characteristics of a chosen individual that suspects a disparate treatment and discrimination, but has a different protected variable. Then it amounts to testing whether this artificial individual has the same prediction as the original one. If the predictions differ, then this conclusion can serve as a legal proof for discrimination.
These procedures have existed for a long time (since their introduction in 1939 ^{4}^{4}4https://fr.wikipedia.org/wiki/Test_de_discrimination) , and since 2006 when the French justice has taken them as a proof of biased treatment, although the testing process itself has been qualified as unfair^{5}^{5}5https://www.juritravail.com/discriminationphysique/embauche/phalternativeA1.html. Furthermore, this technique has been generalized by sociologists ans economists (see for instance [32] for a description of such method) to statistically measure group discrimination in housing and labour market by conducting carefully controlled field experiments.
This testing procedure considered as a discrimination test is nowadays a commonly used method in France to assess fairness for sociological studies of Observatoire des discriminations^{6}^{6}6https://www.observatoiredesdiscriminations.fr/testing and laboratoire TEPP as pointed out in [25], or governemental studies DARES^{7}^{7}7https://dares.travailemploi.gouv.fr/daresetudesetstatistiques/etudesetsyntheses/daresanalysesdaresindicateursdaresresultats/testing of French Ministry of Work ISM Corum ^{8}^{8}8http://www.ismcorum.org/. Some industries are labeled using such test. An audit quality of recruiting methods is proposed while Novethic^{9}^{9}9https://www.novethic.fr/lexique/detail/testing.html proposes ethic formations.
Testing is efficient to detect human discrimination specially in labour market but hiring tech is producing more and more softwares or web platforms performing predictive recruitment as in [31]. Does testing remains valid in front of machine learning algorithms? This last strategy is evaluated using the same experimental protocol as in the previous sections. The results of these experiments are shown in Figure 6(bottom). Testing does not detect any discrimination when the sensitive variable is captured by the other variables.
An algorithmic solution to bypass this testing procedure is given by the following trick. Train a classifier as usual using all available information and then build a testing compliant version of it as follows : for an individual, the predicted outcome is assigned as the best decision obtained on the actual individual and a virtual individual with exactly the same characteristics as the original one, except for the protected variable which has the opposite label (e.g. the Gender variable is Male instead of Female), namely . Note that in case of multiclass labels, the outcome should be the most favourable decision for all possible labels. This classifier is fair by design in the sense that no matter their gender, the testing procedure can not detect a change in the individual prediction.
Nevertheless, this trick against testing cannot cheat usual evaluation of discrimination by using a disparate impact measure which is usual in the USA by measuring the impact on real and not fictitious recruitment. This is the reason why hiring tech companies add some facilities ([31]) to mitigate ethnic bias of algorithmic hiring for avoiding an enterprise juridical complications. The evaluation of this strategy is evaluated using the same experimental protocol as in the previous sections and these are shown in Figure 6(bottom).
As expected for previous results, this method has little impact on the classification errors and the disparate impacts. This emphasises the conclusion of Section 4.1 claiming that the Gender variable is captured by other variables. Removing the effect of a sensitive variable can therefore require more advanced treatments than those described above.
5 Differential treatment for fair decision rules
5.1 Strategies
As we have seen previously, bias may induce discrimination of an automatic decision rule. Although many complex methods have been developed to tackle this problem, we investigate in this section the effects of two easy and maybe naive modifications of machine learning algorithms. We present in this section the effect of two alternative strategies to build fair classifiers. They have in common the idea of considering different treatments according to each group . These strategies are the following :

Building a different classifier for each class of the sensitive variable: This strategy consists in training the same prediction model with different parameters for each class of the sensitive variable. We denote separate treatment this strategy.

Using a specific threshold for each class of the sensitive variable: Here, a single classifier is trained for all data to produce a score. The binary prediction is however get using a specific threshold for each subgroup or . Note that when the score is obtained by estimating the conditional distribution then the threshold used is often . Here this threshold is made dependent and is adapted to avoid any possible discrimination. In practice, we keep a threshold of for the observations in the group but we adapt the corresponding threshold for the observations in the group . In our tests, we automatically set this threshold on the training set so that the disparate impact is close to in the cases where it was originally lower to this this socially accepted threshold. The classifier and the potentially adapted threshold are then used for further predictions. This corresponds in a certain way to favour the minority class by changing equality to equity. We denote this strategy as positive discrimination since this procedure corresponds to this purpose.
5.2 Results obtained using the Separate Treatment strategy
Splitting the model parameters into parameters adapted to each group reduces the bias of the predictions when compared to the initial model, but it does not remove it. As we can see in Figure 7(top), where the notations are analogous to those in the above figures, it improved the disparate impact in all cases for relatively stable prediction accuracies. Note that the improvements are more spectacular for the basic Logistic Regression and Decision Tree models than for the Gradient Boosting model. This last model is indeed particularly efficient to capture fine high order relations between the variables, which gives less influence to the strong nonlinearity generated when splitting the machine learning model into two classspecific models. Hence building different models reduces but does not solve the problem, the level of discrimination in the decisions being only slightly closer to the level of bias in the initial dataset.
5.3 Results obtained using the Positive Discrimination strategy
Results obtained using the positive discrimination strategy are shown in Figure 7(bottom). They clearly emphasize the spectacular effect of this strategy on the disparate impacts, which can be controlled by the data scientist. By adjusting the threshold, it is possible to adjust the levels of discriminations in the dataset, as in this example where the socially acceptable level of 0.8 can be reached. In this case we see a decrease in the performance of the classifier, but yet being reasonable.
These results should however be tempered for a main reason. Although the average error receives little changes, the number of false positive cases of women is clearly increased when introducing positive discrimination. In our tests more than half of the predictions that should have been false in the group are even true. These false positive decisions have a limited impact on the average prediction accuracy as they where obtained in the group which has less observations than and that there are clearly less true predictions with than . Yet false positive errors are considered as the most important error type and thus this increase may be very harmful for the decision maker. On a legal point of view, this procedure may be judged as unfair or rises political issues that are far beyond the scope of this paper.
6 Conclusions
In this paper, we provided a casestudy of the use of machine learning technics for the prediction of the wellknown Adult Income dataset. We focused on a specific fairness criterion, the statistical parity, which is measured through the Disparate Impact. This metric quantifies the difference of the behaviour of a classification rule applied for two subgroups of the population, the minority and the majority. Fairness is achieved when the algorithm behaves in the same way for both groups, hence when the sensitive variable does not play a significant role in the prediction. Main results are summarized in Figure 8.
In particular, we convey the following takehome messages: (1) Bias in the training data may lead to machine learning algorithms taking unfair decisions, but not always. While there is a clear increase of bias using the tested machine learning algorithms with respect to the Gender variable, the Ethnic Origin does not lead to a severe bias. (2) As always in Statistics, computing a mere measure is not enough but confidence intervals are needed to determine the variability of such indexes. Hence, we proposed an adhoc construction of confidence intervals for the Disparate Impact. (3) Standard regulations that promote either the removal of the sensitive variable or the use of testing technics appeared as irrelevant when dealing with fairness of machine learning algorithms.
Note also that different notions of fairness (local and global) are at stake here. We first point out that testing methods focus on individual fairness while statistical methods such as the Disparate Impact Analysis tackle the issue of group fairness. These two notions if related to the similar notion of discrimination with respect to an algorithmic decision are yet different. In this work, we showed that an algorithm can be designed to be individually fair while still presenting a strong discrimination with respect to the minority group. This is mainly due to the fact that testing methods are unable to detect the discrimination hidden in the algorithmic decisions that are due to the training on an unbalanced sample. Testing methods detect discrimination if individuals with the same characteristics but different sensitive variables are treated in a different way. This corresponds to trying to find counterfactual explanation to an individual with a different sensitive variable. This notion of counterfactual explanations to detect unfairness has been developed in [23]. Yet the testing method fails in finding a counterfactual individual since it is not enough to change only the sensitive variable but a good candidate should be the closest individual with a different sensitive variable but with the variables that evolve depending on . For this, following some recent work on fairness with optimal transport theory as in [16] developing an idea from [13], some authors propose a new way of testing discrimination by computing such new counterfactual models in [5]. Finally, we tested two a priori naive solutions consisting either in building different models for each group or in choosing different rules for each group. Only the latter that can be considered as positive discrimination proves helpful in obtaining a fair classification. Note that if some errors are increased (false positive rate), this method has a good generalization error. Yet in other cases, the loss of efficiency could be greater and this method may lead to unfair treatment.
This data set has been extensively studied in the literature on fairness in machine learning and we are well aware of the numerous solutions that have been proposed to solve this issue. Even with standard methods, it is possible for a data scientist, when confronted to fairness in machine learning, to design algorithms that have very different behaviors and yet achieving a good classification error rate. Some algorithms hamper discrimination in the society while others just maintain its level, and some others correct this discrimination and provide gender equity. It is worth noting that the most explainable algorithms, such as the logistic regression, do not protect from discrimination. On the contrary, the capture of gender bias is inmediate due to its simplicity, while more complex algorithms might be more protected from this spurious correlation or, since the variable is discrete, better said spurious dependency.
The choice of a model should not be driven only by its performance with respect to a generalization error but should also be explainable in terms of bias propagation. For this, measures of fairness should be included in the evaluation of the model. In this work, we only considered statistical parity type fairness but many other definitions are available, without any consensus on the better choice for such a definition neither from a mathematical or a legal point of view. A strong research effort in data science is hence the key for a better use of Artificial Intelligence type algorithms. This will allow data scientists to describe precisely the algorithmic designing process, as well as their behaviour, in terms of precision and propagation of bias.
In closing, note that biases are what enables machine learning algorithms to work and helpfulness of complex algorithms is due to their ability to find hidden bias and correlations in very large data sets. Hence bias removal should be handled with care because one part of this information is crucial, while the other is harmful. Therefore, explainability should not be understood in terms of explainability of the whole algorithm, but maybe one line of future research in machine learning should focus on explainability of the inner bias of an algorithm, or its explainability with respect to some legal regulations.
References
 [1] J. Angwin, J. Larson, S. Mattu, and L. Kirchner. Machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks. ProPublica, 2016.
 [2] S. Barocas and A. D. Selbst. Big data’s disparate impact. Calif. L. Rev., 104:671, 2016.
 [3] R. Berk, H. Heidari, S. Jabbari, M. Kearns, and A. Roth. Fairness in criminal justice risk assessments: The state of the art. Sociological Methods & Research, page 0049124118782533, 2018.
 [4] D. Biddle. Adverse impact and test validation: A practitioner’s guide to valid and defensible employment testing. Gower Publishing, Ltd., 2006.
 [5] E. Black, S. Yeom, and M. Fredrikson. Fliptest: Fairness testing via optimal transport. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, page 111–121, New York, NY, USA, 2020. Association for Computing Machinery.
 [6] J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pages 77–91, 2018.
 [7] A. Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153–163, 2017.
 [8] J. S. Cramer. The origins of logistic regression. 2002.
 [9] E. Del Barrio, P. Gordaliza, and JM. Loubes. A central limit theorem for lp transportation cost on the real line with application to fairness assessment in machine learning. Information and Inference: A Journal of the IMA, 8(4):817–849, 2019.
 [10] W. Dieterich, C. Mendoza, and T. Brennan. Compas risk scales: Demonstrating accuracy equity and predictive parity. Northpoint Inc, 2016.
 [11] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226. ACM, 2012.
 [12] B. Efron and R. J. Tibshirani. An introduction to the bootstrap. CRC press, 1994.
 [13] S. A Feldman, M.and Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian. Certifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 259–268. ACM, 2015.
 [14] A. W. Flores, K. Bechtel, and C.T. Lowenkamp. False positives, false negatives, and false analyses: A rejoinder to machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks. Fed. Probation, 80:38, 2016.
 [15] S. A. Friedler, C. Scheidegger, S. Venkatasubramanian, S. Choudhary, E. P. Hamilton, and D. Roth. A comparative study of fairnessenhancing interventions in machine learning. ArXiv eprints, February 2018.
 [16] P. Gordaliza, E. Del Barrio, F. Gamboa, and JM. Loubes. Obtaining fairness using optimal transport theory. In International Conference on Machine Learning, pages 2357–2365, 2019.
 [17] M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. In Advances in neural information processing systems, pages 3315–3323, 2016.
 [18] U. HébertJohnson, M. P. Kim, O. Reingold, and G. N. Rothblum. Calibration for the (computationallyidentifiable) masses. In International Conference on Machine Learning, pages 1939–1948, 2018.
 [19] M. Hurley and J. Adebayo. Credit scoring in the era of big data. Yale JL & Tech., 18:148, 2016.
 [20] F. Kamiran, T. Calders, and M. Pechenizkiy. Discrimination aware decision tree learning. In 2010 IEEE International Conference on Data Mining, pages 869–874, Dec 2010.
 [21] M. Kearns, S. Neel, A. Roth, and Z. S. Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In International Conference on Machine Learning, pages 2564–2572, 2018.
 [22] J. Kleinberg, S. Mullainathan, and M. Raghavan. Inherent tradeoffs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807, 2016.
 [23] M. J. Kusner, J. Loftus, C. Russell, and R. Silva. Counterfactual fairness. In Advances in Neural Information Processing Systems, pages 4066–4076, 2017.
 [24] P. Lahoti, K. P. Gummadi, and G. Weikum. ifair: Learning individually fair data representations for algorithmic decision making. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1334–1345. IEEE, 2019.
 [25] Y. L’Horty, M. Bunel, S. Mbaye, P. Petit, and L. du Parquet. Discriminations dans l’accès à la banque et à l’assurance.
 [26] M. MercatBruns. Discrimination at Work. University of California Press, 2016.
 [27] T. M. Mitchell et al. Machine learning, 1997.
 [28] S. B. Morris and R. E. Lobsenz. Significance tests and confidence intervals for the adverse impact ratio. Personnel Psychology, 53(1):89–111, 2000.
 [29] F. Pasquale. The black box society. Harvard University Press, 2015.
 [30] D. Pedreschi, S. Ruggieri, and F. Turini. A study of topk measures for discrimination discovery. In Proceedings of the 27th Annual ACM Symposium on Applied Computing, pages 126–131. ACM, 2012.
 [31] M. Raghavan, S. Barocas, J. Kleinberg, and K. Levy. Mitigating bias in algorithmic hiring: Evaluating claims and practices. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, page 469–481, New York, NY, USA, 2020. Association for Computing Machinery.
 [32] P. A Riach and J. Rich. Field experiments of discrimination in the market place. The economic journal, 112(483):F480–F518, 2002.

[33]
A. Romei and S. Ruggieri.
A multidisciplinary survey on discrimination analysis.
The Knowledge Engineering Review
, 29(5):582?638, 2014.  [34] R. Rothmann, J. KriegerLamina, and W. Peissl. Credit scoring in Österreich, 07 2014.
 [35] C. D. Sutton. Classification and regression trees, bagging, and boosting. Handbook of statistics, 24:303–329, 2005.
 [36] A. W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 1998.
 [37] B.P. Winrow and C. Schieber. The disparity between disparate treatment and disparate impact: An analysis of the ricci case. Academy of Legal, Ethical and Regulatory Issues, page 27, 2009.
 [38] M B Zafar, I Valera, M Gomez Rodriguez, and K P Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web, pages 1171–1180. International World Wide Web Conferences Steering Committee, 2017.
Appendix A Appendix
a.1 The Adult Income dataset
Nº  Label  Possible values 

1  Age  Real 
2  workClass  Private, Selfempnotinc, Selfempinc, Federalgov, Localgov, Stategov, Withoutpay, Neverworked 
3  fnlwgt  Real 
4  education  Bachelors, Somecollege, 11th, HSgrad, Profschool, Assocacdm, Assocvoc, 9th, 7th8th, 12th, Masters, 1st4th, 10th, Doctorate, 5th6th, Preschool 
5  educNum  integer 
6  mariStat  Marriedcivspouse, Divorced, Nevermarried, Separated, Widowed, Marriedspouse absent, MarriedAFspouse 
7  occup  Techsupport, Craftrepair, Otherservice, Sales, Execmanagerial, Profspecialty, Handlerscleaners, Machineopinspct, Admclerical, Farmingfishing, Transportmoving, Privhouseserv, Protectiveserv, ArmedForces 
8  relationship  Wife, Ownchild, Husband, Notinfamily, Otherrelative, Unmarried 
9  origEthn  White, AsianPacIslander, AmerIndian Eskimo, Other, Black 
10  gender  Female, Male 
11  capitalGain  Real 
12  capitalLoss  Real 
13  hoursWeek  Real 
14  nativCountry  UnitedStates, Cambodia, England, PuertoRico, Canada, Germany, Outlying US(GuamUSVIetc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, DominicanRepublic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, ElSalvador, Trinidad and Tobago, Peru, Hong, Holand Netherlands 
15  income  , 
a.1.1 Data preparation
As discussed in the introduction of Section 2, the study has started with a detailed preprocessing of the raw data to give a more clear interpretation to further analyses. First, we noticed that the variable fnlwgt (Final sampling weight) has not a very clear meaning so it has been removed. For a complete description of such variable access the link http://web.cs.wpi.edu/~cs4341/C00/Projects/fnlwgt. We have also performed a basic and multidimensional exploration (MFCA) in order to represent the possible sources of bias in the data in https://github.com/wikistat/FairML4EthicalAI/blob/master/AdultCensus/AdultCensusRbiasDetection.ipynb.
This exploration leaded to a deep cleaning of the data set and highlighted difficulties present on certain variables, raising the need to transform some of them before fitting any statistical model. In particular, we have deleted missing data, errors or inconsistencies; grouped together certain highly dispersed categories and eliminated strong redundancies between certain variables. This phase is notoriously different from the strategy followed by [15] who analyze raw data directly. Some of these main changes are listed below:

Variable 3 fnlwgt is removed since it has little significance for this analysis.

The binary variable child is created to indicate the presence or absence of children.

Variable 8 relationship is removed since it is redundant with gender and mariStat.

Variable 14 nativCountry is removed since it is redundant with variable origEthn.

Variable 9 origEthn is transformed into a binary variable: CaucYes vs. CaucNo.

Varible 4 education is removed as redundant with variable educNum.

Additionally cleanup the , , and in variable “Target”
a.2 Testing lack of fairness and confidence intervals
Let be a random sample of independent and equally distributed variables. Previous criterion can be consistently estimated by their empirical version. Yet the value of the criterion may depend on the data sample. Due to the importance of obtaining an accurate proof of unfairness in a decision rule it is important to obtain confidence intervals in order to control the error of detecting unfairness. In the literature it is often achieved by computing the mean over several sampling of the data. We provide in the following the exact asymptotic behaviors of the estimates in order to build confidence intervals.
Theorem A.1 (Asymptotic behavior of the Disparate Impact estimator)
Set the empirical estimator of DI(g) as
Then the asymptotic distribution of this quantity is given by
(A.1) 
where and
where we have denoted and .
Proof:
Consider for the random vectors
where and . Thus, has expectation
The elements of the covariance matrix of are computed as follows:
and finally,
From the Central Limit Theorem in dimension 4, we have that
Now consider the function
Applying the DeltaMethod (see in [36]) for the function , we conclude that
where .
Hence, we can provide a confidence interval when estimating the disparate impact over a data set. Actually is a confidence interval for the parameter asymptotically of level .
Previous theorem can be used to test the presence of disparate impact at a given level.
(A.2) 
aims at checking if has Disparate Impact at level . We want to check wether . Under , the inequality holds, and so
Finally, from the inequality above and Eq. (A.1), we have that
as and, equivalently,
as , where is the
quantile of
. In conclusion, the test rejects at level whenWhen dealing with Equality of Odds, we want to study the asymptotic behavior of the estimators of the True Positive and True Negative rates across both groups. The reasoning is similar for the two rates, so we will only show the convergence of the True Positive rate estimator, denoted in the following by .
Theorem A.2
Set the following estimate of the True Positive rate of a classifier :
Then, the asymptotic distribution of this quantity is given by
(A.3) 
where and
where we have denoted and for
Proof of Theorem A.2 The proof follows the same guidelines of previous proof. We set here
where and . From the Central Limit Theorem, we have that
with
(A.4) 
Now consider the function
Applying the DeltaMethod for the function , we conclude that
where and .
a.3 Bootstraping v.s Direct Calculation of IC interval
The estimation of the Disparate Impact is unstable. In this paper we promote the use of the theoretical confidence interval based on the well known Delta method to control its variability. Contrary to [28], it does not rely on Gaussian approximation. We compare the stability of this confidence interval to bootstrap simulations, see for instance in [12] for more details on bootstrap methods.
For this we build 1000 bootstrap replicates and estimate the disparate impact. Figure 9 presents the simulations. We can see that the bootstrap simulations remain in the confidence interval. Moreover, if we build a confidence interval for the bootstrap estimator, the confidence intervals are the same. We obtain by the theoretical confidence interval while the bootstrap’s confidence interval is . Hence the theoretical confidence is a reliable measure of fairness for the data set and should be preferred due to its small computation time compared to the 1000 bootstrap replication.
Note that in this paper, for sake of clarity, we have chosen to focus only on the disparate impact criterion. Yet all other fairness criteria should be given with the calculation of a confidence interval. For instance in [9] we propose confidence intervals for Wasserstein distance which is use in many methods in fair learning.
a.4 Application to other real datasets
To illustrate these tests we have also considered another two wellknown and real data sets.

German Credit data. This data set is often claimed to exhibit some origin discrimination in the success of being given a credit by the German bank. Hence we compute the disparate impact w.r.t Origin. We obtain
Hence here confidence intervals play an important role. Actually the disparate impact is not statistically significantly lower than 0.8, which entails that the discrimination of the decision rule of the German bank can not be shown, which promotes the use of a proper confidence interval.

COMPAS Recidivism data. A third data set is composed by the data of the controversial COMPAS score detailed in [10]. The data is composed of 7214 offenders with personal variables observed over two years. A score predicts their level of dangerosity which determines whether they can be released while a variable points out if there has been recidivism. Hence Recidivism of offenders is predicted using a score and confronted to possible racial discrimination which corresponds to the protected attribute. The protected variable separates the population into caucasian and non caucasian. To evaluate the level of discrimination we first compute the disparate impact with respect to the true variable and the COMPAS score seen as a predictor.
In both cases, the data are biased but the level of discrimination is low. Yet as mentioned in al the studies on this data set, the level of errors of prediction is significantly different according to the ethnic origin of the defender. Actually the conditional accuracy scores and their corresponding confidence intervals show clearly the unbalance treatment received by both populations.
This unbalanced treatment is clearly assessed with the confidence interval.
Comments
There are no comments yet.