covariate shift phenomenon and its solution
Published: 2019-04-28

When rereading Paper's "Batch Normalization" recently, I found that it repeatedly mentioned a concept "Co-Variant Shift" in the article, and batch-Normalization was proposed to solve the Co-Variant Shift phenomenon in neural networks (especially in deeper networks).I am very interested in this concept, so I took the time to look up some of them and summarize what I learned here today.

First of all, let me explain what is called the covariate shift phenomenon. This means that the data distribution of the training set is inconsistent with the data distribution of the prediction set. In this case, if we train a classifier on the training set, we will definitely not achieve better results on the prediction set.The problem of inconsistent sample distribution between training set and prediction set is called "covariate shift" phenomenon.For example, I want to train a model to judge whether a person has blood diseases according to blood samples. For negative samples, it is definitely to collect the blood of some blood patients, but for positive samples, the sampling must be reasonable and the samples must meet the distribution in the whole population.If only the blood of people in specific fields (for example, school students) is taken as a positive sample, then the model I finally trained can hardly achieve good results in all population groups, because the real prediction focuses on only a small part of the normal population.(This phenomenon is also very common in Transfer of learning)

to solve the "covariate shift" problem, it is actually to give a new weight to the data in the training set, namely Reweight operation. for example, for sample xixix_i, its distribution in the training set is q(xi)q(xi)q(x_i), and its real distribution in the prediction set is p (Xi) p (Xi) p (x) p (x _ I), then its new weight is p (Xi) q (Xi) p (Xi) q (Xi) \ frac {p (x _ I)} {q (x _ I)}.Then the problem now becomes how to determine the true distribution of sample xixix_i in training set and prediction set.In fact, the method used is particularly clever, and the same is the method of machine learning: Logistic Rgression, which randomly extracts samples from the training set and the test set. According to their different sources, the samples from the training set are labeled as 1, and the samples from the prediction set are labeled as -1.This data is divided into a new training set and a test set, the model is trained on the training set, and then the performance of the trained model on the test set is seen. If the performance is good, it shows that it can well distinguish the data from the previous training set and the test set, and the distribution of these data is inconsistent, and vice versa.The specific calculation formula is as follows:
P (z = 1 | xi) = p (xi) p (xi)+q (xi) p (z = 1 | xi) = p (xi) p (xi)+q (xi) p (z = 1 | x _ i) = \ frac {p (x _ i)} {p (x _ i)+q (x _ i)}//z = 1 indicates that the sample comes from the previous prediction set distribution ppp, and z=-1 indicates that the sample comes from the previous training set distribution qqqq.After the Logistic Regression classifier is trained, p (z = 1 | Xi) = 11+ef (Xi) p (z = 1 | Xi) = 11+ef (Xi) p (z = 1 | x _ I) = \ frac {1} {1+e {-f (x _ I)}} and then it is easy to deduce for sample xixix_i _ I.Its reweight is p (z = 1 | xi) p (z = 1 | xi) = ef (xi) p (z = 1 | xi) p (z = 1 | xi) = ef (xi) \ frac {p (z = 1 | x _ i)} {p (z =-1 | x _ i)} = e {f (x _ i)}, where f(xi)f(xi)f(x_i) is the classifier we trained.

Seems to feel that the solution to the problem of Co-Variant Shift has been finished. In fact, there is a big premise, that is, what kind of index should be used to judge whether the phenomenon of Co-Variant Shift has occurred (reweight sample weight is required only after judging the phenomenon of Co-Variant Shift, otherwise it is not necessary).The index used here is called MCC (Matthews Correlation Coefficient). This index essentially uses a correlation coefficient between training set data and prediction set data, with a value between [-1,1]. If 1 is a strong positive correlation, 0 is no correlation, and -1 is a strong negative correlation.Its specific calculation is related to the concept of Fusion Matrix. Here are some concepts related confusion matrix:
TP(True Positive): True 1, Forecast 1
FN(False Negative): True 1, Forecast 0
FP(False Positive): True 0, Forecast 1
TN(True Negative): True 0, Forecast 0
Mcc=TP∗TN−FP∗FN(TP+FP)(TP+FN)(TN+FP)(TN+FN)√Mcc=TP∗TN−FP∗FN(TP+FP)(TP+FN)(TN+FP)(TN+FN)Mcc=\frac{TP*TN-FP*FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}
(PS: Several indicators to measure the effect of the second classification, ACC (accuracy), Rec (recall), F value, AUC, MCC, each corresponding to its own application scenario)
By calculating Mcc, it is generally believed that if the value is greater than 0.2, the correlation between the prediction set and the test set is high, which means that the classifier is easy to apply the experience learned from the training set to the prediction set, which means that the covariate shift phenomenon occurs;If it is less than 0.2, there is no covariate shift phenomenon.