Multiple Instance Learning
I have been preparing for the seminar today. It is a report that my tutor specifically asked me to do. It is a paper on "weakly supervised Discriminatory Location and Classification: A Joint Learning Process".After reading it for the first time, I felt that the so-called weakly supervised seems to be just another way of saying it.Before looking at more papers, this conclusion has yet to be verified.In any case, multi-instance learning is a big hole with far-reaching influence. Many impressive and important work directly or indirectly apply the idea of multi-instance.Among them, I think the most dazzling should be Felzenszwalb Daniel's Object
Detection with Discriminatively trained part based models》。In that article, inside called this model latent SVM, which is actually a variant of MILSVM.
"Learning is the process of acquiring knowledge, technology, attitude or value through teaching or experience, thus leading to measurable and stable behavioral changes. More precisely, it is the process of establishing new spiritual structures or examining past spiritual structures."
inside, I think the most appropriate place to say the above passage is "learning needs to look at the past".For computers, the so-called past is training samples, and machine learning is a process of obtaining classifiers or fitting functions according to training samples.And supervised learning means that each training sample has its own mark.When this mark represents the classification category, what you learn is the classification function.When this mark represents a certain continuous value, the learning result is the fitting function.More complicated, each tag can be a set, and what we learn at this time is a structured prediction function.
In many cases, not every training sample is marked, but only a part is marked, while the other part is only the data itself, and the mark is missing.At this time, it is necessary to combine unmarked data with marked data for machine learning, which is semi-supervised learning.In other cases, even all the samples are not marked. This is unsupervised learing.But in fact, unsupervised learing can't carry out the classification and fitting tasks with clear objectives, and can only do some analytical tasks, such as clustering, PCA, dictionary learning, etc.
Multiple Instance Problem
We consider a kind of training data, this data is marked, marking only two categories, positive and negative.However, this time the target of marking is not a sample, but a packet (bag).One or several data together are called a bag, each bag has its own tag.When the mark of a bag is negative, the marks of all samples in the bag are negative.When the mark of a bag is positive, the mark of at least one sample in this bag is positive.Our goal is to learn to get a classifier, so that the new input samples can be given its positive and negative marks.Such a kind of problem is a multi-example problem.
This problem is very common in practical applications. For example, when President Fang built the Great Wall, he had to list some prohibited words to prevent everyone from searching. He felt that it was too troublesome to input one by one, so he could find some yellow or reactionary websites and directly use them as positive sample packages: one of the words in the websites is always prohibited.Then take the healthy People's Daily website as a negative sample package: none of the words in it are illegal.For example, when marking training picture samples, a rectangular box is needed to indicate the position of the target, which may not be accurate enough to cause misalignment between different samples. At this time, some local perturbations can be made to the marked rectangular box to obtain some new rectangular boxes, which can be regarded as a bag together. One of them is always the best positive sample, that is, the mark is positive.And take a picture without a target as a negative sample package: no matter how the picture is captured inside, it is a negative sample.
As for how to solve the multi-sample problem, if all the sample markers are known, it is a supervised learning problem that can be done with SVM, adaboost, etc.The difficulty now is that we do not know the marks of many samples.It doesn't matter for the negative sample package. Each sample in it is marked with a negative mark. This is clear.The problem lies in the positive sample package. inside of each positive sample package can only guarantee that one sample is positive, while the other is negative. The key is which sample is positive?This is also unclear.
The solution to this problem is actually quite straightforward: alternative optimization.That is to say, we first assume that we already know the marks of all samples, then we can obtain a classification model by some supervised learning method. Through this model we can predict each training sample, and then update their marks. We can use the newly obtained marks to retrain the classification model again.Therefore, the whole optimization process is divided into two parts: supervised learning and mark updating.
There are still some points to note here:
First, when training the supervised learning model, only the predicted "most like correct" (that is, the highest classification score) one is selected from the positive sample package inside. Other samples in the positive sample package are not needed, regardless of whether the prediction is positive or negative.This is because, in fact, the multi-sample problem can also be described as that the "most correct" sample mark in the positive sample package is positive and has nothing to do with other samples.Therefore, this choice strategy is exactly in line with the definition of the problem.
Second, if there are enough negative samples, only the one predicted to be "most like correct" in each negative sample package can be selected as the negative sample for training. Such negative samples are also called hard sample or most violated sample.In practice, they are most effective for fast convergence of the model.
Then a simple flowchart is given below:
input: data matrix x = [x1, x2, xn],
Package mark y = [y1, y2, ym], relationship between package and sample I = [i1, I2, im]
output: classification function f
initialize the sample i∈Ij in each tag packet j to the tag yi=Yj of the packet, the initialization set u is empty, and all samples are added to the sample set u.
Repeat the following procedure
taking data xi of samples of i∈U and marking yi to train to obtain a classification function f
F is used to predict the marker yi of all samples
For each positive marker packet, select the sample with the highest F prediction score to join the set U.
For each negative marker packet, select the sample with the highest F prediction score to join the set U (or take some samples with higher scores, or take all samples to join U, depending on whether the negative samples are sufficient)
until the end condition is satisfied