Kaggle Last User Classification Problem
Published: 2019-05-15


Reference Model

In fact, the key point of this project is that there are a large number of discrete features. For discrete dimensions, the processing method is to convert each feature level of each discrete dimension into a dimension like SQL row and column exchange, and the value under this dimension is only 0 or 1.But this is bound to lead to a dimension explosion.This project is typical. After linking the user table and the activity table with merge function, there are a large number of discrete dimensions.At this time, a method to deal with too many dimensions is called "Hash Trick".

Assuming that your discrete dimension is the user's educational background, after each feature level is split into a single dimension, there are the following dimensions:

Graduate or above, undergraduate, junior college, senior high school, junior high school, primary school

There is a hash function whose size is 5:


Because the size of the hash function is 5, all results cannot be greater than or equal to 5(0-4)

After dimension reduction, the original 6 features of "graduate students or above, undergraduate, junior college, senior high school, junior high school and primary school" became 5 features, and the value on each feature is the number of times the hash result value appears:


0 appears twice, 1 appears 0 times, 2 appears twice, and so on.

Reference Model inside has such a formula:

f <- ~ . - people_id - activity_id - date.x - date.y - 1

under the explanation here, for the function hashed.model.matrix, because it is used to reduce dimensions, it doesn't care that there is no value to the left of all tide symbols of dependent variables, minus sign indicates to eliminate some dimensions, and finally "-1" is because hashed.model.matrix will generate some data of unknown use in the first column. refer to this link: click open link

param <- list(objective = "binary:logistic",
              eval_metric = "auc",
              booster = "gblinear",
              eta = 0.03)

Param above is the last parameter used by boost. You can see the famous logical regression. eta indicates the scale when boost adjusts the weight.

booster parameter can be gblinear or gbtree, to be introduced