Unattended feature learning really works

Feature selection

The selection of features is an important topic in feature engineering (another important topic is the extraction of features). It is often said that data and features determine the upper bound of machine learning and that models and algorithms only approach that upper bound. It can be seen that feature engineering, especially feature selection, has a very important position in machine learning. In general, feature selection refers to choosing the set of features that will achieve the best performance of the particular model and algorithm. The following methods are commonly used in engineering:
1. Calculate the correlation between each characteristic and the response variable: One of the most common methods used in engineering is calculating the Pearson's coefficient and the mutual information coefficient. The Pearson coefficient can only measure linear correlations and mutual information. The coefficient can measure various correlations well, but the calculation is relatively complicated. Fortunately, many toolkits contain this tool (e.g. sklearns MINE). After the correlation is obtained, you can sort and select the features.
2. Build a single feature model, sort the features by the accuracy of the model to select features. Also keep in mind that there is a document in JMLR'03 in which a decision tree was introduced based on the feature selection method is essentially equivalent. When the target feature is selected, it will be used to train the final model.
3. Select features using the regular L1 term: The regular L1 method has the features of a sparse solution, so it naturally has the feature selection feature. It should be noted, however, that these are not features. Choosing L1 does not mean that it is not important, as two high-correlation features may only retain one. If you want to determine which feature is important, you should pass in the regular L2 method for cross-checking
4. Train a preselected model that can score features: RandomForest and Logistic Regression can both score the features of the model and train the final model after the correlation is obtained through the score.
5. After combining the functions, select the function: For example, the user ID and the user function are combined to get a larger range of functions, and then select the function from Recommendation System and Advertising System It is relatively common and this is also the one Main source for so-called functions at the billions or even billions level. The reason for this is that user data is relatively sparse and combined functions can take into account both the global model and the personalized model. This problem has an opportunity to be discussed.
6. Feature selection through deep learning: Currently, this method is becoming a means with the popularity of deep learning, especially in the field of image processing, since deep learning automatically has the ability to do learning functions, which is why deep learning is also called unsupervised feature learning becomes. After selecting the properties of a particular neural layer from the deep learning model, the final target model can be trained.

General methods for processing features: discretization and feature combination.


In discretization, numerical features are discretized into several fixed intervals. For example, if the score is 0 to 100, it will be discretized into four levels of A, B, C, and D, and then four 01 functions will be used for one-hot coding, e.g.
A is 1, 0, 0, 0
B is 0,1,0,0
C is 0,0,1,0
D is 0,0,0,1
The first digit indicates if it's A and the second digit indicates if it's B ...
The role played here is to reduce overfitting. Finally, the skills of the two students with a score of 95 and 96 need not necessarily be different, but the student of A is still evident in comparison with the student of D. Different. In fact, the linear function is converted into a step function.

Another way how to split the vehicle speed into multiple gears like 10 km / h, like this:

If the goal we want to learn now is fuel economy

Here are the following tests using a domestic hot-selling model:
At a constant speed of 120 km / h, fuel consumption is 7.81 liters / 100 km
At a constant speed of 90 km / h, the fuel consumption is 5.86 liters / 100 km
At a constant speed of 60 km / h, fuel consumption is 4.12 liters / 100 km
At a constant speed of 30 km / h, fuel consumption is 4.10 liters / 100 km

Obviously, fuel consumption is not linear and will not work if it is not discretized. Think carefully, after such a discretization you can actually approximate any function.

The combination of features is also known as a feature crossover

Synthetic features (synthetic features) and combinations of features (crosses of features) are not identical. Feature crosses are a subset of feature combinations.

Synthetic trait

A feature is not included in the input features, but is derived from one or more input features. Features created individually through normalization or scaling are not compound features. The synthetic characteristics include the following types:

  1. Multiply a feature by itself or by other features (called a feature combination).
  2. Divide the two characteristics.
  3. The sequential features are divided into buckets to break them down into multiple intervals and containers.

Feature Cross: Coding of non-linear laws

For the following nonlinear problem. Any line drawn by the linear learner cannot predict the health of the tree well.

To solve the nonlinear problem shown in the above illustration, you can create a feature combination. The feature combination refers to a composite feature that encodes nonlinear laws in feature space by multiplying two or more input features. The term "cross" (combination) comes from the cross product (vector product). We create a feature combination named x3 by combining and:
x3 = x1x2

We'll treat this newly created x3 feature combination like any other feature. The linear formula is:
y = b + w1x1 + w2x2 + w3x3

Although w3 represents nonlinear information, you do not need to change the training method of the linear model to determine the value of w3.

Types of combinations of features
By using the stochastic gradient descent method, the linear model can be effectively trained. Therefore, adding combinations of features when using advanced linear models has always been an effective way to train large data sets. We can create many different types of combinations of features. E.g.

[A X B]: A combination of features formed by multiplying the values ​​of two features.
[A x B x C x T x E]: A feature combination formed by multiplying the values ​​of five features.
[A x A]: A combination of features formed by squaring the value of a single feature.
Feature crosses: combine one-hot vectors
In practice, machine learning models rarely combine continuous features. However, machine learning models often combine one-hot feature vectors and view the combination of one-hot feature vectors as logical connections. For example, let's say we have the following two features: Country / Region and Language. One hot coding of each feature creates a vector with binary features that can be interpreted as Country = USA, Country = France, or Language = English, Language = Spanish. Then when you combine functions of these one hot codes, you get binary functions that can be interpreted as logical connections as shown below:

country: usa AND language: spanish
As another example, let's say you enter latitude and longitude to get a single 5 element feature vector with a hot element. For example, the specified latitude and longitude can be expressed as follows:

binned_latitude = [0, 0, 0, 1, 0]
binned_longitude = [0, 1, 0, 0, 0]
Suppose you create a feature combination of these two feature vectors:

binned_latitude X binned_longitude
This combination of features is a one-hot vector with 25 elements (24 0s and 1 1). A single 1 in this combination represents a specific connection of latitude and longitude. Then your model can learn about the specific relevance of this connection.

Suppose we give the latitude and longitude coarsely, as shown below:

binned_latitude (lat) = [
0 10 20 ]

binned_longitude (lon) = [
0 15 ]
When you create feature combinations for these coarse bins, synthetic features are generated with the following meanings:

binned_latitude_X_longitude (lat, lon) = [
0 0 10 10 20 20 ]
For example, suppose our model needs to predict how satisfied the dog owner is with the dog based on the following two characteristics:

Behavior type (barking, barking, cuddling, etc.)
If we create the following combination of functions based on these two functions:

[behavior type X time of day]
Our ultimate predictive power will far exceed the predictive power of any feature on its own. For example, if the dog screams (happily) when the owner returns from work at 5 p.m., it may indicate a positive prediction of the owner's satisfaction. If the dog whines (possibly painfully) when the owner sleeps at 3 a.m., it may indicate a highly negative predictor of the owner's satisfaction.

Linear learners can scale well to large amounts of data. Using feature combinations for large amounts of data is an effective strategy for learning highly complex models. Neural networks can provide a different strategy.