0%

Ensemble-Blending

1. Introduction

Blending 方法可以看是一个简化版的 Stacking 方法,blend 的中文意思是 “混合”,这就很好地诠释了 Blending 方法的思想 —— 不同模型的混合,也就是集成的本意。该方法仅在验证集中进行预测,如下为 Blending 的步骤:

  1. 将数据划分为 train set 和 test set,其中,train set 需要进一步被细分为 Training set 和 validation set (见 Fig. 1 (i),图片来源:Analytics vidhya);
  2. 构建多个模型 (第一层),并使用细分后的 Training set 作为输入数据来训练这些模型,并用训练好的模型来预测 validation set 和 test set,得到 val_pred 和 test_pred;
  3. 再次构建模型(第二层),并将 val_pred 视为训练集来训练该层的模型(见 Fig. 1 (ii);
  4. 用训练好的模型来对测试集 (test_pred) 进行预测,得到预测结果 test prediction 记为整个 Blending 集成学习的预测结果。
Fig.1 Bleding procedutre

2. Sample

2.1 Steps

为了对 Blending 有进一步的了解,参考 Analytics vidhya, 构建了一个两层的 Blending 模型,其中第一层由决策树和 KNN 组成,对这两个模型用 Training set 进行预测,并用训练好的模型预测 Valitation set 和 test set。之后构建一个逻辑回归利用上述两个预测结果进行预测训练。具体步骤如下:

  • Import Module

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    import numpy as np
    import pandas as pd

    from sklearn import tree
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import cross_val_score
    from sklearn.model_selection import train_test_split

    from sklearn import datasets
    from sklearn.datasets import make_blobs
  • Creating Datasets

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    data, target = make_blobs(n_samples = 10000, centers = 2,
    random_state = 1, cluster_std = 1.0 )

    # Creatint test set and temporary train set, test_percent = 20%
    x_train1,x_test,y_train1,y_test = train_test_split(data,
    target, test_size = 0.2, random_state = 1)
    # Spliting temporary train set into train set and validation set, validation_percent = 80%*30%
    x_train,x_val,y_train,y_val = train_test_split(X_train1,
    y_train1, test_size = 0.3, random_state = 1)

    print("The shape of training X:", x_train.shape)
    print("The shape of training y:", y_train.shape)
    print("The shape of test set:", x_test.shape)
    print("The shape of test y:", y_test.shape)
    print("The shape of validation X:", x_val.shape)
    print("The shape of validation y:", y_val.shape)

    Result:

    1
    2
    3
    4
    5
    6
    The shape of training X: (5600, 2)
    The shape of training y: (5600,)
    The shape of test set: (2000, 2)
    The shape of test y: (2000,)
    The shape of validation X: (2400, 2)
    The shape of validation y: (2400,)

  • First Layer

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    model1 = tree.DecisionTreeClassifier()
    model1.fit(x_train, y_train)
    val_pred1=model1.predict(x_val)
    test_pred1=model1.predict(x_test)

    val_pred1=pd.DataFrame(val_pred1)
    test_pred1=pd.DataFrame(test_pred1)


    model2 = KNeighborsClassifier()
    model2.fit(x_train,y_train)
    val_pred2=model2.predict(x_val)
    test_pred2=model2.predict(x_test)

    val_pred2=pd.DataFrame(val_pred2)
    test_pred2=pd.DataFrame(test_pred2)
  • Second Layer

    1
    2
    3
    4
    df_val=pd.concat([pd.DataFrame(x_val), val_pred1, val_pred2],axis=1)
    df_test=pd.concat([pd.DataFrame(x_test),test_pred1,test_pred2],axis=1)
    model = LogisticRegression()
    model.fit(df_val,y_val)
  • Forecasting Results

    1
    2
    model.score(df_test,y_test)
    cross_val_score(model,df_test,y_test,cv=10)

    Result:

    1
    2
    1.0
    array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

2.2 Complete codes

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# Inporting modules
import numpy as np
import pandas as pd

from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

from sklearn import datasets
from sklearn.datasets import make_blobs

# Creating datasets
data, target = make_blobs(n_samples = 10000, centers = 2,
random_state = 1, cluster_std = 1.0 )

# Creatint test set and temporary train set, test_percent = 20%
x_train1,x_test,y_train1,y_test = train_test_split(data,
target, test_size = 0.2, random_state = 1)
# Spliting temporary train set into train set and validation set, validation_percent = 80%*30%
x_train,x_val,y_train,y_val = train_test_split(X_train1,
y_train1, test_size = 0.3, random_state = 1)

# Constructing model
## First layer
model1 = tree.DecisionTreeClassifier()
model1.fit(x_train, y_train)
val_pred1=model1.predict(x_val)
test_pred1=model1.predict(x_test)

val_pred1=pd.DataFrame(val_pred1)
test_pred1=pd.DataFrame(test_pred1)

model2 = KNeighborsClassifier()
model2.fit(x_train,y_train)
val_pred2=model2.predict(x_val)
test_pred2=model2.predict(x_test)

val_pred2=pd.DataFrame(val_pred2)
test_pred2=pd.DataFrame(test_pred2)

## Second layer
df_val=pd.concat([pd.DataFrame(x_val), val_pred1, val_pred2],axis=1)
df_test=pd.concat([pd.DataFrame(x_test),test_pred1,test_pred2],axis=1)
model = LogisticRegression()
model.fit(df_val,y_val)

# Print forecasting results
model.score(df_test,y_test)
cross_val_score(model,df_test,y_test,cv=10)
-------------This blog is over! Thanks for your reading-------------