極端隨機森林

最新 07-03

上次說了隨機森林，這次聊一個小眾的集成演算法--「極端隨機森林」。

我們說了，隨機森林是一種包含多棵樹的分類器，其中每棵決策樹的構造和分類測試均相互獨立，訓練過程中，每棵樹對原始的訓練數據進行採樣替換，構造新的訓練數據集；決策樹中每個決策節點上的分裂測試均從一個隨機測試集合中產生，根據某種量化評價標準，例如信息熵等，從隨機測試集合中選擇一個最佳測試作為決策點的分裂測試。隨機森林中的每棵決策樹均不需要進行剪枝。

極端隨機森林同樣是一種多棵決策樹集成的分類器，與隨機森林分類器比較，主要有兩點不同：

一是不採取bootstrap採樣替換策略，而是直接採用原始訓練樣本，目的在於減少偏差；

二是在每棵決策樹的決策節點上，分裂測試的閥值是隨機選擇的。

一般情況下，

極端隨機森林分類器在分類精度和訓練時間等方面都要優於隨機森林分類器。

導入相關的基礎包

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

導入數據並設定隨機種子

df = pd.read_csv("modell.csv")

SEED = 222

定義因(響應)變數和自變數，並且拆分訓練集和測試集

y=df["y"]

x_col=col_names[1:]

x=df[x_col]

xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.3,random_state=SEED)

導入sklearn裡面極端隨機森林的包，以及畫圖包

from sklearn.ensemble import ExtraTreesRegressor

import matplotlib.pyplot as plt

用極端隨機森林進行訓練

etr=ExtraTreesRegressor(n_estimators=1000)

etr=etr.fit(xtrain, ytrain)

查看變數的重要程度並列印

importances = etr.feature_importances_

indices = np.argsort(-importances)

#importances進行降序排列其index

for f in range(x.shape[1]):

print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

變數重要程度畫圖顯示

s=[]

for i in indices:

s.append(x_col[i])

plt.figure(figsize=(20,6))

#figsize為可選項，可以自己定義圖像大小

plt.title("Feature importances")

plt.bar(range(x.shape[1]), importances[indices],color="b", align="center")

plt.xticks(range(x.shape[1]), s)

#s為上面根據重要程度降序排列的自變數名稱

plt.xlim([-1, x.shape[1]])

plt.show()

預測變數並繪製混淆矩陣

from sklearn.metrics import confusion_matrix

predictions=etr.predict(xtest)

predictions=1*(predictions==1)

#predictions預測為浮點型的話需要進行轉換一下

confusion_matrix=confusion_matrix(ytest, predictions)

import matplotlib.pyplot as plt

plt.matshow(confusion_matrix)

plt.title("predict")

plt.colorbar()

plt.ylabel("real")

for x in range(len(confusion_matrix)):# 數據標籤

for y in range(len(confusion_matrix)):

plt.annotate(confusion_matrix[x, y], xy=(x, y), horizontalalignment="center", verticalalignment="center", color="red")

plt.show()

加入隨機森林進行對比

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=1000)

rf=etr.fit(xtrain, ytrain)

predictions2=rf.predict(xtest)

predictions2=1*(predictions2==1)

from sklearn.metrics import confusion_matrix

confusion_matrix2=confusion_matrix(ytest, predictions2)

import matplotlib.pyplot as plt

plt.matshow(confusion_matrix2)

plt.title("predict")

plt.colorbar()

plt.ylabel("real")

for x in range(len(confusion_matrix2)):# 數據標籤

for y in range(len(confusion_matrix2)):

plt.annotate(confusion_matrix2[x, y], xy=(x, y), horizontalalignment="center", verticalalignment="center", color="red")

plt.show()

極端隨機森林跟隨機森林預測結果基本差不多，這也與樣本相關，大家也可以根據自己的數據帶入進行對比下。

喜歡這篇文章嗎？立刻分享出去讓更多人知道吧！

本站內容充實豐富，博大精深，小編精選每日熱門資訊，隨時更新，點擊「搶先收到最新資訊」瀏覽吧！

請您繼續閱讀更多來自 不懂亂問 的精彩文章:

※集成演算法與隨機森林

TAG:不懂亂問 |