想免費用谷歌資源訓練神經網路？Colab 詳細使用教程

知識 04-27

1 簡介

Colab 是谷歌內部類 Jupyter Notebook 的互動式 Python 環境，免安裝快速切換 Python 2和 Python 3 的環境，支持Google全家桶(TensorFlow、BigQuery、GoogleDrive等)，支持 pip 安裝任意自定義庫。

網址：https://colab.research.google.com

2 庫的安裝和使用

Colab 自帶了 Tensorflow、Matplotlib、Numpy、Pandas 等深度學習基礎庫。如果還需要其他依賴，如 Keras，可以新建代碼塊，輸入

# 安裝最新版本Keras

# https://keras.io/

!pip install keras

# 指定版本安裝

!pip installkeras==2.0.9

# 安裝 OpenCV

# https://opencv.org/

!apt-get -qq install -y libsm6 libxext6&&pip install -q -U opencv-python

# 安裝 Pytorch

# http://pytorch.org/

!pip install -q http://download.pytorch.org/whl/cu75/torch-0.2.0.post3-cp27-cp27mu-manylinux1_x86_64.whl torchvision

# 安裝 XGBoost

# https://github.com/dmlc/xgboost

!pip install -q xgboost

# 安裝 7Zip

!apt-get -qq install -y libarchive-dev&&pip install -q -U libarchive

# 安裝 GraphViz 和 PyDot

!apt-get -qq install -y graphviz&&pip install -q pydot

3 Google Drive 文件操作

授權登錄

對於同一個 notebook，登錄操作只需要進行一次，然後才可以進度讀寫操作。

# 安裝 PyDrive 操作庫，該操作每個 notebook 只需要執行一次

!pipinstall-U-qPyDrive

frompydrive.authimportGoogleAuth

frompydrive.driveimportGoogleDrive

fromgoogle.colabimportauth

fromoauth2client.clientimportGoogleCredentials

# 授權登錄，僅第一次的時候會鑒權

auth.authenticate_user()

gauth=GoogleAuth()

gauth.credentials=GoogleCredentials.get_application_default()

drive=GoogleDrive(gauth)

執行這段代碼後，會列印以下內容，點擊連接進行授權登錄，獲取到 token 值填寫到輸入框，按 Enter 繼續即可完成登錄。

遍歷目錄

# 列出根目錄的所有文件

# "q" 查詢條件教程詳見：https://developers.google.com/drive/v2/web/search-parameters

file_list=drive.ListFile({"q":""root" in parents and trashed=false"}).GetList()

forfile1infile_list:

print("title: %s, id: %s, mimeType: %s"%(file1["title"],file1["id"],file1["mimeType"]))

可以看到控制台列印結果title: Colab 測試, id: 1cB5CHKSdL26AMXQ5xrqk2kaBv5LSkIsJ8HuEDyZpeqQ, mimeType: application/vnd.google-apps.documenttitle: Colab Notebooks, id: 1U9363A12345TP2nSeh2K8FzDKSsKj5Jj, mimeType: application/vnd.google-apps.folder

其中 id 是接下來的教程獲取文件的唯一標識。根據 mimeType 可以知道文件為 doc 文檔，而 Colab Notebooks 為文件夾（也就是 Colab 的 Notebook 儲存的根目錄），如果想查詢 Colab Notebooks 文件夾下的文件，查詢條件可以這麼寫：

# "目錄 id" in parents

file_list=drive.ListFile({"q":""1cB5CHKSdL26AMXQ5xrqk2kaBv5LBkIsJ8HuEDyZpeqQ" in parents and trashed=false"}).GetList()

讀取文件內容

目前測試過可以直接讀取內容的格式為 .txt（mimeType: text/plain），讀取代碼：

file=drive.CreateFile({"id":"替換成你的 .txt 文件 id"})

file.GetContentString()

而 .csv 如果用GetContentString()只能列印第一行的數據，要用``

file=drive.CreateFile({"id":"替換成你的 .csv 文件 id"})

#這裡的下載操作只是緩存，不會在你的Google Drive 目錄下多下載一個文件

file.GetContentFile("iris.csv","text/csv")

# 直接列印文件內容

withopen("iris.csv")asf:

printf.readlines()

# 用 pandas 讀取

importpandas

pd.read_csv("iris.csv",index_col=[,1],skipinitialspace=True)

Colab 會直接以表格的形式輸出結果（下圖為截取 iris 數據集的前幾行）， iris 數據集地址為 http://aima.cs.berkeley.edu/data/iris.csv ，學習的同學可以執行上傳到自己的 Google Drive。

寫文件操作

# 創建一個文本文件

uploaded=drive.CreateFile({"title":"示例.txt"})

uploaded.SetContentString("測試內容")

uploaded.Upload()

print("創建後文件 id 為 {}".format(uploaded.get("id")))

更多操作可查看 http://pythonhosted.org/PyDrive/filemanagement.html

4 Google Sheet 電子表格操作

授權登錄

對於同一個 notebook，登錄操作只需要進行一次，然後才可以進度讀寫操作。

!pipinstall--upgrade-qgspread

fromgoogle.colabimportauth

auth.authenticate_user()

importgspread

fromoauth2client.clientimportGoogleCredentials

gc=gspread.authorize(GoogleCredentials.get_application_default())

讀取

把 iris.csv 的數據導入創建一個 Google Sheet 文件來做演示，可以放在 Google Drive 的任意目錄

worksheet=gc.open("iris").sheet1

# 獲取一個列表[

# [第1行第1列, 第1行第2列, ... , 第1行第n列], ... ,[第n行第1列, 第n行第2列, ... , 第n行第n列]]

rows=worksheet.get_all_values()

print(rows)

# 用 pandas 讀取

importpandasaspd

pd.DataFrame.from_records(rows)

列印結果分別為

[["5.1", "3.5", "1.4", "0.2", "setosa"], ["4.9", "3", "1.4", "0.2", "setosa"], ...

寫入

sh=gc.create("谷歌表")

# 打開工作簿和工作表

worksheet=gc.open("谷歌表").sheet1

cell_list=worksheet.range("A1:C2")

importrandom

forcellincell_list:

cell.value=random.randint(1,10)

worksheet.update_cells(cell_list)

5 下載文件到本地

fromgoogle.colabimportfiles

withopen("example.txt","w")asf:

f.write("測試內容")

files.download("example.txt")

6 實戰

這裡以我在 Github 的開源LSTM 文本分類項目為例子https://github.com/Jinkeycode/keras_lstm_chinese_document_classification把目錄下的三個文件存放到 Google Drive 上。該示例演示的是對健康、科技、設計三個類別的標題進行分類。

新建

在 Colab 上新建 Python2 的筆記本

安裝依賴

!pipinstallkeras

!pipinstalljieba

!pipinstallh5py

importh5py

importjiebaasjb

importnumpyasnp

importkerasaskrs

importtensorflowastf

fromsklearn.preprocessingimportLabelEncoder

載入數據

授權登錄

# 安裝 PyDrive 操作庫，該操作每個 notebook 只需要執行一次

!pipinstall-U-qPyDrive

frompydrive.authimportGoogleAuth

frompydrive.driveimportGoogleDrive

fromgoogle.colabimportauth

fromoauth2client.clientimportGoogleCredentials

deflogin_google_drive():

# 授權登錄，僅第一次的時候會鑒權

auth.authenticate_user()

gauth=GoogleAuth()

gauth.credentials=GoogleCredentials.get_application_default()

drive=GoogleDrive(gauth)

returndrive

列出 GD 下的所有文件

deflist_file(drive):

file_list=drive.ListFile({"q":""root" in parents and trashed=false"}).GetList()

forfile1infile_list:

print("title: %s, id: %s, mimeType: %s"%(file1["title"],file1["id"],file1["mimeType"]))

drive=login_google_drive()

list_file(drive)

緩存數據到工作環境

defcache_data():

# id 替換成上一步讀取到的對應文件 id

health_txt=drive.CreateFile({"id":"117GkBtuuBP3wVjES0X0L4wVF5rp5Cewi"})

tech_txt=drive.CreateFile({"id":"14sDl4520Tpo1MLPydjNBoq-QjqOKk9t6"})

design_txt=drive.CreateFile({"id":"1J4lndcsjUb8_VfqPcfsDeOoB21bOLea3"})

#這裡的下載操作只是緩存，不會在你的Google Drive 目錄下多下載一個文件

health_txt.GetContentFile("health.txt","text/plain")

tech_txt.GetContentFile("tech.txt","text/plain")

design_txt.GetContentFile("design.txt","text/plain")

print("緩存成功")

cache_data()

讀取工作環境的數據

defload_data():

titles=[]

print("正在載入健康類別的數據...")

withopen("health.txt","r")asf:

forlineinf.readlines():

titles.append(line.strip())

print("正在載入科技類別的數據...")

withopen("tech.txt","r")asf:

forlineinf.readlines():

titles.append(line.strip())

print("正在載入設計類別的數據...")

withopen("design.txt","r")asf:

forlineinf.readlines():

titles.append(line.strip())

print("一共載入了 %s 個標題"%len(titles))

returntitles

titles=load_data()

載入標籤

defload_label():

arr0=np.zeros(shape=[12000,])

arr1=np.ones(shape=[12000,])

arr2=np.array([2]).repeat(7318)

target=np.hstack([arr0,arr1,arr2])

print("一共載入了 %s 個標籤"%target.shape)

encoder=LabelEncoder()

encoder.fit(target)

encoded_target=encoder.transform(target)

dummy_target=krs.utils.np_utils.to_categorical(encoded_target)

returndummy_target

target=load_label()

文本預處理

max_sequence_length=30

embedding_size=50

# 標題分詞

titles=[".".join(jb.cut(t,cut_all=True))fortintitles]

# word2vec 詞袋化

vocab_processor=tf.contrib.learn.preprocessing.VocabularyProcessor(max_sequence_length,min_frequency=1)

text_processed=np.array(list(vocab_processor.fit_transform(titles)))

# 讀取詞標籤

dict=vocab_processor.vocabulary_._mapping

sorted_vocab=sorted(dict.items(),key=lambdax:x[1])

構建神經網路

這裡使用 Embedding 和 lstm 作為前兩層，通過 softmax 激活輸出結果

# 配置網路結構

defbuild_netword(num_vocabs):

# 配置網路結構

model=krs.Sequential()

model.add(krs.layers.Embedding(num_vocabs,embedding_size,input_length=max_sequence_length))

model.add(krs.layers.LSTM(32,dropout=0.2,recurrent_dropout=0.2))

model.add(krs.layers.Dense(3))

model.add(krs.layers.Activation("softmax"))

model.compile(loss="categorical_crossentropy",optimizer="adam",metrics=["accuracy"])

returnmodel

num_vocabs=len(dict.items())

model=build_netword(num_vocabs=num_vocabs)

importtime

start=time.time()

# 訓練模型

model.fit(text_processed,target,batch_size=512,epochs=10,)

finish=time.time()

print("訓練耗時：%f 秒"%(finish-start))

預測樣本

sen 可以換成你自己的句子，預測結果為[健康類文章概率, 科技類文章概率, 設計類文章概率], 概率最高的為那一類的文章，但最大概率低於 0.8 時判定為無法分類的文章。

sen="做好商業設計需要學習的小技巧"

sen_prosessed=" ".join(jb.cut(sen,cut_all=True))

sen_prosessed=vocab_processor.transform([sen_prosessed])

sen_prosessed=np.array(list(sen_prosessed))

result=model.predict(sen_prosessed)

catalogue=list(result[]).index(max(result[]))

threshold=0.8

ifmax(result[])>threshold:

ifcatalogue==:

print("這是一篇關於健康的文章")

elifcatalogue==1:

print("這是一篇關於科技的文章")

elifcatalogue==2:

print("這是一篇關於設計的文章")

else:

print("這篇文章沒有可信分類")

教程就到這裡了，你可以開始利用谷歌資源搭建自己的神經網路咯

- 加入人工智慧學院系統學習 -

喜歡這篇文章嗎？立刻分享出去讓更多人知道吧！

本站內容充實豐富，博大精深，小編精選每日熱門資訊，隨時更新，點擊「搶先收到最新資訊」瀏覽吧！

請您繼續閱讀更多來自 AI講堂 的精彩文章:

※想轉行人工智慧？機會來了！
※AI工程師工作中最常用的編程語言和數據分析工具！

TAG:AI講堂 |