論 AI 如何一本正經地胡說八道

雷鋒網 AI 科技評論按:上一篇文章中我們簡單介紹了 OpenAI 的新語言模型 GPT-2,它雖然沒有什麼技術上的突破,但是依靠超大的參數規模和訓練數據,無監督訓練後生成/續寫文本的效果非常好,簡直以假亂真。

相比於技術方面的乏善可陳,模型生成的文本真是趣味多多。乍一看語言流暢、細節完善、有段落層次、有各種具體的細節,但是仔細看內容卻有不少莫名其妙的地方。這副「一本正經地胡說八道」的樣子讓我們笑個沒完。和雷鋒網 AI 科技評論一起來看看吧。





In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

(在一項令人震驚的發現中,科學家們在安第斯山脈中偏遠的、從未曾探索過的山谷中發現了一群獨角獸。更令研究人員們驚訝的是,這群獨角獸能說一口流利的英語。)(AI 科技評論:一個荒誕的故事開頭,交給人的話會如何續寫呢?)

模型續寫(第 10 次嘗試的結果)

The scientist named the population, after their distinctive horn, Ovid』s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.

(科學家們以它們的角的名字把它們命名為 Ovid 獨角獸,科學界從未知道過這些四隻角的、銀白色的獨角獸的存在。)

Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.


Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.

(La Paz 大學的進化生物學家 Jorge Pérez 博士和他的同行們在探索安第斯山脈的時候發現了一個很小的山谷,這裡沒有其他的動物或者人類。Pérez 注意到在這個山谷里的似乎是一個自然噴泉,周圍還有兩座岩石山和銀色的雪。)(La Paz 大學、Pérez 博士、安第斯山脈,emmm,模型補充了更多信息,彷彿真的有這麼回事)

Pérez and the others then ventured further into the valley. 「By the time we reached the top of one peak, the water looked blue, with some crystals on top,」 said Pérez.

(Pérez 和其它人更走到了山谷的更深入。「等到我們走到了其中一座山的山頂的時候,水看起來是藍色的,上面還有一些晶體狀的東西」,Pérez 說。)(這個設定有點神秘故事的味道)

Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen from the air without having to move too much to see them – they were so close they could touch their horns.

(Pérez 和他的朋友們看到獨角獸群的時候驚呆了。透過空氣,不需要有多少動作就可以看到它們 —— 他們走得好近,甚至都可以摸到它們的角。)(精彩的特寫!)

While examining these bizarre creatures the scientists discovered that the creatures also spoke some fairly regular English. Pérez stated, 「We can see, for example, that they have a common 『language,』 something like a dialect or dialectic.」

(當這些科學家們檢查這些奇怪的生物的時候,他們發現這些生物的英文說得還挺規範的。Pérez 指出「我們可以看到,比如,它們有共通的『語言』,像是一種口音或者方言」。)

Dr. Pérez believes that the unicorns may have originated in Argentina, where the animals were believed to be descendants of a lost race of people who lived there before the arrival of humans in those parts of South America.(Pérez 博士相信這些獨角獸最初是來自阿根廷的,這些動物很有可能是人類遷移到南美洲之前的一支消失了的人種的後裔。)

While their origins are still unclear, some believe that perhaps the creatures were created when a human and a unicorn met each other in a time before human civilization. According to Pérez, 「In South America, such incidents seem to be quite common.」

(雖然它們的來源還不確定,有人認為這些動物可能是在人類文明出現之前,有人和獨角獸相遇而產生的。Pérez 說:「在南美洲,這種事情一點也不罕見。」)(這裡出現了明顯的上下文衝突,不過南美洲真是一片充滿奧妙的土地)

However, Pérez also pointed out that it is likely that the only way of knowing for sure if unicorns are indeed the descendants of a lost alien race is through DNA. 「But they seem to be able to communicate in English quite well, which I believe is a sign of evolution, or at least a change in social organization,」 said the scientist.

(然而,Pérez 也指出,也許只有一種辦法能確定這些獨角獸是不是某一支已經消失了的外星人種族的後裔,那就是通過 DNA。「但是它們用英文溝通的能力不怎麼好,在我看來這是一種進化的信號,或者起碼是社交組織的一種變化」,這位科學家說。)(模型都知道要用 DNA 技術,二十一世紀真的是生物學的時代啊)



A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown.


模型續寫(第 1 次嘗試的結果)

The incident occurred on the downtown train line, which runs from Covington and Ashland stations.


In an email to Ohio news outlets, the U.S. Department of Energy said it is working with the Federal Railroad Administration to find the thief.


「The theft of this nuclear material will have significant negative consequences on public and environmental health, our workforce and the economy of our nation,」 said Tom Hicks, the U.S. Energy Secretary, in a statement. 「Our top priority is to secure the theft and ensure it doesn』t happen again.」

(「這起核原料失竊事件對於公眾、環境健康、我們的勞動力和國家的經濟有很糟糕的負面後果」,美國能源秘書長 Tom Hicks 在一則聲明中表示,「我們的首要任務是確保失竊事件的安全,並確保它不會再次發生。」 )(除了用詞有些奇怪,這就是一副政府部門發言的樣子)

The stolen material was taken from the University of Cincinnati』s Research Triangle Park nuclear research site, according to a news release from Department officials.


The Nuclear Regulatory Commission did not immediately release any information.


According to the release, the U.S. Department of Energy』s Office of Nuclear Material Safety and Security is leading that team』s investigation.


「The safety of people, the environment and the nation』s nuclear stockpile is our highest priority,」 Hicks said. 「We will get to the bottom of this and make no excuses.

(「人民的安全、環境、以及核能儲備是我們最重要的目標,」Hicks 說,「我們會追查到底,不找任何借口。」)(有力的收尾,這水平去寫作文肯定沒問題)


人類給定文本(這是 OpenAI 對 GPT-2 模型的簡介,屬於科技類文本)

We』ve trained a large language model called GPT-2 that generates realistic paragraphs of text, while also exhibiting zero shot generalization on tasks like machine translation, question answering, reading comprehension, and summarization - problems usually approached by using training datasets and models designed explicitly for these tasks.

A typical approach to language modeling is to learn the following task: predict the next word, given all of the previous words within some text. Last year, OpenAI』s Generative Pre-trained Transformer (GPT) showed that language models trained on large amounts of data can be fine-tuned to specific tasks to achieve high performance. GPT-2 shows that much larger language models trained on a more diverse dataset derived from the internet begin to learn these NLP tasks without needing task-specific training data, instead learning from examples the system derives from the raw text. These systems also display a substantial qualitative jump in the realism and coherence of generated text.

模型續寫(第 1 次嘗試的結果)

Here you can see the most recent progress with Generative Pre-trained Transformer:

(來和我們一起看看生成式預訓練 Transformer 模型的最新進展)

Figure 1: Generative Pre-trained Transformer training on several texts.

(圖 1 :生成式預訓練 Transformer 在多種任務中的表現)(科技類文本標配:數據圖表!)

We are now preparing a collection of datasets for translation and machine translation in our language model. We will be using one of the large number of text samples provided by The New York Times.


We believe this project is the first step in the direction of developing large NLP systems without task-specific training data. That is, we are developing a machine language system in the generative style with no explicit rules for producing text.

(我們相信這個項目是向著開發不需要任務專用數據的大規模 NLP 系統的第一步。也就是說,我們正在開發一個生成式的機器語言系統,它生成文本不需要任何明確的規則。)(這一段文字還真的沒什麼大毛病呢)

We hope for future collaborations between computer scientists, linguists, and machine learning researchers.




OpenAI 在模型開源地址給出了更多完整模型生成的文本,包括許多無限制條件的文本生成(即不是續寫)。這些文本的風格涵蓋了操作指南、新聞報道、事件記錄流水賬、小說/小故事、翻譯教學、商品評價、(完整的)商品/會員購買頁面、電子郵件、旅遊指南等等。由於模型的訓練數據就是來自網路上各種各樣的頁面,所以生成的文本中分別重現了這些風格和結構可以說是合情合理;另外,這些文本沒有經過挑選,所以具體內容上就有很多錯亂之處了。


例 4 (IT 新聞一則)

More than 95 percent of iPhone 6 preorders being held by Apple"s INPEX, Shanghai

Apple now joins current Apple partner Pegatron(s tan), which already employed the same iPhone generation strategy to ensure ample supply for iPhones.(副標題 內容簡介)

TOP MOBILE NEWS: Apple TM already used generation-from-graphics approach to keep iPhone supply robust(主標題出現,關於蘋果供應鏈的新聞)

Pegatron, the world"s largest contract electronics manufacturer for Apple, can leverage its experience in supplying parts and components for major electronics companies like Nokia and Sony. In addition, it can rely on the redundancies provided by a single contract.(Pegatron再次出現,呼應上了)


Pegatron already secured 90 percent of iPhone 6 preorders from INPEX at the beginning of October, an executive from INPEX told Reuters. "Whole Julination has to cover one iPhone 6 unit," he said. "This week, I don"t have the capacity for 4 people to sell a unit."

Of course, it"s unbelievably tough to find new qualities when rain doesn"t fall. So when it comes to the iPhone"s next generation, basically every chipmaker will be utilizing "higher-end characteristics" in part of the production process in order to deliver up to 30 percent more processing power for more impressive performance and battery life. And it"s not just a matter of using new packaging materials; Apple will be using a second generation LCD touchpanels as well.



