當前位置:
首頁 > 最新 > 人工深度加工的增強型數據集對大數據分析的重要性

人工深度加工的增強型數據集對大數據分析的重要性


The importanceof human-curated data enrichment for big data analysis

Matt Toussant, Ph.D.

Senior VP, Product & Content Operations, CAS

高級副總裁,產品及內容運營, 美國化學文摘社

大數據分析在化學科學的發展中起著越來越重要的作用。隨著世界上越來越多的科學數據以數字格式存儲、數據收集速度加快,大數據只會越來越大。根據IBM Marketing Cloud的統計,世界上90%的數據都是在過去兩年中創建的。

這對商業和研究來說是個好消息,尤其是在化學科學領域,已經建立了一個完善的科學數據發布和共享體系。隨著更多數據的出現,人類就能夠做出更佳的決策,提高產出,豐富人類生活。當然,要將這些原始數據轉化為信息,並將這些信息轉化為見解,以正確的方式對科學數據組織、精鍊和充實是至關重要的。

什麼是大數據的豐富與擴展?

數據的豐富與擴展指的是為了原始數據的質量和效用而進行關聯、豐富擴展和完善。有效地進行數據的豐富與擴展不僅僅是簡單地減少錯誤和提高數據的準確性,它還包括組織、深度加工、關聯和推算高度複雜的信息庫,將龐大的「數據湖」變成有組織的水庫,由「管道」和相關的知識圖組成,並為潛在的知識體準備取樣。最終,數據豐富與擴展的目標是推動發現這些集合中的關聯集群、關係和最佳語義本體,揭示得出結論所需的新見解,並作出真正的戰略決策和潛在的知情預測。

豐富擴展型的大數據分析提供了新的見解(甚至預見未來)

對大數據的豐富與擴展和相關知識圖的分析有助於研究人員、企業家和商業領袖解讀大量發表的化學科學數據,以產生新的見解並取得更好的結果。從期刊和專利,到化學結構和競爭策略,大數據分析能夠幫助用戶將這些點連接起來,揭示發展趨勢,發現下一個機會。

這些工具不僅有助於更高效地獲取洞察力,而且還有助於預見未來。對豐富與擴展的大數據進行分析,使企業家和商業創新者能夠在競爭格局中獲得內在的線索,評估公司的優勢和劣勢,並了解商業戰略。大數據也可以讓你比以往任何時候更早地找到將研究成果成功商業化的途徑。同樣地,基於今天所了解的信息,也有可能發現與創新相關的商業機會何時會達到頂峰。

生物技術是一個有著蓬勃發展的技術轉移空間的領域,豐富與擴展後的大數據分析將會在這方面發揮重要作用。在這個快速擴張的領域,大數據分析可圍繞生物製品、靶點、治療適應症和製造商等進行專利和出版物數據聚類,以期從中了解競爭對手的格局,並將治療手段與機會聯繫起來。並且,增強型大數據分析有助於追蹤該領域的進展,發現創新研究機會,並幫助研究人員找到通往成功的最佳路徑。

在科學領域取得可靠的豐富與擴展的大數據,仍然需要人類的智慧

豐富與擴展數據是獲取大數據價值的關鍵。然而,由於科學數據量猛增,確保數據洞察力的可靠性和高質量已經成為一個挑戰。

科學數據的複雜性是獨一無二的。化學結構和名稱、範圍值、圖形和圖表只是給演算法結構化和提取帶來困難的科學信息的幾個例證。大數據儲存庫之間的關係質量最終取決於用於創建它們的分析模型技術。今天,計算演算法和統計分析被廣泛用於增強型大數據。雖然這些技術對豐富與擴展數據很重要,但神經網路、深度學習模型和機器學習工具也只能為我們做這些。從科學數據中獲得有用的洞察力所必需的分析模型是複雜的和微妙的——它們還必須得到專家的智力支持。

當涉及到解釋複雜的研究和發現不同的化學數據之間的創新聯繫時,人類的智力仍然是一個重要的組成部分。有經驗的化學家、生物化學家和數據科學家能夠分析數據並提供任何人工智慧系統都無法做到的深刻見解。

在CAS,數百名化學科學領域的專家們在公開披露的數據中識別和收集關鍵創新信息、物質、反應、屬性,精心深度加工和豐富科學信息。這些「服務於科學的科學家們」每天閱讀文獻,積累大量的知識,從而幫助他們揭示那些僅靠技術無法發現的見解和趨勢。由此產生高質量、豐富的「數據湖」,當「數據湖」與先進的數據分析工具相結合時,就會在推動商業戰略和科學創新的商業化方面發揮越來越重要的作用。

Big data analytics is playing an increasingly important role in the advancement of chemical science. With the world』s scientific data increasingly stored in digital format, and data collection accelerating, big data is only going to get bigger. According to IBM Marketing Cloud, 90% of the world』s data was created in the last two years alone.

That』s good news for business and research, particularly in the chemical sciences, where there』s already a well-established framework for publishing and sharing scientific data. With more data comes further real-world intelligence with which to make better decisions, improve outcomes and enrich lives. Of course, to turn this raw data into information, and this information into insight, it』s essential that scientific data is organized, refined and enriched in the right way.


Data enrichment is all about associating, enhancing and refining the quality and utility of raw data. Effective enrichment goes beyond simply minimizing errors and improving data accuracy. It involves organizing, curating, associating and extrapolating highly complex information repositories, turning vast 『data lakes』 into organized reservoirs composed of 『pipelines』 and associated knowledge graphs with underlying ontologies ready for sampling. Ultimately, the goal of enrichment is to drive the discovery of associated clusters, relationships and optimally sematic ontologies within these collections, revealing the new insights necessary to draw conclusions and make truly strategic decisions and potentially informed predictions.

The analysis of enriched big data and associated knowledge graphs is helping researchers, entrepreneurs and business leaders make sense of the vast collection of published chemical science data to generate new insights and achieve better outcomes. From papers and patents, to chemical structures and competitor strategies, big data analysis enables users to connect the dots, revealing what』s trending and where the next opportunity might lie.

These tools aren』t simply helping to deliver insights faster – they』re helping to predict the future. The analysis of enriched big data allows entrepreneurs and business innovators to get the inside track on the competitive landscape, evaluate a company』s strengths and weaknesses and inform business strategy. Big data could also let you find the paths to successful commercialization of research earlier than ever before. Equally, it may be possible to identify when the commercial opportunities associated with a particular innovation will peak – based on what』s known today.

The biotech sector is a thriving technology transfer space where the analysis of enriched big data is set to play a key role. In this rapidly expanding field, big data analysis is helping to cluster patent and publication data around categories of biologics, targets, therapeutic indications and manufacturers to understand the competitor landscape and link treatments to opportunities. In turn, this is helping to track where the field is moving, uncover novel research opportunities and help researchers identify the best route to success.


Enrichment is essential for getting the most out of big data. However, ensuring that the insight is reliable and high-quality has become a challenge considering the sheer volume of scientific data now available.

Scientific data is unique in its complexity. Chemical structures and names, ranged values, and graphs and charts are just a few of the elements of scientific information that make algorithmic structuring and extraction difficult. The quality of the relationships derived from big data repositories ultimately comes down to the robustness of the analytical models used to create them. Today, computational algorithms and statistical analyses are widely used to enrich big data. But despite their importance for data enrichment, neural networks, deep learning models and machine learning tools can only take us so far. The analytical models necessary to obtain useful insight from scientific data are complex and nuanced – and they must be supported by expert insight.

When it comes to interpreting complex studies and finding innovative connections between disparate chemical data, human intellect is still a critical component. Experienced chemists, biochemists and data scientists can analyze data and offer insights that no artificial intelligence system can.

At CAS, hundreds of experts across the chemical science fields carefully curate and enrich scientific information by identifying and collecting key ideas, substances, reactions, properties and much more in published data. These 『scientists serving science』 read the literature daily, amassing a wealth of knowledge that assists them in uncovering insights and trends not found by technology alone. The resulting high-quality, enriched 『data lake』, when combined with advanced data analytics tools, plays an increasingly important role in driving business strategy and the commercialization of scientific innovation.

Click "Read more" to learn more about how CAS scientists enrich chemical big data and how we can deliver insights to help drive your business forward.

Copyright?2018 Chemical Abstracts Service, a division of American Chemical Society.

長按下面二維碼關注「ACS美國化學會"


喜歡這篇文章嗎?立刻分享出去讓更多人知道吧!

本站內容充實豐富,博大精深,小編精選每日熱門資訊,隨時更新,點擊「搶先收到最新資訊」瀏覽吧!


請您繼續閱讀更多來自 ACS美國化學會 的精彩文章:

ACS編輯之選:「微距空氣升華法」生長有機晶體
如何向非學術出版物投稿

TAG:ACS美國化學會 |