Anubhav Jain

Lawrence Berkeley Laboratory, USA

Anubhav Jain

Capturing and Leveraging Materials Science Knowledge from Millions of Journal Articles using Natural Language Processing Techniques

Abstract: Historically, both data and knowledge (connections and conclusions based on data) in the materials domain has been recorded mainly as text, figures, or tables in journal articles. Unfortunately, this vast treasure trove of knowledge is difficult to make use of since it requires manually reading articles, which is impossible to do on a large scale. In this talk, I will describe some of our efforts to extract information from the research literature automatically based on natural language processing techniques. The talk will summarize our most recent progress towards extracting both individual data items as well as "knowledge" (e.g., proposed applications of a chemical composition) in various areas. I will also touch upon our efforts in automatically extracting data from figures. Finally, I will describe how this effort can feed into other efforts for machine learning as well into other materials databases.


Anubhav Jain is a Staff Scientist/Chemist at Lawrence Berkeley National Laboratory focusing on new materials discovery using high-throughput computations and machine learning. Some of his current projects include helping develop the Materials Project database of calculated materials properties, applying data mining to solar photovoltaics research through the DuraMat consortium, screening for novel materials for various applications such as thermoelectrics and catalysis, and applying natural language processing techniques to capture knowledge from the research literature. He received his B.E. in Applied & Engineering Physics from Cornell University and his PhD from the Massachusetts Institute of Technology in Materials Science & Engineering.