Extração de informação de artigos científicos: uma abordagem baseada em indução de regras de etiquetagem

Álvarez, Alberto Cáceres

doi:10.11606/D.55.2007.tde-21062007-144352

Home

Facilities

Master's Dissertation

DOI

https://doi.org/10.11606/D.55.2007.tde-21062007-144352

Document

Master's Dissertation

Author

Álvarez, Alberto Cáceres (Catálogo USP)

Full name

Alberto Cáceres Álvarez

E-mail

Institute/School/College

Instituto de Ciências Matemáticas e de Computação

Knowledge Area

Computer Science and Computational Mathematics

Date of Defense

2007-05-08

Published

São Carlos, 2007

Supervisor

Lopes, Alneu de Andrade (Catálogo USP)

Committee

Lopes, Alneu de Andrade (President)
Nunes, Maria das Graças Volpe
Vieira, Renata

Title in Portuguese

Extração de informação de artigos científicos: uma abordagem baseada em indução de regras de etiquetagem

Keywords in Portuguese

Aprendizagem de máquina
Extração de infomação
Processamento de lingua natural

Abstract in Portuguese

Este trabalho faz parte do projeto de uma ferramenta denominada FIP (Ferramenta Inteligente de Apoio à Pesquisa) para recuperação, organização e mineração de grandes coleções de documentos. No contexto da ferramenta FIP, diversas técnicas de Recuperação de Informação, Mineração de Dados, Visualização de Informações e, em particular, técnicas de Extração de Informações, foco deste trabalho, são usadas. Sistemas de Extração de Informação atuam sobre um conjunto de dados não estruturados e objetivam localizar informações específicas em um documento ou coleção de documentos, extraí-las e estruturá-las com o intuito de facilitar o uso dessas informações. O objetivo específico desenvolvido nesta dissertação é induzir, de forma automática, um conjunto de regras para a extração de informações de artigos científicos. O sistema de extração proposto, inicialmente, analisa e extrai informações presentes no corpo dos artigos (título, autores, a filiação, resumo, palavras chaves) e, posteriormente, foca na extração das informações de suas referências bibliográficas. A proposta para extração automática das informações das referências é uma abordagem nova, baseada no mapeamento do problema de part-of-speech tagging ao problema de extração de informação. Como produto final do processo de extração, tem-se uma base de dados com as informações extraídas e estruturadas no formato XML, disponível à ferramenta FIP ou a qualquer outra aplicação. Os resultados obtidos foram avaliados em termos das métricas precisão, cobertura e F-measure, alcançando bons resultados comparados com sistemas similares

Title in English

Information extraction from scientific articles: an approach based on induction of tagging rules

Keywords in English

Information extraction
Machine learning
Natural languge processing

Abstract in English

This dissertation is part of a project of a tool named FIP (an Intelligent Tool for Research Supporting). FIP is a tool for retrieval, organization, and mining large document collections. In the context of FIP diverse techniques from Information Retrieval, Data Mining, Information Visualization, and particularly Information Extraction, focus of this work, are used. Information Extraction systems deal with unstructured data looking for specific information in a document or document collection, extracting and structuring them in order to facilitate their use. The specific objective presented in this dissertation is automatically to induce a set of rules for information extraction from scientific articles. The proposed extraction system initially analyzes and extracts information from the body of the articles (heading, authors, affiliation, abstract, and keywords) and then extracts information from each reference in its bibliographical references. The proposed approach for information extraction from references is a new technique based on the strategy of part-of-speech tagging. As the outcome of the extraction process, a database with extracted and structured information in XML format is made available for the FIP or any other application. The system has been evaluated using measures of Precision, Recall and F-measure, reaching good results compared to similar systems

WARNING - Viewing this document is conditioned on your acceptance of the following terms of use:
This document is only for private use for research and teaching activities. Reproduction for commercial use is forbidden. This rights cover the whole data about this document as well as its contents. Any uses or copies of this document in whole or in part must include the author's name.

dissertacao_alberto.pdf (2.10 Mbytes)

Publishing Date

2007-06-21

Derived works

WARNING: The material described below relates to works resulting from this thesis or dissertation. The contents of these works are the author's responsibility.

Álvarez, A. C., and LOPES, A. A. Information Extraction from Tagged Bibliographical References. In 2n International Worshop on Web and Text Intelligence, São Carlos, 2009. Proc of the 2nd International Workshop on Web and Text Intelligence., 2009.