PDF Articles Metadata Harvester

Assoc. Prof. Leon Abdillah

PDF Articles Metadata Harvester

Leon Andretti Abdillah

Information Systems, Computer Science Faculty, Bina Darma University, Jl. A. Yani No.12, Palembang 30264, Indonesia

E-mail: leon dot abdillah at yahoo dotcom


Scientific journals are very important in recording the finding from researchers around the world. The recent media to disseminate scientific journals is PDF. On scheme to find the scientific journals over the internet is via metadata. Metadata stores information about article summary. Embedding metadata into PDF of scientific article will grant the consistency of metadata readness. Harvesting the metadata from scientific journal is very interesting field at the moment. This paper will discuss about scientific journal metadata harvesters involving XMP.

Keywords: Scientific journal article, metadata, harvester, XMP.



L. A. Abdillah, “PDF articles metadata harvester,” Jurnal Komputer dan Informatika (JKI), vol. 10, pp. 1-7, April 2012. 



Metadata are very useful to enrich the scientific journal article. Some elements of scientific journal such as author, title, and year. Metadata could stored in several file formats, such as; RIS; (2) Plain Text; (3) Enw; or (4) BibTex. Another scheme to store the metadata is using

XMP technology when the article is in PDF format. These information will be embedded in PDF article as hidden information or document properties. These hiden information consist of valuables information that summarize the contents of article. PDF format become standard for disseminate scientific finding.
 This harvester able to retrieve all of XMP fields from PDF files
 Author enriches this harvester with some useful additional fields beside XMP, such as recency
 The added recency field could be used to count the age of an article
 XMP technology of PDF become new standard to store the metadata information of ascientific article for the future
 At the moment not all articles published in PDF format are supplied by their author(s)/publisher with metadata in XMP. This is a challenge for next research.


[1] Szakadát, I. and G. Knapp, New Document Concept and Metadata Classification for Broadcast Archives, in Advances in Information Systems Development, A.G. Nilsson, et al., Editors. 2006, Springer US. p. 193-201.
[2] Jianmin, X., et al. Application of Extended Belief Network Model for Scientific Document Retrieval. in Fuzzy Systems and Knowledge Discovery, 2009. FSKD ’09. Sixth International Conference on. 2009.
[3] Fateman, R.J. More versatile scientific documents. in Document Analysis and Recognition, 1997., Proceedings of the Fourth International Conference on. 1997.
[4] Sharp, D., Formal Structure of Scientific Journals and Types of Scientific Papers. Treballs de la SCB, 2001. 51: p. 109-117.
[5] Bogunovic, H., et al. An electronic journal management system. in Information Technology Interfaces, 2003. ITI 2003. Proceedings of the 25th International Conference on. 2003.
[6] Balys, V. and R. Rudzkis, Statistical classification of scientific publications. INFORMATICA, 2010. 21(4): p. 471–486.
[7] Gill, T., et al., Introduction to Metadata, M. Baca, Editor. 2008: Los Angeles.
[8] Nadkarni, P.M., What Is Metadata?, in Metadata-driven Software Systems in Biomedicine. 2011, Springer London. p. 1-16.
[9] Taylor, C. (2003) An Introduction to Metadata.

[10] Greenberg, J., Metadata and the world wide web. Encyclopedia of Library and Information Science, 2003.

[11] Han, H., et al., Automatic document metadata extraction using support vector machines, in Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries. 2003, IEEE Computer Society: Houston, Texas. p. 37-48.
[12] Bray, T. (2003) On Search: Metadata.
[13] Andric, M. and W. Hall. Exploiting Metadata Links to Support Information Retrieval in Document Management Systems. in Enterprise Distributed Object Computing Conference Workshops, 2006. EDOCW ’06. 10th IEEE International. 2006.
[14] Hawking, D. and J. Zobel, Does topic metadata help with Web search? J. Am. Soc. Inf. Sci. Technol., 2007. 58(5): p. 613-628.
[15] Kobayashi, M. and K. Takeda, Information retrieval on the web. ACM Comput. Surv., 2000. 32(2): p. 144-173.
[16] Greenberg, J., Metadata Extraction and Harvesting: A comparison of two automatic metadata generation applications. Journal of Internet Cataloging, 2004. 6(4): p. 59-82.
[17] Hillmann, D. (2005) Using Dublin Core – The Elements.
[18] Mohammed, K.A.F., The impact of metadata in web resources discovering. Online Information Review, 2006. 30(2): p. 155-167.
[19] Halbert, M., J. Kaczmarek, and K. Hagedorn, Findings from the Mellon Metadata Harvesting Initiative, in Research and Advanced Technology for Digital Libraries, T. Koch and I. Sølvberg, Editors. 2003, Springer Berlin / Heidelberg. p. 58-69.
[20] Marinai, S. Metadata Extraction from PDF Papers for Digital Library Ingest. in Document Analysis and Recognition, 2009. ICDAR ’09. 10th International Conference on. 2009.
[21] Beel, J., et al., SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size), in Research and Advanced Technology for Digital Libraries, M. Lalmas, et al., Editors. 2010, Springer Berlin / Heidelberg. p. 413-416.
[22] Roszkiewicz, R., Metadata in Context. Seybold Report, 2004. 4(8): p. 3-8.
[23] Ajedig, M.A., F. Li, and A.u. Rehman. A PDF Text Extractor Based on PDF-Renderer. in Proceedings of the International MultiConference of Engineers and Computer Scientists. 2011.

Leave a Reply

Your email address will not be published. Required fields are marked *