Information retrieval and storing for the contents of scientific journals

Leon Andretti Abdillah


Classification of scientific documents is becoming of more importance to research communities [1], due to the increasing volume of scientific literature published in both manuscript format and available electronically. The electronic representation of scientific documents may include journals, technical reports, program documentation, laboratory notebooks etc [2]. Figure 2 shows an example of a journal article, and the typical components that are included within it.

Information Retrieval

Information retrievals is the computerized process of producing a list of documents that are relevant to an inquirer’s request by comparing the user’s request to an automatically produced index of the textual content of documents in the system [4]. User request will use word(s) as a key for searching in search engine (google is the most popular one).

The field of information retrieval evolved to provide principled approaches to searching various forms of content from the internet using search engines.

Search Engine

Searching documents on the Internet has become one of the most commonly used activities for users, ranging from tourist information, social activities to the review of scientific documents.  My proposed research will focus on the scholarly or scientific documents. In searching activity, user will input the keyword(s) as query to the interface to ask the system to searching it for him/her.

Generally we recognize the system as Search Engine (SE). Right now there are various of SE in the internet.

Search engines have become the most important medium for Internet users to find pages on the web [5]. Many researches and surveys show that Google is the number one followed by yahoo. For scholarly searching, Google has launched Google Scholar (GS) in 2004 (beta version).


So this proposal will able to find the best model to bridge the both side (how users can through the bridge that connect the publishers and the crawler). How the crawler could find the best way to recognize the metadata that publisher provided their publication documents.


