Glossary
This section is really a hybrid between a real Glossary and a FAQ that intends to explain some of the terms and the meanings as used in the building of this ranking.
Database size. The number of records in the search engine databases that it publicly accessible from external sources. Not all the robots crawl the Web at the same time or with identical procedures, besides post crawling processes and other commercial requirements finally result in really different databases. The current size, composition and evolution of the figures are a relevant point in webometric analysis.
Delimited search. A key characteristic of the search engines that allow the cybermetric analysis. A delimiter operator has a specific syntax and meaning that can differ among engines. It provides the number of records (web pages) that satisfied a certain condition filtering the results according to strings in the address (URL) or other characteristics (language, format) of the page. Special relevance has the link delimiter that can be used in combination with site or other similar to calculate inlinks.


Discipline differences. The ranking does not provide any kind of thematic assignation to the units, so a formal thematic analysis is not possible at the moment. But there are important differences regarding academic focus on our universities database that should be taken into account. Research focused universities are mixed with learning institutions and a group of discipline oriented (mainly pedagogy, medicine and theology) organizations are also present.
Formal characteristics. As there is neither universal document control nor formal guidelines for web page building, there is a huge diversity of formal aspects in the Webspace, including obvious malpractices. Some authors have focused on these to provide new indicators such as link density, link quality, expressed as ratios of non working links, missing tags, including those so relevant as title or metadata, or updating frequency. None of these characteristics are taken into account in our rankings, but they should be taken into consideration for micro-analysis.
Geographical biases. The use of several search engines in our ranking is due to the geographical bias observed in some of them. We do not know if this is due to topological or traffic problems in the network (some eastern Asian countries are usually poorly covered) or to the crawlers behaviour or if the biases are equal long the time. Alexa biases preclude us to add the popularity data in our rankings.
Institutional domains. The basic unit of our analysis refers to the common URL domain shared by all the web sites of an institution. Unfortunately some organizations maintain two or more equivalent domains, without a preferred marked one. Also for concern is the fact that some second level departments maintain completely different domains. Usually we maintain two entries for those institutions with two top level equivalent domains. We intend to merge results of smaller domains with those of the main one in the near future, but it is a difficult task.
Invocation. The presence of the name of an institution or a researcher in a Web page. The global presence is the number of times the name appears in the Web and can be calculated easily using quotation marks around the name in the search engines. Sometimes this figure is referred as the number of times this name is cited in the Web. Some authors refer this as Web visibility, although we prefer to reserve this word for link visibility. This indicator usually favours large, well-known, old institutions independently of their real effort for having a relevant Web presence.
No invocation measure was used in our ranking, mainly because it is not possible to assign a unique, unambiguous universal name for every institution.
Invisible Web. Traditionally refers to the information available through gateways or search interfaces that is not accessible by the search engines’ robots. It is a huge part of the Internet content, including library catalogues, bibliographic and alphanumeric databases or even some repositories of documents. During last years some engines, specially Google, has made a great effort to index these records and in fact several databases are more or less covered in their systems (i.e. PubMed is partially indexed by Google). Our ranking do not consider the Invisible or Deep Web and we encourage transforming it in crawler friendly information.
Language. English is the “lingua franca” for scientific communication and it is also the language of a significant fraction of the internet users. Non-english institutions publishing only in their mother tongue alone achieved a lower visibility than those with multilingual websites.
Link motivation. Major concern in link analysis is the motivations behind a link creation. Previous studies suggest that “sitations”, the hypertextual equivalent to bibliographic citations, are still rare. We think this situation will improve when more papers became available on the Web, but we consider other reasons to link very useful to describe scholarly communication. Informal linking is a powerful source of information about intellectual, economic and political connections of the academic and scientific activities.
| CATEGORY |
CASE |
COMMENTS |
| Sitation |
Link to paper or document |
Generally in pdf/ps/doc format |
| Teaching/learning |
Link to course materials |
Mainly html pages but also pdf, doc or ppt |
| Research oriented |
Resources index |
Portal type |
| Software repository |
|
| Research projects sites |
|
| Conferences, seminars or meetings pages |
|
| Raw data |
Including media files if applicable |
| Personal |
Self archive |
Pre or post prints, but also unpublished material |
| Team or colleagues pages |
|
| Blog |
|
| Third parties (non-research) |
|
| Institutional |
Parent institution |
And related ones |
| Funding organization |
|
Link popularity. Another term to refer to link visibility that has been used extensively. We prefer to reserve popularity for the measure of number of visits. Although not yet implemented on the Ranking, we intend to consider number of visits or popularity as a relevant factor for our rankings in the future.
Open access. The movement to distribute in an open way the scientific production of, at least, the public funded researchers is facing tougher opposition than expected. A strong bet for open access initiatives will be clearly reflected in our rankings.
Personal pages. A frequently heard statement about web contents quality is related to the information provided by the personal pages of students or staff members. There is a lot of free space hosted by the university web servers that is used for personal purposes, and in general it is thought that it is used with low quality information or not academic related. Data suggest a large number of small websites are crowding the institutional domains, but most of them are interesting enough to merit consideration. Some “personal” pages are in fact the research group site, while others are institutional (scientific societies, electronic bulletins, conference sites). True personal pages cover both extremes of the contents range, with people offering only CVs to others providing very large arrangements of information of their academic or research topics with links to personal repositories of documents. A striking pattern is the absence of links to other colleague’s websites or institutions.
Quality. We advice against the use of the rankings as global or partial indicator of quality. Impact or visibility describes better our aims, but in the particular context of promotion of open and universal access to the scientific activities and results through the Web.
Ranking. As their main objective is purely commercial, current search engines are not offering stable, reliable, or trustworthy results for webometric purposes. The situation has improved in the last years but there are still important bias and a worrisome instability. This is the reason we are using absolute values but relative positions for our analysis.
Rich files. A general term comprising a rather heterogeneous group of file types, mainly those devoted to represent unitary enriched documents, such as MS Word doc, Adobe Acrobat pdf or PostScript ps. In our analysis we also included MS Powerpoint ppt and excluded xls or latex or tex. Rich files are relevant because they are use for scholarly communication as authors usually distribute their papers and presentations in these formats. Certainly some of these types are used extensively for bureaucratic purposes (forms, administrative documents, internal reports) but these can only explain a small percentage of large numbers observed in domains with extensive repositories.
There are several other file types that can be considered as rich files, and even raw formats like txt are being used for distributing academic content. But their individual contribution is too low to be considered.
Rounding. Google and Yahoo offer rounded results, ending in ,000, which means an error rate in the order of 2 to 5%. Moreover the numbers provided by Yahoo in the first page is about another 4-5% higher that the one showed in the following pages that show a trend towards the “correct” number.
Search Engine. The software that searches an index and returns matches. Search engine is often used synonymously with spider and index, although these are separate components that work with the engine. There are only four engines useful for quantitative analysis purposes as they have a large and independent self crawled database and their recovery system allow filtering of results according to url-related delimiters:
Google www.google.com
Yahoo Search search.yahoo.com
Bing www.bing.com
Exalead www.exalead.com/search
Self archiving. Self-archiving involves depositing a free copy of a digital document on the World Wide Web in order to provide open access to it. The term usually refers to the self-archiving of peer reviewed research journal and conference articles as well as theses, deposited in the author’s own institutional repository or open archive for the purpose of maximizing its accessibility, usage and citation impact. This practice is common among most prolific authors and in certain disciplines. However globally it is only a minority of authors who support this option. As much of these papers are published as rich files, pdf, ps or doc, this practice increases notably the performance of an institution in our rankings.
Size. The size of an institutional domain is the combined number of pages of all the websites with that domain, including html and non html formats that can be assimilated. From a practical point of view, size refers to the number provided by a search engine when a search like site:domain is done. This indicator is central for our rankings and it is used also as denominator for Web Impact Factor calculations by other authors. However there is a wide range of pages according to different criteria, including content size measured in bytes. For example, a page containing a pdf document that can be a monograph consisting of several hundreds pages totalling several Mb of texts and images, while other page consists only of the phrase “page under construction”. Global size could be an interesting indicator and we expect to provide it for selected websites.
Stability. From the early times instability of the search results in general, and of the number that represents results in particular has been a subject of special concern. Certainly the Web is a highly dynamic system, growing at an incredible pace, but also the crawlers change their specifications and schedule unexpectedly. A world crawling round can last from 15 to 45 days and in this meantime.
Visibility. In the context of this ranking, the term refers to link visibility: The number of external inlinks received by an institutional domain. The most used syntax for this request in search engines is:
linkdomain:webometrics.info –site:webometrics.info
Web cost. Maintain a very large presence on the Web can be quite costly; including specific funding and human resources, but the total cost is far below any other publication method and the potential audience is truly global. A way to undertake large projects is distributed effort, so individual graduate students, professors or researchers, scientific teams and other administrative units have an autonomous web presence. A rich content page should include a large diversity of objects including images and other media files, certain amount of navigational links and a selected group of external outlinks. That can require a huge effort that can be only face if theses tasks are subject of evaluation as other academic and scientific activities.
Web Impact Factor. The most cited cybermetric indicator, although its usage is not universal due to several shortcomings. It is the defined as the ratio between the external inlinks received by a website and the number of webpages comprising that website. Some authors suggested modifications to the denominator, using different alternative measures for the size of the institution using non-internet data such us number of potential authors (staff, professors, graduate students), economic wealth (funding, projects) or bibliometric data (papers in journals).
Our ranking is derived from WIF in which a ratio 1:1 is established between visibility and size.
Recent Comments