Provenance for Internet Documents
This was originally written towards the end of my tenure with the
2006-05-08 — Posted as found in a file dated 19 August, 2001.
PROVENANCE: problems
The long-term value of archived network documents (ie: web pages) is greatly affected by our current attention to the
Our Ideal Meta-data
To illustrate the issues, let us first consider an ideal record of
provenance and authenticity for an HTTP ("web") document.
Date of acquisition URI (the textual address) and IP address Method of acquisition Crawler "identity" as presented to the server Server response dependent upon the client identity Scripted behavior that controls what the client accesses Response code from server, including information about changes in the URI. Ownership of the domain name at time of acquisition Assignment of the IP address at time of acquisition Owner requests for removal from the archive
At first glance, all this meta-data is collected by The Internet Archive in its crawls, except domain ownership and IP assignment. This data is collected in databases belonging to different registry sources. The importance of this data can be illustrated with a few examples of usage.
Example: A researcher is examining the evolution of design in small business web sites. In her first pass through her collection, she correlates changes in the HTML structure with changes in IP address. Some web pages in her sample undergo changes more dramatic than the average. She modifies her data sorting to check for changes in the domain name registration record and is able to subtract trends due to change in ownership of the domain name. When she choose a few web pages as case studies, she selects a page that shows before and after a change in the IP address. She identifies the two different web hosting companies from their IP address assignments and, from retrieving their web sites at the time of the change, is able to demonstrate the change in price point and features that may have motivated the owner to switch web hosts. Example: A researcher is trying to determine the origin of a collection of USENET messages which seeded a now-prominent grassroots organization. He uses the machine and domain names in the headers and correlates them with the domain name owners, including their geographic location. He also correlates the machine names to IP addresses and the owners of those address, generally ISPs. A common thread points to several individuals in a distributed geographic area as the source of the earliest messages; only later did the ideas become distributed to the city considered the home of the movement.
Our Ideal Context
Our consideration of context is motivated by one simple question that we
expect users of internet archive to have: "What did
www.site.com look like at a certain point in time?"
Provenance is defined as a record of the ownership of an object. In
the world of paintings and statuary, in which an object was unique and
exchange of ownership was a physical act with no modification to the
object, the details of the actual exchange method were rarely of
importance. For HTTP transactions, however, the equivalent of noting
whether a transfer occurs in a wooden crate or padded box does have
historical implications.
Servers
can identify the type of client making a HTTP request and can change
their response based on that information. Web designers with a
concern that their site appear at its best on a multitude of
platforms use the User-Agent data to determine the capabilities of
the client software and native operating system and then deliver the
best-tailored results. Our crawler may either be identified by the server as a
robot, and fed what a savvy web-master wants reported to the search
engines, or lumped in with dozens of miscellaneous browsers of
generally reduced feature sets.
In our ideal provenance we do note the method of
acquisition, a simple task. The challenge is first to determine
whether transfer by a robot crawler instead of the branded latest
browser affects the delivered content. That cannot be determined by
recording the method of acquisition alone. Furthermore, if we intend
to answer the question, "What did www.site.com look like in
2001?," we will not be satisfied with an acquisition method that
may be denied content.
There
are other crawling-dependent issues as well. Probably of the most
concern in replicating the web experience is the usage of javascript
(and other scripting languages) to dynamically add web content and
cascading style sheets to provide layout. In HTML, a designer may
specify the source code location in the head of the document. In the
body, a simple call can then generate the entire web page, none of
which is included in the HTML code. Extensions of this concern apply
to the use of different browser plug-ins.
Our
ideal context, then, requires us to obtain the information required
to indicate what a branded browser, with all features enabled, would
have received from the server. Ideally we would be able to deliver
that information directly. The final challenge is emulating the
browser’s response to that content, clearly delineating where data
has been lost. With all these considerations addressed, we can
preserve an authentic image of the Web.
PROVENANCE: solutions
Domain Name Ownership
Internet Corporation for Assigned Names and Numbers (ICANN)[1] is the
initial authority that empowers the agents for registering domain
names. For the most common top level domains (TLDs), .com, .net, and
.org, there is a competitive environment in which ICANN accredits
companies to become part of the Shared Registration System. This
system is managed by Network Solutions[2].
The three TLDs, .com, .org and .net, are the most popular for web
destinations, especially within the American sphere of influence. The
.edu domain is managed by Network Solutions and is included in the
Shared Registration System. This leaves the TLDs .int, .gov, .mil and
the two letter country code names[3]. Each of these has its own
registry. In some cases these domain names are being used for content
within the the country that owns the TLD.[4] In other cases, the
countries are exploiting this new resource by marketing their TLD to
the global market as an alternative to the crowded .com market. [5]
For example the Heard and McDonald Islands offer their TLD .hm as
appropriate for your online "home." Armenia (.am) and the
Federal State of Micronesia (.fm) are aggressively going after the
radio market. While the United State’s country code (.us) is best
known as the home of many K-12 schools and state and local
governments, all entities for which domain name ownership seems
obvious, there are no such restrictions. Already, the us domains are
being marketed as an alternative to the .com TLD [6].
In the long term, an archival home must be found for the different registries. The
individual databases could then be merged and linked to the different
collections so that research queries for archived content may include
criteria based on the domain name ownership data.
IP Address Allocation
ICANN is also initial authority that distributes the IP address to the
three Regional Internet Registries (RIRs). Like the commercial radio
spectrum, the IP addresses are finite (only 32 bits, in fact). The
RIRs attempt to ensure the IP addresses are distributed in a fair and
efficient manner.For the RIR that serves the Americas and part of
Africa, the usual practice is that large chunks of addresses are
assigned to the major ISPs. These are sub-allocated by those ISPs to
their customers, and finally an IP address is assigned to an
individual host. This chain of ownership establishes geographic
information separate from the ownership of the domain name and also
reflects the economic structure of the Internet. Admittedly, only
some of the sub-allocations registries are maintained in in the RIRs
databases and there is no requirement that an organization with a
class of IP addresses deploy all those addresses in the same
area. Nonetheless, this is the best available data for narrowing down
the geographic location of a server.
The three global RIRs ARIN serving North and South America, the Caribbean, and sub-Saharan Africa. http://www.arin.net/whois/ RIPE NCC (Réseaux IP Européens Network Coordination Centre) serving Europe, the Middle East, and parts of Africa. http://www.ripe.net/ APNIC (Asia Pacific Network Information Centre) serving the Asia Pacific region. http://www.apnic.net/
Here
again, the same archival home should be found for the three RIRs. The
individual databases will then be merged and linked to the different
collections so that research queries for archived content may include
criteria based on the IP address allocation.
Domain Name Resolution
The
correspondence between domain names and the IP addresses is recorded
in the files of domain name servers through out the world. The
correspondence is also recorded by each arc file record of a HTTP
document in which the URL (and domain name) and the IP address are
both recorded in the header. There may be some use in capturing this
information as well, to the extent that it is not burdensome. Network
Solutions maintains Top-Level Domain Zone Files for active domain
names in the .com, .net, and .org TLDs and the associated IP address
for the host server. These are updated twice daily and are available
by ftp through agreement with Network Solutions. This is one method
by which this data may be collected by The Internet Archive.
Crawler User-Agent Self
Identification
The
capability and the motivation for servers to distribute different
documents depending upon the requesting user agent is well
established. The fraction implementing such behavior is unknown. The
range in responding to this issue varies from running the crawler
with different user agent lines against all html crawls to doing
nothing at all.
In
preparation for addressing this issue, the arc file format should be
extended to identify the User-Agent string that the crawler
presented upon acquisition. Then a periodic trial should be run on a
cross-section sample of html URLs, documenting the fraction of
servers and sites implementing a variable response based on the
user-agent declaration. This fraction should be tracked and recorded
so that the probability of a certain network document came from a
site with variable response is available whenever an Archive user
asks the question, "What did www.site.com look like at a certain
point in time?"
Re-presentation
In
preparation for a future beyond the current operating systems,
internet archivists should support the designers of browser and platform
emulators. Special collections of architecture documentation,
compilers, and software programs, should be established. While this
does not mean that a site can be completely reproduced in the distant
future, there will be enough documentation to establish "What
did www.site.com look like at a certain point in time?"
References
1. http://www.icann.org/general/fact-sheet.htm 2. http://www.nsiregistry.com/ 3. http://www.iana.org/cctld/cctld-whois.htm 4. http://www.nic.yu/index-e.html , http://www.psg.com/dns/tz/ 5. http://www.tonga.to/tonga/ http://www.ccnames.cc/ http://www.nunames.nu/ http://dot.fm/ http://www.dot.am/ http://www.home.hm/ 6. http://www.domainregistry.net/ http://beltane.com/