Thursday, October 29, 2009

Internet Archive WayBack Machine: Valuable Technical Reference

Have you ever bookmarked a really good technical resource, but were disappointed when you tried to access that page later and it was gone? Have you ever seen what looks like the perfect linked resource in a blog post, article, or book, but then found the referenced URL to not work? In short term cases such as an intermittent server issue or network problem, Google Cache can be an indispensable tool to see a cached version of the page. When a page has been completely removed for a long time, a web page archiving site is more helpful.

A particularly easy site archive tool to use is Internet Archive's WayBack Machine. This free online tool is very easy to use if you know the URL of the page you care about (which you would if the URL was part of a bookmark or if you clicked on a link to a site that was no longer present). You can enter the URL in the form field at the Wayback Machine page, click on the button saying "Take Me Back."



After doing the above, you see a detailed history of changes to the page at that location. This is demonstrated in the following screen snapshot.



The above screen snapshot demonstrates use of Internet Archive Wayback Machine to see a history of my page on Common Struts Errors and Their Causes that was originally posted on GeoCities, but is no longer available due to the demise of GeoCities.

Even though the page is no longer available at the GeoCities URL, its archived versions can still be viewed via the Internet Archive WayBack Machine. The following two images show what happens when one tries to access the page at its former GeoCities address and then what happens when using one of the Internet Archive WayBack Machine's archived versions.





As the screen snapshots above indicate, the page that is no longer available on GeoCities is still available via Internet Archive WayBack Machine. In my case, I have hosted this page at Google Sites and there are many other mirrored copies of it on the web, but the Internet Archive can be useful for finding pages that don't get copied or hosted on alternative sites. Because Internet Archive WayBack Machine also provides various versions of the same page, a history of a particular page can be seen.

The recent loss of GeoCities-hosted pages only one of many examples of original content being removed from the web. A relatively recent well-known example of loss of original content was the sudden and somewhat surprising online disappearance of why the lucky stiff. The next two screen snapshots show a page from the popular Why's (poignant) Guide to Ruby. The first snapshot indicates that the original page is no longer available at its original URL (http://www.poignantguide.net/ruby/chapter-1.html), but the second snapshot indicates that it is archived and accessible with WayBack Machine (at http://web.archive.org/web/20080526095452/www.poignantguide.net/ruby/chapter-1.html).





So far, I've only discussed the simple search implemented by providing the URL and clicking on the "Take Me Back" button. There are many advanced search options as well. If you have an e-mail or bookmark with a URL that no longer works, you'll often have a date on the e-mail or a date on the bookmark property of when it was created. You can use that date with WayBack Machine to find the page as of that date. This is especially useful for an easy way to recollect history.

What was big in the world of Java in late 2004? To find this out, I could look at archives for some major players in the Java world. For example, the java.sun.com page on the last day of 2004 looked like that shown in the next screen snapshot:



We can see from the above screen snapshot (I had entered http://web.archive.org/web/20041231/java.sun.com/ with a day but no time, but it resolved to http://web.archive.org/web/20041231092116/java.sun.com/ with a day and appropriate time) that big topics at the end of 2004 included early access to binaries and source of Java SE 6 (then still called as J2SE 6 or Mustang) and NetBeans 4.0.

As a final example, I'll look at something from the very early days of Java. The URL http://web.archive.org/web/19970211220056/www.sun.com/sunworldonline/swol-10-1995/swol-10-javadigest.html provides an interesting article from SunWorld Online called Java users reveal their habit. This makes for very interesting historical perspective for those of us in the Java community. For those of us who have been around since the public inception of Java, it is a reminder that applets were once the "big thing" in the Java world. For those who are not familiar with the early days of Java, it can provide an interesting perspective on how far things have come (and how much some things are still the same).

Assuming that the deal for Oracle to acquire Sun Microsystems goes through, there is some question about what will happen to the wealth of Java resources available online at domains such as http://java.sun.com/ (such as the JDK 6 Documentation). These articles and other documents will almost assuredly continue to exist (even if at a different URL that includes "oracle" in its name), but it is still comforting to know that the archived versions should also be available via the WayBack Machine.

There are many valuable benefits to having old web pages still available after their deletion and having the ability to see older versions of a particular web page. When one runs into the all-too-common frustration of finding a bookmark or link that is no longer valid, the Internet Archive WayBack Machine is a very welcome tool. However, such power can also be abused. Some articles that talk about people who are not too happy with the power of the WayBack Machine include Internet Archive Sued Over WayBack Machine and Internet Archive Settles Suit Against WayBack Machine.

As an interesting side note, there is a Java-based, open source implementation of the WayBack Machine called wayback.


Conclusion

There is no question that the ability to link documents via hyperlinks is one of the main reasons for the success of HTTP and the World Wide Web. It is very convenient to click on links in bookmarks, e-mail messages, articles, blogs, and even type in URLs from books and go directly to the referenced site. On the other hand, one of the greatest disappointments when using the web is when a hyperlink is no longer valid. Although a search engine might be able to tell you a page's new location (or the location of a mirror or copy), or Google Cache might be used to see a cached version of a page that is temporarily down, a tool like the Internet Archive WayBack Machine is sometimes what is needed to see a page that is no longer otherwise available. In cases where the page still exists, but information that was previously available has been removed, WayBack Machine's ability to see historical versions of that particular page is very useful. In many ways, using WayBack machine definitely feels like time travel.

No comments: