This document is intended to describe Uniform Resource Locators, widely used on the World Wide Web and other media for referencing documents. This document was written to be an understandable, comprehensive, and accurate resource on URLs. However, some information may become obsolete, as this document will not be updated to keep pace with developments beyond 1996.
URLs make it possible to direct both people and software applications to a variety of information, available from a number of different Internet protocols. Most commonly, you will run into URLs when using a World Wide Web (WWW) client, as that medium uses URLs to link WWW pages together. In your WWW browser's "location" box, the item that generally starts with "http:" is a URL. Files available over protocols besides HTTP, such as FTP and Gopher can be referenced by URLs. Even Telnet sessions to remote hosts on the Internet and someone's Internet e-mail address can be referred to by a URL.
A URL is like your complete mailing address: it specifies all the information necessary for someone to address an envelope to you. However, they are much more than that, since URLs can refer to a variety of very different types of resources. A more fitting analogy would be a system for specifying your mailing address, your phone number, or the location of the book you just read from the public library, all in the same format.
In short, a URL is a very convenient and succinct way to direct people and applications to a file or other electronic resource. Learning how to interpret, use, and construct URLs will greatly assist your exploration of the Internet.
<scheme>:<scheme-dependent-information>
Examples of various schemes are "http", "gopher", "ftp", and "news". These schemes and others are explained below. The scheme tells you or the application using the URL what type of resource we are trying to reach and/or what mechanism to use to obtain that resource.
The scheme dependent information is detailed below with each separate scheme. However, most schemes include two different types of information: the Internet machine making the file available and the "path" to that file. With these types of schemes, we generally see the scheme separated from the Internet address of the machine with two slashes (//), and then the Internet address separated from the full path to the file with one slash (/). FTP, HTTP, and Gopher URLs generally appear in this fashion:
scheme://machine.domain/full-path-of-file
As an exercise, let's look at this file's URL:
http://www.netspace.org/users/dwb/url-guide.html
The scheme for this URL is "http" for the HyperText Transfer Protocol. The Internet address of the machine is "www.netspace.org", and the path to the file is "users/dwb/www-authoring.html". When working with the WWW, most URLs will appear very similar to this one's overall structure.
Note that when using FTP, HTTP, and Gopher URLs, the "full-path-of-file" will sometimes end in a slash. This indicates that the URL is pointing not to a specific file, but a directory. In this case, the server generally returns the "default index" of that directory. This might be just a listing of the files available within that directory, or a default file that the server automatically looks for in the directory. With HTTP servers, this default index file is generally called "index.html", but is frequently seen as "homepage.html", "home.html", "welcome.html", or "default.html".
However, sometimes your clients might not support a certain URL scheme, and you will have to manually decode it for yourself. In these instances, first start with the scheme, and look at the descriptions provided below. For instance, if I come across a URL that starts with "mailto:", I look below and find that this indicates an Internet email address. Next, figure out what application you should use to utilize this URL. In this example, I use Eudora to email people, so I launch that application. Then, use the scheme description to determine what information the scheme-dependent portion of the URL is providing. In this case, the Mailto URL lists an email address of "dwb@netspace.org". Lastly, figure out how to give this information to your client. For Eudora, I know that I should create a new message, and then fill in "dwb@netspace.org" within the From: mail field.
For example, say that I want to tell someone how to obtain the Mac WWW browser created by the Netscape Communications Corporation. I've obtained the browser for myself by using my favorite FTP client. I used that client to contact the site "ftp.mcom.com". I then changed to the "netscape" directory, and then went into the "mac" directory. Finally, I got the file called "netscape.sea.hqx". Looking at the description of the FTP scheme, I construct the following URL:
ftp://ftp.mcom.com/netscape/mac/netscape.sea.hqx
Now, when presenting this URL to people, there is a general syntax that ought to be used to avoid confusion. Many people place the URL on its own line, separated from text below and above by whitespace. Most people consider this sufficient. However, a more precise syntax is recommended by RFC 1738, and distinguishes URLs from other Uniform Resource Identifiers (URIs). (URLs are a subset of URIs.) This syntax is to preface the URL by "<URL:" and terminate it with ">". Thus, when sending the above URL to obtain Netscape for Mac via email, I use the syntax:
<URL:ftp://ftp.mcom.com/netscape/mac/netscape.sea.hqx>
Other recommendations exist, but this is the format used within this document and the RFC which defines URLs.
Also note, when constructing URLs, that certain characters are reserved or unsafe. To use these characters, you will need to encode them with "escape sequences." These sequences are mentioned in the section entitled Appendix A: Escape Sequences.
http://<host>:<port>/<path>?<searchpart>
The host is the Internet address of the WWW server, and the port is the port number to connect to. In most cases, the port can be omitted (along with the preceding colon), and it defaults to the standard "80". The path tells the WWW server which file you want, and if omitted, indicates that you want the "home page" for the system. The searchpart may be used to pass information to the server, often to an executable CGI script, but for most WWW documents is not used. Generally, this part of the URL is omitted, along with the preceding question-mark.
Another character that may be frequently encountered when browsing the WWW is the pound sign (#), which can be used to point to a named anchor. An author of an HTML document can allow browsers to point to a specific section of a document by creating a named anchor within that document. Then, a URL with a pound sign and the anchor's name appended will reference that specific section. Named anchors are used throughout this document, and as an example, the following URL points directly to the section "What are URLs?":
http://www.netspace.org/users/dwb/url-guide.html#what
ftp://<user>:<password>@<host>:<port>/<cwd1>/<cwd2>/.../<cwdN>/<name>;type=<typecode>
If contacting a site which provides general FTP access, the user and password can be omitted, including the colon between them and the at-symbol afterwards. The host is the Internet address of the FTP site. The port and its preceding colon can be omitted as well. The portion of "<cwd1>/<cwd2>/.../<cwdN>" refers to the series of "change directory" commands a client must use to move to the directory in which the file desired resides. The name is the name filename of the desired file. The construction ";type=<typecode>" allows for a transmission method (e.g. ascii vs. binary) to be specified, but I haven't found any clients which support this syntax, and in fact, most incorrectly assume that it is part of the filename. For now, avoid using the typecode.
gopher://<host>:<port>/<gopher-path>
The host indicates the Internet address of the Gopher server, while the port, as in the previous cases, can generally be omitted along with its preceding colon. The gopher-path specifies the type of Gopher resource, a selector string, and perhaps other information. A detailed discussion of Gopher queries is not within the scope of this document, but generally you can determine a document's gopher-path from information provided by your browser.
mailto:<account@site>
The account@site is the Internet email address of the person you wish to contact, as defined by RFC 822. Note that when encoded in WWW documents, some WWW browsers may not understand the Mailto scheme. Support for Mailto is increasing, but for now, one can switch to a different browser or interpret the Mailto URL manually.
news:<newsgroup-name>
news:<message-id>
The newsgroup-name is the Usenet newsgroup name (e.g. comp.infosystems.www.providers) and generally will tell the browser to retrieve the titles of all the available articles within that newsgroup. If the newsgroup-name is "*", the URL refers to "all available newsgroups." The message-id corresponds to the Message-ID of the specific article to obtain, and can be found within the article's header information.
Note that the News URL does not specify how a client is to obtain this information. A client must be properly configured to know where to obtain Usenet newsgroups and articles, generally from a specific NNTP server.
telnet://<user>:<password>@<host>:<port>/
The user and password tokens can be omitted, and are included only for advisory purposes. The host refers to the site to connect to, and port can be omitted, defaulting to the standard "23".
tn3270://<user>:<password>@<host>:<port>/
wais://<host>:>port>/<database>
wais://<host>:<port>/<database>?<search>
wais://<host>:<port>/<database>/<wtype>/<wpath>
The host and port (which can be omitted) describe the same constructs in previously described schemes. The first syntax indicates a specific WAIS database, the second a particular search, and the third a specific document.
file://<host>/<path>
The host is the fully qualified domain name of the system, and the path is the hierarchical directory path of the form "directory/directory/.../filename". The host can be left as an empty string or "localhost" to refer to local files on the client on which the URL is being interpreted.
nntp://<host>:<port>/<newsgroup-name>/<article-number>
The items within this syntax are all as described in previous schemes. Generally, it is better to use the News scheme and trust that the client knows how to obtain Usenet items. The NNTP scheme specifies that the NNTP protocol is used, and also specifies a specific NNTP server, designated by the host, to be used; most NNTP servers do not provide universal access. Thus, use News whenever possible.
prospero://<host>:<port>/<hsoname><field>=<value>
See Neuman, B., and S. Augart, "The Prospero Protocol", USC/Information Sciences Institute, June 1993, <URL:ftp://prospero.isi.edu/pub/prospero/doc/prospero-protocol.PS.Z> for information on Prospero.
SPACE %20 < %3C > %3E # %23 % %25 { %7B } %7D | %7C \ %5C ^ %5E ~ %7E [ %5B ] %5D ` %60Reserved characters are characters which have special meaning within specific schemes, and must be encoded when used in such schemes if they are to be used for a purpose other than that meaning. The escape sequences are listed below:
; %3B / %2F ? %3F : %3A @ %40 = %3D & %26Thus, the tilde (~) which designates the directory within which this document resides should be encoded to produce the URL:
http://www.netspace.org/users/dwb/url-guide.html
For some FTP servers, such as many VM systems, this is an incorrect assumption and these clients may be unable to retrieve such files.