Lab Project #2, Part B: Proxy Cache
In this lab, you will develop a small web proxy server which is
also able to cache web pages. This is a very simple proxy server
which only understands simple GET-requests, but is able to handle all
kinds of objects, not just HTML pages, but also images.
This will give you a chance to get to know one of the most popular
application protocols on the Internet- the Hypertext Transfer Protocol
(HTTP). When you're done with the assignment, you should be able to
configure your web browser to use your personal proxy server as a web
proxy.
Overview: HTTP Proxies
Ordinarily, HTTP is a client-server protocol. The client (usually your
web browser) communicates directly with the server (the web server
software). However, in some circumstances it may be useful to
introduce an intermediate entity called a proxy. Conceptually, the
proxy sits between the client and the server. In the simplest case,
instead of sending requests directly to the server the client sends
all its requests to the proxy. The proxy then opens a connection to
the server, and passes on the client's request. The proxy receives the
reply from the server, and then sends that reply back to the
client. Notice that the proxy is essentially acting like both a HTTP
client (to the remote server) and a HTTP server (to the initial
client).
Why use a proxy? There are a few possible reasons:
- Performance: By saving a copy of the pages that it fetches,
a proxy can reduce the need to create connections to remote
servers. This can reduce the overall delay involved in retrieving a
page, particularly if a server is remote or under heavy load.
- Content Filtering and Transformation: While in the simplest
case the proxy merely fetches a resource without inspecting it, there
is nothing that says that a proxy is limited to blindly fetching and
serving files. The proxy can inspect the requested URL and selectively
block access to certain domains, reformat web pages (for instances, by
stripping out images to make a page easier to display on a handheld or
other limited-resource client), or perform other transformations and
filtering.
- Privacy: Normally, web servers log all incoming requests
for resources. This information typically includes at least the IP
address of the client, the browser or other client program that they
are using (called the User-Agent), the date and time, and the
requested file. If a client does not wish to have this personally
identifiable information recorded, routing HTTP requests through a
proxy is one solution. All requests coming from clients using the same
proxy appear to come from the IP address and User-Agent of the proxy
itself, rather than the individual clients. If a number of clients use
the same proxy (say, an entire business or university), it becomes
much harder to link a particular HTTP transaction to a single computer
or individual.
Assignment Details
Reference Code
The code is divided into three classes as follows:
-
ProxyCache
holds the start-up code for the proxy and code for handling the
requests.
-
HttpRequest
contains the routines for parsing and processing the incoming
requests from clients.
-
HttpResponse
takes care of reading the replies from servers and processing
them.
Your work will be to complete the proxy so that it is able to
receive requests, forward them, read replies, and return those to the
clients. You will need to complete the classes
ProxyCache
, HttpRequest
, and
HttpResponse
. The places where you need to fill in code
are marked with /* Fill in */. Each place may require one or
more lines of code.
NOTE: As explained below, the proxy uses
DataInputStreams for processing the replies from servers. This is
because the replies are a mixture of textual and binary data and the
only input streams in Java which allow treating both at the same time
are DataInputStreams. To get the code to compile, you must use the
-deprecation argument for the compiler as follows:
javac -deprecation *.java
If you do not use the -deprecation flag, the compiler will refuse
to compile your code!
Running the Proxy
Running the proxy is as follows:
java ProxyCache port
where port is the port number on which you want the proxy to
listen for incoming connections from clients.
Configuring Your Browser
You will also need to configure your web browser to use your
proxy. This depends on your browser. In Internet Explorer, you can set
the proxy in "Internet Options" in the Connections tab under LAN
Settings. In Netscape (and derived browsers, such as Mozilla), you can
set the proxy in Edit->Preferences and then select Advanced and
Proxies.
In both cases you need to give the address of the proxy and the
port number which you gave when you started the proxy. You can run the
proxy and browser on the same computer without any problems.
Proxy Functionality
The proxy works as follows.
- The proxy listens for requests from clients
- When there is a request, the proxy spawns a new thread for
handling the request and creates an HttpRequest-object which
contains the request.
- The new thread sends the request to the server and reads the
server's reply into an HttpResponse-object.
- The thread sends the response back to the requesting client.
Your task is to complete the code which handles the above
process. Most of the error handling in the proxy is very simple and it
does not inform the client about errors. When there are errors, the
proxy will simply stop processing the request and the client will
eventually get a timeout.
Some browsers also send their requests one at a time, without using
parallel connections. Especially in pages with lot of inlined images,
this may cause the page to load very slowly.
Programming Hints
Most of the code you need to write relates to processing HTTP
requests and responses as well as handling Java sockets.
One point worth noting is the processing of replies from the
server. In an HTTP response, the headers are sent as ASCII lines,
separated by CRLF character sequences. The headers are followed by an
empty line and the response body, which can be binary data in the case
of images, for example.
Java separates the input streams according to whether they are
text-based or binary, which presents a small problem in this
case. Only DataInputStreams are able to handle both text and binary
data simultaneously; all other streams are either pure text (e.g.,
BufferedReader), or pure binary (e.g., BufferedInputStream), and
mixing them on the same socket does not generally work.
The DataInputStream has a small gotcha, because it is not able to
guarantee that the data it reads can be correctly converted to the
correct characters on every platform (DataInputStream.readLine()
function). In the case of this lab, the conversion usually works, but
the compiler will flag the DataInputStream.readLine()-method as
deprecated and will refuse to compile without the -deprecation flag.
It is highly recommended that you use the DataInputStream for
reading the response.
(Bonus points) Possible Extensions
While it may not be obvious at first, proxies are very flexible tools
that can serve a number of different purposes on the web. Common uses
for proxies include improving giving performance boosts to dial-up
users (through caching and pre-fetching), privacy protection (through
anonymous proxies), content filtering and blocking (used in many
"NetNanny"-type applications), and content transformation.
Sample Proxy Applications:
When you have finished the basic assignment, you can try the
following extensions for bonus points.
- Better error handling. Currently the proxy does no error
handling. This can be a problem especially when the client
requests an object which is not available, since the "404 Not
found" response usually has no response body and the proxy
assumes there is a body and tries to read it.
- Support for POST-method. The simple proxy supports only
GET-method. Add support for POST, by including the request body
sent in the POST-request.
- Content Transformation.
Content transformation is the process of a proxy inserting, removing,
or changing the contents of a resource requested from a remote
server. After the resource has been retrieved from the server, the
proxy is free to do whatever it would like to the content. Since the
data returned from a web server is usually just text, this means that
we can change the page almost any way we want- add or remove dirty
words, change the text to Pig-Latin, rotate the images on the page 90
degrees, etc.
- Caching Caching is one of the most common performance
enhancements that web proxies implement. Caching takes advantage of
the fact that most pages on the web don't change that often, and that
any page that you visit once you (or someone else using the same
proxy) are likely to visit again. A caching proxy server saves a copy
of the files that it retrieves from remote servers. When another
request comes in for the same resource, it returns the saved (or
cached) copy instead of creating a new connection to a remote
server. This saves a modest amount of time and CPU if the remote
server is nearby and lightly trafficked, but can create more
significant savings in the case of a more distant server or a remote
server that is overloaded (it can also help reduce the load on heavily
trafficked servers).
Caching introduces a few new complexities as well. First of all, a
great deal of web content is dynamically generated, and as such
shouldn't really be cached. Second, we need to decide how long to keep
pages around in our cache. If the timeout is set too short, we negate
most of the advantages of having a caching proxy. If the timeout is
set too long, the client may end up looking at pages that are outdated
or irrelevant.
The basic functionality of caching goes as follows.
- When the proxy gets a request, it checks if the requested
object is cached, and if yes, then returns the object from the
cache, without contacting the server.
- If the object is not cached, the proxy retrieves the object
from the server, returns it to the client, and caches a copy for
future requests.
Add the simple caching functionality described
above. You do not need to implement any replacement or
validation policies. Your implementation will need to be able to
write responses to the disk (i.e., the cache) and fetch them
from disk when you get a cache hit. For this you need to
implement some internal data structure in the proxy to keep
track of which objects are cached and where they are on
disk. You can keep this data structure in main memory; there is
no need to make it persist across shutdowns.