Industry Publications Index ... Click Here

Proxies and Performance

Originally published  December, 2000
by Carlo Kopp
2000, 2005 Carlo Kopp

The web proxy server has become a central part of the computing infrastructure of many organisations. As such, getting the configuration, host sizing and network topology right is vital to achieving good performance in a proxy.

In this month's feature we will explore some of the basic ideas in proxies and elaborate on some of the key issues in sizing and performance.

Why Proxies?

The primary aim of all web proxies was performance enhancement by caching. Rather than wait a long time to fetch a URL, if it can be found in the proxy's cache it can be retrieved much faster. More recently, proxies have also been used for censoring access by the use of access control lists (ACLs), to block access to unwanted URLs.

Given the current political obsession with web censorship, we can expect the use of proxies for access control to become an increasingly important function. This discussion will however focus on the performance issues.

A well configured proxy on a well sized and tuned host will enhance web access performance in two ways. The first is direct, by recovering frequently accessed web objects from the local cache. The second is indirect, by the reduction of traffic over the main link in and out of the site. Frequently accessed objects which sit in the cache are objects which need not use bandwidth upstream of the proxy.

From the user's perspective, the direct performance improvement by effective caching is the more visible. From the organisation's perspective, the reduction in Megabytes downloaded across the external link reflects in the dollars expended on link traffic charges.

Achieving good proxy performance is not a trivial chore, as those readers with prior exposure will appreciate. Indeed, a poorly performing proxy will reduce performance and achieve very little in terms of traffic reduction. Therefore the importance of getting proxy performance right cannot be understated.

Understanding Proxy Performance

Proxies exhibit behaviour which is determined by basic caching theory, discussed in some detail in a recent article in this series. Unlike other caches, embedded in operating systems or machine hardware, a proxy is a system level cache in which performance is closely related to factors external to embedded caches, such as aggregate machine sizing and network performance. Therefore the process of tuning the cache design to achieve intended performance aims is much messier and frequently, much more expensive.

To understand these interactions it is helpful to discuss the operation of a proxy in more detail.

The fundamental functional task of a proxy is to receive outbound http, and frequently also ftp and gopher requests for object access, compare these against a stored cache of previously accessed objects, and if the object is in the cache, return it to the requesting host as quickly as possible. If the object is not in the proxy cache, the proxy must fetch it from the web, return it to the requesting host, and store a copy in the proxy cache.

A typical proxy runs as one or more processes on a dedicated host, using the host's network protocol stack and inter-process communications mechanism to transfer objects between itself, requesting hosts and target sites on the external network. It will typically store the cached objects on disk, usually multiple spindles, and maintain a memory resident table with the necessary indexing information to locate disk resident objects.

Therefore the performance of the proxy depends not only on the central system level performance parameter of a cache, the hit ratio, but also upon the performance of the proxy host hardware, and the performance of the operating system interprocess communications mechanism, file system and disk block device drivers. Needless to say the performance of the host operating system memory management and disk block caching mechanisms can have an important impact as well.

Hit ratio, to reiterate the point, is the ratio between accesses which hit a cached object and can be retrieved quickly, against the total number of accesses. A typical number for a proxy cache is between 40% and 60%, although it will depend quite critically upon the statistical behaviour of the user population when web browsing.

It is helpful to trace the progress of a request which results in either a hit or a miss through a proxy server, to identify the individual sources of time delay, since trimming these down is the key to good proxy performance.

When a user host makes a http request to the proxy, this request will be wrapped in an IP packet and fired across the LAN to the network interface on the proxy host. There the packet is received by the hardware, buffered and processed first by the lower and upper parts of the device driver, and then the TCP/IP protocol stack. It is then transfered via the interprocess communications mechanism, usually a socket scheme, to the proxy process.

From a performance perspective the first potential bottleneck is the LAN and the interfaces on the user host and the proxy host. The fatter the pipe, the quicker the request can be transferred from the user's process, typically a browser like Netscape or Mosaic, to the proxy process on the proxy host. The path between these two processes fits the classical network interprocess communications model and its performance can be directly modelled using queueing theory, and measured using tools like ttcp. In essence this is the same performance problem seen with large client-server systems, web servers or hosts driving Xterminals or NCs.

The choices for maximising performance are straightforward. In hardware, it is necessary to put the faster possible LAN connection on the proxy host. This means not only a high speed connection, like a 100 Mbit/s Ethernet or 155 Mbit/s ATM over a switch, but also cleverly choosing an adaptor board with good performance. In terms of the operating system, the best choices are those which have excellent interprocess communications and network protocol stack performance. In practice, this limits choices to various Unix variants, especially BSD subtypes.

Once the request has been received by the proxy process, it is executed and the memory resident indexing structure is searched to determine whether the object is sitting in the disk cache or not. Should a cache miss occur, the proxy process will then transparently, or otherwise (many proxies can mask the identity of the requesting host by substituting the source IP address with its own), send the request out to the net to fetch the object. Therefore, the same performance caveats for the incoming proxy request also apply, with hardware, operating system and LAN interface introducing time delays.

The time for the proxy to process the miss is therefore determined by the time to receive the request, the time to process the request, and the time to forward it to the external network. Thereafter, the request must propagate across the net to the target web, ftp or gopher host, be processed at that end, and a response be sent back, propagating across the same chain. While we have some control over the time to handle a miss at the proxy end, we have no control over the time the request will take in the big bad external world. Therefore the miss time is highly variable statistically, depending upon the performance and the load across every network connection and router between the proxy and the external target host, and the performance of the target host. Hence the importance of minimising misses.

What we have some opportunity to manipulate in a miss situation is the time it takes the proxy to perform the lookup in its proxy cache indexing tables. The first caveat is that the whole table must be memory resident, if any part of it is swapped out to disk, we incur up to seconds of delay for it to be swapped back into memory for an access. Therefore, inadequate memory in the proxy host is a recipe for disaster, and sizing must always be such to allow the table to be wholly memory resident.

Assuming the table is memory resident, then we can improve speed by using a faster processor in the proxy host. This is the classical memory resident compute bound performance problem, where memory bandwidth, CPU clock speed and architecture, and CPU cache size and performance do matter. Suffice to say that horsepower in this area is seldom wasted in a proxy system. It is worth noting that extra performance directly translates into CPU cycles available to crunch through the network protocol stack and inter-process communications code, which does not hurt overall system performance in the slightest.

Dealing with a proxy cache hit is a little more complex, since it will result in disk I/Os. If we assume a hit situation occurs, then the proxy process will have to perform a disk access to read the object into memory. The disk I/O request must therefore propagate via a system call to the file system code, which must in turn translate it into a series of disk block read requests which are processed by the block I/O disk device driver and then sent via the storage bus, SCSI, IDE or other, to the disk hardware. The disk hardware in turn must search its multi-Megabyte internal cache for these blocks, and if it misses, it must go through repeated seek/rotate cycles to recover the blocks off the disk platters.

The key performance drivers here are the throughput of the host's internal I/O bus to the disk adapter, the speed with which the disk driver upper and lower halves can be executed, the speed of the storage adapter hardware and storage bus, and the speed of the disk drive itself. Not to forget, the speed of the filesystem on the disk is also a key issue here.

This is, yet again, a classical computer system performance problem, driven by queuing theory and amenable to mathematical modelling, and with some difficulty, also measurement. Unlike some large supercomputing and other like applications, the proxy problem has the nice property of involving large numbers of disk I/O requests to relatively small objects, frequently very widely scattered around. This means in turn, that the I/O traffic can be readily spread across a large number of spindles, thereby minimising queueing delays on the disk hardware.

Indeed, the ideal disk hardware arrangement is one in which the native block size is a little larger than the most frequently occurring cache object sizes, the traffic is spread across as large a number of modestly sized disk drives as possible. In effect, the more parallel queues we have for disk I/Os, the better the potential performance, all other things being equal.

As noted previously, filesystem performance itself is an issue, and in this respect the best choices remain the BSD FFS/UFS filesystem family, which perform well with a scattered workload against most alternatives. With all filesystems, larger objects will behave better with filesystems which are not heavily fragmented and can provide a good statistical proportion of large objects stored in runs of consecutive disk blocks.

A RAID array can be a good choice for such an application, providing that it is properly tuned to the workload, and does not exhibit any pathological interactions with the filesystem block optimisation algorithms. One nice property of many RAID designs is the opportunity to set the RAID block size to something which is a good fit for the statistically most frequent object sizes stored in the cache.

Whether to opt for RAID or a large farm of disks is an open question, insofar as it requires some insight into the statistics of the user site which may prove to be difficult to guess right.

Up to this point, the discussion has centred on maximising the speed of hit and miss handling in the proxy, making no assumptions about the hit ratio on the proxy. The latter is an issue within itself.

Maximising Proxy Hit Ratios

If the hit ratio we see on our proxy is low, no amount of performance improvement in hit and miss processing will significantly improve performance, for the simple reason that the statistically dominant performance issue will be external web access time to target websites.

Therefore the hit ratio determines to what degree any other performance improvements on the proxy host can be exploited.

Proxy software, e.g. the public domain Squid, or the various commercial offerings, use a wide a range of cache replacement algorithms and other techniques aimed at improving the cache hit ratio, or even adaptively tuning it over time. Most of the algorithms in use have been borrowed from established operating system schemes used for disk I/O or other caching applications.

For a web object cache, in which the user access patterns can be very widely scattered, the biggest single driver of cache hit ratio performance is size. This is because the proxy cache is, in classical terms, closest in behaviour to the directly mapped cache architecture, and clever tricks such as set associativity cannot be exploited. The ground rule for directly mapped caches is bigger is better.

It follows therefore that the larger the disk farm on the proxy, the potentially better the hit ratio. Money saved on disk capacity cannot be recovered by increasing proxy host performance, and given the low cost per Gigabyte these days, makes little sense.

Indeed, the best strategy for performance growth in a proxy is to increase disk capacity until a peak hit ratio is achieved, upon which it makes sense to start tuning hit and miss processing times.

The proxy farm, comprising multiple sibling proxies which communicate using the ICP protocol (RFC 2186 and RFC 2187), is now a very popular strategy for maximising proxy hit ratios. In a proxy farm, potentially huge storage capacities can be achieved. Each proxy will, on a miss, interrogate its siblings to determine whether the object is held somewhere in the proxy farm. If yes, the proxy holding it will recover it from its own disk.

Maximising the performance of a proxy farm is a no less interesting problem for a system architect. The key issue is to minimise the time consumed in the ICP query and response cycles. Since these are bound primarily by the time it takes for ICP messages to propagate between proxy processes on each of the proxy siblings, and the time to process these in each sibling proxy system, the performance drivers are the same as those in miss processing on an individual proxy system. Therefore, network performance and CPU speed do matter, as does memory residency of the cache indexing tables.

A very useful strategy for a large proxy farm is to dedicate a high speed switch, and individual LAN adaptors on each proxy host in the farm, solely to the handling of the ICP traffic. In this manner, ICP requests need not compete with other network traffic for adaptors and switch time.

While this may represent a costly addition, its importance should not be underestimated on a larger proxy farm. Each user request may generate a large number of ICP requests, each of which incurs the overheads of an IP message. Since small packet IP traffic behaviour is dominated by queueing delays, a dedicated ICP LAN can produce a disproportionate improvement in performance.

Putting It All Together

Architecting and tuning a proxy or a proxy farm is not a trivial task by any measure. Indeed, it many respects it can be much more demanding than the classical problems of tuning large multiuser NFS servers, web servers and database servers. This discussion has focussed on caching performance alone, and the addition of further proxy functions such as censorship of unwanted URLs, or transparent forwarding of SSL traffic will reflect in further demands for CPU speed, network speed and memory size, should such specialised traffic become statistically significant against more general traffic.

It is worth noting that this is not a chore for beginners to attempt, and will be time consuming by any measure since iterative cycles of measurement and tuning will be required to get the desired balance between performance and cost. In an environment where the user base is growing at a steady rate, this is likely to become a continuous process, rather than an infrequent one-off task.

Are there any generalisations we can make? The only one which makes any sense is the proverbial size matters. Since proxy performance is so sensitive to hit ratio performance, it follows that disk farm or proxy farm size will be the decisive parameter, all else being equal.

In this day and age of commodified 1 GHz CPUs, 128 MB DRAM modules and 20 GB 7200 RPM disks, the cost overheads in achieving good proxy performance are thankfully modest, in comparison with large databases or other applications which tend to bottleneck more easily. Nevertheless, it can be easy to produce disaster if care is not taken.

Given the distributed nature of the problem, the long term trend for proxies and proxy farms is favourable, since they scale well with a commodified small system hardware environment. The ultimate determinant of success in this game will be the knowledge and understanding of the implementor.

$Revision: 1.1 $
Last Updated: Sun Apr 24 11:22:45 GMT 2005
Artwork and text 2005 Carlo Kopp

Industry Publications Index ... Click Here