Prof. Vojin G. Oklobdzija

Sample of possible projects taken from UC Berkeley CS252 site:


CS 252 - Project Suggestions Page

Here are a list of suggested CS 252 projects for Fall 1996. If you use an item that was suggested by someone, you should at least tell the person that you are using it, and hopefully tell them the result of the investigation.

Note that you are not limited to the projects listed here. You may work on anything that you find interesting that is deemed worthy of being a CS 252 project.

If you want more of an idea of just what a CS 252 project entails, check out the projects from last year, Spring 1996 and Fall 1995.

Please try to find a project partner and have a rough idea of what you want to work on by Monday, September 23. Send mail to and indicating who you are working with and what you are interested in working on. Please, just one message per group. Feel free to contact us if you have any questions or are looking for suggestions and/or clarifications.

A more detailed survey concerning your project choice will be coming shortly. It is important to get started thinking about this early. Trust me, you don't want to let this all go until the last minute.



SPEC95 across different architectures

Since you have a variety of architectures, take the gcc compiler and measure SPECint95 on some public version of Unix such as Linux or FreeBSD
  1. You already know about ATOM (alpha) and EEL (sparc). Now there is also ETCH for x86. Look at:
    Having similar tools for 3 architectures might allow you to have 3 groups of students look at similar stuff on 3 architectures.
  2. Compare path lengths for the various architectures
  3. Do static code size comparisons
  4. See impact of optimizations

Suggested by Dileep Bhandarkar (



SPEC95 cache analysis

Repeat Mark Hill's SPEC92 cache analysis for SPEC95.
  1. Gee, J.D.; Hill, M.D.; Pnevmatikatos, D.N.; Smith, A.J. "Cache performance of the SPEC92 benchmark suite." IEEE Micro, Aug. 1993, vol.13, (no.4):17-27. Abstract: The authors consider whether SPECmarks, the figures of meritobtained from running the SPEC benchmarks under certain specified conditions, accurately indicate the performance to be expected from real, live work loads. Miss ratios for the entire set of SPEC92 benchmarks are measured.
  2. This may need multiple teams, N benchmarks per team.
  3. See if it varies for x86 vs. RISC? Use NOW, PC clusters.

Suggested by Dileep Bhandarkar (


A "voting" data-prefetch engine

This H/W device has the following characteristics:
  1. For any data reference stream, TWO independent prefetch devices make predictions: one is a standard load address stride predictor (which predicts strided accesses), and the other is a stream buffer, which basically reacts to cache miss history.
  2. The challenge is to design a voting function that dynamically selects one or the other of the prefetch addresses to issue to the higher levels of the memory hierarchy.
  3. This selection should be made conditional on whichever of the two predictors is currently generating the more accurate future address stream.
  4. Accuracy is defined as the ability to reduce future data cache misses.

Suggested by Sharad Mehrotra ( Sharad.Mehrotra@Eng.Sun.COM)



Architecture Archeology/Endangered Species Act

Documenting architectural history might attempt to either collect or construct emulators for machines which are disappearing
  1. The real wonder for the ARPAnet for me in 1973 was the diversity of architecture. I started on an IBM 360/75, I believed at that timethat the world revolved around EBCDIC. Over the next couple of years encountered my first DEC-10, ILLIAC-IV, CDC-6600, ...
  2. The value of emulation history is going to take on interesting significance in the future. The challenge will be to preserve this software history as the base emulation machines themselves pass into history.
  3. Write emulators in Java so can run anywhere?Simple assembler so can write programs?

Suggested by Eugene Miya (



Evaluate usefulness of VIS/MMX instructions

Both Intel (MMX) and SPARC (x86) have recently extended their instruction sets with mini-vector-like operations that allow the simultaneous operation on groups of values (i.e. vectors). One area which is being specifically targetted by this is multimedia. Both companies claim significant speedups from the use of these instructions. How valid are these claims? Evaluate the usefulness of these additional instructions for some application(s). Are the instructions worthwhile? Can you suggest any different new instructions which would be better? What really matters is complete, end-to-end performance of an entire application, not just an impressive speedup on a small micro-kernel. Some claims and/or previous evaluations may have been published in past issues of Microprocessor Report.


The following several suggestions deal with the IRAM project. See the IRAM web page and/or contact Dave Patterson or Rich Fromm for more details about the project.



Investigation of logic circuits in memory processes

One problem with building a processor in a DRAM manufacturing process is that the logic and memory processes are optimized for different factors, and logic in a DRAM process is likely to run slower. How much slower is a very important question. It would be very helpful to simulate (with SPICE) various logic circuits in a memory process to try to quantify these differences. Other related topics (besides speed differences) include area and power considerations. Studying the performance of an SRAM cache built in a DRAM process would also be useful, since an IRAM with a large DRAM main memory is still likely to have an L1 cache, which is likely to be built from SRAM. This project would investigate such physical phenomena and apply this knowledge to some simple architectural evaluations.

Stelios Perisakis ( has already investigated some of these issues, so it would probably be helpful to consult with him before undertaking this project.


On-the-fly compression/decompression

An IRAM will have a tremendous bandwidth to the on-chip memory, but there is still the problem of accesses that must go off-chip. The off-chip memory bandwidth is likely to be orders of magnitude less than the on-chip bandwidth. Perhaps the most sensible solution is to store programs and data in compressed form, decompress them as they are read into the chip, and compress data as it is written off of the chip. Evaluate the usefulness of this idea (with both standard and novel compression schemes) with real programs and data,\ paying careful attention to the architectural feasibility of any implication, not just how good some algorithm might perform theoretically.


Treat registers like a cache

Have a (fully?) associative array of readout registers, and only do the writeback once the readout register is replaced. This means that a logical read turns into a physical read. It also implies that the physical DRAM copy is wrong. All the standard cache questions apply: stream buffers, speculation, etc. In addition, there is the question of pre-writeback of dirty lines, as in the virtual memory subpiece of operating systems.

Suggested by Eric Anderson (



Programmable prefetch unit

Many people are building more and more complicated prefetch units that try to learn indirect references, etc. (See Sharad Mehrota's work at UIUC Examination of a memory access classification scheme for pointer-intensive and numeric programs and Quantifying the performance potential of a data prefetch mechanism for pointer-intensive and numeric programs). A potentially better idea is to let the compiler/user program the prefetch unit. This would have a second little state machine that can inspect the state of the primary CPU and issue memory prefetches based on the primary CPU's state.

Suggested by Eric Anderson (



Cache replacement policies

Is farthest future use an optimal cache replacement policy? Given a memory reference stream, and an optimal replacement policy, if the cache does not do well on that stream, then no cache can do well. However, if that cache can do well, and in addition conventional caches do poorly, then there are still questions about improving caching. Also question of how much buffering is neded to do perfect prefetching with the optimal replacement policy. Define optimal as least total misses.

Suggested by Eric Anderson (


Multiprocessor Simulator

We need to understand how different interconnects for large multiprocessors (order 1K-10K nodes, 100 TFLOPS) impact their performance on scientific codes which solve coupled-physics problems with multiple domain decompositions. In these codes, different decompositions are used for different physics, and all physics must be done at every time step. Note that these codes are more complicated than many others due to the multiple domain decompositions.

We wish to determine the dependence of computational performance on interconnect details such as topology, latency, bandwidth, dynamic reconfigurability, and cut-through vs. store&forward routing. Because the codes to be simulated are large and complex, the instruction-level simulator we're currently using is much too slow. We'd like to switch to a trace-based simulator.

The immediate job of the student would be to build a simulator code which can accept a code trace, a simplified description of the compute nodes, and a detailed interconnect model, and provide output which includes total time-to-solution plus details of traffic flow in the interconnect. These details include throughput statistics, latency statistics, and some contention details (where in the code, where in the interconnect, how much added delay).

This project requires a creative soul to define a generic interconnect input description and handling of the interconnect within the simulation, plus the programming to define the simulator. The student would be expected to talk with us about the relevant interconnect features which must be included, and then to (rather independently) build a simulator for this purpose. Obviously, the work needs to be documented well enough so that others can use the simulator.

Once the simulator is completed, the student can either go off to other research, or continue on this project-- becoming involved with simulating different interconnects and understanding what (detailed) attributes are needed for the codes of interest. Either would be acceptable to me.

I'd like to get a simulation code in place by January of 97 at the latest.

Suggested by Bob Deri (



Practical Implementations of ILP

Determine the degree of scalarity that would be practical under realistic physical design constraints for a commercial microarchitecture such as x86, PowerPC, SPARC, Alpha or MIPS.

Use traces or write a simulator that can execute binary code in the target architecture.

Compare results for both in-order and out-of-order execution in hardware. Out-of-order would require the assumption of a certain size reorder window for load/stores and a certain size reorder window for non-load/stores, or a single reorder window for both (HP 8200 has separate windows.)

Assume a certain pipeline (number of stages, etc) and a certain number of execution units of each type (branch, compare, integer, load/store, floating-point)

Consider constraints such as:

  1. 1 vs. 2 load/store units
  2. maximum number of unresolved branches allowed
  3. whether more than one branch could be resolved per cycle
  4. ability to compress two dependent adds into a single cycle (or to compress an add plus the address calculation for a dependent load/store into a single cycle)
  5. 1 vs. 2 instructions per cycle that can set condition codes
  6. availability or lack of rename registers in the fp unit

Suggested by George Taylor (


The following suggestions were from David Culler for CS 262. Some may be applicable as well for CS 252, some (i.e. many) may not. Use your own discretion.



Stack vs. Register Machines

A re-evaluation of Stack vs reg ala Amdahl, Blau, and Brooks in the presence of multiple issue and serious renaming.

Suggested by David Culler (


Network Desgin in Clusters of SMPs

Investigate how this degree of freedom at the nodes impact on the design of the network itself. With multiple interfaces, each node attaches to the network fabric at several points. This dramatically changes the fault tolerance properties of the network as well as its behavior under contention. It appears likely that many of the benefits obtained by using randomness within the network can be realized in practice by simply arbitrating among multiple ports into the network at each node.

Suggested by David Culler (



Load balancing in Clusters of SMPs

Statistical analysis shows that systems on the scale of a thousand processors will lose substantial efficiency from even very slight variations in workload, such as might arise from cache effects, floating-point unit effects, pipeline stalls and the like. In practice, numerous factors make it difficult to achieve a near perfect workload distribution. Balancing work within and SMP is considerably easier and cheaper than balancing work between nodes. Our initial results suggest that the ability to dynamically balance work within and SMP node significantly improves the efficiency of the overall system. In addition to balancing work between processors within an SMP, there is opportunity to dedicated processors to communication tasks. This potentially reduced synchronization costs while improving attentiveness to the network. To date, there is little systematic exploration of the potential alternatives.

Suggested by David Culler (



Floating-point registers: caller vs. callee save

One interesting project would be the division of floating point registers into caller saved and callee saved registers.

My survey on the comp.compilers indicated that all other system vendors have at least some callee-saved registers, and Sun is alone in having all registers saved by the caller. This has a performance impact. It would be interesting to find out how much it is.

It doesn't have to be a Sparc-centric study, but a general evaluation based on a generic RISC processor like DLX perhaps.

Suggested by SME architecture group at Sun, via Robert Garner ( robert.garner@Eng.Sun.COM)



Prepare-to-branch vs. branch-prediction (static vs. dynamic; fairness)

Suggested by SME architecture group at Sun, via Robert Garner ( robert.garner@Eng.Sun.COM)



Out-of-order vs. in-order machines: Tradeoffs in hardware/software complexity

Complex machines such as the HP PA8000 use a large window of available instructions and hardware scheduling with register renaming in an attempt to issue more instructions per clock. Is the extra complexity in such processors justified? Can compilers close the gap between such machines and simpler in-order machines through profile feedback and static scheduling?

If they can run their traces through appropriate in-order/out-of-order models (simulators), it would be interesting to see what they come up with.

Suggested by SME architecture group at Sun, via Robert Garner ( robert.garner@Eng.Sun.COM)



Do out-of-order machines need small L1 caches that are very close (low L1 load latency) as much as do in-order machines?

What is the (perf) sensitivity of an out-of-order machine to L1 load latency? If a latency of say up to 5 clocks can be hidden well, then should we just build the largest cache we can at the 5 clock latency that the machine can tolerate?

If the machine is capable of hiding the extra latency, why have a smaller cache that is closer?

This problem can also be studied with the traces and simulators they have.

Suggested by SME architecture group at Sun, via Robert Garner ( robert.garner@Eng.Sun.COM)


Back to CS252 home page...