GLORIA — GEOMAR Library Ocean Research Information Access

1

Online Resource

An efficient LISP-execution architecture with a new representation for list structures

Sohi, Gurindar S. ; Davidson, Edward S. ; Patel, Janak H.

Association for Computing Machinery (ACM) ; 1985

In: ACM SIGARCH Computer Architecture News Vol. 13, No. 3 ( 1985-06), p. 91-98

add to mindlist on the mindlist

Details

In: ACM SIGARCH Computer Architecture News, Association for Computing Machinery (ACM), Vol. 13, No. 3 ( 1985-06), p. 91-98

Type of Medium: Online Resource

ISSN: 0163-5964

URL: Article

DOI: 10.1145/327070.327136

RVK:

SS 1985

Language: English

Publisher: Association for Computing Machinery (ACM)

Publication Date: 1985

detail.hit.zdb_id: 2088489-8

detail.hit.zdb_id: 186012-4

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

2

Online Resource

Parallelism in the front-end

Oberoi, Paramjit S. ; Sohi, Gurindar S.

Association for Computing Machinery (ACM) ; 2003

In: ACM SIGARCH Computer Architecture News Vol. 31, No. 2 ( 2003-05), p. 230-240

add to mindlist on the mindlist

Details

In: ACM SIGARCH Computer Architecture News, Association for Computing Machinery (ACM), Vol. 31, No. 2 ( 2003-05), p. 230-240

Abstract: As processor back-ends get more aggressive, front-ends will have to scale as well. Although the back-ends of superscalar processors have continued to become more parallel, the front-ends remain sequential. This paper describes techniques for fetching and renaming multiple non-contiguous portions of the dynamic instruction stream in parallel using multiple fetch and rename units. It demonstrates that parallel front-ends are a viable alternative to high-performance sequential front-ends.Compared with an equivalently-sized trace cache, our technique increases cache bandwidth utilization by 17%, front-end throughput by 20%, and performance by 5%. Parallelism also enhances latency tolerance: a parallel front-end loses only 6% performance as the cache size is decreased from 128 KB to 8 KB, compared with a 50--65% performance loss for sequential fetch mechanisms.

Type of Medium: Online Resource

ISSN: 0163-5964

URL: Article

DOI: 10.1145/871656.859645

RVK:

SS 1985

Language: English

Publisher: Association for Computing Machinery (ACM)

Publication Date: 2003

detail.hit.zdb_id: 2088489-8

detail.hit.zdb_id: 186012-4

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

3

Online Resource

Multiscalar processors

Sohi, Gurindar S. ; Breach, Scott E. ; Vijaykumar, T. N.

Association for Computing Machinery (ACM) ; 1995

In: ACM SIGARCH Computer Architecture News Vol. 23, No. 2 ( 1995-05), p. 414-425

add to mindlist on the mindlist

Details

In: ACM SIGARCH Computer Architecture News, Association for Computing Machinery (ACM), Vol. 23, No. 2 ( 1995-05), p. 414-425

Abstract: Multiscalar processors use a new, aggressive implementation paradigm for extracting large quantities of instruction level parallelism from ordinary high level language programs. A single program is divided into a collection of tasks by a combination of software and hardware. The tasks are distributed to a number of parallel processing units which reside within a processor complex. Each of these units fetches and executes instructions belonging to its assigned task. The appearance of a single logical register file is maintained with a copy in each parallel processing unit. Register results are dynamically routed among the many parallel processing units with the help of compiler-generated masks. Memory accesses may occur speculatively without knowledge of preceding loads or stores. Addresses are disambiguated dynamically, many in parallel, and processing waits only for true data dependences.This paper presents the philosophy of the multiscalar paradigm, the structure of multiscalar programs, and the hardware architecture of a multiscalar processor. The paper also discusses performance issues in the multiscalar model, and compares the multiscalar paradigm with other paradigms. Experimental results evaluating the performance of a sample of multiscalar organizations are also presented.

Type of Medium: Online Resource

ISSN: 0163-5964

URL: Article

DOI: 10.1145/225830.224451

RVK:

SS 1985

Language: English

Publisher: Association for Computing Machinery (ACM)

Publication Date: 1995

detail.hit.zdb_id: 2088489-8

detail.hit.zdb_id: 186012-4

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

4

Online Resource

Effective jump-pointer prefetching for linked data structures

Roth, Amir ; Sohi, Gurindar S.

Association for Computing Machinery (ACM) ; 1999

In: ACM SIGARCH Computer Architecture News Vol. 27, No. 2 ( 1999-05), p. 111-121

add to mindlist on the mindlist

Details

In: ACM SIGARCH Computer Architecture News, Association for Computing Machinery (ACM), Vol. 27, No. 2 ( 1999-05), p. 111-121

Abstract: Current techniques for prefetching linked data structures (LDS) exploit the work available in one loop iteration or recursive call to overlap pointer chasing latency. Jump pointers, which provide direct access to non-adjacent nodes, can be used for prefetching when loop and recursive procedure bodies are small and do not have sufficient work to overlap a long latency. This paper describes a framework for jump-pointer prefetching (JPP) that supports four prefetching idioms: queue, full, chain, and root jumping and three implementations: software-only, hardware-only, and a cooperative software/hardware technique. On a suite of pointer intensive programs, jump pointer prefetching reduces memory stall time by 72% for software, 83% for cooperative and 55% for hardware, producing speedups of 15%, 20% and 22% respectively.

Type of Medium: Online Resource

ISSN: 0163-5964

URL: Article

DOI: 10.1145/307338.300989

RVK:

SS 1985

Language: English

Publisher: Association for Computing Machinery (ACM)

Publication Date: 1999

detail.hit.zdb_id: 2088489-8

detail.hit.zdb_id: 186012-4

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

5

Online Resource

Coherence decoupling : making use of incoherence

Huh, Jaehyuk ; Chang, Jichuan ; Burger, Doug ; [et al.]

Association for Computing Machinery (ACM) ; 2004

In: ACM SIGARCH Computer Architecture News Vol. 32, No. 5 ( 2004-12), p. 97-106

add to mindlist on the mindlist

Details

In: ACM SIGARCH Computer Architecture News, Association for Computing Machinery (ACM), Vol. 32, No. 5 ( 2004-12), p. 97-106

Abstract: This paper explores a new technique called coherence decoupling , which breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup (SCL) protocol and a safe, backing coherence protocol. The SCL protocol produces a speculative load value, typically from an invalid cache line, permitting the processor to compute with incoherent data. In parallel, the coherence protocol obtains the necessary coherence permissions and the correct value. Eventually, the speculative use of the incoherent data can be verified against the coherent data. Thus, coherence decoupling can greatly reduce --- if not eliminate --- the effects of false sharing. Furthermore, coherence decoupling can also reduce latencies incurred by true sharing. SCL protocols reduce those latencies by speculatively writing updates into invalid lines, thereby increasing the accuracy of speculation, without complicating the simple, underlying coherence protocol that guarantees correctness.The performance benefits of coherence decoupling are evaluated using a full-system simulator and a mix of commercial and scientific benchmarks. Our results show that 40% to 90% of all coherence misses can be speculated correctly, and therefore their latencies partially or fully hidden. This capability results in performance improvements ranging from 3% to over 16%, in most cases where the latencies of coherence misses have an effect on performance.

Type of Medium: Online Resource

ISSN: 0163-5964

URL: Article

DOI: 10.1145/1037947.1024406

RVK:

SS 1985

Language: English

Publisher: Association for Computing Machinery (ACM)

Publication Date: 2004

detail.hit.zdb_id: 2088489-8

detail.hit.zdb_id: 186012-4

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

6

Online Resource

High-bandwidth address translation for multiple-issue processors

Austin, Todd M. ; Sohi, Gurindar S.

Association for Computing Machinery (ACM) ; 1996

In: ACM SIGARCH Computer Architecture News Vol. 24, No. 2 ( 1996-05), p. 158-167

add to mindlist on the mindlist

Details

In: ACM SIGARCH Computer Architecture News, Association for Computing Machinery (ACM), Vol. 24, No. 2 ( 1996-05), p. 158-167

Abstract: In an effort to push the envelope of system performance, microprocessor designs are continually exploiting higher levels of instruction-level parallelism, resulting in increasing bandwidth demands on the address translation mechanism. Most current microprocessor designs meet this demand with a multi-ported TLB. While this design provides an excellent hit rate at each port, its access latency and area grow very quickly as the number of ports is increased. As bandwidth demands continue to increase, multi-ported designs will soon impact memory access latency.We present four high-bandwidth address translation mechanisms with latency and area characteristics that scale better than a multiported TLB design. We extend traditional high-bandwidth memory design techniques to address translation, developing interleaved and multi-level TLB designs. In addition, we introduce two new designs crafted specifically for high-bandwidth address translation. Piggyback ports are introduced as a technique to exploit spatial locality in simultaneous translation requests, allowing accesses to the same virtual memory page to combine their requests at the TLB access port. Pretranslation is introduced as a technique for attaching translations to base register values, making it possible to reuse a single translation many times.We perform extensive simulation-based studies to evaluate our designs. We vary key system parameters, such as processor model, page size, and number of architected registers, to see what effects these changes have on the relative merits of each approach. A number of designs show particular promise. Multi-level TLBs with as few as eight entries in the upper-level TLB nearly achieve the performance of a TLB with unlimited bandwidth. Piggyback ports combined with a lesser-ported TLB structure, e.g., an interleaved or multi-ported TLB, also perform well. Pretranslation over a single-ported TLB performs almost as well as a same-sized multi-level TLB with the added benefit of decreased access latency for physically indexed caches.

Type of Medium: Online Resource

ISSN: 0163-5964

URL: Article

DOI: 10.1145/232974.232990

RVK:

SS 1985

Language: English

Publisher: Association for Computing Machinery (ACM)

Publication Date: 1996

detail.hit.zdb_id: 2088489-8

detail.hit.zdb_id: 186012-4

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

7

Online Resource

Dynamic speculation and synchronization of data dependences

Moshovos, Andreas ; Breach, Scott E. ; Vijaykumar, T. N. ; [et al.]

Association for Computing Machinery (ACM) ; 1997

In: ACM SIGARCH Computer Architecture News Vol. 25, No. 2 ( 1997-05), p. 181-193

add to mindlist on the mindlist

Details

In: ACM SIGARCH Computer Architecture News, Association for Computing Machinery (ACM), Vol. 25, No. 2 ( 1997-05), p. 181-193

Abstract: Data dependence speculation is used in instruction-level parallel (ILP) processors to allow early execution of an instruction before a logically preceding instruction on which it may be data dependent. If the instruction is independent, data dependence speculation succeeds; if not, it fails, and the two instructions must be synchronized. The modern dynamically scheduled processors that use data dependence speculation do so blindly (i.e., every load instruction with unresolved dependences is speculated). In this paper, we demonstrate that as dynamic instruction windows get larger, significant performance benefits can result when intelligent decisions about data dependence speculation are made. We propose dynamic data dependence speculation techniques: (i) to predict if the execution of an instruction is likely to result in a data dependence mis-specalation, and (ii) to provide the synchronization needed to avoid a mis-speculation. Experimental results evaluating the effectiveness of the proposed techniques are presented within the context of a Multiscalar processor.

Type of Medium: Online Resource

ISSN: 0163-5964

URL: Article

DOI: 10.1145/384286.264189

RVK:

SS 1985

Language: English

Publisher: Association for Computing Machinery (ACM)

Publication Date: 1997

detail.hit.zdb_id: 2088489-8

detail.hit.zdb_id: 186012-4

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

8

Online Resource

Cooperative Caching for Chip Multiprocessors

Chang, Jichuan ; Sohi, Gurindar S.

Association for Computing Machinery (ACM) ; 2006

In: ACM SIGARCH Computer Architecture News Vol. 34, No. 2 ( 2006-05), p. 264-276

add to mindlist on the mindlist

Details

In: ACM SIGARCH Computer Architecture News, Association for Computing Machinery (ACM), Vol. 34, No. 2 ( 2006-05), p. 264-276

Abstract: This paper presents CMP Cooperative Caching, a unified framework to manage a CMP's aggregate on-chip cache resources. Cooperative caching combines the strengths of private and shared cache organizations by forming an aggregate "shared" cache through cooperation among private caches. Locally active data are attracted to the private caches by their accessing processors to reduce remote on-chip references, while globally active data are cooperatively identified and kept in the aggregate cache to reduce off-chip accesses. Examples of cooperation include cache-to-cache transfers of clean data, replication-aware data replacement, and global replacement of inactive data. These policies can be implemented by modifying an existing cache replacement policy and cache coherence protocol, or by the new implementation of a directory-based protocol presented in this paper. Our evaluation using full-system simulation shows that cooperative caching achieves an off-chip miss rate similar to that of a shared cache, and a local cache hit rate similar to that of using private caches. Cooperative caching performs robustly over a range of system/cache sizes and memory latencies. For an 8-core CMP with 1MB L2 cache per core, the best cooperative caching scheme improves the performance of multithreaded commercial workloads by 5-11% compared with a shared cache and 4-38% compared with private caches. For a 4-core CMP running multiprogrammed SPEC2000 workloads, cooperative caching is on average 11% and 6% faster than shared and private cache organizations, respectively.

Type of Medium: Online Resource

ISSN: 0163-5964

URL: Article

DOI: 10.1145/1150019.1136509

RVK:

SS 1985

Language: English

Publisher: Association for Computing Machinery (ACM)

Publication Date: 2006

detail.hit.zdb_id: 2088489-8

detail.hit.zdb_id: 186012-4

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

9

Online Resource

Mixed-mode multicore reliability

Wells, Philip M. ; Chakraborty, Koushik ; Sohi, Gurindar S.

Association for Computing Machinery (ACM) ; 2009

In: ACM SIGARCH Computer Architecture News Vol. 37, No. 1 ( 2009-03), p. 169-180

add to mindlist on the mindlist

Details

In: ACM SIGARCH Computer Architecture News, Association for Computing Machinery (ACM), Vol. 37, No. 1 ( 2009-03), p. 169-180

Abstract: Future processors are expected to observe increasing rates of hardware faults. Using Dual-Modular Redundancy (DMR), two cores of a multicore can be loosely coupled to redundantly execute a single software thread, providing very high coverage from many difference sources of faults. This reliability, however, comes at a high price in terms of per-thread IPC and overall system throughput. We make the observation that a user may want to run both applications requiring high reliability, such as financial software, and more fault tolerant applications requiring high performance, such as media or web software, on the same machine at the same time. Yet a traditional DMR system must fully operate in redundant mode whenever any application requires high reliability. This paper proposes a Mixed-Mode Multicore (MMM), which enables most applications, including the system software, to run with high reliability in DMR mode, while applications that need high performance can avoid the penalty of DMR. Though conceptually simple, two key challenges arise: 1) care must be taken to protect reliable applications from any faults occurring to applications running in high performance mode, and 2) the desire to execute additional independent software threads for a performance application complicates the scheduling of computation to cores. After solving these issues, an MMM is shown to improve overall system performance, compared to a traditional DMR system, by approximately 2X when one reliable and one performance application are concurrently executing.

Type of Medium: Online Resource

ISSN: 0163-5964

URL: Article

DOI: 10.1145/2528521.1508265

RVK:

SS 1985

Language: English

Publisher: Association for Computing Machinery (ACM)

Publication Date: 2009

detail.hit.zdb_id: 2088489-8

detail.hit.zdb_id: 186012-4

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

10

Online Resource

Dependence based prefetching for linked data structures

Roth, Amir ; Moshovos, Andreas ; Sohi, Gurindar S.

Association for Computing Machinery (ACM) ; 1998

In: ACM SIGPLAN Notices Vol. 33, No. 11 ( 1998-11), p. 115-126

add to mindlist on the mindlist

Details

In: ACM SIGPLAN Notices, Association for Computing Machinery (ACM), Vol. 33, No. 11 ( 1998-11), p. 115-126

Abstract: We introduce a dynamic scheme that captures the accesspat-terns of linked data structures and can be used to predict future accesses with high accuracy. Our technique exploits the dependence relationships that exist between loads that produce addresses and loads that consume these addresses. By identzj+ing producer-consumer pairs, we construct a compact internal representation for the associated structure and its traversal. To achieve a prefetching eflect, a small prefetch engine speculatively traverses this representation ahead of the executing program. Dependence-based prefetching achieves speedups of up to 25% on a suite of pointer-intensive programs.

Type of Medium: Online Resource

ISSN: 0362-1340 , 1558-1160

URL: Article

DOI: 10.1145/291006.291034

Language: English

Publisher: Association for Computing Machinery (ACM)

Publication Date: 1998

detail.hit.zdb_id: 2079194-X

detail.hit.zdb_id: 282422-X

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher