GLORIA — GEOMAR Library Ocean Research Information Access

1

Online Resource

PEBIL: binary instrumentation for practical data-intensive program analysis

Laurenzano, Michael A. ; Peraza, Joshua ; Carrington, Laura ; [et al.]

Springer Science and Business Media LLC ; 2015

In: Cluster Computing Vol. 18, No. 1 ( 2015-3), p. 1-14

add to mindlist on the mindlist

Details

In: Cluster Computing, Springer Science and Business Media LLC, Vol. 18, No. 1 ( 2015-3), p. 1-14

Type of Medium: Online Resource

ISSN: 1386-7857 , 1573-7543

URL: Article

DOI: 10.1007/s10586-013-0307-2

Language: English

Publisher: Springer Science and Business Media LLC

Publication Date: 2015

detail.hit.zdb_id: 2012757-1

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

2

Online Resource

Making the Most of SMT in HPC : System- and Application-Level Perspectives

Porter, Leo ; Laurenzano, Michael A. ; Tiwari, Ananta ; [et al.]

Association for Computing Machinery (ACM) ; 2015

In: ACM Transactions on Architecture and Code Optimization Vol. 11, No. 4 ( 2015-01-09), p. 1-26

add to mindlist on the mindlist

Details

In: ACM Transactions on Architecture and Code Optimization, Association for Computing Machinery (ACM), Vol. 11, No. 4 ( 2015-01-09), p. 1-26

Abstract: This work presents an end-to-end methodology for quantifying the performance and power benefits of simultaneous multithreading (SMT) for HPC centers and applies this methodology to a production system and workload. Ultimately, SMT’s value system-wide depends on whether users effectively employ SMT at the application level. However, predicting SMT’s benefit for HPC applications is challenging; by doubling the number of threads, the application’s characteristics may change. This work proposes statistical modeling techniques to predict the speedup SMT confers to HPC applications. This approach, accurate to within 8%, uses only lightweight, transparent performance monitors collected during a single run of the application.

Type of Medium: Online Resource

ISSN: 1544-3566 , 1544-3973

URL: Article

DOI: 10.1145/2687651

Language: English

Publisher: Association for Computing Machinery (ACM)

Publication Date: 2015

detail.hit.zdb_id: 2142607-7

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

3

Online Resource

Designing Future Warehouse-Scale Computers for Sirius, an End-to-End Voice and Vision Personal Assistant

Hauswald, Johann ; Laurenzano, Michael A. ; Zhang, Yunqi ; [et al.]

Association for Computing Machinery (ACM) ; 2016

In: ACM Transactions on Computer Systems Vol. 34, No. 1 ( 2016-04-06), p. 1-32

add to mindlist on the mindlist

Details

In: ACM Transactions on Computer Systems, Association for Computing Machinery (ACM), Vol. 34, No. 1 ( 2016-04-06), p. 1-32

Abstract: As user demand scales for intelligent personal assistants (IPAs) such as Apple’s Siri, Google’s Google Now, and Microsoft’s Cortana, we are approaching the computational limits of current datacenter (DC) architectures. It is an open question how future server architectures should evolve to enable this emerging class of applications, and the lack of an open-source IPA workload is an obstacle in addressing this question. In this article, we present the design of Sirius, an open end-to-end IPA Web-service application that accepts queries in the form of voice and images, and responds with natural language. We then use this workload to investigate the implications of four points in the design space of future accelerator-based server architectures spanning traditional CPUs, GPUs, manycore throughput co-processors, and FPGAs. To investigate future server designs for Sirius, we decompose Sirius into a suite of eight benchmarks (Sirius Suite) comprising the computationally intensive bottlenecks of Sirius. We port Sirius Suite to a spectrum of accelerator platforms and use the performance and power trade-offs across these platforms to perform a total cost of ownership (TCO) analysis of various server design points. In our study, we find that accelerators are critical for the future scalability of IPA services. Our results show that GPU- and FPGA-accelerated servers improve the query latency on average by 8.5× and 15×, respectively. For a given throughput, GPU- and FPGA-accelerated servers can reduce the TCO of DCs by 2.3× and 1.3×, respectively.

Type of Medium: Online Resource

ISSN: 0734-2071 , 1557-7333

URL: Article

DOI: 10.1145/2870631

Language: English

Publisher: Association for Computing Machinery (ACM)

Publication Date: 2016

detail.hit.zdb_id: 602353-8

detail.hit.zdb_id: 2006326-X

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

4

Online Resource

DjiNN and Tonic : DNN as a service and its implications for future warehouse scale computers

Hauswald, Johann ; Kang, Yiping ; Laurenzano, Michael A. ; [et al.]

Association for Computing Machinery (ACM) ; 2016

In: ACM SIGARCH Computer Architecture News Vol. 43, No. 3S ( 2016-01-04), p. 27-40

add to mindlist on the mindlist

Details

In: ACM SIGARCH Computer Architecture News, Association for Computing Machinery (ACM), Vol. 43, No. 3S ( 2016-01-04), p. 27-40

Abstract: As applications such as Apple Siri, Google Now, Microsoft Cortana, and Amazon Echo continue to gain traction, web-service companies are adopting large deep neural networks (DNN) for machine learning challenges such as image processing, speech recognition, natural language processing, among others. A number of open questions arise as to the design of a server platform specialized for DNN and how modern warehouse scale computers (WSCs) should be outfitted to provide DNN as a service for these applications. In this paper, we present DjiNN , an open infrastructure for DNN as a service in WSCs, and Tonic Suite , a suite of 7 end-to-end applications that span image, speech, and language processing. We use DjiNN to design a high throughput DNN system based on massive GPU server designs and provide insights as to the varying characteristics across applications. After studying the throughput, bandwidth, and power properties of DjiNN and Tonic Suite, we investigate several design points for future WSC architectures. We investigate the total cost of ownership implications of having a WSC with a disaggregated GPU pool versus a WSC composed of homogeneous integrated GPU servers. We improve DNN throughput by over 120x for all but one application (40x for Facial Recognition) on an NVIDIA K40 GPU. On a GPU server composed of 8 NVIDIA K40s, we achieve near-linear scaling (around 1000x throughput improvement) for 3 of the 7 applications. Through our analysis, we also find that GPU-enabled WSCs improve total cost of ownership over CPU-only designs by 4-20x, depending on the composition of the workload

Type of Medium: Online Resource

ISSN: 0163-5964

URL: Article

DOI: 10.1145/2872887.2749472

RVK:

SS 1985

Language: English

Publisher: Association for Computing Machinery (ACM)

Publication Date: 2016

detail.hit.zdb_id: 2088489-8

detail.hit.zdb_id: 186012-4

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

5

Online Resource

PowerChop : identifying and managing non-critical units in hybrid processor architectures

Laurenzano, Michael A. ; Zhang, Yunqi ; Chen, Jiang ; [et al.]

Association for Computing Machinery (ACM) ; 2016

In: ACM SIGARCH Computer Architecture News Vol. 44, No. 3 ( 2016-10-12), p. 140-152

add to mindlist on the mindlist

Details

In: ACM SIGARCH Computer Architecture News, Association for Computing Machinery (ACM), Vol. 44, No. 3 ( 2016-10-12), p. 140-152

Abstract: On-core microarchitectural structures consume significant portions of a processor's power budget. However, depending on application characteristics, those structures do not always provide (much) performance benefit. While timeout-based power gating techniques have been leveraged for underutilized cores and inactive functional units, these techniques have not directly translated to high-activity units such as vector processing units, complex branch predictors, and caches. The performance benefit provided by these units does not necessarily correspond with unit activity, but instead is a function of application characteristics. This work introduces P ower C hop , a novel technique that leverages the unique capabilities of HW/SW co-designed hybrid processors to enact unit-level power management at the application phase level. P ower C hop adds two small additional hardware units to facilitate phase identification and triggering different power states, enabling the software layer to cheaply track, predict and take advantage of varying unit criticality across application phases by powering gating units that are not needed for performant execution. Through detailed experimentation, we find that P ower C hop significantly decreases power consumption, reducing the leakage power of a hybrid server processor by 9% on average (up to 33%) and a hybrid mobile processor by 19% (up to 40%) while introducing just 2% slowdown.

Type of Medium: Online Resource

ISSN: 0163-5964

URL: Article

DOI: 10.1145/3007787.3001152

RVK:

SS 1985

Language: English

Publisher: Association for Computing Machinery (ACM)

Publication Date: 2016

detail.hit.zdb_id: 2088489-8

detail.hit.zdb_id: 186012-4

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

6

Online Resource

Sirius : An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers

Hauswald, Johann ; Laurenzano, Michael A. ; Zhang, Yunqi ; [et al.]

Association for Computing Machinery (ACM) ; 2015

In: ACM SIGARCH Computer Architecture News Vol. 43, No. 1 ( 2015-05-29), p. 223-238

add to mindlist on the mindlist

Details

In: ACM SIGARCH Computer Architecture News, Association for Computing Machinery (ACM), Vol. 43, No. 1 ( 2015-05-29), p. 223-238

Abstract: As user demand scales for intelligent personal assistants (IPAs) such as Apple's Siri, Google's Google Now, and Microsoft's Cortana, we are approaching the computational limits of current datacenter architectures. It is an open question how future server architectures should evolve to enable this emerging class of applications, and the lack of an open-source IPA workload is an obstacle in addressing this question. In this paper, we present the design of Sirius, an open end-to-end IPA web-service application that accepts queries in the form of voice and images, and responds with natural language. We then use this workload to investigate the implications of four points in the design space of future accelerator-based server architectures spanning traditional CPUs, GPUs, manycore throughput co-processors, and FPGAs. To investigate future server designs for Sirius, we decompose Sirius into a suite of 7 benchmarks (Sirius Suite) comprising the computationally intensive bottlenecks of Sirius. We port Sirius Suite to a spectrum of accelerator platforms and use the performance and power trade-offs across these platforms to perform a total cost of ownership (TCO) analysis of various server design points. In our study, we find that accelerators are critical for the future scalability of IPA services. Our results show that GPU- and FPGA-accelerated servers improve the query latency on average by 10x and 16x. For a given throughput, GPU- and FPGA-accelerated servers can reduce the TCO of datacenters by 2.6x and 1.4x, respectively.

Type of Medium: Online Resource

ISSN: 0163-5964

URL: Article

DOI: 10.1145/2786763.2694347

RVK:

SS 1985

Language: English

Publisher: Association for Computing Machinery (ACM)

Publication Date: 2015

detail.hit.zdb_id: 2088489-8

detail.hit.zdb_id: 186012-4

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

7

Online Resource

Sirius : An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers

Hauswald, Johann ; Laurenzano, Michael A. ; Zhang, Yunqi ; [et al.]

Association for Computing Machinery (ACM) ; 2015

In: ACM SIGPLAN Notices Vol. 50, No. 4 ( 2015-05-12), p. 223-238

add to mindlist on the mindlist

Details

In: ACM SIGPLAN Notices, Association for Computing Machinery (ACM), Vol. 50, No. 4 ( 2015-05-12), p. 223-238

Abstract: As user demand scales for intelligent personal assistants (IPAs) such as Apple's Siri, Google's Google Now, and Microsoft's Cortana, we are approaching the computational limits of current datacenter architectures. It is an open question how future server architectures should evolve to enable this emerging class of applications, and the lack of an open-source IPA workload is an obstacle in addressing this question. In this paper, we present the design of Sirius, an open end-to-end IPA web-service application that accepts queries in the form of voice and images, and responds with natural language. We then use this workload to investigate the implications of four points in the design space of future accelerator-based server architectures spanning traditional CPUs, GPUs, manycore throughput co-processors, and FPGAs. To investigate future server designs for Sirius, we decompose Sirius into a suite of 7 benchmarks (Sirius Suite) comprising the computationally intensive bottlenecks of Sirius. We port Sirius Suite to a spectrum of accelerator platforms and use the performance and power trade-offs across these platforms to perform a total cost of ownership (TCO) analysis of various server design points. In our study, we find that accelerators are critical for the future scalability of IPA services. Our results show that GPU- and FPGA-accelerated servers improve the query latency on average by 10x and 16x. For a given throughput, GPU- and FPGA-accelerated servers can reduce the TCO of datacenters by 2.6x and 1.4x, respectively.

Type of Medium: Online Resource

ISSN: 0362-1340 , 1558-1160

URL: Article

DOI: 10.1145/2775054.2694347

Language: English

Publisher: Association for Computing Machinery (ACM)

Publication Date: 2015

detail.hit.zdb_id: 2079194-X

detail.hit.zdb_id: 282422-X

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

8

Online Resource

Sirius Implications for Future Warehouse-Scale Computers

Hauswald, Johann ; Laurenzano, Michael A. ; Zhang, Yunqi ; [et al.]

Institute of Electrical and Electronics Engineers (IEEE) ; 2016

In: IEEE Micro Vol. 36, No. 3 ( 2016-5), p. 42-53

add to mindlist on the mindlist

Details

In: IEEE Micro, Institute of Electrical and Electronics Engineers (IEEE), Vol. 36, No. 3 ( 2016-5), p. 42-53

Type of Medium: Online Resource

ISSN: 0272-1732 , 1937-4143

URL: Journal

URL: Article

DOI: 10.1109/MM.40

DOI: 10.1109/MM.2016.37

RVK:

SQ 1100

Language: Unknown

Publisher: Institute of Electrical and Electronics Engineers (IEEE)

Publication Date: 2016

detail.hit.zdb_id: 2027750-7

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

9

Online Resource

Input responsiveness: using canary inputs to dynamically steer approximation

Laurenzano, Michael A. ; Hill, Parker ; Samadi, Mehrzad ; [et al.]

Association for Computing Machinery (ACM) ; 2016

In: ACM SIGPLAN Notices Vol. 51, No. 6 ( 2016-08), p. 161-176

add to mindlist on the mindlist

Details

In: ACM SIGPLAN Notices, Association for Computing Machinery (ACM), Vol. 51, No. 6 ( 2016-08), p. 161-176

Abstract: This paper introduces Input Responsive Approximation (IRA), an approach that uses a canary input — a small program input carefully constructed to capture the intrinsic properties of the original input — to automatically control how program approximation is applied on an input-by-input basis. Motivating this approach is the observation that many of the prior techniques focusing on choosing how to approximate arrive at conservative decisions by discounting substantial differences between inputs when applying approximation. The main challenges in overcoming this limitation lie in making the choice of how to approximate both effectively (e.g., the fastest approximation that meets a particular accuracy target) and rapidly for every input. With IRA, each time the approximate program is run, a canary input is constructed and used dynamically to quickly test a spectrum of approximation alternatives. Based on these runtime tests, the approximation that best fits the desired accuracy constraints is selected and applied to the full input to produce an approximate result. We use IRA to select and parameterize mixes of four approximation techniques from the literature for a range of 13 image processing, machine learning, and data mining applications. Our results demonstrate that IRA significantly outperforms prior approaches, delivering an average of 10.2× speedup over exact execution while minimizing accuracy losses in program outputs.

Type of Medium: Online Resource

ISSN: 0362-1340 , 1558-1160

URL: Article

DOI: 10.1145/2980983.2908087

Language: English

Publisher: Association for Computing Machinery (ACM)

Publication Date: 2016

detail.hit.zdb_id: 2079194-X

detail.hit.zdb_id: 282422-X

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher

10

Online Resource

Reining in Long Tails in Warehouse-Scale Computers with Quick Voltage Boosting Using Adrenaline

Hsu, Chang-Hong ; Zhang, Yunqi ; Laurenzano, Michael A. ; [et al.]

Association for Computing Machinery (ACM) ; 2017

In: ACM Transactions on Computer Systems Vol. 35, No. 1 ( 2017-02-28), p. 1-33

add to mindlist on the mindlist

Details

In: ACM Transactions on Computer Systems, Association for Computing Machinery (ACM), Vol. 35, No. 1 ( 2017-02-28), p. 1-33

Abstract: Reducing the long tail of the query latency distribution in modern warehouse scale computers is critical for improving performance and quality of service (QoS) of workloads such as Web Search and Memcached. Traditional turbo boost increases a processor’s voltage and frequency during a coarse-grained sliding window, boosting all queries that are processed during that window. However, the inability of such a technique to pinpoint tail queries for boosting limits its tail reduction benefit. In this work, we propose Adrenaline , an approach to leverage finer-granularity (tens of nanoseconds) voltage boosting to effectively rein in the tail latency with query-level precision. Two key insights underlie this work. First, emerging finer granularity voltage/frequency boosting is an enabling mechanism for intelligent allocation of the power budget to precisely boost only the queries that contribute to the tail latency; second, per-query characteristics can be used to design indicators for proactively pinpointing these queries, triggering boosting accordingly. Based on these insights, Adrenaline effectively pinpoints and boosts queries that are likely to increase the tail distribution and can reap more benefit from the voltage/frequency boost. By evaluating under various workload configurations, we demonstrate the effectiveness of our methodology. We achieve up to a 2.50 × tail latency improvement for Memcached and up to a 3.03 × for Web Search over coarse-grained dynamic voltage and frequency scaling (DVFS) given a fixed boosting power budget. When optimizing for energy reduction, Adrenaline achieves up to a 1.81 × improvement for Memcached and up to a 1.99 × for Web Search over coarse-grained DVFS. By using the carefully chosen boost thresholds, Adrenaline further improves the tail latency reduction to 4.82 × over coarse-grained DVFS.

Type of Medium: Online Resource

ISSN: 0734-2071 , 1557-7333

URL: Article

DOI: 10.1145/3054742

Language: English

Publisher: Association for Computing Machinery (ACM)

Publication Date: 2017

detail.hit.zdb_id: 602353-8

detail.hit.zdb_id: 2006326-X

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

Online Resource

Link to publisher