GLORIA

GEOMAR Library Ocean Research Information Access

Your email was sent successfully. Check your inbox.

An error occurred while sending the email. Please try again.

Proceed reservation?

Export
  • 1
    Online Resource
    Online Resource
    Springer Science and Business Media LLC ; 2015
    In:  Cluster Computing Vol. 18, No. 1 ( 2015-3), p. 1-14
    In: Cluster Computing, Springer Science and Business Media LLC, Vol. 18, No. 1 ( 2015-3), p. 1-14
    Type of Medium: Online Resource
    ISSN: 1386-7857 , 1573-7543
    Language: English
    Publisher: Springer Science and Business Media LLC
    Publication Date: 2015
    detail.hit.zdb_id: 2012757-1
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
  • 2
    Online Resource
    Online Resource
    Association for Computing Machinery (ACM) ; 2015
    In:  ACM Transactions on Architecture and Code Optimization Vol. 11, No. 4 ( 2015-01-09), p. 1-26
    In: ACM Transactions on Architecture and Code Optimization, Association for Computing Machinery (ACM), Vol. 11, No. 4 ( 2015-01-09), p. 1-26
    Abstract: This work presents an end-to-end methodology for quantifying the performance and power benefits of simultaneous multithreading (SMT) for HPC centers and applies this methodology to a production system and workload. Ultimately, SMT’s value system-wide depends on whether users effectively employ SMT at the application level. However, predicting SMT’s benefit for HPC applications is challenging; by doubling the number of threads, the application’s characteristics may change. This work proposes statistical modeling techniques to predict the speedup SMT confers to HPC applications. This approach, accurate to within 8%, uses only lightweight, transparent performance monitors collected during a single run of the application.
    Type of Medium: Online Resource
    ISSN: 1544-3566 , 1544-3973
    Language: English
    Publisher: Association for Computing Machinery (ACM)
    Publication Date: 2015
    detail.hit.zdb_id: 2142607-7
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
  • 3
    In: ACM Transactions on Computer Systems, Association for Computing Machinery (ACM), Vol. 34, No. 1 ( 2016-04-06), p. 1-32
    Abstract: As user demand scales for intelligent personal assistants (IPAs) such as Apple’s Siri, Google’s Google Now, and Microsoft’s Cortana, we are approaching the computational limits of current datacenter (DC) architectures. It is an open question how future server architectures should evolve to enable this emerging class of applications, and the lack of an open-source IPA workload is an obstacle in addressing this question. In this article, we present the design of Sirius, an open end-to-end IPA Web-service application that accepts queries in the form of voice and images, and responds with natural language. We then use this workload to investigate the implications of four points in the design space of future accelerator-based server architectures spanning traditional CPUs, GPUs, manycore throughput co-processors, and FPGAs. To investigate future server designs for Sirius, we decompose Sirius into a suite of eight benchmarks (Sirius Suite) comprising the computationally intensive bottlenecks of Sirius. We port Sirius Suite to a spectrum of accelerator platforms and use the performance and power trade-offs across these platforms to perform a total cost of ownership (TCO) analysis of various server design points. In our study, we find that accelerators are critical for the future scalability of IPA services. Our results show that GPU- and FPGA-accelerated servers improve the query latency on average by 8.5× and 15×, respectively. For a given throughput, GPU- and FPGA-accelerated servers can reduce the TCO of DCs by 2.3× and 1.3×, respectively.
    Type of Medium: Online Resource
    ISSN: 0734-2071 , 1557-7333
    Language: English
    Publisher: Association for Computing Machinery (ACM)
    Publication Date: 2016
    detail.hit.zdb_id: 602353-8
    detail.hit.zdb_id: 2006326-X
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
  • 4
    Online Resource
    Online Resource
    Association for Computing Machinery (ACM) ; 2016
    In:  ACM SIGARCH Computer Architecture News Vol. 43, No. 3S ( 2016-01-04), p. 27-40
    In: ACM SIGARCH Computer Architecture News, Association for Computing Machinery (ACM), Vol. 43, No. 3S ( 2016-01-04), p. 27-40
    Abstract: As applications such as Apple Siri, Google Now, Microsoft Cortana, and Amazon Echo continue to gain traction, web-service companies are adopting large deep neural networks (DNN) for machine learning challenges such as image processing, speech recognition, natural language processing, among others. A number of open questions arise as to the design of a server platform specialized for DNN and how modern warehouse scale computers (WSCs) should be outfitted to provide DNN as a service for these applications. In this paper, we present DjiNN , an open infrastructure for DNN as a service in WSCs, and Tonic Suite , a suite of 7 end-to-end applications that span image, speech, and language processing. We use DjiNN to design a high throughput DNN system based on massive GPU server designs and provide insights as to the varying characteristics across applications. After studying the throughput, bandwidth, and power properties of DjiNN and Tonic Suite, we investigate several design points for future WSC architectures. We investigate the total cost of ownership implications of having a WSC with a disaggregated GPU pool versus a WSC composed of homogeneous integrated GPU servers. We improve DNN throughput by over 120x for all but one application (40x for Facial Recognition) on an NVIDIA K40 GPU. On a GPU server composed of 8 NVIDIA K40s, we achieve near-linear scaling (around 1000x throughput improvement) for 3 of the 7 applications. Through our analysis, we also find that GPU-enabled WSCs improve total cost of ownership over CPU-only designs by 4-20x, depending on the composition of the workload
    Type of Medium: Online Resource
    ISSN: 0163-5964
    RVK:
    Language: English
    Publisher: Association for Computing Machinery (ACM)
    Publication Date: 2016
    detail.hit.zdb_id: 2088489-8
    detail.hit.zdb_id: 186012-4
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
  • 5
    Online Resource
    Online Resource
    Association for Computing Machinery (ACM) ; 2016
    In:  ACM SIGARCH Computer Architecture News Vol. 44, No. 3 ( 2016-10-12), p. 140-152
    In: ACM SIGARCH Computer Architecture News, Association for Computing Machinery (ACM), Vol. 44, No. 3 ( 2016-10-12), p. 140-152
    Abstract: On-core microarchitectural structures consume significant portions of a processor's power budget. However, depending on application characteristics, those structures do not always provide (much) performance benefit. While timeout-based power gating techniques have been leveraged for underutilized cores and inactive functional units, these techniques have not directly translated to high-activity units such as vector processing units, complex branch predictors, and caches. The performance benefit provided by these units does not necessarily correspond with unit activity, but instead is a function of application characteristics. This work introduces P ower C hop , a novel technique that leverages the unique capabilities of HW/SW co-designed hybrid processors to enact unit-level power management at the application phase level. P ower C hop adds two small additional hardware units to facilitate phase identification and triggering different power states, enabling the software layer to cheaply track, predict and take advantage of varying unit criticality across application phases by powering gating units that are not needed for performant execution. Through detailed experimentation, we find that P ower C hop significantly decreases power consumption, reducing the leakage power of a hybrid server processor by 9% on average (up to 33%) and a hybrid mobile processor by 19% (up to 40%) while introducing just 2% slowdown.
    Type of Medium: Online Resource
    ISSN: 0163-5964
    RVK:
    Language: English
    Publisher: Association for Computing Machinery (ACM)
    Publication Date: 2016
    detail.hit.zdb_id: 2088489-8
    detail.hit.zdb_id: 186012-4
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
  • 6
    In: ACM SIGARCH Computer Architecture News, Association for Computing Machinery (ACM), Vol. 43, No. 1 ( 2015-05-29), p. 223-238
    Abstract: As user demand scales for intelligent personal assistants (IPAs) such as Apple's Siri, Google's Google Now, and Microsoft's Cortana, we are approaching the computational limits of current datacenter architectures. It is an open question how future server architectures should evolve to enable this emerging class of applications, and the lack of an open-source IPA workload is an obstacle in addressing this question. In this paper, we present the design of Sirius, an open end-to-end IPA web-service application that accepts queries in the form of voice and images, and responds with natural language. We then use this workload to investigate the implications of four points in the design space of future accelerator-based server architectures spanning traditional CPUs, GPUs, manycore throughput co-processors, and FPGAs. To investigate future server designs for Sirius, we decompose Sirius into a suite of 7 benchmarks (Sirius Suite) comprising the computationally intensive bottlenecks of Sirius. We port Sirius Suite to a spectrum of accelerator platforms and use the performance and power trade-offs across these platforms to perform a total cost of ownership (TCO) analysis of various server design points. In our study, we find that accelerators are critical for the future scalability of IPA services. Our results show that GPU- and FPGA-accelerated servers improve the query latency on average by 10x and 16x. For a given throughput, GPU- and FPGA-accelerated servers can reduce the TCO of datacenters by 2.6x and 1.4x, respectively.
    Type of Medium: Online Resource
    ISSN: 0163-5964
    RVK:
    Language: English
    Publisher: Association for Computing Machinery (ACM)
    Publication Date: 2015
    detail.hit.zdb_id: 2088489-8
    detail.hit.zdb_id: 186012-4
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
  • 7
    In: ACM SIGPLAN Notices, Association for Computing Machinery (ACM), Vol. 50, No. 4 ( 2015-05-12), p. 223-238
    Abstract: As user demand scales for intelligent personal assistants (IPAs) such as Apple's Siri, Google's Google Now, and Microsoft's Cortana, we are approaching the computational limits of current datacenter architectures. It is an open question how future server architectures should evolve to enable this emerging class of applications, and the lack of an open-source IPA workload is an obstacle in addressing this question. In this paper, we present the design of Sirius, an open end-to-end IPA web-service application that accepts queries in the form of voice and images, and responds with natural language. We then use this workload to investigate the implications of four points in the design space of future accelerator-based server architectures spanning traditional CPUs, GPUs, manycore throughput co-processors, and FPGAs. To investigate future server designs for Sirius, we decompose Sirius into a suite of 7 benchmarks (Sirius Suite) comprising the computationally intensive bottlenecks of Sirius. We port Sirius Suite to a spectrum of accelerator platforms and use the performance and power trade-offs across these platforms to perform a total cost of ownership (TCO) analysis of various server design points. In our study, we find that accelerators are critical for the future scalability of IPA services. Our results show that GPU- and FPGA-accelerated servers improve the query latency on average by 10x and 16x. For a given throughput, GPU- and FPGA-accelerated servers can reduce the TCO of datacenters by 2.6x and 1.4x, respectively.
    Type of Medium: Online Resource
    ISSN: 0362-1340 , 1558-1160
    Language: English
    Publisher: Association for Computing Machinery (ACM)
    Publication Date: 2015
    detail.hit.zdb_id: 2079194-X
    detail.hit.zdb_id: 282422-X
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
  • 8
    Online Resource
    Online Resource
    Institute of Electrical and Electronics Engineers (IEEE) ; 2016
    In:  IEEE Micro Vol. 36, No. 3 ( 2016-5), p. 42-53
    In: IEEE Micro, Institute of Electrical and Electronics Engineers (IEEE), Vol. 36, No. 3 ( 2016-5), p. 42-53
    Type of Medium: Online Resource
    ISSN: 0272-1732 , 1937-4143
    RVK:
    Language: Unknown
    Publisher: Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016
    detail.hit.zdb_id: 2027750-7
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
  • 9
    Online Resource
    Online Resource
    Association for Computing Machinery (ACM) ; 2016
    In:  ACM SIGPLAN Notices Vol. 51, No. 6 ( 2016-08), p. 161-176
    In: ACM SIGPLAN Notices, Association for Computing Machinery (ACM), Vol. 51, No. 6 ( 2016-08), p. 161-176
    Abstract: This paper introduces Input Responsive Approximation (IRA), an approach that uses a canary input — a small program input carefully constructed to capture the intrinsic properties of the original input — to automatically control how program approximation is applied on an input-by-input basis. Motivating this approach is the observation that many of the prior techniques focusing on choosing how to approximate arrive at conservative decisions by discounting substantial differences between inputs when applying approximation. The main challenges in overcoming this limitation lie in making the choice of how to approximate both effectively (e.g., the fastest approximation that meets a particular accuracy target) and rapidly for every input. With IRA, each time the approximate program is run, a canary input is constructed and used dynamically to quickly test a spectrum of approximation alternatives. Based on these runtime tests, the approximation that best fits the desired accuracy constraints is selected and applied to the full input to produce an approximate result. We use IRA to select and parameterize mixes of four approximation techniques from the literature for a range of 13 image processing, machine learning, and data mining applications. Our results demonstrate that IRA significantly outperforms prior approaches, delivering an average of 10.2× speedup over exact execution while minimizing accuracy losses in program outputs.
    Type of Medium: Online Resource
    ISSN: 0362-1340 , 1558-1160
    Language: English
    Publisher: Association for Computing Machinery (ACM)
    Publication Date: 2016
    detail.hit.zdb_id: 2079194-X
    detail.hit.zdb_id: 282422-X
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
  • 10
    Online Resource
    Online Resource
    Association for Computing Machinery (ACM) ; 2017
    In:  ACM Transactions on Computer Systems Vol. 35, No. 1 ( 2017-02-28), p. 1-33
    In: ACM Transactions on Computer Systems, Association for Computing Machinery (ACM), Vol. 35, No. 1 ( 2017-02-28), p. 1-33
    Abstract: Reducing the long tail of the query latency distribution in modern warehouse scale computers is critical for improving performance and quality of service (QoS) of workloads such as Web Search and Memcached. Traditional turbo boost increases a processor’s voltage and frequency during a coarse-grained sliding window, boosting all queries that are processed during that window. However, the inability of such a technique to pinpoint tail queries for boosting limits its tail reduction benefit. In this work, we propose Adrenaline , an approach to leverage finer-granularity (tens of nanoseconds) voltage boosting to effectively rein in the tail latency with query-level precision. Two key insights underlie this work. First, emerging finer granularity voltage/frequency boosting is an enabling mechanism for intelligent allocation of the power budget to precisely boost only the queries that contribute to the tail latency; second, per-query characteristics can be used to design indicators for proactively pinpointing these queries, triggering boosting accordingly. Based on these insights, Adrenaline effectively pinpoints and boosts queries that are likely to increase the tail distribution and can reap more benefit from the voltage/frequency boost. By evaluating under various workload configurations, we demonstrate the effectiveness of our methodology. We achieve up to a 2.50 × tail latency improvement for Memcached and up to a 3.03 × for Web Search over coarse-grained dynamic voltage and frequency scaling (DVFS) given a fixed boosting power budget. When optimizing for energy reduction, Adrenaline achieves up to a 1.81 × improvement for Memcached and up to a 1.99 × for Web Search over coarse-grained DVFS. By using the carefully chosen boost thresholds, Adrenaline further improves the tail latency reduction to 4.82 × over coarse-grained DVFS.
    Type of Medium: Online Resource
    ISSN: 0734-2071 , 1557-7333
    Language: English
    Publisher: Association for Computing Machinery (ACM)
    Publication Date: 2017
    detail.hit.zdb_id: 602353-8
    detail.hit.zdb_id: 2006326-X
    Location Call Number Limitation Availability
    BibTip Others were also interested in ...
Close ⊗
This website uses cookies and the analysis tool Matomo. More information can be found here...