In:
Вычислительные технологии, Federal Research Center for Information and Computational Technologies, , No. 3 ( 2022-07-21), p. 46-65
Abstract:
Эффективное хранение данных - одна из важнейших задач при проектировании любой информационной системы. Рост потребностей в обработке больших объемов данных спровоцировал появление большого количества средств для их хранения. В связи с этим возникает необходимость выбора форматов хранения на этапе проектирования. Выбор форматов влияет на параметры вычислительной среды (объем, время обработки данных), а также аппаратных ресурсов. Статья посвящена разработке методики оценки эффективности обработки больших данных в зависимости от выбора реляционного или колоночного формата. Представлено исследование двух популярных способов хранения и обработки больших данных: реляционная база данных PostgreSQL и хранение в файлах колоночного формата Apache Parquet с обработкой с помощью фреймворка Apache Hive Purpose. In the process of developing information and analytical systems, the choice of the most effective tool for data storage is important. The purpose of the presented study is to compare the data processing features for various data storage tools. Analysis of these features in the dynamics of the growth of the data volume is an important issue. Methodology. Stands were prepared for experimental evaluation of the two presented alternatives. As evaluation criteria, the data volume, processing time, the use of RAM and processor resources as well as the dynamics of changes in the characteristics with a change in the data volume was chosen. Two data queries were prepared that contain different requirements for obtaining results: filtering and data aggregation. For evaluation, both one and several simultaneously running queries were launched. Findings. Numerical characteristics of the examined criteria were obtained. The processing speed when using a relational database was several times higher than the results obtained when using a big data processing system. As the volume of data grows, big data processing systems perform better. Regarding characteristics such as the data volume, the use of column formats is more efficient for any amount of data. Value. The results showed the feasibility of using a relational database with small amounts of data. As the volume of data grows, it is necessary to use alternative ways of storing and processing data, which suggests that when designing a system, not only the analysis of the data structure is required, but also the estimated volume
Type of Medium:
Online Resource
ISSN:
1560-7534
,
2313-691X
Uniform Title:
Оценка эффективности обработки больших объемов данных в реляционных и колоночных форматах
DOI:
10.25743/ICT.2022.27.3.001
DOI:
10.25743/ICT.2022.27.3.005
Language:
Russian
,
Russian
Publisher:
Federal Research Center for Information and Computational Technologies
Publication Date:
2022
Permalink