The design of the high-level conceptual architecture of the BigDataOcean platform, as well as the definition and design of the platform’s components and their respective APIs, were based on the analysis of the technical requirements that were elicited during the specification phase of the BigDataOcean requirements engineering methodology (Adapted from the OSMOSE requirements Methodology (used in the FP7 OSMOSE project – 610905) (OSMOSE Project, 2014) and in the IEEE Standard). In order to ensure that all stakeholders’ needs were successfully covered, in the analysis performed it was assured that all technical requirements were addressed by at least one component within the platform architecture.
During the development activities of the first integrated version of the BigDataOcean platform, new additional technical requirements were identified. In addition to the new additional requirements, the consortium received the initial evaluation and feedback from the project’s pilots that introduced the need for a re-evaluation of the high-level architecture of the BigDataOcean platform and the design of the components of the platform. The outcomes of the re-evaluation produced several refinements and enhancements on both the high-level architecture and the components of the platforms towards the aim of addressing these new requirements and providing the additional features and functionalities to the BigDataOcean platform as envisioned by the consortium. To this end, the high-level architecture received the necessary updates and is illustrated in the following figure.
One of the most important updates introduced in the updated high-level architecture is the redesign of the Big Data Storage. More specifically, with the new updated design the Big Data Storage is composed of two distinct storage solutions, the Hadoop Distributed File System (HDFS) and the PostgreSQL , each utilised for different purposes. HDFS is a distributed file system designed to be highly fault-tolerant, to provide high throughput access and capable for handling and storing large data sets with high reliability and high availability. Within the context of the BigDataOcean platform, HDFS is used to store large amount of incoming raw data in their original format (for example in netCDF, CSV or Excel file format), as well as the semantically enriched data as provided by the Annotator in a fast, reliable and efficient way. On the other hand, PostgreSQL is a powerful open source object-relational database management system (ORDBMS) ensuring data integrity and correctness with high reliability and high availability. Within the context of the BigDataOcean platform, PostgreSQL is used to store the normalised information as parsed and extracted by the BDO Data Parser component.
BDO Data Parser is the component responsible for the parsing and normalisation of the data sources stored in raw format in the HDFS storage solution. The design of the BDO Data Parser includes the definition of the BigDataOcean Context model based on the Climate and Forecast (CF) Metadata Convention standard . This model contains all the variables included in the datasets, addressing the mapping of the naming of the variables to the CF compliant standard names and the resolution of the unit conflicts in conformance to the CF. The collected variables are mapped to the CF categories and the context model produced serves as input for the schema definition of the PostgreSQL storage solution. The BDO Data Parser is designed to perform the extraction of the information from the datasets available in HDFS in structured file formats, such as netCDF, and semi-structured file formats, such as CSV, Excel and KML. The normalisation and transformation of this information is executed in accordance to the BigDataOcean Context model. The results of the process are stored in PostgreSQL and are available to the BigDataOcean for query execution and analysis. The BDO Data Parsers offers a variety of API methods in order to upload or download a dataset to HDFS, find datasets with various filters, find variables with various parameters, parse files or set metadata repository identifiers.
The data extraction process is supplemented with the two optional components, the Anonymiser and the Cleanser, where no major updates or refinements were introduced. The Anonymiser provides the anonymisation processes required dealing with privacy issues, protection of personal, sensitive data and risk elimination of unintended disclosure of information. The Cleanser is the optional offline tool offering the cleansing functionalities for the detection and correction of corrupted, incomplete or incorrect records from a dataset. The Cleanser provides the necessary mechanisms to replace, modify or eliminate the identified erroneous data.
The Annotator, also called Harmonisation tool, is responsible for the semantic annotation of the extracted datasets utilising the ontologies and vocabularies provided by the Metadata Repository. The Annotator receives as input the raw data, ideally from the Anonymiser and the Cleanser tools, in order to associate a meaning to (or annotate) each attribute of the dataset. The ontologies and vocabularies are obtained from the Metadata Repository towards the aim of introducing semantic relationships that provide information for the dataset and its relationship with other datasets. The Annotator requires the input of domain experts in order to perform the annotation of the data attributes which are being imported into the BigDataOcean platform. The domain expert can use the user-friendly GUI in order to specify the annotations. The annotations are received by the Big Data Framework and Storage which can translate and store these metadata descriptions of the datasets into the Big Data Storage.
Moreover, the Annotator provides an extended list of API methods providing the ability to list or search datasets and variables with various parameters.
The Metadata Repository allows users to import, manage, query and interlink vocabularies related to the maritime domain. The managed information is provided online for maritime domain experts and to the Annotator via exposed API methods. Through these API methods, the Annotator can retrieve related ontologies and vocabularies in order to be utilised in the mapping process of the incoming datasets. The ontologies and vocabularies are stored internally in a triple store and represented as RDF data according to the Semantic Web standards and are also accessible using the SPARQL query language. The Metadata Repository offers a list of API methods in order to search, autocomplete or suggest vocabulary terms, vocabularies, agents and pilots.
On the top of the data storage of the BigDataOcean lays the Query Builder. Through an innovative graphical user interface, the Query Builder is enabling the exploration, combination and expression of simple or complex queries on top of the available datasets. The ease of use of the graphical user interface ensures that end user unfamiliar with formal query languages or the structure of the underlying data are still able to describe complex queries on top of multiple datasets. The results of the produced queries are used by other components of the platform, such as the Visualiser and the Algorithms Controller, in order to specify the input data for the visualisation or analysis process. The Query Builder offers a variety of API methods enabling the listing of available datasets and variables, row counting, query execution and retrieval of all the variables and dimensions of a query.
The next component in the foreseen logical flow is the Algorithms Controller, which responsible for initiating and monitoring the execution of the data analysis by implementing the interconnection with the underlaying cluster-computing framework, the Apache Spark. Algorithms Controller is capable of performing advanced analytics over multiple datasets that are available on the Big Data Storage with a wide range of algorithms. The results are provided in a timely and efficient manner to the Visualiser component. The Algorithms Controller is supplemented with the Algorithm Customiser and the Algorithm Combiner. The Algorithm Customiser is undertaking the responsibility of customising the algorithm execution via a set of parameter values, while the Algorithm Combiner is extending the analysis capabilities by enabling the multiple algorithm execution in a chainable way. The list of available algorithms and their respective parameters can be retrieved via the available API methods and the execution of the algorithm can be also triggered via the dedicated API method.
The Visualiser is the component enabling the visualisation capabilities of the BigDataOcean platform. Visualiser offers a variety of visual representations including bar and scatter plots, pie charts, heat maps, etc. based on the execution output of queries composed by the Query Builder and the results of the analysis executed by the Algorithm Controller. The interactive data visualisation can be utilised within various contexts within the BigDataOcean platform, such as the generation of visualisations of the explored data within Query Builder, visualisation of the results of the executed analysis or within customisable and flexible dynamic dashboards. The Visualiser offers list of API methods to list the available visualisation types and their respective parameters, as well as to generate the desired visualisation.
For the purposes of BigDataOcean, Apache Spark has been selected as the cluster-computing framework. Spark is offering the processing engine suitable for analysis with a combination of speed, performance, reliability and efficiency. Spark is also offering cluster management with data parallelism and fault-tolerance suitable for the needs of the BigDataOcean platform. Apache Hadoop YARN is supplementing the cluster management offered by Spark. YARN is providing a centralised platform responsible for the resource management and scheduling. YARN is offering the support of multiple data processing engines enabling Spark with extended functionalities like enhanced resource management, job scheduling and monitor tools over the cluster nodes and the necessary security mechanisms.