BigDataOcean Platform and Design

In the context of requirements engineering, several methodologies are available and the benefits from each methodology depends on the nature of the project to which they are applied. The methodologies cover all aspects of the development lifecycle and provide the necessary directions for the sequential steps that will be executed. Within the context of BigDataOcean the requirements engineering methodology that was followed (Adapted from the OSMOSE requirements Methodology (used in the FP7 OSMOSE project – 610905) (OSMOSE Project, 2014) and in the IEEE Standard) consists of five main phases, namely: Preparation, Elicitation, Analysis, Specification and Validation.

In the first phase, the Preparation phase, stakeholders are characterised and data sources are identified. In addition to this, the data value chain is defined and the gaps between the current and the expected business strategy are identified. In the second phase, the Elicitation phase, the raw requirements are extracted as a result of the elicitation activities with the pilot partners through workshops and questionnaires. The third phase, the Analysis phase, includes the refinement and categorisation of the raw requirements for the consolidation of the user stories, using their high-level descriptions of the expected functionalities of the BigDataOcean platform. In the fourth phase, the Specification phase, the user stories are further refined and prioritised, so as to define the technical requirements of the BigDataOcean platform. These technical requirements serve as the basis for the design of the conceptual architecture of the platform, as well as the definition and design of the platform components. The final phase, the Validation phase, covers the proof-of-concept implementation, based on the outcomes and work of the Specification phase, and an in-depth evaluation, on a technical and socio-business level, of the proof-of-concept. The following figure depicts the BigDataOcean requirements engineering methodology.

Figure 1: BigDataOcean Requirements Engineering

In the Elicitation phase, the consortium concluded that the most suitable and most efficient methods for raw requirement elicitation are questionnaires and workshops. As a consequence, the appropriate and relevant questionnaires were carefully designed and defined by the technical partners. In addition to the questionnaires, the technical partners organised several workshops in collaboration with the pilot partners of the project, in order to enable the requirement identification and extraction. The analysis and the evaluation of the results led to the definition of the raw requirements that served as input for the next phase, the Analysis phase.

During the Analysis phase the raw requirements were refined and adjusted in order to be analysed and translated in the form of user stories. The user stories can be considered a very high-level definition of the desired requirements from the stakeholders’ perspective. In general, the user stories are short descriptions of the stakeholders’ needs, where a desired functionality is presented in a conceptual way. These user stories were evaluated from the corresponding development teams in order to obtain a deeper understanding of the stakeholders’ expectations, but also in order to plan the future work and provide a rough estimation of the effort required to implement it. The user stories collected were based on the template “As a , I want , so that ”. In order to enable proper analysis in the next phase, these user stories were classified into two categories: the ones referring to core platform, and the ones that are demonstrator specific. During this phase, 56 user stories were collected and documented in two separate tables, that presented the evolution from raw requirements to user stories. This facilitated the definition of the services of the BigDataOcean platform, in order to cater to the requirements of the stakeholders. Table 1 presents an extract of the core platform related user stories, and Table 2 presents an extract of the demonstrator – specific user stories.

ID BigDataOcean To-BE Related BigDataOcean Raw Requirements User Story (As a <role>, I want < goal/desire >, so that <benefit>)
US-1 The BigDataOcean platform should be able to correlate data from multiple data sources DC-04 As a data analyst, I want the BigDataOcean platform to have semantic services that allow access to correlated data, so that new insights can potentially be discovered.
US-3 The BigDataOcean platform should allow browsing and searching for datasets DA-01, DA-02, DA-03, DA-04, DA-05 As a data consumer, I want to be able to search for specific datasets, based on various criteria, or browse through a list of existing ones, so that I can easily and efficiently find relevant data to use.
US-7 The BigDataOcean platform should offer a set of analytics and services on one or multiple datasets DA-08. DA-09, DC-07 As a data analyst, I want analytics tools and services that can run on one or multiple datasets, so that can identify any abnormalities and extract new knowledge.
Table 1: BigDataOcean User Stories
ID BigDataOcean To-BE User Story (As a <role>, I want < goal/desire >, so that <benefit>) Pilot
US-34 The BigDataOcean platform should provide tools for analytics about marine ecosystems As a user (ship owner), I would like to be able to use a proactive maintenance and fault prediction service, so that I can can better predict the required maintenance and faults of the ship machinery and equipment. 1
US-39 The BigDataOcean platform should allow browsing and searching for datasets As a user (national entity), I would like to have analytics about marine ecosystems, so that I can increase the protection and conservation of National and European biotopes, the protection and clean-up of the industrial and marine environment, and decrease mobilisation costs. 2
US-43 The BigDataOcean platform should provide a tool for descriptive, predictive, and prescriptive analytics related to the movement tracking of vessels. As a user (ocean observatory), I would like to have descriptive, predictive, and prescriptive analytics related to the movement tracking of vessels, so that I can monitor and manage environmental risk and the collision risk with ocean gliders, do fishing identification, monitor underwater noise, and do marine spatial planning. 3
US-52 The BigDataOcean platform should provide a tool to visualise the coast with the marine data, monitor the positions of buoys and see the current plant location As a user (offshore renewable service provider), I would like to use a tool to visualise the coast with marine data, monitor positions of buoys and see the current plant location, so that I can improve offshore renewable energy site selection and resource assessment. 4
Table 2: User Stories specific to BigDataOcean pilots

Following the Analysis phase, the Specification phase receives as input the collected user stories and further analysis is conducted in order to prioritize them and refine them for the elicitation of the technical requirements of the platform with respect to the BigDataOcean Data Value Chain. These technical requirements are classified according to the steps defined in the BigDataOcean Data Value Chain that are presented in the following list:

  • Data Acquisition: In this step the technical aspects for the diverse methods that are required to collect raw maritime data from different sources and instruments are covered.
  • Data Analysis: In this step the technical aspects of the data analytics methods performed to identify data accuracy and upcoming data integration are covered.
  • Data Curation: This step covers the technical aspects for the assessment of data quality issues and data integration problems.
  • Data Storage: In this step the technical aspects for data compression, data indexing and access control are covered.
  • Data Usage: In this step the technical aspects for data processing and data analytics are covered.

Figure 2: BigDataOcean Data Value Chain

A concrete set of 28 technical requirements were defined and documented. An extract of these requirements is presented in the table below.

Step in BigDataOcean Data Value Chain Technology Requirements
Data Acquisition DAC-TR-4: The system should be able to upload big data datasets on BigDataOcean infrastructure. DAC-TR-5: The system should be able to link big data sources to BigDataOcean Infrastructure.
Data Analysis DAN-TR-1: The system should be able to perform big data analytic algorithms on the data stored / linked in the BigDataOcean platform including descriptive, predictive, and prescriptive analytics.
DAN-TR-3: The data analysis framework should be performed on a scalable cluster-computing framework, programming entire clusters with implicit data parallelism and fault-tolerance.
Data Curation DC-TR-2: The system should be able to curate large scale (big data) datasets in a timely and efficient manner.
DC-TR-4: The system should allow for efficient transformation of data: converting values to other formats, normalising and de-normalising.
Data Storage DS-TR-1: The system should be able to store and manage large datasets (big data).
DS-TR-4: The system should allow for replication and high availability with no single point of failure.
Data Usage DU-TR-1: The system should be able to perform simple and advanced queries over the stored big data.
DU-TR-2: The system should be able to visualize stored and real-time data using common chart types (bar chart, line chart, pie chart, etc.).
Table 3: BigDataOcean Technology Requirements

The design of the high-level conceptual architecture of the BigDataOcean platform, as well as the definition and design of the platform components it consists of, were based on the analysis of these technical requirements (illustrated in Figure 3). In the analysis conducted it was ensured that all technical requirements were prioritised and addressed by at least one component of the architecture in order cover every stakeholder’s needs.

Figure 3 BigDataOcean High-Level Architecture

The data extraction process of the platform is performed with a three-step approach with optional tools in order to break down and allocate the process more efficiently. At first, the Anonymiser offline tool undertakes the responsibility of filtering the data from sensitive, personal or corporate information that should not be disclosed to third parties. The second step includes the Cleanser, which is also an offline tool, responsible for the cleansing operations offered by the platform. This includes removal of erroneous, incomplete or inaccurate data but also the transformation of data types and, the substitution or exclusion of data based on a set of conformance rules to specific constraints set in the tool. Additionally, the Cleanser can perform missing data handling via various interpolation or extrapolation techniques. The final step of the data extraction process is the Tupler tool, which is mainly responsible for the extraction of tuples from the incoming dataset providing a tabular view of the data with the option to remove specific columns according to the stakeholders’ needs. The Tupler is able handle incoming datasets in various formats, like structured data (e.g. XML or JSON), semi-structured data (e.g. CSV files), data originating from databases (relational and NoSQL) and datasets originating from streaming data sources with a predefined schema. It should be noted that the direct upload of the datasets to the Big Data Storage is also supported.

Following the data extraction process, the next component is the Annotator (Mapper). The Annotator is capable of semantically annotating the incoming datasets making use of the ontologies and vocabularies maintained in the Metadata Repository. The Metadata Repository contains the upper-level ontologies and vocabularies for the domain-specific needs of BigDataOcean as defined by the consortium. The Annotator performs the semantic enrichment of the incoming datasets with the corresponding semantics of the domain model for the provision of the information to the platform and the linking of the stored datasets. In addition to the annotation process, the semantically enriched dataset can be optionally received by the RDF Transformation Engine in order to be transformed to RDF format.The RDF format is the W3C recommended standard for representing data and metadata, enabling business intelligence in the form of triples, which are basically the data model of RDF. The RDF has three components: 1) the Subject (S), 2) the Predicate (P), and 3) the Object (O).

One of the core components of each Big Data ecosystem is the storage solution. In the context of BigDataOcean the Big Data Storage component implements the storage solution for the platform. It is capable of storing large amount of data, as well as the RDF triples and the semantically annotated metadata as produced by the RDF Transformation Engine and the Annotator respectively. The Big Data Storage offers a set of characteristics mandatory for the BigDataOcean platform, such as: horizontal scalability, high availability and flexibility, content accessibility, and advanced data protection. The Big Data Storage component is designed to handle the uploading of large datasets in a timely and efficient manner. In addition to the Big Data Storage component, the extracted and semantically enriched datasets are provided to the Data Indexer. The Data Indexer is responsible for the platform’s near real-time indexing and advanced querying capabilities based on the predefined indexing schema facilitating the indexing during processing time.

On top of the Big Data Storage, the platform offers the Ontario tool. Ontario facilitates the harmonisation of maritime data sources, offering a layer of abstraction to data coming from heterogeneous sources in different formats, with semantification performed on-the-fly based on the ontologies and vocabularies stored in the Metadata Repository. The Ontario tool enables querying over datasets originating from different data sources.

The Query Builder is the graphical tool that enables the query processing of simple or complex queries with a user-friendly, easy-to-use and efficient way. Through a list of predefined filters and multiple dataset selection the users are able to customise their queries according to their needs without effort. The results of the query processing are presented for additional filtering or are provided to Visualiser component for advanced visualisations.

The Algorithm Controller is responsible for the extended analysis functionalities of the platform. More specifically, it handles the algorithm execution, executing from a list of predefined algorithms for selection.It also handles the interaction with the Spark cluster-computing framework underneath and it provides the results to the Visualiser component. The customisation of the algorithm is enabled by the Algorithm Customiser, where the parameters of the algorithm can be set by the user, tailoring the algorithm execution to the user’s needs. Additionally, the multiple algorithm execution of algorithms in a chainable way is facilitated by the Algorithm Combiner.

The visualisation capabilities of the platform are enabled by the Visualiser component. The Visualiser receives as input the results of the query executed by the Query Builder or the results of the executed analysis from the Algorithm Controller. It offers a large variety of data visualisations, such as: histograms, line graphs, pie charts, heat maps, bar and scatter plots, tables and maps, which are presented in the form of customisable and flexible dynamic dashboards.

One of the core components is the open-source cluster-computing framework Apache Spark , a powerful processing engine focused on performance, speed, reliability and ease of use. Built around the concept of a data structure called the resilient distributed dataset (RDD) , Spark offers data parallelism and fault-tolerance via a programmable interface, as well as the implementation of iterative algorithms which are the basis for machine learning techniques. In addition to Spark, the Apache Hadoop Yarn is utilised as cluster management technology,
implementing a centralised platform responsible for the resource management and the delivery of consistent operations, data governance tools across Hadoop clusters, and overall security.

Comments are closed