BigDataOcean Platform and Design

In the context of requirements engineering several methodologies are available and the benefits from each methodology is depending on the nature of the project that will be applied. The methodologies are covering all the aspects of the development lifecycle and are providing the necessary directions towards the sequential steps that will be executed. Within the context of BigDataOcean the requirements engineering methodology followed (Adapted from the OSMOSE requirements Methodology (used in the FP7 OSMOSE project – 610905) (OSMOSE Project, 2014) and in the IEEE Standard) consists of five main phases, namely the Preparation, the Elicitation, the Analysis, the Specification and the Validation phase.

In the first phase, the Preparation phase, the stakeholders are characterised and the data sources are identified. In addition to this, the data value chain is defined and the gaps between the current and the expected business strategy are identified. In the second phase, the Elicitation phase, the raw requirements are extracted as a result of the elicitation activities with the pilot partners through workshops and questionnaires. The third phase, the Analysis phase, includes the refinement and categorisation of the raw requirements towards the consolidation of the user stories, where the high-level descriptions of the expected functionalities of the BigDataOcean platform are provided. In the fourth phase, the Specification phase, the user stories are further refined and prioritised so as to define the technical requirements of the BigDataOcean platform. These technical requirements serve as the basis for the design of the conceptual architecture of the platform, as well as the definition and design of the platform components. The final phase, the Validation phase, covers the proof-of-concept implementation, based on the outcomes and work of the Specification phase, and the in-depth evaluation in a technical and socio-business level of the proof-of-concept. The following figure depicts the BigDataOcean requirements engineering methodology.

Figure 1: BigDataOcean Requirements Engineering

In the Elicitation phase, the consortium concluded that the most suitable and most efficient methods for raw requirement elicitation are questionnaires and workshops. As a consequence, the appropriate and relevant questionnaires were carefully designed and defined by the technical partners. In addition to the questionnaires, the technical partners organised several workshops in collaboration with the pilot partners of the project in order to enable the requirements identification and extraction. The analysis and the evaluation of the results has led to the definition of the raw requirements that served as input for the next phase, the Analysis phase.

During the Analysis phase the raw requirements were refined and adjusted in order to be analysed and be translated in the form of user stories. The user stories can be considered as a very high-level definition of the desired requirements from the stakeholders’ perspective. In general, the user stories are short descriptions of the stakeholders’ needs, where a desired functionality is presented in a conceptual way. These user stories were evaluated from the corresponding development teams in order to obtain a deeper understanding of the stakeholders’ expectations, but also in order to plan the future work and provide a rough estimation of the effort required to implement it. The user stories that have been collected were based on the template “As a , I want , so that ”. In order to enable proper analysis in the next phase, these user stories were classified into two categories, the ones referring to core platform and the ones that are demonstrator specific. During this phase 56 user stories were collected and documented in two separate tables that presented the evolution from the raw requirements to user stories that facilitated the definition of the services of the BigDataOcean platform in order to cater for the requirements of the stakeholders. Table 1 presents an extract of the core platform related user stories and while Table 2 presents an extract of the demonstrator specific user stories.

ID BigDataOcean To-BE Related BigDataOcean Raw Requirements User Story (As a <role>, I want < goal/desire >, so that <benefit>)
US-1 BigDataOcean platform should be able to correlate data from multiple data sources DC-04 As a data analyst, I want the BigDataOcean platform to have semantic services that allow to access correlated data, so that potentially new insights can be discovered.
US-3 BigDataOcean platform should allow browsing and search for datasets DA-01, DA-02, DA-03, DA-04, DA-05 As a data consumer, I want to be able to search for specific datasets, based on various criteria, or browse through a list of existing ones, so that I could easily and efficiently find relevant data to use.
US-7 BigDataOcean platform should offer a set of analytics and services on one or multiple datasets DA-08. DA-09, DC-07 As a data analyst, I would need analytics tools and services that can run on one or multiple datasets, so that I would be able to identify any abnormalities and also extract new knowledge.
Table 1: BigDataOcean User Stories
ID BigDataOcean To-BE User Story (As a <role>, I want < goal/desire >, so that <benefit>) Pilot
US-34 BigDataOcean platform should provide tool to provide analytics regarding the marine ecosystems As a user (ship owner), I would like to be able to use a proactive maintenance and fault prediction service, so that I can have better maintenance and fault prediction of the ship machinery and equipment. 1
US-39 BigDataOcean platform should allow browsing and search for datasets As a user (national entity), I would like to have analytics over marine ecosystems, so that I can increase protection and conservation of National and European biotopes, protection and clean-up of the industrial and marine environment, and decrease mobilisation costs. 2
US-43 BigDataOcean platform should provide a tool for descriptive, predictive, and prescriptive analytics related to vessels’ movement tracking As a user (ocean observatory), I would like to have descriptive, predictive, and prescriptive analytics related to the vessel’s’ movement tracking, so that I can monitor and manage the collision risk with ocean gliders, environmental risk, fishing identification, underwater noise, and marine spatial planning. 3
US-52 BigDataOcean platform should provide a tool to visualise the coast with the marine data, positions of the buoys and current plant location As a user (offshore renewable service provider), I would like to use a tool to visualise the coast with the marine data, positions of the buoys and current plant location, so that I can improve offshore renewable energy site selection and resource assessment. 4
Table 2: User Stories specific to BigDataOcean pilots

Following the Analysis phase, the Specification phase receives as input the collected user stories and further analysis was conducted in order to prioritize them and refine them towards the elicitation of the technical requirements of the platform with respect to the BigDataOcean Data Value Chain. These technical requirements were classified according to the steps defined in the BigDataOcean Data Value Chain that are presented in the following list:

  • Data Acquisition: In this step the technical aspects for the diverse methods that are required to collect raw maritime data from different sources and instruments are covered.
  • Data Analysis: In this step the technical aspects of the data analytics methods performed to identify data accuracy and upcoming data integration are covered.
  • Data Curation: This step covers the technical aspects for the assessment of data quality issues and data integration problems.
  • Data Storage: In this step the technical aspects for data compression, data indexing and access control are covered.
  • Data Usage: In this step the technical aspects for data processing and data analytics are covered.

Figure 2: BigDataOcean Data Value Chain

As a consequence, a concrete set of 28 technical requirements was defined and documented. An extract of these requirements is presented in the table below.

Step in BigDataOcean Data Value Chain Technology Requirements
Data Acquisition DAC-TR-4: The system should be able to upload big data datasets on BigDataOcean infrastructure. DAC-TR-5: The system should be able to link big data sources to BigDataOcean Infrastructure.
Data Analysis DAN-TR-1: The system should be able to perform big data analytic algorithms on the data stored / linked in the BigDataOcean platform including descriptive, predictive, and prescriptive analytics.
DAN-TR-3: The data analysis framework should be performed on a scalable cluster-computing framework, programming entire clusters with implicit data parallelism and fault-tolerance.
Data Curation DC-TR-2: The system should be able to curate large scale (big data) datasets in a timely and efficient manner.
DC-TR-4: The system should allow for efficient transformation of data: converting values to other formats, normalising and de-normalising.
Data Storage DS-TR-1: The system should be able to store and manage large datasets (big data).
DS-TR-4: The system should allow for replication and high availability with no single point of failure.
Data Usage DU-TR-1: The system should be able to perform simple and advanced queries over the stored big data.
DU-TR-2: The system should be able to visualize stored and real-time data using common chart types (bar chart, line chart, pie chart, etc.).
Table 3: BigDataOcean Technology Requirements

The design of the high-level conceptual architecture of the BigDataOcean platform, as well as the definition and design of the platform components it consists of, were based on the analysis of these technical requirements and is illustrated in Figure 3. In the analysis conducted it was assured that all technical requirements were prioritised and addressed by at least one component of the architecture in order to safeguard that all stakeholders’ needs were successfully covered.

Figure 3 BigDataOcean High-Level Architecture

The data extraction process of the platform is performed with a three-step approach with optional tools in order to break down and allocate the process more efficiently. At first, the Anonymiser offline tool undertakes the responsibility of filtering the data from sensitive, personal or corporate information that should not be disclosed to third parties. The second step includes the Cleanser, which is also an offline tool, responsible for the cleansing operations offered by the platform. This includes removal of erroneous, incomplete or inaccurate data but also transformation of data types, substitution or exclusion of data based on a set of conformance rules to specific constraints set in the tool. Additionally, the Cleanser can perform missing data handling via various interpolation or extrapolation techniques. The final step of the data extraction process is the Tupler tool, which is mainly responsible for the extraction of tuples from the incoming dataset providing a tabular view of the data with the option to remove specific columns according to the stakeholders’ needs. The Tupler is able handle incoming datasets in various formats, like structured data (e.g. XML or JSON), semi-structured data (e.g. CSV files), data originating from databases (relational and NoSQL) and datasets originating from streaming data sources with a predefined schema. It should be noted that the direct upload of the datasets to the Big Data Storage is also supported.

Following the data extraction process, the next component is the Annotator (Mapper). The Annotator is capable of semantically annotating the incoming datasets making use of the ontologies and vocabularies maintained in the Metadata Repository. The Metadata Repository contains the upper-level ontologies and vocabularies for the domain-specific needs of BigDataOcean as defined by the consortium. The Annotator is performing the semantic enrichment of the incoming datasets with the corresponding semantics of the domain model towards the provision of the information for the content of the dataset and the linking of the stored datasets. In addition to the annotation process, the semantically enriched dataset can be optionally received by the RDF Transformation Engine in order to be transformed to RDF format. RDF format the W3C recommended standard for representing data and metadata enabling business intelligence in the form of triples, which are basically the data model of RDF which in turn has three components: 1) the Subject (S), 2) the Predicate (P), and 3) the Object (O).

One of the core components of each Big Data ecosystem is the storage solution. In the context of BigDataOcean the Big Data Storage implements the storage solution of the platform. It is capable of storing large amount of data, as well as the RDF triples and the semantically annotated metadata as produced by the RDF Transformation Engine and the Annotator respectively. The Big Data Storage offers a set of characteristics mandatory for BigDataOcean platform, such as horizontal scalability, high availability and flexibility, content accessibility and advanced data protection. The Big Data Storage is designed to handle uploading of large datasets in a timely and efficient manner. In addition to the Big Data Storage, the extracted and semantically enriched datasets are provided to the Data Indexer. The Data Indexer is offering to near real-time indexing and advanced querying capabilities of the platform based on the predefined indexing schema facilitating the indexing during processing time.

On top of the Big Data Storage, the platform is offering the Ontario tool. Ontario is facilitating the maritime data sources harmonisation offering a layer of abstraction to data coming from heterogeneous sources in different formats with semantification performed on-the-fly based on the ontologies and vocabularies stored in the Metadata Repository. The scope of the Ontario tool is to enable querying over datasets originating from different data sources.

The Query Builder is the graphical tool enabling the query processing of simple or complex queries with a user-friendly, easy (self-explanatory) end efficient way. Through a list of predefined filters and multiple dataset selection the users are able to customise their queries according to their needs without effort. The results of the query processing are presented for additional filtering or are provided to Visualiser component for advanced visualisations.

The Algorithm Controller is responsible for the extended analysis functionalities of the platform. More specifically, it is handling the algorithm execution, from a list of predefined algorithms for selection, handling the interaction with the Spark cluster-computing framework underneath and it is providing the results to the Visualiser component. The customisation of the algorithm is enabled by the Algorithm Customiser, where the parameters of the algorithm can be set by the user, tailoring the algorithm execution to the user needs. Additionally, the multiple algorithm execution of algorithms in a chainable way is materialised by the Algorithm Combiner.

The visualisation capabilities of the platform are enabled by the Visualiser component. The Visualiser receives as input the results of the query processing execution of the Query Builder or the results of the executed analysis from the Algorithm Controller. It offers a large variety of data visualisations such as histograms, line graphs, pie charts, heat maps, bar and scatter plots, tables and maps which are presented in the form of customisable and flexible dynamic dashboards.

One of the core components is the open-source cluster-computing framework Apache Spark , a powerful processing engine focused on performance, speed, reliability and ease of use. Built around the concept of a data structure called the resilient distributed dataset (RDD) , Spark offers data parallelism and fault-tolerance via a programmable interface, as well as the implementation of iterative algorithms which are the basis for machine learning techniques. In addition to Spark, the Apache Hadoop Yarn is utilised as cluster management technology,
implementing a centralised platform responsible for the resource management and the delivery of consistent operations, data governance tools across Hadoop clusters and overall security.

Comments are closed