Motivation: Why Prosto?

Why functions and column-orientation?

In traditional approaches to data processing, we frequently need to produce a new table even though we need to define a new attribute. For example, in SQL, a new relation has to be produced even if we want to define a new calculated attribute. We also need to produce a new relation (using join) if we want to add a column with data from another table. Data aggregation by means of groupby operation produces a new relation too, although the goal is to compute a new attribute.

In many important cases, processing data using only set operations is counter-intuitive, and this is why map-reduce, join-groupby (including SQL) and similar set-oriented approaches require high expertise and are error-prone

The main unique novel feature of Prosto is that it relies on a different formal basis:

Prosto adds mathematical functions (implemented as columns) to its model by significantly simplifying data processing and analysis

Now, if we want to define a new attribute then we can do it directly without defining new unnecessary tables, collections or relations. New columns are defined in terms of other columns (in different tables) and this makes this model similar to how spreadsheets work but instead of cells we use columns. For comparison, if in spreadsheets we could define a new cell as A1=B2+C3, then in Prosto we could define a new column as Column1=Column2+Column3. The main theoretical challange is in introducing a set of column operations between columns in multiple tables in such a way that these operations effectively replace relational operations (join and groupby) and cover most important use cases. How it is done is described in [2]. Prosto with Column-SQL is one possible implementation of this model.

Below we describe three use cases where applying set operations is an unnecessary and counter-intuitive step. The example data model shown in this diagram is used for demonstration purposes. (Note however that it does not exactly corresponds to the use cases.)

../_images/fig_11.pngData processing workflow

Calculating data

One of the simplest operations in data processing is computing a new attribute using already existing attributes. For example, if we have a table Items containing orders characterized by quantity and price then we could compute a new attribute amount as their arithmetic product:

SELECT *, quantity * price AS amount FROM Items

This wide spread data processing pattern may seem very natural and almost trivial but it actually has one significant conceptual flaw:

the task was to compute a new attribute while the query produces a new table

Although the result table does contain the required attribute, the question is why not to do exactly what has been requested? Why is it necessary to produce a new table if we actually want to compute only an attribute?

The same problem exists in map-reduce. If our goal is to compute a new field then we apply the map operation which will emit completely new collection of objects having this new field. Here again the same problem: our intention was not to create a new collection with new objects – we wanted to add a new computed property to already existing objects. However, the data processing framework forces us to describe this task in terms of operations with collections. We simply do not have any choice because such data models provide only sets and set operations, and the only way to add a new attribute is to produce a new set with this attribute.

An alternative approach consists in using column operations for data transformations so that we can do exactly what is requested: adding (calculated) attributes to existing tables.

Linking data

Another wide spread task consists in computing links or references between different tables: given an element of one table, how can I access attributes in a related table? For example, assume that price is not an attribute of the Items table as in the above example but rather it is an attribute of a Products table. Here we have two tables, Items and Products, each having name attribute which relates their records. If now we want to compute the amount for each item then the price needs to be retrieved from the related Products table. The standard solution is to copy the necessary attributes into a new table by using the relational (left) join operation for matching the records:

SELECT item.*, product.price FROM Items item
JOIN Products product ON item.name = product.name

This new result table can be now used for computing the amount precisely as we described earlier because it has the necessary attributes copied from the two source tables. Let us again compare this solution with the problem formulation. Do we really need a new table? No. Our goal was to have a possibility to access attributes from the second Products table (while computing a new attribute in the first table). Hence this again can be viewed as a workaround rather than a solution:

a new set is not needed for the solution and it is produced just because there is no possibility not to produce it

A principled solution to this problem would be a data model which uses column operations for data processing so that a link can be defined as a new column in an existing table.

Aggregating data

Assume that for each product, we want to compute the total number of items ordered. This task can be solved using group-by operation:

SELECT name, COUNT(name) AS totalQuantity
FROM Items GROUP BY name

Here again we see the same problem:

a new unnecessary table is produced although the goal is to produce a new (aggregated) attribute in an existing table

Indeed, what we really want is to add a new attribute to the Products table which would be equivalent to all other attributes (like product price used in the previous example). This totalQuantity could be then used to compute some other properties of products. Of course, this also can be done using set operations in SQL but then we will have to again use join to combine the group-by result with the original Products table followed by producing yet another table with new calculated attributes. It is apparently not how it should work in a good data model because the task formulation does not mention and does not actually require any new tables - only attributes. Thus we see that the use of set operations in this and above cases is a problem-solution mismatch.

A solution to this problem again would be a column oriented data model where aggregated columns could be defined without adding new tables.

Benefits of Prosto

Easily processing data in multiple tables

New derived columns are added directly to tables without creating multiple intermediate tables.

We can easily implement calculate columns using apply method of pandas. However, we cannot use this technique in the case of multiple tables. Prosto is intended for and makes it easy to process data stored in many tables by relying on link columns which are also evaluated from the data.

Getting rid of join and group-by

Column definitions such as link columns and aggregate columns are used instead of join and groupby set operations.

Data in multiple tables can be processed using the relational join operation. However, it is tedious, error prone and requires high expertise especially in the case of many tables. Prosto does not use joins. Instead, it relies on link columns which also have definitions and are evaluated during workflow execution.

Data aggregation is typically performed using some kind of group-by operation. Prosto does not use this relational operation by providing column operations for that purpose which are simpler and more natural especially in describing complex analytical workflows.

Flexibility and modularization via user-defined functions

UDFs describe what needs to be done with the data only in this operation using arbitrary Python code. If UDF for an operation changes then it is not necessary to update other operations.

Prosto is very flexible in defining how data will be processed because it relies on user-defined functions which are its minimal units of data processing. They provide the logic of processing at the level of individual values while the logic of looping through the sets is implemented by the system according to the type of operation applied. User-defined functions can be as simple as format conversion and as complex as as a machine learning algorithm.

Parameterization of operations by a model object

A model can be as simple as one value and as complex as a trained deep neural network. This feature leads to a novel view of how data analysis should be organized by combining feature engineering and machine learning so that both model training and model use (predictions or transformations) are part of one data processing workflow. Currently models are supported only as static parameters but in future there will be a possibility to train model within the same workflow

Future directions

  • In future, Prosto will implement such features as incremental evaluation for processing only what has changed, model training for training models as part of the workflow, data/model persistence and other data processing and analysis operations.
  • Data Dictionary (DD) for declaring schema, tables and columns, and Feature Store (FS) for defining operations over these data objects

References

[1]: A.Savinov. On the importance of functions in data modeling, Eprint: arXiv:2012.15570 [cs.DB], 2019. https://www.researchgate.net/publication/348079767_On_the_importance_of_functions_in_data_modeling

[2]: A.Savinov. Concept-oriented model: Modeling and processing data using functions, Eprint: arXiv:1606.02237 [cs.DB], 2019. https://www.researchgate.net/publication/337336089_Concept-oriented_model_Modeling_and_processing_data_using_functions

[3]: A.Savinov. From Group-By to Accumulation: Data Aggregation Revisited, Proc. IoTBDS 2017, 370-379. https://www.researchgate.net/publication/316551218_From_Group-by_to_Accumulation_Data_Aggregation_Revisited

[4]: A.Savinov. Concept-oriented model: the Functional View, Eprint: arXiv:1606.02237 [cs.DB], 2016. https://www.researchgate.net/publication/303840097_Concept-Oriented_Model_the_Functional_View

[5]: A.Savinov. Joins vs. Links or Relational Join Considered Harmful, Proc. IoTBD 2016, 362-368. https://www.researchgate.net/publication/301764816_Joins_vs_Links_or_Relational_Join_Considered_Harmful