Concepts behind Prosto

Matrixes vs. sets

It is important to understand the following crucial difference between matrixes and sets expressed in terms of multidimensional spaces:

A cell of a matrix is a point in the multidimensional space defined by the matrix axes - the space has as many dimensions as the matrix has axes. Values are defined for all points of the space. A tuple of a set is a point in the space defined by the table columns - the space has as many dimensions as the table has column. Values are defined only for a subset of all points of the space.

It is summarized in the table:

| Property | Matrix | Set | | — | — | — | | Dimension | Axis | Attribute | | Point coordinates | Cell axes values | Tuple attribute values | | Dimensionality | Number of axes | Number of attributes | | Represents | Distribution | Predicate | | Point | Value of distribution | True of false |

The both structures can represent some distribution over a multidimensional space but do it in different ways. Obviously, these differences make it extremely difficult to combine these two semantics in one framework.

Prosto is an implementation of the set-oriented approach where a table represents a set and its rows represent tuples. Note however that Prosto supports an extended version of the set-oriented approach which includes also functions as first-class elements of the model.

Sets vs. functions

Tuples are a formal representation of data values. A tuple has structure declared by its attributes.

A set is a formal representation of a collection of tuples representing data values. Tuples (data values) can be only added to or removed from a set. In Prosto, sets are implemented via table objects.

A function is a mapping from an input set to an output set. Given an input value, the output value can be read from the function or set for the function. In Prosto, functions are implemented via column objects.

Attributes vs. columns

Attributes define the structure of tuples and they are not evaluated. Attribute values are set by the table population procedure.

Columns implement functions (mappings between sets) and their values are computed by the column evaluation procedure.

Pandas vs. Prosto

Pandas is a very powerful toolkit which relies on the notion of matrix for data representation. In other words, matrix is the main unit of data representation in pandas. Yet, pandas supports not only matrix operations (in this case, having numpy would be enough) but also set operations, relational operations, map-reduce, multidimensional and OLAP as well as some other conceptions. In this sense, pandas is a quite eclectic toolkit.

In contrast, Prosto is based on a solid theoretical basis: the concept-oriented model of data. For simplicity, it can be viewed as a purely set-oriented model (not the relational model) along with a function-oriented model. Yet, Prosto relies on pandas in its implementation just because pandas provides a powerful set of highly optimized operations with data.