Column operations¶

Compute column¶

A compute column is intended for computing a new column based the values in other columns in the same row. It is defined via a Python user-defined function which gets several input columns and returns one output column which is then added to the table.

Calculate column (instead of map operation)¶

Probably the simplest and most frequent operation in Prosto is computing a new column of the table which is done by defining a calculate column. The main computational part of the definition is a (Python) function which returns a single value computed from one or more input values in its arguments.

This function will be evaluated for each row of the table and its outputs will be stored as a new column.

It is similar to how apply works in pandas (and actually it relies on it in its implementation) but it is different from how map operation works because a calculated column does not add any new table while map computes a new collection (which makes computations less efficient).

The Prosto approach is somewhat similar to spreadsheets with the difference that new columns depend on only one coordinate - other columns - while cells in spreadsheets depend on two coordinates - row and column addresses. The both however are equally simple and natural.

Check out the calculate.ipynb notebook for a working example of the calculate operaiton.

Link column (instead of join)¶

We can define and evaluate new columns only in individual tables but we cannot define a new column which depends on the data in another table. Link columns solve this problem. A link column stores values which uniquely represent rows of a target (linked) table. In this sense, it is a normal column with some values which are computed using some definition. The difference is how these values are computed and their semantics. They do not have a domain-specific semantics but rather they are understood only by the system. More specifically, each value of a link column is a reference to a row in the linked table or None in the case it does not reference anything.

The main part of the definition is a criterion for finding a target row which matching this row. The most wide spread criterion is based on equality of some values in two rows and the definition includes lists of the columns which have to be equal in order for this row to reference the target row.

Link columns have several major uses:

Data in other (linked) tables can be accessed when doing something in this table, say, when defining its calculate columns
Data can be grouped using linked rows interpreted as groups, that is, all rows of this table referencing the same row of the target table are interpreted as one group
Link columns are used when defining aggregate columns

There could be other criteria for matching rows and defining link columns which will be implemented in future versions.

Check out the link.ipynb notebook for a working example of the link operaiton.

Merge column (instead of join)¶

Once we have defined link columns and interlinked our (initially isolated) set of tables, the question is how we can use these links? There are two major conceptual alternatives:

move the whole linked columns to the source tables as one dedicated operation performed before the data in this column is used
do not move the whole column but rather use this link to access individual data values in the linked column from within each operation which needs this data.

The second approach requires less memory because the (linked) data is used where it resides but it is less efficient because each value is accessed via the link. The first approach require more memory because we duplicate the linked column by moving it to the table where it will be used. However, access to this data will be as fast as to all other columns within this source table.

Currently, the first approach is implemented via the dedicated merge column operation. This operation specified a sequence of link columns from the source table to a target column in another table. Its result is a new column in this source table which contains the same data as in the target column and it can be used precisely as any other column in this table. The merged (copied) column can be then used in other operations like calculate columns or aggregate columns.

It is important that it is not necessary to use this operation explicitly. In other words, if we want to use a linked column in some operation, then we could merge it first but it is an explict and optional way. A simpler approach is to specify a column path as our column name in the operation. A column path is a sequence of simple column names separated by some symbol, ‘::’ (two colons) by default. The translator will find such column paths and automatically insert the necessary merge operation.

Rolling aggregation (instead of over-partition)¶

This column will aggregate data located in “neighbor” rows of this same table. These rows to be aggregated are selected using criteria in the window object. For example, we can specify how many previous rows to select.

Currently, its logic is equivalent to that of the rolling aggregation in pandas with the difference that the result column is immediately added to the table and this operation is part of the whole workflow.

The roll operation can distinguish different groups of rows and process them separately as if they were stored in different tables. We refer to this mode as rolling aggregation with grouping. If the link parameter is not empty then its value specifies a column or attribute used for grouping.

Check out the roll.ipynb notebook for a working example of rolling aggregation.

Aggregate column (instead of groupby)¶

This column aggregates data in groups of rows selected from another table. The selection is performed by specifying an existing link column which links the fact table with this (group) table. The new column is added to this (group) table.

Currently, its logic is equivalent to that of the groupby in pandas with the difference that the result column is added to the existing table and the two tables must be linked beforehand.

Check out the aggregate.ipynb notebook for a working example of aggregation.

Discretize column¶

Let us assume that we have a numeric column but we want to partition it into a finite number of intervals and then use these intervals intead of numeric values. The discretize coumn produces a new column with a finite number of values where each such value represents a group the input value belongs to.

How the groups are identified and how the input space is partitioned is defined in the model. In the simplest case, there is one numeric column and the model defines intervals with equal length. These intervals are identified by their border value (left or right). The output columm will contain border values for the intervals input values belong to. For example, if we have temperature values in the input column like 21.1, 23.3, 22.2 etc. but we want to use discrete values like 21, 23, 22, then we need to define a discretize column. In this case, it is similar to rounding (which can be implemented using a calculate column) but the logic of discretization can be more complicated.

Links:

https://numpy.org/doc/stable/reference/generated/numpy.digitize.html