Quick start (Column-SQL)

Creating a workflow

All data elements (tables and columns) as well as operations for data generation are defined in a workflow object (interpreted as a context):

import prosto as pr
prosto = pr.Prosto("My Prosto Workflow")

More info: Workflow and operations

Populating a source table

Each table has some structure which is defined by its attributes. Table data is defined by the tuples it consists of and each tuple is a combination of some attribute values.

The simplest way to populate a source table is to create or load a pandas data frame and then pass it to a Column-SQL statement:

sales_data = {
    "product_name": ["beer", "chips", "chips", "beer", "chips"],
    "quantity": [1, 2, 3, 2, 1],
    "price": [10.0, 5.0, 6.0, 15.0, 4.0]
}
sales_df = pd.DataFrame(sales_data)

prosto.column_sql("TABLE Sales", sales_df)

The Column-SQL statement TABLE Sales will create a definition of a source table with data from the sales_df data frame.

In more complex cases, we could pass a user-defined function (UDF) instead of the data frame. This function is supposed to “know” where to load data from by returning a pandas data. For example, it could load data from a csv file.

More info: Table operations

Defining a calculate column

A column is formally interpreted as a mathematical function which maps tuples (defined by table attributes) of this table to output values. The simplest column operation is a calculate column which computes output values using the values of the specified input columns of the same table:

prosto.column_sql(
    "CALCULATE  Sales(quantity, price) -> amount",
    lambda x: x["quantity"] * x["price"]
)

This new amount column will store the amount computed for each record as a product of quantity and price. The CALCULATE statement consists of two parts separated by arrow:

  • First, we define the source table and its columns that we want to process as input: Sales(quantity, price)
  • Second, we define a column to be created: amount

This use of arrows is an important syntactic convention of Column-SQL which informally represent a flow of data within one table or between tables.

More info: Column operations

Executing a workflow

A workflow object stores only operation definitions. In order to really process data, the workflow has to be executed:

prosto.run()

Prosto translates a workflow into a graph of operations (topology) taking into account their dependencies and then executes each operation.

Now we can explore the result by reading data form the table along with the calculate column:

df = prosto.get_table("Sales").get_df()
print(df)
   product_name quantity price amount
0  beer         1        10.0  10.0
1  chips        2        5.0   10.0
2  chips        3        6.0   18.0
3  beer         2        15.0  30.0
4  chips        1        4.0   4.0

The amount column was derived from the data in other columns. If we change input data, then we can again run this workflow and the derived column will contain updated results.

The full power of Prosto comes from the ability to process data in multiple tables by definining derived links (instead of joins) and then aggregating data based on these links (without groupby). Note that both linking and aggregation do not require and will not produce new tables: only columns are defined and evaluated. For example, we might use column paths like my_derived_link::my_column in operations in order to access data in other tables.

More info: Column-SQL