Hacker Newsnew | past | comments | ask | show | jobs | submit | more asavinov's commentslogin

Looks interesting although I miss a short introduction or how-to guide. I found that one can "Create and train machine learning models to predict market values." In this context, a related project is Intelligent Trading Bot: https://github.com/asavinov/intelligent-trading-bot which is intended for generating trading signals based on ML and feature engineering


Having Python expressions within a declarative language is a really good idea because we can combine low level logic of computations of values with high level logic of set processing.

A similar approach is implemented in the Prosto data processing toolkit:

https://github.com/asavinov/prosto

Although Prosto is viewed as an alternative to Map-Reduce by relying on functions, it also supports Python User-Defined Functions in its Column-SQL:

  prosto.column_sql(
    "CALCULATE  Sales(quantity, price) -> amount",
    lambda x: x["quantity"] * x["price"]
  )
In this Column-SQL statement we define a new calculated column the values of which are computed in Python (as a sum of two columns). An advantage is that we can process data in multiple tables without joins or groupbys which is much easier than in the existing set-oriented approaches. Another advantage is that we can combine many statements by defining a workflow in an Excel-like manner.


Interesting! I will take a look



Not exactly. Timing is the issue when it comes to mixed type of data sources. try to load data via the URL (e.g. the results of API invocation or rss feed), then merge it with the data from postgres sql table or data from CSV file. It becomes challenging with the listed tools. Whereas TABLUM.IO solves the issue and make it organic and fast.


Most of the self-service or no-code BI, ETL, data wrangling tools are am aware of (like airtable, fieldbook, rowshare, Power BI etc.) were thought of as a replacement for Excel: working with tables should be as easily as working with spreadsheets. This problem can be solved when defining columns within one table:

  ColumnA = ColumnB + ColumnC, ColumnD = ColumnA * ColumnE 
we get a graph of column computations similar to the graph of cell dependencies in spreadsheets.

Yet, the main problem is in working multiple tables: how can we define a column in one table in terms of columns in other tables? For example:

  Table1::ColumnA = FUNCTION(Table2::ColumnB, Table3::ColumnC)
Different systems provide different answers to this question but all of them are highly specific and rather limited.

Why it is difficult to define new columns in terms of other columns in other tables? Short answer is that working with columns is not the relational approach. The relational model is working with sets (rows of tables) and not with columns.

One generic approach to working with columns in multiple tables is provided in the concept-oriented model of data which treats mathematical functions as first-class elements of the model. Previously it was implemented in a data wrangling tool called Data Commander. But them I decided to implement this model in the Prosto data processing toolkit which is an alternative to map-reduce and SQL:

https://github.com/asavinov/prosto

It defines data transformations as operations with columns in multiple tables. Since we use mathematical functions, no joins and no groupby operations are needed and this significantly simplifies and makes more natural the task of data transformations.

Moreover, now it provides Column-SQL which makes it even easier to define new columns in terms of other columns:

https://github.com/asavinov/prosto/blob/master/notebooks/col...


Hamilton is more similar to the Prosto data processing toolkit which also relies on column operations defined via Python functions:

https://github.com/asavinov/prosto

However, Prosto allows for data processing via column operations in many tables (implemented as pandas data frames) by providing a column-oriented equivalents for joins and groupby (hence it has no joins and no groupbys which are known to be quite difficult and require high expertise).

Prosto also provides Column-SQL which might be simpler and more natural in many use cases.

The whole approach is based on the concept-oriented model of data which makes functions first-class elements of the model as opposed to having only sets in the relational model.


> I think SQL is irritatingly non-composable, many operations require gymnastics to express

One approach to radically simplify operations with data is to use mathematical functions (in addition to mathematical sets) which is implemented in Prosto data processing toolkit [0] and (new) Column-SQL [1].

[0] https://github.com/asavinov/prosto Prosto is a data processing toolkit - an alternative to map-reduce and join-groupby

[1] https://prosto.readthedocs.io/en/latest/text/column-sql.html Column-SQL (work in progress)


One alternative to SQL (type of thinking) is Column-SQL [1] which is based on a new data model. This model is relies on two equal constructs: sets (tables) and functions (columns). It is opposed to the relational algebra which is based on only sets and set operations. One benefit of Column-SQL is that it does not use joins and group-by for connectivity and aggregation, respectively, which are known to be quite difficult to understand and error prone in use. Instead, many typical data processing patterns are implemented by defining new columns: link columns instead of join, and aggregate columns instead of group-by.

More details about "Why functions and column-orientation" (as opposed to sets) can be found in [2]. Shortly, problems with set-orientation and SQL are because producing sets is not what we frequently need - we need new columns and not new table. And hence applying set operations is a kind of workaround due the absence of column operations.

This approach is implemented in the Prosto data processing toolkit [0] and Column-SQL[1] is a syntactic way to define its operations.

[0] https://github.com/asavinov/prosto Prosto is a data processing toolkit - an alternative to map-reduce and join-groupby

[1] https://prosto.readthedocs.io/en/latest/text/column-sql.html Column-SQL (work in progress)

[2] https://prosto.readthedocs.io/en/latest/text/why.html Why functions and column-orientation?


Here is another project based on the same idea of processing data using functions:

https://github.com/asavinov/lambdo - Feature engineering and machine learning: together at last!

Yet, here the focus is on feature engineering and rethinking how it can be combined with traditional ML. Essentially, the point is that there no big differences and it is more natural and simpler to think of them as special cases of the same concept: features can be learned and ML models are frequently are used for producing intermediate results.


The main motivation is that the conventional approaches to data processing are based on manipulating mathematical sets for all kinds of use cases: we produce a new set if we want to calculate a new attribute, we produce a new set if want to match data from different tables, we get a new set if we aggregate data. Yet, we actually do not need to produce new sets (table, collections etc.) in many cases - it is enough to add a new column to an existing set. Here are more details about the motivation:

https://prosto.readthedocs.io/en/latest/text/why.html

Column is an implementation of a function (similarly to how table is an implementations of a set). Theoretically, this approach leads to a data model based on two core elements: mathematical functions (new) and mathematical sets (old).

This approach was implemented in Prosto which is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby.


> I always felt like there was some super deep & fundamental link between these mathematical concepts and relational modeling ideas.

The relational model is relies on set theory (more specifically relational algebra). An alternative view on data and data modeling is based on 1) sets, and 2) functions, and is called the concept-oriented model [1, 2]. It is actually quite similar to category theory and maybe even could be described in terms of category theory. It is also quite useful for data processing and there is one possible implementation which is an alternative to map-reduce and join-groupby approaches [3].

[1] A. Savinov, On the importance of functions in data modeling https://www.researchgate.net/publication/348079767_On_the_im...

[2] A. Savinov, Concept-oriented model: Modeling and processing data using functions https://www.researchgate.net/publication/337336089_Concept-o...

[3] https://github.com/asavinov/prosto Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: