An electronic spreadsheet has a grid of cells, named by column/row coordinates, and each cells can be set to just a plain value or to a formula which derives the cell value from other cell values. Because cell values can depend on other cell values in turn, the spreadsheet forms a kind of visual dataflow language and thus spreadsheets are turing complete.
Microsoft Excel is ubiquitous in business, adding formatting and charting and multiple sheets per file and integration with the rest of the office suite. Despite its limitations, probably more useful business work has been done in Excel and its imitators than in all other languages combined.
It’s hard to deny the convenience of using a spreadsheet to manipulate tabular data, as opposed to editing whatever kind of markup. Standard unix tools like sort and awk are powerful but not easy for beginners especially those less used to command line programming.
However, the standard spreadsheet has some pretty severe problems:
Aggregate functions like
SUM() can easily be incorrectly set up to address a subset of
rows in a table area, losing important data.
See this botched economic analysis
[The authors] had not selected the entire row when averaging growth figures: they omitted data from Australia, Austria, Belgium, Canada and Denmark […] When that error was corrected, the “0.1% decline” data became a 2.2% average increase in economic growth.
Anything that looks like a number, and anything that looks like a date, gets autoconverted.
See this paper on scrambled gene names
Back in about 2010 or so I got interested in methods for extracting formulae and data from
spreadsheets into regular code. A spreadsheet would get imported and each cell treated as
a simple assignment statement like
a10 = a7 + b9; and the resulting mess could be sorted
into an appropriate order and then executed quickly.
There’s a lot of business logic wrapped up in spreadsheets no-one fully understands any more, and this would be a first step towards converting that into readable logic. However, the process of extracting the patterns from thousands of formulae scattered among cells is pretty fraught, and there’s no way to differentiate or give names to variables, constants and labels,so the resulting code is no less of a mess than the original spreadsheet, really. Here we are again, Naming Things.
One possibility is, we could simply dump the collection of formulae out in the form of a more convential programming language, and then compile and run it. Just about any language would work as the target:
double a10 = 1.01; double a11 = 1.02; double b10 = 12; double b11 = 12; double c10 = pow(a10, b10); double c11 = pow(a11, 11); double c12 = c11 - c10;
Obviously there’s a lot of repetition, but once the ‘transpile’ step is done, conventional refactoring tools can be used to give constants and variables more sensible names. It’s still horribly ugly of course, but at least it is a start, and there’s even a change someone might notice the typo in the formulae above:
double interest_rate_1 = 1.01; double interest_rate_2 = 1.02; double months_per_year = 12; double compound_rate_1 = pow(interest_rate_1, months_per_year); double compound_rate_2 = pow(interest_rate_2, months_per_year); double profit = compound_rate_2 - compound_rate_1;
… and at least it’ll run quick!
 (There’s actually some facilities for naming things built in to Excel, but because they’re pretty well hidden they’re not used a lot.)
However, this code transformation ignores the fundamentally tabular nature of spreadsheets. A single sheet might have many tables within it, and they’re not usually formally declared, but they still exist. The problem is, the information within the file is generally too patchy to somehow magically extract types and form classes etc. You could use some heuristic or another, like the Decompilers of old but fixing up the wrong guesses is always going to be a bear. Header rows aren’t always marked, columns aren’t always consistent, formulae may be different or missing in places.
However, an interactive analysis tool could help with this, allowing an iterative process to conversion. It’s not going to be trivial: I have fond memories of playing with the Sourcer disassembler back in the early days of PC viruses, and it was an adventure.
A typical spreadsheet is one “workbook” split into multiple “worksheets”, and each “worksheet” may contain multiple tabular regions with different types of data in them.
The example above contains three separate regions for example:
Each of these have vagaries of formatting and type.
There’s an old joke about “sure you can walk there, but I wouldn’t start from here”.
Perhaps it’d make more sense to think about this as “how would you make a new spreadsheet” rather than “how would you convert an existing spreadsheet”. We’ve thought about what makes conversion hard, so let’s add some information to make it easy.
In the example,
B1 is a column label,
B3:B5 are the elements of labelled column,
B8:B11 is a sequence of labels and
B13 is a row label.
So what useful thing can we say about Column B as a whole? Not much.
There’s now three separate tables, each of which has a title, column and/or row labels and a clear separation of purpose.
Because tables are separate, each can have a different ‘source’, whether that be a database, local file storage or a ‘temporary’ table held only in RAM.
When developing software, labels make programs readable to humans. Table, column and row labels allow us to write much more natural formulae. An individual cell or a group of cells can be referred to by label, so we can write:
Profit.Total = SUM(Profit by Quarter.Profit)
… which is a lot more readable than our earlier
C13 = SUM(C8:C11).
A whole row or column can be addressed, or a cell can be addressed by specifying
row and column in either order. For example, the cell
is the same cell as
In this way, tables can be extended, re-ordered or even “flipped” diagonally with no effect on the rest of the spreadsheet!
In the original example, there’s no consistent indication of what is a heading or a
string or a number.
It happens that the cell
C13 is a formula, taking it’s value from
Theres’s no particular visual indication of this either.
Table headings, row/column headings and formulae should all be visually distinct, to avoid confusion. The flow of data between tables should be also presented clearly.
The visual layout of the spreadsheet would also then become detatched from the logical layout of the spreadsheet, so that a user with different requirements (eg: mobile devices, bigger fonts, clearer colours) could display the spreadsheet differently without the issues that this causes in Excel.
Labels are separate from the row or column they’re labelling. As well as a label, each row/column can be given a type … integer, float, all the usuals. A cell’s type is limited by both its row and its column. It’d probably also be appropriate to do some Type Inference.
It might also be useful to use row/column predicates to limit the valid values of cells, bringing in a bit of Design by Contract as well.
(Borrowed from PEP20)
In a traditional spreadsheet, it’s possible to select an arbitrary range of cells and apply a formula to that. It’s pretty easy to get that range wrong and add new rows outside the range. Instead of allowing a formula to target a subset of rows accidentally, this would have to be made explicit, so you could still write a formula like
SUM(Country.GDP IF Country.Name != 'Australia' ELSE 0)
… if that is, for some weird reason, what you actually meant. It’s still possible to write a misleading spreadsheet, of course, but it makes it a lot harder to do subtly, sneakily or by accident.
Much like Flobot, every graphical program in this system would be “dual” with a text representation. This would mean that programs could be compared and maintained using familiar source control systems.
We’ve considered the strengths and weaknesses of conventional spreadsheets and proposed some interesting principles for designing something better.
It sounds a bit radical, but sometimes, it really is time to reinvent the wheel!
I posted this on twitter and got some good feedback already.
I haven’t really talked about it so far because I’ve been thinking mostly about the programming experience but a “table” in this sense absolutely could be a view into an external database or into an API, and a “table” could also be exposed to the world as an microservices-like API.
If rows and columns are notationally equivalent, how do I express that some are orthoganal to others? Can this generalize to more dimensions?
It’s easy enough to write “this cell is the total of those cells”, but there’s a notational issue around how to express “the cells in table X are the totals of each of the columns of table Y”.
I desparately need a name for a “row or column”. Vector, in math terms, would be a
1xM matrix, so would be a good name for mathematicians, but I’m not sure
it gives the right idea for anyone else.
If anyone does know a product and/or service which covers these bases already, let me know!