Excel, Python and the future of data science
- 14 June, 2021 20:15
The world of data science is awash in open source: PyTorch, TensorFlow, Python, R, and much more. But the most widely used tool in data science isn’t open source, and it’s usually not even considered a data science tool at all.
It’s Excel, and it’s running on your laptop.
Excel is “the most successful programming system in the history of homo sapiens,” says Anaconda CEO Peter Wang in an interview “because regular ‘muggles’ can take this tool...put their data in it...ask their questions…[and] model things.” In short, it’s easy to be productive with Excel.
Superior ease and productivity: This is the future Wang envisions for the popular Python programming language. Although Excel has succeeded without open source, Wang believes Python will succeed precisely because of open source.
It’s about builders
For years we’ve treated software as a product that some company delivers to you for a fee. At least in the enterprise world, this has never reflected reality. Why? Because no matter how good the product, it never fully satisfies the needs of customers.
In addition to whatever customers pay for the software, they’re also going to pay additional fees for integration, customisation, etc. Software, in short, is always a process and not really a product.
Open source was early to clue into this fact. Wang says, “What open source does is it opens the doors. It’s like the right to tinker, the right to repair, the right to extend.” In other words, open source embraces the idea of software as a service—as a process.
More important, this means that open source encourages more people to participate in its creation and success. With most software, Wang estimates that 90 per cent to 95 per cent of users are left out of the creation process. They might see the demos but they’re trusting others to deliver software value on their behalf.
By contrast, “open source for data science has become so successful because a whole new category of users got turned into makers and builders,” Wang says.
Most people aren’t writing Python scripts, to be clear. But Python has made it much easier for average people to do data science, which is one of the biggest reasons for its success in data science.
For Wang, the holy grail isn’t for Python to beat Ruby or Perl or some other programming language—it’s to supplant Excel as the data science tool of choice for average, mainstream users. “I’m pushing Python and PyData to be the conceptual successor to Excel,” he says.
Remixing the future
How do we get there? Open source community is essential, Wang argues, and not merely to the community of those capable of committing code. Python, he says, has a “remix culture and a learning culture as well as a teaching culture.”
Of course code matters in Python land. These committers, Wang suggests, lay the foundation for much of what others build on top: “By maintaining a certain user layer and a user-facing API and providing some stability around that, they are allowing a whole higher level of contribution to emerge and to thrive.” This isn’t enough, however.
Nor is it the only valuable contribution. He notes that “all the people answering usage questions on Stack Overflow and all the people writing a blog post about their first Scikit-learn model” may be only two or three years into doing any kind of data analysis work themselves, but they’re paving the way for others to participate.
Is this better than the Excel model of innovation, with one company pushing a particular product? For Wang, the answer is a clear yes.
“When we have slowed down and worked with other people, generally the end result is better than if we just hunkered down and did our own thing,” he says. The end result, Wang hopes, is a community developed “Excel” that will change data science forever, making it even more approachable and broadly applicable than Excel.