From Stanford to Sisu: Making the Leap

People often ask me why I started working on Sisu. I didn’t get here by chance: even though we’re only two years old as a company, I’ve been working on new interfaces to analytics for over half a decade. And the more I spend time in this area, the more I believe that ranking and relevance for cloud data is the highest-impact opportunity in data today — so much so, that I’m all in.

Who needs a faster database?

In 2015, I was a twenty-five-year-old who’d just signed up for a seven-year run on the tenure track at Stanford, eager to make my name as a newly-minted assistant professor of computer science.

My PhD thesis, in one slide (from my job talk). By exposing just a bit more semantics than “read” and “write” to the database, we showed how to avoid massively expensive delays due to coordination. All good ideas in data come down to expressing (just enough) context at the system level.
In retrospect, this summary slide from a talk at the SIGMOD 2017 New Researcher Symposium panel on “What’s Hyped?” (talk title: “Databases Are Overhyped”) foretold the entire sequence of events.

Even the database experts can’t understand queries fast enough

While pondering these questions and writing my dissertation, I spent a few months in Cambridge at MIT working with Sam Madden, a renowned database professor who’d also co-founded a startup called Cambridge Mobile Telematics (CMT) based on his research on mobile driving.

A clean problem statement: Prioritizing attention

I found inspiration for Sam’s problem in a paper from 1971 called “Designing Organizations for an Information-Rich World” by Turing Award winner Herb Simon. As Simon wrote,

Talk slide from 2016. In retrospect, it should have been obvious that Andreessen Horowitz would be our first investors.
A figure from an early grant proposal and eventual paper describing what we’d attempt to build in the years to come.

From prototyping, we learned: There’s no silver bullet, but there are pipelines

Now, you can’t get tenure with just a vague idea. You have to execute.

State-of-the-art in scalable human-in-the-loop analytics, as captured in an early 2016 slide deck. Inspired by original content from @jrecursive.
  • At the same time, there was no silver bullet or single model to help people like Sam monitor and diagnose their metrics. The kind of ranking and relevance functionality we needed consisted of entire pipelines of operators — some to perform feature engineering and transform the data, some to classify it, and others to automatically aggregate the results and visualize them.
  • Most of the literature looked at these problems in isolation. For example, it turns out “anomaly detection” and “outlier detection” are poorly defined terms: what makes something an “anomaly” is remarkably subjective, and every paper has a different definition for each. You could easily substitute the phrase “foobar detection” for “anomaly detection” and “outlier detection” and the results would make the same amount of sense.
  • Almost nothing ran fast off the shelf. Most of the seminal work on high-volume, high-dimensional statistical inference (especially in sparse regimes) ran dog slow on data at scale, if you were lucky to find a working implementation. Even many streaming algorithms optimized for low memory usage, not for keeping up with insane volumes of data found in people’s data warehouses.
  • Despite the fact that the idea of “prioritizing attention” was so nebulous and poorly defined, the advances in search over unstructured data from the mid-1990s through the 2010s was pretty good evidence that we could make progress. Concepts like inverted indexes that are now standard within every unstructured search engine dated to the 1950s. “All” it took were some smart people to put them to work in an end-to-end system — and a lot of data.
Early slideware from 2016 on extensible runtime support for user-defined function (UDF)-based statistical scoring models. This architecture didn’t really work, but we learned a lot about reservoir sampling.

Use cases lead to users lead to optimizations lead to use cases

From all of this building, two things happened.

With awesome partners at Microsoft, we started to see real usage with our prototypes deployed there. The team over there is amazing.
In density estimation, the left plot containing density estimates is much more expensive to render than the right plot containing a density-based classification of high-probability regions. In the right plot, as soon as you realize you’re in a high-probability, dense (blue) area, you can stop computing. This optimized density-based classification is not only theoretically faster than standalone density estimation, it translates to speedups of over 1000x on real data. This sounds simple, but it’s only obvious when you look at the end-to-end density-based classification task. From 2017.

What about everyone else?

Despite this success, something kept me up at night. We hadn’t improved our production UX since I wrote our first prototypes. We saw companies like Microsoft incorporate our backend into their production systems, but we hadn’t really closed the loop with users for all of their analyses since the earliest days. I tried hard but couldn’t hire production engineers on campus to harden our codebase for people that couldn’t afford SRE and DevOps support for our research prototypes. But aside from the bleeding-edge tech companies, we had a hard time supporting sustained use. (I personally wrote our Oracle ODBC connector.) What about the rest of the world?

About to sign our first real lease. We outgrew this space late last year, and also got air conditioning. The latter was a real cause for celebration.
Some of the forty-plus members of Team Sisu as of October 2020, in our posh virtual office.

Time’s up: Making the leap

Earlier this year, Stanford reminded me my two years on leave were up. It was time to come back to campus — use it or lose it, including any C-level title at the company.

Founder and CEO, Sisu Data (www.sisudata.com), former Assistant Professor, Stanford CS