Data Science is Terrible

Welcome to another edition of “In the Minds of Our Analysts.”

At System2, we foster a culture of encouraging our team to express their thoughts, investigate, pen down, and share their perspectives on various topics. This series provides a space for our analysts to showcase their insights.

All opinions expressed by System2 employees and their guests are solely their own and do not reflect the opinions of System2. This post is for informational purposes only and should not be relied upon as a basis for investment decisions. Clients of System2 may maintain positions in the securities discussed in this post.

Today’s post was written by David Cheng.

How to do “Data Science”

Thanks to the kind folks over at FirstMark (VC) for putting together the chart below of everything you should know and the vendors that provide tooling if you’re working with data. They’ve been doing this for a couple years now and every year it gets bigger and denser. Queue drumroll for a 6MB PNG file…

https://twitter.com/mattturck/status/1610765711529414660

To be fair, a lot of vendors are repeated across categories (congrats to databricks for appearing the most). Bonus points to anyone who spots System2 (hint: bottom right, smallest logo).

As Matt Turck from FirstMark succinctly put it…

Doing data science, in general, is terrible. Does it have to be? Let’s dig into how the data sausage is made. Read on if you want to dissuade yourself from a career in data science. Skip to the end if you want to know why you should hire System2.

Data Science is Terrible Because it Requires Too Many Skills To Execute

So a great data scientist should…

Know a fair amount of the stuff Matt Turck mentioned above - keeping up-to-date with the latest techniques and tooling.
Have reasonably deep domain knowledge - at System2 we understand buy-side investing
Be able to translate problems to math and code - someone with a stats background
Gather the relevant data or figure out what needs to be measured and measure it - software developer
Be able to work with potentially large datasets and process it efficiently - comp sci and engineering are handy
Be able to integrate and enrich the data in order to explore or prepare it for AI/ML - comp sci and AI/ML experience
Do all of this in a repeatable fashion while handling changes in the data, modeling assumptions, and software updates - software engineering experience
Be comfortable negotiating with vendors or other departments to source data - experience with the marketplace
Have good communication skills and project management to manage all of the above - business analyst

Data science requires a lot of skills to execute. If you don’t have good coverage:

Things may be done inefficiently or poorly which can turn your Snowflake costs from 50K → 500K just because your quant doesn’t really do data engineering and databases. Maybe those tables should have been normalized?
You might not get insights because your data engineer doesn’t have enough experience working with statistics, time series, and modeling. Is your correlation on just raw levels or on change?
You end up with a lot of interesting but useless insights because your team doesn’t understand the domain and what’s actionable. Did someone forget to panelize and normalize before analysis?
Or, more simply, you might just get things wrong

At System2, we’ve quickly come to the conclusion that if you don’t have a well-functioning and balanced team to cover your required areas then you miss out, spend a lot, and have low morale. Different domains require different mixes of these skills, which may require different teams. For example, e-commerce has a lot of pre-built models and consolidated vendors so they need fewer quants and fewer engineering skills but more integration and data modeling.

Data science is terrible because most people don’t realize the really wide (and domain specific) skillset it takes to execute. At System2 we believe that there are no good data scientists, there are only good data science teams.

Data Science is Terrible Because the Tooling is Terrible

From the previous section we’ve established that doing data science requires a wide variety of applied skills. Unfortunately, there are almost too many tools (that I dislike). Most of the tools are terrible because:

they only address a small set of skills so you need many tools to cover everything
Tableau is nice but a whole bunch of other tools are necessary to get the data in the right shape for it to be useful.
most tooling is difficult to work with other tooling
Like your Spark data pipelines? Have fun mixing it in with dbt and scheduled with prefect.
a lot of the tools are proprietary and lock you in with the vendor and it’s features (or lack thereof)
everyone keeps getting acquired, which can sometimes leave you stuck or stall development
Redash was purchased by Databricks, Streamlit by Snowflake, Looker by Google

With few dominant players and a high degree of vendor lock-in, it’s no wonder the venture capitalists and private equity companies are throwing so much money into this space. Let’s walkthrough a couple of use cases and the popular choices in each for data science in the finance space.

For data visualization:

Tableau is great for business analyst types who may be weak in coding and know some SQL. It’s easy to explore, distill insights, and publish but is too limiting for statistical types who find it terrible for manipulating data. You also can’t make any new visualizations, and interactivity with a database or APIs is tough.
Hex or Observable are better choices for more technical folks because they can write whatever code to do whatever they want. But for the same reason that it’s great for technical folks, it’s terrible for most business analysts.

Different tools, different audiences, different capabilities, different skills to use. Very little transferability or compatibility between the two. You’ll probably end up with both and make duplicate data pipelines since Tableau users need better-prepared data. Terrible. It’s generally agreed that you wouldn’t want to build a data pipeline within your data visualization tools. Do you want to have portfolio risk models calculated within a Tableau notebook? Is it much better in a Hex notebook? Should they really be pulling in interest rate curves?

So let’s add more tools for a data pipeline:

Talend or Alteryx provide nice visual tools to help build workflows which is a great way for people to get started but we’ve struggled with scaling up in a cost-effective manner as well as being wary of vendor lock-in.
Databricks scales up well, provides visualization tools, and lets you integrate anything into it. It’s what I run into the most frequently with our vendors and partners.
The folks at Snowflake, besides being a popular cloud database, are more or less feature equivalent to Databricks with the recent additions of Snowpark and Streamlit.
I like dbt but it requires a lot of SQL and knowledge of databases. You also still need a database. It’s a terrible place for folks who are building their first data pipeline.

What do you choose? What fits your organization? Can you live with it for the rest of time? What tool will you use to orchestrate and manage your pipelines? How about a data quality monitoring tool?

Making a tool choice is a heavy choice and largely dependent on your team and their collective skills. Tooling in Data Science is a lot like constructing the Tower of Babel. It would be nice if one environment could work with big or small data, manage it, orchestrate it, and communicate it but it’s hard to see that in the current state.

Data Science is Terrible Because the Data Itself is Terrible

Before any of the analysis or insights, everyone has to spend time getting the data and making it accessible. No one enjoys this part of the job. Data is terrible because of the synergistically bad combination of issues from data formats, data delivery, and data ingestion.

Let’s start with data formats. There are so many. However, the most popular format is also the worst format. Comma Separated Values (CSVs) and their inbred cousins (tab or pipe separated variants) have the following issues:

What happens when I have a comma that isn’t separating a value?
Surround your value with double quotes.
If your values are surrounded with double quotes…
What happens when I have a double quote within my value?
Use two double quotes.
Some bad people put a backslash in front (\”) instead
Some very bad people use a mix of both
Why can’t headers be required?
What happens when there’s a new line in a value? Like two paragraphs?
As long as the value is enclose in double quotes it should be fine
Unfortunately a lot of code and systems don’t handle this
Some bad people like to put a backslash and n (\n) to signify a new line
When will I find out the data type of this empty column?
Is 90210 a number or text?
If it’s followed by 07030 it’s probably a text zip code
In a better world it would have been “90210”

https://makeameme.org/meme/why-would-you-08183f59de

Then comes data delivery. Anything is better than a CD-ROM through the mail. However, the most common delivery format we see at System2 is over S3. S3 is great because it’s cheap, can be super fast, and has pretty rich options for controlling access. Unfortunately, it can be complicated to use. To begin there’s usually a back-and-forth:

My AWS account user or yours?
Are both our user IAM profiles setup correctly? <write and troubleshoot your JSON>
Are your bucket permissions correct? <write and troubleshoot more JSON>

Somehow writing JSON policies to manage this is simpler than using the console user interface. This is not something the average data scientist enjoys doing but it is required.

https://archive.org/details/America_Online_6.0_Free_Trial_1000_Hours_for_45_Days_AOL_A0701R202_2000

After delivery, you want it in a queryable form. Usually this entails:

Defining a table for the data with column names and types that aren’t specified because you have a CSV to work with.
Do you guess the types?
Will all the date types be consistent? What timezone are they in?
What’s the type of the column full of empties?
Wait for a load to complete
Your load will fail at 95% because of invalid data and you will have to start over. Either adjust your schema or do some pre-processing to get the bad stuff out.

https://tenor.com/view/cat-loading-error-angry-whydoesthishappen-gif-8985245

Using a data lake can get you around the long load but every query will be slower instead. Haha.

I have to give a shout out to Google BigQuery for making loads relatively friendly. You can usually point to a bucket, provide a table name, and then it appears with columns and all.

At System2, we often joke that if any of us ruled the world in the future we’d make data formats and delivery standardized. I’d probably make data that doesn’t meet its specification a felony.

Shameless Plug

System2 is how you figure stuff out from a bunch of data and make it part of your business process. We have teams of people with all the necessary skills, tooling, infrastructure, and experience to deliver actionable insights from raw data. In a future content marketing post, I’ll dive into our actual stack if people on the team don’t force me to write about LLMs first.

What I Hope The Future Will Look Like

On a brighter note, I do see a brighter future for all of us in the industry. On my Santa list are:

Parquet (or something like it) will become the new CSV
Having column names and data types is great!
Column structure is lake friendly!
Compresses nicely (gzip or snappy)!
We can all publish databases to 3rd parties like Snowflake
Granting access is just a SQL statement away!
Private sharing and marketplace is easy!
We can skip data format, data delivery, and data ingest!
Lets-Plot becomes the defacto visualization library
That should bring over the R ggplot crowd?
Someone will hopefully add easy python databinding to d3 react stuff?