5 Secrets Billionaires Don’t Want You To Know About Making Money Using Data Science -or- What Is Data Science

Welcome to another edition of "In the Minds of Our Analysts."

At System2, we foster a culture of encouraging our team to express their thoughts, investigate, pen down, and share their perspectives on various topics. This series provides a space for our analysts to expose their insights.

All opinions expressed by System2 employees and their guests are solely their own and do not reflect the opinions of System2. This post is for informational purposes only and should not be relied upon as a basis for investment decisions. Clients of System2 may maintain positions in the securities discussed in this post.

Today’s post was written by David Cheng.

What is Data Science? 🙋‍♂️🙋‍♂️

Let’s start by asking a language model, like ChatGPT. ChatGPT is not a general purpose AI, but it is a fantastic summarizer that has ingested the internet. So let’s use it to save us from Googling, reading, thinking, and repeating. It’s also exciting because it can paraphrase incorrectly, such as by calling the King of Twitter the CEO.

“ChatGPT, what is data science in a sentence or two?”

Data Science is the process of applying scientific methods, techniques, and algorithms to extract meaningful insights from data. It is used to identify patterns, trends, and correlations in data and can be used to make predictions, provide solutions, and inform decisions.

Given the description above, it looks like everyone has been doing data science since Excel was created. So why now does it seem like a new-ish thing 37 years after Excel first came out (Sorry Lotus 1-2-3 folks of ‘82)? Having spoken to recruiters, it seems like every company is doing data science now and everyone is now potentially a data scientist.

Does data science require big data (vague term)? Does it require the usage of artificial intelligence/machine learning (AI/ML) (also, vague)? Can I use a database or does it require Hadoop/Spark? Does Snowflake qualify me? Will the real data scientists please stand up? And put one of those fingers on each hand up?

Data Science Where I Work 🐜🔬

At System2, our service is marketed as a ready-to-go data science team for firms. Marketing aside, I think we’re doing data science because:

  • We work with data that isn’t practical to put into Excel. A lot of data sets are just a little bit too big for Excel to handle. A few are ridiculous and span terabytes (transaction data, cell phone location).

  • We write code to make the data suitable for our purposes. We normalize (the database meaning of it, as well as statistical), build panels, fill gaps, find outliers, detect issues, etc.

  • We use math, mostly stats, and occasionally some AI/ML.

  • Since we’re already coding, we have the freedom to make any kind of visualization but are limited by libraries and effort.

  • Ultimately, we have some domain expertise and do all of the above to answer important and/or hard questions, monitor the work, and generate alerts.

  • From doing all this a lot, we’ve gotten familiar with the data landscape and gotten very good at knowing the capabilities or shortfalls of each and how to handle their individual “quirks.”

For me, “data science” is about being flexible when working with data and code to achieve desired outcomes. Unfortunately (or fortunately for me now), the tools to work with data and code are usually a lot more complicated than Excel, and interpreting things correctly with the right math can also be hard.

I remember how Excel seemed like magic in the ‘90s and a lot of firms were hiring college grads to do stuff in Excel for them. I don’t recall if that wave had a catchy marketing term like “data science.” I vaguely recall a lot of talk of “digitizing the workplace” and “paperless office,” but I was more the grunt than the management. Maybe in a decade, using Python + Snowflake will be as ubiquitous as Excel, and doing “data science” will become “doing work.” Fortunately, for me and System2, I don’t really see a single tool as nice as Excel to do everything that we do.

How to Fail at Data Science 💸🔥

So what does it take to actually do data science? It kills me when I hear a firm say they’ve hired a data scientist straight out of graduate school and then check off the “doing data science” box. Do they really expect this single data scientist to:

  • Run and use a database? (Snowflake makes it easier.)

  • Run and use a computer platform like Spark? (Pay Databricks.)

  • Scrape data from the internet? (Contract out to Zyte.)

  • Create and continually run a data pipeline? (Use dbt cloud.)

  • Understand your business? (stock price <> revenue)

  • Take their data, code, and charts and translate them into English so they can play with scenarios?

  • Manage the cloud (AWS, GCP, Azure)?

  • Time series forecasting? (Maybe they know Prophet.)

  • NLP sentiment analysis? (Scikit-learn.)

  • Work with data vendors and handle compliance? (No easy way out of this one.)

Most cases that I’ve seen for these folks are set up for failure. If they make it, it’s usually because they’re in a big enough firm where they can borrow and repurpose people in adjacent departments with adjacent roles. It’s too much for one person.

Each of these example expectations have gotten easier over time with services and products hitting the market all the time. Will the core competency for future data scientists just be figuring out which service and vendor to use instead of actually doing the work? Just skip the rhetorical questions and hire System2 or get your struggling resource(s) to hire us.

How System2 Does Data Science 💎🚀

To “do” data science, we’ve settled on roughly three roles:

  • Data Analyst - By understanding data and its implications, analysts use statistics and data visualization tools to identify patterns, trends, and correlations. Data analysts provide insights that can be used to drive business strategy, optimize outcomes, and improve decisions.

  • Data Scientist (the role, not the function) - Employ mathematics, statistics, computer science, and programming to analyze large datasets and uncover hidden patterns and relationships. By doing so, data scientists are able to answer complex questions and drive business decisions.

  • Data Engineer - Builds and maintains the infrastructure that stores and processes data. Data engineers specialize in extracting data from various sources to transform it into a format suitable for analysis.


    I like to think of the roles as axes. At the origin are a lot of shared skills and overlap. Near the origin, everyone should be writing code, writing SQL, doing exploratory data analysis, scraping, and statistical analysis. At System2, everyone is expected to be comfortable with Python, SQL, notebooks, Plotly, basic stats, requests, and Streamlit. The further you go along an axis, the more specialized and complicated things get. For an analyst, it might be an understanding of finance. For an engineer, it might be making pipelines in Prefect. For a scientist, it might be using an expectation maximization algorithm to join two different time series.

FAIL

NOT FAIL

In the terrible, hand-drawn chart above, a fail is when there is little or no overlap. Having an analyst dependent on an engineer to write SQL or having the scientist work opaquely in R is slow (in handing off), expensive (minimum of 3 people), and fault-ridden (no one can check the work). To make things work there needs to be overlap and common ground.

At System2 though, we push folks to stretch into other roles. A great data analyst should also aspire to be a limited engineer and an amateur scientist. If they can be great at two or more roles, even better. It reduces dependencies, creates more synergies, and often helps us discover insights. There are plenty of times when an analyst working on a scrape finds some metadata that solves their problem better. Had they “thrown it over the wall” to the engineer the context may have been lost.

GOOD

BETTER (AKA SYSTEM2)

The Story of Data Sausage 🌭

Internally, we joke we’re making data sausage for our clients. Data sausage is usually made with a bunch of different data. Some of it is great. Some of it is terrible. We mix it all up and write code to help apply stats until it makes sense. Then we package it up into apps and reports for our clients to consume. On a fun note, I once got our CEO, Matei Zatreanu, to say “data sausage” on CNBC.

Matei Zatreanu on CNBC’s “Closing Bell”

Shameless Plug for System2 📣🐕

System2 is a data science firm in NYC that primarily serves hedge funds. We have staff with experience from a variety of industries and can get you from zero to one hundred without the hassle. What do you need System2 for? Well…

  • Want to segment your customers by behavior and track how your site redesign or marketing emails impact predicted customer lifetime length and value (you’re calculating those, right)?

  • Need help parsing mountains of legal claims, cross-referencing them with government maps of disaster areas for possible fraud, estimating total claims, and identifying which firms are handling the most claims?

  • In the last earnings call, management said the new store format they’re rolling out is going to increase revenues. You might want to source transaction data and take a peek at same-store sales. While you’re at it, we might as well forecast revenues and help you better understand your customers.


Disclaimer: Author and a growing number of individuals have a financial interest in System2

matei zatreanu