The Paradox of Models: Embracing Imperfection for Practical Insight

Welcome to another edition of “In the Minds of Our Analysts.”

At System2, we foster a culture of encouraging our team to express their thoughts, investigate, pen down, and share their perspectives on various topics. This series provides a space for our analysts to showcase their insights.

All opinions expressed by System2 employees and their guests are solely their own and do not reflect the opinions of System2. This post is for informational purposes only and should not be relied upon as a basis for investment decisions. Clients of System2 may maintain positions in the securities discussed in this post.

Today’s post was written by Young Se Choi.


Introduction

Models serve as powerful tools that enable analysts to extract valuable insights from complex data. Moreover, model building gives analysts a tool to uncover biases, identify patterns, and separate real growth from seasonality. However, the famous quote by statistician George Box, "All models are wrong, but some are useful," challenges us to confront the inherent paradoxes that surround the world of modeling. In this post, we delve into the essence of this quote and explore its implications for the field of data science.

“All models are wrong…

Models, by their very nature, are simplifications of reality. They are constructed to represent and understand complex phenomena. Yet, it is crucial to acknowledge that models can only partially capture the intricacies of the real world. They require assumptions, oversimplifications, and generalizations, leading to a gap between the model and reality. Box’s quote encapsulates this unavoidable discrepancy, reminding us that all models are fundamentally flawed.

In fact, applying various models to data and witnessing the confounding influence of subjective decisions go hand-in-hand. The choice of variables, the selection of a specific model, and even the interpretation of assumptions or results are all subject to personal judgment. Any typical analyst grapples with these questions. These decisions introduce subjectivity into the statistical process, making it clear that objectivity was but an elusive ideal.

For example, there are four assumptions that are associated with linear regression that are often deemed subjective:

  1. Linearity: The assumption of linearity states that the relationship between independent variables and the dependent variable is linear.

  2. Independence: The assumption of independence assumes that the observations in the dataset are independent of each other.

  3. Homoscedasticity: Homoscedasticity assumes that the variance of the errors of residuals is constant across all levels of independent variables.

  4. Normality: The residuals of the model should follow a normal distribution.

Adhering to the assumptions of a model is important to ensure reliable inference and valid interpretation of model outputs.

How does one judge homoscedasticity? Typically, homoscedasticity can be detected with a residual plot. In a residual plot, the residuals (the differences between the observed and predicted values) are plotted against the predicted values or the independent variables. If the spread or dispersion of the residuals appears to be relatively constant across all levels of the predictors, it indicates homoscedasticity (top image). However, if the spread of the residuals changes (narrows or widens systematically) as the predicted values change, it suggests heteroscedasticity (lower image). The figured graphs below illustrate perfect, cookie-cutter examples of homoscedastcity versus heteroscedasticity.

Homoscedasticity

Heteroscedasticity

But model diagnostic plots do not usually look as explicit as the ones above. Then, how would one judge homoscedasticity of the plot below ?

Some may argue that the residuals above are relatively constant enough across its fitted values while others may argue the exact opposite. According to the figure, there is an argument that as its fitted values change, most of the residuals dip downward and then upward, almost mimicking a parabolic-like pattern. Assessing the validity of model assumptions is always a case-to-case scenario and answers can vary depending on the way they are articulated. While it is essential to strive for adherence to the assumptions, it is also important to note that violations of the assumptions are common in practice. In such cases, there are techniques available, such as robust regression, the use of nonparametric models, or transformation of variables to mitigate the impact of violations and still obtain valid results.

…but some are useful.”

The challenges of statistical model building lie in striking a delicate balance between recognizing the limitations of models and harnessing their utility; it’s kind of like an art to produce reliable results. Acknowledging the limitations of a model in its diagnostics allows us to approach it with skepticism, avoiding blind faith in its accuracy. Transparency and clear communication regarding model uncertainties are crucial before making informed decisions. By openly discussing the limitations, the insights gained from models are appropriately contextualized and guide us toward stronger conclusions. In this way, data analysts seek a deeper understanding of the complexities beyond the model’s reach.

A useful model might provide accurate predictions within a specific range or capture the general trends and patterns in the data. For example, statistical modeling was applied to the COVID-19 pandemic in order to enhance strategies for awareness and preparedness in the face of future challenges. The World Health Organization (WHO) helped monitor the daily number of infectious cases during the pandemic which in itself is a complex task, requiring the use of statistical tools and methodologies. These measures ameliorated the crisis by evaluating resource availability, maintaining comprehensive records of infected and recovered patients for future reference, and identifying the segments of the population most affected by the virus.

Conclusion

Statistical modeling requires data analysts to think critically and provide justifications for their choices. And despite their inherent imperfections, models are still incredibly valuable to offer practical insights. Models can serve as powerful tools to capture essential patterns, relationships, and trends, enabling us to make informed choices. While they may not reflect reality in their entirety, useful models enhance understanding of systems or processes and open doors to new possibilities.

matei zatreanu