Dark Matter

Technical Introduction of Core IP

Fundamental Grounding and Core IP

At Ensemble, we are proposing a new area of research, “Feature Enhancement,” to the field of Machine Learning. We believe this new area of research will be fundamental to how machine learning is practiced by driving new advancements in the area of embeddings research and curating more machine interpretable data for any ML model. Our IP constitutes a new ML algorithm that can learn to generate new statistical properties that represent an approximation of unobserved confounder variables, significantly reducing the complexity of non-linear relationships in training & inference data before being passed into the model. This can positively impact any statistical model, ranging from a simple linear model to a transformer model. The majority of the theoretical grounding, code, experimental results have primarily been shown for time series and tabular data, but this is also extensible to image and text data across multiple dimensions, or all data types. Today, this IP is unpublished & trade secret.

Current Solutions tackling “data quality” in Machine Learning

Within Machine Learning, the primary strategies for addressing "data quality" issues encompass three main solution sets: data labeling, synthetic data, and pre-defined/domain-specific feature engineering techniques. Data labeling typically employs a heuristic-guided approach, generating new data that mirrors the statistical properties of the original dataset, though this tends to yield only marginal or incremental improvements. Synthetic data, on the other hand, involves a learned approach aimed at producing data with statistical properties identical to those of the original dataset. However, the efficacy of this method is often constrained by the quality of the original data, resulting in models trained on this data performing similarly or, in some cases, worse than when trained on the original data. Domain-specific feature engineering techniques, while proving valuable in specific, isolated settings, face limitations in their generalizability across different environments or problem scenarios.

Introduction to Feature Enhancement & Technical Differentiation

Feature Enhancement represents a significant shift in addressing data quality issues at a statistical level, diverging from traditional synthetic data methodologies. Unlike existing methods, this approach innovates by enhancing data quality beyond its current limits, learning to generate data with completely new statistical properties. This process involves approximating unobserved confounder variables - critical variables absent from the original data but influential in predictions - by creating latent variables optimized in relation to the target variable. This is a departure from simply replicating the existing distributional behavior; it introduces generating meaningful and novel distributional behavior, thus enhancing the data for a given prediction task. This advancement is achieved using a new machine learning algorithm with a unique objective function not present in current literature. Feature Enhancement entails deriving an optimized embedding within a constrained search space, producing variables that approximate unobserved confounders. These newly outputted embeddings have independent variables that exhibit a stronger correlation with the target variable than the original input data. This fundamental statistical behavior significantly enhances any statistical model's performance. Moreover, it greatly simplifies the complex and often challenging-to-model non-linear relationships between input data and target variables before that data is ever ingested by a statistical model. In many ways, this approach “inverts” the complexity of the ML pipeline by injecting complexity normally contained within a trained model into a transformed dataset in advance of training.

Impact on the ML pipeline

The performance of all statistical models in machine learning is inherently capped by the quality of the data they are trained on as it pertains to a given target variable. This upper limit constrains the predictive capabilities of these models, irrespective of the sophistication of the modeling techniques employed. However, by developing models with access to richer data representations directly linked to their target variable(s), and achieving this using any given dataset without the need for external data, the performance ceiling for all statistical models focused on predicting a specific target(s) can be effectively elevated. This enhancement in data representation quality enables models to surpass their traditional performance thresholds, unlocking new levels of predictive accuracy, efficiency, and model lifespan.

Ensemble Product Extensions & Roadmap

Because the core IP of Feature Enhancement lends an understanding of the optimal statistical properties for predicting a given target variable(s), Ensemble will be able to automatically suggest the optimal model for the task in question. Beyond improving data quality, the “new encoder” approach offered by Feature Enhancement to generate optimized embeddings allows completely novel modeling approaches to be developed which would be unique to Ensemble. Hand in hand with automatic model suggestion, Ensemble holds a foundational product and piece of research for improving the performance capabilities of existing ML technology in research and industry.

Last updated 8 months ago