Data Science And Optimization: Doing The Best With Noise

Author- Prof. Kumar Muthuraman
(The McCombs School of Business at The University of Texas at Austin)
Edited By- Dr. Abhinanda Sarkar
(Director of Academics, Great Learning)

1. Introduction

A common problem in data science is the estimation of unknown parameters from a sample of data. These parameters can reflect underlying conditions, such as the extent of the relationship between price and sales, or the association between two stocks in a financial portfolio. The estimation of these parameters based on data usually relies on some sort of optimization, i.e. what value of the parameter is closest to the data in some precisely defined sense. 

The data can contain information that is irrelevant to the parameter estimation problem. We can think of this as noise, as opposed to the relevant information, which is the signal in the data. A common approach is to try and separate out the noise and remove it. The signal is then used to obtain an optimal estimate. A well-known example of this is the regression problem, where the parameter can be the slope of a straight line through data in two dimensions. The noise in this case is captured by how far the points are from the straight line, i.e. the size of the errors. Estimation, then, ‘minimizes the errors’. 

But it is quite possible that the noise is actually useful in the estimation problem. This is particularly true if there are many parameters, and the signal is sparse in the information it contains for each parameter. In large data sets, the noise may represent the majority of the data and should be adapted to suitably. This article discusses ways to do this for specific problems. The details are in referenced papers, [1] and [2], and the purpose is to synthesize some learnings that a data scientist will find useful when they do optimal estimation.

2. Covariance and Matrices

Consider a dataset of observations. There are $n$ observations, and each observation is a vector of $p$ components. For example, a portfolio of $p$ stocks $(p=5)$ can be observed for 30 days $(n=30)$. For each component, analysts often standardize the data, i.e. subtract a mean and divide a standard deviation. For such a standardized version of the data, we can define a covariance matrix as follows:

$$V_n = \frac{1}{n} \sum_{i=1}^n X_i X_i^T$$

Here $X_i$ is the stock price (in our example) for stock $i$ at time $t$. $V$ is a $p \times p$ matrix where each element represents the covariance between the $i^{th}$ and $j^{th}$ components. It is a measure of how two components vary with respect to each other. A large positive covariance corresponds to, in our example, two stocks rising and falling together. A large negative covariance corresponds to them moving in opposite directions. In fact, the standardization of the data implies that the covariance is actually a correlation and lies between -1 and +1.

3. Eigenvalues, signal, and noise

The covariance matrix captures both the signal and the noise in the data. We are interested in the signal and — often — we want to remove the noise. One way to capture the signal is to find eigenvalues and eigenvectors. In mathematical terms, this means that we can write the covariance matrix as a decomposition.

$$V = \lambda_1 e_1 e_1^T + \lambda_2 e_2 e_2^T + … + \lambda_p e_p e_p^T$$

Here $\lambda_1 \geq \lambda_2 \geq … \geq \lambda_p$ are eigenvalues arranged in decreasing order. The vectors $e_1, e_2, …, e_p$ are corresponding eigenvectors. The notation $T$ indicates transposition from a column to a row, so the matrix orders match. In addition, $e_i^T e_i = 1$ and $e_i^T e_j = 0$ for $i \neq j$. Together the eigenvalues and the corresponding eigenvectors are called an eigendecomposition.

The eigenvalues capture the information in the matrix $V$ in the sense that they capture the variability in the data — through the covariance matrix. As decreasing values, $(\lambda_1, e_1)$ captures the most variability and $(\lambda_p, e_p)$ captures the least variability. As additional condition, $e_i^T e_j = 0$, ensures that the information each eigenpair is distinct (technically, independent or orthogonal) from that of the other eigenpairs.

Simply put (see [1] and [2] for details), we can think of the leading eigenpairs as capturing the signal in the data and the trailing eigenpairs as contributing noise. For example, we can think of $e_1$ as representing the stock weights in a portfolio of stocks that has the most variability and this reflects most clearly what is going on in the broader market. Conversely, $e_p$ represents a portfolio that had the least variability and this captures the non-systemic randomness in the market. Clearly both aspects are important — the signal to capture risk and the noise to manage it as much as possible. There is also a more technical sense in which the leading eigenpairs contribute to a more statistically stable estimate of the covariance matrix that we will allude to later.

4. Optimization

The natural question to then ask would be on how to set up a procedure that best optimizes the mix between the eigenpairs. There are two ways one can distinguish between them in the eigenspairs sense.

In data science, there can be several ways optimization can be used. In [1], optimization is used to find the best financial portfolio, with “best” being defined by low risk with high returns (a “risk-return” curve). In [2], optimization is used to find the best fit to data, where data include a response variable $(Y)$ that is the function of the components as described $(X)$. In both cases, the problem reduces to finding the optimal fit on eigenpairs associated with the covariance matrix.

For example, consider a version of the financial portfolio optimization problem [1]. Here a possible task could be to find good mix of stocks that would maximize returns while keeping risk, measured by variance:

  • The first step is to divide the eigenvalues of the covariance matrix into two sets $\lambda_1 \geq \lambda_2 \geq … \geq \lambda_k$ and $\lambda_{k+1} \geq \lambda_{k+2} \geq … \geq \lambda_p$, where the first set of eigenvalues (larger values) corresponds to signal and the second set of second set of eigenvalues (smaller values) corresponds to noise.
  • We then need to create two p-dimensional weight vectors $w_s$ and $w_n$, representing the $p$ weights making up the portfolio. Recall that $p$ is the number of stocks in the portfolio.
  • We now need to connect the two weight vectors to the corresponding p-dimensional eigenvectors. One way to do this is to force the weights to be orthogonal to the eigenvectors, in slightly cumbersome notation $w_s^T e_i = … = w_s^T e_k = 0$ (Condition 1) and $w_n^T e_{k+1} = … = w_n^T e_p = 0$ (Condition 2) thus ensuring that the signal and noise weights fall within the corresponding signal and noise subspaces (i.e., linear combinations are added).

We are now ready to optimize. In matrix notation, using transpose denoted by $T$, we can find optimal weights by:

$$w_s^* = argmin\, w_s^T V w_s$$ $$\text{subject to} \;\; w_s^T u = 1 \;\; \text{and} \;\; w_s \geq 0$$ $$\text{and} \;\; w_s \;\; \text{satisfies Condition 1}$$
$$w_n^* = argmin\, w_n^T V w_n$$ $$\text{subject to} \;\; w_n^T u = 1 \;\; \text{and} \;\; w_n \geq 0$$ $$\text{and} \;\; w_n \;\; \text{satisfies Condition 2}$$

Here $argmin$ is shorthand for ‘that which minimizes’, $u$ is a vector of ones and, together with the non-negativity condition, makes the weights add up to 1 while disallowing negative weights (in finance, sometimes called the “no short selling condition”).

Such optimization problems are called quadratic programming problems. In this case, the above optimization problems minimizes the variance — our risk — subject to the constraints.

5. Putting things together

In the portfolio optimization problem, we have ended up with two optimal portfolios. They are optimal in different ways. The signal-optimizing portfolio is explicitly minimizing risk. The noise-optimizing portfolio is doing something a bit more subtle. It is achieving robustness, in the sense of the portfolio being stable to random, choppy fluctuations. Both of these characteristics are important and it suggests that they should both be criteria in selecting a best portfolio.

More generally, two optimal solutions can be combined. A straightforward way might be to take a linear combination $w^* = \alpha w_s^* + (1-\alpha)w_n^*$, that is itself optimal [?]. In such combinations are described, together with appropriate criteria. The problem of the choice of $\alpha$ can be considered together with the choice of $k$, the number of eigenpairs to consider as signal — a question we had left unanswered.

[1] Long Zhao, Deepayan Chakrabarti, and Kumar Muthuraman, Portfolio Construction by Mitigating Error Amplification: The Bounded-Noise Portfolio

[2] Long Zhao, Deepayan Chakrabarti, and Kumar Muthuraman, Unified Classical and Robust Optimization for Least Squares

→ Explore this Curated Program for You ←

©2025  onlecource.com. All rights reserved