Current statistical learning of EO data analysis is limited
Despite the many successful results and developments, there are still strong limitations for the general adoption of machine learning algorithms for predicting and understanding EO data. Machine learning and signal processing have advanced enormously in the last decade (both at a theoretical and applied levels) but have not moved forward the field of EO data analysis in all its real potential.
The current statistical treatment of biophysical parameters is strongly limited by the quantity and quality of EO data, as well as because of the abuse of standard off-the-shelf methods, which in general are not well-adapted to the particular EO data characteristics. Specifically, current regression models used for EO applications are still deficient because they rely on limited amount of meteorological and remote sensing data, do not observe the particular data characteristics, and often make strong assumptions of linearity, homoscedasticity or Gaussianity. These limitations translate into certain risks of overfitting and unreasonably large uncertainties for the predictions, suggesting a lack of explanatory variables and deficiencies in model specification. Graphical models have been seldom used in EO data analysis. The few works restrict to local studies, use limited amount of data and explanatory variables, consider remote sensing input features only, apply standard structure learning algorithms driven by univariate (often unconditioned) dependence estimates, and do not extract causal relations or identify new drivers in the problem.
We advocate that machine learning algorithms for EO applications need to be guided both by data and by prior physical knowledge. This combination is the way to restrict the family of possible solutions and thus obtain nonparametric flexible models that respect the physical rules governing the Earth climate system. We are equally concerned about the ‘black-box’ criticism to statistical learning algorithms, for which we aim to design self-explanatory models and take a leap towards the relevant concept of causal inference from empirical EO data.
The main goal of the SEDAL project is to develop new machine learning models for the efficient treatment of biophysical land parameters and related covariates at local and global scales. This main scientific goal translates into the following objectives:
- Improve prediction models by adaptation to Earth Observation data characteristics. We will rely on the framework of kernel learning, which has emerged as the most appropriate framework for remote sensing data analysis in the last decade. The new retrieval models will be adapted to the particular signal characteristics, such as unevenly sampled time series and missing data, non-Gaussianity, presence of heteroscedastic and non-stationary processes, and non-i.i.d. (spatial and temporal) relations. Models based on kernels and GPs will allow us to advance in uncertainty quantification using predictive variances under biophysical constraints. Advances in sparse, reduced-rank and divide-and-conquer schemes will address the computational cost problem. The proposed kernel framework aims to improve results in terms of accuracy, reduced uncertainty, consistency of the estimations and computational efficiency.
- Discover knowledge and causal relations in Earth observation data. We will investigate graphical causal models and regression-based causal schemes applied to large heterogeneous EO data streams. This will require improved measures of (conditional) independence, designing experiments in controlled situations and using high-quality data. Learning the hierarchy of the relations between variables and related covariates, as well as their causal relations, may in turn allow the discovery of hidden essential variables, drivers and confounders. Moving from correlation to dependence and then to causation concepts is fundamental to advance the field of Earth Observation and the science of climate change.