denoise module

Module containing functions to remove external signal from geomagnetic data.

Part of the MagPySV package for geomagnetic data analysis. This module provides various functions to denoise geomagnetic data by performing principal component analysis and identifying and removing outliers. Also contains an outlier detection function based on median absolute deviation from the median (MAD).

magpysv.denoise.detect_outliers(*, dates, signal, obs_name, window_length, threshold, signal_type='SV', plot_fig=False, save_fig=False, write_path=None, fig_size=(8, 6), font_size=12, label_size=16)[source]

Detect outliers in a time series and remove them.

Use the median absolute deviation from the median (MAD) to identify outliers. The time series are long and highly variable so it is not appropriate to use single values of median to represent the whole series. The function uses a running median to better characterise the series (the window length and a threshold value stating many MADs from the median a point must be before it is classed as an outlier are user-specified).

Parameters:
  • dates (datetime.datetime) – dates of the time series measurements.
  • signal (array) – array (or column from a pandas.DataFrame) containing the time series of interest.
  • obs_name (str) – states the component of interest and the three digit IAGA observatory name.
  • window_length (int) – number of months over which to take the running median.
  • threshold (float) – the minimum number of median absolute deviations a point must be away from the median in order to be considered an outlier.
  • signal_type (str) – specify whether magnetic field (‘MF’) or secular variation (‘SV’) is plotted. Defaults to SV.
  • plot_fig (bool) – option to plot figure of the time series and identified outliers. Defaults to False.
  • save_fig (bool) – option to save figure if plotted. Defaults to False.
  • write_path (str) – output path for figure if saved.
  • fig_size (array) – figure size in inches. Defaults to 8 inches by 6 inches.
  • font_size (int) – font size for axes. Defaults to 12 pt.
  • label_size (int) – font size for axis labels. Defaults to 16 pt.
Returns:

the input signal with identified outliers removed (set to NaN).

Return type:

signal (array)

magpysv.denoise.eigenvalue_analysis(*, dates, obs_data, model_data, residuals, proxy_number=1)[source]

Remove external signal from SV data using principal Component Analysis.

Perform principal component analysis (PCA) on secular variation residuals (the difference between the observed SV and that predicted by a geomagnetic field model) calculated from annual differences of monthly means at several observatories. Uses masked arrays to discount missing data points and calculates the eigenvalues/vectors of the (3nx3n) covariance matrix for n observatories. The residuals are rotated into the eigendirections and denoised using the method detailed in Wardinski & Holme (2011, GJI, https://doi.org/10.1111/j.1365-246X.2011.04988.x). The SV residuals of the noisy component for all observatories combined are used as a proxy for the unmodelled external signal. The denoised data are then rotated back into geographic coordinates. The PCA algorithm outputs the eigenvalues sorted from largest to smallest, so the corresponding eigenvector matrix has the ‘noisy’ direction in the first column and the ‘clean’ direction in the final column.

This algorithm masks missing data so that they are not taken into account during the PCA. Missing values are not infilled or estimated, so NaN values in the input dataframe are given as NaN values in the output.

Smallest eigenvalue ‘quiet’ direction

Largest eiegenvalue ‘noisy’ direction

Parameters:
  • dates (datetime.datetime) – dates of the time series measurements.
  • obs_data (pandas.DataFrame) – dataframe containing columns for monthly/annual means of the X, Y and Z components of the secular variation at the observatories of interest.
  • model_data (pandas.DataFrame) – dataframe containing columns for field model prediction of the X, Y and Z components of the secular variation at the same observatories as in obs_data.
  • residuals (pandas.DataFrame) – dataframe containing the SV residuals (difference between the observed data and model prediction).
  • proxy_number (int) – the number of ‘noisy’ directions used to create the proxy for the external signal removal. Default value is 1 (only the residual in the direction of the largest eigenvalue is used). Using n directions means that proxy is the sum of the SV residuals in the n noisiest eigendirections.
Returns:

tuple containing:

  • denoised_sv (pandas.DataFrame):
    dataframe with datetime objects in the first column and columns for the denoised X, Y and Z SV components at each of the observatories for which data were provided.
  • proxy (array):
    the signal that was used as a proxy for unmodelled external magnetic field in the denoising stage.
  • eig_values (array):
    the eigenvalues of the obs_data matrix.
  • eig_vectors (array):
    the eigenvectors associated with the n largest eigenvalues of the data matrix. For example, if the residuals in the two ‘noisiest’ directions are used as the proxy for external signal, then these two eigenvectors are returned.
  • projected_residuals (array):
    SV residuals rotated into the eigendirections.
  • corrected_residuals (array):
    SV residuals after the denoising process.
  • covariance_matrix (array): residuals covariance matrix.

Return type:

(tuple)

magpysv.denoise.eigenvalue_analysis_impute(*, dates, obs_data, model_data, residuals, proxy_number=1)[source]

Remove external signal from SV data using Principal Component Analysis.

Perform principal component analysis (PCA) on secular variation residuals (the difference between the observed SV and that predicted by a geomagnetic field model) calculated from annual differences of monthly means at several observatories. Uses the imputer from sklearn.preprocessing to fill in missing data points and calculates the singular values of the data matrix for n observatories (uses Singular Values Decomposition, SVD). The residuals are rotated into the eigendirections and denoised using the method detailed in Wardinski & Holme (2011, GJI, https://doi.org/10.1111/j.1365-246X.2011.04988.x). The SV residuals of the noisy component for all observatories combined are used as a proxy for the unmodelled external signal. The denoised data are then rotated back into geographic coordinates. The pca algorithm outputs the singular values (these are equal to the square root of the eigenvalues of the covariance matrix) sorted from largest to smallest, so the corresponding eigenvector matrix has the ‘noisy’ direction in the first column and the ‘clean’ direction in the final column.

Note that the SVD algorithm cannot be used if any data are missing, which is why imputation is needed with this method. The function denoise.eigenvalue_analysis permits missing values and does not infill them - that is the more robust function.

Smallest eigenvalue: ‘quiet’ direction

Largest eiegenvalue: ‘noisy’ direction

Parameters:
  • dates (datetime.datetime) – dates of the time series measurements.
  • obs_data (pandas.DataFrame) – dataframe containing columns for monthly/annual means of the X, Y and Z components of the secular variation at the observatories of interest.
  • model_data (pandas.DataFrame) – dataframe containing columns for field model prediction of the X, Y and Z components of the secular variation at the same observatories as in obs_data.
  • residuals (pandas.DataFrame) – dataframe containing the SV residuals (difference between the observed data and model prediction).
  • proxy_number (int) – the number of ‘noisy’ directions used to create the proxy for the external signal removal. Default value is 1 (only the residual in the direction of the largest eigenvalue is used). Using n directions means that proxy is the sum of the SV residuals in the n noisiest eigendirections.
Returns:

tuple containing:

  • denoised_sv (pandas.DataFrame):
    dataframe with dates in the first column and columns for the denoised X, Y and Z secular variation components at each of the observatories for which data were provided.
  • proxy (array):
    the signal that was used as a proxy for unmodelled external magnetic field in the denoising stage.
  • eig_values (array):
    the singular values of the obs_data matrix.
  • eig_vectors (array):
    the eigenvectors associated with the n largest singular values of the data matrix. For example, if the residuals in the two ‘noisiest’ directions are used as the proxy for external signal, then these two eigenvectors are returned.
  • projected_residuals (array):
    SV residuals rotated into the eigendirections.
  • corrected_residuals (array):
    SV residuals after the denoising process.

Return type:

(tuple)