Data Drift
Data drift is the significant change in data distribution, compared to the data used to train the model.
Data drift can be caused by the evolution of business processes or industry events which create discontinuities in the underlying phenomena. It does not affect the formatting of the data, but the data at its core and what it represents.
Why is Data Drift important?
Data drift leads to the degradation of a model’s performance on new data. In production data are now significantly different to the data used to train the model, resulting in less accurate predictions.
The model does not know how to correctly predict on these new data, since its training did not include comparable observations.
How does the fairly platform perform Data Drift?
Data drift can be evaluated in several ways.
By using statistical approaches on individual columns. We can compute certain statistical quantities (median, mode, quantiles) to verify if individual features have shifted materially.
For instance, if the average age of the user demographic the model is targeting has changed drastically since the training data was recorded, the model’s performance may be impacted.
The threshold to determine if the percentage change in a categorical column is significant is 30% by default.
To determine if the change in the average of a continuous column is significant, the fairly custom agent performs a T-test with a p-value of 0.05.
Length: Fairly’s Data Drift also checks the lengths of each column in the reference and current datasets, to ensure that they have not changed by more than 25%.
Variance: To determine if any change in variance is significant between reference and current dataset, Fairly’s custom agent performs an F-test with a p-value of 0.05.
KS: Fairly performs a KS-test with a p-value of 0.05 to determine if any continuous columns have a statistically significant change in their distribution.
Median: Fairly checks the median of each continuous column to ensure that is hasn’t changed by more than 25%