When it comes to sensor data, the most common method of aligning datasets is interpolation. The gaps between data points are rarely large, so the number of spaces are few and knowledge of the trends can help fill them in. However, interpolation can be less precise if the data gaps are farther apart, in which case a polynomial or spline interpolant might be the better approach.
This may sound like a challenge, but fortunately these tasks are common enough to be built into the APIs and modules of many data science platforms. Users can often integrate apps capable of exploring different resampling methods into the platforms, helping with experimentation and decision making.
A few other data preparation factors should be considered before building models with sensor data. For example, it’s common to apply additional smoothing and downsampling, and explore domain frequency, before building models. Once the datasets match, further analysis can be more easily performed.
What's the next step?
HG: The next step is choosing deep and machine learning models for streaming high-frequency data, including how to complete the application by integrating the data preparation and modeling stages into a streaming architecture.
So, what is “streaming?” If the first thought is streaming music or movies, that’s on the right track. The type of streaming being discussed here also involves continuously incoming data, but instead of just listening or watching, it is developing an AI model that performs actions based on the information streamed. Therefore, it must be capable of continuously processing the data it receives and reporting the results.
Consider the example of multi-class fault detection using simulated data, with sensors used to predict equipment failure based on pressure, temperature, and current readings. The flow will look similar to this:
The raw sensor-generated data is fed to a messaging service for initial processing. The processed data is then analyzed by a model, which generates predictions based on its analysis. The predictions are then used to update the model, which sends the updated results to a dashboard, over and over again.
The first step in implementing such a system is to plan it with a team. Before developing any models, an engineering team should establish parameters and requirements for the data. The full streaming prototype should then be built as early in the process as possible, so the team has time to update the algorithms. The engineers should also know exactly how they plan to manage buffering, out-of-order data, and other factors frequently associated with high-frequency data.
What else should an engineering team consider?
HG: Another important parameter for an engineering team to consider is the time window, which controls the amount of data entering the system for processing. In the earlier example, one second was chosen because of its importance to model updates and mathematical assumptions.
The focus earlier was on sensor data synchronization and alignment, but now as a next step, the focus is on preparing data in order to train a model. First, engineers need failure data, which despite the name can be obtained without repeatedly breaking equipment thanks to modeling software. In a multi-class fault detection example, both experimental and simulated data was used to train the model.