Use Parallel Computing to Optimize Big Data Set for Analysis

Since R2024a

This example uses:

This example shows how to optimize data preprocessing for analysis using parallel computing.

By optimizing the organization, storage of time series data, you can simplify and accelerate any downstream applications like predictive maintenance, digital twins, signal-based AI, and fleet analytics.

In this example, you transform large raw data into a state ready for future analysis and save it to Parquet files using parallel workers. Parquet files offer efficient access to data because they efficiently store column-oriented heterogeneous data, allowing you to conditionally filter the files by row and only load the data you need. Next, you can use the out-of-memory data to train a simple AI model.

If you have data stored on a cluster, you can use this code for similar data optimization without downloading the data. To see an example that performs analysis on data stored in the cloud, see Process Big Data in the Cloud.

Start a parallel pool of process workers.

pool = parpool("Processes");

Starting parallel pool (parpool) using the 'Processes' profile ...
15-Jan-2024 12:03:49: Job Running. Waiting for parallel pool workers to connect ...
Connected to parallel pool with 6 workers.

Download Flight Data

This example uses sample aircraft sensor data provided by NASA [1].

If you want to try these data preprocessing techniques yourself, you must download the aircraft sensor data. NASA provides data for approximately 180,000 flights, with one MAT file representing each flight. For more information, see Sample Flight Data.

This code creates a folder in your current folder and downloads data for the first year of aircraft tail number 652, which occupies approximately 1.6 GB of disk space. Downloading the data can take several minutes. To confirm that you want to download the data, select "true" from the drop-down list before you run the example.

downloadIfTrue = false;
if downloadIfTrue
    downloadNASAFlightData(pwd,"small");
    dataRoot = fullfile(pwd,"data");
else
    disp("Confirm and download flight data to proceed.");
    return
end

Organizing MAT files into folders...
MAT files organized into folders.

The downloadNASAFlightData function downloads and organizes the files for tail 652 into subfolders for each month.

Convert Data to Nested Tables

Examine a sample of the flight data. Each MAT file comprises of 186 structure arrays, with each structure array representing a sensor. Each structure array stores the metadata associated with the sensor, along with the sensor readings in a nested array. Additionally, the file name contains important metadata such as the flight ID, tail number, and start time.

sampleData = matfile(fullfile(dataRoot,"mat","Tail_652\200101\652200101092009.mat"));
sampleData.ABRK

ans = struct with fields:
           data: [1972×1 double]
           Rate: 1
          Units: 'DEG'
    Description: 'AIRBRAKE POSITION'
          Alpha: 'ABRK'

It is not efficient to store the data for each sensor as a separate structure variable or to embed the metadata within the filename. Instead, you can organize the data into a nested schema. This approach enables you to easily search the metadata and reduce the number of rows in the table by nesting the sensor values. Use the struct2table function to organize the sample structure array.

struct2table(sampleData.ABRK,AsArray=true)

ans=1×5 table
         data          Rate     Units          Description          Alpha  
    _______________    ____    _______    _____________________    ________

    {1972×1 double}     1      {'DEG'}    {'AIRBRAKE POSITION'}    {'ABRK'}

The returnNestedTable helper function applies the struct2table function to each sensor data in the sample MAT file and vertically concatenates the results.

head(returnNestedTable(fullfile(dataRoot,"mat","Tail_652\200101\652200101092009.mat")))

           StartTime           TailNumber       FlightId        Rate     Alpha              Description                Units            data      
    _______________________    __________    _______________    ____    ________    ____________________________    ___________    _______________

    2001-01-09 20:09:00.000       652        652200101092009    0.25    {'1107'}    {'SYNC WORD FOR SUBFRAME 1'}    {'<units>'}    { 493×1 double}
    2001-01-09 20:09:00.000       652        652200101092009    0.25    {'2670'}    {'SYNC WORD FOR SUBFRAME 2'}    {'<units>'}    { 493×1 double}
    2001-01-09 20:09:00.000       652        652200101092009    0.25    {'5107'}    {'SYNC WORD FOR SUBFRAME 3'}    {'<units>'}    { 493×1 double}
    2001-01-09 20:09:00.000       652        652200101092009    0.25    {'6670'}    {'SYNC WORD FOR SUBFRAME 4'}    {'<units>'}    { 493×1 double}
    2001-01-09 20:09:00.000       652        652200101092009       1    {'A/T' }    {'THRUST AUTOMATIC ON'     }    {0×0 char }    {1972×1 double}
    2001-01-09 20:09:00.000       652        652200101092009       1    {'ABRK'}    {'AIRBRAKE POSITION'       }    {'DEG'    }    {1972×1 double}
    2001-01-09 20:09:00.000       652        652200101092009    0.25    {'ACID'}    {'AIRCRAFT NUMBER'         }    {0×0 char }    { 493×1 double}
    2001-01-09 20:09:00.000       652        652200101092009       1    {'ACMT'}    {'ACMS TIMING USED T1HZ'   }    {0×0 char }    {1972×1 double}

Create File Datastore

A datastore is a repository for collections of data that are too large to fit in memory. You can read and process data stored in multiple files as a single entity. To learn more, see Getting Started with Datastore.

Create a FileDatastore object with the data files from the first year of tail 652. You must use the returnNestedTable custom read function to read the data in the MAT files.

dsFlight = fileDatastore(fullfile(dataRoot,"mat","Tail_652"), ...
    ReadFcn=@returnNestedTable,IncludeSubfolders=true, ...
    FileExtensions=".mat",UniformRead=true);

Preview the datastore. The table output is the same as the table output when you call the returnNestedTable read function without the datastore.

preview(dsFlight)

ans=186×8 table
           StartTime           TailNumber       FlightId        Rate      Alpha               Description                Units            data      
    _______________________    __________    _______________    ____    _________    _____________________________    ___________    _______________

    2001-01-09 20:09:00.000       652        652200101092009    0.25    {'1107' }    {'SYNC WORD FOR SUBFRAME 1' }    {'<units>'}    { 493×1 double}
    2001-01-09 20:09:00.000       652        652200101092009    0.25    {'2670' }    {'SYNC WORD FOR SUBFRAME 2' }    {'<units>'}    { 493×1 double}
    2001-01-09 20:09:00.000       652        652200101092009    0.25    {'5107' }    {'SYNC WORD FOR SUBFRAME 3' }    {'<units>'}    { 493×1 double}
    2001-01-09 20:09:00.000       652        652200101092009    0.25    {'6670' }    {'SYNC WORD FOR SUBFRAME 4' }    {'<units>'}    { 493×1 double}
    2001-01-09 20:09:00.000       652        652200101092009       1    {'A/T'  }    {'THRUST AUTOMATIC ON'      }    {0×0 char }    {1972×1 double}
    2001-01-09 20:09:00.000       652        652200101092009       1    {'ABRK' }    {'AIRBRAKE POSITION'        }    {'DEG'    }    {1972×1 double}
    2001-01-09 20:09:00.000       652        652200101092009    0.25    {'ACID' }    {'AIRCRAFT NUMBER'          }    {0×0 char }    { 493×1 double}
    2001-01-09 20:09:00.000       652        652200101092009       1    {'ACMT' }    {'ACMS TIMING USED T1HZ'    }    {0×0 char }    {1972×1 double}
    2001-01-09 20:09:00.000       652        652200101092009       1    {'AIL.1'}    {'AILERON POSITION LH'      }    {'DEG'    }    {1972×1 double}
    2001-01-09 20:09:00.000       652        652200101092009       1    {'AIL.2'}    {'AILERON POSITION RH'      }    {'DEG'    }    {1972×1 double}
    2001-01-09 20:09:00.000       652        652200101092009       4    {'ALT'  }    {'PRESSURE ALTITUDE LSP'    }    {'FEET'   }    {7888×1 double}
    2001-01-09 20:09:00.000       652        652200101092009       4    {'ALTR' }    {'ALTITUDE RATE'            }    {'FT/MIN' }    {7888×1 double}
    2001-01-09 20:09:00.000       652        652200101092009       1    {'ALTS' }    {'SELECTED ALTITUDE LSP'    }    {'FEET'   }    {1972×1 double}
    2001-01-09 20:09:00.000       652        652200101092009       4    {'AOA1' }    {'ANGLE OF ATTACK 1'        }    {'DEG'    }    {7888×1 double}
    2001-01-09 20:09:00.000       652        652200101092009       4    {'AOA2' }    {'ANGLE OF ATTACK 2'        }    {'DEG'    }    {7888×1 double}
    2001-01-09 20:09:00.000       652        652200101092009       4    {'AOAC' }    {'CORRECTED ANGLE OF ATTACK'}    {'DEG'    }    {7888×1 double}
      ⋮

Clean Data

Next, prepare the data for future analysis by cleaning it. Use the transform function to perform some table manipulation and change the data types of table variables. The datastore defers performing the transformation until you read or write from it.

Rename the Alpha and data variables to SensorName and Data.

tdsFlight1 = transform(dsFlight,@(t) renamevars(t,["Rate","Alpha","data"], ...
    ["SampleRate","SensorName","Data"]));

Convert all variables that are cell arrays of character vectors into string arrays. To categorize the data later, convert the Units variable to a categorical array, and SampleRate variable into a single array. Preview a sample of the results of the transformed datastore.

tdsFlight2 = transform(tdsFlight1,@(t) convertvars(t,vartype("cellstr"),"string"));
tdsFlight3 = transform(tdsFlight2,@(t) convertvars(t,"Units","categorical"));
tdsFlight4 = transform(tdsFlight3,@(t) convertvars(t,"SampleRate","single"));
preview(tdsFlight4)

ans=8×8 table
           StartTime           TailNumber       FlightId        SampleRate    SensorName           Description               Units            Data      
    _______________________    __________    _______________    __________    __________    __________________________    ___________    _______________

    2001-01-09 20:09:00.000       652        652200101092009       0.25         "1107"      "SYNC WORD FOR SUBFRAME 1"    <units>        { 493×1 double}
    2001-01-09 20:09:00.000       652        652200101092009       0.25         "2670"      "SYNC WORD FOR SUBFRAME 2"    <units>        { 493×1 double}
    2001-01-09 20:09:00.000       652        652200101092009       0.25         "5107"      "SYNC WORD FOR SUBFRAME 3"    <units>        { 493×1 double}
    2001-01-09 20:09:00.000       652        652200101092009       0.25         "6670"      "SYNC WORD FOR SUBFRAME 4"    <units>        { 493×1 double}
    2001-01-09 20:09:00.000       652        652200101092009          1         "A/T"       "THRUST AUTOMATIC ON"         <undefined>    {1972×1 double}
    2001-01-09 20:09:00.000       652        652200101092009          1         "ABRK"      "AIRBRAKE POSITION"           DEG            {1972×1 double}
    2001-01-09 20:09:00.000       652        652200101092009       0.25         "ACID"      "AIRCRAFT NUMBER"             <undefined>    { 493×1 double}
    2001-01-09 20:09:00.000       652        652200101092009          1         "ACMT"      "ACMS TIMING USED T1HZ"       <undefined>    {1972×1 double}

Standardize the missing values for the units variable. MATLAB uses the <undefined> token to mark missing data in categorical values, but some rows of the Unit variable also show <units> or UNITS, which you can treat as missing in this data set. Use a transformation function to standardize this so every missing value in the units variable uses a uniform missing token. Preview a sample of the results from the transformed datastore.

tdsFlight5 = transform(tdsFlight4,@(t) standardizeMissing(t,["<units>","UNITS"], ...
    DataVariables="Units"));
preview(tdsFlight5)

ans=8×8 table
           StartTime           TailNumber       FlightId        SampleRate    SensorName           Description               Units            Data      
    _______________________    __________    _______________    __________    __________    __________________________    ___________    _______________

    2001-01-09 20:09:00.000       652        652200101092009       0.25         "1107"      "SYNC WORD FOR SUBFRAME 1"    <undefined>    { 493×1 double}
    2001-01-09 20:09:00.000       652        652200101092009       0.25         "2670"      "SYNC WORD FOR SUBFRAME 2"    <undefined>    { 493×1 double}
    2001-01-09 20:09:00.000       652        652200101092009       0.25         "5107"      "SYNC WORD FOR SUBFRAME 3"    <undefined>    { 493×1 double}
    2001-01-09 20:09:00.000       652        652200101092009       0.25         "6670"      "SYNC WORD FOR SUBFRAME 4"    <undefined>    { 493×1 double}
    2001-01-09 20:09:00.000       652        652200101092009          1         "A/T"       "THRUST AUTOMATIC ON"         <undefined>    {1972×1 double}
    2001-01-09 20:09:00.000       652        652200101092009          1         "ABRK"      "AIRBRAKE POSITION"           DEG            {1972×1 double}
    2001-01-09 20:09:00.000       652        652200101092009       0.25         "ACID"      "AIRCRAFT NUMBER"             <undefined>    { 493×1 double}
    2001-01-09 20:09:00.000       652        652200101092009          1         "ACMT"      "ACMS TIMING USED T1HZ"       <undefined>    {1972×1 double}

Write to Parquet Files in Parallel

Now that the data is optimized and ready for future analysis, save the data in the final transformed datastore as Parquet files using the writeall function. The Parquet file format supports the efficient compression, encoding and extraction of column-oriented heterogeneous data. When you set UseParallel to true, the writeall function automatically uses the workers of the open parallel pool to apply the transformations functions and write the contents of the transformed datastore to files.

This code creates one Parquet file for each MAT file in the datastore and saves the Parquet files in the parquet_sample folder, preserving the folder structure of the original MAT files. This process writes 2.6 GB of data to disk. To confirm that you want to save the data, select "true" from the drop-down list before you run the example.

saveIfTrue = false;
if saveIfTrue
    outdir = fullfile(dataRoot,"parquet_sample");
    if isfolder(outdir)
        rmdir(outdir,"s");
    end
    writeall(tdsFlight5,outdir,FolderLayout="duplicate", ...
        OutputFormat="parquet",UseParallel=true)
    disp("Parquet files saved to the parquet_sample folder.")
else
    disp("Confirm and save modified flight data to proceed.")
end

Parquet files saved to the parquet_sample folder.

The writeall function saves the Parquet files into subfolders for each month.

Extract Data to Train Neural Network

Deep learning leverages neural networks to extract useful patterns and features directly from data. To learn more, see Deep Learning in MATLAB (Deep Learning Toolbox).

Now that the data is clean and organized, you can use a subset of the sensor data to train a neural network. You can predict a flight's future true airspeed at cruising altitude, using time series sensor data such as altitude pressure, fuel flow, fan speed, current true airspeed, wind direction, and wind speed. To train a neural network to predict the true airspeed, you can use a long short-term memory (LSTM) network. LSTM networks excel in sequence-to-sequence predictions, which makes them ideal for time series data. They recognize patterns in the sequence of data and predict future values at each time step.

For this example, use each flight's sensor time series as a sequence of training data. To predict the future true airspeed, train the network with targets that are the true airspeed values offset by one time step into the future. This approach means that at every time step of the input sequence, the LSTM predicts the next time step's true airspeed value.

Extract Data

Start by extracting the flight data from January through to April 2001 as well as June 2001. The data from May 2001 is not available.

Create a parquetDatastore and use the SelectedVariableNames property to import only the StartTime, SampleRate, SensorName and Data variables. Each file contains all the data for a single flight, so set the ReadSize property to file to read and process one file at time. The ReadSize property also determines how the software partitions the data before sending it to each parallel worker.

pdsFlight = parquetDatastore(fullfile(dataRoot,"parquet_sample","Tail_652"), ...
    IncludeSubfolders=true,ReadSize="file", ...
    SelectedVariableNames=["StartTime","SampleRate","SensorName","Data"]);

To import data that corresponds to January through April 2001 as well as June 2001, create a row filter using the ParquetDatastore object. Then, use the row filter to select rows with StartTime values greater than or equal to January 2001 and StartTime values less than July 2001. Preview the filtered datastore.

rf = rowfilter(pdsFlight);
pdsFlight.RowFilter = rf.StartTime >= datetime("2001-01-01")...
    & rf.StartTime < datetime("2001-07-01");
preview(pdsFlight)

ans=8×4 table
         StartTime          SampleRate    SensorName         Data      
    ____________________    __________    __________    _______________

    09-Jan-2001 20:09:00       0.25         "1107"      { 493×1 double}
    09-Jan-2001 20:09:00       0.25         "2670"      { 493×1 double}
    09-Jan-2001 20:09:00       0.25         "5107"      { 493×1 double}
    09-Jan-2001 20:09:00       0.25         "6670"      { 493×1 double}
    09-Jan-2001 20:09:00          1         "A/T"       {1972×1 double}
    09-Jan-2001 20:09:00          1         "ABRK"      {1972×1 double}
    09-Jan-2001 20:09:00       0.25         "ACID"      { 493×1 double}
    09-Jan-2001 20:09:00          1         "ACMT"      {1972×1 double}

Extract the cruise phase for each flight. Pivot the nested sensor data into individual variables using the nested2wide helper function and extract the altitude pressure (ALT), fuel flow (FF.1), fan speed (N1.1 and N2.1) true airspeed (TAS), wind direction (WD), wind speed (WS) and weight on wheel (WOW) variables. To resolve the different sample rates, the nested2wide function fills in any missing values with the previous nonmissing value.

The extractCruise helper function identifies the cruise phase of a flight and returns only the section that contains the cruise portion as a timetable. If the function does not detect a cruise phase, or if it detects multiple separate cruises, the function returns an empty timetable.

Use one transform to pivot the nested sensor data into variables, identify and return timetables from the files that contain cruise portions.

tpdsFlight = transform(pdsFlight,@(t) extractCruise ...
    (nested2wide(t,["ALT","FF.1","N1.1","N2.1","TAS","WD","WS","WOW"])));

Not all of the files return data, as only about 60% of flights contain a cruise phase. The rest are short flights or flights that never left the ground. To find the indices of the files with cruise portions, use a transformation to identify the files in the tpdsFlight datastore that did not return empty tables and the readall function to collect the indices in parallel. To return a datastore that includes only files that have a cruise portion, use the subset function with the tpdsFlight datastore and the indices corresponding to these files.

dsHasCruisePhase = transform(tpdsFlight,@(t) ~isempty(t));
indices = readall(dsHasCruisePhase,UseParallel=true);
tpdsFlightCruise = subset(tpdsFlight,indices);
preview(tpdsFlightCruise)

ans=3043×8 timetable
           timestamp            ALT     FF.1     N1.1      N2.1      TAS        WD      WOW      WS  
    _______________________    _____    ____    ______    ______    ______    ______    ___    ______

    2001-01-13 08:20:41.500    21757    1656    93.906    92.938    393.38    65.391     1     11.953
    2001-01-13 08:20:41.750    21760    1656    93.906    92.938    393.38    66.094     1     11.953
    2001-01-13 08:20:42.000    21763    1656    93.906    92.938    393.44    66.094     1     12.949
    2001-01-13 08:20:42.250    21766    1656    93.906    92.938    393.56    66.797     1     12.949
    2001-01-13 08:20:42.500    21768    1656    93.906    92.938    393.06    68.203     1     12.949
    2001-01-13 08:20:42.750    21771    1656    93.906    92.938    393.06    69.609     1     12.949
    2001-01-13 08:20:43.000    21775    1656    93.906    92.938    393.06    68.906     1     12.949
    2001-01-13 08:20:43.250    21779    1656    93.906    92.938    393.06    68.203     1     11.953
    2001-01-13 08:20:43.500    21781    1656    93.906    92.906       393    68.203     1     11.953
    2001-01-13 08:20:43.750    21786    1656    93.875    92.906    393.12      67.5     1     11.953
    2001-01-13 08:20:44.000    21788    1656    93.875    92.938    393.12      67.5     1     11.953
    2001-01-13 08:20:44.250    21790    1656    93.906    92.938    393.06      67.5     1     11.953
    2001-01-13 08:20:44.500    21795    1656    93.906    92.938    393.19      67.5     1     11.953
    2001-01-13 08:20:44.750    21799    1656    93.906        93    393.62    66.797     1     11.953
    2001-01-13 08:20:45.000    21800    1656    93.938        93    393.38    66.797     1     11.953
    2001-01-13 08:20:45.250    21804    1656    93.906    92.938    393.12    68.203     1     11.953
      ⋮

Prepare Training Data

Randomly partition the tpdsFlightCruise datastore into a training set containing 70% of the files, a validation set containing 15% of the files and a test set containing the remaining 15% of the files using the trainingPartitions helper function.

numCruiseFlights = sum(indices);
[indexTrain,indexValidation,indexTest] = trainingPartitions(numCruiseFlights,[0.70 0.15 0.15]);

Subset the tpdsFlightCruise datastore to obtain the pdsTrain training datastore.

pdsTrain = subset(tpdsFlightCruise,indexTrain);

For a better fit and to prevent the training from diverging, you must normalize the training, validation and test data. In this example, you rescale the sensor variables in the training, validation and test data to the range [0,1] using the minimum and maximum values from all the data in the training datastore.

To easily calculate the minimum and maximum of all training data, create a tall timetable with the data from the pdsTrain datastore. Extract the variables required for training from the tall timetable, and compute the minimum and maximum. When you use the gather function to gather the results into memory, MATLAB automatically executes the computations in parallel on the workers of the open parallel pool.

testDataTT = tall(pdsTrain);
tallTrainDs = testDataTT{:,["ALT","FF.1","N1.1","N2.1","TAS","WD","WS"]};
[stats.min,stats.max] = gather(min(tallTrainDs),max(tallTrainDs));

Evaluating tall expression using the Parallel Pool 'Processes':
- Pass 1 of 1: Completed in 27 sec
Evaluation completed in 28 sec

Create a transformed datastore that removes the timestamp and WOW variables and returns the data ready for training using the prepareMLData helper function.

For a single input layer, the neural network expects the datastore to output a table with rows corresponding to the number of sequences in the training data and two variables specifying the predictors and targets, respectively. Each sequence of predictors or targets must be represented as a t-by-c numeric array, where t and c represent the number of time steps and features in each sequence, respectively. In this example, each file is one sequence of training data, so you must transform the training datastore to output a table with two variables for the predictor and target data, and one row per file.

The prepareMLData helper function first normalizes the data based on the minimum and maximum values in the stats structure. To create the predictor data for each file, the prepareMLData helper function combines all seven sensor variables from the sequence into a single t-by-7 array, excluding the sensor data for the final time step. Additionally, the function creates the target true airspeed values by removing the first time step's data and combining the values from the remaining time steps into a single t-by-1 array.

tpdsTrain = transform(pdsTrain,@(t) prepareMLData(t,"TAS","training",stats));

Preview the transformed training datastore.

preview(tpdsTrain)

ans=1×2 table
      Predictors           Targets    
    _______________    _______________
    {1379×7 double}    {1379×1 double}

Prepare Validation and Test Data

Partition 15% of the data from the tpdsFlightCruise datastore for use as the validation data pdsValid datastore.

pdsValid = subset(tpdsFlightCruise,indexValidation);

Use the prepareMLData helper function in a transform to remove the timestamp and WOW variables and return the validation data ready for training. The prepareMLData function transforms the validation data into the same format as the training data.

tpdsValid = transform(pdsValid,@(t) prepareMLData(t,"TAS","training",stats));
preview(tpdsValid)

ans=1×2 table
      Predictors           Targets    
    _______________    _______________

    {2187×7 double}    {2187×1 double}

Partition 15% of the data from the tpdsFlightCruise datastore for use as the test datastore.

tpdsTest = subset(tpdsFlightCruise,indexTest);

Use the prepareMLData helper function to remove the timestamp and WOW variables and return all seven sensor variables, excluding the sensor data for the final time step in the sequence, combined into a single array.

tpdsTestPredictors = transform(tpdsTest,@(t) prepareMLData(t,"TAS","testPredictors",stats));
preview(tpdsTestPredictors)

Use the prepareMLData helper function to remove the timestamp and WOW variables and return the test target true airspeed values shifted by one time step. The prepareMLData function does not normalize the test target data .

tpdsTestTargets = transform(tpdsTest,@(t) prepareMLData(t,"TAS","testTargets",stats));
preview(tpdsTestTargets)

Train and Test Neural Network

Define Network Architecture and Train Network

To create an LSTM network, specify this sequence of layers:

Sequence input layer with size matching the number of input features, in this example, seven (ALT, FF.1, N1.1, N2.1, TAS, WD, and WS).
An LSTM layer with 64 hidden units that outputs the full sequence.
Fully connected layer with output size set to 136 corresponding to 136 hidden neurons.
Dropout layer with dropout probability 0.60211.
Fully connected layer with output size matching the number of outputs, in this example, one (TAS).

numFeatures = 7;
numHiddenUnits = 64;
numResponses = 1;

layers = [ ...
    sequenceInputLayer(numFeatures,Normalization="none")
    lstmLayer(numHiddenUnits,OutputMode="sequence")
    fullyConnectedLayer(136)
    dropoutLayer(0.60211)
    fullyConnectedLayer(numResponses)];

Define the training options.

Train using the "ADAM" optimizer and for 20 epochs with mini-batches of size 1.
Specify the learning rate to 0.01 and the gradient threshold to 1.
To validate the network at regular intervals during training, specify the validation datastore as tpdsValid and set the validation frequency to 30.
Set ExecutionEnvironment to "parallel-auto" to use the parallel pool for training. If there is no current parallel pool, the software starts one using the default cluster profile. If the pool has access to GPUs, then only workers with a unique GPU perform training computation and excess workers become idle.
Monitor the mean squared error (MSE) metric and display the training progress in a plot.

maxEpochs = 20;
miniBatchSize = 1;
options = trainingOptions("adam", ...
    MaxEpochs=maxEpochs, ...
    MiniBatchSize=miniBatchSize, ...
    InitialLearnRate=0.01, ...
    GradientThreshold=1, ...
    ValidationData=tpdsValid, ...
    ValidationFrequency=30, ...
    ExecutionEnvironment="parallel-auto", ...
    Metrics="rmse", ...
    Plots="training-progress", ...
    Verbose=0);

Train the neural network. Because this is a regression task, use mean squared error loss.

net = trainnet(tpdsTrain,layers,"mse",options);

Test Network

You can use the trained neural network to predict the true airspeed of each flight sequence in the test datastore. After the prediction is complete, convert the normalized predicted values to the real values.

yPred = minibatchpredict(net,tpdsTestPredictors,MiniBatchSize=1,UniformOutput=false);
minTAS = stats.min(5);
maxTAS = stats.max(5);
predictedTrueAirspeeds  = cellfun(@(x) (maxTAS-minTAS)*(x+minTAS),yPred,UniformOutput=false);

Extract the target true airspeed values for comparison. Set UseParallel to true to use the workers of the open parallel pool.

targetY = readall(tpdsTestTargets,UseParallel=true);
targetTrueAirspeeds = targetY.("Test Targets");

Compare the target and predicted true airspeed in a plot.

idx = randperm(length(predictedTrueAirspeeds),4);
figure
tiledlayout(2,2)
for i = 1:numel(idx)
    nexttile
    plot(targetTrueAirspeeds{idx(i)},"--")
    hold on
    plot(predictedTrueAirspeeds{idx(i)},".-")
    hold off
    title("Test Flight " + idx(i))
    xlabel("Time Step")
    ylabel("TAS (knots)")
    legend(["Test Data","Predicted"],Location="best")
end

Calculate the mean of the maximum absolute error, the maximum relative error (as a fraction of the target value) and the mean RMSE between the target and predicted true airspeed values.

absErrors = cellfun(@(x1,x2) max(abs(x1-x2)),targetTrueAirspeeds,predictedTrueAirspeeds);
maxRelativeError = cellfun(@(x1,x2) max((abs(x1-x2)./x1)),targetTrueAirspeeds,predictedTrueAirspeeds);
rootMeanSE = cellfun(@(x1,x2) rmse(x1,x2),targetTrueAirspeeds,predictedTrueAirspeeds);
meanAbsErrors = mean(absErrors);
fprintf("Mean maximum absolute error = %5.4f knots",meanAbsErrors)

Mean maximum absolute error = 79.2402 knots

meanRMSE = mean(rootMeanSE);
fprintf("Mean RMSE = %5.4f knots",meanRMSE)

Mean RMSE = 11.3948 knots

Plot the maximum absolute error for the test flight data.

figure
nexttile
histogram(rootMeanSE)
xlabel("RMSE (knots)")
ylabel("Frequency")
title("RMSE")
nexttile
histogram(maxRelativeError*100)
xlabel("Absolute Error (%)")
ylabel("Frequency")
title("Max Absolute Errors as Percentage of Target")

Clean Up

Remove the flight data files and delete the parallel pool.

rmdir(dataRoot,"s");
delete(pool);

References

[1] “Flight Data For Tail 652 | NASA Open Data Portal.” Accessed October 6, 2023. https://data.nasa.gov/dataset/Flight-Data-For-Tail-652/fxpu-g6k3.

Use Parallel Computing to Optimize Big Data Set for Analysis

Download Flight Data

Convert Data to Nested Tables

Create File Datastore

Clean Data

Write to Parquet Files in Parallel

Extract Data to Train Neural Network

Extract Data

Prepare Training Data

Prepare Validation and Test Data

Train and Test Neural Network

Define Network Architecture and Train Network

Test Network

Clean Up

References

See Also

Related Topics

Use Parallel Computing to Optimize Big Data Set for Analysis

Download Flight Data

Convert Data to Nested Tables

Create File Datastore

Clean Data

Write to Parquet Files in Parallel

Extract Data to Train Neural Network

Extract Data

Prepare Training Data

Prepare Validation and Test Data

Train and Test Neural Network

Define Network Architecture and Train Network

Test Network

Clean Up

References

See Also

Related Topics

WeChat