MLOps Transformation: Moving from Stage 0 to Stage 3 (Part II)

A maturity roadmap and a cultural shift.

MLOps Transformation: Moving from Stage 0 to Stage 3 (Part II)
Find out how we improved our Machine Learning Operations (MLOps) maturity from stage 0 to 3 with our in-house Skills Extraction Model.

In our previous post, we introduced the Skills Extraction Algorithm (SEA) developed by the Research and Development Pillar of SkillsFuture Singapore’s (SSG) National Jobs-Skills Data Office (NJSDO) and shared our 2-step journey advancing MLOps maturity from stage 0 to 3. In this post, we delve deeper into the technical aspects of our system, built to support faster iterations, seamless updates, and robust, adaptive model deployments.

Before diving into MLOps, here’s a brief overview of the SEA, which leverages NLP and semantic similarity to extract standardised skills from text inputs such as job descriptions. SEA consists of three key components:

  1. Span Extraction: Identifies relevant skill and knowledge phrases from the input text.
  2. Skill Embedding Generation: Encodes all SSG-recognised skills from the SkillsFuture Framework (SFw) using MPNet.
  3. Matching & Ranking: Embeds extracted spans, matches them against SFw skill embeddings, and re-ranks them based on semantic relevance.

For example, consider the following job description input:

“We are hiring a Data Scientist with a focus on computer vision to develop models that extract insights from image and video data, enabling smarter, automated decision-making. The ideal candidate has strong expertise in Python and AI-focused deep learning frameworks (e.g., PyTorch, TensorFlow), and excels in applying cutting-edge artificial intelligence techniques in a fast-paced, innovation-driven environment.”

From this input, relevant skills would be semantically extracted and mapped to the terminology used in the SkillsFuture Framework (SFw), rather than generic skill names like “computer vision” or “artificial intelligence”. For instance, the extracted and re-ranked skills may include Computer Vision Technology and Artificial Intelligence Application, which aligns with the SkillsFuture skill taxonomy. The objective is to match the language and structure of recognised skills within the framework using semantic similarity, ensuring compatibility with national standards.

Note: The primary focus of retraining within our MLOps is the MPNet-based matching model. This is due to the absence of labeled feedback data for the span extractor component as we do not have human reviewers annotating whether individual spans are correctly extracted. Instead, our reviewers only assess the final output, i.e., whether the extracted and mapped skills based on SFw are relevant.

Overall MLOps Design and Infrastructure Setup

To dive deeper into our solution, below is an overview of our MLOps system design that delivers the process highlighted above. As mentioned in our previous post, we primarily use Amazon Web Services (AWS), such as Redshift, S3, SageMaker , GitLab, and Tableau.

  • AWS S3: Storage for our objects such as model artefacts
  • AWS SageMaker: For building, training, and deploying our models
  • AWS RedShift: Serves as our central data repository, storing training data, SSG’s list of recognised skills and new skills to be added, as well as human-corrected feedback data for continuous model improvement.
  • GitLab: Stores code and enables testing before merging into the codebase. A successful merge triggers SageMaker in MAESTRO to rebuild the model.
  • Tableau: Used for performance monitoring within SSG’s Data Analytics Environment (DAE), providing dashboards and sending alerts when model performance drops.
Note that MAESTRO is a platform that provides MLOps services (such as AWS S3 and SageMaker), the necessary resources, and ready-to-use code templates for model building and deployment.
Figure 1: MLOps Setup

MLOps Process Walkthrough

Putting everything together, the MLOps process for updating the model is triggered by one of three events: code changes, new skill updates (where newly recognised skills from SSG are ingested into Redshift), or detected performance drops (e.g Jaccard or F1 score drops below a certain threshold). When any of these occur, the model artefacts are either updated to obtain new skills embeddings (for skill updates) or retrained (for code changes or performance drops) automatically. Code changes must first pass all required tests (e.g., unit tests) before being successfully merged.

Once updated, the model undergoes functional and performance testing. If any test fails, the process loops back for corrections — this may involve engaging a data scientist or engineer to address either the model issues or code issues to make necessary refinements.

If all tests pass, the model is registered and auto-deployed to DEV. Manual approval is required for promotion to UAT and PROD. In production, performance is monitored using vendor-reviewed feedback as ground truth, stored in Redshift for retraining. If performance drops, retraining is triggered to maintain accuracy.

In our MLOps setup, several components support the end-to-end workflow — including CI/CD pipelines for automated testing, version management, model building, and deployment; tracking and monitoring model performance in production.

Continuous Integration and Continuous Deployment (CI/CD)

Before examining the five components shown in Figure 1 above, let’s discuss how CI/CD brings everything together to ensure continuous integration and delivery. Each component can be viewed as a stage in the CI/CD pipeline, and setting up CI/CD helps ensure each step automatically follows another, reducing manual effort and minimising errors. This accelerated deployment process drastically reduces time to deployment.

Note: There are many tools that can enable us to do CI/CD, with some famous examples being GitHub actions, GitLab CI/CD, and Jenkins, just to name a few. On these platforms, they usually come with runners, an agent that can execute the testing for your codes. The runner will process the commands defined in the CI/CD configuration file which runs the tests for you.

Automated Testing

Our testing strategy spans multiple levels, from unit tests for individual functions to integration tests for the entire pipeline.

  1. Unit Tests (Code-Level Testing)

Purpose: Unit tests validate the smallest building blocks of our system, such as the individual functions or components. The focus is on ensuring they behave as expected.

Scope: Focuses on testing a single unit of code (e.g., a function or method) and the required class definitions.

In our case, we design unit tests with four key scenarios in mind: Happy Path, Negative Path, Edge Cases, and Boundary Cases.

A) Happy Path: Verifying Expected Behavior
The Happy Path tests how a function behaves under normal conditions with valid inputs. This ensures the function produces correct outputs when used as intended.

Example: Function that adds 2 numbers should return 5 when given 2 + 3.

SEA Test Case: If the current model version is 1.2.9, a minor change should correctly upgrade it to 1.3.0.

B) Negative Path: Handling Invalid Inputs
The Negative Path focuses on how a function reacts to incorrect or unexpected inputs. A robust function should either gracefully handle errors or throw clear exceptions for easy debugging.

Example: If a function expects a number but receives a string or None, it should raise an error instead of failing silently.

SEA Test Case: If a function expects a list of dictionaries, sending a nested JSONshould result in an appropriate validation error.

C) Edge Cases: Testing Unusual or Extreme Inputs
Edge case testing 
ensures that functions work correctly even when they receive uncommon or extreme values. This helps prevent unexpected failures in real-world use.

Example: A function processing lists should be tested with an empty list, a single-element list, and a list with 1,000,000 elements.

SEA Test Case: Validate the model version upgrading process when the current version is 999.999.999 to ensure our logic is correct and can even handle this type of large versions.

D) Boundary Cases: Validating Input Limits
Boundary tests
 examine function behavior at input limits or thresholds to ensure correct handling of values at the upper and lower boundaries.

Example: If a function accepts numbers between 1 and 10, we should test it with values 1, 10, 0, and 11 to ensure it correctly enforces constraints.

SEA Test Case: Ensure that SpanExtractor returns the correct output when ChunkSize = 3 and the input length is 4.

2. Functional and Model Tests

Purpose: The tests here ensures that our models and modules can work together and work well even after updates or new skills ingestion.

Scope: Focus on multiple function calls and invocations, assessing the model’s performance and validating that new data ingestion (e.g., new skills) does not degrade accuracy or cause unintended side effects.

In our case, the main emphasis is on testing whether different functions in our system interact correctly to perform a single feature. This is distinct from integration testing, which focuses on the interaction between multiple components or systems. Specifically, we also test whether our model continues to perform well after code/logic updates and the ingestion of new skills.

For instance, we test if new or updated skills can still be categorised and mapped within the internal skills taxonomy.

Another example would be testing of different functions (span extractors and semantic mapping to our taxonomy) to see if they can work seamlessly together to achieve the objective of extracting skills relevant to taxonomy.

3. Integration Testing

Purpose: Ensuring the entire solution and system, are working.

Scope: Basically test everything, from start to end, in a deployed environment (test environment).

For integration tests, we mainly focus on ensuring that the different systems and functions can work together. We ensure that data can flow smoothly across sources like Redshift to our pipelines, S3 buckets, and other integrated systems, verifying that each part of the workflow interacts correctly within the broader environment. These tests help ensure that data flows as expected between systems, without any disruption or loss.

For instance, we test whether the end-to-end model inference works by verifying the seamless integration and interaction of all components in the system. This includes ensuring that data flows correctly through the pipeline, from data ingestion to preprocessing, model inference, and output handling, confirming that each stage functions as expected within the integrated environment.

Model Build

There are 3 main scenarios where the model build pipeline will triggered, namely Code Update, Skill Update, or when Model Performance drops (based on the ground truths via feedback data).

  1. Code Update
    Whenever any data scientist or AI Engineer made changes to the codebase and commits to GitLab, unit tests will be executed. This can be done via many approaches, such as GitLab Runners. If tests fails, the commit will be rejected. Otherwise, a successful commit and push into the code base will trigger model build process to kick start. 

    As part of code management practice, we recommend that all new features or changes be developed in separate feature branches. Once development is complete and the code has passed review, it can be merged into the main branch, then this code update will trigger the model build process.
  2. Model Performance in Production Falls
    When the model in production falls below a certain threshold for a predefined metric (e.g: Jaccard score of 0.40), model re-training will be required, and thus, the model build process starts.
  3. Skills Update
    There might be skills update from time to time, where newly recognised skills from SSG are ingested into Redshift. Whenever this happens, the model build process will be started (to update the embeddings but not the model weights).

What’s happening in Model Build
When triggered by a code update or poor production performance, the MPNet model is retrained using feedback data from RedShift. We apply a champion-contender approach, where only models that outperform previous versions are added to the registry for automated deployment. All experiments are tracked in S3. When the models surpass the performance, we also generate embeddings.npy, which is our embedded skills taxonomy data used for mapping extracted skills to SSG’s skills taxonomy.

When the model rebuild is triggered by a skills update, the process becomes much simpler. Our pipeline will recompute the embeddings for all current skills to generate the embeddings.npy file. Afterwards, functional testing is performed to ensure that the model and mapping still work correctly after the new skills ingestion and update.

As shown below, the embeddings.npy file will be stored under the respective versions. This embeddings.npy file for each model version is crucial because each version of the model produces different embeddings due to changes in model weights during training or fine-tuning.

Bucket root/
├── finetuned_models/
│ └── mlops/
│ | └── versions/
│ | ├── X.Y.Z/
│ | │ ├── embedding_model/
│ | │ │ └── embeddings.npy
│ | └── X.Y.Z/
│ | ├── embedding_model/
│ └── embeddings.npy
├── base_models/
│ ├── all-mpnet-base-v2/
│ ├── jobbert_knowledge_extraction/
│ └── jobbert_skill_extraction/

├── prod/
└── logs.json

Note: We have a total of 3 different models (under base_models/), hence in our inference.py script later, we need to include all this 3 models in model_fn (for loading the model). For more information, you may view the documentation provided by AWS here.

As part of model build, we place strong emphasis on experiment tracking and version management to ensure reproducibility, traceability, and consistency as models evolve.

We store our experiment data in structured JSON files, capturing key details such as data version, date, time, test results, training and test datasets, etc. This ensures a complete history of model performance and transparency into the training details over time.

For versioning, we follow the best practices, where version numbers are assigned in the format X.Y.Z: the major (X), minor version (Y), patch version (Z). Whenever a new update is pushed out, the version will increase accordingly. The definitions of these versioning specific to our MLOps are:

X → Major Changes (Encoder changes, architecture changes, change of loss functions etc).
Y → Model retraining with new data (New model weights)
Z → Data Updates/Ingestion, or bug fixes.

Below is a snippet illustrating the structure we use to maintain logs of each experiment and model artefacts for each version. In our logs, we highlight the source of data for each model version, where the retrained model is stored, the metrics, version, and timestamp. Please note that the values provided are just examples. In the future, we aim to also include all the hyper-parameters used in each versions.

{
"prod": {
"version": "1.18.0"
},
"models": {
"1.17.0": {
"bucket": "sst-s3-gvt-agml-prodizna-d-ndnhgyawarxn-bucket",
"model_path": "SEA-MLOPS-CICD-p-sxne7pxhlgtj/finetuned_temp/mlops/1.17.0/all-mpnet-base-v2",
"train_data_path": "SEA-MLOPS-CICD-p-sxne7pxhlgtj/retrainingtest/train_set.csv",
"val_data_path": "SEA-MLOPS-CICD-p-sxne7pxhlgtj/retrainingtest/val_set.csv",
"test_data_path": "SEA-MLOPS-CICD-p-sxne7pxhlgtj/retrainingtest/test_set.csv",
"metrics": {
"cosine_accuracy": 0.855,
"cosine_accuracy_threshold": 0.5394068956375122,
"cosine_f1": 0.9173789173789173
},
"data_version": "1.0.0",
"data_checksum": "2hajk1456hkjb2hlawd",
"timestamp": "07/11/2024/1337"
},
"1.18.0": {
"bucket": "sst-s3-gvt-agml-prodizna-d-ndnhgyawarxn-bucket",
"model_path": "SEA-MLOPS-CICD-p-sxne7pxhlgtj/finetuned_temp/mlops/1.18.0/all-mpnet-base-v2",
"train_data_path": "SEA-MLOPS-CICD-p-sxne7pxhlgtj/retrainingtest/train_set.csv",
"val_data_path": "SEA-MLOPS-CICD-p-sxne7pxhlgtj/retrainingtest/val_set.csv",
"test_data_path": "SEA-MLOPS-CICD-p-sxne7pxhlgtj/retrainingtest/test_set.csv",
"metrics": {
"cosine_accuracy": 0.955,
"cosine_accuracy_threshold": 0.5394068956375122,
"cosine_f1": 0.9873789173789173
},
"data_version": "1.0.0",
"data_checksum": "2hajk1456hkjab2hlawd",
"timestamp": "08/11/2024/1437"
}
}
}

Model Deploy

Model deployment is much more straightforward with the help of the Project Template available on MAESTRO.

In the model deployment process, the model artifact generated during the build phase is deployed to a target environment, which in our case is the DEV environment. You can configure your setup in Model Deploy to specify which model to deploy. In our case, we deploy the latest and best-performing model. If the newly trained model performs poorly, it will not be deployed or registered in the model registry. Only the experiment and logs are retained.

This process automates tasks like resource provisioning, configuring endpoints, and updating deployments for real-time or batch inference. Currently, minimal changes are required for model deployment. Most changes are one-time adjustments, such as specifying the GPU type. For example, in our prod-config.json file below, we only modified the EndpointInstanceType to indicate the required resource.

{
"Parameters": {
"StageName": "prod",
"EndpointInstanceCount": "1",
"EndpointInstanceType": "ml.g4dn.xlarge", #We only changed this
"SamplingPercentage": "80",
"EnableDataCapture": "true"
},
...etc
}

In model deploy, there are mainly three stages: DEVUAT, and PROD.

Deployment to DEV is automated, primarily used by data scientists and engineers for internal testing, to verify functionality, catch errors early, and perform debugging in a controlled environment. The focus here is on technical validation.If no issues arise, the model will then be promoted to UAT.

In UAT, we see it as a staging environment after DEV, where business users or stakeholders validate and ensure that the model meets the intended business requirements and performs as expected in real life scenarios. Only upon sign-off from them, we then proceed with promotion to PROD.

In PROD, manual approval is required to ensure robustness and stability, which happens after UAT. We strongly recommend against automating PROD deployment to allow for careful evaluation before going live. Hence, in our future work, we also do not plan to automate this process.

Feedback Loop & Performance Monitoring

In our MLOps setup, we also ensure that our model inference captures logs. We store a record of all model predictions on S3, including details such as the model version, timestamp, and the model’s output. An example of this can be seen in the logs.json below. These logs help verify the consistency of end-user feedback, forming a feedback loop that supplies ground truth data for retraining. The data is then stored in RedShift. This feedback data not only enables us to retrain our model but also allows us to monitor model performance metrics, such as Jaccard Score, F1, Precision, Recall, or any other chosen metric, on a live dashboard by connecting to the same RedShift data source.

Note that feedback data is obtained when end users provide responses by confirming if the outputs are correct or making corrections if incorrect.

{
"modelVersion": "15.0.0",
"timestamp": "2024-11-13T04:25:32.052676",
"requestId": "677183436a6d660d8a18568c209103b6fa15583cfcf739ccd78c425cde8cdbf5",
"original_input": "\nThis is a dummy input. This course teaches data science for business use.\n",
"skillExtractOutput": [
{
"skill_id": "c9b3ec8895f8ce25e14e48e14a454131dcf16f32224965ecd99879c2242b20db",
"skillTitle": "Data Design",
"skillType": "tsc",
"proficiencyLevel": "",
"confidenceScore": 0.643510103225708,
"skillTags": [
"Emerging TSC"
],
"phrase": "full text",
"skillOccurrence": 1
}
]
}

An example of our dashboard is provided below, where we focus on the Jaccard Score (the overlap between the skills we predict and the actual skills based on the feedback data). Of course, the metrics you track may differ since it is dependent on your needs and business requirements.

Example of a live dashboard for model performance

While our model performance serves an important purpose, there are many additional performance metrics in production that we can explore. For instance, we can consider system-level monitoring, which helps us understand whether users are trying to push the system beyond its capabilities, providing insights on usage patterns and system limitations.

Overall impact made

Reaching MLOps maturity stage 3 has significantly improved our workflow and efficiency. Automated pipelines, performance monitoring, and retraining triggers have streamlined model development, reduced manual effort, and enabled faster, more consistent deployments. Furthermore, real-time monitoring has allowed us to proactively address performance issues, ensuring our models stay effective in production. This has not only strengthened the technical reliability of our models but has also fostered a culture of continuous improvement, driving alignment with organisational goals and enhancing overall operational effectiveness.

Potential Future Works

In the future, we can consider incorporating multiple embedding models and continuously compare their performance across different model versions or iterations, rather than relying solely on MPNet for all experiments in our MLOps setup. Initially, we selected MPNet after evaluating it against five other embedding models. However, as data drift occurs, there is always a chance that another model may perform better.

From a governance perspective, we can also present more detailed and concise information that goes beyond a general model card. This can include insights into why a particular model was chosen for each experiment, and potential Responsible AI (RAI) considerations.

Additionally, we can integrate more types of monitoring metrics, such as system-level monitoring and infrastructure/resource usage, to gain a better understanding of what’s happening in the production environment.

Reflection & Conclusion

Going from 0 to 1 is no easy job, let alone going from Stage 0 to Stage 3 in MLOps. With no expertise in house, this task may sound daunting. However, with the platform and tools provided by MAESTRO and AIP’s MLOps playbook, this made the barriers to entry much lower.

Special thanks to Leo Li and Raymond Harris from AI Practice for driving this MLOps adoption in SSG.

Special thanks to Victor Ong from AI Practice for his guidance and feedback