Solving for infrastructure through the cloud: An Ontario Hospital’s Journey Towards The Deployment of Artificial Intelligence

In the past few short years, remarkable progress has been made to demonstrate the potential power of artificial intelligence (AI) technologies in healthcare.

Despite the potential of AI in healthcare, very few of these algorithms make it to the bedside. This so-called “implementation gap” prevents AI algorithms from having a real world impact on patient care. The failure to deploy AI algorithms, meaning it is not used in clinical practice, can be caused by a number of factors including addressing the wrong problem, poor design, lack of access to the right data, lack of appropriate infrastructure and the AI algorithm degrading with time. Legal, privacy and ethical barriers also exist, which are challenging for data scientists but critical for real world applicability of AI in healthcare. Another key barrier to this deployment is the lack of access to a modern infrastructure that can be used on demand, without much upfront cost, this is where the cloud comes in.

The piece that follows shares our experience at Trillium Health Partner’s Institute for Better Health in leveraging the University of British Columbia Cloud Innovation Center, powered by AWS (UBC CIC) in the end to end deployment of an analytics algorithm, specifically in the deployment of a census prediction algorithm.

Census is defined as the daily number of patients who will be present in the hospital in each ward to ensure that beds, staff, and relevant supplies are allocated proportionally in response to demand. Hospitals regularly operate at over 100% capacity, with patients occupying hallways and other non-conventional spaces. In order to provide better care to the population of Mississauga, better hospital census and near-term predictions are required to inform a variety of capacity management decisions. At THP, an existing census prediction algorithm was in place, which was developed and deployed on-premises. We used this use case to see whether we can replicate the pre-existing work on the cloud, while utilizing some cloud specific advantages.


We have included our final architecture below for reference. The rest of the piece focuses on our specific implementation, and the technical lessons learned for:

  1. Security, and Identity access management
  2. Data Ingestion
  3. Data Transformation
  4. Data Science
  5. Model Deployment

1. Security, and Identity Access Management

Implementation:

To ensure security configuration in AWS aligned with enterprise governance requirements, the principles of least use guided security and identity management setup. AWS organizations was leveraged to set-up member accounts to inherit security boundaries from a parent. This includes guradrails on,

  1. Preventing access to AWS resources outside Canada,
  2. Enforcing Multi-Factor-Authentication (MFA) for all users, and
  3. Assigning need-based permissions to IAM users

In addition to the above, all management and data events in AWS were logged in CloudTrail for audit-tracking purposes. Using custom thresholds for notification, event alarms were setup in CloudWatch to alert users of a possible problem and apply preventative measures.

Lessons Learned:

  • It is much easier to implement data governance policies in AWS Lake formation rather than for individual services since it delegates necessary permissions to S3, Glue and Athena respectively. The alternate method of creating permissions and policies manually is very cumbersome.
  • Identity Management is a complex module in AWS and runs many layers deep. It is important to map enterprise governance needs to user policies early on.

2. Data Ingestion

Implementation:

First, we built a prototype census prediction algorithm on-premise to scope all necessary data sources including the servers, databases, tables, and columns. After identification of the various data sources, and data elements we worked with internal IT, Business Intelligence, and Decision Support to do a one time extract of all necessary historical data.

To store the data, we set up a data lake in S3. We created four buckets to zone data. The buckets were:

  1. Stage (where source data first gets loaded)
  2. Raw (sourced data has undergone some basic transformation)
  3. Trusted (exploration zone)
  4. Refined (production zone)

The historical one time extract of the data was loaded into the staging area. In parallel, we worked with our teams to build an automated pipeline of daily delta uploads where its uploaded into the staging area.

We used glue crawlers to catalog all of our data assets. The glue catalogue will later be used as a key part of our data tagging, and discoverability capabilities where data dictionaries can be created, sensitive data tagged, in an easily searchable manner.

As an additional benefit, this also allowed the data to be queryable and for analysts to be able to write SQL against it in Athena.

Lessons Learned:

  • It is certainly easier to develop all AWS Glue pipeline jobs under the AWS Lake formation umbrella since permissions are simplified.
  • AWS Glue transformation leverages Apache Spark under the hood, making it easier to extend the code generated by the Glue engine.
  • Transforming smaller csv files to Parquet is not recommended since it would possibly lead to memory errors from processing too many small files

3. Data Transformation

Implementation:

We used glue ETL jobs to transform data between the various zones of the data lake. Specifically, glue ETL jobs were used to transform data between the raw and the staging zone where input csv files were ingested and parquet files were created.

To create the glue ETL script we first used Glue Studio to generate the code boilerplate. Once this was done, the boiler plate code was saved to S3. We modified the boiler plate code, so it could accept input and output data sources as parameters. Since we had many different tables to transform, a boto3 script was used to create a number of different Glue processing jobs for each of our different data sources, specifying the different input and output sources at job creation time.

Lessons Learned:

  • It’s very hard to read non-UTF8 coded files in Glue. Some of our data sources had to be transformed manually. We used Python’s Dask library to do so.
  • While parquet is a great file for efficient storage of data, it’s not very compatible with the data science workflow.

Code Snippet

# used to create 28 glue etl jobs
import boto3
session = boto3.Session(aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key, aws_session_token=aws_session_token)
glue = session.client('glue')
name = 'GlueJob'   

job = glue.create_job(
    Name=name,
    Role='service-role/AWSGlueServiceRole-S3-Crawler',
    Command={
        'Name': name,
        'ScriptLocation': 'S3://[location_of_glue_boiler_plate]',
        'PythonVersion': '3'
    },
    DefaultArguments={
        '--job-bookmark-option': 'job-bookmark-disable',
        '--job-language': 'python',
        '--TempDir': 's3://[location]',
        '--FROM_ADDRESS': 's3://[location_read_from]',
        '--TO_ADDRESS': 's3://[folder_to_write_to]'
    },
    Timeout=2880,
    MaxCapacity=10
)

4. Data Science

Implementation:

We used SageMaker studio to conduct our regular data science workflow because it offers a hosted Jupyter Notebook / Jupyter Lab. Pandas can easily connect to S3, and read data from it as if work was being done locally.

While building our initial prototype on premise, we tried a number of different modelling approaches including moving averages, exponential smoothing, SARIMA, SARIMAX, and Facebook Prophet. We found the SARIMA model as the best performing model and decided to use it as our approach.

Because it was hosted in a secure cloud environment where the code and the data both resides, we were able to easily collaborate with each other, look at code and diagrams as needed, and replicate work in seconds.

Lessons Learned:

  • For forecasting, simple models typically works best for short-term predictions. No machine learning model outperformed statistical approaches for our data sets.
  • Use a rolling horizon train/test evaluation with a persistence model benchmark to determine relative benefits of more complex models.
  • Hospitals don’t typically use version control for model development, but working with AWS Code Commit was essential for us to establish a collaborative environment.

Code Snippet:

# loading objects into s3 with pandas is easy as 1,2,3

df = pd.read_csv(filepath_or_buffer=f"s3://{bucket}/{file}")

5. Model Deployment

Implementation:

Once we were happy with our model, it was time to deploy it. Because our specific use case needed only one set of predictions, once a day, we decided to use SageMakers Processing Jobs to run our model. We transformed the script from the .ipynb Jupyter Notebook into a .py python based file and containerzed it using a simple Docker Script. The Dockerized container was pushed to the Container Registry.

The SageMaker Processing Job we set up grabs the specified container image from the registry, moves the required data from S3 into a specific directory in the container (‘processing/input_dir’) where the python script ingest the data. The script runs, outputting the predictions to a specific directory (‘processing/output_dir’) after which the SageMaker Processing Job grabs it and uploads it back to S3.

The last part of the whole process was figuring out how to trigger the processing jobs. We used a simple lambda function to do this. The lambda function listens to events in our data lake. Once new data is available for prediction, it creates a processing job. The processing job, loads the data from S3, and saves the predictions to S3.

To send the final predictions to users, we decided to create a pdf report and email them. We plotted out predicted findings using plotly and matplotlib. The figures were compiled into a pdf report automatically. Finally, we configured Amazon’s SES to deliver the pdf to the users

Lessons Learned:

  • Deploying batch model predictions have different technical requirements from model end points
  • There are a lot of AWS offerings to serve model results (e.g., Quicksight, Email report, Static Website, ect.)
  • Creating a containerized model with a Docker file is, or AWS’s Batch, is suitable for delivery of most models which operationally only require batch access

Code Snippet:

We executed the following code using SageMaker notebooks to build the docker image in AWS.

## move into the container directory and build the docker image
%cd ./docker
! docker build -t image-name .

The code below was used to test the processing job using the created image.

from sagemaker.processing import Processor, ProcessingInput, ProcessingOutput

processor = Processor(image_uri='image-name',
                     role=role,
                     instance_count=1,
                     instance_type="local")

processor.run(inputs=[ProcessingInput(
                        source='s3://[input-data-location]',
                        destination='/opt/ml/processing/input_dir')],
                    outputs=[ProcessingOutput(
                        source='/opt/ml/processing/output_dir',
                        destination='s3://[save-output-files-here]')],
              )

Once we were happy with the code, the container was pushed to the container registry

import boto3

account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'census'
tag = ':latest'
processing_repository_uri = '{}.dkr.ecr.{}.amazonaws.com/{}'.format(account_id, region, ecr_repository + tag)

# Create ECR repository and push docker image
!docker build -t $ecr_repository .
!aws ecr create-repository --repository-name $ecr_repository
!docker login -u AWS -p $(aws ecr get-login-password --region ca-central-1) $processing_repository_uri
!docker tag {ecr_repository + tag} $processing_repository_uri
!docker push $processing_repository_uri

A simple lambda function was used to trigger the creation of a new processing job based on S3 changes.

from datetime import datetime
now = datetime.now()
dt_string = now.strftime("%Y-%m-%d")
sm = boto3.client('sagemaker')
job_name = f'census-{dt_string}'
sm.create_processing_job(
    ProcessingInputs=[{
    'InputName': job_name,
    'S3Input': {
            'S3Uri': 's3://[input-data-location]',
            'LocalPath': '/opt/ml/processing/input_dir',
            'S3DataType': 'S3Prefix',
            'S3InputMode': 'File'
    }
    }],
    ProcessingOutputConfig={
        'Outputs': [
        {
            'OutputName': 'string',
            'S3Output': {
                'S3Uri': 's3://[save-output-files-here]',
                'LocalPath': '/opt/ml/processing/output_dir',
                'S3UploadMode': 'EndOfJob'
            },
        }
    ]},
    ProcessingJobName=job_name,
    ProcessingResources={
        'ClusterConfig': {
            'InstanceCount': 1,
            'InstanceType': 'ml.t3.medium',
            'VolumeSizeInGB': 1
        }
    },
    AppSpecification={
        'ImageUri': '[image-path-from-container-registry]'
    },
    RoleArn='AWSSageMakerServiceRole'
)

Next

While this was our first use cases, there is a lot more work to be done. For example, we want to make sure new data scientists can contribute to the project and build a mechanism for model performance monitoring. As we continue building products for operational benefits, we are simultaneously continuously working on building our data lake further.

We hope this piece, helps other hospitals in their journey of deploying AI in practice. Please don’t hesitate to contact us to chat about our experience.