Using AWS for EHDS: A technical guide to building a secure health data platform

PressRex profile image
by PressRex
Using AWS for EHDS: A technical guide to building a secure health data platform

In an earlier post, Build secure and scalable data platforms for the European Health Data Space (EHDS) with AWS, we discussed a reference architecture for building secure and scalable data platforms for secondary usage of health data in alignment with the European Health Data Space (EHDS) using Amazon Web Services (AWS). These data platforms require comprehensive data governance, addressing legal compliance, security, scalability, and ensuring minimal management overhead.

This follow-up post walks you through the technical implementation details for building such federated data governance and analysis platforms using AWS. Whether you are a healthcare organization, technology provider, or systems integrator, this post aims to equip you with the technical knowledge to build a secure data platform for secondary usage of health data in alignment with the EHDS Regulation. At the core of this implementation is Amazon DataZone, an out-of-the-box data management service that offers a broad set of fine-grained access controls and data governance configurations. As illustrated in Figure 1, this enables secure data sharing and collaboration across multiple AWS accounts, teams, and departments.

Figure 1. High-level architecture for secondary usage of health data in alignment with EHDS. The major components are an Amazon Simple Storage Service (Amazon S3) bucket, AWS Glue, Amazon Athena, and Amazon DataZone.

Considerations

The EHDS Regulation aims to provide a technology-agnostic framework to enable data sharing across systems. Although connections to external and third-party sources are supported through AWS Glue, the integration across different technological stacks requires more customized approaches and is out of the scope of this post. Hence, this post outlines how to build a secure data platform for secondary usage of health data across AWS environments.

AWS services mapped to the needs deriving from EHDS  

The EHDS envisions a structured approach to health data management across EU Member States, involving various stakeholders such as ministries of health, regional healthcare agencies, research institutions, healthcare providers, and pharmaceutical companies. The implementation specifics may vary by Member State.

AWS services can support each data journey stage for implementing the EHDS. Initial data handling (extraction, collection, streaming) uses Amazon S3, AWS Glue, and Amazon EventBridge for GDPR-compliant storage and event-driven processing. Data aggregation and transformation rely on AWS Glue and Amazon SageMaker for standardization and pseudonymization, while Amazon DataZone and AWS Identity and Access Management (IAM) enable secure cross-environment trust association and access management.

For secondary use, Amazon Athena and Amazon DataZone facilitate efficient data discovery and querying. DataZone features support provider interactions, governance, and metadata management, while AWS CloudFormation ensures scalable and consistent GDPR-compliant environments through infrastructure as code (IaC). AWS CloudTrail provides comprehensive access monitoring and audit capabilities, with DataZone maintaining data quality tracking and lineage.

Solution architecture

The previous blog post showed a high-level architecture for health data sharing between a data holder and a data user in a Health Data Access Body (HDAB) governed account. In this blog post, we provide two scenarios of how to implement this architecture which we are making open source so that you freely build upon it:

Single account deployment (Scenario 1)

This represents the primary governance account setup where an Amazon DataZone domain serves as the foundation for centralized data management. This scenario is better for development and testing of the Amazon DataZone capabilities. It creates a comprehensive data management infrastructure.

Multi-account deployment (Scenario 2)

This represents an extension of the main setup into a secondary AWS account. It maintains similar infrastructure components but operates in a separate account. This enables cross-account data sharing and allows for distributed data management while maintaining centralized discovery and access control.

We will refer to the primary account configured in the single account deployment option as the governance account and the second account configured in the multi-account deployment as the producer account.

Note: The following architecture and technical implementation are for demonstration purposes.

Figure 2. Reference architecture for secondary usage of health data in alignment with EHDS. The major components are AWS CloudFormation, Amazon SageMaker, Amazon DataZone, AWS Glue, Amazon S3, and IAM.

The solution enhances the Amazon DataZone domain’s centralized data governance with automated workflows for data transformation and publishing. When data is uploaded to an Amazon S3 bucket, it triggers an automation flow that crawls the data with an AWS Glue Crawler, adds it to the AWS Glue Data Catalog, and automatically triggers a data source run. This allows for the data to be made available for publishing as soon as it is added to an S3 bucket.

Solution deployment

We have prepared two CloudFormation templates that provision and configure the necessary AWS services. These templates provide two deployment options, one for each scenario.

The instructions are separated in two implementations, depending on your preference and availability of deployment to multiple accounts.

Prerequisites

The following prerequisites are necessary to deploy this solution:

  • Access to at least one AWS account
  • Administrative IAM permissions for Amazon DataZone, Amazon S3, AWS Glue, AWS IAM and AWS Lake Formation.

Note: While the second deployment option uses two AWS accounts, neither deployment option requires an AWS Organizations setup. By default, all data at rest is encrypted using AWS-managed keys through AWS Key Management Service (KMS). You have the flexibility to use customer-managed keys if you prefer more control over your encryption settings.

Single account deployment

This section will walk you through the steps to configure the governance account.

  1. Clone the repository on GitHub using git. You can also manually download the CloudFormation templates from GitHub.

git clone https://github.com/aws-samples/a-technical-guide-to-building-a-data-platform-for-secure-health-data-use.git

  1. In the governance account, go to the CloudFormation console within your chosen Region. Under the Stacks section, choose the Create Stack dropdown and choose With new resources (standard).
  2. Leave the prerequisites section with the default settings. Under the Specify template section, choose Upload a template file. Upload the mainaccount.yml file.
  3. Give a Stack name of your preference.
  4. Choose Next. Leave everything default and choose the I acknowledge that AWS CloudFormation might create IAM resources checkbox
  5. Choose Review the settings and choose Submit.

After a few minutes, the deployment is complete and you should see an Amazon DataZone domain created. You can also get the data portal’s URL in the outputs of the CloudFormation template.

The single account deployment is now complete. You can test the solution by moving to the section: Upload and publish data. 

If you would like to deploy a multi-account environment, follow this next section for setting up and configuring the multi-account deployment.

Multi-account deployment

This section will walk you through the steps to configure the producer account. You need to have completed the single account deployment steps to follow these instructions.

  1. In the governance account, navigate to the Amazon SageMaker platform. On the dashboard, choose View existing domains for Amazon DataZone and choose the domain with the prefix: DataZoneDomain-<stack-name>
  2. On the domain page, scroll down and on the first tab named “Account associations” choose Request association, as shown in the following figure.Figure 3. Amazon DataZone account association.
  3. A new page named “Associate account” opens. Enter the AWS Account ID of the data producer account that you want to associate.
  4. Under RAM Policy, make sure to choose AWSRAMPermissionDataZonePortalReadWrite. Choose Request Association, as shown in the following figure.Figure 4. Attach permissions to account associationYou have now requested to associate the producer account to the domain. This allows members of the associated account to access the domain’s data portal and start publishing data.
  5. Still in the governance account within the Amazon DataZone domain page, go to the tab User management, as shown in the following figure
  6. Choose IAM Users from the dropdown. Copy the ARN of the role that contains DataZoneDomainExecutionRole. You need it for a further step, so paste it in a text editor for now.Figure 5. Add DataZoneDomainExecutionRole.Next, you create necessary resources and permissions to run the CloudFormation stack in the producer account.
  7. Log in to your producer account, ensuring you are in the AWS Region where you deployed the Amazon DataZone domain in the governance account.
  8. On the Amazon SageMaker Platform console, choose View requests. You will see the association request from the governance account. Choose the request and select Review Request.
  9. A new page opens named Accept & configure AWS association. Choose Accept new permissions.
  10. Once the association has been created, select the associated domain. Copy the IAM role under the Data portal URL in your text editor, as showed in Figure 5.Figure 6. Copy Data portal User ARN.
  11. Scroll down to the Default Blueprints section. Select the Default Data Lake option and choose “Enable”. On the next page enable the option for hybrid mode in the Data location registration section. Leave everything as default and choose “Enable Blueprint.”
    Figure 7. Enable Default Data Lake Blueprint.Next you create an IAM role for a Lambda function allowing it to perform actions on the Amazon DataZone domain.
  12. Still in the producer account, visit the IAM console. Choose Role then Create role.
  13. Make sure AWS service is chosen. In the Service or use case dropdown, choose Lambda.
  14. Choose Next. Don’t add any permissions, and choose Next again.
  15. Give a name to this role, such as Lambda-role. Choose Create role.
  16. After creation, search for your role and select it. It should not have any permissions for now. Choose Add permissions and Create inline policy.
  17. Give the Policy a name and choose Create policy. Lastly, copy the ARN of this newly created IAM role for your Lambda function and paste it into your text editor.You must make sure the producer account can access the domain, as the portal’s URL is still greyed out.
  18. Log back in to your Governance account and visit the Amazon DataZone console. Go to the Amazon DataZone domain under the User management tab, chooseSource: View source

Choose the JSON editor view and paste in the following policy. Make sure to replace the ARN with the Amazon DataZone domain execution role that you copied in Step 6.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "sts:AssumeRole",
            "Resource": "<DataZone-domain-execution-role-arn>"
        }
    ]
}
PressRex profile image
by PressRex

Subscribe to New Posts

Lorem ultrices malesuada sapien amet pulvinar quis. Feugiat etiam ullamcorper pharetra vitae nibh enim vel.

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn’t arrive within 3 minutes, check your spam folder.

Ok, Thanks

Read More