Integration structure specializing in safety and entry management

1. Introduction
Microsoft Cloth and Azure Databricks are each powerhouses within the knowledge analytics area. These platforms can be utilized end-to-end in a medallion structure, from knowledge ingestion to creating knowledge merchandise for finish customers. Azure Databricks excels within the preliminary levels attributable to its power in processing giant datasets and populating the totally different zones of the lakehouse. Microsoft Cloth performs properly within the latter levels when knowledge is consumed. Coming from Energy BI, the SaaS setup is straightforward to make use of and it offers self-service capabilities to finish customers.
Given the totally different strengths of those merchandise and that many purchasers don’t have a greenfield state of affairs, a strategic determination might be to combine the merchandise. You will need to then discover a logical integration level the place each merchandise “meet”. This shall be accomplished with safety in thoughts as this can be a prime precedence for all enterprises.
This weblog publish first explores three totally different integration choices: Lakehouse break up, virtualization with shortcuts, and exposing by way of SQL API. SQL API is a standard integration level between again finish and entrance finish and the safety structure of this integration is mentioned in additional element in chapter 3. See already the structure diagram under.
2. Azure Databricks — Microsoft Cloth integration overview
Earlier than diving into the main points of securing SQL API structure, it’s useful to briefly talk about the totally different choices for integrating Azure Databricks and Microsoft Cloth. This chapter outlines three choices, highlighting their benefits and downsides. For a extra intensive overview, seek advice from this weblog.
2.1 Lakehouse break up: Bronze, silver zone in Databricks | gold zone in Cloth
On this structure, yow will discover that knowledge is processed by Databricks as much as the silver zone. Cloth copies and processes the info to gold zone in Cloth utilizing V-Ordering. Gold zone knowledge is uncovered by way of a Cloth lakehouse such that knowledge merchandise might be created for finish customers, see picture under.
The benefit of this structure is that knowledge is optimized for knowledge consumption in Cloth. The drawback is that the lakehouse is break up over two instruments which provides complexity and may give challenges in knowledge governance (Unity Catalog for bronze/silver, however not for gold).
This structure is most relevant to corporations that place a powerful emphasis on knowledge analytics in Microsoft Cloth and will even wish to ultimately migrate your complete lakehouse to Microsoft Cloth.
2.2 Virtualization: Lakehouse in Databricks | shortcuts to Cloth
On this structure, all knowledge is within the lakehouse is processed by Databricks. Information is virtualized to Microsoft Cloth Lakehouse utilizing ADLSgen2 shortcuts or perhaps a mirrored Azure Databricks Unity Catalog in Cloth, see additionally the picture under.
The benefit of this structure is that lakehouse is owned by a single instrument which provides much less challenges in integration and governance. The drawback is that knowledge will not be optimized for Cloth consumption. On this, you could require extra copies in Cloth to use V-Ordering, and so optimize for Cloth consumption.
This structure is most relevant for corporations that wish to preserve the lakehouse Databricks owned and wish to allow finish customers to do analytics in Cloth through which the dearth of V-Ordering will not be a lot of a priority. The latter may very well be true if the info sizes are usually not too huge and/or finish customers want an information copy anyway.
2.3 Exposing SQL API: Lakehouse in Databricks | SQL API to Cloth
On this structure, all knowledge is within the lakehouse is processed by Databricks once more. Nevertheless, on this structure the info is uncovered to Cloth utilizing the SQL API. On this, you may determine to make use of a devoted Databricks SQL Warehouse or serverless SQL. The principle distinction with shortcut structure within the earlier bullet, is that knowledge is processed in Databricks somewhat than Cloth. This may be in comparison with when an internet app fires a SQL question to a database; the question is executed within the database.
The benefit of this structure is that lakehouse is owned by a single instrument which provides much less challenges in integration and governance. Additionally, SQL API offers a clear interface between Azure Databricks and Microsoft Cloth with much less coupling in comparison with shortcuts. The drawback is that finish customers in Cloth are restricted to the Databricks SQL and Cloth is merely used as reporting instrument somewhat than analytics instrument.
This structure is most relevant for corporations that wish to preserve the lakehouse Databricks-owned and want to improve Azure Databricks with the Energy BI capabilities that Microsoft Cloth provides.
Within the subsequent chapter, a safety structure is mentioned for this SQL API integration.
3. Exposing SQL API: safety structure
On this chapter, safety structure is mentioned for this SQL API integration. The rationale is that integrating SQL API is a standard contact level the place again finish and entrance finish meet. Moreover, most safety suggestions are relevant for the opposite architectures mentioned earlier.
3.1 Superior SQL API structure
To attain protection in depth, networking isolation and identity-based entry management are the 2 most necessary steps. You could find this within the diagram under, that was already supplied within the introduction of this weblog.
On this diagram, three key connectivities that have to be secured are highlighted: ADLSgen2 — Databricks connectivity, Azure Databricks — Microsoft Cloth connectivity and Microsoft Cloth — finish person connectivity. Within the remaining of this part, the connectivity between the sources is mentioned specializing in networking and entry management.
On this, it’s not in scope to debate how ADLSgen2, Databricks or Microsoft Cloth might be secured as merchandise themselves. The rationale is that each one three sources are main Azure merchandise and supply intensive documentation on the right way to obtain this. This weblog actually focuses on the combination factors.
3.2 ADLSgen2 — Azure Databricks connectivity
Azure Databricks must fetch knowledge from ADLSgen2 with Hierarchical Identify House (HNS) enabled. ADLSgen2 is used as storage because it offers one of the best catastrophe restoration capabilities. This contains point-in-time restoration integration with Azure Backup coming in 2025, which provides higher safety in opposition to malware assaults and unintentional deletions. You could find the next networking and entry management practices relevant.
Networking: Azure storage public entry is disabled. To be sure that Databricks can entry the storage account, non-public endpoints are created within the Databricks VNET. This makes positive that the storage account can’t be accessed from exterior the corporate community and that knowledge stays on the Azure spine.
Id-based entry management: The storage account can solely be accessed by way of identities and entry keys are disabled. To permit Databricks Unity Catalog entry to the info, the Databricks entry connector id must be granted entry utilizing an exterior location. Relying on the info structure, this may be an RBAC position to your complete container or a fine-grained ACL/POSIX entry rule to the info folder.
3.3 Azure Databricks — Microsoft Cloth connectivity:
Microsoft Cloth must fetch knowledge from Azure Databricks. This knowledge shall be utilized by Cloth to serve finish customers. On this structure, the SQL API is used. The networking and id entry management factors are additionally most relevant for the shortcut structure mentioned in paragraph 2.2.
Networking: Azure Databricks public entry is disabled. That is each true for the entrance finish because the again finish such that clusters are deployed with out a public IP deal with. To be sure that Microsoft Cloth can entry knowledge uncovered by way of the SQL API from a community perspective, an information gateway must be deployed. It may very well be determined to deploy a digital machine within the Databricks VNET, nonetheless, that’s an IaaS element that must be maintained which provides safety challenges by itself. A greater possibility is to make use of a managed digital community knowledge gateway which is Microsoft managed and offers connectivity.
Id-based entry management: Information in Azure Databricks will likely be uncovered by way of Unity Catalog. Information within the Unity Catalog shall solely be uncovered by way of Identities utilizing fine-grained entry management tables and utilizing row-level safety. It isn’t but doable to make use of Microsoft Cloth Workspace Identities to entry the Databricks SQL API. As a substitute, a service principal shall be granted entry to the info within the Unity Catalog and a private entry token primarily based on this service principal shall be used within the Microsoft Databricks Connector.
3.4 Microsoft Cloth — finish person connectivity:
On this structure, finish customers will connect with Microsoft Cloth to entry experiences and to do self-service BI. Inside Microsoft, various kinds of experiences might be created primarily based on Energy BI. You possibly can apply the next networking and identity-based entry controls.
Networking: Microsoft Cloth public entry is disabled. At present, this could solely be accomplished at tenant degree, as extra granular workspace non-public entry will change into out there in 2025. This will guarantee that an organization can differentiate between non-public and public workspace. To be sure that finish customers can entry Cloth, non-public endpoints for Cloth are created within the workspace VNET. This office might be peered to the company on prem networking utilizing VPN or ExpressRoute. The separation of various networks ensures isolation between the totally different sources.
Id-based entry management: Finish customers ought to get entry to experiences on a need-to-know foundation. This may be accomplished to create a separate workspace the place experiences are saved and to which customers get. Additionally, customers shall solely be allowed to log in Microsoft Cloth with conditional entry insurance policies utilized. This fashion, it may be ensured that customers can solely log in from hardened units to stop knowledge exfiltration.
3.5 Ultimate remarks
Within the earlier paragraph, an structure is described the place all the things is made non-public and a number of VNET and jumphosts are used. To get your palms soiled and to check this structure sooner, you may determine to check with a simplified structure under.
On this structure, Cloth is configured with public entry enabled. Rationale is that Cloth public entry setting is at present tenant huge setting. This suggests that you must make all workspaces in an organization both non-public or public. Extra granular workspace non-public entry will change into out there in 2025. Additionally, a single subnet is used to deploy all sources to stop peering between VNETs and/or deploying a number of jumphosts for connectivity.
4. Conclusion
Microsoft Cloth and Azure Databricks are each powerhouses within the knowledge analytics area. Each instruments can cowl all elements of the lakehouse structure, however each instruments even have their very own strengths. A strategic determination may very well be to combine the instruments particularly if there’s a non inexperienced state of affairs and each instruments are utilized in an organization.
Three totally different architectures to combine are mentioned: Lakehouse break up, virtualization with shortcuts and exposing by way of SQL API. The primary two architectures are extra related in case you wish to put extra emphasize on the Cloth analytics capabilities, whereas the final SQL API structure is extra related if you wish to deal with the Cloth Energy BI reporting capabilities.
Within the the rest of the weblog, a safety structure is supplied for the SQL API structure in which there’s a deal with community isolation, non-public endpoints and id. Though this structure focuses on exposing knowledge from the Databricks SQL, the safety ideas are additionally relevant for the opposite architectures.
Briefly: There are quite a few issues to take into consideration if and the place to combine Azure Databricks with Microsoft Cloth. Nevertheless, this shall all the time be accomplished with safety in thoughts. This weblog aimed to present you an in-depth overview utilizing the SQL API as sensible instance.