Product: TIBCO Spotfire®
Multi-Tenant Hadoop clusters and YARN queues
TIBCO Spotfire Data Science supports limited access workspaces and data-sources that provide an effective way for supporting Multitenancy within TIBCO Spotfire Data Science. In addition to controlling data and workspace access, Administrators will also typically want to control the Hadoop cluster resources that can be consumed by the individual tenants. This can be achieved via the use of YARN queues and TIBCO Spotfire Data Science can be configured to aid in the use of YARN queues, as discussed below.
TIBCO Spotfire Data Science Limited data sources
TIBCO Spotfire Data Science data sources can be created with global (‘public’) or ‘limited’ visibility. Limited data sources provide the ability to control how different tenants interact with your clusters, and are associated by a Data admin with the appropriate workspaces. When using data sources without Kerberos enabled, users interact with the cluster as the “Data Source User” specified in the data source configuration. For example, a limited data source could be created for the Marketing group, who could use “user1” to access the cluster, while a second limited data source could be created for Engineering, who could access the cluster using “user2”. Cluster HDFS and Sentry permissioning can be configured to ensure that access to each group’s data is appropriately limited.
For YARN queue utilization, the interaction between Chorus and the Hadoop cluster is dependent on whether queue mapping has been established for the YARN queues. When the user associated with the Chorus data source has been mapped to a YARN queue, no further configuration within Chorus is required.
Based on the YARN scheduler chosen (fifo, capacity scheduler, fair scheduler - recommended), the configuration and set of rules to define which application goes to which queue is different.
By default, the Fair scheduler uses a single default queue. Creating additional queues and setting the appropriate properties allows for more fine-grained control of how applications are run on the cluster. Some examples of queue customization discussed in the following Cloudera blog:
However, if a queue mapping has not been configured on the Hadoop cluster, the Chorus data source can be configured to ensure that all jobs are assigned to the desired queue (this queue needs to exist on the cluster, or yarn.scheduler.fair.allow-undeclared-pools needs to be set to true). This can be achieved by adding the following KV pairs:
To submit MapReduce jobs to a specific queue:
To submit Spark jobs to a specific queue:
This parameter can be set in 3 different places, with the following order of precedence (See Attachment for screenshots):
- Spark operator level
- In the flow, using the Workflow variables panel (with prefix @alpine.spark.conf.)
- Datasource level
TIBCO Spotfire Data Science Kerberos impersonation
When Kerberos is enabled for a Hadoop data source, Chorus will impersonate either the Chorus login user running the workflow or the named data source user, depending on whether the “Impersonate Data Source User” is selected. For instance, when impersonating the login user, and Chorus user “user1” runs the workflow, the job will be submitted to the hadoop cluster as “user1”. Alternatively, if the “Impersonate Data Source User” option is enabled for a limited data source created for engineering, the jobs submitted by “eng_user1”, “eng_user2”, and “eng_user3”, will all be submitted as the specified data source user.
As above, if user mapping has been established for the queues (e.g. creating a queue per user or having yarn.scheduler.fair.allow-undeclared-pools = true), no further action is required. If a mapping for the login users or data source users has not been created, the data source can be configured to leverage a specified queue, as discussed above.