Product: TIBCO Spotfire®
Can all implemented Hadoop methods run on multiple nodes in a distributed manner?
In Team Studio, with Hadoop operators, Hadoop is able to implement several algorithms based on the conditions and configurations of the cluster. If all implemented methods are used in Team Studio, are they able to run on different configurations of a Hadoop cluster?
1) Can these operators run on single nodes vs. multiple nodes?
2) Could you force the operator to run a specific way (like copy everything on one node and compute here without distribute compute/without possibility to use for computation the power of more nodes in the cluster)? Hadoop Operators in Team Studio do not take into account how the Hadoop Cluster is configured. In a workflow, Team Studio will build the code and submit this to the cluster. The Cluster will then determine the best way to process the job at run time. This is decided entirely by the Resource Manager unless specified by the user who submits the job.
A workflow or job can run on a single node if:
- there is a single node cluster
- resource usage is restricted either by using config files or from the source code
- a sequential execution code is written without implementing the parallel/distributed frameworks (MR/Spark)
The user can actually write operators to run on the server where Team Studio is installed (in-memory PCA is an example) but it is not recommended as Team Studio app consumes lot of the resources
In spark you can force all the computation to single node by using a single partition for your dataframe (df.repartition(1))
In MR you can force all the computation to single node by using a single by using single mapper and reducer
If you are talking our operators that ship with the product, no we DO NOT artificially restrict anything to run in single node but the code is structured in a way that it will run in single node if it has to.