Apache Kyuubi Tutorial and Hands on Operation

Mondo Finance Updated on 2024-01-30

Apache Kyuubi, a distributed multi-tenant gateway for serverless SQL on Lakehouse. Kyuubi is an open-source distributed SQL engine based on Apache Spark that provides users with a unified SQL query interface that enables them to query and analyze various data sources in standard SQL syntax. Here is a detailed explanation of Kyuubi:

Background and purposeKyuubi's goal is to provide a highly concurrent, scalable, and multi-tenant SQL engine to meet the needs of big data analytics. It's built on top of Apache Spark, which leverages Spark's distributed computing power to handle large-scale datasets. Core features:SQL compatibility: Supports standard SQL queries, enabling users to use familiar SQL syntax. Distributed Query Engine: Leverage Apache Spark for distributed query and computation. Connection pooling support: Provides a connection pooling mechanism to effectively manage and reuse connections and improve performance. Multi-tenancy support: Allows multiple users or applications to share the same Kyuubi server, ensuring isolation. Authentication and authorization: Authentication and authorization mechanisms are integrated to ensure the security of data access. Extensible data source support: Connect to a variety of data sources, including Hive, HBase, and other Spark-compatible data stores. Architecture and how it works

The Kyuubi architecture consists of a client, a Kyuubi server, and a Spark cluster. The client connects to the Kyuubi server via a JDBC or ODBC driver, and Kyuubi distributes the query to the underlying Spark cluster for processing. Deployment and configuration:kyuubi can be configured with various parameters, including connection pooling, authentication methods, configuration of Spark applications, and more. For detailed configuration information, please refer to Kyuubi's official documentation. Usage scenarios:kyuubi is suitable for scenarios that require big data analysis, especially in environments that need to support multi-user concurrent queries. The SQL query interface allows users to easily retrieve and analyze data from a variety of data sources. Community and maintenanceKyuubi is an active open-source project with a growing community. Users can participate in discussions, report issues, and contribute through channels such as GitHub. Overall, Kyuubi is a powerful distributed SQL engine that provides a high-performance, multi-tenant SQL query service by consolidating the compute power of Apache Spark. In the field of big data analytics, Kyuubi offers a flexible, scalable solution.

Official documentation: The software can be used as a Spark gateway just like Livy, but Livy only supports Spark 2x, if you are interested in livy, you can refer to my following articles:

Spark Open Source REST Service - Apache Livy (Spark Client) [Cloud Native] Apache Livy on K8S Explanation and Practical OperationThe basic technical architecture of the Kyuubi system is shown in the following figure

The middle part of the diagram is the main part of the Kyuubi server, which handles connection and execution requests from the client shown on the left side of the image. In Kyuubi, these connection requests are maintained as Kyuubi Sessions, and execution requests are maintained as Kyuubi Operations, bound to the corresponding sessions. The creation of a kyuubi session can be divided into two cases: lightweight and heavyweight. Most sessions are created lightweight and are imperceptible to the user. The only heavyweight case is when the SparkContext is not instantiated or cached in the user's shared domain, which usually happens when the user connects for the first time or for a long time. This one-time cost session maintenance model can meet most of the ad-hoc rapid response needs. Kyuubi maintains the connection to SparkConext in a loosely coupled manner. These SparkContexts can be Spark programs created locally by the local service instance in client-side deployment mode, or in a YARN or Kubernetes cluster in cluster deployment mode. In High Availability mode, these SparkCoNext can also be created by a Kyuubi instance on another machine and then shared by that instance. These SparkCoNext instances are essentially remote query execution engine programs hosted by the Kyuubi service. These programs are implemented on Spark SQL and maximize the power of Spark SQL by compiling, tuning, and executing SQL statements end-to-end, as well as interacting with metadata (e.g., Hive Metastore) and storage (e.g., HDFS) services as necessary. They can manage their own lifecycles, cache and cache themselves, and are not affected by failover on the Kyuubi server. You can choose physical deployment, or container deployment if it's just testing, and here you can choose container deployment for quick testing. The physical deployment and container deployment tutorials are as follows:

Introduction to the principle of big data Hadoop + installation + practical operation (HDFS + yarn + mapreduce) Big data Hadoop - data warehouse HiveBig data Hadoop - computing engine Spark Rapid deployment of Hive through docker-compose detailed tutorial.

**Deploy the package git clone
docker network create hadoop-network
cd docker-compose-hadoop/mysqldocker-compose -f mysql-compose.yaml up -ddocker-compose -f mysql-compose.yaml ps root password: 123456, the following is the login command, note that generally in the company can not directly enter the password in the command line plaintext, otherwise it is easy to be caught by security, remember, remember to !!docker exec -it mysql mysql -uroot -p123456
Deploy hadoop hivecd docker-compose-hadoop hadoop hivedocker-compose -f docker-composeyaml up -d look at docker-compose -f docker-composeyaml ps# hivedocker exec -it hive-hiveserver2 hive -e "show databases";# hiveserver2docker exec -it hive-hiveserver2 beeline -u jdbc:hive2://hive-hiveserver2:10000 -n hadoop -e "show databases;"
wget --no-check-certificatetar zxf apache-kyuubi-1.8.0-bin.tgz
For more information about Spark, please refer to my previous article: Big Data Hadoop - Computing Engine Spark

wget --no-check-certificatetar -xf spark-3.3.2-bin-hadoop3.tgz
Modify the configuration file:

Go to the spark configuration directory cd spark-33.2-bin-hadoop3 conf copy a template configuration cp spark-envsh.template spark-env.sh
in spark-envAdd the following configuration to sh:

Hadoop configuration file directory export hadoop conf dir=$hadoop home etc Hadoop yarn configuration file directory export yarn conf dir=$hadoop home etc Hadoop Spark directory export spark home= opt apache spark-33.2-bin-hadoop3 spark executable directory export path=$spark home bin:$path
Append the following to the etc profile file:

export spark_home=/opt/apache/spark-3.3.2-bin-hadoop3export path=$spark_home/bin:$path
Testing:

spark-submit \-class org.apache.spark.examples.sparkpi \-master yarn \-deploy-mode cluster \-driver-memory 1g \-num-executors 3 \-executor-memory 1g \-executor-cores 1 \ /opt/apache/spark-3.3.2-bin-hadoop3/examples/jars/spark-examples_2.12-3.3.2.jar 100

cp conf/kyuubi-env.sh.template conf/kyuubi-env.shecho 'export j**a_home=/opt/apache/jdk1.8.0_212' >>conf/kyuubi-env.shecho 'export spark_home=/opt/apache/spark-3.3.2-bin-hadoop3' >>conf/kyuubi-env.shcp conf/kyuubi-defaults.conf.template conf/kyuubi-defaults.conf sets the kyuubi address to localhost, if you do not open this comment, you cannot connect with localhost, you need to fill in the IP address of the host vi conf kyuubi-defaultsconfkyuubi.frontend.bind.host localhost
bin/kyuubi start

Use Kyuubi's built-in beeline client tool.

bin/beeline -u 'jdbc:hive2://localhost:10009/' -n hadoop

Apache Kyuubi not only supports Spark3, but also Presto, Flink, and other components.

Currently, Kyuubi supports load balancing to make the entire system highly available. Load balancing is designed to optimize the use of all Kyuubi service units, maximize throughput, minimize response times, and avoid overloading individual units. Using multiple Kyuubi service units with load balancing instead of a single unit can improve reliability and availability through redundancy.

Kyuubi leverages redundant service instances in a group by using Apache ZooKeeper to provide continuous service in the event of a failure of one or more components.

For more information about the introduction and deployment of ZooKeeper, please refer to the following articles:

Distributed Open Source Coordination Service - ZooKeeper [Cloud Native] ZooKeeper + Kafka on K8s Environment Deployment [Middleware] Rapid Deployment of ZooKeeper through docker-compose Nanny tutorial Here for rapid testing, docker-compose is chosen to deploy Zookeeper.

vim /opt/kyuubi/conf/kyuubi-defaults.confkyuubi.ha.addresses 192.168.182.110:31181,192.168.182.110:32181,192.168.182.110:33181kyuubi.ha.namespace kyuubi
bin/kyuubi restart
bin/beeline -u 'jdbc:hive2:' -n hadoop

For more API interface introduction and use, please refer to the official documentation: Clients Apache Kyuubi;If you have any questions, you can also follow me *** big data and cloud native technology sharing, technical exchanges, such as this article is helpful to you, please help with one click three (Like, **Favorite

Related Pages