How to pull a comeback in Football?

Imagine you are the manager of Leicester City, you find yourself in a difficult position as you head into the half-time break in the FA cup competition match against Tottenham Despite putting in a…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




How To Tame Apache Impala Users With Admission Control

A common problem encountered with Apache Impala is resource management. Everyone wants to use as many resources (i.e. memory) as they can to try to increase speed and/or hide query inefficiency. However, it’s not fair to others and it can be detrimental to queries supporting important business processes. What we see at a lot of clients is that there are plenty of resources when their clusters are freshly built and the initial use cases are being onboarded. There isn’t a concern about resources until you continue to add more use cases, data scientists, and business units running ad-hoc queries that consume enough resources to prevent those original use cases from completing on time. This leads to query failures which can be frustrating for users and problematic for existing use cases.

The first challenge with Admission Control is manually gathering metrics about individual users and the queries they have run to try and define the memory settings for resource pools. You could manually use the Apache Impala queries pane and chart builder in Cloudera Manager to go through each user’s queries to gather up some stats, but that’s very time consuming and tedious to re-evaluate at a later date. In order to make informed and accurate decisions on how to allocate resources for various users and applications, we need to gather detailed metrics. We’ve written a Python script to streamline this process.

The script generates a csv report and does not make any changes. Please review the readme and run the script in your environment.

The csv report includes overall and per-user stats for:

Every workload on every cluster is going to be different and have a wide range of requirements. As you go through the report, there are a few high priority items to look for.

Second, compare max to 99th columns. With the 99th columns, we are trying to account for the majority of their queries (99%). This will allow us to account for bad or errant queries if any of the max columns are more than 10–20% higher than the 99th, investigate that user’s highest queries to see if they were bad queries or if those few queries could be improved to better utilize resources.

The settings we are going to define based on this report are:

We’ll walk you through how to determine each of these settings for the necessary resource pools. Once that is determined, we’ll use the “Create Resource Pool” wizard in CM to create each pool as shown in the image below.

To really gauge this, we would need to have a separate report that took query start times and durations to track the average, 99th percentile, and max concurrency for each user. For this setting, we would suggest that you keep this as low as possible based on the use case because it ultimately affects the Max Memory you want this user or group of users to be able to consume. To make things simple, for the number of queued queries, we would set this to the same number we set for max running queries.

This is calculated from (Default Query Memory Limit * 20 (number of Impala hosts) * Max Running Queries). For example, if we want a resource pool to have a 4GiB max per node query limit and be able to run 5 queries at a time, this comes out to 400GiB of max memory.

This setting is determined by concurrency, duration, and the SLA of queries. If you have a query that has to run within 30 seconds and it’s tuned to run in 20 seconds, if it sits in the queue for more than 10 seconds it will violate the SLA. Third-party applications running against Apache Impala may have their own query timeouts that this may interfere with that we would prefer to return an immediate error. For long-running ETL workloads that may end up with data skew increasing query duration, you can extend these timeouts to ensure that all of the queries are queued and ran.

Default resource pool for users: This is our general pool for anyone on the platform that does not have a justified use case for additional resources. We’re setting aside 25% of the cluster resources.

Default resource pool for service accounts: This is a general resource pool for standard workloads being generated by applications or scheduled processes.

Power Users resource pool: This is the resource pool for users who require more resources. user3 may be the only one that qualifies for the Power Users resource pool.

We recommend creating dedicated resource pools for each service account to ensure that resources are protected and not consumed by standard users.

Add a comment

Related posts:

Three Offseason Wishes for the Yankees

While I am not in charge of the New York Yankees, that does not stop me from making hypothetical roster moves every winter. The 2020–2021 offseason is pivotal in bringing a championship back to the…

Vault App Has Been Released! What Can We Expect from Using the Vault App?

We have all been waiting for the Vault App to be released. It is finally out and now we can finally take it for a test run. The Vault App is an Android and iOS application that lets users store a…

Water Level Sensor Market Future Scope Competitive Analysis and Revenue till 2028

The Water Level Sensor Market report is a valuable source of insightful data for business strategists. It provides the industry overview with growth analysis and historical and futuristic cost…