In Part 1 of this series, I covered tips that can help you diagnose a slow application. You should start by learning how the application works. What protocols does it use? What are the packet flow characteristics? Is QoS required? What is the processing environment? And finally, what are the network’s characteristics?
Also in Part 1, I identified two main groups of problems:
- Client-side processing -- things that happen on the client endpoint
- Network transport -- factors that impact applications on the network
In this article, I’ll look at two more main problem groups:
- Server-side architecture -- application architecture and implementation factors
- Multifunction interactions -- interactions between multiple groups that degrade applications
Server-Side Architecture
The server-side architecture of an application can have a significant impact on how well the application performs. I’ve seen examples where an application server had to make queries against a database on the other side of the country (i.e., many milliseconds away). This type of system design impacts application resilience as well as performance.
1. Chatty applications -- Applications are sometimes designed for high interactivity between components (e.g., between client and server or between an app server and a database server). Application performance suffers as the latency between the components increases. Think about an application that performs hundreds of data exchanges internally before sending a response to the client. The response to the client can be significantly delayed.
2. Database-related problems -- A similar problem exists when an application is designed to return a significant volume of data for the client to filter. This is tempting to do with some of the user interface JavaScript libraries for Web-based applications. Re-architecting the application to return smaller amounts of pre-filtered data reduces the CPU required to process the data as well as network bandwidth required to send it. My storybook example is about an application that returned 15MB of data to a client. It took many seconds for four workstations to download the data over a shared 802.11b Wi-Fi link, which also impacted other applications on Wi-Fi. The vendor improved the server architecture, eliminating the problem.
Another database-related problem is application scalability as the number of clients increases. I’ve seen an application run fine in development with one or two clients, but collapse when the client count increased as it went into production. Database locking is often the reason for poor application performance as it scales up.
3. Servers that are far apart -- an application that is distributed between a main and backup data center. A leaf-spine data center design works well for making sure that the latency between servers in the same fabric doesn’t fluctuate. I’ve seen several cases where part of an application was hosted in one data center while another part of the same application was hosted in another data center. The ability to easily migrate parts of an application between data centers is not necessarily a good thing when taken to the extreme. This is a good example of the old axiom, “Just because you can, doesn’t mean you should.”
Multi-Function Interactions
The interaction among multiple functions can cause poor application performance, at least from the perspective of the application’s customer. I’m not talking about interactions between multiple applications, as I covered in Part 1. Functions required for correct application operation may not be performing correctly, causing overall application performance problems.
4. Domain Name System (DNS) problems -- A misconfigured client could be sending a DNS request to a former, now decommissioned, DNS server. The client may have a long DNS request timeout, making the application startup seem very slow. But once the client switches to another DNS server, the requests are quickly resolved. The main symptom is that the application is slow to start, but then runs fine for some time.
5. Incorrect codec selection -- A client configured to use a high-bandwidth G.711 voice codec over a low-bandwidth connection (e.g., a slow WAN link or a connection over a congested Wi-Fi link) may suffer intermittent periods of high packet loss and high jitter. You should configure the voice/video systems to report calls that have poor characteristics and the codec in use.
6. Competition for bandwidth -- The network path may be sufficient for the desired business applications, but additional applications are starting to demand network bandwidth. A good example is entertainment traffic that competes with business traffic. One customer of ours had a remote site that had plenty of bandwidth for the business application, but over time the employees started using audio streaming applications (music applications) and movie downloads. These created problems for the business applications. As a solution, we applied QoS and adjusted buffering to prioritize the business application traffic above the non-business applications.
Helpful Tools
Application performance management (APM) tools, available from vendors such as Cisco, New Relic, and Riverbed, are helpful. A good APM tool can tell the difference between a client system processing a transaction and a human “think time” on that same client. With the right data feeds, the APM tool can identify back-end application database servers that are creating processing bottlenecks. These tools are great for identifying whether “it’s a network problem” or “it’s an application problem,” and allow IT staff to focus on applying the correct fix. You can achieve the equivalent functionality with packet-capture analysis tools like the open-source Wireshark, but expect it to take much longer.
Summary
You may have to assemble a multifunctional team to understand an application and why it’s performing slowly. Ideally, you would employ a so-called full-stack application developer to help with the analysis. In the worst case, you may need to re-architect the application. The challenge is doing this in a production environment.
If you have a lot of applications (some organizations have hundreds), pick the most important or those that trigger the biggest support effort. You may find fixing common causes of slow applications (like a network link that’s not performing correctly) helps many applications. In any case, you’ll learn what to look for and how to streamline the analysis.