IDERA SQL Diagnostic Manager Virtual Tour




Overview of SQL Diagnostic Manager

Hello, everyone. Welcome today’s webcast. We’re going to go over to SQL diagnostic manager and look at, key benefits, core features, alerting reporting, drill down all that kind of good stuff. Hopefully you guys can hear me. There’s a question answer section within, go to webinar where you guys can type questions to me. If one of you don’t mind maybe writing the audio and the video is working well, that helps me out. I’m the only one for my deer on the call today. I don’t have anybody checking on that for me. I would appreciate it. There’s quite a bit to cover. Oh, thank you very much. I appreciate that. There’s quite a bit to cover. We’re going to go ahead and get started. SQL diagnostic center is a very mature product. It’s been in the marketplace for over 15 years now. We average right around three releases a year.


Again, very mature built from the ground up to monitor Microsoft SQL server environments. It’s really able to cover just about any SQL server out there. Right? If it’s on-prem, cloud-based through Azure or AWS running on Linux express developer, enterprise standard SQL 2000 SQL 2019, frankly, to us, it doesn’t really matter. We have functionality really for all those different environments. The goal is to give you guys comprehensive understanding of how your SQL server is behaving. We’re doing comprehensive monitoring, looking at everything from operating system metrics into the SQL server, pulling them back big picture information like job and backup information, as well as getting into the really detailed views of, individual weights, statistics, poor performing queries, deadlocks blocks, those sorts of things. Again, it’s really trying to give you a really deep insight into how your SQL servers are behaving. All that information that we gather is stored in a repository.


That repository gives us not only live data. Obviously we have that ability, but we can also look at the past and the past can be done a couple of different ways, right? If I’m looking at history, reporting obviously comes into play, but I also have a feature in the tool called history browser that lets me roll back the clock, almost like a DVR where I’m able to drill in and understand, what the server is doing at 3 32 in the morning, three weeks ago as if it was live data. In reality, I’m looking at the past. That’s a big part of what we’ll look at today, how that works, but there’s kind of a third way that I can leverage history. And that is for alerting. We have customers that are monitoring many hundreds and hundreds of SQL servers. If you have to go in and tune your alert thresholds on a couple of hundred SQL servers, because SQL environment is relatively unique, right?


The alert partials, aren’t going to be perfect out of the box. You’re going to have to tune some of that stuff. We know how these servers have behaved. We know how the individual metrics related to those individual servers have behaved historically. We can use those historical data sets to better understand where to set alert thresholds as well. There’s quite a bit of history in this product, and it really does help give you a real robust visualization as well as drill down into, past performance. Now, alerting is another thing that we’ll get into not only the, the configuration of alerts, why is this thing red? Why is it yellow? Once those things do change into a critical or warning state, what am I going to do about it? There’s a number of different ways that alerts can be triggered there’s notification alerting through email event log as an a P trap.


There’s even drug scum integration for guys who are running scum, but there’s also the ability to run the scripts. As an example, I can run a T sequel script in response to an alert. I can run a PowerShell command. I can run a SQL agent job. I can run a command line script. The goal there is to not only notify you, Hey, this is a problem right now, but potentially remediate some issues around that automatically without you guys have to jump into the product and deal with it. So, we’ll get into alerting, we’ll finish up with reporting, but the goal of the call today is really to help you guys get a better understanding of what this tool is all about. Now we’re looking at version 11 is not that old. I think it’s been out for about a month. Now. There is a new release coming short, hopefully in the next, three or four weeks version, that has some new functionality in it, but also has some kind of new features around old subjects like AWS.


For example, now I’m not going to spend a lot of time in PowerPoint. Let me switch over to one more slide. We’ll look at architecture. Then we’ll actually jump into the product. 

Architecture of SQL Diagnostic Manager

The architecture, this tool is really simple. We don’t install anything on the SQL servers that are being monitored. Everything is done remotely. Getting it up and running is pretty easy, right? If you want to do a trial or if you’re going to purchase the product, et cetera, you can reach out to the idea of website, download the installer from us directly. If you’re doing a trial, by the way, the initial download will come with a predefined key predefined trial key. That’s good for 15 instances for two weeks. Out of the box, you get two weeks to do whatever you want with the tool it’s full functioning. It’s a pretty simple install. Whatever machine that you run the installer on, I would consider the DM server, right?


It’s diagnostic major server. What’s going to happen is on that box. We’ll install a set of windows services. 

SQL DM Collection Service

There’s two main services that make up the back end of diagnostics with SQL DM collection service and the SQL DM management service. Now the collection service really does what it says, right? It’s a piece that’s reaching out, pulling the WMI metrics, pulling the SQL metrics on the DMVs, extended events, et cetera. 

SQL DM Management Service

Once that date has been picked up it, hands it to the management service parses through all that information. It’s making determinations about alerting. If it sees something, it doesn’t like it sends up the script. The email runs the script for the job for you. Ultimately the management services, kind of the broker of communication, storing the data, feeding the data out to the end user interfaces as well. 

SQL Diagnostic Manager Repository

The repository, when you’re walking through the installer, it’s going to ask you what server SQL server do you want to install the repository on?


You can certainly have it on the same machine as the DM services. Generally speaking, if you’re going to monitor more than a hundred instances at a time, we do start recommending around that number, that you have a separate server hosting the DM repository than the services. So they don’t compete for resources. It keeps things quick and nimble, but if you’re under that a hundred instance, number, frankly, you can put it anywhere you want. Generally speaking, the repository is going to grow right around two to three gigs per instance, per year. If you’re going to monitor, 10 SQL servers, you should probably plan for 20 to 30 gigs worth of data. Now we do groom that data every night, there’s an eight a job that you can schedule within the console. You don’t have to go out and edit your jobs, but there’s a job that will kick off at 3:00 AM nightly by default.


It will clear data out of the repository history of the repository based on different date ranges. The really detailed stuff, looking at like individual sessions, queries, deadlocks, weights, statistics, blocking events, and our history browser data gets groomed at 31 days. That’s most of the information that we’re collecting. You have a really detailed months worth of data to play with. Above that would be, resource utilization, transactions per second. I guess, log fullness and things like that. We keep for a year and then growth rates. This database and table growth rates are kept for three years. The reason for that three years is so that you can get better forecasting around data growth in the future. The more data you have to work with, the more accurate those forecasts are going to be. Again, if you leave the defaults in place and I’ll show you this stuff in when we get to the tool, if you leave the defaults in place, you’re probably looking at two to three gigs per instance, per year.


The repository is supported on 2015 and above standard developer and enterprise. We do not support, and you may have kind of caught this already, but we don’t support express for the repository because, we need jobs to do things for us. Express doesn’t have jobs, right? Also you’ll probably run out of space as far as what express supports eventually. Standard developer enterprise 2015 or above can be the repository location. 

Support for Azure Cloud and AWS RDS Cloud

Again, from a monitoring standpoint, we have support for Azure, AWS on-prem virtualization, which I haven’t mentioned yet. We do have monitoring available. If you, if you want to, it’s included with the product VMware monitoring, as well as hyper V monitoring. Now this is not a VM-ware or hyper V monitoring tool. The idea is that we can pull in some metrics around that technology on the back end to help you understand if there’s anything critical happening from a VM perspective and then express standard developer enterprise SQL 2000 and 2019.

Client, Web, and Mobile Consoles


Now, as far as UI goes, right, there’s a couple of different ways that you can get into the data that we’ve collected. The general ideas, that this is a desktop client or a fact client based application, meaning that there’s a dotnet console that you would install on your workstation. It would remotely connect over to wherever your DM application is installed. We will feed you information to your laptop. Your desktop, there are web consoles that are also available. That being said, they’re not as feature rich and function rich as what you can do in the desktop client. I think of it more of a secondary user interface, but certainly there are web consoles available. You also have the ability through reporting to run through SSRS. You don’t have to run all the reports and the desktop client. You have the option, if you would like to leverage as a service.


And that gives you a couple benefits. A, if you just need to build a report, he can hit a URL and go build it. B because their subscriptions and things, extra features in SSRS like automation through subscriptions, being able to go in there and edit and rebrand, et cetera, it does add in some neat functionality that can save. DPA’s a lot of time, right? I say, DBH, it’s not always that a DBA is going to be your main user, right? Sometimes, you don’t have a DBA. One of the really nice things about this tools, if you’re not a database administrator, it can help you become a better one, right? There’s a lot of monitoring that we’re doing. There’s a lot of descriptions around the individual alert metrics that we’re pulling back. There’s about 105 unique metrics that we monitor per instance, meaning that I may not even know what this alert is, but because there’s help articles around those metrics.


Let me show you what I mean. Right? If I’m looking at a whole bunch of SQL servers and I see some alerts, these aren’t the most exciting, but I can say show alert, help, for example, on a PLE alert. It’s going to tell me about what PLE is and, if I need more, I can google it and get other resources around it. The good news is that we kind of know what to look for. Therefore, if we see something that doesn’t look right, we can help you understand, Hey, there’s a problem here that you may want to address. If you are a seasoned DBA, there’s a ton of really good, deep DBA type information in here that you can work with to identify root cause and really drill down and understand both in real time. Historically what’s happening on the back end of the SQL server.



My environment is not huge. I’m monitoring tenant’s instance of SQL. The view that we’re in right now is just an all servers, you, but be aware that you can control and make your own kind of organization within the tool. We use a model called tagging where we can organize instances together to logical groups. What I mean by that is, if I look at my Austin group, that’s where I’m located, right? There’s a handful of SQL servers and Austin. Some of these are dev servers. Some of these are prod servers. I have a dev group. Well guess what? Those dev servers are there. I have a prod group. Guess what those prod servers are there, meaning that you can organize these servers together into different views, but those user available throughout the tool and leveraged against different features. What I mean by that is if I take a alert, partial, then change what my PLE alert levels may be.


I can change it on server a and then apply it to a group. I don’t have to change it 10 times more, right. Or if I’m setting up a feature called application security, where I’m controlling, who can view what, I can do that towards a group or alert responses, alert, configuration, obviously custom counters maintenance mode. There’s lots of things you can do directed towards groups, maintenance mode. For example, this server here, that’s all great out. It’s in maintenance mode, meaning I’m not monitoring it right now. You can schedule that individually instance for instance, or you can set it up per group. That’s a nice function. The individual instances themselves is really where you start getting the data back, right? There’s different views you can go to, but once we kind of understand, this is the server we want to deal with, you select that box from the list and you start getting, sessions, queries, resources, database, service log analyze, right?


SQL Server Performance Monitoring

There’s a bunch of sub categories of data, drilling into the individual spins, maybe going look at weights, statistics, or, getting into the resources, right? There’s a lot of information now available for me around this particular SQL server instance. How does this data get into the tool? What, what type of data is it? How’s it being monitored? If you break down and look at the properties of an individual instance, most of the monitoring depth or most of the data, let me rephrase that. Most of the information that we’re looking at within this tool is broken down by minute. All right. For the prod one server, I’m doing my data collection once a minute. If we’re looking at these charts behind here, these are basically one minute charts, right? It’s matching up with this polling interval, another little side note, telling us what type of sequel server is it.


This is an on-prem windows server. We can put it in different groups and then our main polling interval. The idea is that we’re doing a snapshot. We’re collecting a bunch of performance data, basically through WMI calls and a batch of a couple of different queries against the SQL server. Every minute that information is then stored in a repository. So it’s a slice of time. Minute by minute. Now one minute is a long time in SQL server. There are some things that are happening between those polling intervals that I’m going to be concerned about. For example, activity monitors going to help backfill round deadlock autogrowth and blocking. We generally use extended events for this type of data collection, meaning that it’s persistent. It doesn’t matter if I have a 10 millisecond deadlock and my polling intervals, 10 minutes, we’ll still know about every deadlock because we’re collecting it differently.


Monitoring SQL Queries

Right? Similarly, pre monitor now query monitor actually has a couple of different ways that we can gather, pull performing queries. I personally use Curry store. I have it enabled on the SQL server instance, meaning that the diagnostic tool is just going to query the query store information and bring that into diagnose my jury. DM isn’t actually collecting any port form inquiries. In my case, it’s just gathering it from known information in the SQL server instance. If you don’t use Curry store and by the way, side note diagnostic or never enables Curry store diagnostics, or can leverage it, but it doesn’t turn it on. If you’re not using it’s not enabled or you don’t have you’re not on new versions of SQL, 16 and above. You can use extended events for that data collection. That goes all the way up to 19, right? Anything from 2008 to 2019, you can run your query gathering through extended events, but the point I guess, of this as the, to calculate and understand and view your poor performing SQL statements.


Monitoring SQL Statements

Now I’d say statements, that’s batches store procedures triggers as well. What does that mean to you though? Every environment, your workloads are different, et cetera. Maybe server a that’s a, high CPU count and high time count on this server. I’m just looking at time, right? You can put in the values that you think are where your poor performers are, so that we’re showing you exactly what you need to see. I’ve just said anything over two seconds. I want to know about. This again, will help kind of backfill that one minute pulling interval. 

Monitoring Weight Data

There’s a third advanced configuration around monitoring. This is really important data to have at your fingertips, which is weight data, right? Detailed weight data, being able to break down a weight statistic by session, user application, host statements, et cetera. This is not done by default 24 7. It’s really done in real time.


Only in part of the reason why we do that is to preserve space in the DM repository. It’s not very expensive to collect, but if you’re collecting really detailed weights on a bunch of servers all the time, it’s going to be, I wouldn’t say expensive, but there’s going to be more data growth there. So you have the option. If you want to do what I’m doing, which is collecting, and definitely some servers, production servers, I always collect indefinitely dev servers. It kind of depends on what they’re doing, right? Sometimes if I’m doing a lot of testing in there, AB testing, things like that, I do want to have that enabled. Otherwise, maybe I don’t turn it on. It’s really up to you, I guess is the takeaway. One thing that I do point out for new users, generally speaking, I don’t use extended events, generally speaking, I do not use query store together, weight data.


If you notice these two boxes are unchecked at the bottom, the idea is that I’m really relying on dynamic management views to fill in that weight data and DMV is do a great job. There’s really not a need for extended events or query store monitoring to understand wait information. So that’s the way I recommend. Generally we recommend that you use it. Not to get too deep in the weeds, but the idea is that every instance can be configured differently. Based on that configuration is essentially what we’re going to see here, right? The information on this screen right now is live. If you notice up here, it says real time and then has the timeframe right? The day. So live information about the SQL server. It doesn’t really matter which view I, assuming I’m on live, I drop over to resources. You know, this is library resources.


Historical Data Analysis

You can see I’ve rebooted right before the call sessions, right? So what sessions are out there? I want to get into the session details. These are active sessions, maybe not active per se, but these are the sessions. If February ran SP Hutu, this is basically what I would see, right? There’s a lots of live data in here. The problem is as a DBA, looking at the live view of the SQL server, isn’t always exactly what I need to see. I might need to figure out what happened last week or what happened 30 minutes ago. One of the nice things that you can do in here, there’s a couple of ways you can do this. Let me drop over to a chart view. If I’m looking at this chart, it’s basically one hour, why is it one hour? Well, because that’s how I have it defined.


Maybe I want to look at six hours, right? So I can drop back. Look at more time. I can go back pretty far. I can go out five days, 24 hours. The reason why I might be doing this as, I don’t know what I’m looking for, right? Someone might call and complain about a performance problem that happened yesterday. I don’t know anything other than they’re mad at me because something was running slow yesterday. If I back these charts out and look at 24 hours, I can kind of see in here some interesting things, right? This, for example, this cluster of weights right here is very different than everything else we saw on the day. There’s one right there there’s a couple, right? By stepping back, it can give me an idea of where to go into the detail. One of the cool things about this is I can just highlight over the section of time that I’m interested in and guess what it drove me into that narrow window of time, but gave me more detail.


The idea of extending out the time, I generally, by the way, look at one hour’s worth of data. Extending out the time drilling into the time period that you’re interested in is one way to approach history. Another way generally it’s because you kind of know what time you’re looking at is directly through the history browser. If someone picked up the phone and said, Hey, at three 30 yesterday, we have an issue or three 30 on Monday or Friday or whatever, we had an issue, you can then pick the day. So let’s go back to the 12th. Down below here, maybe I want to go look at 6:32 AM. Well, guess what? On the 12th and today. To go back to the live view, you can just click right there. Today we’re actually at 10 o’clock versus 6:00 AM. We see similar workloads, right? Their sake blocked, basically that’s happening on both those days that maybe I want to get into what statement was it?


Database Alerting

All right. If I click on the statement view, I can pull the statement out. This select statement, which application, which database, which client, which session, which user, right? So there’s different ways. You can look at that, but again, I’m not looking at live. I’m looking at 5, 12, 6 30 2:00 AM. History browsers directly accessible in that way. The third way, which frankly I use probably the most is through alert. What I mean by that is often I’ll get an email or something like that. That there’s a problem, but I don’t have the luxury of really getting into it at that exact period of time. If I know what I’m looking for, right now we’re looking at all active alerts, every alert that’s kicked off right now in real time. If I want to look at something that in the past often I’ll filter it down.


I got an email earlier, may have been yesterday afternoon about Tempe, be running low on space. Let’s go back and see, actually that was earlier today. About 7 27 19, it started 7 21. It was still there. 7 22 had it ended. There was an email that was talking about this procedure that was consuming about three gigs attempt to be space coming from this job, this user possession, right? The actual email I get is pretty informative, right? There’s good information to play with, but because we’re recording all this stuff in our repository, I can really drill down and understand this from different perspectives as well. If I right click on the alert real time, isn’t going to give me any value. I want historical view. I select that. It’s going to take me to the instance, prod one. It’s going to take them to the database, typically be summary because that’s the problem.


It’s going to show me charts and then give me tables of the data that was collected at 7 21 this morning, when we saw that problem. We saw that it had come in earlier, right? There were two different alerts that came through. You can see this double spike right there. Tiffany B 1 0 2 is the one that’s consuming the resources right now. Right? I can then drill into the details of 1 0 2, which happens to be this one on the top, right? We can see the command, Tempe workout, execute store procedure to the workout. I’m talking from this job, but now I even have more data, right? The resource utilization. CPU, if you look at the CPU milliseconds column, this is a total CPU for the life of the session. If the session logged in three weeks ago, it might have a really huge CPU number.


If you look at the Delta, like this is an example here, right? So 300, 330, 2000 milliseconds of CPU. Right now it’s only using 171 milliseconds of CPU, right? Yes, this has used a lot of CPU over time, but this one is using all of the CPU right now, right? The same thing with the IO, fiscal IO, total, this quieter Delta. That Delta on 1 0 2 is very high for CPU and IO. What that basically means is maybe I didn’t know, to look at MTV. Maybe what I saw was this increase a double spike of CPU here. Well, we had that double spike attempt to be those two things are related. Or if I look at the disc, guess what’s on my T drive temp DB, right? So the T drives getting really busy. It may be that I saw those two symptoms. I got alerted, the CPU is high.


I got alerted the disc activity was high. From that, I could kind of gleam, Hey, maybe we are running in an attempt to be, or maybe this is affecting pimpy because we can see that information here. Right? There’s different ways to look at that. Another little side note. I may, I don’t always do this, but I will often go look at the servers active alerts at the timeframe suggested, right? So we got the alert. Which one are you? Where are you? We got this alert, right? This is the one we filtered in, used to go back in time. We saw that kind of in a vacuum. If I look at the whole server at that same time frame, there were quite a few other alerts kicking off the T into the drive, probably CPU in here as well. Right? There was a number of different indicators that were having an issue.


It wasn’t just that one single metric. They are all kind of tied in together. The only thing I really don’t know by looking at this is what it is about this store profits causing me the problem, right? What is it in that store procedure that caused all this growth? One way to evaluate that, I think the best way that I like to do it is by looking at weights, right? If I’m looking at the query waits for that timeframe, guess what? There’s a double spike in here, similar to what we saw in the Tempe growth, right? We can see the wait type six packet page IO latch, consumer weights IO. Right? There’s a number of different weights that are kind of pointing me in that parallelism, Tiffany B area. Well, what sessions we knew 1 0 2 was the culprit, right? Guess what 1 0 2 is the pink, right?


All this pink stuff is 1 0 2. Meaning if I look at this from a statement perspective, I know that this is what was actually running in that procedure that was causing me that pain. Now this is a timeline view of weights. There’s a dropdown here that will change me to a duration view. Another way to have looked at that would have been, oh, well we’re getting six packet weights. Which statements are those coming from? Or, we know that it was session 1 0 2, what statements are we getting from 1 0 2? There’s a couple other, but the one that’s actually painful this guy, right? This is out of the box, got, well, it’s always gonna match whatever the timescale you have set here. Right now we’re only looking at an hour if I change that to six hours. Well, the insert that we just looked at is still there, but it’s number two on the list.


The most expensive weight query was the select above it, right? You can easily kind of get a high level and then drill down a view from a weight. The other way, though, that I could let me switch back over to over time. Another way that I could do this is by looking at just queries. Again, we have a, a feature in the tool called query monitor, and my definition was anything over two seconds in duration, meaning that when I look at this signature mode view, which is a little complicated, you’ll not see anything in here below two seconds, right? Everything is kind of above that two seconds, right? Threshold. Now I say it’s complicated because what, this is what a signature mode really is looking at all the statements. If I look at this view, which is just show me everything that exceeded two seconds, right?


There’s no real organization here, I can go in and say, well, show me the longest. Show me the shortest and the most read the most rights, whatever. Right? I can look at each query collected over whatever timeframe that you’ve inputted, right? If I’m looking at this from a signature mode, what we’re doing is basically ordering, organizing that together for you based on the SQL statements that were collected. A signature in this case, that select statement that we saw earlier that had that pop wait for the last six hours as this guy. Now there’s some things missing. It’s not a flip top hash, it’s a select top 500. There’s some hashes there’s a hash here. What we’re doing is saying, oh, well this statement looks like a signature. There may be some slight variations of that statement that have run throughout the day, but basically it’s the same query.


The cool thing about doing that is it helps me understand that statement was captured 19 different times over this timeframe with an average duration of 64 seconds. Right? If I’m looking to tune something or trying to understand some low hanging fruit to improve, this is probably the one that I’m after, because it has the longest average duration, and it has the most frequent collection, right? I can then from there, go drill into the court history and really see how it’s been behaving historically. Now, interestingly, if you look at the ration here, sometimes it runs very long. Other, most of the other times, it’s kind of in that seven to four second range, you can see that reflected here at the chart, CPU increase, duration increase. Those two things go together when that CPU is increasing, it slows down quite a bit because there’s other workloads on this system that are, it’s complicated, but the idea is that maybe I need to fix the CPU on this guy.


What’s going on here. What’s conflicting with it. There’s different approaches that I could take. Right. But, and one thing I might want to confirm the reads are flat, right? It’s not because it’s not taking longer because every CPU, where were you at CPS per second, right? It’s not taking longer because it’s returning more data it’s taking longer because something else is happening. I’m going to cheat here just for the second time. If you look at that statement, you can kind of see based on the weight type in pinch shared, right. We can kind of tell that this was blocking. I bet you, if I go look at, the blog view, guess what’s being blocked, right. That statement’s being blocked. Blocking is happening, directed towards the statement, it takes a lot longer to run, which is pretty obvious. The cool thing again, is I’m not looking at live data.


We’re all looking at information that’s been collected historically. I’m working with it as if it was live, but I had that luxury of doing it later. There’s a lot of like cool ways to get in and visualize and understand how your SQL servers behaved. Historically. 

Alert Thresholds

I think one takeaway though, as excuse me, I have a bit of allergies today, so I have some sneezing. I had to meet there for a second. One of the realities though, is if you’re monitoring lots and lots of SQL servers, you’re not going to have the luxury of drilling into each one and trying to find these problems. The tools designed through alerting what we did there, we identified a problem. We drilled into it. We’d looked at it from a bunch of different directions. The idea is that tool’s really here to help you understand, Hey, you’re having a problem on this server.


Go look at that. You’re having problems over here on this server. Look at that. Or you’re having a, enterprise wide, everything is slowed down from a response time perspective, or this whole segment network segment is down. We can’t monitor it anymore, right? There’s a lot of detailed alerting that we can pull up the tool to help you understand where to go. The issue with detailed alerting. And this is not an idea issue. This is an issue related to all kinds of performance monitoring tools is if your alert thresholds are not defined correctly, you can miss or ignore things that you really need to understand from an alert perspective. Now, I mentioned very early on that we have some features in the tool. If I right click on a server configure alerts, we have some features in the tool that can help you define your alert thresholds quickly.


Let me make these bigger and really what I’m talking about. If I hover over this little bar at the top, it turns yellow. That’s what I’m talking about. Before we go in there and look at that bar, let me group these together and let me show you one of these guys individually, I saw this proceed procedure cash at ratio alert. We’re getting a warning on it right now. So why are we getting a warning? Well, if you look at my warning threshold, it’s 98 to 91. If you’re, if it’s less than 98 and greater than 91, you’re going to get a warning alert. If it’s below 91, you’re going to get a critical alert. This is an inverse metric for people who don’t know, right? The lower the procedure cache ratio, the worse off you are, the higher, the better you are now, what I’m really getting at is this little green box here.


Multi Performance Baselines

There’s a feature called baselining. We’re measuring historically how this metric is behave. We have 6,000 at the bottom there, you’ll see 6,790. We have 6,790 data collections to understand that my average has been 94.4, the standard deviation of 1.7. So my average is 94.4. And my warning threshold is 98. I would think that this server’s going to be yellow 24 7, because that’s not realistic for how the service behaved historically. Right? Visually it’s giving me an idea that maybe I want to drag that down . If it gets down to 93, which is a deviation from the norm, that’s when I want the alert to trigger. There are about a hundred plus we’ll just say that a hundred plus metrics in here to talk to tweak and tune and play with. I don’t want to visually have to do that on every single metric, much less do that on 500 SQL servers.


Right? That’s what the bar is doing for me. If I click on the bar, it said, Hey, listen, I’ve gone through your metrics. It looks like page life expectancy. It looks like a Wes memory looks like log-ins per second. Right? You’ll see in here, OSTP memory, for example, the average has been 82.7, my warning officials, 75. That’s not going to work. It wants me to change that to 85 87, the one we just looked at procedure cache hit ratio, right? We know it’s not. Before we checked that box, it’s reconfiguring to 91, 88. Instead of worrying about 105 metrics, I’m worrying about these six, right? I can go in here and tweak and tune those six and then clean up that server. It’s gonna turn green, assuming nothing bad is happening and we can move forward one the little bit though, this is math. We’re not, there’s not like an AI machine learning engine on the backend.


That’s literally going in there and tweaking it, tuning these metrics for me, we’re looking at how your metric is behaved historically, and then applying a filter of 2% above for warning and 5% for baseline our baseline for critical. If it says, oh, your average SQL server, CPU’s 89 or 99%, we think you should set your threshold to a hundred percent. That’s probably not the best recommendation or for you to take. That’s a takeaway that you need to go add more CPU, right? It’s math, but again, it gives you a short list to play with. You don’t have to deal with everything as far as the metrics themselves go, they’re categorized. Back up alerting at the very top custom, we’ll talk about this in a minute, but the idea of custom is there may be some things that I want to measure that nobody else cares about that aren’t built into the tool.


I can add them to what’s called custom counter, and then below that database, right? Always on ability group, there’s about 13 different alerts there’s data file group log file mirroring 10 DB fragmentation. Quite a few different database specific alerts database as a service, AWS like RDS and, managed instance, database platform as a service database SQL to Azure, some specific metrics in here. One little thing is low and high. We’re actually measuring or alerting if you’re not using as much resources as you have allocated to that Azure database, for example. It might give you some idea of, Hey, we can spend less lower service tier on that guy. There’s some weird, or maybe not weird, some interesting alerts around platform as a service air log agent log alerting, primarily by the way, we’re looking at severity levels, but every metric will have this advanced button. Example of advanced for airlock would be, regular expressions or texts that you want to parse outside of the severity level.


Right? Even though it might be severity level five, I still might want to be alerted if it has XYZ text in it. Right? Let me back up a little bit. Every metric again, has some little advanced feature and the idea is to help you tune this stuff, to get even more accurate, a different example would be something like this, for disc drive space log, full percentages, any kind of space type alert we’re going to look at. The default is setting five 90, meaning that if any transaction log is 75% full that’s warning, any transaction log, 90% full that’s critical, but there may be like this little one-off T log on this database, wherever different threshold. Right? The reason why I’m using advanced is to fine tune to my exact situation and pretty much everything as some advanced feature air log stuff, operational states, which is kind of high-level configuration of the server resource CPU memory disc on here services.


This is a broad basket, but things like cluster fail over replication service availability. So is the server running or not? That’s not really fair. SQL servers can be a lot of things running is really how we prefer them to be. So running is green. Everything else is warning or critical, right? Service availability, but we’re also looking at things like jobs. Are your jobs running long, and again, just a disc you could say, well, that’s generally how we think a job should run, but this job is a little different. Maybe we have a different threshold for that backup job, right? You can do jobs running long based on percentage, meaning that we’re looking at the past performance. Then, looking at the average run time, if it’s succeeding that by a certain percentage, that’s when we want to be alerted. You also have that option to alert on minutes.


This job needs to be done in six minutes. If it’s not done in six minutes, the world’s going to fall apart, right. You can go in and set that threshold based on a minute value as well. There’s also, I already mentioned it replications in there as well. Further down, you’ll get into sessions, dead locking, blocking session CPU, and then virtualization at the bottom. 

Cloud Virtual Machine Monitoring

Virtualization monitoring is something that we can do. It’s included in the product. That’s no additional charge to do virtualization monitoring. It is an extra step to set up. We don’t need any kind of administrative VM-ware rights or hyper well VM-ware rights specifically, hyper V you need OSTP level permissions to gather the hyper V metrics from the host. The way we do it for VMware, we’re just talking to these center. As you can see, it’s not alerting on everything. There’s, I think 14 alerts, right?


Four of them are from the host level. The other nine are from the virtual machine, but there are some good ones, for example, CP ready wave time. Right? This is telling me how many milliseconds, the virtual CPU is wedding for the physical CPU. That can really be a heavy indication of oversubscription, right? So you have five BDMs one host. One of them is stealing all the resources from the others. This would indicate that. If you want to do the monitoring, when I shut this window down, I’ll show you how you set it up, but I can alert you in certain scenarios where VM-ware may be effecting the SQL servers performance. Now, one other little note, if you look there’s this little icon was there, we come up here. There’s another icon there. If we come up further, there’s an icon there. Those icons are there because we actually modified those metrics.



Why that matters is if I hit the apply button, I want to know what I’m applying. So we changed three different thresholds. Do I want to apply those three thresholds to other SQL servers, other tags, groups of SQL servers or other templates? Yes. Please apply that on every server or no, let’s apply that just to the production boxes or let’s do production boxes at hand my test servers, right? You can apply those three changes globally. You also have the ability to apply those changes to templates out of the box. There’s critical, only performance, Azure, AWS. There’s a bunch of out of the box templates that you may want to modify. You can create your own. I have my own little homegrown dev template on my own little homegrown prod template. Once the templates been created, I can always apply that to other servers. Again, I’m using taking those three changes, applying it to existing template.


By the way, if you want to create your one little homegrown template or, your multiple homegrown templates, you can take an existing server and just say, create template and take those settings, save them off and you can use them later. Also, if you go under the tools menu, there is the alert configuration template screen, where you can manage your existing templates, create your own from scratch. Let me show you real fast. This is not going to work because for me to hit the center, I have to be on a different machine. Actually the machine that’s running my DM collection service. So this is going to break. The idea is where is the center, right? You need to read account, not admin account. Once you’ve done that, you check the boxes, right? You check the boxes that are associated as the center environment. What for the sake of you guys is sanity and my own.


Let me show you it for real. What we’re pulling into right now is the server that’s hosting diagnostics. It’s called SQL dev one, right? This is where the SQL DM collection services, right? SQL, DM, sorry. I want it to run SQL DM collection service SQL DM management service. Right? This is where the monitoring is coming from on my laptop, which we’ve been looking at all day. It’s connecting over to the deem repository on Def one, right? So it’s pulling from that virtual machine. Now this virtual machine has VPN access to Houston where that V-Center environment at it is. If I go into BM configuration here, this will actually show up. I have five virtual machines on Vizio. One. These are two of those five link. Once I’ve done that, if I go click into one of these virtual machines and I can now go back to my actual workstation, cause my workstation doesn’t have the access.


Houston DM is accessing Houston. Unfortunately there’s not a lot of load on this, but you can kind of see switch over to the M. The VM, virtual machine memory, host memory, right? The virtual machine has very low memory utilization. The host is only at 30% as well. CPS is trying to, here we go. Well, I’d say that there’s not a lot, but you can see the green and the blue here. Right? The SQL servers, the green, the blue, the total OSS is the green. The gold is the virtual machine CPU. The pink is the host CPU. If I saw the pink way up here at the top, that’s an indication, Hey, we have a problem. We may also have, let me clear out this filter. Let me see if I have any of those alerts around being aware. Oh, I did it. That one metric, I was just talking about a virtual CPU.


Ready? Wait time. So my virtual CPU is wedding 380 milliseconds can access physical CPU on ESX 50. Right? That tie in can help out in a lot of ways if you’re running into virtualization issues, causing performance problems. That’s kind of the first step in the alert process. Looking at questions. One second. I think there was a few questions about the, okay. Yeah. One of the questions is about recording. So yeah, this is all recorded. It should be recorded. And let me double check on that. It should auto record. It is. As soon as this is over, though, it takes some time to process them, but we’ll send out a link to all the attendees and you can download it for sure. What we’ve talked about really was kind of looking at the past how we drill down, how we adjust time, how the data is being gathered, frankly, the way I use the tool quite a bit is letting the alerts kind of drive me into where I to go, right?


We wanted to figure out where to set these alert thresholds. Once we’ve set the alert thresholds, what am I going to do with it? Right. Let me pull up an example real fast. One second. Sorry. My server fans just turned on real loud. I had to adjust for that. Let me give you the real example. Before we get in to show you the letter responses. Let’s look at this one, this with a work box. I wait time a couple of days worth of data. So we’re getting those blocks. We saw that earlier, right? That flux statement runs really long, we know that blocking is causing that issue. If we look at like this guy here, that’s not the exact example, but that’ll work. Okay.


That’s fine.


Responding to Alerts with SQL Diagnostic Manager

Okay. What I’m getting at is this, I know that’s a so alert actions and responses. Once an alert is raised, so you’ll see these alerts in the tool, right? Raise lowered, raised, lowered. So raise the critical means. It’s a brand new incident. It was previously fine. Now it’s critical. Raise the critical lowered. It was a problem now it’s not right. There’s also, and I might have one, let me come back in here. Remained generally. I don’t have a lot of those. Yeah, there aren’t but remained. The server, the blocking may last for like five minutes. Every time we go monitor, it’s still blocking, oh, remain critical, remain critical, et cetera. Right. Interestingly, I don’t have that in my environment because I I’m cheating quite a bit on the alert response side. What I mean by that is there’s different ways that we can respond to an alert is raised.


I see something as critical from a blocking perspective, meaning that blocking is lasting longer than 90 seconds. What am I going to do about it? Well, in my case, I’m doing a couple of different things. I’m using the email provider to send out an email. I’m also writing the event log because I want to know via a different application. We use a product called uptime that does event log scraping. So it’s picking up there. I’m also running a T SQL script that kills a lead blocker, right? There’s a number of different things that we can do in a PowerShell command to pop up a alert box on my screen didn’t know that happened. You can do a lot of different things with these alert responses to help fix problems as well as notified. The way these rules set up is pretty much like a rule in outlook.


So I call this kill blocking product. Most people would never do this, but this is just a kind of crazy of what is possible, right? At the top, you to find your conditions, the middle of your actions, the bottom is the rule. Essentially what I’ve said is if any of my production servers, which I have five, have the metric block session, wait time showing up as critical, which is again, 90 seconds over do four things, send David an email, right to the windows event, log, kill the lead blocker and pop up an alert box through power show. You add essentially what response you want to do, and then inside the response or the action you’ll notice variables in here, right? SQL DM alert, variable summary severity instance. We’re grabbing variables from the alert texts and plugging them in where they, where you want them to be the event, log entry, straightforward, the T SQL that I’m running again, we’ll rely on variables.


I’m using the instance name variable to figure out which of the five SQL servers to run this script on. Basically I’m looking at what’s hidden over here, but instances in the alert, but it’s right behind me. I’m also pulling in the alert summary string, which is session 75 blogging for, and then I’m parsing that to get the spit and having it killed. Right? There’s a PowerShell script that runs after the fact that tells me that just happened as kind of an alert box on my screen. The point being that these alerts can be very simple. If something is critical, send me an email, but you can get really descriptive and really laser focused on certain areas, meaning that you can do a lot of remediation with it. If you need to, we’re running low on time. If you guys have questions, please start typing those.


Reporting with SQL Diagnostic Manager

In, in the meantime, I’m going to talk about the last core area, the tool, which is really reporting, I think before we go there though, let me talk about custom counters because that comes into play. The idea of a custom counter is, again, if there’s something that you want to measure that we don’t do out of the box, you can generally add it. The reason why this ties into reporting is that there’s some unique metric that you just don’t in the tool that we don’t know about. Obviously there’s not going to be a report there so we can build a customer port around a custom metric if that’s what you want to do, but there’s a different categories of custom counter windows counter, which literally are any Debbie Meyer, perfect. One counter. These are wizard driven. If I say, w my counter, I go look at, I pick a server.


This is just really a sample server, for example, I mean that I can build this metric using dev two as my sample server, and then apply it to any number of other instances. You don’t have to do it for each server, but these basically are all the different Debbie, my counters, some of them obviously sequel related. Some of them are related to my printer, right? The point is anything that you really want to look at, you can build in w my purse on SQL server counter. If you can write a T sequel script that returns in American value, there’s, I don’t know, 185 or so VM-ware counters that I can look at, ESX disc, for example, I want to look at the usage, or I want to look at the right rate on this lawn, right? You can get really detailed the point being, again, if it’s something that we don’t measure, you want to measure, we can measure it.


You can then link it to groups. You can link it to individual servers. Once that’s done, you can alert and report and view it just like any other metric. The problem though, from a reports perspective is if you build that custom metric, we don’t know about it. Therefore you’ll have to go under reports, custom new and pick off the metrics that you want. Obviously the idea might be that you want to pick up some of these custom counters, right? Maybe mix in some virtualization counters with it, and then mix in some SQL server counters and operating system, whatever. Right. There’s that ability to kind of drive and design your own report. It doesn’t have to because you have a custom counter can be for any purpose, really, but it’s pretty easy to do. Let me check on questions real fast. Okay. All right. The only other little kind of thing to think about from a report perspective is there’s kind of three different types of reports.


Data Forecasting

Generally speaking, everything could be deployed to SSRS with the exception of forecast reports, desk, database, table forecasting, use algorithms, linear and or exponential algorithm to predict the future. Let me give you a real world example. If I go look at the SQL DM repository and I can do multi-select here, I’m going to just do one database. We’ll just do the deem repository database. We’ll go back to a custom timeframe of November 1st, 2020. We’ll go back to November of last year until today, and then we’ll use that to predict out 90 days into the future. We want to see free space. We’re going to do a linear forecast. What this is really doing is looking at a daily growth rate of this database as of November, till today. That’s what we’re seeing in this view. So the green is November 1st. The end of the green here is today.


Now I’m using a linear algorithm to predict based on the green, the blue, which is 90 days of the future. August 1st we’re going to be about 17 gigs, 16 gigs, and then the dotted line up here is the free space available. I’m going to be at 16 gigs in August. The I’m out of space at 20 gigs. Based in November, I was at, some probably looking at November to be out of space, right? So this is really informative data, right? I can go back as far as I have data up to three years. So, if you’re monitoring this for, your servers for three years, you can look at three years worth of history to use as your predicted model, meaning that it’s much more accurate than maybe two weeks, right? The, I think the key takeaway was that you can do that for desk, database, and table, but these three reports because of those forecasting algorithms have to be run the desktop client.


Everything else though can be deployed through the deployment was of at the bottom out the SSRS, including any of those custom reports that you built, right? You say, oh, take these reports, push them out to this report server essentially you’ll have the reports in SSRS that you can subscribe to edit, et cetera. So forecast reports are one type. 

Top Server 

You also see a lot of pop reports, top server talk query, top database, top application, top table growth. What we’re talking about here, like a top server, for example, show me all the servers today. Which servers had the most alerts, the worst response time, highest CPU, memory deaths, et cetera. I could say, oh, well, let’s look at this server. I can drill into that one and look at charts up response time. This is just from eight o’clock till now you can make that bigger, but CPU, memory desk, and then database overview, right?


I’m using drill down within the top reports to get the more detailed data. The other thing about top I’ll show you another one top database on that server is seven days, top five. So notice I said top five. Number of databases in this report, there’s only gonna be five. Now there’s different buckets of databases that we’re going to be showing. This is going to show me for example, at the top, sorry, there’s a lot of databases on this server. Maybe I should’ve picked a different one, but the top bucket will be which databases are the biggest, right? These are the top five biggest databases. These are the top five fastest growing databases. These are the top five databases based on read. These are the top five databases based on, right? These are the top five databases based on transaction. These are tough databases based on weight type, right?


Table Chart Reports

So that’s what a top report is. Top applications, top queries, top servers, top databases. And then there’s another style of report. Forecasting is looking at the past, predict the future around space top is giving me these breakdowns top 10 lists, top 20 lists of different categories. The third style reports, just your straight table chart reports. What I call it. I, to look at CPU, you let’s look at prod one for the second time we’ll do today. So, charts and I have minutes in here, so that’s a lot of data, right? Maybe I’d change that to charts at the top table at the bottom, right? You have that kind of detail that you can play with, but it’s a pretty simple tool. 

Try SQL Diagnostic Manager

If you guys are interested in playing with the tool, want to download, try, it’s pretty simple. If you just go out to the idea of a website at the very top, you’ll see products, SQL diagnostic vendor for SQL server, and you’ll hit the start for free, give you a download link once you download it’ll kind of get you in our system.


We’ll get you an account manager that you have direct contact with. If you need help with install, need help with, tech support on the trial side, my team handles all that, but give it to your account manager and they can make that stuff happen for you. I’m gonna check one more time on questions. If there are no questions, we’ll go ahead. And it okay guys. Well, hopefully this was a benefit to you and we look forward to speaking with you all later. You’ll have a good day.