Transcript

Expand

Welcome to the SQL Diagnostics Manager metrics threshold and baseline tutorial. In SQL Diagnostic Manager whenever a metric it exceeds a configured value or threshold the state of the metric is changed. And an alert or action can be executed. In SQL Diagnostic Manager thresholds come pre-configured with industry standard levels. This does not mean this levels are one fit solution for everyone. There are several attributes regarding thresholds in SQL Diagnostic Manager. First of all when we select the instance configure alerts we can see the option for configuring the thresholds. Here we can see several attributes of the alert configuration enabled. […] You can see basically the metrics that we can enable or disable meaning that we can collect information about those metrics. So for example here we have some that are enabled like database status and file group space full. And some for example availability group which is not in my environment. And I have chosen to disable those. So I am not collecting that data that I do not need or I do not want. The next one is ranking. And this is new in SQL Diagnostic Manager 10.5. It is basically an alert priority condition to only notify you of the most important alerts above a certain track. So by default they come with certain ranks. And then when you want to create an alert let us say you say only alert me on those metrics that are ranked seventy or above. So basically the most important was like this data file auto growth. The index row hits or the log file ought to grow. Those are above seventy percent. But maybe something like Lock Full Percentage which is set at thirty percent. This time you might not go into the most critical ones that you will be alerted. So this is a way of letting [you know] one of the most important metrics that are super critical for your server. And you want to be alerted right away either via email. Or maybe you want to run a specific script on that automatically. Another attribute of the alert configuration it is that you get the metric the usual metric when you can just change the thresholds. So for example this one is SQL Server memory seventy five to ninety. I can change those values whenever I want to anything that I want to represent to be a warning or to be a critical alert. But you also have another one. And this is new also in 10.5 where we now separated that metric. And another one that is called the baseline. So one is going to be static. So meaning it crosses a threshold in this case eighty percent. And that is going to change the state into a warning if it crosses a threshold to ninety percent. During that collection then it is going to change that into a critical state. Now on the baseline is going to be more dynamic. And this is based on your normal use which is something that we are going to talk about in a second. Here the concept of baselining. But if your normal use it is at a certain percentage or a certain number if you go above that fifty percent above that one hundred percent or one hundred and twenty percent. That deviation then you want to be alerted. So obviously that base I can change based on time. That baseline sometimes is based on usually on seven days. So it is dynamic. It can change on a daily basis. So if I move from my normal even though that normal might change every certain number of days or weeks then I want to be notified of that change. Because it means that my deviation from my normal is greater than the percentage that I am comfortable with. And these are some of the new things that are in 10.5. We had this before. But it was within the metric itself that you can enable here. Now it is separated. Many of these metrics also they have an advanced. But on the bottom basically you can go there. And he might change per metric. But if you choose that Advanced button basically allows you to do a couple of things. In this one is an alert suppression. It says I want to suppress that alert for any number of minutes before the alert is raised. That means that for example if we have in this case is something like CPU [central processing unit]. Right if it goes above ninety percent well I do not want to alert right away maybe I want to wait two three minutes. And if the condition is still there after three minutes then I want to be alerted. If that condition has cleared after three minutes then I wo not be alerted. So that is one way of [you know] getting all those false positives that you get all the time especially on CPU [central processing unit] which spikes maybe for ten seconds it goes to ninety-eight percent. And then it goes down. But it is not something that you need to be alerted on. So that way you can suppress it that way. In this case also in this OS [operating system] processor time percent you have filters. And the alert filters. Basically it is another way of minimizing those false positives. For example here only create when critical in this case in the red. When that threshold has reached four times okay critical level within if five snapshot time period. Let us assume that we collect data every five minutes or I am sorry every minute. And we are doing it here for five minutes. In those five minutes four out of those. So basically for those snapshots cross the threshold of critical I can say five out of five okay. And by the way you do this is basically you are suppressing some of those false positive. So maybe it is not something that important if it is one minute or two minutes that it is at a high level maybe above ninety percent. But if he is three minutes [you know] four minutes or in this case snapshots which it can be measured by every minute every two minutes every five minutes collection every ten minutes depending how you do their collection. Then it would count that as a critical or something that might not be that important. Now make sure to experiment with those advanced because it might be different for every metric. For example here when I have single server agent job failure. So if I click the advanced on that it is going to give me different options for duration or job filters since we are talking about jobs that are failing. For example I can go here and say I want to filter any job. I want to include or exclude any jobs that are of a certain category for example. So I say only include here jobs that have a certain category that is in my category of demo. And it maybe starts with demo. And jobs that start with "sys” those are all the only jobs that I want to monitor for failure. If I want to exclude those then I could have excluded those. And there is got a monitor. And then everything else except everything that comes from the category demo that starts with the word or the letters […]. Another concept that we use here metrics is the recommendations. The recommendations work as follows. As you can see here I got OS [operating system] memory usage. I got from eighty-five to ninety percent. Those are my thresholds for warning and critical. And then I have another line here that we are going to talk a little bit more in depth in a second. But that it is called a baseline. The baseline basically allows me to see my minimum, my maximum, my range. In this case from sixty nine point seven to ninety eight or ninety four point eight I am sorry with an average of eighty two point three percent. So that is giving me my minimum, maximum, my range. And then my average in the case here. Average is eighty-two percent. Now why is that helpful. We are going to see in a second. And that is basically what we call the normal use. How do we configure that. We are going to see that also in a second. What is important here if this is my normal use and these are my thresholds. That means that I am always going to be in a warning state most likely. And also most likely on a critical state. So how do we avoid that. One way to do it is that visually we can see it here. And we can adjust those values to whatever we want. That is one way of doing it. But the other best way of doing it is by applying the recommendations that we see here on the top. All the recommendations all available for the single server. Click here to view the recommendation. Now when we click on that we can see here OS [operating system] memory and paging. So these are the two recommendations out of one hundred different metrics that we have here that we are recommending values for you. Why. Because as you can see here the normal use conflicts with my thresholds. So now it is recommending here ninety nine two hundred percent. That recommendation is based on an option five percent above my warning or ten percent above my critical baseline. And that by default is a little bit higher twenty thirty. But I used those values of five. And ten that is going to give me a recommendation based on this ninety-nine. And two hundred which I think is a little bit high. So because it is already very high usage of memory I am just going to go here and adjust it to ninety-five or ninety-eight. […]. I am going to do the same thing for my baseline is recommending instead of three hundred and ten, and three hundred, and twenty five. It is recommending those values because it is a little bit above the baseline. So we are going to take that recommendation for us. […]. Click OK. You can see here how my new thresholds have change 95 in 98. And as you can see here I have these icons that mean that the configuration of that metric change you can see always […]. And now have the new values. So imagine the value of this is that if you go here to server you do not have to look at all hundred metrics that you might have. Basically click on that and it shows you those metrics that could get a recommendation based on your baseline. Now going more deeply into what the baseline is. A baseline can be defined as the normal use of the server or the application that is running on that SQL Server. Usually when it is the most busiest or is the normal range of working hours. So from eight to five. Or maybe that application runs normally from three to seven o’clock at nine every night or maybe only on Fridays. So you define what is the baseline for that. Now you can say well my application will run 24/7. And most applications do run 24/7. But you can define a period where they are the busiest. So maybe even if it runs 24/7 maybe the busiest time might be from five to eleven o’clock at night. And that is when they are running the busiest they got the most users. And the more the most activity there in that time to configure the baselines we can go to the bottom here configure baseline. And that is going to take us to the options of doing that. You can either do dynamic or custom. A dynamic baseline basically takes the last seven days out of the collected data from Monday through Friday. Only in this case I do not want to include Sunday and Saturday where my servers are not doing anything. They have no users and they are not doing a lot of activity during those days. The other thing is also I wanted to do it from seven to five. Those are more my normal hours that this server working. And they are the busiest. So that is where I want to do my statistics. So I can get that information from that specific time now you can choose whatever you want here. So let us save these application only runs on a Friday or is cranking all these numbers on a Friday afternoon. So I can say [you know] maybe from I do not know three o’clock until seven o’clock at night that is one that they are going to be the busiest. So I select the time that I want to do it. And now it is calculating based on that. So my minimum, maximum, and my average are going to be based on only Fridays and only from three to seven p.m. Everything else will be excluded from that calculation. Another thing you can do here is add more baselines. So for example if I want to have a baseline only for the week right Monday through Friday. But then I want to have another one for the weekend. So basically I can see my standard deviation of my averages in my min max for the week. And how do they behave. And it is going to be different from the weekend when we have less users. And less activity. So that is why we can create different baselines based on that. So if I want to create another baseline here and I want to do it based on Saturday and Sunday only information. Then I can do that. And I can select the times for that that will give me two different time frames of the baseline. And also if I want to have recommendations based on Saturday and Sunday or maybe Monday through Friday. I can do that as well. Another feature of the baseline is that you have what we call the baseline visualizer. And the baseline utilizer allows you to take that metric let us say whatever metric you want let us do something like OS [operating system] processor time percent. Let us see. So let us choose a host processor time four weeks. And this will allow you to see everything that you had within those four weeks. So the values these are all the snapshots that you have where you can see where [you know] you might be running more higher on Wednesday for whatever reason maybe Saturday is Sunday [you know] depending on that week. But some of the weeks are not. So you can take a look at here where are your normal. My normals pretty much are going to be here for every day of the week […] depending on what kind of jobs I am running. And things like that. But you are going to see where my normal is going to fall into. And this is going to help you to set up your thresholds. It is also going to set up where your baseline should be set up [you know] what times of the day. And the days of the week that you might want to set up your threshold. So again a good way to help you get started with that baseline configuration. Now baselines can also be used from an alert action and response meaning that you can set the alert to either send an email or run an action based on a baseline. So for example here I am going to create a new one. And I am going to call it any name. That a one I am going to say baseline example. And I am going to choose here when a metric or when this specific metric on this list. You see they are in warning or critical. I am just going to say critical. I am going to choose the metric. And now this is the advantage that we have now. Because we can go here to OS [operating system] processor ten percent which means that let us say if we had it at eighty percent warning ninety percent critical. That is going to just based on that static number of ninety in this case which is too critical if it crosses ninety percent. Then he can send me an email or run a corrective action. But in this case I also have the person baseline meaning if there is a deviation if he goes fifty percent above that the normal use or fifty percent or one hundred percent more okay or whatever it is. So basically goes into that deviation. And I want to make sure that I can send an email if it deviates more. And a certain percentage of that baseline. And this is another way of using baselines for alerts. So that was a quick tutorial on how to manage your thresholds. And your baselines for more information visit us at “www.idera.com”. Thank you very much.

Configuring Metric Thresholds and Baselines with SQL Diagnostic Manager for SQL Server

With SQL Diagnostic Manager , better alert to abnormalities while not alerting to normal situations for SQL Servers by quickly and easily configuring metric thresholds and baselines.

Start a FREE Trial
Share This
Contact IDERA:
+1 (713) 523-4433