Technology sometimes goes wrong on inconvenient days and times, In this particular example, it thought it would be a good idea to have a malfunction on Friday at about 5 PM - but sometimes it does occur and sometimes more often than it should.
🖥️ Why is the server slow and laggy?
In this particular instance, we will be focusing on single process that was causing the problem, however, before we can get into that it was very obvious that something was wrong because when accessing the server, everything was sluggish and very laggy.
🖥️ Degraded Performance - pay attention!
If you have ever remotely access the server and it’s been sluggish and laggy that usually indicates performance problems, these performance problems usually require further investigation, and that was very much the case in this particular example.
🫵 Check Task Manager
When you have a slow and non-responsive server it’s always best to start with task manager, Wild this is not definitive it will give you a very quick summary of where the problem could be and when you check the performance tab:
👀 Find the perpetrator processYes, this would explain the sluggishness, you have a maxed out processor that is causing all your other, our operations to be very slow and unresponsive, but that on itself is not particularly helpful so let’s navigate ourselves to the details tab to see what’s causing the problem:
This is now starting to make sense it would appear the cause of The CPU drain is Java, The world’s most installable virus by any stretch of the imagination, especially on Windows.
👀 Analyze that Wait chain….
If we then click on the details tabAnd locate the process called “Java” Then right, click on that process and choose the option labeled “Analyze Wait Chain” - this will then analyze the process and advise you on what process the current degraded process is waiting on:
Right, Java being Java, appears to be waiting on itself due to some random bug from this horrible product that should really never been stalled on a server ever, so the cause of the slowness is Java has a deadlock on itself and it’s got quite a few deadlocks on itself.
🗣️ Network I/O but not as you think
The deadlock appears to be network related, but that does not mean there’s a problem with the network. It just means Java is trying to communicate using the network and it’s not getting to where it wants to go, if you take a look at the logs for the application, you’ve got running Java that may give you some clue as to where the problem is.
In this particular example, the Java process was unable to talk an endpoint on the network because it was being blocked by another Java process, and if you’ve guessed that the other process was also waiting on the pre-existing process then you’ve guessed right - So Java has completely lost its bottle and will require a reset.
Java when run on Windows will be spawned by the process that calls it so if you have a service dependent on Java, the process will starts and then call Java - meaning, if you just terminate the service, there’s a chance you won’t get it to start up correctly again.
🛑 Killing Process Warning
This means while you do have the option to right click Java and do end process, for which I would probably recommend the end process tree option, that will successfully kill Java, but it may or may not report back to the application that Java has been terminated.
Note : Check that the high CPU is causing an issue with the application by investigating the logs before taking action, if the application is working with high CPU then you may not need to take action.
✅ Reboot may be safer
Note : I am not usually a fan of rebooting service to fix problems, but where Java is concerned unless you’re going to do a full stack trace and fix the problem on the fly this is the quickest way to get the service back online.
If you get in this tricky situation and don’t fully understand the application and how it calls Java and interacts with Java and how it handles errors and crashes then the quickest option will be to reboot the server and get it back to a healthy state.
👾 Confirm CPU is “normal”
Once the server is rebooted, the activity should go back to normal levels, which you can see below:
If you try the application that has been crippled by Java malfunctioning you should now find that the application works again successfully, however, you can also confirm this in the application or Java logs to make sure everything is actually back to normal, I am not one for taking user feedback as a problem being fixed, I will get my answers from logs and diagnostic reports - if the user happens to confirm the same results as what the servers is telling me then even better.
✋ Working still pending
Your work is not done, if you’ve got this position, there is a chance you will have limited or very basic monitoring on that process or even server, I would never recommend a CPU alert based on percentage of CPU being used because that is very 1985.
What you need is a baseline alert that monitors the server for a period of time and tells you when the CPU is outside normal operating parameters, however, if your mindset is still in the zone of “how busy is the CPU” mode you will need to beat that out of yourself with a stick 🌲 or possibly blunt rock 🪨 - or possibly relearn the fact that monitoring has moved on since those dangerous days of individual spikes.
📧 Notification via email
If you wish to be notified about excess usage outside baseline then that could look like the example below, obviously email is one method but it is effective if you know how to react to that email and its not simply deleted.
🪵 Log File from ScriptIn this instance the script (that we will cover a little later on) will create a log file that will store the historial data used for the baseline values, this is a sample of that output, this sample output has the interval at 2 seconds (customised in script)
2024-08-03 19:47:59 - Monitoring 4 java processes.
2024-08-03 19:48:30 - Process: java (ID: 4368) CPU Usage: 0.01299476%
2024-08-03 19:48:30 - Process: java (ID: 4460) CPU Usage: 0.1169529%
2024-08-03 19:48:30 - Process: java (ID: 8780) CPU Usage: 0.05197904%
2024-08-03 19:48:30 - Process: java (ID: 11400) CPU Usage: 91.34017%
2024-08-03 19:49:00 - Monitoring 4 java processes.
2024-08-03 19:49:31 - Process: java (ID: 4368) CPU Usage: 0.02594205%
2024-08-03 19:49:31 - Process: java (ID: 4460) CPU Usage: 0.1426813%
2024-08-03 19:49:31 - Process: java (ID: 8780) CPU Usage: 0.06485512%
2024-08-03 19:49:31 - Process: java (ID: 11400) CPU Usage: 94.26043%
This will then start a scan for process, then once it has worked out the CPU usage of each process it will write it to the log file and that is the output you can see above for 2x samples - interesting you can see that PID 11400 is taking up all the CPU usage and the other processes are not as intensive.
Note : If you wanted to dump the memory to see what was going on here then you would need the PID 11400 to target for that process dump.
🚀 Mission Control : Script that "does the magic"
Right now we get on to the script that does all the magic and monitores the "java" process, you can substitute the process for any other process, but remember to update all the variables names or it will still report as "java"
🔑 Script : Key variables
In the set variables below the check will occur every 60 seconds nd check for the process called "java" and any of processes need to be over 95% or more than 90 minutes as reflected in the logs, then the last value to controls how frequently you get the emails.
$javaProcessName = "java"
$cpuThreshold = 95 # CPU usage threshold in percentage
$monitoringInterval = 60 # Monitoring interval in seconds
$sustainedPeriod = 90 # Sustained period in minutes
$emailCooldownPeriod = 30 # Cooldown period in minutes
Script : JavaChecker.ps1
Update the values in bold for your requirements and then run the script - remember this script will continue to run as it is monitoring process so the best usage of this script is in the form of a Scheduled Task - so it can run without a user being logged in - but for testing purposes manual runs are fine.
# Define parameters
$javaProcessName = "java"
$cpuThreshold = 95 # CPU usage threshold in percentage
$smtpServer = "<smtp-server>"
$alertRecipient = "lee@croucher.cloud"
$alertSender = "java.process@croucher.cloud"
$monitoringInterval = 20 # Monitoring interval in seconds
$sustainedPeriod = 30 # Sustained period in minutes
$emailCooldownPeriod = 30 # Cooldown period in minutes
$csvFilePath = "LogFile.csv" # Path to the CSV log file
$htmlTemplatePath = "message.html" # Path to the HTML template
# Get the server name
$serverName = [Environment]::MachineName
# Initialize a hashtable to store CPU usage history
$cpuUsageHistory = @{}
$lastEmailSent = $null
# Function to send an email alert
function Send-Alert {
param (
[string]$subject,
[string]$body
)
$currentTime = Get-Date
if ($lastEmailSent -eq $null -or ($currentTime - $lastEmailSent).TotalMinutes -ge $emailCooldownPeriod) {
$htmlContent = Get-Content -Path $htmlTemplatePath -Raw
$htmlContent = $htmlContent -replace "{{SUBJECT}}", $subject
$htmlContent = $htmlContent -replace "{{BODY}}", $body
Send-MailMessage -From $alertSender -To $alertRecipient -Subject $subject -Body $htmlContent -BodyAsHtml -SmtpServer $smtpServer
# Update the last email sent time
$global:lastEmailSent = $currentTime
}
}
# Function to log CPU usage
function Log-ProcessData {
param (
[int]$processId,
[string]$processName,
[float]$cpuUsage
)
$logEntry = [PSCustomObject]@{
DateTime = (Get-Date).ToString("yyyy-MM-dd HH:mm:ss")
Process = "$processName (ID: $processId)"
CPUTime = "$([math]::round($cpuUsage, 2))%"
}
$logEntry | Export-Csv -Path $csvFilePath -Append -NoTypeInformation
}
# Function to check sustained high CPU usage
function Check-SustainedHighCPU {
param (
[array]$cpuHistory,
[int]$threshold,
[int]$duration
)
# Ensure the history has enough data points
if ($cpuHistory.Length -lt ($duration * 60 / $monitoringInterval)) {
return $false
}
# Check if all values in the period exceed the threshold
foreach ($cpu in $cpuHistory[-($duration * 60 / $monitoringInterval)..-1]) {
if ($cpu -lt $threshold) {
return $false
}
}
return $true
}
# Initialize CSV file with headers if it doesn't exist
if (-not (Test-Path -Path $csvFilePath)) {
$headers = @"
DateTime,Process,CPUTime
"@
Add-Content -Path $csvFilePath -Value $headers
}
# Monitor java processes
while ($true) {
# Get initial CPU time for all java processes
$javaProcesses = Get-Process -Name $javaProcessName -ErrorAction SilentlyContinue
$processCount = $javaProcesses.Count
$initialCpuTimes = @{}
$initialTimes = Get-Date
if ($javaProcesses -eq $null -or $processCount -eq 0) {
Write-Output "No java process found."
$logEntry = [PSCustomObject]@{
DateTime = (Get-Date).ToString("yyyy-MM-dd HH:mm:ss")
Process = "No java process found"
CPUTime = "N/A"
}
$logEntry | Export-Csv -Path $csvFilePath -Append -NoTypeInformation
} else {
# Log the count of java processes
$logEntry = [PSCustomObject]@{
DateTime = (Get-Date).ToString("yyyy-MM-dd HH:mm:ss")
Process = "Monitoring $processCount java processes"
CPUTime = "N/A"
}
$logEntry | Export-Csv -Path $csvFilePath -Append -NoTypeInformation
foreach ($process in $javaProcesses) {
$processId = $process.Id
$initialCpuTimes[$processId] = $process.CPU
}
Start-Sleep -Seconds $monitoringInterval
# Get new CPU time for all java processes
$newCpuTimes = @{}
$newTimes = Get-Date
$javaProcesses = Get-Process -Name $javaProcessName -ErrorAction SilentlyContinue
foreach ($process in $javaProcesses) {
$processId = $process.Id
$newCpuTimes[$processId] = $process.CPU
}
foreach ($process in $javaProcesses) {
$processId = $process.Id
$processName = $process.Name
# Calculate CPU usage
if ($initialCpuTimes.ContainsKey($processId) -and $newCpuTimes.ContainsKey($processId)) {
$initialCpuTime = $initialCpuTimes[$processId]
$newCpuTime = $newCpuTimes[$processId]
$cpuUsage = ($newCpuTime - $initialCpuTime) / ($newTimes - $initialTimes).TotalSeconds / [Environment]::ProcessorCount * 100
# Update CPU usage history
if (-not $cpuUsageHistory.ContainsKey($processId)) {
$cpuUsageHistory[$processId] = @()
}
$cpuUsageHistory[$processId] += $cpuUsage
if ($cpuUsageHistory[$processId].Length -gt ($sustainedPeriod * 60 / $monitoringInterval)) {
$cpuUsageHistory[$processId] = $cpuUsageHistory[$processId][-($sustainedPeriod * 60 / $monitoringInterval)..-1]
}
# Log CPU usage
Log-ProcessData -processId $processId -processName $processName -cpuUsage $cpuUsage
# Check for sustained high CPU usage
if (Check-SustainedHighCPU -cpuHistory $cpuUsageHistory[$processId] -threshold $cpuThreshold -duration $sustainedPeriod) {
$subject = "Alert: High CPU usage by $processName on $serverName"
$body = "Process ID: $processId`nCPU Usage: $([math]::round($cpuUsage, 2))% sustained for $sustainedPeriod minutes"
Send-Alert -subject $subject -body $body
}
}
}
}
# Pause for the monitoring interval before the next check
Start-Sleep -Seconds $monitoringInterval
}
CSV Output
This will then output a CSV with the data as below, so you can now easily chart this data in Excel if you wish to track spike and other issues, this will then allow you track when the issue started when you get the notification email.
DateTime,Process,CPUTime
"2024-08-05 07:09:48","Monitoring 4 java processes","N/A"
"2024-08-05 07:10:08","java (ID: 4368)","0.02%"
"2024-08-05 07:10:08","java (ID: 4460)","0.08%"
"2024-08-05 07:10:08","java (ID: 8780)","0.06%"
"2024-08-05 07:10:08","java (ID: 13292)","25.16%"