Some user calls the admin and says they can't get their email. "Is the server down?"
It might be, but with only one complaint, it's probably not.
I propose a system where a user can register a complaint about a service, such as "email is down" or "network drive not accessible" or "website down". If a quorum of users (say, 60% or whatever) say that the service is down, it gets restarted.
Otherwise, each complaint is logged and IT handles them individually.-- lawpoop, Jul 31 2003 Network monitoring tool http://www.ipswitch.../WhatsUp/index.html(blatant self promotion) [krelnik, Oct 04 2004] Software to solve problems like this is what I currently do for a living. Most IT departments use monitoring tools of some kind that continually poll the servers. (See link for one). If they have set it up right, they know the server is down long before the phone rings.
The problem with a knee-jerk reaction like restarting a server is that these sorts of problems often have multiple causes. A cut or disconnected cable between buildings could cause a server to "appear" to be down for one group, while another group is happily working away. Your reboot would not solve the problem, it would make it (temporarily) worse.-- krelnik, Jul 31 2003 [krelnik] It's not a knee-jerk reaction - you have to have the quorum say the system is down.
Your network monitoring tools aren't any smarter than the quorum - neither can distinguish between a down server and a severed patch cable.
Reboots (and again, I'm not talking about reboot, but a service restart) are often the first response. I mean, don't you try that first, before you go digging into the patch cable closet?
Anyway, restarting a service is not a huge deal anyway. I'm not talking about a machine reboot, but 'apachectl restart'.
As far as service down for one group and not the other, you put the other group *in the quorum*.-- lawpoop, Jul 31 2003 I noticed that I had "server reboot" in the title and then "service restart" in the body. I changed that.-- lawpoop, Jul 31 2003 //neither can distinguish between a down server and a severed patch cable// Not so! When I lose my connection to the internet at work, I can look on a map and tell you exactly which component has failed (the local router, the T1 to the ISP, the router at my ISP, DNS, etc). This is precisely why people buy these products, so they can stop managing by guesswork, and go right to the source of the problem.-- krelnik, Jul 31 2003 Wait a minute -- are *you* running nmap, or is the monitoring sofware?
If the monitoring software is running nmap, then the quorum software could do same thing.
I guess you can look at this as adding an extra dimension to your monitoring software. What do *users* say is down?
A quorum might be able to give you different information than software whose whole AI system is a series of scripts.-- lawpoop, Jul 31 2003 The monitoring software does something equivalent to what nmap does, and a whole lot more, internally. Then it displays it graphically on a map of the network, and optionally sends notifications to your pager, email, etc, etc.
//adding an extra dimension to your monitoring software// Well what you are now talking about is having a distributed monitoring system, to check connectivity on your network from different perspectives. We highly recommend that, as it is true that things can appear to be working from one perspective, but down from another.
But you need to rely on detailed probes, not the complaints of some bunch of users who have little knowledge of how your network is connected.
Our product has a built in web server that lets you view its information remotely. So you could put several of them around your network, and use web browser to see the current network status.-- krelnik, Jul 31 2003 One thing I like about this is that if the voting system itself is down, you'd never hear about it.
I don't get it. A service is either down or it's not, regardless of whether a bunch of people think it is. Net monitoring stuff will tell you that. Relying on a bunch of people who freak if their email attachment takes more than ten seconds will not. You'd get a lot of 'false positives' - people voting that something was down when it fact it's not.-- waugsqueke, Jul 31 2003 [waugs] That's what the quorum is for. Sure, people freak out all the time -
"I didn't get an email with an important attatchment!"
"Did you hit send and receive?"
"Oh, there is it."
But you setup your quorum system to restart the mail service when, say 75% of everyone is claiming that it's down, or say, at least 50 people from two different networks say that it is down.
As far as the system itself being down, that's a problem with *every system*, even krelnik's monitoring system.-- lawpoop, Jul 31 2003 Actually we have a way of dealing with that too, watch our website for details next week.-- krelnik, Jul 31 2003 [krelnik]
Is your monitoring software typically on one machine? Or can you install it on every client?
If not, I would think that my system is more robust, because of it's redunancy.
If you have the network monitor on one system, and it goes down, then you have no network monitoring. If you have it on 5 systems, and then their switch goes down, no more monitoring. If you have 100 network monitoring, and the admin's terminal goes down, no network monitoring (insofar as its only beneficial when a person can do something about an outage).
With my system, you might have a central location where people upload their votes. Or, you could have a tallier that goes through the network, polling each workstation.
It's a swarm intelligence.-- lawpoop, Jul 31 2003 Well, your central location is going to be a failure point. What if that goes down?-- krelnik, Jul 31 2003 I vote we restart [lawpoop] .. joke.. sorry.. if you can pick definable services that can be re-started this could be a really useful idea, I'm just not sure you can.-- neilp, Jul 31 2003 [nelip]
try :
apachectl restart
named restart
smb restart
etc.
Worst-case scenario:
shutdown -r now.
I think NT has similar command line features for its services.-- lawpoop, Jul 31 2003 [krelnik]
With the swarm intelligence model, if the central system is down, the admin could conduct polling from any one of the swarm members.-- lawpoop, Jul 31 2003 Small world... I use krelniks product in government installations mainly because it can be set up quickly and is easily configured for paging, therefore providing me with a statistical presentation of the reduction in phone calls placed to help desks.
Quorum of the users? Oh no, the support team needs to know before the Colonel. This is a step backwards for Network Management Systems.
What krelnik said.-- Shz, Jul 31 2003 OK, dammit [sigh]. I'll read your darn website...-- lawpoop, Jul 31 2003 [lawpoop] I was just adding my thoughts to the school of thought today that it's not obvious (esp. to users) which 'service' is defective. I think the idea falls down a bit there.-- neilp, Jul 31 2003 [neilp] Well, you're right in the sense that they don't know "The DNS server is down and it's preventing us from receiving email", but users do have a basic clue such as "The shared drive on the accounting server is down" or "the web database is down"
I mean, obviously , it's not 100% perfect, and it can't fix everything. But, it does let you know that most users think something isn't working right, and automatically does the very first thing that most sysadmins would do in the first place.
This system would have admin-defined problem reports, such as "Email is down", "Internet is down", "Accounting drive is down", "website is down". If there are enough votes for a particular category, then it launches a restart script.
So, say the internet is down (like your T1 or DSL or whatever). You might get a few votes for "Email is down" or "Company Intranet Website is down", but you would get an awful lot of "Internet is down".
You might also cross-list problem complaints, so that "Email is down" and "Intranet website is down" count also as "Internet is down".-- lawpoop, Jul 31 2003 random, halfbakery