skip to main |
skip to sidebar
How to Backup/Restore a Windows 2003 Domain Controller
Posted by General Zod in Microsoft, Tech.
trackback
A couple years back, I was working for a rather large company with
hundreds of sites in about 50 different countries that were all linked
by a single global network… except for 4 or 5 data center sites that
were called “solution centers”. I worked at one of these special
sites. The purpose of the solution centers was to house whatever
services a customer company required us to while keeping it separate for
our company’s global network. As we were not part of the global
network, we were considered the black sheep of the company… and I was
the lone systems engineer responsible for keeping the servers at my site
running. No bother… I do my best work when I’m left to my own devices.
However, this did present many additional complications that others
in my company did not have to contend with. The largest challenge to
overcome was our site’s disaster recovery plan. We could not just
assume to relocate to a new site because we would need to recover our
own environment, which included our own domain.
Yes, I know… I could have just housed one of our domain controllers
at another location and established a special VPN just for the
communications between the DC’s. That would be a valid solution, but
just not good enough. During a DR event, that would place me very
dependent upon the IT staff at that other location… and call me crazy,
but I want to be able to ensure that I would be able to perform the
recovery 100% without the assistance of anyone else.
I spent a lot of time reading over Microsoft white papers and
procedures written by various individuals, throwing ideas around with
colleagues, and plucking away at ideas in an attempt to develop a
procedure that would fulfill our needs. Eventually, I developed the
procedure that you’ll read below… and tested it successfully on several
occasions. Knowing that someone else out there is probably looking for
the same thing, I figured it would be grand to share it with you.
How to Backup the Domain Controller(s)
Obviously, before you can restore your domain, you have to back it up first. :)
Mainly what we’re interested in backing up is the System State of a Domain Controller. So what is the System State?
The System State of your server includes the Registry,
the Boot files, some System files, the Active Directory service, and
other components. (Read more about it here.)
You can not pick and choose between which components are backed up
during a System State backup. It’s an all or nothing situation.
Since this includes the whole of your Registry, you have to
understand that this includes the information about the original
System’s installed hardware. This may complicate the restore process
somewhat. If you backed the System State from DC on an HP Proliant
DL380 G5 series server… and attempt to restore it on a Dell PowerEdge
T100… you will most likely have issues with booting up the OS afterwards
because the hardware set is significantly different.
As part of your DR plan, I recommend making a point of documenting
the hostname, IP address, Operating System, Service Pack level, and the
hardware make/model of each of your domain controllers. You may find
this information useful when the time comes.
These instructions are going to use the hostname "DC123" as name of
the domain controller, and assume that you want to run your System State
backup every day at 3:00am.
Login to your domain controller, and perform the following steps:
- Create a C:\Backup\ folder.
- Click Start — All Programs — Accessories — System Tools — Backup.
- Click [Next] — Select Backup Files and Settings — [Next].
- Select Let me choose what to back up — [Next].
- Expand My Computer — Check System State — [Next].
- Set the location of the backup file to C:\Backup\ folder.
Set the Name of the Backup to “DC123 System State”.
- Click [Next] — [Advanced] — Select Normal — [Next].
- Check the Verify Data after Backup box — [Next].
- Select Replace the existing backups — [Next].
- Select Later — Set the Job Name to “DC123 System State”.
- Click [Set Schedule] — Schedule the job to run Daily at 3:00am.
- Click [OK] — Enter a set of user credentials — [OK].
- Click [Next] — Enter a set of the user credentials — [OK] — [OK] — [Finish].
The actual backup job itself will probably take somewhere between 15 –
30 minutes to run. Then, you can backup the C:\Backup\ folder to
tape. Personally, I had preferred to schedule another task that would
launch at 4:00am to “robocopy” (which can be found as part of the Windows Server 2003 Resource Kit Tools download) each of the backup files to another server where they were all dumped to tape a few hours later.
You only really need to backup 1 domain controller for this to work,
but then your pretty much locked into a single hardware set when it
comes time to do the restore. Since I was never sure what kind of
hardware I would have available to me when it came time to do the
restores, I tried to make a practice of housing each domain controller
on a different model of server… and backing each of them up
individually. Each backup ran me somewhere between 600 – 800 MB of disk
space (which is rather a small pittance by today’s standards).
Yes, this was probably a significant amount of
overkill on my part. However, I find that the more paranoid you are,
the better prepared you tend to find yourself. And I tend to be rather
paranoid about things like DR.
How to Restore the Domain Controller(s)
Now let’s pretend that a disaster has struck!
You’ve retrieved your tapes from off-site storage and acquired your
target hardware, so let’s get to work! (Remember that matching the
hardware to the DC restore would be best, but you can make
substitutions. It’s not an exact science, so some experimentation may
be required.)
Note: These instructions are written with a few assumptions in mind.
- We assume that your entire domain has been leveled by some catastrophic event.
- We assume that your domain controllers are running a Windows 2003 operating system.
- We assume that whomever is doing the work knows the login
credentials (from the original domain) to the domain’s Administrator
account or a user account that is a member of both the domain’s "Domain
Admins" and "Schema Admins" groups.
- Build a stand-alone Windows 2003 server, and bring it up to the same Service Pack level as the original DC.
- Name the server with the same hostname as your original DC.
- Restore your System State backup files from tape, and copy them to the new server’s local hard disk.
- Reboot the server.
- After POST, hit [F8] and select to boot into “Directory Services Restore Mode (Windows domain controllers only)”.
- Click Start — All Programs — Accessories — System Tools — Backup.
- Click [Next] — Select Restore files and settings — [Next] — Browse to the location of the backup file — [Next].
- Expand File – System State Backup — Check the System State box — [Next].
- Click [Advanced] — Select Original Location — [Next] — [OK] — Select Leave existing files (Recommended) — [Next].
- Check the boxes for:
* Restore Security Settings
* Restore junction points, but not the folder and file data
* Preserve existing volume mount points
* When restoring replicated data sets, mark the restored data as the primary data for all replicas
- Click [Next] — [Finish].
- After the restore is completed, click [Close] — [Yes] to reboot the system.
If your server hardware is significantly different from
the original DC, then you may experience difficulty with the boot to the
GUI. If this is the case, then you might be able to still recover the
OS by booting into Safe Mode or by booting to an original Windows 2003
OS CD to perform a Repair.
Once you get into the GUI, you will need to login using the local Administrator password from the original DC.
Now you will be able to seize the FSMO roles. (Note: After each
"seize" command, click [Yes] and allow 3-5 minutes for the task to
complete.)
- Click Start — Run — NTDSUTIL — [OK].
- Type the following commands into NTDSUTIL.
roles
connections
connect to server DC123
q
seize domain naming master
seize infrastructure master
seize PDC
seize RID master
seize schema master
q
q
Next, confirm that your DC is a Global Catalog server.
- Launch AD Sites and Services
(C:\Windows\System32\dssite.msc)
- Expand Sites – Default-First-Site-Name – Servers – DC123.
- Right-click and select NTDS Settings — On the General tab, verify that the Global Catalog box is checked.
- Perform a clean reboot of the system.
Now we’ll clean the old domain controllers out of the AD database.
- Click Start — Run — NTDSUTIL — [OK].
- Type the following commands into NTDSUTIL.
metadata
cleanup connections
connect to server DC123
quit
select operation target
list domains
select domain <#>
list sites
select site <#>
list servers in site
select server <# of bad DC>
quit
remove selected server
quit
- Launch Active Directory Sites and Services(C:\Windows\System32\dssite.msc).
- Expand Sites – Default-First-Site-Name – Servers.
- Right-click on — Select Delete.
- Launch Active Directory Users and Computers (C:\Windows\System32\dsa.msc).
- Expand the domain — Open the Domain Controllers container.
- Right-click on — Select Delete.
- Select The domain controller is permanently offline and can no
longer be demoted using Active Directory Installation Wizard (DCPROMO).
- Click [Delete] — [Yes] to confirm.
Your domain should now be successfully restored, but don’t consider
yourself finished at this point. This restored server should be
considered hinky at best, and should not be kept as a long-term
solution.
Before doing anything else, I recommend that you build a 2nd “clean”
domain controller alongside this restored 1st DC. Then, transfer the
FSMO roles to the 2nd DC. Finally, demote the 1st DC to a member server
and retire it from the domain. That will hopefully ensure that your
domain is running on a clean and stable DC that you can rely upon.
Then, build a new 2nd DC to ensure some redundancy.
Congratulations! Your domain is restored. Now get to work on restoring everything else. :)
Windows Server 2003 Disaster Recovery Planning (Part 1)
In this article, we will discuss what every Microsoft Windows
Administrator and Engineer should think about when trying to manage
their environments in the scope of planning for Disaster Recovery and
Business Continuity. This is Part I in a 4 part article series where we
will cover many of the details administrators and engineers need to know
about planning Disaster Recovery for Windows Systems, as well as for
their networks in general. In part I, we will look at Windows 2000 &
Windows Server 2003 Clustering & Load Balancing for high
availability, as well as general planning information.
For a complete guide to security, check out 'Security+ Study Guide and DVD Training System' from Amazon.com
Planning for High Availability
Windows Server Disaster Recovery Planning can be a chore, but if you
have the details and a plan, it can go smooth to setup, and will be a
life saver when your systems start to smoke, and your VP’s are knocking
on your office door asking what the heck is going on! In this section we
will look at how to plan for High Availability.
Taking the time to plan and design is the key to your success, and
it’s not only the design, but also the study efforts you put in. I
always joke with my administrators and tell them they’re doctors of
technology. I say, “When you become a doctor, you’re expected to be a
professional and maintain that professionalism by educational growth
through constant learning and updating of your skills.” Many IT staff
technicians think their job is 9 to 5, with no studying done after
hours. I have one word for them: Wrong! You need to treat your
profession as if you’re a highly trained surgeon except, instead of
working on human life, you’re working on technology. And that’s how
planning for High Availability solutions needs to be addressed. You
can’t simply wing it and you can’t guess at it. You must be precise,
otherwise, your investment goes down the drain – and all the work you
put in will be not only useless, but also wasteful.
Plan Your Downtime
You need to achieve as close to 100 percent uptime as possible. You
know a 100 percent uptime isn’t realistic, though, and it can never be
guaranteed. Breakdowns occur because of disk crashes, power or UPS
failure, application problems resulting in system crashes, or any other
hardware or software malfunction. So, the next best thing is 99.999
percent, which is still somewhat reasonable with today’s technology. You
can also define in a Service Level Agreement (SLA) what 99.999 percent
means to both parties. If you promised 99.999 percent uptime to someone
for a single year, that translates to a downtime ratio of about five to
ten minutes. I would strive for a larger number, one that’s more
realistic to scheduled outages and possible disaster-recovery testing
performed by your staff. Go for 99.9 percent uptime, which allots for
about nine to ten hours of downtime per year. This is more practical and
feasible to obtain. Whether providing or receiving such a service, both
sides should test planned outages to see if delivery schedules can be
met. You can figure this formula by taking the amount of hours in a day
(24) and multiplying it by the number of days in the year (365). This
equals 8,760 hours in a year. Use the following equation: percent of
uptime per year = (8,760 - number of total hours down per year) / 8,760
If you schedule eight hours of downtime per month for maintenance and
outages (96 hours total), then you can say the percentage of uptime per
year is 8,760 minus 96 divided by 8,760. You can see you’d wind up with
about 98.9 percent uptime for your systems. This should be an easy way
for you to provide an accurate accounting of your downtime. Remember,
you must account for downtime accurately when you plan for high
availability. Downtime can be planned or, worse, unexpected. Sources of
unexpected downtime include the following:
- Disk crash or failure
- Power or UPS failure
- Application problems resulting in system crashes
- Any other hardware or software malfunction
Building the Highly Available Solutions’ Plan
Let’s look at the plan to use a Highly Available design in your
organization and review the many questions you need to ask before
implementing it ‘live’. Remember, if the server is down, people can’t
work, and millions of dollars can be lost within hours. The following is
a list of what could happen in sequence:
- A company uses a server to access an application that accepts orders and does transactions.
- The application, when it runs, serves not only the sales staff,
but also three other companies who do business-to-business (B2B)
transactions. The estimate is, within one hour’s time, the peak money
made exceeded 2.5 million dollars.
- The server crashes and you don’t have a Highly Availability
solution in place. This means no failover, redundancy, or load balancing
exists at all. It simply fails.
- It takes you (the systems engineer) 5 minutes to be paged, but
about 15 minutes to get onsite. You then take 40 minutes to troubleshoot
and resolve the problem.
- The company’s server is brought back online and connections are reestablished.
Everything appears functional again. The problem was simple this
time—a simple application glitch that caused a service to stop and, once
restarted, everything was okay. Now, the problem with this whole
scenario is this: although it was a true disaster, it was also a simple
one. The systems engineer happened to be nearby and was able to diagnose
the problem quite quickly. Even better, the problem was a simple fix.
This easy problem still took the companies’ shared application down for
at least one hour and, if this had been a peak-time period, over 2
million dollars could have been lost. They want to become aware, so the
possibility of 2 million in sales evaporating never occurs again. Worse
still, the companies you connect to and your own clientele start to lose
faith in your ability to serve them. This could also cost you revenue
and the possibility of acquiring new clients moving forward. People talk
and the uneducated could take this small glitch as a major problem with
your company’s people, instead of the technology. Let’s look at this
scenario again, except with a Highly Available solution in place:
- A company uses a Server to access an application that accepts orders and does transactions
- The application, when it runs, serves not only the sales staff,
but also three other companies who do business-to-business (B2B)
transactions. The estimate is, within one hour’s time, the peak money
made exceeded 2.5 million dollars.
- The server crashes, but you do have a Highly Available solution
in place. (Note, at this point, it doesn’t matter what the solution is.
What matters is that you added redundancy into the service.)
- Server and application are redundant, so when a glitch takes place, the redundancy spares the application from failing.
- Customers are unaffected. Business resumes as normal. Nothing is lost and no downtime is accumulated.
- The ‘one hour’ you saved your business in downtime just paid for the entire Highly Available solution you implemented.
Human Resources and Highly Available Solutions
Human Resources (people) need to be trained and work on site to deal
with a disaster. They also need to know how to work under fire. As a
former United States Marine, I know about the “fog of war,” where you
find yourself tired, disoriented, and probably unfocused on the job.
These characteristics don’t help your response time with management. In
any organization, especially with a system as complex as one that’s
highly available, you need the right people to run it.
Managing Your Services
In this section, you see all the factors to consider while designing a
Highly Available solution. The following is a list of the main services
to remember:
• Service Management is the management of the true components of Highly
Available solutions: the people, the process in place, and the
technology needed to create the solution. Keeping this balance to have a
truly viable solution is important. Service Management includes the
design and deployment phases.
- Change Management is crucial to the ongoing success of the solution
during the production phase. This type of management is used to monitor
and log changes on the system.
- Problem Management addresses the process for Help Desks and Server monitoring.
- Security Management as discussed in Chapter 7, is tasked to prevent unauthorized penetrations of the system.
- Performance Management is discussed in greater detail in this
chapter. This type of management addresses the overall performance of
the service, availability, and reliability. Other main services also
exist, but the most important ones are highlighted here. Service
management is crucial to the development of your Highly Available
solution. You must cater to your customer’s demands for uptime. If you
promise it, you better deliver it.
Highly Available System Assessment Ideas
The following is a list of items for you to use during the
postproduction-planning phase. Make sure you covered all your bases with
this list:
- Now that you have your solution configured, document it! A lack of
documentation will surely spell disaster for you. Documentation isn’t
difficult to do, it’s simply tedious, but all that work will pay off in
the end if you need it.
- Train your staff. Make sure your staff has access to a test
lab, books to read, and advanced training classes. Go to free seminars
to learn more about High Availability. If you can ignore the sales
pitch, they’re quite informative.
- Test your staff with incident response drills and disaster
scenarios. Written procedures are important, but live drills are even
better to see how your staff responds. Remember, if you have a failure
on a system, it could failover to another system, but you must quickly
resolve the problem on the first system that failed. You could have the
same issue on the other nodes in your cluster and if, that’s the case,
you’re on borrowed time. Set up a scenario and test it.
- Assess your current business climate, so you know what’s
expected of your systems at all times. Plan for future capacity
especially as you add new applications, and as hardware and traffic
increase.
- Revisit your overall business goals and objectives. Make sure
what you intend to do with your high-availability solution is being
provided. If you want faster access to the systems, is it, in fact,
faster? When you have a problem, is the failover seamless? Are customers
affected? You don’t want to implement a high-availability solution and
have performance that gets worse. This won’t look good for you!
Do a data-flow analysis on the connections the high availability
uses. You’d be surprised that damaged NICs, the wrong drivers, excessive
protocols, bottlenecks, mismatched port speeds, and duplex, to name a
few problems, have on the system. I’ve made significant differences in
networks by simply running an analysis on the data flow on the wire and,
through this analysis, have made great speed differences. A good
example could be if you had old ISA-based NIC cards that only ran at 10
Mbps. If you plugged your system into a port that uses 100 Mbps, then
you will only run at 10, because that’s as fast as the NIC will go. What
would happen if the switch port was set to 100 Mbps and not to
autonegotiate? This would create a problem because the NIC wouldn’t
communicate on the network because of a mismatch in speeds. Issues like
this are common on networks and could quite possibly be the reason for
poor or no data flow on your network.
- Monitor the services you consider essential to operation and make
sure they’re always up and operational. Never assume a system will run
flawlessly unless a change is implemented . . . at times, systems choke
up on themselves, either by a hung thread or process. You can use
network-monitoring tools like GFI, Tivoli, NetIQ, or Argent’s software
solutions to monitor such services.
- Assess your total cost of ownership (TCO) and see if it was all worth it.
Cost Analysis
Do a final cost analysis to check if you made the right decision. The
best way to determine TCO is to go online and use a TOC calculator
program that shows you TCO based on your own unique business model.
Because, for the most part, all business models will be different, the
best way to determine TCO is to run the calculator and figure TCO based
on your own personal answers to the calculator’s questions. Here’s an
example of a specific one, but many more are available to use online -
just run a search in a search engine (like Google.com) on ROI/TCO
calculators, and you will see them.
Testing a High Availability System
Now that you have the planning and design fundamentals down, let’s
discuss the process of testing your high-availability systems. You need
to assure the test is run for a long enough time, so you can get a solid
sampling of how the system operates normally without stress (or
activity) and how it runs with activity. Then, run a test long enough to
obtain a solid baseline, so you know how your systems operate normally
on a daily basis. Use that for a comparison during times of activity.
In Sum
This should give you a good running start on advanced planning for
high availability, and it gives you many things to check and think
about, especially when you’re done with your implementation.