31 Days of Disaster Recovery

Yüklə 211,03 Kb.

səhifə	13/15
tarix	16.08.2018
ölçüsü	211,03 Kb.
	#63138

1 ... 7 8 9 10 11 12 13 14 15

Summary

These tests showed that you can indeed run DBCC CHECKTABLE command in parallel jobs to fit more checks into your maintenance window. Creating the database snapshot ahead of time and running all of the checks against the snapshot directly gives the best results speed wise. And as for the best degree of relativity, I believe that your mileage may vary and recommend testing different levels of max DOP to determine the optimal settings.

As Paul notes in the comments below, you should never run DBCC commands in parallel jobs unless you are creating the snapshot ahead of time and running it directly against the snapshot. And it’s important to remember that it’s not enough just to create the snapshot. You have to run the DBCC command in the context of the snapshot. It won’t automatically use the snapshot just because one exists.

Edit: A couple of people have asked about the script I used to create the dbo.FactInternetSalesBig table. The script I used was for testing out columnstore indexes and is from Kalen Delaney’s (blog|@SQLQueen) blog: Geek City: Build a Big Table with a Columnstore Index.

Day 16: Disaster Recovery Gems From Around the Net

It’s day 16 of my series 31 Days of Disaster Recovery. I’ve seen a lot of great DR related posts recently. I want to do kind of a round-up of some of the other must-read disaster recovery posts and articles out there.

Last Time CHECKDB was Run by Jason Brimhall (blog|@SQLRNNR). This post is important because knowledge is power. Especially if you may not be the one directly responsible for the integrity checks at your job. This is a quick way to check the last good DBCC integrity check for the database.
Automated Backup Tuning by Nic Cain (blog|@SirSQL). You can get some good performance boost by tweaking the backup buffers, but you can also send backup performance in the toilet by tweaking it the wrong way. And it depends heavily on many factors so there is only one reliable way to determine the best settings: trial and error. Nic has automated that process for you and it runs the backup through various settings and graphs the performance differences for you so you can easily see which settings give you the best performance.
Centralizing and Analyzing SQL Backup Pro Backup and Restore Data by Rodney Landrum (blog|@SQLBeat). It’s absolutely imperative that a DBA know the state of the backups on his servers. If you have a large number of servers, that can be a very daunting task. This process shows one way to automate this process and easily keep on top of a large number of servers.
The SQL Server Instance That Will not Start by Gail Shaw (blog|@SQLInTheWild). This is a great article about the common problems you may face when rebooting SQL Server and it won’t start. If you haven’t already been there, you will be soon enough. Gail walks you through fixing several different common problems (which believe me is much better than reinstalling production in the heat of the moment).
Everything by Paul Randal (blog|@PaulRandal). Okay, so I’m not referencing a particular blog post here, but if a week goes by without checking Paul’s blog, chances are pretty good that you missed something important. There’s no better source for learning deep internals on disaster recovery, DBCC, etc. Read it all. And then … read it again.

Day 17: When are Checksums Written to a Page

Today is day 17 of 31 Days of Disaster Recovery. The series has skipped a couple of days due to real life imposing itself, but we’re getting back on track by digging into the Checksum page verification option and offering up some proof that the checksum value doesn’t get written until the page is written to disk. You may also learn some cool tricks for looking at metadata information. Let’s explore.

Checksum Page Verification

I’m going to start out by creating a new database named TestPageVerify, set page verification to NONE, and add a table with some data in it.

-- Create database for testing

Create Database TestPageVerify;

Go
-- Set page verification to none

Alter Database TestPageVerify Set Page_Verify None;

Go
-- Switch to the database

Use TestPageVerify;

Go
-- Create a table with some data in it

Select * Into dbo.AllDBs

From sys.databases;

Go
-- Add a primary key

Alter Table dbo.AllDBs

Add Constraint PK_AllDBs_DBID primary key (database_id);

Next I will choose a column at random and use the undocumented function sys.fn_PhysLocFormatter(%%physloc%%) to return the file ID and page ID on which the record is located. %%physloc%% is a binary representation of where the row is located and fn_PhysLocFormatter breaks that down and returns the data as file ID, page ID, and slot ID formatted as (File:Page:Slot). After getting the results, I made a note of the location and the key value of the record that I’m going to messing with.

-- Find a random data page in the table

-- (:

:): (1:277:3)

-- database_id: 4

Select Top(1) sys.fn_PhysLocFormatter(%%physloc%%),

database_id

From dbo.AllDBs

Order By NewID();

If I use DBCC PAGE to dump the page in its current state, we’ll see that it is not protected. The key data flags to note in the header output are m_flagBits and m_tornBits. You’ll notice in this case that the flag bits are set to 0x0 meaning that checksum is not enabled and written to the page and torn bits are 0 as no value has been written yet.

-- Dump the page (any dump style)

DBCC TraceOn(3604);

DBCC Page(TestPageVerify, 1, 277, 1)
-- From the header output:

-- m_flagBits = 0x0

-- m_tornBits = 0

Go

Excerpt From DBCC Page Output

Now, I’ll enable checksum page verification and recheck the flag bits and torn bits of the page. If you follow it up by running CHECKPOINT manually and rechecking the page, you’ll see that the values don’t change.

Alter Database TestPageVerify Set Page_Verify CheckSum;

Go
-- Dump the page again

-- Still no change

DBCC Page(TestPageVerify, 1, 277, 1)
-- From the header output:

-- m_flagBits = 0x0

-- m_tornBits = 0

The next step is to update our sample record so that the page is dirtied in memory. At this point, dumping the page shows that the checksum value still has not been written and the torn bits is still 0. Running a CHECKPOINT will write the page to disk and cause these values to be updated. The second dump of the page shows that flag bits has been set to 0x200 (checksum page verification is enabled and populated) and torn bits is set to a large integer value.

Begin Tran

Update dbo.AllDBs

Set name = name + '_Test'

Where database_id = 4;

Commit

Go
-- Dump the page again

-- Still no change

DBCC Page(TestPageVerify, 1, 277, 1)

-- From the header output:

-- m_flagBits = 0x0

-- m_tornBits = 0

Go
CHECKPOINT;

Go
-- Dump the page again

-- Still no change

DBCC Page(TestPageVerify, 1, 277, 1)
-- From the header output:

-- m_flagBits = 0x200

-- m_tornBits = 655784296

The next thing I want to test is to add a large, fixed-length column to the table so I can update the record and force it to be moved to a new page. I’ll then use the fn_PhysLocFormatter function again to identify the new page where the record is located.

-- Add a 7600 fixed length column so we can force a page split

Alter Table dbo.AllDBs Add TestVal nchar(3800) null;

Go
Begin Tran

Update dbo.AllDBs

Set TestVal = N'Test'

Where database_id = 4;

Commit

Go
-- Find the same page in the table

-- (:
:): (1:282:0)

Select Top(1) sys.fn_PhysLocFormatter(%%physloc%%),

database_id

From dbo.AllDBs

Where database_id = 4;

Dumping the page header will show that the checksum value is not yet set as both flag bits and torn bits show 0. Running a manual CHECKPOINT and re-dumping the new page shows that the checksum value is not written to the page.

-- Dump the new page

-- No checksum info

DBCC Page(TestPageVerify, 1, 282, 1)
-- From the header output:

-- m_flagBits = 0x0

-- m_tornBits = 0

Go
CHECKPOINT;

Go
-- Dump the page again

-- Still no change

DBCC Page(TestPageVerify, 1, 282, 1)
-- From the header output:

-- m_flagBits = 0x200

-- m_tornBits = 309696659

Nothing left now, but to disable the trace flag we enabled and drop the test database.

-- Disable DBCC Trace

DBCC TraceOff(3604);

-- Cleanup database

Use master;

If DB_ID('TestPageVerify') Is Not Null

Drop Database TestPageVerify;

Summary

Run through the demo above (also attached below) and you will see that page checksums are not written to the page until the page is written to disk. Simply setting page verification to checksum is not sufficient. My recommendation is to plan to rebuild all indexes and heaps at your next index maintenance window to ensure all data gets rewritten. Beginning with SQL Server 2008, you can rebuild a heap with the ALTER TABLE REBUILD; command.

Demo script: WhenChecksumsAreWritten.zip (1 KB)

Day 18: How to CHECKDB like a Boss

Day 18 of my 31 Days of Disaster Recovery series is drawing to a close. It’s 11:22 PM here, and I’ve been working feverishly to finish today’s post before the calendar flips over to tomorrow. This started out as sharing a simple script I use for running DBCC CHECKDB against all databases on a server, and like I tend to do, I thought of lots of things I wanted to add to it. I spent a several hours customizing my “this will be a quick blog post because the script is already written” script.

CHECKDB Like a Boss

As I said, this script started out as a simple script to run DBCC CHECKDB against every database on a server. As I was making it an “official” script, meaning one I feel is suitably evolved for sharing, I came up with several ideas for improving it. The scripts that are attached to this blog posts consists of the T-SQL to create a table to track results of the DBCC CHECKDB run to refer back to later when you see that an error occurred and a stored procedure that you can schedule via a SQL job to run regularly.

I added a couple of parameters to the procedure as follows:

@DBName: Allows you to specify a specific database to run it on. The default is NULL which will run it against all online databases.
@UseSnapshotIfExists: If a database snapshot already exists, this allows you to specify whether DBCC CHECKDB should be run against the snapshot instead of the live database. If multiple snapshots exist, it will run against the most recently created snapshot. A value of 1 means it will run against the existing snapshot, and 0 (default) will ignore database snapshots.

Example usage:

Exec dbo.dba_CHECKDBLikeABoss @UseSnapshotIfExists = 1;

The procedure captures the error output into the table and will raise a Severity 16 error at the end if any DBCC CHECKDB checks failed telling you which databases failed and directing you to check the logging table. The error is also raised to the Windows event log in case you use monitoring software that scans the event log for errors. This process provides several ways to catch the failure (you know, just in case), the job failing, the error raised in SQL, and the error raised in the Windows event log.

I considering writing a script to purge old data from the logging table, but if your server needs regular purging of this table, you’ve got major problems. You should be fine to simply truncate this table any time you’ve been error free for a while and know you no longer need the historical data. This table should not need frequent purging.

Download the scripts as a zip file: CHECKDBLikeABoss.zip (2 KB)

Day 19: How Much Log Can a Backup Log

It’s day 19 of my 31 Days of Disaster Recovery series, and today I want to talk about how much log is in a backup file. A common misconception is that when you restore a backup, you get an exact copy of the database as it was when it was backed up. That’s mostly true, but there are exceptions to that. For example, if you restore a database to a new server, the TRUSTWORTHY property gets reset as does replication or CDC configurations and objects unless you specify to keep them as part of the restore. Another common one is that the entire transaction log is in the backup. In fact, the backup only has as much a log as is required to bring the database to a consistent state upon recovery. I’m going to demonstrate just how much log a backup can log (or contains).

Examine the Log

For this demonstration, I’m going to be using undocumented function fn_dump_dblog() and undocumented trace flag 2537. The function is similar to other commands with which you may be familiar, DBCC LOG() and fn_dblog(). You can find some unofficial documentation on fn_dump_dblog() on the blog of fellow Certified Master Dimitri Furman (blog) here: Reading database transaction log with fn_dump_dblog(). Also, Paul Randal (blog|@PaulRandal) has blogged about using this function. I generally prefer using DBCC LOG() or fn_dblog() because the parameters to pass in are more manageable. The really cool thing about fn_dump_dblog() though is that it can be used to view the transaction log inside of a backup file.

Additionally, trace flag 2537 can be used in conjunction with any of the three log reader functions mentioned above to include the inactive portion of the log file when you view it. I have been told that this trace flag used to be 2536. Not sure exactly when it changed from 2536 to 2537, so if you try this on an older version than SQL Server 2008, you may need to use trace flag 2536 instead.

For this demo, I’m going to start out by creating a new database and switching to it:

Use master;

-- Create Database

Create Database TestBackups;

Go
-- Switch to database

Use TestBackups;

Next step is to run a manual CHECKPOINT to ensure that the log is clear. At this point, we should only see 2 or 3 entries (depending on your version of SQL Server) for the checkpoint operation. I’ll use the fn_dump_dblog() function to look at the active log and verify. On a few occasions, the checkpoint will run prior to the database logging everything it needs to do for the initial creation. If you see a lot of transactions in the log, just run this step again.

-- Clear the tran log

Checkpoint;

Go
-- Log entries should be only the checkpoint entries, 3 records

Select *

From fn_dump_dblog(null, null, null, null, null, null, null, null,

null, null, null, null, null, null, null, null, null, null,

null, null, null, null, null, null, null, null, null, null);

Next I create a table and fill it with some data inside of a marked transaction. When you name a transaction, you can find the start of the transaction in the transaction log by looking for the name in the Transaction Name column of the output. After running the named transaction, I will query the log file for the count of records in the log file (lots) and also for the entry for the named transaction (to show that it’s there).

-- Insert some data into a new table in a named transaction

Begin Tran Tran1 With MARK 'Tran 1'

Select *

INTO dbo.MasterFiles

From sys.master_files;

Commit

Go
-- Lots of log entries now

Select count(*)

From fn_dump_dblog(null, null, null, null, null, null, null, null,

null, null, null, null, null, null, null, null, null, null,

null, null, null, null, null, null, null, null, null, null);
-- Including our named transaction

Select *

From fn_dump_dblog(null, null, null, null, null, null, null, null,

null, null, null, null, null, null, null, null, null, null,

null, null, null, null, null, null, null, null, null, null)

Where [Transaction Name] = 'Tran1';

Next, I will CHECKPOINT the database manually again to clear the log. Then a dump of the log will show that we are back to only 2 or 3 entries for the CHECKPOINT command. Next, I will enable trace flag 2537 and requery the log for the count of records and for the named transaction. Now that we can also see the inactive portion of the log, we see that there is still a large number of transactions in the log including the named transaction.

-- Clear the tran log again

Checkpoint;

Go
-- Log entries should be only the checkpoint entries, 3 records

Select *

From fn_dump_dblog(null, null, null, null, null, null, null, null,

null, null, null, null, null, null, null, null, null, null,

null, null, null, null, null, null, null, null, null, null);
-- Enable trace flag 2537 to see all log entries (active and inactive)

DBCC TraceOn(2537);

-- Lots of inactive log entries

Select count(*)

From fn_dump_dblog(null, null, null, null, null, null, null, null,

null, null, null, null, null, null, null, null, null, null,

null, null, null, null, null, null, null, null, null, null);
-- Including our named transaction

Select *

From fn_dump_dblog(null, null, null, null, null, null, null, null,

null, null, null, null, null, null, null, null, null, null,

null, null, null, null, null, null, null, null, null, null)

Where [Transaction Name] = 'Tran1';

DBCC TraceOff(2537);

Next I switch to master database, back up the database, and then use fn_dump_dblog() to query the log file contained inside of the backup file we just created. We see that there is only a small number of records in the backup file and our named transaction is not among them.

-- Switch to master

Use master;

-- Backup the database

Backup Database TestBackups

To Disk = 'C:\bakTestBackups.bak'

With Init;

Go
-- Check log entries in backup

Select count(*)

From fn_dump_dblog(null, null, N'Disk', 1, N'C:\bakTestBackups.bak',

null, null, null, null, null, null, null, null, null, null, null,

null, null, null, null, null, null, null, null);
-- What about our named transaction?

Select *

From fn_dump_dblog(null, null, N'Disk', 1, N'C:\bakTestBackups.bak',

null, null, null, null, null, null, null, null, null, null, null,

null, null, null, null, null, null, null, null)

Where [Transaction Name] = 'Tran1';

And then finally, I drop the database because I’m done.

Use TestBackups;

Alter Database TestBackups Set Single_User With Rollback Immediate;

Use master;

Drop Database TestBackups;

Summary

Earlier today, I was lurking around #sqlhelp on Twitter, and someone pointed out the blog post by Dimitri that I linked to earlier. It struck me that fn_dump_dblog() would be a good way to demonstrate that the log file in the backup file did not include the entire log file, just the part required to bring the database to a consistent state. Be sure to work through the demo code above or download the script below:

Demo script in zip format: TranLogInBackup.zip (1 KB)

Day 20: The Case of the Backups That Wouldn't Restore

I have decided to spend day 20 of my 31 Days of Disaster Recovery series by relating a true tale from my harried past days of Production DBA Operations. This is a cautionary tale with an important moral. This is the case of the backups that wouldn’t restore.

The Case of the Backups That Wouldn’t Restore

When I worked in operations at Microsoft, I was on a team of about 40 operations engineers managing about 80 application. Only a small portion of the engineers were DBAs, so the core set of DBAs were often called in to consult for the other engineers when they needed in-depth SQL Server knowledge for a specific problem. I was contacted one evening by one of the engineers for a critical application. It was a SharePoint application that had been hit by a newly discovered bug that caused SharePoint to corrupt all of it’s content data. To be clear, the content database itself wasn’t corrupted, but the data it contained had been sort of mangled. It wasn’t the kind of thing that could be fixed, they had to restore.

After more than 3 hours of trying to restore the database themselves, they finally called me at home to ask for my help. They told me that most of the backups were corrupted, and they would lose way too much data if they restore the newest backup that works. They were hoping that I would have some trick up my sleeve to prevent them from losing 4 days of data.

I asked them to break down for me what they had been attempting to do and where it was failing. It turns out that they were using my Standard Backup Scripts with the default settings. This meant they were doing weekly full backups, daily differentials, and log backups every half hour. They said the full backup restored successfully, but the 3 most recent differential backups would not restore. They could restore the 4th differential backup, but that would be losing too much data. Solution was simple, simply restore the most recent differential backup that could be restored and then restore all of the log backups from that point forward to the last known good point …. right?

I told them my plan, and they didn’t seem very enthused. Turns out that when they were trying to restore the database themselves, they decided to simplify the task in front of them by deleting a bunch of the older log backups. The log backups they had not deleted only went back two days. If there was any hope of using the log backups for the restore, we had to somehow get one of the two most recent differentials to restore. they didn’t recall the exact error they got when they tried to restore the newer differential, so I tried it myself. I tested the differentials with RESTORE VERIFYONLY with no problems. They did not appear to be corrupted so I tried to restore one of them.

This was the error I received when I tried to restore the differential backup:

Msg 3136, Level 16, State 1, Line 6

This differential backup cannot be restored because the database has not been restored to the correct earlier state.

Msg 3013, Level 16, State 1, Line 6

RESTORE DATABASE is terminating abnormally.

It was apparent that someone had created an out-of-band full backup of the database. The operations engineer for the application was insistent that no one that works on the application would do that, much less create the backup and then delete it. Sure enough, a full backup had been created during that time frame, and it was NOT in the backup folder. The database had been backed up to a share on the SharePoint server. The admin had done a full site backup through SharePoint, and he was not aware that a full SharePoint backup included a full backup of the database. I checked the share on the SharePoint server, and the full backup was still there.

I restored to a recent known good point in standby mode so the engineer could query the data and verify that it was good. They wanted to get closer to the point where the data corruption occurred so, I used the technique I outlined in my post SQLU DBA Week – Recovering Lost Data and recovered the log files progressively in standby mode repeatedly querying for the data to ensure we get to the most recent point. Once we found the point we felt was the best restore point, I recovered the database, and they were able to begin repopulating all of their SharePoint catalogs.

Moral of the Story

There are several things that could be learned from this experience. After recovery had completed, and everyone had gotten a good night’s sleep, we worked on improving their processes to prevent this kind of problem again. Hopefully, this tale will help you avoid making the same mistakes. Get your recovery plans in order and take heed of the below takeaways from this experience:

Test your backups – Backups are critical, but they are useless if they can’t be restored. If they had implemented some process to test restores of their backups, they would have learned several days earlier that the most recent differential backups could not be restored. They could have addressed this problem a day or two prior to the corruption occurring and been ready to restore when the corruption had been found.
Practice your recovery process – One of the reasons you practice your recovery process is so that when something goes wrong, you know what to do because you’ve already practiced that scenario. If they had practiced different scenarios, they might have realized that they could have simply used the log backups to complete the restore process.
Don’t be afraid to ask for help – If things aren’t going well and you need help, don’t spend more than 3 hours trying to figure it out on your own. Ask for help. It would have saved them 3 hours of time, and it would have been early evening instead of almost bed time when they finally called me. It would have been better for them and for me.
Know what’s going on in your environment – Part of their problem was that they were taking weekly full site backups through SharePoint with no idea of what that actually meant. It wasn’t a fluke that they had the problem that week. They would have had the same problem no matter which week it had occurred. This is especially true if you are not the SharePoint admin or “whatever admin”, it is important that you are communicating and planning your disaster recovery together. It doesn’t work well, as this experience showed, if the pieces of the same application are each doing their own thing in terms of disaster recovery.

Day 21: Who Deleted That Data?

Welcome back for day 21 of my 31 Days of Disaster Recovery series. Today I want to talk about trying to track down who deleted data from a table. This little investigation started out as a question on the #sqlhelp hash tag on Twitter from Wayne Sheffield (blog|@DBAWayne) whom I first met on SQLCruise Alaska 2012.

The question was asking how to convert the page ID from fn_dump_db_log() to match the integer format for page ID in DBCC IND(). Where we ended up was not even close to where we started.

If you want more info on the undocumented function fn_dump_dblog(), check out the following blog post: Day 19 of 31 Days of Disaster Recovery: How Much Log Can a Backup Log.

Converting Page ID

I explained that the Page ID column in fn_dump_dblog() output has file ID and page ID in hexadecimal format as :. It was as simple as splitting the value and then converting both pieces to integers. I recommended using the same expression to convert Page ID as I did for converting the LSN in the blog post Day 11 of 31 Days of Disaster: Converting LSN Formats. Wayne’s plan was to search the log backups from the time frame that they believed the deletion had occurred for entries where the page ID matched one of the page IDs output by DBCC IND() for the table in question in hopes of finding a clue as to what performed the delete, a person or the system.

Wayne quickly discovered an idea better than using the page ID to find entries that matched the entire list of page IDs from DBCC IND(). He could use the AllocUnitId column to correlate it to the object ID of the table. The next trick was to get the AllocUnitId to match up to the table’s object ID. We did this by joining fn_dump_dblog() to sys.allocation_units and then joining that to sys.partitions.

The process for finding these records was Wayne’s brain child. I just helped work out some details of the query. I’ll leave the demoing of his process up to him if he chooses to share it. I will share a query for finding the log entries from fn_dump_dblog() that correlate to a specific object.

-- Define object from which data was deleted

Declare @ObjectID int;

Set @ObjectID = OBJECT_ID('dbo.AllDatabases');

-- Query for log file entries

Select DD.*

From fn_dump_dblog(null, null, null, null, null, null, null, null,

null, null, null, null, null, null, null, null, null, null,

null, null, null, null, null, null, null, null, null, null) DD

Inner Join sys.allocation_units AU

On AU.allocation_unit_id = DD.AllocUnitId

Inner Join sys.partitions P

On P.partition_id = AU.container_id

Where DD.AllocUnitId Is Not Null

And P.object_id = @ObjectID;

Summary

In the end, Wayne was successful in tracking down the entries in the log backup file for the data deletion from the table. Sadly, the log records do not tell you who performed the action. He was able to get the session IDs (SPIDs) for the commands which indicated it was a user thread that ran the delete and not a system thread. Although you can’t find everything you want in the log, you can still find a lot of information and correlate that to certain conclusions. And I especially like the fact that you can use fn_dump_dblog() on a log backup rather than a live database because you can muck around in the log data without worry of affecting a live database.

Day 22: Which DBCC CHECK Commands Update Last Known Good DBCC

The end of the day is quickly approaching as I finish this blog post. This is day 22 in my series 31 Days of Disaster Recovery, and I want to examine which DBCC CHECK commands update the last known good DBCC check that is tracked in the header of the database. To check this value, I could either dump the header page using DBCC PAGE() or I could just output he header info using DBCC DBINFO(). Both of these functions are officially undocumented, but you can find them documented unofficially all over the web. They are known to be safe commands to use; however, I still recommend that you don’t use them to muck around in production.

So the question is which DBCC CHECK commands update the value. Well, let’s find out. Trial and error is our tool of choice here.

The Test

I executed all of the DBCC CHECK commands, except DBCC CHECKIDENT, of course, because that command isn’t used for consistency checks. After each execution, I would check the value of dbi_dbccLastKnownGood in the database header. If the value had been updated, I dropped and recreated the database before the next test execution.

Normally, you need to enable trace flag 3604 to see the output of commands like DBCC PAGE() and DBCC DBINFO() in the query window message pane. This trace flag redirects output from the SQL log to the console. However, in this case, I’m using the WITH TABLERESULTS option. This option is documented for some DBCC commands, but it works with almost any DBCC command. If I use this option, the output automatically goes to the query window results pane, so there’s no need to use trace flag 3604. This option also makes it easy to insert into a table to query.

For the testing, I created a new database named TestDBCC, added a filegroup and file to it and then created a table with some data in it.

Use master;

-- Drop database if it exists

If DB_ID('TestDBCC') Is Not Null

Drop Database TestDBCC;

Go
-- Create new database

Create Database TestDBCC;

Go
-- Add a filegroup for testing DBCC CHECKFILEGROUP

Alter Database TestDBCC Add Filegroup TestFG;

Go
-- Add a file to the filegroup

Alter Database TestDBCC Add File (

Name = N'TestFile',

FileName = N'C:\bakTestFile.ndf')

To Filegroup TestFG;

Go
-- Switch to the database

Use TestDBCC;

Go
-- Create a table for testing DBCC CHECKTABLE

Create Table dbo.AllDBs(

DBID int not null,

DBName sysname not null,

-- Create a check constraint for testing DBCC CHECKCONSTRAINTS

Constraint ckPK Check(DBID > 0))

On TestFG;

Go
-- Add some data to the table

Select database_id, name

From sys.databases;

This is the query I used repeatedly to check the last know good value of the database I created for the test:

-- Table for DBCC results

Declare @DBInfo Table (

ParentObject varchar(255),

Object varchar(255),

Field varchar(255),

Value varchar(255))

-- Insert DBCC DBINFO into table

Insert Into @DBInfo

Exec sp_executesql N'DBCC DBInfo(''TestDBCC'') With TableResults;';
-- Query for last known good DBCC

Select Value As dbccLastKnownGood

From @DBInfo

Where Field = 'dbi_dbccLastKnownGood';

The results of the testing was:

DBCC Command	Update Last Known Good DBCC?
DBCC CHECKTABLE	No
DBCC CHECKCONSTRAINTS	No
DBCC CHECKFILEGROUP	Yes
DBCC CHECKALLOC	No
DBCC CHECKCATALOG	No
DBCC CHECKDB	Yes

I also tested several different options to see if any particular option affected whether the last good DBCC got updated. None of the optional settings made a difference as to whether last good DBCC got updated or not except, of course, ESTIMATEONLY because it specifically does not run any consistency checks.

Day 23: Restoring Differential Backups With New Files

It’s day 23 of my 31 Days of Disaster Recovery series, and today’s blog post is inspired from an email i received in response to day 20’s blog post The Case of the Backups That Wouldn’t Restore.

A friend shared a story with me via email about a partial restore scenario that was very complex. These are the complexities he described:

Weekly full backups and daily differential backups
New data files are added daily for their partitioned tables
1508 total files in the database
Restore server had a different drive layout, and all files had to be moved to different locations

Additionally, he was performing partial restore of only specific filegroups, so his restore process was even more difficult than what I am going to cover today. Needless to say, he had a lot of move commands to write. The GUI in SQL Server 2012 has an option to move all files of a specific type to the same location, but we’re not going to cover the restore GUI in this blog (ever). This brought back memories of past consultations where people needed to restore a differential file where a new file had been added and the location wasn’t available on the restore server. So in today’s post is about how to restore a differential backup when files have been added.

Restore Demo

I worked up a demo that you can walk through to see exactly how you would handle new files being added to a database between differential backups. In this scenario, we’re going to simulate a week of differential backups and add a new file each day before the differential backup. Then we are going to restore the full backup and the most recent differential backup to a new location using MOVE arguments.

Download the scripts in a zip file: DifferentialRestoresWithMove.zip (2 KB)

-- Create unique locations for database files

Exec xp_create_subdir 'd:\TestDiffData';

Exec xp_create_subdir 'd:\TestDiffLogs';

Go
-- Create new database in unique location

Create Database TestDiff

On (Name = N'TestDiff',

FileName = N'd:\TestDiffDataTestDiff.mdf')

Log On (Name = N'TestDiff_log',

FileName = N'd:\TestDiffLogsTestDiff_log.ldf');

Go
-- Take Full Backup

Backup Database TestDiff

To Disk = 'd:\BackupTestDiffDay1.bak'

With Init;

Go
-- Add file for 6 (simulated) days and perform a differential backup

Declare @Cntr int = 1,

@SQL nvarchar(200);

While @Cntr <= 6

Begin

Set @SQL = N'Alter Database TestDiff

Add File (Name = ''TestDiff' +

Cast(@Cntr as nvarchar(200)) + ''',

FileName = ''d:\TestDiffDataTestDiff' +

Cast(@Cntr as nvarchar(200)) + '.ndf'');';
Exec sp_executesql @SQL;
Set @SQL = N'Backup Database TestDiff

To Disk = ''d:\BackupTestDiffDay' +

Cast(@Cntr as nvarchar(200)) + '_diff.bak''

With Differential, Init;';

Exec sp_executesql @SQL;
Set @Cntr = @Cntr + 1;

End

Go
-- Restore the full backup as TestDiff2 to new location

-- Create unique locations for database files

Exec xp_create_subdir 'd:\TestDiff2Data';

Exec xp_create_subdir 'd:\TestDiff2Logs';

Go
-- Restore backup moving files to new location

Restore Database TestDiff2

From Disk = 'd:\BackupTestDiffDay1.bak'

With Move 'TestDiff' To 'd:\TestDiff2DataTestDiff.mdf',

Move 'TestDiff_log' To 'd:\TestDiff2LogsTestDiff_log.ldf',

NoRecovery;

Go
-- Restore file list of most recent differential backup

Restore FileListOnly

From Disk = 'd:\BackupTestDiffDay6_Diff.bak';

Go
-- 6 new database files that need to be accounted for:

-- TestDiff1 d:\TestDiffDataTestDiff1.ndf

-- TestDiff2 d:\TestDiffDataTestDiff2.ndf

-- TestDiff3 d:\TestDiffDataTestDiff3.ndf

-- TestDiff4 d:\TestDiffDataTestDiff4.ndf

-- TestDiff5 d:\TestDiffDataTestDiff5.ndf

-- TestDiff6 d:\TestDiffDataTestDiff6.ndf

-- Restore differential backup moving files

Restore Database TestDiff2

From Disk = 'd:\BackupTestDiffDay6_Diff.bak'

With Move 'TestDiff1' TO 'd:\TestDiff2DataTestDiff1.ndf',

Move 'TestDiff2' TO 'd:\TestDiff2DataTestDiff2.ndf',

Move 'TestDiff3' TO 'd:\TestDiff2DataTestDiff3.ndf',

Move 'TestDiff4' TO 'd:\TestDiff2DataTestDiff4.ndf',

Move 'TestDiff5' TO 'd:\TestDiff2DataTestDiff5.ndf',

Move 'TestDiff6' TO 'd:\TestDiff2DataTestDiff6.ndf',

Recovery;

i also worked up a quick script to generate the MOVE commands. This could come in handy if you ever need to generate the MOVE command for a backup with a large number of files that need to be moved. This is just a quick script, not a well-evolved one, so you will need to take the output and remove the comma from the final line if you use it.

Declare @BackupFile nvarchar(500),

@FileNumberInBackup int,

@MoveDataFilesTo nvarchar(500),

@MoveLogFilesTo nvarchar(500),

@MoveFilestreamTo nvarchar(500),

@MoveFTCatalogTo nvarchar(500),

@RestoreCmd nvarchar(max)

Declare @FileList Table (LogicalName nvarchar(128),

PhysicalName nvarchar(260),

Type char(1),

FileGroupName nvarchar(128),

Size numeric(20,0),

MaxSize numeric(20,0),

FileID bigint,

CreateLSN numeric(25,0),

DropLSN numeric(25,0) NULL,

UniqueID uniqueidentifier,

ReadOnlyLSN numeric(25,0) NULL,

ReadWriteLSN numeric(25,0) NULL,

BackupSizeInBytes bigint,

SourceBlockSize int,

FileGroupID int,

LogGroupGUID uniqueidentifier NULL,

DifferentialBaseLSN numeric(25,0) NULL,

DifferentialBaseGUID uniqueidentifier,

IsReadOnly bit,

IsPresent bit,

TDEThumbprint varbinary(32))

-- Define backup file path/name

Set @BackupFile = N'd:\BackupTestDiffDay6_Diff.bak';

-- Define file number of file in backup (default 1)

Set @FileNumberInBackup = 1;

-- Define destination path (not name) for all file types

Set @MoveDataFilesTo = N'd:\TestDiff2Data';

Set @MoveLogFilesTo = N'd:\TestDiff2Logs';

Set @MoveFilestreamTo = N'd:\TestDiff2FS';

Set @MoveFTCatalogTo = N'd:\TestDiff2FT';
-- Add trailing slash if not exists

If Right(@MoveDataFilesTo, 1) <> ''

Set @MoveDataFilesTo = @MoveDataFilesTo + '';
If Right(@MoveLogFilesTo, 1) <> ''

Set @MoveLogFilesTo = @MoveLogFilesTo + '';

-- Restore file list of most recent differential backup

Insert Into @FileList

Exec sp_executesql N'Restore FileListOnly

From Disk = @BackupFile

With File = @FileNumberInBackup;',

N'@BackupFile nvarchar(500), @FileNumberInBackup int',

@BackupFile = @BackupFile,

@FileNumberInBackup = @FileNumberInBackup;

Select MoveCmd = 'Move ''' + LogicalName + ''' To ''' +

Case Type When 'D' Then @MoveDataFilesTo +

Right(PhysicalName,

CharIndex('', Reverse(PhysicalName)) - 1)

When 'L' Then @MoveLogFilesTo +

Right(PhysicalName,

CharIndex('', Reverse(PhysicalName)) - 1)

When 'F' Then @MoveFilestreamTo

When 'S' Then @MoveFTCatalogTo

Else PhysicalName

End + ','

From @FileList

Where IsPresent = 1

Order By FileID;

Summary

It’s actually not very difficult to deal with files being added in a differential (or log) backup when you need to move the locations. You simply need to identify the files using RESTORE FILELISTONLY and then use the MOVE argument to define the new location. But it can be tedious when there are a large number of files involved. That’s where scripts like these come in handy. Enjoy.

Download the scripts in a zip file: DifferentialRestoresWithMove.zip (2 KB)

Day 24: Handling Corruption in a Clustered Index

Welcome to day 24 of my 31 Days of Disaster Recovery series. Previously, I’ve talked about several different forms of corruption: Nonclustered Index, Allocation Pages, and Tempdb. These were all fairly simple to fix. Today I’m going to dive into a scenario that is a little more complex, clustered indexes.

Clustered indexes are the base data. This means we have to go to a restore scenario. Hopefully, the corruption isn’t wide spread. if it’s just 1 or a few pages, we can do page level restores. If it’s a lot of pages, it may be faster just to do a full restore. It may come down to a judgement call as to which you think is faster.

Identify the Corruption

I’ve created a sample database for us to use for this scenario. It has a table in it with a corrupt clustered index. It also has clean backups from before the corruption occurred so we can perform a page or a full restore. In order to facilitate recovery, I had to do something we tell you to NEVER do with a corrupt database. Instead of providing you with a backup of the corrupt database, I detached the database and copied the files. To run through this demo on your own, you’re going to need to attache the files as a new database.

The scenario here is that the database was offline (let’s say the server rebooted), and corruption occurred during that time. The database is online now and seems fine. We insert some data into our table, and everything works fine. Then we query the data, and we discover the corruption.

-- Take database offline and poof, corruption occurs

-- We don't know that yet though

-- Add more data

Insert Into dbo.FactInternetSales

Select Top(500) *

From AdventureWorksDW2012.dbo.FactInternetSales;

Go
-- Everything seems to be fine. Let's query the data

Select *

From dbo.FactInternetSales;

Go
-- Error occurs

Msg 824, Level 24, State 2, Line 2

SQL Server detected a logical consistency-based I/O error: incorrect pageid (expected 1:298; actual 0:0). It occurred during a read of page (1:298) in database ID 8 at offset 0x00000000c14000 in file 'C:\Program FilesMicrosoft SQL ServerMSSQL11.SQL12MSSQLDATACorruptDB.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

This is a fatal error, so our query window gets disconnected when the error pops up. We know this is a serious error. We need to determine the extent of the corruption. You may be tempted to run DBCC CHECKTABLE at this point, but the corruption could be more than just a single object. I highly recommend running the DBCC CHECKDB in almost every case.

-- Let's run DBCC CHECKDB (there may be other corrupt tables)

DBCC CHECKDB(CorruptDB) With No_InfoMsgs, All_ErrorMsgs, TableResults;

Go
-- Errors returned for:

-- Database ID 8 (CorruptDB)

-- Object ID 245575913 (dbo.FactInternetSales)

-- Index ID 1 (clustered index)

Okay, CHECKDB shows lots of error messages, but we need to look at the ones with a Severity of 16. It appears that just a single table and a single index is corrupted. At first look, it appears that there are 2 pages that are corrupt. Taking a closer look at the error message for page #299 shows that it is reporting that the pointer to it from page #298 is missing. Page #298 is the only one that is corrupted.

If we take a look at the corrupt page using DBCC PAGE, we can see that about half of the page was overwritten with repeating 0’s. This is a sure sign of disk corruption. You need to follow this up by having the disks checked. Disk corruption may be a sign of a failing disk so it is important to investigate right away.

Decide How to Deal With It

Since it is only a single page that is corrupted, I prefer to do a single page restore here. In order to do that, I have to have certain things already in place:

Good backup with a non-corrupted copy of the page in it. This is where having tested your backups previously really pays off. I can’t tell you how many times tables have sat corrupted for months on end before someone finds it and by then they no longer have sufficient backups to support a proper restore or they have to wait for many hours or days for another team to recover backups from tape that is in storage. TEST YOUR BACKUPS.
Log backups to bring the page current with the rest of the database. This means that if we’re in simple recovery mode or if we don’t have all the log backups, we’re dead in the water. Our only choice in that case would be to do a full restore and save as much of the data as possible.

In our case, we see that we have a full backup and a log backup. We will need to do a tail log backup as well to really bring it current. for this restore, I’m going to put the database into restricted-user mode, take the tail log backup, and then proceed with the restore process.

Page-level Restore

To process this restore, I’m going to put the database in restricted-user mode to kick the non-admin users out of the database. Then I will do the page-level restore including a tail log backup. I’ll restore the full backup specifying the page I want to restore. Then I will restore the existing log backup and then take a tail log backup I just took. Then I can recover the database and run DBCC CHECKDB again to ensure that the corruption is fixed. If the database is clean, we can let the users back in to the database.

-- Only the one page is corrupt, so let's do a page restore

-- Switch to master to restore the damaged page

USE master;

Go
-- Set the database in restricted user mode to keep average users out

Alter Database CorruptDB Set Restricted_User With Rollback Immediate;

Go
-- Restore the corrupt page from the good full backup

Restore Database CorruptDB

Page = '1:298'

From Disk = 'C:\UsersSQLSoldierDocumentsBlogFilesCorruptDBBackupsCorruptDB.bak';

Go
-- Restore the 1st pre=existing log backup to bring the page current

-- SQL knows which transations to apply, no need to specify any special commands

Restore Log CorruptDB

From Disk = 'C:\UsersSQLSoldierDocumentsBlogFilesCorruptDBBackupsCorruptDB.trn'

With NoRecovery;

-- If there were more pre-existing log backups, we would restore them in order

-- Now backup the tail of the log...

Backup Log CorruptDB

To Disk = 'C:\bakCorruptDB_LOG_TAIL.trn'

With init;

Go
-- Restore the tail of the log bringing the page current

Restore Log CorruptDB

From Disk = 'C:\bakCorruptDB_LOG_TAIL.trn'

With NoRecovery;

Go
-- Finally, recover the database to bring it online

Restore Database CorruptDB With Recovery;

Go
-- Recheck the database for corruption again

DBCC CHECKDB(CorruptDB) With All_ErrorMsgs, No_InfoMsgs, TableResults;

Go
-- Allow users back in

Alter Database CorruptDB Set Multi_User;

Go
-- Run the original query again

Use CorruptDB;

Select *

From dbo.FactInternetSales;

Summary

At first glance, a page-level restore may seem a little tricky, but once you’ve done it several times, it starts to become old hand. Just remember the steps I recommend taking, and follow them one at a time. It’s not as difficult as it seems.

Download the demo scripts and sample corrupt database in zip format: Demo_ClusteredIndexCorruption.zip (5.03 MB)

Special thanks to Paul Randal (blog|@PaulRandal) for his guidance when I saw something weird happening. Something weird was happening, but not what I thought was happening. Talking to him about it cleared it up for me and allowed me to see my error that I was overlooking.

Day 25: Improving Performance of Backups and Restores

My series 31 Days of Disaster recovery has been on hiatus due mostly to illness. I’ve been battling a chest cold that became bronchitis. I’m still fighting cough, but even that has improved to the point that I’m now sleeping longer at night than night. It wasn’t so much that I was too sick to write a blog post as it was that I was too sick to think up good ideas and put them into words. Rather than deliver poorly thought out and poorly articulated blog posts, I opted to wait until I was better in control of my cognitive abilities.

Today we resume with Day 25 of the series and I want to discuss improving performance of backups and restores. The good news is that what works for one generally works for the other. Most actions you take to speed up backups will also speed up restores. Double bonus.

Adjust the Backup Buffers

I mentioned this first tip earlier in the series when I posted some DR gems from around the net. Nic Cain (blog|@SirSQL) has written a process (Automated Backup Tuning) to automate determining the best buffer settings for backups. This is great because the only real way to figure out the optimal setting for any given server is through trial and error. Nic’s process automates the trial part of it and lays out the results so you can easily pick out the optimal settings.

Skip Creating Database Files

When you run a restore, the first step of the restore process is to create the database files to the same sizes they were in the database when it was backed up. There are a couple of ways to skip the creation of some or all of the database files. For a very large database, this can save you a great deal of time in the restore process.

The first way is to make sure you have Instant File Initialization (IFI) enabled. Normally, when SQL Server creates a database file, it has to fill it with zeroes (i.e., zero it out) to mark the limits of the file. IFI is a feature that allows SQL Server to mark the limits of the files without filling them. This means that to create a 100 GB file, for example, it does not have to write 100 GBs of zeroes. The file creation process is almost instantaneous. Unfortunately, the transaction log files cannot be instantly initialized. This only allows us to skip the creation of the data files. This feature is enabled at the OS level by granting Perform Volume Maintenance Tasks rights to the SQL Server service account via Local Security Policy Editor or Group Policy Editor.

The other option works for data and log files. If the database files for the database you are restoring are already present, it will reuse the files that already exist. If the files exist and are the right size, then you are basically skipping the file creation process. I’ve seen people delete the existing database before starting the restore instead of restoring over the top without realizing that they are prolonging the process. This is particularly helpful when you need to restore a backup of a very large database. If I have to restore a large backup on a new server, the copy process is going to take a long time. In the meantime while it is copying, I will create an empty version of the database with the exact file specifications of the one I’m going to restore. Once the copy process finish, the creation of the database is probably finished as well and I can skip the file creation step by restoring over the database I just created.

Multiple Files on Multiple Dedicated Drives

I can’t stress enough that this recommendation has 3 parts. 1) Multiple files on 2) multiple 3) dedicated drives. You won’t see much if any improvement by simply having multiple files if they are on the same drive. You get 1 backup thread per LUN or mountpoint, not per file. If you write the backup to multiple files on the same drive or even different drives on the same LUN, you only get a single backup thread. No performance boost. The only benefit to this is manageability of moving around and storing smaller files or if you will be able to use multiple dedicated drives for the restore.

A common mistake I see people make is that rather than getting multiple LUNs for backups, they will put one of the backup files on the backup drive and one on a drive that holds database files. If you are trying to write a backup file to a drive that has an active database on it, it will affect performance for the activity of the database as well as the backup. I have seen many cases where this was even slower than writing to a single backup file. It is very important that the LUNs/drives be dedicated to backups only. I have seen directly proportionate improvements in backup time by adding more dedicated backup drives. I used to manage a VLDB that took 6 hours to back up onto a single drive. When we added a 2nd dedicated backup drive, back up time dropped to half, 3 hours. And when we added a 3rd dedicated backup drive, the time dropped to 2 hours.

Other Tips

Use Compression: Unless you are using Transparent Data Encryption (TDE) for the database, you should be using backup compression if it is available. Either use SQL native compression or a 3rd party compression tool. Either way, this is a real must have.
Crank up the SAN Throughput: People love to talk about the speed of the drives in their SAN or the RAID type, but more often than not, I see SANs bottlenecking on throughput long before they reach the limits of the drives. Work with your SAN admin to increase the queue depth and the number of paths to the SAN. Rule of thumb: more SAN paths (multi-pathing) = more throughput.
Use Differential/Partial backups: not all backups have to be full backups. Figure out a mixture of Full and Differential/Partial backups and log backups that allow you to meet your SLA for recovery time and recovery point.

Day 26: The Mysterious Case of the Long Backup

Welcome back for day 26 of my series 31 Days of Disaster Recovery. Today I want to share a tale of a mysterious backup that was running too long, and as the SAN admin reported, nothing had changed in terms of configuration of the SAN or our LUNs. We eventually tracked down the issue, and it was something none of us had even considered. Likewise, it was something we never even thought to look for at the time we were investigating.

Backup Performance

We had 2 databases on the server in question, the small one was 500 GB and the large one was 1.75 TB. The smaller database was basically used for authentication and only had a few hundred updates per day. As such, we rarely focused on this database very much. The main transactional database, the big one was extremely busy 24 hours a day, 7 days a week. There was no maintenance period as it handled transactions from users everywhere. Our busiest times were …. weekends. Followed next by business hours in the United States and business hours in Japan. The system was used by 30,000+ support agents around the globe. Okay, you get the picture. There was no time when it wasn’t busy.

We had highly tuned our backups. We would back up the smaller database at midnight (took less than half an hour) and the large database at 1 AM (US Pacific Time). The large database took 2 hours. We published performance metrics reports daily, and you could see that there was a small performance drop in the main database while the backup was running. From 1 AM to 3 AM, it wasn’t an issue as we had performance to spare. Over time, the backup times kept taking longer and longer. This became an issue when the backup time started taking longer than 4 hours. This put the backup completion time after 7 AM Eastern US Time which meant we were getting close to when business starting picking up again. We were still fine performance-wise in the application, but we were approaching the time that it would become a problem. Furthermore, the smaller database was now taking more than an hour to run and so it was still running when the bigger one started.

In order to maintain the size of this database, we aggressively purged data from it 4 times a day deleting support cases that were closed and had no action on them for at least 90 days. We were deleting millions of rows daily. We tracked and plotted the amount of data that we purged as well as the size of the database in our daily performance reports. There was no significant changes in either of those metrics.

We investigated the amount of activity, also in our performance reports, during the backup time frame, and no big changes there either. All performance metrics throughout the day looked completely normal. No slowness during the day, only during the backup window. We escalated it to the SAN team, and they confirmed that none of our settings on the SAN had changed and that everything looked healthy on the SAN. All SAN metrics looked good. We were on a shared SAN with many other applications, and he said that none of the others were complaining, only us.

Digging deeper, we discovered that while the backups were running, our throughput to the SAN went way down and then sometime in the 3 AM hour, throughput would return to normal. The bulk of the backup was being performed after this time. We had a theory and we needed to confirm it. We asked the SAN admin to validate the same findings on his side of the SAN.

Sure enough, the SAN was being flooded between midnight and 3 AM and bottlenecking on throughput because everyone else on the SAN was running their backups at midnight as well. We changed our backup schedule to work around this. We would back up the smaller database at 11 PM and then start the large database at 3 AM. Backup times returned to normal, and we were good again.

Summary

When you run into performance problems with your backups, it is important to look at the usual suspects first such as disk performance, activity on the server, etc. Our investigation was made much easier by having baselines of the activity that we could compare to the current levels to determine if anything had truly changed. Ultimately though, we had to trust our findings and look outside the box. We had to look outside our system at external factors that were affecting us.

Day 27: Restoring Part of a Database

Today is day 27 of my series 31 Days of Disaster Recovery, and I want to talk about restoring a partial database to a server. If you have a very large database with many filegroups, and you need to restore just part of the database, then you can perform a partial restore (Enterprise Edition required) to only restore minimum amount of filegroups online to get access to the part you need. A good use case for this would be if you need to restore data from backup to recover data that was accidentally deleted.

Yüklə 211,03 Kb.

Dostları ilə paylaş:

1 ... 7 8 9 10 11 12 13 14 15