31 Days of Disaster Recovery

Day 14: Fixing a Corrupt Tempdb

Yüklə 211,03 Kb.

səhifə	12/15
tarix	16.08.2018
ölçüsü	211,03 Kb.
	#63138

1 ... 7 8 9 10 11 12 13 14 15

Tempdb Corruption
Day 15: Running DBCC CHECKTABLE in Parallel Jobs
Holy Double DBCC, Batman!

Day 14: Fixing a Corrupt Tempdb

Welcome to day 14 of my 31 Days of Disaster Recovery series. I’ve previously discussed handling corruption for nonclustered indexes and allocation pages. Today, I’m going to talk about a specific corruption that is very simple to fix; however, it does mean taking the server offline briefly. How brief depends on a few things, such as is the corruption in the log file or the data file. And if it’s in the data file, is Instant File Initialization (IFI) enabled. If it’s not, you should enable IFI now unless you have compliancy restrictions against it. Go do it now, this blog post can wait. I am, of course, talking about corruption in the tempdb database.

Tempdb Corruption

In the early days of SQL Server 2008 RTM, I saw a lot of tempdb corruption when people started implementing Transparent Data Encryption (TDE). There was a SQL Server bug in RTM that could cause this. Yes, folks, I am admitting that corruption is sometimes caused by SQL Server … like .01% of the time. The fix for tempdb corruption is simple. Restart SQL Server. The files will be reinitialized and corruption should be gone. Many people have come to believe that if you restart SQL Server, the tempdb files will be recreated anew. This is the fault of Books Online. Books Online states that the tempdb files are recreated each time the SQL service starts. This is a myth that was proven false by Jonathan Kehayias (blog|@SQ:Poolboy) here: does-the-tempdb-log-file-get-zero-initialized-at-startup.aspx

On some rare occasions, simply restarting SQL Server did not resolve the tempdb corruption. It’s been a long time since I encountered this scenario, and my attempts to reproduce the issue were fruitless. You are not likely to encounter an issue where simply restarting SQL Server doesn’t fix the corruption. If you encounter this issue, delete the corrupted file or files while SQL Server is shut down. As I mentioned before, if it is the log file or if IFI is not enabled, then the restart time will likely be longer than usual. Log files must always be zero initialized when created. Just be aware of how big the file is configured to be and be prepared for the startup to be delayed.

Fixing corruption doesn’t get easier than this, but you shouldn’t stop there. Fixing corruption is only part of the job. You need to investigate the root cause of the corruption. If it’s a SQL bug (probably not), then you need to make sure you have applied the fix or contact PSS to get one. You also need to investigate the disk subsystem to see if it is external corruption. Whatever the cause of corruption, you need to address it to prevent it from occurring again. A DBA’s primary job is to protect the data, and that means prevention is just as important as fixing what’s broken.

Thanks to Paul Randal (blog|@PaulRandal) for pointing out that deleting the tempdb files are usually not needed.

Day 15: Running DBCC CHECKTABLE in Parallel Jobs

Welcome back to my 31 Days of Disaster Recovery series. Today is day 15, and I want to answer a question I was asked a while back. Paul Randal (blog|@PaulRandal) wrote a blog post explaining alternative options for checking integrity of a very large database if you are not able to run the full CHECKDB process, and the question was borne out of one of the recommendations by Paul. One of the tactics Paul recommends is breaking the process up over multiple nights.

The person I was talking to was planning to run DBCC CHECKALLOC and DBCC CHECKCATALOG the first night of the week and then spread out DBCC CHECKTABLE executions for all of the tables across other nights. His database has a mixture of some very big tables and lots of small tables. His question to me was can he save time by running the DBCC CHECKTABLE commands in parallel threads and process multiple tables at the same time. His thought was that while one job is running DBCC CHECKTABLE against a very large table, another job could be running it against the smaller tables.

At the time, I wasn’t sure how well multiple jobs would co-exist, but I said I would run some tests and see. The results of my findings are below.

Holy Double DBCC, Batman!

I decided to use my AdventureWorksDW2012 database for these tests. I have a very large table that I added to the database that I use for testing quite often named dbo.FactInternetSalesBig. dbo.FactInternetSalesBig has 30,923,776 rows in it and is basically just a copy of dbo.FactInternetSales table with the data re-inserted repeatedly until it was sufficiently big enough. The original dbo.FactInternetSales table has 60,398 rows in it, so it is a fraction of the size of the big versions. Another difference between the two tables is indexes. The small table has 9 total indexes, a clustered index, 7 nonclustered indexes, and a columnstore index (remnant of a different test) whereas the big table is just a heap.

I already knew from experience that the big table is about 7 GB of data that takes a fair amount of time to load into memory. First thing I did was run DBCC CHECKTABLE on both tables to make sure they both had their data in memory. Next, I gathered run times for the DBCC CHECKTABLE runs individually for my baseline values. Then I captured the run times for running both commands run at the same time in separate query windows. I toyed around with several other combinations of factors, and the ones that seemed to make the most compelling story was playing around with limiting maximum degree of parallelism (max DOP), and creating the database snapshot ahead of time. I had expected that running the DBCC commands against the database would not be ideal as each job would create it’s own database snapshot (see day 1 of this series: Does DBCC Automatically Use Existing Snapshot?).

The code for the DBCC CHECKTABLE commands are shown below. The commands were the same for running against the database snapshot (named AWSnap) except the USE statement point to AWSnap.

Use AdventureWorksDW2012;

Declare @StartTime datetime2 = getdate(),

@TimeLapse int;

DBCC CHECKTABLE('dbo.FactInternetSalesBig') with No_InfoMsgs;
Set @TimeLapse = DATEDIFF(ms, @StartTime, getdate());
Select @TimeLapse;

Use AdventureWorksDW2012;

Declare @StartTime datetime2 = getdate(),

@TimeLapse int;

DBCC CHECKTABLE('dbo.FactInternetSales') with No_InfoMsgs;
Set @TimeLapse = DATEDIFF(ms, @StartTime, getdate());
Select @TimeLapse;

One of the interesting things I noticed was that running the jobs in parallel caused a relatively small increase in run time for the large table and a large increase in run time for the small table. Initially, I noticed a lot of IOCOMPLETION and PAGEIOLATCH_SH (shared page IO latch waits) waits on the the hidden snapshot that gets created (e.g., 13:1:1444304) and the occasional PREEMPTIVE_OS_FILEOPS (calling out to OS for a file operations) wait. This quickly gave way to CXPACKET (parallelism exchange event) and 1 thread waiting on WRITE_COMPLETION. Clearly there is contention between the parallel jobs.

Altogether, I ended with 8 different sets of numbers. one thing to note is although I show a single set of numbers for each category, each test was run multiple times and the run times were averaged. Additionally, the cache was pre-warmed for both the database and the pre-existing database snapshot. Some generalizations can be made about my results based on these numbers.

Running the jobs in parallel tends to take slightly longer than running individually
Running the jobs with limited max DOP was faster on the database directly, but slower against the snapshot
The best results were achieved with higher max DOP on the pre-existing snapshot

Though not shown in the numbers, I did try with various levels of max DOP. The test machine has 8 logical CPUs, and the max DOP used for the reported numbers were 0 (number of logical CPUs or 8) and 1 (parallelism disabled). 1 was chosen as the test value because that value demonstrated the best results for running the jobs in parallel against the database and degraded incrementally as max DOP is increased. Running the jobs against the pre-existing snapshot got the worst results with max DOP = 1 and improved incrementally as max DOP is increased.

	Run Time (ms)
	FactInternetSalesBig	FactInternetSales	Run Time
Baseline (Individually on database)	234274	3614	237888
Parallel on database	239547	237027	239547
Individually on database w/limited DOP	147240	2467	149707
Parallel on database w/limited DOP	153836	114123	153836
Individually on snapshot	12853	263	13116
Parallel on snapshot	14617	1647	14617
Individually on snapshot w/limited DOP	49710	660	50370
Parallel on snapshot w/limited DOP	50520	950	50520

Yüklə 211,03 Kb.

Dostları ilə paylaş:

1 ... 7 8 9 10 11 12 13 14 15