We often see databases that are simply larger than they need to be. This is usually attributable to technical debt and the gradual growth of data over time. So, what does it look like to put SQL Server on a diet and improve the health of your database?
The good news is that this data isn’t often needed at all or is only needed very infrequently. Therefore, we can use a few approaches to make your database simply perform better and utilize fewer resources. Wouldn’t it be nice if your queries ran faster and processes like your backups had to do less work because your database is just smaller?
We can get there! In this post, we’ll look at a few different things you can implement in order to improve your database’s health.
Unused and Redundant Data
The first and most obvious way to slim down your database is to find unused data and simply delete it from the database if it’s not needed any more. Do you have a table where you have a large amount of data that’s old and just not used? Is there a reason that we’re keeping it around?
There are some obvious blockers to this. Perhaps you deal with data that you have to keep for a certain amount of time due to regulatory reasons (financial, medical, etc.). If that’s your scenario, you’ll want to look at some of the other options on this list.
Do remember, though, that the requirement to store data does not necessarily mean that it needs to be attached to a live server and accessible at all times. In some cases, backing up your database and storing the backup is completely acceptable to meet the requirement. Considering how cheap blob storage is nowadays, this could be a very cost-effective way of being compliant with data retention rules.
Is the case above true but sometimes you do need to query the historical data? You could consider an archive table. These are usually the same table name followed by _Archive or _History and used for storing data that still needs to be “queryable” but not used for most queries.
This will require code changes if you’re accessing the data via an application. If you require the older data then you’ll need a UNION ALL to join the two tables together. You’ll also want to watch out for any schema changes to either database as you’ll want to keep them the same for performance reasons.
The advantage of this approach is that your most commonly executed queries that deal with the latest data simply have less work to do. You have less data to read from disk, less data to store in memory, and less data to join to other data.
Another trick you could apply is to use different filegroups for the archive tables. When you do that, you can put the archive data on a drive that isn’t your nice, expensive, fast disks. It’ll be slower to retrieve, but with this data, does that matter?
If you don’t want to delete this data or move it to another table, then you could consider partitioning your tables. This process can be tricky to get right so definitely research this option, but when it does work well it can be an excellent solution.
Partitioning is the process of separating your data into segments between given values with the idea being that you only query the range that contains the data you need. Dates are often used for this. For example, people might partition the last years’ worth of data by month (so 12 partitions), and then have anything older than that sit in a large partition dedicated to historical data. This allows for queries that query this month’s data to just deal with a small fraction of the data in the table, which is great for performance.
Important to note: There is usually maintenance required on this, however. You probably don’t want to make 100< partitions if you were going with monthly data — in other words, having a partition for each month from the start of time, for this table, and allowing for the next 10 or 20 years.
The better approach is to have a monthly SQL Agent job that deletes your oldest partition (and moves the data into the ‘historical’ partition) and then makes a new one for the next month. This also applies if you’re partitioning based upon a surrogate key (e.g. splitting it into ranges of a certain number of IDs).
The advantage of partitioning is that if your queries always search on a field that uses the partitioning key then you don’t need to make any code changes in the app to take advantage of this feature!
Being able to use filtered indexes is a godsend when it comes to dealing with certain types of data. One example we see a lot is the use of soft deletes where a table contains a column called IsActive or IsDeleted (or similar) and a bit field used for this data. Often when queries are written against these tables, they’ll always search for IsActive = 1 or IsDeleted = 0 so they’re only looking at live data. This makes them an ideal target for filtered indexes.
Filtered indexes act exactly like a normal non clustered index but with the addition of a filter. It will look something like this:
CREATE INDEX IX_IndexName
ON TableName (TableIDField, IsActive)
WHERE IsActive = 1
It’s important to realize that you want the filtered column in the indexed fields for best performance and to ensure the optimizer chooses to use this index.
Reduced Index Width
Another issue that we see more than we’d like is the use of very wide indexes. Every field in your index costs extra storage space and increases your memory requirements. It’s very easy to find a list of indexes that have a large amount of columns. Just use a query like this:
FROM sys.indexes i
INNER JOIN sys.index_columns ic ON i.OBJECT_ID = ic.OBJECT_ID and i.index_id = ic.index_id
INNER JOIN sys.columns col ON ic.OBJECT_ID = col.OBJECT_ID and ic.column_id = col.column_id
INNER JOIN sys.tables t ON i.OBJECT_ID = t.OBJECT_ID
WHERE i.is_primary_key = 0
AND i.is_unique_constraint = 0
AND t.is_ms_shipped = 0
GROUP BY t.name, i.name, i.index_id
ORDER BY COUNT(col.name) DESC
That will return a list of all of your indexes and the number of columns they include, sorted descending by column count. You’ll then have to look at these indexes, look at the queries that hit these tables (query store can be excellent for that) and decide if you can reduce the number of columns. The biggest wins will be the largest data types so look out for things like (max) fields and ntext.
If you don’t have index maintenance scheduled, you could well have some very fragmented tables. This is caused by data deletions and updates, which can happen a lot on heap tables too. It’s easy to check out your fragmentation using a query like the one below:
FROM sys.dm_db_index_physical_stats (DB_ID(), NULL, NULL, NULL, NULL) AS ps
INNER JOIN sys.tables t ON ps.OBJECT_ID = t.OBJECT_ID
INNER JOIN sys.schemas s ON t.schema_id = s.schema_id
INNER JOIN sys.indexes i ON i.OBJECT_ID = ps.OBJECT_ID AND ps.index_id = i.index_id
WHERE I.name IS not null
AND ps.avg_fragmentation_in_percent > 0
AND ps.page_count > 1000
ORDER BY ps.avg_fragmentation_in_percent DESC
This query returns all fragmented indexes, as long as they’re over 1000 data pages (8MB) in size. If you have some large and fragmented indexes, then you’ll want to do some reorganizing or rebuilding. You’ll then want to set up regular maintenance. To do this, we recommend using Ola Hallengren’s index maintenance solution.
There are some excellent options for compressing your data — depending on what your data is exactly, because some data is more compressible than others.
Row and page compression can be an effective way to reduce the size of your tables. There are some considerations here, including SQL Server editions (e.g. all versions of SQL Server 2022 support data compression, but if you’re on 2016 then you need to be on Service Pack 1 or later).
Columnstore compression is a completely different way to store and compress data. As the name would suggest, this doesn’t compress data row by row. Rather, it compresses columns worth of data. This is excellent if you have a lot of duplicated data within columns of your table. This is particularly useful in data warehouse solutions. The downside with columnstore is that in an OLTP workload environment, any inserts into or updates to this table will have to do more work due to the compression. This very much depends on the data and your workloads.
This is not an exhaustive list of all the options, but you should now have a great start and some ideas of what you can do in your own environment. Smaller data is faster data, after all, so let’s try and get your database running like greased lightning. If you’re reading before it happens, you can also join me for a webinar on this topic scheduled for Wed., Jan. 24, 2024. We’ll cover many of the same topics — just more in depth and with demos and additional useful insights.
If you want a hand with any of this, then please get in touch and we’d be more than willing to help.