ReFS vs NTFS – Introduction (Part 1)

ReFS vs NTFS – Introduction


Many companies use disk-based storage for backups, backup copies and replicas etc to adhere to or better still, exceed the 3-2-1 rule.  There is a number of organic ways in which this can be achieved and for many, infused with the DNA of their preferred storage vendor, this is deemed preferable or more affordable when compared to dedicated deduplicating backup appliances, such as Dell EMC Data Domain, ExaGrid and HPE StoreOnce.  I will not get into the weeds regarding these solutions in this blog, that is an entirely different conversation.

That being said, if we abstract the underlying hardware from the repository, for most, the repository will be formatted using either ReFS or NTFS.  Granted I know we can use Linux repositories for this and the obvious CIFS & NFS backend options for *nix file systems as such, however it is widely accepted and evidenced, that local storage is a better option than a filer based share (CIFS & NFS) for a windows-based repository and there is a multitude, of information online confirming this.

Lastly, not every company has enough in house *nix skills to rely on a *nix based backup repository or indeed maintain it effectively.  Offloading support and maintenance to a service provider or storage vendor can be an attractive option when it comes to deduplicating appliances, additionally MSP’s are more than capable of providing the same via BaaS (Backup as a Service) or as an extension of a MSA (Managed Service Agreement).

Calm Seas v Stormy Waters?


“ReFS has hardly been plain sailing over the past year!”

Many of you will be looking at the featured image post thinking “ReFS has hardly been plain sailing over the past year!” You would be right, internally (at Novosco) we found a solution last year that worked for us that let us provide stable and reliable backups and restores.  I wanted to look at the data in the sense, once all the blockclone issues are resolved (which for most, appears to be the case) following Microsoft’s most recent patch, ReFS will at some point be stable and reliable.

My goal was to analyse the data with regards to the capacity savings that can be realised, along with some other factors. To this end, I focused on blockclone and GFS synthetic full backups to make a comparison against NTFS and GFS full backups employing deduplication when they are given the same information.

As it stands, many people are using ReFS in production right now and many more will be considering the move. I think if you are considering the move, there has never been a better time to start experimenting with ReFS.  Kudos to Microsoft and Veeam for their collaborative efforts to address the issues early adopters have faced.  Let’s be honest, a well designed Veeam backup platform can push any disk-based storage to its limits, Veeam is very efficient at moving large amounts of data quickly.

Let’s not forget, backups are by no means a “light” operation, we have platforms in production that can pretty much saturate uplinks to repositories relentlessly (TB’s per hour), until backup completion, this is not a light workload for the underlying file system either as a result, evidenced in some teething pains throughout 2017 and thus far in 2018.


Why are you running this test?


Part of my role within Novosco is to contribute to the technical validation of projects and solutions within our team.  Looking at the data lets you plan and ensure, the solution being considered for a customer, is not only technically valid, but will be throughout its planned lifetime.  Adding capacity or performance to a backup solution post deployment means having to spend more money, good planning avoids having to “move the goalposts”.

What were the conditions of the test?


I wanted to look at real data, synthetic tests are nice but as we all know, nothing beats running something in production to get accurate results.  To this end, it had to be real data with real daily change rates and it had to be the same data targeting all repositories.

Having access to an ever-increasing number of highly capable platforms, a decision was made to perform a daily copy of 8 virtual servers to 8 different repositories over the course of 8 weeks or from another view, “further protect” some data, on a platform that would not feel any impact from the tests.  To that end, a test plan was created.

2TB LUNs were created from a SAN and presented as drives to a server. All repositories where configured with “Per VM Files” to amongst other things, facilitate data analysis at the VM level.  Repositories were created as follows:

ReFS vs NTFS - Veeam Repository Configuration

RepositoryFile SystemBlock SizeVeeam Compression
ReFS Optimal 4KReFS4KOptimal
ReFS Optimal 64KReFS64KOptimal
NTFS Optimal 4KNTFS4KOptimal
NTFS Uncompressed 4KNTFS4KUncompressed
NTFS Dedupe Friendly 4KNTFS4KDedupe Friendly
NTFS Optimal 64KNTFS64KOptimal
NTFS Uncompressed 64KNTFS64KUncompressed
NTFS Dedupe Friendly 64KNTFS64KDedupe Friendly

What was the flow of data for the test?


The data flow from hosts to test repositories was as follows:

  1. VMs are backed up to a capable BaaS node
  2. VMs are copied from the BaaS node to another location via backup copy jobs
  3. VMs are copied from the BaaS node to the test repositories outside of the normal backup window
ReFS vs NTFS -Veeam Test Data Flow


What types of servers did you use for testing?


So now I had 8 repositories that I could target, I needed real-world data to populate them with.  For this I wanted to take a cross sample of VMs with varied roles and changes rates that most companies would use.  For this I selected the following based on their daily change rate and provisioned capacity having to bear in mind that I would need to be able to store 8 weeks’ worth of full backups, plus daily incremental backups within a 2TB footprint for up to 600GB of VMs.

As such one of each of the following types of server were selected:


  • Application
  • Web Application
  • Database
  • Domain Controller
  • Exchange Hybrid
  • Web Server
  • Light Application
  • Network Services


What exactly where you looking to compare?


I wanted to be able to make a direct comparison between the various block sizes, file systems, compression and deduplication settings, that are often used in backup copy jobs from what I have seen over the years.  Having the same 8 VMs, copied to 8 different repositories every day for 8 weeks was a great way to see exactly how the different settings compared.

Granted, I knew this was not going to be 100% accurate for everyone due to the limits of finite resources, it is as such a relatively small sample size but the mixture, of roles on the test VMs covered a nice cross section of the types of servers in everyday use and as such would provide a good indication.  As such I wanted a blend of structured, unstructured and binary data to report against.  These servers fulfilled these requirements within the constraints of the repository space available for testing.


When can we see results?


Results are analysed in Part 2:

ReFS vs NTFS – Initial Analysis (Part 2)

-Craig Rodgers




ReFS vs NTFS – Initial Analysis (Part 2)

This is Part 2 of a series, if you have not read Part 1 you can do so here:

ReFS vs NTFS – Introduction (Part 1)

How was Veeam configured?


Lastly, I required a configuration for the backup copy jobs, to test the file systems as best I could.  To this end I decided to create backup copy jobs that targeted the same 8 servers, with 7 incremental and 8 weekly backup copies configured via GFS.  The final jobs looked something like this:

ReFS vs NTFS - Veeam Test Copy Jobs

I want to see results!


Okay, following on from my previous post I sought to share information that was garnered from early results.  I wanted to see how the initial ingestion of data looked, to make a comparison between the different storage techniques.  With that in mind, before any post processing was applied, I looked at the file systems and compared the sizes of the per VM .VBK files. (these are full backups in Veeam).

Interestingly, 64K ReFS formatted drives have an additional file system overhead once formatted, when compared to 4K.  Feel free to test this yourself, create two new large identical sized thin provisioned disks on a VM running Windows 10 1703 or above or Server 2016, format the drives as ReFS 4K & 64K block size and look at the used and free space.

I will have to find out why this is, nothing turned up after a quick Google so I assume, it’s something quite low down in the ReFS filesystem.  I have publicly shared the graphs below and all other graphs used, via Power BI, I would encourage people to expand the content below and have a closer look.

Use the double-sided arrow in the bottom right corner to view full screen, these are interactive graphs.

So, what can we learn from the above?


Right away we can confirm that Veeam default job settings give us a varying amount of data reduction, before the data even lands on the repository.  As expected the DB server with its structured data achieves the best reduction in space, servers with binary data see the least amount of savings.  Check the two graphs below for 4K and 64K block sizes comparison:

How well did they dedupe?


Similarly, as expected raw uncompressed data overwhelmingly achieved the best levels of deduplication, I have included the ReFS repository data to compare, obviously there was no post process operation on the ReFS repositories.  The chart below shows the results of enabling deduplication on the NTFS repositories and compares the capacity of the repository before and after deduplication had completed:

Initial observations look good…


Over the course of the next 8 weeks the copy jobs did their thing, and during the tests I also made an unexpected discovery regarding the behaviour of dedupe-friendly vs optimal and uncompressed that caught me off guard.

The main backups are stored using optimal compression, when a copy job for optimal and uncompressed repositories run, Veeam data mover service sends the data in its deduplicated compressed form over the network to the mount server for the target repository and the blocks get written.

My observations for dedupe-friendly seemed to show that when the source optimal compression blocks were read from the backup, the data mover service inflates the data again and reprocesses it into dedupe-friendly and sends the relatively inflated dedupe-friendly data over the network.  You can observe this behaviour in the slides below:

If you are unsure as to what the job name means, please refer to the jobs screenshot above. In short, Jobs assume a default of optimal compression, DF = Dedupe-Friendly * UC = Uncompressed.

Images are best viewed full screen:

« 1 of 9 »


I will ask some of my Veeam friends to confirm if this behaviour is expected, I imagine it is to increase the retention capabilities on deduplicating appliances such as Dell EMC Data Domain and HPE StoreOnce.  I’m not sure if the data mover service on an Exagrid decompresses this before it commits to disk either to be fair, if so, it may be better sending optimal blocks over the network to increase throughput if that is a constraint, especially if over a WAN link!

What was the usage on repositories over the course of the 8 weeks of testing?


The server that hosted the LUNs for the repositories was monitored by an agent that logged drive usage every couple of minutes to our monitoring platform, this would have created a silly amount of data points so in the end I settled for 478 data points per repository for file system usage over the course of the 8 weeks.  This provided capacity reporting every 2 hours 51 minutes and I was able to export this to a CSV and analyse the data.

Below you can see how each respective compression setting per file system compared for 4K and 64K block sizes:


I think this is the first time we start to see the true nature of ReFS vs NTFS, the ReFS gradient is smooth and predictable, the graphs for NTFS look choppy and come in waves. Additionally, 4K blocks and 64K blocks appear to be very similar in results:


What if you only stored full backups and ignored the daily’s?


Due to the repositories containing ReFS and NTFS filesystems, to make a fair comparison I had to chop off, the first week and last week and use the 6 weeks in the middle for the next graph.  I did not want to report on potentially skewed results. Once I had all the other reporting data I needed I removed the first and last weeks from the repositories and ran scrubbing and garbage collections on all NTFS volumes, the ReFS volumes had the same backup copies removed.

The following graph is the middle 6 weeks of the backup test:

This is the only graph looking at 6 weeks data, all others report on 8 weeks


There is a lot of information in this graph.  Initially the capacity savings of processed data in the NTFS uncompressed repositories is impossible to ignore, however you cannot ignore the additional space required to ingest the data.  If a long-term retention repository is your goal, then within the constraints of NTFS deduplication, (1TB officially, seen 4TB restored without issue in testing) uncompressed offers huge gains in terms of data reduction, 20:1 in this case, for free, with Windows.

10:1 can be achieved using dedupe-friendly albeit using additional network bandwidth. Almost 4:1 can be attained using optimal compression which works around the 1TB officially supported file system limits nicely depending on data type. With ReFS and Optimal compression, we can achieve an approximate 2.5:1 ratio using the data in this test, obviously your real-world mileage may vary. On some deployments I have seen it as high as 3.5:1.


What are your thoughts on 4K block sizes?


4K blocks offer no benefits, if anything, they will be a hindrance long term if for nothing else other than increased volume fragmentation.



Part 3 is available here:

ReFS vs NTFS – Consulsion (Part 3)


-Craig Rodgers


ReFS vs NTFS – Conclusion (Part 3)

This is Part 3 of a series, if you have not read parts 1 or 2 you can do so here:

ReFS vs NTFS – Introduction (Part 1)

ReFS vs NTFS – Initial Analysis (Part 2)

4K blocks aside then, what else have we learnt?


If we look at the 64K repositories over the 8 weeks of testing, we can make a direct comparison between the various compression settings that make sense to deploy in the real world:

Why does the Lego man at the beginning of the NTFS graphs transform into a somewhat offensive gesture?


Initially, the settings in Windows deduplication were set to deduplicate files older than 5 days old, I changed it to 1 day old (where the 3rd Lego man appears chopped in half).  That aside, it is clear at this point there is an additional overhead required to use uncompressed to avail of the benefits realised through deduplication.

Realistically, to see any benefit you would need 2.5 to 3 times the amount of used capacity on your platform to be able to ingest data.  Bear in mind we also need to allow for unexpected increases in data, file server migration, DB cluster rebuild, Exchange DAG rebuild or upgrade, and all these things create relatively huge amounts of changed data “over the norm”.  However, such a repository would yield huge retention capabilities.

Don’t forget, Veeam, like every other backup system on the planet, does not know what causes huge amounts of changed data if present in CBT, which is short for Change Block Tracking.  This is the mechanism used by all backup software to detect changes to a physical or virtual server hard drive at the block level.

Additionally CBT does not explain that new VMs are a new SQL cluster / Exchange DAG node or upgrade / malware infection etc.  You really need to scope in some breathing room for unexpected bursts in changed data.  Furthermore, it is now impossible to ignore not only the performance of ReFS but the predicable trend line it provides.

Either way it is a good idea to try and scope enough flash storage in the caching layer, to allow for a burst of speed when ingesting data and performing transforms.  If you can handle your daily change rate plus a comfortable overhead, you should be in a good position.


At what point commercially, does it become more effective to simply add more space and enjoy faster backup copies and restores using ReFS vs the additional capacity and hardware support requirements of NTFS?


Great question, and to quote every single person who has provided me with IT training over the years, “it depends”.  If you remember back to Part 1 I referred to the organic nature of NTFS and ReFS repository configurations, it is impossible to calculate this on a general basis, you must take everything in a platform into consideration to give an accurate result.  That being said, you also need to weight the speed of ReFS vs the capacity benefits of NTFS.


I know some people prefer line graphs so here is one comparing all 64K repositories:

From this we immediately learn that in our scenario up to 3 weeks of retention, ReFS is better, hands down, performance gains aside you have little difference with regards to processed capacity. At 5 weeks ReFS is holding its own on capacity vs NTFS but after this point, the benefits of deduplication really start to kick in.

Once more, we can confirm the predictable nature of ReFS vs the varying logarithmic curves of NTFS with deduplication.  In essence, so far it boils down to performance vs retention.

What about drive wear and tear & the churn rate, how will that affect any decision?


If using NTFS and dedupe-friendly or uncompressed, there are requirements to read and write significantly more data than ReFS or NTFS using optimal compression, that will indeed translate into additional drive wear and tear, more IOPS, and more data reads and writes which equates to more head movements etc.  If we look at the amount of changed data over time for 64K repositories we observe the following:


This is a massively under estimated graph and one that is often ignored, Supportability is a keystone in infrastructure design yet often overlooked outside of cross patching network uplinks, SAN controllers and host failover capacity.  What options do we have for disk arrays?

  • RAID 5
  • RAID 6
  • RAID 2.0+ (Huawei Storage rebuild times are insane vs RAID 5/6 (hours and minutes vs days and hours)
  • JBOD
    • Storage Spaces is great if you are Microsoft however its handling of failed disks in the real world frankly leaves a lot to be desired, from a supportability standpoint, if you have ever experienced a failed caching drive you will know what I am referring too.


So that leaves RAID 6 & Huawei, using RAID 5 with the IOPS and drive capacities required for backup jobs is practically insane, thankfully Microsoft have declared ReFS is not supported on any hardware virtualised RAID system, how is that deduplicating appliance looking now?

Wait, what?


Microsoft do not support RAID 5, 6 or any other type of RAID for ReFS, the official support* for ReFS means you use Storage Spaces, Storage Space Direct or JBOD.



Excerpt from above:

Resilient File System (ReFS) overview

Does that mean every instance of ReFS worldwide using hardware virtualised RAID is technically unsupported by Microsoft?


Technically, Yes.  That being said and as previously mentioned, Microsoft have been working very closely with Veeam to resolve issues in ReFS with regards to Blockclone.  Veeam have at best, “spearheaded” and at worst, been “early pioneers” of Blockclone technology, when you tie that with their close working relationship, both parties need ReFS to be stable and surely, have common motives for wanting an increased adoption of ReFS.

The obvious Hyper-V benefits are a clear indicator here as to their motivation, so this has in many ways been the saving grace for backups, however, you must bear in mind that technically it is not a supported implementation if you are using hardware virtualised RAID.

At some point this will come into play as their get out of jail card.

So, what are my options?


Microsoft seem to be working to resolve issues that by proxy contribute towards supporting ReFS functionality on RAID volumes thus far, however at some point this could change.

For now, if you have ReFS in play, it can work and will continue to receive the collaborative efforts of Microsoft and Veeam.  If you are looking at a new  solution, it is impossible to ignore the fact you must  use a hardware virtualised RAID alternative, if you want to use ReFS in a future deployment.

ReFS works in these circumstances and there are a plethora of reports as such.  I have yet to see a failure in an ReFS restore, that being said, if you follow the Veeam forums, there are some who have seen otherwise. Without platform access it is impossible to tell if there were other factors involved.

As it stands, ReFS for the most part is working well now and is probably your best bet for a primary or indeed secondary backup repository taking all these considerations into effect.  As previously discussed, for me, this means a 64K block size ReFS formatted, reverse incremental backup target will be your best all round primary backup storage.  With regards to a second copy, ReFS is great for fast transforms however you may be happy trading performance for retention, in which case backup copies can target an NTFS volume.

Plan ahead folks and look at your data before you commit to a solution!

If you have any comments or thoughts on this series, my opinions or anything else, please let me know via comments below or if you prefer, via Twitter or LinkedIn.  I was genuinely interested in these results myself and I hope some of you were as well.

If you would like to view all the Power BI Graphs in one place you can do so here:

Additionally they can be viewed online in a browser here:

*** Update 20/02/2019 – Microsoft ReFS Support on Hardware RAID ***


Following some sterling work by Anton Gostev, Andrew Hansen & their respective teams, Microsoft changed their support stance on ReFS running on Hardware RAID.

Excerpt from Gostevs Weekly Digest:

“Huge news for all ReFS users! Together with many of you, we’ve spent countless hours discussing that strange ReFS support policy update from last year, which essentially limited ReFS to Storage Spaces and standalone disks only. So no RAID controllers, no FC or iSCSI LUNs, no nothing – just plain vanilla disks, period. As you know, I’ve been keeping in touch with Microsoft ReFS team on this issue all the time, translating the official WHYs they were giving me and being devil’s advocate, so to speak (true MVP eh). Secretly though, I was not giving up and kept the firm push on them – just because this limitation did not make any sense to me. Still, I can never take all the credit because I know I’d still be banging my head against the wall today if one awesome guy – Andrew Hansen – did not join that Microsoft team as the new PM. He took the issue very seriously and worked diligently to get to the bottom of this, eventually clearing up what in the end appeared to be one big internal confusion that started from a single bad documentation edit.

Bottom line: ReFS is in fact fully supported on ANY storage hardware that is listed on Microsoft HCL. This includes general purpose servers with certified RAID controllers, such as Cisco S3260 (see statement under Basic Disks), as well as FC and iSCSI LUNs on SAN such as HPE Nimble (under Backup Target). What about those flush concerns we’ve talked about so much? These concerns are in fact 100% valid, but guess what – apparently, Microsoft storage certification process has always included the dedicated flush test tool designed to ensure this command is respected by the RAID controller, with data being protected in all scenarios – including from power loss during write – using technologies like battery-backed write cache (for example, S3260 uses supercapacitor for this purpose). Anyway – I’m super excited to see this resolved, as this was obviously a huge roadblock to ReFS proliferation.”


-Craig Rodgers


Forward or Reverse Incremental?

Forward or Reverse Incremental?


Petrol vs Diesel? Apples & Oranges? Is it a matter of preference? Moving forward seems logical? Is moving backwards better in some cases?  I can think of a few real-world examples that fit both cases but in the sense of Veeam backups, which one is better?

Better is a very general term…


I suppose it is so let’s weigh things up at a more granular level.  Veeam have a mountain of reference data on the functional differences, performance & best practices for each type of backup.  A lot of the time it comes back to stun settings on VM snapshot removal and backup window about the additional I/O overhead in creating a reverse incremental on the target backup repository.  These are both common denominators that can be easily compared as VM X shows Y and Z using Forward vs Reverse.  There are other factors that need to be considered properly to compare the two, that I seldom see referenced.




What is your personal preference?


Reverse incremental with a weekly active full, for longer than most.


Why then if most others use Forward Incremental?


These days times have changed, most platforms & backup targets have a cache layer that takes the initial I/O hit, from SSD cache drives in a NAS or SAN to multiple enterprise grade NVMe drives in a physical backup server or hardware deduplication appliance, RAM is awesome for this as well.

Why more RAM?


More RAM makes things smoother, faster and stable due to reducing stress on numerous systems such as disk, network and storage using caching, 100% fact.  Pity the prices for DDR-4 in 2017 have gone up, thankfully Samsung have decided to increase output significantly along with others such as SK Hynix and Micron which may bode well for better prices, especially from 2019 onwards, last year’s oversupply vs this year’s shortage may fare better for those without the buying power of a Worldwide Cloud Service Provider, all of whom have been paying a lot more this year to ensure their continued hardware landscape growth.

Stun times are the same using forward or reverse incremental now?


No.  Not going to misrepresent here, reverse incremental takes longer to back up owing to the fact more IOPS must take place on the target repository, as such it is a fact that it takes longer and as such the virtualisation layer snapshots will grow larger and as such will take longer to remove and that’s a fact.  However, if you are looking at a new backup system, look at your current snapshot removal times, I am almost 100% certain that any new backup system you deploy will have significantly reduced snapshot removal times using reverse incremental backups vs your old forward incremental backups, if they do not, let me know because something is drastically wrong with the solution you are deploying.  Feel free to create some test VMs with some equal change data via a robocopy script or the likes from file servers or whatever other mechanism you are comfortable with, just do not create a second backup of a production VM, backup once, copy many, avoids CBT issues in the real world. Daily I see snapshot removal times in the low single digit seconds using reverse incremental though.

So why bother to take longer to back them up?


90% of the time, what backup are you restoring? The most recent backup? Probably, especially in a disaster scenario, you will be restoring the last backup and having that backup available as a full will significantly affect restore speed, need a lower RTO?  Having to restore through incremental backup chains has an overhead, the same can be said for older reverse incremental chains.

You mentioned manageability?


I did, so you are running a backup system, something happens that causes an abnormal amount of changed data, worst case you run all flash systems and you get a ransomware infection that does not get caught.  So now you have huge amounts of changed data that needs to be backed up as far as any backup system knows from CBT data.  If your repository fills up for this or any other reason, what are your options in a forward incremental job?  Maybe you are insane and the repository free space is not monitored, the email warnings from Veeam telling you that space was running low were ignored for whatever the reason, you are now at near zero bytes free space on your backup repository.  What do you do with a forward incremental backup target repository to get backups working again quickly for whatever reason?

Add more space?


For free?  I’d love a few of those storage systems.  What if you used some of the un-provisioned space last time and you don’t have enough?  Using reverse incremental you can remove older parts of the backup chain and retain the ability to perform a full restore and not interrupt the backup chain, because it grows backwards.


The process is simple, ensure no jobs are active, put the repository into maintenance mode, remove older retention point files to achieve the desired amount of free space, re-scan the repository, check the backups for the job and forget all unavailable restore points.

Veeam help link?