Sunday, November 22, 2015

Deploy Windows Offloaded Data Transfers

Applies To: Windows Server 2012 R2, Windows Server 2012

Prerequisites

To use ODX, your environment must meet the following hardware and software requirements.

Hardware requirements

To use ODX, your storage arrays must meet the following requirements:
  • Be certified as compatible with Windows Offloaded Data Transfer (ODX) on Windows Server 2012
  • Support cross-storage array ODX. To support ODX between storage arrays, the copy manager for the storage arrays must support cross-storage array ODX, and the storage arrays must be from the same vendor
  • Be connected by using one of the following protocols:
    • iSCSI
    • Fibre Channel
    • Fibre Channel over Ethernet
    • Serial Attached SCSI (SAS)
  • Use one of the following configurations:
    • One server with one storage array
    • One server with two storage arrays
    • Two servers with one storage array
    • Two servers with two storage arrays

Software requirements

To use ODX, your environment must support the following:
  • The computer initiating the data transfer must be running Windows Server 2012 R2, Windows Server 2012, Windows 8.1, or Windows 8.
  • File system filter drivers such as antivirus and encryption programs need to opt-in to ODX. ODX is not supported by the following file system filter drivers:
    • Data Deduplication
    • BitLocker Drive Encryption
  • Files must be on an unencrypted basic partition. Storage Spaces and dynamic volumes are not supported.
  • Files must be on a volume that is formatted by using NTFS. ReFS and FAT are not supported. Files can be directly transferred to or from this volume, or from one of the following containers:
    • A virtual hard disk (VHD) that uses the .vhd or .vhdx format
    • A file share that uses the SMB protocol
  • The files must be 256 KB or larger. Smaller files are transferred by using a traditional (non-ODX) file transfer operation.
  • The application that performs the data transfer must be written to support ODX. The following currently support ODX:
    • Hyper-V management operations that transfer large amounts of data at one time, such as creating a fixed size VHD, merging a snapshot, or converting VHDs.
    • File Explorer
    • Copy commands in Windows PowerShell
    • Copy commands in the Windows command prompt (including Robocopy)
  • Files should not be highly fragmented. Transfers of highly fragmented files will have reduced performance.

Hyper-V Requirements

To use ODX with virtual machines on a server running Hyper-V, the virtual machines need to access storage from an ODX-capable storage array. You can achieve this by using any of the following approaches.
  • Store the VHD on an ODX-capable iSCSI LUN.
  • Assign ODX-capable iSCSI LUNs to the virtual machine's iSCSI initiator.
  • Assign ODX-capable Fibre Channel LUNs to the virtual machine's virtual Fibre Channel adapter.
  • Connect the host or virtual machine to an SMB file share on another computer that is hosted on an ODX-capable storage array.

Step 1: Gather storage array information

Before deploying ODX, gather the following information about the copy manager (operating system) of the storage array:
  • What is the name and version of the copy manager?
  • Does the copy manager support ODX?
  • Does the copy manager support an ODX operation across multiple storage arrays from the same vendor?
  • What is the default inactive timer value? This specifies how long the copy manager waits to invalidate the idle token after the timer expiration.
  • What is the maximum token capacity of the copy manager?
  • What is the optimal transfer size? This tells Windows how to send Read and Write commands that are optimally sized for the storage array.

Step 2: Validate file system filter drivers

To use ODX, validate all the file system filter drivers on all servers that are hosting the storage support ODX.
To validate the opt-in status of file system filter drivers, use the following procedure:

To validate file system filter driver opt-in status

  1. On each server on which you want to use ODX, list all of the file system filter drivers that are attached to the volume on which you want to enable ODX. To do so, open a Windows PowerShell session as an administrator, and then type the following command, where is the drive letter of the volume:
    Fltmc instances -v 
    
  2. For each filter driver listed, query the registry to determine whether the filter driver has opted-in to ODX support. To do so, type the following command for each filter previously listed, and replace with the name of the filter.
    Get-ItemProperty hklm:\system\currentcontrolset\services\ -Name "SupportedFeatures"
    
  3. If the SupportedFeatures registry value equals 3, the filter driver supports ODX. If the value is not 3, contact the file system filter driver vendor for an ODX-compatible version.

Step 3: Establish a performance baseline

To establish a performance baseline, use the following procedures to disable ODX on the server and create a System Performance Report during a representative data transfer.

Disable ODX

To establish a baseline of non-offloaded data transfer performance, first disable ODX on the server by following these steps:

To disable ODX on the server

  1. Open a Windows PowerShell session as an administrator.
  2. Check whether ODX is currently enabled (it is by default) by verifying that the FilterSupportedFeaturesMode value in the registry equals 0. To do so, type the following command:
    Get-ItemProperty hklm:\system\currentcontrolset\control\filesystem -Name "FilterSupportedFeaturesMode"
    
  3. Disable ODX support. To do so, type the following command:
    Set-ItemProperty hklm:\system\currentcontrolset\control\filesystem -Name "FilterSupportedFeaturesMode" -Value 1
    

Create a System Performance Report during a data transfer

To record the baseline performance of data transfers, use Performance Monitor to record system performance during a represenative data transfer. To do so, follow these steps:

To create a System Performance Report

  1. In Server Manager, on the Tools menu, click Performance Monitor.
  2. Initiate a large data transfer that is representative of the workload you want to accelerate and that is within or between the storage arrays that support ODX.
  3. Start the System Performance data collector set. To do so, expand Data Collector Sets, expand System, right-click System Performance, and then click Start. Performance Monitor will collect data for 60 seconds.
  4. Expand Reports, expand System, expand System Performance, and then click the most recent report.
  5. Review the System Performance Report, and take note of the following counters:
    • CPU Utilization (in the Resource Overview section)
    • Network Utilization (in the Resource Overview section)
    • Disk Bytes/sec (in the Disk section, under Physical Disk)

Step 4: Test ODX performance

After you establish a baseline of system performance during traditional data transfers, use the following procedures to enable ODX on the server and test offloaded data transfers:

Enable ODX

To enable ODX on the server, follow these steps:

To enable ODX

  1. Open a Windows PowerShell session as an administrator.
  2. Type the following command:
    Set-ItemProperty hklm:\system\currentcontrolset\control\filesystem -Name "FilterSupportedFeaturesMode" -Value 0
    

Verify ODX performance

After ODX is enabled, create a System Performance Report during a large offloaded data transfer (see the Create a System Performance Report during a data transfer section earlier in this topic for the procedure).
When you evaluate the performance of offloaded data transfers, you should see the following differences from the baseline that you created when ODX was disabled:
  • CPU utilization should be much lower (only slightly higher than prior to the data transfer). This shows that the server did not need to manage the data transfer.
  • Network utilization should be much lower (only slightly higher than prior to the data transfer). This shows that the data transfer bypassed the server.
  • Disk Bytes/sec should be much higher. This reflects increased performance from direct transfers within an array or within the SAN.
After you verify ODX performance, periodically create another System Performance Report during offloaded data transfers to confirm that ODX is still operating as expected. If any performance degradation is detected, contact Microsoft Customer Support and the storage array vendor.
System_CAPS_tipTip
You can use the following command in a Windows PowerShell session to display a list of storage subsystems that support ODX and use a storage management provider. This command does not display storage subsystems that use the Storage Management Initiative Specification (SMI-S) protocol.
Get-OffloadDataTransferSetting | Get-StorageSubSystem

Appendix: Deployment checklist

Use the following checklist to confirm that you completed all the steps for the deployment.
Deploying Windows Offloaded Data Transfers Checklist
Check the Windows Offloaded Data Transfers prerequisites.
Gather storage array information.
Validate the file system filter drivers.
Establish a performance baseline.
Test the ODX performance.

SQL : The transaction manager has disabled its support for remote/network transactions

Solution:

Make sure that the "Distribute Transaction Coordinator" Service is running on both database and client. Also make sure you check "Network DTC Access", "Allow Remote Client", "Allow Inbound/Outbound" and "Enable TIP".
To enable Network DTC Access for MS DTC transactions
  1. Open the Component Services snap-in.
    To open Component Services, click Start. In the search box, type dcomcnfg, and then press ENTER.
  2. Expand the console tree to locate the DTC (for example, Local DTC) for which you want to enable Network MS DTC Access.
  3. On the Action menu, click Properties.
  4. Click the Security tab and make the following changes: In Security Settings, select the Network DTC Access check box.
    In Transaction Manager Communication, select the Allow Inbound and Allow Outbound check boxes.

Installing and Configuring MPIO

This section explains how to install and configure Microsoft Multipath I/O (MPIO) on Windows Server 2008 R2.

Install MPIO on Windows Server 2008 R2

MPIO is an optional feature in Windows Server 2008 R2, and is not installed by default. To install MPIO on your server running Windows Server 2008 R2, perform the following steps.

To add MPIO on a server running Windows Server 2008 R2

  1. Open Server Manager. To open Server Manager, click Start, point to Administrative Tools, and then click Server Manager.
  2. In the Server Manager tree, click Features.
  3. In the Features area, click Add Features.
  4. In the Add Features Wizard, on the Select Features page, select the Multipath I/O check box, and then click Next.
  5. On the Confirm Installation Selections page, click Install.
  6. When the installation has completed, on the Installation Results page, click Close. When prompted to restart the computer, click Yes.
  7. After restarting the computer, the computer finalizes the MPIO installation.
  8. Click Close.

MPIO configuration and DSM installation

When MPIO is installed, the Microsoft device-specific module (DSM) is also installed, as well as an MPIO control panel. The control panel can be used to do the following:
  • Configure MPIO functionality
  • Install additional storage DSMs
  • Create MPIO configuration reports

Open the MPIO control panel

Open the MPIO control panel either by using the Windows Server 2008 R2 Control Panel or by using Administrative Tools.

To open the MPIO control panel by using the Windows Server 2008 R2 Control Panel

  1. On the Windows Server 2008 R2 desktop, click Start, click Control Panel, and then in the Views list, click Large Icons.
  2. Click MPIO.
  3. On the User Account Control page, click Continue.

To open the MPIO control panel by using Administrative Tools

  1. On the Windows Server 2008 R2 desktop, click Start, point to Administrative Tools, and then click MPIO.
  2. On the User Account Control page, click Continue.
The MPIO control panel opens to the Properties dialog box.
noteNote
To access the MPIO control panel on Server Core installations, open a command prompt and type MPIOCPL.EXE.

MPIO Properties dialog box

The MPIO Properties dialog box has four tabs:
  • MPIO Devices   By default, this tab is selected. This tab displays the hardware IDs of the devices that are managed by MPIO whenever they are present. It is based on a hardware ID (for example, a vendor plus product string) that matches an ID that is maintained by MPIO in the MPIOSupportedDeviceList, which every DSM specifies in its Information File (INF) at the time of installation.

    To specify another MPIO device, on the MPIO Devices tab, click Add.

    noteNote
    In the Add MPIO Support dialog box, the vendor ID (VID) and product ID (PID) that are needed are provided by the storage provider, and are specific to each type of hardware. You can list the VID and PID for storage that are already connected to the server by using the mpclaim tool at the command prompt. The hardware ID is an 8-character VID plus a 16-character PID. This combination is sometimes referred to as a VID/PID. For more information about the mpclaim tool, see Referencing MPCLAIM Examples.
  • Discover Multi-Paths   Use this tab to run an algorithm for every device instance that is present on the system and determine if multiple instances actually represent the same Logical Unit Number (LUN) through different paths. For such devices found, their hardware IDs are presented to the administrator for use with MPIO (which includes Microsoft DSM support). You can also use this tab to add Device IDs for Fibre Channel devices that use the Microsoft DSM.

    noteNote
    Devices that are connected by using Microsoft Internet SCSI (iSCSI) are not displayed on the Discover Multi-Paths tab.
  • DSM Install   This tab can be used for installing DSMs that are provided by the storage independent hardware vendor (IHV).

    Many storage arrays that are SPC-3 compliant will work by using the MPIO Microsoft DSM. Some storage array partners also provide their own DSMs to use with the MPIO architecture.

    noteNote
    We recommend using vendor installation software to install the vendor’s DSM. If the vendor does not have a DSM setup tool, you can alternatively install the vendor’s DSM by using the DSM Install tab on the MPIO control panel.
  • Configuration Snapshot   This tab allows you to save the current Microsoft Multipath I/O (MPIO) configuration to a text file that you can review for troubleshooting or comparison purposes at a later time.

    The report includes information on the device-specific module (DSM) that is being used, the number of paths, and the path state.

    You can also save this configuration at a command prompt by using the mpclaim command. For information about how to use mpclaim, in an elevated command prompt, type run mpcliam /?. For more information, see Referencing MPCLAIM Examples.

Claim iSCSI-attached devices for use with MPIO

noteNote
This process causes the Microsoft DSM to claim all iSCSI-attached devices regardless of their vendor ID and product ID settings. For information about how to control this behavior on an individual VID/PID basis, see Referencing MPCLAIM Examples.

To claim an iSCSI-attached device for use with MPIO

  1. Open the MPIO control panel, and then click the Discover Multi-Paths tab.
  2. Select the Add support for iSCSI devices check box, and then click Add. When prompted to restart the computer, click Yes.
  3. When the computer restarts, the MPIO Devices tab lists the additional hardware ID “MSFT2005iSCSIBusType_0x9.” When this hardware ID is listed, all iSCSI bus attached devices will be claimed by the Microsoft DSM.

Configure the load-balancing policy setting for a Logical Unit Number (LUN)

MPIO LUN load balancing is integrated with Disk Management. To configure MPIO LUN load balancing, open the Disk Management graphical user interface.
noteNote
Before you can configure the load-balancing policy setting by using Disk Management, the device must first be claimed by MPIO. If you need to preselect a policy setting for disks that are not yet present, see Referencing MPCLAIM Examples.

To configure the load-balancing policy setting for a LUN

  1. Open Disk Management. To open Disk Management, on the Windows desktop, click Start; in the Start Search field, type diskmgmt.msc; and then, in the Programs list, click diskmgmt.
  2. Right-click the disk for which you want to change the policy setting, and then click Properties.
  3. On the MPIO tab, in the Select the MPIO policy list, click the load-balancing policy setting that you want.
  4. If desired, click Details to view additional information about the currently configured DSM.
    noteNote
    When using a DSM other than a Microsoft DSM, the DSM vendor may use a separate interface to manage these policies.

    noteNote
    For information about DSM timer counters, see Configuring MPIO Timers.

Configure the MPIO Failback policy setting

If you use the Failover Only load-balancing policy setting, MPIO failback allows the configuration of a preferred I/O path to the storage, and allows automatic failback to be the preferred path if desired.
Consider the following scenario:
  • The computer that is running Windows Server 2008 R2 is configured by using MPIO and has two connections to storage, Path A and Path B.
  • Path A is configured as the active/optimized path, and is set as the preferred path.
  • Path B is configured as a standby path.
If Path A fails, Path B is used. If Path A recovers thereafter, Path A becomes the active path again, and Path B is set to standby again.

To configure the preferred path setting

  1. Open Disk Management. To open Disk Management, on the Windows desktop, click Start; in the Start Search field, type diskmgmt.msc; and then, in the Programs list, click diskmgmt.
  2. Right-click the disk for which you want to change the policy setting, and then click Properties.
  3. On the MPIO tab, double-click the path that you want to designate as a preferred path.
    noteNote
    The setting only works with the Failover Only MPIO policy setting.

  4. Select the Preferred check box, and then click OK.

Because the Server Core installation of Windows Server 2008 R2 does not include the Server Manager interface, you must install MPIO by using a command prompt. Until you enable MPIO by using the DISM tool, you cannot use the mpclaim command.
Open a command prompt to run the following commands. After typing a command, press ENTER.

 

Task Command
Determine which features are currently installed
Dism /online /get-features
Enable MPIO
Dism /online /enable-feature:MultipathIo
ImportantImportant
When using DISM to enable or disable features, the feature name is case-sensitive.

Disable MPIO
Dism /online /disable-feature:MultipathIo
Manage the MPIO configuration after MPIO is enabled
mpclaim
Access the MPIO control panel (new in Windows Server 2008 R2)
MPIOCPL.exe

To enable Windows PowerShell™ on a Server Core installation, you must enable the following features by using these commands at an administrator command prompt:
dism /online /enable-feature:NetFx2-ServerCore
dism /online /enable-feature:NetFx2-W0W64_
dism /online /enable-feature:MicrosoftWindowsPowerShell_
dism /online /enable-feature:MicrosoftWindowsPowerShell-W0W64

Is Offloaded Data Transfers (ODX) working?

Offloaded Data Transfers (ODX) is a new data transfer strategy that makes advancements in moving files.  Only storage devices that comply with the SPC4 and SBC3 specifications will work with this feature.  With this feature, copying files from one server to another is much quicker.  This feature is only available in Windows 8/2012 and above as both the source and the destination.
This is how it works at very high level:
  • A user copies or moves a file by using Windows Explorer, a command line interface, or as part of a virtual machine migration.
  • Windows Server 2012/2012R2 automatically translates this transfer request into an ODX (if supported by the storage array) and it receives a token that represents the data.
  • The token is copied between the source server and destination server.
  • The token is delivered to the storage array.
  • The storage array internally performs the copy or move and provides status information to the user.
Below is a picture representation of what it looks like.  The top box is the way we are used to seeing it.  If you copy a file from one machine to another, the entire file is copied over the network.  In the bottom box, you see that the token is passed between the machines and the data is transferred on the storage.  This makes copying files tremendously faster, especially if these files are in the gigabytes.
image
For more information on Offloaded Data Transfers, please refer to
Many Windows installations have additional filter drivers loaded on the Storage stack.  This could be antivirus, backup agents, encryption agents, etc.  So you will need to determine if the installed filter drivers support ODX.  As a quick note, if the filter driver supports ODX, but the storage does not (or vice versa), then ODX will not be used.
We have filter manager supported features (SprtFtrs) which will tell if filter drivers support ODX.  We can use the FLTMC command, as shown below, to list filter drivers and their supported features.  For example:
X:\> fltmc instances
Filter      Volume Name    Altitude   Instance Name  Frame  SprtFtrs
----------  -------------  ---------  -------------  -----  --------
FileInfo    C:             45000      FileInfo       0      00000003
FileInfo    I:             45000      FileInfo       0      00000003
FileInfo    D:             45000      FileInfo       0      00000003 <-it and="" both="" offload="" read="" span="" supports="" write="">
FileInfo    K:             45000      FileInfo       0      00000003
FileInfo    \Device\Mup    45000      FileInfo       0      00000003

You can also see the Supported Features available for a filter driver in the registry:
HKLM\system\CurrentControlset\services\
The SupportedFeatures registry value contains an entry. If it is 3 as in the above FLTMC output, is supports ODX.
Now that we have determined that ODX is supported by the required components, is it actually working?  You can see the ODX commands FSCTL_OFFLOAD_WRITE and FSCTL_OFFLOAD_READ captured in a Process Monitor trace when it is working.  When ODX is working, we see the following in Process Monitor.
clip_image001
In case a target fails the Offload, it does not support ODX, or does not recognize the token, it can give a STATUS_INVALID_TOKEN and/or STATUS_INVALID_DEVICE_REQUEST as the result.

Other reasons why it might not work :
1)    Something above Storage stack such as encryption or File system driver (Encryption filter, etc.) can cause it to fail.
2)    Even though two disks/volumes might support offload they might be incompatible. This has to be established by involving the Storage Vendors.
It is not a recommendation, but for informational purposes, you can disable ODX functionality in the registry if so desired.  You can do this  with a PowerShell command:
Set-ItemProperty hklm:\system\currentcontrolset\control\filesystem -Name "FilterSupportedFeaturesMode" -Value 1
Or, you can edit the registry directly.  The value of 1 (false) means that it is disabled while the value of 0 (true) means it is enabled.  When this change is made, you will need to reboot the system for it to take effect.
One last thing to mention is that you should always keep current on hotfixes.  This is especially true if you are running Failover Clustering.  Below are the recommended hotfixes you should be running on Clusters, which include fixes for ODX.

Recommended hotfixes and updates for Windows Server 2012-based failover clusters
Recommended hotfixes and updates for Windows Server 2012 R2-based failover clusters

Saturday, November 21, 2015

Cluster Shared Volume (CSV) Inside Out

Components

Cluster Shared Volume in Windows Server 2012 is a completely re-architected solution from Cluster Shared Volumes you knew in Windows Server 2008 R2. Although it may look similar in the user experience – just a bunch of volumes mapped under the C:\ClusterStorage\ and you are using regular windows file system interface to work with the files on these volumes, under the hood, these are two completely different architectures. One of the main goals is that in Windows Server 2012, CSV has been expanded beyond the Hyper-V workload, for example Scale-out File Server and in Windows Server 2012 R2 CSV is also supported with SQL Server 2014.
First, let us look under the hood of CsvFs at the components that constitute the solution.

Figure 1 CSV Components and Data Flow Diagram
The diagram above shows a 3 node cluster. There is one shared disk that is visible to Node 1 and Node 2. Node 3 in this diagram has no direct connectivity to the storage. The disk was first clustered and then added to the Cluster Shared Volume. From the user’s perspective, everything will look the same as in the Windows 2008 R2. On every cluster node you will find a mount point to the volume: C:\ClusterStorage\Volume1. The “VolumeX” naming can be changed, just use Windows Explorer and rename like you would any other directory.  CSV will then take care of synchronizing the updated name around the cluster to ensure all nodes are consistent.  Now let’s look at the components that are backing up these mount points.
Terminology
The node where NTFS for the clustered CSV disk is mounted is called the Coordinator Node. In this context, any other node that does not have clustered disk mounted is called Data Servers (DS). Note that coordinator node is always a data server node at the same time. In other words, coordinator is a special data server node when NTFS is mounted.
If you have multiple disks in CSV, you can place them on different cluster nodes. The node that hosts a disk will be a Coordinator Node only for the volumes that are located on that disk. Since each node might be hosting a disk, each of them might be a Coordinator Node, but for different disks. So technically, to avoid ambiguity, we should always qualify “Coordinator Node” with the volume name. For instance we should say: “Node 2 is a Coordinator Node for the Volume1”. Most of the examples we will go through in this blog post for simplicity will have only one CSV disk in the cluster so we will drop the qualification part and will just say Coordinator Node to refer to the node that has this disk online.
Sometimes we will use terms “disk” and “volume” interchangeably because in the samples we will be going through one disk will have only one NTFS volume, which is the most common deployment configuration. In practice, you can create multiple volumes on a disk and CSV fully supports that as well. When you move a disk ownership from one cluster node to another, all the volumes will travel along with the disk and any given node will be the coordinator for all volumes on a given disk. Storage Spaces would be one exception from that model, but we will ignore that possibility for now.
This diagram is complicated so let’s try to break it up to the pieces, and discuss each peace separately, and then hopefully the whole picture together will make more sense.
On the Node 2, you can see following stack that represents mounted NTFS. Cluster guarantees that only one node has NTFS in the state where it can write to the disk, this is important because NTFS is not a clustered file system.  CSV provides a layer of orchestration that enables NTFS or ReFS (with Windows Server 2012 R2) to be accessed concurrently by multiple servers. Following blog post explains how cluster leverages SCSI-3 Persistent Reservation commands with disks to implement that guarantee http://blogs.msdn.com/b/clustering/archive/2009/03/02/9453288.aspx .

Figure 2 CSV NTFS stack
Cluster makes this volume hidden so that Volume Manager (Volume in the diagram above) does not assign a volume GUID to this volume and there will be no drive letter assigned. You also would not see this volume using mountvol.exe or using FindFirstVolume() and FindNextVolume() WIN32 APIs.
On the NTFS stack the cluster will attach an instance of a file system mini-filter driver called CsvFlt.sys at the altitude 404800. You can see that filter attached to the NTFS volume used by CSV if you run following command:
>fltmc.exe instancesFilter                Volume Name                              Altitude        Instance Name--------------------  -------------------------------------  ------------  ----------------------
CsvFlt                \Device\HarddiskVolume7                   404800     CsvFlt Instance

 Applications are not expected to access the NTFS stack and we even go an extra mile to block access to this volume from the user mode applications. CsvFlt will check all create requests coming from the user mode against the security descriptor that is kept in the cluster public property SharedVolumeSecurityDescriptor. You can use power shell cmdlet “Get-Cluster | fl SharedVolumeSecurityDescriptor” to get to that property. The output of this PowerShell cmdlet shows value of the security descriptor in self-relative binary format (http://msdn.microsoft.com/en-us/library/windows/desktop/aa374807(v=vs.85).aspx):
PS D:\Windows\system32> Get-Cluster | fl SharedVolumeSecurityDescriptor
SharedVolumeSecurityDescriptor : {1, 0, 4, 128...}
 CsvFlt plays several roles:
  • Provides an extra level of protection for the hidden NTFS volume used for CSV
  • Helps provide a local volume experience (after all CsvFs does look like a local volume). For instance you cannot open volume over SMB or read USN journal. To enable these kinds of scenarios CsvFs often times marshals the operation that need to be performed to the CsvFlt disguising it behind a tunneling file system control. CsvFlt is responsible for converting the tunneled information back to the original request before forwarding it down-the stack to NTFS.
  • It implements several mechanisms to help coordinate certain states across multiple nodes. We will touch on them in the future posts. File Revision Number is one of them for example.
The next stack we will look at is the system volume stack. On the diagram above you see this stack only on the coordinator node which has NTFS mounted. In practice exactly the same stack exists on all nodes.
 
Figure 3 System Volume Stack
The CSV Namespace Filter (CsvNsFlt.sys) is a file system mini-filter driver at an altitude of 404900:
D:\Windows\system32>fltmc instancesFilter                Volume Name                              Altitude        Instance Name--------------------  -------------------------------------  ------------  ----------------------
CsvNSFlt              C:                                        404900     CsvNSFlt Instance

CsvNsFlt plays the following roles:
  • It protects C:\ClusterStorage by blocking unauthorized attempts that are not coming from the cluster service to delete or create any files or subfolders in this folder or change any attributes on the files. Other than opening these folders about the only other operation that is not blocked is renaming the folders. You can use command prompt or explorer to rename C:\ClusterStorage\Volume1 to something like C:\ClusterStorage\Accounting.  The directory name will be synchronized and updated on all nodes in the cluster.
  • It helps us to dispatch the block level redirected IO. We will cover this in more details when we talk about the block level redirected IO later on in this post.
The last stack we will look at is the stack of the CSV file system. Here you will see two modules CSV Volume Manager (csvvbus.sys), and CSV File System (CsvFs.sys). CsvFs is a file system driver, and mounts exclusively to the volumes surfaced up by CsvVbus.
 
Figure 5 CsvFs stack

Data Flow

Now that we are familiar with the components and how they are related to each other, let’s look at the data flow.
First let’s look at how Metadata flows. Below you can see the same diagram as on the Figure 1. I’ve just kept only the arrows and blocks that is relevant to the metadata flow and removed the rest from the diagram.
 
Figure 6 Metadata Flow
Our definition of metadata operation is everything except read and write. Examples of metadata operation would be create file, close file, rename file, change file attributes, delete file, change file size, any file system control, etc. Some writes may also, as a side effect cause a metadata change. For instance, an extending write will cause CsvFs to extend all or some of the following: file allocation size, file size and valid data length. A read might cause CsvFs to query some information from NTFS.
On the diagram above you can see that metadata from any node goes to the NTFS stack on Node 2. Data server nodes (Node 1 and Node 3) are using Server Message Block (SMB) as a protocol to forward metadata over.
Metadata are always forwarded to NTFS. On the coordinator node CsvFs will forward metadata IO directly to the NTFS volume while other nodes will use SMB to forward the metadata over the network.
Next, let’s look at the data flow for the Direct IO. The following diagram is produced from the diagram on the Figure 1 by removing any blocks and lines that are not relevant to the Direct IO. By definition Direct IO are the reads and writes that never go over the network, but go from CsvFs through CsvVbus straight to the disk stack. To make sure there is no ambiguity I’ll repeat it again: - Direct IO bypasses volume stack and goes directly to the disk.
 Figure 7 Direct IO Flow
Both Node 1 and Node 2 can see the shared disk - they can send reads and writes directly to the disk completely avoiding sending data over the network. The Node 3 is not in the diagram on the Figure 7 Direct IO Flow since it cannot perform Direct IO, but it is still part of the cluster and it will use block level redirected IO for reads and writes.
The next diagram shows a File System Redirected IO request flows. The diagram and data flow for the redirected IO is very similar to the one for the metadata from the Figure 6 Metadata Flow:

Figure 8 File System Redirected IO Flow
Later we will discuss when CsvFs uses the file system redirected IO to handle reads and writes and how it compares to what we see on the next diagram – Block Level Redirected IO:

Figure 9 Block Level Redirected IO Flow
Note that on this diagram I have completely removed CsvFs stack and CSV NTFS stack from the Coordinator Node leaving only the system volume NTFS stack. The CSV NTFS stack is removed because Block Level Redirected IO completely bypasses it and goes to the disk (yes, like Direct IO it bypasses the volume stack and goes straight to the disk) below the NTFS stack. The CsvFs stack is removed because on the coordinating node CsvFs would never use Block Level Redirected IO, and would always talk to the disk. The reason why Node 3 would use Redirected IO, is because Node 3 does not have physical connectivity to the disk. A curious reader might wonder why Node 1 that can see the disk would ever use Block Level Redirected IO. There are at least two cases when this might be happening. Although the disk might be visible on the node it is possible that IO requests will fail because the adapter or storage network switch is misbehaving. In this case, CsvVbus will first attempt to send IO to the disk and on failure will forward the IO to the Coordinator Node using the Block Level Redirected IO. The other example is Storage Spaces - if the disk is a Mirrored Storage Space, then CsvFs will never use Direct IO on a data server node, but instead it will send the block level IO to the Coordinating Node using Block Level Redirected IO.  In Windows Server 2012 R2 you can use the Get-ClusterSharedVolumeState cmdlet http://technet.microsoft.com/en-us/library/dn456528.aspx to query the CSV state (direct / file level redirected / block level redirected) and if redirected it will state why.
Note that CsvFs sends the Block Level Redirected IO to the CsvNsFlt filter attached to the system volume stack on the Coordinating Node. This filter dispatches this IO directly to the disk bypassing NTFS and volume stack so no other filters below the CsvNsFlt on the system volume will see that IO. Since CsvNsFlt sits at a very high altitude, in practice no one besides this filter will see these IO requests. This IO is also completely invisible to the CSV NTFS stack. You can think about Block Level Redirected IO as a Direct IO that CsvVbus is shipping to the Coordinating Node and then with the help of the CsvNsFlt it is dispatched directly to the disk as a Direct IO is dispatched directly to the disk by CsvVbus.

What are these SMB shares?

CSV uses the Server Message Block (SMB) protocol to communicate with the Coordinator Node. As you know, SMB3 requires certain configuration to work. For instance it requires file shares. Let’s take a look at how cluster configures SMB to enable CSV.   
If you dump list of SMB file shares on a cluster node with CSV volumes you will see following:
> Get-SmbShareName                          ScopeName                     Path                          Description----                          ---------                     ----                          -----------ADMIN$                        *                             C:\Windows                    Remote AdminC$                            *                             C:\                           Default shareClusterStorage$               CLUS030512                  C:\ClusterStorage             Cluster Shared Volumes Def...IPC$                          *                                                           Remote IPC
There is a hidden admin share that is created for CSV, shared as ClusterStorage$. This share is created by the cluster to facilitate remote administration. You should use it in the scenarios where you would normally use an admin share on any other volume (such as D$). This share is scoped to the Cluster Name. Cluster Name is a special kind of Network Name that is designed to be used to manage a cluster. You can learn more about Network Name in the following blog post http://blogs.msdn.com/b/clustering/archive/2009/07/17/9836756.aspx.  You can access this share using the Cluster Name \\\ClusterStorage$
Since this is an admin share, it is ACL’d so only members of the Administrators group have full access to this share. In the output the access control list is defined using Security Descriptor Definition Language (SDDL). You can learn more about SDDL here http://msdn.microsoft.com/en-us/library/windows/desktop/aa379567(v=vs.85).aspx
ShareState            : OnlineClusterType           : ScaleOutShareType             : FileSystemDirectoryFolderEnumerationMode : UnrestrictedCachingMode           : ManualCATimeout             : 0ConcurrentUserLimit   : 0ContinuouslyAvailable : FalseCurrentUsers          : 0Description           : Cluster Shared Volumes Default ShareEncryptData           : FalseName                  : ClusterStorage$Path                  : C:\ClusterStorageScoped                : TrueScopeName             : CLUS030512SecurityDescriptor    : D:(A;;FA;;;BA)
There are also couple hidden shares that are used by the CSV. You can see them if you add the IncludeHidden parameter to the get-SmbShare cmdlet. These shares are used only on the Coordinator Node. Other nodes either do not have these shares or these shares are not used:
> Get-SmbShare -IncludeHiddenName                          ScopeName                     Path                          Description----                          ---------                     ----                          -----------17f81c5c-b533-43f0-a024-dc... *                             \\?\GLOBALROOT\Device\Hard...ADMIN$                        *                             C:\Windows                    Remote AdminC$                            *                             C:\                           Default shareClusterStorage$               VPCLUS030512                  C:\ClusterStorage             Cluster Shared Volumes Def...CSV$                          *                             C:\ClusterStorageIPC$                          *                                                           Remote IPC
Each Cluster Shared Volume hosted on a coordinating node cluster creates a share with a name that looks like a GUID. This is used by CsvFs to communicate to the hidden CSV NTFS stack on the coordinating node. This share points to the hidden NTFS volume used by CSV. Metadata and the File System Redirected IO are flowing to the Coordinating Node using this share.
ShareState            : OnlineClusterType           : CSVShareType             : FileSystemDirectoryFolderEnumerationMode : UnrestrictedCachingMode           : ManualCATimeout             : 0ConcurrentUserLimit   : 0ContinuouslyAvailable : FalseCurrentUsers          : 0Description           :EncryptData           : FalseName                  : 17f81c5c-b533-43f0-a024-dc431b8a7ee9-1048576$Path                  : \\?\GLOBALROOT\Device\Harddisk2\ClusterPartition1\Scoped                : FalseScopeName             : *SecurityDescriptor    : O:SYG:SYD:(A;;FA;;;SY)(A;;FA;;;S-1-5-21-2310202761-1163001117-2437225037-1002)ShadowCopy            : FalseSpecial               : TrueTemporary             : True 
On the Coordinating Node you also will see a share with the name CSV$. This share is used to forward Block Level Redirected IO to the Coordinating Node. There is only one CSV$ share on every Coordinating Node:
ShareState            : OnlineClusterType           : CSVShareType             : FileSystemDirectoryFolderEnumerationMode : UnrestrictedCachingMode           : ManualCATimeout             : 0ConcurrentUserLimit   : 0ContinuouslyAvailable : FalseCurrentUsers          : 0Description           :EncryptData           : FalseName                  : CSV$Path                  : C:\ClusterStorageScoped                : FalseScopeName             : *SecurityDescriptor    : O:SYG:SYD:(A;;FA;;;SY)(A;;FA;;;S-1-5-21-2310202761-1163001117-2437225037-1002)ShadowCopy            : FalseSpecial               : TrueTemporary             : True
Users are not expected to use these shares - they are ACL’d so only Local System and Failover Cluster Identity user (CLIUSR) have access to the share.
All of these shares are temporary - information about these shares is not in any persistent storage, and when node reboots they will be removed from the Server Service. Cluster takes care of creating the shares every time during CSV start up.