Zettabyte
File System - ZFS
Anyone
who has ever lost important files, run out of space on a partition, spent
weekends adding new storage to servers, tried to grow or shrink a file system,
or experienced data corruption knows that there is room for improvement in file
systems and volume managers. Solaris ZFS is designed from the ground up to meet
the emerging needs of a general purpose local file system that spans the
desktop to the data center. Solaris ZFS offers a dramatic advance in data
management with an innovative approach to data integrity, near zero
administration, and a welcome integration of file system and volume management
capabilities. The centerpiece of this new architecture is the concept of a
virtual storage pool which decouples the file system from physical storage in
the same way that virtual memory abstracts the address space from physical
memory, allowing for much more efficient use of storage devices. In Solaris
ZFS, space is shared dynamically between multiple file systems from a single
storage pool, and is parceled out of the pool as file systems request it.
Physical storage can be added to or removed from storage pools dynamically,
without interrupting services, providing new levels of flexibility, availability,
and performance. And in terms of scalability, Solaris ZFS is a 128-bit file
system. Its theoretical limits are truly mind-boggling — 2128 bytes of storage,
and 264 for everything else such as file systems, snapshots, directory entries,
devices, and more. And ZFS implements an improvement on RAID-5, RAID-Z, which
uses parity, striping, and atomic operations to ensure reconstruction of
corrupted data. It is ideally suited for managing industry standard storage
servers like the Sun Fire 4500.
2. FEATURES
ZFS is more than just a file system. In addition to the
traditional role of data storage, ZFS also includes advanced volume management
that provides pooled storage through a collection of one or more devices. These
pooled storage areas may be used for ZFS file systems or exported through a ZFS
Emulated Volume (ZVOL) device to support traditional file systems such as UFS.
ZFS uses the pooled storage concept which completely
eliminates the antique notion of volumes. According to SUN, this feature does
for storage what the VM did for the memory subsystem. In ZFS everything
is transactional , i.e., this keeps the data always consistent on
disk, removes almost all constraints on I/O order, and allows for huge
performance gains. The main features of ZFS are given in this chapter.
2.1 STORAGE POOLS
Unlike traditional file systems, which reside on single devices and thus
require a volume manager to use more than one device, ZFS file systems are
built on top of virtual storage pools called zpools. A zpool is
constructed of virtual devices (vdevs), which are themselves constructed
of block devices: files, hard drive partitions, or entire drives, with the last
being the recommended usage. Block devices within a vdev may be configured in
different ways, depending on needs and space available: non-redundantly
(similar to RAID 0), as a mirror (RAID 1) of two or more devices, as a RAID-Z
group of three or more devices, or as a RAID-Z2 group of four or more devices.
Besides standard storage, devices can be designated as volatile read cache
(ARC), nonvolatile write cache, or as a
spare disk for use only in the case of a failure. Finally,
when mirroring, block devices can be grouped according to physical chassis, so
that the file system can continue in the face of the failure of an entire
chassis.
Storage pool composition is not limited to similar devices but can consist of
ad-hoc, heterogeneous collections of devices, which ZFS seamlessly pools
together, subsequently doling out space to diverse file systems as needed.
Arbitrary storage device types can be added to existing pools to expand their
size at any time. If high-speed solid-state drives (SSDs) are included in a
pool, ZFS will transparently utilize the SSDs as cache within the pool,
directing frequently used data to the fast SSDs and less-frequently used data
to slower, less expensive mechanical disks. The storage capacity of all vdevs is
available to all of the file system instances in the zpool. A quota can be set
to limit the amount of space a file system instance can occupy, and a
reservation can be set to guarantee that space will be available to a file
system instance.
This arrangement of pool will eliminate
bottlenecks and increase the speed of reads and writes, Solaris ZFS stripes
data across all available storage devices, balancing I/O and maximizing
throughput. And, as disks are added to the storage pool, Solaris ZFS
immediately begins to allocate blocks from those devices, increasing effective
bandwidth as each device is added. This means system administrators no longer
need to monitor storage devices to see if they are causing I/O bottlenecks.
Blocks containing active data are never overwritten in place; instead, a
new block is allocated, modified data is written to it, and then any metadata
blocks referencing it are similarly read, reallocated, and written. To reduce
the overhead of this process, multiple updates are grouped into transaction
groups, and an intent log is used when synchronous write semantics are
required.
An advantage of copy-on-write is that when ZFS writes new data, the blocks
containing the old data can be retained, allowing a snapshot version of the
file system to
be maintained. ZFS snapshots are created very quickly, since
all the data composing the snapshot is already stored; they are also space
efficient, since any unchanged data is shared among the file system and its
snapshots.
Writeable snapshots ("clones") can also be created, resulting in two
independent file systems that share a set of blocks. As changes are made to any
of the clone file systems, new data blocks are created to reflect those
changes, but any unchanged blocks continue to be shared, no matter how many
clones exist.
2.4
END-TO-END
CHECKSUMMING
The job of any file system boils down to this: when asked to read a block, it
should return the same data that was previously written to that block. If it
can't do that -- because the disk is offline or the data has been damaged or
tampered , it should detect this and return an error. Incredibly, most file
systems fail this test. They depend on the underlying hardware to detect and
report errors. If a disk simply returns bad data, the average file system won't
even detect it. Even if we could assume that all disks were perfect, the data
would still be vulnerable to damage in transit: controller bugs, DMA parity
errors, and so on. All we'd really know is that the data was intact when it
left the platter. If We think of Wer data as a package, this would be like UPS
saying, "We guarantee that Wer package wasn't damaged when we picked it
up." Not quite the guarantee We were looking for. In-flight damage is not
a mere academic concern: even something as mundane as a bad power supply can
cause silent data corruption. Arbitrarily expensive storage arrays can't solve
the problem. The I/O path remains just as vulnerable, but becomes even longer:
after leaving the platter, the data has to survive whatever hardware and
firmware bugs the array has to offer.
One option is to store a checksum with every disk block. Most modern disk
drives can be formatted with sectors that are slightly larger than the usual
512 bytes -- typically 520 or 528. These extra bytes can be used to hold a
block checksum. But making good use of this checksum is harder than it sounds:
the effectiveness of a checksum depends tremendously on where it's
stored and when it's evaluated.
In many storage arrays, the data is compared to its checksum inside the
array. Unfortunately this doesn't help much. It doesn't detect common
firmware bugs such as phantom writes (the previous write never made it to disk)
because the data and checksum are stored as a unit -- so they're
self-consistent even when the disk returns stale data. And
the rest of the I/O path from the array to the host remains
unprotected. In short, this type of block checksum provides a good way to
ensure that an array product is not any less reliable than the disks it
contains, but that's about all.
To avoid accidental data corruption ZFS provides memory-based end-to-end check
summing. Most check summing file systems only protect against bit rot, as they
use self-consistent blocks where the Compiled by Dominique Heger, Fortuitous
Technologies, Austin, TX checksum is stored with the block itself. In this
case, no external checking is done to verify validity. This style of check summing
will not prevent issues such as phantom writes operations, where the write is
dropped; misdirected read or write operations, where the disk accesses the
wrong block; DMA parity errors between the array and server memory (or from the
device driver);driver errors where the data is stored in the wrong buffer (in
the kernel); accidental overwrite operations such as swapping to a live file
system
With ZFS, the checksum is not stored in the block but next to the pointer to
the block (all the way up to the uperblock). Only the uperblock contains a
self-validating SHA-256 checksum. All block checksums are done in memory, hence
any error that may occur up the tree is caught. Not only is ZFS capable of
identifying these problems, but in a mirrored or RAID-Z configuration, the data
is self-healing.
2.4.1
ZFS Data
Authentication
End-to-end data integrity requires that each data block be
verified against an independent checksum, after the data has arrived in the
host's memory. It's not enough to know that each block is merely consistent
with itself, or that it was correct at some earlier point in the I/O path. Our
goal is to detect every possible form of damage, including human mistakes like
swapping on a filesystem disk or mistyping the arguments to dd(1).
ZFS storage pool is really just a tree of blocks. ZFS provides fault
isolation between data and checksum by storing the checksum of each block in
its parent block pointer -- not in the block itself. Every block in the tree
contains the checksums for all its children, so the entire pool is
self-validating. [The uperblock (the root of the tree) is a special case
because it has no parent; more on how we handle that in another post.]
When the data and checksum disagree, ZFS knows that the checksum can be trusted
because the checksum itself is part of some other block that's one level higher
in the tree, and that block has already been validated.
ZFS uses its end-to-end checksums to detect and
correct silent data corruption. If a disk returns bad data transiently, ZFS
will detect it and retry the read. If the disk is part of a mirror or RAID-Z
group, ZFS will both detect and correct the error: it will use the checksum to
determine which copy is correct, provide good data to the application, and
repair the damaged copy.
As always, note that ZFS end-to-end data integrity doesn't require any special
hardware. We don't need pricey disks or arrays, We don't need to reformat drives
with 520-byte sectors, and We don't have to modify applications to benefit from
it. It's entirely automatic, and it works with cheap disks. The blocks of a ZFS
storage pool form a Merkle tree in which each block validates all of its
children. Merkle trees have been proven to provide cryptographically-strong
authentication for any component of the tree, and for the tree as a whole. ZFS
employs 256-bit checksums for every block, and offers checksum functions
ranging from the simple-and-fast fletcher2 (the default) to the
slower-but-secure SHA-256. When using a cryptographic hash like SHA-256, the
uperblock checksum provides a constantly up-to-date digital signature for the
entire storage pool.
2.4.2
Self Healing
for Mirror
ZFS will detect bad checksums are and data is “healed” by the mirrored copy.
This property is called self healing.
In fig 4(a), Application
issues a read. ZFS mirror tries the first disk. Checksum reveals that the block
is corrupt on disk. In fig 4(b) ZFS tries the second disk. Checksum indicates
that the block is good. . In fig 4(c), ZFS returns good data to the application
and
repairs
the damaged block.
The original promise of RAID (Redundant Arrays of Inexpensive Disks) was that
it would provide fast, reliable storage using cheap disks. The key point was
cheap; yet somehow we ended up here. Why? RAID-5 (and other data/parity schemes
such as RAID-4, RAID-6, even-odd, and Row Diagonal Parity) never quite
delivered on the RAID promise -- and can't -- due to a fatal flaw known as the
RAID-5 write hole. Partial-stripe writes pose an additional problem for a
transactional file system like ZFS. A partial-stripe write necessarily modifies
live data, which violates one of the rules that ensures transactional
semantics. (It doesn't matter if We lose power during a full-stripe write for
the same reason that it doesn't matter if We lose power during any other write
in ZFS: none of the blocks We're writing to are live yet.)
RAID-Z is a data/parity scheme like RAID-5, but it uses dynamic stripe width.
Every block is its own RAID-Z stripe, regardless of block size. This means that
every RAID-Z write is a full-stripe write. This, when combined with the
copy-on-write transactional semantics of ZFS, completely eliminates the RAID
write hole. RAID-Z is also faster than traditional RAID because it never has to
do read-modify-write. The tricky bit here is RAID-Z reconstruction. Because the
stripes are all different sizes, there's no simple formula like "all the
disks XOR to zero." We have to traverse the file system metadata to
determine the RAID-Z geometry. Note that this would be impossible if the file
system and the RAID array were separate products, which is why there's nothing
like RAID-Z in the storage market today. We really need an integrated view of
the logical and physical structure of the data to pull it off.
Isn't it expensive to traverse all the metadata? Actually, it's a trade-off. If
Wer storage pool is very close to full, then yes, it's slower. But if it's not
too close to full, then metadata-driven reconstruction is actually faster
because it only copies live data; it doesn't waste time copying unallocated
disk space.
But far more important, going through the metadata means that ZFS can validate
every block against its 256-bit checksum as it goes. Traditional RAID products
can't do this; they simply XOR the data together blindly. Which brings us to
the coolest thing about RAID-Z: self-healing data? In addition to handling
whole-disk failure, RAID-Z can also detect and correct silent data corruption.
Whenever We read a RAID-Z block, ZFS compares it against its checksum. If the
data disks didn't return the right answer, ZFS reads the parity and then does
combinatorial reconstruction to figure out which disk returned bad data. It
then repairs the damaged disk and returns good data to the application. ZFS
also reports the incident through Solaris FMA so that the system administrator
knows that one of the disks is silently failing.
The challenge faced by RAID-Z revolves around the reconstruction process
though. As the stripesare all of different sizes, an all the disks XOR to zero
based approach (such as with RAID-5) is not feasible. In a RAID-Z environment,
it is necessary to traverse the file system metadata to determine the RAID-Z
geometry. It has to be pointed out that this technique would not be feasible if
the file system and the actual
RAID array were separate products. Traversing all the
metadata to determine the geometry may be slower than the traditional approach
though (especially if the storage pool is used-up close to capacity).
Nevertheless, traversing the metadata implies that ZFS can validate every block
against the 256-bit checksum (in memory). Traditional RAID products are not
capable of doing this; they simply XOR the data together. Based on this
approach, RAID-Z supports a self-healing data feature. In addition to
whole-disk failures, RAID-Z can also detect and correct silent data corruption.
Whenever a RAID-Z block is read, ZFS compares it against its checksum. If the
data disks do not provide the expected checksum, ZFS (1) reads the parity, and
(2) processes the necessary combinatorial reconstruction to determine which
disk returned the bad data. In a 3d step, ZFS repairs the damaged disk, and
returns good data to the application.
2.5.1
Raid 5
write-hole problem
RAID-5 (and other data/parity schemes such as RAID-4, RAID-6, even-odd, and Row
Diagonal Parity) never quite delivered on the RAID promise – and can’t – due to
a fatal flaw known as the RAID-5 write hole. Whenever We update the data in a RAID
stripe We must also update the parity, so that all disks XOR to zero – it’s
that equation that allows We to reconstruct data when a disk fails. The problem
is that there’s no way to update two or more disks atomically, so RAID stripes
can become damaged during a crash or power outage.
To see this, suppose We lose power after writing a data block but before
writing the corresponding parity block. Now the data and parity for that stripe
are inconsistent, and they’ll remain inconsistent forever (unless We happen to
overwrite the old data with a full-stripe write at some point). Therefore, if a
disk fails, the RAID reconstruction process will generate garbage the next time
We read any block on that stripe. What’s worse, it will do so silently—it has
no idea that it’s giving We corrupt data.
“RAID-5 (and other data/parity schemes such as RAID-4, RAID-6, even-odd, and
Row Diagonal Parity) never quite delivered on the RAID promise – and can’t –
due to a fatal flaw known as the RAID-5 write hole. Whenever We update the data
in a RAID stripe We must also update the parity, so that all disks XOR to zero
– it’s that equation that allows We to reconstruct data when a disk fails. The
problem is that there’s no way to update two or more disks atomically, so RAID
stripes can become damaged during a crash or power outage.
To see this, suppose We lose power after writing a data block but before
writing the corresponding parity block. Now the data and parity for that stripe
are inconsistent, and they’ll remain inconsistent forever (unless We happen to
overwrite the old data with a full-stripe write at some point). Therefore, if a
disk fails, the RAID reconstruction process will generate garbage the next time
We read any block on that stripe. What’s worse, it will do so silently—it has
no idea that it’s giving We corrupt data.
Dynamic striping across all devices to maximize throughput means that as
additional devices are added to the zpool, the stripe width automatically
expands to include them; thus all disks in a pool are used, which balances the
write load across them.
ZFS uses variable-sized blocks of up to 128 kilobytes. The currently available
code allows the administrator to tune the maximum block size used as certain
workloads do not perform well with large blocks. Automatic tuning to match
workload characteristics is contemplated.
If data compression (LZJB) is enabled, variable block sizes are used. If a
block can be compressed to fit into a smaller block size, the smaller size is
used on the disk to use less storage and improve IO throughput (though at the
cost of increased CPU use for the compression and decompression operations).
2.8
SMOKINSS’ RESILVERING
Resilvering -- also known as resyncing, rebuilding, or
reconstructing -- is the process of repairing a damaged device using the
contents of healthy devices. This is what every volume manager or RAID array
must do when one of its disks dies, gets replaced, or suffers a transient
outage.
For a mirror, resilvering can be as simple as a whole-disk copy. For RAID-5
it's only slightly more complicated: instead of copying one disk to another,
all of the other disks in the RAID-5 stripe must be XORed together. But the
basic idea is the same. In a traditional storage system, resilvering happens
either in the volume manager or in RAID hardware. Either way, it happens well
below the file system. But this is ZFS, so of course we just had to be
different. In effect, ZFS does a 'cp -r' of the storage pool's block tree
from one disk to another. It sounds less efficient than a straight whole-disk
copy, and traversing a live pool safely is definitely tricky (more on that in a
future post). But it turns out that there are so many advantages to
metadata-driven resilvering that we've chosen to use it even for simple
mirrors.
The most compelling reason is data integrity. With a simple disk copy, there's
no way to know whether the source disk is returning good data. End-to-end data
integrity requires that each data block be verified against an independent
checksum -- it's not enough to know that each block is merely consistent with
itself, because that doesn't catch common hardware and firmware bugs like
misdirected reads and phantom writes.
By traversing the metadata, ZFS can use its end-to-end checksums to detect and
correct silent data corruption, just like it does during normal reads. If a
disk returns bad data transiently, ZFS will detect it and retry the read. If
it's a 3-way mirror and one of the two presumed-good disks is damaged, ZFS will
use the checksum to determine which one is correct, copy the data to the new
disk, and repair the damaged disk.
A simple whole-disk copy would bypass all of this data protection. For this
reason alone, metadata-driven resilvering would be desirable even it it came at
a significant cost in performance. Fortunately, in most cases, it doesn't. In
fact, there are several advantages to metadata-driven resilvering:
2.8.1 Live
blocks only.
ZFS
doesn't waste time and I/O bandwidth copying free disk blocks because they're
not part of the storage pool's block tree. If Wer pool is only 10-20% full,
that's a big win.
2.8.2 Transactional pruning.
If a disk suffers a transient outage, it's not necessary to resilver the entire
disk -- only the parts that have changed. ZFS uses the birth time of each block
to determine whether there's anything lower in the tree that needs resilvering.
This allows it to skip over huge branches of the tree and quickly discover the
data that has actually changed since the outage began.
What this means in practice is that if a disk has a
five-second outage, it will only take about five seconds to resilver it. And We
don't pay extra for it in either dollars or performance like We do with Veritas
change objects. Transactional pruning is an intrinsic architectural capability
of ZFS.
2.8.3 Top-down resilvering.
A storage pool is a tree of blocks. The higher up the tree We go, the more
disastrous it is to lose a block there, because We lose access to everything
beneath it.Going through the metadata allows ZFS to do top-down resilvering.
That is, the very first thing ZFS resilvers is the uperblock and the disk
labels. Then it resilvers the pool-wide metadata; then each file system’s
metadata; and so on down the tree. Throughout the process ZFS obeys this rule:
no block is resilvered until all of its ancestors have been resilvered.
It's hard to overstate how important this is. With a whole-disk copy, even when
it's 99% done there's a good chance that one of the top 100 blocks in the tree
hasn't been copied yet. This means that from an MTTR perspective, We haven't
actually made any progress: a second disk failure at this point would still be
catastrophic. With top-down resilvering, every single block copied increases
the amount of discoverable data. If We had a second disk failure, everything
that had been resilvered up to that point would be available.
2.8.4 Priority-based resilvering.
ZFS doesn't do this one yet, but it's in the pipeline. ZFS resilvering follows
the logical structure of the data, so it would be pretty easy to tag individual
file systems or files with a specific resilver priority. For example, on a file
server We might want to resilver calendars first (they're important yet very
small), then /var/mail, then home directories, and so on.
In ZFS, file system manipulation within a storage pool is easier than volume
manipulation within a traditional file system; the time and effort required to
create or resize a ZFS file system is closer to that of making a new directory
than it is to volume manipulation in some other systems.
ZFS also uses the ARC, a new method for cache management, instead of the
traditional Solaris virtual memory page cache.
2.11 ADAPTIVE ENDIANNESS
Pools and their associated ZFS file systems can be moved between different
platform architectures, including systems implementing different byte orders.
The ZFS block pointer format stores file system metadata in an endian-adaptive
way; individual metadata blocks are written with the native byte order of the system
writing the block. When reading, if the stored endianness doesn't match the
endianness of the system, the metadata is byte-swapped in memory. This does not
affect the stored data itself; as is usual in POSIX systems, files appear to
applications as simple arrays of bytes, so applications creating and reading
data remain responsible for doing so in a way independent of the underlying
system's endianness.
Most file system administration tasks are painful, slow operations that are
relatively uncommon. Because these types of tasks are so infrequently
performed, they are more prone to errors that can destroy a great amount of
data very quickly. Solaris ZFS helps alleviate this problem by automating both
common and less frequent administrative tasks. Solaris 10 allow users to
designate ZFS as the default file system and easily boot Solaris from a ZFS
root file system. Installing Solaris to a ZFS root file system is also
possible. Likewise, users can easily migrate UFS root file systems to ZFS root
file systems with the Live Upgrade feature. Administering storage is extremely
easy, because the design allows administrators to state the intent of their
storage policies, rather than all of the details needed to implement them.
Creating a file system or performing other administrative activities is very
fast less than one second, regardless of size. There is no need to configure
(or worse, reconfigure) underlying storage devices or volumes, as this is
handled automatically when they are added to a pool. Solaris ZFS also enables
administrators to guarantee a minimum capacity for file systems, or set quotas
to limit maximum sizes. Administrators can delegate fine-grained permissions to
perform ZFS administration tasks to non-privileged users making it easy to
deploy ZFS quickly.
2.13 HIGH PERFORMANCE
This radical new architecture optimizes and simplifies code
paths from the application to the hardware, producing sustained throughput at
exceptional speeds. New block allocation algorithms accelerate write
operations, consolidating what would traditionally be many small random writes
into a single, more efficient sequential operation. Additionally, Solaris ZFS
implements intelligent pre-fetch, performing read ahead for sequential data
streaming, and can adapt its read behavior on the fly for more complex access
patterns. To eliminate bottlenecks and increase the speed of both reads and
writes, Solaris ZFS stripes data across all available storage devices,
balancing I/O and maximizing throughput. And, as disks are added to the storage
pool, Solaris ZFS immediately begins to allocate blocks from those devices,
increasing effective bandwidth as each device is added. This means system
administrators no longer need to monitor storage devices to see if they are
causing I/O bottlenecks.
2.14 ADITIONAL CAPABILITY
·
Explicit I/O priority with deadline scheduling.
·
Claimed globally optimal I/O sorting and aggregation.
·
Multiple independent prefetch streams with automatic length
and stride detection.
·
Parallel, constant-time directory operations.
·
End-to-end check summing, using a kind of "Data
Integrity Field", allowing data corruption detection (and recovery if We
have redundancy in the pool).
·
Transparent file system compression. Supports LZJB and gzip
·
Intelligent scrubbing and resilvering.
·
Load and space usage sharing between disks in the pool.
·
Ditto blocks: Metadata is replicated inside the pool, two or
three times (according
to metadata importance). If the pool has several devices, ZFS tries to
replicate over different devices. So a pool without redundancy can lose
data if We find bad sectors, but metadata should
be fairly safe even in this scenario.
·
ZFS design (copy-on-write + superblocks) is safe when using
disks with write cache enabled, if they
support the cache flush commands issued by ZFS. This
feature provides safety and a performance
boost compared with some other file
systems.
·
When entire disks are added to a ZFS pool, ZFS automatically
enables their write cache. This is not done when ZFS only manages discrete
slices of the disk, since it doesn't know if other slices
are managed by non-write-cache safe file systems,
like UFS.
·
File system encryption is supported, though is currently in a
beta stage.
While data security and integrity is paramount, a file system has to perform
well. The ZFS designers either removed or greatly increased the limits imposed
by modern file systems by using a 128-bit architecture, and by making all
metadata dynamic. ZFS further supports data pipelining, dynamic block sizing,
intelligent prefetch, dynamic striping, and built-in compression to improve the
performance behavior.
3.1 THE 128-BIT ARCHITECTURE
Current trends in the industry reveal that the disk drive capacity roughly
doubles every nine months. If this trend continues, file systems will require
64-bit addressability in about 10 to 15 years. Instead of focusing on 64-bits,
the ZFS designers implemented a 128-bit file system. This implies that the ZFS
design provides 16 billion times more capacity than the current 64-bit file
systems. According to Jeff Bonwick (ZFS chief architect) "Populating
128-bit file systems would exceed the quantum limits of earthbased storage. We
couldn't fill a 128-bit storage pool without boiling the oceans."
3.2 DYNAMIC METADATA
In addition to being a 128-bit based solution, the ZFS metadata is 100 percent
dynamic. Hence, the creation of new storage pools and file systems is extremely
efficient. Only 1 to 2 percent of the write operations to disk are metadata
related, which results in large (initial) overhead savings. To illustrate,
there are no static inodes, therefore the only restriction to the number of
inodes that (theoretically) can be used is the size of the storagepool.
4.
CAPACITY LIMITS
ZFS is a 128-bit file system, so it can address 18 billion billion (1.84 × 1019)
times more data than current 64-bit systems. The limitations of ZFS are
designed to be so large that they would never be encountered, given the known
limits of physics. Some theoretical limits in ZFS are:
·
264 — Number of snapshots of any file system
·
248 — Number of entries in any individual directory
·
16 EiB (264 bytes) — Maximum size of a file
system
·
16 EiB — Maximum size of a single file
·
16 EiB — Maximum size of any attribute
·
256 ZiB (278 bytes) — Maximum size of any
zpool
·
256 — Number of attributes of a file (actually
constrained to 248 for the number of
files in a
ZFS file system)
·
264 — Number of devices in any zpool
·
264 — Number of zpools in a system
·
264 — Number of file systems in a zpool
5. PLATFORMS
ZFS is part of Sun's own Solaris operating system and is thus available on both
SPARC and x86-based systems. Since the code for ZFS is open source, a port to
other operating systems and platforms can be produced without Sun's
involvement.
Open Solaris 2008.05 and 2009.06 use ZFS as their default file system. There
are a half dozen 3rd party distributions. Nexenta OS, a complete GNU-based open
source operating system built on top of the Open Solaris kernel and runtime,
includes a ZFS implementation, added in version alpha1. More recently, Nexenta
Systems announced NexentaStor, their ZFS storage appliance providing
NAS/SAN/iSCSI capabilities and based on Nexenta OS. NexentaStor includes a GUI
that simplifies the process of utilizing ZFS.
Pawel Jakub Dawidek has ported ZFS to FreeBSD. It is part of FreeBSD 7.x as an
experimental feature. Both the 7-stable and the current development branches
use ZFS version 13. Moreover, zfsboot has been implemented in both branches. As
a part of the 2007 Google Summer of Code a ZFS port was started for NetBSD.
An April 2006 post on the opensolaris.org zfs-discuss mailing list, was the
first indication of Apple Inc.'s interest in ZFS, where an Apple employee is
mentioned as being interested in porting ZFS to their Mac OS X operating
system.
In the release version of Mac OS X 10.5, ZFS is available in read-only mode
from the command line, which lacks the possibility to create zpools or write to
them. Before the 10.5 release, Apple released the "ZFS Beta Seed
v1.1", which allowed read-write access and the creation of zpools; however
the installer for the "ZFS Beta Seed v1.1" has been reported to only
work on version 10.5.0, and has not been updated for version 10.5.1 and above.
In August 2007, Apple opened a ZFS project on their Mac OS Forge site. On that
site, Apple provides the source code and binaries of their port of ZFS which
includes read-write access, but does not provide an installer. An installer has
been made available by a third-party developer.
The current Mac OS Forge release of the Mac OS X ZFS project is version 119 and
synchronized with the Open Solaris ZFS SNV version 72 Complete ZFS support was
one of the advertised features of Apple's upcoming 10.6 version of Mac OS X
Server (Snow Leopard Server). However, all references to this feature have been
silently removed; it is no longer listed on the Snow Leopard Server features
page.
Porting ZFS to Linux is complicated by the fact that the GNU General Public
License, which governs the Linux kernel, prohibits linking with code under
certain licenses, such as CDDL, the license ZFS is released under. One solution
to this problem is to port ZFS to Linux's FUSE system so the file system runs
in user space instead. A project to do this was sponsored by Google's Summer of
Code program in 2006, and is in a bug fix-only state as of March 2009. The ZFS
on FUSE project is available here. Running a file system outside the kernel on
traditional Unix-like systems can have a significant performance impact.
However, NTFS-3G (another file system driver built on FUSE) performs well when
compared to other traditional file system drivers. This shows that reasonable
performance is possible with ZFS on Linux after proper optimization. Sun
Microsystems has stated that a Linux port is being investigated. It is also
possible to emulate Linux in a Solaris Zone and thus the underlying file system
would be ZFS (though ZFS commands would not be available inside the Linux
zone). It is also possible to run the GNU user land on top of an Open Solaris
kernel, as done by Nexenta.
It would also be possible to reimplement ZFS under GPL as has
been done to support other file systems (e.g. HFS and FAT) in Linux. The Btrfs
project, which aims to implement a file system with a similar feature set to
ZFS, was merged into Linux kernel 2.6.29 in January 2009.
6. LIMITATIONS
Capacity expansion is normally achieved by adding groups of disks as a vdev
(stripe, RAID-Z, RAID-Z2, or mirrored). Newly written data will dynamically
start to use all available vdevs. It is also possible to expand the array by
iteratively swapping each drive in the array with a bigger drive and waiting
for ZFS to heal itself — the heal time will depend on amount of stored
information, not the disk size. The new free space will not be available until
all the disks have been swapped.
It is currently not possible to neither reduce the number of vdevs in a pool
nor otherwise reduce pool capacity.However; this functionality is currently
under development by the ZFS team.
It is not possible to add a disk to a RAID-Z or RAID-Z2 vdev. This feature
appears very difficult to implement. We can however create a new RAID-Z vdev
and add it to the zpool.
We cannot mix vdev types in a zpool. For example, if We had a striped ZFS pool
consisting of disks on a SAN, We cannot add the local-disks as a mirrored vdev.
Reconfiguring storage requires copying data offline, destroying the pool, and
recreating the pool with the new policy.
ZFS is not a native cluster, distributed, or parallel file system and cannot
provide concurrent access from multiple hosts as ZFS is a local file system.
Sun's Lustre distributed file system will adapt ZFS as back-end storage for
both data and metadata in version 3.0, which is scheduled to be released in
2010.
7. CONCLUSION
It is very simple, in the sense that it concisely expresses the user's intent
.It is very powerful as it introduces the pooled storage concepts, snapshots,
clones, compression, scrubbing and RAID-Z. It is safe as it detects and
corrects silent data corruption. It become very fast by introducing dynamic
striping, intelligent pre-fetch, pipelined I/O.By offering data security and
integrity, virtually unlimited scalability, as well as easy and automated
manageability, Solaris ZFS simplifies storage and data management for demanding
applications today, and well into the future.
Anand computer Systems
Swamipuram Building,
Shop No C-8, Near Dandekar Bridge,
Lokmanya Nagar, Pune – 411030
http://anandcomputersystems.co.in
anand.computer.systems@gmail.com
We Have Following Servers Of HP DELL & IBM For Sale
Model | CPU | No. Of Cpu's | Ram | HDD |
Hp Proliant DL 360 G2 (SCSI) | PIII 1.4 Ghz | 2 | 2 Gb | 36 Gb X 2 |
Hp Proliant DL 380 G2 (SCSI) | PIII 1.2 Ghz | 1 or 2 | 1 Gb | 36 Gb X 3 |
Hp Proliant DL 380 G2 (SCSI) | PIII 1.4 Ghz | 1 or 2 | 1 Gb | 36 Gb X 3 |
Hp Proliant DL 380 G3 (SCSI) | Xeon 2.4 Ghz | 1 or 2 | 3 Gb | 36 Gb X 3 |
Hp Proliant DL 380 G3 (SCSI) | Xeon 2.8 Ghz | 2 | 3 Gb | 36 Gb X 3 |
Hp Proliant DL 380 G4 (SCSI) | Xeon 3.0 Ghz | 2 | 1 Gb | 36 Gb X 3 |
Hp Proliant DL 380 G4 (SCSI) | Xeon 3.4 Ghz | 2 | 4 Gb | 73 Gb X 3 |
Hp Proliant DL 385 G1 (SCSI) | AMD 2.4 Ghz | 1 | 1 Gb | 73 Gb X 3 |
Hp Proliant DL 385 G1 (SCSI) | AMD 2.8 Ghz | 1 | 1 Gb | 73 Gb X 3 |
Hp Proliant DL 580 G3 (SCSI) | Xeon 3.66 Ghz | 4 | 8 Gb | 146.8 X 4 |
Hp Proliant DL 585 G2 (SAS) | Dual Core 2.66 Ghz | 4 | 24 Gb | 73 Gb X 3 |
Hp Proliant DL 585 G2 (SAS) | Dual Core 2.66 Ghz | 4 | 64 Gb | 73 Gb X 3 |
DELL Servers | ||||
Model | CPU | No. Of Cpu's | Ram | HDD |
Dell Poweredge 650 (IDE/SCSI) | P IV 2.4 | 1 | 1 GB | 36 |
Dell Poweredge 1750 (SCSI) | Xeon 2.8 Ghz | 1 | 1 GB | 36 Gb X 3 |
Dell Poweredge 1850 (SCSI) | Xeon 2.8 Ghz | 2 | 1 GB | 73 Gb X 2 |
Dell Poweredge 2650 (SCSI) | Xeon 2.4 Ghz | 2 | 3 Gb | 73 Gb X 3 |
Dell Poweredge 2850 (SCSI) | Xeon 3.2 Ghz | 2 | 2 Gb | 73 Gb X 3 |
Dell Poweredge 2850 (SCSI) | Xeon 3.2 Ghz | 2 | 2 Gb | 73 Gb X 3 |
Dell Poweredge 2850 (SCSI) | Xeon 3.2 Ghz | 2 | 6 Gb | 73 Gb X 3 |
Dell Poweredge 2850 (SCSI) | Xeon 3.2 Ghz | 2 | 8 Gb | 73 Gb X 3 |
Dell Poweredge 2950 (SAS) | Xeon DC 2.0 Ghz | 2 | 2 GB | 73 Gb X 3 |
Dell Poweredge 2950 (SAS) | Xeon DC 2.66 Ghz | 2 | 4 Gb | 73 Gb X 3 |
Dell Poweredge 6850 (SCSI) | Xeon 3.66 Ghz | 4 | 8 Gb | 73 GB X 5 |
Dell Poweredge 6850 (SAS) | Xeon 3.66 Ghz | 4 | 12 Gb | 73 Gb X 3 |
Dell Poweredge 6950 (SAS) | AMD DC 2.6 Ghz | 4 | 32 Gb | 73 Gb X 3 |
Dell Pricision 470 (SATA/SCSI) | Xeon 3.2 Ghz | 2 | 1 Gb | 73 Gb X 3 |
Dell Pricision 470 N Series (SATA/SCSI) | Xeon 3.o Ghz | 2 | 1 Gb | 73 Gb X 3 |
Dell Poweredge 2550 (SCSI) | PIII 1.13 Ghz | 1 | 1 Gb | 18 Gb X 2 |
IBM Servers | ||||
Model | CPU | No. Of Cpu's | Ram | HDD |
IBM X Series 335 | 2 Gb | 36 Gb X 2 | ||
IBM X Series 336 | Xeon 3.0 Ghz | 2 | 4 Gb | 36 Gb X 2 |
IBM X Series 336 | Xeon 3.2 Ghz | 2 | 2 Gb | 36 Gb X 2 |
IBM System X 345 | Xeon 3.6 Ghz | 2 | 2 | 36 Gb X 2 |
IBM System X 346 | Xeon 3.0 Ghz | 1 | 1 Gb | 36 Gb X 2 |
IBM System X 346 | Xeon 3.2 Ghz | 2 | 2 Gb | 36 Gb X 2 |
IBM System X 365 | Xeon 3.0 Ghz | 4 | 4 Gb | 36 Gb X 2 |