All about Operating Systems
A / B / C / D / E / F / G / H / I / J / K / L / M / N / O / P / Q / R / S / T / U / V / W / X / Y / Z
Visit also: www.sunorbit.net
|F|| File system
Once integral part - even often the main part - of an OS, today more and more considered as an exchangable part of the OS. More and more OS's come with more than one file system on the market. So it's up to the user which one he uses. Nevertheless is the file system essential for the response times of the system - even in times of huge system-memories and fastest hard-discs. Although the file-system has undergone a considerable evolution, the perfect file system is yet to be found.
Before Shugart brought in the mid-70th his 8-zoll Floppy-disk-drives on the market only the high-priced mainframes had disk-systems. Shugart's floppy-drives were first one-sided(128 kByte), later double-sided(256 KByte), then in the beginning 80th the density doubled with the MFM-technique to 512 KByte(later 1024 and even 2048KBytes). Mid-range to low-range systems like the PDP's most often had at that time still tapes(DECTape). Even many VAX's had only tapes.
Although we can't go into much detail here, let's look at some stages: One early system was the Intel-ISIS system(mid 70th development system for 8008 and 8080). The Isis monitor - not really an OS - used a variant of the linked list for accessing the data on disc: before every block on disc was a header solely for the link data. This technique, which is an excellent technique for accessing data in memory, is the most awkward technique for accessing data on a disc: with this sort of layout you got to travers the whole chain of blocks to arrive at the position where you want to read or write. Random access was by design excluded. And since the processor was too slow to read a whole track in one go, it took multiple revolutions to get all blocks done. No wonder, the system and with it the processor had a bad reputation as beeing extremely slow.
CP/M was the first mass-system with estimated more than 4 million installations. CP/M had a very simple directory structure: at the beginning of the disc there was for every file a fixed entry in the blocktable, reserving a certain amount of blocks, if this wasn't enough, extension entries were needed. CP/M shared with this design the design-flaw of the Unix inode file system: you had to radically overdimension the blocktable or the number of creatable files was unecessarilly restricted once the filesystem was dimensioned. CP/M didn't know subdirectories, the fixed directory space could only be separated in several user areas.
Qdos, the predecessor of MSDOS, was a re-invention of CP/M ( this is often stated in literature, although it is not quite correct: The CP/M file system had much more in common with the Unix inode system than with MS-Dos)- with some (fortunate) misunderstandings. Tim Paterson -the one who wrote Qdos- laid the whole block-allocation-area as one piece at the beginning of the disc, with the pointers to the files following, which later opened the oportunity to have as many subdirectories as one liked. Because directories were now only normal files, no implicit limitation was given.
Today the world of file systems is more or less divided in three big parts: in the professional world the Unix file system - which is in it's essential parts still the inode system of the very first Unix - second there is the NTFS file system of Windows NT and in the semi-professional league and amateur-league you find the DOS/Windows FAT-System, today mostly in it's 32-bit variant.
File systems today
Many file-systems are in discussion right now. Transaction- oriented (journaling-) file-systems which note every step they take and which can therefore reverse/repeat their actions if necessary(eg NTFS ), file systems which behave much like data-base-systems or systems known as 'version control systems' from programming environments and extensions or modifications of other known systems. There are two topics which have dominated and will dominate the discussion: speed and data-integrity. With some minor modifications even systems with bad reputation in the latter can be changed to well behaving systems as we will see below.(the term security/safety is used in the following in the sense of data-integrity or 'integrity of the filesystem', not in the sense of security from intruders,hackers etc.)
Fault tolerance is one of the main issues of modern file system design. This has effects on all levels of the design of the file system. Redundant copies of allocation information, provisions for gracious recovery or even the multiple storage of the same file and/or versions of the same file are issues in this concept. One such file system is eg. the Elephant system. Even one step further go systems which do not overwrite older versions of the same file but keep them in difference files much like 'version control systems' known from programming environments.
Despite the fact that todays hard disks are much more reliable than earlier ones - to mention just one reason: the defect management of todays hard disks makes bad spots nearly invisible - redundancy in allocation information and directory information is still an issue. Not only power failure, system crash, program development demand for redundant information.
But the ideal filesystem has yet to be invented. Eg the journaling filesystem has inherent problems and brought new problems ( a journal itself can be affected by errors, integrity of a jfs in the millisecond to microsecond domain, journaling to ram, to NVRAM, what is if NVRAM fills up.) and suffers from old problems(fragmentation, if power fails during a write half-written files etc). Since journaling and database-systems are normally build on top of normal filesystems(eg the Unix-inode system) they inherit all the problems of the underlaying system. Journaling makes a slow filesystem even slower:synchroneous I/O, when many files are created, renamed or changed cause additional write actions.
A general preface may be in order: todays OS's are much more dependant on a fast file system than earlier ones. The normal user associates with a file- system only opening and closing of user-programs and reading and writing by these programs. But a lot of hidden disc accesses are today associated with this seemingly simple task: reading and writing a sometimes huge registry, loading dll's, reading and writing (multiple) configuration files, macro files, password files and graphic symbol files, establishing links, writing log files, loading fonts, textures, drivers, reading/writing the cache-file.... Just to give you an impression: Simply starting a program like the Microsoft Developer UI causes 11.645 different OS-file-system-calls (open/read/write/close). And there are even more complex programs. Sequential reading and writing as well as random access are equally important. Fragmentation algorithms which help reduce fragmentation only partially are therefore only a half-way solution.
How slow todays file systems really are you get to know if you write programs which must reference every file on disc: indexing programs, virus checkers, cataloging programs... It takes half an eternity to scan a 300 GByte hard disk. Compact directory structures as in NTFS show the biggest advantages here. (Please don't mix this up with fschk or scandisk: they can and they do circumvent performance critical parts! Nevertheless these too need half an eternity on a 300 GByte disk.) One of the most heard questions I get on my pages:"where can a get fast grep?". Or "why grep is so slow?".
File-sytems get obsolete easily. I don't mean by this the kind of systems which could be found till the beginning of the 80's, which were based on linked-lists and therefore were incappable of random-access(Eg ISIS).
The High-Performance-Filesystem HPFS (OS/2) was obsolete right at the time of its introduction into the market. It had the directories organized so that the physical ways for the heads on the disk were optimized in directory-operations. Directories were placed in bands on the disk so that the mean way between data area and directory area was optimized. The 3 main time consuming factors which slow down disk access should get optimized this way: step-time,head-settle-time and disk-latency(this is the time the disk needs to rotate until the desired sector is under the head). Although it gets easily forgotten that every cache has to be filled at least once before it can function as a cache(which means on the other hand that operations which reference an item only once have no benefit from caching), the upcomming of note-worthy caches but more a new type of disks made this system old-fashion: disks which show to the outside an geometry which is (very) different from their internal real geometry. This started when harddisks became intelligent and the disk-manufacturers could implement a variable-sector scheme to improve the capacity of the disk. Translation operations(LBA/CHS/...) used today can even turn the effect into its contrary. It is clear that a system, which tries to optimize physical ways gets fooled by a disk which behaves in such a way. It's main design flaw was to overemphasize the administration on disk, while at that time it was already possible to administer the disk in Ram - with the right administration algorithms naturally. (The sentence with the cache-topic is not to be taken verbatim: caches are a complex subject. With read-ahead caches the statement may be false!)
Since file-systems were treated in the course of discussion of the micro-kernel only as an exchangeable part of the system, file-systems seem to get neglected what concerns their evolution: 'if you don't like type x, take type y !' In reality this means often to exchange one poorly engeneered file-system by another. The end-user, who doesn't know anything about the internals of file-systems is left alone with a decission he can't make. This leads to some monstrous scientific discoveries in some Linux-Books eg, that Linux is not affected by such earthly things as fragmentation , which are reserved for the bad and ugly filesystems like the Dos-fat or is by nature much safer etc etc.
As an example lets take the file-system of Plan9 (also see INFERNO), the succesor of Unix at Bell-labs. It is essentialy the file-system of Unix with some minor modifications. (you may ask: which is the filesystem of Unix? Since Unix behaves as a chameleon also in this subject, the question is not so far away. But for our discussion here they got enough in common - up to Ext2fs and Ext3Fs if we neglect journaling, in a restricted sense even XFS, JFS, ReiserFs, since they are all i-node based - to generalize the topic. If you want to read further, read eg MJ Bach, "The design of the Unix operating system" by Prentice-Hall, or take a look at the description of Ext2fs at sunsite.unc.edu or any other description which describes too the essentials.)
The Unix file-system had an educative attitude: Since it was believed at the time Unix was built that executables over 5kbyte could never get debugged, every attempt to build exe's greater than 5-10k was punished. (Am I the only one to remember this statement?) This was one of the intentions behind the poor direct adressability of 10 blocks (0-9) in the inode(with Linux Ext2fs 12, which makes no difference). Blocks were 512 byte big at that time, which gives a filesize of 5k. If the file was bigger then 5k, indirect addressing takes place which means that adress 10 adresses a block somewhere on the disk which contains the next 128* adresses. If the file was even bigger than the so addressable 70 k, double indirect adressing took place: adress 11 adresses a block somewhere on the disk which itself adressses 128* blocks somewhere on the disk which contain the real adresses of the blocks of the file. Guess what happens with adress 12 ? Indirect-indirect-indirect addressing means that the first block points to 128* blocks somewhere on the disc which themselves point to 128* blocks somewhere on the disc which at good last contain the real addresses. If you wondered about the many 'somewhere's', the word 'somewhere' is the important point. Since there is no dedicated part of the disk for the adress-blocks nor for the directory-blocks, these blocks are really totally scattered over the disk. This means a real punishment took (and takes ) place: since the big 14" hard-disks (and now forgotten drum-disks) at those times were nearly as slow as floppies nowadays, the stepping time plus the head settle time plus the disk-latency time get multiplied by the indirect or indirect-indirect or even indirect-indirect-indirect address finding process. Now remeber: we haven't read a single byte of our file, this has also to happen in between. What's more, in the Unix file-system even the file-name is separated from the other two parts of the directory-information in extra blocks. And last but not least, if all of this isn't catastrophic enough, the directory-information as well as the adress-blocks are totally normal file-blocks on the disk, meaning they are subject to fragmentation as all normal files on the disk. A good principle - non dedication - is turned here into its parody.
(* todays Unix systems changed from 128 to 256, which makes no difference)
Caching can't cope with this scattering. Absolutely no benefit can be expected from eg read ahead caching. One could - compared to the HPFS-filesystem described above - describe this way of getting the needed information: " Is there any farther way we can take?" or "is there any deviation we have omitted?" and the whole system as LPFS, or "least performing file system". This whole system misses any feeling for the subtleties of a mechanical device such as the hard disk( to mention just one fact: large swings of the head are much slower than short swings or single steps (or even no step), etc,etc..). Harddisk manufacterers could mark their harddisks which stand the Unix-test: "Unix approved".
The file-name was separated from the other two parts to have an elegant way to implement aliases in the form of hard-links which could reference the same i-node-entry. The latter is the only change in the Plan9-filesystem. File-names are now part of the i-node. This seems to be partly due to the late recognition that hard-links are essentially worthless: they can only reference files in the same filesystem (filesystem crossing references must be implemented with soft-links which can easily be implemented in every filesystem) Many later systems had shown: Links can be implemented at much less costs, even in systems which originaly know nothing about links.(Even a link counter)
A short repetition: The i-node is first placed on the disk (along with some other information, boot, superblock, sometimes bitmaps for i-node and data-area). It follows the data-area comprising data, adressblocks and directory-information ( the latter beeing part of the i-node in Plan-9). This means that once the i-node is dimensioned to, say 40.000 entries, this is fixed eternally. If someone has many small files, he won't be able to fill the disk because there are no more i-node entries. If some other have only really big files, much of the disk-space is wasted because the over-dimensoned i-node eates valuable no more available disk-space.
What's wrong with this file-system ? All and everything. It stands on its head, it is at sixes and sevens. Since a disk (or a partition) has a fixed number of sectors or blocks (or more generally capacity), this has to be the fixed directory-information, the place which takes in unix the i-node. Since files - especialy the number of files on a disk - is the variable part which depends on the user and his applications and may change with time, this has to be the extensible part of the directory-information, which is fixed in the case of Unix in the inode. To fix this variable part means that you unnecesarilly have to over-dimension the number of permitted entries or the user gets at a certain point the message: 'Out of i-node-entries' - despite the fact that the disk still has plenty of room. Whats more the Unix-filesystem has absolutly no advantages in beeing quick - quite the contrary as we have seen - or beeing safe - there is no build-in redundancy whatsoever - although some filesystems as Ext2fs try to implement some rudimentary redundancies. With todays blocking factors of 4-8 kbyte and more - else the system would show its slowness too soon - every little file of 200 byte consumes 8kbyte(think of the many icons or the many scripts...) and every, yes every file on the disk wastes 4 kbyte(with 8kblocking) at least on the disk - the last block of every file is statistically always only half filled. This means if you got a 50Gbyte disk with 800thousand medium sized files on it you waste 800.000x4000= 3.2Gbyte or nearly 10%. This was a short while ago a harddisk. Not included the wasted space by small files(and a overdimensioned i-node). The BSD-file system uses disk- space much better, at the price of even lower execution speed. XFS, JFS, ReiserFS have realized that there is something wrong with the whole scheme and implement extensions to the i-node and use B-trees for addressing. If you want to read further, start reading with eg Linux Gazette Nr55 "Journal File Systems".
So - after all - if we put the Unix-filesystem on its feet, where do we arrive at ? Essentially at the web-wide ( and in all books and magazines ) much blamed Dos- or Fat-filesystem. It has at the beginning of the disk in a fixed part for every block on the disk exactly one entry, which tells to what file (if at all) this block belongs. The filename/attributes and a pointer into this block-table is the variable part, which finds a place somewhere on the disk (except the root-dir/FAT16). So if Microsoft wouldn't have done such an ugly job in patching in long filenames in the fat-system and if there would be only a little bit more redundancy (although Dos-Fat has with it's second Fat more redundancy than many other FS) and/or at least error-checking, the fat-system could at least in a true 32bit* or even 64bit implementation and with some minor modifications and improvements survive many of it's much praised competitors. *(only 28bits are used in 32-bit fat.)
Back to Plan9: the Unix-filesystem had a certain logic: all this catastrophic layout was a result of the wish to implement hard links. Now that this prerequisite is given up in Plan9, the whole layout of the Plan9-filesystem makes absolutely no sense.
By the way: The ext2fs - blockgroups are a step in the right direction. But they can't eliminate fragmentation altogether. I don't know if they go back on my proposal, but I made a very similar proposal (going even further, see below) in my below referenced book. This principle is easily adaptable to every file system.
A last word to the Unix file system. An investigation in the 80ths showed that the Unix file system delivers 5-7 percent of the maximum possible throughput. The results of this investigation were published everywhere. It seems nobody has read it. And one thing should be clear: file systems have nothing in common with good old vine, which gets better the older it is.
And ah well, the FAT-FS has many problems as it is by now, but none that couldn't be solved. This industry works with the foresight of a mole, one of the latest examples for this beeing the layout of the 32bit FAT system. Because it was absolutely not foreseeable that harddisks will ever surmount the 1-10 Gigabyte barrier, 28 bits out of the 32bits were considered to be enough (EIDE's support of only 28 bit* is a totally different problem - although not in the sense of farsight... And there is also SCSI.) Now there are the same problems as 4 years ago. And nobody would say a word if such slips would happen to a one-man company... Or should this all be intention??
So after our little tour through the world of file systems we can finally say that all highly appraised systems till today have the same design flaw: they interchange the fixed and the variable part of the administration information. What did the Unix developpers hope to gain by this scattering of information which belongs together? It seems it is a psychological question: the amount of information looks smaller if you write it by and by. But in the end your disk doesn't get smaller by this ! You only have scattered information which belongs together all over the disk! And this means you have produced an additional - unnecessary - cause of fragmentation!
The argumentation that you can read concerning the 28bit Fat is of similar naivety. At first: a 28-bit number consumes not a single bit less space than a 32-bit number. At second: how stupid do you think we programmers are, that you suppose we would put the whole FAT 1:1 in memory? There are 1000 and 1 different possibilities to compress or otherwise intelligently handle this information. In times of CP/M we had to manage 32k of directory information in 64k of memory! And still do some caching etc (and the OS and the programs too had to reside in these 64k). Compared to this the relationship between administration information and memory is today totally relaxed: a 8 MB FAT compared to the 256 or 512 MB which you can find in the normal PC today is a relationship of 32:1 or 64:1 !
But also in other regards it seems when it comes to file systems normally very intelligent people get problems and escape to mythological explanations. To cite just one of these often heard allegations : the Fat-system is unreliable because it has all information on one place (= the Fat)(you can read this allegation on many places, so please take this one only as an example. This is only a variant of the allegation that the Fat-system is unreliable because it has all information on the first track.). Normally not necessary to argue against, but since you hear it that often: please, what's with the MFT or the i-node? Are they protected by some mythological gods? Is there any proof that the first track(s) accidentally more often gets overwritten than any other track?
Since at least twenty years it is clear how a modern filesystem should look, no matter how organized. The first demand is: a filesystem has to be fast. This means that the whole administration has to be done as far as possible in Ram. A modified caching algorithm, which keeps the interesting parts of the allocation map and directory information permanently in Ram could secure this. There was and there is no problem in keeping this information in Ram- normal caching is too slow and should be reserved for the data area. (Since harddisk capacity has grown continously and ram prices have fallen continously the ratio of allocation information to ram capacity has been almost constant over the last 20 years. ) (Just to prevent any misunderstanding: keeping this information in ram has no effect on the updating of disk-data ! Allocation data on disk has always to be up to date by write through.) Further this means in the case of block-oriented systems that the allocation- information and directory information have to be compact. Besides minimizing head movement this way full benefit from read ahead caching can be taken. Although stepping times, head settle times and disk latency all have been reduced in the last years, they all have been reduced only proportionaly to the grown capacities of todays harddisks. And at good last this means a decission which is overdue since years and renders irrelevant many of the above mentioned pseudo-new concepts: a buffered power-supply is needed, which costs in mass-production a mere ten dollar or twenty more than todays power-units, since buffering is needed only in the millisecond range (at least in countries with weak energy supply).
All other cases of file system inconsistency eg in program development are easy to handle in a well designed fault tolerant - "forgiving" - file system. A good file system should be invisible, transparent to the average user, like ram or the cpu.
I was asked to give the fragmentation prevention algorithmen, so here it is: a good fragmentation prevention algorithm tends to produce by itself bigger and bigger holes of free allocation blocks. At the same time it has to be fast, not consuming too much time for calculations. Now if you always simply take the biggest free hole in the allocation map and preallocate in this free hole a system specific medium sized area of allocation blocks for every file to be written, you will observe that this algorithm tends to produce by itself bigger and bigger continuous free areas. Why is easy to see: since small free blocks are avoided (at least till the disk is nearly full) they accumulate and finally melt to bigger free blocks. This algorithm is thus quite the contrary to the 'dumb' allocation algorithm as it is implemeted in different variants in most OS's today such as DOS, UNIX etc which simply take the next free block (or next 10,20,50.. free blocks).(preallocation is necessary because not only in a multitask/multithread environment many files can be open which concurrently demand free blocks. After closing a file the not used space is given back to the pool of free blocks.) This is a proven algorithm that works (since the 80ths!). You simply need no more defragmenters if your system defragments while writing.
Although fragmentation is on the normal user system no more that big an issue as it was - due to the big harddisks today - no fragmentation is better than little fragmentation. And in systems in which even the administration information is subject to fragmentation a really good fragmentation prevention counts double and triple!
So if you're going to construct a better file system, always remember the topmost rule of every good engineer : Kis, keep it simple! A filesystem is no end in itself, it's academic beauty or correctness doesn't count. It has to deliver two services: persistence and extension of the always too small system ram. Nothing more. But that fast.
(slightly actualized excerpt of the book:"Multi
task,Multi user, Multi processor
In the case of abnormal system situations provisions must be taken for a gracious recovery of the system. While early systems used tools like scandisk or fsck to guarantee the safe state of the file system, most modern systems use some sort of Journaling or logging of metadata. While the process of scanning a disk is an undirected operation - the whole disc is scanned - the process of finding the fault in a journal is a directed operation and thus very much faster: we immediately find the place of error. Time - in this case uptime of the system - is the gained resource.
But we pay a high price for this gained time: Journaling - if not done to (battery backed-up) ram (or nvram or..) - can slow down a file system considerably - and thus the response time of the whole system in its every day usage. If the underlaying file system is already by design slow, this can be crucial. (For newbie's it may be necessary to mention that a journaling file system does not make superfluous tools like scandisk or fschk alltogether! The journal catches only file system inconstancy errors. Hardware errors eg must get detected as usual with fschk.)
To get a better insight into the problems of journaling we should first distinguish clearly the different reasons for recovery as there are: system crash, power failure, data consistency/integrity failure. Second we must distinguish the underlaying systems: professional usage, which most often have a power bakup (uninteruptable power suply) and "normal" systems with no provisions for power failure - lacking even a buffered power supply which could prevent in case of power failure the worst.
Most systems today log only meta data changes. File system consistency after a crash is guaranteed by these systems because all of the metadata updates are in the journal and can be applied to the file system for committed transactions. On the other hand this means for not finished transactions that the actual data is lost - eg the file to be written. Writes should therefore only get acknowledged after the data is written to its permanent position on the disk. But this means a further degradation in response time of the system.
After power failure or a system crash the actual data is lost - no metadata journal in the world will bring it back. In a multi task/multi user environment this can be quite a lot of data. This is the reason why there are by now systems emerging which even write the data itself into the journal (to write it afterwards by normal file system actions on disc). One reason for this strange double writing is the slow file system: a special area on disc is reserved for the journal which means nothing less than a circumvention of the slow Unix file system (the justification for this procedure is unmasking: the perceived!! write time should get lowered). One silly question: what is if power fails/ system crashes while we write our data to the journal?
Journaling to NVRAM does not suffer from the described problems - better said, it minimizes these problems - but has others. Especially if we want to use the just described intermediate saving method, our NVRAM soon will show that it is not endless. Another major drawback of writing the journal to NVRAM lies in the complications when sharing the state of the file system with another system is required....
More on file systems see above.
of Utah) Group Members
All about OSs
A / B / C / D / E / F / G / H / I / J / K / L / M / N / O / P / Q / R / S / T / U / V / W / X / Y / Z
Visit also: www.sunorbit.net