I'm thinking that I can easily do multiple FS ops in one transaction with some careful structuring of the journal.
The only problem with this is the old problem of journal size. Unlike a normal journaled FS, with user transactions, we may be dealing with a lot more data in many transactions and a static log size may be a bad thing (and severely limit the size of transactions that are possible).
I'm thinking that having an expandable log (perhaps a linked-list kind of approach) could be quite useful. In memory, we'll probably need another data structure to have things remotely efficient (esp if we want to know where to write additional log entries).
This could also be nice performance wise as we only have to lock one of these when doing operations on the log.
I've also been thinking about weird things like delayed index updates. That is, you update a file (or attributes) and it doesn't update the index pointing to it on disk until it is conveinient for it to do so.
I'm trying to write up what has been discussed regarding my hons project.
Got a meeting penciled in for monday morning.
Should be good.
But I have to get all these ideas that have been in my head down on paper, and all the reasoning behind them.
Finally got LaTeX on my laptop. Not from DarwinPorts though :( Got it from some i-installer program that I guy did. It works, I'm happy.
Otherwise titled: A day of reading about Filesystems.
I'm managing to continually find stuff I haven't already read. I think this is a good thing. I'm clearly now going to have to do some more research into how RDMBS do things.
The idea of Snapshots sound interesting, and may prove an interesting way to help avoid having to do journalling. I've been thinking that journalling could be a real issue if we allow large transations, so maybe a snapshot like idea could be good. Although, this could limit the number of transactions we work on concurrently if it's implemented the WAFL way.
So looking at how RDBMS do concurrent transactions would be useful. I guess we do have the advantage of having PostgreSQL and MySQL code to peer at but I do hope there are some nice papers out there to read :)
I've collected a large array of useful papers to read. I'm going to try and wade through them sometime soon, but will have to do more coursework soon.
I'm thinking tomorrow might contain a bit of coursework - need to do more POVRAY/OpenGL stuff as well as study more of the Pattern Recognition stuff. Oh, and look at the Info Security assignment & exercise for this week.
hmmm... this is a casually tricky one.... i've got some ideas and thoughts down, going to mail off the PDF to Ron and Carlo and see what they think.
1/2 actual information and aims for the project
1/2 just ramblings on topics of data storage.
hmmm.....
Linux:
- transactions on UNIX filesystems
- versioning in UNIX VFS environment
- extending UNIX FS model to accommodate advanced DB like features
Walnut:
- transactions
Multivolume disks?
- uid object migration in distributed environment
- have a distributed volume (e.g. lab scenario, all lab machines have same 'volume' mounted and sync of primary machine)
Snapshots for all versions?
- each version would require > 1 block
i.e. 10 mods of 10 bytes to 1 block would require >= 10 extra blocks
hmm... has to be better way in this scenario....
B+Tree of unique ids (seperate from walnut object IDs).
B+Trees sorted by size and location (a-la XFS) provides:
- ability to allocate large/small objects efficiently (size)
- ability to allocate blocks near existing objects (e.g. for object expansion) by using the location B+Tree
B+Trees are good, therefor use them.
Split up into allocation groups (a-la XFS and BFS). Allows more parallel operation as different threads can work on different allocation groups (during updating).
Idea for fine grained B+Tree locking:
- each node gets an ID (which should be pretty unique... but not essential to correct operation).
- when process needs to update node, places a lock on it's ID
- when a process accesses a node, before reading, it checks to see if a lock has been placed on that ID. IF so, we spin, waiting for the lock to be released
- when a process has finished updating a node, it releases the lock on it's ID.
- the process waiting to read it will stop spinning and be able to read the new copy.
IF IDs overlap, there is no problem, we'll just be spinning when a different node is being updated.
However, if we are updating a tree of nodes, then some careful locking wil have to be employed, from the BOTTOM UP! that is, update from the bottom up, locking each individual one as we go as if we only lock the root of the tree being updated, another process could have already passed it and screw us up royally.
some studies have shown that for multimedia applications, a larger block size improves throughput (e.g. 256kb blocks). For large media files, the waste of an average 128kb per file is insignificant (over several megabytes to many hundred mb or indeed GB). But, for smaller files (typically occupied by configuration files or small system binaries) a smaller block size saves more space (4k block for 100byte file).
ReiserFS goes for something slightly different allowing several files to share a single block. I am unsure if the extra effort involved in implementing this is worth it, or the slower access times that reiser reports for allowing this.
Maybe different allocation groups could have different block sizes? maybe there could be some kind of block-size migration system? or would the overhead not be worth it? Could it be one of those "maintenance" tasks that you run every month/year? How often does the average usage of a disk change that we'd need something like this?
Been re-reading a lot of the XFS papers that are on the SGI website (http://oss.sgi.com/projects/xfs/) and thinking more about what I want out of an object store. There are a lot of similar design goals (I think) yet some very different ways of implementing things.
Having a large B+Tree full of every object could be quite nice, kinda like inodes on conventional UNIX filesystems. On the object-store layer, we'd be able to store small objects inside the inodes. Above this, the namespace layer could find out if we can pack something into an inode, and if so, optimize for that (e.g. linear list of files for directories under 1k).
The idea of having a very layered system is increasingly appealing to me.
This means we could have some very nice optimizations for some applications. Some systems would only ever care about the object-store itself (a squid like caching system for example) and others could care a lot about a namespace system (or even be one). An example of the latter could be a database.
Expandability for the future is a given, it has to be. 64bits seems like a lot now, but no doubt somebody will be pushing it in 10 years or so.
snapshot like system for in-progress transactions, when committing, search for common substrings in blocks and the offsets of these (possibly indicating an insert operation) and on a per-block basis create a "shift N spaces" entry in the delta fork of the object
got a site up (really basic) covering my honors stuff.
http://www.flamingspork.com/honors/
well, i'm working on it.
bloody thing. all this other stuff we have to do instead of the project. It really does annoy me. I'd love to be able to get rid of coursework, assignments and these 'intermediate' things we have to do (which reminds me, i've got a heap of stuff to catch up on still) and just get on with the research. i.e. stuff that actually interests me/i care about.
i'm getting there on it - trying to work out what actually to talk about, in what order and all that. Not that we really get any help from this "subject" on these things. hmmm.....
Was thinking, I'm getting a pretty simple way of doing sparse objects (start from block 0 in allocation-group 0, i.e. a normally impossible location) with my slightly (possibly) better block_run structure (over BeFS's) I have been thinking about how to do the lookup from capability to onode.
/* fcfs_block_run
--------------
We have a max of ~4billion alloction groups,
within these, we can have up to 4billion blocks.
and we can address these with *one* block_run structure.
This means we don't have a lot of block address lookup on large files
or contiguous files. Should lead to fast IO...
*/
struct fcfs_block_run {
u32 allocation_group; /* Allocation group */
u32 start; /* Start Block */
u32 len; /* Length (blocks) */
};
Well.... I was writing about how sparse objects can be really useful in storing large, sparsely populated arrays and then (twacking my head yelling "obviously!") realised that isn't this precisely what the whole Walnut password capability address space is?
So.... why not use a sparse object for the object catalog? There's no reason not too... The only thing that would approach it is that sparse objects are implemented on a block granularity. Which means, as soon as you enter one byte data into the object, you occupy a full block. We could probably get around this with some hashing to make the array a bit less sparse and a few other tricks. But hey, easy initial implementation - i like it!
Honors Site update with more recent for my project - added functionality and organisation.
also updated documents.
so much to write about, so little time. Got literature review due on wednesday. lots more to do, hopefully something that i've done makes remote sense.
hell, i haven't even run it through ispell yet....
maybe i should print and read it. do my typical edit of every 2nd sentence to make it (hopefully) better. :)
pity just creating two lists: crappy and non-crappy and inserting file system names in them isn't enough.
I wish I had more time to implement my data store - it would be so much better.
be able to mark blocks as "in transaction" and only have this info recorded in memory, not on disk. allows less writes to disk, as any uncommitted transaction we don't care about on restart.
but, when things get to the "we're ready to commit this" stage, we're going to have to write back to disk....
hrrrmmm.....
the magical 100 objects have been put onto a volume, and they're in the index!
i'm even going to now hack on a little store utility that will take stdin and put it into an object.
but before that, i think i'm going to make something that prints an object to stdout.
just so we can start having some fun :)
a lot less is broken now.
must focus on it.
common message, and i guess it's what i have to do.
write lots and lots and lots and make it look all very impressive.
i.e. illustrate all the thoughts. too many of them :)
that would make some sense, and form an update - but gah - the screen on the machine with the SCSI card and the scanner has carked it, and i really cannot be bothered doing remote X and walking down stairs to change the paper.
live with not knowing what i'm doing.
or, just wait for the CVS logs or some strange thing.
gah, darn too much to do.
trying desperately to balance coursework and thesis.
it's insane and i don't know if i'm giving things reasonable priorities.
i now know how a process scheduler feels
got some feedback regarding the practice presentation i gave today - although it was about 8 mins over time (meant to be 15), i know where i can cut out some stuff.
comments and notes below:
clarity of Walnut better than unix, problems with walnut
explicitly state BeFS, maybe before 'trends'
relate explicitly file systems -> object store
explicitly state that word processing is an example - perhaps state others
positioning of problem pages, bring together, remove duplicates
tie solutions to problems
summary slide of what works
use meta-data stuff as extensibility example
if trouble with time, drop out some extra things.
work a bit more at very beginning.
explain walnut,
what it does
how my project relates to it
explicit Aims slide
presentation on thursday, 2pm in 135(26) - the seminar room. old building, it's very bare atm.....
at least video out works on my laptop - YAY!
waiting for feedback on the latest slides.
have done rearranging on the thesis.
Trying to polish off the thesis, add bits taht need adding, and extra discussion where needed.
I'm hoping to actually be able to cover everything I've thought about - but I do begin to doubt it sometimes - i've thought too much.
Currently 173 pages.
talk went okay - not as good as i really wanted, but there was a problem with balancing time to explain the background information needed (lots) vs what i'd done.
getting closer and closer, i think i've managed to cover everything.
constructing the log book which also needs to be handed in, collating things from here and there. This entry is basically the last one that's going to appear as i'm simply too lazy to print this off tomorrow.
180 odd pages, hoping that i can catch ron or carlo today to go over parts of it.
still have to write intro/conclusion. urghhhhhhhhhh