src/TODO


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181

v0.6
/- fold observer into cmonctl/ceph?
/- osd scrub
/- async metadata

v0.7
/- smart osd sync
/- osd bug fixes
/- fast truncate
/- updated debian package
/- improved start/stop scripts
/- proc/sysfs cleanup

v0.8
/- O_DIRECT
- kill fill_trace

- ENOSPC
- flock

- fully async file creation
- cas?

big items
- finish client failure recovery (reconnect after long eviction; and slow delayed reconnect)
- ENOSPC
  - space reservation in ObjectStore, redeemed by Transactions?
  - reserved as PG goes active; reservation canceled when pg goes inactive
  - something similar during recovery
  - ?
- repair
- enforceable quotas?
- mds security enforcement
- client, user authentication
- cas
- osd failure declarations


repair
- are we concerned about
  - scrubbing
  - reconstruction after loss of subset of cdirs
  - reconstruction after loss of md log
- data object 
  - path backpointers?
  - parent dir pointer?
- cdir objects
  - parent dir pointer
    - update on rename?  or on cdir store?
      on cdir store is sufficient if mdlog survives...
  - or what the hell, full trace?
- mds scrubbing


kernel client
- should O_DIRECT invalidate the page cache?
- inotify for updates from other clients?
- optional or no fill_trace?
- flock, fnctl locks
- async xattrs
- avoid pinning inodes with expireable caps?
- avoid flushing tcp socket when sending client_lease release messages (when the request is about to follow)
- make osd retry writes if failure after ack..
- ACLs
- make writepages maybe skip pages with errors?
  - EIO, or ENOSPC?
  - ... writeback vs ENOSPC vs flush vs close()... hrm...
- set mapping bits for ENOSPC, EIO?
- flush caps on sync, fsync, etc.
  - do we need to block?  how do we track that?
- fix readdir vs fragment race by keeping a separate frag pos, and ignoring dentries below it
- reconnect after being disconnected from the mds

vfs issues
- real_lookup() race:
  1- hash lookup find no dentry
  2- real_lookup() takes dir i_mutex, but then finds a dentry
  3- drops mutex, then calld d_revalidate.  if that fails, we return ENOENT (instead of looping?)
- vfs_rename_dir()
- a getattr mask would be really nice

filestore
- make min sync interval self-tuning (ala xfs, ext3?)
- get file csum?

btrfs
- clone compressed inline extents
- ioctl to pull out data csum?


userspace client
- handle session STALE
- time out caps, wake up waiters on renewal
  - link caps with mds session
- validate dn leases
- fix lease validation to check session ttl
- clean up ll_ interface, now that we have leases!
- clean up client mds session vs mdsmap behavior?
- stop using mds's inode_t?
- fix readdir vs fragment race by keeping a separate frag pos, and ignoring dentries below it

mds
- on replay, but dirty scatter replicas on lists so that they get flushed?  or does rejoin handle that?
- take some care with replayed client requests vs new requests
- linkage vs cdentry replicas and remote rename....
- move root inode into stray dir
- make recovery work with early replies
  - purge each session's unused preallocated inodes
- dftlock is missing from rejoin phase
- file size recovery gives (wrong) 4MB-increment results?
- hard link backpointers
  - anchor source dir
  - build snaprealm for any hardlinked file
  - include snaps for all (primary+remote) parents
- how do we properly clean up inodes when doing a snap purge?
  - when they are mid-recover?  see 136470cf7ca876febf68a2b0610fa3bb77ad3532
- what if a recovery is queued, or in progress, and the inode is then cowed?  can that happen?  
- proper handling of cache expire messages during rejoin phase?
  -> i think cache expires are fine; the rejoin_ack handler just has to behave if rejoining items go missing
- add an up:shadow mode?
  - tail the mds log as it is written
  - periodically check head so that we trim, too
- rename: importing inode... also journal imported client map?
- rerun destro trace against latest, with various journal lengths
- cap/lease length heuristics
  - mds lock last_change stamp?
- handle slow client reconnect (i.e. after mds has gone active)
- fix reconnect/rejoin open file weirdness
- anchor_destroy needs to xlock linklock.. which means it needs a Mutation wrapper?
  - ... when it gets a caller.. someday..
- FIXME how to journal/store root and stray inode content? 
  - in particular, i care about dirfragtree.. get it on rejoin?
  - and dir sizes, if i add that... also on rejoin?
- add FILE_CAP_EXTEND capability bit


journaler
- fix up for large events (e.g. imports)
- use set_floor_and_read for safe takeover from possibly-not-quite-dead otherguy.
- should we pad with zeros to avoid splitting individual entries?
  - make it a g_conf flag?
  - have to fix reader to skip over zeros (either <4 bytes for size, or zeroed sizes)
- need to truncate at detected (valid) write_pos to clear out any other partial trailing writes


mon
- paxos need to clean up old states.
  - default: simple max of (state count, min age), so that we have at least N hours of history, say?
  - osd map: trim only old maps < oldest "in" osd up_from

osdmon
- monitor needs to monitor some osds...

pgmon
/- include osd vector with pg state
  - check for orphan pgs
- monitor pg states, notify on out?
- watch osd utilization; adjust overload in cluster map

crush
- allow forcefeed for more complicated rule structures.  (e.g. make force_stack a list< set<int> >)

osd
- pg split should be a work queue
- pg split needs to fix up pg stats.  this is tricky with the clone overlap business...
- generalize ack semantics?  or just change ack from memory to journal?  memory/journal/disk...
- rdlocks
- optimize remove wrt recovery pushes

simplemessenger
- close idle connections?

objectcacher
- read locks?
- maintain more explicit inode grouping instead of wonky hashes

cas
- chunking.  see TTTD in
   ESHGHI, K.
   A framework for analyzing and improving content-based chunking algorithms.
   Tech. Rep. HPL-2005-30(R.1), Hewlett Packard Laboratories, Palo Alto, 2005.