Cache File Format (Documentation)

Discuss building things with or for the Mozilla Platform.
Locked
TreyDX
Guest

Cache File Format (Documentation)

Post by TreyDX »

I am trying to write a forensics tool (Similar to Mozilla Cache View) that will allow me to extract the files out of the disk cache (like .jpg, .html, .css, etc. files) for Firefox. FF 2.0 is what I'm really concentrating on now--I know the history.dat file format has been changed in the 3beta to sqllite, but I assume the cache files have stayed the same. The problem is I can't figure out exactly how to parse these files (_CACHE_MAP_, _CACHE_001_/002/003, and all the ones named after the hashes). I have found two resources that have almost given me the information I need to pull out the files, but I need a little more information about what all the bits and bytes mean for each of these files. I really just would like to read the documentation about how these cache entries are built, but the closest I can come to getting the information I need is from the source code itself.

If anyone knows where the documentation is for how the cache entries are built, that would help me out tremendously! Other than that, these are the two resources I have come up with from 3 or 4 days of google searching.

1) http://people.mozilla.com/~chofmann/l10 ... CacheMap.h the comments in the source code kind of point me in the right direction.

2) http://www.securityfocus.com/infocus/1832 this seems close to what I need to be able to do, but it doesn't quite have enough details for me to be able to implement it in code.

I would appreciate any additional resources you can provide on this matter.

Thanks!
User avatar
steviex
Moderator
Posts: 28902
Joined: August 12th, 2006, 8:27 am
Location: Middle England

Post by steviex »

Moving to Mozilla Development
Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. -Albert Einstein

Please DO NOT PM me for support... Lets keep it on the board, so we can all learn.
Anonymosity
Posts: 8779
Joined: May 7th, 2007, 12:07 pm

Post by Anonymosity »

Why do you need to write an extension to do that when you can access files in the cache using "about:cache?device=disk"? Click on the item of interest, then click it again when the new window opens. You can drag an image out of the window and drop it into a directory to save it.
User avatar
dickvl
Posts: 54145
Joined: July 18th, 2005, 3:25 am

Post by dickvl »

treydx
Posts: 4
Joined: March 16th, 2008, 4:11 pm

Post by treydx »

It is not an extension that I want to write. It is a completely external tool. I basically want to be able to implement about:cache?device=disk in my program without ever opening up Firefox...and I want to move ALL of these files to a directory where they can be saved long-term. So if someone clears the cache later, I still have access to these files and would be able to present them in court as evidence. Does that clarify the question a little?

Thank you
Torisugari
Posts: 1634
Joined: November 4th, 2002, 8:34 pm
Location: Kyoto, Nippon (GMT +9)
Contact:

Post by Torisugari »

I wonder why you need info other than the source. They aren't difficult to read,in comparison with the rest of files in the tree.
treydx
Posts: 4
Joined: March 16th, 2008, 4:11 pm

Post by treydx »

I was trying to avoid using the source if there was other documentation available. I didn't have the source downloaded and I have never done any firefox development, so if I could just read the documentation instead of reading through and downloading the source code, that would be magnitudes easier for me. It doesn't appear that there is documentation anywhere else, so I will have to figure it out from the source.

Thanks for all the help anyway.

Oh, one last question: is mozilla/netwerk/cache/src where I should be looking for this code? That's the only real cache info I can find in the source tree.
User avatar
dickvl
Posts: 54145
Joined: July 18th, 2005, 3:25 am

Post by dickvl »

You can access the source online via http://lxr.mozilla.org/
(e.g. http://lxr.mozilla.org/mozilla1.8/sourc ... cheMap.cpp)

So there is no need to download the full source.
murilo123
Posts: 2
Joined: June 11th, 2008, 1:33 pm

Post by murilo123 »

Hi Treydx,
I was searching for exactly the same thing as you....
unfortunately, I didn´t found any documentation. Do you have any luck other than the source code?
Murilo
treydx
Posts: 4
Joined: March 16th, 2008, 4:11 pm

Post by treydx »

Hey murilo123,

I did have some luck. For the most part, I was able to follow along with the securityfocus article to get all the info I needed. The hardest part was that that article had some errors in it. The first error was pretty obvious, it said left shift and you needed to right shift the bits or vice versa. The next problem that I found is that the bucket size is not static like the article says, but it's actually based upon the number of entries in the map. And I think that article says the wrong bucket size for each of the _CACHE_00x_ files, but that is well commented in the source via the link someone else posted in this thread.

I still have most of my source code in python if you need some more help. I finished up this project a few months ago, so I don't really remember everything else off the top of my head. I do remember having to skip headers for each file...which I could never figure out what was in that part that I had to skip. I think I ended up having to skip 4096b in the map and ~276b/B in each of the cache block files.

Let me know if you need some more help. If you know python, I can just send my code to you (I'm no python pro, but it works with most of my test cases). Oh, one more thing, I never took the time to merge the meta data and data from the cache files (or do extensive extension parsing, just jpg, gif, and png).
murilo123
Posts: 2
Joined: June 11th, 2008, 1:33 pm

Post by murilo123 »

So, did you analyse the source code to find it? No documentation?
I would appreciate if you could send the code...
You can use murilotito -at- gmail dot com

Thanks!
treydx
Posts: 4
Joined: March 16th, 2008, 4:11 pm

Post by treydx »

I think this is the bulk of the code that will help you. No, I never really found any documentation. There are two or three helpful comments in the source in a few different files if you need more info, but like I said, I'm no python pro :)


#FILE SELECTORS
# - 0 = separate file on disk
# - 1 = 256 byte block file
# - 2 = 1k block file
# - 3 = 4k block file

def readmap(path):
"""readmap(path)

Takes a directory or cachemap file as input and parses it in
order to read all of the cached files that are stored in a Mozilla
or Netscape format (including FireFox).

Exports all files to a directory located in XXX.
"""

verbose = False

#Change a directory to append the default file name
if os.path.isdir(path):
path = os.path.join(path, CACHE_MAP_FILENAME)

#Open the file pointed to by path
if os.path.isfile(path):
if verbose:
print "Reading Cache Map from: %s" % path
try:
#Open file to read in binary format
mapfile = open(path, 'rb')
except IOError:
print "Cache Map could not be opened! Exiting..."
return False
else:
print "Cache Map not found! Exiting..."
return False

#Begin parsing Cache Map file
try:
#Read, Seek, Open, Etc all cause IOErrors
header = mapfile.read(20)
except IOError:
print "Error reading header\n"
return False

#Cache version
#Size of cache in bytes
#Number of entries stored in cache
#Dirty flag
#Number of records
ver, datasize, entrycount, isdirty, recordcount = struct.unpack(">5I", header)

if verbose:
print "Version: %d" % ver
print "Datasize: %d" % datasize
print "EntryCount: %d" % entrycount
print "IsDirty: %d" % isdirty
print "RecordCount: %d" % recordcount

#Read the eviction rank array (32 buckets)
try:
erank = mapfile.read(4*32)
except IOError:
print "Error reading Eviction Ranks!\n"
return False

eranks = struct.unpack(">32I", erank)

#highest eviction rank of each bucket
if verbose:
print "Eviction ranks: ", eranks

#Read the BucketUsage array (32 buckets)
try:
bu = mapfile.read(4*32)
except IOError:
print "Error reading the Bucket Usage!\n"
return False

#Number of used entries in each bucket
bucketusage = struct.unpack(">32I", bu)

if verbose:
print "Entries Used in each Bucket: ", bucketusage

#Sanity check... should be 0x114 (276) bytes into file
whereami = mapfile.tell()
assert whereami == 276, "Where is end of header? %d\n" % whereami

#Now read BUCKETS
#32 buckets in file. 256 records per bucket. 4 (32b) ints per record
#Record is: Hash Number, Eviction Rank, Data Location, Metadata Location

#Number of records in a bucket
bucketsize = recordcount/32

recordlist = [ ]

#32 buckets
for i in range(32):
#Read one bucket
try:
#256 records, 4 values/rec, 4 bytes/value
start = mapfile.tell()
bucket = mapfile.read(bucketsize*4*4)
next = mapfile.tell()
except IOError:
print "Error Reading Bucket!\n"
return False
if (start==next):
print "Could not read the bucket!"
return False

fmt_string = ">" + str(bucketsize*4) + "I"
b = struct.unpack(fmt_string, bucket)
if verbose:
print b
numentries = bucketusage[i]
while (numentries > 0):
a = readbucket(b, numentries)
recordlist.append(a)
numentries = numentries - 1
if verbose:
print recordlist
try:
whereami = mapfile.tell()
mapfile.seek(0,2) #move to EOF
#sanity check, I better be at EOF
assert whereami == mapfile.tell(), "Where is the EOF? %d,%d\n" % (whereami, mapfile.tell())
except IOError:
print "ERROR! Cache map file read error.\n"
return False

#Close the file
if not mapfile.closed:
mapfile.close()

return recordlist

def readbucket(bucket, entry):
"""Given a bucket of records, read the record with index of 'entry'"""
index = 4*(entry-1)
return (bucket[index:index+4])

def noL(a):
"""Remove L suffix from int/long numbers"""
if(type(a)!=types.StringType):
return a
if(a[-1:] == "L" or a[-1:] == "l"):
return a[0:-1]
return a

def noX(a):
"""Remove 0x prefix from hex numbers"""
if(type(a)!=types.StringType):
return a
if(a[0:2] == "0x" or a[0:2] == "0X"):
return a[2:]
return a

def readrecord(record):
"""readrecord(record) -> record_dictionary

Function takes in a 4-word record entry and parses the data
out of it.

Returns a dictionary of the record data.
Keys:
hash -> hex identifier
erank -> eviction rank
dataloc -> hex value that returns other data properties
metadataloc -> same as above except for meta data
datablock -> which data block file (or separate file) data is in (e.g. 1)
metablock -> same .... except for meta data
datafile -> the name of the data block file (_CACHE_001_)
metafile -> the name of the meta data block file
datastartblock -> block the data starts on in the data file
metastartblock -> same .... meta data
datanumblocks -> number of blocks the data spans in the data file
metanumblocks -> same
datablocksize -> size of a block in the file (1: 256, 2:512, 3:1024)
metablocksize -> same

"""

rec = { }
hashnumber = noL(noX(hex(record[0])))
evictionrank = int(record[1])
datalocation = int(record[2])
metadatalocation = int(record[3])

rec["hash"] = hashnumber
rec["erank"] = evictionrank
rec["dataloc"] = datalocation
rec["metadataloc"] = metadatalocation

#Calculate data/metadata locations/files
#Reference: http://www.securityfocus.com/infocus/1832 is wrong here.
# left shift does not make sense!!!
whichdatablockfile = noL((datalocation & FILESELECTORMASK) >> 28)
whichmetablockfile = noL((metadatalocation & FILESELECTORMASK) >> 28)
rec["datablock"] = int(whichdatablockfile)
rec["metablock"] = int(whichmetablockfile)

datafile = CACHE_BLOCKS[whichdatablockfile]
metafile = CACHE_BLOCKS[whichmetablockfile]

#if index is 0, data is stored in a separate file
#filename: <hashnumber><type><generationnumber>
# hex conversions - remove first 2 (0x); add 00 to front; get last 2
if (datafile == ""):
gen = ("000" + noX(hex(datalocation & 0xFF)))[-3:-1]
datafile = noL(str(hashnumber)) + "d" + str(gen)
if (metafile == ""):
gen = ("000" + noX(hex(metadatalocation & 0xFF)))[-3:-1]
metafile = noL(str(hashnumber)) + "m" + str(gen)

rec["datafile"] = datafile
rec["metafile"] = metafile

#calculate start block
datastartblock = int(datalocation & 0xFFFFFF)
metastartblock = int(metadatalocation & 0xFFFFFF)
rec["datastartblock"] = datastartblock
rec["metastartblock"] = metastartblock

datanumblocks = int((datalocation & 0x03000000) >> 24)
metanumblocks = int((metadatalocation & 0x03000000) >> 24)
rec["datanumblocks"] = datanumblocks
rec["metanumblocks"] = metanumblocks

#BLOCKSIZE = (0, 256, 512, 1024) #another error in reference?
BLOCKSIZE = (0, 256, 1024, 4096)
datablocksize = BLOCKSIZE[whichdatablockfile]
metablocksize = BLOCKSIZE[whichmetablockfile]
rec["datablocksize"] = datablocksize
rec["metablocksize"] = metablocksize

return rec


def getdata(rec):
"""given a record, pull the binary data out of the cache files"""
verbose = False

#Open the data file
path = os.path.join(INPATH, rec["datafile"])
if verbose:
print "Path to Data file: %s" % path
try:
df = open(path, "rb")
except IOError:
if verbose:
#Too many errors: fail silently
print "Error opening data cache file: %s" % rec["datafile"]
print " - ", rec
return False

#Open the metadata file
path = os.path.join(INPATH, rec["metafile"])
if verbose:
print "Path to Meta Data File: %s" % path
try:
mf = open(path, "rb")
except IOError:
if verbose:
print "Error opening meta cache file: %s" % rec["metafile"]
print " - ", rec
return False

#Read Data
if (rec["datablock"] != 0):
bsize = int(rec["datablocksize"])
start = int(rec["datastartblock"])
numbl = int(rec["datanumblocks"]) + 1
try:
#Skip past header
df.seek(4096, 0)

#seek to block (1=relative mode)
df.seek(bsize * start, 1)

#read data (number of blocks * block size)
data = df.read(bsize * numbl)
except IOError:
print "Error reading in data file!"
print " - ", rec
return False
#Standalone file (i.e. not _CACHE_00X_)
elif (rec["datablock"] == 0):
try:
#Read entire file... I am not sure if there are headers
data = df.read()
except IOError:
print "Error reading standalond data file!"
print " - ", rec
return False
else:
print "Unknown data block reading file!"
print " - ", rec
return False
df.close()


#Read Metadata
if (rec["metablock"] != 0):
bsize = int(rec["metablocksize"])
start = int(rec["metastartblock"])
numbl = int(rec["metanumblocks"]) + 1
try:
#Skip header
mf.seek(4096, 0)

#Seek to starting block (1 = relative seek mode)
mf.seek(bsize * start, 1)

#Read the data
meta = mf.read(bsize * numbl)
except IOError:
print "Error reading in meta data file!"
print " - ", rec
return False
#Standalone file
elif(rec["metablock"] == 0):
try:
#Read entire metadata file
meta = mf.read()
except IOError:
if verbose:
print "Error reading standalone metadata file!"
print " - ", rec
return False
else:
print "Unknown meta block reading file!"
print " - ", rec
return False
mf.close()


#Common signatures
extension = "" #<- default

#Guess at signatures (Could use foremost.config???)
try:
if(issig(data[:6], "474946383761") or issig(data[:6], "474946383961")):
extension = ".gif"
if(issig(data[:11], "FFD8FFE0XXXX4A46494600")):
extension = ".jpg"
if(issig(data[:8], "89504E470D0A1A0A")):
extension = ".png"
if(issig(data[:4], "00000100")):
extension = ".ico"
if(issig(data[:3], "1F8B08")):
extension = ".gz"
if(issig(data[:3], "435753") or issig(data[:2],"464C56") or issig(data[:2],"465753")):
extension = ".swf"
if(issig(data[:3], "494433")):
extension = ".mp3"
if(issig(data[:4], "504B0304")):
extension = ".zip"
#if(issig(data[:8], "D0CF11E0A1B11AE1")):
# extension = ".doc" #<- Really any microsoft office document!
except ValueError:
pass

#Write the data file out
OUTFILENAME = "cf" + str(rec["hash"]) + extension
outplace = os.path.join(OUTFOLDER, OUTFILENAME)
#print "OUTPLACE: ", outplace
outfile = open(outplace, "wb") #write in binary
outfile.write(data)
outfile.close()

#Write the meta data file out
#Todo: Merge this into the data files??
OUTFILENAME = "cf" + str(rec["hash"]) + ".meta"
outplace = os.path.join(OUTFOLDER, OUTFILENAME)

outfile2 = open(outplace, "wb")
outfile2.write(meta)
outfile2.close()

return True
Locked