Introduction
Metadata collections
Configuring collections
Installation
Administration
Starting and stopping the daemon
Sending update requests
Logs
Database housekeeping
Update daemon keeps ACIS database tables up-to-date with the metadata collections that an ACIS service uses. I also sometimes call it “RI daemon”. RI stands for RePEc-Index, which is a historical name for the modules behind the daemon.
The daemon reads data about documents, software components, series, institutions from the collections’ data files and inserts it into corresponding tables. If a record disappears from the underlying metadata collection, RI daemon deletes corresponding data from the tables.
It is a daemon because it does not do anything by its own will. It sits there and waits for somebody to send him a request. Update requests are basically, orders for work. The only way the daemon communicates with the world is by writing a log.
Sometimes ACIS itself sends an update request to the daemon. Sometimes you might want to send an update request to the daemon.
Collections is how ACIS thinks about your metadata. You may have zero, one or more separate collections. (If you have zero, ACIS won’t be of much use.) Each collection has an identifier, a type and is stored in data files somewhere in the local filesystem. These things are specified in the collections’ configuration (see below).
Collection identifiers must be unique for any particular
ACIS service installation. “ACIS” is a reserved
collection identifier.
Collection type must be “AMF” if your collection
will consist of AMF files. (For RePEc’s ReDIF it is
“RePEcRec”.) Other collection types may be defined
in the future.
Collection consists of any number of data files with stable names, grouped into any directory structure that fits. The directory structure can not contain circular symbolic links. Update daemon must have file-system access to the files and directories of the collection.
Each file of a collection may contain zero, one or many data records. Each data record must have a globally unique identifier. Identifiers are treated in case-insensitive manner. Having been lowercased they still must be unique.
If two or more records in a collection have the same identifier, update daemon excludes them all from the ACIS database.
All other records are processed and its data is saved to the ACIS database.
For AMF collections only files with .amf.xml name
extension (case-insensitive) are treated as data files;
other files are ignored.
If data of a collection has changed, you want ACIS to check and process it to reflect these changes in the database. Use bin/updareq utility is for this. If your collection changes often, you will want to run this utility at regular intervals.
You configure the collections with
metadata-collections parameter of the
main.conf file. For each collection you put the
collection identifier into this parameter and you specify
its type and path to it with metadata-X-type
and metadata-X-home parameters, where
X is the collection identifier.
If you have several collections, separate their identifiers in metadata-collections with a space.
“ACIS” is a reserved collection identifier, system
uses it for its own generated data.
You use collection identifiers when you request a data update of a collection with bin/updareq utility.
Right now ACIS understands data in two metadata formats: RePEc’s ReDIF and AMF. Support for other metadata formats and collection structures can be added. It is not very difficult to develop, given that they are file-based or have a pretty simple way to map records to files.
Parsed collections configuration is written into
{HOME}/RI/collections file.
Normally, the update daemon is installed when you install ACIS and no special installation is needed.
If you need a separate installation of the update daemon or if you want to manually upgrade the update daemon, you may follow these instructions. Grab the latest RePEc-Index package from http://acis.openlib.org/code/. You unpack the package and then run:
$ RePEc-Index-0.XX/install.sh HOME
where HOME is the path to your ACIS home dir (or the directory you want to install the daemon to).
The actual daemon script will be installed to
bin/control_daemon.pl. But you do not normally need
it. ACIS includes everything you need to work with it, like
scripts to start and stop the daemon and
to send update requests to it.
I suggest that for a serious ACIS installation you make a private copy of Berkeley DB. This will protect you from system-wide software updates. Read why and how to do that.
Use bin/rid start and bin/rid
stop to start and stop update daemon, respectively.
You better run the daemon all the time while ACIS is
working. If for some reason it were not running for some
time, it is not a catastrophe. Some update requests may be
lost, but generally you can recover by running
“bin/updareq ACIS /”.
bin/updareq COLLECTION PATH [TOO_OLD]PATH in collection
COLLECTION. TOO_OLD is time in seconds.
If a file was last time processed more than TOO_OLD
seconds ago, the daemon will process it again (even if it
didn’t change since). By default, TOO_OLD is
86400*12 seconds, which means 12 days. The main log it keeps is in RI/daemon.log. It is
the general log of requests comming in and what processing
channel took it for processing.
All details of processing particular requests go into logs
RI/update_ch0.log, RI/update_ch1.log, …
RI/update_ch5.log. Each of these correspond to a
processing channel.
These logs protocol what is being done, what files are read, what records are found in them and so on. If there were any problems with processing data, it will be logged in there.
Recent versions of update daemon use Berkeley DB Transactional Data Store for its database. This causes it to work a little slower (when compared to plain Data Store), but it gives us parallel processing feature and great fault-protection.
The database files are stored in {HOME}/RI/data.
General files (so called “database environment”) are stored
in this dir, and files for specific collections are stored
in subdirs of it. For example, data a for collection called
“bliss” would live in {HOME}/RI/data/bliss.
When data is added or modified in the database, the Berkeley
DB library creates sequentially numbered log files, e.g.
RI/data/log.0000000001. If you have lots of data
going in and getting modified in the database, number of
these files can grow fast and they’ll occupy a real huge
amount of disk space.
To solve this you need to do two things:
initiate chekpoints regularly;
delete unnecessary log files.
You can read about it in the Berkeley DB documentation, but basically, you can do it this way:
db_checkpoint -1 -h {HOME}/data
db_archive -d -h {HOME}/RI/data
db_checkpoint and db_archive utilities are
from the Berkeley DB library package. Some care you need
here. Becaust you may have these utilities installed
system-wide. And those can be of wrong version of BDB than
you actually use. If you have installed a private copy of Berkeley DB, then
you’ll have to refer to these utilities by full path.
For instance, in my case, I had to use
/home/ivan/lib/bdb/bin/db_checkpoint and
/home/ivan/lib/bdb/bin/db_archive. So, I have a
cron entry which runs a little script like this:
#!/bin/sh
/home/ivan/lib/bdb/bin/db_checkpoint -1 -h /opt/ACIS/RI/data
/home/ivan/lib/bdb/bin/db_archive -d -h /opt/ACIS/RI/data
Here are some documentation links, just in case you need to know more:
If you want to know more about the daemon, you may look into internals doc.
Generated: Fri Jul 20 20:35:11 2007
ACIS project, acis@openlib.org