Harvesters

If you are a site administrator you can create entry harvesters that will scan either the server file system or will fetch web-based resources and automatically ingest them into the repository.

Harvesting files

The File Harvester facility allows scans the local file system on the server and add entries into the database for those files. For example, say you have a large directory of data (and/or other) files (e.g., /project/data) that you want RAMADDA to provide access to. The File Harvester can walk the directory tree and add entries into the repository for the files it finds.

The first thing you need to do is to specify in the Site Administration area that it is OK to serve up files from that directory tree. Go to the page in RAMADDA and enter the directory path, e.g., /project/data:

Image 1: File System Access

Next, create a File Harvester by going to the Harvesters tab of the Admin pages and press the "New Harvester" button. Specify a name (e.g., "Test"), leave the type as Local Files and hit create:

Image 2: Create a new File Harvester

This will take you to the Harvester Edit form. The minimal configuration you need to do here is to enter the file system directory to scan and the RAMADDA folder to add the harvested entries to. In this example we are harvesting the directory tree /project/data and we are adding the entries into the RAMADDA folder RAMADDA/Case Studies/Test:

Image 3: Harvester Edit Form

When you have entered this information press "Change" which will apply the changes and then take you back to the Harvester Edit Form. You can start this Harvester with the "Start" link in the upper right. Or you can go to the main Harvesters page (the top Harvesters tab) and this will list your Harvester. From there you can start the Harvester and monitor and/or stop its progress. When it is finished the Stop link will change to Start.

With the above settings the directory tree of your file system will be used to create the folder hierarchy in RAMADDA. Every file the Harvester finds will result in a file entry. You can run a Harvester any number of times and it will only add the new files that it has not seen before.

7.3.0 Run Settings

The form has settings for running the harvester. When you are first creating a harvester sometimes it may takes some time to figure out just what you are harvesting and the name and folder settings for the repository entry. So, its good to turn on test mode. This will result in entries not being added to the repository when you run the harvester. Rather, when in test mode, up to "Count" number of files will be found and the results will be listed in the "Information" section of the harvester page.

The "Active on startup" flag, when set, results in the harvester being started when the repository starts up. The "Run continually" flag has the harvester continually run. It uses the "Every" setting to determine the pauses between runs. You can choose Absolute time to pause every N minutes. Or, you can choose "Minutes" or "Hourly" to have it run relative to the hour or the day, e.g. "3 hourly" will run at 0Z, 3Z, 6Z, 9Z, etc.

For example, if you know you are getting data files in real-time that are coming in every 30 minutes you could set your harvester to run in "Absolute" mode every 15 minutes. If you had a Web harvester that is fetching images you might want to use an "Hourly" setting to get the image at some fixed interval (e.g., 0Z, 6Z, 12Z, 18Z, etc).

7.3.1 Files Settings

Under "Look for files" you specify a directory on the server file system to scan and a regular expression to match on the file name. The repository will recursively scan the directory tree and any files it finds that matches the pattern it will add to the repository.

The regular expressions used are somewhat extended in that you can specify subsets of the regular expression and use the result text for metadata and other information when creating the entry in the repository. For example, a very common case is to have a date/time embedded in the filename. So, you could have in your regular expression something of the form:

.*data(fromdate:\d\d\d\d\d\d\d\d_\d\d\d\d)\.nc

This would match any files of the form:

data_yyyymmdd_hhmm.nc

The "(" and ")" define the sub-expression (just like normal regular expression). But the "fromdate:" is the special extension that tells the harvester that that sub-expression is used to create the repository entry fromdate field.

The date format that is used is defined in the Date Format field and follows the Java date format conventions.

If you are creating entries of a certain type that has a number of attributes you can extract the attribute values using this extended regular expression technique. For example, if you had an entry with two attributes attr1 and attr2 and your files were of the format:

<attr1>_<attr2>.csv

Your regular expression would be:

(attr1:[^/]+.)_(attr2:[^/]*).csv

This says that attr1 is any number of characters except the slash ("/"). The slash exclusion is used to exclude the file path as the full file path is used when matching patterns. The value for attr2 follows the "_" and is any number of characters except a slash.

7.3.2 Entry Creation

When creating an entry we need to know the folder to put it under, its name and description. You specify templates for these that can contain a set of macros (see below). Note: this is where the Test mode described above is useful. Sometimes it takes a while to figure just what you want in terms of folder structure and entry names.

To define the folder you need to select an existing base folder and then optionally specify a folder template. The folder template is used to automatically create a new folder if needed. So for example, if your base folder was: Top/Data and your Folder Template was: Ingested/Satellite then the result folder would be:

Top/Data/Ingested/Satellite

The Harvester would create the Ingested and the Satellite folders as needed.

The name, description and folder templates all can contain the following macros. Note: The different date fields (e.g., create_, from_ and to) refer to the create date/time, the from data time (which defaults to the create date unless specified in the pattern) and the to data time.

${filename}	The file name (not the full path)
${fileextension}	The file extension
${dirgroup}	See below
${create_date},${from_date}, ${to_date}	The full formatted date string
${create_day}, ${from_day}, ${to_day}	The numeric day of the month
${create_week}, ${from_week}, ${to_week}	The numeric week of the month
${create_weekofyear}, ${from_weekofyear}, ${to_weekofyear}	The numeric week of the year
${create_month},${from_month}, ${to_month}	Numeric month of the year
${create_monthname},${from_monthname}, ${to_monthname}	Month name
${create_year}, ${from_year}, ${to_year}	Numeric year

The dirgroup macro is the parent directories of the data file up to but not including the main directory path we are searching under. For example, if you are looking under a directory called "/data/idd" and that directory held sub-dirs:

/data/idd/dir1/data1.nc
/data/idd/dir1/dir2/data2.nc

Then when ingesting the data1.nc file its dirgroup value would be:

dir1

When ingesting the data2.nc file its dirgroup value would be:

dir1/dir2

Another common way of defining the folder is to use the date macros. For example a folder template of the form:

${from_year}/${from_monthname}/Week ${from_week}

Would result in folders like:

2009/January/Week 1
2009/January/Week 2
...
2009/March/Week 1
2009/March/Week 2

You can also name the entrys using the macros. So, using the above date based folder template you could then have a Name template that incorporates the formatted date:

Gridded data - ${from_date}

The Move file to storage checkbox allows you to determine whether the file is to be moved from its initial location to the RAMADDA storage area.
Note: If the file is not moved to the storage area than one of the data directories the file lies under needs to be added to the list of file system directories in the Admin->Access area

7.3.3 Web Harvesters

The Web Harvesters work the same way as the File Harvesters but they fetch a URL (e.g., an image) every time they run. You can also define more that one URLS to fetch . The basic Run settings, Folder and entry creation mechanisms are the same as described above.