Page 1 of 1

Vessel and contract tracker

Posted: Wed 13 Oct 13 2010 8:15 am
by Daniel Wee
Ports info now in MySQL.

I'll have to see about a method to get the MSI number if available instead of IMO. Vessel dimensions may be available in e-mail so this could mean that there is no need to maintain a full vessels database.

Fetchmail is pulling the mail off the aggregator and the parser seems to be extracting the headers and body correctly for the moment. May need to see what the actual mail encoding looks like for further parsing. I also need to configure fetchmail to delete downloaded mail and for the parser to delete processed admin file.


Re: Vessel and contract tracker

Posted: Fri 15 Oct 15 2010 9:55 pm
by Daniel Wee
I think to parse these mails we will need a multi-state tracker, in that we can track multiple subjects and know when the subject has changed.

It also needs to be able to recognize different keywords or formats, such as port names, ship names, date formats, units and so on. There are a variety of delimiting styles so we need to provide for that as well. There are also a lot of abbreviations used which needs to be identified.

For countries, we will need a country list with a number of abbreviations.

For ships, we may need to check against the vessel tracker to see if such a ship exists and if so, extract pertinent information and the IMO and MMSI in particular.

Context identification can be triggered in a number of ways:-
1. keywords such as M/V, MV, M.V., DWT, DWAT, BLT, SPEED, FLAG, OPEN
2. units such as MT, MTS, K which follow numbers
3. special types such as JAN to DEC
4. numerical formats such as dates, or date ranges such as 22-25 OCT
5. countries, ports, vessel names
6. paragraphing

There should also be contextual prediction to expect certain types of keywords within a given context. The completeness of the context or sub-context must be trackable as well so if the item has been filled, it should be duly noted. The order should have some priority. This means that a proper dictionary needs to be built and words assigned meanings. Furthermore, we need some loose syntax rules to govern the parsing process. We may need some fuzzy matching routines to allow for misspellings or unknown but guessable keywords.

Maybe the parser can built up translation templates/rules from successful parsings and this can help with dealing with future updates. This way we can know that a certain set of rules has higher priority within a known format that has been seen previously. These templates can be identified with senders of the e-mails.

Vessels can be looked up on:- ... 47918.html

Then with the MMSI number, they can be located and more info gained from:- ... =538003369

We need to find a place where we can directly post the search information.


Re: Vessel and contract tracker

Posted: Fri 15 Oct 15 2010 10:29 pm
by Daniel Wee
Actually the marine tracker can search for vessels as well with the following query:- ... &B1=Search

You need at least the FLAG to verify the result though. And then with the MMSI you can get the other information and cross-check against the parsed data.


Re: Vessel and contract tracker

Posted: Sun 17 Oct 17 2010 9:36 pm
by Daniel Wee
The question here is whether to implement the dictionary in MySQL or if it should be hard-coded into the program. The former allows greater flexibility and results in a smaller main program but the latter will run faster (though taking up more memory which might limit the dictionary size).

The dictionary would consist of, roughly:-

1. word
2. word type - key parameter, units
3. context

So, DWAT might be a word, with a word type of DWT. The context, in this case, is also DWT. And the unit MT is associated with the DWT context? Hmm... perhaps the context needs a separate table which associates units to itself. In this way you can have different contexts share the same units. So a contextual category would be:-

1. context
2. associated units

Thus given a unit, you can check for all possible contexts using that unit - assuming that the context has not been ascertained.

In terms of parsing, there are several formats that need to be dealt with and it might be possible to capture the syntax in a set of rules perhaps. This may require some rather flexible rules though. One might, for example, have:-

context value unit

which means:-

context=value (units)

which might further be generalized as:-

A nnn B

which, if nnn qualifies as a numerical value, then fits the translation of:-

A=nnn (using B units)

This needs nnn to be a value and B to be a unit to be true. It may be that B is a non-unit and therefore be excluded from the sub-sentence.

But then you also have an open ended type of phrasing such as:-

A/B/C [*] x/y/z

which translates to A=x, B=y and C=z without actually giving any direct contextual information. The units would have to be drawn out of identifying the contexts of A, B and C. If for some reason these cannot be identified, perhaps a fuzzy match function will make a best guess.


Re: Vessel and contract tracker

Posted: Sun 17 Oct 17 2010 10:33 pm
by Daniel Wee
The actual parsing of the numericals are also contextually constrained. There would have to be at least two levels of parsing. The first level simply identifies it boundaries and if it is a number of some sort. If it is, the actual value depends on the context. For example, for DWT, 24.880 and 24,880 and 24880 would all translate to 24.88 MT. For draft, you can't really have 24880 so the moment you see this number, you should know that it cannot be the draft context. Simply by looking at 24880, you know it has to be something else.

We need a way of keeping track of the various possible contexts and then have a way of recursing through the tree to narrow down the possibilities. Along the way, unlikely contexts would be eliminated. Given that we have a finite number of contexts, we could use an array for this, or even a bit-mapped flag system (which could be rather limiting though.) We could define a struct to represent the various contexts with a flag.


Re: Vessel and contract tracker

Posted: Sun 17 Oct 17 2010 11:26 pm
by Daniel Wee
The major query target would be a group of contexts that the parser is looking to fill. This might be:-


and so on, with each of these representing a contextual category.

There will also be some conjunctions and prepositions that need to be addressed. Some keywords will switch contexts as well, such as OPEN or M/V or RANGE. Dates could require special handling.


Re: Vessel and contract tracker

Posted: Mon 18 Oct 18 2010 9:48 am
by Daniel Wee
There is also wider contexts, as opposed to local context which is confined to the immediate vicinity (neighbouring or one removed) words. The wider context looks at the entire sentence to see what kind of sentence it is. In this way it can determine the meaning of nouns such as vessel names at the start of the sentence. For example, OPEN triggers a wider context of vessel names near the beginning.

Special section denoters at the beginning of each line could have a contextual function as well - such as "1)" or "1." or "A". This can be rather tricky though.

Context delimiters are not always the EOL as some information spills over to the next line. In most cases, though, EOL does signifiy a new context. Where you might get "RANGE 22-26" and OCT on the next line, there needs to be the ability to carry through the EOL. This may be signalled by the fact that the month is mandatory in a date field triggered by RANGE.

Date ranges are typically two date fields, the start and the end. Unless specified, the assumed year is the current year. We could do that for the month as well.


Re: Vessel and contract tracker

Posted: Mon 18 Oct 18 2010 10:10 am
by Daniel Wee
In word parsing, we can't always assume space or EOL as the delimiter. Some words include non-alphanumeric characters such as "Wini Port (Nusa Tengarra Timur)". The correct parsing is crucial to efficiency of subsequent analysis. In these cases, the limit or boundary can be indicated if a new context is triggered, or a recognized word/unit/symbol is encountered.

In the above case - it would appear that "Wini Port" is the root name to be used for searches (including a partial search using CONTAINS) but if the search should fail, the fall back would be the take the entire field up to the next known trigger.

Multiple spaces can also act as a delimiter.

Variations are also found in nouns, for example, HONG KONG, HONGKONG, HKG, HK, which may require a space-less compare. We could create a new field in the table where all the spaces have been stripped for this purpose.

The role of directing symbols such as ":" should also be noted as it potentially reverses the syntax. For example, where you might have HKG FLAG, FLAG: HKG is also valid. As such, the colon should become part of the syntax rules.

These syntax rules would look like a string, normally constructed by the program, as follows:-


Actually, we might go for even more specific interpretations so that you don't have to abstract it into a context. So you might have:-


So a preprocessor might work to standardize as much as possible the words. Where the first pass fails to recognize parts, it will need to be abstracted - for example:-

L/B 180,80M/30,50M

will become translated into:-

L 180.80 M
B 30.8 M

This could actually take place before the first-pass in a pre-processor stage.


Re: Vessel and contract tracker

Posted: Mon 18 Oct 18 2010 10:18 am
by Daniel Wee
There should be some prioritizing in the search terms. For example, vessel name and flag would be critical information. Port and dates are also critical. Vessel information is slightly less so as these can be culled from other databases.