alexa PASVAG: A Routine to Relate Different Geographical Names | OMICS International
ISSN: 2151-6200
Arts and Social Sciences Journal
Make the best use of Scientific Research and information from our 700+ peer reviewed, Open Access Journals that operates with the help of 50,000+ Editorial Board Members and esteemed reviewers and 1000+ Scientific associations in Medical, Clinical, Pharmaceutical, Engineering, Technology and Management Fields.
Meet Inspiring Speakers and Experts at our 3000+ Global Conferenceseries Events with over 600+ Conferences, 1200+ Symposiums and 1200+ Workshops on
Medical, Pharma, Engineering, Science, Technology and Business

PASVAG: A Routine to Relate Different Geographical Names

Matos V* and Coelho V

IME, Instituto Militar de Engenharia SE-6, Rio de Janeiro-BR, Brazil

*Corresponding Author:
Matos V
IME, Instituto Militar de Engenharia SE-6
Rio de Janeiro-BR, Brazil
Tel: +5521997031054
E-mail: [email protected]

Received: June 15, 2015 Accepted: August 13, 2015 Published: August 21, 2015

Citation: Matos V, Coelho V (2015) PASVAG: A Routine to Relate Different Geographical Names. Arts Social Sci J 6:120. doi:10.4172/2151-6200.1000120

Copyright: © 2015 Matos V, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Visit for more related articles at Arts and Social Sciences Journal


This research may contribute to the development of studies for geographically investigate or ethnic historically the origins or changes of Geographical Names. According Rostaing “The Toponymy intends to seek the origin of place names and also to study its transformations. “Quantifying these changes can reveal relationships proximity of words that the indicator is proposed. Example: change naming a neighborhood of Rio de Janeiro City- BR suffered, formerly known as “REAL ENG °” becoming in the “Realengo” neighborhood. Assign a distance value of this change can reveal never before made associations. Thus, research walked in the search for solutions that fit issues of computational linguistics and textual similarity indices applied to the Geographical Names, that these attributes become binding key-in processing queries to different databases. The goal is to enable the management and recovery of a large body of data. This indicator similarity proposed in this paper has been tested and confronted between experiments with simulated data. The main results of the experiments recognized standards in testing, and the importance of the variable noise position in the string, as well as usage limits for similarity component in the integration of databases.


Geographical names; String; Hypothesis; Realengo


When working with information originating from various sciences, there is a need to seek recovery strategies and exploitation of data visualization [1]. From a technological standpoint, the space used as a reference for data exploration allows adding understanding and insight in building the information [2]. Because of this, equip the information to design a geographic information system is characterized by elements such as multidisciplinary and interdisciplinary [3-5]. This heterogeneous set of such data depends on the integration of several sciences and reflects this context storage requirement of different types of data to group as logical format records constitute a Geographic Database (GDB).

For the purpose of this work is used the notion of similarity of geographical names (NG) in which admits a measure of how two strings are similar. On the premise is quantification of similarity is based on the metric space. And in turn, provides the notion of distance and relative proximity to the idea of the first law of geography “all things are similar, but closer things are more related than distant things” [6].

The growing demand of information makes the search for technological innovation tools have premised the need to generate and store large quantities of records. Thus, in the context of data replication, the theme of Geographic Database (BDG) converges to central labor issue. Treating the problem from the perspective of the universe structural [7,8], which included the concepts of reality to be stored on the computer, i.e. the BDG; it uses variables that give meaning to the question of spatiality of formal models for geographic entities. They are: geometric field that stores the registry geometry (point, line or polygon) by one or more pairs of coordinates; and field; filing its geographical name.

For the same meaning is applied to Geographic Names. It adopted the close relationship in response indicator of similarity. Therefore, if the similarity between two strings gets close to zero, it indicates symmetry between the NG, and through the text field present in BDG, called “geographical name”, there is the limits of efficiency when changing similarity index to propose an acceptance threshold for similarity of Geographical Names.

Materials and Methods

The construction of the database to test the proposed indicator proposal (PASVAG), implemented in Postgres, delimited string in the set range of 1 to 20 characters which limits the maximum size of digits that you use to create the database. Through algorithms written in JAVA programming language, povoaremos the database with the simulation of a sequence of characters including noise (dissimilar character chain) in every possible position in the string. Thus, the performance is evaluated indicator similarity to character sets and changes the cardinality of its chains, the position (s) Noise (s) as well as the noise size. This routine allowed generating the comparison materials and inferring a noise in the string to which you want to compare, and so calculate the distance with controlled noise.


Read string1;

Read similarity;

Read size;

Assign zero to contador1;

Assign zero to contador2;

Assign to empty matrix;

Read string1 in the matrix;

While contador1 <size make

Matrix receives “@” in contador1 position; Contador2 receives increment;

While contador2 <size make Matrix receives “@” in contador2 position; Calculates the similarity matrix;

Guard the matrix and similarity; Increases contador2;

End while;

Increases contador1;

End while;


The Pseudo shows on your first level data entry. At this stage, the variable is loaded with the sequence of characters. In the example above s1 receives “ABCD”, in the next step, triggers the counter and if verifiesthe size of s1. Immediately below, the diamond is the accountant who handles the condition and repeats the subsequent expressions in the number of times the cardinality of the string. In the following level, replaces the position 1 of the array by noise (@) and increments the second counter. The following is checked the second counter condition that will process the procedures for applying the calculation of similarity and keep its results in the database.

Figure 1 demonstrates how was filed the records in the database. This process was repeated until strings with cardinality 20. The total universe with group sizes yielded by the law of permutations, a universe of 2,097,130 records. This small sample of the database exemplifies the distribution noise occupying all positions of possibilities.


Figure 1: Bunch of data storage screen.

The Table 1 illustrates a records section of how dissimilar variables were allocated in accordance with the methodology already described, comprising a permutation in leasing all the possibilities for noise values for a string with four characters cardinality. It is noteworthy that the concept of noise for this approach, symbolized by the signal “@” means the percentage of the same upon size of the string from which is compared. Then equation 1 refers:

Table 1: Percentage of noise by character size.

R% - Noise Percentage

r - Number of dissimilar characters.

t - Number of string characters.

image          (1)

The use of this concept provides comparability set of cardinality tracks this problem in the simulated scenario. Still on the topic above, the formulation proves the impossibility of the terms “r” and “t” assume different values of integer and smaller than “one”, as the atomic unit of a string is invariably “one” character. What makes a restriction on the universe of possibilities for noise levels to a finite universe of known values. Another observation pointless the eq.1 equation is the effect of the term “t” in the composition percentage of noise. In Table 2 if we observes-that for a same percentage of noise, 25% exist values of higher concentration of dissimilar terms for a same range of noises, Logo, the higher the cardinality the chain of greater characters will be the amount of noises distributed to a same percentage. To carry out the calculation of this new indicator of similarity to the scope of the Geographic Name (NG), is Question meet a set of assumptions that are:

s2 @BCD @[email protected] @[email protected]@
Cardinalidade 4 8 12
Ruído 25%

Table 2: Noise variety of demonstration in different cardinality.


Be C a chain of characters any, image so that assigned a geographical name (NG) as image And “L” a set composed by list of characters image At this stage it is assumed that the set of characters belonging to the NG are the elements that compose the form of ordered lists the set “L”, and thus image .

Ergo reformulate elements of the set “L” on form of bigrams composing the set “B”. This recast elements is expressed by the formation rule given by the position of the elements in image and thus image That is, the set “B” is rearranged as image

In sequence, adopts-if the use of cardinality of the sets, i.e. the number of elements within a set “W” any, if denotes by image Therefore, the cardinality of “W” is nothing more than the set number of objects. Using the example of the set imagethe “L” set cardinality is given byimage

The following proposition consists of data two sets any “W” and “Z”. Be the difference image the set composed of elements that are in the set “W” and are not in the set “Z”. Thus, image These information forms the basis for understanding the subtraction operation of set theory. A practical example of this use by NG, either the sets: image image Figure 2 illustrates through the Venn diagram the difference between sets by subtracting the portion not hatched is observed as a result of an empty set. However, in Figure 2 is changed order of subtraction of the joint which results in a unitary assembly. Therefore, the proposition for the whole set image subtracting image


Figure 2: Example with use of Venn diagrams.

And last, I is an index of similarity that meets the requisites of metric space.

Motion for Similarity Indicator for Geographic Names

In calculating this new indicator, addresses the mapping of the string of two units: characters and bigrams. In characters, it is taken as noise the character set that are present in L2 and L1not. We told these differences, we will treat the set of characters that are present in L1 and not in L1, so that these differences are summed, termed as noise in the characters (Rc)

image             (2)

Then, referred to as noise ratio (Tc) will be the reason (2) in (3) in the following formulation.

image             (3)

In this step the process repeats with new mapping unit string into bigrams. Made the sum of the differences is found the amount of noise in the bigrams (RB).

image             (4)

In this other step is calculated noise ratio in bigrams (TB) applying the ratio (equation 4) (equation. 5).

image             (5)

Finally the indicator is made by applying the variables (EQ. 3) and (EQUATION 5) to the variables of the equation (EQUATION. 6).

image             (6)

Thus, to satisfy the condition of metric space, the proposed indicator meets this premise reverse. Therefore, the distance between two strings is image if the indicator (I) obtains maximum similarity will result in 1, but applied to the similarity function will zero distance between the string comparisons.

As seen previously, the formulation of the indicator similarity to NG must meet the following conditions.


Once the test if the indicator formulation meets the expected properties, it was found the maximum and minimum limits of similarity. For example, the comparison of two strings in the amount of resulting noise null means that all the elements fit together, therefore,

we have a case of maximum similarity value: image , applying the Equation (equation.3) and (equation 5).



The null result for noise rates when applied in (equation.6).


Results in the maximum similarity are validation demonstrates that the proposed indicator fits the requirements of metric space as it obeys the positivity condition and zero distance between the strings.

When the limit being the minimum of similarity or the maximum distance, when the geographical names are quite dissimilar, that is, there is no similarity between the pair of strings, in which case the amount of noise is greater than zero and equal be conditional as


where a ≠ b ≠ 0 by applying the equation (equation. 3) and (equation 5).










It provides us the value of the indicator, even if not known the value of TB, but by multiplying TC, will have zero multiplied by any amount resulting in zero.

This final result demonstrates that the function of the compared strings obtained maximum distance from one another, which means no similarity.

Calculation Methods

Unless the conditions of use of the similarity to the specificity of the NG will need to make measurable how different they are NG. For this, we will take the notion of distance to measure the similarity. So the shorter the distance, the more similar the NG are, and the greater the distance, the lower the similarity between them. By applying the necessary requirements to ensure a metric space, these properties are attributed to the indicator.

Let’s see if the proposed indicator meets the conditions to print its results in the metric space. Be the first property the metric space positivity condition: image for all C1, C2 in X. In the example of its application: Be I(“BANANA”, “ANANAIS”) the similarity of two strings C1 and C2 (Table 3).

  C1   C2 =
L {B,A,N,A,N,A}   {A,N,A,N,A,I,S}  
Rc   |L1-L2 |=1+|L2-L1 |=2   3
RB   |B1-B2|=1+|B2-B1 |=2   3
TC   Rc/|L1 |+|L2|   3/|6+7|
TB   RB/|B1 |+|B2|   3/|5+6|
I   (1-Tc )×(1-TB)   0,559
f(d)   f(1-I)   0,441

Table 3: Example of calculating the indicator proposed to prove the positivity condition.

The results found indicated in the last row of the table. Satisfies the first condition of metric space in which its value is positive. The following test demonstrates that the indicator by finding a maximum similarity, when the distance between the strings is zero. Let’s see if C1=C2 meets the distance (C1, C2)=0. Be I(“BANANA”, “BANANA”) (Table 4).

  C1   C2 =
L {B,A,N,A,N,A}   {B,A,N,A,N,A}  
Rc   |L1-L2 |=0+|L2-L1|=0   0
RB   |B1-B2|=0+|B2-B1|=0   0
TC   Rc/|L1|+|L2|   0/|6+6|
TB   RB/|B1|+|B2|   0/|5+5|
I   (1-Tc )×(1-TB)   1
f(d)   f(1-I)   0,0

Table 4: Example of calculating the indicator proposed to prove the null distance condition.

When you view the zero in response, it is understood that the distance between the words is null proving more this property. To validate the third property, symmetry f(C1, C2)=f(C2, C1), will reuse the example of the first property entering the inverted variable is I (“Ananais”, “Banana”) the similarity of two strings C1 and C2 (Table 5).

  C1   C2 =
L {A,N,A,N,A,I,S}   {B,A,N,A,N,A}  
Rc   |L2-L1 |=2+|L1-L2|=1   3
RB   |B2-B1|=2+|B1-B2|=1   3
TC   Rc/|L1|+|L2|   3/|7+6|
TB   RB/|B1|+|B2|   3/|6+5|
I   (1-Tc )×(1-TB)   0,559
f(d)   (1-I)   0,441

Table 5: Example of calculating the indicator proposed proving the symmetry.

At the end of the calculation found that the result is the same as example1, meaning that no matter the order of the input variables of the similarity of function, because its result will be the same. In this check it is concluded that the indicator meets the symmetry condition metric space. Following is proven the last condition, triangular inequality, to contemplate the premises of metric space. Be C1, C2 and C3, “Banana”, “Ananais” and “TOMATE” strings which you want to calculate their distances (Tables 4 and 6).

  C1   C3 =
L {A,N,A,N,A,I,S}   {A, B, A, C, A, X, I}  
B {AN,NA,AN,NA,AI,IS}   {AB, BA, AC, CA, AX, XI}  
Rc     |L3-L1|=3+|L1-L3|=3   6
RB   |B3-B1|=6+|B1-B3|=6   12
TC   Rc/|L1|+|L2|   6/|7+7|
TB   RB/|B1|+|B2|   12/|6+6|
I   (1-Tc )×(1-TB)   0,0
f(d)   f(1-I)   1

Table 6: Example of calculating the indicator proposed proving the condition of triangular inequality.

Below we find the distance C2, C3 (Tables 5 and 7).

  C2   C3 =
L {B,A,N,A,N,A}   {A, B, A, C, A, X, I}  
B {BA,AN,NA,AN,NA}   {AB, BA, AC, CA, AX, XI}  
Rc   |L3-L2|=4+|L2-L3|=5   9
RB   |B3-B2|=4+|B2-B3 |=5   9
TC   Rc/|L1|+|L2|   9/|6+7|
TB   RB/|B1|+|B2|   9/|5+6|
I   (1-Tc )×(1-TB)   0,05
f(d)   f(1-I)   0,95

Table 7: Example of calculating the indicator proposed proving the condition of triangular inequality.

With these results it can be seen in the graph below the validation of the triangle inequality property, d(C1, C2) ≤ d(C1, C3) + d(C2, C3). Visually checked in Figure 2 is that the sum of f distances f(ABACAXI, BANANA) + f(ABACAXI, ANANAIS) is greater than the distance f(ANANAIS, BANANA) (Figure 3).


Figure 3: Graph representing the distances C1, C2, C3.


For the realization of similarity tests in the string with simulated data, an important point is the definition of the sequence of known characters that will be the basis for the studied models. This representation allows you to play on a smaller scale, the universe of possibilities that an alphanumeric digit occupy the composition of an NG. The design of the tests with simulated data base was provided both to discover other parameters on the study problem, as the test marker proposed similarity and to identify their advantages and limitations. In this simulation model, the capture of noise effect on strings was a great value greatness to formulate a theoretical model and refine knowledge of markers on the similarity of NG.

Data simulated results

In Figure 4 is observed the effect of noise of 40% (two dissimilar characters) to cardinality of six characters. To the extent that it varies the position of the noise observed different patterns of responses:


Figure 4: Width of the results due to the positioning of the noise.

1) The first position when the noise occupies the ends of the string, namely the sign “@” symbol noise grouped at the beginning, end, or occupying its two ends. This pattern gives the maximum cutting pattern for this range of cardinality and noise. This feature has a positive effect for the treatment of addresses because the indicator demonstrates tolerant exchange of address types, “Street or Avenue”, framed at the beginning of the string.

2) The next standard, obtained of the noise occupying one end, or grouped within the string. This pattern shows an average cut in relation to other positions noise.

3) In the hard cut pattern is observed that noise manifests itself within the chain, occupying positions spaced.

Given the above, we note that for any cardinality there is a pattern of response that distinguishes the various known noise positions. This makes it possible adoption to any geographical name.

The following were checked outliers marker similarity. In Figure 5 it is shown the performance of the similarity function for cardinality of twenty characters with minimal cutting pattern. By imposing the premise we “one” for maximum similarity and absence of noise. However, according to the behavior of the represented function, the similarity score comprises measuring an amount of dissimilar terms of cardinality up to half of the string (values above 50% noise).


Figure 5: Leasing the minimum points to a chain of twenty characters.

The amplitude indicator similarity to-noise variation was investigated and shown in Figure 6, through the similarity score. As the noise is increased and features of smaller amounts of similarity, the greater the amplitude between maximum and minimum values of class similarity. Example is the marker in the position that admits only noise Rc=1: The marker has a similarity Rc=[8.5, 9.0].


Figure 6: Behavior of similarity to noise variation and constant cardinality.

For a second outlet point values for the three position for abscissa values have their order Rc(3)=[5.8, 7.2].

Upon the behavior of the tested similarity function, one can create a rule for quantifying the amplitude response as a function of cardinality and the amount of noise. The amount of similarity is always results the amount of “noise + 1”. That is, a chain of cardinality of twenty characters with two obtains a noise amplitude similarity three valid results. This expression is valid up to the limit indicator of similarity, that is, half of its cardinality.

Another important observation is shown in Figure 7. When you enter a five-character noise, the rule of verification established to quantify the number of similar responses is the same. Highlighted by the red rectangle, the standard six answers spreads no matter how much rises the cardinality. With this data we find that the outlet values of the triangular table cardinality for up to twenty characters can estimate values of any cardinality.


Figure 7: Graph illustrating the spread of the five characters in different noise cardinalities.

Taxonomia noise in geographical names

From the maximum and minimum intervals listed in the triangular matrix available in the appendix, were listed below some noise patterns studied from simulated data. This use of the triangular matrix provides support for the user has the conditions to infer situations expected in similarity query, i.e., quantify how flexible is the response pattern in the query by NG.

Figure 8 summarizes the relationship between strategy records similarity, considering a cardinality of a NG any fixed, the higher the amount of noise, the lower the similarity index, then lower confidence for a relationship. Since the smaller the amount of noise, the higher the similarity index and increased confidence to the linkage between records. So, starting to break similarity, I=[0.85,9.0], with the interference of a lower noise 5% (relative to EQ.5 expression).


Figure 8: Sintese of relationships between variables, cardinality, noise and similarity.

In this first pattern is underscored a subtle difference for writing terms in Figure 9. To achieve this level of similarity, the smaller the amount of noise and higher cardinality only the variable similarity is sufficient to rely on the relationships between data.


Figure 9: Geographical Names ratio to 0.9 similarity.

In response patterns with 0.81 similarity, with a percentage of noise 15 to 20% (relative to eq.5 expression), a new pattern in which the marker relates similarity incomplete NG. Exemplified in Figure 10 the track in question contains the name “PARQUE” or “VILA”, even if present in only one of the records the relationship of both is possible. However, this range covers the need to know one more variable in addition to the similarity, because it just is not enough to rely on the relationship between the NG.

For the range of values, I=[0.76, 0.70], it is essential to consider the size of the string and noise. In the examples of Figure 10, it is observed that for this amount of similarity with smaller cardinalities that the example of Figure 11 shows how much diminishes the trust between relationships. In the meantime the cardinality variables and similarity reveal another kind of relationship, one to many: a record of the IPP database table happens to be related to more than one record of the place names of CNEFE table. This limit is unfeasible the decision of relationship, because trust even trimmed the similarity variables,cardinality and noise are low.


Figure 10: Geographical Names ratio to 0.8 similarity.


Figure 11: Geographical Names similarity ratio to 0.7.

For range of values below 0.65 to the limit of the similarity metric 0.5, we find the following: cardinality and low similarity with loud noise which rules out any relationship.

To cardinality and high noise ratio, with low similarity, the relationship of the sought records is possible as long as analyzed over an external variable, its geographical position (Figure 12).


Figure 12: Geographical Names ratio to less than 0.7 similarity.


In short, when dealing with smaller similarity values from 0.6 to balanced recovery of their records can be performed since the noise is considered to be greater than eight characters. Because even with low sensitivity and correct classification rate (43%), their low percentage of error for incorrect classifications (6%) minimizes the risk of erroneous classification.

Before concluding the hypothesis that there is a threshold of similarity between Geographic Names, the experiments show that it is necessary to consider the position occupied by the noise in the string. In addition, one must wonder if these noises are clustered or dispersed. This renders a high degree of relevance for the acceptance scale value of similarity in the tested databases. To set a lower limit based on experiments and complete a minimum of similarity can be considered two parameters: The first parameter is observed in Figures 4 and 11. At the point of minimum similarity of a chain of twenty characters, the maximum amount of noise that the indicator measures is equivalent to 50% of the cardinality of the string. The second point, to set a minimum limit, is analysis of different cardinality sizes from the Geographic Names. A string cannot be twice the size of your relational pair, as well as consider this difference in cardinality should pay attention to the noise between them. Example:

I (Avenida Guilherme Maxwell, Rio de Janeiro, AvG Macwell, Rio de Janeiro)=0.507 Therefore, the maximum of 62% difference considers altercation cardinality and/or noise.

For strings of the same cardinality, the number of possible results considering the variation of the noise position is the amount of noise plus one. This property is important for future shows is value mappings.


The verification of the applicability of the proposed indicator to assess the recovery of records in different databases, demonstrated quite effectively for correct classification of NG pairs. The results indicated that the smaller similarity value ranges than 0.9 and greater than 0.8, the use of a variable to refine the retrieval of records retrieved increase confidence in the information. This use provides for greater cardinalities of ten characters, the positive effect to correct positive ratings genuine. For smaller similarity range 0.6 and 0.5 higher, the additional variable noise increases efficiency for correct classification. When considering noise values greater than eight characters, the incorrect classifications index drops to 6%. This shows an asymmetry property with high efficiency for negative impostor classification. Reached the expected goal, the similarity coefficients maximized queries that require textual comparisons, as in the georeferencing process addresses.


Select your language of interest to view the total content in your interested language
Post your comment

Share This Article

Relevant Topics

Article Usage

  • Total views: 12076
  • [From(publication date):
    September-2015 - Mar 23, 2018]
  • Breakdown by view type
  • HTML page views : 8156
  • PDF downloads : 3920

Post your comment

captcha   Reload  Can't read the image? click here to refresh

Peer Reviewed Journals
Make the best use of Scientific Research and information from our 700 + peer reviewed, Open Access Journals
International Conferences 2018-19
Meet Inspiring Speakers and Experts at our 3000+ Global Annual Meetings

Contact Us

Agri & Aquaculture Journals

Dr. Krish

[email protected]

1-702-714-7001Extn: 9040

Biochemistry Journals

Datta A

[email protected]

1-702-714-7001Extn: 9037

Business & Management Journals


[email protected]

1-702-714-7001Extn: 9042

Chemistry Journals

Gabriel Shaw

[email protected]

1-702-714-7001Extn: 9040

Clinical Journals

Datta A

[email protected]

1-702-714-7001Extn: 9037

Engineering Journals

James Franklin

[email protected]

1-702-714-7001Extn: 9042

Food & Nutrition Journals

Katie Wilson

[email protected]

1-702-714-7001Extn: 9042

General Science

Andrea Jason

[email protected]

1-702-714-7001Extn: 9043

Genetics & Molecular Biology Journals

Anna Melissa

[email protected]

1-702-714-7001Extn: 9006

Immunology & Microbiology Journals

David Gorantl

[email protected]

1-702-714-7001Extn: 9014

Materials Science Journals

Rachle Green

[email protected]

1-702-714-7001Extn: 9039

Nursing & Health Care Journals

Stephanie Skinner

[email protected]

1-702-714-7001Extn: 9039

Medical Journals

Nimmi Anna

[email protected]

1-702-714-7001Extn: 9038

Neuroscience & Psychology Journals

Nathan T

[email protected]

1-702-714-7001Extn: 9041

Pharmaceutical Sciences Journals

Ann Jose

[email protected]

1-702-714-7001Extn: 9007

Social & Political Science Journals

Steve Harry

[email protected]

1-702-714-7001Extn: 9042

© 2008- 2018 OMICS International - Open Access Publisher. Best viewed in Mozilla Firefox | Google Chrome | Above IE 7.0 version