Sub Navigation

Seite drucken. Seite weiterempfehlen.

Tutorial

QuotationFinder_0.6

What is QuotationFinder, and for which purposes can it be used?

QuotationFinder is a tool for the automatic comparison of fully digitized texts. It can detect quotations, allusions, plagiarism or other intertextual links. The program is individually adjustable in terms of language, encodings or the segmentation of the compared texts and can be useful in many contexts.

There are two different modes of comparison: A single text can be compared with a collection of texts (Find Quotations), or all texts in a collection can be compared with each other (Compare Texts).

  • Find Quotations: This mode of comparison can be used to find all quotations from a certain author in a single text (e.g. a newspaper article about that author); to analyze language changes (e.g. “Base” vs. “Cousine” in German); to compare the usage of key phrases over time; or to analyze the degree to which terms and phrases coined by a certain author influenced later publications in a field of study. For displaying the results, hit lists as well as different types of charts are available.
  • Compare Texts: In this mode every text in a collection is compared with every other text, using up to 24 statistical measures capturing various levels and types of similarity. The results, allowing both a tabular and a graphical view of the data, can be used for tasks like plagiarism detection or the analysis of different text editions.
  • Align Sentences: In this mode two texts are aligned sentence by sentence. The results, displayed in three different formats, can be used for a close reading of original and translation.  

Getting Started

Possible operating systems:

  • Microsoft Windows
  • Mac OS (required version: 10.5 on 64-bit Intel-based Macs)
  • Linux
  • other operating systems able to use the Java Virtual Machine (1.6+)

System requirements:

  • Java 1.6 (or a higher version): JRE or JDK (available from Sun)
  • 512 MB RAM  

Automatically download and start QuotationFinder

  • Click here and follow the instructions. A window will open up saying that the digital signature cannot be verified:

In this case you can click on "Run" without worries. As only members of the HRA can change the content of our homepage, you are guaranteed that the program has not been manipulated since we published it.

Manually download, unpack and start QuotationFinder:

  1. First, download QuotationFinder_0.5.zip.
  2. Then, unpack the zip archive. The main folder QuotationFinder_0.5 contains two executable files (QuotationFinder_0.5.bat, QuotationFinder_0.5.jar), an AppleScript application (QuotationFinder_0.5.app - only for Mac users), a tutorial (Tutorial_QuotationFinder_0.5.pdf) and a readme (README.txt) as well as a folder with test data for your convenience (TestData) and another folder with Java libraries used by the program (lib).
  3. Start the program:
  • Windows: Double-click on the batch file "QuotationFinder_0.5.bat".
  • Mac OS: Double-click on the AppleScript application "QuotationFinder_0.5.app".
  • Linux: Make "QuotationFinder_0.5.bat" executable and start it from command line (type "sh QuotationFinder_0.5.bat").

And now?

After starting the program, choose what you would like to do first: Find Quotations, Compare Texts or create a New Collection in Your Library. You can always switch between the first two by clicking on the Go menu. The last option is available from the Library menu. To help you get to know QuotationFinder's functions, we provide test data here. (If you downloaded the program manually, please find the data in the folder QuotationFinder_0.5/TestData.)  

Find Quotations

1. The user interface  

If you decided to find quotations, the user interface will look like image 1. It consists of two panels: The left panel (Your File) displays a single text file and the right panel (Your Library) displays the text collections that the single text can be compared to.

Below the menu you will find a toolbar containing the most important commands.

Image 1: Find Quotations GUI

2. Opening text for comparison

You have two options for opening the desired text:

1. Click on File -> Open File(s) (see image 2) to browse your computer for the text. The following types of files are supported:

  • Hypertext Markup Language (.html/.htm)
  • Microsoft Word documents (.doc)
  • Portable Document Format (.pdf)
  • Plain Text (.txt)
  • Rich Text Format (.rtf)

The chosen text will be displayed in the left panel (Your File). Then a small window will pop up asking you to select the language (and encoding) of the text (see also Settings used for comparison). 

2. Insert text from the clipboard into the left panel using Copy and Paste. The left panel is fully editable, so you can always write something or change the displayed text.

Image 2: File menu

Your Library:
To compare Your File to a collection of texts, these texts have to be added as a collection to Your Library. To do this, click on Library -> New Collection (see image 3). A file chooser dialog allows you to select one file or directory, or several files or directories to be added as a collection to Your Library. Again, only the following file formats are accepted: txt, rtf, html, pdf, doc. Currently, the encoding of Plain Text documents cannot be extracted automatically. The program will ask you to select the correct encoding from a drop-down list. After adding a new collection, it will be displayed in the right panel. The data needed for the actual comparison (the index) is automatically generated and saved to the subfolder “Library” of the folder containing QuotationFinder. Your Library can hold as many collections as you need, but you can only compare Your File to one collection at a time. By clicking on Library -> Add Files to Collection you can always enlarge a collection. To delete a collection, select the collection in the Your Library panel and click on Library -> Delete Collection. Note that when you restart the software, it will automatically display all previously added collections (that have not been deleted).  

Image 3: Library menu

Changing or adding metadata:
The name of each collection is generated automatically. You may change this name later by first selecting it from the list and then clicking on Library -> Rename Collection (see image 3).
Some files contain metadata like author, title or date. This metadata is saved automatically with the collection. You may view and change it by clicking on Library -> Edit Metadata (see image 3). All automatically generated or manually entered metadata will be displayed in a new window (see image 4). This metadata is important for two reasons: First, it is used in the display of comparison results (see image 7). Second, it is used for generating statistical output.

Image 4: Changing or adding metadata

Importing a collection:

On our download page, high quality collections are available for import into your QuotationFinder Library. To do this, first select a collection of interest to you, then download the zip archive and unpack it to a location of your choice. Each archive contains a readme with general information on the given collection as well as a QuotationFinder Collection file (.qfc) containing the collection. In QuotationFinder, click on Library -> Import Collection. A file chooser dialog will open allowing you to select this .qfc file. Confirm your choice by clicking on Open, and the collection will be imported into your Library. Its name will then be displayed in the Your Library panel.

Note that the metadata of imported collections may be viewed but not changed. Furthermore, the hits resulting from a comparison of your file with an imported collection might link to files on the Internet. Therefore, you have to be connected to the Internet to open these files within QuotationFinder.

3. Settings used for comparison

The success and results of your comparison depend on correctly specifying the settings. Therefore, before starting a comparison, click Settings and choose between the following options (see image 5) according to your requirements: 

Image 5: Settings menu

Language: If you chose the wrong language when opening Your File, you can change your selection again. Currently, the following languages are supported:

  • English
  • Chinese
  • French
  • German
  • Japanese
  • Latin

Other languages may be added on demand. Just contact us. Currently, the language setting is only important for deciding whether a given token is smaller than the Minimum Length of Token (see below). For western languages, a token's word count is taken into consideration, whereas for Chinese and Japanese the number of characters in a token are counted. If Your File contains, for example, both English and Chinese sentences, you should choose Chinese as Language. Thus, both Chinese and English tokens will be searched, although for the English tokens the Minimum Length of Token restriction will not work properly.


Encoding: Choosing the correct encoding of a file is only important for Plain Text documents. By default, UTF-8 will be chosen. If Your File is not displayed correctly, you may switch to another encoding. Currently the following encodings are supported:

  • Chinese simplified (GB2312)
  • Chinese traditional (Big5)
  • English (US-ASCII)
  • Japanese (Shift_JIS)
  • Unicode (UTF-8)
  • Unicode (UTF-16 LE)
  • Unicode (UTF-16 BE)
  • Western (ISO-8859-1)
  • Western (ISO-8859-15)

Other encodings may be added on demand. Just contact us.

Advanced: When you click on this menu item, a new window will open (see image 6). Use the available settings to adjust your comparison, taking into consideration the characteristics of Your File:

Tokenizer: Before comparison, Your File is always segmented into a series of phrases, also called tokens, that are then searched in the text collection. Here you can choose the length of these tokens:  

  • Clause: The text is segmented after every .?!;:。?!;:  Enter.
  • Subclause: The text is segmented after every .?!;:,。?!;:,   Enter. (Note that only in subclause mode a new token will be generated after a comma (,).)
  • n-Gram: Here you can choose a number between 3 and 20 which indicates the length of the token. If you choose 3, for example, the text in Your File will be segmented into trigrams, meaning that every possible combination of three succeeding words or characters will be compared to the collection. In a document containing the text “Today is Sunday not Monday”, the tokens will be: “Today is Sunday”, “is Sunday not”, “Sunday not Monday”. This works for Chinese texts as well, but for them characters instead of words are considered. You can see the result of the tokenization in the left panel by clicking on the tab Segmented. (Note that the following two layers of settings are only available in the clause and subclause mode since the value of the nGrams already represents the length of the tokens.)

Minimum Length of Token: Every token that is shorter than this number will be excluded from the comparison.

Proximity: A scale from 1 to 10 allows you to determine the grade of fuzziness involved. The selected numeral represents the number of words or characters which may occur within a search phrase so that the phrase is still considered a match. Consequently, selecting a higher value on the proximity scale will increase the fuzziness of matching, while selecting "0" will only consider word-by-word matches.  

Image 6: Advanced settings

4. Running the comparison

After choosing Your File and adjusting the settings accordingly, select a collection from Your Library. Then click on the Compare button at the bottom of the interface, and the single text will be compared to the collection. When the comparison is finished, a small window will pop up displaying the results. 

5. Viewing the results

The results of a comparison are displayed in several windows: the Result report, the Analyzed tab, the Hit List tab and the Hit(s) tab.

The Result report contains five different types of information under the following headings:

  • Number of Tokens: The complete number of tokens generated from Your File by the tokenizer and used during comparison.
  • Multiple hits: Tokens with at least two matches in the chosen collection.
  • One hit: Tokens with exactly one match in the collection.
  • No hit: Tokens with no match in the collection.
  • Not searched for: Tokens that are shorter than the Minimum Length of Token as selected in the Advanced Settings.


In the Analyzed tab (see image 7), Your File is segmented into tokens (one token per line), and the tokens that were found in the collection are highlighted in blue. When clicking on such a token, the Hit List scrolls to the occurrences of the token in the collection. This list displays all hits per token. Underneath each token, a link indicates the source text of the match. (This link might also contain the name of the author, title, etc., depending on how much metadata you added before.) Clicking on a link will open up the corresponding text in the Hit(s) tab where you can see the token (highlighted in green) in context.

Image 7: Viewing the results

6. Results as a chart

To see the results of the comparison as a chart, click on View -> Chart View in the menu bar. (Note that the functions Table View and Text View can only be selected while being in the Compare Texts mode).

Chart: The chart (see image 8) displays the results of the comparison and can be manually labeled on the right side of the window. You can choose between certain Modes of Display (Hits per Author, per Date, per Path, per Source, etc.), depending on how many texts with varying information are part of your collection, but you have to make sure that you have already added the relevant metadata (see Your Library). You can also change the type of chart from Bar Chart to Line Chart or Pie Chart.

Legend: The analyzed text is displayed at the bottom. As part of the Advanced Options on the right, you can switch between the display of all tokens or only tokens for which a hit was retrieved.

Changed settings are only displayed after clicking Refresh.

To return to Your File or the Hit List, click on View -> File View.

Image 8: Chart view

7. Saving the results

After completing your comparison, you can save the results in different formats:

  • To save the chart graphically depicting the results, click Save As… on the right of the Chart View window, pick a certain quality of display and choose between two formats of picture compression (.jpg/jpeg or .png).
  • By clicking on Export, the data represented in the chart can be exported as two tab delimited text files saved as a zip archive. The tab delimited values can easily be copied into a Microsoft Excel sheet and further processed.
  • By clicking on File -> Save Results, the files displayed in the Analyzed and Hit List tabs are saved as html files.   
  • The entire comparison project can be saved by clicking on File -> Save Project. 

Nach oben

Compare Texts

In the Compare Texts mode you can calculate up to 24 different similarity scores for each text pair in a collection of texts. This is useful for plagiarism detection as well as for the comparison of different text versions.

1. Opening the texts

Click on File -> Open File(s) to open the texts you want to analyze. The following types of files are supported:

  • Hypertext Markup Language (.html/.htm)
  • Microsoft Word documents (.doc)
  • Portable Document Format (.pdf)
  • Plain Text (.txt)
  • Rich Text Format (.rtf)

Choose the correct language and, if you are asked for it, the correct encoding. You only need to specify the encoding for Plain Text and Rich Text documents; for all other files types the encoding will be detected automatically.

The filepaths of the chosen texts will then be displayed in the left panel. You can always change the encoding of one or more documents by selecting them and setting the new encoding in Settings -> Encoding. If you want to remove one or more documents, select them and click on File -> Remove. 

Image 9: Compare Texts GUI

2. Selecting the features

Up to 24 different scores can be calculated, ranging between 0% (no similarity at all) and 100% (highest possible similarity). These scores are divided into the following categories:

  • Word: Every word or character will be considered.
  • Trigram: Every possible combination of three succeeding words or characters will be considered. The trigrams of a document containing the text “Today is Sunday not Monday”, for example, are: “Today is Sunday”, “is Sunday not”, “Sunday not Monday”.
  • Subclause: Every subclause (ending with .?!;:,。?!;:,or Enter) will be considered.
  • Clause: Every clause (ending with .?!;:。?!;:or Enter) will be considered.

For each of these categories, five different scores can be calculated:

Similarity: Calculates to what degree the two documents contain the same tokens. This score is similar to resemblance, but produces higher values.    
Similarity is calculated as two times the number of intersecting tokens (i.e. tokens that occur in both documents) divided by the number of all tokens in both documents. Several occurrences of the same token are counted separately. Multiplying by 2 is necessary to get the result 1 when comparing two identical documents.
Example: Document1 = "Today is Sunday", Document2 = "Today is Sunday Today is Sunday". Then the similarity between the documents is 2 * 3 / 9 *100 = 67.  

Resemblance: Calculates to what degree the two documents contain the same tokens. This score is similar to similarity, but produces lower values. 
Resemblance is calculated as the number of intersecting tokens divided by the number of all tokens. Several occurrences of the same token are labeled with their occurrence numbers.
Example: Document1 = "Today is Sunday", Document2 = "Today is Sunday Today is Sunday". First we label all tokens with their occurrence numbers: Document1 = "Today1 is1 Sunday1", Document2 = "Today1 is1 Sunday1 Today2 is2 Sunday2". Then the resemblance between the documents is 3 / 6 * 100 = 50.  

Index Resemblance: Calculates to what degree the two documents contain the same tokens. The difference to resemblance is that several occurrences of the same token are not considered (i.e. it does not matter if a token occurs seventy-four times or only once).
The score is calculated as the number of intersecting tokens divided by the number of all tokens.
Example: Document1 = "Today is Sunday", Document2 = "Today is Sunday Today is Sunday". Then the index resemblance between the documents is 3 / 3 *100 = 100.  

Containment: Calculates to what degree the smaller document is contained in the bigger document.
The score is calculated as the number of intersecting tokens divided by the number of tokens in the smaller document. Several occurrences of the same token are labeled with their occurrence numbers.
Example: Document1 = "Today is Sunday Sunday", Document2 = "Today is Monday or Tuesday". First we label all tokens with their occurrence numbers: Document1 = "Today1 is1 Sunday1 Sunday2", Document2 = "Today1 is1 Monday1 or1 Tuesday1". Then the containment between the documents is 2 / 4 *100 = 50.  

Index Containment: Calculates to what degree the smaller document is contained in the bigger document. The difference to containment is that several occurrences of the same token are not considered (i.e. it does not matter if a token occurs seventy-four times or only once).
The score is calculated as the number of intersecting tokens divided by the number of tokens in the smaller document.
Example: Document1 = "Today is Sunday Sunday", Document2 = "Today is Monday or Tuesday". Then the index containment between the documents is 2 / 3 *100 = 67.  
    
Additionally, a few features are included for the evaluation of  formal similarities:

Number of Words/Sentences/Paragraphs: Calculates how similar the two documents are concerning their size.
For both documents the number of words, sentences or paragraphs is counted. Then the smaller number is divided by the bigger number.
Example: Document1 = "It is snowing", Document2 = "I would love to go for a walk". Then the number of words/sentences/paragraphs ratio between the documents is 3 / 8 *100 = 375.  

Vocabulary Size: Calculates how similar the two documents are concerning the size of the vocabulary used.
For both documents the number of different words is calculated. Then the smaller number is divided by the bigger number.
Example: Document1 = "It is snowing", Document2 = "Jingle Bells Jingle Bells". Then the vocabulary size ratio between the documents is 2 / 3 *100 =  67.

Note that you can always click on the help button (the blue question mark) right next to every feature for further explanations and examples.

3. Viewing the results

After you have selected the features, click on the Compare button, and the results will be displayed in a graphical way (see image 10). The intersecting tokens are highlighted in grey in the documents and are also displayed in the Hits / Details list on the right. When you select a token from the list, it is highlighted in blue in the documents, and you can jump from one occurrence to the next by pressing Return. You can see the feature explanations by selecting the respective feature in the results list and choosing Feature Explanation from the Help Menu.     

Note that only the occurrences that are taken into account for the calculation are highlighted. For index resemblance and index containment all occurrences are taken into account, but for similarity, resemblance and containment that is not the case. So you might see passages where a token is highlighted in one document, but not in the other because the highlighted token has already been matched to an earlier occurrence of the token in the other document.  

By clicking on View -> TableView, you will get a representation in the form of a table. Similarity scores range between 0% and 100%. The higher the score, the more similar the two documents are. Again you can get an explanation for each feature by selecting the respective column and clicking on Help -> Feature Explanation.  

By either double-clicking on a table cell or choosing View -> Text View, you can go back to the graphical view.

Image 10: Text View

4. Saving the results

By clicking on File -> Save Results, the result table is saved as a tab delimited text file. The contents of this file can easily be copied into a Microsoft Excel sheet and further processed.
The entire comparison project can be saved by clicking on File -> Save Project.  

Nach oben

Align Sentences

1. Opening the texts

Image 11: Sentence Aligner Settings

Click on File -> Open File(s) to open the texts you want to align. The following file types are supported:

  • Hypertext Markup Language (.html/.htm)
  • Microsoft Word documents (.doc)
  • Portable Document Format (.pdf)
  • Plain Text (.txt)
  • Rich Text Format (.rtf)

Choose the correct language and, if you are asked for it, the correct encoding. You only need to specify the encoding for Plain Text and Rich Text documents; for all other files types the encoding is detected automatically. The language setting is important to divide the texts into sentences.

You can also change both language and encoding afterwards in the settings box on the right hand side of the program (see image 11). Changes to the opened files are possible, but cannot be saved to the original file again.

2. Adjusting the result settings

You can adjust how your results will be displayed before running the aligner, but you may also change these settings afterwards.

In the First View drop-down box (see image 11) you can choose in which way your results will be displayed at first: Text View, Table View or Split View. Each view has its own merits (for a discussion, see 3. Viewing the results), so best try them out for yourself and pick your favorite.

If the coloring is enabled, the likelihood of the alignment of each sentence is displayed through three different shades of blue color. Brighter color speaks to a securer alignment.

Forced alignments are sentences that you align yourself by hand. This can be done in the Split Sentences pane and the list of forced alignments is shown in the Forced Alignments pane. Also, in the result views forced alignments can be highlighted, if the box “Show Forced Alignments” is checked.  

3. Viewing the results

After you have opened your texts and if necessary made further changes to the settings, click on Align and the results will be displayed in the chosen view: Text View, Table View, or Split View. You can change between the views through clicking on the menu item View and choosing another view, or else by clicking on one of the view buttons in the toolbar.

Text View: The Text View gives priority to the source text by printing its sentences in black and bold. The translation of each sentence is given below in grey and enclosed in square brackets. By selecting the checkbox Show Translation as Tooltip, the translation of a sentence is only given as a tooltip if the mouse hovers above a sentence for a moment. If errors in the alignment are detected, these can be corrected by clicking on Change Alignment. You are taken back to the File View – Split Sentences. See below section 4, Force Alignments for further details.

Table View: The Table View is best used to analyze how the program works, i.e. which sentences are aligned to each other. Each sentence is surrounded with a light grey box, bold grey boxes surround sentence pairs or n-to-m sentences that have been aligned as source and translation equivalent.

Split View: Split View allows reading of each text singularly. The translation of a sentence can be found by clicking on the sentence. The corresponding sentence(s) will then be scrolled to and highlighted.

The usual editing functions (find, find again and select all) are available for each view.  

Image 12: Text View

4. Force alignments

If you want to force the alignment of a sentence pair (source – translation; 1-1, 1-n, n-1, n-m) this can be done in File View -> Split Sentences. Click on the sentence number(s) of the source and translation text that you want to align, than press Return on your keyboard. A short note will confirm that the forced alignment has been saved. It can be viewed in File View -> Forced Alignments. In this pane, forced alignments can also be deleted from the list by clicking on a sentence and pressing Delete on your keyboard.

It should be noted, however, that forcing alignments to correct errors in one part of the text can result in wrong alignments at another part of the text that has been correctly aligned before. It is so far not possible to correct alignments without realigning the whole text.  

Image 13: Split Sentences

Nach oben

Troubleshooting

Creating the library directory failed

QuotationFinder creates a directory in your home directory named "QuotationFinder" containing another directory named "Library".  All your collections are stored there. Creating either directory may fail if QuotationFinder does not have write permission in your home directory. Please adjust the permissions accordingly and restart QuotationFinder. If creating the directory still fails, create the directories manually and restart the program.

A window is smaller than its contents

Screens with small resolution sometimes have problems showing the whole content of QuotationFinder windows. This problem might be solved by enlarging the window. If this does not help, please send us a screenshot and we will try to fix the problem.  

An error occured while comparing text pairs

When you work with very large files in the Compare Texts mode, the program may run out of memory, and an error message will advise you to divide your files into smaller sections. Another way of solving this problem is to increase the memory (RAM) given to the program: Open QuotationFinder_0.5.bat in an editor and replace "-Xmx256m" (i.e. 256 MB RAM) by "-Xmx1024m" (i.e. 1 GB RAM) for example. Be careful not to exceed the RAM of your computer system. Save the file and restart the program. (This will only work if you downloaded QuotationFinder manually instead of using Java Web Start.) 

Suche