Whenever we read newspaper articles, novels or journal articles there are several words
beyond our current vocabulary and looking up these words on the internet or in a dictionary
takes additional effort and time. Such a hiatus might disrupt the continuity essential for
natural grasping and the reader's mood. The project attempts to overcome this problem by
proposing a system which makes use of Optical Character Recognition (OCR) to scan,
process and overlay the meaning of words based on their rarity or those the user intends to
look up in real time and will also have provisions for applying the recognition process on an
image file on the device’s storage such as a screenshot. By making use of various OCR
techniques the words could be recognised, with minimal errors and its meaning can be
searched from a local or online dictionary.
Another use case of this general character recognition framework could be extending it to
Text Summarisation. The user can summarise the paragraphs of concerned from the Ebooks
available for various subjects and create his own summary based on Extraction
As students and academicians we desire reliable sources of explanations on technical topics
when we come across them in technical papers, magazines, articles or sources of technical
information that are brief and need elaboration.The application that is the subject of this
project aims to capture a word or group of technical words through the Android
smartphone’s camera and searches for occurrences of the topic in reference books .
to the pages of electronic books in which the topic is mentioned is fetched and displayed on
the screen in the form of cards. The user can go to that particular page of the electronic book
to learn more about that topic.
The user can summarise text about the topics from reference books(which in most cases are
generally verbose) and prepare his own summaries using the automatic summarization tool
for future reference.
Technology Stack Used
3. Netbeans IDE
4. Apache Web Server
5. Mosquitto MQTT server(for notifications)
6. BoxPDF (Java library for reading and manipulating .pdf files
7. Tesseract OCR (Open source library for optical character recognotion.)
9. Glose Word API
10. Volley HTTP Request Pooling Library
Objectives of Referize:
➔ To provide for a real time text processing framework for processing text under the
current frame to fetch relevant references to electronic books based on the technical
words selected by user.
➔ To provide easy and intuitive touch gestures such as taps to pick the words.
➔ To provide a structured navigation tree which follows the hierarchy of courses,
modules and submodules(which is aligned with the syllabus of the university) to
allow the student to make quick references to content in electronic books about topics.
➔ To allow users to search topics in the electronic books by searching using keywords
which are provided as an input to a natural language search based on a full text index
on the topics in the electronic books.
➔ To allow user to enter his course details so that relevant electronic books are
recommended to him for study.
➔ To allow the user to create a concise summary of the paragraphs on various topics of
interest from those electronic books.
➔ Provide an administrative panel for system administrators to allow them to add
universites,fields,branches,courses, modules and submodules(academic hierarchy).
➔ To automate the process of extracting topics from electronic books by parsing the
index provided in the electronic book and creating an online index to match search
➔ When same topic is found in multiple books user can summarize the given topic so
that is is beneficial for him in future.
1. Using the Android Application and the User Dashboard
2. Using the Administrator Panel
If we consider the bare minimum set of components to describe the application functionally, we will have the following:
A. Android Application Under the mobile device we have the following main components:
A.1 Camera : The application layer camera software which will provide the video frames.
A.2 Local OCR Engine : The local OCR engine is capable of identifying, processing and
detecting words bounded by bounding boxes.
A.3 HTTP web request : It represents the HTTP request object in Java which will
encapsulate the data to be sent to the server.
B. Summarization Module
C. Server side scripts for data retrieval, parsing of electronic books and their storage
Extractive Summarization Algorithm
This algorithm is based on the page rank algorithm which is used to rank web pages by the popular search engine Google.
The vertices in this case are the sentences in the paragraph and each of the words in these vertices is assigned a weight according to the vector space model and the TF-IDF weighting system.
An index of words is formed from the document and if wi is the ith word in the index then
The weight of wi is computed as follows:
The inverse sentence frequency of wi is computed as follows:
Where N is the total number of sentences in the document and ni is the number of sentences in which the word wi occurs.
The term frequency of the ith index term wi is calculated as follows:
Where freqi,j is the frequency of occurrence of the ith index term in the jth sentence.
And maxl (freql,j) gives us the maximum frequency of occurrence of an index term in the jth sentence and is used to normalize the value of the term frequency of each word.Here the value of l loops through all of the index terms.
1. Read the input document as a string using org.apache.commons.io.FileUtils2. Declare a Set of words to store the index3. Break the input string into sentences using java.text.BreakIterator;4. Store all sentences in a list.5. Declare a list of Nodes with length equal to the length of the input string6. Initialize each of the Node in the list with the list of sentences obtained in step 4.7. Find inverseTermCounts for each word in the each sentence.8. Populate the index of words declared in step 2.9. Declare a map of nodes and their associated weight matrices.(A weight matrix for a node is computed using the product of the term frequency and the inverse sentence frequency of each word in a sentence.)10. Declare a graph to store all node to node associations. (In this case every node will have one association with every other node.)11. Initialize the graph with all the nodes.12. For each node (node_1) in the graph 12.1 To every other node (node_2) in the graph which is not equal to node_112.1.1 Edge weight of edge between these nodes= getDotProduct(Weight matrix of node_1, Weight matrix of node_2)/(getRootOfSumOfSquares(Weight matrix of node_1)*getRootOfSumOfSquares(Weight matrix of node_2));13. Run the Page Rank algorithm with the graph obtained, the damping factor and the number of top ranked sentences needed as a parameter.
Page Rank Algorithm
PAGE RANK ALGORITHM1. Declare a rank map<Node,Rank(double)>2. Initialize the rank of each node to 1.3. Set converged to false4. While (! Converged)4.1 Create a temporary rank map 4.2 for( node_i in nodes in graph)4.2.1 Initialize sum =0.04.2.2 for( node_j in nodes in graph)126.96.36.199 if(! node_j.equals(node_i))188.8.131.52.1 If an edge exists between node_i and node_j184.108.40.206.1.1weightIJ= Get the edge weight of that edge220.127.116.11.1.2 Page rank of node_j = get the rank of node_j from the rank map.18.104.22.168.1.3 sum_of_denominator=0.04.2.2.1.1.4 for(node_k in nodes in graph)22.214.171.124.1.4.1 if(!node_k.equals(node_j))126.96.36.199.188.8.131.52 If an edge exists between node_j and node_k184.108.40.206.220.127.116.11.1 weightJK=get the edge weight of that edge18.104.22.168.22.214.171.124.2 sum_of_denominator+=weightJK126.96.36.199.188.8.131.52 End if184.108.40.206.1.4.2 End if220.127.116.11.1.5 End for18.104.22.168.1.6 sum+=(weightIJ* Page rank of node_j)/sum_of_denominator)22.214.171.124.2 End if126.96.36.199 End if4.2.3 End for4.2.4 rank of node_i= (1-damping factor)+damping factor*sum4.2.5 Put in temporary rank map (node_i, rank of node_i)4.3 End for5.End while6. Return top ‘n’ ranked sentences arranged by chronological order using the rank map.
1. Class App
The main class that implements the summarizer algorithm
Another way to determine what to look for is to find the most Martian-like places on Earth. The hyper-arid, high-altitude Atacama Desert, which, gets just 0.6 inches (15 millimeters) of rainfall a year but used to be much wetter, is exposed to punishing ultraviolet radiation and has active geothermal features such as hot springs.
"If you want to find the microbe, you have to become the microbe. Very early on, you need to shelter — you need to adapt and you need to survive," Cabrol said. Microbes would also have to "organize around oases and organize a lot faster."
These Martian oases could be similar, in some ways, to the evaporating lakes, salt flats and hot springs of the Atacama, Cabron said.
Ancient creatures in these Martian environments would likely be extremophiles or superbugs that are highly adaptive, and are possibly very quick to form symbiotic communities, Cabrol said.
While structures that could provide microbial habitats might be found on Mars, researchers will have to know where to look in the first place, Cabrol said. They won't get many opportunities to sample in many places, she said. Finding the tools with the resolution to identify those habitats will also be challenging, Cabrol added.
However, drones that can fly up and down to image the area at different scales could reveal some of the fine detail that provides clues for ancient life, she said.
And some tools already heading out on the Mars 2020 mission could reveal evidence of potential habitats. For instance, Cabrol showed images of Gusev Crater. Pictures of that feature initially lacked the resolution to reveal any evidence of habitat. But after looking at the light spectra reflected, "The spectra are telling us this is something that could be related to hydrothermal activity and constructs," Cabrol said. "There's only one way of knowing — it's to go back."
Rank:0.43783834671158367Sentence:"There's only one way of knowing — it's to go back."Rank:0.5481579720600451Sentence:Theywon'tgetmanyopportunitiestosampleinmanyplaces, shesaid.Rank:0.619597050307139Sentence:Forinstance, CabrolshowedimagesofGusevCrater.Rank:0.6988569438953812Sentence:Microbeswouldalsohaveto"organize around oases and organize a lot faster."Rank:0.7582177019994549Sentence:Thehyper-arid, high-altitudeAtacamaDesert, which, getsjust0.6inches (15millimeters) ofrainfallayearbutusedtobemuchwetter, isexposedtopunishingultravioletradiationandhasactivegeothermalfeaturessuchashotsprings.Rank:1.0039943770837287Sentence:Veryearlyon, youneedtoshelter—youneedtoadaptandyouneedtosurvive," Cabrol said. Rank : 1.0072314462422134Sentence :Pictures of that feature initially lacked the resolution to reveal any evidence of habitat. Rank : 1.0179725658895435Sentence :"Ifyouwanttofindthemicrobe, youhavetobecomethemicrobe.Rank:1.0301468179131923Sentence:Butafterlookingatthelightspectrareflected, "The spectra are telling us this is something that could be related to hydrothermal activity and constructs,"Cabrolsaid.Rank:1.096728114416393Sentence:AncientcreaturesintheseMartianenvironmentswouldlikelybeextremophilesorsuperbugsthatarehighlyadaptive, andarepossiblyveryquicktoformsymbioticcommunities, Cabrolsaid.Rank:1.1014702277489823Sentence:Findingthetoolswiththeresolutiontoidentifythosehabitatswillalsobechallenging, Cabroladded.Rank:1.154489628552796Sentence:AnotherwaytodeterminewhattolookforistofindthemostMartian-likeplacesonEarth.Rank:1.2741403231012418Sentence:However, dronesthatcanflyupanddowntoimagetheareaatdifferentscalescouldrevealsomeofthefinedetailthatprovidescluesforancientlife, shesaid.Rank:1.3989112787721734Sentence:AndsometoolsalreadyheadingoutontheMars2020missioncouldrevealevidenceofpotentialhabitats.Rank:1.419985023981618Sentence:TheseMartianoasescouldbesimilar, insomeways, totheevaporatinglakes, saltflatsandhotspringsoftheAtacama, Cabronsaid.Rank:1.4322621813245138Sentence:WhilestructuresthatcouldprovidemicrobialhabitatsmightbefoundonMars, researcherswillhavetoknowwheretolookinthefirstplace, Cabrolsaid.FinalSummary(6sentences):--------------------------------------------AnotherwaytodeterminewhattolookforistofindthemostMartian-likeplacesonEarth.TheseMartianoasescouldbesimilar, insomeways, totheevaporatinglakes, saltflatsandhotspringsoftheAtacama, Cabronsaid.WhilestructuresthatcouldprovidemicrobialhabitatsmightbefoundonMars, researcherswillhavetoknowwheretolookinthefirstplace, Cabrolsaid.Findingthetoolswiththeresolutiontoidentifythosehabitatswillalsobechallenging, Cabroladded.However, dronesthatcanflyupanddowntoimagetheareaatdifferentscalescouldrevealsomeofthefinedetailthatprovidescluesforancientlife, shesaid.AndsometoolsalreadyheadingoutontheMars2020missioncouldrevealevidenceofpotentialhabitats.