Created at 10pm, Apr 26
cyranodbScience
0
A Multi-Modal Foundation Model to Assist People with Blindness and Low Vision in Environmental Interaction
XQa2fKfO0WZyzBw-3T6Xz5rfXTc8XfMUtgKSGrgFGzo
File Type
PDF
Entry Count
82
Embed. Model
jina_embeddings_v2_base_en
Index Type
hnsw

Abstract

People with blindness and low vision (pBLV) encounter substantial challenges when it comes to comprehensive scene recognition and precise object identification in unfamiliar environments. Additionally, due to the vision loss, pBLV have difficulty in accessing and identifying potential tripping hazards independently. Previous assistive technologies for the visually impaired often struggle in real-world scenarios due to the need for constant training and lack of robustness, which limits their effectiveness, especially in dynamic and unfamiliar environments, where accurate and efficient perception is crucial. Therefore, we frame our research question in this paper as: How can we assist pBLV in recognizing scenes, identifying objects, and detecting potential tripping hazards in unfamiliar environments, where existing assistive technologies often falter due to their lack of robustness? We hypothesize that by leveraging large pretrained foundation models and prompt engineering, we can create a system that effectively addresses the challenges faced by pBLV in unfamiliar environments. Motivated by the prevalence of large pretrained foundation models, particularly in assistive robotics applications, due to their accurate perception and robust contextual understanding in real-world scenarios induced by extensive pretraining, we present a pioneering approach that leverages foundation models to enhance visual perception for pBLV, offering detailed and comprehensive descriptions of the surrounding environment and providing warnings about potential risks. Specifically, our method begins by leveraging a large-image tagging model (i.e., Recognize Anything Model (RAM)) to identify all common objects present in the captured images. The recognition results and user query are then integrated into a prompt, tailored specifically for pBLV, using prompt engineering. By combining the prompt and input image, a vision-language foundation model (i.e., InstructBLIP) generates detailed and comprehensive descriptions of the environment and identifies potential risks in the environment by analyzing environmental objects and scenic landmarks, relevant to the prompt. We evaluate our approach through experiments conducted on both indoor and outdoor datasets. Our results demonstrate that our method can recognize objects accurately and provide insightful descriptions and analysis of the environment for pBLV.

The first picture depicts a scene where a pedestrian crossing has a red light. The model can provide feedback to the user regarding the risk of crossing the street when the traffic signal is red. In the second scene, a train is approaching, which can be extremely dangerous if proper precautions are not taken. The model can send an alert that it is risky to cross the railway at the current time. It demonstrates that our model can effectively analyze risks and provide necessary alerts for pBLV. 8 of 15 Q: Can you describe the environment around? Outdoor Scene
id: 9587fe29c24c52d4e1574f60b9241c7a - page: 8
Outdoor Scene A: It is risky to cross this street because there is a red light A: it is risky to cross the train tracks while the train is coming A: It is risky to walk on a wet sidewalk because it can be slippery A: It is risky to walk on the sidewalk because there is a yellow fire hydrant in the middle of the sidewalk Q: Where is the {giraffe, sheep, bookshelf, rubbish bin} in front of me? A: This place is a park, and the environment around it is surrounded by bushes and hedges. The couple is sitting on a bench in the park while having a conversation. A: This place is a clothing store, and it has a horse cart in front of it. The environment around the store is a busy city street with people walking on the sidewalk. A: This is a college classroom, where students are sitting at tables and working on their laptops. A: This is an office desk with a computer, monitor, keyboard, mouse, and speakers on it.
id: b50e7a38fb329fede1108194f50ebff5 - page: 9
A: There are several giraffes standing under a large tree in a grassy field. The giraffes are spread out across the scene, with some closer to the tree and others further away. The giraffes appear to be enjoying the shade provided by the tree and the lush green environment around them. A: A herd of sheep is grazing on a rocky hillside. The sheep are spread out across the hill, with some closer to the bottom and others higher up. There are at least three sheep visible in the scene, with two of them closer to the bottom of the hill and one higher up. The rocky hillside provides a natural habitat for the sheep, and they seem to be comfortable and content in their surroundings. A: The bookshelf is located in a living room with a couch, a television, and a cat. The bookshelf is positioned in the corner of the room, near the couch and the television. A: There is a rubbish bin located in a bathroom. The bin is positioned near the toilet and sink, making it easily accessible for cleaning and d
id: 14fb0e3890caabc0fcd5a9b10480950e - page: 9
Object Localization Q: Is there a risk for me to continue moving forward? Indoor Scene Indoor Scene Risk Assessment Scene Understanding J. Imaging 2024, 10, 103 9 of 15 Figure 4. Examples of scene understanding (top), object localization (middle), and risk assessment (bottom) on Visual7W dataset. 4.2.2. Quantitative Analysis of Inference Time and Helpfulness Scoring for pBLV
id: 104297823e8eeed2755cd8d0eff022da - page: 9
How to Retrieve?
# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "XQa2fKfO0WZyzBw-3T6Xz5rfXTc8XfMUtgKSGrgFGzo", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "XQa2fKfO0WZyzBw-3T6Xz5rfXTc8XfMUtgKSGrgFGzo", "level": 2}'