GPT-V and OCR for Screen Control
Hey there everyone, now that AI can "see" very well with GPT-V, I was wondering if it can interact with a computer like we do, just by looking at it. Well, one of the shortcommings of GPT-V is that it cannot really pinpoint the x,y coordinates of something in the screen very well, but I solved it by combining it with simple OCR, and annotating for GPT-V to tell where it wants to clickTurns out with very few lines of code the results are already impressive, GPT-V can really control my computer super well and I can as it to do whatever tasks by itself, it clicks around, type stuff and press buttons to navigateWould love to hear your thoughts on it!
Users praised the GPT-V for its ability to control a computer with OCR for precise clicks, and appreciated the elegant simplicity of tagging screen elements. There was a mention of OthersideAI's self-operating computer, a suggestion to include AI assistance for increased efficiency, and a recommendation to optimize the tool by using multiple instructions per screenshot. A related work link was also shared.
Users have criticized the product for its inability to accurately pinpoint x,y coordinates, suggested the inclusion of AI assistance for improved efficiency, and noted a limitation of only allowing one instruction per screenshot.