It would be pretty easy to get AI to identify a human, chair, getting on chair, "retrieving something" (vs changing a lightbulb), etc. It could even be easily programmed to see "falling" in a kitchen and identify it as "accident" and as something like a 4 on a one to five scale of "accident severity/relevance".
If this would be pretty easy, you should do it straight away. A program that could run on a typical processor, and could monitor a video feed of a room and reliably detect when a person falls, would be worth hundreds of millions to the nursing home and home health care industries. (Reliability doesn't have to be perfect. A certain rate of false positives is tolerable. Think smoke detectors.) I'd invest in your start-up. It would be like free money.
Heck, if you demonstrated an AI that could monitor a video feed from a swimming pool, and reliably tell the difference between someone jumping or diving in on purpose, and someone falling in, within a year every insurance company in the world would be requiring every pool owner in the world to install one.
Now, just maybe, we're at the point where these AIs would be possible, they just wouldn't be practical for general use because they'd require a supercomputer to run on. (And yes, for some military applications of comparable difficulty that limitation might not matter.) That practical problem becomes a fundamental conceptual problem for the idea of reproducing the human brain's ability to compress raw sensory observations into narrative, as I'll show.
The problem here is, you're looking at the difficulty of a specific case, such as "detect falling in an indoor space and estimate its severity," as if it were representative of the difficulty of the problem in general. Sure you could probably configure IBM's Watson to recognize "cooking in a kitchen" and "move a chair" and "get on chair" and "retrieve something" and "read from a book" and "put down a small object" and "fall off something" in a video stream. And perhaps one might not notice at first that the narrative thus created, of a string of detected events lacking any sense of the causal connections between them or inferences therefrom (that the book is a cookbook; that the object reached for was a cooking ingredient) or any editing of the unimportant details (moving the chair), is a terribly poor one compared with what a young child could manage. But then you input a different video, say of a young child in a snowsuit climbing a snowbank with a snow saucer, sitting down in it, and then being pushed to slide down the slope by a large friendly dog, and your AI wouldn't be able to make any sense of it. You've solved zero percent of the general problem of turning a stream of sensory data into summary narrative. What you have instead is known in the business as a rigged demo.
What practical present-day AIs do to overcome such problems is to exhaust all possibilities, by taking advantage of the incredible speed of present-day processors and by constraining the range of possibilities considered. "Alexa" doesn't really figure out what you're saying; it figures out which of a limited (though large) list of possible commands you're most likely giving it.
But that won't work for the problem at hand, because the range of possibilities is too large and the processing too intensive. Remember how you'd probably need a supercomputer to monitor a video feed and reliably detect whether there's a person falling? To simultaneously detect whether there's a person reaching for something would require another supercomputer. To simultaneously detect whether there's a dog pushing a snow saucer would require another one. And so forth. Long before you run out of possibilities, you reach the limits of processing power.
But maybe that's because the input is video, which is data-intensive. Maybe you could use just one supercomputer to identify all the objects in each frame. Then a second one to keep track of continuities (e.g. movements of the same objects) from frame to frame. Then a third to determine "actions" and "events" from the continuities. (The chair continues deforming/breaking; the person begins falling.) Then a fourth judges causalities (the person falls
because the chair broke). And so forth. Only the first two layers need to process actual video, to produce coded data such as words, maps, and trajectories that all the subsequent layers would use. Would that help?
It might, but it's still a formidable computational task. What would Google Inc. pay to buy a startup that had developed an AI that could reliably summarize documents? Documents are already just words, just about the lowest bandwidth data you can have, but processing meanings is very difficult, which is one of the reasons why state of the art AI language translation is poor. Can an AI read a Harry Potter novel and summarize it in a page? Not at present, or anywhere on the horizon. But a fourth grader can.
Sheesh, Myriad, what's the point? It's this. The reason you, and Marvin Minsky in 1966 when he assigned undergrads to solve machine vision in one summer, and just about everyone, vastly underestimate the difficulty of AI is that you think the world is just there for you to see. You think your eyes are like transparent windows you look out of, at things like chairs and soup pots and "reaching for things" and cause and effect and dogs pushing snow saucers, that are just there. Minsky's colleagues in 1966 were making good progress in getting computers to play chess well, which they considered one of the most difficult cognitive tasks humans are capable of. How hard could it be, by comparison, to scan a photograph of a room and find the chess board? A three year old child or a trained rat can do that. The answer, going by Moore's Law and the approximately 40 years it took to get the latter capability working well, turns out to be, about a million times harder.
The reason is that, although (at least in our empiricist world view) the objects and (arguably) events really are there, we don't perceive them anywhere close to directly. Our brains reconstruct them from the continuously changing blobs of light and color that our retinas transduce, the continuously changing frequency spectra our ears detect, and a few other signals. Our brains sort out not only the objects and positions and movements (a person climbs on a chair and reaches for a canister) but the causes and explanations (she's cooking in a kitchen and needs an ingredient from inside the canister). We don't perceive the enormous computational effort this requires. Or rather, we do, but we don't perceive it as effort. We perceive it, in part, as consciousness.