Diagnosing random TD crashes / getting useful logs

Since starting with TouchDesigner this year, I have ten sites running three different applications 24/7, and more are coming soon. It’s mostly been great, but one of my apps has a random crash and I can’t figure it out.

The crashing app is basically a dynamic video player. It takes in user-created videos (we control the smartphone app that makes the videos, and we have a server that transcodes the videos into a TD-friendly format) and renders an animation featuring the latest video, before returning to a grid of the last 12 videos. Also, on an idle timer, it switches to a promotional video.

This app uses a bunch of simple GLSL shaders to create transition effects, and python + threads for networking (as I’ve documented in my last post here: viewtopic.php?f=4&t=11395). It polls our local HTTP video server every 10 seconds to see if new content is available.

I have similar code running in TD in other installations that never crash. This app seems to crash anywhere from 1-50 hours after starting it. I think the biggest difference is that this app makes a network call every 10 seconds, whereas my other apps are triggered on user interaction and create threads and do socket i/o less often. AFAIK, from Windows’ Task Manager, it’s not a memory leak, and I haven’t found any log file or crash report that gives me any clues.

  • Is there any useful information that can be found in TD log files? I haven’t found anything yet.
  • Are crash logs written anywhere? I can’t find any.
  • Is there a way to log stdout and stderr to a file? My python code print()s useful info to stdout, and uncaught exceptions end up on stderr. I’m using TouchPlayer for remote installations and don’t know how to get logs.

As an aside, we use a Windows program called AlwaysUp to auto-launch our installations and re-start them if they crash. Unfortunately, Windows doesn’t seem to notice that TD is crashing until there’s some user interaction (moving the mouse / clicking the TD window). Then AlwaysUp quickly restarts the app. I hate to say it, but I’ve been experimenting with an auto-mouse-clicker app (murgee.com/auto-clicker/) to try to detect the crashes. :frowning: It’s a horrible fix, but the TD app can restart at any time and happily update with new content… I’d love to hear how other people keep their apps running on Windows!!

We use an in house made process manager that watches to see if the app has crashed, and then restarts it. NVoid makes a tool for this called stalker:

github.com/nVoid/Stalker

There are also a handful of other apps that do this as well, ranging from free to cheap, to pricey. So your choice here in terms of a direction is up to you in a lot of ways. I know a good chunk of people that use team viewer to both remotely access and manage apps running as installations.

When it comes to logging, I made a simple logger that full-ish featured. It’ll catch any op errors and log them, and you can also make direct calls into your log file if you want to keep track of what’s happening in your installation:

derivative.ca/Forum/viewtopic.ph … ing#p41717

Thread access is a little dicey with Touch - it looks like you’re on the right track there, but it could also be something in your thread process that causes the crash. Anyway, a logging tool that catches errors or logs events should at least be a gateway towards being better able to track when things happen and when crashes occur.

That’s not a lot of help, but hopefully it gives you a push.

If you use Stalker, let us know if you have any issues. We’re planning on an update soon.

The first place I start with things like these is trying to make a control / replica of the install machine in your my office. Then I can start systematically disabling parts of the network 1 by 1 until it stops crashing, then I know where to start looking for the issue. As Matthew said, I could also see threading + network calls being the issue. I don’t see why GLSL or general Python things would cause an issue. Are you using queues to communicate with your thread?

I also use Process Explorer a lot for these kinds of things:
docs.microsoft.com/en-us/sysint … s-explorer

It’s like task manager but with much more detail and some GPU info as well.

Logging tools help a ton, you can log whatever you’re printing if it’s helpful to review later. Matthew’s logger is nice. We also have a logger in our component libraryhttp://store.nvoid.com.

@raganmd - Ha, yes, recently I’ve spent a good portion of my life squinting at TeamViewer. And upgrading the account so more and more of us can all do it at the same time :wink:

It sounds like the answer for logging is “roll your own”, which is OK by me. It does mean that I won’t be able to capture stderr, which is kind of a bummer. I suppose I can write a decorator to wrap all of my code in try/catch and use my logger to print exception info. I’ll definitely take a look at your logger… maybe you’ve done all the work for me already!!

@elburz - Process Explorer is a nice tip! Definitely going to try that. I ended up not using queues. Instead, I keep some shared memory for my threads and TD’s thread and mutex-lock it. I use a Timer in TD to poll for results. (So it’s more or less the same as using a queue, except I do the locking myself) Explained here: viewtopic.php?f=4&t=11395#p44538

I think we’ve bought all the available stock of Zotac EN1070 mini PCs that haven’t been grabbed by bitcoin miners!! I currently only have one machine available for dev/test :frowning: but the next pile of them should arrive soon. Ofc, long testing will be how I eventually find the bug…

Thanks all for taking time to help me out :slight_smile:

I want to be a good touch citizen and follow up on this…

I ended up getting in touch with TD support (many thanks for helping on a saturday, malcolm) and after we gathered a couple crash logs, we decided the crash was due to a race/deadlock issue from my python networking threads. We agreed that this crash shouldn’t happen, though it did. Luckily for me, for this project my networking needs were simple and I was able to switch to using the Web DAT’s asynchronous option. (Also, while I have a few installations running networking code like I’ve detailed previously, only this application was polling every 5 seconds for updates. This explains the random nature of this crash, and why I haven’t seen it elsewhere – the probability of it happening is MUCH MUCH lower).

I’ve been so busy with the latest installation running around the US that I haven’t followed up, but I expect that finding the source of this crash will be tricky. I’ve heard wisdom from TD folks that python threads are dangerous, and I assumed it was because most TD people are not experienced software engineers WRT threading. Now it seems like the wisdom has some weight :wink:

For my usage, the Web DAT is too simple – e.g. it doesn’t support enough HTTP verbs. Hopefully I can find the time to make a suitable replacement with C++ soon, and if I do, I’ll certainly share it with the community.