If you call someone on the other side of the world, perhaps you notice the latency in responses. For voice that’s okay, but for live music that’s disastrous.
Mark Goldstein thinks he’s solved the latency problem associated with the production of live musical performances online. Having one musician in Bangalore, another in California, and yet another in New York? No problem. Except, perhaps, for finding a mutually agreeable time for them to be awake and play together.
VAMOSI: I’m sure you’ve noticed this. You’re on a conference call with someone who’s halfway around the world. You say something, then you have to pause a moment before they respond. That’s latency. It’s a delay.
With speech, that’s not really a problem. We talk over each other all the time. But with music. That’s a problem.
With COVID there were performances by musicians who were each in different locations. And what we saw on screen was four or six boxes of musicians. But what we didn’t know was that it wasn’t live. Someone had to go and put those performances together, sync the instruments. OR sometimes not sync the instruments (at least on screen).
What if you could remove the latency with music entirely? What if the musician heard only his or her piece of the music, but the audience could hear the whole performance.
In a moment we’ll talk with someone who’s successfully done that, and what it might mean for future communications.
VAMOSI: Welcome to the Hacker Mind, an original podcast from ForAllSecure. It’s about challenging our expectations around the people who hack for a living.
I’m Robert Vamosi and in this episode I’m exploring latency in communications, how, given the way the internet was designed, there seems to be no way around that, except, my guest and his colleagues have come up with something old that works. I hope you’ll stick around.
VAMOSI: Ordinarily on the Hacker Mind I use a licensed music service. This, however, is a live recording and it’s different. The difference you might not be able to discern is that each musician is geographically separated from each other. For example, the ukulele and voice are from Hawaii, the base is in Brooklyn, the keyboard is in another part of Hawaii, and the melodica is in San Francisco.
In world where most of us are used to conferencing systems like Zoom, we be tempted to say,, so what? But to a musician, the timing and latency are important, if not disastrous if wrong
Problem came to the fore with COVID, when people where not in general traveling, let alone being in the same room with another person not from their household. People had been working on this problem before COVID, and solutions were offered, but they weren’t great. Part of it is the way the internet itself was created-- latency is a baked in problem. So I turned to an expert in digital music and technology.
GOLDSTEIN: I'm Mark Goldstein. I'm a freelance musician and a computer scientist. I've been a freelance musician for a long time and I'm a computer science graduate now retired and I've spent most of my career bouncing back and forth between technology and art and art and technology. And I used to think I was going around in circles. But after moving to California and being introduced to the wider world of electronic music and computer music and recording technology, I see I've been working perhaps in ever ascending spirals, oscillating between those two poles. And I've been very lucky to work in both of those areas most of my life
VAMOSI: As we’ll hear, Mark’s experience with digital music is extensive. So, when the worldwide pandemic hit in 2020, he saw a lot of musicians experimenting with conferencing systems -- and with disastrous results.
GOLDSTEIN: Ah, well, it's since the pandemic we've seen a lot of musicians attempting to play with each other over the internet. Early experiments were done with conferencing systems like zoom and Google lead and very quickly, musicians and teachers learned that there was this problem of a time lag between the plane and hearing the response from partners one or two or many on the call. And this did not provide a very satisfying musical experience no matter what the context was, whether it was an actual attempt to have a performance or rehearsal, or even something as protein as a lesson. So a few years ago, some very very interesting software solutions began to appear.
One of the first ones was JamKazam, a commercial product, followed pretty quickly by an open source project called Jamulus. They had slightly different approaches to the problem, but they did give reasonable performance over limited distance between the players, because there's the technology that is employed. And both of those programs have been getting better and better as time goes on.
A third player in that space came from Stanford University called JackTrip, which was originally an experimental project to connect musicians all over the world. And they realized pretty much during the pandemic that this technology would also be applicable for regular musicians who just want to have fun and play. Because one of my mantras is to play. It's all about playing. No matter what the context of the particularly musical application is. What JackTrip brought to the table was what was actually JackTrip with the jack chip board brought to the table as people were having problems installing basically experimental software and tweaking it and calibrating it. So they decided they would put out a Raspberry Pi based will appliance that users could just plug into their computer and maybe jump over some of the hurdles to making these projects, these applications work.
VAMOSI: Oh, so that should work. I mean you get an appliance, you hook it up and we’re good right?
GOLDSTEIN: The downside of that is you know, there's people that don't know how to put stereo together either. And so for some of the folks that didn't know how to configure a computer, buying this appliance can help them for others. They're just as confused when they're told to connect the ethernet cable to your modem. What's that? So they are very technical stumbling blocks to all of these solutions.
VAMOSI: Right, these are musicians, not technicians.
GOLDSTEIN: So what I came up with with Telejam was a couple of other friends that helped Boris Smus and Michael McNabb. We've been working together on this for a few years was the idea that first of all, the biggest thing was I wanted to solve this problem that no matter how you design these systems up till Telejam, at some point when the distance between players is so great that the travel time of the sound no matter how you're shipping the sound over the internet goes from player A to B the latency the time lag just gets too long.To play, play music together.
VAMOSI: So here’s that lag again. If I’m on the West Coast of the United States and Mark is Europe, there’s going to be a lag between us. When talking, that might be okay. But if I’m trying to play a musical piece along with Mark’s beat -- that’s a challenge. And it doesn’t sound so great either.
GOLDSTEIN: I like to say synchronicity, and simultaneous and simultaneity are not the same thing. You can play together but it won't be synchronized. If you're very very far apart. It is just sound over the internet travels slow and it's not the speed of light. Theoretically, it should be but it just isn't because of the way the internet buffers routes, and sometimes disrupts connections.
VAMOSI: So perhaps we need to introduce the concept of hops as well. The internet is distributed. It’s designed to withstand a nuclear attack, so signals are broadcast in all directions and packet information ultimately leads it to it’s final destination. The way the signal gets from me to Mark is not direct, but through a series of hops or way stations that relay the signal. Sometimes the number of hops to get from point A to point B are few, so the latency is small, and sometimes the number of hops is so great, that you start to notice that latency. That’s the way the internet was designed. But even if we have a direct wire, there’s still a natural delay. And even if we had a single beam of light, the speed of light is such that there would also be a natural delay. For example, there’s a natural 2 and half second delay for a single photon to travel to the moon and back. Mark was looking to get around all that.
GOLDSTEIN: So I was trying to come up with a model that would ship sound around the world enabling musicians to play together in some form without regard to the distance and that's where the architecture of a daisy chain came up. And that's the really, really important thing that we've done that separates us from all of the other applications. Having said that, there is a trade off and I'll get into that later. But the idea of using a daisy chain to conquer the basic latency issue is very, very important.
VAMOSI: So the way it works now for most teleconferencing systems is that you’re connecting to a central server and that server is broadcasting out to everyone.
GOLDSTEIN: Yeah, that's a hub and spoke architecture. Pretty much. There's a server in the center of everything when you log into your zoom call, for instance. Everybody logs into that server and you're communicating with everybody through the server. So one person speaks it goes to the server, which then broadcasts it to everybody else on the call. And then somebody else responds to whatever Speaker One was saying it goes back to the server, and it gets rebroadcast to everybody else and there's interesting audio processing going on there. There's echo cancellation happening so that you don't hear yourself coming back. Most of these systems also do active speaker selection and cancellation of everybody else so only one person can speak at a time. You've probably experienced that on a zoom call where a couple of people start to speak at once. And the system will decide who's got who's got to stay and who's got to hike and when he will get everybody else out which can be very frustrating, and to learn to be a little bit more polite.
VAMOSI: Telegram got rid of the talking stick idea of teleconferencing.
GOLDSTEIN: But using something like tele jam where everybody can speak all at once is more like sitting around the dining room table at Thanksgiving dinner with your family and everybody's just chattering away. It could still be difficult to understand who's saying what to whom, but at least that constraint that only one person can have the mic at a time goes away and that's important when you're playing music with other people.
VAMOSI: So just having everyone be able to say anything at any time. That still doesn’t help with music, where the beats are what synchronize the musicians, and if those are ahead of where you are by location, that’s not good.
GOLDSTEIN: before we compare it just to this hub and spoke say that the other systems like JamKazam and JackTrip and Jamulus, well, actually, Jamulus is still just the Hub and Spoke but JamKazam, as I understand it, uses it uses a peer to peer but every peer is collected is connected to every other peer. So A is collected connected to B and C and B is connected to A and C and C is connected to A and B and every one of them you know has to manage the latencies between all of the other players. And computationally this puts a heavy load as you grow the size of the ensemble as well. Whereas with hub and spoke all you have is one connection to the server and back the server does all the work puts more load on the server when you have more people playing so those are the two models: Hub and Spoke and peer to peer everyone to everyone.
VAMOSI: So that’s the backend technology. There’s also the front end technology. What does the musician need to do to get this to work?
VAMOSI: The goal for Mark was to make it possible for people, no matter where they were on earth, to play together on beat, without latency. They decided to chain the musicians together serially.
GOLDSTEIN: So what we came up with is just a simple daisy chain from B to C to D and sound moves in one direction. And the first person basically substitute plays piano might be playing the drum track that sounds then goes over the web to the next person. The next person doesn't really have to know how long it took that sound to get there. It is somebody making a telephone call. It's in their ears. They hear it. They respond to it by playing along with it. In their computer we can perform a two channel mix of the incoming player one with the accompaniment played in real time in the moment by player two. They are immediately mixed together. That's a zero latency mix. There is no lag between those two players. And then that mixed audio track gets sent to player three and again, player three doesn't really have any sense of how long it took for the sound to get from player to to play or three. They hear two folks singing and playing together. That person playing on top of that will be immediately mixed in at zero latency with respect to what that person is hearing gets shipped on to the next player in the next in the let in the next in the next. I call this in the early days I called it sigma streaming for the mathematicians in the audience. So sort of left associated arithmetic at every point what you're hearing is a parenthesis plus b parenthesis plus C parenthesis plus D. So it's similar to streaming. You could also consider a pyramidal scheme where you're building and building and building a mix one player at a time.
VAMOSI: So for someone who doesn't know the recording process, basically the performers perform individually to control the sound quality of each. You go into a studio, someone lays down an initial beat or a track and then everybody builds on that. So, as a signer, you’re coming in the music already laid, and you’re listening to it in headphones while singing.
GOLDSTEIN: It's something a little bit like karaoke. Okay. Imagine that you're having a party and you want to sing some tune but you don't have a band to sing with you. Go online . Go to one of the karaoke sites, you pick your favorite tune up the comps and you press play. And your computer's spouting out what we would call a backing track. buppa buppa buppa b uppa and you step up if you're at a bar, maybe if you're at home and you've got a little microphone, you step up and use your microphone.
VAMOSI: The Karaoke approach is legitimate for learning music.
GOLDSTEIN: And for those of us who remember the good old music minus one record, which we used to practice when we were kids, which was called MMO. You put the record on it. We played the Mozart Clarinet Concerto and you could play along with the orchestra. This is exactly the opposite. It's OMM, One More Musician. Every trip down the line. There's one more player in the mix one more player in the mix and it grows and grows and grows till it pops out at the end with a full ensemble, which can be live broadcast to an audience that's listening, or it can be recorded and then play back to the ensemble to get to to get the idea of what the they all sounded like playing together.
VAMOSI: So Mark is just expanding on this basic idea.
GOLDSTEIN: You're just like saying me and along with my back feet track and singing along. Now imagine somebody has a tape recorder in the room, and they're holding their mic above you. And what that mic is picking up is both what you're singing and what the karaoke site is playing and that is an overdub. You have something that was pre recorded, that's been performed and played back. You got somebody with some live talent in the room that's singing, playing along with it. And then you've got another recording device that is recording both what was being played back at the same time that that new live person is sinking to it. And that's essentially what an overdose is. When you overdose you have the choice of mixing the two together, which is what we do intelligence so that we don't increase the computing load when you go player player player player it's still it's just a stream of audio. And the first leap only has one player in it. The second we've had the mix of the two. Non fungible, mixed together. They're just mixed together. And again and again and again. Whereas some of these systems like digital audio workstations would work in a multitrack sort of way.
VAMOSI: So this is important. Mixing tracks together is different from multi track. Mixing, the two or more tracks are blended. That makes it really hard, later to separate out, to make specific edits to one, but not another. Multi track are parallel tracks. If you want to make an edit, later, to one track, you can, without affecting the other tracks
GOLDSTEIN: How would I talk about a multi multitrack? You can think of them obviously, typewriters with red, red black ribbons, but we don't have typewriters anymore. But if you just consider let's say streams, you've got a stream running, and water is flowing through that stream, and you've got a parallel stream running in another pipe. And sound is running through that too, but the two are synchronized. So instead of water running through those pipes, you could think of sound running to those pipes and you're filling one with what's being played back. And you're filling the other one with what's being performed at the same time. So there's two tracks in those two pipes that are synchronized. And in the old days when we had tape, because the tape was only so wide, you could only fit so many players individually on their separate tracks in the old days. We had stereo two tracks for the left and right and then we had the cartridges and the cassettes with the four tracks and the eight tracks. And the advantage of the multi tracking is that every time you play the tracks back together, you can sweeten and audio wise tinker with each of those tracks separately. Like you've got a separate track for the drums, the bass and the piano, and the drums are too loud. But if you didn't record them together, if you just recorded them on parallel tracks, you can turn that drum up or down and kind of balance it. This gets really nasty in the digital age. You can edit those tracks by slicing out clickers and bad notes and putting good notes in and if it's the entire ensemble recorded together. It's really hard to say oh, they really blew it. The oboe player in measure 72 hit a lemon slice, measure 72 out and try to find another take of the whole orchestra playing and slipping in the net. But if you're multitrack you know if you're lucky, you can just grab that one clicker for the oboe player at that spot. Pull it out, find another tape where they did it right and put it in. This is something that astonished me when I started working in the recording industry with the diakses people I would go and visit some of the high end recording studios like Sony in New York, where people using this this kind of like realized Glenn ghouls dream Glenn Gould was a very very famous pianist who gave up performing live because he was really interested with recording and he was fortunate that he was mostly a solo performer. And he was very, very compulsive about the quality of the performances that he put to take to recognize that they're very, very popular.
VAMOSI: I was lucky to be in film school when everything was still transitioning over to digital. Rather, I had die hard professors who insisted on teaching the old school film and audio techniques that included razor blades and scotch tape. Yeah, frame by frame, beat by beat, all laid out on a Steinbeck editing table or over a reel-to-reel recorder. You have no idea how many minute edits I make in each episode of The Hacker Mind.
GOLDSTEIN: With digital it's all cut and paste and cut and paste and cut and paste as easy is an order word processor. That way I used to describe digital workstations in the early days when nobody knew what we were doing. I say it's a word processor for sound. And it is and it's kind of a wonderful thing that the world is turned into. It's all cut and paste with Negroponte saying bits or bits but it's a terrible thing that the world's gone on and kind of pace is now you can edit to your heart's content and I would see these classical editors for audio bragging about how many edits there were in the finished product. They are so clean that you can't hear them. And there's this one idea that you know, better is better. But the other is that no life has been. This is life and the two will never quite see eye to eye and they both are reasonable given what the context and what your what your goal is. But yeah, so the idea when you're multi tracking to be able to be able to edit individually is very powerful and Telejam gives you that option if that's where you want to go and it was very, very easy to add that to the to add that to the program. And this is what I discovered we got. We were very very happy with the original performance and the latency problem was solved almost immediately. I was like wow, this works.
VAMOSI: Problem solved, right? Latency was conquered.
GOLDSTEIN: But then when we discovered all the different ways people wanted to use it and all the different you know, I wouldn't say complaints but the pain points they had when can we do this like this? Okay, and we just started a little bit of feature creep. I don't think it's too much. Because when I look at what's in there, it's a nice kind of summary of what my life's work has been. And in another way, it's a very interesting summary because at my home consultancy, I haven't done much Consulting at home but I used to call myself StoneSoup Studio. I still do and you know the story of Stone Soup with the soldiers coming into town and they're really really hungry. But there's no food so they put out a pot of boiling water and people say what are you doing? Oh, we're making stone soup. It dropped the stone in it but man if I only had some carrots Oh, I have carrots. So originally this came to my mind when I started to get involved in a computer and music software for myself rather than working for karma or working for the DAW companies. First, I was doing some consulting for one of the companies at home. I couldn't work in their offices anymore because they had a sick building. I said well, you know, I love to consult but I need a computer. You know I don't really know my own computer because I've never had to pay for one before. Early days in PCs. Were legends Mac. Oh, great. Find Mac comes into the house. I'm doing work for them. It's great. I got this now that boy it would be fun to play with the Mac's programming language which I have been using karma. But you know, I really have the money to do this. And then one of my friends said yes. I'll give you a copy of Max and he got on. Well, that's cool. But I really don't have a synth module to make the sounds that I need that have an extra sensor out you know and I don't have a loudspeaker to listen to and, and gradually over time I built up this little home studio all from Stone Soup. It'd be really nice, really nice, really nice and then over time, I realized what I needed and I made a little bit more money. So I started to buy on my own so that that's one genesis of Stone Soup, everybody likes contributing what they contribute.
VAMOSI: There is a sense of opening up the world. You don’t have to rely upon the local musicians you have. Now you can bring in talent from around the world. “If only I had a bass player …”
GOLDSTEIN: When I think of Telejam, it's like, well, we got this ensemble here together and I'm playing piano, I got the bass player, but I really could use a good singer. I'm a singer over there in Bangalore. You can, we will bring her in. Sure. Let's put the singer in there and then somebody else wants to play drums but he's over in Brooklyn. Yeah, come on. Let's do it. Put them in Brooklyn. And it's Stone Soup. It's sigma streaming and it's Stone Soup. Which makes me very happy.
VAMOSI: All is not perfect with this method. Mark alluded earlier that there was a downside to using the daisy chain method.
GOLDSTEIN: The trade off here is that when you're working in a one-way daisy chain, you can only hear the people who played before you right you can't hear the people after you and when you first explain this to some musicians they'll kind of screw up their their face a little you mean, I can't hear the people before me. I mean, I can't hear the people that come after me. I can hear the people that came before me. But that's kind of weird. And the reply to that is not really we've been making music like this and it's called overdubbing and recording sessions for decades and decades and people who are familiar with how to do that can be comfortable with it. Some musicians may not be many, many are. This is not a new idea to overdub. What's new is to do it in real time.
VAMOSI: What’s Mark’s talking about is the way in which studio recordings are done. Recording each track is done over a period of time. In this case, this is done live.
GOLDSTEIN: So what and it's and that can be very, very pleasing. If you're ready to take the musical hit that everyone can't hear everyone, but you certainly can hear everyone who came before you. And this is a big deal because when we hit the pandemic and everybody went to ground, we had choral directors. We had orchestras all going oh my gosh. We can't play together. We've got a service on Sunday. You know, we want to keep the orchestra members happy and keep our audience engaged. What are we going to do? And out came the Brady Bunch videos.
VAMOSI: What Mark’s talking about is ZOOM, and Google Meet and any other type of conferencing software we use today. The idea that we’re in this little boxes and can look around at each other, even if we’re thousands of miles apart. But here sound and picture are not in sync. Not really. See it’s one thing to conquer latency with audio; it’s entirely another to do so with video streaming. Ah, but you saw that performance during the pandemic where everyone was in sync? Chances are they were edited together, after the fact. What Mark is trying to do is capture the spontaneous musical event, live.
GOLDSTEIN: Nothing new about Brady Bunch videos, where do you see a matrix of faces and people are singing and playing together? None of that is done in real time and so many people never understood that I was amazed when the started to happen and proliferated on the web. In the early days. It was a pandemic and people will go. Did you hear what the Berlin Philharmonic did? Or my church choir is singing on Sunday services. They're all singing together. It sounds great. Or I think there was a Rolling Stone special at one point they had to stone people, if they're in their living rooms like playing together, like no, they know this is not what's going on. It's a sound sculpture. It's an assembled mix, which is a wonderful thing. But it's all done out of real time. When somebody leaves a track down the track goes to somebody else they play against. this can be done in situ in a studio where you bring the talent in one person at a time. There are also web based applications that allow you to do this online and fly around where you know, the evolving mixes in the cloud. The engineer can call people in and they play with what they have to play and then they go home and then they record the next person that's all mixed together. This is all wonderful. This is what I call varieties of musical experience. It doesn't do much for the players. I think, like my orchestra, the Redwood Symphony started to do this early on. And we were given backing tracks and told to go home and play them on our cell phones and send them in and it's not so exciting, I think for me as a musician to lay a track down against everybody else, especially when I knew I was experimenting with this. This technology could let us do it together.
VAMOSI: And we’re not just talking about the Rolling Stones performing together. We’re also talking about the local church choir. They can benefit from this as well.
GOLDSTEIN: How many people are listening to how many choir directors I personally know that would send an email saying I need your track by Thursday. Please give me your tracks so I can mix it together and assemble the songs that we have to play for Sunday services that are happening again and again and again. This led me to add other features to tele jam that take advantage of some of my previous experiences in the music world.
VAMOSI: Oh, yeah, we said earlier, Mark is quite accomplished in the digital music industry. Mark was part of the design for the original digital audio workstations or DAWs.
GOLDSTEIN: Besides being interested in sound qua sound and recording and mixing, just WebRTC technology and such. I was one of the people who participated in the digital audio workstation revolution online or not online but digital recording multitrack on computers which blossomed into programs like Audacity and Pro Tools. But when the sound droid blew up at Pixar, I call that the Big Bang of digital audio and so many people that were involved weren't Pixar or Lucasfilm at the ranch. So many people who were involved in that project took off and started their own small companies. Andy Moore was one who developed the sound droid originally and he did Sonic solutions which was a very high end digital audio workstation. Couple of other fellows put together a high quality D to A converter and then got a brilliant programmer named Adrian free to write a waveform editor for it called Mac mix and they formed a company called Integrated Media systems and I was employee number 10 with them and we put out a system called diakses Mac mix. It was one of the first commercially available popular digital audio workstations. Darkside of Moon was remastered with it. A lot of the Moller stuff done with San Francisco Symphony is still being recorded on a follow-on system that I designed with Sony for the digital audio, one bit DSD super audio CDs. So I was involved in the recording end of the of the industry early on, and I learned that there's many many different ways to record there's recording for broadcast for video for radio for post production for audio for CDs, I had already been a recording performer. My days in Baltimore, I was staff percussionist that helped me are in public radio and for many many years when Wall Street week came on nationwide, the first thing you heard was my xylophone going I think and we did many many overdubbing sessions there one a month so very familiar with what recordings like so when I realized we already had this capability with tele jam to play down the line. And obviously I wanted to make recordings at the end anyway to archive and verify what we were doing. And also in the debugging process to try to squeeze the links out make sure we weren't fooling ourselves that we were synchronizing. We built basic recording capability intelligence at the end. So the first reason for doing this obviously is you've got a one way daisy chain everybody's playing but you only hear that person to your left. Well, we just did a wonderful take of fly me to the moon. What did I wonder what the singer did and measure 17 When I played that special little cord?
VAMOSI: This may be subtle to a non-musician, but hearing a slight change in the live music can either throw off a mediocre musician or inspire a great musician. With one-way daisy chains, you don’t know until later how that change was interpreted.
GOLDSTEIN: What I wondered what that singer did to measure 70. When I slipped that specialty record, it's like the tail wagging the dog here. When you're in the beginning. You can't hear what's happening down the line. So it's very satisfying as soon as we finish the take. I use the word take as if we're in a recording session. Boom. I can then take what we just recorded and play it back to the group and we can all hear it. So there's this very, very fast, communal instant feedback. But again, if we're in daisy chain mode, we can't talk about it. So in the early days what I was doing was turning off the daisy chain and turning it on to zoom going back to zoom playing the recording into zoom. And then I thought, well, there's no reason for the daisy chain to run in one direction. The reason for one running in one direction is to synchronize. But we can ship audio in two directions about a peer to peer connection. It just won't be time synchronized. But who cares? If we're talking to each other? There'll be longer and longer gaps if we're as far away as Bangalore to America. You get used to it. Some people do, some people don't. So we built into a daisy chain in what we call talk mode for conversational work. We had the one way for recording and then we could blow play that recording back made just made everything just a lot more fun, effective faster. Then I realized, well if we're recording the mix at the end, why not record the individual contribution as a track at every player's position? No reason not to do it. It's just a recording. And because we're shipping compressed audio, peer to peer, you know for efficiency's sake and it's very, very good at Opus encoders on web and web RTC web RTC and they sound damn good. It's you know mp3 quality sound which is, quote unquote good enough for a lot of people, especially if the object is to have fun and play together not produce Deutsche Grammophon level audio file record. However, what's going on in each person's computer is in loco right there. Recording, which is a WAV file. No compression was full of poor audio, whatever the computer can do. And it's it's what each player plays and so now we've got a mix coming out the end which is very rapidly produced and effective, but we've also got the individual tracks or stems are recorded at each place, which then can be uploaded to the cloud, downloaded back to a producer's location, remixed in any digital audio workstation of their choice. So we now have this hybrid mode of working. It's got the free zone and the immediacy of a live conference call or live session, but it also has the capability to track and remix and re record and post produce like you would do with a DAW which is very, very powerful. So now a choir director can say I'll see you all at six o'clock on tele jam, and the piano player can start playing and everybody can see me apart. Everything can be recorded. You can listen to the mix and say I didn't quite like what you guys didn't let her be let's do another tape book. Do another tape. Thank you all very much. I've got all your tracks seen in church on Sunday online. And now the producer has got all the tracks they want. They can either use the already next rough cut if it's good enough, or they can start slicing and dicing and editing and do all the magic that we've been doing with digital audio workstations again, again for decades. So my point here is that we've got a hybrid system that takes advantage of a lot of existing technology, existing working styles and amps it up by adding this ability to play synchronized and everybody together.
VAMOSI: So we started out talking about latency. Here’s an example of a flutist in LA, bass and voice in separate homes in Baltimore, and percussion in San Francisco. But we still haven't resolved the issue like from here to Bangalore is a certain distance as takes a certain amount of time. Whether that be nanoseconds or full on seconds. You're daisy chaining. This it seems to me that you still have the latency, you're just pushing it out.
GOLDSTEIN: Unknown 31:18 Right? The problem is you don't you don't experience the latency. It's all relativistic naturally. During the pandemic, I started reading all about the new physics and things and wrapping my head around time travel and the speed of life. So if you've got the God view, and you're standing outside everything, you can look down, you can see that mark in San Francisco just played boop, boop, boop, but the guy over in Bangalore hasn't heard it yet. And some milliseconds later, it's usually sub second, but it's lots of milliseconds later, all of a sudden, the guy in Bangalore his booty Booty boot, but I'm already going chunky, chunky, but he doesn't hear that. But so the point is, from his point of view, there is no latency. Again, if I make a phone call to Paris, and I go, Hello, the person in Paris just hears Hello they don't know when I said it versus when they heard it. But from my point of view, in a full duplex, two way conversation, I go, Hello, how are you? And I'm waiting, waiting. It goes over to Paris. They say hello, how you going to get up? to five? Yeah, and then it comes back. Wait boom, boom, then I get it. So latency is a tricky thing. People go Oh, latency, look at the number. Okay, there's a number but who's hearing it? If you're not hearing it, is it if the tree falls in the forest? Is it really there?
GOLDSTEIN: And in fact, the original meaning of the term latency is hidden. That was when symptoms for a disease would not appear until after the disease has already taken over. There's this latent period before you even knew you had it. And in that sense, what we're doing to latency is hiding it. You don't really know it, we're just hiding it. It's not there. Right? It's good. It's just not there. It's magic.
VAMOSI: So I'm wondering if there are any other applications where the daisy chain could be used to resolve latency problems.
GOLDSTEIN: I can say that one of the things that people do miss by working only in the audio domain is visual. And originally, I thought, well, yeah, we can probably just do exactly the same thing. With the video as well streaming video, person to person to person to person. But again, here's the thing about copy paste, which works and doesn't work. Nicholas Negroponte, the founder of the MIT Media Lab was famous for saying bits or bits which got me really mad when I read that but it was really books because they're not all bits are not created equal. You can drop one audio bit from a very quiet passage in high quality music and if you're in an ideal listening environment, you will hear the tick and audio bit is a very is a larger part of the total data stream, really the audio string, you drop one bit in a video, one pixel in a video, no one's gonna see that. So if you're cutting copy and pasting bits, bits are not always bits. It's what they represent means something and this is why the date it's, it's a different set of problems. To synchronize, not to synchronize but to synchronize the video and the audio on a stream. And programs like you zoom today have that problem because there's an audio stream that they have to process, there's a video stream that they have to process. And I don't think anybody is convinced when you watch Zoom production, that those lips and the sound, what you're seeing and what you're hearing are actually in sync. That's a problem. And there's all kinds of tricks to make it better and better and zoom is getting zoom and the rest of the field are getting better and better at so how would that apply to tele jam, same problem. And probably there were even harder constraints because the reason the musicians want to see videos is because they want to see the conductor. They want to somehow see what the leader is doing. Even if the leader is not playing anything. Okay, that's a hard problem. And I started to think a little bit more about it originally, I thought we could well just do a Brady Bunch thing, because you can't you can't mix video like you can make sound you know if I just slap your picture on top of my picture on top of the players picture. You're not gonna see anything. So what happens? You need a matrix, you need a very much need for what zoom does, okay, so you have a black matrix maybe and you know, you've got six players, you've got three by two and the first player goes into his little corner and it gets streamed with the audio. And then okay, you've got the second player playing their audio, their audio gets mixed. That's not a problem. What are you gonna do with their video? Okay, well, we'll put it into their, into their special little square as well. But it ain't that simple. Because the processing that's going on in the machine at that node is separate forks, probably for the video processing in the audio processing. How do you read them back together again, so they're still in sync with each other? It's kind of like a meta synchronization problem now. I think it can be solved. I'm not sure it can be solved in a web app, which has, you know, constraints on processing power and memory usage that might require a native app with a large amount of space for buffering video. And one of the other wonderful things about tele jam is that there is no timecode represented in the stream since all we're doing is taking a stream of audio, doing a real time overdubs. You don't need any time to all these other many other systems are time stamping and sending timecodes and having synchronization and extra buffering. And we don't have to do that. But if you're gonna do that with video, which is coming in at a different rate, different processing, it sounds like something like a timestamp system might be the better way to go. And it's not going to be as simple as intelligent daisy chain
VAMOSI: While Telejam is an audio-only application, it can run alongside a video streaming or conferencing app to provide a simultaneous video feed. This will not be in sync with the audio. Players performing should not listen to the video app, but use it to see and be seen by each other and by a wider audience watching the video app. The Telejam output can be patched into the video app to send the synchronized audio to the audience.
GOLDSTEIN: Yep. Well, yeah, we have a zoom channel on and people can see each other, but it's a really bad thing to watch because it's out of sync. So even more than zoom would be if everybody's playing together, right? Even muted. So that's kind of a sad thing. On the tele jam site. It's kind of fun. We mostly have tele jam.net. We have a few audio files of people around the world playing together and we have one where a video was taken without my knowledge of the participants. Bobby, Bobby, who I'm blanking on Bobby's name, Bobby's, who's the accordion player. Also an amazing machine learning Maven. Has his machine algorithms writing send Polsky polka music that that ended up and we recorded some of the stuff that his ml wrote for us with him playing accordion and another music information retrieval expert Cynthia Lim in Delft playing melodica and I'm sitting in California just clapping along because I wanted to make sure everything was right on the money. Unbeknownst to me, Bobby was doing a screen capture of this. And he posted it. And I'm looking at it going like Well, yeah, we're really playing together. Wow. I know. I know. We're playing together. I could hear it and I'm looking at the video and that's impossible. He took a screen snap of his video. We we were not playing together. Those videos have to be coming in in a different time. But what he had done was he had post produced the video and resync the audio so that it was safe there were three images there. Korean melodica and I had a very visible clap going on. So he synchronized the clap with my hand clapping. So when you looked at the video, you hear clapping and see my hands move is like yep, I'm in sync and then you watch any other players just blowing their instruments and figuring well he's a trick is you can't tell what they're playing about by looking at what they're playing right now. But I zeroed in on it because I know the piece and I started to watch the thing. Uh huh. Nope. Yeah, those videos are not video in sync, but the hand clapping is in sync and that's the trick and I actually did a trick on the trick. I had two sisters. One was in Zurich, and one was in North Carolina who wanted to sing that because I knew you from frozen or from Wicked from Wicked because I knew you as a birthday present to their mother. And I hooked them up with a backing track that I played for them to hold them together. And they sang together. And at the same time I had them on YouTube, not into zoom and I don't don't look at it don't look at it's gonna confuse you just listen to the listen to your pianist. But I made a recording of it. And you can see them sitting there waiting for their cues and singing and I could queue one up to the sound but then the other one wouldn't be queued. So I went and I excised the two singers into two separate videos, threw them into iMovie sync the two singers up with the actual track, made another iMovie event and off it went to mom as a birthday present. And it works so but that's like a Brady Bunch solution. That's an after the fact solution. You could imagine however, doing like an OBS. OBS is an open source online broadcasting system that a lot of people in streaming use nowadays. It enables you to grab pieces of the screen and set delays on and in real time. So you could imagine some sort of super OBS setup where you've got tele jam running. You've got some video app running via you know zoom or meat or Jitsi or all kinds of other things that are out there and the n have OBS running to to snarf the particular players on the screen, apply a different delay to each one to synchronize them and ship that in with the music. So there's an opportunity for us should be really a super system. There's a lot of work involved. Right? Just a lot of labor
VAMOSI: So, throughout this journey, was there something that surprised Mark the most?
GOLDSTEIN: One thing that surprised you to me is that we've conquered the latency but we conquered time. So if I want to play with my friend and Bangalore, which is like 13 and a half hours away, we have to find a time when we're both awake. Which is challenging, but we've managed. The other thing is I'm still curious about scaling when we first began, like how many players can we mix together before we saturate the audio mix? And we were happy to discover some research work that was just doing the math on mixing that if we could control the levels of each participant, keeping them down around minus 12 or minus 18 DB. That successive addition of new tracks does not go to pink, the audio for quite some time and indeed, when we carefully you know manage the audio levels of each player. We haven't had a problem with saturation with players yet. A chain is only as strong as its weakest link. So it's still the internet. So you have to make sure at least you've got a peer to peer link for every player that's going to hold it because if one link in the chain breaks the chain is gone. You have to restart it. We've that happen occasionally. But not that often. Right but it does happen. It's a risky take. The largest ensemble I've used so far. It's been about seven players of the Peabody Ragtime ensemble coming from their home, very raucous stuff, playing ragtime music and that seems to work. I would love to work with a string quartet, wind quintet. The issue of playing jazz hands where you're improvising and there's no there's no score is enabling because people can just play given what they're given. On the other hand, because it's not an all way listening event. You can't really have the entire ensemble reacting to each other. But if you put the players in the correct order, it is fine. If the players are amenable, amenable to that written music is harder, where you really need a conductor.
VAMOSI: Mark does have a solution for this.
GOLDSTEIN: One of the things we can do is to put a click track in which we really don't like to do. But again, one of the tricks we can do with tele jam basically with anything that has recording capability or multi tracking capability is I can have a stereo signal going I do have a stereo signal going down to the daisy chain. And the left channel is the actual performance but the right channel is what I can call the ghost track. And it can be as simple as a metronome. It can be like a karaoke backing track like I did for the singing Sisters where I took the piano and sent that down to a separate track. It could be a rehearsal for a choir, where the conductor is conducting the pianist the pianist is playing; everybody in the choir drops in and drops out with the pianos running it all down. When you pops out at the end, you can mute the ghost track and all of a sudden you've got an acapella performance, or you've got the performance without the ticking metronome. So you can hold it there are ways to hold an ensemble together that don't interrupt their current or corrupt the listening experience depending on what you want to do. There's more work to be done with that.
VAMOSI: I’d really like to thank Mark Goldstein for sharing his work to make live musical performances possible no matter where the artist is physically located. Think of it, all the great musicians are able to perform, as one, in real time, without them leaving their homes.
Thank you for subscribing!