The Hacker Mind Podcast: Hacking Biology
There are a lot of parallels between computer security and biology. If you think you already understand hacking systems, then I’ve got a story for you.
In this episode, Harrison Green talks about his experience creating exploits during capture the flag competitions and how it relates to his current day to day work with the Durrant Lab at the University of Pittsburgh on computational biology.
Vamosi: The show is called The Hacker Mind, and by listening to more than a few episodes, I think you know that I'm not talking about black hoodies rad music or expensive sports car. Okay, maybe in the future episode.
But anyway, the story is going to be a little bit different.
I often talk with hackers and discuss the day to day work, or even their specific InfoSec journey. Usually we talk about other pursuits, if any, lately I'm finding these outside interests have started to blur, or even combined with hacking is very, I mean, what is their other pursuits, We're not even security related, but just as nerdy or as important as the complex fascinating world of synthetic biology. Did you know that you could program DNA to behave in certain ways.
In a moment we'll hear from someone who is actually using his years of experience creating exploits on the fly. During numerous Capture the Flag competitions in his day to day work now with computational biology. Did I just say biology. Yeah. In this episode, we're going to be hacking biology.
Welcome to the hacker mind and original podcast from Brawl secure, it's about challenging our expectations about the people who hack for a living. I'm Robert Vamosi, and in this episode, I'm talking about how the hacker mindset might directly help us learn more about Computer Aided drug discovery through the use of convolutional neural networks, and machine learning. If you think hacking only pertains to computer systems. Then I've got a story for you.
People are complex and often if you start talking with somebody if you ask the right questions, they'll surprise you. For example, in this interview, I started discussing structure aware fuzz testing, and ended up discussing cell based proteins. I mean, it's all good, it's all science, it's just not what I expected. So let me start by introducing you to Harrison Greene, a recent graduate from the University of Pittsburgh is currently taking a year off from school, and is now working as a researcher at the Durant lab at University of Pittsburgh, after doing biology research with his newly minted computer science degree.
Green: Yeah it's tricky. I guess I'm have a lot of different interests. So, I think. Originally I was really interested in computer science, and biology.
Vamosi: So right here, Harrison is perhaps a good person to start to explain how having a computer science degree can start to map, other fields.
Green: Yeah, I guess the way I think about it, like computer science in general, is sort of a field that can be applied to almost any other field like biology benefits, computer science, physics benefits engineering benefits. And for me at least information security is just like a more complete version of pewter science, it's like, it's like not only creating things, but figuring out how things work, debugging things breaking things and those same skills can can be applied to any other domain. So, for me yeah computer sciences like computer science and information security is the like toolkit, and I can take that to any domain,
Vamosi: you might be thinking that biology is the polar opposite from computer science, but to hear Harrison explain it. There's a lot of commonality between the two. I
Green: I find biology really interesting because thinking about all the like internal processes, it's like a really complex system, and trying to understand that system is interesting I think that's also why like reverse engineering,
Vamosi: Reverse engineering is where you don't have direct access to the source code. So you use a disassembler. For example, you might use IDA Pro, which generates assembly language source code from machine executable code. Not everyone can read assembly very well, and not everyone likes to do it.
Green: One of the things I've noticed with reverse engineering is when you spend 10 hours staring at assembly code, suddenly reading C code is really nice, like you see all the details in C code. So it's almost like by pushing yourself and doing something harder than all these other tests become easier in comparison.
Vamosi: So, diving into the world of biology is somewhat similar to computer science, you really can't see all the proteins and molecules, but you can still model them, and you can test them.
Green: So there are a lot of information security parallels. I think the main, the main difference in my opinion is that it's trying to try to create this system in a biology lab. It's kind of like trying to develop an exploit without using GDB, trying to kind of land a rock chain, without being able to debug it. You're doing everything blind and then every experiment costs $1,000 And spending 10 hours in the lab each time.
Vamosi: So there's this desire in biology to make that time in the lab, more efficient, and there's a real need to peer into the darkness of the unknown to understand what's going on in the biological world. Ultimately in biology, you're always testing the unknowns, which starts to sound like software application testing, at least a little bit.
Green: So I spent this whole summer like developing these DNA sequences these protein sequences and then a lot of time actually in the lab, developing things, and I found it really interesting because it's, it's almost the same as like developing an exploit for for a specific target. Honestly the exploit development mindset is the same as like the science mindset. Like, you test something doesn't work and you have to figure out why. So you need to come up with a hypothesis, why isn't this working, and you need to come up with some tests to figure out where it's failing benefit of pewter sciences we have things like debuggers to help us. Whereas in biology. You're standing from like a mile away. Everything is like really small. You need to come up with some sort of tests that will be able to distinguish two different outcomes
Vamosi: Before entering the University of Pittsburgh. Harrison, like a lot of young hackers played Capture the Flag.
Green: I was interested in computer science in general for a really long time and I was first introduced to CTF 's. I see Senior High School, and before that I felt like I'd done a lot of things related to CDs, I just didn't know they existed, so I had done a lot of pushing the limits of what I could do with computer science and, you know, writing pseudo viruses that would like make files and things like this but I just didn't know about this CTF thing. So, once I started playing CTF site, really enjoyed that and I think, I think I've improved since my senior year of high school but I was sort of interested from the beginning. I have seen a lot of people get blocked better through playing CTF. If you acquire a lot of specific knowledge about computers and security. I'm wondering, is there any way to trigger curiosity. It's good question. It's tough to say if everybody can do that or if that's just a specific trade. I think the competitive nature of CTF helped sort of fuel my curiosity. If they weren't, they weren't a competition, I'm not sure I would put as much effort into them. Like you'd have more game sites where you sort of 24 Seven you can solve these challenges, but I don't find those as interesting because there's, there's not a set amount of time there's not like a specific leaderboard.
Vamosi: Some of the computer science work that Harrison has done involves buss testing. Was he always interested in fuzz testing, or was it a more recent interest.
Green: Yeah, I think that was something definitely more recent, so I've been interested in neuroscience, for a long time and information security in general for a shorter amount of time. And, yeah, recently I think there was one CTF challenge in particular, I think it's seesaw. Just this competition in New York, and it involved fuzzing and I just never heard of fuzzing, didn't, didn't even know what it was,
Vamosi: capture the flag or CTF competitions like seesaw are often Jeopardy style, where they provide challenges based on specific categories, then rank the challenges by difficulty within the category. It's to make the competitions more fun. Yes, but it also teaches different aspects of information security,
Green: spent a long time trying to figure out that problem and didn't didn't even end up solving it.
Vamosi: In this case, one of the purposes of this Jeopardy CTF board was to expose the competitors to the concept of fuzz testing.
Green: But then I looked at write ups afterwards and learned about fuzzing I thought it was pretty cool. I think it was. Yeah, it was a particular unique problem. I think it's the only problem I've seen where the specific goal was to use a cluster to solve it.
Vamosi: Sometime after Sisa Harrison became more interested in fuzz testing, and that interest, got him an internship at ForAllSecure,
Green: I think I'd heard about ForAllSecure before from the DARPA cyber grand challenge, and I saw that they were doing fuzzing so I reached out to chip.
Vamosi: One of Harrison's first areas of research was in structure aware fuzz testing. What's that, to better understand structure where fuzz testing, we're going to have to start with the basics of fuzzing itself. Basically, with fuzzing. you were inputting invalid data, how an application run error, does it crash, the fuzzer just one input, it generates 1000s of test cases, and how it generates these is different with each fuzzer basically among all the fuzzers if you need to have a good example that's valid. From there the fuzzer will generate invalid based on that valid data that you provided. So there's random fuzzing, which is literally like monkeys pounding on a keyboard. Then there's generational fuzzing which takes a known good example and follows the specs to generate invalid if that's more efficiently random, but less effective than coverage guided, which monitors the feedback from the fuzzer and generates new input based on that. That's a lot of guessing, but with fuzzing. All of these test cases can be generated quickly and autonomously.
Green: So originally, like, first fuzzing paper was basically these researchers, throwing just random data at Unix programs, and getting a lot of crashes that way. So that kind of fuzzing is called Generation based fuzzing when you're just generating inputs. But recently, these fuzzers called coverage guided professors like lib fuzzer and AFL and they use coverage feedback to guide fuzzing process. So, as inputs find new coverage in your target, which means new branches taken in the actual assembly for new instructions hit during execution. They will keep those inputs as interesting inputs and moving forward, they can take those inputs mutate them slightly, and maybe hit a slightly different part of execution. And as you keep doing this over and over, you get a really efficient way to explore all of the possible execution states, your target.
Vamosi: So what then is structure where fuzzing with normal coverage guided fuzzing like we see with open source tools we can capture information about how the target responds to our input. This feedback from the target then enables our coverage guided fuzzers to guide their search to discover inputs that reach into new parts of the application. In other words, was structure aware fuzzing. If we find an interesting new input, we can mutate it to explore all the execution space around it. Perhaps it would help if we compare structure aware fuzz testing with the classic game battleship. A random fuzzer is like calling shots without knowing if our previous shots were hits, or misses. B 10 C five F two with generational fuzz testing, we would then follow a pattern, a one, a two, a three. With both of these, we might hit a ship, but it would still be a rather hit and miss process. A third way then is coverage guided, where we monitor the results, and when we find the hit, we can explore the hit further to see if there's something really there. B to hit B one, Miss B three, hit before hit you sunk my battleship. So structure where fuzzing is yet another refinement in the criteria to improve this efficiency, a structure where fuzzer would for example, know only to specify inputs using one of those letters, followed by one of those numbers, while a normal coverage guided fuzzer might mutate the inputs to potentially invalid strings, such as AA or 99 B, or a string of special characters. Thus structure aware guided fuzzers can help minimize the losses while maximizing the code coverage.
Green: There are sort of two reasons for why you'd use structure where fuzzing. So one of the reasons is, maybe you have a function that doesn't take like a byte stream argument. So for example, a lot of traditional fuzzer harnesses, need some function that accepts just a byte sequence, like it might be decode image function or the compressed data function or something like this.
Vamosi: Remember when I said we were looking at applications inputs. What if the application doesn't have readily accessible inputs, how are you going to fuzz that turns out you could write a harness which is basically a script that invokes a target function over and over, for the purposes of fuzz testing.
Green: When you have a target, you usually have some entry point, which is just the first thing you call when you use the program. So for an image conversion library. There's probably a single function that represents this conversion process and the harness is simply a little wrapper program that the fuzzer will use to invoke this function over and over as it generates new inputs, but maybe you have a function that takes an integer, and a struct. And like two floats. So you need some way to retrieve those kinds of objects from the byte sequence of the fuzzer gives you.
Vamosi: what you're trying to do is increase the code coverage while being efficient about it. Remember the random fuzzer, you don't want to spend hours throwing up random stuff that won't find any useful vulnerabilities structure where fuzzing is a way to define the structure and then iterate within that structure.
Green: So for example if you're fuzzing any kind of structured language, so maybe you're fuzzing a JSON parser. If the fuzzer understands the structure of the language, it can generate inputs that pass the like parse stage or pass the syntax stage or something like this, and then you can reach later stages of execution. And in a lot of cases, this can be interesting because you're interested in bugs that are found during execution.
Vamosi: At this point perhaps another concrete example might be good. Take for example a PNG image file,
Green: The fuzzer would be fuzzing something that processes a PNG, for example, so it might be an image library that can convert png jpg. So during that process, it would need to read this PNG data and PNG is a very structured file format so it has different sections and the sections have requirements. And if any of the sections are invalid, for example, it'll just throw an error, but usually that error isn't especially interesting maybe you're interested in an error, further down the line after it's already parsed some amount of data. So, building a fuzzer that has some knowledge of this format can be really useful.
Vamosi: So how might an attacker be able to use this fuzz testing information,
Green: So a lot of the bugs that people look for with fuzzers are security vulnerabilities so people aren't aren't usually just interested in, you know, a bug that crashes the program but above that, maybe an attacker can use to gain some control of the program like remote code execution. So a lot of the time people will look at functions with API that might be exposed to an attacker. So if you're using an image library and web application, for example, and an attacker can upload an image, then they have control over the data in the image so they can make it something they can make it whatever they want, they can make it an invalid image that matches your program in a specific way that gives them control
Vamosi: Beyond structure where fuzzing. Another area where Harrison's been focused is fuzzing full libraries.
Green: So one of the areas we've been interested in is if you have not just one function, but if you have a whole library. How do you fuzz a whole library at once. So for example, one of the targets we've been looking at is the Skia graphics library, which is a graphics library that's used in Chrome and Android. For example,
Vamosi: The Skia graphics engine is an open source graphics library with C++ which abstracts away platform specific graphic API Skia Inc originally developed the library, Google acquired it in 2005, and then released the software again, as open source,
Green: And the way that works is, whenever you want to draw something, you need this canvas object. So for example if you want to draw a rectangle, you need a canvas with a rectangle object. When you need a paint object, which describes things like the color, the fill the dash style and things like this, and you would call like Canvas dot draw rect, with the rect and with the paint object,
Vamosi: A rectangle is pretty basic, often you need to produce a more complicated object. And that complicates the number of functions and arguments that are needed,
Green: And some of the functions are more complicated maybe need four arguments and they're all these objects and fuzzing something like this is traditionally challenging because you can construct a lot of individual test cases that test different scenarios, but finding a way to fuzz, all of the possible scenarios is difficult because a lot of the ways you can call things are, maybe just invalid,
Vamosi: Just because there's the potential for something doesn't mean that it leads to a valid result. So there's a lot of scenarios to consider a lot of false leads.
Green: So you don't want to find a lot of bugs that are just due to trying to call like draw rect on an object was on a canvas, for example. So our goal has been basically, if you have a list of all of these, these API's, we call them endpoints, how do you construct a harness that can fuzz, all different combinations of them at once. So we've been working on a process called Data Flow fuzzing, which is basically treating this whole execution as a big graph. And by graph I mean the computer science kind of graph where you have vertices and edges,
Vamosi: A Data Flow graph is a collection of arcs and nodes, in which the nodes are either places where the variables are assigned or used and the arcs, show the relationship between the places where the variables are assigned or used expresses all the possibilities for concurrent execution of the program part.
Green: Then, during the fuzzing process the fuzzer will invoke each endpoint one by one. And if objects are created, it'll move those objects to the next endpoint, and so on. And we've done some interesting things this way.
Vamosi: This is another type of capture the flag attack or king of the hill. And that's what you see at DEF CON. That's where you're actively hacking into another team server, while defending your own server from their attacks.
Green: I have heard that like some, some people on CTF teams are trying to use fuzzing because normal challenges, for example, like a lot of exploitation problems give you like this interface where you add notes you edit notes delete notes, and you can build a fuzzer for that. And maybe it helps you find a crash sooner than maybe let's develop
Vamosi: An exploit Harrison currently plays with dice gang, a CTF team that did really really well in the most recent DEF CON qualifiers for DEF CON 16 teams get to go on to the finals, hence the need for a qualifying round.
Green: Yeah we did, we got 15, which was very close for the finals
Vamosi: Harrison has played in the DEF CON finals in Las Vegas on other teams. Because of COVID-19 most of these, however, are online now.
Green: I mean that's that's pretty much the way we played last crested Yes, that is so much most of those gifts are online. Usually only in the finals are in person for CTF so we're pretty used to that sometimes we set up like discord channels to talk to each other that
Vamosi: Often, that requires specialization,
Green: Specialized place everybody has their own specialization. So for example I do a lot of reverse engineering, the main thing. And we usually create like a different channel for each challenge. Our team has gotten pretty big. I think we probably have 20 to 30 people for big CTF. So there's probably like two or three people working on any given challenge.
Vamosi: And it's true. Today there are a lot more people playing Capture the Flag. And if you follow CTF time, there's at least one competition, every weekend. So our CTF something that more and more people should be thinking about. I mean, even if you're not going to specialize in information security might someone really benefit from having those skills in their other fields.
Green: I think that people in other fields should place ETFs and expand the skill set that way. Pretty much all of the people I've known who place ETFs are, you know, really competent programmers in general so I think that's a, that's a function of just this experience, you know, problem solving during weekends when you solve a really difficult CTF problem, it gives you a lot of satisfaction and you, you're pushed to work harder. Each weekend. So I think that's a useful. Useful attitude to have in all schools
Vamosi: Coming up with a test to distinguish outcomes in the lab to make drug discovery and other advances more efficient, is one of the areas being addressed through the study, known as synthetic biology.
Green: So, synthetic biology is kind of this field. In biology, it's kind of like the, it's almost the equivalent of information security, but for biology. So the idea is basically taking a lot of engineering principles and applying them to biology. So what exactly does it mean to apply engineering principles to biology. So for example, you know, in our cells we have DNA, which comes from proteins. And we've actually developed a pretty good understanding of how specific sequences in your DNA, contribute to proteins binding and initiating transcription, things like that. So there are a lot of specifically coded sections of DNA that do specific things, and you can construct their own DNA circuits,
Vamosi: DNA circuits. In this case we're talking about better ways to bind proteins, so we can produce more effective drugs. Turns out there are competitions in biology that support this research.
Green: So after my sophomore year. Yeah, after my sophomore year at Pitt, I did this summer, competition, research thing called iGEM style iGEM stands for International genetically engineered machine. And it's this competition among undergrad and graduate students, where you, you build and test synthetic biology project. You can think of iGEM
Vamosi: is kind of a capture the flag competition for biologists. Each summer, high school and university student teams are given a kit of biological parts and work over the summer to build and test biological systems in living cells, ranging from bacteria to mammal and cells.
Green: You know I think I'm drawn to biology just because I'm really interested in that area. And I think working on stuff can help people
Vamosi: At the Durrant Lab at the University of Pittsburgh, Harrison worked on creating a deep convolutional neural network that predicts appropriate protein fragments given the structure of receptor began complex, Something that might help researchers with new drug discovery. This "DeepFrag" model is open source, and it is the subject of a recently published article in the academic journal Chemical Science.
Green: The task we were trying to solve for the paper is predicting improvements on molecules so that they bind to proteins better. So, in the, in the drug discovery world. A lot of drugs we use these molecules called small molecules, and they're just, they're just like they sound the swab on wheels, and most of them work by binding to the sort of proteins are like the machines and ourselves, and proteins are used in all sorts of biological pathways. So, perhaps a quick biology refresher is necessary. Yeah, so all of these work by binding to proteins and either, making them more effective in the pathway, or they can just prevent the effect of the protein. So one of the difficulties in drug discovery is trying to predict how molecules bind to proteins, and once you've identified a molecule that binds to a protein. How do you make it bind stronger. That usually corresponds to a strong example.
Vamosi: So there's a myriad of proteins that could combine or not combined. How do you sort through them all the sounds a bit like work done on fuzzing libraries, how you would have multiple outcomes to consider. So it's not surprising that Harrison looked to computational science to help him with the challenge at hand.
Green: You know that the paper I worked on was developing a neural network for this specific drug prediction task, and having the mindset of just how do I solve this problem, you you start iterating quicker on solutions and you develop a good attitude, in terms of how do you debug your solutions, things like that. so I think it's a really useful attitude to have towards skill skills in general,
Vamosi: A neuro network and artificial intelligence is a series of algorithms, one that attempts to model underlying relationships in a set of data through a process that mimics the way the human brain operates an artificial neuron receives a signal that is a real number, and the output of each neuron is computed by some nonlinear function of the sum of its inputs, signaling to other neurons connected to it. The connections are called edges and the neurons and edges also have weights that are adjusted as the learning proceeds the weight increases or decreases the strength of the signal at the connection. And that sounds a little bit more like computational data graph that we mentioned earlier. Here's where computational biology gets really cool. Instead of wasting time and money in the lab hypothesizing on what might or might not happen. The Machine Learning takes over processing various iterations, much faster. Kind of like fuzzing mutating all the options. Only in this case we're looking for protein matches
Green: Our process was to use a machine learning model to take take part of a molecule, and a protein and predict a small change to the molecule or a small addition to the molecule that would make it binding stronger. And we basically used a large data set of known molecule protein tuples known pairs of volunteers and proteins, and we cut those molecules in half and use one part, as the input to try to get the machine learning to predict the other part of the molecule, the missing part. I would think
Vamosi: That interesting machine learning to do this would be rather hard actually Harrison disagrees.
Green: So the machine learning model itself was was fairly simple use something called a convolutional
Vamosi: neural network, a convolutional neural network, or CNN is a deep learning algorithm which could take, For example, an input image, assign its importance through learnable weights and biases and produce a result, CN ns are now used in autonomous vehicles, able to process inputs from a variety of camera, LIDAR, and radar sensors mounted to the vehicle, providing a 360 view of its environment, and passing those results on to the various parts of the car. For example, a CNN can map raw pixels from the single front facing camera directly to the steering commands, and therefore avoid a pedestrian, on the street.
Green: Yeah definitely anytime you have kind of like spatial information like 2d vehicles, either they have LiDAR, which can be 3d convolutional networks, or they use their onboard cameras, which would be 2d, networks, CN NS can also be beneficial when used in biological sciences is the same kind of architecture that image classifiers use for example, so if you're trying to distinguish between a dog and a cat, we'd use a 2d convolution neural neural network. The only difference here is we used a 3d convolutional neural network, because our molecular structures are 3d, so we converted those to these 3d pixels called voxels. And then we use this 3d convolution.
Vamosi: This is an important first step to helping identify the proteins needed, but they still needed to construct a database.
Green: I think one of the, the advancements we were trying to make was. Well one of the difficulties with machine learning is usually you don't have the right data. So in this case there was no data set of molecules with additions that were better, so we had to sort of construct this data set using a different kind of data. And I think this is one of the areas where I see a lot of, lot of new papers are using these sort of constructed data sets, that's one of the main research areas in machine learning is how do you make good use of unsupervised data by unsupervised I mean like unlabeled data. So,
Vamosi: the human brain does pattern recognition well, it knows how to quickly break down sensory data into quantifiable data by labeling things, either on a macro or micro level, with machine learning all that has to be learned.
Green: Maybe you have a data set of just cats and dogs, but you don't actually have to label cat and dog you don't know which images which can you still use machine learning model to learn some features from this data set, and there are there are papers that could do this which is
Vamosi: given the early successes with automotive and biology. Can any of this machine learning knowledge be applied to probing the unknown vulnerabilities and finding software bugs in application security.
Green: A lot of the advancements in machine learning come from people finding new creative ways to use it. So I don't see. It's not an obvious way to plug in machine learning to finding bugs for example it's not like an immediate thing you can suddenly train, machine learning models, but maybe there are some tests in machine learning where machine learning models can help you identify things so for example if you have a history of bugs that have been committed to a repository that are more like the type of variety of bugs, maybe you can use machine learning models to help identify places where these might be more likely to occur. I think identifying smaller sub tests in the problem of finding bugs can be a viable strategy.
Vamosi: So what does the future hold for Harrison.
Green: Yeah, I am interested in research, so I want to stay in academia, I'm taking a gap year this year but then I'm looking at graduate school. Yeah but tough to tell about the future.
Vamosi: I'd like to thank Harrison Greene for coming on the show. He's significantly complex individual one that I expect to hear more from his research in the near future, as he continues to straddle the interests of both computer science and biology. And as I said before if you start talking with somebody if you ask the right open ended questions, they'll often surprise you. That said, Let's strike up a conversation, really, I'd like to hear from you. You can follow me and DM me on Twitter at Robert Vamosi. The hacker mine is a podcast, brought to you every two weeks commercial free buy for all secure for the hacker mind, I remain just another structured fuzz test, Robert Vamosi.