Everyone. My name is Mark Shawnee and today I'm joined by John Shea. Wrote here from Cloudera. Would like to welcome all of you to our weekly demos. Today's topic is security and governments with Cloudera shared data experience or SDX. A few housekeeping items I have to cover before I'm allowed to continue with the good stuff. If you have questions, please add them to the Q&A Widget which is found on the left hand side of your screen. At the end of today's demo will be answering the questions that you send it. Please send in question making this interactive is going to make the whole experience a lot more interesting. Now if you look at the right side of your screen, there should be some additional resources over there for you. At the end there will be a full question survey. Their survey will help us know all of your interested in so we can tailor their content for you in the future. All right, in the last thing, this demos being recorded, so don't panic if you miss something or do you think someone else would really benefit from seeing what we have here today? Just click or share the link that you use to join this webinar. Or after this webinars done and send you a follow up email with the link in there as well. Before we dive into the demo, I'm going to cover a little bit about what the Cloudera data platform is. John has a good overview, so I'll make mine quick. This is for those of you that are just learning about CD for the first time. The quite erudita platform is Cloudera's entry into the enterprise data cloud market. An enterprise data cloud, maybe a new concept for some of you, but don't worry, I have a definition right here. For a platform to be considered an enterprise data cloud, it must do several things. It must be a hybrid multi cloud. So not only should the platform operate in the public cloud, but it should work with an run on data center installations as well. It must be multi function. By this I mean it you should be able to perform different parts of your data lifecycle from data warehousing. The data engineering, machine learning and more. It must have common security and governments users of the platform should be able to set security policies that are reflected across the entire platform and then be able to track the data movement data flow through the platform as well to help with compliance. So the criteria data platform checks every box of the enterprise data cloud definition. But before I feel all of John Johns, Thunder Island dive deeper into house EDP checks all those boxes. So go ahead John floors yours. Thanks mark for kicking us off. My name is John Shea. I'm the director of product management for data management and data governance here at Clara, I focus on how we secure and share data across the different workloads in CP. I'm going to talk to you today about how we secure and govern data within. Claire's shared data experience. So here is our overview slide of the cloud data platform and I'm going to talk about these three main areas, but focus on one in particular first. CDP is capable of running on any public cloud. Today we directly support Azure and Amazon. It also supports usage in the data center and in private cloud eventually, and we want to make sure that we can work with all of these clouds together in hybrid cloud weigh. It also supports multiple workloads. We have traditional virtual machine based cluster setups which we call data hubs, and we have modern cloud optimized Kubernetes based workloads that we call experiences. Today we have data warehouse experience and the machine learning experience and coming in the coming months will have the data flow data engineering and operational database experiences as well. Now what ties all of these things together is the focus of today's talk. Clara SDX the shared data experience. This is a shared layer that centralizes control of metadata, schema migration, security, and governance. These are the core pieces, allow different workloads to work together, and allow the decoupled storage and compute architecture to work in a way that is scalable across multiple teams. This is made up of key elements within the control plane. The data catalog, the replication manager, the workload manager, and the management console. So let me just hop into the management console and show you this. Live. And then the plan is that I will give you a quick overview of the core governance features that I'm going to demo, and then I'll just jump into that demo as well. OK, so the first thing I need to do to secure and govern the system is to be able to log into the system in the 1st place and hear what I'm going to do is. I'm going to log in via single sign-on into my corporate Octa. So when I log in here, my credentials of my identity now it will be propagated through all these different experiences and tools within. See DP as I mentioned before. We have 3 available workloads at a data hubs data warehouse in machine learning and then management tools around workload management, replication between environments, the data catalog and the management console. I'm going to jump into the data management console right now and give you a flavor of what's going on in here and jump over here to the dashboard. Here you can see a map of the world and you can see that there are clusters or environments up all around the world. We have 37 Amazon environments for Azure environments 15. See DH classic clusters in different locations in different data centers around the world, and six classic HTP clusters. So you can see here Azure being provisioned, Amazon and classic clusters. Now I want to have the size of the word environment here environment. And environment is an isolated set of machines that have ones shared, set of policies, governance and basically set up so that the only thing you need to do to share data with in these environments is change access control policies. Everything is kind of prewired within one environment. And each of these environments have shared data storage as well using the native storage of the particular cloud you're on. So if you're on Amazon using its three, if you're in Azure, using adls, and if your on Prem you're using either hefs or ozone in the coming months. Now let's jump into the core demo and focus on how we secure in govern our data. So data governance is complex. It's basically an operational tax and it adds complexity and slows a lot of things down because of the interactions that need to happen in order to create policies and enforce them. Here I have an example of two different tables, yellow table that is public information and a blue table that has company sensitive or personally identifying information within it. Now each of these have to have a different policy applied to them. The yellow table yellow policy, the blue table of blue policy. And I have a blue user and a great user. Now with the public information so that everybody can do it, but with the sensitive data set the blue one there is a particular column, the red column that only certain users are allowed to see and most users are not, so the policy is set up to block access to this table for the grey user. Now one problem that we have is that we want to combine datasets to get interesting insights, but that sensitive information can propagate. So with yellow table in the blue table, I create a new green table by doing a join. And the problem that I have is that that's sensitive information propagates. I still have sensitive data that red column in the green table. Now is an admin I need to set a policy on that table to control access on it. I also have to think about what that policy should be. So I've decided that the blue users can see this table and the Gray users can't access the table because it has sensitive information. Now that greys are still want to be able to use the joint information, it just doesn't care about that particular column. What can we do? Well, we could manage to work ourselves around this by creating new views. Here will do is create a new blue table that has rejected the information in that red column, turned in it safe and Gray. And then will do is I will join that yellow table at the blue table to create a new green table. Now that red column has been rejected. So now it is safe and doesn't control. Doesn't contain that sensitive information anymore. And now as an admin, I need to think about these new tables and set policies and every new flavor of table that I create, henceforth derived from the blue one. So here I'll have a fourth policy on the blue table and a fifth policy, and I had to think about it in each way. Now each of these interactions takes time I have to have an interaction with an admin where data steward and as the number of policies grow and as the number of tables grown as a number of different flavors can grow. This problem can scale exponentially. It can be very very taxing. In CDP and with SDX because we centralise all that governance and security mechanisms and because we have lineages and kind of these governance mechanisms governesses is simplified. We can use an elegant auto propagating mechanism called attribute based access controls to propagate this masking policy in a much more simple and less operational burdensome way. Here's our scenario. Again, sensitive data can propagate. But this time as an admin what I will do is I will apply A tag to the columns that have sensitive information and that tag will automatically propagate through the linea age. Next I will apply apolosi to that tag. So now when I try to access that table as the Gray user, I get the redacted version of the data. And this propagates through the linneage. So when I try to access the green table as the great user, I also get the redacted data. Now the huge time savings here is that I've just defined 1 tag. And I've just defined one policy. And this automatically gets applied to any of the derived table. This is a huge time saver. OK, so now what I'm going to do is I'm going to jump into the data catalog and demo this feature live and show you some of the other features inside of the data catalog would help you manage your data. OK, so here is the data catalog and I'm in my environment STX demo. GDP data like now. Full disclosure. I'm going to hop over to a different instance. This is the instance that will be in production in about a week or two. I wanted to hop over here because there's a lot of new features that are landing and I want I really want to emphasize these, but here what I've done is I've gone to the data catalog and here I have the ability to pick any of the different environments that are present. Any of those different clusters and browse and search for things in the data so he wanted to look for is dub dub customers because this is the example data set that I'll be doing merges against. So you can see here. I just searched for it and now I selected it. Now the first thing you're going to see here is pretty basic linneage. This example has data that lives in S3. And you can see that there. There's a create external table, and then there's a table information. Now the interesting thing here is actually the schema. So when I look at the schema I have a lot of information presented to me about my data sets, right? So this is actually the schema name type. It has accounts and profile that columns that are present. How many unique values are whether there are no value is what the Min, Max and mean values are as well as some red things here which are marked as potentially sensitive information. So this each number field. Looks like potentially a credit card number. This an RN actually looks like a telephone number. 2 quick visualizations of the data. So if I click on this icon for eye color, looks like most people have blue eyes in this set of customers. No one that might be interesting is height. If I look at the distribution, looks like most people are somewhere around 160 centimeters tall. Min Max distributions of that information. Blood type yeah. So you can see certain fields have been marked off as potentially sensitive information, and there is a workflow that I have here to select and approve or disapprove of these items. Um? The other thing we should show in this overview is not only do I have the linneage, but I also have some statistical information about how often and how popular this data set is. This is a relatively new spun up instant, so here we have our top 10 users being hive. One of the system users. And we would have summaries of requests on the data here, although we just don't have very much right now. Yeah, there you go. You can see in the past month there's been a handful of requests. I just wanted to show that the instance that we have in production doesn't have all of these bells and whistles on them yet, but they will be here shortly. OK, so now I'm back in the prod instance and I'm just kind of done a little bit deeper about what we have here. What you can see is that in the schema, the credit card and the email fields have been selected and tagged on these columns, so they're actually tags now and classifications that will be applied. I clicked on the policy tab. You'll actually see which policies are being applied to this particular table, so as it loads you will see that there are two tag based policies that are in effect. Rejecting credit cards and redacting emails. So it will click into this. And as Ranger loads, let me try one more thing. Um? It will go directly to the policy that is related to the credit card, and as you can see here, the list masking policy basically will look for the credit card tag items marked with that and for admins, hive queries will be unmasked, whereas for regular users they'll only have a partial math showing only the last four digits that Ranger is. This repository of our security policies and masking policies in access control policies. Not just for who Duke sequel things, but for all the other services that are hosted inside of CP. There are settings for all the different kinds of things here and we can go dive into details, but there are primarily 2 main ways you want to do it, either via tags or being very specific about particular resources. All right? So let's pop back out to the data catalog here. I can also go see the audits and I can see who the last people to access this table is and whether they were allowed or rejected. So you can see, Joe analysts was a regular user that had masking policies applied where John did not. OK, so now what I'm going to do is I'm going to actually switch over to a different table. I'm going to switch over to dub dub customers enriched. Because the tables even more interesting. what I have here is the linea age graph and you can see here that the core dub dub customers table is here with those tags set and that data comes from S3. But you can also see that there is a tweet data set. I created a sample and then I did a little bit of cleanup here to clean up some of the country codes and merge it all together to create a new table called dub dub customers enriched. So next time I'm going to do is, I'm going to hop over to Hugh and select this query on that particular table. And while that's running, I'm going to hop over as a different user. The regular user Joe analyst. And log in an issue exactly the same query. So let me go over the data warehouse and find that particular instance. Open up you. And take off that query. Boom So as you can see, it's exactly the same query, and here as the power user I can see that we count the country, the given name surname at full email address in a full credit card number. Well, when I move over to the less privileged user, I see exactly the same information, but now the email address and the credit card number has been redacted. Now, the really interesting thing in the really powerful thing about it is that I set the policies here at sub customers and because of the advanced linneage capabilities and a back properties that we have that tag on, those consoles automatically propagated to this table. Here looking at the schema you can see that it is Gray meetings that it has been automatically propagated and I can look at the policies and audits of that information as well. Alright, so I've demonstrated to you in the product how we can apply tags in security policies and have them automatically propagate throughout your datasets in your luggage within your environment. This is a great feature. It's going to save you a whole lot of time and effort. All of these security and governance and linneage features are tide in together throughout all these different workloads. When we collect data using our data collection tools such as knife I, when we enrich data using spark or hive or any of the other tools that we have as well as when we do reporting in the data warehouse, the lineages captured amongst all these things will also be pulling that in for the operational database workloads and eventually for machine learning models as well. So you'll be able to see how that machine learning model. Was generated from end to end capturing governance and capturing that linneage from the edge to the AI models. OK, so that brings us to the end of today's plan. Demo will start now answering your questions that have been submitted throughout the demo. All right, John, that was great. Thank you so much for that presentation. Lovely showed there. Now remember for the QA part the widget is on the left hand side of your screen, so go ahead and send in those questions. We already have some questions in and this is a good one to start with. Hey John, what is the definition of an environment and how does that relate with us? EDP cluster. Yeah so. We're trying to get away in my mind. We're trying to get away from the word cluster because when we've separated workloads from storage and from the policies we end up having lots of different clusters and we need to kind of group them altogether when they share the same policies, the same metadata, the same linear, an audit tracking so. Specifically, anytime we have a geographically no coordinated set of machines that need to have the same policy, that's what we call an environment and you know you can call that whole thing ACP cluster as well. But just that word clusters so overloaded we chose a different word for it. Alright, Thanks, and here's another one. How is FDX integrated with Kubernetes without nuts? Yeah so. I I talked about it very quickly. So, uh, OK, so there's two parts of this question. I'm not exactly sure which lines are both. All of the services within CP. There's basically two classes. There's the data hubs which use traditional virtual machines, so those are. Come, you know not using Kubernetes under the covers, but those experiences like the data warehouse experience the Emil experience, those kinds of things. Those are actually Kubernetes applications that are using your traditional Hadoop stack tools. Now the question here is also about Knox. So when you set up an environment you're actually securing that as well. It's set up in its own private VPC and there are limited egress and Ingress points to get into that environment. In a cloud account, so I don't know all the specifics about how and where Knox playing there, but you know that funnel point where you only have a limited number of Ingress and egress points from the network. I believe Knox is related to that. Flip it on my area of expertise, so that's about as far as I can go with that. Yeah, no problem? Uh, moving on. But see, let's go with them. Did Claire customized Apache Ranger in Atlas for our use? 'cause? Generally, you know Rangers used for security analysis used for governance, so maybe you can explain how we incorporated them. Just like any of the other open source projects and all of the pieces of the enterprise data cloud in CDP there all open source now Apache Ranger and Apache Atlas, our Apache projects. So all of the software and all the pieces are all done upstream in the open source Apache communities and the versions that we're using are almost identical to those that are upstream. There might be a few minor changes, just that we can get all the pieces to work together. But there's no explicit customization that's you, know kid, and just like Clutter, they're all upstream and available for folks in the community. OK, Great, Uh Next One could you briefly cover row level security and how it works in CDP? Sure. Row level security is the ability to, basically I believe it's you're asking about row level masking or row level filtering ability to only see subsets of the table. This is a feature that's available when you're using resource based access controls as opposed to tag based access control. So in the demo I showed an example of A tag based access control. This is specific to a table and it's going to use the properties of the table so we don't have the capability of using tags there. But in Ranger you can set a row masking aerobics filtering policy. You can have a predicate in there and depending on who you are or what groups you belong to, you will have a filter applied and only be able to retrieve the Rose that. Your policy allows you to see. Got it and then going back to Atlas. Could you explain the differences between Apache outlets? How we have it in there and the data catalog? Great, that's a great question so. I can answer it this way. I mentioned this concept of an environment. Every environment has an instance of Apache Atlas running in it. The data catalog allows you to straddle multiple environments, so if you had a environment in Amazon AWSDS and then you had an environment in the EU on Azure, each of those would have their own separate Atlas instance, but there would be one global data catalog that would talk to both of them and allow you to search and navigate and apply tags and Whatnot at the global level. Another major difference between them is that you can kind of think of Atlas is sort of the back end and the data catalog is a front end to capture about. You know the majority of tasks, the data steward, or someone looking for datasets would be using, so it's not going to have all of the gory details that Ranger Alice may have, but it will allow you to get there if you needed to use the advanced features. Got it great, now I'm going to kind of combine to questions here 'cause you're dealing with the same topic, so now will the data catalog support connecting different data sources? And where does the data actually have to live for the data catalog outlets to be able to be tracking or arranger to apply that security rules to it? Yeah, great question. So you know again, kind of drilling down on this difference between the data catalog and Atlas and have the information that has to do with the policies and the linneage live in Atlas and that lives in the customers account that lives in your account, right? So that is hosted and owned by the machines that are running in your cloud account. The data catalog is a presentation layer at the control plane level, and what that means is that runs inside of the Cloudera account. So the Cloudera accounts talks to the Atlas instances in your particular environments, and then presents that data to you as a user. So I guess one other point to make is that there is no. Customer owned metadata that is stored in the data catalog. It only has pointers. The data in the Atlas instances that are hosted in the customers accounts. Right in that data doesn't have to be in one specific Cloudera. Run cluster right. For example, it will track data that in S3 storage as long as. You've imported it like. Expand on that. Yeah, I mean so by default all of your data when you're using the public cloud version of seedy pee is stored in the cloud native object store. So if you're in Amazon, your data is stored in S3. If you're in an Azure, it's in adls. If you're an on Prem cluster, it would be in JSON couple quarters from now. It'll be ozone. The on Prem data store object store, and all of the information around there is kind of stored either in databases, in your data center or. In databases or in the object store under your account. And here's kind of a follow-up question with that, so is metadata stored in Atlas. Um so. The metadata about tags linneage. Those are stored in Atlas. The information about auditing is stored in Rangers databases and you know the actual schema information. It will actually go out to the different services. So in this particular case with the table examples I gave you would talk to the hive metastore, the HM S to get the table information. Yeah, but I think there is a version of that data that lives in Alice as well so that we can track its linneage. Got it OK, uh, so moving on. Does. Does SDX work with or? Let me make sure I'm not seeing this correctly. Sure, sorry is masking encryption. Does it work with kudu as well or just like hide? Yeah so. I don't know, I I'm I. I think that is work in progress. I know, I know it is on the road map. I don't know if it is available in the production version today or if it will be available in the next version, but it definitely is on the road map. OK great and last one this is a nice clarifying point, so how is FDX license? What does it cost? So as the ex today, if you're using public cloud, basically just comes when you are on boarded and when you're brought onto CD. So in the public cloud version of CDP we don't charge for SDX, we provide it and be workloads that you use that use compute. That's actually where we charge all charges the customers so there is no actual fee for it today now. The cloud world. The pricing is up in the air and I don't know what that will be yet, but that will be determined in the coming months. That product is not available yet. Got it, got it well thanks John sorry everyone under the lots of questions but we have to do rap it up. Thank you for joining everyone. Thanks for sending those questions keeping it very interactive. Really important. Before we leave we have a survey with four questions so please answer those questions to help us help you cater the content in the future. And don't forget that once the webinars over will be sending everyone an email with instructions on how to access the slides. And the recorded webinar. So looking forward to seeing everyone again real soon. Thanks bye thank you.
Enterprise IT is overwhelmed trying to manage data security and governance across fragmented systems infrastructures. This impacts security and efficiency and also frustrates business users through a lack of agility and flexibility. With SDX, IT has centralized data security, governance, and control it requires to safely and securely meet the ever-changing business needs.
During this demo, you’ll learn how Cloudera’s SDX:
Provides IT with enterprise-grade data security and governance
Seamlessly and consistently applies IT policies across a complete analytics landscape