Data visualization a practical introduction kieran healy pdf download






















Just as Don Vlad was about to order the hit, I threw off my masquerade mask, revealing my face, and offered myself to him. One look and I was doomed. He took me as his hostage. His captive. He didn't know that by. Read more.

This chapter shows, how one can make breathtaking maps using ggplot2. It starts with using president election data from US and shows how to make state-level, county-level maps. It is just fantastic and I am looking forward to learn some mapping skills with this chapter. This chapter is a perfect example of the useful chapter for both beginners and advanced users alike,.

The final chapter is all about how to refine your plot and make it better. Remember the 14 different scatter plots I mentioned in the third chapter? This final chapter introduces how to enhance you plots using right colors, text and theme. The chapter has a nice case study on a new dataset where one can put all these to use to refine the plots.

In short, a great way to end the book. By now the it must be clear that the upshot of the review, go grab a copy of the book if you are interested R, ggplot, Data Visualization irrespective of whether you are beginner or expert. Skip to secondary menu Skip to main content Skip to primary sidebar. A must-read for anyone who works with data. The book is broadly relevant, beautifully rendered, and engagingly written.

It is easily accessible for students at any level and will be an incredible teaching resource for courses on research methods, statistics, and data visualization. It is packed full of clear-headed and sage insights. There is no other book quite like this. If we want to get more adventurous later, the tools are available to produce custom palettes that still have desirable perceptual qualities. Our decisions about color will focus more on when and how it should be used.

As we are about to see, color is a powerful channel for picking out visual elements of interest. They pop out at us from whatever they are surrounded by. For some kinds of object, or through particular channels, this can happen very quickly. But it is the existence of pop-out that is relevant to us, rather than its explanation.

Pop-out makes some things on a data graphic easier to see or find than others. Consider the panels in figure 1. Each one of them contains a single blue circle. Think of it as an observation of interest. Reading left to right, the first panel contains twenty circles, nineteen of which are yellow and one blue.

The blue circle is easy to find, as there are a relatively small number of observations to scan, and their color is the only thing that varies. The viewer barely has to search consciously at all before seeing the dot of interest. In the second panel, the search is harder, but not that much harder. There are a hundred dots now, five times as many, but again the blue dot is easily found. The third panel again has only twenty observations. But this time there is no variation on color. Instead nineteen observations are triangles and one is a circle.

On average, looking for the blue dot is noticeably harder than searching for it in the first panel, and it may even be more difficult than in the second panel despite there being many fewer observations.

Think of shape and color as two distinct channels that can be used to encode information visually. In the fourth panel, the number of observations is again upped to one hundred. Finding the single blue dot may take noticeably longer. It seems that search performance on the shape channel degrades much faster than on the color channel.

Finally the fifth panel mixes color and shape for a large number of observations. Again there is only one blue dot on the graph, but annoyingly there are many blue triangles and yellow dots that make it harder to find what we are looking for.

Dual- or multiplechannel searches for large numbers of observations can be very slow. Similar effects can be demonstrated for search across other channels for instance, with size, angle, elongation, and movement and for particular kinds of searches within channels. For example, some kinds of angle contrasts are easier to see than others, as are some kinds of color contrasts.

Ware , 27—33 has more discussion and examples. The consequences for data visualization are clear enough. As shown in figure 1. Even if our software allows us to, we should think carefully before representing different variables and their values by shape, color, and position all at once. It is possible for there to be exceptions, in particular as shown in the second panel of 10 10 9 9 8 8 7 7 8 9 10 Tea 11 12 8 9 10 Tea Figure 1.

But even here, in all but the most straightforward cases a different visualization strategy is likely to do better. Gestalt rules At first glance, the points in the pop-out examples in figure 1. In fact, they are not quite randomly located. Instead, I wrote a little code to lay them out in a way that spread them around the plotting area but prevented any two points from completely or partially overlapping each other.

I did this because I wanted the scatterplots to be programmatically generated but did not want to take the risk that the blue dot would end up plotted underneath one of the other dots or triangles. Each panel in figure 1. There are clearly differences in structure between them. Defining randomness, or ensuring that a process really is random, turns out to be a lot harder than you might think. But we gloss over those difficulties here.

In a model like this points are again randomly distributed but are subject to some local constraints. In this case, after randomly generating a number of candidate points in order, the field is pruned to eliminate any point that appears too close to a point that was generated before it. If you ask people which of these panels has more structure in it, they will tend to say the Poisson field.

We associate randomness with a relatively even distribution across a space. But in fact, a random process like this is substantially more clumpy than we tend to think. I first saw a picture of this contrast in an essay by Stephen Jay Gould Hence the relatively even—but not random—distribution that results. The upper panel shows a random point pattern generated by a Poisson process. The layout of the figure employs some of these principles, in addition to displaying them.

We look for structure all the time. We are so good at it that we will find it in random data, given time. This is one of the reasons that data visualization can hardly be a replacement for statistical modeling. Rather, they describe our tendency to infer relationships between the objects we are looking at in a way that goes beyond what is strictly visible. What sorts of relationships are inferred, and under what circumstances? Similarity: Things that look alike seem to be related.

Connection: Things that are visually tied to one another seem to be related. Continuity: Partially hidden objects are completed into familiar shapes. Closure: Incomplete shapes are perceived as complete. Figure and ground: Visual elements are taken to be either in the foreground or in the background.

Common fate: Elements sharing a direction of movement are perceived as a unit. Look at Data Some kinds of visual cues outweigh others. For example, in the upper left of figure 1. In the upper right, the three groups are still salient but the row of blue circles is now seen as a grouped entity.

In the middle row of the figure, the left side shows mixed grouping by shape, size, and color. Meanwhile the right side of the row shows that direct connection outweighs shape. Finally the two schematic plots in the bottom row illustrate both connection and common fate, in that the lines joining the shapes tend to be read left-toright as part of a series. Note also the points in the lower right plot where the lines cross. There is more involved besides that, however. Beyond core matters of perception lies the question of interpreting and understanding particular kinds of graphs.

The proportion of people who can read and correctly interpret a scatterplot is lower than you might think. At the intersection of perception and interpretation there are specific visual tasks that people need to perform in order to properly see the graph in front of them. Even if viewers understand all these things, they must still perform the visual task of interpreting the graph. A scatterplot is a visual representation of data, not a way to magically transmit pure understanding.

Even well-informed viewers may do worse than we think when connecting the picture to the underlying data Doherty, et al. In the s William S.

In both studies, participants were asked to make comparisons of highlighted portions of each chart type and say which was smaller. Cleveland went on to apply the results of this work, developing the trellis display system for data visualization in S, the statistical programming language developed at Bell Labs. R is a later implementation of S.

He also wrote two excellent books that describe and apply these principles Cleveland , These include treemaps, where a square or rectangle is subdivided into further rectangular areas representing some proportion or percentage of the total. It looks a little like a stacked bar chart with more than one column. The comparisons and graph types made by their research subjects are shown schematically in figure 1. As can be seen from the figure, the charts tested encoded data in different ways.

Types 1—3 use position encoding along a common scale while types 4 and 5 use length encoding. The pie chart encodes values as angles, and the remaining charts as areas, using either circular, separate rectangles as in a cartogram or subrectangles as in a treemap. Their results are shown in figure 1. The replication was quite good. The overall pattern of results seems clear, with performance worsening substantially as we move away from comparison on a common scale to length-based comparisons to angles and finally areas.

Area comparisons perform even worse than the justifiably maligned pie chart. The data values were encoded or mapped in to the graph, and now we have to get them back out again. When doing this, we do best judging the relative position of elements aligned on a common scale, as, for example, when we compare the heights of bars on a bar chart, or the position of dots with reference to a fixed x- or y-axis. When elements are not aligned but still share a scale, comparison is a little harder but still pretty good.

It is more difficult again to compare the lengths of lines without a common baseline. Outside of position and length encodings, things generally become harder and the decoding process is more error prone. We tend to misjudge quantities encoded as angles. This is one reason pie charts are usually a bad idea. We also misjudge areas poorly.

We have known for a long time that area-based comparisons of quantities are easily misinterpreted or exaggerated. For example, values in the data might be encoded as lengths, which are then squared to make the shape on the graph. The result is that the difference in size between the squares or rectangles area will be much larger than the difference between the two numbers they represent. Comparing the areas of circles is prone to more error again, for the same reason. It is possible to offset these problems somewhat by choosing a more sophisticated method for encoding the data as an area.

Instead of letting the data value be the length of the side of a square or the radius of the circle, for example, we could map the value directly to area and back-calculate the side length or radius.

Still, the result will generally not be as good as alternatives. And as saw with the 3-D bar chart in figure 1. Finally, we find it hard to judge changes in slope. The estimation of rates of change in lines or trends is strongly conditioned by the aspect ratio of the graph, as we saw in figure 1. Our relatively weak judgment of slopes also interacts badly with threedimensional representations of data. For this reason, it can be disproportionately difficult to interpret data displays of point clouds or surfaces displayed with three axes.

They can look impressive, but they are also harder to grasp. Different sorts of variables attributes can be represented more or less well by different kinds of visual marks or representations, such as points, lines, shapes, and colors.

Our task is to come up with methods that encode or map variables in the right way. As we do this, we face several constraints. First, the channel or mapping that Look at Data we choose needs to be capable of representing the kind of data that we have.

If we want to pick out unordered categories, for example, choosing a continuous gradient to represent them will not make much sense. If our variable is continuous, it will not be helpful to represent it as a series of shapes. Second, given that the data can be comprehensibly represented by the visual element we choose, we will want to know how effective that representation is.

Following Tamara Munzer , —3 , Figures 1. If we have ordered data and we want the viewer to efficiently make comparisons, then we should try to encode it as a position on a common scale.

Encoding numbers as lengths absent a scale works too, but not as effectively. Encoding them as areas will make comparisons less accurate again, and so on. Third, the effectiveness of our graphics will depend not just on the channel that we choose but on the perceptual details of how we implement it. So, if we have a measure with four categories ordered from lowest to highest, we might correctly decide to represent it using a sequence of colors. But if we pick the wrong sequence, the data will still be hard to interpret, or actively misleading.

In a similar way, if we pick a bad set of hues for an unordered categorical variable, the result might not just be unpleasant to look at but actively misleading. Finally, bear in mind that these different channels or mappings for data are not in themselves kinds of graphs.

They are just the elements or building blocks for graphs. When we choose to encode a variable as a position, a length, an area, a shade of gray, or a color, we have made an important decision that narrows down what the resulting plot can look like.

But this is not the same as deciding what type of plot it will be, in the sense of choosing whether to make a dotplot or a bar chart, a histogram or a frequency polygon, and so on. Each of these plots is far less noisy than the junk-filled monstrosity we began with. The scale on the bar chart version goes to zero, while the scale on the dotplot version is confined to the range of values taken by the observations. For example, consider the scales on the x-axis in each case.

The left-hand panel in figure 1. The scale starts at zero and extends to just beyond the level of the largest value. Meanwhile the right-hand panel is a Cleveland dotplot.

Each observation is represented by a point, and the scale is restricted to the range of the data as shown. But being honest with your data is a bigger problem than can be solved by rules of thumb about making graphs. In this case there is a moderate level of agreement that bar charts should generally include a zero baseline or equivalent given that bars make lengths salient to the viewer.

But it would be a mistake to think that a dotplot was by the same token deliberately misleading, just because it kept itself to the range of the data instead. Which one is to be preferred?

It is tricky to give an unequivocal answer, because the reasons for preferring one type of scaling over another depend in part on how often people actively try to mislead others by preferring one sort of representation over another. On the one hand, there is a lot of be said in favor of showing the data over the range we observe it, rather than forcing every scale to encompass its lowest and highest theoretical value.

Many otherwise informative visualizations would become useless if it was mandatory to include a zero point on the x- or y-axis. Sometimes this is done out of active malice, other times out of passive bias, or even just a hopeful desire to see what you want to see in the data.

Remember, often the main audience for your visualizations is yourself. In those cases, the resulting graphic will indeed be misleading. Rushed, garish, and deliberately inflammatory or misleading graphics are a staple of social media sharing and the cable news cycle.

But the problem comes up in everyday practice as well, and the two can intersect if your work ends up in front of a public audience. A decline in enrollments led to some 1. Default settings and general rules of good practice have limited powers to stop you from doing the wrong thing.

But one thing they can do is provide not just tools for making graphs but also a framework or set of concepts that helps you think more clearly about the good work you want to produce. The first panel shows the trend in the number of students beginning law school each year since The y-axis starts from just below the lowest value in the series.

The second panel shows the same data but with the y-axis minimum set to zero instead. The columnist and writer Justin Fox saw the first version and remarked on how amazing it was. He was then quite surprised at the strong reactions he got from people who insisted the y-axis should have included zero. My own view is that the chart without the zero baseline shows you that, after almost forty years of mostly rising enrollments, law school enrollments dropped suddenly and precipitously around to levels not seen since the early s.

The levels are clearly labeled, and the decline does look substantively surprising and significant. In a well-constructed chart the axis labels are a necessary guide to the reader, and we should expect readers to pay attention to them. The chart with the zero baseline, meanwhile, does not add much additional information beyond reminding you, at the cost of wasting some space, that 35, is a number quite a lot larger than zero.

That said, I am sympathetic to people who got upset at the first chart. At a minimum, it shows they know to read the axis labels on a graph. That is less common than you might think. It likely also shows they know interfering with the axes is one way to make a chart misleading, and that it is not unusual for that sort of thing to be done deliberately.

First year enrollment Look at Data 40, 20, 0 Year Figure 1. They quickly start formulating requests. They want to know how to make a particular kind of chart, or how to change the typeface for the whole graph, or how to adjust the scales, or how to move the title, customize the labels, or change the colors of the points. These requests involve different features of the graph. Some have to do with the details of how those elements are represented. If a variable is mapped to shape, which shapes will be chosen, exactly?

If another variable is represented by color, which colors in particular will be used? Some have to do with the framing or guiding features of the graph. If there are tickmarks on the x-axis, can I decide where they should be drawn? If the chart has a legend, will it appear to the right of the graph or on top? If data points have information encoded in both shape and color, do we need a separate legend for each encoding, or can we combine them into a single unified legend?

And some have to do with thematic features of the graph that may greatly affect how the final result looks but are not logically connected to the structure of the data being represented. Can I have a light blue background in all my graphs? A real strength of ggplot is that it implements a grammar of graphics to organize and make sense of these different elements Wilkinson When you write your code, you carry out each task using a function that controls that part of the job. At the beginning, ggplot will do most of the work for you.

Only two steps are required. First, you must give some information to the ggplot function. This establishes the core of the plot by saying what data you are using and what variables will be linked or mapped to features of the plot. This decides what sort of plot will be drawn, such as a scatterplot, a bar chart, or a boxplot.

As you progress, you will gradually use other functions to gain more fine-grained control over other features of the plot, such as scales, legends, and thematic elements.

This also means that, as Look at Data you learn ggplot, it is very important to grasp the core steps first, before worrying about adjustments and polishing. In the next chapter we will learn how to get up and running in R and make our first graphs. We will be producing sophisticated plots quite quickly, and we will keep working on them until we are in full control of what we are doing.

As we go, we will learn about some ideas and associated techniques and tricks to make R do what we want. If you would like to learn more about the relationship between perception and data visualization, follow up on some of the references in this chapter. Munzer , Ware , and Few are good places to start. Finally, foundational work by Bertin lies behind a lot of thinking on the relationship between data and visual elements.

R and ggplot are the tools we will use. The best way to learn them is to follow along and repeatedly write code as you go.

The material in this book is designed to be interactive and hands-on. If you work through it with me using the approach described below, you will end up with a book much like this one, with many code samples alongside your notes, and the figures or other output produced by that code shown nearby. I strongly encourage you to type out your code rather than copying and pasting the examples from the text. Typing it out will help you learn it. Petty Book.

Miller Jr. Rowling Book. Dieter Book. Hofstadter Book. Zahler Book. Jeske Book. Marshall Book. PDF] Spartan Fit! Transform Your Mind. Transform Your Body. Commit to Grit. Symonds Book. Hodges Book.



0コメント

  • 1000 / 1000