本文由SCY原创,转载注明出处。
本文主要讲解ggplot2包的绘图原理及案例。
ggplot2学习笔记
ggplot2是基于图层图形语法(the Grammar of Graphics),可简单粗暴的理解为,先绘制好图形的每一部分,然后将各部分相加形成一张完善的图形,使用ggplot2时,会反复使用如下几个对象,简单介绍:
几何对象 (geom ) geometric objects,如散点points、
线性lines、柱状bars及方图Histogram等,可绘制的geom为:
1 2 3 library( tidyverse) library( ggplot2) library( patchwork)
标度 (scale ) scales map values in the data space to values in the aesthetic space. This includes the use of colour, shape or size. Scales also draw the legend and axes, which make it possible to read the original data values from the plot(将数据取值映射到图形 空间,使用颜色,形状,大小表示不同取值,使用图例,网格线展示标度)
图像属性 (aes)data to the aes thetic attributes (color, shape, size)
坐标系 (coord )information about the plot’s coord inate system(描述数据如何映射到图形,同时包含坐标轴和网格线 axes, gridlines)
统计变换 (stat )stat istical transformations of the data,对数据的汇总
分面 (facet ) A facet specifies how to break up and display subsets of data as small multiples. This is also known as conditioning or latticing/trellising.(将数据拆分为子集,对各子集作图并联合展示,也成条件作图或网格图)
绘图主题 (theme )Atheme controls the finer points of display, like the font size and background colour. While the defaults in ggplot2 have been chosen with care, you may need to consult other references to create an attractive plot.(主题涉及图形更细的方面,如背景色,字体大小等)。
原理
ggplot()
函数包括9个部件:
数据(data)
映射(mapping)
几何对象(geom)
统计变换(stats)
标度(scale)
坐标系(coord)
分面(facet)
主题(theme)
存储和输出(output)
其中前三个是必需的。
Hadley wickham将这套语法诠释为:一张统计图形就是从数据 到几何对象 (geometric object,缩写geom )的图形属性 (aesthetic attribute,缩写aes )的一个映射。
此外,图形中还可能包合数据的统计变换(statistical transformation,缩写stat),最后绘制在某个特定的坐标系(coordinate system,缩写coord)中,而分面(facet)则可以用来生成数据不同子集的图形。
语法模版
1 2 3 4 5 6 7 8 9 10 library( tidyverse) library( ggplot2) library( colorspace) d <- read_csv( "data/temp_carbon.csv" ) ggplot( data = d, mapping = aes( x = year, y = carbon_emissions) ) + geom_line( ) + xlab( "year" ) + ylab( "carbon emissions (metric tons)" ) + ggtitle( "Annual global carbon emissions, 1880-2014" )
映射
#> # A tibble: 234 × 11
#> manufacturer model displ year cyl trans drv cty hwy fl class
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
#> 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
#> 3 audi a4 2 2008 4 manu… f 20 31 p comp…
#> 4 audi a4 2 2008 4 auto… f 21 30 p comp…
#> 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
#> 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
#> 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
#> 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
#> 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
#> 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
#> # ℹ 224 more rows
#> tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
#> $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
#> $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
#> $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
#> $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
#> $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
#> $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
#> $ drv : chr [1:234] "f" "f" "f" "f" ...
#> $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
#> $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
#> $ fl : chr [1:234] "p" "p" "p" "p" ...
#> $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
序号
变量
含义
1
manufacturer
生产厂家
2
model
类型
3
displ
发动机排量(升)
4
year
生产年份
5
cyl
气缸数量
6
trans
传输类型
7
drv
驱动类型(f =前轮驱动,r=后轮驱动,4= 4wd)
8
cty
每加仑城市里程
9
hwy
每加仑高速公路英里
10
fl
汽油种类
11
class
类型
排量和油耗之间是什么关系?
提取子集:displ
hwy
class
1 2 mpg %>% select( displ, hwy, class )
#> # A tibble: 234 × 3
#> displ hwy class
#> <dbl> <int> <chr>
#> 1 1.8 29 compact
#> 2 1.8 29 compact
#> 3 2 31 compact
#> 4 2 30 compact
#> 5 2.8 26 compact
#> 6 2.8 26 compact
#> 7 3.1 27 compact
#> 8 1.8 26 compact
#> 9 1.8 25 compact
#> 10 2 28 compact
#> # ℹ 224 more rows
1 2 ggplot( data = mpg, mapping = aes( x = displ, y = hwy) ) + geom_point( )
ggplot()
表示调用该函数画图,data = mpg
表示使用mpg这个数据框 来画图。
aes()
表示数据和视觉属性之间的映射,aes(x = displ, y = hwy)
,意思是变量displ
作为(映射为)x轴方向的位置 ,变量hwy
作为(映射为)y轴方向的位置 。
aes()
除了位置上映射,还可以实现色彩、形状或透明度等视觉属性的映射。
geom_point()
表示绘制散点图。
+
表示添加图层。
以上是位置上的映射,ggplot还包含了颜色、形状以及透明度等图形属性的映射,比如在aes()
增加一个’color = class`,具体来说,不同的汽车类型 ,用不同的颜色 来表现。
1 2 ggplot( data = mpg, mapping = aes( x = displ, y = hwy, color = class ) ) + geom_point( )
1 2 ggplot( data = mpg, mapping = aes( x = displ, y = hwy, size = class ) ) + geom_point( )
1 2 ggplot( data = mpg, mapping = aes( x = displ, y = hwy, shape = class ) ) + geom_point( )
1 2 ggplot( data = mpg, mapping = aes( x = displ, y = hwy, alpha = class ) ) + geom_point( )
映射 VS 设置
点指定为某一颜色
1 2 3 4 5 ggplot( data = mpg, mapping = aes( x = displ, y = hwy, color = "blue" ) ) + geom_point( ) -> pa ggplot( data = mpg, mapping = aes( x = displ, y = hwy) ) + geom_point( color = "blue" ) -> pb pa / pb
pa和pb的区别:pa中的”blue”一个固定值也是唯一值映射给了color;pb颜色设置为”blue”蓝色
1 2 ggplot( data = mpg, mapping = aes( x = displ, y = hwy) ) + geom_point( size = 5 )
1 2 ggplot( data = mpg, mapping = aes( x = displ, y = hwy) ) + geom_point( shape= 3 )
1 2 ggplot( data = mpg, mapping = aes( x = displ, y = hwy) ) + geom_point( alpha = 0.3 )
几何对象
1 2 3 ggplot( data = mpg, mapping = aes( x = displ, y = hwy) ) + geom_point( ) -> p1 p1
1 2 3 ggplot( data = mpg, mapping = aes( x = displ, y = hwy) ) + geom_smooth( ) -> p2 p2
1 2 3 4 ggplot( data = mpg, mapping = aes( x = displ, y = hwy) ) + geom_point( ) + geom_smooth( ) -> p3 p3
全局变量 VS 局部变量
1 2 ggplot( data = mpg, mapping = aes( x = displ, y = hwy, color = class ) ) + geom_point( )
1 2 ggplot( data = mpg) + geom_point( mapping = aes( x = displ, y = hwy, color = class ) )
事实上,如果映射关系aes()
写在ggplot()
里,就是全局变量;如果写在geom_xxx()
里就是局部变量。
缺少局部变量的映射关系时,就会到全局变量寻找。
1 2 3 ggplot( data = mpg, mapping = aes( x = displ, y = hwy, color = class ) ) + geom_point( ) + geom_smooth( )
上图中geom_point()
和geom_smooth()
缺少局部变量的映射关系,就会继承全局变量的映射关系。
1 2 3 ggplot( data = mpg, mapping = aes( x = displ, y = hwy) ) + geom_point( aes( color = class ) ) + geom_smooth( )
geom_smooth()
缺少局部变量的映射关系,要继承全局变量映射关系,但是全局变量没有指定特别的映射关系,因此只画一条拟合曲线。
1 2 ggplot( data = mpg, mapping = aes( x = displ, y = hwy, color = class ) ) + geom_point( aes( color = factor( cyl) ) )
geom_point()
中有因子型的局部变量cyl
,因此不再继承全局变量。
1 2 3 ggplot( mpg, aes( displ, hwy, color = class ) ) + geom_smooth( method = lm) + geom_point( )
1 2 3 ggplot( mpg, aes( displ, hwy) ) + geom_smooth( method = lm) + geom_point( aes( color = class ) )
保存图表
使用ggsave()
函数把图表保存为.png
或者.pdf
。
1 2 3 4 5 ggplot( mpg, aes( displ, hwy) ) + geom_smooth( method = lm) + geom_point( aes( color = class ) ) + ggtitle( "This is my first plot" ) -> p ggsave( "first.pdf" , p, width = 8 , height = 6 , dpi = 300 )
继续进阶版本的教程
1 2 3 4 5 library( tidyverse) library( gghighlight) library( cowplot) library( patchwork) library( ggforce)
1 2 3 4 read_csv( "data/datasaurus.csv" ) -> df df %>% count( dataset)
#> # A tibble: 13 × 2
#> dataset n
#> <chr> <int>
#> 1 away 142
#> 2 bullseye 142
#> 3 circle 142
#> 4 dino 142
#> 5 dots 142
#> 6 h_lines 142
#> 7 high_lines 142
#> 8 slant_down 142
#> 9 slant_up 142
#> 10 star 142
#> 11 v_lines 142
#> 12 wide_lines 142
#> 13 x_shape 142
1 2 3 4 5 6 7 8 df %>% group_by( dataset) %>% summarise( across( everything( ) , list ( mean = mean, sd = sd) , .names = "{fn}_{col}" ) ) %>% mutate( across( is.numeric , round , 3 ) )
#> # A tibble: 13 × 5
#> dataset mean_x sd_x mean_y sd_y
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 away 54.3 16.8 47.8 26.9
#> 2 bullseye 54.3 16.8 47.8 26.9
#> 3 circle 54.3 16.8 47.8 26.9
#> 4 dino 54.3 16.8 47.8 26.9
#> 5 dots 54.3 16.8 47.8 26.9
#> 6 h_lines 54.3 16.8 47.8 26.9
#> 7 high_lines 54.3 16.8 47.8 26.9
#> 8 slant_down 54.3 16.8 47.8 26.9
#> 9 slant_up 54.3 16.8 47.8 26.9
#> 10 star 54.3 16.8 47.8 26.9
#> 11 v_lines 54.3 16.8 47.8 26.9
#> 12 wide_lines 54.3 16.8 47.8 26.9
#> 13 x_shape 54.3 16.8 47.8 26.9
1 2 3 4 ggplot( df, aes( x, y, color = dataset) ) + geom_point( ) + theme( legend.position = "none" ) + facet_wrap( ~ dataset, nrow = 3 )
事实上,每张图都相差很大。所以,要眼见为实。换句话说,可视化是数据探索中非常重要的部分。
前面讲到R语言数据类型有字符串型、数值型、因子型、逻辑型、日期型等,ggplot2会将字符串型、因子型、逻辑型、日期型默认为离散变量 ,而数值型默认为连续变量 。在呈现现数据的时候,可能会同时用到多种类型的数据,比如
一个离散
一个连续
两个离散
两个连续
一个离散,一个连续
三个连续
1 2 read_csv( "data/gapminder.csv" ) -> gapdata gapdata
#> # A tibble: 1,704 × 6
#> country continent year lifeExp pop gdpPercap
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779.
#> 2 Afghanistan Asia 1957 30.3 9240934 821.
#> 3 Afghanistan Asia 1962 32.0 10267083 853.
#> 4 Afghanistan Asia 1967 34.0 11537966 836.
#> 5 Afghanistan Asia 1972 36.1 13079460 740.
#> 6 Afghanistan Asia 1977 38.4 14880372 786.
#> 7 Afghanistan Asia 1982 39.9 12881816 978.
#> 8 Afghanistan Asia 1987 40.8 13867957 852.
#> 9 Afghanistan Asia 1992 41.7 16317921 649.
#> 10 Afghanistan Asia 1997 41.8 22227415 635.
#> # ℹ 1,694 more rows
检查是否存在缺失值
1 2 3 4 gapdata %>% summarise( across( everything( ) , ~ sum ( is.na ( .) ) ) )
#> # A tibble: 1 × 6
#> country continent year lifeExp pop gdpPercap
#> <int> <int> <int> <int> <int> <int>
#> 1 0 0 0 0 0 0
选择什么样的类型呈现数据
柱状图
常用于一个离散变量
geom_bar()
自动完成了统计工作stat_count()
1 2 ggplot( data = gapdata) + geom_bar( aes( x = continent) )
x轴变量排序利用reorder()
函数
1 2 ggplot( data = gapdata) + geom_bar( aes( x = reorder( x = continent, continent, length ) ) )
x轴和y轴翻转
1 2 3 ggplot( data = gapdata) + geom_bar( aes( x = reorder( x = continent, continent, length ) ) ) + coord_flip( )
1 2 3 4 gapdata %>% distinct( continent, country) %>% ggplot( aes( x = continent) ) + geom_bar( )
1 2 3 4 5 6 7 gapdata %>% distinct( continent, country) %>% group_by( continent) %>% summarise( n = n( ) ) %>% ggplot( ) + geom_col( aes( x = continent, y = n) )
直方图
连续变量
1 2 3 gapdata %>% ggplot( aes( x = lifeExp) ) + geom_histogram( binwidth = 1 )
1 2 3 gapdata %>% ggplot( aes( x= lifeExp, color = continent) ) + geom_freqpoly( )
1 2 3 4 5 gapdata %>% ggplot( ) + geom_density( aes( x= lifeExp, fill = continent) ) + facet_wrap( ~ continent, ncol = 1 )
1 2 3 gapdata %>% ggplot( aes( x = lifeExp) ) + geom_density( adjust = 1 )
1 2 3 gapdata %>% ggplot( aes( x = lifeExp) ) + geom_density( adjust = 0.3 )
1 2 3 gapdata %>% ggplot( aes( x = lifeExp, fill = continent) ) + geom_density( alpha = 0.3 )
1 2 3 gapdata %>% ggplot( aes( x = lifeExp, fill = continent) ) + geom_density( aes( alpha = 0.3 ) )
在**ggplot2
**中,图形是由数据、几何对象(geoms)和美学映射(aesthetic mappings)组成的。这些元素合在一起定义了你的图形长什么样和代表了什么。
aes()
函数是用于定义美学映射的。简单来说,当你想基于数据的某个变量来决定一个图形特性(如颜色、大小、形状等)时,你就会用到 aes()
函数。当你把一个变量放到 aes()
里,你实际上是告诉 ggplot
:“请根据这个变量的值来改变某个图形特性”。
例如,aes(x = lifeExp, fill = continent)
告诉 ggplot
,x坐标应该由**lifeExp
决定,并且填充颜色应该基于 continent
**的不同类别来变化。
当你在**geom_density(alpha = 0.3)
这样的设置中直接为alpha
赋值,你是在给这个特性一个固定的值。这就意味着,不论数据如何,图形的这个特性都是这个固定值
。而当你使用 geom_density(aes(alpha = 0.3))
,你实际上是在告诉ggplot
,你希望 透明度基于某种数据来变化**。但因为你给了它一个固定的数字,它实际上并没有根据数据变化。这是一个容易引起混淆的写法,因为通常我们希望在**aes()
内部映射的都是数据集中的变量,而不是固定值**。
为什么第一种方法更好:使用aes()
来映射数据到图形的某个特性时,应该确保映射的真的是数据集中的某个变量 。把固定的属性值放到aes()
内部是不清晰的 ,因为这种写法会让读者误以为这个特性是根据数据变化的。为了代码的清晰和明确,最好避免这种混淆。
1 2 3 4 5 gapdata %>% dplyr:: filter( continent != "Oceania" ) %>% ggplot( aes( x = lifeExp, fill = continent) ) + geom_histogram( ) + facet_grid( continent~ .)
直方图和密度图画在一起。注意y = stat(density)
表示y是由x新生成的变量,这是一种固定写法,类似的还有stat(count)
和stat(level)
1 2 3 4 5 6 gapdata %>% dplyr:: filter( continent != "Oceania" ) %>% ggplot( aes( x = lifeExp, y = stat( density) ) ) + geom_histogram( aes( fill = continent) ) + geom_density( ) + facet_grid( continent~ .)
箱线图
一个离散型一个连续型
1 2 3 4 gapdata %>% ggplot( aes( x = factor( year) , y = lifeExp) ) + geom_boxplot( )
小提琴图
1 2 3 4 5 gapdata %>% ggplot( aes( x = year, y = lifeExp) ) + geom_violin( aes( group = year) ) + geom_jitter( alpha = 0.25 ) + geom_smooth( se = FALSE )
抖散图
1 2 3 gapdata %>% ggplot( aes( x = continent, y = lifeExp) ) + geom_jitter( )
1 2 3 4 gapdata %>% ggplot( aes( x = continent, y = lifeExp) ) + geom_jitter( ) + stat_summary( fun.y = median, color = "red" , geom = "point" , szie = 5 )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 gapdata %>% ggplot( aes( x = continent, y = lifeExp) ) + geom_violin( trim = FALSE , alpha = 0.5 ) + stat_summary( fun.y = mean, fun.max = function ( x) { mean( x) + sd( x) } , fun.min = function ( x) { mean( x) - sd( x) } , geom = "pointrange" )
山峦图
一个连续变量一个离散变量
1 2 3 gapdata %>% ggplot( aes( x = lifeExp, y = continent, fill = continent) ) + ggridges:: geom_density_ridges( alpha = 0.5 )
1 2 3 4 5 6 gapdata %>% ggplot( aes( x = lifeExp, y = continent, fill = continent) ) + ggridges:: geom_density_ridges( alpha = 0.5 ) + scale_fill_manual( values = c ( "#003f5c" , "#58508d" , "#bc5090" , "#ff6361" , "#ffa600" ) )
1 2 3 4 5 6 gapdata %>% ggplot( aes( x = lifeExp, y = continent, fill = continent) ) + ggridges:: geom_density_ridges( alpha = 0.5 ) + scale_fill_manual( values = colorspace:: sequential_hcl( 5 , palette = "Peach" ) )
散点图
两个连续变量
1 2 3 gapdata %>% ggplot( aes( x = gdpPercap, y = lifeExp) ) + geom_point( )
1 2 3 4 gapdata %>% ggplot( aes( x = log ( gdpPercap) , y = lifeExp) ) + geom_point( ) + geom_smooth( method = "lm" )
1 2 3 4 gapdata %>% ggplot( aes( x = gdpPercap, y = lifeExp) ) + geom_point( ) + scale_x_log10( )
主题设置
theme原理
element_function()
包括四个:
element_text()
# 文本,控制标签和标题的字体
element_line()
# 线条,控制线条的颜色、类型、粗细
element_rect()
# 矩形,控制背景矩形的颜色或边界线条类型
element_blank()
# 空白,不分配绘图空间,删去该区域绘图元素
#> Rows: 234
#> Columns: 11
#> $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
#> $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
#> $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
#> $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
#> $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
#> $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
#> $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
#> $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
#> $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
#> $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
#> $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
整理数据
1 2 3 mpg %>% filter( class != "2seater" , manufacturer %in% c ( "toyota" , "volkswagen" ) ) -> df df
#> # A tibble: 61 × 11
#> manufacturer model displ year cyl trans drv cty hwy fl class
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 toyota 4runner 4… 2.7 1999 4 manu… 4 15 20 r suv
#> 2 toyota 4runner 4… 2.7 1999 4 auto… 4 16 20 r suv
#> 3 toyota 4runner 4… 3.4 1999 6 auto… 4 15 19 r suv
#> 4 toyota 4runner 4… 3.4 1999 6 manu… 4 15 17 r suv
#> 5 toyota 4runner 4… 4 2008 6 auto… 4 16 20 r suv
#> 6 toyota 4runner 4… 4.7 2008 8 auto… 4 14 17 r suv
#> 7 toyota camry 2.2 1999 4 manu… f 21 29 r mids…
#> 8 toyota camry 2.2 1999 4 auto… f 21 27 r mids…
#> 9 toyota camry 2.4 2008 4 manu… f 21 31 r mids…
#> 10 toyota camry 2.4 2008 4 auto… f 21 31 r mids…
#> # ℹ 51 more rows
1 2 3 4 5 6 df %>% ggplot( aes( x = displ, y = hwy, color = factor( cyl) ) ) + geom_point( ) + facet_grid( vars( manufacturer) , vars( class ) ) + ggtitle( "这是我的标题" ) + labs( x = "x_displ" , y = "y_hwy" )
修改主题
整体元素
描述
主题元素
类型
整个图形背景(plot)
plot.background
element_rect()
图形标题
plot.title
element_text()
图形边距
plot.margin
margin()
1 2 3 4 5 6 7 8 9 10 11 df %>% ggplot( aes( x = displ, y = hwy, color = factor( cyl) ) ) + geom_point( ) + facet_grid( vars( manufacturer) , vars( class ) ) + ggtitle( "这是我的标题" ) + labs( x = "x_displ" , y = "y_hwy" ) + theme( plot.background = element_rect( fill = "orange" , color = "black" , size = 10 ) , plot.title = element_text( hjust = 1 , color = "red" , face = "bold" ) , plot.margin = margin( t = 20 , r = 20 , b = 20 , l = 20 , unit = "pt" ) )
坐标轴元素
描述
主题元素
类型
坐标轴刻度
axis.ticks
element_line()
坐标轴标题
axis.title
element_text()
坐标轴标签
axis.text
element_text()
直线和坐标轴
axis.line
element_line()
1 2 3 4 5 6 7 8 9 10 11 12 13 df %>% ggplot( aes( x = displ, y = hwy, color = factor( cyl) ) ) + geom_point( ) + facet_grid( vars( manufacturer) , vars( class ) ) + ggtitle( "这是我的标题" ) + labs( x = "x_displ" , y = "y_hwy" ) + theme( axis.line = element_line( color = "orange" , size = 1 ) , axis.title = element_text( color = "red" , face = "italic" ) , axis.ticks = element_line( color = "purple" , size = 5 ) , axis.text = element_text( color = "blue" ) , axis.text.x = element_text( angle = - 45 , hjust = 0 ) )
面板元素
描述
主题元素
类型
面板背景
panel.background
element_rect()
面板网格线
panel.grid
element_line()
面板边界
panel.border
element_rect()
1 2 3 4 5 6 7 8 9 10 11 df %>% ggplot( aes( x = displ, y = hwy, color = factor( cyl) ) ) + geom_point( ) + facet_grid( vars( manufacturer) , vars( class ) ) + ggtitle( "这是我的标题" ) + labs( x = "x_displ" , y = "y_hwy" ) + theme( panel.border = element_rect( color = "purple" , fill = NA ) , panel.background = element_rect( fill = "orange" , color = "red" ) , panel.grid = element_line( color = "grey80" , size = 0.5 ) )
图例元素
描述
主题元素
类型
图例背景
legend. background
element_rect()
符号
legend.key
element_rect()
标签
legend.text
element_text()
标题
legend.title
element_text()
边距
legend.margin
margin()
位置
legend.position
“top” “bottom” “left” “right”
1 2 3 4 5 6 7 8 9 10 11 12 13 14 df %>% ggplot( aes( x = displ, y = hwy, color = factor( cyl) ) ) + geom_point( ) + facet_grid( vars( manufacturer) , vars( class ) ) + ggtitle( "这是我的标题" ) + labs( x = "x_displ" , y = "y_hwy" ) + theme( legend.background = element_rect( color = "orange" ) , legend.title = element_text( color = "blue" , size = 20 ) , legend.key = element_rect( fill = "grey80" , color = NA ) , legend.text = element_text( color = "red" ) , legend.margin = margin( t = 5 ) , legend.position = "top" )
分面元素
描述
主题元素
类型
分面标签背景
strip.background
element_rect()
条状文本
strip.text
element_text()
分面间隔
panel.spacing
unit
1 2 3 4 5 6 7 8 9 10 11 df %>% ggplot( aes( x = displ, y = hwy, color = factor( cyl) ) ) + geom_point( ) + facet_grid( vars( manufacturer) , vars( class ) ) + ggtitle( "这是我的标题" ) + labs( x = "x_displ" , y = "y_hwy" ) + theme( strip.background = element_rect( fill = "orange" ) , strip.text = element_text( color = "red" ) , panel.spacing = unit( 0.1 , "cm" ) )
标度
映射 是数据转化到图形属性,这里的图形属性是指视觉可以感知的东西,比如大小,形状,颜色和位置等,标度 (scale )是控制着数据到图形属性映射的函数 ,每一种标度都是从数据空间的某个区域(标度的定义域)到图形属性空间的某个区域(标度的值域)的一个函数 。
每一个视觉属性背后都有标度。
1 2 3 ggplot( mpg, aes( x = displ, y = hwy) ) + geom_point( aes( color = class ) ) + labs( x = "x轴" )
1 2 3 ggplot( mpg, aes( x = displ, y = hwy) ) + geom_point( aes( color = class ) ) + scale_x_continuous( name = "x轴" )
1 2 3 ggplot( mpg, aes( x = displ, y = hwy) ) + geom_point( aes( color = class ) ) + scale_color_brewer( )
标度函数命名规则:
标度函数是由”_“分割的三个部分构成的。
scale
视觉属性名(e.g., color, shape or x)
标度名(e.g., continuous, discrete, brewer)
参数name
,坐标和图例的名字,如果不想要图例的名字,就可以
name = NULL
参数limits
,
坐标或图例的范围区间。连续性c(n, m)
,离散型c("a", "b", "c")
参数breaks
, 控制显示在坐标轴或者图例上的值(元素)
参数labels
, 坐标和图例的间隔标签
一般情况下,内置函数会自动完成
也可人工指定一个字符型向量,与breaks
提供的字符型向量一一对应
也可以是函数,把breaks
提供的字符型向量当做函数的输入
NULL
,就是去掉标签
参数values
指的是(颜色、形状等)视觉属性值,
要么,与数值的顺序一致;
要么,与breaks
提供的字符型向量长度一致
要么,用命名向量c("数据标签" = "视觉属性")
提供
参数expand
, 控制参数溢出量
参数range
, 设置尺寸大小范围,比如针对点的相对大小
1 gapdata <- read_csv( "data/gapminder.csv" )
1 2 3 4 5 6 newgapdata <- gapdata %>% group_by( continent, country) %>% summarise( across( c ( lifeExp, gdpPercap, pop) , mean) ) newgapdata
#> # A tibble: 142 × 5
#> # Groups: continent [5]
#> continent country lifeExp gdpPercap pop
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Africa Algeria 59.0 4426. 19875406.
#> 2 Africa Angola 37.9 3607. 7309390.
#> 3 Africa Benin 48.8 1155. 4017497.
#> 4 Africa Botswana 54.6 5032. 971186.
#> 5 Africa Burkina Faso 44.7 844. 7548677.
#> 6 Africa Burundi 44.8 472. 4651608.
#> 7 Africa Cameroon 48.1 1775. 9816648.
#> 8 Africa Central African Republic 43.9 959. 2560963
#> 9 Africa Chad 46.8 1165. 5329256.
#> 10 Africa Comoros 52.4 1314. 361684.
#> # ℹ 132 more rows
坐标轴
1 2 3 4 newgapdata %>% ggplot( aes( x = gdpPercap, y = lifeExp) ) + geom_point( aes( color = continent, size = pop) ) + scale_x_continuous( )
1 2 3 4 newgapdata %>% ggplot( aes( x = gdpPercap, y = lifeExp) ) + geom_point( aes( color = continent, size = pop) ) + scale_x_log10( )
1 2 3 4 5 6 7 newgapdata %>% ggplot( aes( x = gdpPercap, y = lifeExp) ) + geom_point( aes( color = continent, size = pop) ) + scale_x_log10( name = "GDP per capita" , breaks = c ( 500 , 1000 , 3000 , 10000 , 30000 ) , labels = scales:: unit_format( unit = "dollar" ) )
颜色
1 2 3 4 5 newgapdata %>% ggplot( aes( x = gdpPercap, y = lifeExp) ) + geom_point( aes( color = continent, size = pop) ) + scale_x_log10( ) + scale_color_viridis_d( )
1 2 3 4 5 newgapdata %>% ggplot( aes( x = gdpPercap, y = lifeExp) ) + geom_point( aes( color = continent, size = pop) ) + scale_x_log10( ) + scale_color_brewer( type = "qual" , palette = "Set1" )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 newgapdata %>% ggplot( aes( x = gdpPercap, y = lifeExp) ) + geom_point( aes( color = continent, size = pop) ) + scale_x_log10( ) + scale_color_manual( name = "continents" , values = c ( "Africa" = "red" , "Americas" = "blue" , "Asia" = "orange" , "Europe" = "black" , "Oceania" = "gray" ) , breaks = c ( "Africa" , "Americas" , "Asia" , "Europe" , "Oceania" ) , labels = c ( "africa" , "americas" , "asia" , "europe" , "oceania" ) ) + scale_size( name = "population size" , breaks = c ( 2e8 , 5e8 , 7e8 ) , labels = c ( "200 million" , "500 million" , "700 million" ) )
用标度还是主题?
那什么时候用标度,什么时候用主题?这里有个原则:主题风格不会增加标签,也不会改变变量的范围,主题只会改变字体、大小、颜色等等。
教程(一)
加载包
加载数据
1 2 haven:: read_dta( "data/auto.dta" ) -> auto auto
#> # A tibble: 74 × 12
#> make price mpg rep78 headroom trunk weight length turn displacement
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AMC Concord 4099 22 3 2.5 11 2930 186 40 121
#> 2 AMC Pacer 4749 17 3 3 11 3350 173 40 258
#> 3 AMC Spirit 3799 22 NA 3 12 2640 168 35 121
#> 4 Buick Cent… 4816 20 3 4.5 16 3250 196 40 196
#> 5 Buick Elec… 7827 15 4 4 20 4080 222 43 350
#> 6 Buick LeSa… 5788 18 3 4 21 3670 218 43 231
#> 7 Buick Opel 4453 26 NA 3 10 2230 170 34 304
#> 8 Buick Regal 5189 20 3 2 16 3280 200 42 196
#> 9 Buick Rivi… 10372 16 3 3.5 17 3880 207 43 231
#> 10 Buick Skyl… 4082 19 3 3.5 13 3400 200 42 231
#> # ℹ 64 more rows
#> # ℹ 2 more variables: gear_ratio <dbl>, foreign <dbl+lbl>
#> $class
#> [1] "tbl_df" "tbl" "data.frame"
#>
#> $row.names
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
#> [51] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
#>
#> $label
#> [1] "1978 Automobile Data"
#>
#> $notes
#> [1] "1"
#> [2] "from Consumer Reports with permission"
#>
#> $names
#> [1] "make" "price" "mpg" "rep78" "headroom"
#> [6] "trunk" "weight" "length" "turn" "displacement"
#> [11] "gear_ratio" "foreign"
创建ggplot2图
1 2 ggplot( ) + geom_point( mapping = aes( x = weight, y = price) , data = auto)
美学映射
1 2 ggplot( data = auto) + geom_point( mapping = aes( x = weight, y = price, size = rep78, color = rep78, alpha = rep78) )
rep为离散型变量,但是上图中的色彩映射,默认为连续型变量,将rep78设置为因子型变量,图例显示为离散型。
1 2 ggplot( data = auto) + geom_point( mapping = aes( x = weight, y = price, size = rep78, color = factor( rep78) , alpha = rep78) )
将图例合并到一起,要解决两个问题:
1 2 3 4 ggplot( data = auto %>% subset( ! is.na ( rep78) ) ) + geom_point( mapping = aes( x = weight, y = price, size = rep78, color = factor( rep78) , alpha = rep78) ) + labs( color = "rep78" )
1 2 3 4 ggplot( data = auto %>% subset( ! is.na ( rep78) ) ) + geom_point( mapping = aes( x = weight, y = price, size = rep78, color = factor( rep78) , alpha = rep78) ) + scale_color_discrete( name = "rep78" )
size的内容在aes()里为映射,在aes()外为设置
1 2 3 ggplot( data = auto %>% subset( ! is.na ( rep78) ) ) + geom_point( mapping = aes( x = weight, y = price, color = factor( rep78) ) , size = 4 )
strok()
代表环的宽度
1 2 ggplot( data = subset( auto, ! is.na ( rep78) ) ) + geom_point( mapping = aes( x = weight, y = price, stroke = rep78) , shape = 1 )
1 2 3 4 haven:: read_dta( "data/world-covid19.dta" ) -> df ggplot( df) + geom_line( aes( date, confirmed, group = country) )
1 2 3 4 5 ggplot( df) + geom_line( aes( date, confirmed, color = country) ) + theme( legend.position = "none" )
1 2 3 4 5 6 7 8 9 1 : 5 %>% crossing( 1 : 5 ) %>% set_names( c ( "x" , "y" ) ) %>% mutate( z = as.numeric ( row.names( .) ) ) %>% ggplot( aes( x, y, shape = I( z) ) ) + geom_point( size = 5 , color = "red" , fill = "green" ) + geom_label( aes( label = z, y = y + 0.5 ) , size = 1.5 ) + theme( axis.text = element_blank( ) , axis.title = element_blank( ) )
1 2 3 4 auto %>% count( rep78) %>% ggplot( aes( x = rep78, y = n) ) + ggchicklet:: geom_chicklet( )
1 2 3 4 df %>% filter( country %in% c ( "中国" , "美国" ) ) %>% ggplot( aes( x= date, y = confirmed, color = country) ) + geom_line( )
位置调整
常见的柱状图
1 2 ggplot( auto) + geom_bar( aes( x = rep78, fill = factor( rep78) ) )
geom_col()
是没有统计变换的geom_bar()
1 2 3 4 5 6 7 8 9 10 11 ggplot( auto) + geom_bar( aes( x = rep78, color = factor( foreign) , fill = factor( foreign) ) , position = position_identity( ) ) + theme( legend.position = c ( 0.2 , 0.7 ) ) + labs( fill = "foreign" , color = "foreign" )
1 2 3 4 5 6 7 8 9 10 11 ggplot( auto) + geom_bar( aes( x = rep78, color = factor( foreign) , fill = factor( foreign) ) , position = position_dodge( width = 0.3 ) ) + theme( legend.position = c ( 0.2 , 0.7 ) ) + labs( fill = "foreign" , color = "foreign" )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 ggplot( auto) + geom_bar( aes( x = rep78, color = factor( foreign) , fill = factor( foreign) ) , position = position_fill( ) ) + theme( legend.position = c ( 0.2 , 0.7 ) , legend.background = element_rect( fill = "white" , color = "white" ) , ) + labs( fill = "foreign" , color = "foreign" )
坐标系
coord_flip()
coord_fixed
coord_sf()
1 2 3 4 5 6 7 8 9 library( sf) read_sf( "data/world_high_resolution_mill.geo.json" ) -> wdmp haven:: read_dta( "data/world-covid19.dta" ) -> df wdmp %>% left_join( df, by = c ( "code" = "iso" ) ) %>% subset( date == "2020-06-10" ) -> mydata mydata
#> Simple feature collection with 180 features and 9 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -20015110 ymin: -6947577 xmax: 20015110 ymax: 12641180
#> Projected CRS: World_Miller_Cylindrical
#> # A tibble: 180 × 10
#> name name_en code geometry country country_en date
#> <chr> <chr> <chr> <MULTIPOLYGON [m]> <chr> <chr> <date>
#> 1 阿尔及… ALGERIA DZA (((-245324.1 4068138, -2… 阿尔及… Algeria 2020-06-10
#> 2 列支敦… Liecht… LIE (((1067880 5656162, 1059… 列支敦… Liechtens… 2020-06-10
#> 3 埃及 EGYPT EGY (((3961141 2619833, 3915… 埃及 Egypt 2020-06-10
#> 4 孟加拉… BANGLA… BGD (((10297641 2482880, 102… 孟加拉… Bangladesh 2020-06-10
#> 5 尼日尔 Niger NER (((400866.4 1306467, 394… 尼日尔 Niger 2020-06-10
#> 6 卡塔尔 QATAR QAT (((5699872 2791895, 5682… 卡塔尔 Qatar 2020-06-10
#> 7 纳米比… NAMIBIA NAM (((2223927 -2808902, 222… 纳米比… Namibia 2020-06-10
#> 8 保加利… Bulgar… BGR (((2486312 5005203, 2497… 保加利… Bulgaria 2020-06-10
#> 9 玻利维… Bolivia BOL (((-6966045 -2514366, -6… 玻利维… Bolivia 2020-06-10
#> 10 加纳 Ghana GHA (((-299106.5 1058207, -3… 加纳 Ghana 2020-06-10
#> # ℹ 170 more rows
#> # ℹ 3 more variables: confirmed <dbl>, recovered <dbl>, deaths <dbl>
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 mydata %>% ggplot( ) + geom_sf( aes( fill = confirmed) , color = "white" , size = 0.01 ) + coord_sf( ) + guides( fill = guide_legend( ) ) + theme( panel.grid.major = element_blank( ) , axis.ticks.x = element_blank( ) , axis.text.x = element_blank( ) ) + labs( fill = "确诊人数" , title = "全球新馆疫情:2020-06-10" , caption = "数据来源" )