(18)ggplot2学习笔记

本文由SCY原创,转载注明出处。

本文主要讲解ggplot2包的绘图原理及案例。

ggplot2学习笔记

ggplot2是基于图层图形语法(the Grammar of Graphics),可简单粗暴的理解为,先绘制好图形的每一部分,然后将各部分相加形成一张完善的图形,使用ggplot2时,会反复使用如下几个对象,简单介绍:

  1. 几何对象geom) geometric objects,如散点points、
    线性lines、柱状bars及方图Histogram等,可绘制的geom为:
1
2
3
library(tidyverse)
library(ggplot2)
library(patchwork)
  1. 标度scale) scales map values in the data space to values in the aesthetic space. This includes the use of colour, shape or size. Scales also draw the legend and axes, which make it possible to read the original data values from the plot(将数据取值映射到图形空间,使用颜色,形状,大小表示不同取值,使用图例,网格线展示标度)

  2. 图像属性(aes)data to the aesthetic attributes (color, shape, size)

  3. 坐标系coord)information about the plot’s coordinate system(描述数据如何映射到图形,同时包含坐标轴和网格线 axes, gridlines)

  4. 统计变换statstatistical transformations of the data,对数据的汇总

  5. 分面facet) A facet specifies how to break up and display subsets of data as small multiples. This is also known as conditioning or latticing/trellising.(将数据拆分为子集,对各子集作图并联合展示,也成条件作图或网格图)

  6. 绘图主题theme)Athemecontrols the finer points of display, like the font size and background colour. While the defaults in ggplot2 have been chosen with care, you may need to consult other references to create an attractive plot.(主题涉及图形更细的方面,如背景色,字体大小等)。

原理

ggplot()函数包括9个部件:

  • 数据(data)
  • 映射(mapping)
  • 几何对象(geom)
  • 统计变换(stats)
  • 标度(scale)
  • 坐标系(coord)
  • 分面(facet)
  • 主题(theme)
  • 存储和输出(output)

其中前三个是必需的。

Hadley wickham将这套语法诠释为:一张统计图形就是从数据几何对象 (geometric object,缩写geom)的图形属性(aesthetic attribute,缩写aes)的一个映射。

此外,图形中还可能包合数据的统计变换(statistical transformation,缩写stat),最后绘制在某个特定的坐标系(coordinate system,缩写coord)中,而分面(facet)则可以用来生成数据不同子集的图形。

语法模版

1
2
3
4
5
6
7
8
9
10
library(tidyverse)
library(ggplot2)
library(colorspace)

d <- read_csv("data/temp_carbon.csv")
ggplot(data = d, mapping = aes(x = year, y = carbon_emissions)) +
geom_line() +
xlab("year") +
ylab("carbon emissions (metric tons)") +
ggtitle("Annual global carbon emissions, 1880-2014")

映射

1
mpg
#> # A tibble: 234 × 11
#>    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
#>    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#>  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
#>  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
#>  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
#>  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
#>  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
#>  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
#>  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
#>  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
#>  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
#> 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
#> # ℹ 224 more rows
1
str(mpg)
#> tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
#>  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
#>  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
#>  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
#>  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
#>  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
#>  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
#>  $ drv         : chr [1:234] "f" "f" "f" "f" ...
#>  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
#>  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
#>  $ fl          : chr [1:234] "p" "p" "p" "p" ...
#>  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...
序号 变量 含义
1 manufacturer 生产厂家
2 model 类型
3 displ 发动机排量(升)
4 year 生产年份
5 cyl 气缸数量
6 trans 传输类型
7 drv 驱动类型(f =前轮驱动,r=后轮驱动,4= 4wd)
8 cty 每加仑城市里程
9 hwy 每加仑高速公路英里
10 fl 汽油种类
11 class 类型

排量和油耗之间是什么关系?

提取子集:displ hwy class

1
2
mpg %>% 
select(displ, hwy, class)
#> # A tibble: 234 × 3
#>    displ   hwy class  
#>    <dbl> <int> <chr>  
#>  1   1.8    29 compact
#>  2   1.8    29 compact
#>  3   2      31 compact
#>  4   2      30 compact
#>  5   2.8    26 compact
#>  6   2.8    26 compact
#>  7   3.1    27 compact
#>  8   1.8    26 compact
#>  9   1.8    25 compact
#> 10   2      28 compact
#> # ℹ 224 more rows
1
2
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point()

  • ggplot()表示调用该函数画图,data = mpg表示使用mpg这个数据框来画图。

  • aes()表示数据和视觉属性之间的映射,aes(x = displ, y = hwy),意思是变量displ作为(映射为)x轴方向的位置,变量hwy作为(映射为)y轴方向的位置

  • aes()除了位置上映射,还可以实现色彩、形状或透明度等视觉属性的映射。

  • geom_point()表示绘制散点图。

  • + 表示添加图层。

以上是位置上的映射,ggplot还包含了颜色、形状以及透明度等图形属性的映射,比如在aes()增加一个’color = class`,具体来说,不同的汽车类型,用不同的颜色来表现。

1
2
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point()

1
2
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, size = class)) +
geom_point()

1
2
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, shape = class)) +
geom_point()

1
2
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, alpha = class)) +
geom_point()

映射 VS 设置

点指定为某一颜色

1
2
3
4
5
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = "blue")) +
geom_point() -> pa
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(color = "blue") -> pb
pa / pb

pa和pb的区别:pa中的”blue”一个固定值也是唯一值映射给了color;pb颜色设置为”blue”蓝色

1
2
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(size = 5)

1
2
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(shape= 3)

1
2
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(alpha = 0.3)

几何对象

1
2
3
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() -> p1
p1

1
2
3
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth() -> p2
p2

1
2
3
4
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth() -> p3
p3

1
(p1 / p2) | p3 

全局变量 VS 局部变量

1
2
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point()

1
2
ggplot(data = mpg) +
geom_point( mapping = aes(x = displ, y = hwy, color = class))

事实上,如果映射关系aes()写在ggplot()里,就是全局变量;如果写在geom_xxx()里就是局部变量。

缺少局部变量的映射关系时,就会到全局变量寻找。

1
2
3
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point() +
geom_smooth()

上图中geom_point()geom_smooth()缺少局部变量的映射关系,就会继承全局变量的映射关系。

1
2
3
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth()

geom_smooth()缺少局部变量的映射关系,要继承全局变量映射关系,但是全局变量没有指定特别的映射关系,因此只画一条拟合曲线。

1
2
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point(aes(color = factor(cyl)))

geom_point()中有因子型的局部变量cyl,因此不再继承全局变量。

1
2
3
ggplot(mpg, aes(displ, hwy, color = class)) +
geom_smooth(method = lm) +
geom_point()

1
2
3
ggplot(mpg, aes(displ, hwy)) +
geom_smooth(method = lm) +
geom_point(aes(color = class))

保存图表

使用ggsave()函数把图表保存为.png或者.pdf

1
2
3
4
5
ggplot(mpg, aes(displ, hwy)) +
geom_smooth(method = lm) +
geom_point(aes(color = class)) +
ggtitle("This is my first plot") -> p
ggsave("first.pdf", p, width = 8, height = 6, dpi = 300)

继续进阶版本的教程

1
2
3
4
5
library(tidyverse)
library(gghighlight)
library(cowplot)
library(patchwork)
library(ggforce)
1
2
3
4
read_csv("data/datasaurus.csv") -> df

df %>%
count(dataset)
#> # A tibble: 13 × 2
#>    dataset        n
#>    <chr>      <int>
#>  1 away         142
#>  2 bullseye     142
#>  3 circle       142
#>  4 dino         142
#>  5 dots         142
#>  6 h_lines      142
#>  7 high_lines   142
#>  8 slant_down   142
#>  9 slant_up     142
#> 10 star         142
#> 11 v_lines      142
#> 12 wide_lines   142
#> 13 x_shape      142
1
2
3
4
5
6
7
8
df %>% 
group_by(dataset) %>%
summarise(
across(everything(), list(mean = mean, sd =sd), .names = "{fn}_{col}")
) %>%
mutate(
across(is.numeric, round, 3)
)
#> # A tibble: 13 × 5
#>    dataset    mean_x  sd_x mean_y  sd_y
#>    <chr>       <dbl> <dbl>  <dbl> <dbl>
#>  1 away         54.3  16.8   47.8  26.9
#>  2 bullseye     54.3  16.8   47.8  26.9
#>  3 circle       54.3  16.8   47.8  26.9
#>  4 dino         54.3  16.8   47.8  26.9
#>  5 dots         54.3  16.8   47.8  26.9
#>  6 h_lines      54.3  16.8   47.8  26.9
#>  7 high_lines   54.3  16.8   47.8  26.9
#>  8 slant_down   54.3  16.8   47.8  26.9
#>  9 slant_up     54.3  16.8   47.8  26.9
#> 10 star         54.3  16.8   47.8  26.9
#> 11 v_lines      54.3  16.8   47.8  26.9
#> 12 wide_lines   54.3  16.8   47.8  26.9
#> 13 x_shape      54.3  16.8   47.8  26.9
1
2
3
4
ggplot(df, aes(x, y, color = dataset)) +
geom_point() +
theme(legend.position = "none") +
facet_wrap(~dataset, nrow = 3)

事实上,每张图都相差很大。所以,要眼见为实。换句话说,可视化是数据探索中非常重要的部分。

前面讲到R语言数据类型有字符串型、数值型、因子型、逻辑型、日期型等,ggplot2会将字符串型、因子型、逻辑型、日期型默认为离散变量,而数值型默认为连续变量。在呈现现数据的时候,可能会同时用到多种类型的数据,比如

  • 一个离散

  • 一个连续

  • 两个离散

  • 两个连续

  • 一个离散,一个连续

  • 三个连续

1
2
read_csv("data/gapminder.csv") -> gapdata
gapdata
#> # A tibble: 1,704 × 6
#>    country     continent  year lifeExp      pop gdpPercap
#>    <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
#>  1 Afghanistan Asia       1952    28.8  8425333      779.
#>  2 Afghanistan Asia       1957    30.3  9240934      821.
#>  3 Afghanistan Asia       1962    32.0 10267083      853.
#>  4 Afghanistan Asia       1967    34.0 11537966      836.
#>  5 Afghanistan Asia       1972    36.1 13079460      740.
#>  6 Afghanistan Asia       1977    38.4 14880372      786.
#>  7 Afghanistan Asia       1982    39.9 12881816      978.
#>  8 Afghanistan Asia       1987    40.8 13867957      852.
#>  9 Afghanistan Asia       1992    41.7 16317921      649.
#> 10 Afghanistan Asia       1997    41.8 22227415      635.
#> # ℹ 1,694 more rows

检查是否存在缺失值

1
2
3
4
gapdata %>% 
summarise(
across(everything(), ~ sum(is.na(.)))
)
#> # A tibble: 1 × 6
#>   country continent  year lifeExp   pop gdpPercap
#>     <int>     <int> <int>   <int> <int>     <int>
#> 1       0         0     0       0     0         0

选择什么样的类型呈现数据

柱状图

常用于一个离散变量

geom_bar()自动完成了统计工作stat_count()

1
2
ggplot(data = gapdata) +
geom_bar(aes(x = continent))

x轴变量排序利用reorder()函数

1
2
ggplot(data = gapdata) +
geom_bar(aes(x = reorder(x = continent, continent, length)))

x轴和y轴翻转

1
2
3
ggplot(data = gapdata) +
geom_bar(aes(x = reorder(x = continent, continent, length))) +
coord_flip()

1
2
3
4
gapdata %>% 
distinct(continent, country) %>%
ggplot(aes(x = continent)) +
geom_bar()

1
# distinct()函数保持数据唯一行,去除重复的行。
1
2
3
4
5
6
7
# 先做计算在画图
gapdata %>%
distinct(continent, country) %>%
group_by(continent) %>%
summarise(n = n()) %>%
ggplot() +
geom_col(aes(x = continent, y = n))

直方图

连续变量

1
2
3
gapdata %>% 
ggplot(aes(x = lifeExp)) +
geom_histogram(binwidth = 1)

1
# binwidth 代表条的宽度
1
2
3
gapdata %>% 
ggplot(aes( x= lifeExp, color = continent)) +
geom_freqpoly()

1
2
3
4
5
# smooth histogram = density plot
gapdata %>%
ggplot() +
geom_density(aes(x= lifeExp, fill = continent)) +
facet_wrap(~continent, ncol = 1)

1
2
3
gapdata %>% 
ggplot(aes(x = lifeExp)) +
geom_density(adjust = 1)

1
2
3
gapdata %>% 
ggplot(aes(x = lifeExp)) +
geom_density(adjust = 0.3)

1
2
3
gapdata %>% 
ggplot(aes(x = lifeExp, fill = continent)) +
geom_density(alpha = 0.3)

1
2
3
gapdata %>% 
ggplot(aes(x = lifeExp, fill = continent)) +
geom_density(aes(alpha = 0.3))

在**ggplot2**中,图形是由数据、几何对象(geoms)和美学映射(aesthetic mappings)组成的。这些元素合在一起定义了你的图形长什么样和代表了什么。

aes()函数是用于定义美学映射的。简单来说,当你想基于数据的某个变量来决定一个图形特性(如颜色、大小、形状等)时,你就会用到aes()函数。当你把一个变量放到aes()里,你实际上是告诉ggplot:“请根据这个变量的值来改变某个图形特性”。

例如,aes(x = lifeExp, fill = continent)告诉ggplot,x坐标应该由**lifeExp决定,并且填充颜色应该基于continent**的不同类别来变化。

当你在**geom_density(alpha = 0.3)这样的设置中直接为alpha赋值,你是在给这个特性一个固定的值。这就意味着,不论数据如何,图形的这个特性都是这个固定值。而当你使用geom_density(aes(alpha = 0.3)),你实际上是在告诉ggplot,你希望透明度基于某种数据来变化**。但因为你给了它一个固定的数字,它实际上并没有根据数据变化。这是一个容易引起混淆的写法,因为通常我们希望在**aes()内部映射的都是数据集中的变量,而不是固定值**。

为什么第一种方法更好:使用aes()来映射数据到图形的某个特性时,应该确保映射的真的是数据集中的某个变量把固定的属性值放到aes()内部是不清晰的,因为这种写法会让读者误以为这个特性是根据数据变化的。为了代码的清晰和明确,最好避免这种混淆。

1
2
3
4
5
gapdata %>% 
dplyr::filter(continent != "Oceania") %>%
ggplot(aes(x = lifeExp, fill = continent)) +
geom_histogram() +
facet_grid(continent~.)

直方图和密度图画在一起。注意y = stat(density)表示y是由x新生成的变量,这是一种固定写法,类似的还有stat(count)stat(level)

1
2
3
4
5
6
gapdata %>% 
dplyr::filter(continent != "Oceania") %>%
ggplot(aes(x = lifeExp, y = stat(density))) +
geom_histogram(aes(fill = continent)) +
geom_density() +
facet_grid(continent~.)

箱线图

一个离散型一个连续型

1
2
3
4
# year原本为数值型,先用factor函数将数值型转化为离散型
gapdata %>%
ggplot(aes(x = factor(year), y = lifeExp)) +
geom_boxplot()

小提琴图

1
2
3
4
5
gapdata %>% 
ggplot(aes(x = year, y = lifeExp)) +
geom_violin(aes(group = year)) +
geom_jitter(alpha = 0.25) +
geom_smooth(se = FALSE)

抖散图

1
2
3
gapdata %>% 
ggplot(aes(x = continent, y = lifeExp)) +
geom_jitter()

1
2
3
4
gapdata %>% 
ggplot(aes(x = continent, y = lifeExp)) +
geom_jitter() +
stat_summary(fun.y = median, color = "red", geom = "point", szie = 5)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
gapdata %>% 
ggplot(aes(x = continent, y = lifeExp)) +
geom_violin(
trim = FALSE,
alpha = 0.5
) +
stat_summary(
fun.y = mean,
fun.max = function(x){
mean(x) + sd(x)
},
fun.min = function(x){
mean(x) - sd(x)
},
geom = "pointrange"
)

山峦图

一个连续变量一个离散变量

1
2
3
gapdata %>% 
ggplot(aes(x = lifeExp, y = continent, fill = continent)) +
ggridges::geom_density_ridges(alpha = 0.5)

1
2
3
4
5
6
gapdata %>% 
ggplot(aes(x = lifeExp, y = continent, fill = continent)) +
ggridges::geom_density_ridges(alpha = 0.5) +
scale_fill_manual(
values = c("#003f5c", "#58508d", "#bc5090", "#ff6361", "#ffa600")
)

1
2
3
4
5
6
gapdata %>% 
ggplot(aes(x = lifeExp, y = continent, fill = continent)) +
ggridges::geom_density_ridges(alpha = 0.5) +
scale_fill_manual(
values = colorspace::sequential_hcl(5, palette = "Peach")
)

1
# 增加标度 更换填充的颜色

散点图

两个连续变量

1
2
3
gapdata %>% 
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point()

1
2
3
4
gapdata %>% 
ggplot(aes(x = log(gdpPercap), y = lifeExp)) +
geom_point() +
geom_smooth(method = "lm")

1
2
3
4
gapdata %>% 
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
scale_x_log10()

主题设置

theme原理

element_function()包括四个:

  • element_text() # 文本,控制标签和标题的字体

  • element_line() # 线条,控制线条的颜色、类型、粗细

  • element_rect() # 矩形,控制背景矩形的颜色或边界线条类型

  • element_blank() # 空白,不分配绘图空间,删去该区域绘图元素

1
glimpse(mpg)
#> Rows: 234
#> Columns: 11
#> $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
#> $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
#> $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
#> $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
#> $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
#> $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
#> $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
#> $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
#> $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
#> $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
#> $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

整理数据

1
2
3
mpg %>% 
filter(class != "2seater", manufacturer %in% c("toyota", "volkswagen")) -> df
df
#> # A tibble: 61 × 11
#>    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
#>    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#>  1 toyota       4runner 4…   2.7  1999     4 manu… 4        15    20 r     suv  
#>  2 toyota       4runner 4…   2.7  1999     4 auto… 4        16    20 r     suv  
#>  3 toyota       4runner 4…   3.4  1999     6 auto… 4        15    19 r     suv  
#>  4 toyota       4runner 4…   3.4  1999     6 manu… 4        15    17 r     suv  
#>  5 toyota       4runner 4…   4    2008     6 auto… 4        16    20 r     suv  
#>  6 toyota       4runner 4…   4.7  2008     8 auto… 4        14    17 r     suv  
#>  7 toyota       camry        2.2  1999     4 manu… f        21    29 r     mids…
#>  8 toyota       camry        2.2  1999     4 auto… f        21    27 r     mids…
#>  9 toyota       camry        2.4  2008     4 manu… f        21    31 r     mids…
#> 10 toyota       camry        2.4  2008     4 auto… f        21    31 r     mids…
#> # ℹ 51 more rows
1
2
3
4
5
6
df %>%
ggplot(aes(x = displ,y = hwy, color = factor(cyl)))+
geom_point() +
facet_grid(vars(manufacturer), vars(class)) +
ggtitle("这是我的标题") +
labs(x = "x_displ", y = "y_hwy")

修改主题

整体元素

描述 主题元素 类型
整个图形背景(plot) plot.background element_rect()
图形标题 plot.title element_text()
图形边距 plot.margin margin()
1
2
3
4
5
6
7
8
9
10
11
df %>%
ggplot(aes(x = displ,y = hwy, color = factor(cyl)))+
geom_point() +
facet_grid(vars(manufacturer), vars(class)) +
ggtitle("这是我的标题") +
labs(x = "x_displ", y = "y_hwy") +
theme(
plot.background = element_rect(fill = "orange", color = "black", size = 10),
plot.title = element_text(hjust = 1, color = "red", face = "bold"),
plot.margin = margin(t = 20, r = 20, b = 20, l = 20, unit = "pt")
)

坐标轴元素

描述 主题元素 类型
坐标轴刻度 axis.ticks element_line()
坐标轴标题 axis.title element_text()
坐标轴标签 axis.text element_text()
直线和坐标轴 axis.line element_line()
1
2
3
4
5
6
7
8
9
10
11
12
13
df %>%
ggplot(aes(x = displ,y = hwy, color = factor(cyl)))+
geom_point() +
facet_grid(vars(manufacturer), vars(class)) +
ggtitle("这是我的标题") +
labs(x = "x_displ", y = "y_hwy") +
theme(
axis.line = element_line(color = "orange", size = 1),
axis.title = element_text(color = "red", face = "italic"),
axis.ticks = element_line(color = "purple", size = 5),
axis.text = element_text(color = "blue"),
axis.text.x = element_text(angle = -45, hjust = 0)
)

面板元素

描述 主题元素 类型
面板背景 panel.background element_rect()
面板网格线 panel.grid element_line()
面板边界 panel.border element_rect()
1
2
3
4
5
6
7
8
9
10
11
df %>%
ggplot(aes(x = displ,y = hwy, color = factor(cyl)))+
geom_point() +
facet_grid(vars(manufacturer), vars(class)) +
ggtitle("这是我的标题") +
labs(x = "x_displ", y = "y_hwy") +
theme(
panel.border = element_rect(color = "purple", fill = NA),
panel.background = element_rect(fill = "orange", color = "red"),
panel.grid = element_line(color = "grey80", size = 0.5)
)

图例元素

描述 主题元素 类型
图例背景 legend. background element_rect()
符号 legend.key element_rect()
标签 legend.text element_text()
标题 legend.title element_text()
边距 legend.margin margin()
位置 legend.position “top” “bottom” “left” “right”
1
2
3
4
5
6
7
8
9
10
11
12
13
14
df %>%
ggplot(aes(x = displ,y = hwy, color = factor(cyl)))+
geom_point() +
facet_grid(vars(manufacturer), vars(class)) +
ggtitle("这是我的标题") +
labs(x = "x_displ", y = "y_hwy") +
theme(
legend.background = element_rect(color = "orange"),
legend.title = element_text(color = "blue", size = 20),
legend.key = element_rect(fill = "grey80", color = NA),
legend.text = element_text(color = "red"),
legend.margin = margin(t = 5),
legend.position = "top"
)

分面元素

描述 主题元素 类型
分面标签背景 strip.background element_rect()
条状文本 strip.text element_text()
分面间隔 panel.spacing unit
1
2
3
4
5
6
7
8
9
10
11
df %>%
ggplot(aes(x = displ,y = hwy, color = factor(cyl)))+
geom_point() +
facet_grid(vars(manufacturer), vars(class)) +
ggtitle("这是我的标题") +
labs(x = "x_displ", y = "y_hwy") +
theme(
strip.background = element_rect(fill = "orange"),
strip.text = element_text(color = "red"),
panel.spacing = unit(0.1, "cm")
)

标度

映射是数据转化到图形属性,这里的图形属性是指视觉可以感知的东西,比如大小,形状,颜色和位置等,标度scale)是控制着数据到图形属性映射的函数,每一种标度都是从数据空间的某个区域(标度的定义域)到图形属性空间的某个区域(标度的值域)的一个函数

每一个视觉属性背后都有标度。

1
2
3
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
labs(x = "x轴")

1
2
3
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
scale_x_continuous(name = "x轴")

1
2
3
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
scale_color_brewer()

标度函数命名规则:

标度函数是由”_“分割的三个部分构成的。

  • scale

  • 视觉属性名(e.g., color, shape or x)

  • 标度名(e.g., continuous, discrete, brewer)

  • 参数name,坐标和图例的名字,如果不想要图例的名字,就可以
    name = NULL

  • 参数limits,
    坐标或图例的范围区间。连续性c(n, m),离散型c("a", "b", "c")

  • 参数breaks, 控制显示在坐标轴或者图例上的值(元素)

  • 参数labels, 坐标和图例的间隔标签

    • 一般情况下,内置函数会自动完成
    • 也可人工指定一个字符型向量,与breaks提供的字符型向量一一对应
    • 也可以是函数,把breaks提供的字符型向量当做函数的输入
    • NULL,就是去掉标签
  • 参数values 指的是(颜色、形状等)视觉属性值,

    • 要么,与数值的顺序一致;
    • 要么,与breaks提供的字符型向量长度一致
    • 要么,用命名向量c("数据标签" = "视觉属性")提供
  • 参数expand, 控制参数溢出量

  • 参数range, 设置尺寸大小范围,比如针对点的相对大小

1
gapdata <- read_csv("data/gapminder.csv")
1
2
3
4
5
6
newgapdata <- gapdata %>% 
group_by(continent, country) %>%
summarise(
across(c(lifeExp, gdpPercap, pop), mean)
)
newgapdata
#> # A tibble: 142 × 5
#> # Groups:   continent [5]
#>    continent country                  lifeExp gdpPercap       pop
#>    <chr>     <chr>                      <dbl>     <dbl>     <dbl>
#>  1 Africa    Algeria                     59.0     4426. 19875406.
#>  2 Africa    Angola                      37.9     3607.  7309390.
#>  3 Africa    Benin                       48.8     1155.  4017497.
#>  4 Africa    Botswana                    54.6     5032.   971186.
#>  5 Africa    Burkina Faso                44.7      844.  7548677.
#>  6 Africa    Burundi                     44.8      472.  4651608.
#>  7 Africa    Cameroon                    48.1     1775.  9816648.
#>  8 Africa    Central African Republic    43.9      959.  2560963 
#>  9 Africa    Chad                        46.8     1165.  5329256.
#> 10 Africa    Comoros                     52.4     1314.   361684.
#> # ℹ 132 more rows

坐标轴

1
2
3
4
newgapdata %>% 
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent, size = pop)) +
scale_x_continuous()

1
2
3
4
newgapdata %>% 
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent, size = pop)) +
scale_x_log10()

1
2
3
4
5
6
7
newgapdata %>% 
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent, size = pop)) +
scale_x_log10(
name = "GDP per capita",
breaks = c(500, 1000, 3000, 10000, 30000),
labels = scales::unit_format(unit = "dollar"))

颜色

1
2
3
4
5
newgapdata %>% 
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent, size = pop)) +
scale_x_log10() +
scale_color_viridis_d()

1
2
3
4
5
newgapdata %>% 
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent, size = pop)) +
scale_x_log10() +
scale_color_brewer(type = "qual", palette = "Set1")

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
newgapdata %>% 
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent, size = pop)) +
scale_x_log10() +
scale_color_manual(
name = "continents",
values = c("Africa" = "red", "Americas" = "blue", "Asia" = "orange",
"Europe" = "black", "Oceania" = "gray"),
breaks = c("Africa", "Americas", "Asia", "Europe", "Oceania"),
labels = c("africa", "americas", "asia", "europe", "oceania")
) +
scale_size(
name = "population size",
breaks = c(2e8, 5e8, 7e8),
labels = c("200 million", "500 million", "700 million")
)

用标度还是主题?

那什么时候用标度,什么时候用主题?这里有个原则:主题风格不会增加标签,也不会改变变量的范围,主题只会改变字体、大小、颜色等等。

教程(一)

加载包

加载数据

1
2
haven::read_dta("data/auto.dta") -> auto
auto
#> # A tibble: 74 × 12
#>    make        price   mpg rep78 headroom trunk weight length  turn displacement
#>    <chr>       <dbl> <dbl> <dbl>    <dbl> <dbl>  <dbl>  <dbl> <dbl>        <dbl>
#>  1 AMC Concord  4099    22     3      2.5    11   2930    186    40          121
#>  2 AMC Pacer    4749    17     3      3      11   3350    173    40          258
#>  3 AMC Spirit   3799    22    NA      3      12   2640    168    35          121
#>  4 Buick Cent…  4816    20     3      4.5    16   3250    196    40          196
#>  5 Buick Elec…  7827    15     4      4      20   4080    222    43          350
#>  6 Buick LeSa…  5788    18     3      4      21   3670    218    43          231
#>  7 Buick Opel   4453    26    NA      3      10   2230    170    34          304
#>  8 Buick Regal  5189    20     3      2      16   3280    200    42          196
#>  9 Buick Rivi… 10372    16     3      3.5    17   3880    207    43          231
#> 10 Buick Skyl…  4082    19     3      3.5    13   3400    200    42          231
#> # ℹ 64 more rows
#> # ℹ 2 more variables: gear_ratio <dbl>, foreign <dbl+lbl>
1
attributes(auto)
#> $class
#> [1] "tbl_df"     "tbl"        "data.frame"
#> 
#> $row.names
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
#> [51] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
#> 
#> $label
#> [1] "1978 Automobile Data"
#> 
#> $notes
#> [1] "1"                                    
#> [2] "from Consumer Reports with permission"
#> 
#> $names
#>  [1] "make"         "price"        "mpg"          "rep78"        "headroom"    
#>  [6] "trunk"        "weight"       "length"       "turn"         "displacement"
#> [11] "gear_ratio"   "foreign"

创建ggplot2图

1
2
ggplot() + 
geom_point(mapping = aes(x = weight, y = price), data = auto)

美学映射

1
2
ggplot(data = auto) + 
geom_point(mapping = aes(x = weight, y = price, size = rep78, color = rep78, alpha = rep78)) #点的大小、颜色、透明度建立映射

rep为离散型变量,但是上图中的色彩映射,默认为连续型变量,将rep78设置为因子型变量,图例显示为离散型。

1
2
ggplot(data = auto) + 
geom_point(mapping = aes(x = weight, y = price, size = rep78, color = factor(rep78), alpha = rep78))

将图例合并到一起,要解决两个问题:

  • 分类要一致

  • 图例名字要统一

1
2
3
4
ggplot(data = auto %>% 
subset(!is.na(rep78))) +
geom_point(mapping = aes(x = weight, y = price, size = rep78, color = factor(rep78), alpha = rep78)) +
labs(color = "rep78") #修改轴、图例和plot的标签

1
2
3
4
ggplot(data = auto %>% 
subset(!is.na(rep78))) +
geom_point(mapping = aes(x = weight, y = price, size = rep78, color = factor(rep78), alpha = rep78)) +
scale_color_discrete(name = "rep78") #利用标度修改图例标题

size的内容在aes()里为映射,在aes()外为设置

1
2
3
ggplot(data = auto %>% 
subset(!is.na(rep78))) +
geom_point(mapping = aes(x = weight, y = price, color = factor(rep78)), size = 4)

strok()代表环的宽度

1
2
ggplot(data = subset(auto, !is.na(rep78))) + 
geom_point(mapping = aes(x = weight, y = price, stroke = rep78), shape = 1)

1
2
3
4
haven::read_dta("data/world-covid19.dta") -> df

ggplot(df) +
geom_line(aes(date, confirmed, group = country))

1
2
3
4
5
ggplot(df) +
geom_line(aes(date, confirmed, color = country)) +
theme(
legend.position = "none"
)

1
# 在theme中修改图例位置为none,去掉图例
1
2
3
4
5
6
7
8
9
1:5 %>% 
crossing(1:5) %>%
set_names(c("x", "y")) %>%
mutate(z = as.numeric(row.names(.))) %>%
ggplot(aes(x, y, shape = I(z))) +
geom_point(size = 5, color = "red", fill = "green") +
geom_label(aes(label = z, y = y + 0.5), size = 1.5) +
theme(axis.text = element_blank(),
axis.title = element_blank())

1
# I(z)使得z变量直接用作点的形状,而不是尝试在aes()内部解释它。
1
2
3
4
auto %>% 
count(rep78) %>%
ggplot(aes(x = rep78, y = n)) +
ggchicklet::geom_chicklet()

1
# 圆角柱状图
1
2
3
4
df %>% 
filter(country %in% c("中国", "美国")) %>%
ggplot(aes(x= date, y = confirmed, color = country)) +
geom_line()

位置调整

常见的柱状图

1
2
ggplot(auto) + 
geom_bar(aes(x = rep78, fill = factor(rep78)))

geom_col()是没有统计变换的geom_bar()

1
2
3
4
5
6
7
8
9
10
11
ggplot(auto) + 
geom_bar(
aes(
x = rep78,
color = factor(foreign),
fill = factor(foreign)),
position = position_identity()
) +
theme(legend.position = c(0.2, 0.7)) +
labs(fill = "foreign",
color = "foreign")

1
2
3
4
5
6
7
8
9
10
11
ggplot(auto) + 
geom_bar(
aes(
x = rep78,
color = factor(foreign),
fill = factor(foreign)),
position = position_dodge(width = 0.3)
) +
theme(legend.position = c(0.2, 0.7)) +
labs(fill = "foreign",
color = "foreign")

1
2
3
4
5
6
7
8
9
10
11
12
13
14
ggplot(auto) + 
geom_bar(
aes(
x = rep78,
color = factor(foreign),
fill = factor(foreign)),
position = position_fill()
) +
theme(
legend.position = c(0.2, 0.7),
legend.background = element_rect(fill = "white", color = "white"),
) +
labs(fill = "foreign",
color = "foreign")

坐标系

coord_flip()

coord_fixed

coord_sf()

1
2
3
4
5
6
7
8
9
library(sf)

read_sf("data/world_high_resolution_mill.geo.json") -> wdmp
haven::read_dta("data/world-covid19.dta") -> df

wdmp %>%
left_join(df, by = c("code" = "iso")) %>%
subset(date == "2020-06-10") -> mydata
mydata
#> Simple feature collection with 180 features and 9 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -20015110 ymin: -6947577 xmax: 20015110 ymax: 12641180
#> Projected CRS: World_Miller_Cylindrical
#> # A tibble: 180 × 10
#>    name    name_en code                   geometry country country_en date      
#>    <chr>   <chr>   <chr>        <MULTIPOLYGON [m]> <chr>   <chr>      <date>    
#>  1 阿尔及… ALGERIA DZA   (((-245324.1 4068138, -2… 阿尔及… Algeria    2020-06-10
#>  2 列支敦… Liecht… LIE   (((1067880 5656162, 1059… 列支敦… Liechtens… 2020-06-10
#>  3 埃及    EGYPT   EGY   (((3961141 2619833, 3915… 埃及    Egypt      2020-06-10
#>  4 孟加拉… BANGLA… BGD   (((10297641 2482880, 102… 孟加拉… Bangladesh 2020-06-10
#>  5 尼日尔  Niger   NER   (((400866.4 1306467, 394… 尼日尔  Niger      2020-06-10
#>  6 卡塔尔  QATAR   QAT   (((5699872 2791895, 5682… 卡塔尔  Qatar      2020-06-10
#>  7 纳米比… NAMIBIA NAM   (((2223927 -2808902, 222… 纳米比… Namibia    2020-06-10
#>  8 保加利… Bulgar… BGR   (((2486312 5005203, 2497… 保加利… Bulgaria   2020-06-10
#>  9 玻利维… Bolivia BOL   (((-6966045 -2514366, -6… 玻利维… Bolivia    2020-06-10
#> 10 加纳    Ghana   GHA   (((-299106.5 1058207, -3… 加纳    Ghana      2020-06-10
#> # ℹ 170 more rows
#> # ℹ 3 more variables: confirmed <dbl>, recovered <dbl>, deaths <dbl>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
mydata %>% 
ggplot() +
geom_sf(aes(fill = confirmed), color = "white", size = 0.01) +
coord_sf() +
guides(fill = guide_legend()) +
theme(
panel.grid.major = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_blank()
) +
labs(
fill = "确诊人数",
title = "全球新馆疫情:2020-06-10",
caption = "数据来源"
)

-------------已经到底啦-------------