14  Introduction to Plotting

14.1 Introduction

We will now learn some techniques to visualize your data. We will learn how to create histograms, bar charts, line plots, scatter plots, among others, and how to customize them.

Base R (R without any packages) has some basic plotting functions. These are easy to use but they are not easily customizable and don’t look very elegant. For that reason we will also learn how to use the popular plotting package ggplot2. But in this chapter we stick to base R, leaving ggplot for later chapters.

14.2 Example Setting: Penguins

To get started on some basic plotting techniques we will use the famous “Palmer Penguins” dataset. This dataset contains several measurements of different penguins collected by researchers on Antwerp Island in the Palmer Archipelago of Antarctica. Interestingly, there is a smaller island next to this called Brabant Island.

The dataset contains data from three species of penguins: the Adelie, Chinstrap and Gentoo. A picture of each species is shown in the pictures below:

This dataset is convenient to use because we can load it into R straight from a package. First install the package with the dataset with:

install.packages("palmerpenguins")

Then load it with:

library(palmerpenguins)
data(penguins)

Running the command data(penguins) loads up two datasets: penguins and penguins_raw. We will ignore the penguins_raw dataset and only work with the penguins one.

14.3 Data Inspection

Before getting started with plotting, it’s good to first get a basic understanding of our data. Let’s get some summary statistics with summary() and find out how many observations with have with nrow():

summary(penguins)
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2                                 
nrow(penguins)
[1] 344

We see that we have data on 344 penguins with the following variables:

  • species: A factor variable indicating which of the 3 species the penguin is.
  • island: A factor variable indicating which island the penguin was on.
  • bill_length_mm: A numerical variable indicating how long the penguin’s bill (their beak) was (in mm).
  • bill_depth_mm: A numerical variable indicating how deep the penguin’s bill was (in mm). The depth is the distance between the top and bottom of their beak.
  • flipper_length_mm: A numerical variable indicating how long their flipper (wing) is (in mm).
  • body_mass_g: A numerical variable indicating how heavy the penguin is (in grams).
  • sex: A factor variable indicating the gender of the penguins (male or female).
  • year: A numerical variable indicating what year the data point is from.

We also see that we have 2 missing values for 4 of the variables and 11 missing values for sex. For our purposes here it is fine to just leave these missing values in the dataset. We don’t need to delete those rows.

14.4 Basic Plotting with Base R

We will now learn how to do some very simple plots with base R: the histogram, the bar plot and the scatter plot. The plots from base R are not very beautiful, but the idea is to learn how to make “quick and dirty” plots for you to quickly get a sense of your data, before making nicer customizable plots with ggplot.

14.4.1 Histograms

To describe the distribution of a single numeric variable, we can use a histogram. A histogram splits the data into “bins” and shows the number of observations in each bin. We can create a histogram by using the hist() function, putting the variable we want to plot as the argument inside:

hist(penguins$body_mass_g)

14.4.2 Bar Plot

For categorical variables, we can use a bar plot to visualize the relative frequencies of different categories. We already saw the table() function which counts the number of times each category appears:

table(penguins$species)

   Adelie Chinstrap    Gentoo 
      152        68       124 

If we want to plot these values, we can put this entire expression into the barplot() function:

barplot(table(penguins$species))

14.4.3 Scatter Plots

To quickly visualize the relationship between two variables we can make a scatter plot. We can do this by listing the two variables we want to plot as arguments in the plot() function:

plot(penguins$bill_length_mm, penguins$flipper_length_mm)

In each case, the base R commands to make plots are very short and easy to use. Therefore I use them very frequently in the console to learn what a dataset looks like. But because they do not look very nice I do not tend to use them in research papers. I prefer to use the plots from ggplot, which we will learn about next.