13 Data Cleaning

Often when we get a dataset we need to analyze, it’s not always exactly structured and formatted in a way that we can immediately work with it. Here are some examples of this:

The data don’t start at the top of the file because the first few rows contain some other information.
The dates are not formatted correctly.
Numbers are interpreted as characters.
The data contain extra columns that we don’t want.
There are rows with missing data that we want to omit.
The variable names are not what we want them to be.

In these cases we need to clean the data before we can work with it. By cleaning we don’t mean modifying the underlying data. It just means bringing the dataset into a format such that we can more easily work with it in R.

Some datasets are “dirtier” than others, sometimes so “dirty” that it can take weeks or even months to clean. Fortunately we will stick to “lightly unkempt” data for this course that can be cleaned with only a few lines of code.

In this chapter we will learn some basic data cleaning techniques to deal with the 6 example issues listed above. We will do this using stock price data for the company ASML from the Amsterdam Stock Exchange.

Download the following file: asml-trades.csv and put it in your project folder. The variable names and meanings are:

Date: The date the data from that row are from.
Open: The opening price of the stock on that day.
High: The highest price the stock traded at on that day.
Low: The lowest price the stock traded at on that day.
Last: The price of the last-traded stock at on that day.
Close: The closing price of the stock on that day.
Number.of.Shares: The number of shares traded that day.
Number.of.Trades: The number of trades made that day.

13.1 Skipping Rows

Before reading the dataset into R, let’s take a quick look at the text in the file itself. In the Files tab in RStudio, click on the asml-trades.csv file and click “View File”. The text of the file opens in RStudio. The first few lines should look like:

Historical Data,,,,,,,,
From 2021-08-31 to 2023-08-30,,,,,,,,
NL0010273215,,,,,,,,
Date,Open,High,Low,Last,Close,Number of Shares,Number of Trades,Print table
31/8/2021,713.8,717.7,699.9,704.5,704.5,978555,36599,
1/9/2021,712.4,715.5,706.4,714.2,714.2,674506,38527,
2/9/2021,713.9,723.4,708.4,722.2,722.2,533840,29229,

We immediately notice that the first 3 rows contain information about the data that we don’t want to include in our dataframe. The variable names are on line 4 instead of line 1.

One option would be to delete those rows and save the file. However, it is bad practice to directly edit a data file. One should always work with the raw CSV file as it was downloaded. This makes it easier to reproduce your work and show your steps through your R script. Fortunately the read.csv() function has an option to skip rows. We can use this with:

df <- read.csv("asml-trades.csv", skip = 3)

We could also have read in the data directly from the URL with:

df <- read.csv("https://walshc.github.io/ebi-prog/asml-trades.csv", skip = 3)

This way we wouldn’t have to set up an RStudio project or change the working directory.

Let’s take a first look at the data with summary():

summary(df)

     Date                Open            High            Low       
 Length:521         Min.   :394.7   Min.   :408.2   Min.   :375.8  
 Class :character   1st Qu.:535.5   1st Qu.:545.5   1st Qu.:525.8  
 Mode  :character   Median :592.1   Median :597.4   Median :582.5  
                    Mean   :589.2   Mean   :597.5   Mean   :579.6  
                    3rd Qu.:645.5   3rd Qu.:652.5   3rd Qu.:636.1  
                    Max.   :770.5   Max.   :777.5   Max.   :764.2  
                    NA's   :6       NA's   :6       NA's   :6      
      Last           Close       Number.of.Shares   Number.of.Trades  
 Min.   :397.4   Min.   :397.4   Length:521         Length:521        
 1st Qu.:535.9   1st Qu.:535.9   Class :character   Class :character  
 Median :589.4   Median :589.4   Mode  :character   Mode  :character  
 Mean   :588.4   Mean   :588.4                                        
 3rd Qu.:644.0   3rd Qu.:644.0                                        
 Max.   :770.5   Max.   :770.5                                        
 NA's   :6       NA's   :6                                            
 Print.table   
 Mode:logical  
 NA's:521

From this we can see a few problems:

From the summary of the Date variable, we can see that R read it in as a character. It did not recognize that it was a date.
The variables Open, High, Low, Last and Close contain 6 NAs. NA stands for “Not Available” and is what R uses to represent missing values.
For the variables Number.of.Shares and Number.of.Trades, we can see that they were read in as characters instead of numbers.
Print.table has 521 NAs (all values are NA), thus this variable is useless and should be deleted.
We also want to change some of the variable names and also change the names to lower case and replace the dots (.) with underscores (_).

We will work through these problems for the rest of this chapter.

13.2 Formatting Dates

13.2.1 Converting Dates in the ASML Example

We will start by converting the date variable from a character to a date variable. This is useful for doing operations with the date (such as subsetting the data on observations before/after a particular date) and for plotting.

We can convert the character to a date using the as.Date() function. The first argument of the function is the vector of dates that need to be converted, and the format argument specifies the format the date is written in. Let’s see what format the dates are written in using the head() command to see the first few rows:

head(df$Date)

[1] "31/8/2021" "1/9/2021"  "2/9/2021"  "3/9/2021"  "6/9/2021"  "7/9/2021"

When we look at the data, we can see that the format is dd/mm/yyyy. To specify this we need to write format = "%d/%m/%Y".

df$Date <- as.Date(df$Date, format = "%d/%m/%Y")

We can check what this did with the head() command again:

head(df$Date)

[1] "2021-08-31" "2021-09-01" "2021-09-02" "2021-09-03" "2021-09-06"
[6] "2021-09-07"

We can see that when R has dates correctly formatted they show up in the format yyyy-mm-dd. This is the default format.

Now that the date is formatted correctly, the summary() command shows the first and last dates in the data (31 August 2021 and 29 August 2023):

summary(df$Date)

        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"2021-08-31" "2022-03-01" "2022-08-30" "2022-08-29" "2023-02-28" "2023-08-29"

13.2.2 Converting Dates from Other Formats

Dates in the Netherlands are typically written like dd-mm-yyyy. If it was the case we would instead do "%d-%m-%Y". You can try this out with:

as.Date("1-9-2023", format = "%d-%m-%Y")

[1] "2023-09-01"

If the dates were in mm/dd/yyyy format, as is typical in USA, we would do "%m/%d/%Y".

as.Date("12/31/2022", format = "%m/%d/%Y")

[1] "2022-12-31"

Sometimes we omit the century from years. We might write the 1st of September 2023 as 01/09/23. In R we need to use %y instead of %Y for these abbreviated years:¹

as.Date("01/09/22", format = "%d/%m/%y")

[1] "2022-09-01"

13.2.3 Converting Dates with Month Names (Optional)

Sometimes the dates have the month name in words. To convert this we need to use the %b option for abbreviated month names and %B for full month names:

as.Date("Sep 1 2023", format = "%b %d %Y")

[1] "2023-09-01"

as.Date("January 1 2023", format = "%B %d %Y")

[1] "2023-01-01"

Note that if your computer is not in English it might not work as R expects to read months in the language of your computer. To get around this, you can first set the language for dates to English using the Sys.setlocale() function before formatting the dates. Because of the complications with month names in different languages, I will not ask questions on the assignments or exam involving this. I am just providing this information in case it may be useful for you later.

13.3 Converting Characters to Numbers

For the variables Number.of.Shares and Number.of.Trades, we saw that they were read in as characters instead of numbers. Usually this happens when there are some letters or other non-numeric symbols (like the % symbol) somewhere in the data. This is because all elements of vectors in R (the individual columns of a data.frame) must have the same data type. If there are any character elements in a vector, the remaining elements are coerced into characters.

If we look through the data we can see that some rows have "None" written instead of NA.²

To do this conversion we first replace the "None" values with NA. We do this by assigning NA to the subset of values where df$Number.of.Shares == "None":³

df$Number.of.Shares[df$Number.of.Shares == "None"] <- NA

Once we have done this we use the as.numeric() function to convert the values from character to numeric:

df$Number.of.Shares <- as.numeric(df$Number.of.Shares)

Now when we summarize we see that it’s treated as a number:

summary(df$Number.of.Shares)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  84141  571178  711067  783291  916000 2932273       6

We can do the same with the Number.of.Trades. We are also able to do this skipping the step of converting the "None" to NA:

df$Number.of.Trades <- as.numeric(df$Number.of.Trades)

Warning: NAs introduced by coercion

When we do this, however, we see that R warned us that it converted some observations to NA. A warning is different to an error in that R still completes the operation (in an error it will just stop). But it is warning us because we may not have expected some values to be forced to NA. In general it is better to code in a way that doesn’t generate warnings, so I recommend setting the non-numeric values to NA first.

13.4 Deleting columns

The last column of the data, Print.table, contains no data. It has NA for all rows.

To delete a variable we write over the variable with NULL. This essentially replaces it with nothing:

df$Print.table <- NULL

An alternative approach would be to use the column index of the variable we want to drop. Print.table is the 9th column in the data, and to drop the 9th column we could use:

df <- df[, -9]

If we view the data in RStudio, we can see that it is now deleted.

13.5 Dropping rows with missing data

Let’s take a look at how our data look now:

summary(df)

      Date                 Open            High            Low       
 Min.   :2021-08-31   Min.   :394.7   Min.   :408.2   Min.   :375.8  
 1st Qu.:2022-03-01   1st Qu.:535.5   1st Qu.:545.5   1st Qu.:525.8  
 Median :2022-08-30   Median :592.1   Median :597.4   Median :582.5  
 Mean   :2022-08-29   Mean   :589.2   Mean   :597.5   Mean   :579.6  
 3rd Qu.:2023-02-28   3rd Qu.:645.5   3rd Qu.:652.5   3rd Qu.:636.1  
 Max.   :2023-08-29   Max.   :770.5   Max.   :777.5   Max.   :764.2  
                      NA's   :6       NA's   :6       NA's   :6      
      Last           Close       Number.of.Shares  Number.of.Trades
 Min.   :397.4   Min.   :397.4   Min.   :  84141   Min.   :  4398  
 1st Qu.:535.9   1st Qu.:535.9   1st Qu.: 571178   1st Qu.: 26290  
 Median :589.4   Median :589.4   Median : 711067   Median : 33375  
 Mean   :588.4   Mean   :588.4   Mean   : 783291   Mean   : 35830  
 3rd Qu.:644.0   3rd Qu.:644.0   3rd Qu.: 916000   3rd Qu.: 42707  
 Max.   :770.5   Max.   :770.5   Max.   :2932273   Max.   :108957  
 NA's   :6       NA's   :6       NA's   :6         NA's   :6

We can see that there are 6 NAs for all variables except the Date variable. If we scroll through the data we notice that 6 rows with NAs are the same for all variables. Let’s have a look at what dates these are. We can do this using the is.na() function. This function, when applied to a vector, returns TRUE if the element is NA and FALSE if not. To see the dates when the variables are missing, we can do:

df$Date[is.na(df$Open)]

[1] "2022-04-15" "2022-04-18" "2022-12-26" "2023-04-07" "2023-04-10"
[6] "2023-05-01"

We can see that the missings were:

2022-04-15: Good Friday
2022-04-18: Easter Monday (Tweede paasdag)
2022-12-26: Day after Christmas (Tweede kerstdag)
2023-04-07: Good Friday
2023-04-10: Easter Monday (Tweede paasdag)
2023-05-01: Labor Day (Dag van de Arbeid)

These are weekdays where the Amsterdam stock market is closed. We can drop rows with any missing values using the na.omit() function. I will print the number of rows in df before and after the operation to show what is happening:

nrow(df)

[1] 521

df <- na.omit(df)
nrow(df)

[1] 515

We can see that we fell from 521 observations to 515 after deleting the 6 holidays from the data.

13.6 Renaming Variables

Although the variable names are quite okay, suppose we wanted to change some of them.

Suppose we wanted to change the name of "Number.of.Shares" to "num_shares" to make it shorter to type, and to replace the dot with an underscore. Because we know it occupies the 7th column, we can change the name with:

names(df)[7] <- "num_shares"

Suppose we also wanted to change the name of "Number.of.Trades" to "num_trades". Counting columns like we did above increases our chances of making a mistake (besides, we need to do a lot of counting). We can instead change the name using the old name as follows:

names(df)[names(df) == "Number.of.Trades"] <- "num_trades"

How this works is names(df) == "Number.of.Trades" is TRUE only in the 8th column when the name is actually "Number.of.Trades", and so it changes the name of only that column.

We can also change the names of multiple columns at the same time. Suppose we wanted to change Open, High, Low and Last to lower case. We can do:

names(df)[2:5] <- c("open", "high", "low", "last")

If we wanted to quickly change all variable names to lower case, we can use the tolower() function. The tolower() function converts upper case characters to lower case:

test <- c("hello!", "HELLO!", "Hello!", "HeLlO!")
tolower(test)

[1] "hello!" "hello!" "hello!" "hello!"

Let’s use it to change the names of the data set:

names(df) <- tolower(names(df))
names(df)

[1] "date"       "open"       "high"       "low"        "last"      
[6] "close"      "num_shares" "num_trades"

If the date is 01/01/69, the format %d/%m/%y will interpret it as January 1 1969. But if the date is 01/01/68, it will interpret it as January 1 2068. All short-format years after 69 are put in the 1900s and all short-format years before 69 are put in the 2000s. You don’t need to remember these details for the exam though because we won’t ever use dates outside of 1969-2068.↩︎
As a shortcut, we could have found all the non-numeric characters in a variable using the following command: unique(grep("[^0-9]", df$Number.of.Trades, value = TRUE)). You don’t need to remember this command for the exam.↩︎
We could also have replaced all elements with non-numeric characters without knowing what they are with the following: df$Number.of.Shares[grepl("[^0-9]", df$Number.of.Trades)] <- NA. You don’t need to remember this command for the exam.↩︎