With great power comes not only great responsibility, but often great complexity — and that sure can be the case with R. The open-source R Project for Statistical Computing, a programming language and environment, offers immense capabilities to investigate, manipulate and analyze data. But because of its sometimes complicated syntax, beginners may find it challenging to improve their skills after learning some basics.
If you're not even at the stage where you feel comfortable doing rudimentary tasks in R, we recommend you head right over to Computerworld's Beginner's Guide to R. But if you've got some basics down and want to take another step in your R skills development — or just want to see how to do one of these four tasks in R — please read on.
I've created a sample data set with three years of revenue and profit data from Apple, Google and Microsoft, looking at how the companies performed shortly after the 2008-09 "Great Recession." (The source of the data was the companies themselves; "fy" means fiscal year.) If you'd like to follow along, you can type (or copy and paste) this into your R terminal window:
fy <- c(2010,2011,2012,2010,2011,2012,2010,2011,2012)
company <- c("Apple","Apple","Apple","Google","Google","Google","Microsoft","Microsoft","Microsoft")
revenue <- c(65225,108249,156508,29321,37905,50175,62484,69943,73723)
profit <- c(14013,25922,41733,8505,9737,10737,18760,23150,16978)
companiesData <- data.frame(fy, company, revenue, profit)
The code above will create a data frame like the one below, stored in a variable named "companiesData":
fy | company | revenue | profit | |
---|---|---|---|---|
1 | 2010 | Apple | 65225 | 14013 |
2 | 2011 | Apple | 108249 | 25922 |
3 | 2012 | Apple | 156508 | 41733 |
4 | 2010 | 29321 | 8505 | |
5 | 2011 | 37905 | 9737 | |
6 | 2012 | 50175 | 10737 | |
7 | 2010 | Microsoft | 62484 | 18760 |
8 | 2011 | Microsoft | 69943 | 23150 |
9 | 2012 | Microsoft | 73723 | 16978 |
(R adds its own row numbers if you don't include row names.)
If you run the str() function on the data frame to see its structure, you'll see that the year is being treated as a number and not as a year or factor:
str(companiesData)
'data.frame': 9 obs. of 4 variables:
$ fy : num 2010 2011 2012 2010 2011 ...
$ company: Factor w/ 3 levels "Apple","Google",..: 1 1 1 2 2 2 3 3 3
$ revenue: num 65225 108249 156508 29321 37905 ...
$ profit : num 14013 25922 41733 8505 9737 ...
I may want to group my data by year, but don't think I'm going to be doing specific time-based analysis, so I'll turn the fy column of numbers into a column that contains R categories (called factors) instead of dates with the following command:
companiesData$fy <- factor(companiesData$fy, ordered = TRUE)
Throughout the course of this tutorial, I'll also show how to accomplish these tasks using packages in the so-called "tidyverse" — an ecosystem initially championed by RStudio Chief Scientist Hadley Wickham and now backed by a number of open-source authors both within and outside of RStudio.
For creating ordered factors, the tidyverse forcats package has several options, including companiesData$fy <- forcats::as_factor(as.character(companiesData$fy))
.
Now we're ready to get to work.
Adding a column to an existing data frame
One of the easiest tasks to perform in R is adding a new column to a data frame based on one or more other columns. You might want to add up several of your existing columns, find an average or otherwise calculate some "result" from existing data in each row.
There are many ways to do this in R. Some will seem overly complicated for this easy task at hand, but for now you'll have to take my word for it that some more complex options can sometimes come in handy for advanced users with more robust needs. However, if you're looking for an easy, elegant way to do this now, skip to Syntax 5 and the dplyr package.
Syntax 1: By equation
Simply create a variable name for the new column and pass in a calculation formula as its value if, for example, you want a new column that's the sum of two existing columns:
dataFrame$newColumn <- dataFrame$oldColumn1 + dataFrame$oldColumn2
As you can probably guess, this creates a new column called "newColumn" with the sum of oldColumn1 + oldColumn2 in each row.
For our sample data frame called data, we could add a column for profit margin by dividing profit by revenue and then multiplying by 100:
companiesData$margin <- (companiesData$profit / companiesData$revenue) * 100
That gives us:
fy | company | revenue | profit | margin | |
---|---|---|---|---|---|
1 | 2010 | Apple | 65225 | 14013 | 21.48409 |
2 | 2011 | Apple | 108248 | 25922 | 23.94664 |
3 | 2012 | Apple | 156508 | 41733 | 26.66509 |
4 | 2010 | 29321 | 8505 | 29.00651 | |
5 | 2011 | 37905 | 9737 | 25.68790 | |
6 | 2012 | 50175 | 10737 | 21.39910 | |
7 | 2010 | Microsoft | 62484 | 18760 | 30.02369 |
8 | 2011 | Microsoft | 69943 | 23150 | 33.09838 |
9 | 2012 | Microsoft | 73723 | 16978 | 23.02945 |
Whoa — that's a lot of decimal places in the new margin column.
We can round that off to just one decimal place with the round() function; round() takes the format:
round(number(s) to be rounded, how many decimal places you want)
So, to round the margin column to one decimal place:
companiesData$margin <- round(companiesData$margin, 1)
And you'll get this result:
fy | company | revenue | profit | margin | |
---|---|---|---|---|---|
1 | 2010 | Apple | 65225 | 14013 | 21.5 |
2 | 2011 | Apple | 108248 | 25922 | 23.9 |
3 | 2012 | Apple | 156508 | 41733 | 26.7 |
4 | 2010 | 29321 | 8505 | 29.0 | |
5 | 2011 | 37905 | 9737 | 25.7 | |
6 | 2012 | 50175 | 10737 | 21.4 | |
7 | 2010 | Microsoft | 62484 | 18760 | 30.0 |
8 | 2011 | Microsoft | 69943 | 23150 | 33.1 |
9 | 2012 | Microsoft | 73723 | 16978 | 23.0 |