 # R vs. Python, Which is better for data analysis? # R vs. Python, Which is better for data analysis?

There have been dozens of articles written comparing Python and R from a subjective standpoint. We’ll add our own views at some point, but this article aims to look at the languages more objectively. We’ll analyze a dataset side by side in Python and R, and show what code is needed in both languages to achieve the same result. This will let us understand the strengths and weaknesses of each language without the conjecture. At Dataquest, we teach both languages, and think both have a place in a data science toolkit.

We’ll be analyzing a dataset of NBA players and their performance in the 2013-2014 season. You can download the file here. For each step in the analysis, we’ll show the Python and R code, along with some explanation and discussion of the different approaches. Without further ado, let’s get this head to head Python vs R matchup started!

## Importing a CSV

R

``````library(readr)
``````

Python

``````import pandas
``````

The above code will load the CSV file `nba_2013.csv`, which contains data on NBA players from the 2013-2014 season, into the variable `nba` in both languages. The only real difference is that in Python, we need to import the pandas library to get access to Dataframes. In R, while we can import the data using the base R function `read.csv()`, using the `readr` library function `read_csv()` has the advantage of greater speed and consistent interpretation of data types. Dataframes are available in both R and Python, and are two-dimensional arrays (matrices) where each column can be of a different datatype. At the end of this step, the CSV file has been loaded by both languages into a dataframe.

## Finding the number of rows

R

``````dim(nba)
``````
``` 481  31
```

Python

``````nba.shape
``````
```(481, 31)
```

This prints out the number of players and the number of columns in each. We have `481` rows, or players, and `31` columns containing data on the players.

## Looking at the first row of the data

R

``````head(nba, 1)
``````
```player pos age bref_team_id
1 Quincy Acy  SF  23          TOT
[output truncated]
```

Python

``````nba.head(1)
``````
```player pos  age bref_team_id
0  Quincy Acy  SF   23          TOT
[output truncated]
```

This is pretty much identical. Both print out the first row of the data, and the syntax is very similar. Python is more object-oriented here, and `head` is a method on the dataframe object, and R has a separate `head` function. This is a common theme you’ll see as you start to do analysis with these languages, where Python is more object-oriented, and R is more functional.

## Find the average of each statistic

Let’s find the average value for each statistic. The columns, as you can see, have names like `fg` (field goals made), and `ast` (assists). These are the season statistics for the player. If you want a fuller explanation of all the stats, look here.

R

``````library(purrr)
library(dplyr)

nba %>%
select_if(is.numeric) %>%
map_dbl(mean, na.rm = TRUE)
``````
```player NA
pos NA
age 26.5093555093555
bref_team_id NA
[output truncated]
```

Python

``````nba.mean()
``````
```age             26.509356
g               53.253638
gs              25.571726
[output truncated]
```

There are some major differences in approach here. In both, we’re applying a function across the dataframe columns. In Python, the mean method on dataframes will find the mean of each column by default.

In R, we can use functions from two popular packages to select the columns we want to average and apply the `mean` function to them. The `%>%` operator, referred to as “the pipe”, passes output of one function as input to the next. Taking the mean of string values will just result in `NA` — not available. We can take the mean of only the numeric columns by using `select_if`. However, we do need to ignore `NA` values when we take the mean (requiring us to pass `na.rm=TRUE` into the `mean` function). If we don’t, we end up with `NA` for the mean of columns like `x3p.`. This column is three point percentage. Some players didn’t take three point shots, so their percentage is missing. If we try the `mean` function in R, we get `NA` as a response, unless we specify `na.rm=TRUE`, which ignores `NA` values when taking the mean. The `.mean()` method in Python already ignores these values by default.

## Make pairwise scatterplots

One common way to explore a dataset is to see how different columns correlate to others. We’ll compare the `ast`, `fg`, and `trb` columns.

R

``````library(GGally)

nba %>%
select(ast, fg, trb) %>%
ggpairs()
`````` Python

``````import seaborn as sns
import matplotlib.pyplot as plt
sns.pairplot(nba[["ast", "fg", "trb"]])
plt.show()
`````` We get very similar plots in the end, but this shows how the R data science ecosystem has many smaller packages (GGally is a helper package for ggplot2, the most-used R plotting package), and many more visualization packages in general. In Python, matplotlib is the primary plotting package, and seaborn is a widely used layer over matplotlib. With visualization in Python, there is usually one main way to do something, whereas in R, there are many packages supporting different methods of doing things (there are at least a half dozen packages to make pair plots, for instance).

## Make clusters of the players

One good way to explore this kind of data is to generate cluster plots. These will show which players are most similar.

R

``````library(cluster)
set.seed(1)
isGoodCol <- function(col){
sum(is.na(col)) == 0 && is.numeric(col)
}
goodCols <- sapply(nba, isGoodCol)
clusters <- kmeans(nba[,goodCols], centers=5)
labels <- clusters\$cluster
``````

Python

``````from sklearn.cluster import KMeans
kmeans_model = KMeans(n_clusters=5, random_state=1)
good_columns = nba._get_numeric_data().dropna(axis=1)
kmeans_model.fit(good_columns)
labels = kmeans_model.labels_
``````

In order to cluster properly, we remove any non-numeric columns, or columns with missing values (`NA`, `Nan`, etc). In R, we do this by applying a function across each column, and removing it if it has any missing values or isn’t numeric. We then use the cluster package to perform k-means and find `5` clusters in our data. We set a random seed using `set.seed` to be able to reproduce our results.

In Python, we use the main Python machine learning package, scikit-learn, to fit a k-means clustering model and get our cluster labels. We perform very similar methods to prepare the data that we used in R, except we use the `get_numeric_data` and `dropna` methods to remove non-numeric columns and columns with missing values.

## Plot players by cluster

We can now plot out the players by cluster to discover patterns. One way to do this is to first use PCA to make our data 2-dimensional, then plot it, and shade each point according to cluster association.

R

``````nba2d <- prcomp(nba[,goodCols], center=TRUE)
twoColumns <- nba2d\$x[,1:2]
clusplot(twoColumns, labels)
`````` Python

``````from sklearn.decomposition import PCA
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(good_columns)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=labels)
plt.show()
`````` Made a scatter plot of our data, and shaded or changed the icon of the data according to cluster. In R, the `clusplot` function was used, which is part of the cluster library. We performed PCA via the `pccomp` function that is built into R.

With Python, we used the PCA class in the scikit-learn library. We used matplotlib to create the plot.

## Split into training and testing sets

If we want to do supervised machine learning, it’s a good idea to split the data into training and testing sets so we don’t overfit.

R

``````trainRowCount <- floor(0.8 * nrow(nba))
set.seed(1)
trainIndex <- sample(1:nrow(nba), trainRowCount)
train <- nba[trainIndex,]
test <- nba[-trainIndex,]
``````

Python

``````train = nba.sample(frac=0.8, random_state=1)
test = nba.loc[~nba.index.isin(train.index)]
``````

You’ll notice that R has many more data-analysis focused builtins, like `floor`, `sample`, and `set.seed`, whereas these are called via packages in Python (`math.floor`, `random.sample`, `random.seed`). In Python, the recent version of pandas came with a `sample` method that returns a certain proportion of rows randomly sampled from a source dataframe — this makes the code much more concise. In R, there are packages to make sampling simpler, but aren’t much more concise than using the built-in `sample` function. In both cases, we set a random seed to make the results reproducible.

## Univariate linear regression

Let’s say we want to predict number of assists per player from field goals made per player.

R

``````fit <- lm(ast ~ fg, data=train)
predictions <- predict(fit, test)
``````

Python

``````from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train[["fg"]], train["ast"])
predictions = lr.predict(test[["fg"]])
``````

Scikit-learn has a linear regression model that we can fit and generate predictions from. R relies on the built-in `lm` and `predict` functions. `predict` will behave differently depending on the kind of fitted model that is passed into it — it can be used with a variety of fitted models.

## Calculate summary statistics for the model

R

``````summary(fit)
``````
```Call:
lm(formula = ast ~ fg, data = train)

Residuals:
Min      1Q  Median      3Q     Max
-228.26  -35.38  -11.45   11.99  559.61
[output truncated]
```

Python

``````import statsmodels.formula.api as sm
model = sm.ols(formula='ast ~ fga', data=train)
fitted = model.fit()
fitted.summary()
``````
`OLS Regression Results`

Dep. Variable: ast
R-squared: 0.568
Model: OLS
[output truncated]

If we want to get summary statistics about the fit, like r-squared value, we’ll need to do a bit more in Python than in R. With R, we can use the built-in `summary` function to get information on the model. With Python, we need to use the statsmodels package, which enables many statistical methods to be used in Python. We get similar results, although generally it’s a bit harder to do statistical analysis in Python, and some statistical methods that exist in R don’t exist in Python.

## Fit a random forest model

Our linear regression worked well in the single variable case, but we suspect there may be nonlinearities in the data. Thus, we want to fit a random forest model.

R

``````library(randomForest)
predictorColumns <- c("age", "mp", "fg", "trb", "stl", "blk")
rf <- randomForest(train[predictorColumns], train\$ast, ntree=100)
predictions <- predict(rf, test[predictorColumns])
``````

Python

``````from sklearn.ensemble import RandomForestRegressor
predictor_columns = ["age", "mp", "fg", "trb", "stl", "blk"]
rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=3)
rf.fit(train[predictor_columns], train["ast"])
predictions = rf.predict(test[predictor_columns])
``````

The main difference here is that we needed to use the randomForest library in R to use the algorithm, whereas it was built in to scikit-learn in Python. scikit-learn has a unified interface for working with many different machine learning algorithms in Python, and there’s usually only one main implementation of each algorithm in Python. With R, there are many smaller packages containing individual algorithms, often with inconsistent ways to access them. This results in a greater diversity of algorithms (many have several implementations, and many are fresh out of research labs), but with a bit of a usability hit.

## Calculate error

Now that we’ve fit two models, let’s calculate error. We’ll use MSE.

R

``````mean((test["ast"] - predictions)^2)
``````
```4573.86778567462
```

Python

``````from sklearn.metrics import mean_squared_error
mean_squared_error(test["ast"], predictions)
``````
```4166.9202475632374
```

In Python, the scikit-learn library has a variety of error metrics that we can use. In R, there are likely some smaller libraries that calculate MSE, but doing it manually is pretty easy in either language. There’s a small difference in errors that almost certainly due to parameter tuning, and isn’t a big deal.

Now that we have data on NBA players from 2013-2014, let’s scrape some additional data to supplement it. We’ll just look at one box score from the NBA Finals here to save time.

R

``````library(RCurl)
``````

Python

``````import requests
data = requests.get(url).content
``````

In Python, the requests package makes downloading web pages easy, with a consistent API for all request types. In R, RCurl provides a similarly simple way to make requests. Both download the webpage to a character datatype. Note: this step is unnecessary for the next step in R, but is shown for comparison’s sake.

## Extract player box scores

Now that we have the web page, we’ll need to parse it to extract scores for players.

R

``````library(rvest)
table <- html_nodes(page, ".stats_table")
rows <- html_nodes(table, "tr")
cells <- html_nodes(rows, "td a")
teams <- html_text(cells)

extractRow <- function(rows, i){
if(i == 1){
return
}
row <- rows[i]
tag <- "td"
if(i == 2){
tag <- "th"
}
items <- html_nodes(row, tag)
html_text(items)
}

scrapeData <- function(team){
teamData <- html_nodes(page, paste("#",team,"_basic", sep=""))
rows <- html_nodes(teamData, "tr")
lapply(seq_along(rows), extractRow, rows=rows)
}

data <- lapply(teams, scrapeData)
``````

Python

``````from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(data, 'html.parser')
box_scores = []
for tag in soup.find_all(id=re.compile("[A-Z]{3,}_basic")):
rows = []
for i, row in enumerate(tag.find_all("tr")):
if i == 0:
continue
elif i == 1:
tag = "th"
else:
tag = "td"
row_data = [item.get_text() for item in row.find_all(tag)]
rows.append(row_data)
box_scores.append(rows)
``````

This will create a list containing two lists, the first with the box score for `CLE`, and the second with the box score for `GSW`. Both contain the headers, along with each player and their in-game stats. We won’t turn this into more training data now, but it could easily be transformed into a format that could be added to our `nba` dataframe.

The R code is more complex than the Python code, because there isn’t a convenient way to use regular expressions to select items, so we have to do additional parsing to get the team names from the HTML. R also discourages using `for` loops in favor of applying functions along vectors. We use `lapply` to do this, but since we need to treat each row different depending on whether it’s a header or not, we pass the index of the item we want, and the entire `rows` list into the function.

We use `rvest`, a widely-used R web scraping package to extract the data we need. Note that we can pass a url directly into rvest, so the last step wasn’t needed in R.

In Python, we use BeautifulSoup, the most commonly used web scraping package. It enables us to loop through the tags and construct a list of lists in a straightforward way.

## Python vs R in Conclusion

We’ve taken a look at how to analyze a dataset with R and Python. There are many tasks we didn’t dive into, such as persisting the results of our analysis, sharing the results with others, testing and making things production-ready, and making more visualizations. There is a lot more to discuss on this topic, but just based on what we’ve done above, we can draw some meaningful conclusions:

### R is more functional, Python is more object-oriented

As we saw from functions like `lm`, `predict`, and others, R lets functions do most of the work. Contrast this to the `LinearRegression` class in Python, and the `sample` method on dataframes.

### R has more data analysis built-in, Python relies on packages

When we looked at summary statistics, we could use the `summary` built-in function in R, but had to import the `statsmodels` package in Python. The dataframe is a built-in construct in R, but must be imported via the `pandas` package in Python.

### Python has “main” packages for data analysis tasks, R has a larger ecosystem of small packages

With Python, we can do linear regression, random forests, and more with the scikit-learn package. It offers a consistent API, and is well-maintained. In R, we have a greater diversity of packages, but also greater fragmentation and less consistency (linear regression is a builtin, `lm`, `randomForest` is a separate package, etc).

### R has more statistical support in general

R was built as a statistical language, and it shows. `statsmodels` in Python and other packages provide decent coverage for statistical methods, but the R ecosystem is far larger.

### It’s usually more straightforward to do non-statistical tasks in Python

With well-maintained libraries like BeautifulSoup and requests, web scraping in Python is far easier than in R. This applies to other tasks that we didn’t look into closely, like saving to databases, deploying web servers, or running complex workflows.

### There are many parallels between the data analysis workflow in both

There are clear points of inspiration between both R and Python (pandas Dataframes were inspired by R dataframes, the rvest package was inspired by BeautifulSoup), and both ecosystems continue to grow stronger. It’s remarkable how similar the syntax and approaches are for many common tasks in both languages.

Original article