Matrices Portfolio
library(Matrix)
library(tidyverse)
library(igraph)
We work with the same dataset used for Tidyverse
containing data regarding some of the top-voted kaggle kernels.
# Use `` for column names with spaces
kaggle <- "kagglekernels.csv" %>%
read_csv(col_types = cols(
Votes=col_double(),
Owner=col_factor(),
Kernel=col_factor(),
Dataset=col_factor(),
Output=col_character(),
`Code Type`=col_factor(),
Language=col_factor(),
Comments=col_double(),
Views=col_double(),
Forks=col_double())
)
kaggle # Tibbles automatically print head(tibble)
## # A tibble: 971 x 12
## Votes Owner Kernel Dataset `Version Histor… Tags Output `Code Type` Language
## <dbl> <fct> <fct> <fct> <chr> <chr> <chr> <fct> <fct>
## 1 2130 Mega… Explo… Titani… Version 8,2017-… tuto… This … Script markdown
## 2 1395 Guid… Full … Data S… Version 19,2017… tuto… This … Notebook Python
## 3 1363 Pedr… Compr… House … Version 47,2018… begi… This … Notebook Python
## 4 1316 Anis… Intro… Titani… Version 93,2018… tuto… This … Notebook Python
## 5 1078 Kaan… Data … Pokemo… Version 389,201… begi… This … Notebook Python
## 6 1003 Phil… Explo… Zillow… Version 44,2017… begi… This … Script markdown
## 7 946 Mana… Titan… Titani… Version 16,2017… tuto… This … Notebook Python
## 8 826 Omar… A Jou… Titani… Version 6,2016-… begi… This … Notebook Python
## 9 814 anok… Data … Quora … <NA> inte… This … Notebook Python
## 10 726 SRK Simpl… Zillow… Version 19,2017… eda,… This … Notebook Python
## # … with 961 more rows, and 3 more variables: Comments <dbl>, Views <dbl>,
## # Forks <dbl>
Again, we can use the Tags
to create a number of different new variables, each representing one Tag
.
# stop here in order to have the column names
tagmatrix <- kaggle %>%
dplyr::select(Tags) %>%
mutate(rn=row_number()) %>%
separate_rows(Tags, sep="\\s*,\\s*") %>% # RegEx comma-separated tags
mutate(i1=1) %>% # Uniquely identifies rows together with rn
mutate_all(~na_if(., "")) %>% # remove NA generated by "<tag>,"
pivot_wider(names_from = Tags,
values_from = i1,
values_fill = list(i1 = 0)) %>% # Wide format
dplyr::select(-rn, -"NA")
tagmatrix
## # A tibble: 971 x 101
## tutorial beginner `feature engine… preprocessing eda `data cleaning`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 1 0 0 0
## 2 1 0 0 1 0 0
## 3 0 1 0 0 1 1
## 4 1 0 0 0 0 0
## 5 0 1 0 0 0 0
## 6 0 1 0 0 1 0
## 7 1 0 1 0 0 0
## 8 0 1 0 0 1 0
## 9 0 0 0 0 1 0
## 10 0 0 0 0 1 0
## # … with 961 more rows, and 95 more variables: ensembling <dbl>, xgboost <dbl>,
## # `data visualization` <dbl>, `model comparison` <dbl>, `random
## # forest` <dbl>, `logistic regression` <dbl>, intermediate <dbl>, nlp <dbl>,
## # `regression analysis` <dbl>, `time series` <dbl>, `geospatial
## # analysis` <dbl>, `linear regression` <dbl>, advanced <dbl>, cnn <dbl>,
## # classification <dbl>, `neural networks` <dbl>, linguistics <dbl>, `survey
## # analysis` <dbl>, `dimensionality reduction` <dbl>, pca <dbl>, `image
## # processing` <dbl>, `deep learning` <dbl>, storytelling <dbl>,
## # databases <dbl>, bigquery <dbl>, learning <dbl>, crime <dbl>,
## # finance <dbl>, forecasting <dbl>, healthcare <dbl>, `gradient
## # boosting` <dbl>, rnn <dbl>, animation <dbl>, geography <dbl>,
## # terrorism <dbl>, `model diagnosis` <dbl>, `k-means` <dbl>, `food and
## # drink` <dbl>, `decision tree` <dbl>, animals <dbl>, `recommender
## # systems` <dbl>, `video games` <dbl>, demographics <dbl>, internet <dbl>,
## # basketball <dbl>, sports <dbl>, optimization <dbl>, `marketing
## # analytics` <dbl>, cricket <dbl>, politics <dbl>, biology <dbl>, `network
## # analysis` <dbl>, `pipeline code` <dbl>, languages <dbl>, education <dbl>,
## # `machine learning` <dbl>, gan <dbl>, regression <dbl>, business <dbl>,
## # marketing <dbl>, clustering <dbl>, `5daychallenge` <dbl>, svm <dbl>,
## # lstm <dbl>, `image data` <dbl>, `object segmentation` <dbl>,
## # statistics <dbl>, housing <dbl>, economics <dbl>, `text mining` <dbl>,
## # banking <dbl>, memory <dbl>, `association football` <dbl>, violence <dbl>,
## # `visual arts` <dbl>, history <dbl>, `bayesian statistics` <dbl>,
## # cities <dbl>, `united states` <dbl>, `occupational safety` <dbl>,
## # programming <dbl>, countries <dbl>, immigration <dbl>, `universities and
## # colleges` <dbl>, `auto racing` <dbl>, probability <dbl>, `programming
## # languages` <dbl>, plants <dbl>, firefighting <dbl>, `stochastic
## # processes` <dbl>, safety <dbl>, `outlier analysis` <dbl>, weather <dbl>,
## # `signal processing` <dbl>, `reinforcement learning` <dbl>
Each column now has 1
s where the Kernels were tagged with that specific tag, and 0
otherwise. Data in this form is not very useful for a graph representation. Instead, we can create an adjecency matrix having tags on columns and on rows and where the entries correspond to the number of times the pair of tags was used together in the same kernel. In order to count this metric, we can leverage the crossprod
function. We also transform tagmatrix
, which is a tibble
, into a sparseMatrix.
tagmatrix %<>%
as.matrix %>%
Matrix::Matrix(sparse=TRUE) %>%
Matrix::crossprod() %>% # Preserves sparsity
`diag<-`(0) %>% # A Tag is not related with itself
Matrix::drop0()
We can look at the sparsity pattern with the function Matrix::image()
to get an idea of what this matrix looks like.
Matrix::image(tagmatrix, main="Sparsity Patter of Tag Pairings")
Alternatively, we can look at the percentage of sparsity.
nnzero(tagmatrix) / length(tagmatrix)
## [1] 0.08214881
We now proceed to use the igraph
package to plot a network using tagmatrix
as a weighted adjecency matrix. Since the matrix is very large, we restrict ourselves to the most “social” tags, i.e. tags that have been paired with other tags more than a given number of times, in this case 5
. To do this, we set those values to zero and then drop the corresponding column and rows (since tagmatrix
is symmetric).
# set values below a threshold to 0
tagmatrix[tagmatrix<=5] <- 0
# drop empty rows and columns for graphing purposes
flag <- apply(tagmatrix, 1, function(x) any(x != 0))
tagmatrix <- tagmatrix[flag, flag]
# value corresponds to how "social" those tags are
graph <- graph_from_adjacency_matrix(tagmatrix, weighted = TRUE)
We first choose to plot the network using a standard circle layout.
plot(graph, layout=layout_in_circle(graph), vertex.label.cex=1.0,
edge.arrow.size=1.0, vertex.label.color="black", vertex.size=24)
While at first this graph looks insightful, it’s hard to see some structure. Rather, we can specify a layout by choosing ourselves the coordinates of each node. After some trail and error, it’s possible to come up with a layout similar to this one.
# graph
layout <- matrix(c(0, 1, 1, 0, -1, 2, 3, -1, 3, 4, 4, 1, 1, -1, 2, 0, 3, 4, 4,
0, 1, 2, 2, 2, 1, 2, 1, 0, 0, 2, 3, 0, 0, 0, 3, 3, 3, 1), nrow=19, ncol=2)
plot(graph, layout=layout, vertex.label.cex=1.0, edge.arrow.size=1.0,
vertex.label.color="black", vertex.size=24)
We can see a few important features. First of all it looks like there are three main clusters. One is about deep learning, one is about finance and crime data and the other cluster gathers together most other variables. In this latter cluster, we can see that data visualization
dominates.