Tidyverse Portfolio
library(tidyverse)
library(plotly)
library(forcats)
library(reshape2)
library(magrittr)
Spotify: Genre Popularity by Artist
Import the spotify data an drop index
column cause it contains redundant information.
spotify <- read_csv("top50.csv", col_names=c(
"index", "Song", "Artist", "Genre", "BPM", "Energy", "Danceability",
"Loudness", "Liveness", "Valence", "Length", "Acousticness",
"Speechiness", "Popularity"),
col_types=cols(
index=col_double(),
Song=col_factor(),
Artist=col_factor(),
Genre=col_factor()),
skip=1)
# Index column contains row_numbers
spotify <- spotify %>% select(-c("index"))
For brevity, will change the name of the genres and change one of the artists cause it has a non utf-8 name.
levels(spotify$Genre) <- c(
"CanPop", "ReggaeFlow", "DancePop", "Pop", "DfwRap", "Trap", "CountryRap",
"ElecPop", "Reggaeton", "PanPop", "CanadaHH", "Latin", "EscapeRoom",
"PopHouse", "AustrPop", "EDM", "AltHH", "BigRoom", "BoyBand", "R&Besp",
"Brostep"
)
levels(spotify$Artist) <- c(
"Shawn Mendes", "Anuel AA", "Ariana Grande", "Ed Sheeran", "Post Malone",
"Lil Tecca", "Sam Smith", "Lil Nas X", "Billie Eilish", "Bad Bunny",
"DJ Snake", "Lewis Capaldi", "Sech", "Drake", "Chris Brown",
"J Balvin", "Y2K", "Lizzo", "MEDUZA", "Jhay Cortez","Lunay", "Tones and I",
"Ali Gatie", "Daddy Yankee", "The Chainsmokers", "Maluma", "Young Thug",
"Katy Perry", "Martin Garrix", "Jonas Brothers", "Lauv", "Kygo",
"Taylor Swift", "Lady Gaga", "Khalid","ROSALìA", "Marshmello", "Nicky Jam"
)
Let’s make a bar plot in polar coordinates showing the popularity of genres and stack the columns by artists.
spotify %>%
{ggplot(data=spotify) +
geom_bar(aes(x=Genre, fill=Artist)) +
coord_polar() +
theme(axis.text.x=element_text(size=12)) +
labs(y="Number of Songs per Genre",
title="Top 50 Spotify Songs by Genre by Artist") +
theme(plot.title=element_text(hjust=0.5))}
We can also obtain an equivalent interactive version using plotly
which makes it easier to understand when there are multiple variables. Notice how “Dance Pop” and “Pop” are the most popular genres in the top 50 Spotify songs, and how Ed Sheeran dominates the “Pop” genre.
# First, create new dataframe where we re-order the factor
spotify_ordered <- within(
spotify,
Genre <- factor(Genre, levels=names(sort(table(Genre), decreasing=TRUE)))
)
#forcats::fct_infreq(Genre)
p <- ggplot(data=spotify_ordered) +
geom_bar(aes(x=Genre, fill=Artist)) +
coord_flip() +
theme(axis.text.x=element_text(size=12)) +
labs(y="Number of Songs per Genre",
title="Top 50 Spotify Songs by Genre by Artist", x="Genre") +
theme(plot.title=element_text(hjust=0.5))
ggplotly(p)
Spotify: Correlation of Musical Features
Let’s see the correlation between the different features that describe a song.
# create function to set to NA all upper triangular part of the matrix
upper_tri_to_na <- function(matrix){
matrix[upper.tri(matrix)] <- NA
return(matrix)
}
# Find the melted correlation matrix (should I add/remove BPM?)
melted_corrmatrix <- spotify %>%
select(-c(Genre, Artist, Song)) %>% cor %>%
upper_tri_to_na %>%
melt %>%
mutate(value=round(value, digits=2))
# Plot correlation plot
ggplot(melted_corrmatrix, aes(x=Var1, y=Var2, fill=value)) +
geom_tile(color="white") +
geom_text(aes(x=Var1, y=Var2, label=value), color="grey90", size=10) +
theme(axis.text.x=element_text(size=15),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
axis.text.y=element_text(size=15),
plot.title=element_text(hjust=0.5, size=20),
legend.title=element_text(size=15),
legend.text=element_text(size=12)) +
ggtitle("Correlation between Song Features") +
labs(fill="Correlation") +
scale_fill_continuous(na.value="grey91")
We can see that Loudness
and Energy
are positively correlated, as one would expect. Surprisingly, Speechiness
and BPM
are also positively correlated. Energy
and Acousticness
are negatively correlated instead.
Most Popular Kaggle Kernels
We can also work on the dataset containing information about the most popular kaggle kernels. We cast columns to the correct datatype to avoid problems later on with factors.
# Use `` for column names with spaces
kaggle <- "kagglekernels.csv" %>%
read_csv(col_types = cols(
Votes=col_double(),
Owner=col_factor(),
Kernel=col_factor(),
Dataset=col_factor(),
Output=col_character(),
`Code Type`=col_factor(),
Language=col_factor(),
Comments=col_double(),
Views=col_double(),
Forks=col_double())
)
Want to use kaggle$Output
to extract the number of visualizations and the number of data files that each kernel outputs.
kaggle %<>%
select(Output) %>% # Grab the `Output` column. It contains strings.
transmute(
OutputVisualizations = as.numeric(str_extract(Output, "\\d+(?= vis)")),
OutputFiles = as.numeric(str_extract(Output, "\\d+(?= data fil)"))) %>%
mutate_all(replace_na, 0) %>% # replace NA with 0s in case nothing is outputted
cbind(kaggle, .) %>% # bind these columns at the end of kaggle
select(-Output) # drop the Output column as it is useless now
We can use this data to see how the number of output visualizations and the number of output files changes as the programming language and the file type change.
kaggle %>%
select(c(Language, OutputVisualizations, OutputFiles, `Code Type`)) %>%
group_by(`Code Type`, Language) %>%
summarise(
Visualizations=sum(OutputVisualizations),
Files=sum(OutputFiles)) %>%
gather("Outputs", "NumOutputs", -Language, -`Code Type`) %>%
ggplot(aes(x=Language, y=NumOutputs, fill=Outputs, color=Outputs)) +
geom_bar(stat="identity", alpha=0.3) +
facet_wrap(~`Code Type`) +
labs(
y="Total Outputs (Visualizations + Files)",
x="Programming Language",
title="Output Visualizations and Files by Language and Code Type")
We can see that among the scripts markdown
outputs many more visualizations that any other programming language. We can also notice how R
notebooks seem to mainly output visualizations and not many files, as it would seem reasonable. SQLite
also does not suprise, it only outputs files because it is a query language.