install.packages(c("install.load", "rando", "randomNames", "generator", "randtoolbox",
"stringi", "ggplot2", "data.table", "maps", "pivottabler", "openxlsx", "round",
"pander", "basictabler", "magrittr", "DT", "sf", "USA.state.boundaries", "flextable",
"strex"))
# install the packages and their dependenciesCommon Microsoft Excel Operations Performed in R
Introduction
The following document is modeled after the article by Sunil Ray (Refer to the Works Cited section). The following are the 10 Microsoft Excel operations that Ray performed:
Vlookup()
CONCATENATE()
LEN()
LOWER(), UPPER() and PROPER()
TRIM()
If()
Pivot Table
Creating Charts (Data Visualization)
Remove duplicate values
Text to Columns
Replicate the R code
Install the packages
Note: If you wish to replicate the R code below, then you will need to copy and paste the following command in R first to make sure you have all the packages and their dependencies installed:
Load the packages
Note: If you wish to replicate the R code below, then you will need to copy and paste the following commands in R first to make sure you have all the packages loaded:
install.load::load_package("randomNames", "generator", "randtoolbox", "stringi",
"ggplot2", "data.table", "maps", "stats", "pivottabler", "openxlsx", "round",
"pander", "basictabler", "magrittr", "DT", "sf", "USA.state.boundaries", "flextable",
"strex")
# load needed packages using the load_package function from the install.load
# package (it is assumed that you have already installed these packages)
# set the pander options
panderOptions("table.continues", "")
panderOptions("table.caption.prefix", "")
panderOptions("missing", "")
# set pander options to have missing values in the tables as blank cells
Create the Data Set with 1000 Fake Policies
set.seed(1000)
# makes the example reproducible
# Create the fake data sets to be used in these examples
# create a data.table for the Policy data set
policy <- data.table(Customer = 1:1000, Policy = paste0(as.numeric(gsub("-", "",
(r_national_identification_numbers(1000)))), stri_rand_strings(1000, 3, pattern = "[A-Z]")),
Product = (as.numeric(r_phone_numbers(1000))), Premium = round_r3(sample(c(congruRand(1000) *
15073, congruRand(1000) * 17158, congruRand(1000) * 19710), 1000, replace = TRUE),
d = 2), City = sample(us.cities$name, 1000))
# change the column names
setnames(policy, c("Customer ID", "Policy ID", "Product ID", "Premium ($)", "City"))
# correct spelling of City name
change_column <- "City"
for (col in change_column) set(policy, i = which(grepl("Cincinati", policy[[col]])),
j = col, value = "Cincinnati") # Source 1
# display the first 5 customers
pander(head(policy))| Customer ID | Policy ID | Product ID | Premium ($) | City |
|---|---|---|---|---|
| 1 | 679259874WJO | 3.693e+09 | 17242 | Fullerton CA |
| 2 | 398734972ZAT | 5.471e+09 | 12097 | Pawtucket RI |
| 3 | 313512130KKG | 8.726e+09 | 6753 | Victorville CA |
| 4 | 150571626VHB | 4.525e+09 | 2202 | Quincy MA |
| 5 | 443391456VOC | 7.149e+09 | 16148 | Richmond VA |
| 6 | 384885628ZSK | 2.312e+09 | 7847 | Huntington WV |
# display the last 5 customers
pander(tail(policy))| Customer ID | Policy ID | Product ID | Premium ($) | City |
|---|---|---|---|---|
| 995 | 249325153DDN | 2.434e+09 | 19199 | Coeur d’Alene ID |
| 996 | 565883197KXH | 9.242e+09 | 12415 | Cleveland OH |
| 997 | 649838447HGH | 8.313e+09 | 9388 | Seattle WA |
| 998 | 445305026QNH | 3.968e+09 | 9250 | DeKalb IL |
| 999 | 838201152HIZ | 5.376e+09 | 5548 | Surprise AZ |
| 1000 | 292873799CKE | 3.847e+09 | 7238 | Poway CA |
# Customer data set create genders character vector
genders <- c("Female", "Male")
# create Gender character vector sampled from genders
Gender <- sample(genders, 1000, replace = TRUE)
# create random names based on the gender giving only the first names
Name <- randomNames(gender = Gender, which.names = "first")
# create a data.table for the Customer data set
customer <- data.table(Customer = 1:1000, Name = Name, Gender = Gender, Age = sample(1:120,
1000, replace = TRUE), City = policy$City, Vintage = sample(1:720, 1000, replace = TRUE))
# change the column names
setnames(customer, c("Customer ID", "Name", "Gender", "Age", "City", "Vintage (in months)"))
# display the first 5 customers
pander(head(customer))| Customer ID | Name | Gender | Age | City | Vintage (in months) |
|---|---|---|---|---|---|
| 1 | Devin | Male | 19 | Fullerton CA | 346 |
| 2 | Shaahida | Female | 75 | Pawtucket RI | 702 |
| 3 | Jordan | Male | 67 | Victorville CA | 627 |
| 4 | Krista | Female | 102 | Quincy MA | 317 |
| 5 | Abigail | Female | 14 | Richmond VA | 713 |
| 6 | Joan | Female | 99 | Huntington WV | 645 |
# display the last 5 customers
pander(tail(customer))| Customer ID | Name | Gender | Age | City | Vintage (in months) |
|---|---|---|---|---|---|
| 995 | Rachel | Female | 30 | Coeur d’Alene ID | 650 |
| 996 | Latasha | Female | 76 | Cleveland OH | 622 |
| 997 | Leah Frances | Female | 97 | Seattle WA | 589 |
| 998 | Ju-Young | Female | 72 | DeKalb IL | 310 |
| 999 | Haafil | Male | 22 | Surprise AZ | 96 |
| 1000 | Anerae | Female | 8 | Poway CA | 615 |
# write the policy and customer data.tables to a .xlsx workbook to be read back
# in use the R package openxlsx to create the .xlsx spreadsheet
## Lists elements are written to individual worksheets, using list names as
## sheet names if available
l <- list(Policy = policy, Customer = customer)
## different sheets can be given different parameters
write.xlsx(l, "Fake-User-Data.xlsx")
# read back in the worksheets from the workbook and test for identicalness read
# in the policy worksheet (sheet 1)
policy1 <- read.xlsx("Fake-User-Data.xlsx", sheet = 1)
# make policy1 a data.table
policy1 <- setDT(policy1)
# change the column names of policy1
setnames(policy1, names(policy1), names(policy))
# changing column to integer class
change_class1 <- "Customer ID"
for (col in change_class1) set(policy1, j = col, value = as.integer(policy1[[col]])) # Source 1
# Are policy and policy1 identical?
all(policy == policy1)[1] TRUE
# read in the customer worksheet (sheet 2)
customer1 <- read.xlsx("Fake-User-Data.xlsx", sheet = 2)
# make customer1 a data.table
customer1 <- setDT(customer1)
# change the column names of customer1
setnames(customer1, names(customer1), names(customer))
# changing column to integer class
change_class2 <- c("Customer ID", "Age", "Vintage (in months)")
for (col in change_class2) set(customer1, j = col, value = as.integer(customer1[[col]])) # Source 1
# Are customer and customer1 identical?
all(customer == customer1)[1] TRUE
Perform the data analysis
1) Vlookup() Example 1
# make a new column named Key which contains the first digit of the Product ID
policy[, Key := as.integer(stri_extract_first_regex(policy$`Product ID`, "[1-9]"))]
# set the key of policy
setkey(policy, Key)
# display the first 5 Policies
pander(head(policy))| Customer ID | Policy ID | Product ID | Premium ($) | City | |
|---|---|---|---|---|---|
| 1 | 7 | 454536322GIG | 1.869e+09 | 5709 | West Babylon NY |
| 1 | 19 | 538684138QQD | 1.268e+09 | 9522 | Columbia MD |
| 1 | 24 | 406299389SPF | 1.835e+09 | 11192 | Odessa TX |
| 1 | 45 | 627835714CWY | 1.799e+09 | 9166 | Lincoln NE |
| 1 | 53 | 756277121DFH | 1.878e+09 | 1524 | Country Club FL |
| 1 | 64 | 702156477GIV | 1.362e+09 | 15446 | Chino Hills CA |
# create the key_table to be used in the lookup
key_table <- data.table(Key = 1:9, Category = LETTERS[1:9])
# display the key_table
pander(key_table)| Key | Category |
|---|---|
| 1 | A |
| 2 | B |
| 3 | C |
| 4 | D |
| 5 | E |
| 6 | F |
| 7 | G |
| 8 | H |
| 9 | I |
# set the key of key_table
setkey(key_table, Key)
# merge the policy and key_table data.tables thus doing a VLOOKUP()
policy <- policy[key_table]
# remove the unneeded Key column
policy[, Key := NULL]
# set the key of key_table
setkeyv(policy, "Customer ID")
# change the column name
setnames(policy, ncol(policy), "Product Category")
# change the column order
setcolorder(policy, c(1:3, 6, 4:5))
# display the first 5 Policies
pander(head(policy))| Policy ID | Product ID | Product Category | Premium ($) | City |
|---|---|---|---|---|
| 679259874WJO | 3.693e+09 | C | 17242 | Fullerton CA |
| 398734972ZAT | 5.471e+09 | E | 12097 | Pawtucket RI |
| 313512130KKG | 8.726e+09 | H | 6753 | Victorville CA |
| 150571626VHB | 4.525e+09 | D | 2202 | Quincy MA |
| 443391456VOC | 7.149e+09 | G | 16148 | Richmond VA |
| 384885628ZSK | 2.312e+09 | B | 7847 | Huntington WV |
1) Vlookup() Example 2
# add in the Latitude, Longitude, State Abbreviation, and State Region for each
# City in the policy data.table
# create the City data.table with the US city names and respective Latitude and
# Longitude
City <- data.table(City = us.cities$name, Latitude = us.cities$lat, Longitude = us.cities$long)
# get the State_names as a character vector
State_names <- str_after_last(getElement(City, "City"), " ")
# add the State abbreviation to the City data.table
City[, State := State_names]
# display the first 5 Cities
pander(head(City))| City | Latitude | Longitude | State |
|---|---|---|---|
| Abilene TX | 32.45 | -99.74 | TX |
| Akron OH | 41.08 | -81.52 | OH |
| Alameda CA | 37.77 | -122.3 | CA |
| Albany GA | 31.58 | -84.18 | GA |
| Albany NY | 42.67 | -73.8 | NY |
| Albany OR | 44.62 | -123.1 | OR |
# set the key of City
setkey(City, City)
# set the key of policy
setkey(policy, City)
# merge the policy and city data.tables thus doing a VLOOKUP()
policy <- City[policy]
# display the first 5 Policies
pander(head(policy))| Latitude | Longitude | State | Customer ID | Policy ID | |
|---|---|---|---|---|---|
| Abilene TX | 32.45 | -99.74 | TX | 900 | 882492748NOG |
| Akron OH | 41.08 | -81.52 | OH | 766 | 883527884MYY |
| Alameda CA | 37.77 | -122.3 | CA | 164 | 727715324KFW |
| Albany GA | 31.58 | -84.18 | GA | 713 | 259651667ONC |
| Albany NY | 42.67 | -73.8 | NY | 189 | 949624664QCF |
| Albany OR | 44.62 | -123.1 | OR | 632 | 308656677PPI |
| Product ID | Product Category | Premium ($) | |
|---|---|---|---|
| Abilene TX | 3.743e+09 | C | 1971 |
| Akron OH | 8.696e+09 | H | 6363 |
| Alameda CA | 8.43e+09 | H | 4275 |
| Albany GA | 7.533e+09 | G | 10324 |
| Albany NY | 3.787e+09 | C | 6196 |
| Albany OR | 9.22e+09 | I | 16547 |
# change the column order
setcolorder(policy, c(1:6, 9, 7:8))
# display the first 5 Policies
pander(head(policy))| Latitude | Longitude | State | Customer ID | Policy ID | |
|---|---|---|---|---|---|
| Abilene TX | 32.45 | -99.74 | TX | 900 | 882492748NOG |
| Akron OH | 41.08 | -81.52 | OH | 766 | 883527884MYY |
| Alameda CA | 37.77 | -122.3 | CA | 164 | 727715324KFW |
| Albany GA | 31.58 | -84.18 | GA | 713 | 259651667ONC |
| Albany NY | 42.67 | -73.8 | NY | 189 | 949624664QCF |
| Albany OR | 44.62 | -123.1 | OR | 632 | 308656677PPI |
| Premium ($) | Product ID | Product Category | |
|---|---|---|---|
| Abilene TX | 1971 | 3.743e+09 | C |
| Akron OH | 6363 | 8.696e+09 | H |
| Alameda CA | 4275 | 8.43e+09 | H |
| Albany GA | 10324 | 7.533e+09 | G |
| Albany NY | 6196 | 3.787e+09 | C |
| Albany OR | 16547 | 9.22e+09 | I |
# create region to be merged with the policy table
region <- data.table(State = state.fips$abb, Region = state.fips$region)
# display the first 5 Regions
pander(head(region))| State | Region |
|---|---|
| AL | 3 |
| AZ | 4 |
| AR | 3 |
| CA | 4 |
| CO | 4 |
| CT | 1 |
# create addregion to include Alaska and Hawai'i as region only includes 48
# states
addregion <- data.table(State = c("AK", "HI"), Region = 5:6)
# display addregion
pander(addregion)| State | Region |
|---|---|
| AK | 5 |
| HI | 6 |
# create a list of region and addregion to effectively combine the data.tables
# by row
l <- list(region, addregion)
# the new region now has 2 additional rows
region <- rbindlist(l)
# remove duplicate State abbreviations
region <- unique(region)
# display the first 5 Regions
pander(head(region))| State | Region |
|---|---|
| AL | 3 |
| AZ | 4 |
| AR | 3 |
| CA | 4 |
| CO | 4 |
| CT | 1 |
# set the key of Region
setkey(region, Region)
# create region_code for the VLOOKUP() with region
region_code <- data.table(Region = 1:6, Region1 = c("East", "Central", "Mid West",
"West", "Alaska", "Hawai'i"))
# display region_code
pander(region_code)| Region | Region1 |
|---|---|
| 1 | East |
| 2 | Central |
| 3 | Mid West |
| 4 | West |
| 5 | Alaska |
| 6 | Hawai’i |
# set the key of region_code
setkey(region_code, Region)
# merge the region and region_code data.tables thus doing a VLOOKUP()
region <- region[region_code]
# set the key of Region
setkey(region, State)
# set the key of policy
setkey(policy, State)
# merge the policy and city data.tables thus doing a VLOOKUP()
policy <- policy[region]
# display the first 5 Policies
pander(head(policy))| City | Latitude | Longitude | Customer ID | Policy ID | |
|---|---|---|---|---|---|
| AK | Anchorage AK | 61.18 | -149.2 | 532 | 384475938AMY |
| AK | Juneau AK | 58.3 | -134.4 | 795 | 837561119EAL |
| AL | Auburn AL | 32.59 | -85.48 | 803 | 980378167CVE |
| AL | Birmingham AL | 33.53 | -86.8 | 752 | 473542814SEE |
| AL | Decatur AL | 34.57 | -86.99 | 919 | 580577016YOF |
| AL | Dothan AL | 31.24 | -85.41 | 347 | 286944266ECV |
| Premium ($) | Product ID | Product Category | Region | Region1 | |
|---|---|---|---|---|---|
| AK | 9017 | 6.287e+09 | F | 5 | Alaska |
| AK | 8370 | 2.632e+09 | B | 5 | Alaska |
| AL | 7576 | 8.526e+09 | H | 3 | Mid West |
| AL | 18330 | 7.461e+09 | G | 3 | Mid West |
| AL | 13279 | 3.649e+09 | C | 3 | Mid West |
| AL | 936.2 | 8.434e+09 | H | 3 | Mid West |
# remove the Region column
policy[, Region := NULL]
# change the column name
setnames(policy, "Region1", "Region")
# change the column order
setcolorder(policy, c(5:9, 1, 4, 10, 2:3))
# display the first 5 Policies
pander(head(policy))| Customer ID | Policy ID | Premium ($) | Product ID | |
|---|---|---|---|---|
| AK | 532 | 384475938AMY | 9017 | 6.287e+09 |
| AK | 795 | 837561119EAL | 8370 | 2.632e+09 |
| AL | 803 | 980378167CVE | 7576 | 8.526e+09 |
| AL | 752 | 473542814SEE | 18330 | 7.461e+09 |
| AL | 919 | 580577016YOF | 13279 | 3.649e+09 |
| AL | 347 | 286944266ECV | 936.2 | 8.434e+09 |
| Product Category | City | Region | Latitude | Longitude | |
|---|---|---|---|---|---|
| AK | F | Anchorage AK | Alaska | 61.18 | -149.2 |
| AK | B | Juneau AK | Alaska | 58.3 | -134.4 |
| AL | H | Auburn AL | Mid West | 32.59 | -85.48 |
| AL | G | Birmingham AL | Mid West | 33.53 | -86.8 |
| AL | C | Decatur AL | Mid West | 34.57 | -86.99 |
| AL | H | Dothan AL | Mid West | 31.24 | -85.41 |
2) CONCATENATE()
# create 2 new columns
policy[, `:=`(Host = rep("https://www.customerunknown.com", nrow(policy)), Request = policy$"Policy ID")]
# concatenate the Host and Request columns into the URL column with '/'
# separators and extra space padding using the paste0 function
policy[, `:=`(URL, paste0(" ", Host, "/", Request, "/", " "))]
# remove the Host and Request columns
policy[, c("Host", "Request") := NULL]
# display the first 5 Policies
pander(head(policy))| Customer ID | Policy ID | Premium ($) | Product ID | |
|---|---|---|---|---|
| AK | 532 | 384475938AMY | 9017 | 6.287e+09 |
| AK | 795 | 837561119EAL | 8370 | 2.632e+09 |
| AL | 803 | 980378167CVE | 7576 | 8.526e+09 |
| AL | 752 | 473542814SEE | 18330 | 7.461e+09 |
| AL | 919 | 580577016YOF | 13279 | 3.649e+09 |
| AL | 347 | 286944266ECV | 936.2 | 8.434e+09 |
| Product Category | City | Region | Latitude | Longitude | |
|---|---|---|---|---|---|
| AK | F | Anchorage AK | Alaska | 61.18 | -149.2 |
| AK | B | Juneau AK | Alaska | 58.3 | -134.4 |
| AL | H | Auburn AL | Mid West | 32.59 | -85.48 |
| AL | G | Birmingham AL | Mid West | 33.53 | -86.8 |
| AL | C | Decatur AL | Mid West | 34.57 | -86.99 |
| AL | H | Dothan AL | Mid West | 31.24 | -85.41 |
| URL | |
|---|---|
| AK | https://www.customerunknown.com/384475938AMY/ |
| AK | https://www.customerunknown.com/837561119EAL/ |
| AL | https://www.customerunknown.com/980378167CVE/ |
| AL | https://www.customerunknown.com/473542814SEE/ |
| AL | https://www.customerunknown.com/580577016YOF/ |
| AL | https://www.customerunknown.com/286944266ECV/ |
3) LEN()
# gives the number of characters of the first 5 URLs
head(nchar(policy$URL))[1] 56 56 56 56 56 56
4) LOWER(), UPPER() and PROPER()
# transform the first 5 Customer Names to lowercase
head(stri_trans_tolower(customer$Name))[1] "devin" "shaahida" "jordan" "krista" "abigail" "joan"
# transform the first 5 City Names to uppercase
head(stri_trans_toupper(customer$City))[1] "FULLERTON CA" "PAWTUCKET RI" "VICTORVILLE CA" "QUINCY MA"
[5] "RICHMOND VA" "HUNTINGTON WV"
# transform the first 5 URLs to Title Case
head(stri_trans_totitle(policy$URL))[1] " Https://Www.customerunknown.com/384475938amy/ "
[2] " Https://Www.customerunknown.com/837561119eal/ "
[3] " Https://Www.customerunknown.com/980378167cve/ "
[4] " Https://Www.customerunknown.com/473542814see/ "
[5] " Https://Www.customerunknown.com/580577016yof/ "
[6] " Https://Www.customerunknown.com/286944266ecv/ "
5) TRIM()
# the URL column in Policy has extra spaces on both sides (left and right).
# Remove the spacing from both sides.
policy[, URL := stri_trim_both(policy$URL, pattern = "\\P{Wspace}")]
# display the first 5 Policies
pander(head(policy))| Customer ID | Policy ID | Premium ($) | Product ID | |
|---|---|---|---|---|
| AK | 532 | 384475938AMY | 9017 | 6.287e+09 |
| AK | 795 | 837561119EAL | 8370 | 2.632e+09 |
| AL | 803 | 980378167CVE | 7576 | 8.526e+09 |
| AL | 752 | 473542814SEE | 18330 | 7.461e+09 |
| AL | 919 | 580577016YOF | 13279 | 3.649e+09 |
| AL | 347 | 286944266ECV | 936.2 | 8.434e+09 |
| Product Category | City | Region | Latitude | Longitude | |
|---|---|---|---|---|---|
| AK | F | Anchorage AK | Alaska | 61.18 | -149.2 |
| AK | B | Juneau AK | Alaska | 58.3 | -134.4 |
| AL | H | Auburn AL | Mid West | 32.59 | -85.48 |
| AL | G | Birmingham AL | Mid West | 33.53 | -86.8 |
| AL | C | Decatur AL | Mid West | 34.57 | -86.99 |
| AL | H | Dothan AL | Mid West | 31.24 | -85.41 |
| URL | |
|---|---|
| AK | https://www.customerunknown.com/384475938AMY/ |
| AK | https://www.customerunknown.com/837561119EAL/ |
| AL | https://www.customerunknown.com/980378167CVE/ |
| AL | https://www.customerunknown.com/473542814SEE/ |
| AL | https://www.customerunknown.com/580577016YOF/ |
| AL | https://www.customerunknown.com/286944266ECV/ |
6) If()
# if the Premium > $5000, then it's High and if not, then it's Low
policy[, Category := ifelse(policy$"Premium ($)" > 5000, "High", "Low")]
# change the column order
setcolorder(policy, c(1:5, 12, 6:11))
# display the first 5 Policies
pander(head(policy))| Customer ID | Policy ID | Premium ($) | Product ID | |
|---|---|---|---|---|
| AK | 532 | 384475938AMY | 9017 | 6.287e+09 |
| AK | 795 | 837561119EAL | 8370 | 2.632e+09 |
| AL | 803 | 980378167CVE | 7576 | 8.526e+09 |
| AL | 752 | 473542814SEE | 18330 | 7.461e+09 |
| AL | 919 | 580577016YOF | 13279 | 3.649e+09 |
| AL | 347 | 286944266ECV | 936.2 | 8.434e+09 |
| Product Category | Category | City | Region | Latitude | |
|---|---|---|---|---|---|
| AK | F | High | Anchorage AK | Alaska | 61.18 |
| AK | B | High | Juneau AK | Alaska | 58.3 |
| AL | H | High | Auburn AL | Mid West | 32.59 |
| AL | G | High | Birmingham AL | Mid West | 33.53 |
| AL | C | High | Decatur AL | Mid West | 34.57 |
| AL | H | Low | Dothan AL | Mid West | 31.24 |
| Longitude | URL | |
|---|---|---|
| AK | -149.2 | https://www.customerunknown.com/384475938AMY/ |
| AK | -134.4 | https://www.customerunknown.com/837561119EAL/ |
| AL | -85.48 | https://www.customerunknown.com/980378167CVE/ |
| AL | -86.8 | https://www.customerunknown.com/473542814SEE/ |
| AL | -86.99 | https://www.customerunknown.com/580577016YOF/ |
| AL | -85.41 | https://www.customerunknown.com/286944266ECV/ |
7) Pivot Table
# melt policy using reshape
policy_pt <- melt(policy, id.vars = c(9, 5), measure.vars = 3)
# cast policy_pt using reshape
policy_pt <- dcast(policy_pt, Region ~ `Product Category`, sum, fill = NA) # Sources 4 and 5
# obtain the row sums
policy_pt[, "Grand Total" := rowSums(.SD, na.rm = TRUE), .SDcols = 2:ncol(policy_pt)]
# compute the column sums
addpt <- policy_pt[, lapply(.SD, sum, na.rm = TRUE), .SDcols = 2:ncol(policy_pt)]
# create a new column with the name of 'Grand Total'
addpt[, `Grand Total2` := "Grand Total"]
# change the column order
setcolorder(addpt, c(ncol(addpt), 1:(ncol(addpt) - 1L)))
# create a list of policy_pt and addpt in order to combine them by row
l1 <- list(policy_pt, addpt)
# combine the list
policy_pt <- rbindlist(l1, fill = TRUE)
# display the pivot table created using data.table
pander(policy_pt)| Region | A | B | C | D | E | F | G |
|---|---|---|---|---|---|---|---|
| Alaska | 8370 | 9017 | |||||
| Central | 159975 | 116100 | 286302 | 197884 | 228935 | 153537 | 224985 |
| East | 173858 | 161561 | 231200 | 118963 | 124928 | 99298 | 128164 |
| Hawai’i | 9613 | 4632 | |||||
| Mid West | 218196 | 313187 | 380994 | 342939 | 294637 | 372498 | 267178 |
| West | 314655 | 303071 | 250713 | 271445 | 310124 | 303904 | 421518 |
| 866683 | 902290 | 1158821 | 931231 | 958624 | 938254 | 1046476 |
| H | I | Grand Total | Grand Total2 |
|---|---|---|---|
| 17387 | |||
| 205900 | 180277 | 1753894 | |
| 113953 | 128149 | 1280072 | |
| 14246 | |||
| 251895 | 209061 | 2650585 | |
| 305813 | 271242 | 2752485 | |
| 877561 | 788728 | 8468669 | Grand Total |
# or using pivottabler
ptt <- PivotTable$new()
ptt$addData(policy)
ptt$addColumnDataGroups("Product Category")
ptt$addRowDataGroups("Region")
ptt$defineCalculation(calculationName = "Premium ($)", summariseExpression = "sum(`Premium ($)`)")
ptt$renderPivot()# convert to flextable to modify the formatting of the numeric data convert to
# a basictabler table first
tbll <- ptt$asBasicTable()
# now convert to a flextable
ftt <- tbll$asFlexTable()
# change the numeric formatting
ftt <- flextable::colformat_double(ftt, big.mark = ",", digits = 2, na_str = "N/A")
# autofit the table
ftt <- flextable::autofit(ftt)
# display the pivot table created using flextable
fttA | B | C | D | E | F | G | H | I | Total | |
Alaska | 8370.4 | 9016.87 | 17387.27 | |||||||
Central | 159974.75 | 116100.09 | 286301.5 | 197883.55 | 228935.48 | 153537 | 224984.84 | 205900.2 | 180276.67 | 1753894.08 |
East | 173857.9 | 161561.16 | 231199.79 | 118963.03 | 124927.69 | 99297.65 | 128163.66 | 113952.73 | 128148.62 | 1280072.23 |
Hawai'i | 9613.39 | 4632.22 | 14245.61 | |||||||
Mid West | 218196.15 | 313186.58 | 380993.97 | 342939.16 | 294636.58 | 372498.39 | 267177.58 | 251895.44 | 209060.84 | 2650584.69 |
West | 314654.69 | 303071.32 | 250712.51 | 271445.17 | 310124.31 | 303904.03 | 421518.06 | 305812.78 | 271242.16 | 2752485.03 |
Total | 866683.49 | 902289.55 | 1158821.16 | 931230.91 | 958624.06 | 938253.94 | 1046476.36 | 877561.15 | 788728.29 | 8468668.91 |
# the object ptt can be exported as a Microsoft Excel pivot table as well
wbpt <- createWorkbook()
addWorksheet(wbpt, "Pivot Table")
ptt$writeToExcelWorksheet(wb = wbpt, wsName = "Pivot Table", topRowNumber = 1, leftMostColumnNumber = 1,
outputValuesAs = "rawValue", applyStyles = TRUE, mapStylesFromCSS = TRUE)
saveWorkbook(wbpt, file = "Policy-pivot-table.xlsx", overwrite = TRUE)
8) A) Creating Charts (Data Visualization)
# create a bar chart based on the pivot table
# remove the last row
policy_ptuse <- policy_pt[-nrow(policy_pt)]
# remove the last column
policy_ptuse[, ncol(policy_pt) := NULL]
# melt to make it easier to plot with ggplot2
policy_ptmelt <- melt(policy_ptuse, id.vars = "Region", measure.vars = LETTERS[1:9])
setnames(policy_ptmelt, 2, "Legend")
# plot with ggplot2
ggplot(policy_ptmelt, aes(x = Region, y = value, fill = Legend)) + geom_bar(stat = "identity",
width = 1, position = position_dodge(width = 0.9)) + labs(title = "Total Premium Cost by Region",
x = "Region", y = "Premium Cost ($)") + scale_fill_brewer(palette = "Spectral") +
theme_dark() # Source 5
8) B) Creating a Customer Location Map
9) Remove duplicate values
# The quickest way to remove duplicate values in R is to use unique
# the dimensions of policy (number of rows by number of columns)
dim(policy)[1] 1000 12
# display the duplicated rows in policy
head(duplicated(policy))[1] FALSE FALSE FALSE FALSE FALSE FALSE
# the dimensions of the unique values of policy (with duplicated values
# removed)
dim(unique(policy))[1] 1000 12
10) Text to Columns
# the text used in the article can be converted from text to columns directly
# using the fread function from the data.table package
pander(fread("A001; 2345; North; A
A002; 3423; North; B
A003; 2124; South; C
A004; 5134; West; D
A004; 4321; East; D
A005; 7653; South; C",
header = FALSE, sep = ";"))| V1 | V2 | V3 | V4 |
|---|---|---|---|
| A001 | 2345 | North | A |
| A002 | 3423 | North | B |
| A003 | 2124 | South | C |
| A004 | 5134 | West | D |
| A004 | 4321 | East | D |
| A005 | 7653 | South | C |
Sources used in the R code
Source 1
r - Convert column classes in data.table - Stack Overflow answered by Matt Dowle on Dec 27 2013. https://stackoverflow.com//questions/7813578/convert-column-classes-in-data-table
Source 2
r - data.table operations by column name with spaces fails - Stack Overflow answered by grrgrrbla on Jun 10 2015 and edited by Arun on Jun 10 2015. https://stackoverflow.com/questions/30757316/data-table-operations-by-column-name-with-spaces-fails
Source 3
2657 Productions. Using the reshape package in R for pivot-table-like functionality, 8 Aug 2010, http://news.mrdwab.com/post/reshape-package/
Source 4
data.table vignette: Efficient reshaping using data.tables, See https://rdatatable.gitlab.io/data.table/articles/datatable-reshape.html
Source 5
r - ggplot side by side geom_bar() - Stack Overflow answered by David Arenburg on Jul 31 2014 & edited by David Arenburg on Aug 1 2014. See https://stackoverflow.com/questions/25070547/ggplot-side-by-side-geom-bar.
Source 6
How to map data with R: A hands-on tutorial to get you to start creating maps with R. By Abhinav Malasi, Jun 29, 2021. See https://medium.com/geekculture/how-to-map-data-with-r-8333110dff5b
Source 7
R Error in x$ed : $ operator is invalid for atomic vectors - Stack Overflow answered & edited by Jilber Urbina on Apr 25, 2014. See https://stackoverflow.com/questions/23299684/r-error-in-xed-operator-is-invalid-for-atomic-vectors
Works Cited
Sunil Ray, Analytics Vidhya, “Simple Yet Powerful Excel Tricks for Analyzing Data,” November 3, 2015, https://www.analyticsvidhya.com/blog/2015/11/excel-tips-tricks-data-analysis/, Accessed: November 11, 2016.
EcoC²S Links
EcoC²S Home
About EcoC²S
EcoC²S Services
1 Stop Shop
Online Store
EcoC²S Media
EcoC²S Resources
R Trainings and Resources provided by EcoC²S (Irucka Embry, EIT)
Donations accepted with Liberapay
If you would like to contribute to the continued development of Irucka Embry’s R packages and/or Irucka Embry’s R Examples, please feel free to donate via the link below:
Please feel free to review Irucka Embry (iaembry)’s profile on Liberapay.
Copyright and License
All R code written by Irucka Embry is distributed under the GPL-3 (or later) license, see the GNU General Public License (GPL) page.
All written content originally created by Irucka Embry is copyrighted under the Creative Commons Attribution-ShareAlike 4.0 International license. All other written content retains the copyright of the original author(s).
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International license.