Data organization in R - code snippets

on May 16, 2022

Match two different data frames based on a common subject ID code

In this case, I want to match these two data frames based on the subject ID code called “Prolific_ID,” which is present in both data sets. I only want to include a subset of variables from each data set (here: only three variables from the first data set and only one variable from the second data set).

# Match subset of data_orig and data_FU based on Prolific ID 
temp1 <- data_orig[c("Prolific_ID", "AHQ1_total_orig", "CAARS_total", "Demo_ADHD")]
temp2 <- data_FU[c("Prolific_ID", "AHQ1_total_rep")]
data_rep <- merge(temp1, temp2, by="Prolific_ID")

Merge >2 different data frames (add columns)

Here I want to merge 5 data frames into one. In this case, I want to add columns and all of the rows are in the same order. That is, the first row is subject1, the second row is subject2, etc. in each of the data frames. This makes merging easy. I just want to tack on more variable columns to make one big spreadsheet.

See here for alternative options if your rows are ordered differently in each data frame and you instead need to sort by row name, or a subject ID code, to merge >2 data frames.

# Merge five data frames 
## Note: this command works *only* if all of your rows are in the same order in each data frame
##       This is the case for Osprey, but for other cases, you'll want a subjID or rowname to sort by 
leftSMG_00_baseline <- cbind.data.frame(leftSMG_00_baseline_subj, leftSMG_00_baseline_QM, 
                                        leftSMG_00_baseline_seg, leftSMG_00_baseline_quant_Tiss,
                                        leftSMG_00_baseline_quant_Alpha)

Merge >2 different data frames (add rows)

Similar to above, I want to combine separate data frames, except here, I want to add more rows instead of adding more columns. I have two data frames — one with 24 subjects at a baseline time point, and a second with 16 of those same 24 subjects at a second time point. Here I’ll use rbind instead of cbind to add more rows instead of adding more columns. This results in one data frame, now with 40 rows.

leftSMG <- rbind.data.frame(leftSMG_00_baseline, leftSMG_01_pre_tDCS)

Add a variable to your data frame

Try the mutate command in the dplyr package. This command adds a column named “timept” after the “subject” column in the leftSMG_01_pre_tDCS data frame. In this case, I wanted the timept value for every row to be “pre-tDCS.” However, you could define this new variable in any way that you like.

# Make a timept variable 
library(dplyr) 
leftSMG_01_pre_tDCS <- leftSMG_01_pre_tDCS %>% mutate(timept="pre-tDCS", .after=c("subject"))

Remove rows from your data frame

This code removes the last 6 rows from my data frame. That is, I ask for R to remove (-c) the number of rows in the data frame (nrow(leftSMG_int)) minus 5. In this case, I have 24 rows total – 5 = 19. So, nrow(leftSMG_int)-5) = 19, i.e., start removing row 19 and onwards. Then keep removing up until (:) the total number of rows in the data frame (nrow(leftSMG_int)) = 24. So, remove (-c) rows 19, 20, 21, 22, 23, and 24 == the last 6 rows.

# Remove the added rows from main data frame (last 6 rows)
leftSMG_int <- leftSMG_int[-c((nrow(leftSMG_int)-5):nrow(leftSMG_int)), ]

Remove a data frame from your workspace entirely

This one’s easy!

# Remove the leftSMG_00_baseline_subj data frame from workspace 
rm(leftSMG_00_baseline_subj)

One way (of many) to subset your data

There are a ton of ways to subset data, depending on your goals — for instance, do you want to create a new data frame? Or do you just want to categorize your data by levels of a factor for one command (e.g., subset by ‘young’ vs. ‘old’ only to count the number of subjects in each group)?

I like the subset command for the former instance — when I want to create a new data frame. This use of subset creates a new data frame for three levels of the timept variable. Here, I ask to retain any rows where timept is equal to (==) pre-tDCS or post-tDCS or tDCS-fu.

AH_tDCS <- subset(AH_data, timept=="pre-tDCS" | timept=="post-tDCS" | timept=="tDCS-fu")

This use of subset excludes certain subjects from my data frame. Here, I ask to exclude any rows where the subject ID is “sub-MAL” or the subject ID is “sub-RBL.”

leftSMG_int <- subset(leftSMG_int, subject!="sub-MAL" & subject!="sub-RBL")

Here I ask to retain only those rows where tCr_IFG and tCr_SM are not missing (i.e., =NA).

data_press_complete <- subset(data_press, !is.na(data_press$tCr_IFG) & !is.na(data_press$tCr_SM))

Here I don’t use the subset command and instead use square brackets. This command sets the variable Asc to NA only for those rows where subject is sub-WBR:

leftSMG_int$Asc[leftSMG_int$subject=="sub-WBR"] <- NA

This command does the same as the command above, except that we set Asc to NA if subject = sub-WBR and if timept = post-tDCS:

leftSMG_int$Asc[leftSMG_int$subject=="sub-DBR" & leftSMG_int$timept=="post-tDCS"] <- NA

Sanity checks: count your data

Use different variations of the count commands to return the number of cases in different groups. For instance, here, we count how many subjects fall into different diagnosis groups (which, in this case, is a factor with three levels).

count(data_press_FTD_demos$diagnosis_group)

Categories:

Tags:

code data organization R statistics

Data organization in R — code snippets