As always, start by opening a new script file, give your file a “good name” and save it in our folder (POL2045). Remove everything from RStudio’s memory and set your working directory
rm(list=ls())
setwd("~POLXXXX") # Remember XXXXis the code of your module
Although many data manipulation function exist in basic R sometimes it is easier to use packages build for the sole purpose of making data manipulation easier. dplyr
is such a package (https://dplyr.tidyverse.org/)
dplyr
is providing a consistent set of verbs that help you solve the most common data manipulation challenges:
mutate( ) : adds new variables that are functions of existing variables
select( ) : picks variables based on their names
filter( ) : picks cases based on their values
arrange( ) : changes the ordering of the rows
summarise( ) : reduces multiple values down to a single summary
library(dplyr)
Although not required dplyr
make use of the pipe operator %>%
. The main value of the pipe operator is the ability to connect multiple functions together. To give you an example:
filter(my_data, my_variable == variable_value)
or
my_data %>% filter(my_variable == variable_value)
If you use RStudio, you can type the pipe ([%>%]) with Ctrl + Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.
We will use the European Value Survey (EVS) which includes a series of continuous and categorical variables.
select( )
functionselect ()
will keep only those variables in the dataset you are interested in. Sometimes, especially when we re working with big datasets we want to reduce the number of variables.
In our seminar we will work again with the European Value Study (2019). As you already know the dataset contains many variables.Our goal is to explore what predicts attitudes towards immigration. The exact wording of the question is the following:
Please look at the following statements and indicate where you would place your views on this scale? (scale of 1 to 10)
Where 1 means that immigrants take jobs away from [Nationality] and 10 that immigrants DO NOT take jobs away from [Nationality].load("EVS_UK_full.RData")
library(dplyr)
Our dependent variable is attitudes towards immigration (v185), our main idependent variables are: self-placement on the left-right spectrum (v102), attitudes towards nationality (v189,v190, v191, v192, v193), and finally three control variables namely age (v226), education (v243_edulvlb), and gender (v225). We also need the variable describing the name of the country in the dataset (country).
sub.sample<-EVS_UK_full %>% select(v102, v185, v189,v190, v191, v192, v193, v225, v226,v243_edulvlb,country)
head(sub.sample)
EVS includes all European countries, for our excercise we want to analyse data examining attitudes towards immigration in GB. To exclude all other countries from our dataset we need a) to know the value of the variable that corresponds to Great Britain (country==826) b) to delete all other values of the country variable from our dataset. To do so we will use the filter ()
function.
sub.sample %>% filter(country == 826)
Our next step is to let R know which values represent missing cases. According to the codebook all values ranging from -10 to -1 describe missing cases (Don’t know, No answer, not applicable)
sub.sample[sub.sample <=-1] <- NA
sub.sample[sub.sample <=-2] <- NA
Our dependent variable is a continuous variable with a range from 1 to 10, where 1 means that the respondent hold anti-immigration attitudes (Immigrants take jobs) and 10 that the respondent doesn’t hold anti-immigration attitudes. When the large number, in our example (10), indicate lower agreement with the phenomenon under study, then we say that the variables are negatively coded. This is not really wrong but it makes more sense, is more intuitive, to reverse the order of the coding. It is the case that readers and researchers expect higher values to indicate higher levels of agreement with the phenomenon under study, in this case anti-immigration attitudes.
Using the arrange( )
function we can reverse the order of the code of the values. In our example what we want is instead of 10
representing positive attitudes towards immigration to represent negative ones.
sub.sample<-sub.sample %>% mutate(immi.jobs=(max(v185,na.rm=TRUE)-(v185)))
Note: I used the na.rm=T function inside the max function max()
. This is because otherwise due to NAs R cannot make the maths.
To make sure we did everything correctly we use the table ()
function to compare the two variables- before and after reversing the codes.
table(sub.sample$immi.jobs)
table(sub.sample$v185)
We should apply the same method to reverse the order of three more variables v189 to v193.
sub.sample<-sub.sample %>% mutate(born.country=(max(v189,na.rm=TRUE)-(v189)))
sub.sample<-sub.sample %>% mutate(respect.inst=(max(v190,na.rm=TRUE)-(v190)))
sub.sample<-sub.sample %>% mutate(country.ancestry=(max(v191,na.rm=TRUE)-(v191)))
sub.sample<-sub.sample %>% mutate(speak.lang=(max(v192,na.rm=TRUE)-(v192)))
sub.sample<-sub.sample %>% mutate(share.cultr=(max(v193,na.rm=TRUE)-(v193)))
head(sub.sample)
Our next step is to create the age
variable, we know the year of birth and that the survey administrated in 2017. To calculate the responder’s name when the survey took place we substract respondents year of birth from the year the survey took place (2017- year of birth).
sub.sample<-sub.sample %>% mutate(age=2017-v226)
table(sub.sample$age)
Our last step is to use the rename( )
function, part of dplyr
to give meaningful names to our variables.
sub.sample<-sub.sample %>%
rename(education=v243_edulvlb, gender=v225, left_right=v102)
head(sub.sample)