HOMEWORK 4 Part II

PART II (Final Project)

1. What is your research question for the final project?

Is there a relationship between level of education (profile_educ5) and how one rates Planned Parenthood (ftpp)?

Information source: ANES 2020 Social Media Study

#my_data <- read_excel(path= "C:\\Users\\erink\\OneDrive\\Documents\\2020 Social Media.xlxs.xlsx") 
#my_data2<- read_csv(file = "C:\Users\erink\OneDrive\Documents\2020Social_Media.csv")

my_data2<- read_csv(file = "C:\\Users\\erink\\OneDrive\\Documents\\2020Social_Media.csv")
head(my_data2)

# A tibble: 6 x 13
  caseid profile_gender profile_age profile_racethnicity profile_educ5
   <dbl>          <dbl>       <dbl>                <dbl>         <dbl>
1   3824              2          32                    3             1
2    235              2          63                    1             3
3   1286              1          41                    1             3
4   4981              1          52                    2             5
5   1183              2          67                    1             3
6   3158              1          46                    2             4
# ... with 8 more variables: profile_marital <dbl>,
#   profile_income <dbl>, profile_region4 <dbl>,
#   profile_region9 <dbl>, profile_metro <dbl>, profile_relig <dbl>,
#   profile_born <dbl>, ftpp <dbl>

2. What is your hypothesis (i.e. an answer to the research question) that you want to test?

I expect that those with higher levels of education are more likely to rate Planned Parenthood highly.

3. Present some exploratory analysis. In particular: a. Numerically summarize (e.g. with the summary() function) the variables of interest (the outcome, the explanatory variable, the control variables).

summary(my_data2)

     caseid     profile_gender   profile_age   profile_racethnicity
 Min.   :   1   Min.   :1.000   Min.   :18.0   Min.   :1.000       
 1st Qu.:1459   1st Qu.:1.000   1st Qu.:36.0   1st Qu.:1.000       
 Median :2916   Median :1.000   Median :51.0   Median :1.000       
 Mean   :2915   Mean   :1.495   Mean   :50.5   Mean   :1.636       
 3rd Qu.:4371   3rd Qu.:2.000   3rd Qu.:65.0   3rd Qu.:2.000       
 Max.   :5830   Max.   :2.000   Max.   :80.0   Max.   :4.000       
 profile_educ5   profile_marital profile_income  profile_region4
 Min.   :1.000   Min.   :1.000   Min.   : 1.00   Min.   :1.000  
 1st Qu.:3.000   1st Qu.:1.000   1st Qu.: 7.00   1st Qu.:2.000  
 Median :3.000   Median :1.000   Median :11.00   Median :3.000  
 Mean   :3.397   Mean   :2.549   Mean   :10.36   Mean   :2.671  
 3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:14.00   3rd Qu.:4.000  
 Max.   :5.000   Max.   :6.000   Max.   :18.00   Max.   :4.000  
 profile_region9 profile_metro    profile_relig     profile_born    
 Min.   :1.00    Min.   :0.0000   Min.   :-7.000   Min.   :-7.0000  
 1st Qu.:3.00    1st Qu.:1.0000   1st Qu.: 1.000   1st Qu.:-1.0000  
 Median :5.00    Median :1.0000   Median : 9.000   Median : 1.0000  
 Mean   :5.23    Mean   :0.8313   Mean   : 6.706   Mean   : 0.8106  
 3rd Qu.:8.00    3rd Qu.:1.0000   3rd Qu.:12.000   3rd Qu.: 2.0000  
 Max.   :9.00    Max.   :1.0000   Max.   :14.000   Max.   : 2.0000  
      ftpp       
 Min.   : -7.00  
 1st Qu.: 28.00  
 Median : 61.00  
 Mean   : 56.98  
 3rd Qu.: 88.00  
 Max.   :100.00

I am mostly interested in level of education and rating of Planned Parenthood. The dataset initially had 521 variables. I removed most, but I did also keep the following variables: gender, age, race, marital status, income, region and religion, so that I can see how some of those variables factor into my research question.

b. Plot the relationships between key variables. You can do this any way you want, but one straightforward way of doing this would be with the pairs() function or other scatter plots / box plots. Interpret what you see.

install.packages("ggplot2")
library(ggplot2)

pairs(~profile_educ5 + profile_relig + ftpp, data=my_data2)

ggplot(data = my_data2) +
  geom_point(mapping = aes(x = ftpp, y = profile_educ5))

ggplot(data =my_data2) +
  geom_smooth(mapping = aes(x = ftpp, y = profile_educ5))

ggplot(data = my_data2)+ 
  geom_bar(mapping = aes(x = profile_educ5))

install.packages("lsr")

package 'lsr' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\erink\AppData\Local\Temp\RtmpwN4Fqm\downloaded_packages

library(lsr)

install.packages("PerformanceAnalytics")

package 'PerformanceAnalytics' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\erink\AppData\Local\Temp\RtmpwN4Fqm\downloaded_packages

library(PerformanceAnalytics)

chart.Correlation(my_data2,histogram=TRUE)

The geom point model isn’t very useful at the moment, I will have to make some adjustments to be able to read that model. The geom smooth model does show that people who rated Planned Parenthood very highly do tend to have high levels of education, but there is also another peak among people who rated Planned Parenthood with a low score.

I have to review best models for this type of data, since there are only 5 options for level of education. I also want to explore the chart.correlation model further as I dig further into this project.